**Wil M. P. van der Aalst Josep Carmona (Eds.)**

Tutorial

# LNBIP 448

# **Process Mining Handbook**

# **Lecture Notes in Business Information Processing 448**

Series Editors

Wil van der Aalst *RWTH Aachen University, Aachen, Germany*

John Mylopoulos *University of Trento, Trento, Italy*

Sudha Ram *University of Arizona, Tucson, AZ, USA*

Michael Rosemann *Queensland University of Technology, Brisbane, QLD, Australia*

Clemens Szyperski *Microsoft Research, Redmond, WA, USA* More information about this series at https://link.springer.com/bookseries/7911

# Process Mining Handbook

*Editors* Wil M. P. van der Aalst RWTH Aachen Aachen, Germany

Josep Carmona Universitat Politècnica de Catalunya Barcelona, Spain

ISSN 1865-1348 ISSN 1865-1356 (electronic) Lecture Notes in Business Information Processing ISBN 978-3-031-08847-6 ISBN 978-3-031-08848-3 (eBook) https://doi.org/10.1007/978-3-031-08848-3

© The Editor(s) (if applicable) and The Author(s) 2022. This book is an open access publication.

**Open Access** This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

## **Preface**

Process mining emerged as a new discipline around the turn of the century. The combination of event data and process models poses interesting scientific problems. Initially, the focus was on the discovery of process models (e.g., Petri nets) from example traces. However, over time the scope of process mining broadened in several directions. Next to process discovery, topics such as conformance checking and performance analysis were added. Different perspectives were added (e.g., time, resources, roles, costs, and case types) to move beyond control-flow models. Along with directly-follows graph (DFGs) and Petri nets, a wide range of process model notations has been explored in the context of event data. Examples include declarative process models, process trees, artifact-centric and object-centric process models, UML activity models, and BPMN models. In recent years, the focus also shifted from backward-looking to forward-looking, connecting process mining to neighboring disciplines such as simulation, machine learning, and automation.

Over the past two decades, the discipline did not only expand in terms of scope but also in terms of adoption and tool support. The first commercial process mining tools emerged 15 years ago (Futura Process Intelligence, Disco, etc.). Now there are over 40 commercial products next to open-source process mining tools such as ProM, PM4Py, and bupaR. The adoption in industry has accelerated in the last five years. In several regions of the world, most of the larger companies are already using process mining, and the process mining market is expected to double every 18 months in the coming years.

Given the amazing developments in the last two decades, a comprehensive process mining summer school is long overdue. This book contains the core material of the first Summer School on ProcessMining organized by the IEEE Task Force on ProcessMining. The Task Force on Process Mining was established in October 2009 as part of the IEEE Computational Intelligence Society. Its activities led to the International Process Mining Conference (ICPM) series, a range of successful workshops (BPI, ATAED, PODS4H, etc.), the Process Mining Manifesto (translated into 15+ languages), the XES standard, publicly available datasets, online courses, and case studies. However, a dedicated summer school on process mining was missing. Therefore, we started the preparations for this in 2020. Due to the COVID-19 pandemic, this was delayed by one year, but this gave us more time to carefully prepare this handbook on process mining.

The summer school took place in Aachen, Germany, during July 4–8, 2022. The location of the summer school was the scenic SuperC building with nice views of the city center and close to the cathedral of Aachen, which was the first UNESCO World Heritage site in Germany.

The local organization was undertaken by the Process and Data Science (PADS) group at RWTH Aachen University. The event was financially supported by Wil M. P. van der Aalst's Alexander von Humboldt (AvH) professorship. The event was also supported by the RWTH Center for Artificial Intelligence, the Center of Excellence Internet of Production (IoP), Celonis, and Springer.

The book starts with a 360-degree overview of the field of process mining (Chapter 1). This first chapter introduces the basic concepts, the different types of process mining, process modeling notations, and storage formats for events.

Chapter 2 presents the foundations of process discovery. It starts with discovering directly-follows graphs from simple event logs and highlighting the challenges. Then basic bottom-up and top-down process discovery techniques are presented that produce Petri nets and BPMN models.

Chapter 3 presents four additional process discovery techniques: an approach based on state-based regions, an approach based on language-based regions, the split mining approach, and the log skeleton-based approach.

Techniques to discover declarative process models are presented in Chapter 4. The chapter focuses on discovering declarative specifications from event logs, monitoring declarative specifications against running process executions to promptly detect violations, and reasoning on declarative process specifications.

Chapter 5 presents techniques for conformance checking. An overview of the applications of conformance checking and a general framework are presented. The goal is to compare modeled and observed behavior.

Chapter 6 discusses event data in more detail, also describing the data-preprocessing pipeline, standards like XES, and data quality problems.

Chapter 7 takes a more applied view and discusses how process mining is used in different industries and the efforts involved in creating an event log. The chapter also lists best practices, illustrated using the order-to-cash (O2C) process in an SAP system.

Chapter 8 introduces a number of techniques for process enhancement, including process extension and process improvement. For example, it is shown how to add additional perspectives to a process model.

Chapter 9 introduces event knowledge graphs as a means to model multiple entities distributed over different perspectives. It is shown how to construct, query, and aggregate event knowledge graphs to get insights into complex behaviors.

Predictive process monitoring techniques are introduced in Chapter 10. This is the branch of process mining that aims at predicting the future of ongoing (uncompleted) process executions.

Streaming process mining refers to the set of techniques and tools which have the goal of processing a stream of data (as opposed to a fixed event log). Chapter 11 presents such techniques.

The topic of responsible process mining is addressed in Chapter 12. The chapter summarizes and discusses current approaches that aim to make process mining responsible by design, using the well-known FACT criteria (Fairness, Accuracy, Confidentiality, and Transparency).

Chapter 13 discusses the evolution of the field of process mining, i.e., the transition from process discovery to process execution management. The focus is on driving business value.

Chapter 14 makes the case that healthcare is a very promising application domain for process mining with a great societal value. An overview of healthcare processes and healthcare process data is given, followed by a discussion of common use cases.

Chapter 15 shows that process mining is a valuable tool for financial auditing. Both internal and external audits are introduced, along with the connection between the two audits and the application of process mining.

Chapter 16 introduces a family of techniques, called robotic process mining, that discover repetitive routines that can be automated using robotic process automation (RPA) technology.

Chapter 17 concludes the book with an analysis of the current state of the process mining discipline and outlook on future developments and challenges. Pointers to the lecture material will be made available via www.process-mining-summer-school.org, www.processmining.org, and www.tf-pm.org. These complement this book.

Finally, we thank all the participants, authors, speakers, and the organizations supporting this once-in-a-lifetime event. In particular, we thank the Alexander von Humboldt Foundation. Enjoy reading!

April 2022 Wil M. P. van der Aalst Josep Carmona

# **Contents**



# **Introduction**

# **Process Mining: A 360 Degree Overview**

Wil M. P. van der Aalst(B)

Process and Data Science (PADS), RWTH Aachen University, Aachen, Germany wvdaalst@pads.rwth-aachen.de http://www.vdaalst.com/

**Abstract.** Process mining enables organizations to uncover their actual processes, provide insights, diagnose problems, and automatically trigger corrective actions. Process mining is an emerging scientific discipline positioned at the intersection between process science and data science. The combination of process modeling and analysis with the event data present in today's information systems provides new means to tackle compliance and performance problems. This chapter provides an overview of the field of process mining introducing the different types of process mining (e.g., process discovery and conformance checking) and the basic ingredients, i.e., process models and event data. To prepare for later chapters, event logs are introduced in detail (including pointers to standards for event data such as XES and OCEL). Moreover, a brief overview of process mining applications and software is given.

**Keywords:** Process mining · Event data · Process modeling · Process discovery

## **1 Introduction**

Process mining can be defined as follows: *process mining aims to improve operational processes through the systematic use of event data* [1,2]. By using a combination of event data and process models, process mining techniques provide insights, identify bottlenecks and deviations, anticipate and diagnose performance and compliance problems, and support the automation or removal of repetitive work. Process mining techniques can be *backward-looking* (e.g., finding the root causes of a bottleneck in a production process) or *forward-looking* (e.g., predicting the remaining processing time of a running case or providing recommendations to lower the failure rate). Both backward-looking and forwardlooking analyses can trigger *actions* (e.g., countermeasures to address a performance or compliance problem). The focus of process mining is on *operational processes*, i.e., processes requiring the repeated execution of activities to deliver products or services. These can be found in all organizations and industries, including production, logistics, finance, sales, procurement, education, consulting, healthcare, maintenance, and government. This chapter provides a 360◦ overview of process mining, introducing basic concepts and positioning process mining with respect to other technologies.

The idea of using detailed data about operational processes is not new. For example, Frederick Winslow Taylor (1856–1915) collected data on specific tasks to improve labor productivity [35]. With the increasing availability of computers, spreadsheets and other business intelligence tools were used to monitor and analyze operational processes. However, in most cases, the focus was on a single task in the process, or behavior was reduced to aggregated Key Performance Indicators (KPIs) such as flow time, utilization, and costs. Process mining aims to analyze *end-to-end processes* at the level of *events*, i.e., detailed behavior is considered in order to explain and improve performance and compliance problems.

Process mining research started in the late 1990s [23]. In 2004 the first version of the open-source platform ProM was released with 29 plug-ins. Over time the ProM platform was extended and now includes over 1500 plug-ins. The first commercial process mining tools appeared around 15 years ago. Today, there are over 40 commercial process mining tools and process mining is used by thousands of organizations all over the globe. However, only a small fraction of its potential has been realized. Process mining is generic and can be applied in any organization.

**Fig. 1.** Process mining = data science ∩ process science.

Figure 1 shows that process mining can be seen as the intersection of data science and process science. In [2], the following definition is proposed: "Data science is an interdisciplinary field aiming to turn data into real value. Data may be structured or unstructured, big or small, static or streaming. Value may be provided in the form of predictions, automated decisions, models learned from data, or any type of data visualization delivering insights. Data science includes data extraction, data preparation, data exploration, data transformation, storage and retrieval, computing infrastructures, various types of mining and learning, presentation of explanations and predictions, and the exploitation of results taking into account ethical, social, legal, and business aspects." In [2], process science is used as an umbrella term to refer to the broader discipline that combines knowledge from information technology and knowledge from management sciences to improve and run operational processes. In the more recent [12], the following definition is proposed: "Process science is the interdisciplinary study of continuous change. By process, we mean a coherent series of changes that unfold over time and occur at multiple levels." In [12], we emphasize the following key characteristics of process science: (1) processes are in focus, (2) processes are investigated using scientific methods, (3) an interdisciplinary lens is used, and (4) the goal of process science is to influence and change processes to realize measurable improvements. As stated in [2] and visualized in Fig. 1; process mining can be viewed as the link between data science and process science. Process mining seeks the confrontation between event data (i.e., observed behavior) and process models (hand-made models or automatically discovered models), and aims to exploit event data in a meaningful way, for example, to provide insights, identify bottlenecks, anticipate problems, record policy violations, recommend countermeasures, and streamline processes.

**Fig. 2.** 360◦ overview of process mining.

Figure 2 shows a high-level view of process mining. *Event data* need to be extracted from *information systems* used to support the processes that need to be analyzed. Customer Relationship Management (CRM), Enterprise Resource Planning (ERP), and Supply Chain Management (SCM) systems store events. Examples are SAP S/4HANA, Oracle E-Business Suite, Microsoft Dynamics 365, and Salesforce CRM. Next to these sector-agnostic software systems, there are more specialized systems such as Health Information Systems (HIS). All of these systems have in common that they are loaded with event data. However, these are scattered over many database tables and need to be converted into a format that can be used for process mining. As a consequence, data extraction is an integral part of any process mining effort, and may be time-consuming. Events are often represented by a case identifier, an activity name, a timestamp, and optional attributes such as resource, location, cost, etc. Object-centric event data allow events to point to any number of objects rather than a single case (see Sect. 3).

Once extracted, event data can be *explored*, *selected*, *filtered*, and *cleaned* (see Fig. 2). Data visualization techniques such as dotted charts and sequence diagrams can be used to understand the data. Often, the data need to be scoped to the process of interest. One can use generic query languages like SQL, SPARQL, and XQuery or a dedicated Process Query Language (PQL). Data may be incomplete, duplicated, or inconsistent. For example, month and day may be swapped during manual data entry. There is a variety of techniques and approaches to address such *data quality* problems [34].

The resulting dataset is often referred to as an *event log*, i.e., a collection of events corresponding to the selected process. *Process discovery* techniques are used to automatically create process models. Commercial tools typically still resort to learning the so-called *Directly-Follows Graph* (DFG) which typically leads to underfitting process models [3]. If two activities do not occur in a fixed order, then loops are created. This leads to Spaghetti-like diagrams suggesting repetitions that are not supported by the data. However, there are numerous approaches to learning higher-level models represented using Business Process Model and Notation (BPMN), Petri nets, or Unified Modeling Language (UML) activity diagrams. In contrast to DFGs, such models are able to express concurrency. Example techniques to discover such models are the Alpha algorithm [8], region-based approaches [11,13,33,36], inductive mining techniques [28,29], and the split miner [9]. The process model returned may aim to describe all behavior observed or just the dominant behavior. Note that the event log only contains example behavior, is likely to be incomplete, and at the same time may contain infrequent behavior.

The combination of a process model and event data can be used to conduct conformance checking and performance analysis (Fig. 2). The process model may have been discovered or made by hand. Discovered process models are descriptive and hand-crafted models are often normative. *Conformance checking* relates events in the event log to activities in the process model and compares both. The goal is to find commonalities and discrepancies between the modeled behavior and the observed behavior. If the process model is normative, deviations correspond to undesired behavior (e.g., fraud or inefficiencies). If the model was discovered automatically with the goal of showing the dominant behavior, then deviations correspond to exceptional behavior (i.e., outliers). Note that most processes have a Pareto distribution, e.g., 80% of the cases can be described by only 20% of the process variants. It is often easy and desirable to create a process model describing these 80%. However, the remaining 20% cannot be discarded since these cases cover the remaining 80% of the process variants and often also the majority of performance and compliance problems. Sometimes event logs are even more unbalanced, e.g., it is not uncommon to find logs where 95% of the cases can be described by less than 5% of the process variants. In the latter case, it may be that the remaining 5% of cases (covering 95% of the process variants) consume most of the resources due to rework and exception handling.

Since events have timestamps, it is easy to overlay the process model with *performance diagnostics* (service times, waiting times, etc.). After discovering the *control-flow*, the process model can be turned into a *stochastic* model that includes probabilities and delay distributions.

After applying conformance checking and performance analysis techniques, users can see performance and compliance problems. It is possible to perform *root-cause analysis* for such problems. One may find out that critical deviations are often caused by a particular machine or supplier, or that the main bottleneck is caused by poor resource planning or excessive rework for some product types. In a procurement process, price changes by a particular supplier may explain an increase in rework. If "Receive Invoice" often occurs before "Create Purchase Requisition", then this signals a compliance problem in the same process. These are just a few examples. In principle, any process-related problem can be diagnosed as long as event data are available.

The right-hand side of Fig. 2 shows that process mining can be used to (1) transform and improve the process and (2) automatically address observed and predicted problems. The stochastic process models discovered from event data can be used to conduct "what-if" analysis using *simulation* or other techniques from *operations research* (e.g., planning). The combination of event data and process models can be used to generate *Machine Learning* (ML) problems. ML techniques can be used to predict outcomes without being explicitly programmed to do so. The uptake of ML in recent years can be attributed to progress in deep learning, where artificial neural networks having multiple layers progressively extract higher-level features from the raw input. ML techniques cannot be applied directly to event data. However, by replaying event data on discovered process models, it is possible to create a range of supervised learning problems. Examples include:


It is important to note that the right-hand side of Fig. 2 (i.e., extraction, discovery, conformance checking, and performance analysis) cannot be supported using mainstream Artificial Intelligence (AI) and Machine Learning (ML) technologies (e.g., neural networks). One first needs to discover an explicit process model tightly connected to the event data, to pose the right questions. However, process mining can be used to create AI/ML problems. The combination can be used to trigger corrective actions or even complete workflows addressing the problem observed. This way, event data can be turned into actions that actively address performance and compliance problems.

## **2 Process Models**

There are many notations to describe processes, ranging from Directly-Follows Graphs (DFGs) and transition systems, to BPMN and Petri nets. We will use an example to gently introduce these notations. Consider a process involving the following activities: *buy ingredients (bi)*, *create base (cb)*, *add tomato (at)*, *add cheese (ac)*, *add salami (as)*, *bake in oven (bo)*, *eat pizza (ep)*, and *clean kitchen (ck)*. We will call this fictive process the "pizza process" and use this to illustrate the key concepts and notations.

**Fig. 3.** BPMN model of the "pizza process". The three toppings (tomato, cheese, and salami) can be added in any order.

Figure 3 shows a process model using *Business Process Model and Notation* (BPMN) [17]. The process starts with activity *buy ingredients (bi)* followed by activity *create base (cb)*. Then three activities are executed in any order: *add tomato (at)*, *add cheese (ac)*, and *add salami (as)*. After all three toppings (tomato, cheese, and salami) have been added, the activities *bake in oven (bo)*, *eat pizza (ep)*, and *clean kitchen (ck)* are performed in sequence. Assuming that the three concurrent activities are performed in some order (i.e., interleaved), there are 3! = 6 ways to execute the "pizza process". The two diamond-shaped symbols with a **+** inside denote *parallel gateways*. The first one is a so-called *AND-split* starting the three concurrent branches and the second one is a socalled *AND-join*. The BPMN process starts with a *start event* (shown as a circle) and ends with an *end event* (shown as a thick circle).

**Fig. 4.** Petri net modeling the "pizza process" with activities *buy ingredients (bi)*, *create base (cb)*, *add cheese (ac)*, *add tomato (at)*, *add salami (as)*, *bake in oven (bo)*, *eat pizza (ep)*, and *clean kitchen (ck)*.

Figure 4 models the same process in terms of a Petri net. This model also allows for 3! = 6 ways to execute the "pizza process". The circles correspond to *places* (to model states) and the squares correspond to *transitions* (to model activities). Places may hold tokens. A place is called marked if it contains a token. A *marking* is a distribution of tokens over places. In Fig. 4, the source place (i.e., the input place of transition *bi*) is marked, as is indicated by the token (the black dot). A transition is enabled if all input places are marked. In the initial marking shown in Fig. 4, transition *bi* (corresponding to activity *buy ingredients*) is enabled. A transition that is enabled may fire (i.e., it may occur). This means that a token is removed from each of the input places and a token is produced for each of the output places. Note that transition *cb* consumes one token and produces three tokens (one for each output place) and transition *bo* consumes three tokens (one for each input place) and produces one token. The process ends when a token is put on the sink place, i.e., the output place of *ck*. In total there are 2 + 2<sup>3</sup> + 3 = 13 reachable markings. Although the behavior of the Petri net in Fig. 4 is the same as the BPMN model in Fig. 3, it is easier to refer to the states of the process model.

**Fig. 5.** Process tree of the "pizza process": →(*bi*, *cb*, ∧(*ac*, *at*, *as*), *bo*, *ep*, *ck*).

Figure 5 models the "pizza process" using a *process tree*. This representation is rarely presented to end-users, but several mining algorithms use this internally. Process trees are closer to programming constructs, process algebras, and regular expressions. The graphical representation can be converted to a compact textual format: <sup>→</sup>(*bi*, *cb*,∧(*ac*, *at*, *as*), *bo*, *ep*, *ck*). A sequence operator <sup>→</sup> executes its children in sequential order. The root node in Fig. 5 denotes such a sequence, i.e., the six child nodes are executed in sequence. The third child node models the parallel execution of its three children. This subtree can be denoted by <sup>∧</sup>(*ac*, *at*, *as*). Later we will see that there are four types of operators that can be used in a process tree: → (sequential composition), × (exclusive choice), ∧ (parallel composition), and - (redo loop). The semantics of a process tree can be expressed in terms of Petri nets, e.g., Fig. 5 and Fig. 4 represent the same process.

**Fig. 6.** DFG of the "pizza process". Note that the behavior is different, e.g., one may add 10 toppings to the pizza.

Most of the process mining tools directly show a *Directly-Follows Graph* (DFG) when loading an event log. This helps get a first impression of the behavior recorded. Figure 6 shows a DFG for our running example. There are two special nodes to model start () and end (). The other nodes represent activities. The arcs in a DFG denote the "directly-follows relation", e.g., the arc connecting *cb* to *at* shows that immediately after creating the pizza base *cb* one can add tomato paste *at*. Activity *cb* has three outgoing arcs denoting a choice, i.e., *cb* is directly followed by *at*, *ac*, or *as*. Activity *at* also has three outgoing arcs denoting that one can add another topping (*ac* or *as*) or bake the pizza (*bo*). Note that the behavior of the DFG in Fig. 6 is different from the three models shown before (i.e., the BPMN model, the Petri net, and the process tree). The DFG allows for infinitely many ways to execute the "pizza process" (instead of 3! = 6). For example, it is possible to create a pizza where each of the toppings was added 10 times. The problem is that whenever two activities can occur in any order (e.g., *at* and *ac*), there is immediately a loop in the DFG (even when both happen only once).

**Fig. 7.** BPMN model of the extended "pizza process".

To explain other process constructs such as choice, skipping, and looping we extend the "pizza process". First of all, we allow for adding multiple servings of cheese, i.e., activity *ac* can be executed multiple times after creating the pizza base and before putting the pizza in the oven. Second, instead of adding salami as a topping one can add mushrooms, i.e., there is a choice between *as* (add salami) and *am* (add mushrooms). Third, the eating of the pizza may be skipped (i.e., activity *ep* is optional).

Figure 7 shows the BPMN model with these three extensions. In total six exclusive gateways were added: three XOR-splits and three XOR-joins (see the diamond-shaped symbols with a × inside). After adding cheese, one can loop back. There is a choice between adding salami and adding mushrooms. Also the eating of the pizza can be skipped.

**Fig. 8.** Petri net modeling the extended "pizza process" with two silent transitions (to skip eating the pizza and to add more cheese), and a transition *am* corresponding to activity *add mushrooms*.

Figure 8 shows a Petri net modeling the extended process. A new transition *am* (add mushrooms) has been added. Transitions *as* and *am* share an input place. If the input place is marked, then both transitions are enabled, but only one of them can occur. If *as* consumes the token from the shared input place, then *am* gets disabled. If *am* consumes the token from the shared input place, then *as* gets disabled. This way, we model the choice between two toppings: salami and mushrooms. Figure 8 also has two new so-called *silent transitions* denoted by the two black rectangles. Sometimes such silent transitions are denoted as a normal transition with a τ label. Silent transitions do not correspond to activities and are used for routing only, e.g., skipping activities. In Fig. 8, there is one silent transition to repeatedly execute *ac* (to model adding multiple servings of cheese) and one silent transition to skip *ep*.

**Fig. 9.** Process tree of the extended "pizza process": →(*bi*, *cb*, ∧(-(*ac*, τ ), *at*, ×(*as*, *am*)), *bo*, ×(*ep*, τ ), *ck*).

The process tree in Fig. 9 has the same behavior as the BPMN model and Petri net just shown. The process tree uses all four operators: → (sequential composition), × (exclusive choice), ∧ (parallel composition), and - (redo loop). A silent activity is denoted by τ and cannot be observed. The process tree in Fig. <sup>9</sup> can also be visualized in textual form: <sup>→</sup>(*bi*, *cb*,∧(-(*ac*, τ ), *at*, <sup>×</sup>(*as*, *am*)), *bo*, <sup>×</sup>(*ep*, τ ), *ck*).

To understand the notation, we first look at a few smaller examples. Process tree <sup>×</sup>(a, b) models a choice between activities <sup>a</sup> and <sup>b</sup>. Process tree <sup>×</sup>(a, τ ) can be used to model an activity a that can be skipped. Process tree -(a, τ ) can be used to model the process that executes a at least once. The "redo" part is silent, so the process can loop back without executing any activity. Process tree -(τ, a) models a process that executes a any number of times. The "do" part is now silent and activity a is in the "redo" part. This way it is also possible to not execute a at all.

Now let us take a look at the three modifications of our extended "pizza process": -(*ac*, τ ) models that multiple servings of cheese can be added, <sup>×</sup>(*as*, *am*) models the choice between salami and mushrooms, and <sup>×</sup>(*ep*, τ ) models the ability to skip eating the pizza.

The DFG shown in Fig. 10 incorporates the three extensions. Again, the behavior is different from Figs. 7, 8, and 9. Unlike the other models, the DFG allows for adding multiple servings of salami, mushrooms, and tomato paste. It is impossible to model concurrency properly, because loops are added the moment the order is not fixed. Therefore, DFGs are suitable for a quick first view of the process, but for more advanced process analytics, higher-level notations such as BPMN, Petri nets, and process trees are needed.

**Fig. 10.** DFG of the extended "pizza process". Note that the process becomes increasingly Spaghetti-like, allowing for process executions different from the BPMN model, the Petri net, and the process tree.

Note that, in this section, we focused on control-flow. However, process models can be extended with frequencies, probabilities, decision rules, roles, costs, and time delays (e.g., mean waiting times). After discovering the control-flow and replaying the event data on the model, it is easy to extend process models with data, resource, cost, and time perspectives.

## **3 Event Data**

Using process mining, we would like to analyze and improve processes using event data. Table 1 shows a fragment of an event log in tabular form. One can think of this as a table in a relational database, a CSV (Comma Separated Value) file, or Excel spreadsheet. Each row in the table corresponds to an *event*. An event can have many different attributes. In this simple example, each event has five attributes: *case*, *activity*, *timestamp*, *resource*, and *customer*. Most process mining tools and approaches require at least three attributes: *case* (refers to a process instance), *activity* (refers to the operation, action, or task), and *timestamp* (when did the event happen). These three attributes are enough to discover and check the control-flow perspective. A case may refer to an order, a patient, an application, a student, a loan, a car, a suitcase, a speeding ticket, etc. In Table 1, each case refers to a pizza being produced and consumed. In Sect. 2 we showed process models describing this process. However, now we start from the observed behavior recorded in the event log. We can witness the same activities as before: buy ingredients (*bi*), create base (*cb*), add cheese (*ac*), add tomato (*at*), add salami (*as*), add mushrooms (*am*), bake in oven (*bo*), eat pizza (*ep*), and clean kitchen (*ck*). Table 1 uses a simple time format (e.g., *18:10* ) to simplify the presentation (i.e., we skipped the date). Systems often use the ISO 8601 standard (or similar) to exchange date- and time-related data, e.g., *2021- 09-21T18:10:00+00:00*. In the remainder, we formalize event data and provide useful notions to reason about both observed and modeled behavior. We start with some basic mathematical notations.

**Table 1.** Fragment of a larger event log with 6400 events, i.e., the whole table has 6400 rows. These events describe the production of 800 pizzas. Each row refers to an event having five attributes, including the three mandatory ones: case, activity, and timestamp.


(*continued*)


**Table 1.** (*continued*)

#### **3.1 Notations**

<sup>B</sup>(A) is the set of all *multisets* over some set <sup>A</sup>. For some multiset <sup>b</sup> ∈ B(A), <sup>b</sup>(a) denotes the number of times element <sup>a</sup> <sup>∈</sup> <sup>A</sup> appears in <sup>b</sup>. Some examples: b<sup>1</sup> = [ ], b<sup>2</sup> = [x, x, y], b<sup>3</sup> = [x, y, z], b<sup>4</sup> = [x, x, y, x, y, z], and b<sup>5</sup> = [x3, y2, z] are multisets over <sup>A</sup> <sup>=</sup> {x, y, z}. <sup>b</sup><sup>1</sup> is the empty multiset, <sup>b</sup><sup>2</sup> and <sup>b</sup><sup>3</sup> both consist of three elements, and b<sup>4</sup> = b5, i.e., the ordering of elements is irrelevant and a more compact notation may be used for repeating elements. The standard set operators can be extended to multisets, e.g., <sup>x</sup> <sup>∈</sup> <sup>b</sup>2, <sup>b</sup><sup>2</sup> <sup>b</sup><sup>3</sup> <sup>=</sup> <sup>b</sup>4, <sup>b</sup><sup>5</sup> \ <sup>b</sup><sup>2</sup> <sup>=</sup> <sup>b</sup>3, <sup>|</sup>b5<sup>|</sup> = 6, etc. {<sup>a</sup> <sup>∈</sup> <sup>b</sup>} denotes the set with all elements <sup>a</sup> for which <sup>b</sup>(a) <sup>≥</sup> 1. b(X) = - *<sup>a</sup>*∈*<sup>X</sup>* <sup>b</sup>(x) is the number of elements in <sup>b</sup> belonging to set <sup>X</sup>, e.g., <sup>b</sup>5({x, y}) = 3 + 2 = 5. <sup>b</sup> <sup>≤</sup> <sup>b</sup> if <sup>b</sup>(a) <sup>≤</sup> <sup>b</sup> (a) for all <sup>a</sup> <sup>∈</sup> <sup>A</sup>. Hence, <sup>b</sup><sup>3</sup> <sup>≤</sup> <sup>b</sup><sup>4</sup> and <sup>b</sup><sup>2</sup> ≤ <sup>b</sup><sup>3</sup> (because <sup>b</sup><sup>2</sup> has two <sup>x</sup>'s). b<b if <sup>b</sup> <sup>≤</sup> <sup>b</sup> and <sup>b</sup> <sup>=</sup> <sup>b</sup> . Hence, b<sup>3</sup> < b<sup>4</sup> and <sup>b</sup><sup>4</sup> < b<sup>5</sup> (because <sup>b</sup><sup>4</sup> <sup>=</sup> <sup>b</sup>5).

<sup>σ</sup> <sup>=</sup> <sup>a</sup>1, a2,...,a*<sup>n</sup>* <sup>∈</sup> <sup>X</sup><sup>∗</sup> denotes a *sequence* over <sup>X</sup> of length <sup>|</sup>σ<sup>|</sup> <sup>=</sup> <sup>n</sup>. <sup>σ</sup>*<sup>i</sup>* <sup>=</sup> <sup>a</sup>*<sup>i</sup>* for 1 <sup>≤</sup> <sup>i</sup> ≤ |σ|. is the empty sequence. <sup>σ</sup><sup>1</sup> · <sup>σ</sup><sup>2</sup> is the concatenation of two sequences, e.g., x, x, y · x, y, z <sup>=</sup> x, x, y, x, y, z . The notation [<sup>a</sup> <sup>∈</sup> <sup>σ</sup>] can be used to convert a sequence into a multiset. [<sup>a</sup> <sup>∈</sup>x, x, y, x, y, z ]=[x3, y2, z].

<sup>f</sup> <sup>∈</sup> <sup>X</sup> <sup>→</sup> <sup>Y</sup> is a total function, i.e., <sup>f</sup>(x) <sup>∈</sup> <sup>Y</sup> for any <sup>x</sup> <sup>∈</sup> <sup>X</sup>. <sup>f</sup> <sup>∈</sup> <sup>X</sup> → <sup>Y</sup> is a partial function with domain *dom*(f) <sup>⊆</sup> <sup>X</sup>. If <sup>x</sup> ∈ *dom*(f), then we write <sup>f</sup>(x) = <sup>⊥</sup>, i.e., the function is not defined for <sup>x</sup>.

#### **3.2 Standard Event Log**

An event log is a collection of events. An event e can have any number of attributes, and often we require the following three attributes to be present: case #*case* (e), activity #*act*(e), and timestamp #*time* (e). Table 1 shows example events. If e is the first visible event, then #*case* (e) = pizza-56, #*act*(e) = *bi* (buy ingredients), and #*time* (e) = *18:10*. For simplicity, we write *18:10*, but the full timestamp includes a date and possibly also seconds and milliseconds.

To formalize event logs, we introduce some basic notations.

**Definition 1 (Universes).** U*ev is the universe of events,* U*act is the universe of activities,* U*case is the universe of cases,* U*time is the universe of timestamps,* <sup>U</sup>*att* <sup>=</sup> {*act*, *case*,*time*,...} *is the universe of attributes,* <sup>U</sup>*val is the universe of values, and* U*map* = U*att* → U*val is the universe of attribute-value mappings. We assume that* <sup>U</sup>*act* ∪ U*case* ∪ U*time* ⊆ U*val ,* <sup>⊥</sup> ∈ U*val , and for any* <sup>f</sup> ∈ U*map:* <sup>f</sup>(*act*) ∈ U*act* ∪ {⊥}*,* <sup>f</sup>(*case*) ∈ U*case* ∪ {⊥}*, and* <sup>f</sup>(*time*) ∈ U*time* ∪ {⊥}*.*

Note that standard attributes of an event (activity, case, timestamp, etc.) are treated as any other attribute. <sup>f</sup> ∈ U*map* is a function mapping any subset of attributes onto values. For example, f could be such that *dom*(f) = {*case*, *act*,*time*, *resource*, *customer* , *cost*, *size*}, <sup>f</sup>(case) = pizza-56, <sup>f</sup>(act) = *bi*, f(time) = *2021-09-21T18:10:00+00:00*, f(resource) = Stefano, f(customer) = Valentina, f(size) = 33cm, and f(cost) = e9.99. Note that the last two attributes are not shown in Table 1. and that *2021-09-21T18:10:00+00:00* is abbreviated to *18:10*.

To be general, we assume that events are partially ordered. Recall that a strict partial order is *irreflexive* (<sup>e</sup> ≺ <sup>e</sup>), *transitive* (e<sup>1</sup> <sup>≺</sup> <sup>e</sup><sup>2</sup> and <sup>e</sup><sup>2</sup> <sup>≺</sup> <sup>e</sup><sup>3</sup> implies <sup>e</sup><sup>1</sup> <sup>≺</sup> <sup>e</sup>3), and *asymmetric* (if <sup>e</sup><sup>1</sup> <sup>≺</sup> <sup>e</sup>2, then <sup>e</sup><sup>2</sup> ≺ <sup>e</sup>1).

**Definition 2 (Event Log).** *An event log is a tuple* <sup>L</sup> = (E, #, <sup>≺</sup>) *consisting of a set of events* <sup>E</sup> ⊆ U*ev , a mapping* # <sup>∈</sup> <sup>E</sup> → U*map, and a strict partial ordering* ≺ ⊆ <sup>E</sup> <sup>×</sup> <sup>E</sup> *on events.*

*For any* <sup>e</sup> <sup>∈</sup> <sup>E</sup> *and att* <sup>∈</sup> *dom*(#(e))*:* #*att*(e) = #(e)(att) *is the value of attribute att for event* e*. For example,* #*act*(e)*,* #*case* (e)*, and* #*time* (e) *are the activity, case, and timestamp of an event* e*.*

*The ordering of events respects time, i.e., if* <sup>e</sup>1, e<sup>2</sup> <sup>∈</sup> <sup>E</sup>*,* #*time* (e1) <sup>=</sup> <sup>⊥</sup>*,* #*time* (e2) <sup>=</sup> <sup>⊥</sup>*, and* #*time* (e1) <sup>&</sup>lt; #*time* (e2)*, then* <sup>e</sup><sup>2</sup> ≺ <sup>e</sup>1*.*

To be general, events can have any number of attributes and no attribute is mandatory. However, when using simplified event logs, we only consider events having a case and activity (with an order derived using timestamps).

Assume <sup>L</sup> = (E, #, <sup>≺</sup>) is the event log in Table 1. The whole event log has 6400 events, i.e., the table has many more rows. Let <sup>E</sup> <sup>=</sup> {e1, e2,...,e6400} be the whole set of events and assume the first event shown in Table 1 is e433. #(e433) is a mapping with *dom*(#(e433)) = {*case*, *act*,*time*, *resource*, *customer*} (the columns shown in the table). #*case* (e433) = pizza-56, #*act*(e433) = *bi* (buy ingredients), #*time* (e433) = 18:10, #*resource* (e433) = Stefano, and #*customer* (e433) = Valentina. Assuming that the event identifiers follow the order shown in Table 1, the last event visible in the table is e456, and #*case* (e456) = pizza-58, #*act*(e456) = *ck* (clean kitchen), #*time* (e456) = 20:51, #*resource* (e456) = Mario, and #*customer* (e456) = Laura. Assuming a total order as shown in the Table, <sup>e</sup><sup>433</sup> <sup>≺</sup> <sup>e</sup>434, <sup>e</sup><sup>434</sup> <sup>≺</sup> <sup>e</sup>435, <sup>e</sup><sup>455</sup> <sup>≺</sup> <sup>e</sup>456, <sup>e</sup><sup>433</sup> <sup>≺</sup> <sup>e</sup>456, etc.

As stated in Definition 2, ≺ is a strict partial order and it is not allowed that timestamps (when present) and the partial order disagree. Using Table 1 and the event identifiers <sup>e</sup><sup>433</sup> and <sup>e</sup>456. It cannot be that <sup>e</sup><sup>456</sup> <sup>≺</sup> <sup>e</sup>433, because #*time* (e456) > #*time* (e433). For two arbitrary events e<sup>1</sup> and e<sup>2</sup> it cannot be that both #*time* (e1) <sup>&</sup>lt; #*time* (e2) and <sup>e</sup><sup>2</sup> <sup>≺</sup> <sup>e</sup>1. However, it can be that #*time* (e1) <sup>&</sup>lt; #*time* (e2) and <sup>e</sup><sup>1</sup> ≺ <sup>e</sup><sup>2</sup> (the time perspective is more fine grained) or that #*time* (e1)=#*time* (e2) and <sup>e</sup><sup>1</sup> <sup>≺</sup> <sup>e</sup><sup>2</sup> (the partial order is more fine grained). Optionally, the partial order can be derived from the timestamps (when present): <sup>≺</sup> <sup>=</sup> {(e1, e2) <sup>∈</sup> <sup>E</sup> <sup>×</sup> <sup>E</sup> <sup>|</sup> #*time* (e1) <sup>&</sup>lt; #*time* (e2)}. In this case, the event log is fully defined by L = (E, #) (no explicit ordering relation is needed).

It should be noted that in the often used BPI Challenge 2011 log provided by a Dutch academic hospital [16], 85% of the events have the same timestamp as the previous one. This is because, for many events, only dates are available. Many publicly available event logs have similar issues, for example, in the socalled Sepsis log [30], 30% of the events have the same timestamp as the previous one. In this event log, activities for the same case are sometimes batched, leading to events with the same timestamp. These examples illustrate that one should inspect timestamps and not take the order in the event log for granted. It may be beneficial to use partially ordered event data in case of data quality problems or when there is explicit causal information.

#### **3.3 Simplified Event Log**

For process mining techniques focusing on control-flow, it often suffices to focus only on the activity attribute and the ordering within a case. This leads to a much simpler event log notion.

**Definition 3 (Simplified Event Log).** *A simplified event log* <sup>L</sup> ∈ B(U*act* <sup>∗</sup>) *is a multiset of traces. A trace* <sup>σ</sup> <sup>=</sup> <sup>a</sup>1, a2,...a*<sup>n</sup>* ∈U*act* <sup>∗</sup> *is a sequence of activities.* L(σ) *is the number of times trace* σ *appears in event log* L*.*

Consider case pizza-56 in Table 1. There are eight events having this case attribute. By ordering these events based on their timestamps we get the trace <sup>σ</sup>pizza-56 <sup>=</sup> *bi*, *cb*, *ac*, *at*, *as*, *bo*, *ep*, *ck* . We can do the same for the other two cases shown in Table 1: <sup>σ</sup>pizza-57 <sup>=</sup> *bi*, *cb*, *at*, *ac*, *as*, *bo*, *ep*, *ck* and <sup>σ</sup>pizza-58 <sup>=</sup> *bi*, *cb*, *as*, *at*, *ac*, *bo*, *ep*, *ck* . We are using the same shorthands as before, i.e., buy ingredients (*bi*), create base (*cb*), add cheese (*ac*), add tomato (*at*), add salami (*as*), add mushrooms (*am*), bake in oven (*bo*), eat pizza (*ep*), and clean kitchen (*ck*).

The same trace may appear multiple times in a log. For example, L = [ a, b, c, e <sup>10</sup>, a, c, b, e <sup>5</sup>, a, d, e ] is a simple event log with 10 + 5 + 1 = 16 cases and 40 + 20 + 3 = 63 events.

An event log with events having any number of attributes (Definition 2) can be transformed into a *simplified event log* by ignoring the additional attributes and sequentializing the events belonging to the same case. Events without a case or activity attribute are ignored in the transformation process.

**Definition 4 (Conversion).** *An event log* <sup>L</sup> = (E, #, <sup>≺</sup>) *defines a simplified event log* <sup>L</sup>˜ ∈ B(U*act* <sup>∗</sup>) *that is constructed as follows:*

	- <sup>E</sup>*<sup>c</sup>* <sup>=</sup> {<sup>e</sup> <sup>∈</sup> <sup>E</sup> <sup>|</sup> #*case* (e) = <sup>c</sup>} *are the events in* <sup>c</sup>*,*
	- <sup>σ</sup>*<sup>c</sup>* <sup>=</sup> <sup>e</sup>1, e2,...,e*<sup>n</sup> is a (deterministically chosen) sequentialization of the events in* <sup>c</sup>*, i.e.,* <sup>σ</sup>*<sup>c</sup> is such that* {e1, e2,...,e*n*} <sup>=</sup> <sup>E</sup>*c,* <sup>|</sup>E*c*<sup>|</sup> <sup>=</sup> <sup>|</sup>σ*c*|*, and for any* <sup>1</sup> <sup>≤</sup> i<j <sup>≤</sup> <sup>n</sup>*:* <sup>e</sup>*<sup>j</sup>* ≺ <sup>e</sup>*i.*
	- <sup>σ</sup>˜*<sup>c</sup>* <sup>=</sup> #*act*(e1), #*act*(e2),..., #*act*(e*n*) <sup>∈</sup> <sup>A</sup><sup>∗</sup> *is the trace corresponding to* c *(i.e., the events in* σ*<sup>c</sup> are replaced by the corresponding activities).*

Let <sup>L</sup> = (E, #, <sup>≺</sup>) be the event log corresponding to the events visible in Table <sup>1</sup> (assuming the order in the table). Then: <sup>L</sup>˜ = [ *bi*, *cb*, *ac*, *at*, *as*, *bo*, *ep*, *ck* , *bi*, *cb*, *at*, *ac*, *as*, *bo*, *ep*, *ck* , *bi*, *cb*, *as*, *at*, *ac*, *bo*, *ep*, *ck* ]. Table 1 only shows a fragment of the whole event log. For the whole event log L = (E, #, <sup>≺</sup>), we have <sup>L</sup>˜ = [ *bi*, *cb*, *ac*, *at*, *as*, *bo*, *ep*, *ck* <sup>400</sup>, *bi*, *cb*, *at*, *ac*, *as*, *bo*, *ep*, *ck* <sup>200</sup>, *bi*, *cb*, *as*, *at*, *ac*, *bo*, *ep*, *ck* <sup>100</sup>, *bi*, *cb*, *ac*, *as*, *at*, *bo*, *ep*, *ck* <sup>50</sup>, *bi*, *cb*, *at*, *as*, *ac*, *bo*, *ep*, *ck* <sup>25</sup>, *bi*, *cb*, *as*, *ac*, *at*, *bo*, *ep*, *ck* <sup>25</sup>]. This event log has 800 cases and 6400 events. Using process discovery techniques we can automatically discover the models in Figs. 3, 4, 5 and 6 from such an event log. If the event log also has cases where cheese is added multiple times (e.g., *bi*, *cb*, *ac*, *at*, *ac*, *ac*, *as*, *bo*, *ep*, *ck* ), mushrooms are added instead of salami (e.g., *bi*, *cb*, *ac*, *at*, *am*, *bo*, *ep*, *ck* ), and the eating activity is skipped (e.g., *bi*, *cb*, *ac*, *at*, *as*, *bo*, *ck* ), then we can automatically discover the models in Figs. 7, 8, 9 and 10 using suitable process mining techniques.

#### **3.4 Object-Centric Event Logs**

Table 1 corresponds to a conventional "flat" event log where each event (i.e., row) refers to a case, activity, and timestamp. It is very natural to assume that an event has indeed a timestamp and refers to an activity. However, the assumption that it refers to precisely one case may cause problems [4]. *Object-Centric Event Logs (OCEL)* aim to overcome this limitation [22]. In OCEL, an event may refer to any number of objects (of different types) rather than a single case. Objectcentric process mining techniques may produce Petri nets with different types of objects [7] or artifact-centric process models [18,19].


**Table 2.** Fragment of a larger Object-Centric Event Log (OCEL) with four types of objects: pizza, resource, customer, and location. One event may refer to a set of objects, e.g., three pizzas, three customer, and a location.

To understand the problem, we use Table 2, which shows OCEL data in tabular form. Compared to Table 1, we do not assume a single case notion. Instead, an event may refer to any number of objects. In this toy example, we assume four types of objects: pizza, resource, customer, and location. Assume that e is the first event listed in Table 2. #*act*(e) = *bi* (buy ingredients), #*time* (e) = *18:10*, #*pizza* (e) = {pizza-56, pizza-57, pizza-58}, #*resource* (e) = {Stefano}, #*customer* (e) = {Valentina, Giulia, Laura}, and #*location*(e) = {supermarket}. Note that in Table 1 there were three *bi* (buy ingredients) events, one for each pizza. Hence, Table 2 is closer to reality if the ingredients were indeed bought in the same visit to the supermarket. In a classical event log with a single case identifier, we need to artificially replicate events (one *bi* event per pizza). This may lead to misleading statistics, i.e., there was just one trip to the supermarket and not three. The three pizzas were created on demand, so the *bi* event also refers to the three customers. Table 2 also shows that creating the pizza base is team work, i.e., all *cb* events are done by both Mario and Stefano. If we assume that e is the last event visible in Table 2, then #*act*(e) = *ck* (clean kitchen), #*time* (e) = 20.51, #*pizza* (e) = <sup>∅</sup>, #*resource* (e) = {Mario}, #*customer* (e) = <sup>∅</sup>, and #*location*(e) = {kitchen-2}. This expresses that, according to this event log, cleaning the second kitchen is unrelated to the pizza prepared in it.

Definition 2 can be easily extended to allow for *Object-Centric Event Logs* (OCEL). We just need to assume that event attributes include object types and that attribute-value mappings may yield sets of values (e.g., objects) rather than individual values. Without fully formalizing this, we simply assume that U*objtyp* ⊆ U*att* is the universe of *object types*, U*objs* is the universe of *objects*, and <sup>P</sup>(U*objs* ) ⊆ U*val* (i.e., values can be sets of objects). Moreover, for any <sup>f</sup> ∈ U*map* and *ot* ∈ U*objtyp* <sup>∩</sup> *dom*(f): <sup>f</sup>(*ot*) ⊆ U*objs* . Hence, *attribute value mappings can be used to also map object types onto sets of objects*.

To apply classical process mining techniques, we need to convert the objectcentric event data to traditional event data. For example, we need to convert Table 2 into Table 1 if we pick object type *pizza* as a case notion. This is called "flattening the event log" and always requires picking an object type as a case notion. This can be formalized in a rather straightforward manner.

**Definition 5 (OCEL Conversion).** *Let* <sup>L</sup> = (E, #, <sup>≺</sup>) *be an event log having an object type ot* ∈ U*objtyp such that for any* <sup>e</sup> <sup>∈</sup> <sup>E</sup>*:* #*ot*(e) ⊆ U*objs is the set of objects of type ot involved in event* e*. Based on this assumption, we can create a "flattened event log"* <sup>L</sup>˜*ot* ∈ B(U*act* <sup>∗</sup>) *that is constructed as follows:*

	- <sup>E</sup>*<sup>o</sup>* <sup>=</sup> {<sup>e</sup> <sup>∈</sup> <sup>E</sup> <sup>|</sup> <sup>o</sup> <sup>∈</sup> #*ot*(e)} *are the events involving object* <sup>o</sup>*,*
	- <sup>σ</sup>*<sup>o</sup>* <sup>=</sup> <sup>e</sup>1, e2,...,e*<sup>n</sup> is a (deterministically chosen) sequentialization of the events involving* <sup>o</sup>*, i.e.,* <sup>σ</sup>*<sup>o</sup> is such that* {e1, e2,...,e*n*} <sup>=</sup> <sup>E</sup>*o,* <sup>|</sup>E*o*<sup>|</sup> <sup>=</sup> <sup>|</sup>σ*o*|*, and for any* <sup>1</sup> <sup>≤</sup> i<j <sup>≤</sup> <sup>n</sup>*:* <sup>e</sup>*<sup>j</sup>* ≺ <sup>e</sup>*i.*
	- <sup>σ</sup>˜*<sup>o</sup>* <sup>=</sup> #*act*(e1), #*act*(e2),..., #*act*(e*n*) <sup>∈</sup> <sup>A</sup><sup>∗</sup> *is the trace corresponding to* o *(i.e., the events in* σ*<sup>o</sup> are replaced by the corresponding activities).*

Definition 5 shows that any OCEL can be transformed into a simplified event log. The simplified event log is a multiset of traces where each trace refers to the "lifecycle" of an object. Consider for example ˜σpizza-56 <sup>=</sup> *bi*, *cb*, *ac*, *at*, *as*, *bo*, *ep* showing the lifecycle of pizza-56 in Table 2. ˜σStefano <sup>=</sup> *bi*, *cb*, *cb*, *bo*, *bo*, *cb*, *bo* is the trace corresponding to resource Stefano. ˜σValentina <sup>=</sup> *bi*, *cb*, *ac*, *at*, *as*, *bo*, *ep* is the trace corresponding to customer Valentina. This trace is now the same as σpizza-56, but this would not be the case if Valentina eats multiple pizzas (e.g., in subsequent visits to the restaurant). ˜σsupermarket <sup>=</sup> *bi* is the trace corresponding to the location "supermarket" (assuming there was just one visit to the supermarket). ˜σrestaurant <sup>=</sup> *ep*, *ep*, *ep* is the trace corresponding to the location "restaurant" (again considering only the events visible in Table 2). These traces are rather short because we only consider the events shown in Table 2.

By converting an OCEL to a conventional event log, we can apply all existing process mining techniques. For each object type, we can create a process model showing the "flow of objects" of that type. However, flattening the event log using *ot* as a case notion potentially leads to the following problems.


The first two problems are easy to understand: events disappear completely (deficiency) or are replicated leading to potentially misleading management information (convergence). The problem of divergence is more subtle. To understand this better, consider ˜σkitchen-1 <sup>=</sup> *cb*, *cb*, *at*, *ac*, *ac*, *at*, *as*, *bo*, *as*, *bo*, *ck* describing the "lifecycle" of the first kitchen. In this trace one can see *cb* followed by *cb* (two subsequent create pizza base events) and *ac* followed by *ac* (two subsequent add cheese events). However, these events refer to different pizzas and are not causally related. The discovered process model is likely to show loops involving *cb* and *ac*, although these events occur precisely once per pizza.

In summary, one can create different views on the process by flattening the event data for selected object types, but one should be careful to interpret these correctly (e.g., be aware of data duplication and the blurring of causalities).

The running "pizza process" example is not very realistic, and is only used to introduce the basic concepts in a clear manner. Earlier, we mentioned CRM systems like Salesforce and ERP systems like SAP S/4HANA, Oracle E-Business Suite, and Microsoft Dynamics 365. These systems are loaded with event data scattered over many database tables. ERP and CRM systems are widely used, broad in scope, and sector-agnostic. Also, more sector-specific systems used in banking, insurance, and healthcare have event data distributed over numerous tables. These tables refer to different types of objects that are often in a oneto-many or many-to-many relation. This immediately leads to the challenges described before.

Let us consider two of the processes almost any organization has: *Purchaseto-Pay* (P2P) and *Order-to-Cash* (O2C). The P2P process is concerned with the buy-side of an organization. The O2C process is concerned with the sell-side of a company. In the P2P process the organization is dealing with purchasing documents, items, suppliers, purchase requisitions, contracts, receipts, etc. Note that there may be many purchase orders per supplier and an order may consist of multiple items. Hence, events may refer to different objects and also multiple objects of the same time. In the O2C process, we can witness similar phenomena. A customer may place three orders on the same day and each order may have several items. Items from different orders may end up in the same delivery. Moreover, items in the same order may end up in different deliveries.

P2P and O2C processes are considered simple and there is a lot of experience with extracting such data from systems such as SAP. Still, these processes are more complicated than what many people think. It is not uncommon to find thousands of process variants. This offers great opportunities for process mining, because unexpected variants provide hints on how to improve the process. However, one should not underestimate the efforts needed for data extraction. Therefore, we discussed OCEL as it sits in-between the real database tables in systems such as SAP, Oracle, and Salesforce, and the flattened event logs assumed by most systems.

#### **3.5 XES Standard**

The initial version of the XES (eXtensible Event Stream) format was defined by the *IEEE Task Force on Process Mining* in September 2010. After several iterations, XES became the official IEEE standard for storing event data in 2016 [24]. XES is supported by most of the open-source process mining tools and many of the leading commercial tools. The goal is to facilitate the seamless exchange of event data between different systems. Of course, it is also possible to do this using relational databases or simple file formats. However, XES adds semantics to the data exchanged. Therefore, we focus on the concepts and refer to [24] for the syntax.

Figure 11 shows the XES meta model expressed in terms of a UML class diagram. A XES document (e.g., an XML file) contains one log consisting of any number of traces. Each trace describes a sequential list of events corresponding to a particular case. The log, its traces, and its events may have any number of attributes. Attributes may be nested. There are five core types: *String*, *Date*, *Int*, *Float*, and *Boolean*. XES does not prescribe a fixed set of mandatory attributes for each element (log, trace, and event), e.g., an event can have any number of attributes. However, to provide semantics for such attributes, the log refers to so-called XES *extensions*. An extension gives semantics to particular attributes. For example, the *Time extension* defines a timestamp attribute of type *xs:dateTime*. This corresponds to the #*time* (e) attribute used before. The *Organizational extension* defines a resource attribute of type *xs:string*, i.e., the

**Fig. 11.** Meta model of XES [24]. A log contains traces and each trace contains events [2,24]. Log, traces, and events have attributes. Extensions may define new attributes and a log should declare the extensions used in it. Global attributes are attributes that are declared to be mandatory. Such attributes reside at the trace or event level. Attributes may be nested. Event classifiers are defined for the log and assign a "label" (e.g., activity name) to each event. There may be multiple classifiers.

#*resource* (e) attribute. Users can define their own extensions. For example, it is possible to develop domain-specific or even organization-specific extensions.

XES also supports three concepts that are of general interest and important for process mining: *classifiers*, *lifecycle* information, and *activity instances*. These concepts are interrelated as is discussed next.

Classifiers are used to attach labels to events. There is always at least one classifier and by default; this is the activity name. When turning an event log L into a simplified event log <sup>L</sup>˜ ∈ B(U*act* <sup>∗</sup>) in Definition 4, we are using this default classifier: each event e is mapped onto #*act*(e). However, it is also possible to project events onto resources, locations, departments, etc., or combinations of attributes. An event classifier assigns to each event an identity, which makes it comparable to other events (via their assigned identity). Event classifiers are defined for the whole log, and there may be an arbitrary number of classifiers.

Thus far, we implicitly assumed that events are atomic. Therefore, an event has a timestamp. To handle activities that take time, XES provides the possibility to represent *lifecycle information* and to connect events through *activity instances*. An activity instance is a collection of related events that together represent the execution of an activity for a case. For example, an activity instance may be composed of a start event and a complete event. This way, we can derive information about the duration of an activity instance. The XES lifecycle model distinguishes between the following types of events: *schedule*, *assign*, *withdraw*, *reassign*, *start*, *suspend*, *resume*, *abort*, *complete*, *autoskip*, and *manualskip*. Using this XES extension, an event e has an attribute #*type* (e). For example, assume that e<sup>1</sup> and e<sup>2</sup> are two events that belong to the same activity instance and #*type* (e1) = start and #*type* (e2) = complete. #*time* (e2)−#*time* (e1) is the duration of the activity. Similarly, we can measure waiting times, etc. Note that classifiers can also use lifecycle information, e.g., an event e is identified by the pair (#*act*(e), #*type* (e)). This implies that when we discover process models, there may be activities (a,start) and (a, complete).

Many XES logs contain lifecycle information, but few contain explicit activity instances. This implies that heuristics are needed to link events. For example, (a,start) is coupled to the first (a, complete) following it. However, in the trace ...,(a,start),..., (a,start),..., (a, complete),..., (a, complete),... , there are two possible ways to match starts and ends. Fortunately, it is often possible to extract activity instances from the original data source.

#### **4 Different Types of Process Mining**

After introducing multiple ways to represent process models (BPMN, Petri nets, process trees, and DFGs) and different types of events logs (e.g., XES and OCEL), we now briefly introduce some of the standard process mining tasks (see Fig. 12). As a starting point, we assume that high-quality event data are available. In practice, it is often time-consuming to extract event data from existing systems. As mentioned before, events may be scattered over multiple database tables or even multiple information systems using different identifiers. When starting with process mining, data extraction and data cleaning may take 80% of the time. Of course, the exact percentage depends on the type of process and information system. Also if the data pipeline is set up properly, this is a one-time effort that can be reused continuously.

#### **4.1 Process Discovery**

Event logs contain example behavior. The challenge is to discover a process model based on such example behavior. The model should not be "overfitting" (i.e., simply enumerating the observed example traces) and not "underfitting" (i.e., allow for behavior unrelated to what was observed). This is a difficult task and numerous algorithms have been proposed in literature, including the Alpha algorithm [8], region-based approaches [11,13,33,36], inductive mining techniques [28,29], and the split miner [9]. A baseline approach is the creation of a DFG, where the observed activities are added as nodes and two nodes a and b are connected through a directed arc if activity a is directly followed by activity b at least once. Obviously, such an approach is too simplistic and leads to underfitting process models. If activity a is directly followed by activity b in

**Fig. 12.** Six frequently used types of process mining.

one case and activity b is directly followed by activity a in another case, then a loop is introduced. The techniques mentioned above address this problem and are able to uncover concurrency. However, there are many other challenges. The event log may contain *infrequent behavior*, i.e., traces or patterns which are less frequent compared to the mainstream behavior. Should this infrequent behavior be included or not? Hence, most approaches are parameterized to discard rare behavior. On the one hand, we often want to leave out infrequent behavior to simplify models. On the other hand, one cannot assume to have seen all behavior. Concurrency leads to an exponential number of states and a factorial number of possible traces. An unbounded loop leads to infinitely many possible traces. Process discovery is further complicated by the fact that event logs do *not* contain *negative* examples (i.e., traces that cannot happen) and are often *incomplete* (i.e., only a small fraction of all possible behavior is observed).

It is important to focus on a *particular process* or *problem*, having a *particular goal* in mind. One needs to select and filter the data based on a well-defined goal. Randomly using sliders to simplify process models may be useful for a first exploration, but will rarely lead to the desired insights.

To introduce process discovery, we focus on the *control-flow*, i.e., the ordering of activities. However, process models may include other *perspectives*, including time, data, resources, costs, etc. For example, a choice may be based on the attributes of the case or preceding event, and we may attach resource allocation rules to activities (e.g., role information and authorizations). Process discovery may add such perspectives, but we typically try to get clarity on the control-flow first. If no reasonable control-flow can be established, one should not try to add additional perspectives. Several process discovery techniques are explained in detail in [5,10].

#### **4.2 Conformance Checking**

Conformance checking requires both an event log and a process model as input. The goal is to indicate where log and model disagree. To illustrate this consider Figs. 7, 8, and 9. These three models describe exactly the same behavior of the extended "pizza process" that can be compactly described as <sup>→</sup>(*bi*, *cb*,∧(-(*ac*, τ ), *at*, <sup>×</sup>(*as*, *am*)), *bo*, <sup>×</sup>(*ep*, τ ), *ck*). Let <sup>M</sup> <sup>=</sup> { *bi*, *cb*, *ac*, *at*, *as*, *bo*, *ep*, *ck*, ,... *bi*, *cb*, *am*, *at*, *ac*, *ac*, *ac*, *bo*, *ep*, *ck*, ,... *bi*, *cb*, *at*, *ac*, *am*, *bo*, *ck* } be the infinite set of all traces allowed by the BPMN model, Petri net, and process tree depicted in the three figures. Let <sup>L</sup> ∈ B(U*act* <sup>∗</sup>) be an event log containing 800 traces. Assume <sup>σ</sup><sup>1</sup> <sup>=</sup> *bi*, *cb*, *ac*, *at*, *as*, *bo*, *ep*, *ck* <sup>∈</sup> <sup>L</sup>, <sup>σ</sup><sup>2</sup> <sup>=</sup> *bi*, *cb*, *ac*, *ac*, *at*, *am*, *ep*, *ck* <sup>∈</sup> <sup>L</sup>, and <sup>σ</sup><sup>3</sup> <sup>=</sup> *bi*, *cb*, *at*, *ac*, *at*, *as*, *bo*, *ck* <sup>∈</sup> <sup>L</sup>. Hence, <sup>L</sup> = [σ1, σ2, σ3,...] and <sup>|</sup>L<sup>|</sup> = 800. <sup>σ</sup><sup>1</sup> <sup>∈</sup> <sup>M</sup>, i.e., this is a perfectly fitting trace. <sup>σ</sup><sup>2</sup> ∈ <sup>M</sup> because activity *bo* (bake in oven) is missing, i.e., someone was eating an uncooked pizza. <sup>σ</sup><sup>3</sup> ∈ <sup>M</sup> because activity *at* (add tomato) occurs twice. The goal of conformance checking is to detect such deviations.

<sup>L</sup>*fit* = [<sup>σ</sup> <sup>∈</sup> <sup>L</sup> <sup>|</sup> <sup>σ</sup> <sup>∈</sup> <sup>M</sup>] is the multiset of fitting traces and <sup>L</sup>*dev* = [<sup>σ</sup> <sup>∈</sup> <sup>L</sup> <sup>|</sup> <sup>σ</sup> ∈ <sup>M</sup>] is the multiset of deviating traces. Hence, fitness at the trace level can be defined as <sup>|</sup>L*fit*<sup>|</sup> / <sup>|</sup>L|. The fraction is 1 if all traces are fitting and 0 if none of the traces is fitting.

There are many measures for fitness. For example, the above fraction does not take into account to what degree a trace is fitting or not. Trace <sup>σ</sup><sup>4</sup> <sup>=</sup> *bo*, *bo*, *bo*, *at*, *at*, *at*, *at*, *at* <sup>∈</sup> <sup>L</sup> is obviously more deviating than <sup>σ</sup><sup>2</sup> and <sup>σ</sup>3. Moreover, it is not enough to produce a number. In practice, good diagnostics are much more important than a single quality measure.

There are many techniques for conformance checking. The two most frequently used approaches are *token-based replay* [32] and *alignments* [6,14]. For token-based replay, the process model is represented as a Petri net and traces in the event log are replayed on the model. If the trace indicates that an activity needs to take place, the corresponding transition is executed. If this is not possible because an input place is empty, a so-called *missing token* is added. Tokens that are never consumed are called *remaining tokens*. The numbers of missing and remaining tokens relative to the numbers of consumed and produced tokens indicate the severity of the conformance problem. Token-based replay can be extended to Petri nets with silent and duplicate activities using heuristics. For example, if there are two activities with the same label, pick the one that is enabled. If both are enabled, pick one of them. Similarly, silent transitions (i.e., transitions not corresponding to recorded activities) are executed when they enable a transition corresponding to the next activity in the event log. This requires an exploration of the states reachable from the current state and may lead to inconclusive results.

Compared to computing alignments, token-based replay is fairly efficient, but does not always produce valid paths through the process model. Alignments are often seen as the gold standard for conformance checking because they provide paths through the process model that are as close to the observed behavior as possible. We would like to map observed behavior onto modeled behavior to provide better diagnostics and to relate also non-fitting cases to the model. Alignments were introduced to overcome the limitations of token-based replay. The diagnostics are more detailed and more precise, because each observed trace is mapped onto a model behavior that is as close to what was observed as possible. The alignment shows common behavior, but also skipped and inserted events signaling deviations. Such skipped and inserted events are easier to interpret than missing and remaining tokens. However, for large event logs and processes, alignment computations may be intractable. Moreover, there may be many optimal alignments, making the diagnostics non-deterministic.

Several conformance checking techniques are explained in detail in [15]. When comparing observed and modeled behavior, we typically consider four main quality dimensions [1,2,6]:


#### **4.3 Performance Analysis**

The goal of process mining is to improve processes by uncovering problems. These may be the conformance problems just described, but (of course) also include performance problems such as untimely completion of a case, limited production, missed deadlines, tardiness, excessive rework, and recurring quality problems. Using *token-based replay* [32] and *alignments* [6,14] it is possible to relate event data to a process model. As a result, it is fairly straightforward to annotate the process model with frequency and time information. Frequencies of undesired activities and loops can be used to identify quality and efficiency problems. Since events have timestamps, it is possible to measure times in-between activities, including statistics such as mean, median, standard deviation, minimum, and maximum. This allows for analyzing performance indicators, e.g., waiting times, response times, and service times.

A *Service Level Agreement* (SLA) is an agreement between a service provider and a client. Process mining can be used to analyze SLAs, e.g., when is a particular SLA not met. Some well-known SLAs are churn/abandonment rate (number of cases lost), average speed to answer (response time seen by customer), percentage of cases handled within a predefined timeframe, first-call resolution (cases successfully handled without rerouting), percentage of duplicated cases (e.g., multiple procurement documents corresponding to the same order), mean time between failures, mean time to recovery, etc.

#### **4.4 Comparative Process Mining**

Comparative process mining uses as input *multiple* event logs, e.g., L1, L2,..., <sup>L</sup>*<sup>n</sup>* ∈ B(U*act* <sup>∗</sup>). These event logs may refer to different locations, periods, or categories of cases. For example, we may have the event logs L*Aachen* and L*Munich* referring to the same processes performed at two locations. We may have the event logs L*Jan*, L*Feb*, L*Mar* ,...,L*Dec* referring to different periods or L*Gold* and L*Silver* referring to gold and silver customers.

Having multiple event logs allows for comparison and highly relevant questions. What are the striking differences and commonalities? What factors lead to these differences? Root cause analysis can be used to explain the observed differences. For example, in L*Feb* waiting times may be much longer than in L*Jan* due to limited resource availability. Comparative process mining may focus on frequently occurring problems, sometimes referred to as *execution gaps*. Such execution gaps include lost customers, additional work due to price changes, the merging of duplicate orders, and rework due to quality problems.

Comparative process mining is also a great tool for *inter- or intraorganizational benchmarking*. For example, an insurance company may have different regional offices. Using comparative process mining, these offices can learn from each other and increase the overall performance.

#### **4.5 Predictive Process Mining**

Process discovery, conformance checking, performance analysis, and comparative process mining are *backward-looking*. Although the value of such techniques is obvious, the actual goal is to continuously improve processes and respond to changes. Operational processes are subject to many changes, e.g., a sudden increase in the number of orders or disruptions in the supply chain. Moreover, many compliance and performance problems can be foreseen and addressed proactively. Fortunately, process models discovered and enriched using process mining can be used in a *forward-looking* manner.

Process mining can be used to create a range of *ML questions* that can be answered using standard software libraries. For example, when detecting a recurring bottleneck or deviation, it is possible to extract features from the event log and create a predictive model. This leads to a so-called *situation-feature table* with several descriptive features (e.g., people involved, path taken, and time of day) and one target feature (e.g., waiting time or decision). Then standard ML techniques ranging from regression and decision trees to neural networks can be applied to explain the target feature in terms of descriptive features. This leads to better diagnostics and explanations. Moreover, the models can be used in a predictive manner.

Predictive process mining questions also create specific ML challenges. Most ML techniques assume a fixed number of features as input (i.e., a fixed-length feature vector) and assume inputs to be independent. Artificial recurrent neural network architectures such as Long Short-Term Memory (LSTM) can be used to handle traces of variable length. Contextual features can be added to include information about the utilization of resources. However, this requires fine-tuning and domain knowledge.

A discovered process model can be viewed as a description of the *as-is* situation. Using simulation and model adaptation, it is possible to explore possible *to-be* situations. Simulation enables forward-looking forms of process mining. Comparative process mining can be used to compare the different alternatives.

#### **4.6 Action-Oriented Process Mining**

Process mining can be used to show (1) what has happened, (2) what is happening now, and (3) what will happen next in the process. Hence, it covers the full spectrum from backward-looking to forward-looking types of analysis. Backwardlooking forms of process mining can lead to process redesigns and organizational changes. Forward-looking forms of process mining and diagnostics of the current state of a process can trigger improvement actions. Action-oriented process mining aims to turn diagnostics into actions. Assisted by low-code automation platforms, process mining software can trigger workflows. Some examples:


Next to triggering improvement actions, process mining can also detect repetitive work that may be automated using *Robotic Process Automation* (RPA). RPA can be used to automate repetitive tasks done by humans without changing the underlying systems. Typical examples include copying information from one system into another system. Process mining can be used to discover such repetitive tasks. The term *task mining* is often used to refer to the discovery of processes based on user-interface interactions (filling out a form, pushing a button, copying text, etc.). Task mining can be used to uncover repetitive processes that can be automated. There is also a connection to *online scheduling* and other Operations Research (OR) techniques. For example, based on historical information, it is possible to create a robust schedule with events taking place in the future. Differences between scheduled events and the actual events may trigger improvement actions.

## **5 Applications and Software**

Process mining started as an exercise in the late 1990s trying to automatically create a Petri net from example traces [2]. According to Gartner there are now over 40 process mining vendors [26]. Some of them are listed in Table 3. Note that the list is very dynamic with new vendors emerging and large IT companies acquiring smaller process mining vendors. For an up-to-date overview, see the website www.processmining.org which lists all process mining tools.


**Table 3.** Some of the process mining tools available at the end of 2021. For each tool the vendor and website are listed. The last column indicates whether an academic version is available.

(*continued*)


**Table 3.** (*continued*)

All of the tools in Table 3 support the discovery of Directly-Follows Graphs (DFGs) with frequencies and times. Most of them (but not all) support some form of conformance checking and BPMN visualization. Some of the tools target process or data analysts rather than people managing or executing processes. These tools are typically lightweight and can be deployed quickly. *Enterprise-level process mining tools* are more difficult to deploy, but aim to be used by many stakeholders within an organization. For example, within Siemens, over 6000 employees are using the Celonis software to improve a range of processes. Enterprise-level process mining tools have automated connections to existing information systems (e.g., SAP, Salesforce, Oracle, ServiceNow, and Workday) to allow for the continuous ingestion of data. These tools also allow for customized dashboards to lower the threshold to use process mining. In 2020, Gartner estimated the process mining software market revenue to be \$550 million, which was over 70% market size growth from the previous year [26]. The process mining market is forecast to keep growing 50% per year (Compound Annual Growth Rate) in the coming years. Note that this does not include consultancy based on process mining. The Big Four (i.e., Deloitte, Ernst & Young, KPMG, and PwC) all have process mining competence centers providing process mining services all over the globe.

The technology is generic and can be used in any domain. For example, process mining is used in


In [31], several use cases are described in detail. In [26,27], typical applications are described, and in [21] the results of a global process mining survey are presented. These show that the adoption is increasing, e.g., according to the global survey, 83% of companies already using process mining on a global scale plan to expand their initiatives [21]. *Process mining helps organizations to improve processes, provide transparency, reduce costs, ensure compliance, avoid risks, eliminate waste, and redesign problematic processes* [21]. To get a glimpse of the possible applications, the reader can take a look at the use cases collected by the IEEE Task Force on Process Mining [25] and HSPI Management Consulting [20]. Note that these cover just a fraction of the actual applications of process mining. It has become fairly standard to apply process mining to standard processes such as Purchase-to-Pay (P2P) and Order-to-Cash (O2C).

#### **6 Summary and Outlook**

This chapter aimed to provide a 360◦ overview of the field of process mining. We showed that process mining connects data science and process science leading to data-driven process-centric techniques and approaches. Event data and process models were introduced. Events can be grouped in event logs, but also stored in databases. In the standard setting an event has a few mandatory attributes such as case, activity, and timestamp. This can be further reduced to representing an event log by a multiset of traces where each trace is a sequence of activities. This format is often used for control-flow discovery. However, in real-life settings it is not so easy to find a single case notion. Often events may refer to multiple objects of different types. There may also be data quality problems and data may be scattered over multiple source systems. Moreover, additional attributes such as costs, time, and resources need to be incorporated in models. We introduced Directly-Follows Graphs (DFG), Petri nets, BPMN models, and process trees as basic control-flow representations. These will be used in the remainder.

We informally described six common types of process mining: (1) process discovery, (2) conformance checking, (3) performance analysis, (4) comparative process mining, (5) predictive process mining, and (6) action-oriented process mining. These characterize the scope of process mining and challenges. The chapter also provided pointers to the over 40 process mining tools and case studies.

Although process mining is already used by many of the larger organizations, it is a relatively new technology and only a fraction of its potential is realized today. Three important trends can be witnessed that together lead to a wider adoption.

– Supporting data extraction and analysis through *process-specific and domainspecific adapters and applications* ("process mining apps"). This reduces the

effort to get started with process mining and leverages past experiences in other organizations.


Process mining can also play a role in realizing sustainability goals and help to address environmental, social and economic challenges. Process mining can help to quantify and steer sustainability efforts, e.g., by removing waste and quantifying emissions. Process mining can easily handle multiple dimensions, such as time, cash flow, resource usage, and CO<sup>2</sup> emissions, during analysis. Sustainability is just one of many topics where process mining can play a role. Moreover, these applications also pose interesting research questions leading to new concepts and techniques.

**Acknowledgment.** Funded by the Alexander von Humboldt (AvH) Stiftung and the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany's Excellence Strategy – EXC 2023 Internet of Production – 390621612.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Process Discovery**

# **Foundations of Process Discovery**

Wil M. P. van der Aalst(B)

Process and Data Science (PADS), RWTH Aachen University, Aachen, Germany wvdaalst@pads.rwth-aachen.de http://www.vdaalst.com/

**Abstract.** Process discovery is probably the most interesting, but also most challenging, process mining task. The goal is to take an event log containing example behaviors and create a process model that adequately describes the underlying process. This chapter introduces the baseline approach used in most commercial process mining tools. A simplified event log is used to create a so-called *Directly-Follows Graph* (DFG). This baseline is used to explain the challenges one faces when trying to discover a process model. After introducing DFG discovery, we focus on techniques that are able to discover models allowing for concurrency (e.g., Petri nets, process trees, and BPMN models). The chapter distinguishes two types of approaches able to discover such models: (1) *bottom-up process discovery* and (2) *top-down process discovery*. The *Alpha algorithm* is presented as an example of a bottom-up technique. The approach has many limitations, but nicely introduces the idea of discovering local constraints. The basic *inductive mining* algorithm is presented as an example of a top-down technique. This approach, combined with frequency-based filtering, works well on most event logs. These example algorithms are used to illustrate the foundations of process discovery.

**Keywords:** Process discovery · Process models · Petri nets · BPMN

## **1 Introduction**

Process discovery is typically the first step after extracting event data from source systems. Based on the selected event data, process discovery algorithms automatically construct a process model describing the observed behavior. This may be challenging because, in most cases, the event data cannot be assumed to be complete, i.e., we only witnessed example behaviors. There may also be conflicting requirements (e.g., recall, precision, generalization, and simplicity) [1,3]. This makes process discovery both interesting and challenging.

Figure 1 positions this chapter. The input for process discovery is a collection of events and the output is a process model. Such a process model can be used to uncover unexpected deviations and bottlenecks. In the later stages of the process mining pipeline shown in Fig. 1, process models are used to check compliance, compare processes, detect concept drift, and predict performance and compliance problems.

Events may have many attributes and refer to multiple objects of different types [3]. However, in this chapter, we start from very basic event data. We assume that each *event*

**Fig. 1.** This chapter focuses on process discovery. This is the first step after extracting event data from the source system(s). To set the scene, we consider only control-flow information, i.e., the ordering of activities.

refers to a *case*, an *activity*, and has a *timestamp*. There may be many other attributes (e.g., resource), but we ignore these. Initially, we assume that timestamps are only used for the ordering of events corresponding to the same case. This implies that each case is represented by a *sequence of activities*. We call this a *trace*. For example, <sup>σ</sup> <sup>=</sup> a, b, c, e represents a case for which the activities a, b, c, and e occurred. Note that there may be many cases that have the same trace. Therefore, we represent an event log as a multiset of traces. For example, <sup>L</sup><sup>1</sup> = [a, b, c, e<sup>10</sup>,a, c, b, e<sup>5</sup>,a, d, e] is an event log describing 16 cases and 10 × 4+5 × 4+1 × 3 = 63 events. Note that trace <sup>σ</sup> <sup>=</sup> a, b, c, e appears 10 times. In [3], we use the term *simplified* event log. Here we drop the adjective "simplified" since the representation will be used throughout the chapter.

**Definition 1 (Event Log).** <sup>U</sup>*act is the universe of activity names. A trace* <sup>σ</sup> <sup>=</sup> a1, a2, ...,a<sup>n</sup>∈U*act* <sup>∗</sup> *is a sequence of activities. An event log* <sup>L</sup> ∈ B(U*act* <sup>∗</sup>) *is a multiset of traces.*

Note that L(σ) is the number of times trace σ appears in event log L. For example, <sup>L</sup>1(a, b, c, e) = 10, <sup>L</sup>1(a, c, b, e)=5, <sup>L</sup>1(a, d, e)=1, <sup>L</sup>1(b, a)=0, <sup>L</sup>1(<sup>c</sup>) = <sup>0</sup>, <sup>L</sup>1(-)=0, etc.

Given an event log <sup>L</sup> ∈ B(U*act* <sup>∗</sup>), we would like to learn a process model adequately capturing the observed behavior. Figure 2 shows four process models discovered for <sup>L</sup><sup>1</sup> = [a, b, c, e<sup>10</sup>,a, c, b, e<sup>5</sup>,a, d, e]. The models also show frequencies.

Figure 2(b) shows a Directly-Follows Graph (DFG). The start, end, and five activities are the nodes of the graph. Activities a and e occurred 16 times, b and c occurred 15 times, and d only once. The arcs in Fig. 2(b) show how often an activity is *directly* followed by another activity. For example, a is 10 times directly followed by b, a is 5 times directly followed by c, and a is once directly followed by d. To indicate the start

**Fig. 2.** Three process models learned from event log L<sup>1</sup> = [a, b, c, e <sup>10</sup>, a, c, b, e 5, a, d, e].

and end of cases, we use a start node and an end node -. One can view and - as "dummy" activities or states. Although they do not present real activities, they are needed to describe the process adequately. Since all 16 cases start with a, the arc connecting to a has a frequency of 16. Note that due to the cycles in the DFG, also traces such as a, b, c, b, c, b, c, b, e are possible according to the DFG (but did not appear in the event log).

Figure 2(c) shows a *Petri net* discovered using the same event log L1. The transitions (i.e., squares) correspond to the five activities in the event log. The places (i.e., circles) constrain the behavior. The Petri net allows for the three traces in the event log and nothing more. Initially, only transition a is enabled. When a fires (i.e., occurs), a token is consumed from the input place and a token is produced for each of the two output places. As a result, transitions b, c, and d become enabled. If d fires, both tokens are removed and two tokens are produced for the input places of e. If b fires, only one token is consumed and one token is produced. After b fires, c is still enabled, and c will fire to enable e. Transition c can also occur before b, i.e., b and c are concurrent and can happen at the same time or in any order. There is a choice between d and the combination of b and c. The start of the process is modeled by the token in the source place. The end of the process is modeled by the double-bordered sink place.

Also, the *process tree* discovered for event log L<sup>1</sup> shown in Fig. 2(d) allows for the three traces in the event log and nothing more. The root node is a sequence (→) with three "child nodes": activity a, a choice, and activity e. These nodes are visited 16 times (once for each case). The choice node (×) has two "child nodes": a parallel node ∧ and an activity node <sup>e</sup>. The parallel node (∧) has two "child nodes": activity <sup>b</sup> and activity <sup>c</sup>. The whole process tree can be represented by the expression <sup>→</sup>(a, <sup>×</sup>(∧(b, c), d), e). Note that the <sup>d</sup> node is visited only once. The <sup>∧</sup>, <sup>b</sup>, and <sup>c</sup> nodes are visited 15 times. In this example, each node has a unique label allowing us to refer easily. Often a tree has multiple nodes with the same label, e.g., <sup>→</sup>(a, <sup>×</sup>(→(a, a), a), a) where <sup>a</sup> appears five times and → two times.

In Fig. 2, we just show example results. In the remainder, we will see how such process models can be learned from event data. The goal of this chapter is not to give a complete survey (see also [10] for a recent survey). Instead, we would like to bring forward the essence of process discovery from event data, and introduce the main principles in an intuitive manner.

The remainder of this chapter is organized as follows. Section 2 presents a baseline approach that computes a Directly-Follows Graph (DFG). This approach is simple and highly scalable, but has many limitations (e.g., producing complex underfitting process models) [2]. In Sect. 3, we elaborate on the challenges of process discovery. Section 4 discusses higher-level representations such as Petri nets (Subsect. 4.1), process trees (Subsect. 4.2), and BPMN (Subsect. 4.3). Section 5 introduces "bottom-up" process discovery using the Alpha algorithm [1,9] as an example. Section 6 introduces "top-down" process discovery using the basic inductive mining algorithm [22–24] as an example. Finally, Sect. 7 concludes the chapter with pointers to other discovery approaches (e.g., using state-based or language-based regions).

## **2 Directly-Follows Graphs: A Baseline Approach**

In this chapter, we present a very simple discovery approach that is supported by most (if not all) process mining tools: Constructing a so-called Directly-Follows Graph (DFG) by simply counting how often one activity is followed by another activity (see Fig. 2(b)). We use this to also introduce filtering techniques to remove infrequent activities, infrequent variants, and infrequent arcs. The more advanced techniques presented later in this chapter build upon the simple notions introduced in this section.

Let us first try to describe the process discovery problem in abstract terms, independent of the selected process modeling notation. Therefore, we describe a model's behavior as a set of traces.

**Definition 2 (Process Model).** U*<sup>M</sup> is the universe of process models. A process model* <sup>M</sup> ∈ U*<sup>M</sup> defines a set of traces lang*(M) ⊆ U*act* <sup>∗</sup>*.*

Examples of process models defined later are DFGs U*<sup>G</sup>* ⊆ U*<sup>M</sup>* (Sect. 2.1), accepting Petri nets U*AN* ⊆ U*<sup>M</sup>* (Sect. 4.1), process trees U*<sup>Q</sup>* ⊆ U*<sup>M</sup>* (Sect. 4.2), and BPMN models U*BPMN* ⊆ U*<sup>M</sup>* (Sect. 4.3). Consider, for example, the process models M<sup>1</sup> (DFG), M<sup>2</sup> (Petri net), and M<sup>3</sup> (process tree) in Fig. 2. *lang*(M2) = *lang*(M3) = {a, b, c, e,a, c, b, e,a, d, e}. *lang*(M1) = {a, b, e,a, c, e,a, d, <sup>e</sup>,...,a, b, c, b, c, b, c, e,...} contains infinitely many traces due to the cycle involving b and c.

The goal of a process discovery algorithm is to produce a model that explains the observed behavior.

**Definition 3 (Process Discovery Algorithm).** *A process discovery algorithm is a function disc* ∈ B(U*act* <sup>∗</sup>) → U*<sup>M</sup> , i.e., based on a multiset of traces, a model is produced.*

Given an event log L, a process discovery algorithm *disc* returns a model allowing for the traces *lang*(*disc*(L)). A discovery algorithm *disc* guarantees *perfect replay fitness* if for any <sup>L</sup> ∈ B(U*act* <sup>∗</sup>): {<sup>σ</sup> <sup>∈</sup> <sup>L</sup>} ⊆ *lang*(*disc*(L)). We write {<sup>σ</sup> <sup>∈</sup> <sup>L</sup>} to turn a multiset of traces into a set of traces and make the model and the log comparable. All three models in Fig. 2 have perfect replay fitness (also called perfect recall).

#### **2.1 Directly-Follows Graphs: Basic Concepts**

We already informally introduced DFGs, but now we formalize the concepts needed to precisely describe the corresponding discovery algorithm.

**Definition 4 (Directly-Follows Graph).** *A Directly-Follows Graph (DFG) is a pair* <sup>G</sup> = (A, F) *where* <sup>A</sup> ⊆ U*act is a set of activities and* <sup>F</sup> ∈ B((<sup>A</sup> <sup>×</sup> <sup>A</sup>) <sup>∪</sup> ({-} × <sup>A</sup>) <sup>∪</sup> (<sup>A</sup> × {-}) ∪ ({-}×{-})) *is a multiset of arcs. is the start node and is the end node (*{-, -}∩U*act* = ∅*).* U*<sup>G</sup>* ⊆ U*<sup>M</sup> is the set of all DFGs.*

 and can be viewed as artificially added activities to clearly indicate the start and end of the process. The nodes of a DFG are to denote the beginning, to denote the end, and the activities in set A. Note that - <sup>∈</sup> <sup>A</sup> and - <sup>∈</sup> <sup>A</sup> (this is also important in later sections). There are four types of arcs: (-, a), (a1, a2), (a, -), and (-, -) (with a, a1, a<sup>2</sup> <sup>∈</sup> <sup>A</sup>). <sup>F</sup>((-, a)) indicates how many cases start with a, F((a1, a2)) indicates how often activity a<sup>1</sup> is directly followed by activity a2, F((a, -)) indicates how many cases end with a, and F((-, -)) counts the number of empty cases. In the directlyfollows graph, we only consider directly-follows within the same case. For example, <sup>F</sup>((a, b)) = (10 <sup>×</sup> 0) + (10 <sup>×</sup> 0) + (10 <sup>×</sup> 1) + (10 <sup>×</sup> 2) + (10 <sup>×</sup> 3) = 60 given some event log [<sup>a</sup><sup>10</sup>,<sup>b</sup><sup>10</sup>,a, b<sup>10</sup>,a, b, a, b<sup>10</sup>,a, b, a, b, a, b<sup>10</sup>].

The DFG in Fig. 2(b) can be described as follows: M<sup>1</sup> = (A, F) with <sup>A</sup> <sup>=</sup> {a, b, c, d, e} and <sup>F</sup> = [(-, a)<sup>16</sup>,(a, b)<sup>10</sup>,(a, c)<sup>5</sup>,(a, d)<sup>1</sup>,(b, c)<sup>10</sup>,(b, e)<sup>5</sup>,(c, b)<sup>5</sup>, (c, e)<sup>10</sup>,(d, e)<sup>1</sup>,(e, -)<sup>16</sup>].

Figure <sup>3</sup> shows process models discovered for another event log <sup>L</sup><sup>2</sup> = [a, b, c, e<sup>50</sup>, a, c, b, e<sup>40</sup>, a, b, c, d, b, c, e<sup>30</sup>, a, c, b, d, b, c, e<sup>20</sup>, a, b, c, d, c, b, e<sup>10</sup>, a, c, b, d, c, b, d, b, c, e<sup>10</sup>]. The fact that <sup>b</sup>, <sup>c</sup>, and <sup>d</sup> occur a variable number of times per case suggests that there is a loop. Figure 3(b) shows the corresponding DFG. This DFG can be described as follows: <sup>M</sup><sup>4</sup> = (A, F) with <sup>A</sup> <sup>=</sup> {a, b, c, d, e} and <sup>F</sup> <sup>=</sup> [(-, a)<sup>160</sup>, (a, b)<sup>90</sup>, (a, c)<sup>70</sup>, (b, c)<sup>150</sup>, (b, d)<sup>40</sup>, (b, e)<sup>50</sup>, (c, b)<sup>90</sup>, (c, d)<sup>40</sup>, (c, e)<sup>110</sup>, (d, b)<sup>60</sup>, (d, c)<sup>20</sup>, (e, -)<sup>160</sup>].

**Definition 5 (Traces of a DFG).** *Let* <sup>G</sup> = (A, F) ∈ U*<sup>G</sup> be a DFG. The set of possible traces described by* <sup>G</sup> *is lang*(G) = {<sup>a</sup>2, a3,...,a<sup>n</sup>−<sup>1</sup> | <sup>a</sup><sup>1</sup> <sup>=</sup> - <sup>∧</sup> <sup>a</sup><sup>n</sup> <sup>=</sup> - ∧ <sup>∀</sup><sup>1</sup>≤i<n (ai, a<sup>i</sup>+1) <sup>∈</sup> <sup>F</sup>}*.*

Note that and have been added to the DFG to have a clear start and end. However, these "dummy activities" are not part of the language of the DFG.

Consider the DFG <sup>M</sup><sup>1</sup> shown in Fig. 2(b): *lang*(M1) = {a, b, e,a, c, e,a, d, <sup>e</sup>,a, b, c, e,a, c, b, e,a, b, c, b, e,a, c, b, c, e,a, b, c, b, c, e,...}. Also the DFG <sup>M</sup><sup>4</sup> in Fig. 3(b) has an infinite number of possible traces: *lang*(M4) = {a, b, e, a, c, e,a, b, c, e,a, c, b, e,a, b, c, b, e,a, c, b, c, e,a, b, d, b, e,...}. Whenever the DFG has a cycle, then the number of possible traces is unbounded.

**Fig. 3.** Three process models learned from event log L<sup>2</sup> = [a, b, c, e <sup>50</sup>, a, c, b, e 40, a, b, c, d, b, c, e <sup>30</sup>, a, c, b, d, b, c, e <sup>20</sup>, a, b, c, d, c, b, e <sup>10</sup>, a, c, b, d, c, b, d, b, c, e <sup>10</sup>].

#### **2.2 Baseline Discovery Algorithm**

Since the event log only contains example traces, it is natural that the discovery algorithm aims to generalize the observed behavior to avoid over-fitting. Therefore, we start with a *baseline discovery algorithm* that ensures that all observed behavior is possible according to the discovered process model. The algorithm used to discover the DFGs in Fig. 2(b) and Fig. 3(b) is defined as follows.

**Definition 6 (Baseline Discovery Algorithm).** *Let* <sup>L</sup> ∈ B(U*act* <sup>∗</sup>) *be an event log. discDFG* (L)=(A, F) *is the DFG based on* L *with:*

*–* <sup>A</sup> <sup>=</sup> {<sup>a</sup> <sup>∈</sup> <sup>σ</sup> <sup>|</sup> <sup>σ</sup> <sup>∈</sup> <sup>L</sup>} *and –* <sup>F</sup> = [(σi, σ<sup>i</sup>+1) <sup>|</sup> <sup>σ</sup> <sup>∈</sup> <sup>L</sup> <sup>∧</sup> <sup>1</sup> <sup>≤</sup> i < <sup>|</sup>σ|] *with* <sup>L</sup> = [-- · <sup>σ</sup> · --| <sup>σ</sup> <sup>∈</sup> <sup>L</sup>]*.*

Note that L, L , and F in Definition 6 are multisets. Each trace in the event log L is extended with the artificially added activities. L adds at the start and at the end of each trace in L. M<sup>1</sup> = *discDFG* (L1) is depicted in Fig. 2(b) and M<sup>4</sup> = *discDFG* (L2) is depicted in Fig. 3(b).

A DFG can be viewed as a first-order Markov model (i.e., the state is determined by the last activity executed). The baseline discovery algorithm (Definition 6) tends to lead to underfitting process models. Whenever two activities are not executed in a fixed order, a loop is introduced.

#### **2.3 Footprints**

A DFG can also be represented as a matrix, as shown in Table 1. This is simply a tabular representation of the graph and the arc frequencies, e.g., F((-, -)) = 0, F((-, a)) = 16, and F((c, e)) = 10. To capture the relations between activities, we can also create a so-called *footprint matrix* [1]. Table 2 shows the footprint matrix for the DFG in Fig. 2(b). Between two activities a<sup>1</sup> and a2, precisely one of four possible relations holds:



**Table 1.** Matrix representation of the DFG in Fig. 2(b).

**Table 2.** The footprint of the DFG in Fig. 2(b).


Table <sup>2</sup> (based on Fig. 2(b)) shows, for example, that <sup>a</sup> <sup>→</sup> <sup>b</sup>, <sup>b</sup> <sup>←</sup> <sup>a</sup>, <sup>b</sup>c, and <sup>c</sup>#d. The creation of the footprint can be formalized as follows.

**Definition 7 (Footprint).** *Let* <sup>G</sup> = (A, F) ∈ U*<sup>G</sup> be a DFG.* <sup>G</sup> *defines a footprint fp*(G) <sup>∈</sup> (A <sup>×</sup>A ) → {→,←, , #} *such that* <sup>A</sup> <sup>=</sup> <sup>A</sup>∪ {-, -} *and for any* (a1, a2) <sup>∈</sup> <sup>A</sup> <sup>×</sup> <sup>A</sup> *:*

*– fp*(G)((a1, a2)) = <sup>→</sup> *if* (a1, a2) <sup>∈</sup> <sup>F</sup> *and* (a2, a1) <sup>∈</sup> <sup>F</sup>*, – fp*(G)((a1, a2)) = <sup>←</sup> *if* (a1, a2) <sup>∈</sup> <sup>F</sup> *and* (a2, a1) <sup>∈</sup> <sup>F</sup>*, – fp*(G)((a1, a2)) = *if* (a1, a2) <sup>∈</sup> <sup>F</sup> *and* (a2, a1) <sup>∈</sup> <sup>F</sup>*, and – fp*(G)((a1, a2)) = # *if* (a1, a2) <sup>∈</sup> <sup>F</sup> *and* (a2, a1) <sup>∈</sup> <sup>F</sup>*.*

*We write* <sup>a</sup><sup>1</sup> <sup>→</sup>*<sup>G</sup>* <sup>a</sup><sup>2</sup> *if fp*(G)((a1, a2)) = <sup>→</sup>*,* <sup>a</sup>1#*<sup>G</sup>* <sup>a</sup><sup>2</sup> *if fp*(G)((a1, a2)) = #*, etc.*

We can also create the footprint of an event log by first applying the baseline discovery algorithm: *fp*(L) = *fp*(*discDFG* (L)). Hence, Table 2 also shows *fp*(L1) = *fp*(*discDFG* (L1)) = *fp*(M1). This allows us to write <sup>b</sup>→<sup>L</sup><sup>1</sup> <sup>e</sup>, <sup>b</sup><sup>L</sup><sup>1</sup> <sup>e</sup>, <sup>b</sup>#<sup>L</sup><sup>1</sup> <sup>d</sup>, etc.

#### **2.4 Filtering**

Using the baseline discovery algorithm, an activity a appears in the discovered DFG when it occurs at least once and two activities a<sup>1</sup> and a<sup>2</sup> are connected by a directed arc if a<sup>1</sup> is directly followed by a<sup>2</sup> at least once in the log. Often, we do not want to see the process model that captures all behavior. Instead, we would like to see the dominant behavior. For example, we are interested in the most frequent activities and paths. Therefore, we would like to *filter* the event log and model. Here, we consider the three basic types of filtering:


To describe the different types of filtering, we introduce some notations for traces and event logs.

**Definition 8 (Frequency and Projection Functions).** *Let* <sup>L</sup> ∈ B(U*act* <sup>∗</sup>) *be an event log.*


First, we define *activity-based filtering* using a threshold <sup>τ</sup>*act* <sup>∈</sup> <sup>N</sup> <sup>=</sup> {1, <sup>2</sup>, <sup>3</sup>,...}. All activities with a frequency lower than τ*act* are removed from the event log, but all cases are retained.

**Definition 9 (Activity-Based Filtering).** *Let* <sup>L</sup> ∈ B(U*act* <sup>∗</sup>) *be an event log and* <sup>τ</sup>*act* <sup>∈</sup> <sup>N</sup>*. filter act*(L, τ*act*) = <sup>L</sup>↑<sup>A</sup> *with* <sup>A</sup> <sup>=</sup> {<sup>a</sup> <sup>∈</sup> *act*(L) <sup>|</sup> #*act* <sup>L</sup> (a) <sup>≥</sup> <sup>τ</sup>*act*}*.*

Again we use <sup>L</sup><sup>1</sup> = [a, b, c, e10,a, c, b, e5,a, d, e] and <sup>L</sup><sup>2</sup> = [a, b, c, e50, a, c, b, e40, a, b, c, d, b, c, e30, a, c, b, d, b, c, e20, a, b, c, d, c, b, e10, a, c, b, d, c, b, d, b, c, e<sup>10</sup>] to illustrate the definition. If <sup>τ</sup>*act* = 10, then *filter act*(L1, τ*act*) = [a, b, c, e10,a, c, b, e5,a, e] (only activity <sup>d</sup> is removed). If <sup>τ</sup>*act* = 16, then *filter act*(L1, τ*act*)=[a, e<sup>16</sup>] (only activities <sup>a</sup> and <sup>e</sup> remain). If <sup>τ</sup>*act* <sup>&</sup>gt; <sup>16</sup>, then *filter act*(L1, τ*act*)=[-<sup>16</sup>]. Note that the number of traces is not affected by activity-based filtering (even when all activities are removed). If τ*act* = 200, then *filter act*(L2, τ*act*)=[b, c<sup>50</sup>,c, b<sup>40</sup>,b, c, b, c<sup>30</sup>,c, b, b, c<sup>20</sup>,b, c, c, b<sup>10</sup>, c, b, c, b, b, c<sup>10</sup>] (only activities <sup>b</sup> and <sup>c</sup> remain).

Next, we define *variant-based filtering* using a threshold <sup>τ</sup>*var* <sup>∈</sup> <sup>N</sup>. All trace variants with a frequency lower than τ*var* are removed from the event log.

**Definition 10 (Variant-Based Filtering).** *Let* <sup>L</sup> ∈ B(U*act* <sup>∗</sup>) *be an event log and* <sup>τ</sup>*var* <sup>∈</sup> <sup>N</sup>*. filter var* (L, τ*var* ) = <sup>L</sup>⇑<sup>V</sup> *with* <sup>V</sup> <sup>=</sup> {<sup>σ</sup> <sup>∈</sup> *var* (L) <sup>|</sup> #*var* <sup>L</sup> (σ) <sup>≥</sup> <sup>τ</sup>*var* }*.*

If <sup>τ</sup>*var* = 5, then *filter var* (L1, τ*var* )=[a, b, c, e<sup>10</sup>,a, c, b, e<sup>5</sup>]. If <sup>τ</sup>*var* = 10, then *filter var* (L1, τ*var* )=[a, b, c, e<sup>10</sup>]. If <sup>τ</sup>*var* <sup>&</sup>gt; <sup>10</sup>, then *filter var* (L1, τ*var* )=[]. Note that (unlike activity-based filtering) the number of traces may decrease.

Finally, we define *arc-based filtering* using a threshold <sup>τ</sup>*arc* <sup>∈</sup> <sup>N</sup>. Whereas activitybased filtering and variant-based filtering operate on event logs, arc-based filtering modifies the DFG and not the event log used to generate it. All arcs with a frequency lower than τ*arc* are removed from the graph.

**Definition 11 (Arc-Based Filtering).** *Let* <sup>G</sup> = (A, F) ∈ U*<sup>G</sup> be a DFG and* <sup>τ</sup>*arc* <sup>∈</sup> <sup>N</sup>*. filter arc*(G, τ*arc*)=(A, F ) *with* <sup>F</sup> = [(x, y) <sup>∈</sup> <sup>F</sup> <sup>|</sup> <sup>F</sup>((x, y)) <sup>≥</sup> <sup>τ</sup>*arc*]*.*

In its basic form τ*arc* retains all nodes even when they become fully disconnected from the rest. Consider the DFG <sup>M</sup><sup>1</sup> = (A, F) in Fig. 2(b) with <sup>A</sup> <sup>=</sup> {a, b, c, d, e} and F = [(-, a)<sup>16</sup>,(a, b)<sup>10</sup>,(a, c)<sup>5</sup>,(a, d)<sup>1</sup>,(b, c)<sup>10</sup>,(b, e)<sup>5</sup>,(c, b)<sup>5</sup>,(c, e)<sup>10</sup>,(d, e)<sup>1</sup>,(e, -)<sup>16</sup>]. If <sup>τ</sup>*var* = 10, then *filter arc*(M1, τ*arc*)=(A, F ) with F = [(-, a)<sup>16</sup>,(a, b)<sup>10</sup>, (b, c)<sup>10</sup>,(c, e)<sup>10</sup>,(e, -)<sup>16</sup>]. If <sup>τ</sup>*var* = 15, then *filter arc*(M1, τ*arc*)=(A, F) with <sup>F</sup> <sup>=</sup> [(-, a)<sup>16</sup>,(e, -)<sup>16</sup>]. Note that the DFG is no longer connected.

The three types of filtering can be combined. Because arc-based filtering operates on the DFG, it should be done last. It is also better to conduct activity-based filtering before variant-based filtering. There are several reasons for this. The number of traces is affected by variant-based filtering. Moreover, activity-based filtering may lead to variants with a higher frequency. Consider L<sup>1</sup> with τ*act* = 16 and τ*var* = 10. If we first apply variant-based filtering, one variant remains after the first step and none of the activities is frequent enough to be retained in the second step: *filter act*(*filter var* (L1, τ*var* ), τ*act*)=[-<sup>10</sup>]. If we first apply activity-based filtering, then the two most frequent activities are retained and all 16 traces are considered in the second step: *filter var* (*filter act*(L1, τ*act*), τ*var* )=[a, e<sup>16</sup>]. For <sup>L</sup><sup>2</sup> with <sup>τ</sup>*act* = 200 and <sup>τ</sup>*var* = 40, we find that *filter act*(*filter var* (L2, τ*var* ), τ*act*)=[-<sup>90</sup>] and *filter var* (*filter act*(L2, τ*act*), τ*var* )=[b, c<sup>50</sup>,c, b<sup>40</sup>].

These examples show that the order of filtering matters. We propose a *refined baseline discovery algorithm using filtering*. The algorithm first applies activity-based filtering followed by variant-based filtering. Then the original baseline algorithm is applied to the resulting event log to get a DFG (see Definition 6). Finally, arc-based filtering is used to prune the DFG.

**Definition 12 (Baseline Discovery Algorithm Using Filtering).** *Let* <sup>L</sup> ∈ B(U*act* <sup>∗</sup>) *be an event log. Given the thresholds* <sup>τ</sup>*act* <sup>∈</sup> <sup>N</sup>*,* <sup>τ</sup>*var* <sup>∈</sup> <sup>N</sup>*, and* <sup>τ</sup>*arc* <sup>∈</sup> <sup>N</sup>*: disc*τ*act*,τ*var* ,τ*arc DFG* (L) = *filter arc*(*discDFG* (*filter var* (*filter act*(L, τ*act*), τ*var* )), τ*arc*)*.*

*disc*<sup>τ</sup>*act*,τ*var* ,τ*arc DFG* (L) returns a DFG using the three filtering steps. Only the last filtering step is specific for DFGs. Activity-based filtering and variant-based filtering can be used in conjunction with any discovery technique, because they produce filtered event logs. The footprint notion can also be extended to include these two types of filtering: *fp*<sup>τ</sup>*act*,τ*var* (L) = *fp*(*discDFG* (*filter var* (*filter act*(L, τ*act*), τ*var* ))) is the footprint matrix considering only frequent activities and variants.

(d) Directly-Follows Graph (DFG) based on the filtered event log

**Fig. 4.** Three DFGs learned from event log L<sup>2</sup> = [a, b, c, e <sup>50</sup>, a, c, b, e <sup>40</sup>, a, b, c, d, b, c, e 30, a, c, b, d, b, c, e <sup>20</sup>, a, b, c, d, c, b, e <sup>10</sup>, a, c, b, d, c, b, d, b, c, e <sup>10</sup>]: (b) the original DFG considering all activities, (c) the problematic DFG obtained by simply removing activity d from the graph, and (d) the desired DFG obtained by removing activity d from the event log first.

Most process mining tools provide sliders to interactively set one or more thresholds. This makes it easy to seamlessly simplify the discovered DFG. However, it is vital that the user understands the different filtering approaches. Therefore, we highlight the following risks.


#### **2.5 A Larger Example**

To further illustrate the concepts, we now consider a slightly larger event log L<sup>3</sup> = [*ie*, *cu*, *lt*, *xr* , *fe*<sup>285</sup>, *ie*, *cu*, *lt*, *ct*, *fe*<sup>260</sup>, *ie*, *cu*, *ct*, *lt*, *fe*<sup>139</sup>, *ie*, *lt*, *cu*, *xr* , *fe*<sup>137</sup>, *ie*, *lt*, *cu*, *ct*, *fe*<sup>124</sup>, *ie*, *cu*, *xr* , *lt*, *fe*<sup>113</sup>, *ie*, *xr* , *cu*, *lt*, *fe*<sup>72</sup>, *ie*, *ct*, *cu*, *xr* , *fe*<sup>72</sup>, *ie*, *cu*, *om*, *am*, *cu*, *lt*, *xr* , *fe*<sup>29</sup>,*ie*, *cu*, *om*, *am*, *cu*, *lt*, *ct*, *fe*<sup>28</sup>,...]. We use the following abbreviations: *ie* = initial examination, *xr* = X-ray, *ct* = CT scan, *cu* = checkup, *om* = order medicine, *am* = administer medicine, *lt* = lab tests, and *fe* = final examination. The event log contains 11761 events corresponding to 1856 cases. Each case represents the treatment of a patient. There are 187 trace variants and 8 unique activities. For example, *ie*, *cu*, *lt*, *xr* , *fe* is the most frequent variant, i.e., 285 patients first get an initial examination (*ie*), followed by a checkup (*cu*), lab tests (*lt*), X-ray (*xr* ), and a final examination (*fe*).

Figure 5 shows the DFG for L<sup>3</sup> using the baseline discovery algorithm described in Definition 6. The DFG was produced by ProM's "Mine with Directly Follows visual Miner". Using a slider, it is possible to remove infrequent activities. Figure 6 shows the DFG *discDFG* (*filter act*(L3, τ*act*)) with the activity threshold <sup>τ</sup>*act* set to 1000, i.e.,

**Fig. 5.** The discovered DFG *discDFG* (L3) generated by ProM.

**Fig. 6.** The DFG *discDFG* (*filter act* (L3, τ*act* )) generated by ProM using τ*act* = 1000.

all activities with a frequency of less than 1000 are removed from the event log using projection. In the resulting DFG, four of the eight activities remain.

The discovery of DFGs (as defined in this section) is supported by almost all process mining tools. Figure 7 shows the DFGs discovered using the Celonis EMS using the same settings as used in ProM. Although the layout is different, the Celonis-based DFG in Fig. 7 (left) is identical to the ProM-based DFG in Fig. 5. The DFG in Fig. 7 (right) is identical to the DFG in Fig. 6.

Figure 8 shows variant-based filtering using the Celonis "Variant Explorer". The six most frequent variants are selected. These are the variants that have a frequency above 100, i.e., the depicted DFG is *discDFG* (*filter var* (L3, τ*var* )) with <sup>τ</sup>*var* = 100. There are 1856 cases distributed over 197 variants. The top six variants (i.e., 3% of all variants) cover 1058 cases (i.e., 57%). We also computed the DFG *discDFG* (*filter var* (L3, τ*var* )) with τ*var* = 10. There are 22 variants meeting this lower threshold (i.e., 11% of all variants) covering 1483 cases (i.e., 80%). Most event logs follow such a *Pareto distribution*, i.e., a small fraction of variants explains most of the cases observed. This is also referred to as the "80/20 rule", although the numbers 80 and 20 are arbitrary. For our

**Fig. 7.** The discovered DFG in Celonis before and after activity-based filtering, i.e., *discDFG* (L3) (left) and *discDFG* (*filter act* (L3, τ*act* )) with τ*act* = 1000 (right).

**Fig. 8.** A discovered DFG in Celonis using variant-based filtering: *discDFG* (*filter var* (L3, τ*var* )) with τ*var* = 100. There are six variants having a frequency above 100. These cover 57% of all cases, but only 3% of all variants.

example event log L3, we could state that it satisfies the "80/11 rule" (but also the "57/3 rule", "84/16 rule", etc.).

If the distribution of cases over variants does not follow a Pareto distribution, then it is best to first apply activity-based filtering. If we project L<sup>3</sup> onto the top four most frequent activities, only 20 variants remain. The most frequent variant explains already 51% of all cases. The DFG *discDFG* (*filter var* (*filter act*(L, τ*act*), τ*var* )) with τ*act* = 1000 and τ*var* = 100 combines the activity-based filter used in Fig. 7 and the variant-based filter used in Fig. 8. The resulting DFG (not shown) explains 1672 of the 1856 cases (90%) and 7065 of 11761 events (60%) using only five variants.

The above examples show that, using filtering, it is possible to separate the normal (i.e., frequent) from the exceptional (i.e., infrequent) behavior. This is vital in the context of process discovery and can be combined with the later bottom-up and top-down discovery approaches.

## **3 Challenges**

After introducing a baseline discovery algorithm and various filtering approaches, it is possible to better explain why process discovery is so challenging. In Definition 3, we stated that a process discovery algorithm is a function *disc* ∈ B(U*act* <sup>∗</sup>) → U*<sup>M</sup>* , i.e., based on a multiset of traces <sup>L</sup>, a process model <sup>M</sup> <sup>=</sup> *disc*(L) allowing for *lang*(M) <sup>⊆</sup> U*act* <sup>∗</sup> is produced.

The first challenge is that *the discovered process model may serve different goals*. Should the model summarize past behavior, or is the model used for predictions and recommendations? Also, should the process model be easy to read and understand by end-users? Answers to these questions are needed to address the trade-offs in process discovery. We already mentioned that most event logs follow a Pareto distribution. Hence, the process model can focus on the dominant behavior or also include exceptional behavior.

The second challenge is that different process model representations can be used. These may or may not be able to capture certain behaviors. This is the so-called *representational bias* of process discovery. Consider, for example, event log <sup>L</sup> = [a, b, c, <sup>d</sup><sup>1000</sup>,a, c, b, d<sup>1000</sup>]. There is no DFG that is able to adequately describe this behavior. The DFG will always need to introduce a loop involving b and c. Another example is <sup>L</sup> = [a, b, c<sup>1000</sup>,a, c<sup>1000</sup>]. It is easy to create a DFG describing this behavior. However, when representing this as a Petri net or process tree, it is vital that one can use so-called silent activities (to skip b) or duplicate activities (to have a c activity following a and another c activity following b).

Another challenge is that the event log contains just *example behavior*. Most event logs have a Pareto distribution. Typically, a few trace variants are frequent and many trace variants are infrequent. Actually, there are often trace variants that are unique (i.e., occur only once). If one observes the process longer, new variants will appear. Conversely, if one observes the process in a different period, some variants may no longer appear. An event log is a *sample* and should be treated as such. Just like in statistics, the goal is to use the sample to say something about the whole population (here, the process). For example, when throwing a dice ten times, one may have the following sequence observations <sup>σ</sup> <sup>=</sup> -<sup>4</sup>, <sup>5</sup>, <sup>2</sup>, <sup>3</sup>, <sup>6</sup>, <sup>5</sup>, <sup>4</sup>, <sup>1</sup>, <sup>2</sup>, <sup>3</sup>. If we do not know that two subsequent throws are independent, the expected value is 3.5, the minimum is 1, the maximum is 6, and the probabilities of all six values are equal, then what can be concluded from the sample σ? We could conclude that even numbers are always followed by odd numbers. Real-life processes have many more behaviors, and the observed sample rarely covers all possibilities.

Although processes are *stochastic*, most process discovery techniques aim to discover process models that are "binary", i.e., a trace is possible or not. This complicates analysis. Another challenge is that event logs *do not contain negative examples*. Process discovery can be seen as a classification problem: A trace <sup>σ</sup> is possible (<sup>σ</sup> <sup>∈</sup> *lang*(M)) or not (<sup>σ</sup> <sup>∈</sup> *lang*(M)). In real applications, we never witness traces that are impossible. The event log only contains positive examples. If we also want to incorporate infrequent behavior in the discovered model, we may require *var* (L) <sup>⊆</sup> *lang*(M). However, we cannot assume the reverse *lang*(M) <sup>⊆</sup> *var* (L). For example, loops in models would be impossible, and for concurrent processes we would need a factorial number of cases.

Related to the above are the challenges imposed by *concept drift*. The behavior of the process that we are trying to discover may change over time in unforeseen ways. Certain traces may increase or decrease in likelihood. New trace variants may emerge while other variants no longer occur. Since process models already describe dynamic behavior, concept drift introduces second-order dynamics. Various techniques for concept-drift detection have been developed. However, this for sure complicates process discovery. If we cannot assume that the process itself is in steady-state, then what is the process we are trying to discover? Do we want to have a process model describing the past week or the past year?

Next to concept drift, there are the usual *data quality problems* [1]. Events may have been logged incorrectly and attributes may be missing or are imprecise. In some applications it may be difficult to *correlate events* and group them into cases. There may be different identifiers used for the same case and events may be shared by different cases. Since process discovery depends on the ordering of events in the event log, *high-quality timestamps* are important. However, the timestamp resolution may be too low (e.g., just a date) and different source systems may use different timestamp granularities or formats. Often the day and the month are swapped, e.g., 8/7/2022 is entered as 7/8/2022.

It is important to distinguish the *evaluation of a process discovery algorithm disc* ∈ <sup>B</sup>(U*act* <sup>∗</sup>) → U*<sup>M</sup>* from the *evaluation of a specific process model* <sup>M</sup> in the context of a *specific event log* L. To evaluate a process discovery algorithm *disc*, one can use cross-validation, i.e., split an event log into a training part and an evaluation part. The process model is trained using the *training log* and evaluated using the *evaluation log*. Ideally, the evaluation log has both positive and negative examples. This is unrealistic in real settings. However, it is possible to create synthetic event data with positive and negative cases using, for example, simulation. If we assume that the *evaluation log* is a multiset of positive traces L<sup>+</sup> *eval* ∈ B(U*act* <sup>∗</sup>) and a multiset of negative traces L<sup>−</sup> *eval* ∈ B(U*act* <sup>∗</sup>), then evaluation is simple. Let <sup>M</sup> <sup>=</sup> *disc*(L<sup>+</sup> *train*) be the discovered process model using only positive training examples. Now, we can use standard notions such as *recall* <sup>=</sup> <sup>|</sup>[σ∈L<sup>+</sup> *eval*|σ∈*lang*(M)]| |L+ *eval* <sup>|</sup> and *precision* <sup>=</sup> <sup>|</sup>[σ∈L<sup>−</sup> *eval*|σ∈*lang*(M)]| |<sup>L</sup><sup>−</sup> *eval* <sup>|</sup> using the evaluation log. Recall is high when most of the positive traces in the evaluation log are indeed possible according to the process model. Precision is high when most of the negative traces in the evaluation log are indeed not possible according to the process model.

Unfortunately, the above view is very na¨ıve considering process discovery in practical settings. We *cannot* assume negative examples when evaluating a *specific* model M

in the context of a *specific* event log L observed in *reality*. Splitting L into a training log and an evaluation log does not make any sense since the model is given and we want to use the whole event log.

In spite of these problems, there is consensus in the process mining community that there are the following four *quality dimensions* to evaluate a process model M in the context of an event log L with observed behavior [1].


There exist various measures for recall. The simplest one computes the fraction of traces in event log L possible according to the process model M. It is also possible to define such a notion at the level of events. There are many simplicity notions. These do not depend on the behavior of the model, but measure its understandability and complexity. Most challenging are the notions of precision and generalization. Also, these notions can be quantified, but there is less consensus on what they should measure. The goal is to strike a balance between precision (avoiding "underfitting" the sample event data) and generalization (avoiding "overfitting" the sample event data). A detailed discussion is outside the scope of this chapter. Therefore, we refer to [1,4,15,31] for further information.

## **4 Process Modeling Notations**

We have formalized the notion of an event log and the behavior represented by a DFG. Now we focus on higher-level process models able to model sequences, choices, loops, and concurrency. We formalize Petri nets and process trees and provide an informal introduction to a relevant subset of BPMN.

## **4.1 Labeled Accepting Petri Nets**

Figures 2(c) and 3(c) already showed example Petri nets. Since their inception in 1962 [28], Petri nets have been used in a wide variety of application domains. Petri nets were the first formalism to capture concurrency in a systematic manner. See [17,18] for a more extensive introduction. Other notations such as Business Process Model and Notation (BPMN), Event-driven Process Chains (EPCs), and UML activity diagrams all build on Petri nets and have semantics involving "playing the token game". For process mining, we need to use the so-called *labeled accepting Petri nets*. These are standard Petri nets where transitions are labeled to refer to activities in the event log and, next to an initial marking, these nets also have a final marking. The behavior described by such nets are all the "paths" leading from the initial state to the final state. We explain these concepts step-by-step.

**Fig. 9.** Four accepting Petri nets: (a) *AN* <sup>1</sup> = (N1, [*p1* ], [*p6* ]), (b) *AN* <sup>2</sup> = (N2, [*p1* ], [*p6* ]), (c) *AN* <sup>3</sup> = (N3, [*p1* , *p2* ], [*p4* , *p5* ]), and (d) *AN* <sup>4</sup> = (N4, [*p1* ], [*p6* ]). *AN* <sup>1</sup> was discovered for L<sup>1</sup> (see Fig. 2(c)) and *AN* <sup>2</sup> was discovered for L<sup>2</sup> (see Fig. 3(c)).

States in Petri nets are called *markings* that mark certain *places* (represented by circles) with *tokens* (represented by black dots). *Transitions* (represented by squares) are the active components able to move the Petri net from one marking to another marking. Transitions may have a label referring to the corresponding activity. There may be multiple transitions that refer to the same activity and there may be transitions without an activity label. The former is needed if the same activity can occur at multiple stages in the process. The latter is needed if activities can be skipped. Later we will give examples illustrating the importance of the labeling function in the context of process mining.

**Definition 13 (Labeled Petri Net).** *A labeled Petri net is a tuple* N = (P, T, F,l) *with* <sup>P</sup> *the set of places,* <sup>T</sup> *the set of transitions,* <sup>P</sup> <sup>∩</sup><sup>T</sup> <sup>=</sup> <sup>∅</sup>*,* <sup>F</sup> <sup>⊆</sup> (<sup>P</sup> <sup>×</sup>T)∪(<sup>T</sup> <sup>×</sup>P) *the flow relation, and* <sup>l</sup> <sup>∈</sup> <sup>T</sup> → U*act a labeling function. We write* <sup>l</sup>(t) = <sup>τ</sup> *if* <sup>t</sup> <sup>∈</sup> <sup>T</sup>\*dom*(l) *(i.e.,* t *is a silent transition that cannot be observed).*

Figure 9 shows four accepting Petri nets. The first two were discovered for the event logs L<sup>1</sup> and L<sup>2</sup> used to introduce DFGs. Figure 9(a) shows the labeled Petri net <sup>N</sup><sup>1</sup> = (P1, T1, F1, l1) with <sup>P</sup><sup>1</sup> <sup>=</sup> {*p1* , *p2* , *p3* , *p4* , *p5* , *p6* } (six places), <sup>T</sup><sup>1</sup> <sup>=</sup> {*t1* ,*t2* ,*t3* ,*t4* ,*t5* } (five transitions), <sup>F</sup><sup>1</sup> <sup>=</sup> {(*p1* ,*t1* ),(*t1* , *p2* ),(*t1* , *p3* ),..., (*t5* , *p6* )} (fourteen arcs), and <sup>l</sup><sup>1</sup> <sup>=</sup> {(*t1* , *<sup>a</sup>*),(*t2* , *<sup>b</sup>*),(*t3* , *<sup>c</sup>*),(*t4* , *<sup>d</sup>*),(*t5* , *<sup>e</sup>*)} (labeling function).

As mentioned, there may be multiple transitions with the same label and there may be transitions that have no label (called "silent transitions"). This is illustrated by N<sup>4</sup> = (P4, T4, F4, l4) in Fig. 9(d) with <sup>l</sup><sup>4</sup> <sup>=</sup> {(*t1* , *<sup>a</sup>*),(*t2* , *<sup>b</sup>*),(*t3* , *<sup>a</sup>*)}. Note that *dom*(l4) = {*t1* ,*t2* ,*t3* } does not include *t4* and *t5* which are silent. This is denoted by the two black rectangles in Fig. 9(d). Also note that l4(*t1* ) = l4(*t3* ) = a, i.e., *t1* and *t3* refer to the same activity.

Since a place may have multiple tokens, markings are represented by multisets. Transitions may have input and output places. For example, t1 in Fig. 9(a) has one input place and two output places. A transition is called *enabled* if each of the input places has a token. An enabled transition may *fire* (i.e., occur), thereby consuming a token from each input place and producing a token for each output place.

An *accepting Petri net* has an initial marking <sup>M</sup>*init* ∈ B(P) and a final marking <sup>M</sup>*final* ∈ B(P). The accepting Petri nets *AN* <sup>1</sup> = (N1, [*p1* ], [*p6* ]), *AN* <sup>2</sup> = (N2, [*p1* ], [*p6* ]), and *AN* <sup>4</sup> = (N4, [*p1* ], [*p6* ]) in Fig. 9 have the same initial and final marking. *AN* <sup>3</sup> = (N3, [*p1* , *p2* ], [*p4* , *p5* ]) in Fig. 9(c) has an initial marking M*init* = [*p1* , *p2* ] (denoted by the black tokens) and a final marking M*final* = [*p4* , *p5* ] (denoted by the double-bordered places).

**Definition 14 (Accepting Petri Net).** *An accepting Petri net is a triplet AN* = (N, <sup>M</sup>*init*, M*final*) *where* <sup>N</sup> = (P, T, F,l) *is a labeled Petri net,* <sup>M</sup>*init* ∈ B(P) *is the initial marking, and* <sup>M</sup>*final* ∈ B(P) *is the final marking.* <sup>U</sup>*AN* ⊆ U*<sup>M</sup> is the set of accepting Petri nets.*

An accepting Petri net starts in the initial marking and may move from one marking to the next by firing enabled transitions. Consider, for example, *AN* <sup>3</sup> = (N3, [*p1* , *p2* ], [*p4* , *p5* ]) in Fig. 9(c). Initially, three transitions are enabled in [*p1* , *p2* ]: *t1* , *t2* , and *t3* . Firing *t1* results in marking [*p2* , *p4* ], firing *t2* results in marking [*p1* , *p3* ], and firing *t3* results in marking [*p3* , *p4* ]. If *t1* fires (i.e., activity a occurs), then *t1* and *t3* are no longer enabled and only *t2* remains enabled. If *t2* fires in [*p2* , *p4* ], we reach the marking [*p3* , *p4* ]. In this marking, only *t4* is enabled. Firing *t4* results in the marking [*p4* , *p5* ]. This is also the final marking of *AN* <sup>3</sup>. A *firing sequence* is a sequence of transition occurrences obtained by firing enabled transitions and moving from one marking to the next. A *complete* firing sequence starts in the initial marking and ends in the final marking. *AN* <sup>3</sup> has four possible complete firing sequences: *t1* ,*t2* ,*t4* , *t2* ,*t1* ,*t4* , *t2* ,*t4* ,*t1* , and *t3* ,*t4* .

**Definition 15 (Complete Firing Sequences).** *Let AN* = (N,M*init*, M*final*) ∈ U*AN be an accepting Petri net with* <sup>N</sup> = (P, T, F,l)*. cfs*(*AN* ) <sup>⊆</sup> <sup>T</sup> <sup>∗</sup> *is the set of complete firing sequences of AN , i.e., all firing sequences starting in the initial marking* M*init and ending in the final marking* M*final .*

*cfs*(*AN<sup>1</sup>* ) = {*t1* ,*t2* ,*t3* ,*t5* ,*t1* ,*t3* ,*t2* ,*t5* ,*t1* ,*t4* ,*t5* } and *cfs*(*AN<sup>3</sup>* ) = {*t1* ,*t2* ,*t4* ,*t2* ,*t1* ,*t4* ,*t2* ,*t4* ,*t1* ,*t3* ,*t4* }. Note that *cfs*(*AN<sup>2</sup>* ) and *cfs*(*AN<sup>4</sup>* ) contain an infinite number of complete firing sequences due to the loop involving *t4* .

As stated in Definition 2, a process model defines a set of traces. Earlier, we defined *lang*(G) ⊆ U*act* <sup>∗</sup> for a DFG <sup>G</sup> = (A, F). Now we need to define *lang*(*AN* ) ⊆ U*act* <sup>∗</sup> for an accepting Petri net *AN* = (N,M*init*, M*final*). For this purpose, we need to be able to apply the labeling function <sup>l</sup> to firing sequences. Let <sup>σ</sup> <sup>∈</sup> <sup>T</sup> <sup>∗</sup> be a firing sequence and <sup>l</sup> <sup>∈</sup> <sup>T</sup> → U*act* a labeling function. Function <sup>l</sup> is generalized to sequences, i.e., transitions are replaced by their labels and are dropped if they do not have a label. Formally, <sup>l</sup>(-) = -, <sup>l</sup>(<sup>σ</sup> · <sup>t</sup>) = <sup>l</sup>(σ) · <sup>l</sup>(t) if <sup>t</sup> <sup>∈</sup> *dom*(l), and <sup>l</sup>(<sup>σ</sup> · <sup>t</sup>) = <sup>l</sup>(σ) if <sup>t</sup> <sup>∈</sup> *dom*(l). Consider, for example, the complete firing sequence <sup>σ</sup> <sup>=</sup> *t1* ,*t2* ,*t3* ,*t4* ,*t3* ,*t2* ,*t5* ∈ *cfs*(*AN<sup>4</sup>* ) of the accepting Petri net in Fig. 9(d). <sup>l</sup>(σ) = a, b, a, a, b, i.e., *t1* , *t2* , and *t3* are mapped to the corresponding labels, and *t4* and *t5* are dropped.

**Definition 16 (Traces of an Accepting Petri Net).** *Let AN* = (N,M*init*, M*final*) <sup>∈</sup> <sup>U</sup>*AN be an accepting Petri net. lang*(*AN* ) = {l(σ) <sup>|</sup> <sup>σ</sup> <sup>∈</sup> *cfs*(*AN* )} *are the traces possible according to AN .*

Now we can reason about the traces of the four accepting in Fig. 9. *lang*(*AN* <sup>1</sup>) = {a, b, c, e,a, c, b, e,a, d, e}. *lang*(*AN* <sup>2</sup>) = {a, b, c, e,a, c, b, e,a, b, c, d, b, c, e,a, c, b, d, b, c, e,...,a, c, b, d, b, c, d, c, b, d, c, b, e,...}. *lang*(*AN* <sup>3</sup>) = {a, b, <sup>d</sup>,b, a, d,b, d, a,c, d}. *lang*(*AN* <sup>4</sup>) = {a, b, a,a, a, b,a, b, a, b, a,a, a, b, b, a,...,a, a, b, b, a, a, b, a, b,...}.

It is important to note the consequences of restricting *lang*(*AN* ) to the behavior of complete firing sequences. If *AN* has *livelocks* of *deadlocks*, then these are *not* considered to be part of the language. If we remove the arc from p4 to t4 in *AN* <sup>2</sup>, then *lang*(*AN* <sup>2</sup>) = {a, b, c, e,a, c, b, e}, because there are no complete firing sequences involving t4.

In literature, Petri nets are normally not equipped with a *labeling function* and a *final marking*. However, both the labeling function l and a defined final marking M*final* are vital in the context of process mining. The final marking allows us to reason about complete firing sequences, just like traces in an event log have a clear ending. If we would consider ordinary Petri nets rather than accepting Petri nets, the language would also include all prefixes. This would make it impossible to describe the behavior found in an event log such as <sup>L</sup> = [a, b, c<sup>1000</sup>], because the corresponding Petri net would also allow for traces a, b, <sup>a</sup>, and -.

The labeling function <sup>l</sup> <sup>∈</sup> <sup>T</sup> → U*act* also greatly improves expressiveness. The alternative would be that transitions are uniquely identified by activities, i.e., <sup>T</sup> ⊆ U*act*. However, this would make it impossible to describe many behaviors seen in event logs. Consider, for example, an event log such as <sup>L</sup> = [a, b, c<sup>1000</sup>,a, c<sup>1000</sup>] where <sup>b</sup> can be skipped. It is easy to model this behavior using a silent transition to skip b or by using two transitions with a c label. Although it is trivial to create a DFG G such that *lang*(G) = {a, b, c,a, c} (simply apply the baseline algorithm described in Definition 6), it is impossible to create an accepting Petri net *AN* with *lang*(*AN* ) = {a, b, c,a, c} without using a labeling function allowing for silent or duplicate transitions.

#### **4.2 Process Trees**

The two process trees discovered for event logs L<sup>1</sup> and L<sup>2</sup> (see Fig. 2(c) and Fig. 3(c)) are depicted as <sup>Q</sup><sup>1</sup> <sup>=</sup> <sup>→</sup>(a, <sup>×</sup>(∧(b, c), d), e) and <sup>Q</sup><sup>2</sup> <sup>=</sup> <sup>→</sup>(a, (∧(b, c), d), e) in Fig. 10. Their language is the same as *AN* <sup>1</sup> and *AN* <sup>2</sup> in Fig. 9.

Process trees are not commonly used as a modeling language. However, state-ofthe-art process discovery techniques use process trees as an internal representation. The behavior of process trees can be visualized using Petri nets, BPMN, UML activity diagrams, EPCs, etc. However, they also have their own graphical representation, as shown in Fig. 10.

The main reason for using process trees is that they have a *hierarchical structure* and are *sound by construction*. This does not hold for other notations such as Petri nets and BPMN. For example, if we remove the arc (*t4* , *p2* ) in *AN* <sup>2</sup> shown in Fig. 9(b), then the process may *deadlock*. The process gets stuck in marking [*p5* ] making it impossible to reach the final marking. If we remove the arc (*p4* ,*t4* ) in *AN* <sup>2</sup>, then the process may *livelock*. It is possible to put an arbitrary number of tokens in *p2* and *p4* , but after the occurrence of d it is impossible to reach the final marking. If both arcs are removed, the accepting Petri net is again sound (i.e., free of anomalies such as deadlocks and livelocks). When discovering process model constructs locally, these potential soundness problems are difficult to handle (see [6] for more details on analyzing soundness of process models). Therefore, a range of inductive mining techniques has been developed using process trees that are sound by construction [22–24].

**Fig. 10.** Three process trees: (a) Q<sup>1</sup> = →(a, ×(∧(b, c), d), e), (b) Q<sup>2</sup> = →(a, (∧(b, c), d), e), and (c) Q<sup>3</sup> = →(a, (∧(b, a), τ )).

A process tree is a tree-like structure with one root node. The leaf nodes correspond to activities (including the silent activity τ , which is similar to a silent transition in Petri nets). Four types of operators can be used in a process tree: → (sequential composition), × (exclusive choice), ∧ (parallel composition), and (redo loop). This way it is possible to construct process trees such as the ones shown in Fig. 10.

**Definition 17 (Process Tree).** *Let PTO* <sup>=</sup> {→, <sup>×</sup>,∧, } *be the set of process tree operators and let* <sup>τ</sup> ∈ U*act be the so-called silent activity. Process trees are defined as follows.*


## U*<sup>Q</sup>* ⊆ U*<sup>M</sup> is the set of all process trees.*

Consider the process tree <sup>Q</sup><sup>1</sup> <sup>=</sup> <sup>→</sup>(a, <sup>×</sup>(∧(b, c), d), e) shown in Fig. 10(a). The leaf nodes correspond to the activities a, b, c, d, and e. The root node is a sequence operator (→) having three children: <sup>a</sup>, <sup>×</sup>(∧(b, c), d), and <sup>e</sup>. The root node of the subtree <sup>×</sup>(∧(b, c), d) is a choice operator (×) having two children: <sup>∧</sup>(b, c) and <sup>d</sup>. The root node of the subtree <sup>∧</sup>(b, c) is a parallel operator (∧) having two children: <sup>b</sup> and <sup>c</sup>.

**Fig. 11.** The semantics of the four process tree operators, i.e., → (sequential composition), × (exclusive choice), ∧ (parallel composition), and (redo loop), expressed in terms of Petri nets.

Although it is fairly straightforward to define the semantics of process trees directly in terms of traces, we can also use the mapping onto accepting labeled Petri nets shown in Fig. 11. A silent activity, i.e., a leaf node labeled τ , is mapped onto a silent transition. A normal activity a is mapped onto a transition t with label l(t) = a. Sequential composition <sup>→</sup>(a, b, c, . . . , z) corresponds to the Petri net structure shown in Fig. 11, i.e., first a occurs and only if a has finished, b may start, after b completes, c can start, etc. The sequential composition ends when the last element completes. Note that a, b, c, . . . , z do not need to be atomic activities. These elements may correspond to large subprocesses, each represented by a subtree of arbitrary complexity. Exclusive choice <sup>×</sup>(a, b, c, . . . , z) and parallel composition <sup>∧</sup>(a, b, c, . . . , z) can be mapped onto Petri nets as shown in Fig. 11. Also here the elements do not need to be atomic and may correspond to subtrees of arbitrary complexity. Figure 11 also shows the semantics of the redo loop operator . In (a, b, c, . . . , z), first a is executed. This is called the "do" part (again a may be a subprocess). Then there is the option to stop (fire the silent transition to go to the end place) or one of the "redo elements" is executed. For example, b is executed. After the completion of b, we again execute the "do" part a after which there is again the choice to stop or pick one of the "redo elements", etc. Note that semantically (a, b, c, . . . , z) and (a, <sup>×</sup>(b, c, . . . , z)) are the same.

**Definition 18 (Traces of a Process Tree).** *Let* <sup>Q</sup> ∈ U*<sup>Q</sup> be a process tree and AN* <sup>Q</sup> <sup>∈</sup> U*AN the corresponding accepting Petri net constructed by recursively applying the patterns depicted in Fig. 11. lang*(Q) = *lang*(*AN* <sup>Q</sup>) *are the traces possible according to* Q*.*

Using the above definition, we can compute the set of traces for the three process trees in Fig. 10: <sup>Q</sup><sup>1</sup> <sup>=</sup> <sup>→</sup>(a, <sup>×</sup>(∧(b, c), d), e), <sup>Q</sup><sup>2</sup> <sup>=</sup> <sup>→</sup>(a, (∧(b, c), d), e), and <sup>Q</sup><sup>3</sup> <sup>=</sup> <sup>→</sup>(a, (∧(b, a), τ )). *lang*(Q1) = {a, b, c, e,a, c, b, e,a, d, e}, *lang*(Q2) = {a, b, c, e,a, c, b, e,a, b, c, d, b, c, e,a, c, b, d, b, c, e,...,a, c, b, d, b, c, d, c, b, d, c, b, e,...}, and *lang*(Q3) = {a, b, a,a, a, b,a, b, a, b, a,a, a, b, b, <sup>a</sup>,...,a, a, b, b, a, a, b, a, b,...}.

Some additional examples to illustrate the expressiveness of process trees:


There are also behaviors that are difficult to express in terms of a process tree. For example, it is difficult to synchronize between subtrees. Consider, for example, the process tree <sup>Q</sup> <sup>=</sup> <sup>∧</sup>(→(a, b, c), <sup>→</sup>(d, e, f)) with the additional requirement that <sup>b</sup> should be executed before e. This can only be handled by duplicating activities, e.g., <sup>Q</sup> <sup>=</sup> <sup>×</sup>(→(∧(→(a, b), d),∧(c,→(e, f))),→(a, b, c, d, e, f)). Trying to capture arbitrary synchronizations between subprocesses leads to incomprehensible process trees whose behavior is still easy to express in terms of a BPMN model or a labeled accepting Petri net. Figure 12(a) shows how this can be expressed in terms of a labeled accepting Petri net. Similarly, process trees cannot capture long-term dependencies (e.g., a choice at the beginning of the process influences a choice later in the process). Figure 12(b) shows an example where the first choice depends on the second choice. This simple example can be modeled using the process tree <sup>Q</sup> <sup>=</sup> <sup>×</sup>(→(a, c, d, e),→(b, c, d, f)), which enumerates the two traces and duplicates activities c and d. In general, processtree based discovery techniques are unable to create such models. Nevertheless, process

(a) A labeled accepƟng Petri net synchronizing two parallel flows using place p6.

(b) A labeled accepƟng Petri net with long-term dependencies (p4 and p5).

**Fig. 12.** Two labeled accepting Petri nets with behaviors that are difficult to discover in terms of a process tree. The top model (a) corresponds to the process tree Q = ∧(→(a, b, c), →(d, e, f)) with the additional requirement that b should be executed before e. The bottom model (b) corresponds to the process tree Q = →(×(a, b), c, d, ×(e, f)) with the additional requirement that a should be followed by e and b should be followed by f.

trees provide a powerful representational bias that can be exploited by process discovery techniques.

#### **4.3 Business Process Model and Notation (BPMN)**

Business Process Model and Notation (BPMN) is the de facto representation for business process modeling in industry [19,36]. The BPMN standard is maintained by the Object Management Group (OMG) [27], is supported by a wide range of vendors, and is used by numerous organizations. The OMG specification is 532 pages [27]. Given our focus on process discovery, the constructs for control-flow are most relevant. Moreover, most tools only support a small subset of the BPMN standard and an even smaller subset is actually used on a larger scale. When using the more advanced constructs like inclusive/complex gateways and multiple instance activities, the execution semantics are also not so clear (see Chapter 13 of [27]). Therefore, we only cover start and end events, activities, exclusive gateways, parallel gateways, and sequence flows. Constructs such as pools, lanes, data objects, messages, subprocesses, and inclusive gateways are relevant for more advanced forms of process mining, but outside the scope of this chapter.

Figure 13 shows three BPMN models (B1, B2, and B3) and a limited set of BPMN notations. We (informally) refer to the class of BPMN models constructed using these building blocks as U*BPMN* . The behavior represented by the BPMN model

**Fig. 13.** Three BPMN models corresponding to the accepting Petri nets *AN* <sup>1</sup>, *AN* <sup>2</sup>, and *AN* <sup>4</sup>, and the process trees Q1, Q2, and Q<sup>3</sup> used before.

<sup>B</sup><sup>1</sup> ∈ U*BPMN* is the same as the accepting Petri net *AN* <sup>1</sup> = (N1, [*p1* ], [*p6* ]) in Fig. 9(a) and the process tree <sup>Q</sup><sup>1</sup> <sup>=</sup> <sup>→</sup>(a, <sup>×</sup>(∧(b, c), d), e) in Fig. 10(a). Hence, *lang*(B1) = {a, b, c, e,a, c, b, e,a, d, e}. BPMN model <sup>B</sup><sup>2</sup> ∈ U*BPMN* corresponds to *AN* <sup>2</sup> in Fig. 9(b) and the process tree <sup>Q</sup><sup>2</sup> in Fig. 10(b). BPMN model <sup>B</sup><sup>3</sup> ∈ U*BPMN* corresponds to *AN* <sup>4</sup> in Fig. 9(d) and the process tree Q<sup>3</sup> in Fig. 10(c). We do not provide formal semantics for these BPMN constructs. However, the examples should be selfexplaining and demonstrate that a BPMN model <sup>B</sup> ∈ U*BPMN* defines indeed a set of traces *lang*(B).

In this chapter, we have introduced four types of models: DFGs U*<sup>G</sup>* ⊆ U*<sup>M</sup>* , accepting Petri nets U*AN* ⊆ U*<sup>M</sup>* , process trees U*<sup>Q</sup>* ⊆ U*<sup>M</sup>* , and BPMN models U*BPMN* ⊆ U*<sup>M</sup>* . There exist discovery approaches for all of them. Since they all specify sets of possible complete traces, automated translations are often possible. For example, a discovery technique may use process trees internally, but use Petri nets or BPMN models to visualize the result.

## **5 Bottom-Up Process Discovery**

In Sect. 2, we presented a baseline discovery approach to learn a DFG from an event log. As stated in Definition 3, a process discovery algorithm is a function *disc* ∈ <sup>B</sup>(U*act* <sup>∗</sup>) → U*<sup>M</sup>* that, given an event log <sup>L</sup>, produces a model <sup>M</sup> <sup>=</sup> *disc*(L) that allows for the traces in *lang*(M). The DFG-based baseline approach has many limitations. One of the main limitations is the inability to represent concurrency. The DFG produced tends to have an excessive number of cycles leading to Spaghetti-like underfitting models. Therefore, we introduced higher-level process model notations such as accepting Petri nets (Sect. 4.1), process trees (Sect. 4.2), and a subset of the BPMN notation (Sect. 4.3).

In this chapter, we group the more advanced approaches into two groups: "bottomup" process discovery and "top-down" process discovery. The first group aims to uncover local patterns involving a few activities. The second group aims to find a global structure that can be used to decompose the discovery problem into smaller problems. In this section, we introduce "bottom-up" process discovery using the Alpha algorithm [1,9] as an example. In Sect. 6, we introduce "top-down" process discovery using the basic inductive mining algorithm [22–24] as an example.

Both "bottom-up" and "top-down" process discovery can be combined with the filtering approaches presented in Sect. 2.4, in particular activity-based and variant-based filtering. Without filtering, the basic Alpha algorithm and basic inductive mining algorithm will not be very usable in real-life settings. Therefore, we assume that the event logs have been preprocessed before applying "bottom-up" or "top-down" discovery algorithms.

**Definition 19 (Basic Log Preprocessing).** *Let* <sup>L</sup> ∈ B(U*act* <sup>∗</sup>) *be an event log. Given the thresholds* <sup>τ</sup>*act* <sup>∈</sup> <sup>N</sup> *and* <sup>τ</sup>*var* <sup>∈</sup> <sup>N</sup>*:* <sup>L</sup><sup>τ</sup>*act*,τ*var* <sup>=</sup> *filter var* (*filter act*(L, τ*act*), τ*var* )*.*

In the remainder, we assume that the event log was preprocessed and that we want to discover a process model describing the filtered event log.

#### **5.1 The Essence of Bottom-Up Process Discovery: Admissible Places**

To explain "bottom-up" process discovery, we first introduce the notion of a "flower model" for an event log. This is the accepting Petri net without places. We use this as a basis and then add places one-by-one.

**Definition 20 (Flower Model).** *Let* <sup>L</sup> ∈ B(U*act* <sup>∗</sup>) *be an event log with activities* A = *act*(L)*. The flower model of* L *is the accepting Petri net discflower* (L)=(N, [ ], [ ]) *with* <sup>N</sup> = (∅, A, <sup>∅</sup>, {(a, a) <sup>|</sup> <sup>a</sup> <sup>∈</sup> <sup>A</sup>})*.*

Note that *discflower* (L) contains no places and one transition per activity. The flower model of L<sup>1</sup> is shown in Fig. 14(a). In a Petri net, a transition is enabled if all of its input places contain a token. Hence, a transition without an input place is always enabled. Moreover, the Petri net is always in the final marking [ ]. Therefore, *lang*(*discflower* (L)) = A∗, i.e., all traces over activities seen in the event log. Such a flower model can also be represented as a process tree. If <sup>A</sup> <sup>=</sup> {a1, a2,...,a<sup>n</sup>} <sup>=</sup> *act*(L), then Q = (τ, a1, a2,...,an) is the process tree that allows for any behavior over A, i.e., *lang*(Q) = A∗. Although it is easy to create such a process tree, it is not so clear how to add constraints to it. As mentioned earlier, it is impossible to synchronize activities in different subtrees. However, when looking at the flower Petri net *discflower* (L), it is obvious that places can be added to constrain the behavior. Therefore, we use Petri nets to illustrate "bottom-up" process discovery.

Next, we consider a Petri net having a *single place* constraining the behavior of the flower model. The place p = (A1, A2) is characterized by a set of input activities A<sup>1</sup> and a set of output activities A2. We would like to add places that allow for the behavior seen in the event log. Such a place is called an *admissible place*.

a

p1

p9

(b) single-place net with place ({a},{b,d})

p3 p5 (c) model with three redundant places

c

d

e

b

p2 p4

p8

(d) AN1 = (N1,[p1],[p6]) seen before

**Fig. 14.** Four accepting Petri nets: (a) a flower model, (b) *AN <sup>p</sup>*<sup>2</sup> with just one place p<sup>2</sup> = ({a}, {b, d}), (c) an accepting Petri net with three additional redundant places p<sup>7</sup> = (∅, {e}), p<sup>8</sup> = ({a}, {e}), and p<sup>9</sup> = ({a}, ∅), and (d) the accepting Petri net *AN* <sup>1</sup> already shown in Fig. 9(a) (discovered by applying the original Alpha algorithm [1,9] to event log L1).

p6

p7

**Definition 21 (Admissible Place).** *Let* <sup>L</sup> ∈ B(U*act* <sup>∗</sup>) *be an event log with activities* <sup>A</sup> <sup>=</sup> *act*(L)*.* <sup>p</sup> = (A1, A2) *is a candidate place if* <sup>A</sup><sup>1</sup> <sup>⊆</sup> <sup>A</sup> *and* <sup>A</sup><sup>2</sup> <sup>⊆</sup> <sup>A</sup>*. The corresponding single place accepting Petri net is AN* <sup>p</sup> = (N,M*init*, M*final*) *with* <sup>N</sup> = (P, T, F,l)*,* <sup>P</sup> <sup>=</sup> {p}*,* <sup>T</sup> <sup>=</sup> <sup>A</sup>*,* <sup>F</sup> <sup>=</sup> {(a, p) <sup>|</sup> <sup>a</sup> <sup>∈</sup> <sup>A</sup>1}∪{(p, a) <sup>|</sup> <sup>a</sup> <sup>∈</sup> <sup>A</sup>2}*,* <sup>l</sup> <sup>=</sup> {(a, a) <sup>|</sup> <sup>a</sup> <sup>∈</sup> <sup>A</sup>})*,* <sup>M</sup>*init* = [<sup>p</sup> <sup>|</sup> <sup>A</sup><sup>1</sup> <sup>=</sup> <sup>∅</sup>]*, and* <sup>M</sup>*final* = [<sup>p</sup> <sup>|</sup> <sup>A</sup><sup>2</sup> <sup>=</sup> <sup>∅</sup>]*. Candidate place* <sup>p</sup> = (A1, A2) *is admissible if var* (L) <sup>⊆</sup> *lang*(*AN* <sup>p</sup>)*.* <sup>P</sup>*adm* (L) *is the set of all admissible places, given an event log* L*.*

Given a candidate place p = (A1, A2), *AN* <sup>p</sup> is the accepting Petri net consisting of one transition per activity and a single place p. The transitions in A<sup>1</sup> produce tokens for <sup>p</sup> and the transitions in <sup>A</sup><sup>2</sup> consume tokens from <sup>p</sup>. If <sup>p</sup> is a source place (i.e., <sup>A</sup><sup>1</sup> <sup>=</sup> <sup>∅</sup>), then it has to be initially marked to be meaningful (otherwise, it would remain empty by definition). If <sup>p</sup> is a sink place (i.e., <sup>A</sup><sup>2</sup> <sup>=</sup> <sup>∅</sup>), then it has to be marked in the final marking to be meaningful (otherwise, it could never be marked on a path to the final marking). We also assume that all other places are empty both at the beginning and at the end. Hence, only source places are initially marked and only sink places are marked in the final marking. This explains the reason that <sup>M</sup>*init* = [<sup>p</sup> <sup>|</sup> <sup>A</sup><sup>1</sup> <sup>=</sup> <sup>∅</sup>] (<sup>p</sup> is initially marked if it is a source place) and <sup>M</sup>*final* = [<sup>p</sup> <sup>|</sup> <sup>A</sup><sup>2</sup> <sup>=</sup> <sup>∅</sup>] (<sup>p</sup> is marked in the final marking if it is a sink place).

A candidate place p = (A1, A2) is admissible if the corresponding *AN* <sup>p</sup> allows for all the traces seen in the event log, i.e., event log L and single-place net *AN* <sup>p</sup> are perfectly fitting. Consider, for example, <sup>L</sup><sup>1</sup> = [a, b, c, e<sup>10</sup>,a, c, b, e<sup>5</sup>,a, d, e]. Examples of admissible candidate places are <sup>p</sup><sup>1</sup> = (∅, {a}), <sup>p</sup><sup>2</sup> = ({a}, {b, d}), <sup>p</sup><sup>3</sup> <sup>=</sup> ({a}, {c, d}), <sup>p</sup><sup>4</sup> = ({b, d}, {e}), <sup>p</sup><sup>5</sup> = ({c, d}, {e}), <sup>p</sup><sup>6</sup> = ({e}, <sup>∅</sup>). These are the places shown earlier in Fig. 9(a) (for convenience the accepting Petri net *AN* <sup>1</sup> is again shown in Fig. 14(d)). However, we now consider an accepting Petri net per place, i.e., *AN* <sup>p</sup><sup>1</sup> , *AN* <sup>p</sup><sup>2</sup> , *AN* <sup>p</sup><sup>3</sup> ,..., *AN* <sup>p</sup><sup>6</sup> . Figure 14(b) shows *AN* <sup>p</sup><sup>2</sup> with <sup>p</sup><sup>2</sup> = ({a}, {b, d}). Other admissible places (not shown in Fig. 9(a)) are <sup>p</sup><sup>7</sup> = (∅, {e}), <sup>p</sup><sup>8</sup> = ({a}, {e}), <sup>p</sup><sup>9</sup> = ({a}, <sup>∅</sup>). Examples of candidate places that are not admissible are <sup>p</sup><sup>10</sup> = (∅, {b}) (the initial token in <sup>p</sup><sup>10</sup> is not consumed when replaying a, d, e), <sup>p</sup><sup>11</sup> = ({a}, {b}) (the token produced for <sup>p</sup><sup>11</sup> by <sup>a</sup> is not consumed when replaying a, d, e), <sup>p</sup><sup>12</sup> <sup>=</sup> ({b}, {e}) (it is impossible to replay a, d, e because of a missing token in <sup>p</sup>12), and <sup>p</sup><sup>13</sup> = ({b}, <sup>∅</sup>) (the sink place is not marked when replaying a, d, e).

Note that places correspond to *constraints*. Place <sup>p</sup><sup>4</sup> = ({b, d}, {e}) allows for all the traces in <sup>L</sup><sup>1</sup> but does not allow for traces such as a, e, a, b, d, e, a, b, e, e, etc.

Assuming that we want to ensure perfect replay fitness (i.e., 100% recall), we *only add admissible places*. This is a reasonable premise if filtered the event log (cf. Definition 19) before conducting discovery. This means that process discovery is reduced to finding a subset of P*adm* (L) (i.e., a selection of admissible places given event log L).

Why not simply add all places in P*adm* (L) to the discovered process model? There are two reasons not to do this: *redundancy* and *overfitting*. A place is *redundant* if its removal does not change the behavior. Consider, for example, Fig. 14(c) with two source places, two sink places, and an additional place connecting a and e. The places <sup>p</sup><sup>7</sup> = (∅, {e}), <sup>p</sup><sup>8</sup> = ({a}, {e}), and <sup>p</sup><sup>9</sup> = ({a}, <sup>∅</sup>) are redundant, i.e., we can remove them without allowing for more behavior. Moreover, adding all possible places in P*adm* (L) may lead to overfitting. As explained in Sect. 3, the event log contains example behavior and it would be odd to assume that behaviors that have not been observed are not possible. Note that there are <sup>2</sup><sup>n</sup> <sup>×</sup> <sup>2</sup><sup>n</sup> = 2<sup>2</sup><sup>n</sup> candidate places with <sup>n</sup> <sup>=</sup> <sup>|</sup>*act*(L)|. Hence, *for a log with just ten activities there are over one million candidate places* (2<sup>2</sup>×<sup>10</sup> = 1048576)). Many of these will be admissible by accident. This problem is comparable to "multiple hypothesis testing" in statistics. If one tests enough hypotheses, then one will find seemingly significant results by accident (cf. Bonferroni correction).

There are many approaches to select a suitable subset of P*adm* (L). For example, it is easy to remove redundant places and only consider places with a limited number of input and output arcs [7,26]. However, there is the additional problem that the above procedure requires evaluating each candidate place with respect to the whole event log. This means that a na¨ıve approach quickly becomes intractable for larger event logs and processes.

#### **5.2 The Alpha Algorithm**

In the remainder of this section, we present the first process discovery technique able to discover concurrent models (e.g., Petri nets) from event logs: the *Alpha algorithm* [9]. The Alpha algorithm is completely based on the footprint of the (filtered) event log L. This implies that one pass through the event log is sufficient. Hence, the algorithm is linear in the size of the log (a na¨ıve implementation is exponential in the number of unique activities, but this number is typically low). One can implement the Alpha algorithm efficiently by combining → relations that meet certain constraints. These constrains are monotonic, allowing for an apriori-style algorithm [1].

We have adapted the original presentation used in [9] to leverage the notations and insights already provided in this chapter. We use as input a DFG and as a result also add a dummy start (-) and end (-) activity. However, in essence, the algorithm did not change. We elaborate on the differences with [9] later. The Alpha algorithm discovers an accepting Petri net for any event log L.

**Definition 22 (Alpha Algorithm).** *The alpha algorithm discalpha* ∈ B(U*act* <sup>∗</sup>) → <sup>U</sup>*AN returns an accepting Petri net discalpha* (L) *for any event log* <sup>L</sup> ∈ B(U*act* <sup>∗</sup>)*. Let* A = *act*(L) *and fp*(L) = *fp*(*discDFG* (L)) *the footprint of event log* L*. This allows us to write* <sup>a</sup><sup>1</sup> <sup>→</sup><sup>L</sup> <sup>a</sup><sup>2</sup> *if fp*(L)((a1, a2)) = <sup>→</sup> *and* <sup>a</sup>1#La<sup>2</sup> *if fp*(L)((a1, a2)) = # *for any* <sup>a</sup>1, a<sup>2</sup> <sup>∈</sup> <sup>A</sup> <sup>=</sup> <sup>A</sup> ∪ {-, -}*.*


The complexity of the algorithm is in the first two steps building the sets *Cnd* and *Sel* that are used to create the places in Step 3. The rest builds on the ideas and notions introduced before. The Alpha algorithm creates a transition t<sup>a</sup> for each activity a in the event log and also adds a start transition t and an end transition t - (Step 4). Transitions are labeled with the corresponding activity (Step 6). Transitions t and t are silent, <sup>t</sup> has a source place p as input and t has a sink place <sup>p</sup> as output. The initial marking only marks the source place <sup>p</sup> and the final marking only marks the sink place <sup>p</sup>- (Step 7). Steps 3–8 can be seen as "bookkeeping". The essence of the algorithm is in the first two steps.

Step 1 of the algorithm creates candidate places similar to the construction of candidate places used in Definition 21. (A1, A2) corresponds to a candidate place p such that activities in A<sup>1</sup> produce tokens for p and activities in A<sup>2</sup> consume tokens from p. Note that technically (A1, A2) is a pair of non-empty sets of activities (including start and end). The requirement <sup>∀</sup><sup>a</sup>1∈A<sup>1</sup> <sup>∀</sup><sup>a</sup>2∈A<sup>2</sup> <sup>a</sup><sup>1</sup> <sup>→</sup><sup>L</sup> <sup>a</sup><sup>2</sup> states that any activity in <sup>A</sup><sup>1</sup> can be directly followed by any activity in A2, but no activity in A<sup>2</sup> can be directly followed by an activity in <sup>A</sup>1. The requirements <sup>∀</sup><sup>a</sup>1,a2∈A<sup>1</sup> <sup>a</sup>1#La<sup>2</sup> and <sup>∀</sup><sup>a</sup>1,a2∈A<sup>2</sup> <sup>a</sup>1#La<sup>2</sup> state that activities in the sets A<sup>1</sup> and A<sup>2</sup> cannot directly follow any other member of the same activity set. As a consequence, an activity that can follow itself directly (i.e., <sup>a</sup><sup>L</sup>a) cannot be in A<sup>1</sup> or A2. This also implies that A<sup>1</sup> and A<sup>2</sup> are disjoint. *Cnd* is the set of all pairs of activity sets meeting these requirements. *Sel* ⊆ *Cnd* retains the "maximal elements". Candidate (A1, A2) <sup>∈</sup> *Cnd* is maximal if there is no other(A 1, A <sup>2</sup>) ∈ *Cnd* that is strictly larger, i.e., it cannot be that <sup>A</sup><sup>1</sup> <sup>⊆</sup> <sup>A</sup> <sup>1</sup>, <sup>A</sup><sup>2</sup> <sup>⊆</sup> <sup>A</sup> <sup>2</sup>, and (A 1, A <sup>2</sup>) = (A1, A2). Each selected maximal element, i.e., (A1, A2) <sup>∈</sup> *Sel*, corresponds to a place <sup>p</sup>(A1,A2) connecting the transitions corresponding to <sup>A</sup><sup>1</sup> (i.e., {t<sup>a</sup> <sup>|</sup> <sup>a</sup> <sup>∈</sup> <sup>A</sup>1}) to the transitions corresponding to <sup>A</sup><sup>2</sup> (i.e., {t<sup>a</sup> <sup>|</sup> <sup>a</sup> <sup>∈</sup> <sup>A</sup>2}).

**Fig. 15.** Four accepting Petri nets created using the Alpha algorithm from Definition 22. The place and transition names are as specified in Definition 22. The four event logs used are: L<sup>1</sup> = [a, b, c, e <sup>10</sup>, a, c, b, e 5, a, d, e], L<sup>2</sup> = [a, b, c, e <sup>50</sup>, a, c, b, e 40, a, b, c, d, b, c, e <sup>30</sup>, a, c, b, d, b, c, e <sup>20</sup>, a, b, c, d, c, b, e <sup>10</sup>, a, c, b, d, c, b, d, b, c, e <sup>10</sup>], L<sup>4</sup> = [a, b <sup>35</sup>, b, a <sup>15</sup>], and <sup>L</sup><sup>5</sup> = [a <sup>10</sup>, a, b 8, a, c, b 6, a, c, c, b 3, a, c, c, c, b]. Note that unlike in [9] invisible start and end transitions are added to be more general.

Figure 15 shows some examples where the Alpha algorithm is applied to a smaller event log. The place names reflect the elements of the set *Sel* created in Step 2 of the algorithm. For <sup>L</sup><sup>1</sup> = [a, b, c, e<sup>10</sup>,a, c, b, e<sup>5</sup>,a, d, e], *Sel* <sup>=</sup> {({-}, {a}),({a}, {b, d}),({a}, {c, d}),({b, d}, {e}),({c, d}, {e}),({e}, {-})}. Note that *Cnd*\*Sel* = {({a}, {b}),({a}, {c}),({a}, {d}), ({b}, {e}),({c}, {e}),({d}, {e})}. These candidates were removed because they are not maximal. Figure 15(a) shows the resulting accepting Petri net *discalpha* (L1). Figure 15(b) shows *discalpha* (L2). Note that the Alpha algorithm is able to discover concurrency, choices, and loops. Comparing the process models for L<sup>1</sup> and L<sup>2</sup> with the accepting Petri nets in Fig. 2 (for L1) and Fig. 3 (for L2), we can see that p, t, t -, and <sup>p</sup> have been added. These can be removed if start and end activities happen only at the beginning or end. In L<sup>1</sup> and L2, the only start activity is a and a can only happen in the first position. Also, the only end activity is e and e can only happen in the last position. If this is the case, we do not need to add an artificial start or end -.

Figure 15(c) shows why it is sometimes necessary to add an artificial start or end. In <sup>L</sup><sup>4</sup> = [a, b<sup>35</sup>,b, a<sup>15</sup>], <sup>a</sup> is a start activity in trace a, b, but can also happen at the second position (cf. b, a). The same holds for activity <sup>b</sup>. Therefore, we need to add an artificial start -. a and b are also end activities, but do not appear just at the end, e.g., b may also happen in the first position. Therefore, we need to add an artificial end -. Note that Definition 22 is slightly different from the original algorithm in [9] due to the addition of the dummy start and end activities. For logs where the traditional algorithm already produces the correct result, one can simply remove p, t, t -, and p-. However, the algorithm in Definition 22 is able to handle start and end activities that can also appear in the middle of a trace. Hence, it is more general.

Figure 16 shows the model discovered for the larger event log L<sup>3</sup> = [*ie*, *cu*, *lt*, *xr* , *fe*<sup>285</sup>, *ie*, *cu*, *lt*, *ct*, *fe*<sup>260</sup>, *ie*, *cu*, *ct*, *lt*, *fe*<sup>139</sup>, *ie*, *lt*, *cu*, *xr* , *fe*<sup>137</sup>, *ie*, *lt*, *cu*, *ct*, *fe*<sup>124</sup>, *ie*, *cu*, *xr* , *lt*, *fe*<sup>113</sup>, *ie*, *xr* , *cu*, *lt*, *fe*<sup>72</sup>, *ie*, *ct*, *cu*, *xr* , *fe*<sup>72</sup>, *ie*, *cu*, *om*, *am*, *cu*, *lt*, *xr* , *fe*<sup>29</sup>, *ie*, *cu*, *om*, *am*, *cu*, *lt*, *ct*, *fe*<sup>28</sup>, ...] using the full activity names, i.e., *ie* = initial examination, *xr* = X-ray, *ct* = CT scan, *cu* = checkup, *om* = order medicine, *am* = administer medicine, *lt* = lab tests, and *fe* = final examination. The model was generated using the Alpha algorithm implemented in ProM. Note that there was no need to add artificial start or end activities because *ie* happens only at the beginning and *fe* happens only at the end.

**Fig. 16.** The accepting Petri net that was discovered by the Alpha algorithm implemented in ProM, based on the larger event log L<sup>3</sup> introduced in Sect. 2.5. Note that the artificial start and end activities have not been added, and the full activity names are used.

The Alpha algorithm should be seen as a baseline algorithm to discover concurrency. It has many limitations, as pointed out in the original paper presenting the algorithm [9]. Event log <sup>L</sup><sup>5</sup> = [<sup>a</sup><sup>10</sup>,a, b<sup>8</sup>,a, c, b<sup>6</sup>,a, c, c, b<sup>3</sup>,a, c, c, c, b] is used to illustrate two of these problems: *skipping* and *self-loops*. Figure 15(d) shows the discovered process model *discalpha* (L5). The selected maximal elements are *Sel* = {({-}, {a}),({a}, {b}),({a}, {-}),({b}, {-})}. Note that ({a}, {b, -}) ∈ *Sel*, because <sup>b</sup> <sup>→</sup><sup>L</sup><sup>5</sup> and not b#<sup>L</sup>5-. Because <sup>c</sup><sup>L</sup><sup>5</sup> <sup>c</sup> (<sup>c</sup> can be directly followed by c) and not c#<sup>L</sup><sup>5</sup> c, activity c does not appear in *Sel*, implying that t<sup>c</sup> remains disconnected from the rest of the model. Activity b can be seen as a "skippable" activity and the Alpha algorithm cannot handle such activities, because these require silent transitions. The basic Alpha algorithm can also not discover the self-loop involving c. The Alpha algorithm has been extended to address these problems, and there exist variants to deal with self-loops, skipping, long-term dependencies, etc. See [1] for more information on the limitations of the basic algorithm and pointers to extensions addressing these problems.

### **6 Top-Down Process Discovery**

The Alpha algorithm is an example of a bottom-up discovery approach that tries to add places to the Petri net to locally constrain behavior. *Top-down discovery approaches try to recursively decompose the event log into smaller event logs until the problem gets trivial.* The whole event log L is decomposed into smaller event logs L1, L2,...,L<sup>n</sup> that have a clear relationship, e.g., L<sup>i</sup> may contain events that occur before L<sup>j</sup> if i<j, or <sup>L</sup><sup>i</sup> and <sup>L</sup><sup>j</sup> are fully disjoint for all <sup>i</sup> <sup>=</sup> <sup>j</sup>. Each event in <sup>L</sup> ends up in *precisely one* of the sublogs. However, cases may be distributed over multiple sublogs. Each of the smaller event logs is analyzed and (if needed) decomposed into smaller event logs, e.g., L<sup>i</sup> is in turn decomposed into Li,1, Li,2,...,Li,m, etc. Again the events in L<sup>i</sup> are partitioned over Li,1, Li,2,...,Li,m. This is repeated until we encounter a socalled *base case*, i.e., a sublog containing just one activity, e.g., [<sup>a</sup><sup>160</sup>], [<sup>a</sup><sup>80</sup>,-<sup>80</sup>], or [<sup>a</sup><sup>80</sup>,a, a<sup>60</sup>,a, a, a<sup>20</sup>].

Due to the recursive decomposition of logs into smaller event logs, we automatically get a tree-like structure where the root corresponds to the original event log and the leaves correspond to trivial event logs (the so-called base cases). This fits well with the process tree formalism introduced in Sect. 4.2.

Before introducing a particular approach, let's use a few simple event logs to illustrate the idea of splitting an event log.


In this section, we use the *basic inductive mining algorithm* to illustrate top-down discovery [22–24]. This algorithm uses DFGs to find so-called *cuts* partitioning the set of observed activities into subsets of activities. Set A = *act*(L) is partitioned into pairwise disjoint sets of activities A1, A2,...,An. These activity sets are used to distribute the events in L over L1, L2,...,L<sup>n</sup> such that A<sup>1</sup> = *act*(L1), A<sup>2</sup> = *act*(L2), etc. There are cuts for all four process tree operators, i.e., → (sequential composition), × (exclusive choice), ∧ (parallel composition), and (redo loop).

**Definition 23 (Sequence, Exclusive-Choice, Parallel, and Redo-Loop Cuts).** *Let* <sup>L</sup> ∈ B(U*act* <sup>∗</sup>) *be an event log having a DFG discDFG* (L)=(A, F) *based on* <sup>L</sup> *(note that* <sup>A</sup> <sup>=</sup> *act*(L)*) with start activities* <sup>A</sup>*start* <sup>=</sup> {<sup>a</sup> <sup>∈</sup> <sup>A</sup> <sup>|</sup> (-, a) <sup>∈</sup> <sup>F</sup>} *and end activities* <sup>A</sup>*end* <sup>=</sup> {<sup>a</sup> <sup>∈</sup> <sup>A</sup> <sup>|</sup> (a, -) <sup>∈</sup> <sup>F</sup>}*. An* <sup>n</sup>*-ary* <sup>⊕</sup>*-cut of* <sup>L</sup> *is a partition of* <sup>A</sup> *into* <sup>n</sup> <sup>≥</sup> <sup>2</sup> *pairwise disjoint subsets* A1, A2,...,A<sup>n</sup> *(i.e.,* A = <sup>i</sup>∈{1,...,n} <sup>A</sup><sup>i</sup> *and* <sup>A</sup><sup>i</sup> <sup>∩</sup> <sup>A</sup><sup>j</sup> <sup>=</sup> <sup>∅</sup> *for* <sup>i</sup> <sup>=</sup> <sup>j</sup>*) with* ⊕ ∈ {→, <sup>×</sup>,∧, }*. Such a* <sup>⊕</sup>*-cut is denoted* (⊕, A1, A2,...An)*. For each type of operator* ⊕ ∈ {→, <sup>×</sup>,∧, } *specific conditions apply:*

	- ∀i,j∈{1,...n}∀<sup>a</sup>∈A*i*∀<sup>b</sup>∈A*<sup>j</sup>* <sup>i</sup> <sup>=</sup> <sup>j</sup> <sup>⇒</sup> (a, b) <sup>∈</sup> <sup>F</sup>*.*
	- ∀i,j∈{1,...n}∀<sup>a</sup>∈A*i*∀<sup>b</sup>∈A*<sup>j</sup>* i<j <sup>⇒</sup> ((a, b) <sup>∈</sup> <sup>F</sup> <sup>+</sup> <sup>∧</sup> (b, a) <sup>∈</sup> <sup>F</sup> <sup>+</sup>)*. (Note that* <sup>F</sup> <sup>+</sup> *is the non-reflexive transitive closure of* <sup>F</sup>*, i.e.,* (a, b) <sup>∈</sup> <sup>F</sup> <sup>+</sup> *means that there is a path from* a *to* b *in the DFG.)*
	- ∀<sup>i</sup>∈{1,...n} <sup>A</sup><sup>i</sup> <sup>∩</sup> <sup>A</sup>*start* <sup>=</sup> ∅ ∧ <sup>A</sup><sup>i</sup> <sup>∩</sup> <sup>A</sup>*end* <sup>=</sup> <sup>∅</sup> *and*
	- ∀i,j∈{1,...n}∀<sup>a</sup>∈A*i*∀<sup>b</sup>∈A*<sup>j</sup>* <sup>i</sup> <sup>=</sup> <sup>j</sup> <sup>⇒</sup> (a, b) <sup>∈</sup> <sup>F</sup>*.*
	- <sup>A</sup>*start* <sup>∪</sup> <sup>A</sup>*end* <sup>⊆</sup> <sup>A</sup>1*,*
	- ∀i,j∈{2,...n}∀<sup>a</sup>∈A*i*∀<sup>b</sup>∈A*<sup>j</sup>* <sup>i</sup> <sup>=</sup> <sup>j</sup> <sup>⇒</sup> (a, b) <sup>∈</sup> <sup>F</sup>*,*
	- {<sup>a</sup> <sup>∈</sup> <sup>A</sup><sup>1</sup> <sup>|</sup> (a, b) <sup>∈</sup> <sup>F</sup> <sup>∧</sup> <sup>b</sup> <sup>∈</sup> <sup>A</sup>1} <sup>=</sup> <sup>A</sup>*end ,*
	- {<sup>a</sup> <sup>∈</sup> <sup>A</sup><sup>1</sup> <sup>|</sup> (b, a) <sup>∈</sup> <sup>F</sup> <sup>∧</sup> <sup>b</sup> <sup>∈</sup> <sup>A</sup>1} <sup>=</sup> <sup>A</sup>*start,*
	- ∀(a,b)∈<sup>F</sup> <sup>a</sup> <sup>∈</sup> <sup>A</sup><sup>1</sup> <sup>∧</sup> <sup>b</sup> <sup>∈</sup> <sup>A</sup><sup>1</sup> ⇒ ∀<sup>a</sup>-<sup>∈</sup>A*end* (a , b) <sup>∈</sup> <sup>F</sup>*, and*
	- ∀(b,a)∈<sup>F</sup> <sup>a</sup> <sup>∈</sup> <sup>A</sup><sup>1</sup> <sup>∧</sup> <sup>b</sup> <sup>∈</sup> <sup>A</sup><sup>1</sup> ⇒ ∀<sup>a</sup>-<sup>∈</sup>A*start* (b, a ) <sup>∈</sup> <sup>F</sup>*.*

**Fig. 17.** Four types of cuts: (⊕, A1, A2,...A*n*) with ⊕ ∈ {×, →, ∧, } (based on [1]).

Figure 17 illustrates the four types of cuts. There is an *exclusive-choice cut* when the DFG can be split into disconnected parts after leaving out the artificial start and end -. (Recall that - <sup>∈</sup> <sup>A</sup> and -<sup>∈</sup> <sup>A</sup>.) There is a *sequence cut* when the DFG can be split into sequential parts where only "forward connections" are possible. Note that we need to use the non-reflexive transitive closure of F. There is a *parallel cut* when the DFG can be split into concurrent parts where any activity in one part can be followed by any activity in another part. The *redo-loop cut* has the most complex definition. All start and end activities should be in A<sup>1</sup> (the "do part") and none of the "redo parts" can have start or end activities. Moreover, the "redo parts" (A2, A3,...,An) are only connected through the "do part" (A1). <sup>B</sup>*start* <sup>=</sup> {<sup>b</sup> <sup>|</sup> (a, b) <sup>∈</sup> <sup>F</sup> <sup>∧</sup> <sup>a</sup> <sup>∈</sup> <sup>A</sup><sup>1</sup> <sup>∧</sup> <sup>b</sup> <sup>∈</sup> <sup>A</sup>1} are the start activities of the "redo parts" connected to end activities in the "do part" and <sup>B</sup>*end* <sup>=</sup> {<sup>b</sup> <sup>|</sup> (b, a) <sup>∈</sup> <sup>F</sup> <sup>∧</sup> <sup>a</sup> <sup>∈</sup> <sup>A</sup><sup>1</sup> <sup>∧</sup> <sup>b</sup> <sup>∈</sup> <sup>A</sup>1} are the end activities of the "redo parts" connected to start activities in the "do part". The requirements in Definition 23 imply that <sup>A</sup>*end* <sup>×</sup> <sup>B</sup>*start* <sup>⊆</sup> <sup>F</sup> and <sup>B</sup>*end* <sup>×</sup> <sup>A</sup>*start* <sup>⊆</sup> <sup>F</sup>. This implies that all end activities of the "do part" are connected to all start activities of the "redo parts" and all end activities of the "redo parts" are connected to all start activities of the "do part". For more explanations, see [1].

How the event log <sup>L</sup> is decomposed into <sup>L</sup>1, L2,...,L<sup>n</sup> based on <sup>⊕</sup>-cut (⊕, A1, A2,...An) depends on the type of cut ⊕ ∈ {→, <sup>×</sup>,∧, }. In all log decompositions, each event ends up in precisely one event log, i.e., the number of events remains invariant through decomposition. We use the previously introduced event logs to illustrate this.

First, we consider <sup>L</sup><sup>1</sup> = [a, b, c, e<sup>10</sup>,a, c, b, e<sup>5</sup>,a, d, e] and construct the corresponding DFG to find one of the four cuts. We check the presence of a cut using the order in Definition 23, i.e., (1) ×, (2) →, (3) ∧, (4) . There is no exclusive-choice cut for <sup>L</sup>1, but there is a sequence cut (→, {a}, {b, c, d}, {e}). Using this cut, <sup>L</sup><sup>1</sup> is split into <sup>L</sup><sup>a</sup> = [<sup>a</sup><sup>16</sup>], <sup>L</sup>b,c,d = [b, c<sup>10</sup>,c, b<sup>5</sup>,<sup>d</sup>], and <sup>L</sup><sup>e</sup> = [<sup>e</sup><sup>16</sup>]. <sup>L</sup><sup>a</sup> and <sup>L</sup><sup>e</sup> correspond to base cases since there is just one activity left: L<sup>a</sup> is modeled by a single occurrence of activity a, and L<sup>e</sup> is modeled by a single occurrence of activity e. Hence, the process tree starts with <sup>→</sup>(a, ?, e), where ? corresponds to the subtree describing Lb,c,d. Next, we create a DFG for Lb,c,d and see that we can apply an exclusive-choice cut (×, {b, c}, {d}). Using this cut, <sup>L</sup>b,c,d is split into <sup>L</sup>b,c = [b, c<sup>10</sup>,c, b<sup>5</sup>] and <sup>L</sup><sup>d</sup> = [<sup>d</sup>]. <sup>L</sup><sup>d</sup> corresponds to a base case since there is just one activity left. Hence, the subtree for <sup>L</sup>b,c,d has the following structure <sup>×</sup>(?, d), where ? corresponds to the subtree describing <sup>L</sup>b,c. The overall tree created thus far is <sup>→</sup>(a, <sup>×</sup>(?, d), e). Next, we create a DFG for <sup>L</sup>b,c and see that we can apply a parallel cut (∧, {b}, {c}). It is not possible to apply an exclusive-choice cut or a sequence cut. Using cut (∧, {b}, {c}) sublog <sup>L</sup>b,c is split into <sup>L</sup><sup>b</sup> = [<sup>b</sup><sup>15</sup>] and <sup>L</sup><sup>c</sup> = [<sup>c</sup><sup>15</sup>]. Both correspond to base cases. Hence, the subtree for <sup>L</sup>b,c is <sup>∧</sup>(b, c). The overall tree is <sup>→</sup>(a, <sup>×</sup>(∧(b, c), d), e). This is process tree Q<sup>1</sup> in Fig. 10(a) shown before.

Next, we consider <sup>L</sup><sup>2</sup> = [a, b, c, e<sup>50</sup>,a, c, b, e<sup>40</sup>,a, b, c, d, b, c, e<sup>30</sup>,a, c, b, d, b, c, e<sup>20</sup>,a, b, c, d, c, b, e<sup>10</sup>,a, c, b, d, c, b, d, b, c, e<sup>10</sup>]. Again, we construct the corresponding DFG to find one of the four cuts. The first cut we find is a sequence cut (→ , {a}, {b, c, d}, {e}). Using this cut, <sup>L</sup><sup>2</sup> is split into <sup>L</sup><sup>a</sup> = [<sup>a</sup><sup>160</sup>], <sup>L</sup>b,c,d = [b, c<sup>50</sup>, c, b<sup>40</sup>,b, c, d, b, c<sup>30</sup>,c, b, d, b, c<sup>20</sup>,b, c, d, c, b<sup>10</sup>,c, b, d, c, b, d, b, c<sup>10</sup>], and <sup>L</sup><sup>e</sup> <sup>=</sup> [<sup>e</sup><sup>160</sup>]. <sup>L</sup><sup>a</sup> and <sup>L</sup><sup>e</sup> correspond to base cases suggesting that the process has the following structure <sup>→</sup>(a, ?, e), with ? corresponding to the subtree describing <sup>L</sup>b,c,d. Again we check the presence of a cut. The first cut we find is the redo loop cut (, {b, c}, {d}). Using this cut, <sup>L</sup>b,c,d is split into <sup>L</sup>b,c = [b, c150,c, b<sup>90</sup>] and <sup>L</sup><sup>d</sup> = [<sup>d</sup><sup>80</sup>]. Note that <sup>L</sup>b,c has 240 cases because the "do part" happened 50+ 40+ (2×30)+ (2×20)+ (2<sup>×</sup> 10)+(3×10) = 240 times. The "redo part" happened 30+20+10+(2×10) = 80 times. The redo part is trivial since d is always executed once. Hence, the subtree for Lb,c,d has the following structure (?, d), where ? corresponds to the subtree describing Lb,c. For <sup>L</sup>b,c, we find the subtree <sup>∧</sup>(b, c). The overall tree is, therefore, <sup>→</sup>(a, (∧(b, c), d), e). This is process tree Q<sup>2</sup> in Fig. 10(b) shown before.

To explain the Alpha algorithm, we also used L<sup>4</sup> and L<sup>5</sup> in Fig. 15. Applying the basic inductive mining algorithm to <sup>L</sup><sup>4</sup> = [a, b<sup>35</sup>,b, a<sup>15</sup>] yields the process tree <sup>∧</sup>(a, b). For <sup>L</sup><sup>5</sup> = [<sup>a</sup><sup>10</sup>,a, b<sup>8</sup>,a, c, b<sup>6</sup>,a, c, c, b<sup>3</sup>,a, c, c, c, b], we find the process tree <sup>→</sup>(a, (τ,c), <sup>×</sup>(b, τ )). Note that the subtree (τ,c) is created for the sublog involving just <sup>c</sup>, because <sup>c</sup> happens 0, 1, 2, or 3 times. The subtree <sup>×</sup>(b, τ ) is created for the sublog involving just b, because b happens at most once.

It is possible that none of the cuts in Definition 23 can be applied while the sublog still has multiple activities. In this case, one can always apply so-called *fallthroughs*, e.g., use (τ, a1, a2,...,an) that allows for any behavior. Note that such fallthroughs are not needed when the original process was expressible in terms of a process tree (for the exact conditions, see [1,22]). Moreover, it is also possible to use smarter fallthroughs that separate the problematic activities or behavior from the rest. Suppose that there is a cut (⊕, A1, A2,...Ak) possible considering only activities <sup>A</sup>*good* <sup>=</sup> <sup>A</sup><sup>1</sup> <sup>∪</sup> <sup>A</sup><sup>2</sup> <sup>∪</sup> ... <sup>∪</sup> <sup>A</sup><sup>k</sup> and leaving out <sup>A</sup>*bad* <sup>=</sup> <sup>A</sup>\A*good* <sup>=</sup> {a1, a2,...,a<sup>n</sup>}. Then one can first apply the parallel cut (∧, A*good* , A*bad* ) followed by cut (⊕, A1, A2,...Ak) and cut (τ, a1, a2,...,an) applied to the two sublogs. There are many other fallthroughs, e.g., separating the empty traces from the rest.

**Definition 24 (Inductive Mining Algorithm).** *The basic inductive mining algorithm discIM* ∈ B(U*act* <sup>∗</sup>) → U*<sup>Q</sup> returns a process tree discIM* (L) *for any event log* <sup>L</sup> <sup>∈</sup> B(U*act* <sup>∗</sup>) *using the four types of cuts, log decomposition, and fallthroughs described before.*

**Fig. 18.** Process tree *discIM* (L3) = →(*ie*, ∧(×(*xr*, *ct*), (*cu*, →(*om*, *am*)), *lt*), *fe*) discovered and visualized using ProM's Inductive Visual Miner.

Earlier, we introduced event log L3, containing 11761 events corresponding to 1856 cases. Using the following abbreviations *ie* = initial examination, *xr* = X-ray, *ct* = CT scan, *cu* = checkup, *om* = order medicine, *am* = administer medicine, *lt* = lab tests, and *fe* = final examination, we find *discIM* (L3) = <sup>→</sup>(*ie*,∧(×(*xr* , *ct*), (*cu*,→(*om*, *am*)), *lt*), *fe*). Figure <sup>18</sup> shows a screenshot of ProM's Inductive Visual Miner while analyzing *discIM* (L3) using a BPMN-like notation. No fallthroughs were needed. Note that also the frequencies are shown. It is also possible to show timing information, e.g., average waiting times.

**Fig. 19.** Process tree *discIM* (L3) = →(*ie*, ∧(×(*xr*, *ct*), (*cu*, →(*om*, *am*)), *lt*), *fe*) discovered and visualized as a BPMN model using the Celonis EMS.

Figure 19 shows *discIM* (L3) discovered using Celonis. Celonis also uses a BPMNlike visualization of the process tree. The translation of process trees to BPMN or Petri nets is rather straightforward, and the resulting models are easier to interpret by most users.

In this section, we only introduced the basic inductive mining algorithm. We assume that the event log was filtered in advance to remove infrequent behavior. However, there are also extended versions of the inductive mining algorithm dealing with infrequent behavior [23]. The basic inductive mining algorithm may become intractable for huge event logs, because repeatedly sublogs need to be created. There are also more scalable variants that make a single pass through the event log and use a single overall DFG [24]. These provide fewer formal guarantees. The basic inductive mining algorithm has strong guarantees. For example, *discIM* (L) guarantees perfect replay fitness (i.e., 100% recall). Formally, *var* (L) <sup>⊆</sup> *lang*(*discIM* (L)). See [22–24] for additional formal guarantees provided by these top-down approaches.

Next two the process discovery techniques presented this chapter, there are dozens of other techniques. In [12] additional techniques are presented.

#### **7 Conclusion**

The goal of this chapter is to introduce the foundations of process discovery without aiming to provide a complete survey or details on specific algorithms (see also [10]). After reading this chapter, it should be clear that process discovery is a challenging topic with many competing requirements. We started by introducing a baseline approach that produces a *Directly-Follows Graph* (DFG) for an event log converted into a multiset of traces. For real-life event logs, the DFG may have an excessive number of arcs making the model incomprehensible. Therefore, we discussed three *filtering* approaches that can also be combined to create simpler DFGs. We also showed that the interpretation of such process models highly depends on the log preprocessing [2].

After presenting the baseline DFG discovery approach, we focused on process representations able to capture *concurrency*: Petri nets, process trees, and BPMN models. This is needed because, if activities do not occur in a fixed order due to concurrency, then the discovered DFGs are underfitting and contain many loops. This allowed us to introduce more advanced process discovery approaches. We characterized these as (1) *bottom-up* approaches and (2) *top-down* approaches. Bottom-up approaches try to find local process patterns constraining the process model to better fit the event log. Top-down approaches tackle the problem differently and try to partition larger event logs into smaller ones that can be analyzed more easily. Two representative approaches we described in more detail: the *Alpha algorithm* and the *inductive mining algorithm*. These should be seen as representative examples of both categories. However, there are dozens of process discovery techniques, and it is impossible to name them all.

For example, there exist many extensions of the Alpha algorithm, e.g., variants that can discover silent transitions (e.g., skipping) [34] and non-free choice constructs (e.g., long-term dependencies) [33]. The heuristic mining approach [32] can be seen as another bottom-up approach that incorporates frequency information. The approach can discover complex process structures, but often leads to models that are not sound. Region-based process-discovery approaches provide formal guarantees, but are often not very applicable (e.g., they may produce huge and overfitting process models or take too long to compute). There are two types of regions: *state-based regions* (which require the construction of a transition system) and *language-based regions* (that work on sets of traces). State-based regions were introduced by Ehrenfeucht and Rozenberg [20] in 1989 and generalized by Cortadella et al. [16]. In [8], it is shown how these state-based regions can be applied to process mining by first creating a log-based transition system using different abstractions. In [14,30], refinements are proposed to tailor state-based regions towards process discovery. In parallel, several authors applied language-based regions to process mining [13,35,37]. There are also numerous bottom-up approaches combining different ideas. An example is the so-called split-miner [11] which aims to balance recall and precision. This approach also starts from a filtered DFG, but identifies combinations of splits that capture the concurrency, conflict and causal relations between neighbors in the DFG. As mentioned, there also exist different variants of the inductive mining approach presented in this chapter [22–24].

In this chapter, we only considered a simple event log <sup>L</sup> ∈ B(U*act* <sup>∗</sup>), ignoring additional event and case attributes (e.g., resources, data, transactional information). However, other logging formats may be considered. There are process discovery approaches that exploit timing information, data attributes, object references, partial order information (e.g., events happening on the same day), explicit uncertainty (e.g., imprecise timestamps or missing case identifiers), etc. We also only focused on mainstream representations such as DFGs, Petri nets, and BPMN. However, there are also discovery techniques that aim to discover stochastic process models [29], declarative process models (using Declare or DCR graphs) [25], or object/artifact-centric models (e.g., object-centric Petri nets) [5,21].

The above illustrates that the topic of process discovery has many facets, providing interesting scientific challenges. Moreover, there are several open-source tools (e.g., ProM, bupaR, PM4Py, and RapidProM) and over 40 commercial process mining tools (e.g., Celonis, Disco/Fluxicon, Lana/Appian, Minit, Apromore, myInvenio/IBM, PAFnow, Signavio/SAP, Timeline/Abby and ProcessGold/UiPath) that already provide solid discovery approaches, and are sometimes applied to processes with billions of events. However, as applications of process mining become more demanding, new discovery approaches are needed that are better scalable and can deal with more complex processes and data structures. Therefore, process discovery is not just a great research topic, but also of great practical relevance.

**Acknowledgment.** Funded by the Alexander von Humboldt (AvH) Stiftung and the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany's Excellence Strategy – EXC 2023 Internet of Production – 390621612.

## **References**


BPM 2013. LNBIP, vol. 171, pp. 15–27. Springer, Cham (2014). https://doi.org/10.1007/ 978-3-319-06257-0 2


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Advanced Process Discovery Techniques**

Adriano Augusto1, Josep Carmona2, and Eric Verbeek3(B)

<sup>1</sup> The University of Melbourne, Melbourne, Australia

<sup>2</sup> Universitat Politecnica de Catalunya, Barcelona, Spain `

<sup>3</sup> Eindhoven University of Technology, Eindhoven, The Netherlands

h.m.w.verbeek@tue.nl

**Abstract.** Given the challenges associated to the process discovery task, more than a hundred research studies addressed the problem over the past two decades. Despite the richness of proposals, many state-of-the-art automated process discovery techniques, especially the oldest ones, struggle to systematically discover accurate and simple process models. In general, when the behavior recorded in the input event log is simple (e.g., exhibiting little parallelism, repetitions, or inclusive choices) or noise free, some basic algorithms such as the alpha miner can output accurate and simple process models. However, as the complexity of the input data increases, the quality of the discovered process models can worsen quickly. Given that oftentimes real-life event logs record very complex and unstructured process behavior containing many repetitions, infrequent traces, and incomplete data, some state-of-the-art techniques turn unreliable and not purposeful. Specifically, they tend to discover process models that either have limited accuracy (i.e., low fitness and/or precision) or are syntactically incorrect. While currently there exists no perfect automated process discovery technique, some are better than others when discovering a process model from event logs recording complex process behavior. In this chapter, we introduce four of such techniques, discussing their underlying approach and algorithmic ideas, reporting their benefits and limitation, and comparing their performance with the algorithms introduced in the previous chapter.

## **1 Introduction**

The previous chapter has introduced the alpha algorithm and the inductive mining algorithm as basic algorithms that discover an accepting Petri net from a (simplified) event log. It has also shown a number of example event logs for which these two basic algorithms work excellently. However, these two basic algorithms do not always perform well, often depending on the characteristics of the given event log.

In this chapter, we first introduce an example event log where the recorded process behavior features intertwined parallel compositions and exclusive choices. Second, we discuss the results of the alpha algorithm and the inductive mining algorithm on this example event log, showing that there is room for improvement. Third, we introduce four advanced process mining algorithms, discussing the results of using these algorithms on the example event log – highlighting their benefits and limitations. The first two advanced algorithms use region-based techniques to discover accepting Petri nets,

**Fig. 1.** The directly-follows graph of the event log *L*6.

where the first algorithm uses state-based regions and the second uses language-based regions. The third algorithm relies on sophisticated approaches to pre-process the DFG prior the identification of splits and joins behavioral semantics, and it natively outputs BPMN models. Whereas these three algorithms produce imperative process models, the fourth algorithm generates declarative process models (like Declare) called *log skeletons*. As we shall see in this chapter, thanks to their advanced approaches, these mining algorithms are capable of handling event logs recording very complex process behavior better than the basic mining algorithms do. At the same time, also these algorithms should not be considered bullet-proof solutions for addressing exercises of automated process discovery as, in general, their results vary depending on the input event log.

## **2 Motivation**

For motivating the need for advanced process discovery algorithm, we introduce the event log L<sup>6</sup> = [a, b, c, g, e, h<sup>10</sup>,a, b, c, f, g, h<sup>10</sup>,a, b, d, g, e, h<sup>10</sup>,a, b, d, e, g, h<sup>10</sup>,a, b, e, c, g, h<sup>10</sup>,a, b, e, d, g, h<sup>10</sup>,a, c, b, e, g, h<sup>10</sup>,a, c, b, f, g, h<sup>10</sup>,a, d, b, e, g, h<sup>10</sup>,a, d, b, f, g, h<sup>10</sup>]. At first sight, there seems to be a choice between <sup>c</sup> and d, followed by a choice between e and f. However, it is more complicated than that, as traces like a, c, b, g, e, h and a, b, d, f, g, h are not included in L6.

Figure 1 shows the DFG that results from the event log L6. Clearly, this DFG is not as symmetric as we would have thought after a first glance at L6. For example, e can be directly followed by c or d, but f is always directly followed by g.

Figure 2 shows the accepting Petri net that results from running the alpha algorithm on event log L6. The places with the > sign are places with a larger inflow than outflow, whereas the places with the < symbol are places with a smaller inflow than outflow. This is a clear indication that this net has quality problems, which is also confirmed by the fact that in this net the final marking is not reachable from the initial marking. It is possible to put a token in the final place, but then there would be other tokens in the net as well. Precisely, there would be tokens in the place that is the output of a and e and the input of c.

Figure 3 shows the process tree that results from running the inductive mining algorithm on event log L6. Although the process tree guarantees that the final marking is always reachable from the initial marking, this process tree allows for too much behavior. As an example it is possible to do both e and f, or neither, even though in L<sup>6</sup> always exactly one of these two activities is observed per trace. Also the fact that f is always directly followed by g is not captured by this process tree.

**Fig. 2.** The accepting Petri net discovered by the alpha algorithm from the event log *L*6.

**Fig. 3.** The process tree discovered by the Inductive Mining Algorithm for event log *L*6.

This shows that, for more complex event logs, we need more advanced algorithms than the alpha algorithm and the inductive mining algorithm. This chapter, introduces four of such advanced algorithms each having more success in discovering a process model from the event log L<sup>6</sup> than the basic algorithms from the previous chapter, they are:


**Fig. 4.** The accepting Petri net discovered by the State-based Region Algorithm for event log *L*6.

4. The Log Skeleton Miner, which produces declarative process models (like Declare [45]) called *log skeletons*.

These four advanced algorithms are discussed in the next sections, as the first two algorithms both use the theory of regions, they are discussed in a single section. Then, we continue with split miner, and lastly we conclude with the log skeleton miner.

## **3 The Theory of Regions**

The *theory of regions* [30] was proposed in the early nineties to define a formal correspondence between behavior and structure. In particular, several region-based algorithms have been proposed in the last decades to synthesize specifications into Petri nets using this powerful theory.

As mining is a form of synthesis, several approaches have appeared to mine process models from event logs. Regardless of the region based technique applied, the approaches that rely on the notion of region theory search for a process model that is both fitting and precise [17]. This section shows two branches of region-based approaches for process discovery: state and language-based approaches.

#### **3.1 State-Based Region Approach for Process Discovery**

Figure 4 shows the accepting Petri net that results from running the State-based Region Algorithm on event log L6. Note that for all places the inflow equals the outflow. In the remainder of this section we will provide an overview of the main ingredients of state-based region discovery.

State-based region approaches for process discovery need to convert the event log into a state-based representation, that will be used to discover the Petri net. This representation, is formalized in the following definition.

**Definition 1 (Transition System).** *A* transition system *(*TS*) is a tuple* (S, Σ, A, sin)*, where* S *is a set of* states*,* Σ *is an alphabet of* activities*,* A ⊆ S × Σ × S *is a set of* (labeled) arcs*, and* <sup>s</sup>in <sup>∈</sup> <sup>S</sup> *is the* initial state*. We will use* <sup>s</sup> <sup>e</sup> → s *as a shortcut for* (s, e, s- ) <sup>∈</sup> <sup>A</sup>*, and the transitive closure of this relation will be denoted by* <sup>∗</sup> →*.*

Figure 5(a) presents an example of a transition system.

**Definition 2 (Multiset representation of traces).** *We denote by* #(σ, e) *the number of times that event* e *occurs in* σ*, that is* #(e<sup>1</sup> ... en, e) = |{e<sup>i</sup> | e<sup>i</sup> = e}|*. Given an alphabet* <sup>Σ</sup>*, the* Parikh vector of a sequence <sup>σ</sup> *with respect to* <sup>Σ</sup> *is a vector* <sup>p</sup><sup>σ</sup> <sup>∈</sup> <sup>N</sup>|Σ<sup>|</sup> *such that* pσ(e) = #(σ, e)*.*

The techniques described in [62] present different variants for generating a transition system from an event log. For the most common variant, the basic idea to incorporate state information is to look at the set of multiset of events included in a subtrace in the event log:

**Definition 3 (Multiset State Representation of an Event Log).** *Given an event log* L ∈ B(U*act* <sup>∗</sup>)*, the* TS *corresponding to the* multiset *conversion of* L*, denoted as TSmset*(L)*, is* -S, Σ, T,s<sup>p</sup>- *, such that:* S *contains one state* s<sup>p</sup><sup>w</sup> *for each Parikh vector* p<sup>w</sup> *of a prefix* w *in* L*, with denoting the empty prefix, and* T = {s<sup>p</sup><sup>w</sup> <sup>e</sup> −→s<sup>p</sup>w<sup>e</sup> <sup>|</sup> <sup>w</sup><sup>e</sup> *is a prefix of* <sup>L</sup>}*.*

In the *sequence* conversion, two traces lead to the same state if they fire the same events in exactly the same order.

*Example 1.* Let us use along this section an example extracted from [61]. The event log contains the following activities: *r=register*, *s=ship*, *sb=send bill*, *p=payment*, *ac=accounting*, *ap=approved*, *c=close*, *em=express mail*, *rj=rejected*, and *rs=resolve*. Given the event log L<sup>7</sup> = [r, s, sb, p, ac, ap, c<sup>10</sup>,r, sb, em, p, ac, ap, c<sup>10</sup>,r, sb, p, em, ac, rj, rs, c<sup>10</sup>,r, em, sb, p, ac, ap, c<sup>10</sup>,r, sb, s, p, ac, rj, rs, c<sup>10</sup>,r, sb, p, s, ac, ap, c<sup>10</sup>,r, sb, p, em, ac, ap, c<sup>10</sup>], Fig. 5(a) show an example of TS constructed according to Definition 3.

A *region*<sup>1</sup> in a transition system is a set of states that satisfy an homogeneous relation with respect to the set of arcs. In the simplest case, this relation can be described by a predicate on the set of states considered. Formally:

**Definition 4 (Region).** *Let* S *be a subset of the states of a* TS*,* S- ⊆ S*. If* s ∈ S- *and* s- ∈ S- *, then we say that transition* s <sup>a</sup> → s enters S- *. If* s ∈ S *and* s- ∈ S- *, then transition* s <sup>a</sup> → s exits S- *. Otherwise, transition* s <sup>a</sup> → s does not cross S- *: it is completely inside (*s ∈ S *and* s- ∈ S- *) or completely outside (*s /∈ S *and* s- ∈ S- *). A set of states* r ⊆ S *is a region if for each event* e ∈ E*, exactly one of the three predicates (*enters*,* exits *or* does not cross*) holds for each of its arcs.*

<sup>1</sup> In this paper we will use region to denote a 1-bounded region. However, when needed we will use *k*-bounded region to extend the notion, necessary to account for *k*-bounded Petri nets.

**Fig. 5.** State-based region discovery: (a) transition system corresponding to *L*7, (b) derived Petri net.

An example of region is presented in Fig. 6 on the TS of our running example. In the highlighted region, event r enters the region, s and em exit the region, and the rest of labels do not cross the region.

A region corresponds to a place in the Petri net, and the role of the arcs determine the Petri net flow relation: when an event e enters the region, there is an arc from the corresponding transition for e to the place, and when e exits the region, there is an arc from the region to the transition for e. Events satisfying the do not cross relation are not connected to the corresponding place. For instance, the region shown in Fig. 6(a) corresponds to the shadowed place in Fig. 6(b), where event r belongs to the set of input transitions of the place whereas events em and s belong to the set of output transitions. Hence, the algorithm for Petri net derivation from a transition system consists in finding regions and constructing the Petri net as illustrated with the previous example. In [26] it was shown that only a minimal set of regions was necessary, whereas further relaxations to this restriction can be found in [17]. The Petri net obtained by this method is guaranteed to accept the language of the transition system, and satisfy the *minimal language containment property*, which implies that if all the minimal regions are used, the Petri net derived is the one whose language difference with respect to the log is minimal, hence being the most precise Petri net for the set of transitions considered.

In any case, the algorithm that searches for regions in a transition system must explore the lattice of sets (or multisets, in the case for k*-bounded* regions), thus having a high complexity: for a transition system with n states, the lattice for k-bounded regions is of size <sup>O</sup>(k<sup>n</sup>). For instance, the lattice of sets of states for the toy TS used in this article (which has 22 states) has 2<sup>22</sup> possibles sets to check for the region conditions. Although many simplification properties, efficient data structures and algorithms, and heuristics are used to prune this search space [17], they only help to alleviate the problem. Decomposition alternatives, which for instance use partitions of the state

**Fig. 6.** (a) Example of region (three shadowed states). The predicates are *r* enters, *s* and *em* exits, and the rest of events do not cross, (b) Corresponding place shadowed in the Petri net.

space to guide the search for regions, significantly alleviate the complexity of the statebased region algorithm, at the expense of not guaranteeing the derivation of precise models [15]. Other state-based region approaches for discovery have been proposed, which complement the approach described in this section [54–56].

#### **3.2 Language-Based Region Approach for Process Discovery**

In language-based region theory [6,8,9,22,37,38] the goal is to construct the smallest Petri net such that the behaviour of the net is equal to the given input language (or minimally larger). [41] provides an overview for language-based region theory for different classes of languages: step languages, regular languages, and (infinite) partial languages.

Figure 7 shows the accepting Petri net that results from running the Language-based Region Algorithm on event log L6. As it happened with state-base regions, again for all places the inflow equals the outflow.

More formally, let L ∈ B(U*act* <sup>∗</sup>) be an event log, then language based region theory constructs a Petri net with the set of transitions equals to Σ and in which all traces of L are a firing sequence. The Petri net should have only minimal firing sequences not in the language L (and all prefixes in L). This is achieved by adding places to the Petri net that restrict unobserved behavior, while allowing for observed behavior. The theory of regions provides a method to identify these places, using *language regions*.

**Definition 5 (Prefix Closure).** *Let* L ∈ B(U*act* <sup>∗</sup>) *be an event log. The prefix closed language* L ⊆ Σ<sup>∗</sup> *of* L *is defined as:* L = {σ ∈ Σ<sup>∗</sup> | ∃<sup>σ</sup>-<sup>∈</sup>Σ<sup>∗</sup> σ ◦ σ-∈ L}*.*

The prefix closure of a log is simply the set of all prefixes in the log (including the empty prefix).

**Fig. 7.** The accepting Petri net discovered by the Language-based Region Algorithm for event log *L*6.

**Fig. 8.** Region for a language over four activities [63].

**Definition 6 (Language Region).** *Let* Σ *be a set of activities. A region of a prefixclosed language* L ∈ <sup>Σ</sup><sup>∗</sup> *is a triple* (x, y, c) *with* x, y ∈ {0, <sup>1</sup>}<sup>Σ</sup> *and* <sup>c</sup> ∈ {0, <sup>1</sup>}*, such that for each non-empty sequence* w = w- ◦ a ∈ L*,* w-∈ L*,* a ∈ Σ*:*

$$c + \sum\_{t \in \Sigma} \left( \vec{w'}(t) \cdot \vec{x}(t) - \vec{w}(t) \cdot \vec{y}(t) \right) \ge 0$$

*This can be rewritten into the inequation system:*

$$c \cdot \vec{1} + M' \cdot \vec{x} - M \cdot \vec{y} \ge \vec{0}$$

*where* M *and* M *are two* |L| × |Σ| *matrices with* M(w, t) = w(t)*, and* M- (w, t) = w- (t)*, with* w = w- ◦ a*. The set of all regions of a language is denoted by* (L) *and the region* (0,0, 0) *is called the* trivial region*.*

Intuitively, vectors x, y denote the set of incoming and outgoing arcs of the place corresponding to the region, respectively, and c sets if it is initially marked. Figure 8 shows a region for a language over four activities, i.e. each solution (x, y, c) of the inequation system can be regarded in the context of a Petri net, where the region corresponds to a feasible place with preset {t|t ∈ T,x(t)=1} and postset {t|t ∈ T,y(t) = 1}, and initially marked with c tokens. Note that we do not assume arc-weights here, while the authors of [6,7,22,38] do.

Since the place represented by a region is a place which can be added to a Petri net, without disturbing the fact that the net can reproduce the language under consideration, such a place is called a *feasible* place.

**Definition 7 (Feasible place).** *Let* L *be a prefix-closed language over* Σ *and let* N = ((P, Σ, F), m) *be a marked Petri net. A place* p ∈ P *is called* feasible *if and only if there exists a* corresponding *region* (x, y, c) ∈ (L) *such that* m(p) = c*, and* x(t)=1 *if and only if* t ∈ •p*, and* y(t)=1 *if and only if* t ∈ p•*.*

In general, there are many feasible places for any given event log (when considering arc-weights in the discovered Petri net, there are even infinitely many). Several methods exist for selecting an appropriate subset of these places. The authors of [7,38] present two ways of finitely representing these places, namely a *basis representation* and a *separating representation*. Both representations maximize precision, i.e. they select a set of places such that the behavior of the model outside of the log is minimal.

In contrast, the authors of [63,65,66,68] focus on those feasible places that express some causal dependency observed in the event log, and/or ensure that the entire model is a connected workflow net. They do so by introducing various cost functions favouring one solution of the equation system over another and then selecting the top candidates.

#### **3.3 Strengths and Limitations of Region Theory**

The goal of region theory is to find a Petri net that perfectly describes the observed behavior (where this behavior is specified in terms of a language or a statespace). As a result the Petri nets are perfectly fitting and maximally precise. Consequently, the assumption on the input event log is that it records a *full behavioral specification*, i.e., that the input is *complete* and *noise free*. While the assumption on the output is that it is a *compact* and *exact* representation of the behavior recorded in the input event log. To this end, we note that, although in this section we have focused on safe nets, the theory of regions can represent general k-bounded Petri nets – a feature that is not yet provided by any other automated process discovery technique.

When applying region theory in the context of process mining, it is therefore very important to perform any required generalization of the behavior recorded in the input event log *before* calling region theory algorithms. For state-based regions, the challenges are in the construction of the statespace from the event log, while in languagebased regions it is in the selection of the appropriate prefixes to include in the final prefix-closed language in order to ensure some level of generalization.

In the next section, we will see that split miner relaxes the requirement of having the full behavioral specification recorded in the input event log, striving to discover BPMN process models that only maximizes the *balance* between its fitness and precision.

## **4 Split Miner**

In the following, we describe how Split Miner (hereinafter, SM) discovers a BPMN model starting from an event log. SM operates in six steps (cf. Fig. 9). In the first step, it constructs the DFG and analyses it to detect self-loops and short-loops. In the second step, it discovers concurrency relations between pairs of activities in the DFG. In the third step, the DFG is filtered by applying a filtering algorithm designed to strike balanced fitness and precision of the final BPMN model while maintaining a low controlflow complexity. The fourth and fifth steps focus (respectively) on the discovery of split and join gateways, activities having multiple outgoing edges are turned into a hierarchy of split gateways, while activities have multiple incoming edges are turned into a hierarchy of join gateways. Lastly, if any OR-joins were discovered, they are analyzed and turned (whenever possible) into either XOR-gateways or AND gateways.

Although some of the steps executed by SM are typical of basic automated process discovery techniques such as alpha miner and inductive miner (e.g., the filtering of the DFG), the steps of SM were designed to overcome the limitations of such techniques. Most notably, to increase precision without compromising fitness and/or structural complexity. Furthermore, in SM, each step can operate as a black-box, allowing for additional future improvements by redesign or enhancing a step at a time [5].

We now provide a brief overview of each step of SM in a tutorial-like fashion, by leveraging the example log L<sup>6</sup> = [a, b, c, g, e, h<sup>10</sup>,a, b, c, f, g, h<sup>10</sup>,a, b, d, g, e, <sup>h</sup><sup>10</sup>,a, b, d, e, g, h<sup>10</sup>,a, b, e, c, g, h<sup>10</sup>,a, b, e, d, g, h<sup>10</sup>,a, c, b, e, g, h<sup>10</sup>,a, c, b, f, g, h<sup>10</sup>,a, d, b, e, g, h<sup>10</sup>,a, d, b, f, g, h<sup>10</sup>] (introduced in Sect. 2). Given that an in-depth analysis of the algorithms behind SM would be out of the scope of this chapter and book, we refer the interested reader to the original work [3].

**Fig. 9.** Overview of the Split Miner algorithm.

#### **4.1 Step 1: DFG and Loops Discovery**

Given the input event log L6, SM immediately builds its DFG, as shown in Fig. 10a. In this example, all the traces have the same start and end activity, however, SM automatically adds artificial start and end activities (represented by the nodes and -).

Then, SM detects *self-loops* and *short-loops*, i.e., loops involving only one and two activities (respectively). Loops are known to cause problems when detecting concurrency [60], hence, we want to detect loops before detecting concurrency.

The simplest of the loops is the *self-loop*, a self-loop exists if a node is both source and target of one arc of the DFG, i.e., a → a. Short-loops and their frequencies are detected in the log as follows. Given two activities a and b, for SM, a short-loop (a b) exists if and only if (iff) the following two conditions hold:

**i.** both a and b are not self-loops;

**ii.** there exists at least one log trace containing the subtrace a, b, a or b, a, b.

Condition (i) is necessary because otherwise the short-loop evaluation may not be reliable. In fact, if we consider a process that allows a concurrency between a self-loop activity a and a normal activity b, we could observe log traces containing the subtrace

**Fig. 10.** Processing of the directly-follows graph.

a, b, a, which can also characterize a b. Condition (ii) guarantees that we have observed (in at least one trace of the log) a short-loop between the two activities. In fact, short-loops are characterized by subtraces of the type a, b, a or b, a, b.

The detected self-loops are trivially removed from the DFG and restored only in the output BPMN model. While the detected short-loops are saved and used in the next step. In our example (Fig. 10a), there are no self-loops or short-loops.

#### **4.2 Step 2: Concurrency Discovery**

Given a DFG and any two activities a and b, such that neither a nor b is a self-loop, for SM, a and b are considered concurrent (noted as ab) iff three conditions hold:


These three conditions define the heuristic-based concurrency oracle of SM. The rationale behind the conditions is the following. Condition (iii) captures the basic requirement for ab: the existence of the arcs (a, b) and (b, a) entails that a and b can occur in any order. However, Condition (iii) is not sufficient to postulate concurrency because it may hold in three cases: a and b form a short-loop; (a, b) *or* (b, a) is an infrequent observation (e.g., noise in the data); a and b are concurrent. We are interested in identifying when the third case holds. To this end, we check Conditions (iv) and (v). When Condition (iv) holds, we can exclude the first case because a and b do not form a short-loop. When Condition (v) holds, we can exclude the second case because (a, b) and (b, a) are both observed frequently and have similar frequencies. At this point, we are left with the third case and we assume ab. The variable ε becomes a user input parameter, the smaller is its value the more similar have to be the number of observations of (a, b) and (b, a). Instead, setting ε = 1, Condition (v) would always hold.

Whenever we find ab, we remove the arcs (a, b) and (b, a) from the DFG, since we assume there is no causality but instead there is concurrency. On the other hand, if we find that either (a, b) or (b, a) represents an infrequent directly-follows relation, we remove the least frequent of the two edges. We call the output of this step a *Pruned DFG* (PDFG).

In the example in Fig. 10a, we identify four possible cases of concurrency: (b, c), (b, d), (d, e), (e, g). Setting ε = 0.25, we capture the following concurrency relations: bc, bd, de, eg. The resulting PDFG is shown in Fig. 10b.

#### **4.3 Step 3: Filtering**

**Fig. 11.** Processing of the directly-follows graph.

The filtering algorithm applied by SM on the PDFG is based on three criteria. First, each node of the PDFG must be on a path from the single start node (source) to the single end node (sink). Second, for each node, (at least one of) its path(s) from source to sink must be the one having maximum capacity. In our context, the capacity of a path is the frequency of the least frequent arc of the path. Third, the number of edges of the PDFG must be minimal. The three criteria aim to guarantee that the discovered BPMN process model is accurate and simple at the same time.

The filtering algorithm performs a double breadth-first exploration: forward (source to sink) and backward (sink to source). During the forward exploration, for each node of the PDFG, we discover its maximum source-to-node capacity, and its incoming edge granting such capacity (*best incoming edge*). During the backward exploration, for each node of the PDFG, we discover its maximum node-to-sink capacities, and the *best outgoing edges*. Then, we remove from the PDFG all the edges that are not best incoming edges or best outgoing edges. In doing so, we may reduce the amount of behavior that the final model can replay, and consequently its fitness. Therefore, we introduce a frequency threshold that allows the user to strike a balance fitness and precision. Precisely, we compute the η percentile over the frequencies of the best incoming and outgoing edges of each node, and we add to the PDFG the edges with a frequency exceeding the threshold. It is important to note that the percentile is not taken over the frequencies of *all* the edges, otherwise we would simply retain η percentage of all the edges. Also, this means that even by setting η = 0, SM will still apply a certain amount of filtering.

Figure 11b shows the output of the filtering algorithm when applied to the PDFG of our working example (Fig. 11a). As a consequence of retaining the best incoming and outgoing edges for each node, we would drop the arcs: (e, c) and (c, f); and they would not be retained regardless of the value assigned to η.

#### **4.4 Step 4: Splits Discovery**

Before discovering the split gateways, the filtered PDFG is converted into a BPMN process model by turning the start (-) and end (-) nodes of the graph into the start and end events of the BPMN model, and each other node of the graph into a BPMN activity. Figure 12a shows the BPMN model<sup>2</sup> generated from the filtered PDFG of our working example (Fig. 11b). Now, let us focus on the discovery of the split gateways by considering the example in Fig. 13a. Given an activity with multiple outgoing edges (e.g., activity z), the splits discovery is based on the idea that all the activities directly following (successors of) the same split gateway must have the same *concurrency* and/or *mutually exclusive* relations with the activities that do not directly follow their preceding split gateway. With hindsight and reference to Fig. 13b, we see that since activities c and d are successors of gateway and1, both c and d are concurrent to e, f, g, due to gateway and<sup>3</sup> (i.e., ce, cf, cg, and de, df, dg). At the same time, both c and d are mutually exclusive with a and b, due to gateway xor3. Considering activities by pairs, and analyzing which concurrency or mutually exclusive relations they have in common, we can generate the appropriate splits hierarchy.

**Fig. 12.** Processing of the BPMN model.

With this in mind, we continue our working example. Let us consider activity A (Fig. 12a), it has three successors: B, C, and D. From the outcome of *Step 2*, we know that both C and D are concurrent to B, while C and D are not concurrent (hence, mutually exclusive with each other). Since C and D share the same relations to other

<sup>2</sup> Labels are capitalised to distinguish them from the DFG nodes.

**Fig. 13.** Splits discovery example.

activities (both are concurrent to B), they can be selected as successors of the same gateway, which in this case would be an XOR-gateway because C and D are mutually exclusive. After we add the XOR-gateway, the successors of activity A will be two: B and the newly added XOR-gateway (see Fig. 12b). The algorithm becomes trivial when an activity with multiple outgoing edges has only two successors, indeed, it is enough to add a split gateway matching the relation between the two successors. Continuing the example of activity A, the successor B is in parallel with the newly added XORgateway or, more precisely, with all the activities following the XOR-gateway (activities C an D). Therefore, we can add an AND gateway preceding B and the XOR-gateway. Similarly, if we consider activity B and its two successors, activities E and F, given that they are not concurrent, they must be mutually exclusive and therefore an XORgateway is placed before them. The result of the splits discovery is shown in Fig. 12c.

#### **4.5 Step 5: Joins Discovery**

Once all the split gateways have been placed, we can discover the join gateways. To do so, we rely on the Refined Process Structure Tree (RPST) [46] of the current BPMN model. The RPST of a process model is a tree data structure where the tree nodes represent the single-entry single-exit (SESE) fragments of the process model, and the tree edges denote a containment relation between SESE fragments. Precisely, the children of a SESE fragment are its directly contained SESE fragments, whilst SESE fragments on different branches of the tree are disjoint. Each SESE fragment represents a *subgraph* of the process model, and the partition of the process model into SESE fragments is made in terms of edges. A SESE fragment can be of one of the following four types: a *trivial* fragment, which consists of a single edge; a *polygon*, which consists of a sequence of fragments; a *bond*, which is a fragment where all the children fragments share two common nodes, one being the entry and the other being the exit of the *bond*; and a *rigid*, which represents any other fragment. Each SESE fragment is classified as *homogeneous*, if the gateways it contains (and are not contained in any of its SESE children) are all of the same type (e.g., only XOR-gateways), or *heterogeneous* if its gateways have different types. Figure 14a and Fig. 14b show two examples of homogeneous SESE fragments: a *bond* and a *rigid*.

We note that, at this stage, in the BPMN model (Fig. 12c) all the SESE fragment's exits correspond to activities with multiple incoming edges, which we aim to turn into join gateways. Starting from the leaves of the RPST, i.e., the innermost SESE fragments of the process model, we explore the RPST bottom-up. For each SESE fragment we encounter in this exploration, we select the activities it contains that have multiple incoming edges (there is always at least one, the SESE fragment exit). For each of the selected activities, we add a join gateway preceding it. The join gateway type will depend on whether the SESE fragment is homogeneous or heterogeneous. In the former case, the join gateway will have the same type of the other gateways in the SESE fragment, in the latter case, the join gateway will be an OR-gateway. Figure 14 shows in brief how our approach works for SESE bonds (Fig. 14a), for homogeneous SESE rigids (Fig. 14b), and for all other cases, i.e. heterogeneous SESE rigids (Fig. 14c).

Returning to our working example (Fig. 12c), we can discover three joins. The first one is the XOR-join in the SESE bond containing activities C, D and G, with G as the exit of the bond and the XOR-split as the entry. The bond is XOR-homogeneous, so that the type of the join is set to XOR. The remaining two joins are in the parent SESE fragment of the bond, which is a heterogeneous rigid, hence, we place two OR-joins. The resulting model is shown in Fig. 12d.

**Fig. 14.** Joins discovery examples.

#### **4.6 Step 6: OR-joins Minimization**

The previous step may leave several OR-join gateways in the discovered BPMN model. Since OR-gateways can be difficult to interpret [42], SM tries to remove them by analyzing the process behavior and turning OR-gateways into AND- or XOR-gateways whenever the behavior is interchangeable.

#### **4.7 Strengths and Limitations of Split Miner**

SM was designed to bring together the strengths of older and basic automated process discovery algorithms while addressing their limitations. An example of this design strategy is the filtering algorithm. Past filtering algorithms were either based on heuristics [73,79] that risk to compromise the correctness of the output model, or driven by structural requirements [35]. While SM retains the idea of an integrated filtering algorithm, it focuses on balancing fitness, precision, and simplicity of the output process model.

Past automated discovery algorithms favored either accuracy [73,79] or simplicity [11,35], SM aims to strike a trade-off between the two. The splits and joins discovery steps do not impose any structural constraint on the output process model, as opposed to inductive miner [35] and evolutionary tree miner [11], which enforce blockstructuredness, allowing SM to pursue accuracy. Yet, the discovery of the split gateways is designed to produce hierarchies of gateway which foster simplicity and structuredness, while the join discovery and the use of OR-gateways allow for simplicity without compromising accuracy.

However, also SM has its own limitations. First, SM was designed for real-life contexts, and it operates under the assumption that there is always some infrequent behavior to filter out. Second, SM may discover unsound processes, indeed, hitherto soundness has been guaranteed only by enforcing block-structuredness, a trend that SM does not adhere to. While SM guarantees to discover deadlock-free process models [3], it does not guarantee that such process models respect the soundness property of *proper completion*, so that when a token reaches the end event of the process model, more tokens may be left behind. Nonetheless, the chances of SM discovering an unsound process model are very low [2] and in most cases it can discover accurate yet simple and sound process models.

#### **5 Log Skeletons**

The previous sections introduced three advanced mining algorithms that tackle the example event log L<sup>6</sup> with more success than the basic algorithms as introduced in Sect. 2. Like these basic algorithms, these advanced algorithms all result in an imperative process model, that is, a process model that indicates what the next possible steps are. However, next to these imperative models, we also have declarative models, like Declare [45]. Unlike an imperative model, a declarative model does not specify what the next possible steps are, instead it provides a collection of constraints that any process instance in the end should adhere to.

This Section introduces an advanced mining algorithm that results in a declarative process model, called a *log skeleton*. [75]. This algorithm has been implemented as the "Visualize Log as Log Skeleton" visualizer plugin in ProM 6 [76]. Provided an event log L, the algorithm first extends the provided event log with the artificial start activity and the artificial end activity -. In accordance with Sect. 2, we use L to denote the event log L extended with these artificial activities. Second, the algorithm discovers from this extended event log L the collection of initial specific constraints it adheres to. Third, it reduces some of these constraints, keeping only those constraints that are considered to be relevant. Fourth, it shows the most-relevant constraints to the user as a graph. These last three steps are detailed in the next sections.

## **5.1 Discovering the Log Skeleton**

The specific constraints in a log skeleton are the following three activity frequencies and six binary activity relations.

**Definition 8 (Log Skeleton Frequencies and Relations).** *Let* L- ∈ B(U*act* <sup>∗</sup>) *be an extended event log and let* a, b ∈ *act*(L- ) *be two different activities.*

$$c\_{L'}(a) = \#\_{L'}^{act}(a)$$

*is the* frequency *of activity* a *in event log* L- *.*

$$l\_{L'}(a) = \underline{\min} \{ |\sigma \uparrow \{ a \}| \mid \sigma \in L' \} $$

*is the* lowest frequency *of activity* a *in any trace in event log* L- *.*

$$h\_{L'}(a) = \underline{\max} \{ |\sigma \uparrow \{ a \}| \mid \sigma \in L' \}$$

*is the* highest frequency *of activity* a *in any trace in event log* L- *.*

$$\{(a,b)\in E\_{L'} \Leftrightarrow \forall\_{\sigma \in L'} |\sigma \uparrow \{a\}| = |\sigma \uparrow \{b\}|$$

*denotes that for every trace in event log* L *the frequencies of activities* a *and* b *are the same. Note that the relation* E<sup>L</sup> *induces an* equivalence relation *over the activities. We use* r<sup>L</sup>- (a) *to denote the* representative activity *for the equivalence class of activity* a *(by definition,* (r<sup>L</sup>- (a), a) ∈ E<sup>L</sup>-*).*

$$(a, b) \in R\_{L'} \Leftrightarrow \forall\_{\sigma \in L'} \forall\_{i \in \{1, \dots, |\sigma|\}} (\sigma\_i = a \Rightarrow \exists\_{j \in \{i+1, \dots, |\sigma|\}} \sigma\_j = b)$$

*denotes that for every trace in event log* L *an occurrence of activity* a *is* always followed *by an occurrence of activity* b*. This corresponds to the* response *relation in Declare.*

$$\{(a, b) \in P\_{L'} \Leftrightarrow \forall\_{\sigma \in L'} \forall\_{i \in \{1, \dots, |\sigma|\}} (\sigma\_i = a \Rightarrow \exists\_{j \in \{1, \dots, i - 1\}} \sigma\_j = b)\}$$

*denotes that for every trace in event log* L *an occurrence of activity* a *is always preceded by an occurrence of activity* b*. This corresponds to the* precedence *relation in Declare.*

$$\forall (a, b) \in \overline{R}\_{L'} \Leftrightarrow \forall\_{\sigma \in L'} \forall\_{i \in \{1, \dots, |\sigma|\}} (\sigma\_i = a \Rightarrow \nexists\_{j \in \{i+1, \dots, |\sigma|\}} \sigma\_j = b)$$

*denotes that for every trace in event log* L *an occurrence of activity* a *is never followed by an occurrence of activity* b*.*

$$(a, b) \in \overline{P}\_{L'} \Leftrightarrow \forall\_{\sigma \in L'} \forall\_{i \in \{1, \ldots, |\sigma|\}} (\sigma\_i = a \Rightarrow \exists\_{j \in \{1, \ldots, i-1\}} \sigma\_j = b)$$

**Fig. 15.** The nodes of the log skeleton discovered from the event log *L*6.

*denotes that for every trace in event log* L *an occurrence of activity* a *is never preceded by an occurrence of activity* b*.*

$$(a, b) \in \overline{C}\_{L'} \Leftrightarrow \forall\_{\sigma \in L'} \forall\_{i \in \{1, \ldots, |\sigma|\}} (\sigma\_i = a \Rightarrow \exists\_{j \in \{1, \ldots, |\sigma|\}} \sigma\_j = b)$$

*denotes that for every trace in event log* L *an occurrence of activity* a *never co-occurs with an occurrence of activity* b*.*

Figure 15 shows that we can easily visualize the frequencies and the equivalence relation in the nodes of the log skeleton. The activity, the representative of the equivalence class and the frequencies are simply shown at the bottom of the node, whereas equivalent nodes also have the same background color. For example, Fig. 15 immediately shows that the activities a, b, g, h, -, and are equivalent.

The remaining five activity relations will be visualized by edges between these nodes. However, there could be many such relations, which could very well result in a model that is often called a spaghetti model: A model that contains way too many edges to make any sense of it. Consider, for example, Table 1, which shows that for event log L<sup>6</sup> there are relations between 80 out of 90 possible pairs of different activities, like (f, b) ∈ P<sup>L</sup><sup>6</sup> ∩ R<sup>L</sup><sup>6</sup> . For this reason, the algorithm reduces the collection of these remaining five relations to a collection of *relevant* relations.

**Table 1.** An overview of the initial non-Equivalence relations for event log *L*6.


**Definition 9 (Relevant Log Skeleton Relations).** *Let* L- ∈ B(U*act* <sup>∗</sup>) *be an extended event log and let* a, b ∈ *act*(L- ) *be two different activities.*

$$\begin{aligned} (a,b) \in \mathcal{R}\_{L'} \Leftrightarrow ((a,b) \in R\_{L'} \\ \wedge \exists \, \_{c \in act(L')} ((a,c) \in R\_{L'} \wedge (c,b) \in R\_{L'}) \\ \text{)} \end{aligned}$$

*that is,* RL *is the transitively reduced version of* RL- *. Clearly, if* a *is always followed by* c *and* c *is always followed by* b*, then* a *must be always followed by* b*.*

$$\begin{aligned} (a,b) \in \mathcal{P}\_{L'} &\Leftrightarrow ((a,b) \in P\_{L'} \\ &\land \exists \; \_{c \in act(L')} ((a,c) \in P\_{L'} \land (c,b) \in P\_{L'}) \\ &\text{)} \end{aligned}$$

*that is,* P<sup>L</sup> *is the transitively reduced version of* P<sup>L</sup>- *. Clearly, if* a *is always preceded by* c *and* c *is always preceded by* b*, then* a *must be always preceded by* b*.*

$$\begin{aligned} (a,b) \in \overline{\mathcal{R}}\_{L'} &\Leftrightarrow ((a,b) \in \overline{R}\_{L'} \\ &\land (a,b) \notin \overline{C}\_{L'} \\ &\land \exists \, \_{c \in act(L')} ((a,c) \in \overline{R}\_{L'} \land (c,b) \in \overline{R}\_{L'}) \\ &\text{)} \end{aligned}$$

*that is,* R<sup>L</sup> *is the transitively reduced version of* R<sup>L</sup>- *, on top of which the fact that* a *is never followed by* b *is also considered irrelevant if* a *and* b *do not co-occur. It is not true that if* a *is never followed by* c *and* c *is never followed by* b*, that then* a *is never followed by* b*. Consider, for example the event log containing the traces* a, b*,* b, c*, and* c, a*. We are aware of this, but believe the benefits of doing the transitive reduction outweighs the fact that we may remove relevant relations.*

$$\begin{aligned} (a,b) \in \overline{\mathcal{P}}\_{L'} &\Leftrightarrow ((a,b) \in \overline{\mathcal{P}}\_{L'} \\ &\land (a,b) \notin \overline{C}\_{L'} \\ &\land \exists \, \_{c \in act(L')} ((a,c) \in \overline{P}\_{L'} \land (c,b) \in \overline{P}\_{L'}) \\ &\text{)} \end{aligned}$$

*that is,* P<sup>L</sup> *is the transitively reduced version of* P <sup>L</sup>- *, on top of which the fact that* a *is never preceded by* b *is also considered irrelevant if* a *and* b *do not co-occur. Like with* R<sup>L</sup>- *, it is not true that if* a *is never preceded by* c *and* c *is never preceded by* b*, that then* a *is never preceded by* b*.*

$$\begin{aligned} (a,b) \in \overline{C}\_{L'} &\Leftrightarrow ((a,b) \in \overline{C}\_{L'}\\ &\land \exists \, \_{c \in act(L')} ((a,c) \in P\_{L'} \land (c,b) \in \overline{C}\_{L'})\\ &\land \exists \, \_{c \in act(L')} ((b,c) \in P\_{L'} \land (c,a) \in \overline{C}\_{L'})\\ &\text{)} \end{aligned}$$


**Table 2.** An overview of the relevant non-Equivalence relations for event log *L*6.

*Clearly, if* b *is always preceded by* c *and* c *does not co-occur with* a*, then* b *cannot cooccur with* a*. Note that we could also have used the always-follows relation* R<sup>L</sup> *here instead of the always-precedes relation* P<sup>L</sup>- *, but using the latter relation results in the relevant never-co-occurs relations being more at the beginning of the process, that is, towards the point where the actual decision was made to choose one or the other.*

Table 2 shows the results for the event log L6: Of the 80 initial relations, only 32 are considered to be relevant. Finally, the algorithm shows the log skeleton as a graph to the user, where this graph contains only edges for the relevant relations.

#### **5.2 Visualizing the Log Skeleton**

The discovered log skeleton is visualized using a log skeleton graph, which is a graph showing the relevant relations, the equivalence classes, and the frequencies as discovered from the event log.

**Definition 10 (Log Skeleton Graph).** *Let* L- ∈ B(U*act* <sup>∗</sup>) *be an extended event log and let* a, b ∈ *act*(L- )*. The log skeleton graph for* L*is the graph* G = (V,E, t) *where:*

$$V = \{ (a, r\_{L'}(a), c\_{L'}(a), l\_{L'}(a), h\_{L'}(a)) | a \in act(L') \}$$

*is the set of nodes, where every node contains the activity, the representative of the activity within its equivalence class, the frequency of the activity in the log, and the minimal and maximal frequencies of the activity in any trace. If* l(a) = h(a) *then only* l(a) *is shown, otherwise* l(a)..h(a) *is shown.*

$$E = (\mathcal{R}\_{L'} \cup \mathcal{P}\_{L'} \cup \overline{\mathcal{R}}\_{L'} \cup \overline{\mathcal{P}}\_{L'} \cup \overline{\mathcal{L}}\_{L'})$$

$$\cup (\mathcal{R}\_{L'} \cup \mathcal{P}\_{L'} \cup \overline{\mathcal{R}}\_{L'} \cup \overline{\mathcal{P}}\_{L'} \cup \overline{\mathcal{C}}\_{L'})^{-1} \tag{1}$$

*is the set of edges, where we have an edge from one activity to another activity if we have a relevant relation between these activities (either way).*

$$d \in E \to \{\spadesuit, \spadesuit, \diamondsuit, \clubsuit, \bot\}$$

**Fig. 16.** The full log skeleton discovered from the event log *L*<sup>6</sup> (shown using a left-right orientation).

*denotes the decorator to be used to show the relation from the activity at the tail to the activity at the head:*


*These decorations are shown on the tail of the corresponding edge.*

Table 3 shows which decorators will be shown for the event log L6, and Fig. 16 shows the resulting log skeleton3. Note that the edges (a, b) and (b, a) are visualized by a single edge, with the decorator for (a, b) near a and the decorator for (b, a) near b.

**Table 3.** An overview of the decorators used for the non-Equivalence relations for event log *L*6.


<sup>3</sup> For sake of completeness, we mention that we are using version 6.12.5 of the LogSkeleton package, which is available in the Nightly Build of ProM, see https://www.promtools.org/ doku.php?id=nightly.

As example relations, activity b is never preceded by e (that is, if both b and e occur, then e occurs after b), e is is always preceded by b, and e and f do not co-occur. Also note that although 32 relations were considered to be relevant, 34 are now shown: The relations (g, c) ∈ R and (g, d) ∈ R were not considered relevant as these relations can be induced using f. However, as (c, g) ∈ R and (d, g) ∈ R are considered relevant, the relations for (g, c) and (g, d) are shown as well.

Using the log skeleton shown in Fig. 16, we can deduce the following facts on the example event log:


#### **5.3 Handling Noise**

So far, we have assumed that the event log does not contain any noise. As a result, a constraint like (a, b) ∈ R<sup>L</sup> may be invalid because a single instance of a in the entire event log is not followed by a b. To be able to handle noisy logs, the log skeletons allow the user to set a percentage for which the constraint should hold. We recall here the definition of the Response constraint as provided earlier:

$$\{(a, b) \in R\_{L'} \Leftrightarrow \forall\_{\sigma \in L'} \forall\_{i \in \{1, \dots, |\sigma|\}} (\sigma\_i = a \Rightarrow \exists\_{j \in \{i+1, \dots, |\sigma|\}} \sigma\_j = b)\}$$

When dealing with noise, we are interested in the percentage of cases for which the left-hand side of the implication (σ<sup>i</sup> = a) holds, for which then also the right-hand side (∃<sup>j</sup>∈{i+1,...,|σ|}σ<sup>j</sup> = b) holds. As such, we can divide the instances of the left-hand side into positive instances (for which the right-hand side holds) and negative instances (for which the right-hand side does not hold). If the user allows for a noise level of l (where 0 ≤ l ≤ 1), then the number of negative instances should be at most l times the number of total instances:

$$\left(\sum\_{\sigma \in L'} \left| \{ i \in \{ 1, \ldots, |\sigma| \} \mid \sigma\_i = a \land \nexists i \nsubseteq\_{j \in \{ i+1, \ldots, |\sigma| \}} \sigma\_j = b \} \right| \right) \le l \times \#\_{L'}^{act}(a)$$

This way of handling noise can also be used for the relations P<sup>L</sup>- , R<sup>L</sup>- , P <sup>L</sup>- , and C<sup>L</sup>- , because these constraint are structured in a similar way. However, this way will not work for the equivalence relation E<sup>L</sup>- . To decide whether two different activities a<sup>1</sup> and a<sup>n</sup> (where n ≥ 2) are considered to be equivalent given a certain noise level l (where again 0 ≤ l ≤ 1), we use the following condition for equivalence:

$$\forall\_{i \in \{1, \ldots, n-1\}} \left( \left( \sum\_{\sigma \in L'} ||\sigma \restriction \{a\_i\}| - |\sigma \restriction \{a\_{i+1}\}| \right) \le l \times |L'| \right),$$

**Fig. 17.** The full log skeleton discovered from the event log *L*<sup>6</sup> allowing for 20% noise.

That is, there is a series of activities a1, a2, ..., a<sup>n</sup> such that for every subsequent pair (ai, a<sup>i</sup>+1) the *distance* between both activity counts over all traces should at most be l times the number of traces in the event log. Clearly, setting a noise level of l = 0 results in a condition that the activity counts should match perfectly, which is exactly what we want.

Figure 17 shows the log skeleton that results from event log L<sup>6</sup> when setting the noise level to 0.2. For example, this shows that 80% of the instances of activity c are never preceded by e, that 85% of the instances of e are never followed by c, and that 80% of the instances of activity d do not co-occur with f.

#### **5.4 Strengths and Limitations**

Clearly, a log skeleton is not an imperative process model like a Petri net or a BPMN diagram. Instead, it is a declarative process model like Declare [45]. Some of the relations in the log skeletons exist in Declare as well like R<sup>L</sup>- (Response) and P<sup>L</sup>- (Precedence). But Declare contains many relations that are unknown in a log skeleton, while the Equivalence relation E<sup>L</sup> does not have a counterpart in Declare. As a result, a log skeleton can be considered as a Declare model restricted to only some relations but with an additional equivalence relation.

Of course, limitations also exists for log skeletons. Known process constructs that are hard for log skeletons are loops and duplicate tasks. Furthermore, noise in an event log may be a problem, as a single misplaced activity may prevent discovery of some relations. As attempts to alleviate the problems with these constructs and noise, The visualizer plugin allows the user to specify *boundary activities* (to tackle loops), to *split activities over activities* (to tackle duplicates), and various *noise levels* (to tackle noise). Although our experience with the noise levels is very positive, our experience with the boundary activities and splitting of activities shows that they only can solve some of the problems related to the hard process constructs. As a result, more research is needed in this direction to improve on this.

#### **6 Related Work**

Discovering accurate and simple process models is extremely important to reduce the time spent to enhance them and avoid mistakes during process analysis [28].

While extensive research effort was spent in designing the perfect automated process discovery algorithm, in parallel, researchers have investigated the problem of improving the quality of the input data, proposing techniques for data filtering and data repairing [19,21,32,50–52,57,59,69,70,78]; as well as the problem of predicting what would be the process discovery algorithm yielding the best process model from a given event log [47–49]. A few research studies also explored divide-and-conquer strategies, designing approaches to divide the input data into smaller chunks and separately feed each chunk to a discovery algorithm – in order to facilitate the discovery task. The set of process models discovered from the data chunks would then be reassembled into a unique process model. Among these techniques we find Genet [15,16], C-net miner [55], Stage Miner [43], BPMN Miner [20], and Decomposed Process Mining [77].

It is also worth mentioning techniques that have the ability to deal with negative examples [23,24,33], i.e., to accept also traces that are known to not be part of the underlying process. Of course, this is an information that is not often available, unless domain knowledge can be used, or some automated techniques can be applied for generating it [71,72]. These techniques seem to be better positioned to also consider generalization when searching for the best process model.

Optimization metaheuristics have also been extensively applied in the context of automated process discovery, aiming to incrementally discover and refine the process model to reach a trade-off between accuracy and simplicity. The most notorious, among this type of approaches, are those based on evolutionary (genetic) algorithms [11,25]. However, several other metaheuristics have been explored, such as the imperialist competitive algorithm [1], the swarm particles optimization [18,29,44], and simulated annealing [31,58].

Nonetheless, the latest literature review and benchmark in automated process discovery [2] highlighted that many of the state-of-the-art automated process discovery algorithms [4,13,34,36,67,73,79] were affected by one (or more) of the following three limitations when discovering process models from real-life event logs: i) they achieve limited accuracy; ii) they are computationally inefficient to be used in practice; iii) they discover syntactically incorrect process models. In practice, when the behavior of the process recorded in the input event log varies little, most of the state-of-the-art automated process discovery algorithms can output accurate and simple process models. However, as the behavioral complexity of the process increases, the quality of the discovered process models can worsen quickly. Given that oftentimes real-life event logs are highly complex (i.e., containing complex process behavior, noise, and incomplete data), discovering highly accurate and simple process models with traditional state-ofthe-art algorithms can be challenging.

On the other hand, achieving in a robust and scalable manner the best trade-off between accuracy and simplicity, while ensuring behavioral correctness (i.e., process soundness), has proved elusive. In particular, it is possible to group automated process discovery algorithms in two categories: those focusing more on the *simplicity*, the *soundness* and either the *precision* [13] or the *fitness* [36] of the discovered process model, and those focusing more on its *fitness* and its *precision* at the cost of *simplicity* and/or *soundness* [4,73,79]. The first kind of algorithms strive for simplicity and soundness by enforcing block-structured behavior on the discovered process model. However, since real-life processes are not always block-structured, a direct consequence of doing that is an approximation of the behavior which leads to a loss of accuracy (either fitness of precision). The second kind of algorithms do not adopt any strategy to deal with process simplicity and soundness, focusing only on capturing its behavior in a process model, but in doing so they can produce unsound process models.

Alongside techniques that discover imperative process models, it is important to mention that there exists many discovery algorithm that produce *declarative models* [10,27,39,40,53,74]. Declare models capture the processes' behavior through a set of rules, also known as *declarative* constraints. Even though each declarative constraint is precise, capturing the whole process behavior in a declarative model can be very difficult, especially because declarative models do not give any information about "undeclared" behavior, e.g., any behavior that does not break the declarative constraint is technically allowed behavior. Hence, imperative process models are usually preferred in practice.

#### **7 Challenges Ahead**

Process Mining started about 20 years ago with the development of control-flow miners like the Alpha Miner [64] and the Little Thumb Miner [80]. Although the field has advanced in these 20 years with many others control-flow miners, this does not mean that control-flow mining is already a done deal.

Consider, for example, the results of the latest Process Discovery Contest (PDC 2020) [14], which are shown by Fig. 18. The PDC 2020 was a contest for fullyautomated control-flow miners, and shows the then-current state of the field on these miners. In this contest, every miner was used to discover a control-flow model from a training event log, after which this model was used to classify every trace from a test event log. As the ground truth for this classification is known, we can compute both the average positive accuracy and the average negative accuracy for all of the algorithms on this data set. The results show that there is still some ground to cover for the imperative miners, as none of these miners was able to achieve both an average positive accuracy and an average negative accuracy exceeding 80.0%.

Table 4 shows the weaknesses of several algorithms submitted to the PDC 2020 contest. As an example, the weaknesses of the Inductive IMfa Miner included loops: It scored 59.2%4 on the event logs in the PDC 2020 data set that do not contain loops, and only 19.3% on the event logs that do contain loops. This table indicates that noise and loops but also optional tasks and duplicate tasks can be considered as challenges for control-flow miners in the near future.

<sup>4</sup> This score is computed as the average over <sup>2</sup>·*P*L·*N*<sup>L</sup> *<sup>P</sup>*L+*N*<sup>L</sup> , where *P<sup>L</sup>* is the positive accuracy and *N<sup>L</sup>* is the negative accuracy for (1) the model discovered from a training log *L* and (2) the corresponding test log.

**Fig. 18.** The results of the PDC 2020. The squares correspond to base miners, the circles to imperative miners (that result in an imperative model, like a Petri net or a BPMN diagram), and the triangles to declarative miners (that result in a declarative model, like a DCR graph or a log skeleton). The percentage mentioned with a miner is the score (see footnote 4) of that miner.

**Table 4.** Weaknesses and scores of miners submitted to PDC 2020 and their scores on the event logs that do not contain the weakness (**No**) or that do contain it (**Yes**). Only weaknesses where the **No** and **Yes** scores differ at least 10.0% are listed.


In these 20 years, algorithms have been developed that discover perspectives other than the control-flow perspective. However, many of these other perspectives are added on top of the discovered control-flow model, and hence depend on the discovery of a control-flow model of high-enough quality. Nevertheless, even if assuming that the quality of the control-flow model is indeed high enough, challenges remain for these other perspectives as well.

As a first example, consider the data perspective, which would add expressions (guards) to the control-flow model that would guide the execution of the model: Certain parts of the control-flow model may be only valid if a certain guard evaluates true. Challenges here include the discovery of sensible guards with sensible values. As an example, if based on some value the control-flow 'goes either left or right', then the data in the event log may not contain this precise value. As a result, this value needs to be guessed based on the data that is in the event log.

A second example is the organizational perspective, which would add organizational entities (like users, groups, or roles) to certain parts of the control-flow model: Only resources (like users and automated services) that qualify for these entities can be involved in these parts. Challenges here include the discovery of the correct organizational entities at the correct level. As an example, if some activity was performed by some user according to the event log, then what is the correct organizational level (like user, role, group) for this activity?

## **8 Conclusion**

In this chapter, we have introduced four advanced process discovery techniques: the state-based region miner; the language-based region miner; the split miner; the log skeleton miner. Each of the four techniques aims to alleviate shortcomings of the more basic process discovery techniques as introduced in the previous chapter.

First, the region-based miners can lift the shortcoming of having to assume that activities only occur once in the model. When using regions, different contexts of an activity can be found, and the activity can then be divided over these contexts, leading to a model with an activity for every different context. This is a feature that is not shared by any of the other miners, and this feature can be very important in case we have an event log of a system where these "duplicate activities" occur. Where other miners need to assume there is only one activity, which may lead to discovered models that are incomprehensible, these region-based miners do not need to make this assumption, which may result in more precise models.

Second, the split miner aims to discover process models that simultaneously maximize and balance fitness and precision, while at the same time minimizing the controlflow complexity of the resulting model. This approach brings precision and complexity into the equation, something that previously could be done only by using genetic miners like the evolutionary tree miner [12]. However, differently than genetic miners, split miner typically takes seconds to discover a process model from the event log, as opposed to the hour-long execution times required by genetic miners [2].

Third, the log skeleton miner is not limited to using only the directly-follows relations, which are heavily leveraged by many existing discovery algorithm. This miner discovers a declarative model from the event log that contains facts like "95% of the instances of activity a is always followed by activity b", or "90% of the instances of activity a do not co-occur with an instance of activity b". As such, it is not limited to just the directly-follows relations, and it can discover relations between activities that cannot be discovered if only considering the directly-follows relations.

It is clear that each of these advanced techniques can be used effectively on certain event logs, and may produce better models than those produced by basic techniques. However, ultimately, there is no technique yet that is effective on all (or even almost all) event logs regardless of the process behavior features. Such an ideal process discovery technique should be able to maximize accuracy and simplicity of the discovered process model while at the same time guaranteeing its simplicity and soundness. While, hitherto, the design of such a technique has proved to be challenging and elusive, it has become clear that each process discovery technique can be useful on some event logs. Hence, while we hope that future research endeavors will lead to the ideal process discovery technique, until it materializes, we just have to rely on educated choices based on the process data at hand (i.e., in the form of event log), and select the most appropriate technique for discovering the best process model.

**Acknowledgements.** This work has been supported by MCIN/AEI funds under grant PID2020- 112581GB-C21.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Declarative Process Specifications: Reasoning, Discovery, Monitoring

Claudio Di Ciccio1(B) and Marco Montali2

<sup>1</sup> Sapienza University of Rome, Rome, Italy claudio.diciccio@uniroma1.it <sup>2</sup> Free University of Bozen-Bolzano, Bolzano, Italy montali@inf.unibz.it

Abstract. The declarative specification of business processes is based upon the elicitation of behavioural rules that constrain the legal executions of the process. The carry-out of the process is up to the actors, who can vary the execution dynamics as long as they do not violate the constraints imposed by the declarative model. The constraints specify the conditions that require, permit or forbid the execution of activities, possibly depending on the occurrence (or absence) of other ones. In this chapter, we review the main techniques for process mining using declarative process specifications, which we call *declarative process mining*. In particular, we focus on three fundamental tasks of (1) reasoning on declarative process specifications, which is in turn instrumental to their (2) discovery from event logs and their (3) monitoring against running process executions to promptly detect violations. We ground our review on Declare, one of the most widely studied declarative process specification languages. Thanks to the fact that Declare can be formalized using temporal logics over finite traces, we exploit the automata-theoretic characterization of such logics as the core, unified algorithmic basis to tackle reasoning, discovery, and monitoring. We conclude the chapter with a discussion on recent advancements in declarative process mining, considering in particular multi-perspective extensions of the original approach.

## 1 Introduction

Finding a suitable balance between flexibility and control is a long-standing problem in the management of work processes [83]. Among the different approaches striving to achieve this balance, *flexibility by design* suggests to infuse flexibility in the process modeling language at hand. Declarative process modeling languages take this to the extreme: they support the specification of *what* are the relevant constraints on the temporal evolution of the process, without explicitly indicating *how* process instances should be routed to satisfy such constraints. In comparison with imperative approaches that produce "closed" representations (i.e., only those process executions explicitly foreseen in the model are allowed), declarative approaches yield "open" representations (i.e., every process execution is implicitly allowed, as long as it does not incur in the violation of some constraint).

c The Author(s) 2022

W. M. P. van der Aalst and J. Carmona (Eds.): Process Mining Handbook, LNBIP 448, pp. 108–152, 2022. https://doi.org/10.1007/978-3-031-08848-3\_4

Fig. 1. Intuitive representation of the difference between imperative process models and declarative process specifications in the space of all execution traces. Diagram (a) represents a real process, which isolates the *allowed* (green, solid fill) behaviors from the *forbidden* (red, dotted fill) ones. Diagram (b) shows an *imperative process model* that stays within the boundaries of the process, but misses many allowed behaviors. Diagram (c) shows a *declarative process specification* that well approximates the boundaries of the process: it accepts only traces that are allowed by the process, and includes all the traces accepted by the imperative model in (b). (Color figure online)

Figure 1 depicts an intuitive representation of the difference between classical imperative process models and declarative process specifications, considering execution traces that are forbidden by the real process, allowed by the real process, and captured by the designed process specification. Imperative models (such as those based on Petri nets and related formalisms) are suited to explicitly capture control-flow patterns like sequences, choices, concurrent sections, and loops. Those patterns, in turn, lend themselves to characterize a subset of the allowed traces, but struggle in covering the whole space of execution paths in the case of loosely structured, flexible processes. In other words, they favor control over flexibility. Contrariwise, declarative specifications strive to balance flexibility and control by attempting to characterize constraints that well-separate the allowed behaviors from the forbidden ones. In other words, declarative process specifications allow us to capture not only *what is expected* to occur, but also *what should not happen*. This helps in better approximating the boundaries of the real process, containing (and extending) those captured via imperative process models.

The idea of adopting a constraint-based, declarative approach to regulate dynamic systems has been originally brought forward in different communities: in data management, to express cascaded transactional updates [26]; in multiagent systems, to regulate agent interaction protocols [88]; and in business process management, to capture subprocesses that foresee loosely-coupled control-flow conditions on their activities [85]. This idea was further developed within BPM in consequent years, leading to a series of declarative, constraint-based process modeling languages, with two prominent exponents: Declare [76] and Dynamic Condition-Response Graphs [49]. Common to all such approaches is the usage of *linear temporal/dynamic logics* (i.e., temporal/dynamic logics for sequences of events) to formally describe specifications, and the exploitation of corresponding reasoning mechanisms to tackle a variety of concrete tasks along the entire process lifecycle, from design and model analysis to runtime execution and data analysis.

In this chapter, we focus on *declarative process mining*, that is, process mining where the input or output models are specified using declarative, constraintbased languages. Concretely, we employ the Declare language, but all the presented ideas seamlessly apply any language that can be formalized using logics over *finite* traces [30], which are indeed at the core of Declare. Focusing on finite traces reflects the intuition that every process instance is expected to complete in a finite number of steps. This aspect has a significant impact on the corresponding operational techniques, as these logics admit an automatatheoretic characterization that is based on standard *finite-state automata* [27,30], instead of automata on infinite structures, which are needed when such logics are interpreted over infinite traces.

Leveraging automata-based techniques paired with suitable measures relating traces, events and constraints, we review three interconnected fundamental declarative process mining tasks:


All the presented techniques are integrated in the MINERful process discovery technique<sup>1</sup> [40] and the RuM toolkit<sup>2</sup> [4].

The chapter is organized as follows. Section 2 introduces the declarative process specification language Declare alongside a running example to which we will refer throughout the remainder of the chapter. Section 3 provides the fundamental notions upon which the core techniques for reasoning, discovery and monitoring on declarative specifications are based. We define the formal semantics of Declare and discuss the core reasoning tasks for declarative specifications in Sect. 4. Section 5 explains the core notions of declarative process discovery and monitoring. Section 6 discusses the latest advances in the field of declarative process specification mining. Finally, Sect. 7 concludes this chapter with final remarks and a summary of the core concepts illustrated herein.

<sup>1</sup> https://github.com/cdc08x/MINERful.

<sup>2</sup> https://rulemining.org.


Table 1. A set of Declare constraints among those that are typically used for process mining, with their textual description, graphical notation, and examples fulfilling or violating them.

## 2 DECLARE: A Gentle Introduction

Declare is a language and graphical notation providing an extendible repertoire of templates to formulate constraints. The origin of the approach traces back to the PhD work by Pesic [75], and the parallel and consequent study in the PhD work by Montali [67]. Notably, Declare actually stems from three initial lines of research, respectively focused on the declarative specification of business processes (cf. the ConDec language [78]), service choreographies (cf. the DecSerFlow language [70,94]), and clinical guidelines (cf. the CigDec language [72]). These lines were then unified into a single research thread. The term Declare was used for the first time in [76].

Table 1 shows a set of Declare constraints we use throughout this chapter. The whole, core set of Declare templates has been inspired by a catalogue of temporal logic patterns used in model checking for a variety of dynamic systems from different application domains [41].

Formally, we define a declarative process specification as follows.

Definition 1 (Declarative process specification). *A* declarative process specification *is a tuple* DS = (Rep, Act, K) *where*


Example 1 (A Declare process specification). Figure 2 portrays an example of declarative specification for the admission process of an international Bachelor's program. This example considers the Declare repertoire of templates. The process begins with the creation of an account in the university portal (henceforth, <sup>c</sup>). To specify that <sup>c</sup> is the initial task, we write Init(c), graphically depicted with the Init label in the tag on top of the activity box. Init is a unary template and Init(c) assigns its variable with activity <sup>c</sup>. Unary templates in Declare are also known as *existence* templates. We indicate that not more than one account can be created per process run with AtMostOne(c). In the diagram, it is indicated with the 0..1 label in the tag.

To register for a selection round (r), an account must have been created before (Precedence(c, <sup>r</sup>)). Precedence is a binary template and Precedence(c, <sup>r</sup>), graphically depicted as <sup>c</sup> <sup>r</sup> , assigns c and r to its first and second variable, respectively. Binary templates in Declare are commonly named as *relation* templates.

Every registration to a selection round (r) gives access to a uniquely corresponding evaluation phase (v). After r, v eventually follows and no other registrations are allowed until <sup>v</sup> completes. We write AlternateResponse(r, <sup>v</sup>), graphically depicted as <sup>r</sup> <sup>v</sup> . The evaluation requires r to

Fig. 2. The Declare map of the admission process at a university.

be completed before and v will not recur unless a new registration is issued: AlternatePrecedence(r, <sup>v</sup>), <sup>r</sup> <sup>v</sup> . Typically, if both AlternateResponse(r, <sup>v</sup>) and AlternatePrecedence(r, <sup>v</sup>) hold true, we compactly represent them jointly with the *mutual* relation constraint AlternateSuccession(r, <sup>v</sup>) <sup>r</sup> <sup>v</sup> . An admission test score has to be uploaded in the platform to access the evaluation phase: Precedence(t, <sup>v</sup>). Evaluation phases are necessary for the committee to return rejections (n) and notifications of admission (y), thus AlternatePrecedence(v, <sup>y</sup>) and AlternatePrecedence(v, <sup>n</sup>) hold.

After the admission has been notified, the candidate will not receive a rejection any longer – NotResponse(y, <sup>n</sup>), drawn in Fig. 2 as <sup>y</sup> <sup>n</sup> . NotResponse(y, <sup>n</sup>) falls under the category of the *negative* relation constraints, as the occurrence of y *disables* n in the remainder of the process execution.

Only if candidates receive a notification of admission, they will be entitled to pre-enrol in the program (Precedence(y, <sup>p</sup>)). The candidates are considered as pre-enrolled immediately after they pay the subscription fee (ChainResponse(\$, <sup>p</sup>), shown as follows in the diagram: \$ <sup>p</sup> ). Also, candidates cannot be considered as pre-enrolled if they have not paid the subscription fee: Precedence(\$, <sup>p</sup>). Not more than one pre-enrolment is allowed per candidate: AtMostOne(p). To enrol in the program (e), the candidate must have pre-enrolled – Precedence(p, <sup>e</sup>) – and uploaded the necessary school and language certificates – Precedence(u, <sup>e</sup>).

So far, we have been attaching an informal semantics to Declare and its templates. In the next section, we provide a more systematic and formal characterization.

#### 3 Formal Background

Considering that Declare templates have been originally defined starting from a catalogue of Linear Temporal Logic (LTL) patterns [41], it is not surprising that temporal logics have been used to characterize the semantics of Declare since the very beginning. However, the fact that Declare specifications are interpreted over finite-length executions calls for the use of Linear Temporal Logic on Finite Traces (LTL<sup>f</sup> ) [30]. This indeed leads to a setting that is radically different, both semantically and algorithmically, from the traditional one where formulae are interpreted using LTL over infinite, recurring behaviors [29].

A complete formalization of Declare templates, also including an alternative formalization using a logic programming-based approach, can be found in [68]. It was later refined in [29]. In his PhD thesis, Di Ciccio was the first to provide a semantics based on regular expressions [36]. These two themes were later unified in [28], leading to a richer framework that is able to declaratively capture constraints and metaconstraints, that is, constraints predicating over the possible/certain satisfaction and violation of other constraints.

In this section, we provide some necessary background on LTL<sup>f</sup> and its extension with past-tense temporal operators, as well as on the automata-theoretic characterization for this logic. We then use this framework to formalize Declare and reason automatically on Declare specifications. Thereupon, we reflect upon the most recent advances of research in attempting at capturing not only the formal semantics of constraints, but also how they pragmatically interact with relevant events.

#### 3.1 Linear Temporal Logic on Finite Traces

LTL<sup>f</sup> has the same syntax of LTL [80], but is interpreted on finite traces. In this chapter, in particular, we consider the LTL dialect including past modalities [56] for declarative process specifications as in [18].

From now on, we fix a finite set Σ representing an alphabet of propositional symbols describing (names of) activities available in the domain under study. A (finite) *trace* <sup>t</sup> <sup>=</sup> a1,...,an ∈ <sup>Σ</sup> of length <sup>|</sup>t<sup>|</sup> <sup>=</sup> <sup>n</sup> is a finite sequence of activities, where the presence of activity a<sup>i</sup> at instant i of the trace represents an *event* that witnesses the occurrence of a<sup>i</sup> at instant i – which we also write t(i) = ai. Notice that *at each instant we assume that one and only one activity occurs*. Using standard notation from regular expressions, the set Σ<sup>∗</sup> denotes the overall set of traces whose constitutive events refer to activities in Σ.

Definition 2 (Syntax of **LTL***<sup>f</sup>* ). *Well-formed formulae are built from* Σ*, the unary temporal operators (next) and (yesterday), and the binary temporal operators* **U** *(until) and* **S** *(since) as follows:*

$$\varphi ::= \mathbf{a} \mid (\neg \varphi) \mid (\varphi\_1 \land \varphi\_2) \mid (\bigcirc \varphi) \mid (\varphi\_1 \mathbf{U} \ \varphi\_2) \mid (\ominus \varphi) \mid (\varphi\_1 \mathbf{S} \ \varphi\_2)$$

*where* <sup>a</sup> <sup>∈</sup> <sup>Σ</sup>*.*

Definition 3 (Semantics of **LTL***<sup>f</sup>* , satisfaction, validity, entailment). *An* LTL<sup>f</sup> *formula* <sup>ϕ</sup> *is inductively* satisfied *in some instant* <sup>i</sup> *(*<sup>1</sup> <sup>≤</sup> <sup>i</sup> <sup>≤</sup> <sup>n</sup>*) of a trace* <sup>t</sup> *of length* <sup>n</sup> <sup>∈</sup> <sup>N</sup>*, written* t, i <sup>ϕ</sup>*, if the following holds:*


*A formula* ϕ is satisfied by *a trace* t *(equivalently,* t satisfies ϕ*), written* t ϕ*, iff* t, 1 ϕ*. A formula* ϕ *is: (i)* satisfiable *if it has a satisfying trace from* Σ∗*; (ii)* valid *if every trace in* Σ<sup>∗</sup> *satisfies it. A formula* ϕ<sup>1</sup> entails *formula* ϕ2*, written* <sup>ϕ</sup><sup>1</sup> <sup>|</sup><sup>=</sup> <sup>ϕ</sup>2*, if, for every trace* <sup>t</sup> *of length* <sup>n</sup> <sup>∈</sup> <sup>N</sup> *and every* <sup>i</sup> *s.t.* <sup>1</sup> <sup>≤</sup> <sup>i</sup> <sup>≤</sup> <sup>n</sup>*, if* t, i <sup>|</sup><sup>=</sup> <sup>ϕ</sup> *then* t, i <sup>|</sup><sup>=</sup> <sup>ψ</sup>*.*

Since LTL<sup>f</sup> is closed under negation, it is easy to see that a formula ϕ is valid if and only if <sup>¬</sup><sup>ϕ</sup> is unsatisfiable.

It is worth noting that, in LTL<sup>f</sup> , the next operator is interpreted as the socalled *strong* next: <sup>ϕ</sup> requires that the next instant exists within the trace, and that at such next instant ϕ holds. This has an important consequence: differently from LTL, in LTL<sup>f</sup> formula ¬ <sup>ϕ</sup> is *not* equivalent to ¬ϕ. This is because ¬ <sup>ϕ</sup> is true in an instant of a finite trace either when that instant has no successor, or the next instant exists and in such a next instant ϕ does not hold. More on this can be found in [29].

From the basic operators above, the following can be derived:


Example 2. Let <sup>t</sup> <sup>=</sup> a, b, b, c, d, e be a trace and <sup>ϕ</sup>1, <sup>ϕ</sup><sup>2</sup> and <sup>ϕ</sup><sup>3</sup> three LTL<sup>f</sup> formulae defined as follows: ϕ<sup>1</sup> . = d; ϕ<sup>2</sup> . = ♦ b; ϕ<sup>3</sup> . <sup>=</sup> (<sup>b</sup> <sup>→</sup> ♦ <sup>d</sup>). We have that t, 1 ϕ<sup>1</sup> whereas t, 5 ϕ1; t, 1 ϕ<sup>2</sup> whereas t, 5 ϕ2; t, 1 ϕ<sup>3</sup> and t, 5 ϕ<sup>3</sup> (in fact, t, i <sup>ϕ</sup><sup>3</sup> for any instant <sup>1</sup> <sup>≤</sup> <sup>i</sup> <sup>≤</sup> <sup>n</sup>).

#### 3.2 Finite-State Automata

One of the central features of LTL<sup>f</sup> is that a finite state automaton (FSA) [22] *A* (ϕ) can be computed such that for every trace t we have that t ϕ iff t is in the language recognized by *A* (ϕ), as illustrated in [18,28,30,38]. We include the main notions next, recalling that focusing on deterministic FSAs is without loss of generality, as over finite traces every non-deterministic FSAs can be determinized [50].

Fig. 3. Examples of constraint FSAs.

Definition 4 (Finite state automaton (FSA)). *A (deterministic) finite state automaton (FSA) is a tuple* A = (Σ, S, δ, s0, S*F*)*, where:*



In the remainder of the chapter, we assume that δ is left-total and surjective on <sup>S</sup> \ {s0}, that is, the transition function is defined for every state and symbol, and every state is on a path from the initial one – with the possible exception of the initial state itself. An FSAs that is left-total is called *untrimmed*. Notice that these two requirements are without loss of generality: every FSA can be converted into an equivalent FSA that is left-total and surjective. In particular, to make an FSAs untrimmed, it is sufficient to: *(i)* introduce a non-final trap state <sup>s</sup>⊥; *(ii)* for every state <sup>s</sup> and symbol <sup>a</sup> such that <sup>δ</sup>(s, a ) is *not* defined, enforce δ(s, a ) = <sup>s</sup>⊥; *(iii)* connect <sup>s</sup><sup>⊥</sup> to itself for every symbol, setting <sup>δ</sup>(s⊥, a) = <sup>s</sup><sup>⊥</sup> for every <sup>a</sup> <sup>∈</sup> <sup>Σ</sup>.

Example 3. Figure 3 depicts four FSAs. States are represented as circles and transitions as arrows. Accepting states are decorated with a double line. The initial state is indicated with a single, unlabeled incoming arc. For instance, Fig. 3(a) is such that <sup>Σ</sup> ⊇ {σ1, σ2}, <sup>S</sup> <sup>=</sup> {s0, s1, s2}, <sup>S</sup>F <sup>=</sup> {s0}, <sup>δ</sup>(s0, σ1) = <sup>s</sup><sup>1</sup> and δ(s1, σ1) = s2.

Definition 5 (Runs and traces of an FSA). *Let* A = (Σ, S, δ, s0, S*F*) *be an FSA as per Definition 4. A* computation π *of* A *is a finite sequence alternating states and activities* s<sup>0</sup> <sup>σ</sup><sup>0</sup> −→ ... <sup>σ</sup>*n*−<sup>1</sup> −−−→ <sup>s</sup><sup>n</sup> *that starts from the initial state* <sup>s</sup><sup>0</sup> *is such that for every* <sup>0</sup> <sup>≤</sup> i<n*, we have* <sup>δ</sup>(si, σi) = <sup>s</sup>i+1*. If* <sup>π</sup> *terminates in a final state, that is,* <sup>s</sup><sup>n</sup> <sup>∈</sup> <sup>S</sup>*F, then it is a* run*, and* induces *a corresponding trace* <sup>σ</sup>0,...,σ<sup>n</sup>−<sup>1</sup> *over* <sup>Σ</sup><sup>∗</sup> *obtained from* <sup>π</sup> *by only keeping the symbols that label the transitions.* Example 4. In Fig. 3(a), π<sup>1</sup> = s<sup>0</sup> <sup>σ</sup><sup>1</sup> −→ <sup>s</sup>1, <sup>π</sup><sup>2</sup> <sup>=</sup> <sup>s</sup><sup>0</sup> <sup>σ</sup><sup>2</sup> −→ <sup>s</sup><sup>0</sup> <sup>σ</sup><sup>1</sup> −→ <sup>s</sup><sup>1</sup> <sup>σ</sup><sup>1</sup> −→ <sup>s</sup>2, and π<sup>3</sup> = s<sup>0</sup> <sup>σ</sup><sup>1</sup> −→ <sup>s</sup><sup>1</sup> <sup>σ</sup><sup>2</sup> −→ <sup>s</sup><sup>2</sup> <sup>σ</sup><sup>1</sup> −→ <sup>s</sup><sup>0</sup> are three examples of computations. However, only <sup>π</sup><sup>3</sup> is a run because <sup>s</sup><sup>0</sup> <sup>∈</sup> <sup>S</sup>F whereas <sup>s</sup>1, s<sup>2</sup> <sup>∈</sup>/ <sup>S</sup>F. Notice that, in Fig. 3, we additionally highlight with a grey background colour those states that cannot be in a step of a run – that is, from which accepting states cannot be reached (e.g., s<sup>2</sup> in Fig. 3(a)).

Definition 6 (Accepted trace, language of an FSA). *A trace* <sup>t</sup> <sup>∈</sup> <sup>Σ</sup><sup>∗</sup> *is* accepted *by FSA* A = (Σ, S, δ, s0, s*F*) *if there is a run of* A *inducing* t*. The* language *L* (A) *of* A *is the set of traces accepted by* A*.*

Example 5. For the FSA in Fig. 3(a), the language contains the trace t<sup>1</sup> = σ1, σ2, σ1, since a run exists over this sequence of labels (i.e., <sup>π</sup><sup>3</sup> above), whereas <sup>t</sup><sup>2</sup> <sup>=</sup> σ2, σ1 is not part of the language.

*Automata Product.* FSAs are closed under the (synchronous) product operation <sup>×</sup> [81]. The (cross-)product <sup>A</sup> <sup>×</sup> <sup>A</sup> of two FSAs <sup>A</sup> and <sup>A</sup> is an FSA that accepts the intersection of languages (sets of accepted traces) of each operand: *<sup>L</sup>* (<sup>A</sup> <sup>×</sup> <sup>A</sup> ) = *L* (A) - *L* (A ). It is defined as follows.

Definition 7 (Automata product). *The* product FSA *of two FSAs* A = (Σ, S, δ, s0, S*F*) *and* A = (Σ,S , δ , s 0, S *<sup>F</sup>*) *over the same alphabet* <sup>Σ</sup> *is the FSA* <sup>A</sup>×A = (Σ,S×, δ×, s<sup>×</sup> <sup>0</sup> , S<sup>×</sup> *<sup>F</sup>* )*, where the set* <sup>S</sup><sup>×</sup> <sup>⊆</sup> <sup>S</sup> <sup>×</sup>S *of states (obtained from the cartesian product of the states in* A *and* A *), its initial state* s<sup>×</sup> <sup>0</sup> *, its final states* S<sup>×</sup> *<sup>F</sup> , and the transition function* <sup>δ</sup>×*, are defined by simultaneous induction as follows:*


 Notice that the FSA constructed with Definition 7 can be manipulated using language-preserving automata operations, such as in particular *minimization* [50].

The product operation × is commutative and associative. The identity element for <sup>×</sup> over alphabet <sup>Σ</sup> is <sup>A</sup><sup>I</sup> = (Σ, {s0}, s0, {s0} × <sup>Σ</sup> × {s0}, {s0}) – depicted in Fig. 4(a). It accepts all traces over Σ: *L* A<sup>I</sup> = P (Σ∗) as any sequence of transitions labeled by symbols in Σ corresponds to a run for A<sup>I</sup> . The absorbing element is <sup>A</sup><sup>∅</sup> = (Σ, {s0}, s0, {s0} × <sup>Σ</sup> × {s0}, <sup>∅</sup>) and is illustrated in Fig. 4(b). It does not accept any trace at all: *L* <sup>A</sup>∅ <sup>=</sup> <sup>∅</sup> as any sequence of transitions labeled by symbols in Σ corresponds to a computation ending in a non-accepting state.

## 4 Reasoning

Equipped with the notions acquired thus far, we can now discuss the core reasoning tasks that are associated to declarative process specifications. To this end, we begin this section by describing the semantics of Declare in detail.

Fig. 4. Finite state automata acting as identity element and absorbing element for the automata cross-product operation.

# 4.1 Semantics of DECLARE

The semantics of a Declare template k(x1,...,xm) is given as an LTL<sup>f</sup> formula ϕ<sup>k</sup>(x1,...,x*m*) defined over variables x1,...,x<sup>m</sup> instead of activities. Given the free variables <sup>x</sup> and <sup>y</sup>, e.g., Response(x, y) corresponds to (<sup>x</sup> <sup>→</sup> ♦ <sup>y</sup>), witnessing that whenever x occurs, then y is expected to occur at some later instant. Table 2 shows the LTL<sup>f</sup> formulae of some templates of the Declare repertoire. The formalization of a constraint is then obtained by grounding the LTL<sup>f</sup> formula of its template.

Definition 8 (Constraint formula, satisfying trace). *The* formula *of constraint* k(a1,...,am)*, written* ϕ<sup>k</sup>(a1,...,a*m*)*, is the* LTL<sup>f</sup> *formula obtained from* <sup>ϕ</sup><sup>k</sup>(x1,...,x*m*) *by replacing* <sup>x</sup><sup>i</sup> *with* <sup>a</sup><sup>i</sup> *for each* <sup>1</sup> <sup>≤</sup> <sup>i</sup> <sup>≤</sup> <sup>m</sup>*. A trace* <sup>t</sup> satisfies <sup>k</sup>(a1,...,am) *if* <sup>t</sup> <sup>|</sup><sup>=</sup> <sup>ϕ</sup><sup>k</sup>(a1,...,a*m*)*; otherwise, we say that* <sup>t</sup> violates <sup>k</sup>(a1, ...,am)*.*

Example 6. Considering Table 2, we have <sup>ϕ</sup>Response(a,b) <sup>=</sup> (<sup>a</sup> <sup>→</sup> ♦ <sup>b</sup>), and <sup>ϕ</sup>Response(b,c) <sup>=</sup> (<sup>b</sup> <sup>→</sup> ♦c). Traces b and a, <sup>b</sup>, <sup>a</sup>, <sup>a</sup>, <sup>c</sup>, <sup>b</sup> satisfy Response(a, <sup>b</sup>), while a and a, <sup>b</sup>, <sup>a</sup>, <sup>a</sup>, <sup>c</sup> do not.

A Declare specification is then formalized by conjoining all its constraint formulae, thus obtaining a direct, declarative notion of *model trace*, that is, a trace that is accepted by the specification.

Definition 9 (Specification formula, model trace). *The* formula *of* Declare *specification* DS = (Rep, Act, K)*, written* ϕDS*, is the* LTL<sup>f</sup> *formula* <sup>k</sup>∈<sup>K</sup> <sup>ϕ</sup>k*. A trace* <sup>t</sup> <sup>∈</sup> Act<sup>∗</sup> *is a* model trace *of* DS *if* <sup>t</sup> <sup>|</sup><sup>=</sup> <sup>ϕ</sup>DS*; in this case, we say that* t *is* accepted by DS*, otherwise that* t *is* rejected by DS*.*

Constructing constraint and specification formulae is, however, not enough. When one reads (<sup>a</sup> → ♦b) following the textual description given above, the formula gets intepreted as "whenever a occurs, then b is expected to occur at some later instant". This formulation intuitively hints at the fact that the occurrence of <sup>a</sup> *activates* the Response(a, <sup>b</sup>) constraint, requiring the *target* <sup>b</sup> to occur. In turn, we get that a trace not containing any occurrence of a is *less interesting* than a trace containing occurrences of a, each followed by one or more occurrences of b: even though both traces satisfy Response(a, <sup>b</sup>), the first trace never "interacts"


Table 2. Semantics of some Declare constraints.

with Response(a, <sup>b</sup>), while the second does. This relates to the notion of vacuous satisfaction in LTL [51] and that of interestingness of satisfaction in LTL<sup>f</sup> [39].

The point is, all such considerations are not captured by the formula (<sup>a</sup> → ♦b), but are related to pragmatic interpretation of how it relates to traces. To see this aspect, let us consider that we can equivalently express the formula above as ¬a∨♦(b∧ ¬a), which now reads as follows: "Either <sup>a</sup> never happens at all, or there is some occurrence of b after which a never happens". This equivalent reformulation does not put into evidence the activation or the target.

This problem can be tackled in two possible ways. One option is to attempt at an automated approach where activation, target, and interesting satisfaction are semantically, implicitly characterized once and for all at the logical level; this is the route followed in [39]. The main drawback of this approach is that the user cannot intervene at all in deciding how to fine-tune the activation and target conditions. An alternative possibility is instead to ask the user to explicitly indicate, together with the LTL<sup>f</sup> formula ϕ of the template, also two related LTL<sup>f</sup> formulae expressing activation and target conditions for ϕ. This latter approach, implicitly adopted in [69] and then explicitly formalized in [18], gives more control to the user on how to *pragmatically* interpret constraints. We follow this latter approach.

Intuitively, the *activation* of a constraint is a triggering condition that, once made true, expects that the *target* condition is satisfied by the process execution. Contrariwise, if the constraint is not activated, the satisfaction of the target is not enforced. All in all, to properly constitute an activation-target pair for an LTL<sup>f</sup> formula ϕ, we need them to satisfy the condition that whenever the current instant is such that the activation is satisfied, ϕ must behave equivalently to the target (thus requiring its satisfaction). This is formally captured as follows.

Definition 10 (Activation and target of a constraint). *The* activation *and* target *of a constraint* k *over activities* Act *are two* LTL<sup>f</sup> *formulae* k *and* <sup>k</sup> *such that for every trace* <sup>t</sup> <sup>∈</sup> Act<sup>∗</sup> *we have that:*

$$t \models \varphi\_{\kappa} \quad \text{iff} \quad t \models \Box \left( \mathsf{gK} \rightarrow (\mathsf{K} \mathsf{\succ}) \right)$$

Table 2 shows activations and targets for each constraint, inspired by the work of Cecconi et al. [18]. In the next example, we explain the rationale behind some of the constraint formulations in the table.

Example 7. Consider ChainResponse(\$, <sup>p</sup>), dictating that whenever \$ occurs, then <sup>p</sup> is the activity occurring next. We have <sup>ϕ</sup>ChainResponse(\$,p) <sup>=</sup> (\$ <sup>→</sup> p). Then, by Definition 10, we can directly fix ChainResponse(\$, <sup>p</sup>) = \$, and ChainResponse(\$, <sup>p</sup>) <sup>=</sup> <sup>p</sup>, respectively witnessing that every occurrence of \$ triggers the constraint, with a target requiring the consequent execution of <sup>p</sup> in the next instant. Similarly, for Precedence(\$, <sup>p</sup>) we have <sup>ϕ</sup>Precedence(\$,p) <sup>=</sup> (<sup>p</sup> <sup>→</sup> ♦ \$), and in turn, by Definition 10, <sup>ϕ</sup>Precedence(\$,p) <sup>=</sup> <sup>p</sup> and <sup>ϕ</sup>Precedence(\$,p) <sup>=</sup> ♦ \$. The case of AtMostOne(p) is also similar. In this case, ϕAtMostOne(p) formalizes that <sup>p</sup> cannot occur twice, which in LTL<sup>f</sup> can be directly captured by <sup>¬</sup>♦(<sup>p</sup> ∧ ♦ <sup>p</sup>). This is logically equivalent to (<sup>p</sup> → ¬ ♦p), which directly yields AtMostOne(p) = <sup>p</sup> and AtMostOne(p) <sup>=</sup> ¬ ♦p.

A quite different situation holds instead for the other *existence* constraints. Take, for example, AtLeastOne(a), requiring that <sup>a</sup> occurs at least once in the execution. This can be directly encoded in LTL<sup>f</sup> as ♦a. This formulation, however, does not help to individuate the activation and target of the constraint. Intuitively, we may disambiguate this by capturing that since the constraint requires the presence of a from the very beginning of the execution, the constraint is indeed activated at the beginning, i.e., when **start** holds, imposing the satisfaction of the target ♦a. This intuition is backed up by Definition 10, using the semantics of **start** and noticing the following logical equivalences:

$$
\Diamond \mathfrak{a} = \mathsf{start} \to \Diamond \mathfrak{a} = \Box (\mathsf{start} \to \Diamond \mathfrak{a})
$$

This explains why the latter formulation is employed in Table 2.

*Declarative Constraints as FSAs.* Crucial for our techniques is that every LTL<sup>f</sup> formula ϕ can be encoded into a corresponding FSA (in the sense of Definition 4) A<sup>ϕ</sup> that recognizes all and only those traces that satisfy the formula. This can be done through different algorithmic techniques. A direct approach

Fig. 5. Example FSAs of Declare constraints.

that transforms an input formula into a non-deterministic FSAs is presented in [28,29]; notice that the so-obtained FSAs can then be determinized and minimized using standard techniques [50,99]. A fortiori, given a Declare specification DS = (Rep, Act, K), we proceed as follows:


Figure 5 shows four local automata for constraints taken from our running example: AlternateResponse(r, <sup>v</sup>), ChainResponse(\$, <sup>p</sup>), Precedence(u, <sup>e</sup>) and AtMostOne(p). Examples of global automata are instead given in Fig. 6.

In the remainder of this chapter, we will extensively use local and global automata for reasoning, discovery, and monitoring. Though out of scope for this chapter, it is also worth mentioning that the automata-based approach has also been used for simulation of Declare models and thereby the production of event logs from declarative specifications [37], and also to define enactment engines for Declare specifications [76,97].

# 4.2 Reasoning on DECLARE Specifications

Reasoning on a Declare specification is necessary to understand which model traces are supported and, in turn, to ascertain its correctness. Reasoning is also key to unveil how constraints interact with each other, and check whether activations and targets are properly defined. As we will see, this is instrumental not only to analyze specifications, but it is also an integral part of declarative process mining.

In general, reasoning on declarative specifications is of particular importance: while they enjoy flexibility, they typically do not explicitly indicate how execu-

(a) Alt.Resp.(r, <sup>v</sup>) and Chn.Resp.(\$, <sup>p</sup>), where Σ¯ is Σ \ {r, <sup>v</sup>, \$, <sup>p</sup>} (b) Alt.Resp.(r, <sup>v</sup>), Chn.Resp.(\$, <sup>p</sup>) and Prec.(u, <sup>e</sup>), where <sup>Σ</sup>¯ is <sup>Σ</sup> \ {r, <sup>v</sup>, \$, <sup>p</sup>, <sup>u</sup>, <sup>e</sup>}

(c) Alt.Resp.(r, <sup>v</sup>), Chn.Resp.(\$, <sup>p</sup>), Prec.(u, <sup>e</sup>) and AtMostOne(p), where Σ¯ is Σ \ {r, <sup>v</sup>, \$, <sup>p</sup>, <sup>u</sup>, <sup>e</sup>} (for the sake of readability, a few transitions to s<sup>12</sup> are omitted)-

Fig. 6. Global automata for the interplay of Declare constraints.

tion has to be controlled. We have seen how this phenomenon concretely manifests itself in the context of Declare: traces conforming to the specification (that is, *model traces*) are only implicitly described as those that satisfy all the given constraints. Constraints, in turn, may be quite diverse from each other (e.g., indicating what *is expected* to occur, but also what should *not* happen) and, even more importantly, may affect each other in subtle, difficult to detect ways. This phenomenon is known, in the literature that studies the cognitive impact of languages and notations, under the name of *hidden dependencies* [47]. Hidden dependencies in Declare have been studied in [32,70], and their impact on understandability and interpretability of declarative process models has spawned a dedicated line of research, started in [48].

We detail next key reasoning tasks in the context of Declare, substantiating how hidden dependencies enter into the picture. We show that all such reasoning tasks can be homogeneously tackled by a single check on the global automaton of the specification under study.

Fig. 7. Examples of incorrect Declare specifications.

*Specification Consistency.* This is the most fundamental task, defined as follows.

Definition 11 (Consistent specification). *A* Declare *specification* DS *is* consistent *if there exists at least one model trace for* DS*.*

Example 8. Consider the Declare specification in Fig. 7(a). The specification is inconsistent. This is not due to conflicting constraints insisting on the same activity, but due to hidden dependencies arising from the interplay of multiple constraints. To see why the specification is inconsistent, we can try to construct a trace that satisfies some of the constraints in the model, until we reach a contradiction (i.e., the "trace pattern" constructed so far violates a constraint of the specification). This is graphically shown next:

The picture clearly depicts that AtLeastOne(a) triggers:


Considering the interplay of the involved constraints, d is required to occur in different instants, hence twice, in turn violating AtMostOne(d).

By definition of model trace, it is immediate to see that DS *is consistent if and only if the* LTL<sup>f</sup> *specification formula* ϕDS *is satisfiable*. This, in turn, can be algorithmically verified by first constructing the global automaton ADS, and then checking whether such an automaton is empty (i.e., it does not recognize any trace). Specifically, ϕDS *is* satisfiable *if and only if* ADS *is non-empty*.

*Detection of Dead Activities.* This task amounts to check whether a Declare specification is over-constrained, in the sense that it contains an activity that can never be executed (in that case, such an activity is called *dead*).

Definition 12 (Dead activity). *Let* DS = (Rep, Act, K) *be a* Declare *specification. An activity* <sup>a</sup> <sup>∈</sup> Act *is* dead *in* DS *if there is no model trace of* DS *where* <sup>a</sup> *occurs.*

Example 9. Consider the Declare specification in Fig. 7(b). The specification is consistent; as an example, trace c, <sup>d</sup> is a model trace. However, none of its model traces can foresee the execution of b. This can be seen if one tries to construct a trace containing an occurrence of b. The result is the following:

It is apparent that the presence of b requires a previous occurrence of a and, indirectly, a future occurrence of <sup>d</sup>, violating NotResponse(a, <sup>d</sup>). This shows that b is a dead activity.

Consider now the specification in Fig. 7(c). The situation here is trickier. The specification is consistent, as it accepts the empty trace (where no activity is executed, and hence none of the two response constraints present in the specification gets activated). However, none of the two activities a and b present therein can occur. As soon as this happens, the combination of the two response constraints cannot be *finitely* satisfied. In fact, an occurrence of a requires a later occurrence of b, which in turn requires a later occurrence of a, and so on and so forth, indefinitely. In other words, in every instant, one between Response(a, <sup>b</sup>) and Response(b, <sup>a</sup>) must be active and waiting for a later occurrence of its target, in a future instant. Since every instant must have a next instant, it is not possible to construct a satisfying (finite) trace.

Dead activity detection can be directly reduced to (in)consistency of a specification. Specifically, activity <sup>a</sup> is dead in a Declare specification DS = (Rep, Act, K) if and only if the specification (Rep, Act, K∪{AtLeastOne(a)}), obtained from DS by forcing the existence of <sup>a</sup> is inconsistent (i.e., its specification formula is not satisfiable).

*Valid Activation and Target.* To ensure that a Declare constraint k comes with a valid activation k and target k for its formula ϕk, we can directly apply Definition <sup>10</sup> and check whether the LTL<sup>f</sup> formula <sup>ϕ</sup><sup>k</sup> <sup>↔</sup> ( <sup>k</sup> <sup>→</sup> <sup>k</sup> ) is valid, that is, whether its negation is not satisfiable.

*Checking Relations Between Constraints/Specifications.* We establish two key relations between constraints/specifications. The first is that of *subsumption* between templates, leveraging the entailment relation between LTL<sup>f</sup> formulae to constraints. We formally define it as follows.

Definition 13 (Subsumption). *Let* k(x1,...,xm), k (x1,...,xm) <sup>∈</sup> Rep *two templates.* k(x1,...,xm) subsumes k (x1,...,xm) *(in symbols,* <sup>k</sup>(x1,...,xm) k (x1,...,xm)*) if, given any mapping* κ *assigning* x1,...,x<sup>m</sup> *with activities* <sup>a</sup>1,...,a<sup>m</sup> <sup>∈</sup> Act*,* <sup>ϕ</sup><sup>k</sup>(a1,...,a*m*) <sup>|</sup><sup>=</sup> <sup>ϕ</sup><sup>k</sup>-(a1,...,a*m*)*.*

This relation can be checked by verifying that <sup>ϕ</sup><sup>k</sup>(a1,...,a*m*) <sup>→</sup> <sup>ϕ</sup><sup>k</sup>-(a1,...,a*m*) is valid, that is, the negated formula <sup>ϕ</sup><sup>k</sup>(a1,...,a*m*) ∧ ¬ϕ<sup>k</sup>-(a1,...,a*m*) is not satisfiable for any <sup>a</sup>1,...,a<sup>m</sup> <sup>∈</sup> Act. For example, Alt.Prec.(x, y) Precedence(x, y) as the former requires that y can occur only if preceded by x (just as the latter) *and* y does not recur in between. Therefore, every event that satisfies the former must satisfy the latter too. In the following, we shall lift this notion to constraints too (e.g., we say that AlternatePrecedence(y, <sup>p</sup>) subsumes Precedence(y, <sup>p</sup>)).

By Definition 8 and Definition 9, since both Declare constraints and specifications correspond to LTL<sup>f</sup> formulae, we can use subsumption for a twofold purpose:


The second relation characterizes constraints that are the negated version of each other. Let k<sup>1</sup> and k<sup>2</sup> be two Declare constraints, coming with activation formulae k<sup>1</sup> and k<sup>2</sup> and target formulae k<sup>1</sup> and k<sup>2</sup> , respectively. We say that k<sup>1</sup> and k<sup>2</sup> are the *negated versions* of one another if their activations are logically equivalent, that is <sup>k</sup><sup>1</sup> <sup>↔</sup> <sup>k</sup>2, and their targets are incompatible, that is, <sup>k</sup><sup>1</sup> <sup>∧</sup> <sup>k</sup><sup>2</sup> is false. An example is that of Response vs NotResponse.

Consider now the situation where a decision must be taken concerning which of two candidate constraints k<sup>1</sup> and k<sup>2</sup> can be added to a Declare specification. Knowing that k<sup>1</sup> and k<sup>2</sup> are the negated versions of one another indicates that they should not *both* be added to the specification, as including them both would make the specification inconsistent as soon as the two constraints are activated.

As we will see in the next section, these notions become key when dealing with declarative process mining, and in particular the discovery of Declare specifications from event logs. Figure 8 graphically depicts how the main Declare constraint templates relate to each other in terms of subsumption and negated versions.

#### 126 C. Di Ciccio and M. Montali

Fig. 8. The subsumption map of Declare templates. Templates are indicated with solid boxes. The subsumption relation is depicted as a line starting from the subsumed template and ending in the subsuming one, with an empty triangular arrow recalling the UML IS-A graphical notation. The negative templates are graphically linked to the corresponding relation templates by means of wavy grey arcs.

## 5 Declarative Process Mining

Declarative process constraints depict the interplay of every activity in the process with the rest of the activities. As a consequence, the behavioural relationships that hold among activities can be analysed with a local focus on each one [9], as a projection of the whole process behaviour on a single element thereof. The constraints pertaining to a single activity thus be seen as its footprint in the global behaviour of the process. We shall interchangeably interpret Declare constraints as *(i)* behavioural relations between activities in a process specification or *(ii)* rules exerted on the occurrence of events in traces. Notice that the latter is a different approach than the former, typically used for process modelling as originally conceived by the seminal work of Pesic et al. [77]. The former is instead the basis for declarative process mining. In the following, we describe how process specifications can be discovered and monitored.

#### 5.1 Declarative Process Discovery

Declarative process discovery refers to the inference of those constraints that significantly rule the behaviour of a process, based upon an input event log. The problem can be framed in two distinct ways:

• A *discriminative discovery problem*, reminiscent of a classification task. This requires to split the input event log in two partitions, one containing "positive" examples and the second containing "negative" examples. Discovery

#### Algorithm 1: Overview of the discovery algorithm

```
Input: L ∈ B(U∗
                act ), the event log to be analyzed;
         Rep, a finite set of Declare templates to be considered to express the discovered specification;
         Act ⊆ Uact , a finite set of activities to be included in the discovered specification;
         confmin
             t , suppmin
                     t , confmin
                             e , suppmin
                                     e , the minimum thresholds for trace-based confidence and
         support, and event-based confidence and support, respectively (default for all four parameters: 0.0);
   Output: DS, a declarative process specification
 1 K ←
       -

         k(a1,...,am) : k ∈ Rep, a1; ...,am ∈ Act, ai = aj with 1 ≤ i, j ≤ m

                              /* candidate constraints: templates assigned with any pair of distinct activities */
 2 foreach k ∈ K /* compute measures */
 3 do
 4 ct ← conft(k, L); se ← suppt(k, L); ce ← confe(k, L); se ← suppe(k, L)
 5 if ct ≤ confmin
                 t or st ≤ suppmin
                               t or ce ≤ confmin
                                              e or se ≤ suppmin
                                                            e then
 6 K ← K \ {k} /* remove constraints with a measure below the threshold */
 7 foreach k ∈ K /* remove constraints as per subsumption hierarchy and negated v. */
 8 do
 9 foreach k-
                ∈ K s.t. k-
                          	 k /* for every k-
                                                                         that subsumes k in K */
10 do
11 if allm
                  k-
                   , L
                       ≤ allm(k, L) /* if the measures of k-
                                                                            are ≤ those of k */
12 then
13 K ← K \ {k}
14 else K ← K \ {k-
                           }
15 foreach k-
                ∈ DS s.t. k-
                           is the negated version of k do
16 if allm
                  k-
                   , L
                       < allm(k, L) then K ← K \ {k}
17 else K ← K \ {k-
                           }
18 return DS = (Rep, Act, K)
```
amounts to find a suitable Declare specification that correctly reconstructs the classification, that is, accepts all positive examples and reject all negative ones.

• A *standard discovery problem* – also known as *specification mining* in the software engineering literature [53]. This calls for the individuation of which Declare constraints best describe the traces in the log, considering all of them as "positive" examples.

The first discovery algorithm for Declare treated discovery as a discriminative problem, exploiting inductive logic programming to tackle it [20,52]. In parallel, Goedertier et al. [46] brought forward techniques to generate negative examples from positive ones. Interestingly, this line of investigation recently received again the attention of the community [19,89].

Declarative process discovery framed as a standard discovery problem finds its two main exponents in *Declare Miner* [58] and *MINERful* [40], which have been then extended with an arsenal of techniques to improve the quality and correctness of the discovered specifications. We follow the second thread, summarizing the main ideas exploited therein, though reshaping the core concepts in an attempt to embrace the wider plethora of declarative process discovery techniques and the advancements they brought [8,18,59].

Process discovery in a declarative setting typically consists of the following phases:

1) The initial setup, i.e., the selection of *(i)* the templates to be sought for, *(ii)* the activities to be considered for the candidate constraints instantiating those templates, and *(iii)* the minimum thresholds for constraint interestingness measures to retain a candidate constraint;


Algorithm 1 gives a bird-eye view of the approach in pseudocode. As we can observe, interestingness measures are crucial to determine the degree to which constraints are satisfied in the log. They have been introduced to indicate the level of reliability and relevance of constraints discovered from event logs, originally devised in the field of association rule mining [3] and adapted to the declarative process discovery context [17,65]. Among them, we recall support and confidence. Intuitively, support is a normalized measure quantifying how often the constraint is satisfied in the event log. Confidence considers the number of satisfactions with respect to the occurrences of the activations. We define them formally as follows.

Definition 14 (Trace-based measures). *Let* L *be a non-empty simplified event log with at least a non-empty trace, and* k *a declarative constraint as per Definition 1. We define the trace-based support* supp<sup>t</sup> *and the trace-based confidence* conf<sup>t</sup> *as follows:*

$$\text{supp}\_t(\mathbb{K}, L) = \frac{\sum\_{t \in L : t \equiv \bigotimes(\mathfrak{g}^\mathbb{K}) \wedge \kappa} L(t)}{\sum\_{t \in L} L(t)};\tag{1}$$

$$\text{conf}\_{\mathsf{t}}(\mathsf{k}, L) = \frac{\sum\_{\substack{t \in L: t \mid \neg \lozenge \lozenge \langle \mathsf{gk} \rangle \land \mathsf{K}}} L(t)}{\max \left\{ 1, \sum\_{\substack{\sum\\ t \in L: t \mid \neg \lozenge \langle \mathsf{gk} \rangle}} L(t) \right\}}. \tag{2}$$

We remark that the condition at the numerator that the trace has to satisfy not only the constraint <sup>k</sup> but also eventually its activation, i.e., <sup>t</sup> <sup>|</sup><sup>=</sup> ♦( <sup>k</sup>) <sup>∧</sup> k, serves the purpose of avoiding to count "vacuous satisfactions" discussed in Sect. 4.1. For example, while trace b, <sup>c</sup> satisfies ChainResponse(a, <sup>b</sup>), it does so vacuously, in the sense that it never activates the constraint. This intuitively means that ChainResponse(a, <sup>b</sup>), albeit satisfied, it cannot be interestingly used to describe the behaviour encoded in the trace. We recall that with L(t) denotes the multiplicity of occurrences of t in the log L (see [1], Sect 3.1). The max term at the denominator of the formulation of confidence serves the purpose of avoiding a division by zero in case no trace satisfies ♦( k).

Declare Miner first introduced the trace-based measures to discover specifications from logs, counting traces that (non-vacuously) satisfy constraints as a whole. MINERful, instead, advocated also the adoption of measures that lie at the level of granularity of events. The similarities and differences between the two measuring schemes and the role of explicit activations and targets to tackle vacuity has been later systematized in [18]. The motivation behind the use of eventbased measures is the ability to give a differently weight to traces violating the constraints in more than one instant: with trace-based measures, e.g., both traces a, <sup>b</sup>, <sup>c</sup>, <sup>a</sup>, <sup>b</sup>, <sup>c</sup>, <sup>c</sup>, <sup>a</sup>, <sup>b</sup>, <sup>a</sup>, <sup>b</sup>, <sup>a</sup>, <sup>b</sup>, <sup>a</sup>, <sup>b</sup>, <sup>c</sup>, <sup>a</sup>, <sup>b</sup>, <sup>c</sup>, <sup>a</sup>, <sup>b</sup>, <sup>a</sup>, <sup>b</sup>, <sup>a</sup>, <sup>c</sup> and b, <sup>a</sup>, <sup>c</sup>, <sup>a</sup>, <sup>c</sup>, <sup>a</sup>, <sup>a</sup>, <sup>a</sup>, <sup>a</sup>, <sup>a</sup>, <sup>a</sup>, <sup>c</sup> would count as single violations for ChainResponse(a, <sup>b</sup>). However, only the last occurrence of a out of ten leads to violation in the first trace, whereas all eight occurrences of a lead to violation in the second trace. Next, we formally capture the notion of event-based measures.

Definition 15 (Event-based measures). *Let* L *be a non-empty simplified event log with at least a non-empty trace, and* k *a declarative constraint as per Definition 1. We define the event-based support* supp<sup>e</sup> *and the event-based confidence* conf<sup>e</sup> *as follows:*

$$\text{supp}\_e(\mathbb{k}, L) = \frac{\sum\_{t \in L} |\{a\_i \in t : a, i \doteq \langle \mathfrak{g}\mathfrak{k} \wedge \mathfrak{k}\_\bullet \rangle\}| \times L(t)}{\sum\_{t \in L} |t| \times L(t)};\tag{3}$$

$$\text{conf}\_{e}(\mathbb{K}, L) = \frac{\sum\_{t \in L} |\{a\_i \in t : a, i \equiv (\mathsf{gK} \wedge \mathsf{K}\_\bullet)\}| \times L(t)}{\max\left\{1, \sum\_{t \in L} |\{a\_i \in t : a, i \equiv \mathsf{gK}\}| \times L(t)\right\}}. \tag{4}$$

Again, the condition at the numerator that events satisfy both activation and target of the constraint is intended to avoid including vacuous satisfactions in the sum. The max term at the denominator of confidence is intended to avoid a division by zero in case no event satisfies k.

For the sake of readability, we shall denote with allm(k, L) the tuple containing all computed measures for a constraint k on the event log L: allm(k, L) = (suppt(k, L), conf <sup>t</sup>(k, L),suppe(k, L), confe(k, L)). Given two constraints k<sup>1</sup> and <sup>k</sup>2, we write allm(k1, L) <sup>≤</sup> allm(k2, L) if suppt(k1, L) <sup>≤</sup> suppt(k2, L), conft(k1, L) <sup>≤</sup> conft(k2, L), suppe(k1, L) <sup>≤</sup> conft(k2, L), and confe(k1, L) <sup>≤</sup> conft(k2, L). We write allm(k1, L) <sup>≤</sup> allm(k2, L) if allm(k1, L) <sup>≤</sup> allm(k2, L) and allm(k2, L) <sup>≤</sup> allm(k1, L).

Example 10 (An event log for the specification in Example 1). Let U*act* . <sup>=</sup> {c, <sup>r</sup>, <sup>v</sup>, <sup>t</sup>, <sup>n</sup>, <sup>y</sup>, \$, <sup>p</sup>, <sup>e</sup>, <sup>u</sup>}∪{@} be an alphabet of activities. We interpret @ as an email exchange, which can occur at any stage during the process. The other activities in U*act* are those that were considered in the process specification in Example 1. Let the following event log be built on U*act*:


Table 3. Measures computed for the relation constraints of Example 1 from the event log of Example 10.

L = [t 200 <sup>1</sup> , t<sup>100</sup> <sup>2</sup> , t<sup>100</sup> <sup>3</sup> , t<sup>80</sup> <sup>4</sup> , t<sup>80</sup> <sup>5</sup> , t<sup>4</sup> 6, t<sup>2</sup> 7, t<sup>2</sup> <sup>8</sup>] where


We observe that the log above does not fully comply with the specification. Indeed, *(i)* trace t<sup>8</sup> violates AlternateResponse(r, <sup>v</sup>), as the candidate managed to register twice before evaluation (notice the occurrence of two consecutive <sup>r</sup>'s before <sup>v</sup>); *(ii)* t<sup>7</sup> violates Precedence(t, <sup>v</sup>) and Precedence(u, <sup>e</sup>), as the candidate must have sent the admission test score and the necessary enrolment documents via email rather than via the system (see the occurrence of @ in place of t in the second instant and in place of u later in the trace); finally, *(iii)* trace t<sup>6</sup> violates Precedence(u, <sup>e</sup>), as the candidate must have submitted the enrolment documents via email in that case too (notice the absence of task <sup>u</sup> and the presence of @ in its stance).

Example 11. With the example above, we have that both the trace-based support and trace-based confidence of Alt.Prec.(r, <sup>v</sup>), e.g., equate to 1.0: suppt(Precedence(c, <sup>r</sup>), L) = conft(Precedence(c, <sup>r</sup>), L)=1.0. This is because in all traces the activator (i.e., r) occurs and the constraint is not violated in any trace. Instead, suppt(Alt.Prec.(v, <sup>n</sup>), L) = 100+80+80+2 <sup>568</sup> <sup>0</sup>.<sup>461</sup> and conft(Alt.Prec.(v, <sup>n</sup>), L)=1.0. The trace-based support is lower than the trace-based confidence because the activator (n) occurs in 262 traces out of 568 (i.e., in the 100 instances of t2, the 80 instances of t4, the 80 instances of t5, and the 2 instances of t8). Similarly, confe(Precedence(c, <sup>r</sup>), L)=1.0 and confe(Alt.Prec.(v, <sup>n</sup>), L)=1.0. The measures do not change for eventbased and trace-based confidence because every activation of the two constraints above leads to a satisfaction. In contrast, suppe(Precedence(c, <sup>r</sup>), L) = 1×200+2×100+1×100+2×80+1×80+1×4+1×2+2×2 <sup>9</sup>×200+14×100+10×100+11×80+8×80+12×4+9×2+7×<sup>2</sup> <sup>=</sup> <sup>750</sup> <sup>5800</sup> <sup>0</sup>.129.

It is worth noting that discovery approaches such as Declare Miner [58] and Janus [18] adopt (variations of) local constraint automata to count the satisfactions of constraints. MINERful [40] and DisCoveR [8] resort to occurrence statistics of activities gathered from the event log, more closely to the procedural discovery algorithms discussed in [2].

By definition of confidence and support (trace- or event-based), and as exemplified above, we observe that trace-based confidence is an upper bound for trace-based support and event-based confidence is an upper bound for eventbased support. Next, we illustrate how the discovery algorithm operates with our running example.

Example 12. Table 3 shows the event-based and trace-based measures computed on the basis of our running example for every constraint in the original specification – phase (2) of the discovery procedure described above. They belong to the output of the discovery algorithm running on the event log of Example 10 set at phase (1) to seek for *(i)* all templates from the Declare repertoire in Table 2 *(ii)* over activities {c, <sup>r</sup>, <sup>v</sup>, <sup>t</sup>, <sup>n</sup>, <sup>y</sup>, \$, <sup>p</sup>, <sup>e</sup>, <sup>u</sup>}, with *(iii)* minimum event-based confidence of <sup>0</sup>.95. We remark that also AlternatePrecedence(y, <sup>p</sup>), ChainPrecedence(\$, <sup>p</sup>), AlternatePrecedence(p, <sup>e</sup>) and AlternatePrecedence(c, <sup>p</sup>), NotChainPrecedence(y, <sup>p</sup>) and NotChainResponse(y, <sup>p</sup>), among others, fulfil those criteria and thus are part of the returned set.

To increase the information brought by a discovered model, not only we prune the constraints whose measures lie below the given threshold values. Also, we take into account the subsumption hierarchy illustrated in Fig. 8. In addition, we retain in the constraint set only one among pairs that are a negated version of one another. If we kept both, the model would turn the activation in common into a dead activity (see Sect. 4.2).

Example 13. Figure 9 illustrates the result of the pruning phase (3) based on subsumption and choice of constraints that are the negated version of one another, based on the event log of Example 10. We observe that AlternatePrecedence(y, <sup>p</sup>) has the same measures as Precedence(y, <sup>p</sup>), and we know that Precedence(y, <sup>p</sup>) is subsumed by AlternatePrecedence(y, <sup>p</sup>) (see Sect. 4.2); as we are interested in more restrictive constraints that reduce the space of possible process runs to more closely define its behaviour, we retain the former and discard the latter. Keeping both would introduce a redundancy,

Fig. 9. The subsumption map of relation Declare constraints in a discovery context. The graphical notation follows Fig. 8. Gray boxes denote constraints that have measures below the minimum thresholds. Light-gray boxes indicate constraints that are subsumed by others with equivalent measures.

and retaining only the latter would omit detailed information as not only p must be preceded by y, but also p cannot recur unless y occurs again. By the same line of reasoning, we prefer retaining Init(c) to AtMostOne(c) in the result specification. The same concepts apply with ChainPrecedence(\$, <sup>p</sup>), to be preferred over Precedence(\$, <sup>p</sup>) and AlternatePrecedence(p, <sup>e</sup>) in place of Precedence(p, <sup>e</sup>), among others. Notice that Precedence(y, <sup>p</sup>), Precedence(\$, <sup>p</sup>) and Precedence(p, <sup>e</sup>) were in the given specification of our running example but, we conclude, are not the most restrictive constraints that could be used in the specification, as the discovery algorithm evidences.

To conclude, we remark that not all redundancies can be found with the sole subsumption-hierarchy based pruning. The subsumption hierarchy, indeed, checks constraints that are exerted on the same activities – e.g., AlternatePrecedence(y, <sup>p</sup>) and Precedence(y, <sup>p</sup>). Therefore, we need a more powerful redundancy checking mechanism, seeking for constraints that are entailed by the remainder of the specification's constraint set (see Sect. 4.2).

Example 14. The confidence of AlternatePrecedence(v, <sup>p</sup>) is 1.0 in the event log of our running example. Yet, it does not add information to the discovered specification as it is redundant, logically entailed by the other constraints – in particular, AlternatePrecedence(r, <sup>v</sup>), AlternatePrecedence(v, <sup>y</sup>), Precedence(y, <sup>p</sup>) and AtMostOne(p).

To verify this, we can resort to language inclusion via automata product as in [38]: the language of the product of the four constraint automata is not smaller than the language accepted by the intersection of the second, third and fourth constraint automata. Here, we do not enter the details of the algorithms that detect redundancies at such a deeper level but provide an example of its rationale. The interested reader can find further details in [24,38].

Fig. 10. Example FSAs adapted for the monitoring of constraints. Non-final states indicating current violation (c⊥) are dashed and filled in orange; non-final states indicating permanent violation (p⊥) are dotted and filled in red; final states indicating current satisfaction (c) are thin-solid and filled in blue; final states indicating permanent satisfaction (p) are thick-solid and filled in green. (Color figure online)

#### 5.2 Declarative Process Monitoring

(Compliance) process monitoring aims at tracking running process executions to check their conformance to a reference process model, with the purpose of detecting and reporting deviations as soon as possible [57]. It constitutes one of the main tasks of operational decision support [92, Ch. 10], which characterizes process mining applied at runtime to running process executions.

Declarative process monitoring employs a declarative specification (in our case, described using Declare) as reference model for monitoring. The central fact in monitoring that process instances are running, that is, their generated traces evolve over time, calls for a finer-grained understanding of the state of constraints and of the whole specification. We illustrate this intuitively in the next example.

Example 15. Consider the excerpt in Fig. 11 of our admission process running example, and an evolving trace that, once completed, corresponds to the following sequence: \$, <sup>p</sup>, <sup>u</sup>, \$, <sup>p</sup>. Let us replay the trace from the beginning.


Fig. 11. Excerpt of the Declare specification in Fig. 2.


As witnessed by the example, the state of each constraint can be described in a fine-grained way by considering on the one hand the trace accumulated so far (i.e., the prefix of the whole, still unknown, execution), and by pondering on the other hand about the possible, future continuations. To do so in a formal way, we appeal to the literature on runtime-verification for linear temporal logics, and in particular to the RV-LTL semantics, originally introduced in [11] over infinite traces. This semantics was adopted for the first time in the context of LTL<sup>f</sup> over finite traces in [64,66], in order to define an operational technique for Declare monitoring. This led to deeper investigations on the usage of RV-LTLto characterize the relevance of a trace to a declarative specification [39], and to finally obtain a formally grounded, comprehensive framework for monitoring [27,28].

We now define the RV-LTL semantics for LTL<sup>f</sup> . In the definition, we denote the concatenation of trace <sup>t</sup><sup>1</sup> with <sup>t</sup><sup>2</sup> as <sup>t</sup><sup>1</sup> · <sup>t</sup>2.

Definition 16 (RV-LTL states). *Consider an* LTL<sup>f</sup> *formula* ϕ *over* Σ*, and a trace* <sup>t</sup> *over* <sup>Σ</sup>∗*. We say that* <sup>ϕ</sup> is in (RV-LTL) state <sup>s</sup> after <sup>t</sup>*, written* [<sup>t</sup> <sup>|</sup><sup>=</sup> ϕ]RV = v*, if:*


(Current violation) *(i)* <sup>v</sup> <sup>=</sup> <sup>c</sup>⊥*, (ii) the current trace violates* <sup>ϕ</sup> *(*<sup>t</sup> |<sup>=</sup> <sup>ϕ</sup>*), and (iii) there exists a suffix that leads to satisfy* ϕ *(for some trace* t <sup>∈</sup> <sup>Σ</sup>∗*, we have* <sup>t</sup> · <sup>t</sup> <sup>|</sup><sup>=</sup> <sup>ϕ</sup>*).*

*We also say that* <sup>t</sup> conforms to <sup>ϕ</sup> *if* [<sup>t</sup> <sup>|</sup><sup>=</sup> <sup>ϕ</sup>]RV <sup>=</sup> <sup>p</sup> *or* [<sup>t</sup> <sup>|</sup><sup>=</sup> <sup>ϕ</sup>]RV <sup>=</sup> <sup>c</sup> *(i.e., stopping the execution in* t *satisfies the formula).*

By inspecting the definition, we can directly see that monitoring is at least as hard as LTL<sup>f</sup> satisfiability/validity checking. To see this, consider what happens at the *beginning of an execution*, where the current trace is empty. By applying Definition 16 to this special case, and by recalling the notion of satisfiability/validity of an LTL<sup>f</sup> formula, we in fact get that an LTL<sup>f</sup> formula ϕ is:


To perform monitoring according to the RV-LTL states from Definition 16, we can once again exploit the automata-theoretic characterization of LTL<sup>f</sup> . In particular, given an LTL<sup>f</sup> formula ϕ, we construct its FSA Aϕ, and *color* the automaton states according to the RV-LTL semantics. As introduced in [64] and then formally verified in [28], this can be simply done as follows. Consider a state s in of Aϕ. We label it by:


Figure 10 shows some examples of colored constraint automata, obtained by considering the constraint formulae of some Declare constraints from our running example. To monitor the state evolution of a constraint, one has simply to dynamically play the evolving trace on its colored local automaton, returning the updated RV-LTL label as soon as a new event is processed. Doing so on the local automata in Fig. <sup>10</sup> for trace \$, <sup>p</sup>, <sup>u</sup>, \$, <sup>p</sup> formally reconstructs what discussed in Example 15.

However, this is not enough to *promptly detect violations* as soon as they manifest in the traces. This has been extensively discussed in [28,66], and is at the very core of the power of temporal logic-based techniques for monitoring. We use again Example 15 to illustrate the problem.

Example 16. Consider Example 15 and the following question: is step 6 the earliest at which a violation can be detected? Clearly, if we focus on each constraint in isolation, the answer is affirmative. To see this formally, we play trace \$, <sup>p</sup>, <sup>u</sup>, \$, <sup>p</sup> on the four colored local automata of Fig. 10, obtaining the following runs:


The answer changes if we consider the whole Declare specification that contains all such constraints at once. In fact, by taking into account the interplay of constraints, we can detect a violation already at step 5, i.e., after the second occurrence of payment. This is because, after that step, the two constraints ChainResponse(\$, <sup>p</sup>) and AtMostOne(p) enter into a *conflict*, that is, no continuation of the current trace can lead to satisfy them both. In fact, after trace \$, <sup>p</sup>, <sup>u</sup>, \$, constraint ChainResponse(\$, <sup>p</sup>)is currently violated, waiting for a consequent occurrence of <sup>p</sup>; however, constraint AtMostOne(p), which is currently satisfied, becomes permanently violated upon a further occurrence of <sup>p</sup>.

As we have seen, the early detection of violations cannot always be caught by considering the colored local automata of constraints in isolation. However, it can be systematically detected by taking into account the colored global automaton of the whole specification.

Example 17. Figure 12 shows the colored global automaton of the Declare specification in Fig. 11. By playing the trace \$, <sup>p</sup>, <sup>u</sup>, \$, <sup>p</sup> therein, we obtain the following run: s<sup>0</sup> \$ −→ <sup>s</sup><sup>1</sup> p −→ <sup>s</sup><sup>4</sup> u −→ <sup>s</sup><sup>8</sup> \$ −→ <sup>s</sup><sup>12</sup> p −→ <sup>s</sup>12. Clearly, the violation state <sup>s</sup><sup>12</sup> is already reached in step 5, i.e., just after the second payment.

All in all, we can then monitor an evolving trace against a Declare specification as follows:


Figure 13 shows the result of applying this technique to our running example.

An alternative approach, which is exploited in [64], is to compute, as done before, the global automaton as the cross-product of local automata, remembering, in each global state, the RV-LTL labels of all local states from which such

Fig. 12. The colored global automaton automaton obtained as the (colored) crossproduct of constraints in Fig. 10 as shown in Fig. 6(c), the states of which are decorated with the four RV-LTL truth values.

a global state has been produced. In addition, no minimization step is applied on the resulting automaton. Once colored, this non-minimized, global colored automaton combines in a single device the contribution of all local monitors and that of the global monitor.

#### 5.3 A Note on Conformance Checking

In this section, we have focused on monitoring evolving traces against Declare specifications. This can be seen as a form of *online conformance checking*, aiming at detecting deviations at execution time. This technique can be seamlessly lifted to handle the standard conformance checking task, where conformance is evaluated on an event log containing full traces of already completed process executions (cf. [16]). In this setting, the global automaton is not needed anymore, as a-posteriori it is not relevant to compute the earliest moment of a violation, but only to properly detect it at the trace level. The usage of local automata, one per constraint, is enough, and also has the advantage of producing an informative feedback that indicates, trace by trace, how many (and which) constraints are satisfied or violated. Finer-grained feedbacks like those based on the computation of trace alignments have been extensively applied for procedural models (cf. [16]), and can be also recasted in the declarative setting, aligning the log traces with the (closest) model traces accepted by the global automaton

Fig. 13. Monitoring with local and global colored automata, showing a case where the global automaton detects a violation before it actually manifests on a single constraint.

of the Declare specification of interest. This is an active line of research, which started from the seminal approach in [31].

## 6 Recent Advances and Outlook

We close this chapter by reporting about the most recent advances in the field of declarative process mining revolving around Declare, describing the current frontier of research, and highlighting open challenges.

# 6.1 Beyond DECLARE Patterns

As we have seen in Sect. 3, a Declare specification consists of a repertoire of constraint templates grounded on specific activities. At the same time, such templates come with a logic-based semantics given in terms of LTL<sup>f</sup> . A natural question is then: can the techniques described in this chapter be used for the *entire* LTL<sup>f</sup> logic? This means, more precisely, considering the situation where each constraint corresponds to an arbitrary LTL<sup>f</sup> formula while, as usual, the specification formula is constructed by putting in conjunction the LTL<sup>f</sup> formulae of all its constituting constraints.

To answer this question, one has to separate the *logical* and *pragmatic* aspects involved in the different tasks we have been introducing. We do so focusing on reasoning, discovery, and monitoring.

*Reasoning.* As discussed in Sect. 4.2, all the reasoning tasks we have considered in this chapter can be lifted to the whole LTL<sup>f</sup> logic. Indeed, they are reduced to LTL<sup>f</sup> satisfiability/validity checking, which in turn can be tackled by checking (non-)emptiness of FSAs. The situation may change if one wants to provide more advanced debugging or diagnosis functionalities – for example, to return the most relevant conflicting set(s) of constraints that are causing inconsistencies or dead activities. While these types of problem can also be attacked at the level of the entire logic [25,79], focusing only on pre-defined patterns becomes necessary if one wants to involve humans in the loop or define preferences over constraints in the case where multiple explanations exist [25]. Considering specific patterns is also relevant when studying the computational complexity of reasoning on pattern combinations [44,45,91], or the scalability and effectiveness of reasoning tools [44,45,71,97].

*Discovery.* As pointed out in Sect. 5.1, two distinct process discovery problems are typically tackled in a declarative setting: discriminative discovery and specification mining.

The case of discriminative discovery is tightly related to classification and machine learning, allowing one to rely on general learning algorithms for declarative process mining. Such algorithms tackle general logical frameworks, such as Horn clauses in inductive logic programming or full temporal logics in model learning, and can thus go far beyond a pre-defined set of templates, either targeting full LTL<sup>f</sup> [15,82] or enriching the discoverable Declare templates with further key dimensions, such as metric temporal constraints, event attributes, and data conditions [21,23].

As shown in Sect. 5.1, standard discovery stands as a radically different problem, since the input event log provides a uniform set of (positive) examples, while no negative example is given. This calls for suitable metrics to measure *how well* a set of constraints characterizes the behaviour contained in the log. In the approach described in this chapter, such metrics are defined starting from the notions of constraint activation and target, which are template-specific. Attempts have been conducted to lift some of these notions (in particular that of activation and "relevant" satisfaction [39]) to full LTL<sup>f</sup> , but further research is needed to target the discovery of arbitrary LTL<sup>f</sup> formulae from event logs. Notice that while full LTL<sup>f</sup> discovery would enrich the expressiveness of the discovered specifications, it would on the other hand pose the issue of *understandability*: end users may struggle when confronted with arbitrary temporal formulae, while they are facilitated when pre-defined templates are used.

*Monitoring.* As we have discussed in Sect. 5.2, Declare monitoring is tackled using automata, and consequently seamlessly work for arbitrary LTL<sup>f</sup> formulae. As for advanced debugging techniques, the same considerations done for reasoning also hold for monitoring. For example, the detection of minimal conflicting sets of constraints in the case of early detection of violations caused by the interplay of multiple constraints can be tamed at the level of the full logic [66], but would require to focus on patterns if one wants to formulate preferences or incorporate human feedback [25].

Remarkably, working with FSAs allows us to define monitors for temporal formulae that go even beyond LTL<sup>f</sup> . In fact, LTL<sup>f</sup> is as expressive as star-free regular expressions, while automata are able to capture full regular expressions and, in turn, finite-trace temporal logics incorporating in a single formalism LTL<sup>f</sup> and regular expressions, such as Linear Dynamic Logic over finite traces (LDL<sup>f</sup> ) [30]. Working with LDL<sup>f</sup> in our setting has the specific advantage that we can express and monitor *metaconstraints*, that is, constraints that predicate on the RV-LTL truth values of other constraints [27,28].

## 6.2 Dealing with Uncertainty

In the conventional definition of a Declare specification, constraints are interpreted as being *certain*: every model trace is expected to satisfy all constraints contained in the specification. Such an interpretation is too restrictive in scenarios where the specification should accommodate:


To deal with this form of *uncertainty*, Declare has been recently extended with *probabilistic constraints* [62]. In this framework, every probabilistic constraint comes with:


The interpretation of this constraint is that ϕ holds in a random trace generated by the process with a probability that is p. In frequentist terms, this can be in turn interpreted as follows: given a log of the process, the ratio of traces satisfying <sup>ϕ</sup> must be p.

Since a Declare specification contains multiple constraints, one has to consider how different probabilistic constraints interact with each other. In particular, n probabilistic constraints yield up to 2<sup>n</sup> possible so-called *scenarios*, each highlighting which probabilistic constraints hold and which are violated. Reasoning over such scenarios has to be conducted by suitably mixing their temporal and probabilistic dimensions. The former handles which combinations of constraints and their violations (i.e., which scenarios) are consistent, while the latter lifts the probability conditions attached of single constraints to discrete probability distributions over the possible scenarios.

To carry out this form of combined reasoning, probabilistic constraints are formalized in a well-behaved fragment of the logic introduced in [61]. As it turns out, logical and probabilistic reasoning are loosely coupled in this fragment, and can be carried out resorting to standard finite-state automata and systems of linear inequalities. This approach has been used as the basis for defining a new family of *probabilistic declarative process mining* techniques [6].

#### 6.3 Mixed-Paradigm Models

In Fig. 1, we have intuitively contrasted declarative specifications and imperative models. The distinction of these two approaches is in reality not so crisp. In fact, a single process may contain parts that are more suitably captured using imperative languages, and parts that can be better described as declarative specifications. Take, for instance, a clinical guideline mixing administrative and therapeutic subprocesses [73].

To capture such *hybrid* processes, one needs a multi-paradigm approach that can combine imperative and declarative constructs in a single process model. One of the first proposals doing so is [85], where an imperative process can contain activities that are internally structured using so-called *pockets of flexibility* specified using declarative temporal constraints over a given set of tasks.

This layered approach has been further developed in [90], which brings forward a hierarchical model where each sub-process can be specified either as an imperative or declarative component. Discovery of hierarchical hybrid process models has been subsequently tackled in [87].

Multi-paradigm approaches providing a tighter integration between imperative and declarative components have also been studied. In [33], process models combining Petri nets and Declare constraints at the same modelling level are introduced and studied, singling out methodologies and techniques to handle the intertwined state space emerging from their interaction. Conformance checking for these mixed-paradigm models is extensively assessed in [95]. A different approach is brought forward in [5], where a Declare specification is used to express global constraints that "glue together" multiple imperative processes concurrently executed over the same instances. Automata-based techniques extending those illustrated in Sect. 5.2 are introduced to provide integrated monitoring functionalities dealing at once with the local processes and the global constraints.

At the current stage, further research is needed along the illustrated lines towards a solid theory and corresponding algorithmic techniques for *hybrid, mixed-paradigm process mining*.

# 6.4 Multi-perspective DECLARE Specifications

Throughout the chapter, we have considered pure control-flow specifications, where a process is captured solely in terms of its constitutive activities and of behavioural constraints separating legal from undesired executions. While the control-flow provides the main process backbone, other equally important perspectives should also be taken into account as suggested already in [1]:


Several works have investigated the extension of Declare with additional perspectives. From the formal point of view, this requires to extend the logicbased formalization of Declare with features that can capture resources, metric time, data, and conditions thereof, in turn resorting to variants of metric and/or first-order formalisms over finite traces [10,14,69,74]. It is important to stress that such features may be blurred, considering that data support (if equipped with suitable datatypes and conditions) may be used to predicate over resources and time as well.

Such multi-perspective features have been extensively embedded into Declare or related approaches (see, for example, [13,69,98] for constraints with metric time and [42] for constraints with metric time and resources). Next, we focus in more detail on the data dimension.

When it comes to data, two main lines of research can be identified. The first one deals with standard "case-centric" processes extended with event and case data. The second one focuses instead on "multi-case" processes, wherein constraints are expressed over multiple objects and their mutual relations. We briefly discuss each line separately.

*Declarative Process Specifications with Event/Case Data.* Within a process, activities may be equipped with data attributes that, at execution time, are grounded to actual data values by the involved resources. This means that events witnessing the occurrence of task instances come with a data payload. In addition, each process instance may evolve its own case data in response to the execution of activities.<sup>3</sup> Such case data may be stored in different ways, e.g., as key-value pairs or a full-fledged relational database. In this setting, it becomes crucial to extend Declare with so-called *data-aware constraints*, that is, constraints enriched with data-aware conditions over activities. The simple but illustrative example described next motivates why this is needed.

Example 18. We focus on a process where payments are issued by customers through a pay activity, which comes with an attribute indicating the paid amount, in Euros. Two consequent activities check and emit are executed to respectively inspect a payment and emit a receipt.

Let a log for this process contain multiple repetitions of the following traces:


One may wonder whether Response(pay, check) is a suitable constraint to explain (part of) the behaviour contained in the log. If considered unrestrictedly, this

<sup>3</sup> For conciseness of presentation, we will not distinguish between event and case data in our discussion, but technically they pose different, albeit tightly related, requirements.

(a) Conventional Declare specification (b) Object-centric Declare specification

Fig. 14. Comparison of conventional vs object-centric Declare.

is not the case, as there are many traces where payment is not followed by any inspection. The situation changes completely if one restricts the scope of the constraint activation only to those payments that involve an amount of 100 or more.

A number of works has brought forward combined techniques to discover Declare constraints equipped with various forms of data conditions [54,60,86], to check conformance for data-aware constraints [12,13], and to handle their monitoring [5,69]. This passage has to be carried out with extreme care, as combining event data and time quickly leads to undecidability of reasoning [14, 34,35]. Therefore, such techniques have to operate in a limited fashion or suitably controlling the expressiveness of data conditions and the way they interact with time.

*Object-Centric Declarative Process Specifications.* So far, we have discussed the extension of Declare with event or case data. In a more general setting, data may refer to more complex networks of objects and their mutual relations, simultaneously co-evolved by one or multiple processes. In this type of processes, known under the umbrella term of *object-centric processes*, there is no single, pre-defined notion of case, and process executions cannot consequently be represented as flat traces, but call for richer representations (cf. [43]). The following example illustrates why Declare, in its conventional version, cannot be used to capture *object-centric* processes.

Example 19. Consider the fragment of an order-to-cash process, containing three activities: sign (indicating the signature of a GDPR form by the customer), open (the opening of an order), and close (the closing of an order). Two constraints apply to close, defining under which conditions it becomes executable:


Figure 14(a) shows how these two constraints can be captured in conventional Declare. This specification is satisfactory only in the case where each trace refers to a single customer and a single order by that customer. For example, consider the following two traces, respectively referring to an order o<sup>1</sup> by Anne, and an order o<sup>2</sup> by Bob:

$$t\_1 = \langle \text{sign}, \text{open}, \text{close} \rangle \qquad \qquad t\_2 = \langle \text{open}, \text{close}, \text{sign} \rangle$$

Clearly, t<sup>1</sup> is a model trace, while t<sup>2</sup> is not, as the latter violates Precedence(sign, close).

However, one may need to consider multiple orders owned by the same or distinct customers, in the common situation where distinct orders may be later bundled together to handle their shipment. In our example, assuming that o<sup>1</sup> and o<sup>2</sup> are later bundled together in a shipment, this would require to combine t<sup>1</sup> and t<sup>2</sup> in a single object-centric trace, suitably extending each event with a reference to the object(s) it operates on. Suppose this would result into:

$$t = \left\langle \begin{array}{l} \text{sign(customer=Anne)}, \text{open(order=o2)}, \text{open(order=o1)},\\ \text{close(order=o1)}, \text{close(order=o2)}, \text{sign(customer=Bob)} \end{array} \right\rangle$$

The Declare specification of Fig. 14(a) becomes now inadequate. In fact, it cannot distinguish which events actually *co-refer* to one another and which do not, so it cannot identify that the first signature by Anne refers to the first occurrence of close, but not to the second one. Hence, it wrongly uses the first occurrence of sign to satisfy Precedence(sign, close) for both orders.

Fixing the issue described in Example 19 requires the explicitly extension of Declare with the ability of expressing how events relate to objects, how objects relate to each other, and in turn to *scope* the application of constraints, expressing that they must be enforced over events that suitably co-refer to each other – either because they operate on the same object, or because they operate on related objects. In our running example, this would call for the following actions:


*Object-centric behavioral constraints (OCBC)* [93] have been brought forward to handle this type of scoping through the integration of Declare specifications and UML class diagrams. Figure 14(b) shows the OCBC specification correctly capturing the constraints of Example 19. The approach is still at its infancy: some first seminal works have been conducted to handle discovery of OCBC specifications from object-centric event logs recording full database transactions [55], and to formalize and reason upon OCBC specifications through temporal description logics [7]. Further research is being carried out to improve the performance of discovery and frame it in the context of object-centric event logs of the form of [1], and to tackle conformance checking and monitoring. This is particularly challenging, as integrating temporal constraints with data models quickly leads to undecidability [7].

## 7 Conclusion

Throughout this chapter, we have thoroughly reviewed the declarative approach to process specification and mining. The declarative approach aims at limiting the process behavior by defining the boundaries within which its executions can unfold, yet leaving process executors free to explore at runtime which specific executions are generated. This is in contrast with the imperative approach, where process models compactly depict all and only those traces that are admissible. In fact, notice that different (imperative) process models can comply with the same declarative specification, just like different dynamic systems can model (|=) a set of temporal rules. In the chapter, we have grounded our discussion on the Declare language, but the introduced concepts are broad enough to be seamlessly applicable to other related approaches.

Specifically, we have first discussed how declarative process specifications can be formalized using Linear Temporal Logic on Finite Traces (LTL<sup>f</sup> ), and in turn operationally characterized in terms finite state automata (FSAs) for their execution semantics. On this solid formal ground, we have examined the core reasoning tasks that relate to declarative specifications and then delved deeper into the discovery and monitoring of processes according to the declarative paradigm. Interestingly, we have observed that the reasoning tasks are pervasive in all stages of declarative process mining, such as within discovery to avoid producing redundant or inconsistent outputs, and within monitoring to speculatively consider the possible future continuations of the monitored execution. In the last part of the chapter, we have provided a summary of the most recent advances in declarative process mining, focusing in particular on: *(i)* the applicability of declarative process mining techniques and concepts to full temporal logics, going beyond predefined patterns; *(ii)* the incorporation of uncertainty within constraints; *(iii)* the analysis of hybrid models integrating imperative and declarative fragments; *(iv)* multi-perspective constraints incorporating additional dimensions beyond the control-flow, and supporting the declarative specification of object-centric (multi-case) processes. This bird-eye view provides a fair account of the open research challenges in declarative process mining.

Acknowledgments. The authors want to thank Fabrizio Maria Maggi, Wil van der Aalst, Alessio Cecconi, Federico Chesani, Giuseppe De Giacomo, Riccardo De Masellis, Johannes De Smedt, Massimo Mecella, Paola Mello, Jan Mendling, Maja Pesic, Johannes Prescher for the long-standing cooperation and years of joint work that led to this chapter. The work of the authors has received funding by the Italian Ministry of University and Research under the PRIN programme, grant B87G22000450001 (PINPOINT). The work of C. Di Ciccio was partly funded by the Italian Ministry of University and Research under grant "Dipartimenti di eccellenza 2018–2022" of the Department of Computer Science at the Sapienza University of Rome and the Sapienza research project SPECTRA. The work of M. Montali was partly funded by the UNIBZ projects WineID, SMART-APP, QUEST, and VERBA.

## References


Soffer, P., Völzer, H. (eds.) BPM 2014. LNCS, vol. 8659, pp. 1–17. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10172-9\_1


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

**Conformance Checking**

# **Conformance Checking: Foundations, Milestones and Challenges**

Josep Carmona<sup>1</sup>, Boudewijn van Dongen2(B) , and Matthias Weidlich<sup>3</sup>

<sup>1</sup> Universitat Polit`ecnica de Catalunya, Barcelona, Spain <sup>2</sup> Eindhoven University of Technology, Eindhoven, The Netherlands

B.F.v.Dongen@tue.nl

<sup>3</sup> Humboldt-Universit¨at zu Berlin, Berlin, Germany

**Abstract.** By relating observed and modelled behaviour, conformance checking unleashes the full power of process mining. Techniques from this discipline enable the analysis of the quality of a process model discovered from event data, the identification of potential deviations, and the projection of real traces onto process models. This way, the insights gained from the available event data can be transferred to a richer conceptual level, amenable for a human interpretation. The aforementioned functionalities are grounded on the use of conformance checking artefacts that explicit the relation between observed and modelled behaviour. This chapter describes these artefacts, and builds upon them to gain evidencebased insights on the processes of an organization. Moreover, we overview the applications of conformance checking and propose a general framework that incorporates these applications. Finally, milestones and challenges of the field are outlined.

## **1 Introduction**

Organisations tend to define, by means of conceptual models, complex business processes that must be followed to achieve their objectives [22]. Sometimes the corresponding processes are distributed in different systems, and most of the cases include human tasks, enabling the occurrence of unexpected deviations with respect to the (normative) process model. This is aggravated by the appearance of more and more complex processes, where the observations are provided by heterogeneous sources, such as Internet-of-Things (IoT) devices involved in Cyber-physical Systems [46].

Conformance checking techniques provide mechanisms to relate modelled and observed behaviour, so the frictions between the footprints left by process executions, and the process models that formalise the expected behaviour, can be revealed [14]. As it has been already commented in the first chapters of this book, process executions are often materialized and stored by means of event logs. Table 1 shows an example of an event log for a loan application process.

c The Author(s) 2022

Conformance checking is expected to be the fastest growing segment in process mining for the next years<sup>1</sup>. The main reason for this forthcoming industrial interest is the promise of having event data and process models aligned, thus increasing the value of process models within organizations.

Given an event log and a process model, conformance checking techniques yield some explicit description of their consistent and deviating parts, here referred to as a *conformance artefact*. In the first part of this chapter, we focus on three main conformance artefacts that are covering most of the spectrum of conformance checking:



**Fig. 1.** Example of conformance checking in Celonis.

Remarkably, a conformance artefact enables conclusions on the relation between the event log and the process model. By interpreting the conformance artefact, for instance, the fitness and precision of the model regarding the given log is quantified. Such an interpretation may further involve decisions on how to weight and how to attribute any encountered deviation (see the end of this chapter

<sup>1</sup> https://www.marketsandmarkets.com/Market-Reports/process-analytics-market-2 54139591.html.

for a discussion on this topic). Since the log and the model are solely representations of the process, both of them may differ in how they abstract the process.

Differences in the representations of a process may, of course, be due to inaccuracies. For example, an event log may be recorded by an erroneous logging mechanism (see next chapter of this handbook for understanding this in depth), whereas a process model may be outdated. Yet, differences may also be due to different purposes and constraints that guide how the process is abstracted and therefore originate from the pragmatics of the respective representation of the process. Think of a logging mechanism that does not track the execution of a specific activity due to privacy considerations, or a model that outlines only the main flow of the process to clarify its high-level phases. Either way, the respective representations are not *wrong*, but differ because of their purpose and the constraints under which they have been derived.

By linking an event log and a process model through a conformance artefact, the understanding of the underlying process can be improved. That includes techniques for process enhancement (see [18]). For instance, traces of an event log can be replayed in the process model, while taking into account the deviations between the log and model as materialised in the conformance artefact. Commercial tools that include conformance checking nicely display these deviations on their dashboards, as can be seen in Fig. 1. Another example includes the inspection of the conditions that govern the decision points in a process. The conformance artefact can be used to derive a classification problem per decision point, which enables discovery of the respective branching conditions. Assuming that the model represents the desired behaviour of the process, the conformance artefact further enables conclusions on how the current realisation of the process needs to be adapted.

There exist different algorithmic perspectives to relate modelled and observed behaviour: rule checking, token-replay and alignments.

A process model defines a set of tasks along with causal dependencies for their execution. As such, a process model constrains the possible behaviour of a process in terms of its execution sequences. Instead of considering the set of possible execution sequences of a process model, however, the basic idea of rule-based conformance checking is to exploit rules that are satisfied by all these sequences as the basis for analysis. Such rules define a set of constraints that are imposed by the process model. Verification of these constraints with respect to the traces of an event log, therefore, enables the identification of conformance issues.

Unlike rule checking that is grounded in information derived from the process model, token replay takes the event log as the starting point for conformance analysis. As indicated already by its name, this technique replays each trace of the event log in the process model by executing tasks according to the order of the respective events. By observing the states of the process model during the replay, it can be determined whether, and to what extent, the trace indeed corresponds to a valid execution sequence of the model.

In spite of the two aforementioned class of techniques to relate modelled and observed behaviour, most conformance checking techniques rely on the notion of alignment [1]: given an observed trace σ, query the model to obtain the execution sequence γ that is most similar to σ. The computation of alignments is a computational challenge, since it encompasses the exploration of the model state space, an object that is worst-case exponential with respect to the size of the model or the trace.


**Table 1.** Example of a log of the loan application process.

Once conformance artefacts are computed, the next natural step is to use them. The main applications arising from these artefacts are listed in this chapter as well, as a gentle introduction to some of the chapters devoted to this in this book. We highlight performance analysis and decision point analysis as natural examples of the application of conformance checking.

Furthermore, depending on the trust we put on the two main elements (trust on the log, trust on the model), conformance checking can be generalized as a framework that unifies diverse analysis techniques in the field of process mining [48]. As such, this framework includes several instantiations already known to the reader.

We finish the chapter by listing important milestones and challenges, some of them being already under consideration by the research community, like the computational feasibility of the underlying techniques.

## **2 Relating Observed and Modelled Behaviour: The Basics**

In this section, we discuss the basic notions and techniques to relate observed and modelled behaviour. To this end, we first review generic quality dimensions on this relation (Sect. 2.1). Subsequently, we turn to three different types of conformance checking artefacts that capture the relation between a trace observed in the event log and a process model, namely artefacts grounded in rule checking (Sect. 2.2), token replay (Sect. 2.3), and alignments (Sect. 2.4), see also Fig. 2. A detailed explanation of the contents of this section can be found in [14].

**Fig. 2.** General approaches to conformance checking and resulting conformance artefacts (from [14]): rule checking, token replay, and alignments. All techniques take a trace of an event log and a process model as input. However, conceptually, rule checking starts from the behaviour of the process model, extracting constraints to check for a trace. Token replay, in turn, starts from the behaviour of a single trace, trying to replay the trace in a process model. Alignments, in turn, adopt an inherently symmetric view.

#### **2.1 Quality Dimensions to Relate Process Models and Event Logs**

**Fig. 3.** Example process model of a loan application process in BPMN.

By relating observed and modelled behaviour, an organization can get insights on the execution of their processes with respect to the expectations as described in the models. If both process model M and event log L are considered as languages, their relation can be used to measure how good is a process model in describing the behaviour recorded in an event log.

Hence, confronting M and L can help into understanding the complicate relation between modelled and recorded behaviour. We now provide two views on this relation that represent two alternative perspectives: *fitness* and *precision*. To illustrate this, in this chapter we will be using a process for a loan application. A process model illustrating this process is described in Fig. 3. According to this model, a submitted application is either accepted or rejected, depending on the applicant's data. An accepted application is finalised by a worker, in parallel with the offer process. For each application, an offer is selected and sent to the customer. The customer reviews the offer and sends it back. If the offer is accepted, the process continues with the approval of the application and the activation of the loan. If the customer declines the offer, the application is also declined and the process ends. However, the customer can also request a new offer, in which case the offer is cancelled and a new offer is sent to the customer.

Fitness measures the ability of a model to explain the recorded execution of a process as recorded in an event log (see the example of Fig. 4 for an example of fitting behaviour). It is the main measure to assess whether a model is wellsuited to explain the recorded behaviour. To explain a certain trace, the process model is queried to assess its ability in replaying the trace, taking into account the control flow logic expressed in the model.

In general, fitness is the fraction of the behaviour of the log that is also allowed by the model. It can be expressed as follows.

$$fitness = \frac{|L \cap M|}{|L|} \tag{1}$$

Let us have a look at this fraction in more detail by examining the extreme cases. Fitness is 1, if the entire behaviour that we see in the log L is covered by the model M. Conversely, fitness is 0, if no behaviour in the log L is captured by the model M. In the remainder of this section, we will describe three different algorithms deriving artefacts that can be used to evaluate fitness.

We define a trace to be either *fitting* (it corresponds to an execution sequence of the model) or *non-fitting* (there is some deviation with respect to all execution sequences of the model). For instance, the trace corresponding case A5634 in our running example is fitting, since there is an execution sequence of the model that perfectly reproduces this case, as shown in Fig. 4. In contrast, Fig. 5 shows the information for a trace that does not contain the event to signal that the application has been finalised (*Fa*).

**Fig. 4.** Loan application process model with highlighted path corresponding to the fitting trace -*As, Aa, Fa, Sso, Ro, Do, Da, Af* of case *A*5634 from the event log of Table 1.

**Fig. 5.** Loan application process model with highlighted path corresponding to a trace -*As, Aa, Sso, Ro, Do, Da, Af* , which does not include an event to signal that the application has been finalised (*Fa*). In magenta, we show that the task (*Fa*) has not been observed, but it is required to reach the final state of the process model.

Precision is the counterpart of fitness. It can be calculated by looking at the fraction of the model behaviour that is covered in the log.

$$precision = \frac{|L \cap M|}{|M|} \tag{2}$$

We see that precision shares the numerator in the fraction with fitness from (1). This implies that if we have a log and a model with no shared behaviour, fitness is zero, and by definition also precision is zero. However, the denominator is replaced with the amount of modelled behaviour.

In summary, for the two main metrics reported above, algorithms that can assess the relation between log and model need to be considered. In the next section, we describe the three main algorithmic perspectives to accomplish this task. For an extensive analysis of metrics to assess the relation between observed and modelled behaviour, including metrics like *generalization* or *simplicity*, the reader is referred to [14]. Intuitively, generalization complements precision by quantifying the amount of behaviour that is modelled in a process model, but not observed in an event log. In practice, an event log cannot be expected to be complete, i.e., to contain all possible process behaviour (e.g., all possible interleavings of concurrent activities or all possible numbers of iterations of repetitive behaviour). Hence, a process model is typically assumed to generalize to some extent, i.e., not to show perfect precision, and generalization measure aim to quantify this amount of imprecision. Simplicity, in turn, refers to the structure and complexity of the model. Intuitively, simplicity measures induce some preference for process models that behave similarly in terms of the other dimensions, with the argument being that simple models are generally to be preferred.

#### **2.2 Rule Checking**

The basic idea of rule-based conformance checking is to exploit rules that are satisfied by all the execution sequences of a process model as the basis for analysis. Such rules define a set of constraints that are imposed by the process model. The verification of these constraints with respect to the traces of an event log, therefore, enables the identification of conformance issues.

Considering the running example of our loan application process as depicted in Fig. 3, rules derived from the process model include:


A careful inspection of each one of the rules above would reveal that they are different in nature: rule R1 is an example of *cardinality rule*, which defines an upper and lower bound for the number of executions of an activity. Rule R2 contains a *precedence rule*, which establishes that the execution of a certain activity is preceded by at least on execution of another activity. Rule R3 establishes an *ordering rule*, whereas rule R4 represents an *exclusiveness rule*. Tables 2 and 3 show examples of cardinality and exclusiveness rules, respectively, for the running example and two log traces.

**Table 2.** Precedence rules derived for the process model of the running example and their satisfaction (✓) and violation (✗) by the exemplary log trace -*As, Sso, Fa, Ro, Co, Ro, Aaa, Af*. Each non-empty cell refers to a precedence rule. For instance, the activity to finalize the application (*Fa*) is preceded by the submission of the application (*As*) and the acceptance of the application (*Aa*). Yet, only the former rule is satisfied, whereas the latter one is violated in the given trace.


By assessing to what extent the traces of a log satisfy the rules derived from a process model, rule-based conformance checking focuses on the fitness dimension, i.e., the ability of the model to explain the recorded behaviour. Traces are fitting, if they satisfy the rules, or non-fitting if that is not the case. Let R<sup>M</sup> be a predefined set of rules. Fitness can be defined according<sup>2</sup> to RM:

$$\text{fitness}(L, M) = \frac{|\{r \in R\_M \mid r \text{ is satisfied by all } t \in L\}|}{|R\_M|} \tag{3}$$

As the reader may already have grasped, the dimension of precision is not targeted by rule-checking.

<sup>2</sup> Notice that this makes fitness to depend on a particular set of rules, which is a limitation of the rule-based fitness checking.

**Table 3.** Exclusiveness rules derived for the process model of the running example and their satisfaction (✓) and violation (✗) by the exemplary log trace -*As*, *Aa*, *Sso*, *Ro*, *Fa*, *Ao*, *Do*, *Da*,*Af* . Again, each non-empty cell denotes a rule, i.e., the absence of the execution of two activities for the same case. For instance, the acceptance of an offer (*Ao*) must not be executed for cases for which the application is declined (*Da*). Yet, in the given trace, respective events for both activities can be found, so that the rule is marked as being violated.


#### **2.3 Token Replay**

Intuitively, this technique replays each trace of the event log in the process model by executing tasks according to the order of the respective events. By observing the states<sup>3</sup> of the process model during the replay, one can determine whether, and to what extent, the trace indeed corresponds to a valid execution sequence of the process model.

In essence, token replay postulates that each trace in the event log corresponds to a valid execution sequence of the process model. This is verified by step-wise executing tasks of the process model, according to the order of the respective events in the trace. During this replay, we may observe two cases that hint at non-conformance (see Fig. 6):


<sup>3</sup> A state of a BPMN model is a distribution of tokens over the control flow arcs. A task is enabled in a state if its incoming control flow arc is assigned a token by the respective distribution. If it executes, this token is *consumed*, i.e., no longer assigned to the arc. Moreover, a token is *produced* on the outgoing control flow arc of the task.

**Fig. 6.** State reached after replaying the full trace -*As, Aa, Sso, Ro, Ao, Aaa, Aaa*. One can see that there are three remaining tokens (denoted by yellow background), and two missing tokens (denoted by dashed red lines). (Color figure online)

By exploring whether the replay of a trace yields missing or remaining tokens, replay-based conformance checking mainly focuses on the fitness dimension. That is, the ability of the model to explain the recorded behaviour is the primary concern. Traces are fitting if their replay does not yield any missing or remaining tokens, and non-fitting otherwise:

$$\text{fitness}(L, M) = \frac{1}{2} \left( 1 - \frac{\sum\_{t \in L} missing(t, M)}{\sum\_{t \in L} consumed(t, M)} \right) + \frac{1}{2} \left( 1 - \frac{\sum\_{t \in L} remaining(t, M)}{\sum\_{t \in L} produced(t, M)} \right) \tag{4}$$

In contrast to rule checking, precision can be estimated using token replay [34], but unfortunately, the corresponding technique strongly relies on the assumption that traces are fitting; if they are not, then the estimation of precision through token replay can be significantly degraded [2].

#### **2.4 Alignments**

Alignments take a symmetric view on the relation between modelled and recorded behaviour. Specifically, they can be seen as an evolution of token replay. Instead of establishing a link between a trace and sequences of task executions in the model through replay, alignments directly connect a trace with an execution sequence of the model.

An alignment connects a trace of the event log with an execution sequence of the process model. It is represented by a two-row matrix, where the first row consists of activities as their execution is signalled by the events of the trace and a special symbol (jointly denoted by e<sup>i</sup> below), and the second row consists of the activities that are captured by task executions of an execution sequence of the process model and a special symbol (jointly denoted by ai):


Each column in this matrix, a pair (ei, ai), is a *move* of the alignment, meaning that an alignment can also be understood as a sequence of moves. There are different types of such moves, each encoding a different situation that can be encountered when comparing modelled and recorded behaviour. We consider three types of moves:


Alignments are constructed only from these three types of moves (see an indepth explanation on this in [14]). For instance, let us use the running example (see Fig. 3) and the trace *As*, *Aa*, *Sso*, *Ro*, *Ao*, *Aaa*, *Aaa*. A possible alignment with this trace is:

$$\begin{array}{l|c|c|c|c} \log \text{trace} & A s | A a | S s o | R o \gg | A o | A a a | A a a | \gg| \\ \hline \text{Execution sequence} | A s | A a | S s o | R o | F a | A a a a | \gg| \, A f |} \\ \end{array}$$

This alignment comprises six synchronous moves, one log move, (*Aaa*,), and two model moves, (, *Fa*) and (, *Af*). The log move (*Aaa*,) indicates that the application had been approved and activated, even though this was not expected in the current state of processing (as this had just been done). The model move (, *Fa*) is the situation of the process model requiring that the application be finalised, which has not been done according to the trace. Furthermore, one can easily extract the original trace by projecting away the special symbol for skipping from the top row. Applying the projection to the bottom row yields the execution sequence of the model (*As*, *Aa*, *Sso*, *Ro*, *Fa*, *Ao*, *Aaa*, *Af* ).

In general, *optimal alignments*, i.e., alignments with a minimal number of model or log moves, are preferred. The alignment shown above is optimal since there is no other alignment with least number of deviations. Computing (optimal) alignments is a hot research topic, which has been addressed in many papers in the last years [1,9,19,31,39,44,45,56–58,60,66,68]. In this paper, however, we will refrain from describing the state-of-the-art methods for alignment computation, and refer the interested reader to the aforementioned papers, or to [14].

Moreover, the optimality of alignments may also be generalized in terms of a cost function that assigns costs to particular model moves or log moves, thereby enabling the categorization of deviations in terms of their severity. Then, an alignment is optimal, if the sum of costs assigned to all its moves is minimal. Setting the cost for all model moves and log moves to one, and for synchronous moves to zero, yields the aforementioned notion of optimality, i.e., alignments with a minimal number of model or log moves.

Remarkably, alignments provide a simple means to quantify fitness. Again, this may be done based on the level of an individual trace or the event log as a whole. However, the aggregated cost of log moves and model moves may be a misleading measure, though, as it is not normalised. A common approach, therefore, is to normalise this cost by dividing it by the worst-case cost of a aligning the trace with the given model. Under a uniform assignment of costs to log and model moves, such a worst-case cost originates from an alignment in which each event of the trace T<sup>i</sup> relates to a log move, whereas all task executions of a sequence σ of the model relate to a model move and σ is as short as possible. Since the cost induced by the model moves of an execution sequence depends on its length, the shortest possible execution sequence leading from the initial state to a final state in the model is considered for this purpose.

Realising the above idea, we obtain two ratios that denote the relative share of non-fitness in the alignments of a trace or an event log, respectively. Let M be a model and L an event log. Then, we denote by *cost*(t, M) the cost of an optimal alignment of a trace t ∈ L with respect to the model. Furthermore, let *cost*(t,) and *cost*(, x) be the costs of aligning a trace t with an empty execution sequence, or some execution sequence x ∈ M of the model with an empty trace, respectively. Then, fitness based on alignments is quantified for a trace or an event log:

$$\text{fitness}(L, M) = 1 - \left(\frac{\sum\_{t \in L} \cos(t, M)}{\sum\_{t \in L} (\cos(t, \langle \rangle)) + |L| \times \min\_{x \in M} \cos(\langle \rangle, x)}\right) \tag{5}$$

A simple precision metric based on alignments is grounded in the general idea of *escaping edges* [34]. To give the intuition, we assume that (i) the event log fits the process model; and (ii) that the process model is deterministic. The former means that we simply exclude non-fitting traces, for which the optimal alignment contains log moves or model moves, from the assessment of the precision of the model. The latter refers to a process model not being able to reach a state, in which two tasks that capture the same activity of the process are enabled. The model of our running example (see Fig. 3) is deterministic.

For the activity of each event of a trace of the event log, we can determine a state of the process model right before the respective task would be executed. Under the above assumptions, this state is uniquely characterised. What is relevant when assessing precision, is the number of tasks enabled in this state of the process model. Let M be a process model and L an event log, with t ∈ L as a trace and, overloading notation, e ∈ t as one of the events of the trace. Then, by *enabled*M(e), we denote the number of tasks and, due to determinism of the process model also the number of activities that can be executed in the state right before executing the task corresponding to e.

Similarly, we consider all traces of the log that also contain events related to the activity of event e, say a, and have the same prefix, i.e., events that indicate that the same sequence of activities has been executed before an event signalling the execution of activity a. Then, we determine the number of activities for which events signal the execution directly after this prefix, i.e., the set of activities that have been executed in the same context as the activity a as indicated by event e. Let this number of activities be denoted by *enabled*L(e), which, under the above assumptions, is necessarily less than or equal to *enabled*M(e). Then, the ratio of both numbers captures the amount of 'escaping edges' that represent modelled behaviour that has not been recorded. As such, precision of log L and M is quantified as follows:

$$\text{precision}(L, M) = \frac{\sum\_{t \in L, e \in t} enabled\_L(e)}{\sum\_{t \in L, e \in t} enabled\_M(e)} \tag{6}$$

In summary, alignments are crucial to have accurate insights on the fitness and precision. However, as already acknowledged, they are hard to compute in general. In the remaining of this section, we briefly revise the challenge of computing alignments, together with some alternatives that have been proposed in recent years.

**Computing Alignments.** Computing an optimal alignment for an arbitrary combination of a process model and an event log is a far from trivial task. In terms of complexity, the task is as complex as reachability in Petri nets which, for general Petri nets, is undecidable. Nonetheless, several techniques exist to compute alignments. The best-known technique uses the A algorithm to find the shortest part in the reachability graph of the so-called synchronous product net [64]. This synchronous product is a combination of the process model and a Petri-net representation of the trace. Figure 7 shows an example of a synchronous product net for the running example. The algorithm associates costs to every transition in the synchronous product and uses these costs to find the shortest path from the initial marking to the final marking by expanding a minimal portion of the search space [67,68].

When synchronous products become too large to handle for a monolithic algorithm decomposition approaches can be used [29] to decompose the construction of an alignment into smaller problems which can be combined into a full optimal alignment. If optimality is not a requirement, sub-optimal alignments can be identified with a variety of techniques [51,52,60,62].

Another approach is the use of so-called satisfiability solvers [10]. The alignment problem is encoded as a SAT problem by translating the synchronous product to a set of boolean formulas. Because of this, the solution is limited

**Fig. 7.** The synchronous product model for the running example and trace *T*<sup>1</sup> = -*As*, *Aa*, *Sso*, *Ro*, *Ao*, *Aaa*, *Aaa*.

to safe Petri nets. While strictly a limitation, this is hardly a problem as most process modelling languages found in industry belong to this class of models. A third approach for computing alignments, which is bound by the same limitation, uses job-shop schedulers to find the optimal set of moves [19].

Finally, symbolic techniques exist to compute alignments [43]. These techniques have the upside that they can compute alignments for large sets of traces at once, rather than trace by trace as all techniques above do. However, the downside is that they rely on the state space of the process model to be known. In models with many parallel constructs, this state space may be prohibitively large. An approach using an implicit representation of the state space by means of a *Binary Decision Diagram* was presented recently which alleviates the aforementioned explosion [9].

## **3 Relating Observed and Modelled Behaviour: Advanced Techniques**

In the previous sections, the focus of conformance checking was very much on control flow, i.e. the ordering of activities in the event log in relation to the specified order in which activities should be executed according to the process model. However, real-life processes are not only about activities. Instead, processes are executed by people within an organization to reach a certain business goal. This goal is expressed by data in the process and the process model serves as a guide to reach the goal as efficiently or as precisely as possible.

Consider, for example, the event log in Table 1. Next to the case identifier and the activity, we also see other data such as the amount of the application, the corresponding offer id sent to the customer and whether or not this offer is signed. Not shown in this log is the identity of the employee who executed each activity, but it is not hard to imagine that companies have many employees of different roles and with different authorisation levels.

When doing conformance checking, it is important to consider all these elements and for this, more advanced conformance checking techniques, based on alignments, exist.

**Data-Aware Alignments.** Data plays a pivotal role in processes. Decisions are typically based on data that is provided at the start of a process or generated by any of the activities in the process. In our example of Table 1, the amount columns shows both types. Event e<sup>13</sup> refers to an application being submitted by a customer, requesting a loan of 2000 euro. Event e<sup>37</sup> subsequently shows that the bank offers the customer a loan for 1500 euro. In this process, the activity "Select and send offer" should not be executed with an amount higher than the requested amount. For application A5634 this is correct, but application A5636 shows a violation of this rule as the requested amount is only 200 euro, while the offered amount is 500.

To identify such data issues, several approaches exist. In [20,21] the authors first align the control flow using any of the techniques described above and then they check for deviations on the data level. This work is extended in [32] providing more control over the result and, especially, adopting a balanced view of control-flow and rules referring to the data perspective. Recently an approach that uses SMT solvers brings a fresh air to compute data-aware alignments [25].

**Resource-Aware Alignments.** Consider, for sake of argument, that the offered amount in our example can be higher than the requested amount, but only if the activity "select and send offer" is executed by a manager. In that case, the resource has a higher authorization level to actually deviate from the customer's request. However, if this happens, the final activity "Approve and activate application" also needs to be executed by a manager and this should not be the same person (four-eyes principle).

The relation between the roles and resource identities across different activities makes checking this more complex than data-aware alignments. The authors of [3] consider the resource perspective by looking at the various data operations in an event log and checking if these operations are performed by authorized resources.

**Integrated Approaches.** The techniques presented above share a common feature that they first align the control flow and then use the control-flow alignment to check data and resource rules. An important downside of this approach is that certain deviations may not be detected. Consider, for example, a manager who decides to login to the bank's system and read the application of his neighbour. As no activity is performed, the event log would not show any events and, when a data-access log is checked in isolation, the manager has the authority to read application data, hence no data-access violation is found. However, the manager read data outside of the context of a process, i.e. there was no business-goal associated to the read action.

To comprehensively check the conformance of an event log from the viewpoint of the control flow, the data-access and resource authorization, a more recent approach has been developed by Mozafari et al. [5]. In their paper, the combination of an event log, a data access log and a resource model is used to construct a large synchronous product. This synchronous product is subsequently used to find optimal alignments with respect to deviations in all three perspectives combined without favouring one over the other. These deviations include, for example, spurious data access and authorization problems where otherwise authorized users access data outside of the context of the case they are working on.

**Compliance Checking.** The focus of conformance checking so far has been on the situation where end-to-end process models are available. However, in many companies, such process models do not exist. Instead, each process is only governed by a set of compliance rules, i.e. all activities can be performed, as long as these rules are not broken. Rule engines, as discussed earlier in the introduction, can typically raise flags when as soon as business rules are violated. However, a rule engine typically only recognizes the moment when a rule is violated. Conformance checking using alignments can also be used to identify that specific business rules are not yet fulfilled, but no violation occurred yet.

The work of Ramezani et al. [54] shows how typical compliance rules from the accountancy and control domain can be translated into small Petri nets which in turn can be aligned with event logs to identify violations against these rules as log- or model-moves.

**Realtime (or Righttime) Conformance Checking.** So far, conformance checking was discussed as a technique to identify deviations after processes have been concluded. However, in some cases, it may be interesting to detect deviations during the execution of a process [12,13,70]. Such techniques are often referred to as streaming techniques, i.e., data is being processed as it comes in and a realtime dashboard provides insights into the current conformance level of an entire process. This is particularly useful in environments where employees have a great deal of flexibility in executing activities within a process but where specific conditions have to be met at the end.

**Conformance Checking Without Process Models.** Finally, a specific type of conformance checking exists which does not rely on a traditional notion of a process model. Instead, the event log itself is used as a representative of both the correct and incorrect behaviour and deviations are detected between the mainstream behaviour prevalent in the event log and the 'outlier' cases [36,37]. Specifically, this approach employs recurrent neural networks (RNNs) that are trained for next activity prediction moving through a trace forward or backward. These predictions can be seen as an approximation of a process model against which the alignments of traces are computed.

## **4 Applications of Conformance Checking**

So far, we discussed essential techniques for conformance checking along with their generalization and extension to scenarios beyond the traditional, retrospective analysis of control-flow information. Next, we turn the focus to the broader field of applications of conformance checking.

We first note that an understanding of the link between the recorded and modelled behaviour of a process serves as a foundation for various model-based techniques for the analysis of qualitative and quantitative process properties. The importance of conformance checking for such analysis is detailed in Sect. 4.1, taking techniques for the analysis of performance characteristics and decision points as examples.

A second important observation relates to the fact that deviations between recorded and modelled behaviour, as revealed by conformance checking, can potentially be attributed to quality issues in the event log or the process model. Both, a log and a model, denote representations of the process at hand, which may be incomplete, outdated, imprecise, or simply wrong. This gives rise to a generalized notion of conformance checking, which aims at a separation of deviations that are due to quality issues in the event log or the process model. As discussed in Sect. 4.2, this generalized view on conformance checking enables us to describe common techniques for process mining as part of a unified framework.

#### **4.1 The Case of Model-Based Process Analysis**

Process models serve as the starting point for a plethora of process analysis techniques. Such analysis may be classified along various dimensions. That is, the point in time addressed by the analysis distinguishes retrospective, predictive, or even prescriptive analysis of a business process. The granularity of the analysis may be defined to be on individual instances of a process or a set thereof, thereby integrating potential interactions between different instances of a process. Moreover, analysis based on a process model may incorporate diverse process perspective, starting with the traditional view only on the control-flow of the process, through the data produced and consumed during its execution, the impact of such data on the control-flow, the integration of events produced by the environment in which the process is executed, the utilization of resources, and the definition of organizational responsibilities, to name just a few examples.

Regardless of the specific type of model-based process analysis, conformance checking provides a means to ensure that the models provide reasonable representation of the actual behaviour. Considering the behaviour as recorded in an event log as a representation of actual process execution, despite all potential issues related to data quality, such as accuracy and completeness of an event log, conformance checking establishes trust into the analysis results obtained from the models. In the following paragraphs, we reflect on this application of conformance checking for three types of model-based analysis techniques.

**Performance Analysis.** Performance properties are an important aspect of process analysis in various domains. Here, specific measures include information on the time needed by a process instance from start to end, also known as cycle time or sojourn time, which is captured in terms of simple statistics, such as the average or maximal sojourn time, or complete distributions. Moreover, understanding how much time is needed to reach a certain milestone in the execution of a process is valuable information for operational process management, e.g., related to the scheduling of resources.

To enable the respective analysis, a process model is enriched with performance information. Common notions include simple annotations such as the average execution time per task. Yet, one may also consider more elaborated annotations, such as the distributions of not only the execution time per task, but also the wait time between the execution of subsequent tasks. Based on these annotations, analytical techniques or simulation are used to compute performance measures.

Given an event log, conformance checking that links the events of traces to the tasks in a process model helps to extract such performance annotations. For instance, once an alignment is computed, the synchronous steps indicate for which temporal information attached to events needs to be incorporated for the annotations for specific tasks. Note that this is particularly beneficial, once a model contains several tasks with the same label, i.e., representing the same activity of the process. In that case, an alignment separates the events that shall serve as the basis for the performance annotation of the different tasks based on the behavioural context in which they are executed. For instance, the model for a loan application process in Fig. 3 contains two tasks for declining an application (*Da*). This way, the respective activity may be executed in different contexts, once directly after the submission of the application and once towards the end of the process, after an offer has been declined. Consequence, both tasks may be have different performance characteristics. Alignments help to incorporate these differences by separating the events that are linked to either task.

However, conformance checking may not only employed to extract performance annotations from an event log, but also enables their validation. For instance, performance annotations may have been defined manually, based on expectations. Then, temporal information of the event log may be utilized to validate these annotations, where, again, conformance checking indicates which events shall be considered for which of the tasks in the process model.

**Decision Point Analysis.** Decision point analysis aims at insights on the conditions that govern decision points in a process. In process modelling, it is a common abstraction to neglect such conditions and simply assume that a nondeterministic choice is taken, as the conditions may not be relevant for some control-flow-oriented analysis. However, this abstraction may also be problematic, as it hides how the context of process execution influences the controlflow, e.g., that certain activities are executed solely for certain types of cases. Such insights are particularly relevant also for performance analysis as discussed above, since the conditions at a decision point may induce highly skewed distributions. In our running example, Fig. 3, there is a first decision point directly after the submission of an application, which may lead to an immediate rejection and, hence, completion of process execution. Understanding the properties of cases that govern this decision will, therefore, be very beneficial for any analysis of performance characteristics.

To understand the conditions at decision points of a process, decision mining may employed. It takes traces of an event log, including the data attached to the events or the trace as a whole, as observations for particular decision outcomes. Then, a classification problem is derived, with the different outcomes being the classes, and common techniques for supervised classification enable the construction of a classifier. Assuming that the obtained classifier can be interpreted, e.g., is represented as a decision tree, the conditions for a decision point can be extracted and added to a process model.

In this context, conformance checking, again, helps to prepare a process model for analysis, as well as to validate existing annotations. In the former case, alignments that link events to tasks help to prepare the data needed for decision mining. Through an alignment, the data available at a specific decision point is characterised and may be used as input to the classification algorithm. The later case, the decision points in a process model have already been annotated with the respective conditions. Then, conformance checking reveals if these conditions are matched with the behaviour recorded in the event log, either by constructing an alignment solely based on control-flow information and checking the conditions at decision points separately, or by integrating the conditions directly in the computation of multi-perspective alignments as discussed in Sect. 3.

#### **4.2 A General View on Conformance Checking**

An event log and a process model both denote representations of an abstract entity, the actual process as it is implemented in an organization. From this view point, illustrated in Fig. 8, it becomes clear that any deviation detected between these representations may potentially be attributed to the way that the representations capture the actual process, i.e., the log and the model may show quality issues. For instance, logging mechanisms may be faulty and the integration of event data from different systems may be imprecise. Similarly, models may have been created based on an incomplete understanding of the process and may be biased towards the expected rather than the actual behaviour. Moreover, in many application contexts, processes are subject to change and evolve over time. Hence, process models created at some time point become outdated. Event logs that span a large time period, in turn, may contain information about

**Fig. 8.** Both, an event log and a process model, are representations of a process.

different versions of a process, so that the log in its entirety appears to describe a process that was actually never implemented as such at any specific point in time.

From the above observation, it follows that a deviation between an event log and a process model may be interpreted as an issue to fix in either of the representations. That is, one of the representations is assumed to be correct, i.e., it is assumed to truthfully denote the actual process, whereas the other representation is updated with the goal to resolve the deviation. Specifically, techniques to repair a process model based on the event log and techniques to repair an event log based on the process model have been proposed in the literature, as discussed next.

**Model Repair.** Assuming that an event log constitutes a correct representation of a process' behaviour, deviations detected between the log and a process model are a starting point for model repair. To this end, existing techniques are mostly based on alignments computed between a trace of an event log and the process model. The reason being that alignments clearly separate behaviour that is only observed in the process model (i.e., a move in model) and behaviour that is present only in the trace (i.e., a move in log).

Intuitively, a move in model captures the situation that the execution of an activity is defined to be mandatory, while this execution is optional according to the supposedly correct event log. Therefore, a simple repair strategy is to relax the control-flow defined by the model and explicitly enable the continuation of a process instance without executing the respective activity. Note though that different syntactical changes may be considered to realize this change. For instance, in a BPMN process model, one may insert a decision point before the task to determine whether it is executed, whereas a similar effect may also be achieved by changing the semantics of existing routing constructs, such as transforming a parallel split into an inclusive choice.

A move in log, on the other hand, hints at a supposedly correct activity execution that is without counterpart in the model. A repair strategy, therefore, is to insert a corresponding task into the process model. The location for this insertion is also determined based on the conformance checking result. That is, the alignment up to the respective move in log induces a state in the process model. The task needs to be inserted, such that it is activated in this state and such that before and after its execution, all tasks that have been activated in the original state are still activated. In practice, such a repair operation may not only be conducted on the level of individual model in log steps, but for sequences thereof. In this case, a model fragment to capture the behaviour of this sequence is discovered and inserted into the original model.

As an example, consider the following alignment for the process model introduced earlier (Fig. 3).

$$\begin{array}{l|c|c|c} \log \text{trace} & As|Aa|Sso|Ro \gg |Ao|Aaa|Aaa| \gg \\ \hline \text{Execution sequence} |As|Aa|Sso|Ro |Fa|Ao|Aaa| \gg |Af| \end{array}$$

From the move in model (, *Fa*), one may derive a change in the process model that enables skipping of the respective task *Fa* in the process model. The move in log (*Aaa*,), in turn, suggests a change in the model that supports an additional execution of the activity to approve and activate an application. Yet, we note that this activity execution directly succeeds a synchronous move for a task referring to same activity. Hence, instead of adding a new task in the process model, it may be more desirable to generalise the process model and insert a loop around the existing task *Aaa*, so that it may be executed multiple times in an execution sequence of the model.

**Log Repair.** The idea of repairing a process representation based on the results of conformance checking may also be applied to event logs. Given an alignment of a trace and a process model, the actual changes to apply to the trace are derived from the types of the respective alignment steps. Under the assumption that the model is a correct representation of the process, a move in model would lead to the insertion of an event into the trace at the position of the alignment step. An event that is part of a move in log, in turn, would be deleted from the trace.

In practice, the insertion or deletion of events of a trace may be problematic. For instance, the creation of artificial events raises the question of how to define the values of an events' attributes, from generic ones such as an events' timestamp to domain-specific attributes (e.g., the state of a business object). Against this background, log repair may not focus on alignment steps in isolation, but aim at identifying high-level changes. An example would be the presence of two alignment steps, a move in model and a move in log, both related to the execution of the same activity. Instead of deleting and inserting an event, moving the event from the position of the move in log step to the position of the move in model step would enable repair without the need to generate an artificial event.

Taking up the aforementioned example, based on the alignment, log repair may suggest that the second event linked to the approval and activation of the application (*Aaa*) is erroneous (e.g., the activity execution was recorded twice due to a faulty logging mechanism) and, thus, shall be removed from the trace. At the same time, it may suggest to insert events for the activities to finalise the application (*Fa*) and finish the application (*Af*), for instance, assuming that these steps are manually recorded, so that some incompleteness of the trace is to be expected.

**Generalized Conformance Checking.** Both, model repair and log repair consider one process representation to be correct, which may therefore serve as a ground truth. In the general case, however, quality issue may be present in both representations. As a consequence, some of the conformance issues detected between a model and a log may stem from the model not adequately capturing the process, some of them may originate from low quality of the event log, while some are also inherent deviations that need to be analysed.

To balance the different reasons of conformance issues, it was suggested to incorporate a notion of *trust* in the process model, denoted by τ<sup>M</sup> ∈ [0, 1], as well as the event log, denoted by τ<sup>L</sup> ∈ [0, 1] [48]. These trust values capture the assumed correctness of either representation and may reflect how the representation has been derived. For instance, a process model created as part of a first brainstorming session may be less trustworthy in terms of correctness and completeness compared to a model created as a part of a rigorous process management initiative. Similarly, an event log created by a process-oriented information systems can, in general, be expected to be more trustworthy than a manual documentation of activity executions by a diverse group of process stakeholders.

Once a trust level has been specified for both, the model and the log, conformance checking may be phrased as an optimization problem that incorporates model and log repair. To this end, the following notions need to be defined: A function δL<sup>2</sup> to measure the distance of two event logs; a function δM<sup>2</sup> to measure the distance of two models; and a function δL,M to measure the distance of an event log and a process model, such as alignment-based fitness or a combination of fitness and precision. Given an event log L and a process model M, generalized conformance checking [48] is then defined as the identification of some adapted log L<sup>∗</sup> and adapted model M∗, such that:

$$\begin{aligned} L^\*, M^\* &= \operatorname{argmin}\_{L', M'} \left( \delta\_{L^2}(L, L'), \delta\_{M^2}(M, M'), \delta\_{L, M}(L', M') \right) \\ \text{subject to } \delta\_{L^2}(L, L') &\le 1 - \tau\_L \text{ and } \delta\_{M^2}(M, M') \le 1 - \tau\_M. \end{aligned}$$

Intuitively, the above problem formulation considers that the given model M and log L may require to be adapted, if they are not fully trustworthy. However, the trust values induce a bound for the distance between any adapted model and log, and the original model and log, respectively, as illustrated in Fig. 9. Within the space set by these bounds, the distances between the adapted and original model, between the adapted and original logs, and between the adapted model and the adapted log shall be minimised. Here, a specific instantiation may require the minimisation of a linear combination of the three distances.

Generalized conformance checking unifies various tasks in the field of process mining [14,48]. Table 4 highlights how specific tasks can be seen as instances

**Fig. 9.** The problem of generalized conformance checking (from [14]).

**Table 4.** Overview of process mining tasks listed as instances of the generalised conformance checking problem according to [14,48].


of the generalized conformance checking problem, depending on the trust into an event log or a process model. Specifically, tasks such as classical process discovery or process simulation fit into this picture when assuming that there is no trust into the model or the log, which can be interpreted as the setting that the respective artifacts are not available.

#### **5 Further Reading**

Conformance checking has evolved significantly in the last decade, enabling the industrial adoption and commercial software offerings. As it has been shown in Sect. 3, techniques beyond control flow are already been proposed in the last years, since considering other perspectives brings significant value and triggers adoption.

Still, the core of the techniques developed are still focusing on the algorithmic aspects of the computation of conformance artefacts for the control flow perspective. We now review further work on the three dimensions considered in this chapter, thereby providing pointers for further reading.

**Rule Checking.** The idea of rule-based conformance checking is to rely on constraints which are then checked for the traces of an event log. The idea of rule-based conformance checking has been brought forward in [76]. It employs constraints derived from the (causal) behavioural profile of a process model [75, 77], which are sets of binary relations over activities derived from the order of potential occurrences of tasks in the execution sequences of the model.

This general idea, however, is not limited to a specific set of rules. Rather, other notions of constraints can be used in the very same manner, including transition adjacency relations [80] and the rules of the 4C spectrum [42]. Such sets of binary rules to capture behaviour are inherently limited in their expressiveness, though, as already for relatively simple classes of models, an exponentially growing number of rules would be needed to capture the complete behaviour [40]. Rule-based conformance checking, therefore, lends itself to scenarios, where only certain constraints need to be checked rather than the complete behaviour as specified by a process model.

While the results of rule checking enable insights on deviant traces, they may also be used for aggregated conformance measures. For instance, fitness measures may be derived based on the numbers of satisfied and violated rules [76]. Also, filtering of rule violations and discovery of associations between them may provide further insights into context of non-conformance [76,78].

Finally, conformance checking based on rules has the advantage that it can be lifted to online scenarios in a straight-forward manner. To this end, rules can be translated to queries over streams of events, see [17,78], which enables the use of algorithms and systems developed for complex event processing [16].

**Token Replay.** Techniques for token replay were first introduced in [49]. Alternative techniques were presented in [72], and later adapted to an online scenario in [71]. Recently new heuristics have been recently proposed that make token replay a fast alternative to alignments [7,8].

**Alignments.** The seminal work in [1] proposed the notion of alignment and developed a technique based on A<sup>∗</sup> to compute optimal alignments for a particular class of process models. Improvements of this approach have been presented recently in different papers [66,68]. The approach represents the state-of-theart technique for computing alignments, and can be adapted (at the expense of increasing significantly the memory footprint) to provide all optimal alignments. Alternatives to A<sup>∗</sup> have appeared in the last years: in the approach presented in [19], the alignment problem is mapped as an *automated planning* instance. Automata-based techniques have also appeared [31,44]. The techniques in [44] (recently extended in [45]) rely on state-space exploration and determination of the automata corresponding to both the event log and the process model, whilst the technique in [31] is based on computing several subsets of activities and projecting the alignment instances accordingly. We also highlight the recent approach that is grounded on the use of relaxation labelling combined with A∗, to provide a light alternative to compute alignments [39].

The work in [57] presented the notion of *approximate* alignment to alleviate the computational demands by proposing a recursive paradigm on the basis of the structural theory of Petri nets. In spite of resource efficiency, the solution is not guaranteed to be executable. Alternatively, the technique in [59] presents a framework to reduce a process model and the event log accordingly, with the goal of alleviating the computation of alignments. The obtained alignment, called *macro-alignment* since some of the positions are high-level elements, is expanded based on the information gathered during the initial reduction. Techniques using local search have recently been also proposed very recently [61].

Against this background, the process mining community has focused on divide-and-conquering the problem of computing alignments, as a valid alternative to this problem with the aim of alleviating its complexity without degrading the quality of the solutions found. We turn now our focus to decompositional approaches to compute alignments, which are more related to the research of this paper.

Decompositional techniques have been presented [35,63,73] that, instead of computing optimal alignments, they focus on the *crucial problem* of whether a given trace fits or not a process model. These techniques vertically decompose the process model into pieces satisfying certain conditions (so only *valid* decompositions [63], which satisfy restrictive conditions on the labels and connections forming a decomposition, guarantee the derivation of a real alignment). Later on, the notion of *recomposition* has been proposed on top of decompositional techniques, in order to obtain optimal alignments whenever possible by iterating the decompositional methods when the required conditions do not hold [29]. In contrast to the aforementioned vertical decomposition techniques, this approach does not require this last consolidation step of partial solutions, and therefore can be a fast alternative to these methods at the expense of loosing the guarantee of optimality.

There has been related work also on the use of partial order representations of process models for computing alignments. In [13], unfoldings were used to capture all possible transition relations of a model so that they can be used for online conformance checking. In contrast, unfoldings were used recently in a series of papers [38,60] to speed-up significantly the computation of alignments. We believe these approaches, specially the last two, can be easily integrated in our framework.

Also, the work of [45] can also be considered a decompositional approach, since it proposes decomposing the model into sequential elements (*S-components*) so that the state-space explosion of having concurrent activities is significantly alleviated. We believe that this work is quite compatible with the framework suggested in this paper, since the model restrictions assumed in [45] are satisfied by the partial models arising from our horizontal decomposition.

Finally, the MapReduce distributed programming model has already been considered for process mining. For instance, Evermann applies it to process discovery [23], whilst [15] applies it for monitoring declarative business processes. Recently, MapReduce techniques has been proposed to offer a horizontal decompositonal alternative to computing alignments [62].

#### **6 Milestones and Challenges**

Conformance checking is nowadays a mature field, demonstrated by its presence in some of the process mining commercial tools and process mining use cases. In spite of this, the available support for its adoption is far from complete. One example is the metrics available: whilst fitness or precision are considered well evaluated through current techniques, accurate generalization metrics that additionally can be evaluated efficiently are yet to come [41,69,72].

Alignments are a central pillar of current techniques for conformance checking. However, the complexity requirements of the state-of-the-art techniques hamper their application for large instances (see Sect. 2.4). Actually, process mining is facing the following paradox: whilst there exist techniques to discover process models arbitrarily large [4,30], most of the existing alignment computation techniques will not be able to handle such models. Alternative approaches, like the decomposition or structural techniques only alleviate the problem, at the expense of losing the guarantee of important properties like optimality. Also, when incorporating other dimensions like data or resources, so that a multiperspective for conformance checking is enabled, the complexity of the problem increases significantly, making it difficult to be applied for real-life problems; we envision new contributions also for multi-perspective conformance checking in the near future that can overcome this limitation.

Beyond computational or algorithmic challenges, there are other equally important challenges, more oriented towards considering the understanding of conformance checking results. One of them is the visualization of deviations. In industrial scenarios, thousand of deviations can easily pop up when assessing conformance, and it is not so easy to rank the importance of each one with a criteria that really impacts the business of the organization. For instance, looking at Fig. 1, one can see the list of violations at the bottom of the figure, ordered by the percentage of the cases where these deviations occur. This may not necessarily be the most interesting ranking from a business perspective.

We now provide a list of particular challenges with the aim of triggering future research in the field. The list is by no means complete.

**Representing Uncertainty and Preventing Bias.** As mentioned above, conformance checking deals with the comparison of recorded behaviour against specified behaviour, typically represented as an event log and process model. Based on this comparison, conclusions can be drawn with respect to the recorded behaviour as well as the underlying process which produced the recorded behaviour.

This distinction becomes only irrelevant when the recorded behaviour contains all the process behaviour of interest. In all other situations, where the observed behaviour is only a sample of the complete process behaviour, a source of variation is introduced by the sample. Sampling variation will cause the outcome of the conformance checking activity, which is only an estimate of the true value, to vary over different samples. Initial work on this direction has been recently proposed [6,28].

Information on the accuracy of a specific conformance estimate is important for a practitioner to make informed decisions. Unfortunately, representing uncertainty is typically ignored by existing conformance checking techniques and remains an important open challenge.

A second related challenge is that practitioners not only want an idea about the estimate's accuracy, but also want some guarantee that the estimate is unbiased. Various sources of errors exist which could lead to biased results, which receive too little attention in the existing work on conformance checking.

Some of the most relevant sources of errors are coverage error and construct validity. Coverage error occurs when the recorded behaviour is not a representative sample of the underlying process under analysis. This could be caused e.g. by non-random sampling or an incorrect definition of the underlying process. Construct validity refers to the question whether the conformance technique actually measures what it claims to be measuring.

This issue of estimate bias raises at least three important challenges. Firstly, there is a need for further development of conformance techniques which produce unbiased estimates, as recent research empirically challenged the claim that existing estimators are unbiased [27]. Secondly, with respect to construct validity, more attention should be given to making explicit what a measure actually represents. In particular the concept of generalisation suffers from an ambiguous and unclear definition, while other conformance measures are so complex that it is no longer clear what is measured and how it behaves. Thirdly, illustrating that a conformance estimator is unbiased should become a fundamental methodological part of any paper introducing and reviewing conformance checking techniques.

**Computational Feasibility.** As with many data analysis tasks, computation feasibility is a challenge. In the context of conformance checking, different elements contribute to this. One element lies in the current approach itself. As we highlighted before, alignment-based approaches are the state-of-the-art techniques to conformance checking due to its robustness and detailed diagnosis on deviations at the event level. However, it is also a computationally intensive operation that can take a long time to execute and can even be unfeasible for industrial-sized processes.

Further, computational feasibility is challenged by the persistently growing size of event logs. In the industry, huge quantities of events are recorded. For example, Boeing jet engines can produce ten terabytes of operational information every thirty minutes and Walmart is logging one million customer transactions per hour. In these contexts, operational efficiency is typically of paramount importance and is ensured by having predefined operational protocols and guidelines. Consequently, aside from being capable of dealing with complex and large underlying processes, conformance checking techniques should also support large amounts of data.

Responding to these computational challenges, techniques that are tailored for the emergence of large event logs and processes are created. For example, it is often not possible to store all the event data produced by large processes due to the limitation of storage capacity. This has motivated techniques that allow conformance checking to be performed in an online setting to data streams that are continuously producing event data related to ongoing cases. While a solution for one challenge, this response in itself holds additional challenges.

**Online Conformance Checking.** Online conformance checking analyzes event streams to assess their conformance with respect to a given reference model (the reader is referred to [11] The key aspect of this problem is that events must be analyzed immediately after they are generated (without storing them). The key benefit of this technique is to be able to detect deviations immediately, thus giving time to the process manager to shift the trace back to the reference behaviour. More generally, the main benefit is the reduction in latency among the BPM lifecycle phases.

Event streams represent a specific type of data streams, where each data point is an event (as in standard event logs). General data stream mining techniques have been studied in the past and several stream operations models have been defined, including: insert-only streams; insert-delete streams; and additive streams. Respectively, events are only inserted; deleted; or "incremented" (this holds typically for numerical variables). Typically, event streams are assumed to be insert-only streams, where events are just added to the stream.

Since event streams are generally assumed to be unbounded and events are supposed to arrive at unpredictable pace, several constraints are imposed on the analysis. Specifically, once an element is received, it should be processed immediately: either it is analyzed or it is discarded. In case it is analyzed, it cannot be explicitly inspected again at a later stage: since the stream is unbounded, it is impossible to store it and its events have to be stored in an aggregated (or summarized) fashion. Additionally, the time scale plays a fundamental role in online conformance checking: a recent deviation is more important than older ones, as the process manager can immediately enact proper countermeasures.

Several problems are still open, for example how to find good conformance measures, which operate in efficient (i.e., constant) time. Other relevant and unsolved problems are handling streams where the arrival time of events does not coincide with the execution time (thus, events need to be "sorted" afterwards), or understanding when a process instance is really terminated (even if the termination state of the model has been reached).

**Desired Properties of Conformance Measures.** The objective of conformance checking is to provide insights on how well a model describes given event data or how well given event data describes the model. This is represented both quantitatively - for measuring conformance - and qualitatively - for providing diagnostic information. We discuss properties and challenges of measures and diagnostics information in conformance checking.

Similar to machine learning, conformance measures are used to assess how well a model describes the event data: A model should have a high fitness or recall to the log (be able to replay all observed traces) and a high precision to the log (show little additional behaviour). Models with high fitness and precision distinguish themselves further in terms of generalization (their ability to replay likely, but so far not observed traces of the process that generates the log) and simplicity (being structurally simple). In this sense, we use conformance measures to compare two different models M1 and M2 with each other in their ability to describe a given log L (in terms of fitness and precision), describe the unknown process P behind the log (in terms of generalization), and be easily understandable (through a simple model structure). A model M1 scoring higher than a model M2 in a measure is considered to be the "better" model. For most event logs, the quality measures define a pareto front: a model scoring better on one measure scores worse on another measure leading to a set of "best" models for which no model can be found scoring better on any measure without scoring worse on other measures. With these properties, conformance measures have two main applications: helping a user decide which among a set of possible models is a preferred description of the event data, evaluating and benchmarking algorithms in process mining.

As it has been recently suggested [55], establishing certain axioms is a safe way to be able to determine the boundaries of a certain technique for determining a particular quality dimension. These axioms are expected to clarify important aspects such as logical consistency, robustness, confidence, to name a few.

## **7 Conclusions**

This chapter provided an overview on conformance checking, aiming at covering the basic techniques, pinpointing what are the natural applications of the field and looking into the future by listing challenges that we believe will be crucial to overcome in the years to come. The chapter may be seen as a gentle introduction to the reference book in the field, where most of the topics are extensively developed [14].

**Acknowledgements.** This work has been supported by MCIN/AEI funds under grant PID2020-112581GB-C21.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

**Data Preprocessing**

# **Foundations of Process Event Data**

Jochen De Weerdt1(B) and Moe Thandar Wynn<sup>2</sup>

<sup>1</sup> KU Leuven, Leuven, Belgium jochen.deweerdt@kuleuven.be <sup>2</sup> Queensland University of Technology, Brisbane, Australia

**Abstract.** Process event data is a fundamental building block for process mining as event logs portray the execution trails of business processes from which knowledge and insights can be extracted. In this Chapter, we discuss the core structure of event logs, in particular the three main requirements in the form of the presence of case IDs, activity labels, and timestamps. Moreover, we introduce fundamental concepts of event log processing and preparation, including data sources, extraction, correlation and abstraction techniques. The chapter is concluded with an imperative section on data quality, arguably the most important determinant of process mining project success.

## **1 Introduction**

This chapter is devoted to a core building block of process mining, namely event data or event logs. The particularities of event logs in comparison to traditional attribute-value data sets used for non-process mining data science and analytics applications, make that dedicated analysis techniques become worthwhile. To put it more concretely, classical data science analyses, e.g. learning a decision tree or running a clustering algorithm, when straightforwardly applied to an event log, will not give you workable results. This is because events in an event log, which can be considered as the observations (rows) in our dataset, are related to each other in terms of time and by means of an overarching case dimension, which, when not taken into account via dedicated analysis techniques, results in useless or biased results. In this chapter, we will first explain and exemplify the fundamental structure of event logs. In addition, we will discuss the most common sources from which event logs can be obtained. Furthermore, we will dive into the data preprocessing pipeline, bringing in the perspectives of event extraction, correlation and abstraction. Finally, given the uphill battle in many organizations in terms of data availability and especially data quality, we close the chapter with a discussion of this theme.

## **2 The Fundamental Structure of Event Logs**

We refer to [3] for the conceptual definition of an event log. Here, we will complement the definition with a more practical view on the essential event log data requirements, an exploration on additional data attributes, an analysis of event types, as well as the link to the XES storage standard.

#### **2.1 Essential Event Log Data Requirements**

Figure 1 illustrates an excerpt of an example event log related to a fictitious Purchase-to-Pay (P2P) process. This small excerpt can help to understand the three essential data requirements for event logs to be analysis-ready for process mining technique application. First, each event should be linked to a case or process instance, typically by using a *Case ID*. This is "Requirement 1". In the simple example of Fig. 1, each case or process instance will refer to one procurement of a product or service by an organization with one of its suppliers. Events will be collected for every process instance and will pertain to activities or steps executed within the different stages of the P2P process (e.g. requisitioning, invoicing, reception of goods, etc.).

We thus argue that the presence of a Case ID is an essential requirement for an event log. However, it should be pointed out that Case IDs are not always straightforwardly available. This problem has been addressed in both process mining literature, as well as in practice, and is often referred to as *event correlation*. This topic is addressed in Sect. 3. There also exists research on the direct application of process mining techniques on event data without Case IDs (e.g. [27]), however, this is a rather niche application. Nevertheless, it is important to point out that, in contrast to static event logs, an increasing number of process mining techniques are developed for streams of events. In such event streams, the notion of a CaseID is often even more complicated.

**Fig. 1.** Example event log from a fictitious P2P process, illustrating the three essential requirements: presence of a case ID, activity label, and timestamp per event.

The second key requirement ("Requirement 2") for event log data is the fact that each event should correspond to an activity executed within the process. More specifically, an assumption is made that there exists a restricted set of labels, reflecting the activities in the business process, to which each event is mapped. In Fig. 1, this is shown in the second column. Given that activity labels are simple strings, there is a lot of freedom to tailor the activity label for the right analysis viewpoint. However, oftentimes, natural log data is stored at lower levels of granularity than desired for analysis purposes. Typically, one would prefer that the granularity level of activities is such that they can be understood and interpreted by business experts. Nonetheless, a lot of event data exists for which the granularity level is much lower. In Sect. 3, we discuss the task of bringing lower level events to a better granularity level, which is referred to as *event abstraction*.

Finally, the last requirement ("Requirement 3") entails that there exists an ordering of the events pertaining to a case. As such, each case logically consists of a sequence of events. Most often, this ordering will be derived from a timestamp attribute. However, this is not strictly mandatory, given that the order could also be derived from the order in which events are recorded in a database or table, insofar this order in which events occurred matched with their factual execution order within a process.

It should be pointed out that, while a Case ID, Activity and Timestamp column are essential requirements in order to be able to conduct process mining analyses, their definition might not always be as clear cut as is the case for the illustrative example. For instance, for many real-life datasets, different choices can be made in terms of using one single or multiple columns to create the activity label, and as such provide a different perspective on the process. A similar effect can also occur for Case IDs, where for instance, with an example from a clinical pathway perspective, the use of a patient ID instead of an admission ID as case identifier, can yield a very different analysis.

#### **2.2 Additional Data Attributes**

In addition to the mandatory elements of a Case ID, Activity, and Timestamp, event logs will usually contain several or often many additional attributes (columns). In Fig. 1, the event log contains additional attributes including Vendor, Plan, Country, City, Value and Order Quantity. In our example, the values for these attributes remain constant within a single case, and accordingly can be considered as process instance-level attributes. However, this is not mandatory, as attributes can pertain to events or activities, and might be updated throughout the execution of a process instance. For instance, an item number or item type that is recorded when a purchase order item is created is an example of such an event-level attribute.

Additional data attributes can have many purposes, but typically the following three uses are most important. Foremost, these additional data attributes can help to filter cases and events in order to obtain a more focused analysis viewpoint or perform comparative analysis between subsets of process instances. Secondly, these additional data attributes might contain valuable context information, and can therefore be exploited to gain better insights into the process. For instance, a textual comment field in an incident management process could contain essential information regarding the problem at hand, which in turn might impact routing choices, timing, resource allocation, etc. Finally, the availability of additional data attributes, especially information on resources, costs, etc. opens up possibilities for the application of process mining techniques that go beyond process discovery and conformance checking. For instance, organizational mining techniques were developed to focus on resources employed within the process [53]. Moreover, these additional data attributes also play a fundamental role in decision mining [18,47] (see [17]) and predictive process monitoring [19] (see [20]).

#### **2.3 Storing Event Data**

Event data is intrinsically simple attribute value data, easily visualized in a twodimensional table. Nonetheless, unstructured data formats including Excel-files or plain text files, without any form of underlying schema, fail to serve as a proper storage format. This is mainly due to the complex interactions between events, cases, and their attributes. This observation drove the development of the eXtensible Event Stream (XES) standard [1], an IEEE Standards Associationapproved language to transport, store, and exchange event data. Its metadata structure is represented in Fig. 2. XES uses the W3C XML Schema definition language, guaranteeing interoperability between various systems. An IEEE XES instance corresponds to a file-based event log or formatted event stream that can be used to transfer event data in a unified manner. In IEEE XES, events are considered as an observed atomic granule of activity. Next to events, IEEE XES specifies the concept of a log, a trace, and an attribute component. Event and/or trace classifiers are used to assign an *identity* to traces and events. The standard does not define a specific set of attributes for events, traces or logs. However, it does allow for *extensions*. An extension can be used to define a set of attributes for events, traces and/or logs. For instance, a common set of attributes can be defined for event logs within a particular application domain. An overview of currently available standard extensions is available on the XES website<sup>1</sup>.

## **2.4 Event Types**

To conclude the section on the fundamental structure of event logs, it is important to point to the concept of event types or lifecycle transitions of activities. When sourcing events from many process-aware information systems, events oftentimes relate to the transactional lifecycle that activities undergo. One example of such a transactional lifecycle model is shown in Fig. 3a. This is the transition lifecycle model of the BPMN 2.0 standard<sup>2</sup>. Such a transactional lifecycle model describes the states and state transitions which an activity might take in its execution. Also in IEEE XES, a *lifecycle* extension has been approved, which specifies a default activity lifecycle<sup>3</sup>. This state machine is shown in Fig. 3b.

<sup>1</sup> http://www.xes-standard.org/.

<sup>2</sup> https://www.omg.org/spec/BPMN/2.0/.

<sup>3</sup> http://www.xes-standard.org/.

**Fig. 2.** The IEEE XES metadata structure

(a) The lifecycle of an activity as defined in BPMN 2.0

(b) State machine illustrating the most typical transitions in an activity's lifecycle, according to XES

**Fig. 3.** Two different activity life cycle models

When retrieving data from process-aware information system, especially from Business Process Management Systems (BPMS) [43], a large collection of event types might be readily available. This is oftentimes not the case in other environments, for instance for web data. In case there are no defined event types, one typically assumes that an event pertaining to the execution of an activity reflects the completion of the activity. In this case, every activity execution is represented by a single event. However, having only a single event per activity execution does not allow to make a distinction between waiting time and execution time of activities. As such, for more fine-grained performance analysis, one would typically prefer two events per activity execution, indicating its start and completion time.

## **3 Event Log Preprocessing**

Data preprocessing is a fundamental part of any data science project. While not as attractive compared to model building or deployment, the preprocessing stage of a project is often most time and effort consuming. Estimates indicate that 80% of resources in typical data science projects is devoted to data preprocessing. One model illustrating the typical data analytics process is depicted in Fig. 4. This model, originally introduced in [25] as the Knowledge Discovery in Databases (KDD) process, reflects the main stages in the execution of a data analytics process. It should be pointed out that this model is an oversimplification of reality, given the frequent and unpredictable iterations that most often occur, rendering the management and completion of a typical data science project usually much more difficult. One notable complexity is the preprocessing of data, usually consisting of data selection, data cleaning, and data transformation.

**Fig. 4.** A representation of the typical stages in a data analytics project [25]

In this part, we want to zoom in on a couple of aspects related to the different stages of a process mining-based analytics project. Most importantly, we want to elaborate on event log data sources, as well as the differences in terms of pipelines between classical data analytics projects and process mining projects.

#### **3.1 Event Log Data Sources**

Event data is rapidly becoming an almost untameable beast, given the widespread and drastic increase in availability of such kind of data. In application domains ranging from typical service sector companies including banks and insurers, over manufacturing, to healthcare and education. At system level, we identify the following categorization of most common and important sources for event data:


for business value and competitive advantage creation, customer-centric process mining analysis has strong potential. As such, in addition to CRM data, process mining has a strong interest into event data produced on these online platforms. Please note that, in many cases, including for instance learning environments such as MOOCs, a default standard for web-based platforms to store data is JSON (JavaScript Object Notation).

– **Internet of Things (IoT):** Finally, IoT systems also contain a high potential as source for event data. Sensors and actuators have been deployed widely for all kinds of purposes. Although the granularity gap between typical IoT data (sensor readings) and event data is sometimes challenging to bridge, IoT is becoming a hugely important source of even data in areas such as security, manufacturing, healthcare, and transportation.

It is pointed out that this is not a comprehensive list of all possible event log data sources. In an online survey with 289 participants spanning the roles of practitioners, researchers, software vendors, and end-users, SAP ECC (R/3), SAP S/4 HANA, and Salesforce are selected as the top three most analyzed source systems for process mining analysis [57].

#### **3.2 A Comparison with Classical Analytics Data Preprocessing**

While sourcing appropriate data is always the first step in any data preprocessing exercise, it seems reasonable to state that in many situations, analysts could rely on a vast amount of event data sources. This is in line with classical analytics tasks, for which a growth in available data has been observed as well. However, in comparison to classical data preprocessing stages within an analytics process, starker differences exist at the level of cleaning and transforming data.

With respect to data cleaning, where in classical setups, problems including missing values and outliers are a main focus, data cleaning of event logs has received much less scientific and practical attention. A more detailed discussion on data quality for process mining can be found below in Sect. 5. Other differences between a process mining project process and a classical data analytics process are even more notable.

First, at the selection stage, a typical procedure within classical data analytics is to, early-on in the process, divide obtained data into training and test data. Especially when considering predictive analytics, it is of crucial importance to evaluate the true predictive power of learned models by means of independent test data that was not used for training the model. This procedure is rarely seen in process mining, with the exception of some works on predictive process monitoring. One could claim that this is due to the more unsupervised nature of process discovery algorithm, nonetheless, the difference remains striking.

Another essential data preprocessing step for classical data analytics projects relates to transforming the features space such that more valuable features are provided to algorithms for training models. Feature transformation includes techniques such as normalization, grouping and binning. Moreover, advanced feature engineering is also an important but often neglected step to improve model performance. Feature engineering aims at crafting new features based on the original data. The typical data format of event logs, consisting of events pertaining to cases, make that the "rows" in event log are intrinsically correlated. This invalidates the assumption of data being independent and identically distributed (IID). This is a central assumption underpinning about every machine learning technique. However, for process mining, when considering events as the observation level, they are by definition not IID. As such, a large majority of techniques addressing data cleaning and feature transformation including advanced feature engineering, remain purposeless when applied to event data.

When making an assessment of one of the most recently introduced process mining methodologies, i.e. PM<sup>2</sup> [56], four event data preprocessing tasks are defined: (1) creating views, (2) filtering logs, (3) enriching logs, and (4) aggregating events. All these tasks are tailored to the process mining context, and have no immediate corresponding task in a classical data analytics pipeline. For instance, in CRISP-DM [52], data preparation includes selection, cleaning, construction, integration and formatting of data. Several process mining case studies such as the one presented in [6] adapted CRISP-DM to work with healthcare datasets.

In the next Section, we will dive deeper into the problem of event log preparation, which is often extensive and demanding, especially when data for process mining cannot be sourced from process-aware information systems.

#### **4 Event Log Preparation**

While possibly not perfectly disjoint, event log preparation often includes three types of techniques: extraction, correlation and abstraction [21]. Figure 5 illustrates the relationship between these types of techniques and fundamental process mining concepts.

**Fig. 5.** Event log preparation techniques (extraction, correlation, and abstraction) and their relationship to key process mining concepts [21].

In what follows, we will provide a summary overview of reported tools and techniques for abstraction, correlation and abstraction of event data.

#### **4.1 Extraction of Event Data**

Extraction refers to obtaining event data from source systems, most often databases underlying a variety of information systems. Generally, data stored in such databases is not recorded with a process perspective in mind, and therefore will not automatically reflect essential concepts such as events and traces. Accordingly, identification of relevant event data is a primordial challenge. It often requires strong domain knowledge, and despite standardization efforts, often remains prone to ad-hoc solutions.

Two perspectives should be separated when investigating solutions for event data extraction. On the one hand, there is commercial process mining software, where vendors have adopted a clear strategic focus to address the challenges that come with extraction of event logs. Accordingly, a majority of commercial process mining tools comes with software solutions (connectors) that have been developed to allow tapping into all kinds of source systems and databases. Such connectors define how to extract relevant event data from particular source systems and which additional transformations should be applied. As such, these tools promise the holy grail of automating data extraction, a problem addressed in the academic community for over a decade.

One of the first tools stemming from scientific research was the ProM Import Framework [31]. Already in these early days, the idea of an extensible plug-in architecture allowing to develop adapters to hook into a large variety of systems was proposed and partially implemented. With the uptake of XES, XESame was developed as a more flexible successor to the ProM Import Framework. Other researchers have focused on extraction from ERP systems, e.g. the EVS Model Builder [33] and XTract [41], or other operational systems, e.g. Eventifier [46].

Another important stream of research within the realm of event extraction addresses object or artifact centricity. Many source systems, including popular ERP systems, store data at the logical level of objects instead of providing a true process perspective. Oftentimes, assumptions in terms of a desired perspective (definition of case id and activity) are required in order to flatten an objectcentered database into a "flat" event log. One noteworthy scientific initiative in this context is ontology-based data access (ODBA) for event log extraction [13, 14]. The approach is based on an ontological view of the domain of interest and linking it as such to a database schema and has been implemented in the Onprom tool. Finally, the recently introduced OCEL standard<sup>4</sup> is another relevant piece of work, putting forward a general standard to interchange object-centric event data with multiple case notions.

The XES survey also uncovered the top tools that are currently being used by the process mining community for the preparing of event logs [57]. There is also ongoing work by the IEEE Task force on reinventing the IEEE XES standard

<sup>4</sup> http://ocel-standard.org/.

to address several identified data related challenges in the XES survey [57], in particular, to capture the semantics of event data and to support complex data structures.

#### **4.2 Correlation of Event Data**

Mapping event data extracted from source systems and databases to cases (instances of the business process under investigation) is denoted as correlation. In cases where event data is obtained but Case IDs are missing, a non-trivial process can be started to automatically or semi-automatically generate Case IDs. In a scientific context, several solutions have been proposed, most of them being focused on using additional event data attributes [12,15,42,44,48], sometimes aided by a conceptual model [9,40] or even a process model [8,37].

In practical situations, the problem of correlating event data is probably more related to a variety of non-integrated data sources, which all capture or support part of a business process. As such, an integration of these different sources should be achieved. Hereto, especially when an organizational data warehousing architecture is present, Extract-Transform-Load (ETL) processing would be a default technology to resort to. ETL tools are perfectly equipped to derive and deploy matching schemes to integrate data from non-integrated data sources. Nevertheless, an ETL-approach leading to a data consolidation integration pattern is not the sole option. Increasingly, companies start to focus on the introduction of data virtualization layers in order to realize a more federation-oriented data integration. Data federation can prevent the creation of yet another duplicated database or data store, but instead provides flexible querying and analysis tools for information from multiple source systems as if all data resides within a single integrated database.

#### **4.3 Abstraction of Event Data**

Next to extraction and correlation, abstraction is considered as the third prong of the process mining event data preparation trident. In many real-world scenarios, event data is stored at much more fine-grained granularity levels compared to a business-understandable process activity level. As such, abstraction techniques can be considered as mapping techniques that can translate one or more lowerlevel events into higher-level events pertaining to business process activities. For a detailed taxonomy of event abstraction, we refer the interested reader to [59].

**IoT.** One particular field of application in which event abstraction is becoming a crucial factor for success is IoT business processes [34]. In IoT, a wide variety of sensors and actuators record contextual observations of a physical environment. These sensor readings or measurements give rise to low-level events, which are intrinsically useful to derive activity-level events from. For instance, in [51], a technique for mapping location-based sensor data to process activities was proposed using so-called *interactions*. Another prominent work in this area is [23], which relies on clustering of segmented continuous sensor data to derive higher-level activities.

**Clustering.** Given that event abstraction is a largely unsupervised learning problem in most cases (i.e. unless domain knowledge is used, there is no natural target available), a pretty intuitive way to map lower-level events to coarsegrained events is using clustering. The earliest proposed event abstraction techniques took this perspective, i.e. by clustering sets or sequences of lower-level events, abstraction into higher-level events can be obtained. For instance, in [32], coherent subsequences of events are learned via trace segmentation to create coarse-granular events. Also in [29,45], clustering techniques have been put forward for event abstraction.

**Pattern-Based Approaches.** Another frequently used paradigm to perform abstraction is pattern matching. The work by Bose and van der Aalst [11] can be considered as origination of pattern-based abstraction. Repeated local subsequence patterns, e.g. maximal repeats or tandem arrays are discovered and used as a basis for the creation of coarse-granular activities. In [38], a more advanced technique is proposed based on mining local process models.

**Supervised Learning.** Despite the unsupervised nature of the problem, abstraction techniques will often leverage additional domain knowledge, a process model, or other information to turn the problem into a more supervised approach. The technique in [7] relies on a predefined process model, an approach also followed by [26]. Other approaches expect supervision in the form of a set of annotated traces in which fine-granular event sets are matched with a higher-level activity [55], or in the form of timing information, e.g. for sessionization as in [36]. Another example of event abstraction from the healthcare domains was presented in [35], in which they rely on multi-level semantic abstraction using a combination of ontologies and dynamic programming. Also active learning is a promising pathway, bringing the expert in the learning loop to solve the supervision problem.

## **5 Process Mining Data Quality Considerations**

"Garbage in, garbage out." It is by far the most mentioned quote in data science and far beyond. But it appears that the more the quote is used, the more relevant it becomes. In process mining, while the problem has been acknowledged in both scientific literature and in practice [57], there is still a need for further research into the development of a comprehensive framework to address the problem of bad quality data leading to incorrect analysis results [58]. We also need to have a better understanding of the root-causes of such data quality issues [5,24].

#### **5.1 Data Quality Dimensions**

Some typical data quality dimensions are shown in Fig. 6 [39]. Although there are some similarities between the data quality challenges encountered for event data and traditional data sets for data mining, a key distinguishing factor is our need for detailed correlated event data in their raw form, to capture the true behavior of processes.

In [10], four broad data quality dimensions are identified for event logs: missing data, incorrect data, imprecise data and irrelevant data. Among these four dimensions, incorrect data (where a data item is not recorded correctly) and imprecise data (where a recorded value is too coarse to be useful) for key event attributes such as activity labels and timestamps could have significant consequences for all forms of process mining techniques.


**Fig. 6.** An overview of some of the most common data quality dimensions, taken from [39].

#### **5.2 Detection and Repair**

The process mining manifesto [2] categorizes the quality of event data from one star to five stars; while most real-life event logs are found to be in-between these two extremes of the scale with many quality issues [58]. Some advocate for repairing or fixing the erroneous data, while others argue that the data should be left alone as it is meant to reflect reality. Regardless of your personal view, it is unavoidable that these data quality issues are dealt with in one way or another. As a process mining professional, it is imperative that we measure the quality of an event log respective to the type of process mining analysis being considered [58]. The data pre-processing task is recognized to be one of the most time-consuming aspects of a process mining study with many spending 60–80% of their efforts while some spending up to 90% of their total efforts on this step [57].

Suriadi et al. [54] identified eleven event log imperfection patterns based on their experience with over 20 Australian industry data sets. The eleven patterns include form-based event capture, inadvertent time travel, unanchored event, scattered event, elusive case, scattered case, collateral event, polluted label, distorted label, synonymous labels and homonymous labels. These event log patterns have been used as a starting point for detection and repair of quality issues in event logs.

There is a growing body of work focusing on the detection and repair of data quality issues associated with activity labels, timestamps, and event orderings. In [49], crowdsourcing and gamification approaches are being proposed to solicit domain expert knowledge for the detection and repair of activity labels while [50] proposes an automated context-aware approach to detecting synonymous and polluted activity labels in an event log. In [28], the authors described a framework to detect timestamp quality issues in an event log and proposed measures to quantify the extent of these data quality issues as a way to measure the quality of an overall event log. In [16], an approach to automatically repairing sametimestamp errors in an event log is presented. In [22], an interactive approach to detect and repair event order imperfections in an event log is presented.

#### **5.3 Quality-Informed Process Mining**

Although data quality issues are well-acknowledged in the process mining community by now, most of the existing process mining algorithms do not explicitly take the potential presence of data quality issues. A notable exception is the removal of infrequent behaviors or noises from discovered process models. The algorithms also typically treat an event log as the "whole truth" without considering the potential effects of data-preprocessing on the reliability of the results [58]. This could lead to misleading or inaccurate conclusions about the process under investigation. In [30], the authors proposed a range of quality annotations at event, trace and log levels to keep track of the data quality issues founded in an event log and also to record the extent of repairs are made to the event log as a result. Such metadata about data quality can assist in undertaking quality-informed process mining. One such algorithm is presented as the 'Quality-Informed visual Miner'plug-in' which demonstrates the use of these data quality annotations for conformance checking and performance analysis purposes.

Alternatively, it is possible to determine whether certain data attributes are of high-quality (i.e., fit-for-purpose) before incorporating them into an event log and then into the process mining analysis. In the Process Mining in Practice book<sup>5</sup>, checklists are provided to detect a range of data quality issues and suggestions are provided on how to potentially correct them. The quality issues covered

<sup>5</sup> https://fluxicon.com/book/.

include formatting errors, missing data (event, attribute values, case IDs, activities, timestamps, attribute history, timestamps for activity repetition) as well as zero timestamps, wrong timestamps, same timestamps for multiple activities and different timestamp granularity. In [4], a data-quality informed approach is proposed where data attributes from a relational database are evaluated on their quality across a range of data quality measures before generating an event log.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **A Practitioner's View on Process Mining Adoption, Event Log Engineering and Data Challenges**

Rafael Accorsi1(B) and Julian Lebherz<sup>2</sup>

<sup>1</sup> Accenture Switzerland, Zurich, Switzerland rafael.accorsi@accenture.com

<sup>2</sup> A.P. Møller-Mærsk, Copenhagen, Denmark julian.lebherz@maersk.com

**Abstract.** Process mining is, today, an essential analytical instrument for datadriven process improvement and steering. While practical literature on how to derive value from process mining exists, less attention haas been paid to how it is being used in different industries, the effort involved in creating an event log and what are the best practices in doing so. Taking a practitioner's view on process mining, we report on process mining adoption and illustrate the challenges of log contruction by means of the order to cash (i.e. sales) process in an SAP system. By doing so, we collect a set of best practices regarding the data selection, extraction, transformation and data model engineering, which proved themselves handy in large-scale process mining projects.

**Keywords:** Process mining adoption · Event log engineering · SAP · Order to cash

## **1 Introduction**

Process mining is, today, an essential analytical instrument for data-driven process improvement and steering [8,10,21]. It helps to understand how a specific process contributes to the whole value chain, to identify different types of operational debts and to quantify improvement opportunities and, eventually, to measure the impact of transformation projects. Put another way, it is the instrument by means of which the business process management (BPM) lifecycle, as in [11, p. 21], can be effectively brought to life.

However, it was not always this way. Considering the state of the Process Mining discipline as of 2013 [2], the majority of work was still very academia-focused. Usecases and pilots ran within research projects or by pioneering process mining technology providers, which at that time were spin-offs founded by PhD researchers in the area, substantiated the power of process mining. The practical evidence for the suitability of process mining as a scalable instrument for process improvement identification was missing though.

There were two main reasons for this. The first reason was the lack of market (and methodology) maturity. In fact, stakeholders could not clearly distinguish between process mining and business intelligence, and providers/consultants could not clearly articulate (and/or substantiate) its advantages. The second reason was the fact that process mining, as well as any other data analytical instrument, requires a specific data model. This is, for process mining, an *event log*, of which the assembly requires a wide range of skills beyond pure data staging and aggregation. Experience in pulling together an event log for complex processes and hetorogeneous systems was lacking.

Put another way, while "academic" process mining work mostly starts with a given log *L*, "practical" process mining work starts with a set of systems (or tables) and aims at creating the log file *L* for subsequent analysis. Admittedly, the latter is easier said than done. Depending on the complexity of the source data model and process to be discovered, up to 80% of a project timespan is used for data preprocessing and log creation, leaving 20% for the real process analytical work [9]. While reviewing the existing literature, we have seen a focus on use-cases [14,23,25], on general approaches to (and techniques for) process analytics [5] and strategies and frameworks for creating event logs for process mining [4,17]. Recently, also data quality is receiving more attention [3]. However, we could not find previous work addressing all these elements *and* a hands-on data preprocessing example and corresponding best-practices.

Given this scenario and our practioner's view, the goal of this chapter is threefold:


Below, we will explicitly take up a practical view of process mining. We thus refrain from formalizations and will introduce the necessary technical concepts – especially in the context of SAP – in an on-demand basis including only the necessary aspects. While focusing on SAP for the hands-on example, the methodology we elaborate on can be equally applied to other processes within SAP, or other ERP systems, such as Oracle, Navision and Salesforce. It is also agnostic to any data transformation approach and platform, and process mining technology, thereby decoupling data transformation from the specific analytical tool one intends to employ.

By focusing on data preprocessing, we deliberately leave out various other – equally relevant – phases of a process mining project. See [27] for a process mining project methodology. For example, although we explain the different angles that make out a process mining project scope in Sect. 4.1, we will not cover the *scoping phase* in detail (e.g. deciding which process or legal entities to be analyzed). We also skip the *data maturity assessment phase*, whose goal is to ensure that the system's data provides a basis for process mining. This is typically required for less known, highly customized or legacy systems, not as much for standard ERP systems and their common satellite applications. We also do not cover *analytical and improvement phases* with methods and methodologies, e.g. to derive insights from process mining and calculate a business case for change. The improvement perspective is extensively covered in [26].

The reminder of this paper is laid out as follows. Section 2 reports on the process mining adoption in different industries and drivers for process mining. Section 3 introduces the SAP O2C process and corresponding data foundation. Section 4 elaborates on how to construct a simple log file for SAP O2C. It does so by cutting through the complexities of data extraction, transformation and data model engineering in a general manner, and on the specific context of SAP O2C. Section 5 summarizes the bestpractices in creating an event log. Section 6 takes stock and provides an outlook on the upcoming challenges for data preprocessing.

## **2 Process Mining Adoption**

Process Mining is widely used in a multitude of industries and businesses to create transparency on the key processes. This section firstly provides an overview on where process mining is being used and, subsequently, elaborates on the drivers for firms to deploy process mining as a basis for process understanding, monitoring and improvement. Although we illustrate, by means of real-world cases, how process mining has contributed to processes improvements in those industries, this section will not deepdive into the specific case studies. For this, we refer to [14], a database with example process mining applications, and to [23], a book compiling a series of industry use-cases for process mining.

#### **2.1 Business Usage**

We have seen Process Mining being used in several industries and processes. Still, their adoption focus differs depending on the underlying industry type and its characteristics. To better differentiate industry adoption in the different industry segments and map the corresponding processes to the industries, we split businesses in three types, namely (a) "Financial Products" (e.g. banks and insurance companies); (b) "Industrial Products" (e.g. pharmaceuticals and manufacturing); and (c) "Services" (e.g. telecommunication, healthcare, retail and government).

Overall, Financial and Industrial Products are, to-date, the segments with the highest process mining penetration [10]. That is not to say that process mining is not being successfully adopted in Services: healthcare [19,24], telecommunication providers [23, Chap. 13 and 20] and municipalities [15] already today highly profit from process mining. However, according to technology providers and market research reports [12], they make around 15% of the installed process mining base. Below we provide examples of how process mining is being adopted in the main industry segments, focusing on the driving factors in Sect. 2.2.

*Financial Products.* These are predominantly banks, e.g. retail, corporate and investment banking, and different types of insurance companies, e.g. health, life, composite and reinsurance. In banking, we have observed the focus on two processes: (a) loan and mortage services and (b) account opening, in particular the KYC process (know-your-client), closely related to the anti-money laundry prevention mechanisms. Focusing on the former, the main focus is on unleashing operational efficiency by means of identifying automation potentials or redesigning the process completely. For example, we have applied Process Mining to assess the loan process of a large bank based out of the Benelux region. In doing so, we have understood that around 70% of the applications were rejected (by the bank) or canceled (by the applicant), which is wellabove the industry benchmark for this type of process and region. More importantly, rejections and cancellation happened at the activity "Final Application Check", which was the penultimate process step before completion. Put another way, the applications ran (at least) ten process steps, including an "Initial Application Check" (second process step), to be rejected or withdrawn at nearly the end of the process. This insight has paved the way to reengineer the process by creating a more thorough initial application check and eventually reducing effort by 19 full-time equivalents (FTEs) per year.

Moving on to insurance – irrespective of its kind –, the focus is on two areas: first, claim management and processing, and second, back and front office functions, such as master data changes and lead management. Because of its sheer volume1 and business relevance, the primary focus is on claim management's efficiency and effectiveness, specifically the level of fully automated claim processing and adherence to service level agreements (SLA), that is, the time elapsed between the submission and settlement of a claim. In a Swiss-based health insurance company with around 15 million claims per year, process mining first helped measuring the full automation rate over the year, namely 74% (target being 80%). Second, it shed transparency on the root-cause for manual work: a large bulk of claims were detoured to manual inspection just to set a final approval sign. While this activity took less than 10 s processing time, it delayed the process by a median 1.8 days (waiting time in work baskets) and reduced the automation level by 8.2%. By refining the rule-set for claims that *really* required the approval step, the automation immediately raised to 82.2%. As a side-effect, this has improved the SLA adherence by 8%.

*Industrial Products.* This type of industry is predominantly characterized by the manufacture of different types of products, such as cars, electronics, power plants or chemicals. Producing businesses, when transforming their operation towards bottom-line savings or top-line improvement, mainly focus on the so-called *operational support functions* including procurement, sales and general accounting, and *supply-chain* and *production*.

Because the operation of such industries is usually based upon a traditional, in terms of data structure widely-understood ERP system, such as SAP or Oracle, this industry can be seen as the forerunner for the deployment of process mining "in the large". The main targets for process mining are procurement – "procure to pay" (P2P) or "source to pay" (S2P) – or sales – "order to cash" (O2C) or "lead to cash" (L2C). We address the sales process in the context of SAP in detail in Sect. 3. In fact, these two core processes – procurement and sales – often deliver a number of quick-wins for rapid process improvements, both in terms of cost-savings and increased revenue.

<sup>1</sup> In Switzerland, for example, the largest health insurance companies receive on average around 1.5 million claims per month. In Germany, this can be up to 17 million claims per month.

As an example, we have analyzed the procurement process of a mid-sized company manufacturing laser-cutting machines, focusing the analysis on three main European legal entities. With process mining we identified cyclic payment runs for invoices (each fourth working day). By overlaying the payment cycles with the payment terms associated to those invoices, we have identified a negative offset. That is, discounts associated with paying an invoice within a specific period were not taken into account whilst prioritizing the payment runs. Over one year and considering only the three entities in scope, this amounted to EUR *.*83 million unrealised discounts.

Turning to production, a very popular analysis regards the interplay between the front-office (in charge of taking leads and orders) and the production plants. In other words, the interplay between the sales and the production process. In this setting, we have used process mining to analyze the impact of late change order requests (coming from the front-office executives) to four production plants for a global fragrance and flavor producer. Late requests led to changes in the production planning, requiring, depending on the situation, a reschedule of production or stock transfers for products to ensure production. The former created idle production times worth 40*.*7 FTE per year. By preventing order changes in the so-called "frozen zone," i.e. orders already scheduled for production, the company was able to reduce the idle time by 47% and ensure a more reliable customer service.

## **2.2 Drivers for Process Mining Deployment**

The adoption of process mining as a technique for process understanding, monitoring and improvement is fueled by some characteristics of the leading industry segments. In this section we revisit some of these drivers and how they contribute to process mining adoption.<sup>2</sup>

*System Homogeneity.* Firms in the Industrial Products space are usually based upon one core ERP system, most predominantly Dynamics, Oracle, Navision and SAP, covering the main processes, with satellite systems for specific tasks, e.g. invoice processing with Basware or customs processing with SAP GTS. Because the underlying tables, data structures and operations for "standard" ERP systems are well-known by experts, the preparation of data towards proces mining becomes easier. Generally, the more homogeneous the system landscape, the easier it is to implement and use process mining, be it by collecting and transforming the data, or by connecting directly to a process mining tool which performs the data transformation. The downside of system homogeneity is that, because of system's maturity, one oftentimes finds less low-hanging fruits in terms of process improvements.

*Transaction Volume.* Some processes are executed once a month (e.g. the consolidation of financial statements in general accounting), others millions of times a day (e.g. cab

<sup>2</sup> Note that these drivers are independent from each other. For example, while insurance companies's technical ecosystems are highly heterogeneous (thereby making the application of process mining more intricate), they profit from scale, i.e. number of claims, and existing solid data foundation necessary to run the business in the actuarial space.

hailing rides at Uber). Both processes can undoubtedly lead to enterprise performance improvements when analyzed with process mining. However, the higher the number of transactions one has at hand, the higher the (at least potential) impact that can be achieved, and consequently the higher the return on investment (ROI) for process mining and improvement exercises. Just imagine one can identify, on average, USD .5 costsavings per claim with 15 million claims processes a year. In practice, this inevitably turns into a scoping question when analyzing processes: what is the "minimal" transaction volume to qualify for process mining? There is no magic formula for this, as processes are subject to different cycles and seasonality. So even the same process (e.g. procurement) in the same industry (e.g. manufacturing) might considerably differ from company to company depending on what is produced (e.g. power plants vs. chips). Our recommendation is to start with the end in mind and delineate the scope based on the business questions to be answered, operational debts to be bridged and process improvement ambition. See Sect. 4.1 for the different scoping elements.

*Process Drivenness.* Some industries – and more specifically, companies in those industries, or even functions in specific companies – exhibit a high maturity level in terms of "process-drivenness" and, correspondingly, digitalization of processes. That is, processes are captured in a structured manner (e.g. by means of BPMN) and the underlying system landscape and data models responsible for the process execution exist (e.g. ER diagrams to capture the relationship of entity sets stored in a database or ADL specifications for architecture description). Other companies (or some of their functions), be it because of their business model or niche of operation, are less "process-driven." For example, in banks the Loan and Credit functions in banks are highly process-driven, while the lead management in Asset and Wealth Management is less so. In fact, for the latter, technically speaking each process execution is a legitimate variant. Clearly, the higher the process-drivenness and volume of transactions, the better the chances for being able to run process mining. The downside is that, because plenty of thinking has been spent on process design and implementation, the quick-wins in terms of improvement potential could already have been harvested by previous initiatives, irrespective of data-driven or not.

*Existing Data Foundation.* Irrespective of all the aforementioned drivers, some companies have largely invested in building a cross-functional data foundation as part of their data strategy [21], either in the sense of a data mart (department-wide for the provision of some form of business intelligence) or a data warehouse/data lake (enterprise-wide for large data analytics), the latter being the focus of current projects tackling the transformation towards data-driven decision making. Process mining profits substantially from an existing data foundation outside of (and combining the different) core systems. The reasons are threefold: first it avoids dedicated bulk data extractions, which are usually time consuming and require additional effort from IT or base teams; second, because the platforms on which they are deployed (e.g. SnowFlake or Teradata) offer a transformation layer allowing the (automatable, periodic) data transformation, thereby avoiding the setup of an additional transformation platform/layer; and third, they enforce some data homogeneity when standardizing data staging, for example, by making sure that timestamps are recorded to the precision of miliseconds.

Overall, these four key drivers put together factors favoring the use of process mining. Of course, transparency and analytics on their own do not lead to bottom-line savings or outperforming top-line. That is, process mining should be embedded in a broader context aiming at continuous improvement, and the identification and elimination of operational debts, measuring the impact of changes and recalibrating the performance goals according to a well-understood and well-established KPI framework [26]. The business process management lifecycle [11] provides a basis for data-driven process improvement based on process mining, in particular, and process analytics, in general [5].

## **3 Real-World Example: Order-to-Cash on SAP Systems**

In order to make the approach and considerations presented hereafter tangible and easily related to actual use cases relevant for both industry and academia, we introduce an exemplary Order-to-Cash (O2C) process run on an SAP ERP system. O2C is not only prevalent across all three industry types as laid out in Sect. 2.1, but also very much relatable to anyone running a business or even just buying goods online. The twist is to simply look at this buying process through the vendor's eyes, i.e. the firm selling for example electronics through a web shop. Irrespective of the firm's business type, region or size, the main process steps of any O2C process are fairly similar. Hence, it makes a perfect running example to showcase event data preprocessing in a real-world scenario.

Many large organizations run their core business processes on ERP software solutions from Oracle or SAP, imposing a minimum level of standardization on process steps and their sequential flow. Since they are, however, designed to fit many different industries and business models, the predefined guardrails are not very strict, allowing for significant variation even in otherwise well-defined business processes like O2C. And while some of the companies even go to the length of modifying the underlying data structures in order to tailor the systems to their very needs, most modifications do not interfere with the core O2C process flow, but rather add complementary information. Paired with the fact that the adoption of SAP-based O2C process mining is far ahead of their Oracle-based counterparts, an SAP ERP has been chosen to exemplify event data preprocessing for O2C.

An end-to-end O2C process encompasses steps from the initial entry of a sales order and its items, all the way to the actual receipt of payment or another financial record clearing the open balance (e.g. a credit note). In practice we have encountered O2C process analyses with well over 100 different process steps, however, in favor of reducing this complexity to a manageable, but representative set of events, the process flow is exemplarily represented by nine individual steps (or events).

The events as depicted in the first swim lane of Fig. 1 have been selected in order to (a) capture at least one instance of each event archetype3, while both (b) reducing the number of events substantially, but also (c) retain major milestones of a typical O2C process. We correlate the *Business Flow*, the underlying *Document Flow* as well as the corresponding *Data Flow* as follows.

<sup>3</sup> Technical event archetypes (immutable vs. mutable direct timestamp vs. log entry timestamp) further defined in Sect. 4.

**Fig. 1.** SAP O2C process description across the different flow types

*Business Flow.* The process starts with the creation of a sales order (SO) with at least one item (SO Item created), after which a confirmation can be sent to the customer (Order Conf. sent). As a next step the corresponding delivery document is created including details for all items (Delivery Item created), after which the warehouse operations (Picking completed, then Packing completed) follow, illustrating the application of O2C for sales of physical goods in store. The goods are eventually sent to the customer (Delivery Item dispatched) and a corresponding billing document including respective items gets created (Billing Item created), which typically interfaces with the financial accounting part of the process. In favor of simplicity, this part is omitted (i.e. all financial postings, such as the settlement of billing documents).

In this given example, we include changes to the quantity ordered (Quantity changed) which can be triggered at any stage before the creation of the delivery note. This change event can be seen as a template and hence applied to a variety of other change attributes (e.g. price or requested delivery date). After each change, the corresponding marker on the sales order gets updated as well (SO Item last changed.).

*Document Flow.* The second perspective focuses on the business documents and their flow, as if actual paper documents would be processed. It starts with the sales order (SO and SO Item), after which the customer is sent a confirmation (Order Confirmation). A delivery document (DD and DD Item) is created and dispatched before the billing document (BD and BD Item) opens a balance for the respective customer.

*Data Flow.* Next, we focus on the main corresponding data structures holding information about the events and/or documents. For SAP-based O2C processes, sales orders and their items are stored in a pair of tables distinguishing sales order header information (table VBAK) from their item level information (VBAP). The data recorded in these tables include their creation date, as well as the date it was last changed. Order confirmations are persisted in a log table (NAST) comprising nearly all outgoing messages, while delivery document information, including creation and dispatch, can be found in another table pair (headers: LIKP; items: LIPS). Picking and packing is traced through changelogs (headers: CDHDR; items: CDPOS) on sales document status (headers: VBUK; items: VBUP) and billing documents are stored in a separate table pair (headers: VBRK; items: VBRP). Similar to picking and packing, all change events – including quantity changes – are tracked in a change audit log (headers: CDHDR; items: CDPOS).<sup>4</sup>

*Limitations.* Finally, it is important to point out that the presented O2C process constitutes a radical oversimplification. While the individual events are indeed representative, the process flow and set of events should be treated solely as an excerpt for demonstration purposes. Not only will real-life O2C processes be significantly more complex, system customizations and other modifications to the SAP O2C standard configuration are likely to require additional attention.

## **4 Event Log Engineering in Practice**

Data preparation for process mining in the form of event log engineering encompasses three main steps, namely:


This section addresses these three steps from two perspectives: first from a broad perspective by touching upon key aspects to be considered; and second, in a zoom into the specific setting introduced in Sect. 3. The following is not meant to be a complete cookbook for process mining preprocessing. Instead, it focuses on the predominant, recurring aspects and challenges – some specific to process mining, some applicable to a wider spectrum of data analysis initiatives.

## **4.1 Data Selection and Extraction**

From a general standpoint, this step focuses on answering the following questions:


<sup>4</sup> Table names in the SAP ECC data model are typically four to six character abbreviations of the context or document they capture. Because of the geographical origin of SAP, namely Germany, the abbreviations often stem from German. For example, VBAK and VBAP stand for, respectively, "Verkaufsbeleg: Kopfdaten" and "Verkaufsbeleg: Positionen," where "Verkaufsbeleg" means "sales order".

The answers to these questions can be clustered under the labels "scoping" and "sourcing." The *scoping phase* defines four analytical angles: *processual angle* (i.e. the subject of analysis), its *regional angle* (i.e. a specific country or set of legal entities), the *time angle* (i.e. time span of transactions to analyse), and the *analytical angle* (i.e. the "why" behind the analysis).

Once the scope is set, the *sourcing phase* establishes a mapping between the process steps and their attributes in a transaction and the events in the source systems, tables and objects. The overarching goal is to identify where – if at all – the necessary events are digitally represented and which attributes are natively available. In some situations, both events and attributes need to be derived by combining different characteristics. For example, the definition of an "automated event" in an SAP system depends on various factors, including user type and reference transaction. Hence capturing and interpreting them correctly is essential for the credibility of process mining (see Sect. 4.2 for details).

The final step in the sourcing phase is the "physical" data extraction from the relevant system and corresponding data objects to a destination outside the system. Assuming that all the data is based on a single ERP landscape, this usually happens by querying the corresponding tables and applying selection criteria to filter out, e.g., the transactions falling in the current time and regional angles. This could be either done by means of an ETL tool connecting to the system, by creating a dedicated extraction script (e.g. specialized ABAP code for SAP, or DART, SAP's embedded extraction tool), or by backing up the relevant tables and fields from the system (see Sect. 5 for best practices on data extraction). In companies with large data volumes or analyses considering a wide time angle (say, 10+ years), data extraction might need to consider so-called "archived transactional data". Whilst archived data can be seamlessly brought back to life, in practical settings, not the entire transaction is archived bur rather its main attributes, for storage capacity reasons. This might restrict the analytical angle for archived transactions.

When extracting transactional and associated master and change data, two aspects are important: first, *data size*; and second, *data protection*. For the former, to estimate the final size of extraction and, simultaneously test the extraction method, one usually extracts, say, one month of data. By extrapolating this to the final time angle, one approximates the final number of cases and events to be dealt with, and consequently the size of final extraction. For the latter, the advent of the General Data Protection Regulation (GDPR) specifically, and increased awareness for data protection generally, puts additional requirements to data extraction and processing. Here, two strategies comes handy. First, *data minimization*, that is extracting only the information strictly needed to cover the analytical angle. For example, if an analysis aims to measure the level of automation in a particular process, one can solely extract the user type, not necessarily the user name or ID. Second, for the necessary but sensitive fields, *data obfuscation* techniques generate – during the extraction – an irreversible value for a particular field. In practice, the most common method is by hashing the values for the sensitive fields. This is typically applied to personally identifiable information, such as user IDs and customer names. Security and privacy have been an important topic in business process management and process mining [1,20], the widespread adoption of process analytics and mining paired with stricter legislation created a sense of urgency which is translating in cutting-edge, scalable data-protection approaches, such as [18,22].

**Data Selection and Extraction in the SAP O2C Scenario.** In the following, we apply the general considerations discussed before to our SAP O2C running example. As a reminder, the scoping phase defines the rationale ("why") and derives the object of study ("what") using business terminology, while the sourcing phase translates this scope into technical delimitations and specifications, guiding the actual extraction of data. The exemplary scenario presented hereafter is fictitious, though resembling essential experiences and learnings from real process mining initiatives.

*Scoping Phase.* While the initial trigger for starting scoping discussions for process mining can originate from IT/analytics departments or solution vendors during presales, we choose to exemplify an arguably more value-driven context. The Global Head of Order Management aims to optimize the firmwide order management process and has been introduced to the general concept of process mining, which seems to be a perfect fit. She initiates a pilot project to evaluate the suitability of the approach, drive process transparency and distill tangible process improvement levers. During scoping discussions with process mining experts, three hypotheses are agreed to become the predominant analysis directions for moving ahead in an orchestrated manner:


While the first hypothesis looks at options to streamline the process, the second one bears potential to create additional value for customers. Lastly, hypothesis number three looks at more medium to long term objectives around process robustness and clarity of flow, which many times is a precursor for automation.

After rallying around the rationale for employing process mining, the scope (i.e. object of study) is being defined. As a largely business-driven exercise, process mining experts typically need to act in a (technical) counterbalance role, since the larger the business scope is set, the more complex all steps of the resulting process mining exercise will be. Hence, it must be the joint goal to aim at the smallest possible scope, while still retaining enough to be representative with regards to all shortlisted hypotheses.

The first delimitation is made with regards to the underlying business process. In the context of our SAP O2C example, all three hypotheses are related to the O2C process, more notably even, the non-financial part of O2C (sometimes referred to as order management). As the next level of detail, a minimum set of process steps or events is selected in line with the hypotheses (as described in the 'Business Flow' swim lane in Fig. 1). The second scoping task identifies corresponding business objects to be traced. Please refer to the 'Document Flow' swim lane in Fig. 1. The third delimitation challenges whether all organizational units (e.g. legal entities, regions or segments) and transaction types (e.g. consignment vs. standard sales) need to be included to retain validity and significance of analytical results. Oftentimes, the project participants are highly acquainted with one specific part of the business, making it a natural choice to ensure the right expertise is available when validating results later. With regards to transaction types, high volume types are typically scoped in when looking at efficiency hypotheses. In our example we focus on 'standard sales from stock' only, while the fictitious firm operates as one legal entity with one sales organization and one warehouse. The fourth delimitation looks at the timeframe to be analyzed. Depending on the underlying data volume, it might become necessary to further restrict the timeframe in scope later during the sourcing phase. In order to capture seasonality, it is generally recommendable to cover one full calendar or fiscal year. In our example we restrict the analysis to data from 2020. The fifth and last business-driven scoping discussion typically presents the biggest challenge. Here, one aims to delimit the number of different data points associated with each business object. For example, each SO item has more than 400 individual attributes in any given SAP system. Some of them are collocated, others require multiple data linkages, but most importantly, many do not naturally indicate whether they might become useful context around process execution during the downstream analysis. While the default reaction of business favors retaining everything, the resulting spike in technical effort and complexity renders this extreme as inadvisable, sometimes even infeasible. In our simplified example we assume the process mining experts are seasoned enough to guide the team toward a narrow selection with necessary attributes only. Such a selection does typically not exceed 40 attributes in case of the SO item example.

*Sourcing Phase.* With the scope being clearly defined from a business perspective, the first step in the sourcing phase is to translate all delimitations into technical terms, i.e. a selection of source systems, data sources (e.g. tables or log files) within these systems, corresponding parameters to filter data records and last but not least the selection of required attributes within the data sources.

In our SAP O2C scenario, we focus on one source system only (exemplarily 'P42', an SAP R/3 ERP, even though the characteristics described largely apply to SAP S/4 instances as well). Since SAP ERP systems are capable of multi-tenancy, it is important to select the correct tenant in addition, which in case of the running example falls on the only active tenant configured in the productive ERP instance (i.e. 'P42/010').

Next, the process steps and traced documents (please refer to the 'Business Flow' and 'Document Flow' swims lane in Fig. 1) are translated into their respective data sources. Often, this exercise with its required deep expertise is indicative to whether multiple data scope refinements and, hence, data extractions will become necessary, thereby prolonging the project timeline. These translations applied to the running example are shown in the tables in Fig. 2.

In order to restrict the extraction data volume for each data source, the delimitations on organizational scope and transaction type, as well as timeframe are translated into row filter criteria. Figure 3 shows exemplary filtering criteria. While the tenant filter represents an example for restricting the organizational scope, a timeframe filter


**Fig. 2.** Data sources for the SAP O2C process


**Fig. 3.** Filtering criteria when extracting data for SAP O2C process

is also applied to each data source. As shown for the data source CDPOS, timeframe restrictions sometimes require linkage to another data source, like in this case its header information in CDHRD. Setting fixed timeframe boundaries will, however, lead to cutoff artifacts in the resulting analysis. If a sales order is registered on December 31 2020, the corresponding order confirmation will likely be created outside the selection window and thereby cut from the extraction. Preventing such artifacts would require substantial pre-extraction analysis and sophisticated extraction mechanisms catering to dependencies between data sources. This is typically deemed impractical, and analysts would rather deal with the resulting artifacts during analysis. Lastly, transaction type filters are exemplified through the document category for sales orders, the configured message type for order confirmations, which needs to be looked up in the system itself, and the list of tables affected by logged changes.

Equipped with a clear selection of data sources and filter criteria (i.e. restricting data records), selection criteria (i.e. restricting attributes/columns) are next. While data sources like sales order item or delivery item tables have over 350 columns, only a small number of them is required to evaluate specific hypotheses. Typically, practitioners hone in on (a) the identifying primary key, (b) temporal, quantity, price, cost, volume, and weight information, (c) markers indicating a state of the object, (d) links to other relevant objects, and (e) links to supplementary information. Figure 4 showcases attribute


**Fig. 4.** Attribute selection for delivery items in SAP

selection for delivery items (system table LIPS). During this screening process it is natural to come across additional supplementary information not yet covered in the data source selection. In such cases, and if their usefulness gets validated, they need to go through the same delimitation procedure as other data sources.

The final step of the technical translation is the screening of a complete, resulting attribute selection for sensitive data. Some data types are prohibited to be transferred across country borders (even if it is solely for analysis), others fall into categories requiring additional safeguarding, pseudonymization or even anonymization. While the process act of obfuscation is part of the extraction itself, it is recommended to identify all attributes requiring special attention upfront. In the SAP O2C scenario, these could entail usernames and details from customer master data.

Lastly, the actual extraction is configured and run accordingly. As all major considerations regarding the extraction of large volumes of data from ERP systems have already been presented in the general section of this chapter, our running example assumes a proven one-time extraction mechanism is used. Such setups have been utilized extensively by external auditors, however, are usually limited to selective one-time extracts, storing the payload in individual files locally on the SAP application server. Since no sensitive data has been identified in the data scope of our SAP O2C example, no obfuscation mechanisms need to be configured.

#### **4.2 Data Transformation**

After the extraction, data is typically loaded onto a preprocessing platform to generate the target data model, i.e. the log file and ancillary tables (see Sect. 4.3 for details). This platform can be a database management system (such as a Microsoft SQL Database Server) or part of the ETL tool applied during the extraction. Depending on the process mining technology applied, transformation might also happen inside the tool, such as in UiPath Process Mining and Celonis.

Data transformation is the most important preprocessing step in the journey towards process analytics. This is in particular relevant because an error in reconstructing the end-to-end process may cascade to a completely flawed process mining exercise, delivering misleading results, creating negative experience and, in the worst case, discrediting the whole approach. Consequently, a lot of attention needs to be put into mapping the correct process.

Specifically, we want to call out the following key aspects, namely:


Serving as unique transaction identifier, the *Case Identifier* (CaseID for short) is a primary point of concern when transforming data in order to achieve an end-to-end representation of the process. When considering one single system, such as SAP, the transaction identifier is typically given by the key document number being tracked (e.g. sales order number), to which other related documents refer. When the transaction spans different systems, the CaseID might – or might as well *not* – be consistent across them.

Assuming that CaseIDs are not consistent, two situations might occur: (1) there exists a mapping between the systems, that is, one can precisely link the transactions, even though the CaseID used in the systems for the same transaction differ; or (2) there is no link between the systems, or this link is not persistent (e.g. being deleted after 24 h). In (2), transactions can only be approximated by relating timestamps and transactional attributes on both systems, the so-called *linkage criteria*. That is, assume that transactions are passed on from an Application *A* to an Application *B*. The linkage criteria will initially define a time range (e.g. from 1 to 10 s) within which a transaction ending on the Application *A* is connected with the transaction that commences on Application *B*. Ideally, the matching of timestamps on both ends will create an one-toone linking of the transactions. The resultant CaseID could be the concatenation of the CaseIDs on Systems *A* and *B*, e.g. CaseID*A*–CaseID*B*. However, in practice a linkage criteria based solely on time ranges can lead to one-to-many relationships between the transactions on both systems, for instance when the cadence of transactions is high. By refining the linkage criteria with non-temporal matching attributes (e.g. the vendor and/or material), one gains precision and reduces uncertainty. Still, in some settings it is impossible to achieve a perfect mapping across systems. In these situations we recommend adding a case attribute that flags those transactions which perfectly match and those which do not.

*Timestamps* are essencial in process mining, as they mainly charaterize the partial ordering ≺ in which the events are sequenced by the underlying process mining algorithms. A typical problem happening in particular when analyzing automated process steps in sequence, but also in other contexts, is the precision of timestamps. Automated process steps happen in range of miliseconds, and this is the precision with which timestamps need to be captured, otherwise process steps will have the same timestamp. In practical settings, this leads to an extreme high number of process variants, as the process mining engines will pick events with the same timestamp in a random order and artificially create variants. When precise timestamps are not available, hardwiring the process ordering by means of a dedicated field in the final log might come handy. Tools will use this field to enforce the ordering, avoiding unnecessary variants. Another solution is to subsume all the sequenced steps into one, assuming that their ordering is not relevant for the analytical angle.

Another aspect commonly overlooked in analysis is the fact that the timestamps might be captured in a different timezones, especially when the regional angle spans various continents. (Summer and winter time shifts shall not be forgotten, too.) As an example, suppose one is looking for outlier transactions in which invoices were settled outside of the standard European working hours for a particular company, e.g. between 7PM and 6AM. Completely legitimate settlements happening the US might be then considered illegitimate if one does not normalize the timezones. This can be either done by adding a supplementary timestamp field to the log (denoting the time according to the reference point, e.g. CET, defined in the analytical angle), or by adding an attribute field for the timezone offset according to the reference point (e.g. +2).

A final consideration regarding the timestamps is the fact that typical ERP systems, as well as most of the legacy systems, do not capture the begin *and* end timestamps of events. As events are therefore "atomic", it is impossible to measure the actual duration of an process step and, correspondly, to quantify the working time per step. In fact, the lead times between events in a discovered process maps encompass both processing and waiting times. To address this, approaches for so-called "effort mining" are being developed and tested in practical settings [28]. They using statistical methods to estimate the duration of tasks, thereby allowing the quantification of working time and productivity, as well as benchmarking.

Considering *amounts*, two frequent issues are: (1) *amount duplication*, and (2) *unharmonized currencies*. The duplication happens when loops exist in the process. For example, suppose the event "Issue Invoice" happens, with an associated event attribute "Invoice Amount." Furthermore, suppose that, because of a loop, this event happens twice in some cases. Naively adding up the amounts associated with the event "Issue Invoice" will include duplications because of the multiple occurrences of the event within a case. Similarly, amount corrections happening in a case must be taken into account when assessing the final amount related to a case. Ideally, to avoid dupliations and other errors associated with amounts one should parse, the execution traces create an ancillary case attribute table recording the amounts per case (see Sect. 4.3 for details on the data model), thereby avoiding calculations on the specific process mining tool.

Unharmonized currencies typically happen when the analytical angle spans different countries, e.g. Denmark and Brazil. As above, naively adding up amounts without taking into account the different local currencies will lead to a wrong financial assessment of the process. Therefore, as a preprocessing step, currencies shall be harmonized to a reference currency set during the analytical angle, such as USD or EUR. This will be the *reporting currency*. The basis for such a harmonization might be system tables capturing the currency conversion rates history (e.g. TCURR on SAP), or dedicated APIs from which the historical foreign exchange can be retrieved (e.g. Fixer.io). For flexibility, the resulting data model stores both the amounts in local and reporting currencies, optionally also the conversion rate.

**Data Transformation in the SAP O2C Scenario.** In the following, we apply the general considerations discussed before to our SAP O2C running example. While first focusing on unit harmonization (i.e. timestamps and prices), the second part will describe different archetypes of events and exemplarily discuss data transformation steps to generate event log entries.

*Unit Harmonization.* As characterized above, there are several types of attributes that can occur in different base units across the data sources, sometimes even within one source system. Starting with the timestamps, SAP typically stores date (DATS) and time (TIMS) data in the configured time zone in the SAP installation. Some timestamps are, however, persisted in the time zone of the individual user interacting with the system (e.g. in SAP Warehouse Management, short WM). Luckily, all attributes relevant to our example are based on the same time zone and therefore, no adjustment for different time zones needs to be made. Since we analyze a full year of data, the switch between summer and winter time can – depending on the SAP system configuration – still require adjustments and a decision to treat one of them as dominant.

Before applying adjustments, we prepare our data sources by combining separated date and time information into timestamps (e.g. in VBAP: ERDAT & ERZET *>* tsCreation). If any data source contains multiple separated timestamps, each pair will result in an additional attribute. Moving to the actual adjustment and taking CET as our dominant base time zone, we adjust all timestamps in summertime by subtracting one hour. For traceability and testing purposes the adjusted timestamp shall be added as an additional column (tsCreationCET). Once a project matures into an operational monitoring solution, such steps are typically collapsed to reduce overall data volume.

Another major category for unit harmonization is currency denominated attributes. Within SAP, some data sources provide figures in multiple currencies (often document, local and reporting currency), other hold the transaction or document currency only. In these cases, and especially when firms engage in international business relations, respective metrics need to be harmonized before being compared or aggregated.

There is a substantial level of semantics captured in the way SAP ERP systems convert currencies.<sup>5</sup> However, in the context of our SAP O2C process at hand, we assume a currency conversion mechanism is available in the data transformation environment. Some of the currency attributes which need to be harmonized are static in terms of source and target currency (e.g. all records converted from USD to EUR), others need to dynamically capture the source currency per each individual record (e.g. sales order item price from document currency to EUR). Exemplarily, Fig. 5 shows the input to such a dynamic currency conversion function, whose output is then stored in an additional attribute.

*Event Data Transformation.* When transforming data into an event log capturing all relevant events as defined in Sect. 4.1, different event data archetypes should be distinguished. These types inform corresponding transformation recipes and while they need to be tailored to individual events, their core structure remains largely intact. Figure 6

<sup>5</sup> It goes beyond the scope of this running example to explain the inner workings of currency conversion in SAP which is based on the tables TCURF, TCURN, TCURR, TCURV, and TCURX.


**Fig. 5.** Exemplary currency conversion.


**Fig. 6.** Types of data archetypes.

delimits these three archetypes and Fig. 7 maps them to the events which are part of our SAP O2C running example.

Below, we detail the event Sales Order Item created to exemplify the immutable timestamp transformation archetype, the event Sales Order Item last changed to exemplify the mutable timestamp transformation archetype, and both events Order Confirmation sent as well as Picking completed to exemplify the log entry transformation archetype. For simplicity reasons we limit the transformations to the three basic elements for process mining: (a) the object ID/case ID candidate, (b) the event name, and (c) the timestamp.

*Timestamp – immutable.* In order to extract event records for the event type Sales Order Item created an object ID (caseID candidate) is crafted by concatenating VBAP.MANDT, VBAP.VBELN and VBAP.POSNR, the primary key of the respective data source table. In a preparation step we have already generated the corresponding timestamp VBAP.tsCreated from VBAP.ERDAT and VBAP.ERZET. Many event types can be extracted in this manner.

*Timestamp – mutable.* To extract event records for the event type Sales Order Item last changed we use the same object ID as for the immutable event. In a preparation step, we have also generated a corresponding timestamp VBAP.tsLastChanged from VBAP.AEDAT and 23:59:59, a dummy time to fill the missing precision in this timestamp. It is very important to clearly document usage


**Fig. 7.** Mapping archetypes to the events.

of such dummy times, since they can lead to undesired analysis results due to misinterpretation of the event sequence. In general, such mutable event types are more valuable for operational process mining analyses, with shortened refresh cycles, and thus a greater chance of the data still being current at the time of analysis. We included it in the SAP O2C running example for completeness only.

*Log entry.* As the first example of the log entry archetype, the event records for Order Confirmation sent are retrieved. Assuming the data source NAST has already been filtered to solely include order confirmation message types, it is linked to VBAP based on the client (MANDT) and its object key (OBJKY) referencing to the header primary key of VBAP (MANDT, VBELN). The same concatenated object ID is used as for the immutable event. And since the two sources are linked already, tsProcessed as derived from NAST.DATVR and NAST.UHRVR is used as the event timestamp.

The second example derives the event Picking completed from SAP's change documentation. Assuming the data source VBUP has already been filtered to solely include the item status information of standard sales order items, we also restrict the change logs based on the affected table (CDPOS.TABNAME = VBUP), on the affected field (CDPOS.FNAME = PKSTA), and on the change type (CDPOS.CHNGIND = U) to retain value updates only. Thereafter, VBUP is linked to CDPOS based on MANDT and CDPOS.TABKEY referencing the primary key of VBUP. Next, change log header information (CDHDR) is linked based on MANDT and CHANGENR. Lastly, we can extract the object ID from VBUP analogously to the immutable timestamp example, and the prepared timestamp tsUpdated, derived from CDHDR.UDATE and CDHRD.UTIME. Many changelog structures for other event types work similar, even outside the SAP ecosystem.

The exemplarily described recipes can be applied beyond the events listed as part of our simplified SAP O2C process analysis. It is rarely a blind application, however, rather a tailoring exercise. Sometimes the name of the resulting events – mostly in the log entry archetype – is even meant to be dynamically derived from attributes on a record-by-record basis. This becomes particularly useful when analyzing workflow systems with potentially hundreds of different events, since all of them can be extracted with one transformation recipe.

#### **4.3 Data Model Engineering**

Generally speaking, the transformation creates an *event log* for the process in scope, as defined in the processual and regional angles. It further contains the necessary events and attributes needed to respond to the analytical angle happening in the time span prescribed by the time angle. This section focuses on considerations at building a data model fit for scalable process mining analytics.

The simplest target data model for an event log file is a table in which the columns capture the attributes and the rows capture the events. Although some tools still build upon a single event log table as their input format and although this format might be handy for small exercises, producing a single event log has several adverse practical implications, namely:


Practical process mining thus calls out the need for more efficient and scalable logs. There are two complementary strategies for engineering event logs. The more general strategy focuses on splitting the log into at least two tables: the so called *event table* containing the events and their attributes, and the *transaction table* containing the case attributes. The key linking these two tables is the CaseID. This strategy can be further refined, depending on the needs of the analysis. For example, another usual structures seen in practice is the *change table* capturing updates in the main documents (e.g. Quantity changed in Sect. 3) and the *property table* capturing derived precomputed transaction attributes easing the analysis (e.g. the number of events in a particular transaction or precision of the linkage criteria, as of Sect. 4.2).

The more specific strategy takes the scope and its different angles into account, as well as who is eventually consuming the analysis. Specifically, when the *regional angle* comprises multiple geographies (e.g. five hubs of a Global Business Services (GBS) topology), it is wise to create one data model – irrespective of its layout – for each hub. While this does not prevent having a global analysis, benchmarks and knowledge transfer of best-practices, it by default ensures controllability and need-to-know policies, i.e. that hubs focus on their area of concern. The *analytical angle* is also a strong driver for event log engineering. For example, an SAP O2C analysis might focus on improving client servicing and lead management. In this case, the focus is on transactions against external customers, and *not* on intercompany or intracompany transactions6. Therefore, the data model for this analysis can be built to comprise only the relevant transactions.

Generally, narrowing the event log according to the scope reduces the risk of adding noise to the analysis, and the risk of misinterpretation. This is because it requires the clear-cut specification (and transparent communication) of the filtering criteria used during log engineering and data transformation. It also reinforces that there is no "one size fits all", standard target log file and set of events and attributes to be reconstructed.

**Data Model Engineering in the SAP O2C Scenario.** In the following, we apply the general considerations discussed before to our SAP O2C running example. Starting with the selection of a common process instance identifier or case ID suitable for the analytical angle at hand, we define a dedicated data table for information on each process instance (i.e. the case table). Lastly, contextual data is added in a scalable way and linked to the core data model.

*Case Identifier.* When transforming source data into event records as described in Sect. 4.2 the resulting object identifier (object ID) is typically referencing the underlying business object or document. Exemplarily, the events derived from sales order items (VBAP), e.g. Sales Order Item created, will have a concatenation of the table's primary key fields as its unique object ID reference. However, events derived from other data sources, like Deliver Item created will correspondingly have an object ID composed of the primary key fields of LIPS assigned. This results in a need to relate these objects and documents involved in our O2C process, in order to retrieve original process flow end-to-end.

We look at the document flow in Fig. 1 and use the link attributes we preserved in Sect. 4.1 to derive an object graph in accordance with the relationships between the

<sup>6</sup> Intercompany transactions are between two or more related internal legal entities in the same enterprise; intracompany transactions are between two or more entities within the same legal entity.

**Fig. 8.** Relationship model relating to the case identifier.

corresponding data sources. The only exception is the data source NAST, whose corresponding events have already been linked to the respective sales order item object ID during event data transformation as described in Sect. 4.2. Such direct links are typically used when the business object or document has very few additional attributes of relevance – link in the exemplary case – the order confirmation. Please refer to Fig. 8 for the resulting relationship model and exemplary graph.

As some of the relationships between the business object data sources are one-tomany (in some scenarios even many to many), the resulting graph/forest can become quite complex. Considering the example in Fig. 8, the part of the graph in which two sales order items exist (X and Y), with both belonging to the same sales order (R). While X has no link to any delivery item yet, Y references two distinct delivery items (U and W) with two different headers (T and Z). This means the sales order item was likely split into two deliveries. Now, one of the deliveries (W) is already billed with a billing document item (A) and header (B).

For most process analyses it is advisable to define one of the object ID types as the case identifier (caseID). Based on the underlying analytical angle and hypotheses, we select the sales order item as the identifying document type and create a mapping table, which lists all reachable objects within the forest as a function of the caseID (see Fig. 9). As a rule of thumb, when traversing the forest, the very same relationship shall not be traversed in both ways, i.e. after connecting X to R, we do not proceed to connect Y to the same set of reachable objects since it would take the same relation (VBAP - VBAK) that connects X to R in a backward direction. This approach prevents linking objects and thereby their associated events erroneously. The combination of our mapping table with the event record table from Sect. 4.2 results in a final event log table, which can already be used for process mining.

*Case Table.* Most process mining analysis are moving beyond the pure event traces quite quickly, resulting in the need to add contextual information. The most straightforward option is to create an additional table with exactly one record per process instance (i.e. per active caseID) and adding so-called case-level information to it. Since we have


**Fig. 9.** Mapping table listing reachable objects.


**Fig. 10.** Connecting additional contextual data sources to the case table.

defined the sales order item as our case ID type, we can simply add the attributes preserved in Sect. 4.1 as case attributes (e.g. net price, material number).

*Contextual Data.* When using process mining in real business scenarios, the thirst for contextual information does not stop at the case table. Applied to our running example, we can assume that just because we selected the sales order item as caseID type does not mean additional information on the delivery document items and billing items is irrelevant. One approach can be to add such information in the event log with the tradeoff being an extremely detrimental impact on data volume. In practice, we rather opt to introduce additional tables, often one per objectID type (except the one selected as case ID type). Illustrated in Fig. 10, we connect two additional contextual data sources to the cases table. In order to establish the link from these object tables to the case table, the previously generated mapping table (caseID objectID) can be re-used.

In practice even more advanced data models, such as the one described above do not support the testing of every hypothesis project stakeholders come up with. Sometimes, hypothesis-specific "helper" tables are created and linked into the process mining data model. In such cases, it is advisable to challenge the business value from such modifications before triggering substantial data model modifications.

## **5 Best Practices**

In this section we take stock on the above sections and distill best practices from our experience of rolling out process mining "in the large". Clearly this is a non-exhaustive list; its intent is to elude on the most relevant and recurring topics.

*Data Selection and Extraction.* This is the basis for process mining, and if not structured well, hiccups here can undermine the entire analytical effort. Four best-practices in this area:

**BP1** Explicitly formulate the four analytical angles and confirm it with all stakeholders. **BP2** Find a sweet-spot between data minimization and extraction efficiency.

**BP3** Estimate the final size (and time) of extraction.

**BP4** Extract data from a QA environment or existing staging platform.

By following (BP1) one ensures common knowledge as to the analytical objectives and avoid getting lost in details. Turning to (BP2), as mentioned in Sect. 4.1, data extraction technically boils down to some form of select-statement on terabyte-sized tables. Data minimization criteria (formulated as where-constraints) add constraints to such a statement, slowing down the extraction. Therefore it is important find the sweet spot between minimization and efficiency. One way to do so is to follow (BP3) and carry out a probe extraction with a drastically reduced scope and extrapolate the values to the full scope range. Finally, because a data extraction might have an impact on the performance of the system, (BP4) recommends the extraction of data from a QA (test) environment or staging platform, as opposed to a productive environment. Of course, for this, the extraction environment must fully cover the scope.

*Data Transformation.* When transforming data towards an event log, events are discovered according to the business logic and system specific configurations. In doing so, the precision is essential for the analytical correctness. Five best-practices to emphasize in this area:

**BP5** Modularize event discovery.

**BP6** Harmonize timestamp format, currencies and other units.

**BP7** Take system customizations into account.

**BP8** Do not ignore the business logic and context.

**BP9** Meaningful event naming convention

With (BP5) one creates modules to discover the different types of events (e.g. SO Item created). In doing so, adjustments in those events (e.g. naming convention or discovery logic) can be done locally without requiring the generation of a whole event log. With (BP6) one avoids misinterpretation of results and a sound basis for analysis. Besides those harmonization efforts, ERP systems are highly customized to a particular business and operational mode. This can be at the level of fields in a table (e.g. a field capturing a specific company flag for a completed delivery) or the way attributes add up for an attribute (e.g. what is an automated vs. a manual step). Therefore, (BP7) recommends to take those customizations into account when transforming data. One approach is to carefully resuse and validate existing transformation scripts.<sup>7</sup> Building on that, different attributes carry aspects on the business logic, e.g. document types associated to sales orders indicating external or internal sales. In (BP8) we recommend to take this into account when generating the event log by creating different events or transaction identifiers. Finally, by (BP9) one facilitates the understanding of process maps. For example, instead of naming an event SO Item qty chg., use SO Item qty incr., already indicating how the change impacted the sales order quantity field. This creates more meaningful logs and a more effective basis for analysis. However, if used exaggeratedly, this leads to an inflation of distinct events, making any analysis a complex undertaking.

*Data Model Engineering.* The best-practices regarding the target data model have an impact on the scalability and ease of analysis. We emphasize the following:

**BP10** Add sanity checks.

**BP11** Modularize logs according to the analytical scope.

**BP12** Split the attributes according to attribute types.

**BP13** Consider ancillary analytics, e.g. machine learning, prediction and simulation.

In (BP10) we recommend the use of sanity check tests indicating, for example, the overall number of cases reconstructed or a summary of fields including NULL or empty values. This helps in the quality assurance phase, e.g. by matching the number of expected transactions with the number of cases or by detecting and tracing transformation bugs. By (BP11), one ensures that event logs fit their analytical angle but, at the same time, separate business concerns. Regardless of modularization, by (BP12) one separates the characteristics of events and those of transactions. This eliminates redundant data and is less error-prone during analysis. Finally, process data has been increasingly used as a subject of more advanced analytics. The data model required for such analytics differs substantially from a plain event log. In (BP13) we recommend to take this into account when deciding on the necessary attributes and their aggregation level. In some situations, it is worth creating a separate table allowing, e.g., regression or time series analytics.

## **6 Outlook**

During the past ten years, with process mining finally finding its way from academia into market leading organizations, a lot of progress has been made in both simplifying the approach, including data preparation, for business use, as well as extending associated functionalities and proving to generate tangible value in a growing number of industries. With spreading awareness and substantial increases in capital allocation from venture firms, this development has only accelerated in the more recent past. From our perspective as practitioners, we expect the following to be some of the most substantial improvements:

<sup>7</sup> Also ready-to-use connectors provided in some process mining technologies offer an "aid" but not a "replacement" for tailored scripts.


Besides the overall maturing industry and the improvements listed above in particular, we anticipate some of the already prevalent challenges to intensify, while new ones emerge from macrotrends:


based process analytics (e.g. "What if we change A? Would there be another bottleneck?"), the lack of simulation capabilities becomes apparent. In isolation from process mining, there has been plenty of research [7] and tool support for simulation engines [13]. The challenge will be to seamlessly integrate simulation engines with process mining engines, without turning the corresponding configuration into a Customer Experience (CX) nightmare. It is expected that, as part of this convergence, additional requirements towards event log engineering emerge (e.g. statistical distribution information).

3. *Data Exchange Restrictions*. With regulations and restrictions around data sensitivity and data exchange tightening around the world, it becomes increasingly difficult to manage the compliance angle of holistic and often global analytics initiatives. This shift also leads to additional precaution whenever data is shared with third parties and especially when these are within the open domain. It will accordingly become more and more difficult for academia to work on relevant business scenarios with representative underlying data sets, which in turn results in slower and less targeted innovation in the field, including event log engineering.

In summary, the discipline of process mining and corresponding event log engineering is expected to thrive under the increased attention of academia, solution vendors, professional service firms and financiers. The most substantial impact, however, will continue to emanate from firms of all sizes adopting process mining to streamline operations and – at times – turn process excellence into their competitive advantage.

**Acknowledgment.** Rafael Accorsi would like to thank Nadja Walti, Peter Blank and Wolf-Dietrich Zabka for their valuable comments, suggestions and proof-reading.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

**Process Enhancement and Monitoring**

# **Foundations of Process Enhancement**

Massimiliano de Leoni(B)

Department of Mathematics, University of Padua, Padua, Italy massimiliano.deleoni@unipd.it

**Abstract.** Process models are among the milestones for Business Process Management and Mining, and used to describe a business process or to prescribe how its instances should be carried out. It follows that they need to fulfill certain properties to be useful. If they aim to represent how the process is currently being executed, they need to be precise and recall the behavior observed in reality. If the goal is to ensure that the process is executed according to laws and regulations, its model should only allow the behavior that is valid from a domain viewpoint and provides some guarantee to ensure good performance level. Process enhancement is the type of Process Mining that aims at models that fulfill these properties, and the literature further splits it into two subfields: process extension and process improvement. *Process extension* aims to incorporate the process perspectives on data, decision, resources and time into the model: their inclusion in process models enable designers to fine-tune the model specifications, thus obtaining models with higher levels of precision. Process improvement passes through an "improved" process model. If the model contains portions of behavior that lead to unsatisfactory outcomes (high costs, low customer satisfactions, etc.) or that violate norms and regulations, one would like those portions to be disallowed by the model. In case some executions are observed in reality and are not allowed by the model, they should be incorporated into the model if they are observed to generally yield good performances. This chapter discusses these two types of process enhancement, and illustrates some basic and some advanced techniques to tackle it, highlighting the pros and cons, and the underlaying assumptions.

**Keywords:** Process improvement · Process extension · Decision discovery · Role discovery · Bottleneck analyses · Model repair

A process model is one of the main milestones for Business Process Management and Mining, and may be of two natures. A first nature of process models is descriptive: they are used by process analysts to engage process stakeholders (e.g., actors, managers, chief officers) into discussions on how the instances of the process have typically been executed, or how they should be. A second nature is prescriptive, and that is the case when the models are used as input for Process-aware Information System to automate processes and enforce how they must be carried out [10]. In both of scenarios, desirable models need to fulfill certain properties to be of fruitful use:

1. Models need to be precise and only allow legitimate behavior (high precision). This is especially relevant for models with a prescriptive nature: one wants to ensure that the information systems enforce how process instances must be executed, and also how they must not be.


This chapter introduces a number of techniques for process enhancement, which is the type of process mining that aims to create models that fulfill one or more of the properties mentioned above. Process enhancement starts with the provision of a process model for which these properties are relevant. This model can be mined from data, or designed by hand on the basis of process documentations, and/or stakeholder input. The literature proposes two types of process enhancement [25]: *process extension* and *process improvement*.

*Process extension* focuses on the first property (high precision) and aims to incorporate different perspectives. The model often only defines the control-flow perspective, which is certainly the process-model backbone, but it is insufficient to precisely encode the behavior that a model must explicitly allow or disallow. Processes manipulate, read and produce data (objects), their activities are performed by resources within due deadlines, and they take time to be carried out. In literature, these aspects are named perspectives: data, resource and time perspectives. Their inclusion in process models enable designers to fine-tune the model specifications, thus obtaining models with higher levels of precision.

*Process improvement* focuses on the other properties, and starts from the belief that, if a model has a prescriptive nature, process improvement passes through an "improved" process model. Improvement can be regarded as ensuring process models *(i)* to better reflect reality, and/or *(ii)* to only allow executions that are valid from a domain viewpoint and/or are correlated to better performances.

## **1 Process Extension: Basic Techniques**

The extension of process models to incorporate multiple perspectives relies on the presence of attributes associated with the events. The definition of simplified event log introduced in [1] can be extended accordingly:

**Definition 1 (Simplified Multi-Perspective Event Log).** *Let* U*ev be the universe of events. A simplified event log* <sup>L</sup> <sup>⊂</sup> <sup>2</sup>U<sup>∗</sup> *ev is a set of traces, sequences of events, with the constraint that an event can only belong to one trace:* <sup>∀</sup>σ- , σ-- <sup>∈</sup> L. e <sup>∈</sup> <sup>σ</sup>- , e <sup>∈</sup> <sup>σ</sup>-- ⇒ σ- = σ--*.*

Sections 1.2, 1.3 and 1.4 illustrates basic techniques to extend the models to incorporate the data, resource and time perspective, respectively.

## **1.1 Model-Aligned Event Logs**

Several techniques for process enhancement requires that the event-log traces can be replayed on the process model (cf. [5]). This requires events to be univocally mapped

**Fig. 1.** The Petri net of the working example used in this chapter. The letters inside the transitions identify the transition names, while the script underneath indicates the transition label, namely the activity name. The thicker, red-coloured place and arcs identify a decision point, namely places with outgoing arcs to multiple transitions. (Color figure online)

onto process activities; in case of Petri-net models, events must be mapped onto transitions. However, multiple process activities (e.g., Petri-net transitions) can have the same label, and the choice of the activity to which to map each event is not necessarily local, but it depends on the entire sequence of activities that are executed. Furthermore, when processes are modelled via Petri nets, the model may include invisible transitions, namely transitions with no associated labels that, by definition, leave no trail in event logs. To further complicate the matter, log traces might not be compliant with the process model: certain activities might have been executed when not expected, or not executed when expected.

The situations above can be tackled by solving the following problem: given a log trace <sup>σ</sup><sup>L</sup> <sup>=</sup> e1,...,em and an accepting Petri net *AN* = (N,M*init*, M*final*) where <sup>N</sup> = (P, T, F,l), we need to find the **model-aligned trace** <sup>σ</sup><sup>P</sup> <sup>=</sup> f1,...,fn such that


The computation of a **closest model-aligned trace** can be achieved through alignments (cf. [5]), as explained through an following example: let us consider the log trace ea, eb, ee where <sup>e</sup>a, <sup>e</sup>b, and <sup>e</sup><sup>e</sup> are respectively the events for activities *Enter Loan Application*, *Retrieve Applicant Data*, and *Approve Simple* the subscript indicates the event activity (e.g. #*act*(ea) = a). The model is depicted in Fig. 1, where transitions τ<sup>1</sup> and τ<sup>2</sup> are invisible, and the label of each transition is shown under the respective transition. The alignment between the model and the trace is as follows

$$\gamma = \frac{\left| e\_a \left| e\_b \right| \gg \left| \gg \right| \gg \left| e\_e \right|}{a \parallel b \parallel c \parallel \tau\_1 \parallel \tau\_2 \parallel e} \right| $$

The top row and bottom row respectively identify the log component of the alignments (namely the events), and the process/model component (the Petri-net transitions). To create a model-aligned trace, we need to synthesize the sequence of events. For each synchronous move between an event e and a transition t, we create an event e such that #*act*(e- ) = t, and for any other event attribute a, #*<sup>a</sup>* (e- )=#*<sup>a</sup>* (e). For each model move for a transition t, we create an event e<sup>t</sup> such that only the activity attribute is populated #*act*(e- ) = t. These events are then ordered according to the order of the moves in the alignments. This means that, for the trace in question, the model-aligned trace is e- a, e- b, e- c, e- <sup>τ</sup><sup>1</sup> , e- <sup>τ</sup><sup>2</sup> , e- <sup>e</sup> where the subscript indicates the activity associated with the event, i.e. #*act*(e- <sup>x</sup>) = x. Events e- a, e- <sup>b</sup> and <sup>e</sup>- <sup>e</sup> are also populated with the additional attributes and values that are present for ea, eb, and ee, respectively: for instance, for each attribute *v* of e<sup>a</sup> different from *act*, #*<sup>v</sup>* (e- <sup>a</sup>)=#*<sup>v</sup>* (ea).

Hereafter, the event log that originates from the information systems (i.e. with activity labels) is referred to as **event log**, while the event log defined over model transitions is named **model-aligned event log**. The events of model-aligned traces that stem from synchronous moves take on the attributes and their values from the mapped events of the real event log, including resource and timestamp. The events that come from model moves do not have any attribute but the activity. Note that, strictly speaking, a modelaligned trace is not a repaired trace as discussed in [5]: the activities of model-aligned traces are transitions names, where log traces refer to transition labels.

#### **1.2 Data-Perspective Discovery**

The data perspective focuses on how data objects are manipulated by the activities during the execution of process instances. The study of this perspective is of high relevance because the process-instance execution routing is affected by the characteristics of the specific process instance, such as the amount requested for a loan or the profile of the loan requestor, and also by the outcomes of previous steps in the process, such as the results of a verification activity. As an example, let us consider golden and silver profiles of potential loan requestors: a financial institute might decide to treat golden customers via a different procedure than that for other customers. Since the data perspective affects how decisions are made in the process, this perspective is also often referred to as **decision perspective**.

It is nowadays gaining momentum to represent this perspective in an integrated model, e.g., extending a BPMN model, or as a set of separate tables, also known as decision tables. This is also testified by the continous refinement of the Decision Model and Notation (DMN), a standard by the Object Management Group to describe and model decision tables [21].

Historically, the discovery of the data perspective is called *Decision Mining*, a name that was introduced by the seminal work by Rozinat et al. [23]. However, this work could not be applied on Petri nets containing invisible transitions or multiple transitions associated to the same label. This limitation has been lifted in [8] through the use of alignments and the construction of model-aligned tables.


**Table 1.** A fragment of a model-complaint event log for the model in Fig. 1. The gray events have been introduced as result of alignment model move for invisible transitions. Their case identifier is inherited from the other trace events.

The simplest representation of the data perspective is to attach decision rules to decision points. When processes are modelled via Petri nets, decision points are places with arcs to multiple transitions (see, e.g., the red place with thick border and the outgoing arcs in Fig. 1). The rules explaining the choices are driven by additional process data, which are generated by activities/transitions preceding the split. In the remainder, this additional data is abstracted as a set of process attributes, where each attribute can take on a value within the respective attribute domain. For these Petri nets extended with data, a guard over these process attributes is attached to each transition, which can possibly be identically true. A transition is enabled if every incoming place has a token as for classical Petri nets, but also the associated guard needs to evaluatr true wrt. the current value assignment to process attributes. As for classical Petri net, a transition is enabled if every input place has a token; however, in this case, that is only a necessary


**Table 2.** The observation instances for the model in Fig. 1 and the event log in Table 1 to discover the guards at decision point for place c5. The last column indicates the class feature.

condition: the guard also needs to evaluate true. Other process-modelling notations have equivalent constructs to represent this: for instance, BPMN models use XOR-split gateways, depicted as a diamond with a X symbol inside, and conditions are represented on the arcs going out the gateway.

The basic algorithm for guard discovery assumes that the decisions are mutually exclusive: when a process instance reaches a decision point, one and exactly one branch is enabled for any assignment of values to attributes.

In [8,23], the decision-mining problem is transformed into a classification problem: which transition is expected to fire (namely is enabled) for each valid assignment of values to process attributes. This problem can be tackled through decision-tree learning: decision trees have the remarkable advantage to explicitly indicate the classification criteria, namely which transition is enabled for each assignments of values to attributes.

The intuition can be given through an example related to a process modelled via the Petri net in Fig. 1. A corresponding model-aligned event log is represented conveniently in tabular form in Table 1. Loan lengths are measured in years, attribute *Income* is the monthly salary, *InstallmentAmount* is the amount of each monthly installment, and *InstalmentDivIncome* is the ratio between *InstallmentAmount* and *Income*. Last, *Verification* is a boolean process attribute to which a value is assigned as result of executing the activity *Retrieve Applicant Data* (Petri-net transition *b*): if the retrieval of applicant data confirms the information provided by applicants through the first activity, attribute *Verification* takes on a false value, and the loan request is going to be rejected. Let us focus on decision point c5, which is input place of transitions d or τ2. It follows that d and τ<sup>2</sup> are mutually exclusive. We need to train a decision tree that define the conditions that discriminate when d or τ<sup>2</sup> is expected to occur, which are going to become the guards of d and τ2. Table 2 shows the instances to be used to train the decision-tree model. The last column holds the class values of the learning instances, whereas the others columns refer to the independent variables. Since the model does not contain loops, exactly one token is produced in place c5, for each process instance. The first row refers to the case with identifier 1: in this case, the transition τ<sup>2</sup> is observed when the following values were assigned to the process variable through the execution of given activities (i.e. model transitions): LoanAmount = 400000, LoanLength = 30, Age = 30, Income = 2500, InstalmentAmount = 1225, V erif ication = T RUE and InstalmentDivIncome = 0.49. The second row refers

**Fig. 2.** A possible decision tree that is learned from the observation instances in Table 2.

to the second trace (id 2): in this case, d was observed with LoanAmount = 450000, LoanLength = 30, Age = 30, InstalmentAmount = 1380, Income = 3000, V erif ication = F ALSE and InstalmentDivIncome = 0.46. In case values are assigned to process attributes by different transitions, the latest observed value is considered. Figure 2 shows a possible tree that can be learned from the observation instances in Table 2. The guards can be extracted from the decision tree, traversing the paths from the room to leaves. The guard for any transition t is in the form of expr<sup>1</sup> <sup>∨</sup> ... <sup>∨</sup> expr<sup>n</sup> where expr<sup>i</sup> refers to the <sup>i</sup>-th path that leads to a leaf labeled as <sup>t</sup>, and is a conjunction of atoms variable operator constant (e.g., age <sup>≤</sup> <sup>60</sup> or V erif ication = F ALSE) of the nodes and arcs part of the path.

As an example, the guard of transition d (activity *Notify Rejection*)is an expression with four subexpression: expr<sup>d</sup> <sup>1</sup> <sup>∨</sup> ... <sup>∨</sup> expr<sup>d</sup> <sup>4</sup>. Sub-expression expr<sup>d</sup> <sup>1</sup> refers to the path for the left-most node, which includes the root node V erif ication and the edge associated with label F ALSE, lead to expression V erif ication = F ALSE; expr<sup>d</sup> <sup>2</sup> refers to the second left-most node with label d:

V erif ication = T RUE ∧ InstalmentDivIncome ≤ 0.5 ∧ 61 ≤ Age ≤ 70 ∧ LoanLength > 10

and etc. Considering the four paths in the decision tree, the guard for d is as follows:

V erif ication = F ALSE ∨(V erif ication = T RUE ∧ InstalmentDivIncome ≤ 0.5 ∧ 61 ≤ Age ≤ 70 ∧ LoanLength > 10) ∨(V erif ication = T RUE ∧ InstalmentDivIncome ≤ 0.5 ∧ Age ≥ 70) ∨(V erif ication = T RUE ∧ InstalmentDivIncome > 0.5)

That is a disjunction of conjunction of terms related to paths from the root to the leaves labeled with d. One can similarly obtain the guard of τ2, which is the disjunction of two expressions:

```
(V erif ication = T RUE ∧ InstalmentDivIncome ≤ 0.5 ∧ Age ≤ 60)
∨(V erif ication = T RUE ∧ InstalmentDivIncome ≤ 0.5 ∧ 61 ≤ Age ≤ 70 ∧ LoanLength ≤ 10)
```
#### **Missing Values**

Let us consider again the model-aligned event log in Fig. 1, and suppose that the event for transition b in the trace with case identifier 1 comes from a model move. This means that, in the real event log before computing alignment, an event for b in the first trace was not observed. In such a case, the fact that transition b assigns a TRUE value to *Verification* is lacking. As a consequence, the first decision-tree training instance has a missing value for attribute *Verification*. Several techniques exist for the management of missing values of a given variable f, such as:


Several implementations of decision-tree learning algorithms (e.g., of C4.5 [22]) are already equipped with the missing-value managements. However, it is important to think carefully about the meaning of missing values. It might be - as many schemes implicitly assume - that the value was produced but, for quality issues, was not recorded in the dataset. However, it could also mean that the transition did not produce that value, due to, e.g., some concept drift or impossibility to find a suitable value for the specific instance in question. When this is true, *the missing value conveys important information*, and the learning instance should carry the information that the value was missing via an additional boolean feature, instead of injecting random values. This additional feature can increase the discriminative power of the guards, differentiating the situations in which the information was provided from those in which the information was missing.

#### **1.3 Organizational Mining**

Organizational Mining focuses on the resources, which refer to anyone or anything involved in performing activities, such as a human process participant, a software system (e.g., a server) or an equipment (e.g., a production machine). The organization perspective, also referred to as *resource perspective*, aims to model how resources are grouped, ad how they interact to each other.

Among different goals, organizational mining aims at how resources collaborated to carry on individual process instances. Typically, the resource collaborations can be represented as *social networks*, which are graphs where nodes are the resources and arcs, direct or indirect, indicate some form of collaboration between pairs of resources [25,26]. Arcs can be also given weights, which is proportional to the frequency/intensity of the collaborations. One of the most studied social networks in Business Process Management relates to the hand-over of work between resources. A work hand-over between two resources a and b exists in a process instance p, if a has executed



**Fig. 3.** An example of application of role discovery for the process referring to the log in Fig. 1. Table (a) is the resource-activity matrix, where colors are used to define a reasonable grouping of rows, i.e. resources. When no value is depicted for a cell, it should be intended as zero. Table (b) details the discovered roles. Note that the role name cannot be automatically derived.

an activity for p, which is directly followed by a second activity that is executed by b. This implies that a has handed over the progression of the execution of p to b, which, in turn, can later hand it over to another resource. Among the different goals, organizational mining aims to discover these social networks, and later to analyze them. Social network analysis is very interesting in Organizational Mining because it can unveil relevant information about resources. Notably, it can discover *cliques* of resources that tend to work together, or critical resources that are less "replaceable". Less replaceable resources are characterized by a large degree of incoming and outgoing arcs, and the removal of the corresponding nodes in the graph may create longer paths between pairs of resources, or even yield disconnected components.

Space consideration prevents us from further discussing social-network analysis, and forces us to rather focus on analyzing the event logs to discover *roles*, groups of resources that work on the same activities. Clustering techniques are simple techniques to discover roles within organizations, especially under the assumption that a resource plays one single role. The starting point is to build a resource-activity matrix, such as that in Fig. 3(a). Rows refer to different resources and columns to different activities. The value for the row r and column a indicates the average number of times that r executes a in a process execution. For instance, Mark executes activity a 0.6 times per case, on average. Note that, if an activity is executed exactly once per process instance, the sum of the values of the cells of the corresponding column is one. A sum lower or higher than one indicates an activity to be optional or be involved in a loop.

In a resource-activity matrix, each row is a different resource and can be regarded as a vector with as many dimensions as the number of process activities: the value of the dimension for a given activity is equal to that of the corresponding cell in the matrix. Rows are thus vectors, points of a cartesian space, that can be clustered, e.g., via wellknown clustering algorithms, such as K-Means or DBScan [19]). The row colors in Fig. 3(a) illustrates a reasonable clustering for the matrix in question. As an example, Mark and Sue belong to the same cluster and, hence, play the same role: their role allows them to perform activities *a* and *d*. The same is for John and Max, who can perform *b* and *c*. Anne and Jennifer form the third role that enables them to perform *e* and *f*. Note that it would be equally reasonable to split the cluster with Anne and Jennifer into two, although a simpler solution with fewer role is possibly preferable when equivalent.

#### **1.4 Time Perspective**

A process-instance execution can takes a considerable amount of time to be carried out. Depending on the domain, it might even take months or years to conclude: consider, e.g., a health-care process to follow up cancer diagnoses, or that to give monthly unemployment benefits, or even a process to reintegrate workers who have suffered physical issues that prevent them from going back to their original employment. It follows that process activities are not instantaneous as we have so far considered, but they take some time to be executed. In fact, certain activities require external inputs (e.g., the production of documents, the arrival of materials and other goods), and the availability of necessary machines and suitable human resources. If these requirements are not met at the moment when the activities are ready to be started, their execution is forcibly delayed. These delays can have a cascading effect on other activities that follow in the process.

Within the realm of Process Mining, the time perspective focuses on the timing of events that carry timestamp information. The time-perspective analysis can notably be used to discover process bottlenecks, and monitor the service levels: their analysis enables verifying whether executions are carried on within a reasonable amount of time (e.g., a complaint is addressed within the same day in which it is filed), or whether the temporal process constraint are fulfilled (e.g., the second shot of the COVID-19's Pzifer vaccine is given within 21 days from the first). The analysis of the perspectives on time and resource is also partly overlapping: thanks to the time information, process analysts can assess, for instance, whether resources are fairly, overly, or scantily utilized.

The verification of the satisfaction of time-related constraints is related to conformance checking (see [5]) As an example, Mannhardt and Blinde illustrates an interesting case study to check the conformance of the treatment of patients who suffered from Sepsis [18]. The conformance checking of time perspective is not covered in this chapter, which conversely focuses on extending and annotating a process model to unveil potential time-related issues, especially process' bottlenecks.

The presence or absence of bottlenecks can be related to *(i)* waiting time, namely the difference between the timestamp of the actual start of an activity instance and the earliest moment in which the instance could have started (cf. above discussion of delays caused by lack of resources), or *(ii)* service time, i.e. the duration of an activity-instance execution.

Several ways exists to analyse the service and waiting times of the activities of a process model, e.g. modelled via Petri nets. The performances at the different points of the model can be analyzed through *queue mining* [24]. For instance, queue mining can be employed to estimate how long a token typically remains unconsumed in a Petri-net place. This estimation is far from being easy because it requires to consider several factors: the average length of the token queues in places, the policy of consumptions of tokens (FIFO or according to some priorities), the relationships between places (e.g., connected to the same transition), etc. Queue mining considers the process model as a queuing network, whose characteristics are determined after analysing the information stored in event logs. A queuing network is used to determine the activity execution policies. When a queue network is created, several off-the-shelf techniques can be employed for its analysis.

Space limitation forces this chapter to only focus on a simple technique based on the Petri-net token-replay game: real-log traces are transformed in model-aligned traces that are replayed on the Petri-net model to collect waiting and service times. The transformation to model-aligned traces ensures that they are replayable on the model. However, the firings of Petri-net transitions are atomic by definition, and hence their execution take no time. This is clearly not realistic, and requires to explicitly model the starting and completion of activity instances are two separate Petri-net transitions. This explicit modelling can be simply explained through our working example of the process modelled as in Fig. 1.

Each visible transition is split into the sequence of two transitions that model the starting and completion of activity instances, yielding the Petri net in Fig. 4(a). For instance, activity *Enter Loan Application* is now represented through two transitions, named *a s* and *a c*, which respectively fire when instances of that activity starts or completes. We aim to play the token game: this means that transition *a s* fires upon a start event for activity *Enter Loan Application*, and *a c* upon a complete event for the same activity. This means that, when a token is present in the places named a r, . . . , f r, it indicates that the activity associated with transitions a s, . . . , f s are being executed, respectively. Note that there is no need to split invisible transitions: they are necessary for modelling purposes, and do not represent an actual activity, and hence can be considered as instantaneous. As mentioned earlier, the real event log need to be translated into a model-aligned event log that can be directly replayed on the Petri net of the process model, and alignment techniques are used for this purpose. In the scenario in which events refer to either the starting or the completion of the activities, the two transitions that indicates the starting or completion of any activity *x* need both to be mapped to events for *x*, but the first to events related to the starting of *x* and the second to events related to the completion of *x*. For the model in Fig. 4(a), transition *a s* is mapped to events related to the starting of *Enter Loan Application*, and *a c* to events related to the completion of *Enter Loan Application*.

After computing the alignments with this mapping, it is possible to synthesize the model-aligned event log in Fig. 4(b). Gray rows refer to the firing of invisible transitions: in that case, the timestamp of the associated events is assigned to be equal to the earliest moment in which the transition could fired. Consider transition τ<sup>1</sup> and the first trace: τ<sup>1</sup> can fire when both transitions b c and c c have fired, b c fires at time 11 and c c at time 10 for the first trace, and consequently the earliest moment in which τ<sup>1</sup> can fire is at time 11.

Each trace of the model-aligned event log can be replayed on the Petri net. This allows computing the amount of time in which a token resides in a given place, i.e. the difference between the timestamp in which the token was consumed and the timestamp when it was produced. For example, consider the place a r: tokens are produced in that place when transition a s fires and are consumed when transition a c fires. For trace


**Fig. 4.** An example of extending the process model with the time perspective. The left-hand side picture shows how the process model in Fig. 1 can be annotated with temporal information wrt. the model-aligned log shown in the right-hand side table. The gray lines in the table are the events related to invisible transitions.

with case identifier 1, a c and a s respectively fired at time 4 and 1, thus the difference is 3. The residence of each token in each place can be computed by replaying the modelaligned event logs: these timestamp differences are shown within the clouds associated to the different places in Fig. 4(a). The average per place can subsequently be computed, which is shown next to the respective cloud. One can red color each place with a color intensity that is proportional to the mean value of time: white is associated to an average of zero, and the color becomes closer and closer to dark red as the average is closer and closer to the largest observed value.

Considering place a r again, the average time is 3.4 for the event log in Fig. 4(b). This indicates that the average duration of instances of activity *Enter Loan Application* is 3.4 time units (e.g., hours). Consider place c1: tokens are produced in the place after the completion of the same activity and consumed when transition b s fires, namely when activity *Retrieve Applicant Data* starts. For the first trace, the amount of time a token is c1 is 3 time units, namely the timestamp of the event for b s, which is seven for first trace, minus the timestamp of the event for a c, i.e. 4. After collecting the times for each token in c1 for all traces (see the cloud connected to the place) and computing the average, one can conclude that the average time between the starting of activity instances of *Retrieve Application Data* and the completion of the corresponding instance of the preceding activity *Enter Loan Application*.

#### **Dealing with Non-compliant Traces and Missing Timestamps**

So far, we have assumed that *(i)* activity executions leave trails in log through both start and completion events, and *(ii)* every trace is compliant with the model. In particular, assumption *(ii)* means that the model-aligned traces only include additional events related to firing of invisible transitions. These assumptions do not always hold in reality: event logs often only contain the events related to the completions of activity instances, and some traces are not fully compliant with the model (cf. the Conformance Checking field discussed in [5]).

*Assumption* (i) *is not met.* In this case, one can employ a na¨ıve approach that assumes that the next activity in the process starts as soon as the previous completes: in this case, the timestamp of the starting event is the same as the timestamp of the completion event of the activity that precedes. This is often unrealistic, as pictorially depicted in Fig. 5. In the timeline, *Completion of a* indicates the moment in which activity instance *a* completes and *Real start of a* is the actual moment in which *a* started, which has left no trail in the event log. Moment *Completion of the activity instance preceding a* is when the previous activity concluded. The time difference between *Completion of the activity instance preceding a* and *Real Start of a* corresponds to the waiting time of *a*. If this time difference is set to 0, no waiting time is assumed. A better estimation can be obtained if the event log contains information about the resource perspective: one can look at the completion event of a given activity instance a and consider the resource r that performed the instance: the starting timestamp of the activity instance is equal to the earliest moment after the completion of the activity instance that precedes a in which r has completed any activity instance and has become available [20].

**Fig. 5.** Representation of the scenario when the timestamp of the start event of an activity instance a is not present in the event log and needs to be estimated. This timestamp is located between the timestamp when a completes and the earliest timestamp when the resource r that is going to perform a is available to start a. This time interval is represented through a green area, and the real start of a, which is unknown, is located within the area. In case the resource information is missing, we do not even have the earliest timestamp of availability of r: this introduces further uncertainty, because we can only rely on the timestamp of completion of the activity instance that precedes a in the trace.

This corresponds to the moment in figure labelled as *Availability of the resource that performed a*: this introduces some waiting time, namely the time difference between *Completion of the activity instance preceding a* and *Availability of the resource that performed a*, thus being more realistic. The latter case is still often unrealistic in practice [11]: *(a)* resources work on multiple processes and continuously switch from one to the other while event logs refer to one process, *(b)* take breaks during the working days (e.g., when tired), *(c)* carry on additional duties that lead no trail in the event logs (e.g., when answering the phone). Let us consider Fig. 5 again: the actual start is in a moment between when the resource has become available and when the activity instance has been completed. The choice of estimating different start moments leads to estimating different activity-instance durations. In [14], Fracca et al. proposes a technique to estimate the starting event where different activity-duration configurations are simulated, and the resulting simulated event log is compared with the real event logs to assess the similarity with respect to time-related aspects (activity-instance waiting times and process-instance durations): the more similar are the real and simulated event log is, the more realistic are the estimation of activity instance durations. The simulation of different activity-duration configurations requires a simulation model, which consists of a process model that is extended with additional information related to the simulation aspects, such as the inter-arrival time, the routing probabilities at the XOR gateways, the roles and the resource-activity allocation, potential work calendar, and more. The simulation model can be constructed by combining different process mining techniques, as also discussed in [14].

*Assumption* (ii) *is not met.* This can be clearly caused by not meeting the assumption *(i)*: the starting events are missing, yielding model moves for every Petri-net transition linked to the starting of activity instances. We consider the situation hereafter in which assumption *(i)* is met. In this case, the deviations are related to the activities that have not been performed in accordance to the process model. In this case, both the starting and completion events are missing. If the number of non-compliant traces is limited, these can be excluded from the analysis. Otherwise, the log traces are aligned to create model-aligned traces, without adding the timestamps to the events that come from model moves for visible transitions: in this case, statistics are computed for reliability by only considering pairs of subsequent events that have a timestamp associated.

## **2 Process Extension: Advanced Techniques**

This section introduces some advanced techniques to overtake the limitations of the basic algorithms for decision mining and for role discovery: In particular, the basic algorithm for decision mining introduced in Sect. 1.2 is only able to discover with atoms of form *var-op-const* where *var* is a variable, *op* is a comparison operator and *const* is a constant (e.g. Age <sup>≤</sup> <sup>60</sup> or V erif ication <sup>=</sup> F ALSE), while the basic algorithm for role discovery in Sect. 1.3 assumes a resource to be able to play one single role, only. Sections 2.1 and 2.2 discussed some advanced techniques that aim to overcome these limitations.

#### **2.1 Data-Perspective Discovery of Guards with Variable Comparison**

Let us consider a variant of the event log in Table 1 where *InstalmentAmount* is present but attribute *InstalmentDivIncome* is missing. As mentioned above, the basic guard-discovery algorithm will be unable to discover guards that include an atom InstalmentAmount/Income > 0.5, or its negation. The work by de Leoni et al. [7] reports on an extension to the basic algorithm that can discover atoms of form *var-op-var*, such as InstalmentAmount > <sup>0</sup>.<sup>5</sup> · Income.

The algorithm builds on some oracle that discovers invariants in a set of observation instances, such as the Daikon system [6]. Analogously to the basic algorithm, the algorithm is applied for each place p of the Petri Net modelling a process, and consists of five steps:


**Fig. 6.** The possible decision tree that is learned from the observation instances in Table 2 augmented with boolean features related to discovered invariants, such as *InstalmentAmount*/*Income* > 0.5.

5. A decision tree is trained using the set of augmented observation instances.

For the working example, such a decision tree as in Fig. 6 is learnt: the invariant is now able to discriminate between the instances of d and of τ2.

## **2.2 Discovery Roles with Overlapping Resources**

The basic organization-mining technique discussed in Sect. 1.3 relies on clustering, and thus assumes each resource to play exactly one role. In many settings, this assumption does not hold: resources can associated with multiple roles. Burattin et al. [4] lift this assumption, by clustering activities instead of resources: the clustering puts together the activities that require to be executed by resources playing the same role.<sup>1</sup> The starting point is a process model and its dependencies of form <sup>a</sup> <sup>→</sup> <sup>b</sup>, i.e. activity <sup>b</sup> can follow <sup>a</sup> but <sup>a</sup> cannot follow <sup>b</sup>. Clustering is obtained by removing all the dependencies <sup>a</sup> <sup>→</sup> <sup>b</sup> for which the handover is larger than a given threshold τ <sup>w</sup>: 2

**Definition 2 (Resource Handover for a Model Dependency).** *Let* <sup>a</sup> <sup>→</sup> <sup>b</sup> *be the dependency between two activities* <sup>a</sup> *and* <sup>b</sup>*. Let* <sup>L</sup> *be an event log and* <sup>R</sup><sup>a</sup>→<sup>b</sup> <sup>=</sup> <sup>σ</sup>∈<sup>L</sup> e*i*,e*<sup>j</sup>* ∈σ.#*act* (e*i*)=a∧#*act* (e*<sup>j</sup>* )=<sup>b</sup> (#*res* (ei), #*res* (e<sup>j</sup> )) *be the multiset of pairs of resources in* L *where the first resource executes* a *and is immediately followed by the second resource executing* <sup>b</sup>*. Let* <sup>R</sup><sup>a</sup> <sup>a</sup>→<sup>b</sup> *and* <sup>R</sup><sup>b</sup> <sup>a</sup>→<sup>b</sup> *be the projection over the first and*

<sup>1</sup> The terminology and formalization used hereafter slightly different those in [4], to harmonize with the rest of the chapter.

<sup>2</sup> Given two multisets <sup>X</sup> and <sup>Y</sup> , the interection <sup>X</sup> <sup>∩</sup> <sup>Y</sup> returns a multiset that contains every element z present in X and Y with the lowest cardinality for z between that of X and of Y . Symbol indicates the union of multisets: the cardinality of each element in the union of two multisets X and Y is equal to the sum of the cardinalities of the element in X and in Y . Given a sequence σ, a second sequence σ ∈ σ if σ is a sub-sequence of σ.

*second component of* <sup>R</sup><sup>a</sup> <sup>a</sup>→b*, respectively. Let* <sup>R</sup><sup>=</sup> <sup>a</sup>→<sup>b</sup> *be the pairs with the same resource value on both components. The resource handover for dependency* <sup>a</sup> <sup>→</sup> <sup>b</sup> *for* <sup>L</sup> *is defined as follows:*

$$w\_{ab}(L) = 1 - \frac{|\mathcal{R}\_{a \longrightarrow b}^a \cap \mathcal{R}\_{a \longrightarrow b}^b| + |\mathcal{R}\_{a \longrightarrow b}^=|}{|\mathcal{R}\_{a \longrightarrow b}^a| + |\mathcal{R}\_{a \longrightarrow b}^b|}$$

The definition states that wab(L) is closer and closer to zero if it is more and more frequent that two activities a and b are performed by the same resources. If activities belong to the same cluster, the resources that perform them can play the same role.

As an example, let us consider the dependency <sup>a</sup> <sup>→</sup> <sup>c</sup> for the model in Fig. <sup>1</sup> and the log in Table 1. It follows <sup>R</sup><sup>a</sup>→<sup>c</sup> = [(M ark, M ax)<sup>2</sup>,(Sue, Anne)<sup>1</sup>] where the superscript indicates the cardinality; hence, <sup>R</sup><sup>a</sup> <sup>a</sup>→<sup>c</sup> = [M ark2, Sue<sup>1</sup>] and <sup>R</sup><sup>c</sup> <sup>a</sup>→<sup>c</sup> = [M ax2, Anne<sup>1</sup>]. Therefore, the resource handover for the dependency is wac(L)=1. Value 1 is obtained when the set of resources are totally disjoint, as the case is for a and c: the dependency is hence removed, making a and c belong to different clusters. Repeating the reasoning on dependency <sup>a</sup> <sup>→</sup>, we obtain <sup>w</sup>ab = 1, thus causing <sup>a</sup> and <sup>b</sup> to belong to different clusters.

Ultimately, this means that activity *a* is a cluster with only itself. However, Fig. 3(b) shows that activities a and d should belong to the same cluster, so as to add the performing resources to the same role. However, this cannot happen if we only look at the dependencies because there is no dependency <sup>a</sup> <sup>→</sup> <sup>d</sup>, or vice versa. Therefore, after partitioning the activities, some clusters can be merged. This occurs if the so-called merging degree is larger than a given threshold τ <sup>ρ</sup>:

**Definition 3 (Merging Degree).** *Let* <sup>A</sup><sup>1</sup> <sup>=</sup> {a1,1,...,a1,n} *and* <sup>A</sup><sup>2</sup> <sup>=</sup> {a2,1,...,a2,n} *be two activity clusters. Let* <sup>L</sup> *be an event log. For any set* <sup>A</sup> *of activities, let us denote the multiset of resource executing activities in* <sup>A</sup> *with* <sup>R</sup><sup>A</sup> <sup>=</sup> <sup>σ</sup>∈<sup>L</sup> <sup>e</sup>∈σ:#*act* (e)∈<sup>A</sup> #*res* (ei)*. The merging degree of* <sup>A</sup><sup>1</sup> *and* <sup>A</sup><sup>2</sup> *is defined as:*

$$\rho\_{A\_1, A\_2}(L) = 2 \frac{|\mathcal{R}\_{A\_1} \cap \mathcal{R}\_{A\_2}|}{|\mathcal{R}\_{A\_1}| + |\mathcal{R}\_{A\_2}|}$$

Similarly to Definition 2, this measures the amount of shared resources between those that execute two sets A<sup>1</sup> and A<sup>2</sup> of activities. If ρ<sup>A</sup>1,A<sup>2</sup> (L) > τ <sup>ρ</sup>, A<sup>1</sup> and A<sup>2</sup> are merged.

In conclusion, the algorithm to discover roles where resources belong to multiple is as follows:


It is worthwhile reflecting that the actual relevant values for the resource-handover threshold τ <sup>w</sup> are limited to the set of handover values wxy computed for each dependency <sup>x</sup> <sup>→</sup> <sup>y</sup>. Given that the number of dependencies is finite and usually small, it is possible to extensively apply the role discovery setting τ <sup>w</sup> iteratively to every value <sup>w</sup>xy where <sup>x</sup> <sup>→</sup> <sup>y</sup>. This enables process analysts to evaluate the different configuration and determine the most realistic role set, using business knowledge. Also, once a value is set for τ <sup>w</sup> and the clusters are created, one can similarly reason for τ <sup>ρ</sup>: the number of values to test is finite, i.e. the values ρA,A (L) for each pairs (A- , A--) of clusters at step 4.

## **3 Process Improvement**

Process analysts and certain stakeholders (e.g., CEOs) may oftentimes have partial or helicopter-like view on the organizations in which such process are executed. As a consequence, the process models that they have in mind (also known as "to-be" models) may not summarize how processes are *really* executed by resources. In these cases, such "to-be" models are of limited use. Improvement can be regarded as altering the model so that it reflects reality (i.e., improvement on fitness) while ensuring the other quality criteria remain within a certain reasonable range (i.e., precision, generalization and simplicity). The result is an "as-is" model that show how the process is *really* executed. Section 3.1 details how models can be improved on fitness.

However, if models are used to prescribe how processes ought to be executed, they should only represent the behavior with which the organization is satisfied. If the model contains portions of behavior that lead to unsatisfactory outcomes (high costs, low customer satisfactions, etc.) or that violate norms and regulations, one would like those portions to be disallowed by the model. Section 3.2 details how models can be improved to ensure no regulation violations and to incorporate behavior that has proven to yield good performance levels.

The classical problem of process discovery discussed in [2,3] and that of process improvement share some commonalities in that they both aim to come up with "asis" models. The difference lays on the fact that the problem of process discovery is largely unsupervised (little or no knowledge is fed in), while process improvement is supervised: an original model is provided, which constitutes the initial "backbone" that is later altered to obtain a "as-is" model. It follows naturally that process improvement is generally able to produce better models because the original model encodes behavior that is deemed appropriate from a business viewpoint. This reasoning is especially valid when the original model is hand-designed by or in concert with process owners.

The remainder of this section will use the same working example that was used in Sects. 1 and 2, namely the process modelled in Fig. 1. However, hereafter we will differently assume that the activity names in the real event log coincides with the transition names a, . . . , f, to keep the discussion simple. Also, since we discuss techniques for process improvement that only consider the control flow, traces will be considered as sequences of activities, which coincide with transition names for the considerations above (i.e., a simple formulation)

#### **3.1 Model Repair to Reflect Reality**

The problem of repairing a process model M to reflect the reality recorded in a log L can be formulated as finding a process model M that is able to replay each trace in L and is the closest possible to M (i.e. with the minimum number of changes). Note that, if M can replay L, M- = M. This section focuses on the case in which M and M are accepting Petri nets, and the goal is to repair models wrt. the control-flow perspective, thereby ignoring the other perspectives. This formulation suggests that model repair primarily aims at perfect fitness, generating a set of models with optimal fitness. Within this set, the final choice refers to any model that best balances simplicity, precision and generalization (cf. the conformance-checking problem discussed in [5]).

The assumption here is that the repaired model must be able to replay every trace in event log L. However, event logs may record executions that are outliers or refer to process instances that were still running at the moment of the extraction of the event log. Those traces should not be allowed by the repaired model M- . Hereafter, we however assume that every trace that should not be replayed by M is already filtered out from L before applying the model-repair algorithm on L.

This section reports on the repair technique discussed in [13], whose basic intuition can be given via the following example. Let us consider again the model in Fig. 1 and the following event log <sup>L</sup> = [a, g, w, b, c, d,a, w, g, b, c, e] where <sup>g</sup> and <sup>w</sup> are the shortcut names of two new activities: e.g. *fix application* and *add witnesses* respectively. These two activities are not part of the model, and thus cannot be replayed on the model. It follows that the model needs to be executed to add some transitions labelled g and w to make the model compliant with L. The model-repair algorithm needs to determine the point in which the two transitions should be included, namely which places are in the presets and postsets of these transitions. The technique discussed in [13] aims to address this question by aligning the original model M and each of the traces in L. Optimal alignments for the traces in L wrt. the model in Fig. 1 are:

$$\gamma\_1 = \begin{array}{c|c|c|c|c|c|c|c} a & g & w & b & c & \gg & d \\ \hline a & \gg \gg & b & c & \tau\_1 & d \\ \hline \left[c1, c2\right] & & & \left[c3, c2\right] \left[c3, c4\right] \left[c5\right] \left[edd\right] \\ \end{array}$$

$$\gamma\_2 = \begin{array}{c|c|c|c|c|c} a & w & g & b & c & \gg & \gg & e \\ \hline a & \gg \gg & b & c & \tau\_1 & \tau\_2 & e \\ \hline \left[c1, c2\right] & & & \left[c3, c2\right] \left[c3, c4\right] \left[c5\right] \left[c6\right] \left[edd\right] \end{array}$$

Here, the third alignment rows indicate the marking of the Petri net after each synchronous or model move. As usual, the model moves for τ<sup>1</sup> and τ<sup>2</sup> are not considered deviations, and hence do not need to be taken into account when repairing the model. In both of alignments, the actual deviations consist in a sequence of two log moves for activities <sup>g</sup> and <sup>w</sup>, namely related to log sub-traces g, w and w, g. These sequences of two log moves (and the corresponding sub-traces) both occurred when the Petri-net model was at marking [c1, c2]. The model needs to be repaired so that the log-move sequences would be replaced in the respective alignments by sequences of synchronous moves.

**Fig. 7.** The model in Fig. 1 repaired by adding the parts in red and green to allow for executions a, g, w, b, c, d and a, w, g, b, c, e.

The two sub-traces g, w and w, g can in fact be regarded as an event log, which can be given as input to some process-discovery techniques to mine a model. If we employed the Inductive Miner, the model would be similar to the Petri net marked through a red border in Fig. 7: transitions *w* and *g* are modelled in a parallelism. Markings [p0] and [p6] are respectively the initial and final marking. Since g, w and w, g were observed at marking [c1, c2], marking [p0] needs to be reachable from [c1, c2] without firing any transition: this is modelled via the invisible transition τ5. Furthermore, when reaching marking p6, the execution should be able to reach back marking [c1, c2], motivating the introduction of invisible transition τ6.

The example above helps introduce the algorithm to repair an accepting Petri net AN with respect to a log L:


The algorithm above only considers the log moves, which are linked to sequences of activities that need to be allowed by the repaired model. Of course, the alignment may also point out sequences of model moves, namely sequences of activities that were expected in an observed process instance but not observed. The repaired model should make these expected, but unobserved sequences as optional. As an example, let us consider the model in Fig. 7, which has already been repaired wrt. the sequences of log moves in the alignments. Let us suppose the event log to contain traces related to applications that are desk-rejected because of their clear incorrectness: these correspond to traces consisting of two events a, d. The corresponding alignment would then be as follows:

$$\gamma\_1 = \begin{array}{|c|c|c|c|} \hline a & \gg & \gg & \gg & d \\ \hline a & b & c & \tau\_1 & d \\ \hline \left[c1, c2\right] \left[c3, c2\right] \left[c3, c4\right] \left[c5\right] \left[end\right] \\ \end{array}$$

It contains a sequence of three model moves. The repaired model should be such that those three model moves are no more necessary. This can be easily tackled, and one can insert an invisible transition that consumes one token in c1 and one in c2, i.e. the places containing tokens before the first model move (i.e., before b), and produces a token in c5, the place containing one token after the last model move (i.e. after τ1).

#### **Advanced Repair for Higher Precision and Simplicity**

The repairing algorithm discussed above is largely focusing on fitness, thereby overlooking the other dimensions. In fact, the procedure above can have a negative influence on precision and simplicity, because it may allow additional behavior, and increase the size of the model.

**Higher model precision** can be obtained by removing transitions that seldom appear. In a nutshell, the event-log traces are aligned with the model. For each transition t in the model, we count the number of occurrence of synchronous or model move for t in all the alignments. If this is smaller than a user-defined threshold, t is removed, along with every arc that goes in or comes out from t. The procedure can cause some places to have no more incoming or outgoing arcs: these places are removed, as well.

**Model simplification** can be achieved as an a-posteriori step, e.g., using the technique proposed by Fahland et al. [12], which aims to simply the model, while preserving the same behavior and well balancing generalization and precision [12]. However, simplification can partly be achieved during the repair, e.g. in case of structured loops [13]. Let us consider the model in Fig. 1 and an event log consisting of two traces <sup>σ</sup><sup>1</sup> <sup>=</sup> a, b, c, r, b, c, d and <sup>σ</sup><sup>2</sup> <sup>=</sup> a, b, c, r, c, b, e where <sup>r</sup> is the shortcut for a new activity *Ask for Additional Documents* to, e.g., enable a more thorough assessment. The alignments of the two traces are as follows:

**Fig. 8.** Repair of the model in Fig. 1 to allow for a, b, c, r, b, c, d and a, b, c, r, c, b, e, using the basic model-repair technique.

$$\gamma\_1 = \begin{array}{c|c|c|c|c|c|c} a & b & c & r & c & b & \gg & d \\ \hline a & b & c & \gg & \gg & \tau\_1 & d \\ \hline \left[c1, c2\right] \left[c3, c2\right] \left[c3, c4\right] & & & & \left[c5\right] \left[ednd\right] \\ \end{array}$$

$$\gamma\_2 = \begin{array}{c|c|c|c|c} a & b & c & r & c & b & \gg & \gg & e \\ \hline a & b & c & \gg & \gg & \tau\_1 & \tau\_2 & e \\ \hline \left[c1, c2\right] \left[c3, c2\right] \left[c3, c4\right] & & & \left[c5\right] \left[ed\right] \left[edd\right] \end{array}$$

Using the repair technique discussed so far, we would obtain the model in Fig. 8, where the newly included part is shown in green. The model has multiple transitions of the same label (see *b* and *c*), which would be actually unnecessary if the technique could discover that the green part aims to model a structured loops of repeating *b* and *c*.

The basic repair algorithm can be extended to implement such structured loops as in the example above. We give an intuition on how the algorithm is extended via the above example: let us take <sup>σ</sup><sup>1</sup> <sup>=</sup> a, b, c, r, b, c, d), with the alignment <sup>γ</sup><sup>1</sup> shown at page 22. The marking before the first log move is [c3, c4], and the sequence of events that are associated with the maximal sequence of log moves is <sup>σ</sup><sup>γ</sup><sup>1</sup> <sup>=</sup> r, c, b.

We search in the model to be repaired, namely the model in Fig. 1, for the smallest connected subnet that *(i)* ends with places c3 and c4, namely the places with a token at the marking before the first log move, and *(ii)* contains each transition t in σ<sup>γ</sup><sup>1</sup> = r, c, b, excluding <sup>r</sup>, which is not in the model to be repaired. This subnet corresponds to the gray area in Fig. 9.The trace fragment σ<sup>γ</sup><sup>1</sup> is then projected on this subnet: the events related to transition of the fragment are the only retained, yielding a subtrace <sup>σ</sup><sup>1</sup> <sup>=</sup> c, b. We create an accepting Petri net AN from the fragment, using the marking

**Fig. 9.** Repair of the model in Fig. 1 to allow for a, b, c, r, b, c, d and a, b, c, r, c, b, e, using the advanced model-repair algorithm that increases simplicity (cf. result of the basic algorithm in Fig. 8).

[c3, c4] before the first log move as final marking, and the marking with one token in each place with no incoming arcs as initial marking. Since σ<sup>1</sup> is replayable on AN, the transition τ<sup>3</sup> can be introduced, which constructs a structured loop where transitions b and c can be repeated. Note that the algorithm above is applied on single log sequences of individual traces, and transition r in Fig. 9 has not been introduced yet. The algorithm needs to be iteratively applied to each sequence of events that come from the projection of the log component of each alignment.

In our example, after repairing the model by adding transition τ3, the algorithm is applied on <sup>σ</sup><sup>2</sup> <sup>=</sup> a, b, c, r, c, b, e, but yields no changes. Indeed, after the first repair, the alignment of the model in Fig. 9 with σ<sup>2</sup> is now as follows:


Recall that transition r is not yet part of the model. This will be added as final step, which consists in reapplying the basic repair algorithm on the same traces <sup>σ</sup><sup>1</sup> <sup>=</sup> a, b, c, r, b, c, d and <sup>σ</sup><sup>2</sup> <sup>=</sup> a, b, c, r, c, b, e and on the model in which invisible transition τ<sup>3</sup> is included.

This section has focused on repairing the model to reflect the reality observed in the event log. However, repairing the model can also be regarded as to ensure that the model is sound. In the domain of process model, model soundness implies several properties of which the most important is the absence of deadlocks or livelocks that prevent executions from being completed. Interesting approaches that focus on model repair for soundness are provided by Gambini et al. [15] and by Lohmann et al. [16,17], which are not discussed here due to space limitations.

**Fig. 10.** The basic idea of KPI-driven Model Improvement: the observed behavior (i.e., in the event log) that is satisfactory and compliant with rules should be incorporated in the model, while the observed behavior that is not satisfactory or not compliant, should be not incorporated or disallowed in the model.

#### **3.2 KPI-Driven Model Improvement**

If the model is used to prescribe how the corresponding process should be carried on, one does not want to incorporate the whole behavior observed in the event log, but only that portion that has shown to usually lead to satisfactory values of a certain Key Performance Indicator (KPI) of interest. Furthermore, behavior can only be incorporated if it does not violate the protocols, regulations, and norms. The definition KPI of interest varies depending on the domain, needs to be customized, and may be numerical or defined over an enumeration of values, including boolean. Examples are execution costs, customer satisfaction, execution time, or whether or not the corresponding loan was eventually approved. Similarly, one wants to disallow the behavior allowed by the model that, unfortunately, typically yield unsatisfactory KPI values. Figure 10 graphically illustrates the idea. The rectangle shows the amount of behavior allowed by the model, while the pie shows the amount observed in the event log. The green pie slide is the portion of observed behavior that is associated with unsatisfactory KPI values or with violations of norms or protocols: the part in light green that intersects the modelled behavior should be disallowed from the model after repair. The red and the orange pie portions show the portion with satisfactory KPI values: the part in dark red is not allowed by the model but that should be incorporated because of being associated with executions characterized by satisfactory KPI values.

The remainder of this section focuses on a methodology to extend the model to allow the portion in dark red, which has been introduced by Dees et al. [9]. The starting point is an existing process model, here represented as an accepting Petri net, an event log, and the definition of a Key Performance Indicator (KPI). A KPI is a pair consisting of *(i)* a function that, given a trace, returns the KPI value, and *(ii)* the set of satisfactory KPI values:

**Fig. 11.** The main steps of the methodology for KPI-driven Model Improvement (adapter from [9])

**Definition 4 (Key Performance Indicator).** *Let L be a simplified multi-perspective event log. Let* V *be the set of possible values for a key performance indicator. A* key performance indicator *is a pair* (κ, K) *consisting of a function* <sup>κ</sup> : <sup>L</sup> → V *that assigns a KPI value* <sup>κ</sup>(σ) *to each trace* <sup>σ</sup> *and of a set* <sup>K</sup> ⊂ V *that contains the KPI values that are satisfactory from a business viewpoint.*

Typically, the function κ in a KPI definition depends on the attributes present in the event log. However, this section remains general on how the KPI values of process executions (i.e., traces) are computed.

#### **Partially Model-Aligned Traces**

The technique described hereafter also relies on the concept of model-aligned event logs that has been introduced in Sect. 1.1. However, we extend the concept to allow for traces that are partially model-aligned. It is indeed possible to ignore individual moves: ignoring a model move means that the corresponding event is not added to the trace, and ignoring a log move means that the corresponding event is not removed. To clarify, let us consider a trace a, b, b, d and the model in Fig. 1. The alignment is as follows:

$$\gamma = \left| \frac{a \mid b \mid b \mid \gg \mid \gg \mid d}{a \mid b \gg \mid c \mid \tau\_1 \mid d} \right|$$

A full model-alignment trace is a, b, c, τ1, d. Ignoring log moves for <sup>b</sup> would generate a, b, b, c, τ1, d, namely the new alignment would still generate the log move for <sup>b</sup>; ignoring model moves for <sup>c</sup> would generate a, b, τ1, d, i.e. the model move for <sup>c</sup> is still present. It is possible to ignore multiple moves at the same time: in our example, repairing neither the model move for <sup>c</sup> nor the log move for <sup>b</sup> would produce a, b, b, τ1, d. Note that, hereafter, we always to ignore all model moves for invisible transitions when model-aligning a trace, and we consider to fully model-align a trace even when we ignore model moves for invisible transitions.

#### **The Methodology in a Nutshell**

The methodology takes an event log and the original process model as input and returns an improved process model. It is composed by three main steps (cf. Fig. 11):

*Step 1. Deviation Analysis.* Deviations are detected and a set of rules is discovered that correlate deviations to a selected KPI. Rules are mutually exclusive, which enables to split the event-log traces into groups of traces, such that a trace belongs to at most one cluster (in fact, outlier traces are filtered out).

*Step 2. Align and Merge Log Clusters.* Traces in the different sublogs are partially model-aligned to only keep the deviations in the original trace that have a positive impact on the value of the KPI. All sublogs are then merged to obtain a single partially aligned-model event log.

*Step 3. Repair Model.* Finally the partially aligned-model event log is used as input to repair the model: the process model is modified in such a way that it can replay all the behavior of the partially aligned-model event log. In the partially aligned-model event log we have repaired all deviations corresponding to behavior that should not be incorporated in the model. In this way the repair-model technique will only modify the model to make the desired deviating behavior possible.

The remainder will elaborate on the sub-sets within steps 1 and 2, using the same case study as in Fig. 1. Step 3 does not require further details since it consists in applying any technique for model repair to reflect reality, such as the technique by Fahland et al. [13] discussed in Sect. 3.1.

#### **Step 1. Deviation Analysis**

The deviation-analysis step takes an event log L, an accepting Petri net AN, and a KPI definition (κ, K). The result is a decision tree that allows splitting L in so many sublogs as the tree leaves. Each sub-log is associated with a different KPI value. Note that certain traces are considered outliers and filtered out, namely the union of the sub-logs does not necessarily coincide with L. To achieve this, the following sub-steps can be identified:

*Step 1.1: Conformance Checking.* The first step is checking conformance of the event log and the process model. This is done to determine all deviations that are observed between the log and the model. The result of conformance checking is an alignment for each log trace.

*Example: let us consider the model in Fig. 1 and three non-compliant traces:* σ<sup>1</sup> = a, b, c, w1, f*,* σ<sup>2</sup> = a, b, c, w2, f *and* σ<sup>3</sup> = a, b, c, w3, f *where* w1*,* w2 *and* w3 *is a shortcut for the activities to ask for one, two or three witnesses, respectively. The alignments are of the following form where* w*<sup>X</sup> respectively stands for* w1*,* w<sup>2</sup> *and* w3*:*

$$\gamma\_2 = \frac{\left| a \, \middle| \, b \, \middle| \, c \, \middle| \, w\_X \mid \gg \mid \gg \mid f \right|}{\left| a \, \middle| \, b \, \middle| \, c \, \middle| \, \gg \,\middle| \, \tau\_1 \, \middle| \, \tau\_2 \, \middle| \, f \right|}$$

The KPI is here boolean: **true** and **false** respectively indicate whether the approval process has finally led to a loan that is eventually repaid in full or only in part. The latter case is undesired because it requires the involvement of a credit-collection agency. For the three executions in the example, σ<sup>1</sup> refers to a loan paid back in full whereas σ<sup>2</sup> and σ<sup>3</sup> to loans paid back in part.

*Step 1.2: Moves' Correlation to KPI Values.* The number of model moves and log moves of activities is correlated with the chosen KPI. To model that the improved model should comply rules and regulations, the concepts of *disallowed activities* and *mandatory activities* has been introduced. The set G<sup>D</sup> of disallowed activities include those that should never become part of the process model, whereas the set G<sup>M</sup> of mandatory activities are those that cannot become optional or be removed from the model. In this step, we build a set of so-called *observation instances*, which are used to train a classification tree. Let T and l be the set of transitions and the labelling function of the labelled Petri net of AN. Let <sup>A</sup> be the activities of <sup>N</sup>, i.e. the Petri-net labels: <sup>A</sup> <sup>=</sup> <sup>∪</sup><sup>t</sup>∈<sup>T</sup> <sup>l</sup>(t) . To keep it simple, we assume without losing generality that the log activities coincide with <sup>A</sup>, too. We build one observation instance for each trace <sup>σ</sup> <sup>∈</sup> <sup>L</sup> with the following features:


From the set of observation instance, we learn a decision tree, using the KPI value as target feature, and the number of log and model moves as independent features. If the domain of the KPI values is finite (e.g., satisfactory vs unsatisfactory), a classification tree is used; otherwise, we employ a regression tree.

*Example (cont.): log moves for* w<sup>2</sup> *and* w<sup>3</sup> *are correlated with full repay, where log moves for* w<sup>1</sup> *are correlated with part repay. However, let us assume* w<sup>3</sup> *be within the set of disallowed activities (e.g., three witnesses require too much additional work). Thus, log moves for* w<sup>3</sup> *are not allowed as independent feature. The result could be such a decision tree as in Fig. 12: when there are log moves for* w2*, the KPI is fulfilled: the loan is eventually repaid in full.*

*Step 1.3: Splitting of the Event Log into Groups and Outlier Filtering.* The classification tree can be seen as a clustering of the traces of an event log. Each leaf is a different cluster and the path from the root to the leaf provides a rule that characterizes the traces that belong to a certain group. For reliability, the wrongly-classified traces are removed from the groups, namely the traces classified to have KPI values that differ from the actual values. The wrongly-classified traces might potentially affect the repair-model phase, and allow behavior in the model that would not be linked to actual, satisfactory KPI values.

*Example (cont.): The trace cluster associated with leaf* Part Repay *(left leaf) is* L<sup>1</sup> = [a, b, c, w1, f*], whereas the cluster for leaf* Full Repay *is* L<sup>2</sup> = [a, b, c, w2, f]*. Note that trace* σ<sup>3</sup> = a, b, c, w1, f *would also be in* L2*, but would be wrongly classified and consequently filtered out. In fact,* σ<sup>3</sup> *is associated with a loan that is repaid in full but the decision tree in Fig. 12 would classify it as partly repaid: it does not indeed contain log moves for* w2*.*

## **Step 2. Model-Align and Merge Log Clusters**

Step 1 concluded with splitting L in n sublogs and filtering out those traces that are wrongly classified. Let {L1,...,Ln} be the sublogs obtained via splitting. Each <sup>L</sup><sup>i</sup> refers to a different decision-tree leaf <sup>v</sup>i, associated with a KPI value <sup>C</sup>(vi).

*Step 2.1: Conformance Checking of the Sublogs.* Conformance Checking is done with the original process model and each log cluster. Note that Step 2.1 is a conceptual step: in practice, one does not need to recompute the alignments for the cluster logs as one can simply reuse the alignments obtained as result of Step 1.1.

*Step 2.2: Model-Align of the Sublogs.* This step is repeated for each cluster Li, associated with a leaf <sup>v</sup>i. If <sup>L</sup><sup>i</sup> is associated with an unsatisfactory KPI value (i.e. <sup>C</sup>(vi) ∈ <sup>K</sup>), every deviation is repaired. Note that, even if the traces are fully model-aligned, they are kept in the log that is used for repairing the model at step 3. Those traces provide support to not remove behavior that is not observed: see discussion on achieving higher model precision in subsection *Advanced Repair for Higher Precision and Simplicity* within Sect. 3.1.

If L<sup>i</sup> is associated with a satisfactory KPI value, every deviation is repaired, except those in the conditions in the path from the decision-tree root to the leaf vi.

*Example (cont.): Trace* σ<sup>1</sup> *is model-aligned in full because related to an unsatisfactory KPI value, yielding a partial model-aligned trace* σ*<sup>r</sup>* <sup>1</sup> = a, b, c, f*. Trace* σ<sup>2</sup> *is related to satisfactory KPI values (see leaf Full Repay in the decision tree in Fig. 12), and associated to a tree path that indicates that the number of log moves for* w<sup>2</sup> *is larger*

**Fig. 12.** A decision tree that correlates alignment moves to KPI values.

**Fig. 13.** The model repaired to increase the changes for loan to be repaid in full (the KPI). The change consists in introducing the activity *Ask for two witnesses*, which are shown to be beneficial for a better risk assessment.

*than zero. This means that the log move for* w<sup>2</sup> *is ignored when model-aligning* σ2*: thus, the partial model-aligned trace* σ*<sup>r</sup>* <sup>2</sup> *coincides with the original trace* σ2*.*

*Step 2.3: Merge the Sublogs.* We merge all model-aligned sublogs into a single event log. This is a requirement to apply the next step, namely repairing the process model.

*Example (cont.): This step generates the event log* L = [a, b, c, f, a, b, c, w2, f]*, which is used for model repair.*

When the log is used with a model-repair technique (e.g., that in Sect. 3.1), the model in Fig. 2 is repaired as shown in Fig. 13: the transition w<sup>2</sup> is introduced.

**Acknowledgement.** Some of the ideas and techniques reported in this chapter is the result of author's collaborations with various researchers. While it is not possible to name them all, the author would like to give a special mention to Wil van der Aalst, Marcus Dees, Marlon Dumas, Felix Mannhardt, and Hajo Reijers (in strict alphabetical order).

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Process Mining over Multiple Behavioral Dimensions with Event Knowledge Graphs**

Dirk Fahland(B)

Eindhoven University of Technology, Eindhoven, The Netherlands d.fahland@tue.nl

**Abstract.** Classical process mining relies on the notion of a unique case identifier, which is used to partition event data into independent sequences of events. In this chapter, we study the shortcomings of this approach for event data over *multiple entities*. We introduce *event knowledge graphs* as data structure that allows to naturally model behavior over multiple entities as a network of events. We explore how to construct, query, and aggregate event knowledge graphs to get insights into complex behaviors. We will ultimately show that event knowledge graphs are a very versatile tool that opens the door to process mining analyses in multiple behavioral dimensions at once.

**Keywords:** Event knowledge graph · Process mining

## **1 Introduction—A Second Look at Processes**

Process mining aims at analyzing processes from recorded event data. Thereby, the actual processes are rather complex and emerge from the interplay of multiple inter-related *entities*: the various *objects* handled by the process as well as the *organizational entities* that execute the process. We best explain this kind of interplay by an example.

1. Consider a retailer who took two *Orders* for multiple *Items* from the same customer: the customer first places Order O1 for 2 items X and 1 item Y , and shortly afterwards Order O2 for 1 item X and 1 item Y . The retailer promises to ship every order within 6 days.

The retailer handles both orders as explained next and illustrated in Fig. 1.


c The Author(s) 2022

W. M. P. van der Aalst and J. Carmona (Eds.): Process Mining Handbook, LNBIP 448, pp. 274–319, 2022. https://doi.org/10.1007/978-3-031-08848-3\_9

**Fig. 1.** Illustration of a multi-entity process: a retailer handles two orders for multiple items by placing and receiving supplier orders for specific items.


This process relies on 7 different types of entities. *Actors* (human workers) and *machines* (an automated warehouse) together handle 5 types of objects: *Orders*, *Supplier Orders*, *Items*, *Invoices*, *Payments*.

**Challenges Due to Event Data over Multiple Entities.** A process mining analysis of the above process execution relies on recorded event data. Each event has to record in its attributes at least (1) which *action* (or activity) has been executed (2) at which *time*. To construct an event log, classical process mining also expects each event to record (3) in which process execution, typically called


**Table 1.** Event table of events underlying the event log of Table 2.

*case*, the event occurred (see [13], Sect. 2). Table 1 shows the events related to the above example.


In contrast to classical event logs, Table 1 contains no typical case identifier attribute by which each event is related to one specific process execution. Instead, we see multiple sparsely filled attributes identifying *multiple entities* of various types: *Order* (O1, O2), *Supplier Order* (A, B), *Item* (X1, X2, X3, Y 1, Y 2), *Invoice* (I1, I2), and *Payment* (P1).

This makes it difficult to construct an *event log* which is the basis for process mining analysis. Recall that to obtain a classical event log we select one *case identifier* attribute. Then all events referring to the same case id and ordered by time form the *trace* of this case, that is, one process execution. In this way, classical event logs partition the recorded behavior into multiple process executions. Process mining techniques then identify frequent patterns shared by all process executions, or identify outliers and deviations of specific process executions.

However, what exactly *is* a process execution in our example? It is not just all events related the one particular entity. For instance, if we chose *Order* as case identifier, we would obtain traces <sup>e</sup>1, e18, e27, e29 for <sup>O</sup>1 and <sup>e</sup>2, e5, e7, e33, e34 for O2. These traces do reveal that both orders were not shipped within 6 days as intended by the supplier. However, they do not allow us to understand the cause for this as they clearly do not describe the entire behavior shown in Fig. 1. We could try to group all events into traces using *multiple related case identifiers*. However, we will see in Sect. 3 that doing so introduces false behavioral information called *convergence* and *divergence* [41,45] in the resulting event log leading to false analysis results (see [1], Sect. 3)

False behavioral information arises when flatting Table 1 into sequential traces because we *cannot* partition the entities O1, O2, A, B, X, Y, I1, I2, P1 into disjoint sets, each belonging to one process execution that is independent of all others. Rather, the behavior itself is a larger "fabric" of multiple entities that are inter-related and inter-twined over time as shown in Fig. 1. This "fabric" is even more complex as individual *Actors* (R1,...,R5) are specialized in specific activities across multiple different entities, e.g., R2 specializes receiving, updating, and unpacking *Supplier Orders* and handling *Items*. In the following, we explain how to analyze this very "fabric" of multiple inter-related entities as a whole from a simple event table over multiple entity identifiers such as Table 1.

**A Graph-Based Approach.** Our trick will be to slightly adapt the existing definitions for obtaining an event log from an event table: instead of constructing entire traces related to a single case identifier, we discuss in Sect. 3 a *local directlyfollows* relation for each *individual* entity in the data. Each event can be part of multiple such directly-follows relations, depending on to how many entities it is correlated. We then use the model of *labeled property graphs* in Sect. 4 to create an *event knowledge graph* having events as nodes and the local directly-follows relations as edges between events. We obtain a graph similar to what is shown in Fig. 1, but with precise semantics for events and behavioral information.

A path of directly-follows edges over events related to the same entity is similar to a classical trace. However in an event knowledge graph, such paths meet whenever an event is related to more than one entity, where in an event log each trace is disjoint from all others. We explain in Sect. 5 how to interpret and analyze behavioral information in event knowledge graphs. We show how basic *querying* on event knowledge graphs gives insights into complex behavioral properties. We show how *aggregation* on event knowledge graphs allows to construct multientity process models that better describe such processes.

We finally explore the versatility of event knowledge graphs beyond the control-flow perspective in Sect. 6. We show how event knowledge graphs naturally integrate the control-flow perspective and the *actor perspective*. Querying for specific structures in the event knowledge graph reveals complex patterns of *task instances* not visible in either perspective alone. Further, we show how event knowledge graphs allow us to take a *system-perspective* (or queueing perspective) to analyze emergent behavior and performance problems across multiple entities. We conclude in Sect. 7 with an outlook on the various applications areas of event knowledge graphs in process mining, and on open research challenges.

All concepts for constructing and analyzing event knowledge graphs presented in this chapter are implemented as Cypher queries on the graph database system Neo4j<sup>1</sup> at https://github.com/multi-dimensional-process-mining/event graph tutorial [28].

## **2 Multi-entity Event Data**

Before we discuss problems and solutions for analyzing event data over multiple entities, we first define what "event data over multiple entities" actually is.

#### **2.1 Events**

We assume all data to be given in a single event table. Data is recorded from a universe of values *Val*; timestamps *Val* time ⊆ *Val* are totally ordered by ≤.

**Definition 1 (Event Table).** *An* event table T = (E, *Attr* , #) *is a set* E *of events, a set Attr of* attribute names *with act*,*time* <sup>∈</sup> *Attr . Partial function* # : <sup>E</sup> <sup>×</sup> *Attr* - *Val assigns an event* <sup>e</sup> <sup>∈</sup> <sup>E</sup> *and an attribute name* <sup>a</sup> <sup>∈</sup> *Attr to a value* #a(e) = <sup>v</sup>*;* #a(e) =<sup>⊥</sup> *if* <sup>a</sup> *is undefined for* <sup>e</sup>*.*

*Each event* <sup>e</sup> <sup>∈</sup> <sup>E</sup> *records an activity and a timestamp, i.e.,* #*act*(e) =<sup>⊥</sup> *and* #*time* (e) <sup>∈</sup> *Val* time*.*

We write e.a = v for #a(e) = v as a shorthand. An event table specifically allows multi-valued attributes, e.g., sets of values #a(e) = {v1, v2, v3} or a list of values #a(e) = <sup>v</sup>1, v2, v3, v1. <sup>2</sup> Simplifying notation, we also may write <sup>v</sup> <sup>∈</sup> e.a if e.a <sup>=</sup> <sup>v</sup> or if e.a <sup>=</sup> -..., v,....

An event table only defines e.*activity* and e.*time* attributes for each event. The special characteristic of event data over multiple entities is that it does not record a unique case identifier attribute, but identifiers of *multiple entity types*.

**Definition 2 (Event table with entity types).** *An* event table with entities types <sup>T</sup> = (E, *Attr* , #,*ENT*) *additionally designates one or more attributes* <sup>∅</sup> <sup>=</sup> *ENT* ⊆ *Attr as names of* entity types*.*

<sup>1</sup> neo4j.com.

<sup>2</sup> We assume the values in an event table to be consistent with some data model that is specified elsewhere. Our subsequent discussion does not rely on it.

A classical event log corresponds to an event table with a single entity type *ENT* = {*case*}. We can consider Table 1 is an event table with entity types *ENT* <sup>=</sup> {*Resource*, *Order*, *Supplier Order*,*Item*,*Invoice*,*Payment*}.

Event tables (Definition 1) are also called raw event logs and are – besides relational data – the most common form of input to process mining. The entity types of Definition 2 can be retrieved from an event table through schema recovery techniques [46]. Note that Definition 2 formalizes the object-centric event logs (OCEL) described in Sect. 3.4 of [1]; we here use the more general term "entity" instead of "object" as we will later study behavior over entities which are not tangible objects.

Event tables do not model the ordering of events with respect to a case or an entity which is needed for process mining. Before we study the ordering of events, we explain how events relate to entities.

## **2.2 Entities and Correlated Events**

Each entity type *ent* <sup>∈</sup> *ENT* is a column in the event table <sup>T</sup>. Each value in that column refers to a specific entity.

**Definition 3 (Entities).** *Let* T = (E, *Attr* , #,*ENT*) *be an event table with entities. Let ent* <sup>∈</sup> *ENT be an entity type. The* set of entities *in* <sup>T</sup> *of type ent is Entities*(*ent*, T) = {<sup>n</sup> | ∃<sup>e</sup> <sup>∈</sup> <sup>E</sup> : <sup>n</sup> <sup>∈</sup> e.*ent*}*.*

From Table 1 we identify 6 entity types with corresponding entities: (1) Order: {O1, O2} <sup>=</sup> *Entities*(*Order* , T), (2) Supplier Order: A, B, (3) Item: <sup>X</sup>1, X2, X<sup>3</sup> and <sup>Y</sup> <sup>1</sup>, Y 2, (4) Invoice: <sup>I</sup>1, I2, (5) Payment: <sup>P</sup>1, (6) Resource: <sup>R</sup><sup>1</sup> <sup>−</sup> <sup>R</sup>5 (see Definition 3).

An event <sup>e</sup> <sup>∈</sup> <sup>E</sup> which has a value <sup>n</sup> <sup>=</sup> e.*ent* or <sup>n</sup> <sup>∈</sup> e.*ent* is *correlated* to entity n.

**Definition 4 (Correlation).** *Let* T = (E, *Attr* , #,*ENT*) *be an event table with entities. Let* <sup>n</sup> <sup>∈</sup> *Entities*(*ent*, T) *be an entity of type ent* <sup>∈</sup> *ENT .*

*Event* <sup>e</sup> *is* correlated to *entity* <sup>n</sup>*, written* (e, n) <sup>∈</sup> *corr ent*,T *iff* <sup>n</sup> <sup>=</sup> e.*ent* <sup>∨</sup><sup>n</sup> <sup>∈</sup> e.*ent . We write corr* (n, *ent*, T) = {<sup>e</sup> <sup>∈</sup> <sup>E</sup> <sup>|</sup> (e, n) <sup>∈</sup> *corr ent*,T } *for the set of events correlated to entity* <sup>n</sup> <sup>∈</sup> *Entities*(*ent*, T)*.*

For example, for Table 1, event e<sup>30</sup> is correlated to I1, I2, and P1, i.e., (e30, I1),(e30, I2) <sup>∈</sup> *corr Invoice*,T and (e30, P1) <sup>∈</sup> *corrPayment*,T . The events correlated to <sup>I</sup>2 are *corr* (I2,*Invoice*, T) = {e5, e9, e30}. In case the entity identifiers used by different entity types are disjoint, e.g., there are not an Order O3 and an Item <sup>O</sup>3, we can omit entity types and just write (e30, I1),(e30, I2),(e30, P1) <sup>∈</sup> *corr*<sup>T</sup> and *corr* (I2, T) = {e5, e9, e30}.

Correlation lifts to a *set* N of entities by union: *corr* (N,T) = - <sup>n</sup>∈<sup>N</sup> *corr* (n, T). We will later use this to collect events of (transitively) related entities, which we discuss next.

**Fig. 2.** Relations between entities derived from Table 1

#### **2.3 Relations Between Entities**

We now make a first important observation. Although our data only defines entity types explicitly, it *implicity* defines *relations between entity types*. A record e in Table 1 containing two identifiers n1, n<sup>2</sup> of two different types implicitly relates n<sup>1</sup> and n2. For example, event e<sup>5</sup> defines that e5.*Order* = O2 is related to e5.*Invoice* = I2 and event e<sup>18</sup> defines that e18.*Order* = O1 is related to e18.*Invoice* = I1. We can write this as a relation R(*Invoice*,*Order*) = {(O1, I1),(O2, I2)}.

**Definition 5 (Relation).** *Let* T = (E, *Attr* , #,*ENT*) *be an event table with entities. Let ent*1, *ent*<sup>2</sup> <sup>∈</sup> *ENT be two entity types. The* relation between *ent*<sup>1</sup> and *ent*<sup>2</sup> in <sup>T</sup> *is* <sup>R</sup>(*ent*1,*ent*2) <sup>=</sup> {(e.*ent*1, *ent*2) <sup>|</sup> e.*ent*<sup>1</sup> =⊥, e.*ent*<sup>2</sup> =⊥}*.*

Note that Definition 5 does not impose the direction of a relation. Figure 2 visualizes the relations we can derive from Table 1.

Recall that in relational data modeling, each relation R(*ent*1,*ent*2) has a *cardinality* describing how many entities of type *ent*<sup>1</sup> are related to each entity of type *ent*2, and vice versa. We can infer this cardinality from the tuples in R(*ent*1,*ent*2) if we assume that the data in the input event table is sufficiently complete. For example, for the relations in Fig. 2,


Entities are also transitively related by concatenating or joining the relations on a shared entity typed (and then omitting this shared entity type). For example, R(*Order*,*Payment*) = R(*Invoice*,*Order*) - R(*Invoice*,*Payment*) = {(O1, P1),(O2, P1)} is an n-to-1 relation, and <sup>R</sup>(*Order*,*Supplier Order*) <sup>=</sup> R(*Item*,*Order*) - R(*Item*,*Supplier Order*) <sup>=</sup> {(O1, A),(O1, B),(O2, A),(O2, B)} is an n-to-m relation.

Entities, relations, and correlation of events can be automatically retrieved from event tables [46] and relational databases [41,43] through schema recovery techniques. However, we have to be aware that relations and their cardinalities recovered according to Definition 5 are a *static* view of the relations obtained by aggregating all observations over time while *a process updates relations dynamically*. For instance, *Order* O1 was not related to any *Item* until event e27. Modeling such dynamics requires additional concepts as defined in XOC event logs [39,40]. We have to ignore this aspect in the remainder.

## **3 Shortcomings of Event Logs over Multi-entity Event Data**

Having defined event data over multiple entities, we can now discuss ways of ordering events correlated to a case or an entity, which is the basis for process mining analysis. We first explain how transforming multi-entity data into a classical event log with a single case identifier (Sect. 3.1) introduces false behavioral information leading to false analysis results (Sect. 3.2). We then propose a different approach to ordering events with respect to individual entities (Sect. 3.3).

## **3.1 Classical Event Log Extraction**

We cannot directly turn the event data in Table 1 into a classical event log, because we lack a clear case identifier column that is defined for all events. While *Actor* is an entity identifier defined for all events, it does not group events into the process executions described in Sect. 1. The standard procedure to extract a classical event log from such data is the following (see also Def. 5 of [1] and [13]).

**Step 1. Determine relevant entities in the data.** An event table with entity identifiers already defines the set of entities in the process (see Definition 3). For extracting an event log for a process execution, we only consider entities that are also handled "along" or "within" a process execution. Thus, we now focus on *Order*, *Supplier Order*, *Item*, *Invoice*, and *Payment* and exclude *Actor*. 3

**Step 2. Pick one entity as case identifier.** As the process goal is to complete an order, entity *Order* is our best candidate for a case identifier. This identifier defines two cases: O1 and O2. However, as most events in Table 1 are not directly correlated to an *Order*, we cannot simply group events by attribute *Order*.

**Step 3. Define the set of all entities related to a case.** The classical idea is to "enlarge" the scope of the case. We include all entities which are (transitively) related to the case entities O1 and O2 via the relations we can identify in the event table (see Definition 5 and Fig. 2).


<sup>3</sup> In later sections we will not have to make such a distinction and can consider behavior along any kind of entity.

**Step 4. Construct a trace from events of all entities in a case.** Each event <sup>e</sup> correlated to an entity <sup>n</sup> <sup>∈</sup> *caseEntities*(O1) is now also considered as correlated to case O1: *corr* <sup>∗</sup>(O1, T) = *corr* (*caseEntities*(O1), T). For example, for O1 we extract from Table 1:


Taking their union yields *corr* <sup>∗</sup>(O1, T) = {e1, e3, e4, e6, e10, e11, e18, e19, e20, e27, <sup>e</sup>28, e29, e30}. We store all events extracted for <sup>O</sup>1 in a new event table where we explicitly set the attribute *Case* to O1. In this way, we materialize that each <sup>e</sup><sup>i</sup> <sup>∈</sup> *corr* <sup>∗</sup>(O1, T) is correlated to <sup>O</sup>1. We repeat this procedure for each case. Table 2 shows the extracted events for O1 and O2.

Note that this extraction approach can extract the same event *multiple times* for different cases but with a different value for the newly set *Case* attribute. For instance, e<sup>3</sup> and e<sup>30</sup> are extracted both for O1 and for O2. This is due to the n-to-m relation between *Order* and *Supplier Order* and the n-to-1 relation between *Payment* and *Order*.

Ordering the extracted events by time in each case results in the *traces* from the viewpoint of O1 and from the viewpoint of O2 respectively as shown in Tab 2.

Event logs can be automatically extracted in this way from event tables with multiple entity identifiers [46]. Extraction from relational databases succeeds through SQL queries that extract and group events from different tables into traces [35]. These queries can be generated automatically using a variety of techniques [6,7,12,29,35,41]; see [2,13] for a detailed discussion.

#### **3.2 False Behavioral Information in Classical Event Logs**

Note that the event log in Table 2 contains numerous *false* behavioral information. Some events were duplicated and occur in both traces, e.g., e3, e4, e6, e19, e29, suggesting that in total four *Supplier Orders* were placed and received (while there were only two) and that two *Payments* were received (while there was only one). This is also known as *divergence* [41,41,45,52].

Further, the order of events in both traces gives false behavior information. For instance, in the trace for O2, *Update SO* (e7) occurs after *Receive SO* (e6) suggesting a supplier order was updated after it had been received (while this never happened for any Supplier Order). This is also known as *convergence* [41, 45,52].

Where divergence falsifies frequencies of events, convergence falsifies the behavioral information in the directly-follows relation, which is the basis for



most process discovery techniques. As a result, also discovered process models are wrong. Figure 3 (left) shows the directly-follows graph (DFG) of the log in Table 2 and the corresponding process model discovered with the Inductive Miner (IM) annotated with the mean waiting times. Both models show false information suggesting that


**Fig. 3.** Directly-follows graph of event log of Table 2 (left) and Inductive Miner model (right) show false dependencies.

The performance information in the IM model suggests that

– the mean time for receiving a *Supplier Order* after placement is 2.2d while *A* was received within 3d after placement (e3-e6) and *B* was received within 5d after placement (e4-e19) and within 3d after the last update (e7-e19).

This false behavioral information makes it impossible to properly locate deviating behaviors and causes for delays, e.g., the reasons why both orders were not shipped within 6 days.

## **3.3 Correct Behavioral Information: Local Directly-Follows**

The reason why the event log in Table 2 contains false behavioral information is the following:


We can avoid both problems by simply *not* extracting all events towards a single case identifier, but keeping all events local to the entities they are *directly* correlated to. To analyze behavior, we only construct a temporal order between events that are related, e.g., correlated to the same entity.

In other words, instead of defining one global directly-follows relation for all events based on a global case identifier, we define a local directly-follows relation *per* entity [30, Def. 4.6].

**Definition 6 (Directly-Follows (per Entity)).** *Let* T = (E, *Attr* , #,*ENT*) *be an event table with entities. Let* <sup>n</sup> <sup>∈</sup> *Entities*(*ent*, T) *be an entity of type ent* ∈ *ENT .*

*Let* <sup>e</sup>1, e<sup>2</sup> <sup>∈</sup> <sup>E</sup> *be two events;* <sup>e</sup><sup>2</sup> directly follows <sup>e</sup><sup>1</sup> from the perspective of <sup>n</sup>*, written* <sup>e</sup><sup>1</sup> n,T <sup>e</sup><sup>2</sup> *iff*


For example, while e<sup>7</sup> directly follows e<sup>6</sup> globally for O2, they do not follow each other locally from the perspective of O2. Instead, from the perspective of O2, e<sup>7</sup> directly follows <sup>e</sup>4, i.e., <sup>e</sup><sup>4</sup> B,T <sup>e</sup>7. Interestingly, also <sup>e</sup><sup>2</sup> <sup>O</sup>2,T <sup>e</sup><sup>7</sup> and <sup>e</sup><sup>6</sup> <sup>R</sup>2,T <sup>e</sup><sup>7</sup> hold. That means e<sup>7</sup> directly follows *three* different events as seen from three different perspectives: the Supplier Order B, the Order O2 and resource R2.

We cannot represent this information in a single table or a sequential event log. Extracting a *collection* of related sequential event logs from event tables [46] and relational databases [41] results in collection of directly-follows relations per entity-type. However, the behavioral information remains separated per entity type, hindering reasoning about the process as a whole [25]. We therefore turn to a graph-based data model.

## **4 Event Knowledge Graphs**

Our primary aim is to model multiple local directly-follows relations (see Definition 6) over events correlated to multiple entities. To construct these relations, we also have to model entities, relations between entities, and correlations of entities to events (see Sect. 2). A *typed* graph data model such as *labeled property graphs* [48] allows to distinguish different types of nodes (events, entities) and relationships (directly-follows, correlated-to). We adopt labeled property graphs to construct a *knowledge graph* [33] of a process from event data, to augment this graph with further knowledge, and to even perform process mining analysis within a graph. Section 4.1 defines the generic data model of labeled property graphs which we use in Sect. 4.2 to define *event knowledge graphs* and "directly-follows" paths in an event knowledge graph. In Sect. 4.3 we discuss how to algorithmically construct an event knowledge graph from an event table.

## **4.1 Labeled Property Graphs**

A labeled property graph is a graph where each node and each directed edge (called relationship) has a type, called *label*. Further, each node and each relationship can carry attribute-value pairs as properties. For the remainder, we fix a set λ<sup>N</sup> of node labels, a set λ<sup>R</sup> of relationship labels, and a set *Attr* of property names over a value domain *Val*.

**Definition 7 (Labeled Property Graph).** *A* labeled property graph *(LPG)* G = (N, R, λ, #) *is a graph with* nodes N*, and* relationships R *with the following properties:*


We write x.a <sup>=</sup> <sup>v</sup> for #(x, a) = <sup>v</sup> and x.a <sup>=</sup><sup>⊥</sup> if <sup>a</sup> is undefined for <sup>x</sup>. We write N- <sup>=</sup> {<sup>n</sup> <sup>∈</sup> <sup>N</sup> <sup>|</sup> <sup>λ</sup>(n) = } and <sup>R</sup>- <sup>=</sup> {<sup>r</sup> <sup>∈</sup> <sup>R</sup> <sup>|</sup> <sup>λ</sup>(r) = } for the nodes and relationships with label , respectively. We also write (n1, n2) <sup>∈</sup> <sup>R</sup> if there exists <sup>r</sup> <sup>∈</sup> <sup>R</sup>with −→r = (n1, n2).

Figure 4 shows an example of a labeled property graph, defining 5 nodes with label *Event*, 3 nodes with label *Entity*, 7 relationships with label *corr*, and 4 relationships with label *df*.

We here also provide some notation for standard operations on LPGs. Let G<sup>1</sup> = (N1, R1, λ1, #<sup>1</sup>) and G<sup>2</sup> = (N2, R2, λ2, #<sup>2</sup>) be two LPGs.

<sup>G</sup><sup>2</sup> is a *sub-graph* of <sup>G</sup>1, written <sup>G</sup><sup>2</sup> <sup>⊆</sup> <sup>G</sup>1, iff <sup>N</sup><sup>2</sup> <sup>⊆</sup> <sup>N</sup>1, R<sup>2</sup> <sup>⊆</sup> <sup>R</sup>1, λ<sup>2</sup> <sup>=</sup> <sup>λ</sup>1|<sup>N</sup>2∪R<sup>2</sup> , #<sup>2</sup> = #1|<sup>N</sup>2∪R<sup>2</sup> . The *union* of <sup>G</sup><sup>1</sup> and <sup>G</sup><sup>2</sup> is <sup>G</sup><sup>1</sup> <sup>∪</sup>G<sup>2</sup> = (N<sup>1</sup> <sup>∪</sup> <sup>N</sup>2, R<sup>1</sup> <sup>∪</sup> <sup>R</sup>2, λ<sup>1</sup> <sup>∪</sup> <sup>λ</sup>2, #<sup>1</sup> <sup>∪</sup> #<sup>2</sup>) under the assumption that <sup>λ</sup>1(x) = <sup>λ</sup>2(x) and #<sup>1</sup> <sup>a</sup>(x) = #<sup>2</sup> <sup>a</sup>(x) for all <sup>a</sup> <sup>∈</sup> *Attr* for any <sup>x</sup> <sup>∈</sup> (N<sup>1</sup> <sup>∪</sup> <sup>R</sup>1) <sup>∩</sup> (N<sup>2</sup> <sup>∪</sup> <sup>R</sup>2). For a set **<sup>G</sup>** <sup>=</sup> {G1,...,Gn} of graphs, we write - <sup>G</sup>∈**<sup>G</sup>** <sup>G</sup> <sup>=</sup> <sup>G</sup><sup>1</sup> <sup>∪</sup> ... <sup>∪</sup> <sup>G</sup>n.

Labeled property graphs are a native data structure for knowledge graphs [33] and for a variety of *graph database systems* [48] that provide data management and query languages for reading and manipulating graphs [5].

#### **4.2 Formal Definition of an Event Knowledge Graph**

To precisely model event data in an LPG, we have to restrict ourselves to specific node labels for events and entities, and to specific relationship labels for correlation and directly-follows. Thereby, directly-follows relationships can only be

**Fig. 4.** Event knowledge graph of events *e*5*, e*9*, e*18*, e*29*, e*<sup>30</sup> of Table 2.

defined between events that are correlated to the same entity and directly follow each other from the viewpoint of that entity (Definition 6). This is formalized in the model proposed by Esser [25] which we here call *event knowledge graph*<sup>4</sup>

**Definition 8 (Event Knowledge Graph).** *An* event knowledge graph *(or just* graph*) is an LPG* <sup>G</sup> = (N, R, λ, #) *with node labels* {*Event*,*Entity*} ⊆ <sup>Λ</sup><sup>N</sup> *and relationship labels* {*df* , *corr*} ⊆ <sup>Λ</sup><sup>R</sup> *indicating "directly-follows" and "correlation" with the following properties.*

	- *(a)* <sup>e</sup><sup>1</sup> *and* <sup>e</sup><sup>2</sup> *are correlated to entity* <sup>n</sup>*:* (e1, n),(e2, n) <sup>∈</sup> <sup>R</sup>*corr ;*
	- *(b)* e<sup>1</sup> *occurs before* e2*:* e1.*time* < e2.*time; and*
	- *(c) there is no other event* <sup>e</sup> <sup>∈</sup> <sup>N</sup>*Event correlated to* n,(e , n) <sup>∈</sup> <sup>R</sup>*corr that occurs in between* e1.*time* < e .*time* < e2.*time*

<sup>4</sup> The initially chosen term "event graph" [25,38] which seems natural and shorter has previously been coined for a model for discrete event simulation [49]. At the same time, we will see that the proposed event *knowledge* graph model allows to capture more than just events.

*We write df* .*type* <sup>=</sup> *df* .*ent*.*type and* (e1, e2) <sup>∈</sup> <sup>R</sup>*df* <sup>n</sup> *.*

Figure 4 shows an event knowledge graph for entities I1, I2, P1 of Table 2 and their correlated events. Each *df* relationship is defined between any two subsequent events correlated to the same entity. In the following, we omit the labels and use dashed edges for *corr* relationships, square nodes for *Event* nodes, and ellipses for *Entity* nodes.

A path along *df* -relationships corresponds to a trace in a classical event log. A *path* in a graph <sup>G</sup> is a sequence **<sup>r</sup>** <sup>=</sup> <sup>r</sup>1,...,rk ∈ <sup>R</sup><sup>∗</sup> of consecutive relationships, i.e., the target node of −→r<sup>i</sup> = (n<sup>i</sup>−<sup>1</sup>, ni) is the start node of <sup>−</sup><sup>r</sup> −→<sup>i</sup>+1 = (ni, n<sup>i</sup>+1), <sup>1</sup> <sup>≤</sup> i<k.

## **Definition 9 (df-path).** *Let* G = (N, R, λ, #) *be an graph.*

*A path* **r** = <sup>r</sup>1,...,rk ∈ (R*df* )<sup>∗</sup> *of df-relationships is a* directly-follows path (df-path) *iff all relationships are defined for the same entity, i.e., for all* <sup>1</sup> <sup>≤</sup> i<k*,* <sup>r</sup>i.*ent* <sup>=</sup> <sup>r</sup><sup>i</sup>+1.*ent* <sup>=</sup> <sup>n</sup>*; we also say* **<sup>r</sup>** *is a df-path for entity* <sup>n</sup>*.*

**<sup>r</sup>** *is* maximal *iff there is no other df-relationship* <sup>r</sup> <sup>∈</sup> <sup>R</sup>*df so that* r, r1,...,rk *or* <sup>r</sup>1,...,rk, r *is also a df-path.*

For a path **r** = <sup>r</sup>1,...,rk ∈ (R*df* )∗, −→r<sup>i</sup> = (e<sup>i</sup>−<sup>1</sup>, ei) we write just the sequence of its nodes <sup>e</sup>0,...,ek in case the correlated entity is clear. The graph in Fig. <sup>4</sup> defines three DF-paths: for <sup>I</sup>1: <sup>e</sup>18, e30, for <sup>I</sup>2: <sup>e</sup>5, e9, e30, and for <sup>P</sup>1: <sup>e</sup>29, e30.

Event knowledge graphs can be efficiently stored and queried using graph database systems [25]. This enables retrieving df-paths from graph databases using query languages, such as Cypher [25,33]. While the nodes and relationships of Definition 8 can also be encoded in RDF [11], the df-paths rely on attributes of relationships (Definition 9) which are not supported by RDF but by LPGs.

Alternative formalizations of Definition 8 define just a partial order over events [4,30,55,56] describing the local directly-follows relation wrt. various entities 6. Such a partial order view is equivalent to a family of df-paths [30, Cor. 4.9]. This equivalence allows to switch perspectives depending on the analysis task at hand.

#### **4.3 Obtaining an Event Knowledge Graph from an Event Table**

Event data is (currently) not recorded in the form of a graph, but for example in the form of an event table T with multiple entities (Definition 2). We obtain an event knowledge graph from an event table T in three steps.


We now explain and define each step along the running example of Table 1. We assume as input an event table T = (E, *Attr* , #<sup>T</sup> ,*ENT*) with multiple entities as stated in Definition 2. The central requirement is that each unique entity type *ent* ∈ *ENT* ⊆ *Attr* is explicitly recorded as a dedicated attribute (column) of T, and that each value in column *ent* is an entity identifier.

**Step 1: Create Event Nodes.** We start by translating each event record in event table T into an event node in graph G.

**Definition 10 (Event nodes from an event table).** *Let* T = (E, *Attr* , #<sup>T</sup> ,*ENT*) *be an event table with entities. The* event nodes of T *are the graph* G*Event* <sup>T</sup> = (N*Event*, <sup>∅</sup>, λ, #<sup>G</sup>) *with*

*1.* N*Event* = E*, i.e., each event of* T *becomes an event node, and 2.* #<sup>G</sup> <sup>a</sup> (e)=#<sup>T</sup> <sup>a</sup> (e) *for all* <sup>a</sup> <sup>∈</sup> *Attr , i.e., each event keeps all attributes from* <sup>T</sup> *as properties in* G*.*

The resulting graph G is a set of disconnected *Event* nodes only.

**Step 2: Create Entity Nodes and Correlation Relationships.** Each attribute of an event <sup>e</sup> in <sup>T</sup> that refers to an entity, e.g., e.*ent* <sup>=</sup> {n}, is now a property of the event node e in G. The basic idea is to "push out" this property: we make each unique value n an *Entity* node n and link e to n by a *corr* relationship. The following definition constructs a small graph G*corr* (n) that does exactly this. We then use graph union <sup>G</sup> <sup>∪</sup> - <sup>n</sup> <sup>G</sup>*corr* (n) to add them to <sup>G</sup>. The reason for doing so is that we can later calculate with various subgraphs.

**Definition 11 (Entity and correlation inference).** *Let* G = (N, R, λ, #<sup>G</sup>) *be a graph and ENT be known entity types.*

*Given a property name ent* <sup>∈</sup> *ENT , each property value* e.*ent we find on an event node* <sup>e</sup> <sup>∈</sup> <sup>N</sup>*Event is an* entity identifier of *ent* in <sup>G</sup>*: Entities*(*ent*, G) = {<sup>n</sup> <sup>|</sup> <sup>∃</sup><sup>e</sup> <sup>∈</sup> <sup>N</sup>*Event* : <sup>n</sup> <sup>∈</sup> e.*ent*}*, see Definition 3.*

*Let* <sup>n</sup> <sup>∈</sup> *Entities*(*ent*, G) *be an identifier of type ent* <sup>∈</sup> *ENT . The* entity and correlation inferred for n in G *is the graph* G*corr* (n)=(N , R , λ # ) *with:*


We can infer entities and correlation on *any* event knowledge graph, not just the graph produced by Definition 10. This allows us to apply Definition 11 multiple times in any order. We can infer entities and correlation for an entity type *ent* by G*corr* (*ent*) = - <sup>n</sup>∈*Entities*(*ent*,G) <sup>G</sup>*corr* (n). We can add the inferred entities and correlation to graph G for all entity types *ENT* by graph union <sup>G</sup>∪- *ent*∈*ENT* <sup>G</sup>*corr* (*ent*). In the result, each value <sup>n</sup> <sup>∈</sup> *Entities*(*ent*, T) becomes a new node <sup>n</sup> with n.*type* <sup>=</sup> *ent*. Correspondingly, each pair (e, n) <sup>∈</sup> *corr ent*,T becomes a new relationship of type *corr* from e to n.

**Fig. 5.** Event graph of events of Table 2 without directly-follows relationships.

For example, applying Definition 10 on the event table of Table 2 results in the event nodes e1,...,e11, e18,...,e21, e27,...,e<sup>32</sup> shown in Fig. 5. Inferring entities and correlation for entity types *Order*, *Supplier Order*, *Item*, *Invoice*, and *Payment* adds the entity nodes and correlation edges shown in Fig. 5. In this graph we see that events e1, e18, e27, e<sup>28</sup> are the events correlated to entity O1 of type *Order*. Moreover, event e<sup>18</sup> is correlated to two entities *Order* O1 and *Invoice* I1; event e<sup>27</sup> is correlated to four entities *Order* O1, *Item* X1, *Item* X2, and *Item* Y 1.

**Step 3: Infer Local Directly-Follows Relations.** We now can infer the local directly-follows relation (Definition 6) and materialize it as *df* -relationships between event nodes. Again, the basic idea is simple: for each entity node n we retrieve all events e1,...,e<sup>n</sup> with a *corr* -relationship from e<sup>i</sup> to n. We order e1,...,e<sup>n</sup> by time and define a new *df* -relationship r from e<sup>i</sup> to ei+1; to remember for which entity r holds, we set r.*ent* = n.

As before, we do not add the *df* -relationships directly to G but construct a separate graph <sup>G</sup>*df* (n). We then add to <sup>G</sup> by graph union <sup>G</sup> <sup>∪</sup> - <sup>n</sup> <sup>G</sup>*df* (n) which later allows us to calculate with graphs.

**Definition 12 (df inference).** *Let* <sup>G</sup> = (N, R, λ, #) *be a graph. Let* <sup>n</sup> <sup>∈</sup> <sup>N</sup>*Entity . Let* <sup>e</sup>0,...,ek *be the sequence of events* {e0,...,ek} <sup>=</sup> *corr* (n) *correlated to* <sup>n</sup> *and sorted by time:* <sup>e</sup><sup>i</sup>−<sup>1</sup>.*time* < ei.*time*, <sup>1</sup> <sup>≤</sup> <sup>i</sup> <sup>≤</sup> <sup>k</sup>*.*

*The* df-relationships inferred for n in G *is the graph* G*df* (n) = (N*Event*, R*df* , λ , # ) *with*


We can only infer a df-relationship for entity <sup>n</sup> if <sup>|</sup>*corr* (n)<sup>|</sup> <sup>&</sup>gt; 1. Thus, for dfinference to have any effect, we have to have inferred the entity n and correlation using Definition 11 and there are at least two events correlated to n. As for entity and correlation inference, we can add the inferred df-relationships to G by graph union <sup>G</sup> <sup>∪</sup> - <sup>n</sup>∈N*Entity* <sup>G</sup>*df* (n).

For example, if we infer the df-relationships for each entity in the graph of Fig. 5 and add them to that graph, we obtain the graph shown in Fig. 6. Note that we only show the *corr* relationships to the first event of each entity for readability. This graph explicitly models the events, entities, correlation, and local directly-follows relations of all events in Table 2.

**Complete Procedure.** The following definition summarizes how to apply the above three definitions to obtain an event knowledge graph of an event table T.

**Definition 13 (Event knowledge graph of an event table).** *Let* T = (E, *Attr* , #<sup>T</sup> ,*ENT*) *be an event table with entities. The event table* T *defines the* graph G = (N, R, λ, #<sup>G</sup>) of T *as follows:*


From Definition 10–13 follows that the df-relationships in graph G materialize the local directly-follows relation of event table T (Definition 6).

**Lemma 1.** *Let* G = (N, R, λ, #<sup>G</sup>) *be the event knowledge graph of event table* <sup>T</sup> = (E, *Attr* , #<sup>T</sup> ,*ENT*) *with entities. For any entity* <sup>n</sup> <sup>∈</sup> *Entities*(*ent*, T), *ent* <sup>∈</sup> *ENT holds* <sup>e</sup><sup>1</sup> n,T <sup>e</sup><sup>2</sup> *(*e<sup>2</sup> *directly follows* <sup>e</sup><sup>1</sup> *from the perspective of* <sup>n</sup>*) iff* (e1, e2) <sup>∈</sup> <sup>R</sup>*df* <sup>n</sup> *.*

**Fig. 6.** Event graph of events of Table 2 after inferring directly-follows relationships.

#### **4.4 Inferring Entity Interactions**

The procedure of Definition 13 infers the local directly-follows relation for each entity in the graph. However, there are also important behavioral dependencies in the process *between* related entities, such as *Orders* and *Payments*, that are not visible in the graph of Fig. 6.

We know from Fig. 1 that shipping O2 has to wait until the invoice of O1 has been cleared by the related payment P1, but the graph of Fig. 6 suggests that e<sup>31</sup> of O2 does not depend on e<sup>30</sup> of P1 or any event of O1. This is because there is no entity correlated to both e<sup>31</sup> and e<sup>30</sup> or any event of O1.

Our analysis in Sect. 2.3 found that *Orders* are related to *Payments*. We can materialize this information in an event knowledge graph. We apply Definition 5 on all *Event* nodes to obtain relation R(*ent*1,*ent*2) between any two (interesting) entity types *ent*1, *ent*2. For each pair, (n1, n2) <sup>∈</sup> <sup>R</sup>(*ent*1,*ent*2) we add a new relationship with label *related* from entity node n<sup>1</sup> to entity node n2. Figure 7 illustrates the result of this step for (*Order*,*Invoice*) and (*Invoice*,*Payment*). We can infer transitive relationships by materializing paths of *related*-relationships (ignoring their directions) as new *related*-relationships.

**Fig. 7.** Inferring relations between *Orders*, *Invoices*, and *Payments*.

For example, we materialize -<sup>O</sup>1, I1, P1 ∈ (R*related* )<sup>∗</sup> and -<sup>O</sup>2, I2, P1 ∈ (R*related* )<sup>∗</sup> as (O1, P1),(O2, P1) <sup>∈</sup> <sup>R</sup>*related* in Fig. 7. These steps obviously require domain knowledge to decide which potential relations to materialize, esp. when considering paths over n-to-1 and 1-to-n relationships [41].

We then can infer the behavior between two related entities by adapting entity and correlation inference (Definition 11) as follows [25]:


Figure 8 shows the result of reifying the relation between *Order* and *Payment* entities of Fig. 7 into derived entities (O1, P1) and (O2, P1) of type (*Order* ,*Payment*) and inferring the df-relationships for this entity type. We now inferred df-paths from *Create Invoice* in O1 (e18) via *Clear Invoice* in P1 (e30) to *Pack Shipment* in O2 (e31).<sup>5</sup>

Not all df-relationships for (O1, P1) and for (O2, P2) provide new information. For example in Fig. 8, (e2, e5) <sup>∈</sup> <sup>R</sup>*df O2* and (e2, e5) <sup>∈</sup> <sup>R</sup>*df* (*O2*,*P1*) run in parallel.

We say that a df-relationship (e1, e2) <sup>∈</sup> <sup>R</sup>*df* (n1,n2) of a derived entity (n1, n2) *provides new information* if there is not already an existing df-relationship

<sup>5</sup> Our example here exploits that both orders of the same customer have invoices cleared by the same payment. For the more general case, we would have to include the customer in the data and infer the dependency via the customer entity.

**Fig. 8.** Result of reifying the relation between *Order* and *Invoice* entities of Fig. 6 into a derived entity of type (*Order,Invoice*) and inferring the df-relationships for this entity type.

(e1, e2) <sup>∈</sup> <sup>R</sup>*df* <sup>n</sup><sup>1</sup> or (e1, e2) <sup>∈</sup> <sup>R</sup>*df* <sup>n</sup><sup>2</sup> for one of the original entities <sup>n</sup><sup>1</sup> or <sup>n</sup>2. Thus, a df-relationship (e1, e2) provides new information if it actually describes an interaction from n<sup>1</sup> to n<sup>2</sup> or vice versa. In Fig. 8, (e7, e29), (e28, e29), and (e30, e31) provide new information.

In principle we should keep only those *df* -relationships of a derived entity (n1, n2) that provide new information. However, we can best study the interaction between n<sup>1</sup> and n<sup>2</sup> when all *df* -relationships between n<sup>1</sup> and n<sup>2</sup> are part of a path related to (n1, n2). We therefore keep all *df* -relationships of (n1, n2) that either provide new information or are between two *df* -relationships of the *df* -path for (n1, n2) that do provide new information. In Fig. 8, for (*O2* ,*P1* ), we keep (e7, e29) and (e30, e31) (provide new information) and also (e29, e30) (between df-relationships that provide new information); for (*O1* ,*P1* ), we only keep (e28, e29).

The complete graph for Table 1 after inferring the *df* -relationships between *Order* and *Payment* entities is shown in Fig. 9.

#### **4.5 Creating Event Knowledge Graphs from Real-Life Data**

This method for constructing event knowledge graphs uses basic principles of information inference: (1) construct entities and correlation based on the presence of an entity identifier or a relation; and (2) derive a local directly-follows relation from the viewpoint of *each* entity. Our definitions assume the data to be accurate wrt. the real process, for instance, that entity identifiers and time stamps are recorded correctly and precise; otherwise further preprocessing is required [30,44,47].

All steps of the method can be implemented as a series of Cypher queries<sup>6</sup> to construct event knowledge graphs in a graph database for our running example [28] as well as for various real-life datasets comprising single and multiple event tables [24]; several event knowledge graphs of real-life processes are available [19–24]. A variant of event knowledge graphs, called *causal event graph* that only models events but not the entities, can be extracted automatically from relational databases [56].

In the following, we exploit the flexibility of LPGs that underly event knowledge graphs to infer and materialize further behavioral information, going beyond what event tables or event logs can describe.

## **5 Understanding Behavior over Multiple Entities**

The event knowledge graph of Fig. 9 we obtain with the method of Sect. 4 explicitly models what we observed earlier in Sect. 1: the behavior of the different entities forms a complex *network* of synchronizing *df* -paths. This section first discusses how to interpret df-paths (Sect. 5.1) and how they synchronize (Sect. 5.2). We then discuss querying graphs through selection of entities and projection onto events in Sect. 5.3; we apply these operations to understand why the retailer of our example in Sect. 1 could not ship orders within the promised 6 days. We finally introduce aggregation in Sect. 5.4 which we use to discover basic process models directly within event knowledge graphs in Sect. 5.5.

#### **5.1 How to Read Df-Paths in an Event Knowledge Graph**

We discuss how to read *df* -paths over events based on running example of Fig. 6.

In a classical event log, each trace has a unique initial event and a unique final event indicating the start and completion of a process execution. A graph has multiple initial and final events – one per entity. Event e is *starting* or *ending* event if it has no incoming or outgoing *df* -relationship at all, e.g., e1,...,e4, and e32. Event e is *starting* or *ending* event for entity n if it has no incoming or outgoing *df* -relationship for n. For example, e<sup>11</sup> is the ending event of the *df* path for A but it still has an outgoing *df* -relationship for X2. Some events are starting/ending events for *multiple df* -paths or entities. For example, e<sup>6</sup> is the

<sup>6</sup> https://github.com/multi-dimensional-process-mining/eventgraph tutorial.

**Fig. 9.** Complete event knowledge graph of event table Table 1.

starting event for X1, X2, X3 and e<sup>7</sup> is the starting event for Y 1, Y 2 while e<sup>27</sup> is the ending event for X1, X2, Y 1 and e<sup>31</sup> is the ending event for X3, Y 2.

We call an event *intermediate* in a df-path of an entity n if it is not a starting or ending event in the df-path of n. For example, e<sup>6</sup> is an intermediate event of A.

In graph in Fig. 9 we see that the df-paths of entities of the same type are rather similar to each other.


Note that the graph no longer shows any directly-follows relation from *Receive SO* to *Update SO* that was falsely observed in Sect. 3. We can also analyze time differences between events on the df-path. For example, in Sect. 1 we stated that each Supplier Order is to be received within 3 days of placing the order.


Thus, the graph now shows temporal information and delays for individual entities correctly, in contrast to the classical event log of Sect. 3.

#### **5.2 How to Read Synchronization in a Graph**

Analyzing the df-paths for O1 and O2 also shows that none of the orders were shipped within 6 days: <sup>e</sup>20.*time*−e1.*time* <sup>&</sup>gt; <sup>7</sup>*days* and <sup>e</sup>32.*time*−e2.*time* <sup>&</sup>gt; <sup>8</sup>*days*. As completing the orders depends on other entities, i.e., the items, we now analyze entity interactions through synchronization of df-paths.

A df-path **r** = <sup>e</sup>0,...,ek *goes through* an event <sup>e</sup> iff <sup>e</sup> <sup>=</sup> <sup>e</sup>i, <sup>0</sup> <sup>≤</sup> <sup>i</sup> <sup>≤</sup> <sup>k</sup>. An event e is *local* to an entity n if there is only one df-path of entity n that goes through e, e.g., e1, e2, e32. Two or more entities n1,...,n<sup>k</sup> *synchronize* in a *shared* event e if two or more df-paths of n1,...,n<sup>k</sup> go through e, e.g., e<sup>7</sup> synchronizes Supplier Order B and Order O2 whereas e<sup>19</sup> synchronizes Supplier Order B and Items Y 1 and Y 2.

**Reading Entity Creation and Updates.** We now discuss different interpretations of entities n1,...,n<sup>k</sup> *synchronizing* in a shared event.

Event e *intermediately synchronizes* entities n1,...,n<sup>k</sup> when e is an intermediate event for n1,...,nk. We can interpret an intermediate synchronization as an update or state change of one or more entities that requires the involvement of the other entities. For example, event e<sup>7</sup> intermediately synchronizes Order O2 and Supplier Order B to update B based on the information in O2; event e<sup>8</sup> updates both Supplier Order A and Item X3. Which entity changes state in e<sup>8</sup> is not visible in the graph of Fig. 9.

An event e that is intermediate for one entity n but a starting event for entities n1,...,n<sup>k</sup> can be interpreted as *entity* n *"created" or "initiated" entities* n1,...,nk. For example, Supplier Order A created Items X1, X2, X3 in e6, and Supplier Order B created I2 in e5. Correspondingly, an event e that is intermediate for entity n and ending event for n1,...,n<sup>k</sup> is "closing" or "completing" entities n1,...,nk. For example, Order O1 "completes" items X1, X2, Y 1 in e27.

An event e where multiple entities n1,...,n<sup>k</sup> of the same type synchronize is a *batching* event for n1,...,n<sup>k</sup> [36,42,55]. For example, e<sup>27</sup> batches X1, X2, Y 1, e<sup>30</sup> batches I1, I2, and e<sup>31</sup> batches X3, Y 2.

However, we have to be careful with those interpretations as, both, the graph and the data from which it was created may be incomplete. Entities that are "created" or "closed" may continue to exist both prior and after the data recorded, e.g., all Items X1,...,Y 2 certainly exist prior to this process and after it, thus e<sup>6</sup> and e<sup>27</sup> only show when these items entered the visibility or scope of our observations. Likewise, a starting event e for an entity n that is *not* an intermediate event for another entity n<sup>2</sup> does *not* describe how n was created. For example, e1,...,e<sup>4</sup> do not explain how O1, O2, A, B were created. This is because our graph of Fig. 9 is incomplete as we did *not* (a) infer the *Resource* entity and the corresponding *df* -relationships from Table 1 and (b) we only recorded data in a limited time window. A helpful principle to check for incompleteness in distributed behavior is due to C.A. Petri [27]: most events happens due to a synchronous interaction of two or more entities, and most physical entities are never created from nothing and never disappear into nothing.

**Reading Entity Interactions.** Events and df-paths describe different modes of interaction. An event e where the df-paths of n<sup>1</sup> and n<sup>2</sup> synchronize is a *synchronous interaction*. A df-path for entity n describes an *asynchronous interaction* between n<sup>1</sup> and n<sup>2</sup> if n synchronizes both with n<sup>1</sup> and n<sup>2</sup> in different events. If the df-path for <sup>n</sup> has only 2 events <sup>e</sup>1, e2 then we can interpret entity n as *message* from n<sup>1</sup> to n2. We can interpret an event e that is the ending event of entity n<sup>1</sup> and the starting event of entity n<sup>2</sup> as a *handover* from n<sup>1</sup> to n2. In Fig. 9, e<sup>7</sup> is a synchronous interaction of O2 and B, the df-path of Y 1 describes an asynchronous interaction from B to O2, and e<sup>28</sup> is a handover from O1 to (O1, P1).

If two entities n<sup>1</sup> and n<sup>2</sup> never synchronize in a shared event but there is at least one asynchronous interaction between n<sup>1</sup> and n2, then n<sup>1</sup> and n<sup>2</sup> *interact asynchronously*. If all asynchronous interactions, i.e., df-paths, only go from n<sup>1</sup> to n2, then the interaction is *one-directional*, and it is *bi-directional* otherwise. In Fig. 9, A and O1 interact asynchronously and one-directional (from A to O1 via X1), O2 and P1 interact asynchronously and bi-directional (via (O2, P1)).

n<sup>1</sup> and n<sup>2</sup> *interact indirectly* if for any two events e<sup>1</sup> of n<sup>1</sup> and e<sup>2</sup> of n<sup>2</sup> the shortest df-path from e<sup>1</sup> to e<sup>2</sup> involves df-relationships from multiple other entities. For example, O1 interacts indirectly with O2 via (O1, P1) and (O2, P2) (df-path <sup>e</sup>28, e29, e30, e31).

Finally, n<sup>1</sup> and n<sup>2</sup> *do not interact* if there is no df-path from n<sup>1</sup> to n2, or vice versa. For example, A and B do not interact. Note, however, that (indirect) interactions via other entities as well as non-interaction are subject to which entities have been included in the construction of the graph and which relations have been reified into derived entities.

**Reading Event Dependencies and Delays.** We observed in Sect. 5.1 that neither O1 nor O2 was shipped within 6 days as required in Sect. 1. We now want to analyze which entities, that synchronized with O1 and O2, delayed either order to be shipped on time.

Consider an event e that synchronizes the df-paths of multiple entities n1,...,nk. Event e *directly depends on* any event e<sup>i</sup> that directly precedes e via an incoming df-relationship (ei, e) <sup>∈</sup> <sup>R</sup>*df* <sup>n</sup>*<sup>i</sup>* , <sup>1</sup> <sup>≤</sup> <sup>i</sup> <sup>≤</sup> <sup>k</sup> along entity <sup>n</sup>i. We call e.*time* <sup>−</sup> <sup>e</sup>i.*time* the *delay* between <sup>e</sup><sup>i</sup> and <sup>e</sup>.

Suppose e1,...,e<sup>k</sup> are sorted on their delay to e. Event e<sup>1</sup> was the first event that directly preceded e, i.e., e could not have occurred earlier than e1. The entity <sup>n</sup>1, for which (e1, e) <sup>∈</sup> <sup>R</sup>*df* <sup>n</sup><sup>1</sup> was observed, was the first entity ready to synchronize in e. We can interpret that each later event ei,i > 1 *delayed* the synchronization in e as entity n<sup>i</sup> became ready to synchronize later than n<sup>1</sup> did, with e<sup>k</sup> and n<sup>k</sup> delaying e the most.

For example in Fig, 9, e<sup>31</sup> (*Pack Shipment* for O2) depends on e7, e8, e21, e<sup>30</sup> along entities O2, X3, Y 2, and (O2, P1) with delays of 3 days, 3 days, 2 days, and 3 h, respectively. While O2 was first ready to synchronize in e<sup>31</sup> after e<sup>7</sup> (*Update Order* ); e<sup>31</sup> was delayed most by e<sup>30</sup> (*Clear Invoice* for I1, I2) along (O2, P1).

For a given event e, we can build the set *delay*∗(e) of transitive predecessors that delayed e the most, by first adding event e that delayed e most, then adding event e that delayed e most, etc. For example in Fig. 9, *delay*∗(e32) = {e31, e30, e29, e28, e27, e20, e19, e7, e5, e2}.

Comprehending such subsets of events (and the dynamics they describe) is rather difficult. We use graph querying to reduce a graph to a subgraph of interesting events.

#### **5.3 Basic Querying Operations**

Similarly to classical event logs, we can also subset (or filter) event knowledge graphs for a more focused analysis. Recall that we have two basic operations to sub-setting classical event logs: selection (include only a subset of the cases with specific properties but keep all events in a case) and projection (keep all cases but keep only a subset of events with specific properties). The same operations can be applied on event knowledge graphs.

We *select* a subset of entities, but keep all event nodes correlated to the entities and all directly-follows relations between the events of these entities. Formally, given a graph G, we select entity nodes N*Entity* sel <sup>⊆</sup> <sup>N</sup>*Entity* from <sup>G</sup> by (1) removing all entity nodes <sup>N</sup>*Entity* \N*Entity* sel and all adjacent *corr* relationships, then (2) removing all event nodes <sup>e</sup> <sup>∈</sup> <sup>N</sup>*Event* which no longer have any *corr* relationships (because none of their entities was selected) and the adjacent *df* relationships.

We *project* on a subset of events by keeping all entity nodes but only the selected event nodes; as this may interrupt df-paths (if an intermediate event gets removed) we have to recompute all df-relationships. Formally, given a graph G, we project onto event nodes N*Event* proj <sup>⊆</sup> <sup>N</sup>*Event* from <sup>G</sup> by (1) removing all *df* -relationships from <sup>G</sup>, (2) removing all event nodes <sup>N</sup>*Event* \N*Event* proj , and then (3) doing df-inference on the resulting graph (Definition 12).

The criteria by which we select events and entities can consider properties of events and entities but also relations to other event and entity nodes, and even

**Fig. 10.** Projection of Fig. 9 onto events that delayed most *e*<sup>28</sup> and *e*<sup>32</sup> and are not *Unpack* events. Bold df-relationships indicate which preceding event delayed an event the most.

more complex paths or sub-graphs. For example, to understand what caused delays in shipping order O1 (e28) and O2 (e32) while also removing unnecessary events, we can project the graph of Fig. 9 onto the events the (1) delayed either shipment the most (2) but without *Unpack* events. Formally, we project onto (*delay*∗(e32)∪*delay*∗(e28)) \ {<sup>e</sup> <sup>∈</sup> <sup>N</sup>*Event* <sup>|</sup> e.*act* <sup>=</sup> *Unpack*}. Figure <sup>10</sup> shows the resulting graph. Note the new df-relationships (e5, e30) <sup>∈</sup> <sup>R</sup>*df* <sup>I</sup>2, (e19, e27) <sup>∈</sup> <sup>R</sup>*df* <sup>Y</sup> <sup>1</sup>, (e19, e31) <sup>∈</sup> <sup>R</sup>*df* <sup>Y</sup> <sup>2</sup>, obtained after doing df-inference over the remaining events.

In Fig. 10, we observe the following: *Pack Shipment* for O1 (e27) was delayed by Item Y 1 which was only ready for e<sup>27</sup> after *Receive SO* (e19). In turn, e<sup>19</sup> was delayed by Supplier Order B with *Update SO* (e7), which we already identified as cause for not receiving all items within 3 days in Sect. 5.1. *Pack Shipment* for O2 (e31) was delayed by entity (O2, P1), that means, by *Clear Invoice* (e30) for the Payment P1 related to O2. *Receive Payment* for P1 (e29) was delayed by (O1, P1), that means, by *Ship* (e28) for the related order O1.

Altogether, this allows us to pinpoint the bottlenecks in the process: *Update SO* delayed delivery of items Y 1, Y 2 needed for both O1 and O2, causing a delay in shipment for O1. The fact that the customer only paid and cleared both invoice I1, I2 after O1 was shipped delayed shipping O2 together with the retailer's policies.

#### **5.4 Aggregating Events and Df-Relationships**

Selection and projection allow to subset the data. Aggregation allows to materialize new nodes and relationships in the data. While the aggregation principle we explain here can be applied for many purposes, we specifically discuss it for


The basic aggregation principle from sets of events to activities is formally identical to creating entity nodes from event properties as given in Definition 11.


The yellow rounded rectangles in Fig. 11 represent the *Class* nodes of the events for Orders O1,O2 and Supplier Orders A,B. The dashed edges represent the *observes* relationship, e.g., e<sup>2</sup> and e<sup>1</sup> both observe *Create Order*.

We then can aggregate the df-relationships in a straight-forward way: for any two class nodes c1 and c2 we add a *df* relationship of type *ent* from c<sup>1</sup> to c<sup>2</sup> if there are corresponding events e<sup>1</sup> and e<sup>2</sup> that directly follow each other for *ent*, i.e., if (e1, c1),(e2, c2) <sup>∈</sup> <sup>R</sup>*observes* and (e1, e2) <sup>∈</sup> <sup>R</sup>*df* <sup>n</sup> , n.*type* = *ent*. We can also count how many df-relationships occur between events of c<sup>1</sup> and c<sup>2</sup> and add this as property to this relationship.

For example, in Fig. 11, we observe two df-relationships from *Create Order* to *Create Invoice* (e1, e18) and (e2, e5). Note, that this definition also creates selfloops around event classes, e.g., we observe three df-relationships from *Unpack* to *Unpack*. Also note that, as for events nodes, a class node can be part of *df* relationships for multiple different entity types, e.g., *Update SO* is an activity that occurs for *Order* and *Supplier Order*.

#### **5.5 Discovering Multi-entity Process Models**

The aggregation operation of Sect. 5.4 essentially constructs a directly-follows graph. The key difference to the directly-follows graph of classical event logs is that each df-relationship between *Class* nodes is specific to one entity type. Thus, it respects the idea of the local directly-follows relation laid out in Definition 6. The resulting graph is a *multi-entity directly-follows graph*, also called *multiviewpoint DFG* [4] or *artifact-centric model* [41].

**Fig. 11.** Aggregating events to event classes and lifting the directly-follows relationships

Applying the event and df-aggregation of Sect. 5.4 to the graph of Fig. 9 results in the multi-entity DFG shown in Fig. 12. While the graph as a whole is rather complex, each edge is grounded in temporal relations of a specific entity type. Moreover, we can see that the behavior for each entity type is rather simple.

Event and df-aggregation can be implemented as simple, scalable queries<sup>7</sup> over standard graph databases, enabling efficient in-database process discovery [25,34]; the queries can be extended to filter based on frequencies or properties of the event knowledge graph [28].

An alternative representation of the multi-entity DFG is the proclet model [26] shown in Fig. 13. It is constructed by not creating a global *Class* node per unique e.*Activity* value in the data, but by creating a *Class* node per unique pair of activity name and entity type (e.*Activity*, *ent*). As a result, we see for example two *Create Invoice* nodes, one for *Order* and one for *Invoice*. Two class nodes of the same name are linked by a *cardinality* relationship that indicates how many entities are involved in an event of this class. For example, in every *Create Invoice* events, one *Order* and one *Invoice* is involved, while in every *Receive SO* event one *Supplier Order* and 2-3 *Items* are involved.

<sup>7</sup> https://github.com/multi-dimensional-process-mining/eventgraph tutorial.

**Fig. 12.** Multi-Entity Directly-Follows-Graph of the running example obtained by aggregating the graph of Fig. 9

**Fig. 13.** Synchronous proclet model of the running example obtained by aggregating the graph of Fig. 9

## **6 Beyond Control-Flow: Multi-dimensional Process Analysis**

So far, we analyzed the entities that are created and updated by the process based on the event data in Table 1. We now turn our attention to the *organizational entities* that actually make the process happen: the workers and supporting systems often called *resources*, and the work itself that is being carried out. Along the way, we showcase how flexible event knowledge graphs are. We integrate new events from a different data source in Sect. 6.1. We then enrich event knowledge graphs with df-paths over *activities* Sect. 6.2, which reveals *queues*. Enriching event knowledge graphs with df-paths over *workers* in Sect. 6.3 reveals patterns of how individual workers perform larger scale *tasks*. Finally, we show how to infer new information from (enriched) event knowledge graphs in Sect. 6.4.

#### **6.1 Extending Event Knowledge Graphs with New Events**

The process is supported by an automated warehouse (see Fig. 1). Figure 14 shows events of how the *Items* were handled by the warehouse. To analyze how the warehouse influenced the process, we have to combine these events with the events from Table 1. Luckily, we can avoid combining both tables into one joint event table and repeating the entire procedure of Sect. 4.3. We can simply *locally update* an existing graph with new events as follows. We choose to start from the graph of Fig. 6.



**Fig. 14.** Warehouse events

3. For each entity node n inferred in step 2, remove every df-relationship <sup>r</sup> <sup>∈</sup> <sup>R</sup>*df* , r.*ent* <sup>=</sup> <sup>n</sup>, and then infer the df-relationships for <sup>n</sup> (Definition 12) now including the new imported events.

The resulting graph is shown in Fig. 15. Note that we can obtained the original Fig. 6 again by selection of the original entities and projection onto the original events (see Sect. 5.3).

**Fig. 15.** Event graph after extending Fig. 6 with Fig. 14 (new events highlighted).

#### **6.2 Adding Activities as Entities Reveals Queues**

We defined entity inference in Definition 10 for the entity type attributes of the source event table. However, Definition 10 can be applied on *any* property of an event node.

For example, if we pick the *Activity* property as "entity identifier", we infer entities such as *Receive SO*, *Unpack*, *Scan*, *Store*, *Retrieve*, *Pack Shipment*. These are not entities handled by the process. No, these entities are the actual building blocks of the process. For example, each *Item* handled has to "pass through" each of these entities to be completely processed. We can visualize how other entities "pass through" activities by inferring in the graph of Fig. 15 the entity nodes for *Activity* and their df-paths<sup>8</sup>. Figure 16 shows the resulting graph (limited to a subset of events for readability).

We can see that the (red) *Activity* df-paths "go across" all the existing dfpaths while the (green) *Item* df-paths traverse the different *Activity* df-paths largely "in parallel". Whenever an *Item* df-path synchronizes with an *Activity*

<sup>8</sup> Note that the *Entity* nodes identified by the activity property are *semantically different* from the *Class* nodes identified by the activity property that we obtained in Sect. 5.4. The *Class* nodes semantically aggregate the existing *df* relationships between events observed for other entities to *df* relationships between *Class* nodes. *Entity* nodes of the entity type *Activity* instead derive new df relationships in addition to existing df relationships for other entities.

**Fig. 16.** Inferring *Activity* as entities in the graph of Fig. 15 reveals *Queues*. (Color figure online)

df-path in an event, the item is being worked on. Thus, we can interpret each *Activity* entity A as an abstract "work station" and its events as the work that is being performed there.

The space between two work stations A and B is a *queue* A : B, i.e., the space where *Items* after being worked on at A wait until being worked on at B. We can see in the graph in Fig. 16 that the *Items* do not always leave a queue in the same order they entered it: X1 entered *Unpack:Scan* after X3 (e<sup>10</sup> follows e<sup>8</sup> in the df-path for *Unpack*) but leaves before X3 (e<sup>12</sup> precedes e<sup>16</sup> in the df-path for *Scan*).

We can better understand this behavior by changing the layout of the graph in Fig. 16. We select from Fig. 16 only *Item* and *Activity* entities. Setting the x-coordinate of each event by its time property and the y-coordinate by its *Activity* entity results in the graph in Fig. 17, which is called the *Performance Spectrum* [16].

The Performance Spectrum shows us that batching happens at *Receive SO* and *Pack Shipment* (diverging/converging *Item* df-paths), that *Scan:Store* and *Store:Retrieve* are being FIFO queues, that *Unpack:Scan* is *not* a FIFO queue, e.g., X3 is overtaken by X1, X2 and Y 1 is overtaken by Y 2.

We already identified in Sect. 5 reasons why Order O2 was not shipped within the 6 days promised by the retailed (see Sect. 1). We now can also clarify the reasons for O1. Figure 17 shows that although the second supplier order B with the required item Y 1 was received on *7-5* (the 6th day of O1), order O1 was only packed *after* the *15:00* pick-up time. The non-FIFO handling in *Unpack:Scan* seems to be at fault. We observe

**Fig. 17.** Sub-graph of Fig. 16 for *Activity* entities and *Item* entities, with event coordinates defined by *Activity* (y-axis) and *time* (x-axis), results in the *Performance Spectrum*.


Thus, if *Unpack:Scan* had followed a strict FIFO policy, Y 1 could have completed its *Scan* activity at *7-5 12:45* ; the subsequent *Pack Shipment* event over X1, X2, Y 1 could have completed at *7-5 14:45* just before the scheduled pick-up at *7-5 15:00*.

The Performance Spectrum reveals further, far more involved patterns of process performance over time than just batching and FIFO [16]. It is also implemented as a visual analytics tool over event data [15] and in combination with process models [54]. Mining performance patterns from it [36] allows to engineer so called inter-case features for improving the accuracy of remaining time prediction [37].

#### **6.3 Adding Actors as Entities Reveals Complex Tasks**

We found in Sect. 6.2 that *Activity* entities describe the abstract "work stations" where other entities are being worked on. Workers are performing this actual work. Often called "resources" in process management literature [18], we prefer the term *Actor* used in organizations research [32], as each actor follows its own behavior. To study actor behavior in the graph of Fig. 6, we only have to (1) infer the *Actor* entities from the event nodes (see Table 1), and (2) infer each actor's df-path. Figure 18 shows the resulting graph.

We can see actors R1, R2, R3 working "intertwined" in the same part of the process. In contrast, R4 and R5 work more separated from the other actors. Also, the actor df-paths actors show very different characteristics. The df-path <sup>e</sup>1, e2, e3, e7 of <sup>R</sup>1 synchronizes with any other entity only in one event, and

**Fig. 18.** Adding actors (*Resource*) entities to the event knowledge graph reveals task execution patterns

then moves on to the next entity O1, O2, A, B, always performing just a single activity on each. In contrast, the df-path of R4 synchronizes over multiple subsequent events with the same entity, i.e., e27, e<sup>28</sup> in O1 and e31, e<sup>32</sup> in O2, meaning R4 always performs a "unit of work" that consists of two subsequent activities. Such a larger unit of work of multiple related activities is called a *task* [32,38].

A *task instance* of an actor R working on an entity X materializes in an event knowledge graph as a specific subgraph over event nodes e1,...,ek: (1) the dfpaths of R and X both meet in e1, (2) diverge in ek, (3) synchronize in each event node e1,...,ek, and (4) at least one of their df-paths has no other event in between e1,...,e<sup>k</sup> [38]. The grey rectangles highlighted in Fig. 18 shows several task instance. The task instances themselves and the way they are ordered in the graph reveal unique characteristics of performing work.


Further, more complex types of task instances can be identified in event knowledge graphs [38]. The df-relationships between task instances also reveal patterns of how work is handed over between actors. For example R1 hands work over to R2 in all *Supplier Orders*, to R3 in all *Orders*, and to R4 in O2; R2 hands work over to R4 in all *Items* and to R5 in I2. Such patterns are studied in the area of routines research [32].

We clearly can see some undesirable behavior in how actors collaborate over the different entities.


The process model shown in Fig. 21, and further explained in Sect. 7, describes for each actor behavioral routines that could avoid undesirable behavior.

#### **6.4 Inference in Event Knowledge Graphs with Multiple Layers**

Our discussions so far focused on constructing, understanding, and finding patterns in graphs over *Entity* and *Event* nodes and the *df* and *corr* relationships. As the model of event knowledge graphs (Definition 8) is based on labeled property graphs (Definition 7), we can extend an event knowledge graph with further node and relationship types, to describe more knowledge about the process. We already did that in Sect. 5.4 when aggregating multiple *Event* nodes of the same activity to a new node with label *Class*. In the following we expand on this idea by an example. We do so in the style of a process mining analyst applying all the concepts of the previous sections as data processing operations. In fact all steps shown here can be realized through Cypher queries over a graph database.

Suppose we want to create a concise summary of how actors organize the work of handling *Supplier Orders*, based on the graph with actor df-paths shown in Fig. 18. The actors correlated to *Supplier Order* events are R1, R2, R3. We select entities A and B and R1, R2, R3 and then project onto events of a *Supplier Order* or between two *Supplier Order* events (to keep e9). The resulting graph is shown in Fig. 19 as "Event Entity Layer".

**Fig. 19.** An event knowledge graph extended with additional layers into a "process knowledge graph".

Next, we aggregate the event layer into a new "Task Instance Layer".

1. For each task instances, i.e., each subgraph of an *Actor* df-path and an *Supplier Order* df-path synchronizing on consecutive events as defined in Sect. 6.3, we extend the graph with a new node with label *TaskInstance*, resulting in the nodes ti3, ti4, ti6, ti7, ti9, ti<sup>21</sup> shown in the "Task Instance Layer" of Fig. 19.


The resulting "Task Instance Layer" in Fig. 19 represents the "Event Entity Layer" at the aggregation level of task executions instead of activity executions.

5. To understand which tasks are performed and how often, we aggregate *Task-Instance* nodes into *Task* nodes by their *Task* property (see Sect. 5.4).

The resulting "Task Layer" in Fig. 19 shows four tasks *Place SO* (performed twice in ti3, ti4), *Update SO* (performed once in ti7), *Update Invoice* (performed once in ti9), and -*Receive SO*, *Unpack*<sup>∗</sup> (performed twice in ti6, ti21).

We now want to visualize the behavior all actors regarding the *frequent tasks* in handling *Supplier Orders*, e.g., tasks performed at least twice. The visualization shall be on the abstraction level of the activities performed by actors, i.e., a multi-entity DFG. To achieve this, we aggregate the "Event Entity Layer" into a "Class Layer" using the "Task Layer" as context.


The resulting multi-entity DFG forms a new "Class Layer" in the graph, that is connected to the "Event Entity Layer" by *observes* relationships, as shown in Fig. 19. The multi-entity DFG shows that R1 and R2 work on disjoint sets of activities, and that R2 indeed follows a cyclic, structured behavior. The paths from *Class* nodes *Receive SO* and *Unpack* to the *Task* nodes show that all activities belong to the same task, i.e., one cycle is one "unit of work".

The multi-entity DFG is a filtered DFG: it lacks df-relationships for *Supplier Orders* and it omits *Update SO*. Thus, the multi-entity DFG *does not fit* or *deviates* from the "Event Entity Layer". We can identify the deviations in multi-layered process knowledge graph in Fig. 19 similar to alignments [9]; see [8]. For instance, for df-relationship (*Unpack*, *Unpack*) <sup>∈</sup> <sup>R</sup>*df* in the "Class Layer", we see


# **7 Conclusion and Outlook**

The preceding sections studied different forms of process mining over multiple behavioral dimensions that are summarized in Fig. 20. We showed in Sect. 3 how classical process mining techniques fail when the assumption of a single entity handled by a single execution (bottom left quadrant in Fig. 20) is violated.

**Fig. 20.** Quadrants of process analysis over multiple behavioral dimensions

To overcome these assumptions, we introduced process mining with event knowledge graphs, that rests on three simple, but fundamental principles:


Applying these principles, we constructed event knowledge graphs from standard event data through simple concepts in Sect. 4.3. We showed in Sect. 5 how to analyze processes where each execution involves *multiple related entities*, such as ERP systems and document-driven processes (bottom right quadrant in Fig. 20). We showed in Sect. 6 how event knowledge graphs also allow to analyze *multiple dynamics* together. We added actor and queue behavior to study how entities pass through queues or actors perform tasks across multiple entities, which are dynamics studied in call centers or in healthcare (top left quadrant in Fig. 20). Note that, in Sect. 6 we always focused on a single entity processed in a queue or in a task. How to analyze the combination of *multiple dynamics* over *multiple entities* (top right quadrant in Fig. 20) is an open question.

Event knowledge graphs give rise to a number of novel research questions.

We have shown how to construct event knowledge graphs from event tables, even automatically [24,25]. We also need techniques to construct event knowledge graphs from relational database while preserving the existing entities and relations. Existing automated conversion techniques from relational to graph databases [50] only convert records into entity nodes, while event knowledge graphs require to construct event nodes.

The quality of a process mining analysis on event knowledge graphs relies on having identified the relevant structural relations (between entities) and behavioral or cause-effect relations (between events) (see Sect. 4.4). We need automated techniques to infer relevant relations that take the temporal semantics of the df-relationship into account. Promising first steps are techniques that explicitly allow to incorporate domain knowledge when inferring causal relationships from relational data [56], or use ontologies [6,7] for extraction. Specifically, dynamically changing relationships and changes of object properties [39,40] still need to be considered.

We have sketched the possibility of structuring a complex process mining analysis by adding analysis layers to the graph, but limited ourselves to simple selection, projection, and aggregation queries. Adequate query languages that also can handle process-relevant phenomena such as frequency, noise, performance in relation to multiple entities need to be considered. Also, more complex behavioral dynamics can be discovered. For example, enriching the event knowledge graph with the activity dimension to derive the performance spectrum (see Sect. 6.2) allows detecting subgraphs that indicate high workload (many events in a short interval) or a dynamic bottleneck (a short-term increase in waiting time) [51]. Aggregating these to "high-level events" and mining for cause-effect relations among them reveals how performance anomalies cascade through a process [51].

Finally, while we did discuss how to discover multi-entity directly-follows graphs through aggregation, true process discovery of models with precise semantics from event knowledge graphs still has to be addressed. In principle, such models can be discovered through principles of artifact-centric process mining [41,46]: First obtain a classical event log per entity type, e.g., by extracting the df-paths per entity type from the graph, and discover a classical process

**Fig. 21.** Synchronous proclet model for the graph of Fig. 9 extended with proclets describing the *intended* (not the observed) behavior for all actors.

model per entity type. Then compose the models of the different entity types to express their synchronization.

Figure 21 shows a possible process model that could be obtained in this way for our example, using a multi-entity extension for Petri nets, called *synchronous proclets* [26]. Each proclet is a Petri net that describes the behavior of one entity type; bold-bordered initial transitions describe the creation of a new entity. The dashed *synchronization edges* describe which transitions occur together; the multiplicity annotations indicate how many entities of each type have to be involved. Note that the proclet model in Fig. 21 is a hybrid between discovered and manually created model. The proclets for *Order*, *Supplier Order*, *Invoice*, *Item*, *Payment*, and *(Order,Payment)* are each discovered from the entity type's df-paths of the graph in Fig. 9. The proclets for the *Actors* however are created manually9, describing the intended routine for each actor based on the insights in Sect. 6.3. Bold-bordered initial transitions describe the creation of a new entity; note that the proclets for actors do not have an initial transition but an initial marking as actors are not created in the process. Dashed *synchronization edges* between transitions describe that the transitions have to occur together; the multiplicity annotations indicate how many entities of each type have to be involved. For instance, R1 creates 1 new *Order* in each occurrence of *Create Order*, but R4 always packs 2–3 *Items* into 1 *Shipment* in each occurrence of *Pack Shipment*.

An alternative formalization of this concept are *object-centric Petri nets* [53]. Object-centric Petri nets also first discover one Petri net per entity type, then annotate the places and arcs with entity identifiers, and then compose all entity nets along transitions for the same activity, resulting in a coloured Petri net model that is accessible for analysis [53] and measuring model quality [3]. However, synchronization by composition prevents explicitly modeling (and thus discovering) interactions between entities such as the relation from *Order* to *Payment* described by proclet *(Order,Payment)* in Fig. 21.

Though, while proclets can describe entity interactions, the behavior of entity interactions tends to be rather unstructured resulting in overly complex models [41]. Extensions of declarative models (see [10]) such as modular DCR graphs [14], that apply similar principles as synchronous proclets, could be more suitable. Alternatively, scenario-based models [31] that specify conditional partial orders of events over multiple entities could be applied. For instance, the conditional scenario in Fig. 22 specifies the interaction between *Orders* and *Payments* observed in the graph of Fig. 9.

**Fig. 22.** Conditional scenario describing an interaction of 2 *Orders* and 1 *Payment*.

Altogether, event knowledge graphs give rise to entirely novel forms of process mining that support novel forms of process management [17].

<sup>9</sup> We created one proclet per actor as introducing a proclet for all actors would result in a very complex proclet as different actors follow very different behavior. Further, the manually created model conveniently avoids the issue of having to layout how *R*2 synchronizes both with *Supplier Order* and with *Invoice*.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Predictive Process Monitoring**

Chiara Di Francescomarino and Chiara Ghidini(B)

Fondazione Bruno Kessler, Trento, Italy *{*dfmchiara,ghidini*}*@fbk.eu

## **1 Introduction**

*Predictive Process Monitoring* [29] is a branch of process mining that aims at predicting the future of an ongoing (uncompleted) process execution. Typical examples of predictions of the future of an execution trace relate to the outcome of a process execution, to its completion time, or to the sequence of its future activities.

Being able to predict in advance the outcome of a process execution, the time that a process instance will require to complete, or the activities that will be executed next can be extremely valuable in several domains and scenarios, e.g., for production processes, allowing organizations to prevent undesired outcomes, issues and delays. Indeed, differently from the problem of monitoring business processes in a *reactive* way [28], i.e., so that the violation or the delay is identified only after its occurrence, predicting the violation or the issue before it occurs, would allow for supporting users and organizations in *preventing* it by taking the appropriate preventive countermeasures. Fueled also by the wave of technical developments in Data Science, Predictive Analytics, and data driven Artificial Intelligence, the development of predictive techniques tailored to the field of Process Mining has rapidly established itself both as a vibrant research topic and as an impactful functionality with a direct application in innovative organizational contexts and process mining tools, which often go hand in hand. Examples are the development of new Predictive Process Monitoring pipelines for specific organizations (such as hospitals) [3] and the investigation of explainable Predictive Process Monitoring techniques performed together with leading Process Mining companies such as myInvenio<sup>1</sup> [18] with the aim of incorporating the features within their Process Mining tools (see also [36]).

Predictive Process Monitoring approaches usually leverage past historical complete executions in order to provide predictions about the future of an ongoing (incomplete) case. They usually have two phases: a *training or learning phase*, in which a predictive model is learned from historical (complete) execution traces and a *runtime or prediction phase*, in which the predictive model is queried for predicting the future of an ongoing case.

<sup>1</sup> Recently acquired by IBM as part of the IBM Process Mining suite. See www.ibm. com/cloud/cloud-pak-for-business-automation/process-mining/.

c The Author(s) 2022

W. M. P. van der Aalst and J. Carmona (Eds.): Process Mining Handbook, LNBIP 448, pp. 320–346, 2022. https://doi.org/10.1007/978-3-031-08848-3\_10

The chapter is structured as follows:<sup>2</sup> after an introduction of a simple explanatory example (Sect. 2) and of the main dimensions characterizing the family of the Predictive Process Monitoring approaches for business processes (Sect. 3), the typical encodings and approaches used for the prediction of outcomes (Sect. 4), numeric values (Sect. 5), and sequences of activities - and related payloads - (Sect. 6), are described in the next three sections, respectively. Finally, Sect. 7 presents new relevant trends in the context of operational support techniques based on Machine Learning and Sect. 8 introduces the main available open source tools supporting Predictive Process Monitoring tasks. We assume as prerequisite for the next sections that the reader has some machine/deep learning knowledge, especially on classification and regression algorithms, as well as on recurrent neural networks. The interested reader can refer to [5,19,20].

## **2 Running Example**

During the execution of a business process, process participants cooperate to satisfy certain business constraints. At any stage of the process enactment, decisions are taken aimed at achieving the satisfaction of these constraints. Being able to predict in advance certain aspects of a process execution allows organizations to take advantage or adapt to desirable future enfolding or to react and be able to prevent an undesirable scenario by taking the appropriate preventive countermeasures.

In this chapter we will illustrate the potential and characteristics of Predictive Process Monitoring by means of a running example in a healthcare scenario.<sup>3</sup> The example describes the process of a patient going to a hospital to perform a radiology exam and related medical checks. The process covers both the clinical aspects, such as the visit(s) and the radiology exam(s) and administrative issues, such as the admission to the radiology department, the computation of the medical bill and its payment. During the process execution, the doctor has to make decisions on whether further exams are required, and - if possible - issued. Depending on the examinations visits can precede and/or follow the radiology exam, which vary in range from Ultrasound, to X-ray, to Pet, MRI, Breast Imaging, and so on. The process typically starts with the admission, the execution of the medical activities (exams and visits) and the computation and paying of the bills. Different executions are nonetheless possible such as a payment in advance, before the visit.

In this scenario, historical information about past executions of the process, and in particular data related to the clinical history of other patients with similar characteristics, could be used to support the hospital predicting the unfolding of a certain execution. As an example, at a certain time during the process execution, one could predict whether a certain patient will require ultrasounds

<sup>2</sup> In this chapter we mostly focus on the main pipeline, omitting aspects mainly related to the preprocessing and evaluation phase.

<sup>3</sup> This example and its instantiations in the following sections are taken and inspired from the running example used in [25].

**Fig. 1.** Predictive Process Monitoring along three dimensions.

and/or at what time. This may be used by the hospital staff to improve or adapt the scheduling of their facilities.

## **3 The Family of Predictive Process Monitoring Approaches**

Although Predictive business Process Monitoring is a relatively young field, it has been growing fast in the latest years, as it is also witnessed by recent surveys on the topic [13,31]. As depicted in Fig. 1, the literature on predictive business process monitoring can be roughly classified along three main dimensions:


Concerning the type of prediction, according to the literature [13,46], we can classify the existing prediction types into three main big categories:


**Fig. 2.** Types of predictions.

Figure 2 shows an example of an execution trace describing the activities carried out by John. Let us assume that it is 8:54 a.m now. At 8:00 a.m. John has registered to the hospital to undergo some health checks, at 8:10 he was taken to the radiology department where he was visited at 8:15 and he is now having X-rays. Predictive Process Monitoring would allow us to answer different types of questions on the future of John. For instance, we could predict whether John will undergo an ultrasound scan in the future. The answer to this specific question will be a boolean value (e.g., it is true that John will undergo an ultrasound in the future). This is a typical example of an outcome-based prediction. However, this class of predictions also includes predictions assuming categorical values, that is, values that range in a limited and fixed number of possible options. Examples are the class of discount that will be applied to a customer at the end of his shopping, the class of risk of a given execution, or, in our scenario, the specific exam out of a number of options. Another typical question Predictive Process Monitoring could allow us to answer about John's future is, once we know that he will undergo an ultrasound, in how much time he is going to have it. The answer to this question is generally provided in terms of a numeric value (e.g., John is going to have an ultrasound exam in 26 min) and is an example of a numeric-value prediction. Typical examples in this settings are predictions related to the remaining time of an ongoing execution, predictions related to the duration or to the cost of an ongoing case. Finally, we could even predict what John is going to do from now on. The answer to this question is a sequence of future activities (e.g., John will undergo an ultrasound, will ask for his bill and will pay it). Typical examples of predictions falling under this category refer to the prediction of the sequence of the future activities (and of their data payloads) of a process case upon its completion.

Predictive Process Monitoring approaches are usually characterized by two phases. In a first phase, the *training or learning* phase (see the light blue part in Fig. 3), one or more models are built or enriched by leveraging the information contained in the execution log. In the second phase, the *runtime or prediction phase* (see the light green part in Fig. 3), the learned model(s) is(are) exploited

**Fig. 3.** Types of PPM approaches.

in order to get predictions related to an ongoing execution trace. We can identify two main groups of approaches dealing with the prediction problem:


Finally, we can identify four different types of information that can be used as input to the Predictive Process Monitoring approaches, e.g., for building a model annotated with execution information or for building the features to be used by machine learning approaches:


**Fig. 4.** Information used for making predictions.

the timestamp associated to each event, the data payload of the event Visit patient also includes the doctor who has visited John, i.e., *Alice* (see the third row in Fig. 4).


In several approaches, more than one of these types of information is used in order to learn from the past.

After reporting few more details on the model-based approaches and approaches leveraging machine learning in the next subsection, in the following sections, we will mainly focus on machine-learning approaches and on encodings taking into account event and data payload features. We will look in more detail at each of the three prediction type macro-categories, i.e., predicting outcomes, numeric values and sequences of activities (and related payloads), respectively.

**Fig. 5.** Overview of the typical phases of model-based approaches.

**Fig. 6.** Running example model-based approaches.

#### **3.1 Predictive Process Monitoring Approaches**

We report here an overview of the main phases related to the two main families of Predictive Process Monitoring approaches, i.e., model-based approaches and approaches based on machine learning.

Figure 5 shows the main phases characterizing the model-based approaches. At training time, an explicit and conformant model (see [7]) can either be already available or can be discovered from the historical traces (see [2,4]) using the optional *Model discovery* phase in Fig. 5. The model is then enriched with information related to the data (*Model enrichment*), as for instance the remaining time extracted from the historical traces. At runtime, the enriched model is used in order to return a prediction.

One of the main model-based approaches leverages a transition system as explicit control-flow model (see Definition 1 in [4]). The transitions system is built based on a given abstraction of the representation of the events in the traces (e.g., the name of the activity), as well as of the representation of the state of the transition system, as for instance the *sequence* of activities executed so far or the *set* of activities occurred so far. For instance, let us consider the simple event log reported in Fig. 6 related to the example described in Sect. 2. Each case relates to a different patient and the corresponding sequence of events indicates the activities executed for a medical treatment of that patient. Given the variability of the process, different interplays are possible between the clinical and administrative activities. In particular in sequence σ<sup>1</sup> the process starts

**Fig. 7.** Annotated transition system obtained from the log reported in Fig. 6.

directly with a visit (possibly due to urgency), while the administrative part is executed in the middle of the process; instead in sequence σ3, the process starts with a computation of the overall price (possibly due to the request of having a quote) before proceeding further. The event timestamp of each event is reported among brackets nearby the activity. For example, trace σ<sup>2</sup> refers to a process execution in which the activity Visit patient is executed at time 08:00, the activity Compute rate at time 10:00 and so on. Figure 7 shows the transition system computed using as event representation abstraction the name of the activity and as state representation the activity *set*.

The transition system is then annotated, given a certain measurement function, as for instance the elapsed time or the remaining time, with the corresponding information extracted from the event log. For instance, information about the remaining time can be extracted from the traces and reported for each state of the transition system. This information is then used for making predictions, e.g., on the completion time of an ongoing trace, given a certain prediction function, as for example the average remaining execution time. The transition system in Fig. 7, for instance, is annotated (in blue) with the remaining time of each trace in the event log of Fig. 6. Moreover, for each state, the average of these values is also computed and reported. For example, the state corresponding to the empty set of activities , is annotated with the remaining time of each trace at the beginning of the execution, i.e., 11 h for σ1, 8 h for σ<sup>2</sup> and so on.

When, at runtime, a prediction about the completion time of a new ongoing trace is required, the annotated transition system can be queried by looking at the state of the transition system corresponding to the ongoing case, and the value of the chosen prediction function returned. For instance, let us assume we want to predict the completion time of an ongoing case σ*<sup>t</sup>* = (Compute rate (CR) {*12:00*}, Visit patient (VP) {*13:00*}). Two measurements are associated to the corresponding state of the transition system in Fig. 7 (see the state in light green), i.e., 6 and 2 hours. Considering the average as prediction function, the average value of the measurements (4 hours) can be used to compute the predicted completion time, i.e., according to the prediction, the patient will complete his process at 17:00.

Several extensions have been proposed to the original approach, such as annotating the transition systems with machine learning models like Na¨ıve Bayes

**Fig. 8.** Overview of the typical phases of approaches based on machine learning.

and Support Vector Regression models [34], taking into account also data payloads [35], combining the annotated transition systems with a context-driven predictive clustering approach [16,17]. Other model-based approaches consider, instead *sequence trees* [9] or stochastic Petri nets [40,41] as explicit models to predict the remaining execution time of a process instance.

Figure 8 sketches the main phases of the typical approaches based on machine learning. These approaches usually require that trace prefixes are extracted from the historical execution traces (*Prefix extraction* phase). This is due to the fact that at runtime predictions are made on incomplete traces, so that correlations between incomplete traces and what we want to predict (*target variables* or *labels*) have to be learned in the training phase. After prefixes have been extracted, prefix traces and labels (i.e., the information that has to be predicted) are encoded in the form of feature vectors (*Encoding* phase). Encoded traces are then passed to the (supervised learning) techniques in charge of learning from the encoded data one (or more) predictive model(s) (*Encoding* phase). At runtime, the incomplete execution traces i.e., the traces whose future is unknown, should also be encoded as feature vectors and used to query the predictive model(s) so as to get the prediction (*Predicting* phase).

In this chapter we will mainly focus on approaches leveraging machine learning - and in particular supervised learning - techniques.

#### **4 Predicting Outcomes**

Outcome predictions are predictions related to (categorical) case outcomes [46]. Typical examples of outcome predictions in the Predictive Process Monitoring literature are predictions related to risks or related the fulfilment of a predicate [11,29].

Given an event log L and a prefix execution trace σ*<sup>m</sup> <sup>i</sup>* = <e1,...,e*m*> of length m, the overall idea is learning a function f*c*(L, σ*<sup>m</sup> <sup>i</sup>* ) returning a categorical

**Fig. 9.** Running example with an outcome label

value label*i*, which is as close as possible to label*i*, i.e., the actual (categorical) value of the variable that we aim to predict (e.g., whether the predicate will be actually fulfilled).

As described in the previous section, when dealing with approaches based on machine learning, one of the main steps to be carried out deals with encoding the information contained in (prefix) execution traces and corresponding labels in a format that is understandable by machine learning techniques. This would allow the technique to train, and hence learn, from encoded data a predictive model. In order to train a model, each (prefix) execution trace σ*i*, (and its corresponding label) have to be represented through a feature vector g*<sup>i</sup>* = (g*i*<sup>1</sup>, g*i*<sup>2</sup>, ...g*ih*, label*i*).

In this section (and in the next two sections) we will present first the typical encodings used with the corresponding type of predictions<sup>4</sup> and then the main (machine-learning) pipelines/approaches used to build the predictive model and query it.

#### **4.1 Typical Data Encodings**

To exemplify the different data encoding techniques, we consider the very simple log in Fig. 9 pertaining to our running example of Sect. 2. Similarly to the log used in Sect. 3.1, also in this log each case relates to a different patient and the corresponding sequence of events indicates the activities executed for a medical treatment of that patient. Visit patient is the first event of sequence σ1. Its data payload "{*33*, *radiology*}" corresponds to the data associated to attributes *age* and *department*<sup>5</sup>. Note that the value of *age* is static: it is the same for all the events in a case, while the value of *department* is different for every event. In the payload of an event, the entire set of attributes available in the log is considered as well. In case for some event the value for a specific attribute is not available, the value ⊥ (*unknown*) is specified for it.

Given a case prefix, we aim at predicting whether the patient will recover soon (*true*), or not (*false*). We report the corresponding value, i.e., the corresponding label, for each case after the semicolon in Fig. 9.

*Boolean Encoding.* In the *boolean encoding* sequences of events are represented as feature vectors, in such a way that each feature corresponds to an event class (an activity) from the log. In particular, the boolean encoding represents a sequence

<sup>4</sup> Please note that some types of encodings can be used for different types of predictions. For instance encodings related to outcome-based and numerical predictions are exactly the same - except for the type of the label.

<sup>5</sup> We omit here for simplicity the information related to timestamps.

**Table 1.** Typical outcome-based encodings for the example in Fig. 9.

(f) *complex index-based* encoding.

σ*<sup>i</sup>* through a feature vector g*<sup>i</sup>* = (g*i*1*<sup>A</sup>* , g*i*2*<sup>A</sup>* , ...g*ih<sup>A</sup>* , label*i*), where h*<sup>A</sup>* is the size of the event class alphabet A = {a<sup>1</sup>*<sup>A</sup>* ,...,a*<sup>h</sup><sup>A</sup>* } and if g*ij<sup>A</sup>* corresponds to the event class a*<sup>j</sup><sup>A</sup>* ∈ A then:

$$g\_{ij} = \begin{cases} 1 & \text{if } a\_{j\_A} \text{ occurs in } \sigma\_i \\ 0 & \text{if } a\_{j\_A} \text{ does not occur in } \sigma\_i \end{cases}$$

For instance, the encoding of the example reported in Fig. 9 with the *boolean* encoding is shown in Table 1a.

*Frequency-Based Encoding.* The *frequency-based* encoding, instead of boolean values, represents the control flow in a case with the frequency of each event class in the case. The frequency-based encoding g*<sup>i</sup>* = (g*i*1*<sup>A</sup>* , g*i*2*<sup>A</sup>* , ...g*ih<sup>A</sup>* , label*i*) of σ*i*, is such that, if g*ij<sup>A</sup>* corresponds to the event class a*<sup>j</sup><sup>A</sup>* ∈ A then:

$$g\_{ij} = \begin{cases} n & \text{if } a\_{jA} \text{ occurs } n \text{ times in } \sigma\_i\\ 0 & \text{if } a\_{jA} \text{ does not occur in } \sigma\_i \end{cases}$$

Table 1b shows the *frequency-based* encoding for the example in Fig. 9, assuming that Visit patient occurs two times in σ*<sup>i</sup>* and Get Payment occur four times in σ*k*.

*Simple-Index Encoding.* Another way of encoding a sequence is by taking into account also information about the order in which events occur in the sequence, as in the *simple-index* encoding. Here, each feature corresponds to a position in the sequence and the possible values for each feature are the event classes. The resulting feature vector g*<sup>i</sup>* of the simple-index encoding of an execution trace σ*<sup>i</sup>* of length m is g*<sup>i</sup>* = (a*i*1, a*i*2, ...a*im*, label*i*), such that a*ik* corresponds to the event class of the event at position k in σ*i*. By using this type of encoding the example in Fig. 9 would be encoded as reported in Table 1c.

*Latest-Payload Encoding.* The *latest-payload* encoding takes into account both the static and the dynamic data attributes of the traces. The value of static attributes (trace attributes) is the same for all the events in the sequence, while the value of dynamic data attributes (event attributes) changes for different events. However, in this encoding, data attributes, also the dynamic ones, are all treated as static features without taking into consideration their evolution over time. Indeed, the latest-payload encoding encodes the data attributes and the data of the latest payload. The latest-payload encoding g*<sup>i</sup>* of an execution trace σ*<sup>i</sup>* of length m is g*<sup>i</sup>* = (s<sup>1</sup> *<sup>i</sup>* ,...,s*<sup>u</sup> <sup>i</sup>* , d<sup>1</sup> *im*,...,d*<sup>r</sup> im*, label*i*), where each s*<sup>i</sup>* is a static feature and each d*im* is a dynamic feature associated to the last event, i.e., the event at position m. Table 1d shows this encoding for the example in Fig. 9.

*Index Latest-Payload Encoding.* The *index latest-payload* encoding adds the latest encoding to the simple-index encoding. The resulting feature vector g*i*, for a sequence g*<sup>i</sup>* = σ*i*, is g*<sup>i</sup>* = (s<sup>1</sup> *<sup>i</sup>* ,...,s*<sup>u</sup> <sup>i</sup>* , a*i*<sup>1</sup>, a*i*<sup>2</sup>,...,a*im*, d<sup>1</sup> *im*,...,d*<sup>r</sup> im*, label*i*), where each s*<sup>i</sup>* is a static feature, each a*ij* is the event class at position j and each d*im* is a dynamic feature associated to the event at position m. Table 1e reports this encoding for the example in Fig. 9.

*Complex Index-Based Encoding.* In the *complex-based* encoding, the dynamic nature of the dynamic information is considered and its evolution over time is taken into account. The resulting feature vector g*i*, for a sequence σ*i*, is g*<sup>i</sup>* = (s<sup>1</sup> *<sup>i</sup>* , .., s*<sup>u</sup> <sup>i</sup>* , a*i*<sup>1</sup>, a*i*<sup>2</sup>, ..a*im*, d<sup>1</sup> *i*1, d<sup>1</sup> *i*2,...,d<sup>1</sup> *im*,...,d*<sup>r</sup> i*1, d*<sup>r</sup> i*2, ...d*<sup>r</sup> im*, label*i*), where each s*<sup>i</sup>* is a static feature, each a*ij* is the event class at position j and each d*ij* is a dynamic feature associated to an event. The example in Fig. 9 is transformed into the encoding shown in Table 1f.

## **4.2 Mostly Used Approaches: Classification-Based Approaches**

Different pipelines and frameworks have been proposed for providing outcome predictions. Most of them relies on classification techniques<sup>6</sup> (e.g., Decision Tree, Random Forest, Support Vector Machine) for the *supervised learning* phase [12, 23,25,29]. Moreover, most of these pipelines have been enriched with a *Bucketing* phase [46] (see the orange blocks in Fig. 10). The idea is that at training time

<sup>6</sup> Note that deep learning techniques can also be used for predicting outcomes [52], however we focus here on the mostly used approaches.

**Fig. 10.** Typical outcome-based pipeline

multiple predictive models are trained. Specifically, the log of prefix traces is divided in multiple buckets and each bucket is used to train a different classifier. At runtime, the most suitable bucket is identified and the corresponding classifier used for predicting the outcome.

The *Bucketing* phase has been instantiated in different ways in the Predictive Process Monitoring literature. For instance, in [12] trace clustering has been used to group prefix traces. Specifically, at training time, a clustering algorithm has been leveraged to cluster together prefix traces sharing a similar control flow. For each cluster, the data payload of the prefix traces in the cluster, once encoded in the proper format, has then been used to train a classifier. At runtime, the cluster of the incomplete ongoing trace is identified, i.e., the cluster containing the trace prefixes closest to the current incomplete trace, and the corresponding classifier queried in order to get the prediction. In [25], instead, a bucket consists of a set of prefix traces of the same length. Also in this case, at training time, a classifier for each prefix length k is built by learning from all prefix traces of length k. At runtime, the classifier of the same length of the ongoing trace is identified and the prediction returned.

## **5 Predicting Numeric Values**

Numeric value predictions are predictions related to quantitative measures of interest of business process executions. Typical examples of numeric predictions in the Predictive Process Monitoring literature are predictions related to time, cost or generic process performance [1,8,48].

Given an event log L and a prefix execution trace σ*<sup>m</sup> <sup>i</sup>* = <e1,...,e*m*> of length m, the overall idea is learning a function f*n*(L, σ*<sup>m</sup> <sup>i</sup>* ) returning a numerical value label*i*, which is as close as possible to label*i*, i.e., the actual (numerical) value of the variable that we aim to predict (e.g., the remaining cycle time until the completion of the execution).

**Fig. 11.** Running example with a numeric label

#### **5.1 Typical Data Encodings**

Let us consider the running example of Fig. 9 and let us assume that this time we would like to predict the time required for completing the execution (reported in Fig. 11 after the semicolon).

Encodings typically used for numeric predictions are the same as the ones used for categorical predictions, except for the label, which is a numerical value rather than a boolean or a categorical value. Table 2 summarizes the boolean, frequency, simple-index, latest-payload, index latest-payload and complex-index encodings for numeric-based predictions.

#### **5.2 Mostly Used Approaches: Regression-Based Approaches**

Pipelines and frameworks proposed for numeric predictions are quite similar to the ones for outcome predictions. Most of them relies on regression techniques<sup>7</sup> (e.g., Regression Trees, Random Forest, XGBoost) for the *supervised learning* phase [23,29].

#### **6 Predicting Next Events**

Next event predictions are predictions related to the unfolding of the future events - until the end - of an incomplete ongoing trace [45]. Next event predictions can be related to the sequence of next event classes, but also to the next data payloads associated to the events, as for instance, the timestamps or the resources associated to the next event(s).

In case of activity predictions, given an event log L and a prefix execution trace σ*<sup>m</sup> <sup>i</sup>* = <e1,...,e*m*> of length m, the overall idea is learning a function f*sa*(L, σ*<sup>m</sup>*) returning a sequence of next event classes that is as close as possible to a*m*+1,...,ω, i.e., to the activity suffix of the current ongoing trace.

Most of the approaches for next activity predictions typically first learn a function f<sup>1</sup>*<sup>a</sup>* that, given the first m events of a trace σ*<sup>m</sup> <sup>i</sup>* , predicts the next event class, i.e., the event class that will occur at time step m + 1. The suffix of the ongoing trace σ*<sup>m</sup> <sup>i</sup>* is then predicted until the last event ω, by predicting the next event iteratively, that is by learning the function f*sa*:

$$f\_{sa}(L, \sigma\_i^m) = \begin{cases} f\_{1a}(\sigma^m) & \text{if } f\_{1a}(L, \sigma\_i^m) = \omega \\ f\_{sa}(L, < e\_1, e\_2, ..., e\_m, e>) & \text{otherwise} \\ \text{with } f\_{1a}(L, \sigma\_i^m) \text{ as } e\text{'s event class} \end{cases} \tag{1}$$

Similarly, when predicting the values of the next events' data attribute x, e.g., the next timestamps, the idea is learning a function f*sx*(L, σ*<sup>m</sup>*) returning a sequence of values of the data attribute x that is as close as possible to the sequence of values actually held by the attribute x in the next events of the ongoing trace.

In the next subsection describing the typical data encodings, we mainly focus on the encoding for the next event class prediction. The results can then be extended to the prediction of other data attributes related to the next event, as well as to predictions related to next events, as described in (1).

#### **6.1 Typical Data Encodings**

Let us consider the running example described in Fig. 9 enriched with timestamp information and let us assume that we want to predict the next activity related to the next time step (i.e., the activity at time step m+1). The actual activity at time step m + 1 is reported after the semicolon for the training traces in Fig. 12.

<sup>7</sup> Note that deep learning techniques can also be used for predicting numeric predictions [45], however we focus here on the mostly used approaches.

(b) *one-hot with temporal features* encoding

**Fig. 12.** Running example with next activity as label

**Table 3.** Typical sequence-based encodings for the example in Fig. 12.


*One-Hot Encoding.* The *one-hot encoding* allows categorical data to be transformed into a numeric format. It relies on the existence of an alphabet of activities. Given the set A = {a<sup>1</sup>*<sup>A</sup>* ,...a*<sup>h</sup><sup>A</sup>* } of all possible activities, an ordering function idx : <sup>A</sup> → {1,..., <sup>|</sup>A|} ⊆ <sup>N</sup> is defined on it, such that <sup>a</sup>*<sup>i</sup><sup>A</sup>* <> a*<sup>j</sup><sup>A</sup>* if and only if i*<sup>A</sup>* <> j*A*, i.e., two activities have the same A-index if and only if they are the same activity.

For instance, in the example in Fig. 12, if the activity alphabet is A = {Visit patient, Perform ultrasound, Compute rate, Get Payment, Check X-ray, Emit receipt}, the function idx :<sup>A</sup> → {1, <sup>2</sup>, <sup>3</sup>, <sup>4</sup>, <sup>5</sup>, <sup>6</sup>} can be defined such that idx(Visit patient) = 1, idx(Perform ultrasound) = 2, idx(Computerate) = 3 and so on. Each event <sup>e</sup>*ij* <sup>∈</sup> <sup>σ</sup>*<sup>i</sup>* is then encoded as a vector (A*ij* ) where the features are all set to 0, except the one occurring at the index of its event class, which is set to 1. In the training phase, the event class of the next event e*m*+1, which represents the target variable or label, is also encoded in the corresponding vector (A*im*). The trace is finally encoded by composing the vectors obtained from all activities in the trace and the next activity into a matrix. The encoding of the trace σ*<sup>i</sup>* is hence given by g*<sup>i</sup>* = ((A*i*<sup>1</sup>), ...,(A*im*),(A*im*+1)). The one-hot encoding related to the example in Fig. 12 is reported in Table 3a.

*One-Hot Encoding with Temporal Features.* The one-hot encoding, which takes into account only the activities, can be enriched with other information. For instance, another encoding used with activity sequences combines the one-hot encoding of features related to event classes and features related to time [45]. In the *one-hot encoding with temporal features*, given the set A = {a<sup>1</sup>*<sup>A</sup>* ,...a*<sup>m</sup><sup>A</sup>* } of all possible activities, each event e*ij* ∈ σ*<sup>i</sup>* is encoded as the one-hot encoding of its event class enriched with three additional features pertaining to time. The first one relates to the time difference between the considered event and the one of the previous event (δ*i*), the second one reports the time since midnight (h*i*), thus allowing for distinguishing between working and night time, and the last one refers to the time since the beginning of the week (w*i*), thus allowing for distinguishing between business and non-working days. Also in this case, in the training phase, the label, i.e., the event class of the next event e*m*+1 is also encoded with the one-hot encoding. The one-hot encoding with temporal features related to the example in Fig. 12 is reported in Table 3b.

*Embedding-Based Encoding.* The *embedding-based encoding* is typically used when the number of the possible values of one or more categorical variables is high and the one-hot encoding may cause an exponential growth of the feature vector dimensionality. In the embedding-based encoding, categorical data with an alphabet of possible values of size m is mapped into a n-dimensional embedding space (where n is the chosen dimensionality of the embedded space) that encodes the values of the categorical attribute so that values that are closer in the vector space are expected to be similar.

#### **6.2 Mostly Used Approaches: LSTM-Based Approaches**

Most of the approaches for next event predictions rely on Recurrent Neural Networks and, more specifically, on LSTM (Long-Short Term Memory) architectures [6,26,45].<sup>8</sup> This type of deep learning approaches, by using recurrent connections in a single block (LSTM cell), is indeed particularly suitable to deal with sequence problems. Different types of LSTM architectures have been proposed in the literature for predicting the label associated to the next event and its data attributes.

For instance in [45] three types of architectures have been proposed in order to predict both next activity and the timestamp of the next event and then, iteratively, suffix prediction and remaining cycle time: a first type with separate layers for activity and timestamp prediction, a second type with shared LSTM layers for both activity and timestamp prediction and finally a third one with some shared and some separate layers. The architecture proposed in [6] for predicting the next activity and its timestamp and the remaining cycle time and suffix for a running case is a composition of LSTMs and feedforward layers. In [26] an encoder-decoder framework based on LSTMs is proposed to predict the next activity and the suffix of an ongoing case. The encoder maps an input sequence into a set of high dimensional vectors and the decoder returns it back into new sequence that can be used for prediction tasks.

#### **7 New Trends in ML-Driven Operational Support**

Besides the mainstream works in the field of Predictive Process Monitoring, new research trends and directions focusing on ML-driven operational support have recently started being investigated and developed. Some of these new trends are summarised in the following subsections.

<sup>8</sup> Note that the usage of LSTM architectures is not limited to next event predictions - they are indeed used also for outcome and numerical predictions - nevertheless it has been widely used in the literature for this type of predictions.


**Table 4.** Simple-index encoding enriched with some inter-case features for the example in Fig. 9.

#### **7.1 Intercase Predictions**

In classical works, Predictive Process Monitoring methods assume that the predicted value of interest of an ongoing case only depends on intra-case information, as for instance on the execution history of that specific case. This assumption results in encodings that include past events, inter-event durations, and other case-related attributes. However, the only intra-case assumption does not hold in many real-life scenarios. For example, in situations where cases share limited resources, the completion time of a case heavily depends on other cases that are running at the same time [42,43].

Inter-case information can be encoded in different ways, as for instance by aggregating data related to traces running simultaneously. Examples of aggregated inter-case information that can be encoded together with the intra-case features are the number of traces and the average duration of traces being executed in the same time window in which the considered trace (prefix) is being executed, e.g., the number of traces and the average duration of traces executed in the same day of the current prefix trace. Table 4 shows an example of a simpleindex encoding enriched with these two simple inter-case features related to the example reported in Fig. 9, where we assume that 10 other traces are running the same day in which σ*<sup>m</sup>* <sup>1</sup> is being executed and that their average duration is 6 hours, while 18 traces are running simultaneously to σ*<sup>m</sup> <sup>k</sup>* with an average duration of 8 hours.

Taking into account the inter-case dimension is a challenging problem, since, on the one hand, we would like to take into account as much inter-case information as possible as the levels of dependencies among cases can greatly vary in different scenarios and, on the other hand, encoding several features for a large number of simultaneously running cases may lead to a feature space explosion.

#### **7.2 Explainable Predictions**

In many applications of Predictive Process Monitoring techniques, users are asked to trust a model helping them making decisions. However, users would need a certain level of trust towards the predictive model: a doctor will not operate on a patient simply because the operation has been predicted or recommended by the model. Understanding the rationale behind predictions would certainly help users decide when to trust or not to trust them.

Explainability techniques are a way to implement responsible process decision making (see [30]) and can help us to this aim. Different explainability techniques

**Fig. 13.** Example of an explanation plot related to the prediction for σ*<sup>j</sup>* .

have been proposed in the XAI (Explainable Artificial Intelligence) literature. Some of these techniques have already been experimented in the field of Predictive Process Monitoring in order to support users in understanding the overall predictive model [33] or the specific predictions it provides [18,44,51]; with model-agnostic techniques, i.e., techniques that can be applied to any predictive model, as in the case of [18] or with techniques specific to the predictive model used, as in the case of XNAP [51] and the attention layer [44] for neural networks.

As an example of prediction explanation related to a trace instance,<sup>9</sup> let us assume that we have trained our predictive model by encoding the training set of the example reported in Fig. 9 with the complex-index encoding (see Table 1f) and that, for our current ongoing trace <sup>σ</sup>*<sup>j</sup>* (Visit patient {*20*, *clinic*}, Perform X-Ray {*20*, *radiology*}, Perform ultrasound {*20*, *radiology*}), which we have observed up to the event 3, the prediction of our predictive model is that the patient will recover soon. In order to understand whether we can trust or not the prediction, we would need to understand why our predictive model has returned such a prediction. Figure 13 shows an example of a possible explanation returned by a prediction explainer as LIME[37] or SHAP[27] applied to our specific Predictive Process Monitoring problem. The plot shows the impact of each feature (and related value) towards (in case of positive values) or against (in case of negative values) the fast recovery of the patient.<sup>10</sup> In the example, the feature

<sup>9</sup> Note that we provide here the idea of prediction explanations focusing on those related to a trace instance. However, aggregated trace prediction explanations (event log explanations) [18], as well as prediction model explanations [44] have also been investigated in the literature.

<sup>10</sup> Note that the semantics of the values on the x axis changes according to the explanation technique used for the plot. For instance, in the case of SHAP, the values on the x axis represent the SHAP values of the feature (and the related value) for the specific instance, that is the contribution of the feature towards the prediction with respect to the average value.

that has impacted most on the prediction of the fast recovery of the patient is her young age.<sup>11</sup>

Furthermore, the explanations used for making predictions more trustable to the users can be eventually used also for understanding the reasons why a predictive process model is wrong and hence use them to improve the model accuracy [38].

#### **7.3 Predictions with A-Priori Knowledge**

Past event logs, or more in general knowledge about the past, is not the only important source of knowledge that can be leveraged to make predictions. In many real life situations, cases exist in which, together with past execution data, some case-specific additional knowledge (*a-priori knowledge*) about the future is available and can be leveraged for improving the predictive power of a Predictive Process Monitoring technique. Indeed, this additional a-priori knowledge is what characterizes the future context of the process executions that will affect the development of the currently running cases.

We can think for instance to the occurrence of a strike, which may cause the delay or the cancellation of a flight in the travel process of a passenger, or to the temporary unavailability of a surgery room, which may delay or even rule out the possibility of executing certain activities in a patient treatment process. In this kind of scenarios, the information about the strike or about the unavailability of the surgery room is often available in advance. However, traditional Predictive Process Monitoring approaches, which only learn from the most frequent observed behaviours, are not able to take into account this knowledge. They will predict that the next activities of the passenger will be the usual ones, as if there is no strike, e.g., having the security check, moving to the boarding gate 3, boarding, ... . While it is impractical to retrain the predictive algorithms to take into consideration this additional knowledge every time it becomes available, it is also reasonable to assume that considering it in some way would allow the Predictive Process Monitoring algorithm to predict for instance that the passenger will be moved to gate 2 and that there will be no boarding, and hence to improve the accuracy of the predictions on an ongoing case.

A possibility to deal with a-priori knowledge is to take into account this knowledge K at prediction time by guiding the Predictive Process Monitoring algorithm towards a solution that is compliant to the a-priori knowledge [14]. In [14] for instance, an approach using LSTM for predicting the next activities has been enriched with a mechanism able to take into account background knowledge K expressed in terms of LTL formulae in order to guide the LSTM algorithm to make predictions compliant with the a-priori knowledge. The LSTM approach keeps returning likely predictions on the suffix of the current ongoing trace (up to the last event ω) until it does not find a suffix that is compliant with K. More in detail, the LSTM network uses a beam search algorithm for

<sup>11</sup> Note that different types of explanations can be returned depending on the type of encoding that has been used.

**Fig. 14.** Beam search in the a-priori approach.

considering at each time step the top beam-width bw most likely next events. Figure 14 shows the idea of the beam-search approach with bw = 2. σ*<sup>m</sup>* = <Take shuttle, Enter via door 3, Check in> is the current ongoing trace at time step m. At time step m + 1, among the three possible next events we take the bw most likely next events (the green nodes in Fig. 14) and keep exploring those future paths. At time step m+2, we again select the 2 most likely next events and keep exploring the next events of these sequences. Whenever we find a sequence that is not compliant with K, as at time step m + 3, we discard that path and we keep on exploring bw compliant paths. We stop whenever we predict the last event ω (see the circle with the thicker border) and the considered trace is still compliant with K.

#### **7.4 Prescriptive Process Monitoring**

Predictive Process Monitoring techniques are able to predict the likelihood of a positive outcome, the time required for completing an execution or the next activities that will be executed. However, all these techniques, are limited to the prediction. They do not support further stakeholders in making decisions on whether it is worth to intervene to avoid undesired outcomes and what to do next to optimize a given Key Process Performance Indicator (KPI) [24,32,47,50].

*Prescriptive Process Monitoring* aims to overcome this limit of Predictive Process Monitoring by supporting or prescribing stakeholders with decisions on whether to take actions in order to prevent or mitigate the occurrence of an undesired outcome [32,47] or on the activities to take for optimizing a certain measure of interests [24,50].

In the first scenario [32,47], predictions are used in order to evaluate through a cost model the tradeoffs between the cost of intervention to mitigate undesired outcomes and the cost of compensating unnecessary interventions. For instance, in the example related to the patient recovery described in Sect. 4, if the prediction related to an ongoing trace is that the patient will not recover soon, a surgery may increase the likelihood that the patient will recover soon and hence reduce anyway the cost for the hospital. However, the surgery has a cost, so that if the surgery has been planned because of a wrong prediction, then the cost of the surgery is unnecessary and hence should be avoided.

In the second scenario [24,50], predictions are used to uncover the future of different continuations of the current trace, so as to identify and hence recommend the one(s) leading to the best value for the KPI of interest. For instance, we can consider the example of the patient recovery described in Sect. 5. If the aim is recommending next activities to minimize the remaining cycle time until the completion of the execution of an ongoing trace σ*<sup>m</sup>* of length m, possible next activities at step m+1 can be considered. For each possible continuation of σ*<sup>m</sup>*, σ*<sup>m</sup>*+1, the remaining time until the end of the execution can be predicted and the next activity corresponding to the minimum cycle time recommended.

## **8 Tool Support**

The research related to Predictive Process Monitoring has been paired with the development of non-commercial plugins and tools with the purpose to be used and improved by the research community. We briefly illustrate in the following three among the main open-source tools supporting Predictive Process Monitoring.

#### **8.1 Predictive Process Monitoring in ProM**

ProM [15] is one of the most used and known tool in Process Mining. It is a framework collecting a number of plugins, working independently one from the other, and each focused on implementing a specific task. Among its variety of plugins, ProM also collects several plugins implementing techniques for the prediction of outcomes (e.g., [8,29]), for the prediction of numerical values (e.g., [1,10,16,23]), as well as for the prediction of next activity sequences (e.g., [35]). Some of them leverage model-based approaches (e.g. [1]), while others rely on machine-learning solutions (e.g., [10]).

#### **8.2 Predictive Process Monitoring in Apromore**

Apromore [22] is a well known and established tool. It is an advanced process model repository that allows to hold, analyse, and re-use large sets of process models. The tool is web-based and therefore it allows the easy integration of new plug-ins in a service oriented manner. This tool aims both at allowing practitioners to deal with the challenges of stakeholders of processes, and researchers to develop and benchmark their own techniques with a strong emphasis on the separation of concerns. The only plug-in performing Predictive Process Monitoring related challenges in Apromore is the one described in [49]. This plug-in performs outcome-based, numeric-based prediction, as well as next event predictions.

## **8.3 Predictive Process Monitoring in Nirdizati**

*Nirdizati* [21,39] is a web-based application for supporting users in building, comparing, and analyzing predictive models that can then be used to perform predictions on the future development of an ongoing case. Differently from the other tools, Nirdizati specifically addresses Predictive Process Monitoring problems. Nirdizati, which collects a rich set of different state-of-the-art approaches based on machine learning algorithms, supports users to deal with different predictive monitoring tasks: outcome-based, numeric and next activities predictions. Moreover, it provides services for supporting the users in tuning the hyperparameters of the specific technique, the possibility of adding some simple intercase features in the encodings, as well as some incremental algorithms, so as to be able to incrementally update the predictive model as soon as new execution traces are available. Finally, it also offers several plots for the results visualisation, thus supporting the users in the predictive model comparison tasks.

**Acknowledgements.** The work described in this chapter describes the effort of a number of people. We would like to thank Marlon Dumas, Marcello La Rosa, Anna Leontjeva, Fabrizio Maria Maggi, Williams Rizzi, Arik Senderovich, Irene Teinemaa, Ilya Verenich, and Anton Yeshchenko for their precious cooperation and all the (also joint) work that led to this chapter.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Assorted Process Mining Topics**

# **Streaming Process Mining**

Andrea Burattin(B)

Technical University of Denmark, 2800 Kgs. Lyngby, Denmark andbur@dtu.dk

**Abstract.** Streaming process mining refers to the set of techniques and tools which have the goal of processing a stream of data (as opposed to a finite event log). The goal of these techniques, similarly to their corresponding counterparts described in the previous chapters, is to extract relevant information concerning the running processes. This chapter presents an overview of the problems related to the processing of streams, as well as a categorization of the existing solutions. Details about controlflow discovery and conformance checking techniques are also presented together with a brief overview of the state of the art.

**Keywords:** Streaming process mining · Event stream

## **1 Introduction**

Process mining techniques are typically classified according to the task they are meant to accomplish (e.g., control-flow discovery, conformance checking). This classification, though very meaningful, might come short when it is necessary to decide which algorithm, technique, or tool to apply to solve a problem in a given domain, where nonfunctional requirements impose specific constraints (e.g., when the results should be provided or when the events are recorded).

Most algorithms, so far, have been focusing on a static event log file, i.e., a finite set of observations referring to data collected during a certain time frame in the past (cf. Definition 1 [1]). In many settings, however, it is necessary to process and analyze the events as they happen, thus reducing (or, potentially, removing) the delay between the time when the event has happened in the real world and when useful information is distilled out of it. In addition, the amount of events being produced is becoming so vast and complex [22] that storing them for further processing is becoming less and less appealing. To cope with these issues, event-based systems [4] and event processing systems [19] can become extremely valuable tools: instead of storing all the events for later processing, these events are immediately processed and corresponding reactions can be taken immediately. In addition, event-based systems are also responsive systems: this means they are capable of reacting autonomously when deemed necessary (i.e., only when new events are observed).

Coping with the above-mentioned requirements in the context of data analysis led to the development of techniques to analyze streams of data [3,21]. A data stream is, essentially, an unbounded sequence of observations (e.g., events), whose data points are created as soon as the event happens (i.e., in real-time). Many techniques have been developed, over the years, to tackle different problems, including frequency counting, classification, clustering, approximation, time series analysis and change diagnosis (also known as novelty detection or concept drift detection) [20,46]. Process mining techniques applied to the analysis of data streams fall into the name of *streaming process mining* [8] and both control-flow discovery as well as conformance checking techniques will be discussed later in the chapter.

The rest of this chapter is structured as follows: this section presents typical use cases for streaming process mining and the background terminology used throughout the chapter. Section 2 presents a possible taxonomy of the different approaches for streaming process mining, which can be used also to drive the construction and the definition of new ones. Section 3 introduces the problem of streaming process discovery, by presenting a general overview of the state of the art and the details of one algorithm. Section 4 sketches the basic principles of streaming conformance checking. As for the previous case, also this section starts with a state-of-the-art summary and then dives into the details of one algorithm. Section 5 mentions other research endeavors of streaming process mining and then concludes the chapter.

#### **1.1 Use Cases**

This subsection aims at giving an intuition of potential use cases for streaming process mining. In general, every setting that requires drawing conclusions before the completion of a running process instance is a good candidate for the application of streaming process mining. In other words, streaming process mining is useful whenever it is important to understand running processes rather than improving *future* ones or "forensically" investigate those *from the past*.

Process discovery on event streams is useful in domains that require a clear and timely understanding of the behavior and usage of a system. For example, let's consider a web application to self-report the annual tax statement for the citizens of a country. Such a system, typically, requires a lot of data to be inserted over many forms and, usually, the majority of its users have to interact with help pages, FAQs, and support information. In this case, it might be useful to understand and reconstruct the flow of a user (i.e., one process instance) to understand if they are getting lost in a specific section, or in specific cycles and, if necessary, provide tailored help and guidance support. Since the ultimate goal is to improve the running process instances (i.e., helping the users currently online), it is important that the analyses process the events immediately and that corresponding reactions are implemented straight away.

Conformance checking on event streams is useful whenever it is important to immediately detect deviations from reference behavior to enact proper countermeasures. For example, let's consider operational healthcare processes [31], in most of these cases (in particular in the case of non-elective care processes, such as urgent or emergency ones) it is critically important to have a sharp understanding of each individual patient (i.e., one process instance), their treatment evolution, as well as how the clinic is functioning. For example, when treating a patient having acute myeloid leukemia it is vital to know that the treatment is running according to the protocol and, if deviations were to occur, it is necessary to initiate compensation strategies immediately.

Another relevant example of streaming conformance checking could derive from the investigation of the system calls of the kernel of an operating system when used by some services or applications. These calls should be combined in some specific ways (e.g., a file should be open(), then either write() or read() or both appear, and eventually the file should be close()) which represent the reference behavior. If an application is observed strongly violating such behavior it might be an indication of strange activities going on, for example trying to bypass some limitations or privileges. Exactly the same line of reasoning could be applied when consuming RESTful services.

Additional use cases and real scenarios are depicted in the research note [5], where real-time computing<sup>1</sup>, to which streaming process mining belongs, is identified as one of the impactful information technology enablers in the BPM field.

#### **1.2 Background and Terminology**

This section provides the basic background on streams as well as the terminology needed in the rest of the chapter.

A stream is a sequence of observable units which evolves over time by including new observations, thus becoming unbounded<sup>2</sup>. An *event stream* is a stream where each observable unit contains information related to the execution of an event and the corresponding process instance. In the context of this chapter, we assume that each event is inserted into the stream when the event itself happens in the real world. The universe of observable units O can refer to the activities executed in a case, thus having O⊆U*act* × U*case* (cf. Definition 1 [1]), as discussed in Sect. 3) or to other properties (in Sect. 4 the observable units refer to relations B between pairs of activities, i.e., O ⊆ (B × U*act* × U*act*) × U*case*).

**Definition 1 (Event stream).** *Given a universe of observable units* O*, an event stream is an infinite sequence of observable units:* <sup>S</sup> : <sup>N</sup>≥<sup>0</sup> → O*.*

We define an operator *observe* that, given a stream S, it returns the latest observation available on the stream (i.e., *observe*(S) ∈ O is the latest observable unit put on S).

<sup>1</sup> Please note that the paper explicitly mentions that, in that context, "*real-time computing refers to the so-called near real-time, in which the goal is to minimize latency between the event and its processing so that the user gets up-to-date information and can access the information whenever required*", thus perfectly matching our notion of streaming process mining.

<sup>2</sup> Please note that, in the literature, it is possible to distinguish other streaming models, where elements are also deleted or updated [21]. However, in this chapter we will assume an "insert-only model".

Due to the nature of streams, algorithms designed for their analyses are required to fulfill several constraints [6,7], independently from the actual goal of the analyses. These constraints are:


As detailed in [21, Table 2.1], it is possible to elaborate on the differences between systems consuming data streams and systems consuming static data (from now on, we will call these "offline"): in the streaming setting, the data elements arrive incrementally (i.e., one at the time) and this imposes the analysis to be incremental as well. These events are transient, meaning that they are available for a short amount of time (during which they need to be processed). In addition, elements can be analyzed at most once (i.e., no unbounded backtracking), which means that the information should be aggregated and summarized. Finally, to cope with concept drifts, old observations should be replaced by new ones: while in offline systems, all data in the log is equally important, when analyzing event stream the "importance" of events decreases over time.

In the literature, algorithms and techniques handling data streams are classified into different categories, including "online", "incremental", and "real-time". While *real-time systems* are required to perform the computation within a given deadline – and, based on the consequences of not meeting the deadline, they are divided into hard/soft/firm –, *incremental systems* just focus on processing the input one element at the time with the solution being updated consequently (no emphasis/limit on the time). An *online system* is similar to an incremental one, except for the fact that the extent of input is not known in advance [37]. Please note that, in principle, both real-time and online techniques can be used to handle data streams, thus we prefer the more general term *streaming techniques*. In the context of this chapter, the streaming techniques are in between the family of "online" and "soft real-time": though we want to process each event *fast*, the notion of deadline is not always available, and, when it is, missing it is not going to cause a system failure but just degradation of the usefulness of the information.

When instantiating the streaming requirements in the process mining context, some of the constraints bring important conceptual implications, which are going to change the typical way process mining is studied. For example, considering that each event is added to the stream when it happens in the real world means that the traces we look at will be incomplete most of the time. Consider the graphical representation reported in Fig. 1a, where the red area represents the portion of time during which the streaming process mining is active. Only in the first case (i.e., instance i), events referring to a complete trace are seen. In all other situations, just incomplete observations are available, either because events happened before we started observing the stream (i.e., instance l, suffix trace) because the events have still to happen (i.e., instance k, prefix trace),

or because of both (i.e., instance j, subsequence trace). Figure 1b is a graphical representation of what it means to give results at any time in the case of conformance: after each event, the system needs to be able to communicate the conformity value as new events come in. Also, the result might change over time, thus adapting the computation to the new observations.

## **2 Taxonomy of Approaches**

Different artists are often quoted as saying: "Good artists copy, great artists steal". Regardless of who actually said this first<sup>3</sup>, the key idea is the importance of understanding the state of the art to incorporate the key elements into newly designed techniques. Streaming process mining techniques have been around for some years now, so it becomes relevant to understand and categorize them in order to relate them to each other and derive new ones.

<sup>3</sup> Many people, including Pablo Picasso, William Faulkner, Igor Stravinsky, and several others are often referred to as the "first author" of some version of the quote. Actually, investigating the history of this quote on the Internet represents a formative yet very procrastination-prone activity (see also https://xkcd.com/214/).


**Fig. 2.** Taxonomy of the different approaches to solve the different stream process mining problems. For each technique, corresponding general steps are sketched [10].

It is possible to divide the currently available techniques for streaming process mining into four categories. A graphical representation of such taxonomy is available in Fig. 2, where three main categories are identified, plus a fourth one, which represents possible mixes of the others. In the remainder of this section, each category will be briefly presented.

*Window Models.* The simplest approach to cope with infinite streams consists of storing only a set of the most recent events and, periodically, analyzing them. These approaches store a "window" of information that can be converted into a static log. Then, standard (i.e., offline) analyses can be applied to the log generated from such a window. Different types of windowing models can be used and classified based on how the events are removed [34]. These models can be characterized along several dimensions, including the unit of measurement (i.e., whether events are kept according to some logical or physical units such as the time of the events or the number of events); the edge shift (so whether any of two bounds of a window is fixed to a specific time instant or if these change over time); and the progression step (i.e., how the window evolves over time, assuming that either of the bounds advances one observation at a time or several data points are incorporated at once). These profiles can create different window models, such as:


**Algorithm 1:** Count-based window model process mining algorithm

**Input**: S: event stream M: memory *max*M: number of observation to keep A: additional information (e.g., a reference model), can be ∅ **1 forever do** // Observe a new event **<sup>2</sup>** e ← *observe*(S) // Memory update **<sup>3</sup> if** *max* (M) ≥ *max*<sup>M</sup> **then <sup>4</sup>** *dequeue*(M) // Forgetting **5 end 6** *insert*(M, e) // Mining update **7 if** *perform mining* **then** // Memory into event log **<sup>8</sup>** L ← *convert*(M) **9** *ProcessMining*(L, A) **10 end 11 end**

Algorithm 1 reports a possible representation of an algorithm for process mining on a count-based window model. The algorithm uses as a memory model a FIFO queue and it starts with a never-ending loop which comprises, as the first step, the observation of a new event. After that, the memory is checked for maximum capacity and, if reached, the oldest event observed is removed. Then, the mining can take place, initially converting the memory into a process mining capable log and then running the actual mining algorithm on the given log.

Window-based models come with many advantages such as the capability of reusing any offline process mining algorithm already available for static log files. The drawback, however, comes from the inefficient handling of the memory: window-based models are not very efficient for summarizing the stream, i.e., the logs generated from a window suffer from strong biases due to the rigidity of the model.

*Problem Reduction.* To mitigate the issues associated with window models, one option consists of employing a problem reduction technique (cf. Fig. 2). In this case, the idea is to *reduce* the process mining problem at hand to a simpler yet well-studied problem in the general stream processing field in order to leverage existing solutions and derive new ones. An example of a very well studied problem is *frequency counting*: counting the frequencies of variables over a stream (another example of a relevant and well-studied problem is sampling). To properly reduce a process mining problem to a frequency counting one, it is important

#### **Algorithm 2:** Lossy Counting

```
Input: S: data stream
         -
          : maximal approximation error
1 T ← ∅ // Initially empty set
2 N ← 1 // Number of observed events
3 w ← -
        1
        -

          // Bucket width
4 forever do
5 e ← observe(S)
6 bcurr ← -
             N
              w

      // Is there a tuple in T with e as first component?
7 if e is already in T then
8 Increment the frequency of e in T
9 else
10 Insert (e, 1, bcurr − 1) in T
11 end
12 if N mod w = 0 then
13 forall the (a, f, Δ) ∈ T s.t. f + Δ ≤ bcurr do
14 Remove (a, f, Δ) from T
15 end
16 end
17 N ← N + 1
18 end
```
to understand what a variable is in the process mining context and if it is indeed possible to extract information by counting how often a variable occurs.

An algorithm to tackle the frequency counting problem is called Lossy Counting [30], described in Algorithm 2 and graphically depicted in Fig. 3. Conceptually, the algorithm divides the stream into "buckets", each of them with a fixed size (line 6). The size of the bucket is derived from one of the inputs of the algorithm (- ∈ [0, 1]) which indicates the maximal acceptable approximation error in the counting. Lossy Counting keeps track of the counting by means of a data structure T, where each component (e, f, Δ) refers to the element e of the stream (the variable to count), its estimated frequency f, and the maximum number of times it could have occurred Δ (i.e., the maximum error). Whenever a new event is observed (line 5), if the value is already in the memory T, then the counter f is incremented by one (line 8), instead, if there is no such value, a new entry is created in T with the value e corresponding to the observed variable, frequency f = 1, and maximum error equal to the number of the current bucket minus one (from here it is possible to understand that since the buck size depends on the maximum allowed error, the higher the error, the larger the bucket size and hence the higher the approximation error) (line 10). With a fixed periodicity (i.e., every time a new conceptual bucket starts) the algorithm cleans the memory, by removing elements not frequently observed (lines 12–16). Please note that this specific algorithm has no memory bound: the size

**Fig. 3.** Graphical representation of the Lossy Counting algorithm.

**Fig. 4.** Demonstration of the evolution of the internal data structure constructed by the Lossy Counting on a simple stream. Each color refers to a variable.

of its data structure T depends on the stream and on the max approximation error (i.e., if the error is set to 0 and the observations never repeat, the size of T will grow indefinitely). Variants of the algorithm enforcing a fixed memory bound is available as well [18] but are not described in detail here. In the rest of this chapter, a set whose entries are structured and updated using the Lossy Counting algorithm (i.e., T in Algorithm 2) will be called Lossy Counting Set.

Figure 4 shows a demonstration of the evolution of the Lossy Counting Sets over time (at the end and at the beginning of each virtual bucket) for the given stream. In this case, for simplicity purposes, the background color of each box represents the variable that we are counting. The counting is represented as the stacking of the blocks, and below, in red, the maximum error for each variable is ported.

The most relevant benefit of reducing the problem to a known one is the ability to employ existing as well as new solutions in order to improve the efficiency of the final process mining solution. Clearly, this desirable feature is paid back in terms of the complexity of the steps (both conceptual and computational) required for the translation.

*Offline Computation.* Due to some of the constraints imposed by the streaming paradigm, one option consists of moving parts of the computation offline (cf. Fig. 2) so that performance requirements are met when the system goes online. This idea implies decomposing the problem into sub-problems and reflecting on whether some of them can be solved without the actual streaming data. If that

**Fig. 5.** Conceptualization of the streaming process discovery. Figure from [16].

is the case, these sub-problems will be solved beforehand and corresponding pre-computed solutions will be available as the events are coming in.

Such an approach comes with the advantage of caching the results of computations that would otherwise require extremely expensive computations. This approach, still, suffers from several limitations since it is not possible to apply this approach to all streaming process mining problems. Additionally, by computing everything in advance, we lose the possibility of adapting the pre-computed solutions to the actual context, which might be uniquely specific to the running process instance.

*Hybrid Approaches.* As a final note, we should not rule out the option of defining *ensemble* methods that combine different approaches together (see Fig. 2).

## **3 Streaming Process Discovery**

After introducing the general principles and taxonomy of techniques to tackle streaming process mining, in this section, we will specifically analyze the problem of streaming process discovery.

A graphical conceptualization of the problem is reported in Fig. 5: the basic idea is to have a source of events that generates an event stream. Such an event stream is consumed by a miner which keeps a representation of the underlying process model updated as the new events are coming in.

#### **3.1 State of the Art**

In this section, the main milestones of the streaming process discovery will be presented. The first available approach to tackle the streaming process discovery problem is reported in [15,16]. This technique employs a "problem reduction" approach (cf. Fig. 2) rephrasing the Heuristics Miner [38] as a frequency counting problem. The details of this approach will be presented in Sect. 3.2. In [23], authors present StrProM which, similarly to the previous case, tracks the direct following relationship by keeping a prefix tree updated with Lossy Counting with Budget [18].

More recently, an architecture called S-BAR, which keeps an updated abstract representation of the stream (e.g., direct follow relationships), is used as starting point to infer an actual process model, as described in [44]. Different algorithms (including α. [39], Heuristics Miner [38] and Inductive Miner [26]) have been incorporated to be used with this approach. Also in this case authors reduced their problem to frequency counting, thus using Lossy Counting, Space Saving [32], and Frequent [24].

Declarative processes (cf. Chapter 4) have also been investigated as the target of the discovery. In [12,14,27], authors used the notion of "replayers" – one for each Declare [35] template to mine – to discover which one are fulfilled. Also in this case, Lossy Counting strategies have been employed to achieve the goal. A newer version of the approach [33], is also capable of discovering data conditions associated with the different constraints.

#### **3.2 Heuristics Miner with Lossy Counting (HM-LC)**

This section describes in more detail one algorithm for streaming process discovery: Heuristics Miner with Lossy Counting (HM-LC) [16].

The Heuristics Miner algorithm [38] is a discovery algorithm which, given the frequency of the direct following relations observed (reported as |a>b| and indicating the number of times b is observed directly after a), calculates the *dependency measure*, a measure of the strength of the causal relation between two activities a and b:

$$a \Rightarrow b = \frac{|a > b| - |b > a|}{|a > b| + |b > a| + 1} \in [-1, 1]. \tag{1}$$

The closer the value of such metric is to 1, the stronger the causal dependency from a to b. Based on a given threshold (parameter asked as input), the algorithm considers only those dependencies with values exceeding the threshold, deeming the remaining as noise. By considering all dependencies which are strong enough, it is possible to build a dependency graph, considering one node per activity and one edge for each dependency. In such a graph, however, when splits or joins are observed (i.e., activities with more than one outgoing or incoming connection) it is not possible to distinguish the type of the splits. In the case of a dependency from a to b and also from a to c, Heuristics Miner disambiguates between an AND and an XOR split by calculating the following metric (also based on the frequency of direct following relations):

$$a \Rightarrow (b \land c) = \frac{|b > c| + |c > b|}{|a > b| + |a > c| + 1} \in [0, 1]. \tag{2}$$

When the value of this measure is high (i.e. close to 1), it is likely that b and c can be executed in parallel, otherwise, these will be mutually exclusive. As for the previous case, a threshold (parameter asked as input) is used to make the distinction.

It is important to note that the two fundamental measures employed by the Heuristics Miner rely on the frequency of the directly-follows measure (e.g. |a>b|), and so the basic idea of Heuristics Miner with Lossy Counting is to consider such values as "variables" to be observed in a stream, thus reducing the problem to frequency counting.

As previously mentioned, Lossy Counting is an algorithm for frequency counting. In particular, the estimated frequencies can be characterized both in terms of lower and upper bounds as follows: given the estimated frequency f (i.e., the frequency calculated by the algorithm), the true frequency F (i.e., the actual frequency of the observed variable), the maximum approximation error and the number of events observed N, these inequalities hold

$$f \le F \le f + \epsilon N.$$

To calculate the frequencies, Lossy Counting uses a set data structure where each element refers to the variable being counted, its current (and approximated) frequency, and the maximum approximation error in the counting of that variable.

For the sake of simplicity, in Heuristics Miner with Lossy Counting, the observable units of the event stream comprises just the activity name and the case id (cf. Definition 1). In other words, each event observed from the stream comprises two attributes: the activity name and the case id (cf. Definition 1 [1], Sect. 3.2 [1]), so an event e is a tuple with e = (c, a), where #*case* (e) = c and #*act*(e) = a.

The pseudocode of HM-LC is reported in Algorithm 3. The fundamental idea of the approach is to count the frequency of the direct following relations observed. In order to achieve this goal, however, it is necessary to *identify* the direct following pairs in the first place. As depicted in Fig. 6, to identify direct following relations it is first necessary to disentangle the different traces that are intertwined in an event stream. To this end, the HM-LC

**Fig. 6.** Conceptualization of the need to isolate different traces based on a single stream. Boxes represent events: their background colors represent the case id, and the letters inside are the activity names. First line reports the stream, following lines are the single cases. Figure from [16].

instantiates two Lossy Counting Sets: D*<sup>C</sup>* , and D*R*. The first keeps track of the latest activity observed in each process instance whereas the second counts the actual frequency of the directly follow relations. These data structures are initialized at the first line of the algorithm, which is followed by the initialization of the counter of observed events (line 2) and the calculation of the size of the buckets (line 3). After the initial setup of the data structure, a never-ending loop starts by observing events from the stream (line 5, cf. Definition 1), where each **Algorithm 3:** Heuristics Miner with Lossy Counting (simplified)

**Input**: S: event stream -: approximation error **<sup>1</sup>** Initialize Lossy Counting Sets D<sup>C</sup> and D<sup>R</sup> **<sup>2</sup>** <sup>N</sup> <sup>←</sup> <sup>1</sup> // Counter of observed events **<sup>3</sup>** w ← - 1 - // Bucket size **4 forever do <sup>5</sup>** (c<sup>N</sup> , a<sup>N</sup> ) ← *observe*(S) **6** b*curr* = - N w // Calculate the current bucket id // Step 1: Update the Lossy Counting Sets **<sup>7</sup> if** ∃((c, a*last*),f,Δ) ∈ D<sup>C</sup> *such that* c = c<sup>N</sup> **then <sup>8</sup>** Remove the entry ((c, alast),f,Δ) from D<sup>C</sup> **<sup>9</sup>** D<sup>C</sup> ← D<sup>C</sup> ∪ {((c, a<sup>N</sup> ), f + 1, Δ)} // Update the <sup>D</sup><sup>R</sup> data structure **<sup>10</sup>** <sup>r</sup><sup>N</sup> <sup>←</sup> (alast, a<sup>N</sup> ) // Build relation <sup>r</sup><sup>N</sup> as <sup>a</sup>last <sup>→</sup> <sup>a</sup><sup>N</sup> **<sup>11</sup> if** ∃(r, f, Δ) ∈ D<sup>R</sup> *such that* r = r<sup>N</sup> **then <sup>12</sup>** Remove the entry (r, f, Δ) from D<sup>R</sup> **<sup>13</sup>** D<sup>R</sup> ← D<sup>R</sup> ∪ {(r, f + 1, Δ)} **14 else <sup>15</sup>** D<sup>R</sup> ← D<sup>R</sup> ∪ {(r<sup>N</sup> , 1, b*curr* − 1)} **16 end 17 else <sup>18</sup>** D<sup>C</sup> ← D<sup>C</sup> ∪ {((c<sup>N</sup> , a<sup>N</sup> ), 1, b*curr* − 1)} **19 end** // Step 2: Periodic cleanup **<sup>20</sup> if** N ≡ 0 mod w **then <sup>21</sup> forall the** ((c, a),f,Δ) ∈ D<sup>C</sup> *such that* f + Δ ≤ b*curr* **do <sup>22</sup>** Remove ((c, a),f,Δ) from D<sup>C</sup> **23 end <sup>24</sup> forall the** (r, f, Δ) ∈ D<sup>R</sup> *such that* f + Δ ≤ b*curr* **do <sup>25</sup>** Remove (r, f, Δ) from D<sup>R</sup> **26 end 27 end <sup>28</sup>** N ← N + 1 // Step 3: Consumption of the data structure to update the model **<sup>29</sup>** Update the model using D<sup>R</sup> **30 end**

event is the pair (c*<sup>N</sup>* , a*<sup>N</sup>* ), indicating that the case id observed as event N is c*<sup>N</sup>* (resp., the activity is a*<sup>N</sup>* ). The id of the current bucket is calculated right afterwards (line 6). The whole algorithm is then divided into three conceptual steps: in the first the data structures are updated; in the second periodic cleanup takes place; in the third the data structures are used to construct and update the actual model.

*Step 1: Updating the Data Structure.* The Lossy Counting Set D*<sup>C</sup>* has been defined in order not only to keep a count of the frequency of each case id observed in the events but also to keep track of the latest activity observed in the given trace. To achieve this goal, the entries of the data structure are tuples themselves, comprising the case id as well as the name of the latest activity observed in the case. Therefore, the first operation within the step consists of checking for the presence of an entry in D*<sup>C</sup>* matching the case id of the observed event (but not the activity name), as reported in line 7. If this is the case, the data structure D*<sup>C</sup>* is updated, not only by updating the frequency but also by updating the latest activity observed in the given case (lines 8 and 9). In addition, having already an entry in D*<sup>C</sup>* means that a previous event within the same trace has already been seen and, therefore, it is possible to construct a direct following relation (cf. line 10 of Algorithm 3). This relation is then treated as a normal variable to be counted and the corresponding Lossy Counting Set is updated accordingly (lines 11–16). In case D*<sup>C</sup>* did not contain an entry referring to case id c*<sup>N</sup>* , it means that the observed event is the first event of its process instance (up to the approximation error) and hence just a new entry in D*<sup>C</sup>* is inserted and no direct following relation is involved (line 18).

*Step 2: Periodic Cleanup.* With a periodicity imposed by the maximum approximation error (-), i.e., at the end of each bucket (line 20), the two Lossy Counting Sets are updated by removing entries that are not frequent or recent enough (lines 21–26). Please note that the algorithm expects that observing an event belonging to a process instance that has been removed from D*<sup>C</sup>* corresponds to losing one direct following relation from the counting in D*R*. From this point of view, the error on the counting of the relations is not only affected by D*<sup>R</sup>* but, indirectly, also by the removal of instances from D*<sup>C</sup>* which causes a relation not to be seen at all (and therefore, it cannot be counted).

*Step 3: Consumption of the Data Structures.* The very final step of the algorithm (line 29) consists of triggering a periodic update of the model. The update procedure (not specified in the algorithm) extracts all the activities involved in a direct following relations from D*<sup>R</sup>* and uses the dependency measure (cf. Eq. 1) to build a dependency graph, by keeping the relations with dependency measure above a threshold. To disambiguate AND/XOR splits Eq. 2 is used. Both these measures need to be adapted in order to retrieve the frequency of the relations from D*R*.

The procedure just mentioned recomputes the whole model from scratch. However, observing a new event will cause only local changes to a model. Hence a complete computation of the whole model is not necessary. In particular, it is possible to rearrange Eqs. 1 and 2 in order to signal when a dependency has changed. Specifically, given a dependency threshold τ*dep*, we know that a dependency should be present if these inequalities hold:

$$|a > b| \ge \frac{|b > a|(1 + \tau\_{dep}) + \tau\_{dep}}{1 - \tau\_{dep}} \quad \text{or} \quad |b > a| \le \frac{|a > b|(1 - \tau\_{dep}) - \tau\_{dep}}{1 + \tau\_{dep}}.$$

**Fig. 7.** Reference process model used to calculate the conformance of traces in Table 1, from [17].

**Table 1.** Example traces with corresponding offline conformance.


In a similar fashion, we can rewrite Eq. 2 so that, given an AND threshold parameter τ*and*, a split (i.e., from activity a to both activities b and c) has type AND if all these inequalities hold:

$$\begin{aligned} |b>c| \le \tau\_{and} \left( |a>b| + |a>c| + 1 \right) - |c>b|\\ |c>b| \le \tau\_{and} (|a>b| + |a>c| + 1) - |b>c|\\ |a>b| \le \frac{|b>c| + |c>b|}{\tau\_{and}} - |a>c| - 1\\ |a>c| \le \frac{|b>c| + |c>b|}{\tau\_{and}} - |a>b| - 1 \end{aligned}$$

If this is not the case, the type of the split will be XOR. Therefore, by monitoring how the frequencies of some of the relations in D*<sup>R</sup>* (which should be used as an approximation of the direct following frequencies) are evolving, it is possible to pinpoint the changes appearing in a model, with no need for rebuilding it from scratch all the times.

In this section, we did not exhaustively cover the reduction of the Heuristics Miner to Lossy Counting (for example, we did not consider the absolute number of observations for an activity or parameters such as the relative-to-best) but we focused on the core aspects of the reduction. The goal of the section was to present the core ideas behind a streaming process discovery algorithm while, at the same time, showing an example of an algorithm based on the problem reduction approach (cf. Fig. 2).

## **4 Streaming Conformance Checking**

Computing the conformity of running instances starting from events observed in a stream is the main goal of streaming conformance checking.

Consider, for example, the process model reported in Fig. 7 as a reference process, and let's investigate the offline conformance (calculated according to the alignment technique reported in [2]) for the traces reported in Table 1. Trace t<sup>1</sup> is indeed conforming with respect to the model as it represents a possible complete execution of the process. This information is already properly captured by the offline analysis. Trace t2, on the other hand, is compliant with the process but just up to activity B, as reported by the conformance value 0.8: offline systems assume that the executions are complete, and therefore observing an incomplete trace represents a problem. However, as previously discussed and as shown in Fig. 1a, in online settings it could happen that parts of the executions are missing due to the fact that the execution has not yet arrived at this part of the computation. This *could* be the case with trace t<sup>2</sup> (i.e., t<sup>2</sup> is the prefix of a compliant trace). Trace t<sup>3</sup> suffers from the opposite problem: the execution is conforming to the model, but just from activity B onward. While offline conformance, in this case, is calculated to the value of 0.68, as for the previous case, we cannot rule out the option that the trace is actually compliant but, since the trace started before the streaming conformance checker was online, it was incapable of analyzing the beginning of it (i.e., t<sup>3</sup> is the suffix of a compliant trace). Trace t4, finally, seems compliant just between activities B and D. Though offline conformance, in this case, is 0.62, as for the previous two cases, in a streaming setting, we cannot exclude that the issue actually derives from the combination of the trace starting before the streaming conformance was online and the trace not being complete (i.e.,t<sup>4</sup> is a subsequence of a compliant trace).

Hopefully, discussing the previous examples helped to point out the limit of calculating the extent of the conformance using only one numerical value in a streaming setting. Indeed, when the assumption that executions are complete is dropped, the behavior shown in the traces of Table 1 could become 100% compliant since the actual issue does not lie in the conformity but in the amount of observed behavior.

#### **4.1 State of the Art**

Computing the conformity of a stream with respect to a reference model has received a fairly large amount of attention, in particular in the case of declarative processes. Under the name "*operational support*", research has been focusing [28, 29] on understanding if and which constraints are violated and satisfied as new events are coming in. In particular, each constraint is associated with one of four possible truth values: permanently or temporarily violated or fulfilled which are computed by representing the behavior as an automaton with all executions replayed on top of it.

Streaming conformance checking on imperative models has also received attention, though more recently. Optimal alignments can be computed for the prefix (i.e., prefix-alignments) of the trace seen up to a given point in time [41], resulting in a very reliable approach which, however, meets only to some extent the streaming scenario (cf. Sect. 1.2). A more recent approach [36] is capable of improving the performance of calculating a prefix-alignment, by rephrasing the problem as the shortest path one and by incrementally expanding the search space and reusing previously computed intermediate results.

A different line of research focused on calculating streaming conformance for all scenarios (cf., Fig. 1a). In this case, techniques employed "offline computation" approaches [11,17,25] to construct data structures capable of simplifying the computation when the system goes online. These approaches not only compute the conformity of a running instance but also try to quantify the amount of behavior observed or still to come.

In addition to these, one of the first approaches [45] focused on a RESTful service capable of performing the token replay on a BPMN model (via a token pull mechanism). No explicit guarantees, however, are reported concerning the memory usage, the computational complexity, or the reliability of the results, suggesting that the effort was mostly on the interface type (i.e., *online* as in *RESTful*).

## **4.2 Conformance Checking with Behavioral Patterns**

This section presents in more detail one algorithm for streaming conformance checking using behavioral patterns [17]. The algorithm belongs to the category of offline computation (cf. Fig. 2), where the heaviest computation is moved before the system goes online, thus meeting the performance requirement of streaming settings.

The fundamental idea of the approach is that using just one metric to express conformity could lead to misleading results, i.e. cases that already started and/or that are not yet finished get falsely penalized. To solve these issues, the approach proposes to break the conformity into three values:


A graphical representation of these concepts is reported in Fig. 8. In addition, the approach does not assume any specific modeling language for the reference process. Instead, the approach takes the reference process as a constraining of the relative orders of its activities. Such constraints are defined in terms of behavioral patterns, such as weak ordering, parallelism, causality, and conflict. Such behavioral patterns (with the corresponding activities involved) represent also what the conformance checking algorithm observes. In the context of this chapter, we will consider the directly follow relation as a pattern.

Please note that the input of the algorithm is not a stream of events, but a stream of observed behavioral patterns, which could require some processing of the raw events. This, however, does not represent a problem for the behavioral pattern considered (i.e., directly follow relation), since these can be extracted using the technique described in Sect. 3.2.

**Fig. 8.** General idea of the 3 conformance measures computed based on a partially observed process instance: *conformance*, *completeness*, and *confidence*. Figure from [17].

As previously mentioned, the technique offloads the computation to a preprocessing stage which takes place offline, before the actual conformance is computed. During such a step, the model is converted into another representation, better suited for the online phase. Specifically, the new model contains:


These requirements drive the definition of the formal representation called "Process Model for Online Conformance" (PMOC). A *process model for online conformance* M = (B, P, F) is defined as a triplet containing the set of prescribed behavioural patterns B. Each pattern b(a1, a2) is defined as a relation b (e.g., the directly follow relation) between activities a1, a<sup>2</sup> ∈ U*act* (cf. Definition 1 [1]). P contains, for each behavioral pattern b ∈ B, the pair of minimum and maximum number distinct prescribed patterns (i.e., B) to be seen before b. We refer to these values as Pmin(b) and Pmax(b). For each pattern, b ∈ B, F(b) refers to the minimum number of distinct patterns (i.e., B) required to reach the end of the process from b.

Once such a model is available, the conformance values can be calculated according to Algorithm 4 which executes three steps for each event: updating the data structures, calculating the conformance values, and housekeeping cleanup. After two maps are initialized (lines 1, 2), the never-ending loop starts and, each observation from the stream (which refers to a behavioral patter b for case id c, cf. Definition 1) triggers and update of the two maps: if the pattern refers to a prescribed relation, then it is added to the obs(c) set (line 6)<sup>4</sup>, otherwise, the value of incorrect observations for the process instance obs(c) is incremented (line 8)<sup>5</sup>. In the second step, the algorithm calculates the new conformance values.

<sup>4</sup> If obs has no key c, obs(c) returns the empty set.

<sup>5</sup> If inc has no key c, then inc(c) returns 0.

**Algorithm 4:** Conformance Checking with Behavioral Patterns

**Input**: S: stream of behavioural patterns M = (B, P, F): process model for online conformance **<sup>1</sup>** Init map obs // Maps case ids to set of observed patterns from M **<sup>2</sup>** Init map inc // Maps case ids to integers **3 forever do <sup>4</sup>** (c, b) <sup>←</sup> *observe*(S) // New observation of pattern <sup>b</sup> for case <sup>c</sup> // Step 1: update internal data structures **<sup>5</sup> if** b ∈ B **then <sup>6</sup>** obs(c) <sup>←</sup> obs(c) ∪ {b} // If <sup>b</sup> already in obs(c), then no effect **7 else <sup>8</sup>** inc(c) ← inc(c)+1 **9 end** // Step 2: compute online conformance values **<sup>10</sup>** conformance(c) <sup>←</sup> <sup>|</sup>obs(c)<sup>|</sup> |obs(c)| + inc(c) **11** Notify new value of conformance(c) **<sup>12</sup> if** b ∈ B **then <sup>13</sup> if** Pmin(b) ≤ |*obs*(c)| ≤ Pmax(b) **then <sup>14</sup>** completeness(c) ← 1 **15 else <sup>16</sup>** completeness(c) <sup>←</sup> min <sup>1</sup>, <sup>|</sup>obs(c)<sup>|</sup> <sup>P</sup>min(b)+1 **17 end <sup>18</sup>** confidence(c) <sup>←</sup> <sup>1</sup> <sup>−</sup> <sup>F</sup>(b) maxb-<sup>∈</sup><sup>B</sup> F(b) **19** Notify new values of completeness(c) and confidence(c) **20 end** // Step 3: cleanup **21 if** *size of* obs *and* inc *is close to max capacity* **then 22** Remove oldest entries from obs and inc **23 end 24 end**

The actual conformance, which resembles the concept of *precision*, is calculated (lines 10, 11) as the number of distinct observed prescribed patterns in c (i.e., |obs(c)|) divided by the sum of the number of prescribed observed patterns and the incorrect patterns (i.e., |obs(c)| + inc(c)): 1 indicates full conformance (i.e., only correct behaviour) and 0 indicates no conformance at all (i.e., only incorrect behaviour). Completeness and confidence are updated only when a prescribed behavioral pattern is observed (line 12) since they require locating the pattern itself in the process. Concerning completeness, we have perfect value if the number of distinct behavioral patterns observed so far is within the expected interval for the current pattern (lines 13, 14). If this is not the case, we might have seen fewer or more patterns than expected. If we have seen fewer patterns, the completeness is the ratio of observed patterns over the minimum expected; otherwise, it's just 1 (i.e., we observed more patterns than needed, so the completeness is not an issue). Please bear in mind that these numbers confront the *number* of distinct patterns, not their type, thus potentially leading to false positives (line 16). The confidence is calculated (line 18) as 1 minus the proportion of patterns to observe (i.e., F(b)) and the overall maximum number of future patterns (i.e., max*b*-<sup>∈</sup>*<sup>B</sup>* F(b )): a confidence level 1 indicates strong confidence (i.e., the execution reached the end of the process), 0 means low confidence (i.e., the execution is still far from completion, therefore there is room for change). The final step performs some cleanup operations on obs and inc (lines 21–23). The algorithm does not specify how old entries should be identified and removed, but, as seen on the previous section, existing approaches can easily handle this problem (e.g., by using a Lossy Counting Set).

It is important to note once again that the actual algorithm relies on a data structure (the PMOC) that is tailored to the purpose and that might be computational very expensive to obtain. However, since this operation is done only once and before any streaming processing, this represents a viable solution. The details on the construction of the PMOC are not analyzed in detail here but are available in [17]. Briefly, considering the directly follow relation as the behavioral pattern, the idea is to start from a Petri net and calculate its reverse (i.e., the Petri net where all edges have opposite directions). Both these models are then unfolded according to a specific stop criterion and, once corresponding reachability graphs are computed, the PMOC can be easily derived from the reachability graph of the unfolded original Petri net and the reachability graph of the unfolded reverse net.

Considering again the traces reported in Table 1 and the reference model in Fig. 7, all traces have a streaming conformance value of 1 (when calculated using the approach just described). The completeness is 1 for t<sup>1</sup> and t2, 0.6 for t3, and 0.5 for t4. The confidence is 1 for t<sup>1</sup> and t3, 0.5 for t2, and 0.75 for t4. These values indeed capture the goals mentioned at the beginning of this section: do not penalize the conformance but highlight the actual issues concerning the missing beginning or end of the trace.

As for the streaming process discovery case, in this section, we did not exhaustively cover the algorithm for streaming conformance checking presented. Instead, we focused on the most important aspects of the approach, hopefully also giving an intuition of how an offline computation approach could work (cf. Fig. 2).

## **5 Other Applications and Outlook**

It is worth mentioning that the concepts related to streaming process mining have been applied not only to the problem of discovery and conformance but, to a limited extent, to other challenges.

Examples of such applications are the discovery of cooperative structures out of event streams, as tackled in [43], where authors process an event stream and update the set of relationships of a cooperative resource network. In [40], several additional aspects of online process mining are investigated too.

Supporting the research in streaming process mining has also been a topic of research. Simulation techniques have been defined both as standalone applications [9], as ProM plugins [42], or just as communication protocols [13].

Finally, from the industrial point of view, it might be interesting to observe that while some companies are starting to consider some aspects related to the topics discussed in this chapter (e.g., Celonis' Execution Management Platform supports the real-time data ingestion, though not the analysis), none of them offers actual solutions for streaming process mining. A report from Everest Group<sup>6</sup> explicitly refers to real-time monitoring of processes as an important process intelligence capability not yet commercially available.

This chapter presented the topic of streaming process mining. While the field is relatively young, several techniques are already available both for discovery and conformance checking.

We presented a taxonomy of the existing approaches which, hopefully, can be used proactively, when new algorithms need to be constructed, to identify how a problem can be tackled. Then two approaches, one for control-flow discovery and one for conformance checking, are presented in detail which, in addition, belong to different categories of the taxonomy. Alongside these two approaches, window models can also be employed, yet their efficacy is typically extremely low compared to algorithms specifically designed for the streaming context.

It is important to mention that streaming process mining has very important challenges still to be solved. For example, dealing with a stream where the arrival time of events does not coincide with their actual execution. In this case, it would be necessary to reorder the list of events belonging to the same process instance before processing them. Another relevant issue might be the inference of the termination of process instances. Finally, so far, we always considered an insertonly stream model, where events can only be added in a monotonic fashion. Scenarios where observed events can be changed or removed (i.e., insert-delete models) are yet to be considered.

## **References**


<sup>6</sup> https://www2.everestgrp.com/reportaction/EGR-2020-38-R-3808/Marketing.


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Responsible Process Mining**

Felix Mannhardt(B)

Eindhoven University of Technology, Eindhoven, The Netherlands f.mannhardt@tue.nl

**Abstract.** The prospect of data misuse negatively affecting our life has lead to the concept of responsible data science. It advocates for responsibility to be built, by design, into data management, data analysis, and algorithmic decision making techniques such that it is made difficult or even impossible to intentionally or unintentionally cause harm. Process mining techniques are no exception to this and may be misused and lead to harm. Decisions based on process mining may lead to *unfair* decisions causing harm to people by amplifying the biases encoded in the data by disregarding infrequently observed or minority cases. Insights obtained may lead to *inaccurate* conclusions due to failing to considering the quality of the input event data. *Confidential* or personal information on process stakeholders may be leaked as the precise work behavior of an employee can be revealed. Process mining models are usually white-box but may still be difficult to interpret correctly without expert knowledge hampering the *transparency* of the analysis. This chapter structures the topic of responsible process mining based on the FACT criteria: Fairness, Accuracy, Confidentiality, and Transparency. For each criteria challenges specific to process mining are provided and the current state of the art is briefly summarized.

**Keywords:** Fairness · Accuracy · Confidentiality · Transparency

## **1 Introduction**

Data-based decisions affect our society and our daily life. Organizations leverage data to obtain *objective insights* that are based on *facts* rather than on guesswork. Being data-driven to guide decisions is in itself hardly new and, certainly, decisions should be based on data rather than being based on arbitrary factors. In fact, the scientific method itself is based on meticulously analysing data to derive trustworthy conclusions.

What changed in recent years, and is increasingly changing every aspect of our life, is the abundance of data and compute power available to most people and organizations. The capability of collecting and analysing a large amount of data is now within the reach for most organization. What used to be a costly and time consuming operation involving a great degree of planning what data to be collected and what methods to build, can now be done ad-hoc on large amounts of *stockpiled* data.

This abundance of data together with the emergence of a wide variety of analysis techniques has led to the formation of the *data science* field. Data science technique are not limited to giving decision support to human decision makers but increasingly *Artificial Intelligence* (AI) is used to automate decisions based on predictive models. *Process mining* is a *data science* method that focuses on improving an organization's processes by leveraging event logs. The core of event logs are timestamped data about all kinds of events that occur in the context of work or business processes [1]. Process mining techniques have been very successfully deployed in numerous organizations and have helped to remove inefficiencies and improve the quality of processes [2].

However, this increased use of data leads to an increased risk of creating negative effects from its usage by accidental or intentional *irresponsible usage* of data [3]. Irresponsible usage of data ranges from invading the privacy of individuals over flawed analysis of data with poor quality or inappropriate methods to unfair automated decisions of systems trained on data biased towards majority groups. The potential misuse of this power gives rise to calls for the *responsible* use of data by creating knowledge and awareness about possible negative consequences and researching technical and socio-technical solutions to prevent these negative consequences.

#### **1.1 Responsible Data Science and AI**

Many initiatives have called for research and development on methods that can be broadly categorized under the umbrella term *responsible data science* under which sub themes such as *responsible AI* [4] are included. Depending on the individual perspective different criteria or principles that are relevant to obtain *responsible* methods have been proposed.


Several other organizations developing or using AI technology have published manifestos or best practices also include similar principles such as *fairness*, *privacy* or *confidentiality*, *accountability*, as well as often also *interpretability* and

<sup>1</sup> https://redasci.org.

<sup>2</sup> https://facctconference.org.

<sup>3</sup> https://digital-strategy.ec.europa.eu/en/library/ethics-guidelines-trustworthy-ai.

**Fig. 1.** Example challenges for responsible process mining in context of the 360 degree overview on process mining [7]

*safety*. Whereas originating from different perspectives and following slightly different definitions, there is great overlap on the major principles that are deemed relevant for leveraging data in a responsible manner. Naturally, the importance of the criteria differs depending on the application area. Considering *fairness* is crucial when designing AI systems based on machine learning that may possibly discriminate against individuals, whereas *safety* would be important when using such a system for controlling an industrial process. At the core of these "calls for action" is the realization that methods from the standard tool set of data science rarely follow all the desired criteria or principles by themselves. Additional effort is required, either by the analyst or system designer, to ensure their responsible use. This often requires ethical considerations since perfect technological solutions commonly do not exist.

#### **1.2 Responsible Process Mining**

This chapter instantiates the responsible data science challenges for process mining and summarises the state-of-the-art research on *responsible process mining*. Some of the challenges are specific to process mining and the event log data format whereas others are comparable to any other data science or AI approach. The context in which process mining operates means that many of the responsible data science principles and challenges are highly relevant. Figure 1 provides a non-comprehensive overview on some of the major challenges for responsible process mining in the context of the different process mining tasks. We discuss and, at least partially, answer some of these questions.

The subject of investigation in process mining is a business process, e.g., the handling of loan applications. So, the process mining analysis is not directly

**Fig. 2.** FACT principles for responsible process mining adapted from [5]

focused on individuals. Rather it looks at the manner in which the work is organized and performed. When analysing the loan application process, event logs are commonly not used for deciding the outcome of the loan application but for deciding how to improve the handling of applications to create a better process. Here, better may refer to being more efficient, less costly, more transparent, or any other indicator of process performance. At first glance process mining seems to not have the same impact on individuals as, e.g., deploying face recognition, predictive policing, or automatically scoring applicants for a job using AI methods. However, the manner in which business processes are performed can have an effect on various stakeholders (customers, employees, etc.).

As any other data science method, process mining relies on data to reconstruct how processes were performed and how process can be improved. Thus, the results are highly dependant on the quality of the used event data and the possible biases contained. Some additional quality and *confidentiality* challenges arise from the required sequential ordering of events, grouping of events to a specific process cases, and events being related to activities. In principle, process mining aims to discover *human-interpretable* models that are supposed to be accurate and transparent. However, for *complex* process behaviour process mining techniques often attempt to generalise from incomplete and noisy data. This creates *accuracy* and *transparency* challenges even in the process mining setting.

We follow the definitions of the FACT principles brought forward in [3,5] and illustrated in Fig. 2 to structure the discussion of process mining related challenges. First, we discuss *fairness* and its relevance to process mining in Sect. 2. Then, in Sect. 3, we briefly illustrate aspects of *accuracy* including data quality and model quality. Section 4 is a major part of this chapter and is devoted to *confidentiality*, which is about protecting and respecting sensitive data in event logs including the privacy of individuals. We close the chapter in Sect. 5 with a look at *transparency* focusing on generalization and the interpretability of process mining results.

## **2 Fairness**

Algorithmic fairness or fairness of automated systems [8] has been an increasingly prominent topic [9] when it comes to the development and usage of AI systems that are based on black-box machine learning models. Statistical biases embedded in training data may lead to systems making unfair decisions or clearly discriminating against certain groups of people. Prominent examples of such bias are the COMPAS system for predicting the risk of criminals to re-offend, which seem exhibit racial bias by having a higher false positive rate among blacks<sup>4</sup>, or gender stereotypes exhibited by automated translation systems such as Google Translate, which applies male gender when translating typically male dominated job names from gender neutral Turkish to English [10]. There are many more examples and we refer to the first chapter of the Fair ML book [10] for a comprehensive introduction.

An important realization regarding bias in data and their usage in any kind of data-based system is that: "Data and data sets are not objective; they are creations of human design" [11]. Data may be incomplete for a certain context leading to *representation bias* that is reflected in the learned model or the data analysis. Even when not being incomplete, data can reinforce existing discrimination that is embodied in the available data (*historical bias*). This cannot be avoided by simply discarding "problematic" attributes from the datasets since bias may be hidden in highly correlated attributes [10]. Many more data biases can be defined depending on the context [9], a notable one being Simpson's Paradox which describe the situation that a statistic may be very different or even opposite for subgroups of a dataset compared to the statistic on the aggregate entire dataset including all those subgroups.

#### **2.1 Process Mining Perspective**

It seems that the discussion on algorithmic fairness is not directly relevant to process mining. The impact of process mining on individuals is usually indirect, so direct discrimination by a process mining analysis seems unlikely to occur. However, the potential reach of decision made based on process mining may have impacts on individuals. Employees working in an analysed process may be subject to unfair decision, customers may be rejected based on predictive process mining techniques, or processes may be redesigned in a way that is discriminating minorities. These are unfair results that are hidden behind the scenes and may not make headlines in the newspaper, unless discovered. Based on the process illustrated in Fig. 3, we give two examples on how fairness challenges can be part of a process mining project.

Automated decision making can be part of process mining as it may result in redesigned processes with changed decision making. As shown in Fig. 3 additional extensive checks may be added to a loan application process for certain

<sup>4</sup> https://www.propublica.org/article/machine-bias-risk-assessments-in-criminalsentencing.

**Fig. 3.** Loan application process in BPMN adapted from the process used in [12]. Additional activities that indicate the kind of checks performed on the loan application before considering it have been added. Based on some criteria either a simple check or a more extensive check of the application is performed and in some cases the check is repeated.

cases leading to *fairness* challenges. This process re-design may be the outcome of a process mining analysis with the goal to minimize the cost of background checks. To further minimize the cost, methods for predictive process mining, action-oriented process mining, or the integration with robotic process automation is used to make the decision whether additional or extensive checks are necessary. Thus, process mining has directly affected the outcome of some process cases. Whereas the final decision is still made by a human, some applicants need to endure much more extensive background checks. This decision is based on machine learning techniques and, thus, inherits all the fairness issue associated with algorithmic decision making.

A second example of a *fairness* challenge that may arise in a process mining context would be affecting the employees working in the process. For example, it may be detected that when certain workers are involved in the processing of the loan application the throughput time is much longer. However, care must be taken not to draw unfair conclusions as those workers may simply handle more difficult cases [3], which leads to biased event data. If the nature of the loan application request is not included in the event log, e.g., due to confidentiality concerns, these confounding factors are difficult to detect and require careful human interpretation.

Besides the obvious ethical concerns that make it relevant to investigate fairness in the context of process mining, there are also upcoming regulations such as the EU Artificial Intelligence Act [13] that may constitute legal threats to consider fairness in any kind of automated data analysis. In the remainder of Sect. 2, we summarise the relevance of fairness for process mining along the main definitions that attempt to formalise fairness for algorithms. For each of the definitions, we instantiate them in the context of process mining and summarise existing work if available.

#### **2.2 Algorithmic Discrimination**

In the literature on algorithmic fairness, several types of discrimination that can arise from unfair algorithms have been defined. Similarly, a wide variety of definitions on how fair algorithmic systems can be designed have been researched. We sketch the main fairness definitions and discrimination's in the light of how they are relevant to the different process mining tasks as illustrated in Fig. 1.

Many possible types of discrimination are possible. It is important to realize that discrimination or unfairness does not always need to be caused by direct discrimination [9]. Direct discrimination would be a decision that is solely based on a sensitive or protected attribute a decision is made that negatively affects them. For example, if a predictive process monitoring would be trained on a somehow biased dataset and learn that female applicants for a loan should always received an extra background check causing a worse service quality or an increase rate of rejection. Clearly, this type of discrimination would be easily detected and mitigated. However, often discrimination can be *indirect discrimination* or *statistical discrimination* [9]. In these cases, some negative effect is applied but it is not directly based on a sensitive attribute. Rather the attribute or some statistical distribution is strongly correlated to a sensitive attribute. For example, when analysing the performance of a process with process mining methods one may identify a group of workers as being slower than other as they are assigned more difficult cases [3] or receive less support than others. Similarly, when improving a process design, one may focus on the 80% most frequent variants and, thereby, discriminate against minority groups with special needs that trigger infrequent activities and, thus, are often not visible in the standard process mining visualization.

#### **2.3 Algorithmic Fairness**

To counter and detect discrimination, there are attempts to formalize the notion of fairness of an algorithmic decision based on data. Again, there are many definitions that formalize different kinds of fairness that can be provided by algorithms [14,15]. It is important to realize that none of them is universally applicable and that it depends on the context which one is suitable. Often fairness definitions are introduced on the example of a simple binary classification task. The four main types of fairness notions based on [15] are: (1) based solely on the predicted outcome, (2) based on the predicted outcome and comparing it with the actual outcome (ground truth), (3) taking additionally into account the probability of the predictions, (4) notions based on similarity of the nonsensitive attributes, and (5) notions based on causal reasoning.

We introduce a few selected of these notions in the context of process mining and assume a simple binary classification model with the protected attribute *gender* for concise presentation as done in [15].


Being only a small selection of possible fairness notions, we refer to [15] for a comprehensive overview. None of the provided definitions is universally accepted and provides fairness in every sense of the concept.

The only work so far that directly addresses the challenge of fairness from a process mining viewpoint is written by Qafari et al. [16]. Here the problem of creating a fair classifier for data extracted from an event log that is enriched with process performance information is investigated. The approach firstly advocates to exclude the sensitive attribute or feature from building the classifier and then builds a C4.5 decision tree based on a discrimination-aware decision tree learning method. As fairness decision *predictive parity* is employed. An interesting problem is raised that *relabeling* may not always be desirable, in which case the fairness guarantees cannot be achieved. This is left as future work.

Though not explicitly addressing fairness, several proposals for applying causal machine learning techniques in the context of process mining have been made. For example, Bozorgi et al. [17,18] looked at discovering causal rules from event logs as well as taking some form of cost into account when making suggestions for intervention in running cases as part of a prescriptive process mining approach. By making the causalities explicit its may be feasible to include fairness constraints into decisions.

#### **2.4 Open Challenges**

Many open research challenges for considering fairness in process mining exists. So far, there is hardly any research on fairness that is specific to process mining neither from a technological nor from an organizational perspective, with the notable exception of [16]. A clear research challenge is to develop specific notions for fairness in process mining from the more generic fairness definitions. Whereas one could take the stance that the existing definitions from the wider machine learning field are sufficient, we motivated the need to consider fairness explicitly also regarding process mining techniques.

## **3 Accuracy**

Models need to be *accurate* to be useful in the real world. An analyst relying on a statistical analysis or an engineer developing a machine learning model for classification needs to have confidence that the analysis or the model captures the real-world phenomenon correctly. Differently to a model based on, e.g., physical laws or logic that can be shown to be correct in any application setting the kind of statistical models often used in data science can rarely be proven to be correct. Thus, the level of accuracy with which a real-world phenomenon is captured or the level of confidence that a user can have when using that model are important aspects of any such model. The accuracy of models depends on many factors and it is often not straightforward to measure it properly. A classification model may, on average, be classifying near perfectly between pictures showing different breeds of dogs on an independent test set but if the relevant breed is highly underrepresented the classifier may still be unusable in the real world due to the class imbalance. It may also be that the classifier provides very good accuracy but makes its decision based on the wrong features picking up on spurious correlations introduced when preparing the training data: a data quality problem.

#### **3.1 Process Mining Perspective**

Understanding and being able to measure the *accuracy* of a process mining analysis is an integral part of responsible process mining. Whereas it may seem obvious to only use results that accurately reflect the process reality, this is frequently impaired in practice by the need to abstract from that reality.

Process discovery techniques are often unable to create the perfectly accurate model but are forced to balance between several quality dimensions [1] that are competing with each other. For example, to obtain a process model that is understandable by a human analyst, some observed behavior may need to be omitted. In some cases, the process behavior is too complex to be captured by a single case notion and multi dimensional or multi entity representation are required to avoid drawing inaccurate conclusions [19]. Conformance checking techniques such as alignments [20] often face the challenge that there are multiple possible explanations for a non-conformance between observed and prescribed process behavior. However, it may be infeasible to show all of them due to the large number of possibilities. Finally, the quality of the input data is often a substantial issue when applying process mining in a real-world scenario [12,21].

This brief look at possible challenges for *accuracy* indicates that the topic is very broad and difficult to discuss comprehensively in the scope of this chapter. Thus, we limit ourselves to briefly describe several challenges and selected solution proposals. We categorize them into solutions for *data quality* and *model quality*.

#### **3.2 Data Quality**

Data quality is known to be often poor [22] and this may lead to non-factual or misleading representation of the real business process. Garbage-in garbage-out is a often used phrase to illustrate this issue. Whereas the data quality issue is not particular to process mining there are some peculiarities of event logs that call for specific solutions.

Often data quality problems in process mining are related to the strict data requirements on timestamps (R1), case identifiers (R2), and event labels (R3) [23]. Wrong or coarse granular timestamps lead to discovering wrong causalities in process models or parallelism where none exists. Inconsistent event labels make it difficult to assign clear semantics to the activities of a discovered process model. These are just two examples of how data quality issue impair process mining. Automated repair approaches to combat some of the data quality problems exist. For example, in [24] autoencoders are used to add missing values. However, any such method may affect *transparency* [25] as it is unclear what part of the data was inferred and what part of the data can be considered truthful beyond doubt. A discovered process model may be perfectly *accurate*, but when it is based on data with poor quality any conclusions become disputable. Notions of data quality and remedies are already introduced and discussed in [12], therefore we go not further into detail on the data quality challenge.

One noteworthy topic connected to data quality is *uncertainty* at the level of the event log data [26], e.g., by adding metadata to express the uncertainty [27]. Pegoraro et al. [26] advocate to explicitly encode the uncertainty about events and traces in order to leverage it in a transparent manner during the analysis. Based on this event log with explicit uncertainty representation conformance checking techniques can be adaped [28] to an obtain more trustworthy diagnostics that also provide more transparency about the possible different scenarios compatible with the (uncertain) observations.

#### **3.3 Model Quality**

How to decide whether a process model is of good quality? In fact, even when it comes to the question on how to measure *accuracy* there is hardly an agreement in process mining. Classically, process mining quality dimensions consist of fitness, precision, generalization, and simplicity as introduced in [1]. For most of these quality dimensions measures have been proposed that are based on conformance checking, e.g., through alignments as indicated in [20]. However, this common practice of measuring model quality has been challenged at least for precision with Tax et al. [29] proposing several axioms that the prevalent measures do not fulfil.

The issue with initially proposed quality measures led to several new methods and definitions for measuring various model quality dimensions being proposed [30–34]. Main complications for model quality in process mining are that process models commonly exhibit infinite behaviour (through loops) and the absence of negative examples, i.e., behaviour that the model should not contain [1].

Recently, there have been several proposals that aim to extend process discovery and the model quality measures to the stochastic setting in which process models include probabilities and the likelihood of observing a certain trace is taken into account [35,36] allowing to better estimate the relevant subset of the behavior modelled. This may help to truly quantify the confidence that an analyst can have in a model.

A somewhat related issue on the confidence an analyst can put in the performance of a process discovery algorithm was brought up by Van der Werf et al. [37]. They observed that process discovery techniques not always discover better process models when provided with a better sample of the process behavior, i.e., a larger event log with observations of process behavior.

#### **3.4 Outlook and Challenges**

The extensive discussion around how to measure quality shows that even defining *accuracy* for process discovery is not straightforward. In practice, this creates the challenge to choose which measure should be used in which context and when can a model be considered good for an analysis purpose. Another very relevant perspective for responsible process mining regarding model quality is how the discovered process model representation is understood by the user of such model. We will come back to this issue when considering *transparency*.

## **4 Confidentiality**

Confidentiality generally refers to the protection of certain sensitive data or information from disclosure. In the context of an organization many different kind of information is usually confidential. Intellectual property such as the design of machines or software may be confidential to protect it from competitors but also general information on the business such as the amount of sales in a certain area is usually kept confidential. A subset of the confidential information in the sphere of an organization relates to personal data. Here, the concern is on the right to *privacy* for individuals of which personal data is processed by the organization. Personal data may relate to customers, employees, suppliers or other people that interact an organization's processes. Privacy rights have received a lot of attention with several high-profile data breaches and increased regulation such as Europe's General Data Protection Regulation (GDPR) [38].

#### **4.1 Process Mining Perspective**

In the context of process mining, the information contained in event logs may be sensitive for several reasons. Event logs contain data providing detailed *information on the operations of an organization*, e.g., the order volume or the production capacity. Uncontrolled disclosure of such information may be undesired as it could negatively affect the organization. Event logs contain *information on individuals*, e.g., customers, which may be subject to the privacy regulations.

Assume a hospital process is analyzed. Case data is related to the individual patient and *confidentiality* challenges to protect sensitive data and *privacy* are obvious [39]. However, the employees that work in processes are often also directly affected by process mining results and may be directly represented in the event logs e.g. via the resource attribute in XES. This can create an additional *confidentiality* challenges to prevent work surveillance [40,41].

Protecting the privacy of individuals in event logs is difficult, as sequential event data is highly vulnerable to re-identification [42]. In fact, when assuming some background information, privacy leakages exists in the vast majority of presumably anonymous event logs that are used in the process mining community [42]. As events are linked together through a case, and often the traces in an event log are highly unique, already very limited background knowledge on some attributes or events can reveal the identity of an individual.

This "privacy problem" creates challenges in the practical application of process mining. Data gathering is more difficult or impossible when privacy concerns are raised. For example, the hospital may fear that privacy regulations (GDPR, HIPAA [43]) are violated when analysing patient trajectories [39,44] or a works council may object to the usage of process mining technology due to fear of worker surveillance [41]. Regulations threaten organizations with high fines when personal data is used without legitimate purpose or consent. The fines in GDPR may be as high as 4% of the organizations worldwide annual revenue [38]. Thus, there is a clear need for privacy-preserving or protecting techniques for process mining. Such approaches aim to retain the utility of the data without the risk of accidental disclosure of personal data. Please note that we use *protection* here in the sense of *anononymity* and *unlinkability* requirements. Next to those, other requirements such as *notice*, *transparency*, and *accountability* are often imposed by regulations [45]. Note that most privacy-preserving techniques differ from the wide variety of best-effort pseudonymization, perturbation, and generalization

**Fig. 4.** The main aspects of any confidentiality scenario for process mining: What is the sensitive information contained in the event log that needs to be protected? Which background knowledge can be assumed (including provided by external sources)? What are the attacks used by the adversary and which threats are posed?

methods that are used by commercial tools<sup>5</sup>. Unfortunately, it has been shown such na¨ıve replacement of identifiers is often not sufficient to keep information secure in many scenarios.

For each confidentiality scenario we need to characterize at least the *sensitive information* (Sect. 4.2) and the *background knowledge* (Sect. 4.3) of the attacker or adversary as illustrated in Fig. 4. Then, we can identify *confidentiality attacks* (Sect. 4.4) that are assumed to be employed an the resulting *threats* that should be mitigated. Based on the analysis of the available threats, protection techniques have been proposed to mitigate these threats under certain assumptions (Sect. 4.5).

#### **4.2 Sensitive Information**

Several kinds of sensitive information may be derived from event logs. We consider both the scenario in which an event log contains some business information that needs to be secured as well as the scenario in which personal data of individuals that took part in the process should not be revealed. These individuals could be customers that are the *subject* of the process or workers that perform activities withing the process.

We assume that the **sensitive information** is contained in a given event log as shown in Fig. 4. Sensitive information may be obtained *directly* from the attribute values of individual events of or it may be *derived* by performing some computation over several events in. Often, in the scenario in which personal data of individuals is at risk the sensitive information in the event log is assumed to be connected to the individual through the process cases each of which is about a single individual. We now illustrate several types of sensitive information with

<sup>5</sup> Most commercial tools provide some kind of pseudonymization technique to replace sensitive data by a hashing or replacement. An example is given here: https:// fluxicon.com/blog/2017/11/privacy-security-and-ethics-in-process-mining-part-3 anonymization/.


**Table 1.** Example of a loan application event log that contains several types of sensitive information and may be subject to confidentiality attacks revealing this information to an adversary possessing suitable background knowledge.

the event log in Table 1 that was obtained from the previously introduced loan application process.

An example for sensitive information related to an individual that can be directly obtained is the social security number of the applicant stored in the column *SSN*, which also acts as case identifier here. Obviously, using such direct identifiers of individuals poses a privacy risk as it would allow to directly link all the remaining information contained in the event log to individuals. Analogously, the *Resource* column contains the full name of the employee responsible for handling the process activities. This information would enable direct profiling of the work performance of individual employees, which may be against company policies or forbidden by work regulations. It is easy to remove directly personally identifiable information such as names or identifying numbers of customers or workers as they are not necessary for process mining. For example, it would be trivial to replace both the *SSN* column and the *Resource* in Table 1 with a surrogate case identifier based on a mapping obtained through one-way hash function or a simple lookup table.

However, it has been shown that obscuring the direct identifiers is not sufficient as also not directly identifying attributes can be problematic [42,46]. *Quasiidentifiers* are values not directly revealing the identity of a person but may be used to do so in combination with other attributes. Common quasi-identifiers are attributes such as gender, birth dates, or postcodes that taken together are often unique for an individual. For example, in Table 1 the combination of columns *Age*, *Type*, and *Postcode* would very likely be uniquely identifying a single customer leading to disclosure of other sensitive information contained in the event log such as the yearly *Income* of the applicant.

So far, we gave examples of sensitive information that is directly stored in the event attributes. However, also the presence of a certain activity in the event log or derived information such as the sequence of events that occurred for a certain case may be considered sensitive. Take for instance the third case in Table 1 in which the loan application is declined (*DA*) after an extensive check (*EC* ). Knowledge of such details on how the loan application process was carried out may be used against the individual. Thus, even the sequence of activities performed for an individual case may be regarded as sensitive information. At the same time, the sequence of activities performed may also act as a quasiidentifier as it is often unique and identifies an individual such as an applicant or a patient [42,47].

When it comes to sensitive business information one may think about attributes encoding the cost of a certain activity or information on prices paid by different customer segments (e.g., the interest offered on the loan). Similarly to the case of personal information, the sensitive information may not only reside in the attribute values but also be derived from the sequence of events that occurred or their timestamps, e.g., the throughput times computed for different organizational units may be considered sensitive.

It is important to realize that these computations may also be based on the artefacts that are returned by the classical process mining tasks: intermediate data structures, process models, and conformance checking results. Thus, direct access to the original event log may not be required to gain access to sensitive information. For example, the utilisation of a certain department or group may be determined by considering the number of traces in a certain time period and could be considered sensitive. The cross-organisational process mining scenario is also commonly considered when it comes to motivating the need of protecting sensitive information for process mining. Here, two organizations want to compare their processes to learn from each other or analyse a process that is jointly performed (e.g., supplier and integrator). However, certain sensitive data should not be shared.

#### **4.3 Background Knowledge**

Apart from the trivial case in which an individual or an organizational entity can be directly identified, attackers often need to possess certain limited background knowledge about the individual, i.e., the process case, the entity, or about remaining parts of the dataset. This is reflected in Fig. 4 by assuming the adversary to use some knowledge to facilitate attacks on sensitive information. Some protection models assume the worst-case scenario in which no restriction on the background knowledge of an attacker is assumed and still some kind of privacy guarantee should be given. However, in many cases it is reasonable to assume only limited background knowledge to be available.

**Background knowledge** may be fully derived from the event log or it may also contain information that is not present in the event log but related to specific cases or events. Thus, it can be any kind of knowledge that gives an attacker information that can be used to identify sensitive information. We keep the definition of the background knowledge deliberately vague as it may be defined in various ways and include arbitrary external data sources. Two more precise definitions for event logs have been introduced in the literature.

Rafei et al. [48] provide several definitions for possible background information in a process mining context. They assume that background knowledge is defined over a simple view of process traces as sequences of event labels, e.g., the third trace in Table 1 would be seen as sequence -SA, EC, DA. Three categories are defined: *Set knowledge*, *Multiset knowledge*, and *Sequence knowledge*. The knowledge refers the occurrence of activity labels in the to be attacked process case at one of the three abstraction levels. Thus, an attacker can either know only about the *presence* of activities (set abstraction), their *frequency* (multiset abstraction), or have in-depth knowledge about a certain *ordering* of activities (sequence abstraction). In Table 1, the third trace -SA, EC, DA would already be uniquely identified when having the set background knowledge {DA} since that is the only trace in which an application is declined. As another example, the multiset background knowledge of [SO<sup>2</sup>] would uniquely identify the fourth case. In many setting such knowledge of process events may be easy to obtain, e.g., one may know that their neighbours received two loan offers in a specific time period.

Von Voigt et al. [42] quantify the re-identification risk of individual cases by assuming different kinds of background knowledge. In addition to knowledge of activity labels as in [48] also case-level attributes are considered to be candidates for background knowledge. For example, in the well-known BPI Challenge 2018 dataset [49] case attributes have been generalized to provide some level of privacy protection. However, still when considering the combinations of all case attributes 84.5% of all cases are unique.

Many other similar abstraction and definitions of background knowledge are possible but have not yet been investigated. For example, partial orders of activities or knowledge about time or resource involved. An adversary may know that two medical diagnostic tests have been performed on the same day and two days later the patient was re-invited for a discussion by the same doctor. Also knowledge on the absence of a certain activity in the case to be attacked could be informative. As Fig. 4 illustrates also external data source may provide complementary background knowledge. A famous example that involved using external background knowledge is the successful attack on a Netflix dataset by using information from the public IMBD movie ratings [50], which included full names for some users, and compared them to the ratings in the Netflix dataset thereby identifying users in the supposedly anonymized Netflix dataset.

In summary, a precise analysis of background knowledge assumed is important to provide meaningful guarantees against uncovering sensitive information.

#### **4.4 Threats and Attacks**

Several attacks on confidential data in event logs are possible. We follow Elkoumy et al. [45] and focus on a honest-but-curious attacker scenario. An adversary has access to data or results and tries to identify some sensitive information without trying to break into systems. So, we do not consider scenarios in which access control or similar security measures are broken.

We structure confidentiality attacks structured according to the threat that they pose, i.e., the kind of sensitive information that an attacker or adversary tries to reveal. As already motivated, it is important to consider the kind of background knowledge that is assumed in the analysis of a specific threat or attack to find reliable mitigation strategies. Attacks on confidentiality use this background knowledge to reveal sensitive information that is contained in the event log as shown in Fig. 4.

So, a very general definition of a confidentiality attack on an event log can be given as follows. Given an event log and some sensitive information that is related to that log, a **confidentiality attack** uses some background knowledge, which may be derived from the log or from other available sources, to reveal some subset of sensitive information that is part of the log. We distinguish four general types of threats based on the goal of an attacker and the employed attack method following the categorization in [45].

*Membership Disclosure Threats.* A basic threat is that an adversary could establish that an individual was taking part of the process that is described by the event log. A *membership inference* attack combines background knowledge about the individual to the information released by an event log or a process mining analysis. So, the sensitive information obtained from the event log would consist of the identifiers for a subset of individuals that took part in the process. Whereas this does not reveal the exact case in which an individual took part, it still often allows to draw conclusions about which activities and events an individual was involved in. Let us assume that the event log obtained in our example loan application process scenario only contains loans for starting a business. Already, the information that an individual is part of that event log, i.e., they were applying for such a specific loan type can be sensitive information.

*Re-identification Threats.* Threats that cause the disclosure of the identity of a individual to which some data belongs are called re-identification threats. So, the sensitive information is the subject of a certain case, e.g., the patient identity, or the subject of a certain event, e.g., the identity of the resource or worker that performed the activity recorded by the event. Example attacks are linkage attacks and intersection attacks [45]. *Linkage attacks* use background knowledge to reveal the identity, e.g., a certain combination of attribute values or a certain sequence of events is known to be connected to an individual. In Table 1, knowing that an individual received two offers, i.e., multiset background knowledge of [SO<sup>2</sup>], and that their data is part of the event log uniquely re-identifies identity of the applicant in the fourth case. *Intersection attacks* try to establish a mapping between two separately released event logs revealing the identity of an individual. Here the information revealed in a second separately released dataset is assumed to be directly linkable to an individual without containing any sensitive information. However, in combination this information can be used as background knowledge and reveal the sensitive information in the first event log.

*Reconstruction Threats.* In some cases it may be possible to partially or fully infer the original event from seemingly protected data. Here, the sensitive information to be retrieved would be the entire event log. The two main attack methods for reconstruction are *difference attacks* and *model-inversion attacks*. The basic idea for both is to repeatedly consult a model or a statistic with slightly different queries and, thereby, uncovering sensitive data.

*Cryptanalysis Threats.* Data may have been pseudonymized, as often done by commercial tools, or encrypted in an attempt to provide confidentiality. However, na¨ıvely pseudonymized or even fully encrypted event logs are vulnerable to attacks based on the analysis of the frequency [51]. Please note that this may lead in turn to re-identification, membership disclosure, or reconstruction, but may also simply leak sensitive business information such as the number of certain activity executions. The main attack method is a *frequency analysis* based on background knowledge on the activities of the process and their prevalence.

## **4.5 Protection Approaches**

Whereas still in an early stage, the research on privacy and confidentiality has received increased attention in the past years and several protection techniques with diverse assumptions and guarantees that protect against the mentioned threats have been proposed. However, none of the proposed methods is generally applicable to any possible confidentiality and privacy problems. Each of them makes certain assumptions regarding the attack scenario including the background knowledge of the assumed attacker. Conversely, depending on the input log the methods result in some loss of utility. Thus, the goal of the process mining analysis (discovery, conformance, etc.), their data requirements, and the characteristics of the process that generated the event log need to be considered.

**Fig. 5.** Different protection models have been proposed that protect the data contained in an event log by transforming it into protected representations: a protected event log, a protected abstraction the event log, or a protected analysis result.

Protection models can work at different levels of a process mining analysis as shown in Fig. 5. Following [48], we differentiate between several tasks for protection models. Some models protecting the event log itself and provide a protected copy of the original event log. Other techniques provide protected abstractions over the original event log, e.g., a directly-follows graph representation which can only be used for certain process mining activities. In [48], these two tasks are denoted as Privacy-Preserving Data Publishing (PPDP) and Privacy-Preserving Process Mining (PPPM), respectively. We add a third possible task, which is to protect process mining results, e.g. a process model or a conformance checking result, without an intermediate representation.

Regardless of the task at hand, techniques can also be distinguished into roughly three categories of protection models [45]: *group-based privacy models*, *indistinguishability-based models*, and *confidentiality frameworks* including encryption. We now introduce the main properties of the three different protection model categories and briefly introduce exemplary techniques.

**Group-Based Privacy.** The prototype of a group-based privacy protection model are those that provide *k-anonymity* [52]. The basic idea is that a tabular dataset containing rows with information about individuals is *k-anonymous* when the values for each combination of sensitive attributes or columns (quasiidentifiers) appear at least k times. So, data similarity is used as a criterion here. The intuitive idea is that the individual will have the same sensitive data as the k − 1 other individuals in the same group and, thus, with sufficiently large k it protects against re-identification. Usually, this is achieved by data *suppression* or *generalization* until the k − anonymity property is achieved. Whereas this model is interpretable and easy to understand, unfortunately, it has been shown to be suspect to certain attacks based on background knowledge [53]. Several extensions have been proposed that mitigate some of those including: ldiversity [53], and t-closeness [54].

For process mining, two methods are providing group-based privacy protection models. The TLKC model by Rafaei et al. [47] and the PRETSA approach by [55]. Both aim at the release of a proctected event log, option (A) in Fig. 5. PRETSA utilizes generalisation based on a prefix-tree that is build on top of the activity sequence in the event log and provides *k-anonymity* and *t-closeness* guarantee to prevent the disclosure (membership and re-identification) of resources or workers that performed certain activities. The TLKC model protects the identity of cases, e.g., a customer, and provides a relaxed variant of *k-anonymity*. Additional, it supports protecting information in the time and organizational perspective. Both approaches make assumptions on the background knowledge. Maintaining data utility is challenging for both methods when many unique traces exist.

**Indistinguishability-Based Privacy.** Differently to the group-based models providing guarantees such as k-anonymity for a given dataset, indistinguishability-based privacy models give a guarantee that two versions of a dataset are indistinguishable to a certain degree. A central model is *Differential Privacy* (DP). The idea is that there is one datasets A without an individuals information and another one A including an individuals information. A mechanism provides DP with a parameter when the results of a (randomized) query mechanism statistically differ between A and A only by a small factor that is controlled by the --parameter [56]. This provides a strong guarantee that is independent of the background knowledge as the guarantee needs to hold for any dataset A and A- . There have been many variants of the differential privacy concept [57]. For example, adding a relaxation parameter δ to better tuning the utility while loosing some of its strengths ((-, δ)-DP), only requiring the values of the datasets to not differ too much and ignoring addition and removal of items (bounded DP), or extending the guarantees in the case individuals appear multiple times in the dataset (group DP), which is a possible scenario for event logs.

For process mining, several adaptions have been proposed. The first one was given by Mannhardt et al. in [58] who assume a protected event log to be queries through a privacy engine. Laplacian noise is added to the counts returned by each query, thereby guaranteeing DP with regard to the individual cases. Queries are defined for both directly-follows relations [1] and complete activity sequences. The method was later extended in [59] to also protect contextual information that is encoded in the attributes of the event log. Furthermore, the guarantee was extended to *local DP*, which means that a perturbed event log itself can be released. One major issue of these methods is that obviously invalid process behavior may be added. Recently, the approach was improved to consider the semantic of the added noise [60]. Contextual information, in particular process performance indicators, is also protected in the work by Kabierski et al. [61]. Finally, there is a very recent proposal by Elkoumy et al. [62] that provides only a bounded DP guarantee but improves the utility of the protected data by using an oversampling approach instead of adding noise.

**Confidentiality Frameworks.** The third type that we distinguish are protection models that are not directly targeted at protecting individuals but any kind of sensitive information in event logs. Here, mainly encryption schemes have been proposed. A major family of techniques are those based on on homomorphic encryption [63] schemes. The goal is to enable certain computations on an encrypted version of the data. For process mining, this idea is taken up by Rafei et al. in [51] and embedded in a framework that aims to protect against frequency or background knowledge-based attacks by disassociating events from their respective cases. It could be used to outsource computations on secured data or in a cross-organizational setting. However, it does not protect the resulting analysis results (B) and (C) from an internal process analyst. The cross-organization setting is also targeted by Elkoumy et al. in [64]. A secure multi-party computation [65] method is proposed that avoids to leak sensitive information in the cross-organisational process mining scenario.

The above categorization and list of techniques covers the major share of the work in the process mining field so far.

#### **4.6 Outlook and Challenges**

Protecting the privacy and confidentiality of data while keeping it useful for analysis is a difficult problem. Information needs to be hidden while the objective is to get as much signal from data as possible. Unsurprisingly, many open challenges exists for *confidentiality* in process mining and, apart from academic prototypes [66,67], none of the proposed techniques has seen uptake in commercial solutions. Seven main challenges for research in the field of privacy and confidentiality in process mining are identified by [45]:


Many of these challenges are geared towards the improving technological solutions that provide some form of privacy guarantee in various settings. However, as already reported in [40] many aspects of privacy and confidentiality as well as the compliance to regulations such as GDPR cannot be solved by technological measures alone. However, there is little research from the organizational side apart from anecdotal discussion on the role of privacy in real-life process mining projects [41]. To conclude, it is notable that process mining has also been used to check conformance to privacy regulations [70]. Thus, process mining can also help in uncovering confidentiality issues that are present in an organizations processes.

## **5 Transparency**

Transparency has been a widely discussed topic for AI systems that are based on machine learning. Often, a key concern is the explainability of black-box classifiers such as Deep Learning models: Why is a certain classification or prediction made and what features are important in the decision of the model?

The core process mining tasks of process discovery and conformance checking aim to provide *white-box* process models that can be interpreted by process stakeholders. Explainability of the discovered models and, thus, transparency is key objective of process mining. Still, there are several aspects of process mining in which transparency is at risk. In the next two section, we focus on two exemplary transparency challenges for process mining: achieving *generalization* without hampering transparency and the *interpretability* of the discovered process model representations.

Besides these two transparency challenges all the common transparency issues of predictive models are inherited when building predictive process mining models. Therefore, we do not discuss this in detail since many resources on explainable machine learning are available and [71] gives a brief overview of how to obtain explainable predictions in the context of process mining.

#### **5.1 Generalization**

Process discovery aims to abstract from the exact behaviour observed in the event log and return a *concise* model of the underlying process. This often requires to disregard *infrequent behaviour* to obtain simpler process models. Conversely, process discovery techniques often attempt to generalize beyond the observed behaviour since they cannot be assumed to have observed all possible incarnations of the process, particularly in the presence of parallel process behavior. This aspiration creates a *transparency* challenges.

Disregarding infrequent behaviour may hide important parts of the observed data. In particular, infrequent patterns may be of high interest [44]. Very few techniques have been focusing on retaining infrequent data, e.g., in [72] certain infrequent dependencies are not filtered if they can be reliably predicted from data attributes and in [73] it is explored how to selectively include infrequent behaviour by filtering over multiple ranges of parameter values.

In a orthogonal direction, the frequency and probability with which behaviour is observed gets more attention in approaches that can be labeled as: *stochastic process mining*. In [35], Leemans et al. proposed a new conformance checking method with the goal of taking into account routing probabilities, which improves the accuracy of the diagnostics.

#### **5.2 Interpretation of Results**

Interpretation of results based on process model notations or visualizations can be difficult for stakeholders leading to *transparency* challenges. For example, the presence of loops together with optional activities may enable non obvious process behaviour and the filtering of edges in a directly-follows graph may lead to invalid statistics as is illustrated for many commercial tools in [74].

However, also for discovery approaches based on clear semantics misinterpretations are possible. As an example, the models discovered by the Inductive Miner often contain silent transitions that allow to skip certain behaviour that in combination with loops allow any behaviour. This may be difficult to spot for a non-expert. Whereas there exists research on the comprehension of process models [75], little work has yet been done in the context of automated process discovery.

Recently the question of interpretability of process mining results has been touched upon by Mendling et al. [76] who raise the issue that the quality of process mining results needs to be judged in light of the tasks of a process analyst using the models. A first technical contribution for process discovery in this direction was provided by Fahland et al. [77] with a new variant of the Inductive Miner that was evaluated in a user study in which an analyst's trust in the model as considered. Overall, there has been surprisingly little research on this topic given the claim of process mining to provide white-box models.

## **6 Conclusion**

This chapter defined the concept of Responsible Process Mining under the umbrella of Responsible Data Science. Based on the FACT criteria put forward in [5] (Fairness, Accuracy, Confidentiality, and Transparency), we gave an overview of challenges related to these criteria and introduced state-of-the-art approaches for addressing each of them. Due to the broad scope of the FACT criteria, we can provide only a high-level introduction and discussion for each of them. We refer to the individual work or relevant surveys for further details.

In some areas the research on responsible process mining is already much further developed than in others. Little attention has been devoted to *fairness* in the context of process mining, at least compared to its prominence in the machine learning field. The trend to more automated decision taking in process mining may change this in the future. In contrast, the *confidentiality* challenge has been recognized in the process mining research community and has recently received much attention in research. However, adoption by commercial process mining tools has not yet started even though the problem has also been recognized by industry [41].

Criteria such as *accuracy* and *transparency* are very broad and many approaches touch these issues; however, with the notable exception of the work on data quality [12] they are rarely addressed explicitly under the umbrella of responsible process mining. More work is required to develop and address these criteria more explicitly in future process mining research.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

**Industrial Perspective and Applications**

# **Status and Future of Process Mining: From Process Discovery to Process Execution**

Lars Reinkemeyer(B)

VP Customer Transformation, Celonis SE, Munich, Germany lars@reinkemeyer.de, l.reinkemeyer@celonis.com

**Abstract.** During the last two decades Process Mining has seen a rapid global adoption: first in academics and then in corporate business. It has evolved into a foundational technology, allowing users to discover actual process flows with unprecedented transparency, speed, and detail. In a business environment Process Mining has no purpose of its own, but companies leverage it to identify process inefficiencies, improve process execution and ultimately drive value. Process discovery and transparency does not provide immediate business value, but requires specific use cases combined with human intelligence to identify and deploy levers for process improvement. In this article we argue that the future focus and evolution of Process Mining shall not focus on lateral expansion - i.e. with further processes and discoveries - but vertically by enhancing the depth of added value for business users with artificial intelligence, proactive and predictive enablement and other levers which boost process execution. In essence, focus should be on deploying smarter technologies for driving business value in process areas where Process Mining has shown impact.

## **1 Setting the Stage**

#### **1.1 The Evolution of Process Mining in Operational Business**

Process Mining was invented at the end of last millennium by Wil van der Aalst and has seen a strong adoption by academics in the first decade of this millennium. In the second decade of this millennium companies started to use Process Mining for transparency, to discover, understand and improve actual processes. To this respect, numerous use cases i.e. in horizontal support functions such as Procurement and Order Management have been defined and deployed by companies like BMW, Siemens, Uber and many more around the world, across all industries, in organizations of any sizes and for processes along the whole value chain, as the following selected examples show:


All these sample have in common that Process Mining is used to discover actual process flows. Then human intelligence is applied to interpret the achieved transparency, identify root causes for process inefficiencies, and turn these into business value. However, the evolution on Process Mining should progress, similar to the evolution of the imaging method in healthcare.

For an analogy, the evolutions in Healthcare and Process Mining show many similarities: prior to inventing xRay at the end of the 19th century, a Medicus needed to guess what is happening in the human body and what are the root causes for a particular disease. Similarly – before using Process Mining – process experts had to use process models and subjective assumptions to guess actual process flows and define process improvement. The invention of xRay allowed to discover the root causes for diseases and thus decide on appropriate remediation. Similarly, Process Mining enables users to discover process gaps and decide on improvements. Imaging methods have become smarter and capable to interpret the images, identify diseases and propose curative measures. Furthermore, medical devices are trained not only to "read" the images, but also to propose and conduct treatments. In a similar form, we expect Process Mining to develop more "intelligence": in a first step by automatically identifying process gaps and proposing measures for remediation to the users who will then execute the action. And in the future even "learn" to execute process activities autonomously based on defined criteria, with only exception based human interference.

#### **1.2 Achievements in the Decade Starting 2010**

Process Mining has seen some impressive developments in the last decade and enabled many companies towards a data- and fact-based culture, using single sources of truth for process assessment and optimization. The evolution from Business Process Modeling (BPM, the design how processes should happen) to Process Mining (full transparency how processes actually happen) allows organizations to understand and improve processes.

Many organizations started using Process Mining in single functional silos (e.g. audit, procurement) and then expanded the usage across different functions. While there is an amazing variety of use cases, experience shows that the biggest impact is achieved in the core processes Accounts Payable (A/P), Accounts Receivable (A/R), Purchaseto-Pay (P2P) and Order-to-Cash (O2C). These horizontal support processes are critical to any company and are typically executed in transactional systems, providing sufficient digital event logs with a high degree of standardization, and requiring a high degree of automation. Many companies, which started their Process Mining journey with a focus on these core processes, achieved short term transparency and operational impact e.g., by eliminating duplicate payments, identifying payment term deviations, and reducing rework [1].

#### **1.3 Hurdles and Challenges**

While the concept of Process Mining has seen a rapid expansion, operational adoption has faced key challenges:


#### **1.4 The Power of Processes**

Processes represent the lifeblood of any organization and efficient process execution is a critical success factor to stay competitive. Amazon is probably one of the most efficient companies, when it comes to process execution and the following quote from Jeff Bezos shows his reluctance to adopt rigid process frameworks, but rather continuously adjust and improve processes to maintain Day 1 efficiency and agility:

"You stop looking at outcomes and just make sure you're doing the process right. Gulp. It's not that rare to hear a junior leader defend a bad outcome with something like, "Well, we followed the process." A more experienced leader will use it as an opportunity to investigate and improve the process. The process is not the thing. It's always worth asking, do we own the process or does the process own us? In a Day 2 company, you might find it's the second." [2].

## **2 The Future of Process Execution**

Process Mining has enabled thousands of organizations around the world to better understand their actual processes, to fuel data- and fact-based discussions and thus derive process improvements. However, the ultimate goal for any organization must be to minimize transactional cost, i.e. the cost induced by executing business processes. The digital age has seen the raise of an increasing number of digital native companies, which are built on highly efficient, automated processes e.g. for sales order or purchase order processing. Think of Amazon's Marketplace, with a maximum degree of automation in order processing leading to a minimum degree of transactional costs. Traditional companies, with a grown legacy of IT infrastructure, are challenged to compete against these digital native companies. While Process Mining is focused on process discovery, process execution focuses on enabling the companies to execute processes more efficiently and thus reduce transactional cost by leveraging smart technologies.

## **2.1 Process- and Organizational Transformation**

Leveraging Process Mining and process execution implies a significant transformation in the way a company operates. Transformation as a popular buzzword comes in multiple flavors, such as e.g. digital transformation or transformation of the business model – which shall not be discussed in further detail. Our focus shall rather be on process transformation and transformation of the operating model, thus focusing on the way a company executes its processes and how it is setting up operations to drive change. Processes are essential for the value generation of any company, and typically show a high degree of inertia as employees got used to a certain way of doing things and typically show reluctance to change.While Process Mining can discover process inefficiencies and process execution can provide solutions for increasing efficiency, operational transformation is a crucial factor for success, which needs to be managed proactively. Experience shows that many Process Mining projects fail due to organizational / human reluctance in respect to change. Tools and technology represent only one side of the equation for an organizational transformation, as the following statement shows: "While cutting edge technology and talent are certainly needed, its equally important to align a company's culture, structure, and ways of working to support broad AI adoption. In most firms that aren't born digital, mindsets run counter to those needed for AI" [3].

Driving process- and organizational transformation implies a range of different success factors:


#### **2.2 Trends**

The following discussion on Trends results from numerous discussions with process owners and experts, other Process Mining evangelists and market players.

#### **2.2.1 Intelligent Process Execution**

While Process Mining has enabled users to discover process inefficiencies with human intelligence, the concept of intelligent process execution builds on this discovery and supports users by providing just that kind of information which is relevant. While Process Mining can screen millions of purchase orders, intelligent process execution provides the individual users with only those purchase orders which require attention or call for immediate action. While Process Mining can discover millions of manual activities, intelligent process execution enables the user to execute multiple activities in one step in a suitable user interface. In essence, intelligent process execution takes Process Mining to the next level by leveraging AI, proactive and predictive solutions for the benefit of providing users only with the relevant information and smart forms of process execution.

#### **2.2.2 Proactive Solutions**

While big data and new tools allow unprecedented transparency, most software provides insights for users to search for relevant issues. Process Mining allows insights where users can identify e.g. late deliveries, rework effort, process delays and much more. But should users apply human intelligence to search for relevant issues, spending high effort and wasting precious time while searching for relevant issues? We don't think so. Virtual assistants should provide proactive, customized and individual support. Intelligent process execution is capable to "learn" current operations and develop skills to propose relevant exceptions proactively to the users. The software is evolving into a smart companion, which is capable to discover the operational process, understand exceptional issues and propose these proactively to the user. E.g. overdue payments can be presented to the user per push-mail or pop-up message, delayed customer deliveries will be flagged out and potentials for automation proposed. Dedicated execution Apps condense execution gaps or exceptions for the user to decide how to proceed in these cases.

#### **2.2.3 Predictive Solutions**

Upcoming events can be predicted to enable users to take preventative measures. It might – to take an example from procurement - be helpful to get a prediction, which purchase order will not be delivered on time. Equally, in logistics it is helpful to get predictions, which shipment will not be delivered on time. Based on historical data, predictions are calculated and presented with probability thresholds: as one example, predictions can identify all supplies, which will not be received on an expended date with a defined probability threshold. Those kinds of solutions have been developed for several years, e.g. based on algorithms programmed with Python on R-server, analyzing open and closed orders including times for process execution, leading to a vast number of operational execution support cases.

## **2.2.4 Usability**

Application development shows a strong focus on the consumer, with a requirement to provide intuitive user interfaces (UIs) which are fun to use and quick in interaction. Usability is equally relevant for standard Apps as well as for individual data analytics:


## **2.2.5 Impact on Digital Workforce and Data Democratization**

With the trends for process transformation and organizational change, methodologies such as Process Execution and AI gain increasing relevance. New digital tools and data democratization have a major influence on the digital workforce, with changing requirements and roles. Process Execution enables and supports data analysts to drive execution efficiency, but at the same time requires new skills, roles and responsibilities. A new generation of experts has been educated, with a thorough understanding of computer science, data and IT.

The mindset of a digital workforce differs significantly to the traditional mindset e.g. regarding access to data: while the traditional approach was extremely restricted in respect to data access – typically with "eyes only" principles - the democratization of data is a trend which drives major change towards open data access. Access to data as well as the preparation and analytics of data was traditionally rather a task conducted by specialized experts in organizational silos. As the general perception changes towards an understanding, that data is essential for today's business, data ubiquity, accessibility, and usability for everybody becomes a standard requirement.

## **2.2.6 Data Collection and Preparation**

While projects in the past required high effort to identify, collect and prepare event logs, there is a trend towards usage of standard connectors. In particular structured data from homogeneous systems (e.g. SAP ERP) can easily be identified and read by standard extractors. Discovering e.g. P2P processes across multiple systems has become possible with much less effort due to standard connectors, which require little customization. Automated discovery of event logs is expected to become possible, building on the growing experience gained from data preparation and technical innovations. Machine learning algorithms will understand the format and structure of data in similar source systems, facilitating an automation of data collection and preparation. In addition, transactional ERP systems as well as workflow platforms such as Pegasystems and ServiceNow play an increasing role for process automation and execution. Data collection and preparation across different types of platforms allows seamless execution for e.g. financial and customer data.

## **2.2.7 Task Mining**

While Process Mining is based on event logs from backend systems, Task Mining allows for process insights based on recorded activities from individual users, typically from front office systems. Samples for captured activities are mouse clicks, keystrokes, application inputs and field entries, thus providing a much deeper understanding of an individual working behavior. Task mining allows to discover actual human activities with the purpose to identify potentials for improvement. Any activity can be recorded, including phone calls, eMail or excel documentation, where no log files are available, and data is stored in unstructured format. While task mining provides a micro-picture of individual behavior and thus allows optimization of individual tasks, it does not allow insights into overarching operational processes, which can only be visualized with Process Mining. Task mining typically complements Process Execution as a "magnified" analysis of actual user behavior e.g. in Call Centers. Solutions have matured quickly and become a valuable support for operational experts.

## **2.2.8 Cloud Technology**

Storing digital traces in a public cloud has become commonly accepted and will support the possibilities to use proven algorithm for extraction and customization of data, deploy standard use cases and benefit from analytics available in the Cloud. Hosted AI is expected to become attractive and available in form of Software as a Service (SaaS) and accessible with standardized Application Programming Interfaces (APIs) to provide applications, technology, and best practices to a wide number of users.

## **2.2.9 IIoT Platforms**

The Industrial Internet of Things (IIoT) has set the technical foundation for an extensive access to event logs, as devices become connected to an internet hosted platform, thus allowing easier access to digital footprints, which are generated from these devices. IIoT platforms such as MindSphere already today receive data from millions of single devices, including relevant event logs. Value can be generated for example by understanding manufacturing processes based on the event logs from multiple machines – even across machines at different sites. The collection of event logs from different machines, sites, and companies on one common IIoT platform will allow new use cases such as the visualization of cross-company supply chain processes or inter-company benchmarking. As a crucial benefit, the IIoT platforms provide a standardized and secured environment and protocol, which has been adopted to industrial requirements.

## **2.3 Midterm Future**

## **2.3.1 Self-learning and -Optimizing Systems**

With AI becoming more mature and suitable to assist even in environments where profound high domain knowledge is required, technology will evolve towards self-learning and -optimization. Imagine a process execution system, which is autonomously capable to learn, i.e. to detect and resolve process inefficiencies. Like self-driving cars, there will be "self-driving" Process Execution tools which are capable to learn factors which determine efficient process flows and autonomously suggest or even initiate measures to optimize process efficiency including optimization of variants and reduction of process exceptions.

## **2.3.2 Artificial Intelligence**

While the impact of AI, which has been experienced in operational use cases to date, has been limited, it will grow up to its promises. Some innovative providers show exciting use cases with virtual process analysts discovering and documenting actual processes by imitation learning. A virtual digital companion learns from actual and optimum process handling and is thus trained to become an accepted artificial co-worker, understanding also complex domain know-how, which is the big challenge in the B2B environment. Virtual companions are trained to identify and remediate process flaws, which can start with simple, repeatable process tasks such as the removal of delivery blocks. Besides all excitement about AI, it must remain explainable in order to ensure ethical data usage with clear transparency about what and how AI is applied. AI governance will play an increasing role and will have a significant impact on the acceptance of these new technologies in particular in a corporate environment.

## **2.3.3 Benchmarking**

Process Mining makes process efficiency measurable and transparent. As it is based on big data and facts, it is predetermined for benchmarking purposes. Standard processes such as P2P and O2C will be benchmarked on operational performances such as automation rate, throughput time or rework across different organizations. With digital traces available on standard platforms and in the cloud this will also become available as a self-service, where companies can access benchmark data – based on appropriate data anonymization – to assess their own performance versus other market players. And consulting companies will be able to lift cross-company benchmarking analysis to a new level of data foundation, as benchmarking can be conducted based on the full set of all relevant events from different players.

## **2.4 Longterm Future**

#### **2.4.1 Inter-Company**

The long term perspective provides significant economic and ecologic benefits through optimization of cross-company supply chains, based on data from different companies and sources. Process optimization will become possible for inter-company value chains, including supplier, manufacturer, freight forwarder and customer. Companies like Slync already today offer multi-party supply chain interaction with a high degree of automation, across different organizations and multiple data sources. The value proposition offers logistics orchestration across manufacturers, suppliers, freight forwarder and customers. With Process Execution, this could be taken to a new level by understanding the extended end-to-end process chains. On-time delivery, integrated manufacturing and optimization of stock/working capital are just a few benefits of a transparent supply chain processes, which can be monitored and managed with the support of Process Execution. Empowering the business partner with access to own process data will allow all parties to benefit. Besides economic benefits this will lead to a sustainable ecological optimization due to the wholistic approach, which will allow to reduce e.g. the number of empty deliveries, reduce waste and allow for a better resource management and more sustainable business.

#### **2.4.2 Sustainability**

The sustainability revolution should be supported by technological innovations such as Process Execution. Think about process inefficiencies in your immediate environment and how better process efficiency could support sustainability: from traffic congestions to waiting times in hospitals, from wasted time in call center queues to waiting times for bureaucratic decisions, from delayed goods deliveries to delayed flight arrival. Process inefficiencies are omnipresent, producing friction, waste and avoidable emission. Understanding the end to end processes allows to track down inefficiencies and reduce waste in time and resources. While Supply Chain Management is probably the primary field, where Process Mining can support a sustainability revolution, CRM and other functions can equally support as ecological driver. The management of resources in ERP systems (financials, materials, assets and HR) will become more efficient with Process Mining, thus allowing to optimize scarce resources. In a world with more than 7.9 billion people and increasing issues due to limited resources this will become a strong purpose.

## **2.4.3 B2C**

While the primary focus on Process Execution to date has been on business-to-business (B2B) processes, there is a huge potential for process optimization in the business-toconsumer (B2C) field. Understanding consumer interactions with the additional dimensions of time and activity sequences allows to better interpret and predict e.g. consumer behavior. B2C use cases could include for example activity tracking for the timing and sequence of user clicks on shopping pages or in social media platforms. Understanding of strategies, how users approach challenges such as search for restaurants or music, appear valuable and might allow for trail prediction. As another example, the insight into the search sequence for web offerings could – based on large amounts of activities – not only be interesting for psychometric analysis, but also for product management and sales.

## **2.5 Vision of a Digital Enabled Organization**

Imagine an organization which has been automated for most standard processes, such as procurement of indirect material, financial transactions, order deliveries and customer order processing. Standard tasks are conducted automatically, supported by an AI, which is capable to learn not only how to execute standard cases, but also minor exceptions, conducting immediate actions and corrections. This "intelligent system" processes most of all activities with zero human touch, and humans only interfere exception based, thus providing a high process reliability at minimum transactional cost.

Data ingestion from diverse source systems is supported by AI, which allows to identify and customize structured and unstructured data from various sources such as ERP or workflow systems. Cloud technology is commonly established as basis for data hosting, collaboration, and data mining, with the application providers applying continuous monitoring and optimization. Streamed event data allows real time process analytics for immediate reaction e.g. for customer interaction. Platforms offer standard Apps for process execution in a secure environment and share best practices for process handling and monitoring.

As most operational processes have been fully automated, the focus of Process Execution changes. Based on this vision, there will be less demand for transparency and discovery in respect to today's focus areas. Standard support processes such as P2P and O2C provide decreasing marginal benefits, as they are mostly optimized and the focus shifts towards more challenging processes such as e.g. customer interaction, manufacturing, HR and legal proceedings. Besides inter-company automation, process optimization is happening cross-organization in integrated supply chain process flows. Exception based activities remain in focus, as they require optimization with appropriate digital tools. Similar to tele-medicine, remote diagnosis and optimization of processes based on smart automation will be available through dedicate Process Mining Analysts, who are alerted by intelligent virtual assistants, which conduct a continuous real-time monitoring and provide predictive and proactive alerting.

The role of humans has changed significantly: mundane tasks have been completely automated and new tasks and roles emerged instead. The focus of human responsibility has changed towards data analytics and steering, using tools which are provided by the digital enabled organization. Process analysts use digital tools such as virtual assistants, which collect data from Process- and Task Mining, thus empowering the digital enabled organization. Value generation shifts towards service innovation. In their book "Dreams and Details" Snabe and Trolle describe how to reinvent business from a position of strength and with a compelling vision. An innovative "'Digital Enabled Organization" could provide the dream to set the mindset and framework to unleash the human and digital potential.

As a positive ecological contribution, the process optimization has yielded significant reduction in carbon footprint e.g. due to reduction of empty trips and optimization of routings. Transactional costs have been reduced to a minimum.

## **3 Conclusion**

While the first two decades of Process Mining have been focused on transparency and discovery, the real impact in a corporate environment is driven through intelligent execution management. Process Mining provides an excellent foundation, which will be enhanced with standard process execution Apps, common extractors, process transformation capabilities and artificial intelligence in order to execute business processes in an easier, smarter and more efficient manner. Thus Process Mining is the base for a much wider field which is still to be developed.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Using Process Mining in Healthcare**

Niels Martin1,2(B), Nils Wittig3, and Jorge Munoz-Gama<sup>4</sup>

<sup>1</sup> Hasselt University, Martelarenlaan 42, 3500 Hasselt, Belgium niels.martin@uhasselt.be

<sup>2</sup> Research Foundation Flanders (FWO), Egmontstraat 5, 1000 Brussels, Belgium <sup>3</sup> KMS Vertrieb und Services AG, Inselkammerstraße 1,

82008 Unterhaching, Germany

nils.wittig@kms.ag

<sup>4</sup> Pontificia Universidad Cat´olica de Chile, Av. Vicu˜na Mackenna 4860,

7820436 Macul, Chile

jmun@uc.cl

**Abstract.** This chapter introduces a specific application domain of process mining: healthcare. Healthcare is a very promising domain for process mining given the significant societal value that can be generated by supporting process improvement in a data-driven way. Within a healthcare organisation, a wide variety of processes is being executed, many of them being highly complex due to their loosely-structured and knowledge-intensive nature. Consequently, performing process mining in healthcare is challenging, but can generate significant societal impact. To provide more insights in process mining in healthcare, this chapter first provides an overview of healthcare processes and healthcare process data, as well as their particularities compared to other domains. Afterwards, an overview of common use cases in process mining in healthcare research is presented, as well as insights from a real-life case study. Subsequently, an overview of open challenges to ensure a widespread adoption of process mining in healthcare is provided. By tackling these challenges, process mining will become able to fully play its role to support evidence-based process improvement in healthcare and, hence, contribute to shaping the best possible care for patients in a way that is sustainable in the long run.

**Keywords:** Process mining *·* Healthcare *·* Evidence-based process improvement

## **1 Introduction**

The prior chapters of this book introduced various process mining topics. In contrast to these preceding chapters, this chapter focuses on introducing a specific application domain of process mining. In particular, this chapter focuses on *healthcare*. In process mining research, healthcare illustrations are often used to demonstrate new techniques, or a healthcare problem is the starting point of the research project altogether [55]. This can be, at least partly, explained by the great societal value related to efforts to improve the healthcare system. In many countries, the long-term sustainability of the healthcare system is an important societal issue due to trends such as the increasing life expectancy, and the raising prevalence of chronic diseases [29]. Improvements in terms of healthcare processes is an indispensable piece of the puzzle to sustain the healthcare system, while continuously improving the quality of care delivered to the patient.

Within the healthcare domain, many different processes are being performed in a wide variety of healthcare organisations. Many processes in healthcare are complex as they are loosely-framed and knowledge-intensive [20,55,58]. While the former indicates that healthcare processes can typically be executed in a large number of distinct ways [58], the latter indicates that the trajectory that is followed strongly depends upon complex decisions made by knowledge workers such as physicians and nurses [20]. These healthcare processes are increasingly being supported by health information systems [53], which capture data about the real-life execution of a process in their databases. This data can be leveraged to compose an event log, the key input for process mining [55].

There has been a steady growth in research interest on process mining in healthcare in recent years [17]. Despite the great potential of process mining to support process improvement in healthcare and the increasing number of methods specifically designed for the healthcare context, the systematic uptake of process mining in healthcare organisations outside the research context is still fairly limited [55]. Hence, there are still challenges ahead that need to be overcome, which is consistent with the fact that process mining in healthcare is a rather young research area. Moreover, healthcare is a highly dynamic field as processes change due to advances in, for instance, medicine and technology [29,55]. For instance, the increasing presence of wearable devices and mobile health applications provides opportunities to collect richer data about a particular process, but also presents new challenges, e.g. in terms of merging all data sources [37,55]. Even though it will require continued efforts, it is worthwhile to benefit from opportunities and tackle challenges as it will enable process mining to fully play its pivotal role to instigate evidence-based process improvement in healthcare [55].

The goal of this chapter is to introduce the reader to healthcare as an application domain for process mining. To this end, the remainder of this chapter is structured as follows. Section 2 provides a primer on healthcare processes and healthcare process data, with an emphasis on its particularities. Section 3 introduces the reader to the common use cases of process mining in healthcare from a research point of view. Section 4 discusses a case study, which illustrates the potential of process mining in the context of a specific hospital. Section 5 outlines the key open challenges that the community is confronted with when it aspires a broad uptake of process mining in healthcare. The chapter ends with a brief conclusion in Sect. 6.

## **2 A Primer on Healthcare Processes and Process Data**

Before providing an overview of common use cases in the process mining from healthcare literature, this section sets the stage by providing an overview of healthcare organisations and healthcare processes (Sect. 2.1). Moreover, the particularities of healthcare processes and healthcare process data are introduced (Sect. 2.2).

#### **2.1 Healthcare Organisations and Healthcare Processes**

Some readers might implicitly equate healthcare to the care that patients receive in a hospital. Hospitals, either general hospitals or specialised hospitals [57], play an important role in the provision of healthcare services. As will become apparent in Sect. 3, many process mining applications are also situated within the hospital context. However, it should be noted that curative care, i.e. care focused on the treatment of diseases to increase life expectancy [88], is organised in various types of healthcare organisations [57]. For instance: long-term care facilities provide care to patients suffering from a chronic disease or patients needing longterm rehabilitation after a hospital discharge. Psychiatric care organisations, in their turn, provide therapy for patients with mental problems. Home-based care organisations, another category of healthcare organisations, deliver care services in the comfort of the patient's home [57].

Within a particular healthcare organisation, a wide variety of healthcare processes is being performed. A basic distinction between medical treatment processes and organisational processes is introduced by Lenz and Reichert [46]. *Medical treatment processes*, also commonly referred to as clinical processes, have a direct link to the patient and are connected to the therapeutic-diagnostic cycle. This implies that, in these processes, healthcare professionals takes informed decisions regarding the patient's diagnosis or therapy based on medical knowledge and the available patient-related information. *Organisational processes*, in their turn, cover all processes that support medical treatment processes by coordinating actions between different healthcare professionals and supporting staff, potentially even belonging to various departments. Examples include appointment or procedure scheduling processes, as well as logistical processes of patients or goods [46,67].

An alternative categorisation of healthcare processes is provided by Mans et al. [52]. Their classification solely takes processes that are directly related to the patients into account, but considers both medical activities as the preparation of these activities (such as booking the appointment) as being part of the same process. Against this background, Mans et al. [52] make a distinction between elective care processes and non-elective care processes. The execution of *elective care processes* can responsibly be postponed for several days or weeks. Within this subcategory, a further distinction is made between standard, routine, and non-routine care processes. For *standard care processes*, a structured treatment trajectory is available, containing information about the activities that need to be performed, as well as the timing that needs to be respected. In a *routine care process*, various treatment trajectories can be followed to obtain an outcome that is typically known. The latter does not hold for *non-routine care processes* as a physician will need to determine the next step in the treatment trajectory based on the patient's reaction on the current process step. While elective care can be postponed for several days or weeks, *non-elective care processes* refers to unexpected medical treatments that need to be performed promptly. Here, a distinction is made between *emergency care processes*, which should be executed immediately, and *urgent care*, which can be postponed for a limited period of time (e.g. a few days) [52].

From the previous, it follows that healthcare is a highly versatile domain, with a large variety of healthcare organisations and a mix of different processes being executed at these organisations. These processes can be fairly structured (e.g. standard care processes) or highly unstructured (e.g. non-routine care processes) [52]. The close interconnection between processes, even across different healthcare organisations, adds to the complexity of the healthcare domain. For instance: the trajectory of a patient suffering from a chronic disease might consist of surgery at a specialised hospital, several check-ups at a local general hospital, as well as multiple therapies taken at home under the supervision of a home nurse [55]. Even within a single healthcare organisation, processes are closely intertwined as, e.g., efficiently carrying out surgical processes depends on the timely execution of logistical processes, both regarding patient transportation and the material flow.

#### **2.2 Particularities of Healthcare Processes and Process Data**

To really grasp the challenging nature of healthcare as an application domain for process mining, it is important to understand the particularities of healthcare processes and healthcare process data. Munoz-Gama et al. [59] defined ten distinguishing characteristics of healthcare processes, which also impact the process data that will be recorded. While some of these characteristics might also be relevant for other sectors, their combined occurrence in the healthcare context needs to be reckoned with and will generate challenges when conducting process mining analyses. The ten key particularities of healthcare processes and healthcare process data, as defined in Munoz-Gama et al. [59], are discussed in the remainder of this subsection.

*Exhibit Significant Variability.* An important contributing factor to the complexity of healthcare processes is their significant variability [63,67]. Variability is caused, amongst others, by the diversity of activities that can be performed (e.g. a wide variety of examinations and treatments) in various orders, and the different characteristics of patients (e.g. they can suffer from various combinations of co-morbidities, influencing the way the process is executed) [67]. As a consequence, in many healthcare contexts, almost every case will have a unique trajectory through the process, leading to challenges within the context of, e.g., control-flow discovery [59].

*Value the Infrequent Behaviour.* In many domains, process mining is used to better understand the typical behaviour of a process. Hence, as infrequent behaviour would complicate, e.g., the discovered control-flow model, it is often removed in the pre-processing stage of a process mining project [15]. However, in healthcare, infrequent behaviour can be a source of valuable knowledge about the process. It might, for instance, highlight infrequent treatment paths that result in the same clinical outcome, unveiling knowledge about alternative treatment options for a particular disease [22,59]. Understanding infrequent behaviour is important as solely focusing on models representing the typical behaviour could generate blind spots, which constitute missed innovation opportunities for healthcare processes [59].

*Use Guidelines and Protocols.* Within the field of medicine, various clinical practice guidelines and protocols are available, which build upon evidencebased information on a certain topic [79,87]. This implies that, for clinical processes, reference processes are often available, which does not hold in many other domains [35]. This opens opportunities for process mining to, e.g., analyse the adherence to these guidelines and protocols [34,59].

*Break the Glass.* While clinical practice guidelines and protocols aim to achieve standardisation in clinical processes, medical doctors and healthcare professionals might need to deviate from guidelines and protocols when confronted with specific situations. For example: the discovery of specific co-morbidities of a patient might require an alternative course of action [62,72]. Another situation that might require a deviation from protocols is an unexpected surge in the number of arriving patients that should be coped with by a department [59]. The occurrence of such *'break the glass'* situations will also be reflected in the data, highlighting the crucial importance to take into account context information when using process mining in healthcare to fully understand the process behaviour [59,80].

*Consider Data at Multiple Abstraction Levels.* In a healthcare context, data about the execution of a process can originate from various data sources, both for clinical processes and organisational processes [45,55]. These data sources will capture data at multiple levels of abstraction. Medical equipment such as surgical robots or wearable devices will often generate large volumes of very fine-grained data, which should be aggregated to retrieve meaningful patterns [59,85]. High-level data, typically recorded in administrative systems, tends to be directly interpretable, but might provide an insufficiently detailed view on the process. Hence, when performing process mining in healthcare, it might be required to integrate data from various sources, potentially bridging clinical and administrative systems, as well as different data abstraction levels [59].

*Involve a Multidisciplinary Team.* Healthcare processes typically have a multidisciplinary character, with healthcare professionals (physicians from various disciplines, nurses, etc.) and supporting staff with various backgrounds being involved [55,67]. Given the critical importance of expertise from the healthcare domain, a multidisciplinary team needs to be involved during all stages of a process mining initiative, ranging from the specification of the problem to the translation of process mining insights to practical actions. This implies that attention needs to be attributed to the use of the appropriate medical terminology and customs to assure mutual understanding [59].

*Focus on the Patient.* When considering healthcare processes, the key role of the patient should be emphasised. Patients are, directly or indirectly, at the core of nearly all healthcare processes. Hence, when performing process mining in healthcare, specific attention should be attributed to support the provision of patient-centred care, a key care quality indicator [11]. When focusing on the patient journey, i.e. the trajectory of a patient over the course of a disease or treatment [49], it is important to note that (s)he typically receives services from various healthcare organisations (e.g. the hospital, the general practitioner, and the physiotherapist). This also causes the patient journey data to be spread over several organisations, with its associated challenges [55,59].

*Think About White-Box Approaches.* Recent advances in artificial intelligence and machine learning have provided techniques to support physicians in taking complex clinical decisions. One of the biggest hurdles for the adoption of such techniques is the physician's reluctance to use systems that they do not fully understand, i.e. to use *black-box* approaches [65]. Hence, to support decisions in a healthcare context, there is a need for *white-box* approaches, enabling healthcare professionals to understand where recommendations originate from. Process mining is perceived as such a white-box approach [39]. Nevertheless, the understandability of process mining outcomes for healthcare professionals should remain a permanent point of attention [55,59].

*Generate Sensitive and Low Quality Data.* Healthcare processes, especially clinical processes, generate sensitive data as it typically contain information regarding a patient's health condition, co-morbidities, ongoing treatments, etc. Consequently, ethics in general and *data privacy* in particular need to be first-class citizens when working with healthcare processes [74]. Moreover, strict regulations are typically in place regarding the use, storage and transfer of sensitive healthcare data [64]. Besides data privacy, *poor data quality* also characterises data collection regarding healthcare processes [54,86]. Data quality, a topic which has been discussed in Chapter 6 [18] is highly relevant in the healthcare domain, where data might suffer from various quality issues such as missing events, incorrect timestamps and imprecise timestamps [52,86]. One of the key reasons for data quality issues in healthcare is the fact that many events are recorded after a manual interaction between a healthcare professional and an information system. This might cause inaccuracies in the recorded data as some actions might not be recorded in the system, other actions might be recorded in the system well after they have been executed, etc. Data quality issues have to be handled with great care when conducting process mining in healthcare [59].

*Handle Rapid Evolutions and New Paradigms.* As the healthcare domain is rapidly and continuously evolving, this also holds for processes in healthcare. Changes are induced both by advances in clinical research, leading to changes in diagnostic or treatment processes [24], as well as advances in technology, e.g. the rise of remote monitoring due to the development of robust mobile health solutions [76]. New healthcare paradigms also surface, which also have an impact on healthcare processes. For instance: patient-centred care has become a core paradigm in healthcare, implying that care should attribute significant attention to the needs and preferences of the individual patient [66]. When working on process mining in healthcare, researchers and practitioners should be aware of these rapid evolutions and emerging new paradigms, as well as be able to cope with them [59].

## **3 Use Cases in Process Mining in Healthcare Research**

Against the background of the previous section, this section aims to highlight some typical use cases for process mining in healthcare as reported in published research articles. While many of the papers that will be referenced below make important methodological contributions, the focus of the discussion in this section is mainly on how process mining techniques were applied in a particular healthcare context. To structure the outline, the six process mining types introduced in Chapter 1 [1] are used: process discovery (Sect. 3.1), conformance checking (Sect. 3.2), performance analysis (Sect. 3.3), comparative process mining (Sect. 3.4), predictive process mining (Sect. 3.5), and action-oriented process mining (Sect. 3.6). At the end of the section, some recommendations for further reading are provided (Sect. 3.7).

#### **3.1 Process Discovery**

Process discovery focuses on the discovery of a process model from an event log. As holds for process mining in general, process discovery is also, by far, the most prominent use case of process mining in healthcare [17,37]. Papers on process discovery in healthcare typically center around the discovery of the control-flow, i.e. the order of activities, from an event log [17].

When focusing on control-flow discovery, various algorithms have been used to *automatically* retrieve a visualisation of the activity order from an event log. Based on a literature review, Guzzo et al. [37] conclude that Heuristics Miner is the most commonly used algorithm, followed by Fuzzy Miner and Inductive Miner. Control-flow discovery has been applied in various healthcare contexts. For instance: Caron et al. [14] use the Heuristics Miner to retrieve a process model for the radiotherapy department within the context of gynaecologic oncology. Duma and Aringhieri [25] use both Heuristics Miner and 'Inductive Miner - Infrequent' to study the patient trajectory at the emergency department of an Italian hospital. To limit the complexity of the data, they preprocess the event log by merging consecutive events referring to the same activity in the process. Despite these pre-processing efforts, the Heuristics Miner discovers a spaghetti model, which is not understandable. The 'Inductive Miner - Infrequent', in its turn, generates a very simple, but imprecise model, meaning that the discovered model allows for a lot of behaviour that is not observed in the event log [25]. Using, amongst others, Heuristics Miner and Fuzzy Miner, Kim et al. [40] focus on the patient trajectory in an outpatient clinic in Korea. They explicitly compare the process models discovered from data to a process model that has been developed solely based on a discussion with domain experts. The process mining insights surface some important trajectories that are not included in the domain experts' model, highlighting the added value of process mining [40].

Besides automated control-flow discovery, *interactive control-flow discovery* also receives some attention in literature. A distinguishing characteristic of interactive control-flow discovery is that a domain expert is interactively involved while the model is being discovered from the event log [10]. In this way, domain knowledge is embedded in the discovery processes, instead of being used to interpret the output of an automated algorithm. Using a case study of the patient trajectory of lung cancer patients, Benevento et al. [10] show that the interactive process discovery approach of Dixit et al. [23] generates control-flow models which are both accurate and understandable. In contrast, automated control-flow discovery algorithms might experience difficulties to generate such an accurate and understandable model. Even though the advanced algorithms discussed in Chapter 3 [7] will prove helpful, it might still be difficult to discover accurate and understandable control-flow models automatically. This can be, at least partly, explained by the fact that the order of tasks in healthcare processes often depends on highly specialised background knowledge, which is not embedded in the event log [10]. While interactive control-flow discovery received fairly little attention so far, it is highly promising for domains in which processes are highly knowledgeintensive and loosely-structured, which holds for many healthcare processes [55]. For a more extensive introduction on interactive process mining in healthcare, the reader is referred to Fernandez-Llatas [29].

A important challenge in control-flow discovery in healthcare, especially for medical treatment processes, is the great variability [59]. As many different paths through the process tend to occur, applying a control-flow discovery algorithm often results in a spaghetti model, which is very complex or even impossible to understand [51]. To handle this problem, trace clustering techniques can be used to create more homogeneous patient subgroups, which can be studied separately in an effort to reduce complexity. For instance: Mans et al. [51] use trace clustering on an event log of gynaecological oncology patients from a Dutch hospital to generate patient groups that follow a similar trajectory. Despite the potential of trace clustering, Lu et al. [48] also recognise some challenges. These include the fact that individual clusters might still contain thousands of distinct activities performed for patients, which would still be highly problematic for control-flow discovery purposes. Moreover, suppose clusters are created based on the medical condition of patients, each cluster might still contain a wide variety of patient trajectories as the same condition might be handled in a variety of ways. Against this background and with the ambition to generate clusters that are meaningful to domain experts, Lu et al. [48] develop a novel trace clustering method. Their method starts from a small sample set of patients, based on input from domain experts, to generate clusters. An evaluation of the method at a Dutch hospital highlights that the resulting control-flow models presented meaningful behavioural patterns for medical experts [48].

While the majority of control-flow discovery contributions take data from the hospital information system as a starting point, other types of input data are also occasionaly taken into consideration [37]. For example: Fernandez-Llatas et al. [31] use real-time indoor location systems data, which track the movement of patients throughout the surgery area of a Spanish hospital. Using this data, PALIA is used to discover a process model that represents the order of locations that a patient has visited [31]. Another illustration is the work of Lira et al. [47], where video recordings of a surgical procedure, i.e. the ultrasound-guided central venous catheter placement, are used as input data. These video recordings are tagged to generate an event log, which is used as an input for control-flow discovery [47].

All of the aforementioned papers focus on the discovery of control-flow models. However, as highlighted in Chapter [1] process discovery can also relate to other perspectives of the process, such as the resource perspective. For instance, Alvarez et al. [3] identify collaboration patterns between healthcare professionals within the emergency department of a hospital. The resulting process model sheds valuable insights in the interactions between physicians, nurses, medical assistants and technicians [3]. Similarly, one of the analyses conducted by Agnostinelli et al. [2] centers around the identification of interactions between different subdepartments in an Italian outpatient clinic. These examples highlight the potential of process mining to discover valuable process models in healthcare, also beyond the control-flow perspective.

#### **3.2 Conformance Checking**

As highlighted in Sect. 2.2, a multitude of clinical practice guidelines and protocols are available in the healthcare domain, which can act as reference processes [59]. Conformance checking, the topic of Chapter 5 [13] and a second common use case for process mining in healthcare, enables assessing the adherence of the real-life healthcare process (as captured by the event log) to clinical guidelines and protocols, as well as to study where reality deviates from an already existing process model [55]. For instance: Mannhardt and Blinde [50] use the public sepsis event log and aim to assess the conformance of the real-life process with two rules put forward by the sepsis guidelines at that time: (i) the time difference between the moment at which the triage document is completed and the admission of intravenous antibiotics should be less than 1 h, and (ii) the time difference between the moment at which the triage document is completed and the measurement of lactic acid should be less than three hours. Through the use of multi-perspective conformance checking, the authors conclude that the first rule is violated for 58.5% of the patients, while the second rule is only violated for 0.7% of patients. This observation constitutes a basis to look into the adherence to medical guidelines in more detail [50]. Another example is the work by Rinner et al. [68], who use alignment-based conformance checking to assess the compliance between the European guideline on melanoma treatment and an event log from an Austrian medical university. This analysis is highly relevant as the authors indicate that patients which comply to the guidelines have a significantly better prognosis than deviating patients [68]. Also focusing on clinical guidelines, Huang et al. [38] propose an approach to detect both global and local anomalies between a clinical pathway and an event log. While the former refers to patient trajectories that significantly deviate from the clinical pathway, the latter represents a deviation in a particular part of the trajectory. This approach is applied to an event log containing trajectories of unstable angina patient at a Chinese hospital [38].

While conformance checking offers great potential, Sato et al. [75] highlight the challenge that clinical guidelines and protocols are often defined at a different level of aggregation than the events in the event log. To tackle this problem and using the pre-operative phase of bariatric surgery as an illustration, the high-level activities in the reference model are explicitly mapped to the events included in the event log. Besides the potential discrepancy in terms of the level of aggregation, Bottrighi et al. [12] also highlight that clinical guidelines typically focus on patients in general, while clinical practice often requires adapting general guidelines to the specificities of individual patients and contexts. For instance: patients might have several co-morbidities and certain equipment might not be available in a particular situation. As a consequence, physicians add what is called basic medical knowledge in order to alter clinical guidelines to the specific patient and contextual characteristics. This adds a dimension to conformance checking: besides checking the adherence to the clinical guideline, the basic medical knowledge that the physician adds also needs to be taken into consideration [12].

The aforementioned examples use clinical guidelines and protocols as the reference model. While this is a common situation in the healthcare domain, it should be noted that conformance checking techniques can also generate valuable insights when the reference model originates from a different source. For instance: Kirchner et al. [41] perform conformance checking within the context of the liver transplantation process. To create the process model to compare the event log with, an interdisciplinary team consisting of physicians and modelling experts was brought together [41]. This example highlights that conformance checking is a versatile toolkit to assess whether hospital processes are performed in reality as intended according to any form of reference model.

#### **3.3 Performance Analysis**

Regarding the evaluation of healthcare process performance, various types of performance measures can be used. A basic distinction can be made between clinical, financial and operational key performance indicators. A *clinical* key performance indicator relates to a measure of the patient's medical condition, a *financial* key performance indicator reflects the financial effect of the execution of the process, and an *operational* key performance indicator represents a measure regarding the operational execution of the process. The category of operational key performance indicators can be further subdivided in *time-related* and *resource-related* key performance indicators. The former can, for example, be the waiting time of a patient or the length of stay, while the latter can relate to the bed occupancy rate or staff utilisation at a particular department [17].

Based on a systematic literature review, De Roock and Martin [17] conclude that less than half of the reviewed paper reports on a specific key performance indicator for their process mining analysis. When a key performance indicator is used, time-related key performance indicators are used the most frequently, followed by clinical key performance indicators. Financial and resource-related key performance indicators are rarely used in literature [17]. A commonly used time-related key performance indicator is the length of stay of a patient, which represents the time between the arrival of a patient and his/her departure [89].

Rojas et al. [70] use the length of stay when conducting a performance analysis of processes at the emergency department of a Chilean hospital. Based on their analysis, they identified that two key steps in the emergency department process contribute to higher length of stay values for patients. Firstly, the number of examination-treatment loops that the patient goes through, indicating the amount of time that is needed to uncover the true problem. Secondly, the need for a validation examination, which is an examination by a physician to ensure that the patient is ready to be discharged from the emergency department. In the same context and with the same key performance indicator, the length of stay at the emergency department of a hospital, Andrews et al. [5] conduct a process performance analysis at the St. Andrew's War Memorial Hospital in Australia. They conclude that a key contributor to high length of stay values is the time that elapses between the moment at which it is decided that a patient should be admitted and the moment at which the patient can actually move to the relevant ward [5].

#### **3.4 Comparative Process Mining**

Comparative process mining, e.g. the comparison of various patient groups, time periods or healthcare organisations, has also been used in the healthcare domain. With respect to the comparison of *patient groups*, Rojas and Capurro [69] study the medication use process for patients suffering from sepsis in the MIMIC-II database. To this end, three patient groups are distinguished, based on whether vasodilators, vasopressors, or systemic antibacterial antibiotics were used. Another example is Pebesma et al. [61], where three patient groups are separated to model the trajectory of cardiovascular risks for patients with type 2 diabetes: a high-risk, medium-risk and low-risk group. After modelling the evolution of the risk level for each group, the gender distribution within each group is determined, suggesting that female patients tend to be in lower risk states compared to their male counterparts. A final example is the research by Andrews et al. [6], who study the pre-hospital care process for victims of road traffic accidents. In this respect, they consider three groups: (i) persons who do not require ambulance transportation, (ii) persons who are transported to e.g. local medical practices or elderly care facilities, and (iii) persons who are transported to a hospital [6].

Other papers *compare different time periods*, which is another type of comparative process mining. For instance, Yoo et al. [92] use process mining to assess the impact of commissioning new buildings of a hospital, where, e.g., the cancer centre and clinical neuroscience centre have moved to the same floor and additional administrative counters have been added. To determine the impact of the move to the new building, as well as the associated new facilities that became available, the results of a process mining analysis before the move are compared to the results using an event log of a period after the move. Their findings highlight that processes run more efficiently in the new facilities, both for the cancer centre and the clinical neuroscience centre. Moreover, the consultation waiting time decreased [92]. A different example is situated within the context of an emergency department. Within that context, Stefanini et al. [77] compare the summer period to the winter period. In their comparison, they both incorporate the patients' trajectory as well as a variety of key performance indicators. One finding is that urgent patients, on average, have to wait longer before their first consultation in summer than in winter [77].

Regarding the *comparison of healthcare organisations*, a prime example is the work by Partington et al. [60]. They compare four Australian hospitals in terms of the pathway of patients who presented themselves at the emergency department and are suspected to suffer from acute coronary syndrome. The comparison focuses on the control-flow and time perspectives of the process. Regarding the time perspective, measures such as waiting times, throughput time and length of stay are taken into consideration. Various valuable insights were retrieved from the comparative analysis, e.g. some hospitals use an angiography (i.e. an X-ray of a patient's blood vessels) significantly more often than other hospitals. Moreover, significant differences in the length of stay of patients were discovered [60]. The work of Partington et al. [60] highlights the great potential of comparative process mining to compare local practices and process performance values. This can constitute a fruitful basis for mutual learning and, hence, the improvement of healthcare processes. However, it requires a culture of transparency, which has been highlighted as a challenge for process mining adoption within the broader process mining field [56].

#### **3.5 Predictive Process Mining**

While the aforementioned process mining types are backward-looking, process mining in healthcare research has also focused on forward-looking approaches, i.e. predictive process mining (see also Chapter 10 [21]). Two key research topics are data-driven prediction models and data-driven process simulation. An example of the former category, *data-driven prediction models*, is Benevento et al. [9], which focus on predicting the waiting time of patients at the emergency department. To this end, various predictor variables are taken into consideration, such as patient variables (e.g. their age or the assigned triage code), temporal variables (e.g. the hour of the day), staff-based variables (e.g. the nurses' schedules, the physicians' schedules). They also consider queue-related variables in the prediction model (e.g. the number of patients who received a triage code, but were not yet treated), which were identified in an event log. The empirical evidence suggests that adding the queue-related variables improves the performance of the waiting time prediction model. In a very different context, van der Spoel et al. [82] use a combination of data mining and process mining techniques to predict the cashflow of a Dutch hospital. In this respect, they focus on predicing the treatment trajectory based on the diagnosis and the start of the trajectory, as well as on predicting the duration of this trajectory [82].

Several papers have investigated the potential of process mining within the context of process simulation in healthcare. These efforts belong to the domain of *data-driven process simulation*, which refers to the extensive use of an event log during the development of a simulation model [19]. For example: Tamburis and Esposito [78] investigate how process mining could be used to support the development of a simulation model of the cataract treatment process at an ophthalmology department. Kovalchuck et al. [42], in their turn, simulate the process that patients suffering from acute coronary syndrome follow, using process mining to support the model development process. To demonstrate the developed simulation model, they focus on the effect of the availability of angiography equipment, which is important to quickly detect the presence of acute coronary syndrome. In particular, the influence of varying the number of angiography instruments on output measures such as the length of stay and the average waiting time is predicted [42]. Franck et al. [33] use a simulation-based analysis of the process of stroke patients at the emergency department. Process mining is used to determine the order of activities from an event log. Using the simulation model, various scenarios are defined in terms of the number of neurovascular intensive care unit beds required to provide patients with care according to the optimal clinical pathway.

van Hulzen et al. [84] use data-driven process simulation to explore potential future scenarios to support capacity management decisions for the radiology department of a Belgian hospital. Within the context of the construction of new facilities, which involves a centralisation of different geographically separated campuses, department management needs to provide input regarding the required number of radiological devices (X-ray, CT scanner, etc.), the size of the waiting area for ambulatory patients, and the required number of receptionists. In particular, the study centers around three key questions formulated by the department management: (i) what is the effect of the centralisation of services on the required resource capacities?, (ii) what is the impact of abolishing the need for patients to drink contrast fluid on the throughput time and required waiting area size?, and (iii) what would be the effect of an online registration system for ambulatory patients on the reception staff requirements and the size of the waiting area? To develop a simulation model to answer these questions, an event log originating from the radiology information system is intensively used. While the case study clearly demonstrates the potential of data-driven process simulation in healthcare, van Hulzen et al. [84] also highlight challenges such as data quality issues, as well as the lack of support to interactively involve domain experts during the development of a simulation model.

#### **3.6 Action-Oriented Process Mining**

As highlighted in Chapter 1 [1], action-oriented process mining focuses on translating process mining insights into actions. This is also a crucial step within the healthcare domain as only then process mining will reach its full potential as a catalyst of evidence-based process improvement [55]. Despite its great importance, research efforts focusing on the translation of process mining insights in actions are scarce in the healthcare domain. This is confirmed by the review of De Roock and Martin [17], where the need for more research on the translation of process mining outcomes to actionable process improvement ideas is indicated as one of the key recommendations for the future development of the research field.

A first step in the direction of action-oriented process mining is ensuring that process mining endeavors start from specific questions put forward by healthcare professionals [55]. Several research papers explicitly report on this matter, such as the work by van Hulzen et al. [84] on data-driven process simulation for capacity management at the radiology department. In a similar vein, Agostinelli et al. [2] explicitly devote attention to defining the questions of healthcare professionals in a process mining project in cooperation with the San Carlo di Nancy hospital. Better understanding three key processes was the central objective of the process mining analysis, including the hospitalisation process of patients. However, Agostinelli et al. [2] claimed that it was difficult to elicit specific questions from healthcare professionals because they had no background knowledge on process mining. The knowledge gap between process mining experts and domain experts is an important consideration to take into account when moving towards action-oriented process mining.

#### **3.7 Further Reading**

This section had the ambition to provide an intuitive overview of common use cases in process mining in healthcare literature. Hence, it does not constitute a full overview of all scientific contributions in the field. For a more detailed outline of the state of the art in literature, the reader is referred to one of the literature reviews on process mining in healthcare that have been published. Some reviews focus on a particular subdomain in healthcare: Kurniati et al. [43] on oncology, Kusuma et al. [44] on cardiology, Williams et al. [90] on primary care, and Farid et al. [28] on frail elderly care. Other reviews take a more generic perspective and consider process mining in healthcare as a whole: Ghasemi and Amyot [36], Rojas et al. [71], Batista and Solanas [8], Erdogan and Tarhan [27], Rule et al. [73], Dallagassa et al. [16], Guzzo et al. [37], and De Roock and Martin [17]. All review papers significantly differ in terms of the review dimensions that are taken into consideration and whether time trends are taken into consideration [17]. De Roock and Martin [17] provide an overview of the similarities and differences amongst 11 published literature reviews.

## **4 Case Study**

The previous sections introduced healthcare processes, their particularities, and common use cases in process mining in healthcare literature. This section presents a real-life case study of conducting a process mining analysis in a hospital. The case study is situated in the *Superfluid Hospital* project conducted at the hospital of Braunschweig, led by Dr. Andreas Goepfert and Lars Anwand together with Nils Wittig. The project has the overarching ambition of ensuring that processes run smoothly within the hospital in order to improve the well-being of patients and employees, the quality of care, as well as the hospital's financial performance. To outline the case study, the project goal and IT-infrastructure is discussed (Sect. 4.1), followed by the outcomes of the process mining analysis (Sect. 4.2).

## **4.1 Project Goal and IT-Infrastructure**

The specific goal of the *Superfluid Hospital* project is discovering medical treatment processes within the hospital. To this end, readily available process execution data and process mining has been used in order to avoid any additional documentation work for healthcare professionals. The fact that no additional data needs to be recorded could play an important role in nurturing acceptance for process mining and to stimulate its use on a continuous basis (e.g. also to track and evaluate the effect of process changes).

Hospitals typically use a variety of IT systems, implying that process execution data will also be scattered over various systems. In order to be able to analyse all relevant data centrally, the Braunschweig hospital uses data warehouse infrastructure as a starting point for process mining. This data warehouse already gathers the relevant data from various underlying information systems in the hospital. In particular, this case study uses the data warehouse infrastructure and business intelligence solution *eisTIK* from *KMS Vertrieb und Services AG*, which combines process execution data from different data sources such as the Hospital Information System, the Laboratory Information System, the Radiology Information System, etc. For the process mining analysis, an integrated version of the tool *Celonis* has been used within the data warehouse. Hence, process mining is no longer a standalone tool, which lowers the efforts for healthcare professionals to perform process mining.

## **4.2 Outcomes of the Process Mining Analysis**

This subsection illustrates the outcomes of conducting process discovery at the case study hospital in Braunschweig. In particular, the focus will be on the medical treatment process of cardiology patients, which is a cohort of 1566 patients

**Fig. 1.** Detailed view of the trajectories of patients receiving cardiology services, showing only a cut-out of the whole process.

in the data warehouse. It was the ambition of the project team, consisting of process analysts and healthcare professionals, to gain a deep understanding in the treatment of cardiology patients in order to identify areas for improvement towards the future.

Figure 1 provides an overview of the trajectories of patients receiving cardiology services, in particular a coronary angiography, containing all activities that have been conducted. As becomes apparent from the visualisation, this level of detail is unsuitable to gain insights into potential problems in the process. As a consequence, the amount of activities represented in the process model is reduced by means of filtering. Visualising only the most important activities, as shown in Fig. 2, leads to a less complex process model. The key difference between Figs. 1 and 2 is that the percentage of included activities is reduced from 100% in Fig. 1 to 53% in Fig. 2. Moreover, the number of connections between activities is also significantly reduced to about 40% in Fig. 2.

When studying Fig. 2 in more detail, it follows that particular diagnostics have already been performed for some patients before they actually go to the hospital. In particular, for 593 patients, an electrocardiogram and other checkups (*'Vorstation¨are Leistungen'*) have already been executed before they were admitted to the hospital. Note that all results that patients bring with them will still be checked to ensure that the patient is eligible for the procedure. Patients that do not have prior check-up results generally take one of the following paths from hospital admission (*'Aufnahme'*, blue hexagon) onwards:

**Fig. 2.** Filtered view of the process for cardiology patients receiving a coronary angiography, all grouped by DRG F49G (which is a diagnosis-related grouping that is used as a billing system in Germany). (Color figure online)


Note that Fig. 2 also contains a connection between the execution of an electrocardiogram (*'EKG'*, yellow hexagon) and hospital admission (*'Aufnahme'*, blue hexagon). This connection represents patients which are temporarily discharged from the hospital, but return the following day to continue the process. Another interesting connection was revealed by analysing the data i.e. the direct connection from admission (*'Aufnahme'*, blue hexagon) to discharge (*Entlassung*, blue hexagon) in Fig. 2 within 21 h. This connection can be explained by the existence of a specific group of patients for whom the treatment has been recorded in a different logic. These patients have previously not been included in the internal performance measurement. This shows that process mining can also highlight relevant deviations in the documentation. In this way, important areas of action for the improvement of data quality have been identified, generating additional added value for the hospital.

As mentioned in Sect. 3, it is important that process mining insights are also translated to actions. Based on the analysis, of which some highlights have been presented above, several actions have been specified in the process, as will be exemplified here. Firstly, patients will be encouraged to bring all relevant radiological imaging and recent electrocardiogram reports with them. This will enable them to get treated much faster by following the first path described above. Secondly, measures have been taken to accelerate the second path to make sure that patients receive the intervention during their first day of hospitalisation. Due to organisational adjustments, patients now receive the ECG with higher priority. This makes it possible that, after a faster diagnosis, they often receive the actual intervention in the afternoon of the day of admission. Finally, the third path outlined above should be combined with the second path by registering patients for both the radiological and cardiological diagnostic services at the moment of admission. The relevant preliminary examinations can be carried out and evaluated over the course of a day. In this way, the procedure can take place the day after admission, provided that there are no medical reasons for not doing so.

Healthcare professionals provided positive feedback on the conducted process mining analysis, both with respect to the analysis procedure, as well as with regards to the insights that have been gathered. The conducted analysis made healthcare professionals aware of the improvement potential in their processes, which will result in shorter hospitalisations and improved care quality for patients. Especially changes that resulted in a reduction of unnecessary waiting times in the patient's trajectory are considered highly useful. While the insights and improvement actions presented in this section are based on an analysis of historical data, it should be noted that the use of the data warehouse with integrated process mining functions also enables real-time analyses. As a consequence, it is possible to create a live view of the process, which opens options to take action in the process while the process for a patient is still running.

## **5 Open Challenges**

Section 3 and Sect. 4 demonstrate the great potential of process mining in healthcare, as well as the research that has been conducted in the research field. However, it has been reported that the uptake of process mining in healthcare, beyond case studies in a research context, is fairly limited [55]. Hence, there are still significant challenges ahead to ensure a widespread adoption of process mining in healthcare. The remainder of this section provides an overview of ten key challenges for the field, based upon the recent work by Martin et al. [55] and Munoz-Gama et al. [59].

*Create a Standardised Terminology.* In the healthcare domain, there is a tradition of using standardised terminologies to ensure a common understanding of concepts [26]. An illustration is the *International Classification of Diseases* (ICD), which defines about 55000 codes to label injuries, diseases, and causes of death in a standardised way [91]. In the process mining field, standardisation often focuses on the data structure level (e.g. the XES and OCEL standards), but less on the terminology level. Terms such as event, case, activity, and trace might be used in an ambiguous way based on the working definitions of individuals or research groups. This is especially troublesome when working in an interdisciplinary context as it can lead to problematic communication. Hence, there is a need to develop a standardised terminology to support process mining in healthcare, which should (i) provide a clear definition of process mining concepts in a healthcare context, and (ii) link to existing terminologies in the healthcare domain whenever possible [55].

*Tackle Real-World Healthcare Problems.* To support the uptake of process mining, it is important that process mining methods help to solve real-word problems of healthcare professionals. In order to capture and thoroughly understand these problems, close and ongoing interaction between the process mining community and healthcare professionals is needed. Only then, methods can be developed that actually support healthcare professionals to solve these problems [55,59]. Progress still needs to be made as, based on a systematic literature review, De Roock and Martin [17] conclude that only 12.5% of the reviewed papers reported that healthcare professionals were actively involved during the problem definition stage of a process mining project. Besides eliciting problems from healthcare professionals instead of assuming that a particular issue is relevant, it is also key to evaluate process mining methods using real-life data from an authentic healthcare context. Besides enabling the researcher to fine-tune the developed method based on the complexity of real-life data, a real-life demonstration will also build confidence among healthcare professionals in process mining's ability to tackle real-world problems [55,59].

*Deal with Low Quality Data.* The healthcare domain has been shown to suffer from low quality process execution data, the key input for process mining. As applying process mining techniques to low quality data can lead to counterintuitive and even misleading results [4], data quality is an important challenge for process mining in healthcare (see also Chapter 6 [18]). Data quality issues include missing events (i.e. events that took place, but which were not registered in the system), incorrect timestamps (i.e. timestamps that do not correspond to the time at which the event actually took place), and imprecise resource information (i.e. resource information that does not refer to a specific healthcare professional) [52,54]. While approaches have recently been developed to assess the event log quality or to handle specific event log quality issues using targeted heuristics [54], data quality remains a challenge for process mining in healthcare. In this respect, it is also important that healthcare organisations are made aware of the need to improve data registration at the source in order to fully leverage the potential of process mining. Potential initiatives include raising awareness among healthcare professionals and facilitating data registration when designing user interfaces [55,59].

*Identify the Most Suitable Process Modelling Language.* Within the context of control-flow discovery, process mining enables retrieving a visual representation of how a healthcare process is performed in reality. In order to effectively use a process model as a communication instrument and, hence, as a basis for process improvement, it is important to determine the most suitable process modelling language within a healthcare context. Within the business process management domain, a wide variety of process modelling languages have been developed such as BPMN, Petri nets and Declare. At the same time, modelling languages to represent clinical guidelines such as GLIF3 have been proposed in the healthcare domain. Given the plethora of available languages and as it has been shown that the modelling language impacts model understandability [32], thorough benchmarking research is required. Such research should focus on both the expressive power of the considered modelling language, as well as the understandability of the resulting control-flow model for healthcare professionals. Regarding the latter, a wide range of healthcare contexts and healthcare professionals should be taken into account. By carefully understanding the strengths and weaknesses of existing process modelling languages, both from the business process management and the healthcare domain, valuable lessons can be drawn on the visualisation of process mining outcomes in healthcare [55].

*Move Beyond Control-Flow Discovery.* While Sect. 3 aimed at providing a broad view on process mining in healthcare, it should be recognised that controlflow discovery remains the most dominant use case of process mining in healthcare [17,37]. While there is a clear need for control-flow discovery algorithms that are designed with the particularities of healthcare processes in mind, it is important that targeted methods are also developed for other process mining types such as conformance checking, predictive process mining or to discover insights from the time or resource perspective [59]. Moreover, as follows from Sect. 3, more research on action-oriented process mining in healthcare is needed as this is the key for process mining to actually contribute to the generation of societal value in healthcare. With respect to the various perspective of a process, analyses that span over several process perspectives, e.g. which combine the control-flow perspective with the time or resource perspective, also have the potential to generate great value for healthcare. Such multi-perspective analyses can provide healthcare professionals with rich insights, e.g. about how the control-flow of the process gives rise to particular resource behaviour [55,59].

*Look Beyond the Hospital Walls.* As highlighted in Sect. 2.2, patients are at the core of healthcare processes. Patients, especially patients with a chronic disease, often have a therapeutic relationship with various healthcare organisations. However, the great majority of the research on process mining in healthcare is still focused on what happens with patients in the context of a hospital visit or admission. Exceptions such as Fernandez-Llatas et al. [30], who focus on supporting nursing home design using process mining, are scarce. Even when a part of the patient's diagnosis and treatment process takes place in a hospital, it is important to note that a significant portion of the process might also be executed outside the hospital's walls. For instance: an oncological patient might have surgery at a specialised hospital, (s)he might have regular check-ups scheduled at a local general hospital and might receive specific treatments at home, supported by a home healthcare organisation. When process mining has the ambition to provide healthcare professionals with valuable insights in the patient journey, it will probably not be sufficient to only study the process fragment that takes place in the hospital. As process execution data will be spread over the information systems of several healthcare organisations, this will pose challenges in terms of obtaining data and connecting all data sources. Moreover, careful consideration has to be given to data privacy and security. While privacy and security are relevant for all process mining endeavours, involving several healthcare organisations will add an additional layer of complexity [55,59].

*Give Control to Healthcare Professionals.* Currently, process mining initiatives in healthcare are often carried out by a multidisciplinary team, consisting of both healthcare professionals and process mining experts. Process mining experts play an important role given the technical skills which are required to prepare an event log and perform the appropriate analyses. In the long run, it should be the ambition of the process mining community to develop tools which are so intuitive that healthcare professionals can autonomously use them, instead of depending on (potentially external) process mining experts. While this is far from trivial given the high complexity of many healthcare processes, as well as due to complicating factors such as data quality issues, efforts to give control to healthcare professionals are highly valuable. A first step would, for instance, be to ensure that healthcare professionals are actively involved in the specification of analysis targets. In order to make informed judgements and clearly delineate their questions, it would be highly valuable if healthcare professionals have a minimal level of data and process literacy [2,17]. Moreover, enhanced training might also nurture a mindset in which process execution data is considered as a strategic asset that the healthcare organisation wishes to leverage to the largest extent possible. Additional efforts to gradually give control to healthcare professionals involve specific attention to elements such as the use of unambiguous terminology and the clear visualisation of outcomes when developing tools to perform process mining in healthcare [55,59].

*Integrate Process Mining Functionalities in Existing Systems.* The positioning of process mining as a standalone tool constitutes a major barrier for the systematic use of process mining in healthcare practice. Nowadays, in order to use process mining, data often need to be extracted from the health information system, reformatted to the required event log structure, and imported in a process mining tool. While this is feasible for a one-off research project, this is impractical in the daily work setting of healthcare professionals. Hence, to support the use of process mining in healthcare, process mining functionalities need to be integrated in the information systems that are used by healthcare professionals. To this end, a strong partnership between the process mining community and health information system vendors needs to be established. Moreover, healthcare organisations can include the need for data-driven process analysis functions when formulating update requests to their vendors [55]. The case study presented in Sect. 4 presents a first step towards tackling this challenge as process mining functionalities were integrated with the data warehouse solution used by the hospital under consideration.

*Develop Tailored Methodologies for Process Mining in Healthcare.* The particularities of healthcare show the need for the development of tailored methodologies for process mining in healthcare. Such methodologies should provide specific guidelines for the various phases of a typical process mining initiative in a healthcare context, ranging from the specification of the research problem, over the composition of the event log, the execution of the analysis, to the interpretation of the final results, and the actions that will be linked to the findings. When establishing methodologies, inspiration should evidently be drawn from efforts in the broader process mining field such as the L\*-methodology [81], and the PM<sup>2</sup>-methodology [83]. However, it is key to also take the particularities of the healthcare domain into consideration, as well as the wide variety of contexts in which process mining can be used in the domain. The presence of solid methodological support might also persuade healthcare organisations that are considering the adoption of process mining, but still have concerns regarding the rigour of a relatively young research domain, as well as regarding how the process mining effort should exactly be approached [55,59].

*Evolve in Symbiosis with Evolutions in the Healthcare Domain.* As highlighted in Sect. 2.2, the healthcare domain is in constant evolution due to advances in various fields such as medicine and technology. Moreover, new paradigms such as patient-centred care give rise to new care approaches. Against this background, it will be an ongoing challenge for process mining to followup on these evolutions and to ensure that the provided support matches the expectations of healthcare organisations. To appreciate the latter statement, it is important to realise that process mining will always be a means to an end, rather than a goal in itself. Consequently, the impact of process mining in the healthcare field will depend on its ability to add value within a constantly changing context. From that perspective, process mining in healthcare should evolve in symbiosis with evolutions in the healthcare domain. While the foregoing represents a more reactive perspective, it is important to note that process mining can also actively contribute to evolutions in healthcare. For instance: process mining techniques can be used to efficiently compare various treatment processes with respect to the clinical and patient experience outcomes they generate and, hence, can contribute to shaping the clinical pathways of the future. Similarly, by providing profound insights in the usage patterns of mobile health applications, process mining can help to optimise the user-friendliness and, hence, patient satisfaction with respect to telemonitoring instruments [55,59].

## **6 Conclusion**

This chapter introduced a specific application domain of process mining: healthcare. Healthcare is a promising domain in which process mining can create significant societal value by helping healthcare organisations to better understand and improve their processes. Besides highlighting and illustrating the potential of various types of process mining in healthcare, the complex nature of many of its processes was also discussed. The specific characteristics of healthcare processes, such as the high level of variance and the widespread presence of guidelines and protocols, necessitate the development of dedicated process mining methods. In this respect, it is important to note that process mining in healthcare can build upon an active and committed research community, who are keen to develop novel methods that start from real-world problems experienced in healthcare. This will definitely be needed as the systematic uptake of process mining in healthcare, beyond the research context, is still fairly limited. A multitude of challenges is still ahead.

While current literature still predominantly focuses on the hospital setting, as was clearly reflected in the examples used in this chapter, it is important to also consider other types of healthcare organisations such as elderly care organisations, psychiatric care organisations and home-based care organisations. These organisations are also confronted with immense challenges and are likely to have even less resources available for advanced analytics than hospitals. Even though these other types of healthcare organisations might even be more challenging for process mining than the hospital context, e.g. because of their lower maturity in terms of data registration, they would greatly benefit from open access and user-friendly instruments from the research community to gain data-driven insights in their processes.

As a final reflection, we would like to make a message explicit that might have already become apparent while reading through this chapter: process mining in healthcare is not merely about technology and algorithms, but also about people. Actionable insights to improve healthcare processes will always emerge from the interplay between the process mining outcomes and the profound domain knowledge of healthcare professionals. Hence, it is crucial that healthcare professionals build trust in the potential of process mining and the results it generates. While healthcare professionals are a crucial actor in process mining in healthcare, another stakeholder should always remain at the center of attention: the patient. In the end, healthcare organisations, healthcare professionals, process miners and many others join forces for a single goal: to provide the best possible care to patients in a way that is sustainable in the long run. Without disregarding the numerous challenges that are still ahead, this chapter demonstrated that process mining can (and should) play an important role in achieving that goal.

## **References**


(eds.) CAiSE 2015. LNCS, vol. 9097, pp. 297–313. Springer, Cham (2015). https:// doi.org/10.1007/978-3-319-19069-3 19


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Process Mining for Financial Auditing**

Mieke Jans1,2(B) and Marc Eulerich<sup>3</sup>

<sup>1</sup> Hasselt University, Martelarenlaan 42, 3500 Hasselt, Belgium mieke.jans@uhasselt.be <sup>2</sup> Maastricht University, Minderbroedersberg 4-6, 6211 LK Maastricht, Netherlands

<sup>3</sup> University Duisburg-Essen, Lotharstr. 65, 47057 Duisburg, Germany marc.eulerich@uni-due.de

**Abstract.** Over the last years, process mining has increasingly demonstrated its potential as a valuable tool for internal and external auditors. Thereby, the possible use cases in the field of auditing are manifold. This chapter focuses especially on the use of process mining in the context of financial audits, which are relevant for both, internal and external auditors. Beside a short explanation of the different types of auditors, this chapter aims to connect process mining to the different process steps of an internal (and later also external) audit and discusses the similarities and differences between both areas.

**Keywords:** Financial auditing *·* Internal auditing *·* External auditing *·* Process mining

## **1 Introduction**

Financial auditing refers to an external independent party that examines the financial statements of an organization and formulates an opinion on how well those statements present a true and fair view of its financial performance and position. Apart from hiring external auditors to conduct such investigations, larger companies also have an internal department that conducts comparable audits, albeit through a wider lens. Where external auditing is only concerned with assuring the quality of financial reporting, internal auditing extends this with an efficiency perspective on the entire functioning of an organisation. Independent from the business units, the internal audit department examines the organisation's governance mechanisms. A key aspect for both external and internal audits is to assess whether processes are in control, whether prominent risks are mitigated (partly by their process design), and whether the input data for the financial statements are complete, accurate, and valid. Consequently, both internal and external audits can benefit from process mining, since it provides the auditor with a realistic view on how processes, that indirectly impact the financial reporting, are being executed. Not surprisingly, process mining has in recent years increasingly demonstrated its potential as a valuable tool for financial auditing.

c The Author(s) 2022

Running a process mining analysis in the context of an audit, internal or external, requires a specific approach that takes into account the preliminaries of audit engagements. This chapter will take the reader through these auditspecific concerns. The chapter starts with a short introduction into financial auditing. Both internal and external audit will be introduced, along with the connection between the two audits. Readers that are familiar with this topic, can immediately proceed with the next section, that discusses process mining in the internal audit function. All phases of the internal audit are explained first, and then revisited while integrating process mining in it. Section 4 brings the external audit in the picture. How does a process mining approach differ between the external and the internal auditor? Sects. 5 and 6 deal with the practical organisation of bringing the right expertise in-house and how to move from data to audit evidence. We end the chapter with open challenges in Sect. 7 and conclude in Sect. 8.

## **2 Financial Auditing**

Financial statements are key when a stakeholder wishes to inform him- or herself about an organization. Investors, banks, employees, customers, vendors, etc. are all parties that might be interested in the financial situation of an organization before partnering up. To this end, the officially published financial statements of an organization are the primary documents to consult. These statements are prepared by the organization, adhering to (national or international) accounting standards. The statements include minimally a *balance sheet* and an *income statement*. Depending under which legislation the organization reports, also a cash-flow statement is included. The balance sheet presents an overview of the assets, liabilities, and capital that the organization possesses at a particular moment of time. The income statement provides an overview of the revenues and the incurred expenses over a period of time, mostly one year. The combination of the revenues and expenses presents the monetary gain or loss that the organization realized over that accounting year.

It goes without saying that it is important that the statements are reliable, given the numerous decisions that are taken based on this information: investors start, continue, or quit investing, banks offer loans or not, customers churn or not. The guiding principle is that the statements need to present a 'true and fair view' of the financial situation of the organization. It is the key responsibility of the auditor to safeguard this principle: they provide *reasonable assurance* that the statements indeed present such a view. This assurance is primarily given by the external auditor (legal requirement for companies from a certain size onward), but this can also be assisted by the internal auditor.

This section will provide a general overview on the governance mechanism that financial auditing holds for companies. It will explain the goals and characteristics of both external and internal auditing and the interaction between these two. The internal audit department has the latitude to fully implement process mining at the core of the business in a continuous fashion. The findings of these continuous monitoring efforts can be passed on to the external auditor who can use these findings as input for their own investigation. Alternatively, the external auditor can run their own 'one shot'-process mining analysis during the annual audit engagement. The biggest traction of process mining in the auditing field is achieved through the internal audit, due to its possible embedding in the core of the organization. The interplay between these two audit settings is elaborated on further in this section.

#### **2.1 Purpose of the External Financial Audit**

The external auditor typically conducts an annual audit [1]. The auditor audits and reports on the procedures and the recorded transactions relied upon to prepare the financial statements. When the auditor reports a 'clean opinion,' the financial statements are presumed to be free of material misstatements and hence reliable to share- and stakeholders for decision making [2].

As mentioned, the objective of an external audit is to obtain reasonable assurance about whether the financial statements are free of material misstatement. It is intended to increase the reliability of the information contained in the annual financial statement. Nevertheless, an audit must be carried out efficiently, which might create tension with the goal of providing assurance. To meet the two requirements, efficiency and reasonable assurance, in the context of the audit, the so-called risk-based audit approach is applied. Following this approach, the external auditor first assesses the risks of the organisation in general, but also per department or business process. Based on this risk assessment, resources are allocated to the riskiest parts of the organization. If, for example the sales process is assessed as a key process to have under control, the auditor will put more emphasis on this process. Differently stated, more resources are allocated to auditing this element, compared to other processes that are assessed less risky.

The concept of risk-based auditing is also regulated through the relevant standard setters. For example the International Auditing and Assurance Standards Board (IAASB) issued the revised auditing standard ISA 315 (Revised 2019) "Identifying and Assessing the Risks of Material Misstatement". This standard establishes the risk identification and assessment procedures that form the basis for a risk-based financial statement audit. The risk assessment procedures are described *"to obtain an understanding of the entity and its environment, including the entity's internal control, to identify and assess the risks of material misstatement..."*. It is clear that the auditor is expected to understand how the organisation (the 'entity') is organized and how they mitigate risks by their internal control system. Precisely this *internal control system* is also a responsibility of the internal audit department, tying the goals of the internal audit and the external audit to each other. A *control* refers to a measure that is implemented to mitigate a certain risk. An example is the design of proper access rights to the financial accounting module to mitigate the risk of having unauthorized bookings in the financial ledger.

#### **2.2 Purpose of the Internal Financial Audit**

Internal auditing is a support unit of the company's management that is embedded in the organization and supports the company on two levels. On one hand it aims to detect and manage potential misstatement risks and on the other hand guards the operational performance [3]. The officially established definition of internal auditing by the global Institute of Internal Auditors (IIA) is as follows:

*"Internal auditing is an independent, objective assurance and consulting activity designed to add value and improve an organization's operations. It helps an organization accomplish its objectives by bringing a systematic, disciplined approach to evaluate and improve the effectiveness of risk management, control, and governance processes."* [4]

Furthermore, the IIA defines a mission of internal auditing, which states that the value of an organization is to be increased and protected through riskoriented and objective auditing, consulting and insights.

As for external auditing, similar risk assessment standards exist for the internal audit. The IIA addresses the risk-based audit planning in their Standards 2010 - Planning, 2010.A1, 2010.A2, and 2010.C1. These standards stipulate how the Chief Audit Executive (CAE) has the responsibility to develop a plan of all upcoming internal audit engagements based on a risk assessment that is performed at least annually.

#### **2.3 Internal and External Audit: Interplay and Common Challenges**

In many respects, the practical procedure of conducting an audit is similar for both internal and external auditors. Especially since both audits include the investigation of recorded financial transactions in the light of the prepared financial statements. In the course of all audits of a material<sup>1</sup> and formal nature, the regularity and reliability of the generated data must be assessed. Hence, the overall aim is to ensure quality control of all published financial information, taking into account the processes that precede the reporting.

<sup>1</sup> Meaning 'significant' in an audit context.

**Fig. 1.** Interplay between internal and external audit

Figure 1 provides a simplified overview of the primary responsibilities of the external and the internal audit. The external auditor is ultimately concerned with the accuracy of the financial statements, which is basically a summary of the recorded business transactions that are encapsulated in business processes. These processes typically integrate one or more recording steps since executing a business transaction alters the financial situation of the organisation and this change needs to be recorded. Figure 2 visualizes an example business process (purchase-to-pay) and its relationship to the financial statements. The process envisions an efficient execution of the purchase, but it also incorporates controls like approving once or twice a purchase order, before the purchase is placed. These type of control measures increase the level of assurance that the reported financial information is accurate and valid. In the designed process three activities trigger the reporting of a financial impact: entering a Goods Receipt document in the system should be reflected in the books by increasing the assets and booking an invoice increases your liability (you owe money to a vendor), while paying the invoice clears that liability again. Hence, during the execution of the procurement process, in parallel to the business transactions the impact on the financial situation of the company is tracked in the general ledger. All these bookings together form the basis for preparing the financial statements that are issued and audited once a year.

**Fig. 2.** Example relationship between the purchase process and the financial statements

The external audit traditionally focuses on the bookings in the general ledger and the financial statements, whereas the internal audit typically starts from the designed procedures and how they translated into financial bookings. In a world without resource limitations, the external auditor could trace back every recorded transaction to its origin and double-check whether the recorded transaction is backed-up by a real transaction (Is there for example evidence of delivering goods or encountering certain expenses?). In reality, however, the external auditor examines the organization's controls to ensure that only legitimate transactions get recorded. The auditor investigates the design of the controls and tests their effectiveness. In our procurement process, the auditor might test whether it is indeed not possible to enter an invoice and have it paid, without creating a purchase order and having it approved by someone else than the auditor. These checks make part of 'understanding the entity's control environment', an essential aspect of risk assessment as stipulated in ISA 315 and mentioned before. The working assumption is that if the processes, foreseen of enough controls, are under control, the generated financial data is accurate.

The above described examination of the control environment is not only the responsibility of the external auditor, but is also part of the internal auditor's function. Although the internal auditor includes an additional efficiency point of view, auditing the control structure and installed control measures is a core responsibility of the internal audit department. Consequently, the external auditor may rely on the internal auditor's findings. Of course, additional checks are always required.

## **3 Process Mining in the Internal Audit Function**

Given the increasing complexity and availability of information in accounting, digital data analysis has emerged as an innovative audit approach to perform financial audits by internal and external auditors [5]. Given the strong connection between the internal audit function and the organization, we start from the perspective of internal auditing. How can process mining support the internal audit? Subsequently, this perspective will be expanded by looking at the application of process mining by external auditors over the course of their audit engagements.

## **3.1 Internal Auditing Background**

The range of internal auditing tasks is subject to constant change, which is reflected in not only a shift of focus within the individual audit areas but also a varying understanding of the role of internal auditing. Within the traditional range of auditing activities, a distinction is made between audits of financial processes, operational processes, and management processes [6]:


A major trend can be noted in the field of internal auditing activities. The solely past-oriented audit is increasingly complemented by future-oriented auditing activities. This expansion is accompanied by a further development of auditing activities. Namely, internal auditing is more and more intended to initiate approaches to solve organizational problems. Providing improvement recommendations can therefore be referred to as the overall mission of all internal audit activities. Consequently, the internal audit shifts from a purely control-oriented view towards an enterprise-wide view.

## **3.2 The Internal Audit Process**

The internal audit process generally pertains to the structure and standardized procedure of auditing activities of the internal audit function and can be structured following the so-called *phase model* (see Fig. 3). The phase model of the audit activity is organized according to a sequence of audit phases. These audit phases are inherently separate units in terms of both content and methodology, yet there is a predetermined order for their execution. In fact, they are connected in such a manner that the start of the respective phase is directly linked to the completion of the preceding phase. As a result, the phase model is in effect the process model of the internal audit.


Figure 3 visualizes the phase model of the internal audit, along with possible ways to integrate process mining activities in the different phases. The following paragraphs will describe these starting points in greater depth. We present a running example to further explain the connection between internal auditing and process mining.

**Fig. 3.** Integration of process mining activities in the internal audit process

#### **3.3 Planning the Audit Schedule**

Imagine an exemplary audit engagement that should assess the functioning and the exposure to risk of a manufacturing company. Through an internal audit, the auditor needs to determine whether the internal rules and guidelines (controls) are fulfilled, if the processes are efficient, if there are specific risks that are not mitigated by appropriate controls and whether there is room for improvement.

The first phase of this audit entails planning the audit; determining the allocation of resources. In order to do so effectively, the auditor could visualize the organization's core business processes and then analyze them in terms of conformity and process efficiency. This would require the use of process discovery algorithms (as described in [7,8]). Any variances, weaknesses, and risks identified throughout this phase can subsequently serve as indicators to guide the allocation of resources. In other words, this phase involves an attempt to "explore" the processes and process discovery is well-positioned to support this phase. In a more mature setting, where the core business processes are efficiently logged and event logs can be extracted automatically, a quick discovery step can yield insights in which processes are highly structured and which aren't simply by looking at the discovered process models and their level of 'spaghettiness'<sup>2</sup> (see [9]).

Measures of 'structuredness' are necessary to turn this step objective, to select which process to give prior attention. For example, one could identify processes with a very high number of variants. In the running example, the purchase-to-pay (P2P) process might show a high number of variants, whereas the hiring process only exhibit a low number of process variants. This is an indication of a myriad of possible execution variants in the P2P process, accompanied with higher risk exposure. However, indicators such as '20 variants per 100 cases' are very generic. This can relate to two extremes (and everything in between): one variant representing 81 cases and 19 variants each representing a single case is one extreme, versus all 20 variants representing five cases as the other extreme. Consequently, the distribution of variants might be more insightful. Possible measures for structuredness are variance, self-loops, repetition and batch-processing [10]. This enables the identification of potential audit objects (risky processes) and, preferably, a simultaneous assessment of the risks

<sup>2</sup> 'Highly structured' is directly associated with 'less risky'.

inherent to these objects. In this phase of the audit process, process mining is consequently used to support the creation of the risk-based audit plan.

**For Example**, Table 1 presents a set of different structuredness measures to gain an insight in which process is more or less structured than other processes. The structuredness measures are calculated for the P2P process of different plants, helping the auditor to classify the individual risk level of each business unit. Based on this exploratory phase, the audit schedule would reserve resources to an audit of the P2P process in the Norway facility in *year n*, and leave the audit of the P2P process of the USA for *year n+1*, and the audit of Germany for *year n+2*, perhaps together with the Belgium plant. Also the other processes would be integrated in the audit schedule, typically covering a cycle of four to six years.


**Table 1.** Measures of process structuredness, used to plan the Audit Schedule

#### **3.4 Planning the Audit**

Once the audit object of a specific audit is identified -the P2P process in the Norwegian facility in our example- the audit is scheduled and the audit engagement needs to be planned in more detail. Starting with process discovery, the individual steps of the audited unit can be visualized and analyzed before the actual on-site audit. This helps the auditor gain a better understanding of the area to be audited and familiarize with the unique features of the process environment. The deduction of the process model based on available transactions facilitates identifying parallel process steps, loops, and undesired process skips [11,12]. This approach bears the advantage of verifying the assurance of the process flows on one hand and revealing process steps that require further examination on the other hand. A first scan of the discovered process model involves a critical look at the discovered edges. Even when not looking at complete process executions from start to end, examining the most frequent direct flows yields interesting information.

**For example**, when the default discovered process shows an edge between 'book invoice' and 'first approval order', it is clear that unexpected sequences are present in a significant part of the transactions. This information is valuable when planning the audit, since it provides indicators of which directions should be investigated more thoroughly.

After process discovery, the auditor can perform a first conformance check against the normative model to verify whether the individual process steps of the examined transactions comply with the previously defined process. In contrast with the exploratory process discovery step, the focus now shifts towards complete process executions. Since the auditor is still in the planning phase, it is recommended to compare the logged transactions with a procedural normative process model (like a BPMN-model that represents the 'to be'-model). The idea is to have a first impression of the level of business alignment: "Are the real process and the process model properly aligned?"[13]. This approach enables identifying variants that are not in line with the (often overly simplified) normative process model. Further investigation during the audit will reveal the real, associated risks. However, during the phase of planning the audit, the auditor can already have a first look at variants that represent a majority of non-conforming cases.

**For example**, Table 2 presents a set of variants in our P2P process that deviate from the normative model in Fig. 2 and that could be skimmed during this audit phase (see how to analyze deviations between observed and modeled behavior in [14]).



**Table 2.** Example output of non-conforming variants

So within the scope of planning the process audit, both process discovery and conformance checking can be used to determine the focus of the audit. The insights that are gained during this phase provide guidance on which special features or possible deviations of the prescribed process warrant further investigation. Sometimes, these analyses already allow for the identification of potential findings before the actual audit takes place. It should be noted that although process deviations might be identified, it does not necessarily represent a financial statement risk. Further tests are required to uncover potential "false positives" [15]. This will be elaborated on in the next phase.

#### **3.5 Conducting the Audit**

When conducting the audit, the auditor takes a deep dive into the control and operational environment. Specific analyses are conducted during this audit phase. Whereas the previous phases were preparatory, this phase is targeted to identifying risks and weaknesses in the reviewed process. As mentioned before, the opinion that auditors issue at the end of the engagement is partly based on an evaluation of the existing controls in terms of their effectiveness and efficiency [16].

The rationale of how internal controls are evaluated by making use of transactional data is presented as a process in Fig. 4 [17]. During the previous two audit phases, potential violations of internal controls have been identified. Also, a preliminary start of deviation analysis is taken in these phases (see example in Table 2). When conducting the audit, this preliminary analysis is extended. The purpose is to classify all deviations that stem from a procedural conformance check as either an exception or an anomaly. To date, in practice these conformance checks are solely using a control-flow perspective. In theory, however, this can be extended to a multi-paradigm conformance check that, for example, includes a Segregation of Duties control between two activities.

Starting from deviations, an iterative cycle presents itself, until all deviations are classified as either an anomaly or an exception. When the deviation is classified as a potential compliance issue, a follow-up investigation is triggered. Taking back our example of missing the receipt of goods, this might–or might not–be a compliance issue. It was not classified as anomaly, because a possible explanation could be formulated: perhaps the purchase related to services, making the receipt of goods an illogical activity. The auditor should test this potential explanation in order to reach a conclusion on whether this deviation is an anomaly or an exception to the normative process. To do so, the auditor collects all cases where the receipt of goods is missing, and subjects this (filtered) log to a conformance check where the formulated hypothesis is tested. Hence, a declarative constraint is checked on this set of transactions: 'if the receipt of goods is missing, the purchase relates to services'. The cases that follow this rule can be cleared and listed as exception. If there are still cases that are not cleared by this possible explanation, the same approach is repeated.

The cycle, as described above, presents the theory. In practice, too many deviations are presented to inspect all of them and auditors fall back to a sampling approach. Current research is looking into weak supervision and active learning to support the auditor in the iterative cycle such that a full-population testing can be reached [18]. The goal is to present deviations in an intelligent way to the auditor whom provides a classifier with the labels *anomaly*, *exception*, or *uncertain*. In case of uncertainty, the auditor provides a possible rule to check (the hypothesis). Based on the iterative human input the classifier can support a classification of all identified deviations, preferably with minimal expert knowledge input [17].

By identifying deviations that can be tracked to the document level, the auditor can also direct the auditing activities in a target-oriented manner. More specifically, in-depth variant analysis and case analysis can be used to evaluate both the functioning of the internal controls and the process performance. The majority of process mining solutions offer various metrics in this regard, such as duration of the process, number of process steps, number of loops, number of variants, etc. Internal auditors can create additional value through this third dimension by auditing, analyzing, and identifying improvement opportunities and utilize process mining for consulting activities in addition to its conventional auditing activities.

**Fig. 4.** Internal control testing rationale, including process mining (Source: [17])

#### **3.6 Communicating the Result**

The advantages of visualizing processes and existing process deviations as well as conformance checks and identification of control weaknesses can also be used by the auditor as part of the audit report and the presentation of audit results. The visualization generally leads to the audited entity or report addressees being even more receptive to the potential findings or recommendations for improvement. Clearly, this area is of limited relevance compared to those mentioned previously, yet it provides an advantage that should not be overlooked.

#### **3.7 Follow-up**

During the follow-up of an audit, the auditor normally tries to check if the problems and negative findings of the initial audit were solved and whether the recommendations were implemented. Thus, the auditor can actually use all of the prior process mining analysis to double-check if the expected improvements were realized.

#### **3.8 Maturity Levels**

Although a match between the different internal audit phases and separate process mining activities is presented in the previous sections, sometimes an internal auditor only relies on process mining during the core of the audit (the phase 'Conducting the audit'), as visually presented in Fig. 5, or another single phase. After building the event log, process discovery can help the auditor to understand the existing process variants and identify potential risk areas. Furthermore, the check of the existing process structure compared to the process model allows the internal auditor to clarify existing deviations or identify additional risks. Checking the process executions against business rules offers an additional way to gather evidence of, for example internal control weaknesses or potential fraud cases. The variant analysis allows the auditor to compare different variants to identify additional risks, but also opportunities, since the as is-process might be a more efficient or effective way to organize the process instead of the intended process model. Finally, the case analysis allows a deep-dive of the auditor on the transaction level to analyze the identified cases. It goes without saying that integrating process mining throughout the entire internal audit is next-level in terms of maturity, compared to the integration in only one phase.

**Fig. 5.** Example of how process mining can be integrated in a single phase of the internal audit

## **4 The Symbiosis Between Internal and External Auditing When Using Process Mining**

As mentioned before, the external and internal audit have partially overlapping goals. Although internal audits include the investigation of process improvement from an efficiency point of view, both audits have the goal to assure the validity of the reported financial statements. As a result, the audit phases of the external and internal audit are not that different in nature. However, differences exist; some in nature and some in terminology.

#### **4.1 External Auditing and Process Mining**

Where the internal audit starts with planning the audit schedule for a period of one or several years, the external audit does not have such a long-term setting. The external audit standards also describe a phase of planning the audit, but this is in the light of a running audit engagement and relates to the audit of that specific accounting year. In order to plan the audit approach, a risk assessment phase takes place. Based on the risk assessment, the auditor decides where to allocate most resources to. This is comparable with the internal audit phase 'planning the process audit'. Similarly, this phase would rely mostly on process discovery and a rather high-level conformance check against a procedural model (Fig. 6).

**Fig. 6.** Integration of process mining activities in the external audit process

Based on the assessed risks, targeted business processes will be investigated through 'tests of controls' and 'tests of details'. The tests of controls are related to examining the design and implementation of the organisation's control environment. The tests of details are checks that take place at the transaction level. This distinction stems from before the digital era, where tests of controls were not executed at a detailed level. Nowadays, however, the distinction is less clear. For example, checking whether all documents in a system have been approved, is a test of details that also checks the effectiveness of a control. Given the lesser delineation between these two concepts, these tests are often intertwined and form, together with other analytical procedures<sup>3</sup>, the core of the external audit. Similar to the internal audit phase of conducting the audit, these tests would heavily rely on checking rules, a more in-depth variant analysis and case analysis

<sup>3</sup> Analytical procedures in the context of an external audit are defined as "... evaluations of financial information through analysis of plausible relationships among both financial and non-financial data. Analytical procedures also encompass such investigation as is necessary of identified fluctuations or relationships that are inconsistent with other relevant information or that differ from expected values by a significant amount." [19].

[11,20]. The driver of the tests is currently stemming from the traditional audit. The check-lists that were used before, are now automated and extended with additional dimensions. Still, the core of the audit did not yet change drastically. This can be devoted to the standards that remain unchanged, creating some reluctance with the auditors to turn to new techniques.

After the tests of details and controls, the results are communicated. As with the internal audit, the visual aspect of process mining is an important characteristic. A graphical presentation of the phases of the internal and external audit, along with the process mining analysis phases that can be used, is given in Fig. 7.

**Fig. 7.** Parallels between internal and external auditing and the process mining analysis phases that support the audits.

## **4.2 Relying on Internal Audit's Process Mining Efforts**

If the external auditor can start from the process mining efforts of the internal auditor, obviously, more can be reached than if this is not the case. In this setting, the external auditor would have to examine the process that was followed by the internal auditor when conducting the process analysis. Following questions will be important to have a clear answer to (and hence clear documentation on):


So although the external auditor does not have to start from scratch, a lot of effort will still be devoted to having assurance over the process of process mining. Only with a reasonable level of assurance of this internal process, the external auditor can rely on the outcome of the internal auditor.

The alternative is that the auditor takes full control of the process mining analysis. They can still start from the internal auditor's expertise, which will speed-up the process of building the event log for instance. But the external auditor would extract the data from the information systems himself, build the event log and run the analysis himself. This trade-off between control and depth of analysis is related to the personal preference of the external auditor.

## **5 Organizational Integration of Process Mining in the Auditing Function**

Aside from the theoretical integration of process mining in the auditing process, organizational integration is equally important. There are different approaches how internal and external auditors can implement process mining in the auditing process. The following options present potential approaches and should be examined individually for the respective auditor. Of course, the best-fitting solution always depends on the financial, technical and human resources available. Also, the time factor for the implementation of process mining and the necessary training of the employees are not to be neglected.

#### **5.1 Individual Process Mining Experts**

Solutions with one or few process mining experts are conceivable, especially in smaller and medium-sized audit departments. The required profile is comparable to the members of a specialized team described above. Aside from profound process mining knowledge, the expert should have an excellent command of data analysis tools and, consequently, design and manipulate queries and data easily. This enables the expert to create the analyses necessary for successful use in the auditing process. As a direct contact person, the expert becomes a shorter link to the auditor than when working with process mining teams. On one hand, the advantage is the lower financial investment and the possibility to easily upscale the team. On the other hand, a challenge could be that the expert's capacity is too low when there are frequent requests.

## **5.2 Specialized Process Mining Team**

It is a good idea to set up an independent process mining team, especially for sizeable internal audit functions or functions with solid data analysis activities. This team specializes in data analysis and process mining and prepares reports to support each audit. Consequently, this team belongs to the core of the audit function and is heavily involved in audit preparations. In such a team, especially auditors with profound ERP systems and process expertise should be involved. This know-how enables the team to develop and prepare target-oriented analyses of data and processes, such that the auditors outside the team reap significant benefits from this. Since the experts can reuse some procedures to prepare different dashboards and process analyses, learning effects can be assumed. While the high degree of specialization of the team brings numerous advantages, the disadvantages of this approach should not be neglected either. The team members are primarily "remote" active, potentially leading to isolation from the actual audit process. Such an approach is also associated with high personnel costs, so it must be examined per company to what extent this can be realistically implemented. If individual teams for data analysis already exist in the auditing function, this would, of course, be a sensible starting point for this approach.

#### **5.3 Training of All Staff**

When process mining becomes an integral part of the auditing function, comprehensive training of all auditors working with it is unavoidable. Depending on the selected process mining software, specific levels of training are required. The training should include the connection to the ERP system: understanding the ERP data structure is a prerequisite to building a suitable event log. The connection to the corresponding process specifications of the company is also of central importance against this background. What do the processes in our company look like? What controls have been implemented? Where are the critical points? All these questions need to be translated into clear questions to feed the process analysis of the auditor. These sample questions show how demanding training on this topic is. The high demands on training must also be understood against the background of the local or global orientation of the internal audit function of the respective company. If the auditing functions have several locations, possibly even in different nations, a corresponding global training concept to roll out to the different nations is recommended. Overall, training should help standardize procedures, develop appropriate competencies, and build a shared knowledge base across the audit function. Only with a sound thought-out training concept, the auditing department can successfully implement process mining in their processes.

#### **5.4 Process Mining Competencies of Other Departments or Outsourcing**

In addition to building up process mining competencies within the auditing team, the auditing function can also draw on support from outside. Numerous companies have now implemented process mining solutions in various areas of the company, which is why a cooperation with other functions in the context of process mining analyses is conceivable. This approach is particularly useful if the auditing department has not yet been able to think through the process mining approach or if initial trial analyses are to be carried out. The great advantage of this approach is that the auditing team has to expand relatively limited resources in order to fundamentally evaluate the possible applications. An important note is, however, that audit-specific aspects that should be included during the log building phase, might not be included. To mitigate this risk, it is important to team-up with a partner that has process mining expertise in an auditing context. Another important characteristic that holds for every outsourcing act, is that the auditor doesn't develop deep in-house expertise. Alternatively, an outsourcing is possible, in order to maintain the flexibility of the auditing function. However, outsourced analyses will probably produce comparable costs due to the enormous implementation effort later on.

## **6 From Data to Audit Evidence**

When the auditor decides to apply a process mining approach, there are two preparatory steps that are key to success: identify the scope of your process under investigation and formulate (upfront!) the most important questions that need to be answered. The process can be either a core business process, like purchase or sale, or a supporting process like an incident management or change management process. It is paramount to identify which activities are subject (and which are not) to the audit. Aside from the process scope, the formulated key questions are of paramount importance. Based on these questions, the event log can be built. Although research on object-centric process mining is on the rise, in practice, process mining for audit is still document-centric: a case is either an order, or a journal posting, or an invoice, or another document of relevance.

Based on the audit questions, the relevant information can be extracted from the information system. Unfortunately, the selection of a case identifier potentially creates noise on the analyses to follow. When many-to-many relationships exist between a document at the beginning of the process and another document that is created near the end, choosing one or the other document always has its up- and downsides. In general, it seems that the auditor prefers to select a document that leads to a financial transaction and at the same time is earlier in the process [21]. If a choice needs to be made on whether to follow that document on its header level, or on a more detailed level, the auditor is attracted to examine the case at the lower level.

To select activities to include in the event log, most ERP systems give a plethora of options to enrich the log with. A lot of recorded events could be included on top of the most straight forward (process) activities. It is for example possible to include both, the moment an invoice was put in the system and the date that it was posted, where the former is cpu-timestamped and the latter timestamp is manually entered. These additional insights are still to be integrated in regular audits. To date, the auditor is not yet fully grasping these opportunities. In a future audit, perhaps when standards have been updated, these type of activities should find their way into typical standard analyses.

As the event log serves as 'audit evidence'<sup>4</sup> in the context of an external audit, it is tied to audit regulations. To assemble the audit evidence, and hence build the event log, the auditor is expected to *obtain sufficient appropriate audit evidence*. This leaves room for discussion on whether, according to this stipulation, an auditor can ask full access to a client's information system or not. If not, the auditor will need to submit a detailed data request. This is the standard procedure for regular audits where you can request 'information'. However, when you are interested in the data underneath the information, this is more difficult. Submitting a sufficient data request to build an event log later is only possible if the auditor is acquainted with the company's information system.

Once the auditor has access to information, the auditor has to decide on whether it is suited to use as audit evidence and base an opinion on this information. Particularly relevant to the case of process mining in auditing are the stipulations on *audit evidence that has been prepared using the work of a management's expert*. In such cases, the auditor is expected to "evaluate the competence, capabilities and objectivity of that expert; obtain an understanding of the work of that expert; and evaluate the appropriateness of that expert's work as audit evidence for the relevant assertion." [22]. *When using information produced by the entity*, the auditor is expected to have a view on the accuracy and completeness of the information, and whether the information is sufficiently precise and detailed [22]. Taking these regulations into account makes it clear that it is not straightforward for an external auditor to rely on other parties to conduct process mining analyses and build further on this.

## **7 Open Challenges**

Although the preceding discussion demonstrates the numerous benefits of applying process mining within internal and external auditing, numerous fundamental challenges also exist. These challenges can be divided into four areas:


All four areas have varying degrees of relevance to their respective organizations and include numerous broader challenges and components.

<sup>4</sup> Audit evidence is defined in ISA 500 as "Information used by the auditor in arriving at the conclusions on which the auditor's opinion is based. Audit evidence includes both information contained in the accounting records underlying the financial statements and other information." [19].

**Data Quality.** With regard to data, the application of process mining within the audit is jeopardized if the available data is either not usable or irrelevant. Also, data integrity and compatibility play a key role against this backdrop. For example, having different source systems implies that the creation of a usable data model for process mining is not straightforward. Consequently, all known problems and challenges of data analyses and IT systems are also valid for process mining in the context of auditing.

**Auditor Skill-Set.** Both, internal and external auditors must have the required know-how and appropriate training for the use of process mining in the context of the audit. However, information systems, data analytics, and process mining are no standard topics in auditing education curricula. Although programs are increasingly including these topics in courses and providing good starting points, the acceleration must be sustained during career development. Therefore, it is equally important that audit companies (or departments) sharpen these skills with the new hires. Only then can this lead to a know-how flow through the entire audit firm.

**Stakeholder.** It is important for the auditor to clarify the compliance with auditing standards during the financial audit with process mining and get the commitment from the audit committee or the audited entity while using process mining. The regulator or the respective professional association must also support the use of new technologies for obtaining an audit result accordingly.

**Full-Population Testing.** Process mining often extends the traditional sampling approach, where a sample is taken from a population as audit evidence. Process mining, on the other hand, generally uses the entire population of transactions, so that the potential alpha and beta errors of a sample no longer exist. However, this consideration becomes difficult if, for example, several hundred or even thousands of deviations ('red flags') are identified during the audit. In such a case, it is necessary to decide how the auditor arrives at the intended reasonable assurance. Does the auditor draw a sample from the identified red flags, because of resource constraints? Or does the auditor use additional resources, because the risk of not evaluating all red flags impairs the audit judgment? In practice, the latter is often not feasible, pushing auditors sometimes back to traditional sampling approaches. This is of course not the way forward when new techniques are around to move closer to full-population testing. The answer should be sought in providing support to dealing with all these deviations (see Sect. 3.5).

In addition to these general areas, further challenges can be identified, such as the high entry barriers and costs for smaller audit functions or audit firms.

## **8 Conclusion and Outlook**

The use of process mining in internal and external auditing offers a magnitude of potential benefits. Through the visualization and analysis of process steps, especially in combination with an in-depth data analysis, auditors can use numerous new ways and approaches to generate unique insights. For example, process mining can support compliance with (global) process governance and thereby improve the process landscape in national and international organizations. Moreover, by combining data from different areas of the company, completely new contexts can be mapped and a true added value for auditing can be created. Process mining can support all areas of auditing work. This includes the identification of risk areas or compliance violations, the audit of the internal control system, and the compliance with governance requirements.

Of course, process mining is a tool on the process level, which is why the link to real business processes and data is of decisive importance for a successful implementation in auditing. When reaching this process connection through the audit, process mining offers numerous approaches and applications for both, experienced and novice auditors. Depending on the focus and experience of the auditor, the field of application can be completely different. Successfully applying process mining in auditing is hitting the balanced combination of the right process mining techniques and skills with the right level of audit expertise. Only running an analysis, without the interpretation of a domain expert, is like any other data analysis meaningless. The added value is found in the powerful combination of techniques and domain expertise. This is where future investigations of this topic should focus at.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Robotic Process Mining**

Marlon Dumas1,3(B), Marcello La Rosa2,3, Volodymyr Leno1,2,3, Artem Polyvyanyy2, and Fabrizio Maria Maggi<sup>4</sup>

<sup>1</sup> University of Tartu, Tartu, Estonia marlon.dumas@ut.ee <sup>2</sup> University of Melbourne, Melbourne, Australia *{*marcello.larosa,artem.polyvyanyy*}*@unimelb.edu.au <sup>3</sup> Apromore, Melbourne, Australia volodymyr.leno@apromore.com <sup>4</sup> University of Bozen-Bolzano, Bolzano, Italy maggi@inf.unibz.it

**Abstract.** User interaction logs allow us to analyze the execution of tasks in a business process at a finer level of granularity than event logs extracted from enterprise systems. The fine-grained nature of user interaction logs open up a number of use cases. For example, by analyzing such logs, we can identify best practices for executing a given task in a process, or we can elicit differences in performance between workers or between teams. Furthermore, user interaction logs allow us to discover repetitive and automatable routines that occur during the execution of one or more tasks in a process. Along this line, this chapter introduces a family of techniques, called Robotic Process Mining (RPM), which allow us to discover repetitive routines that can be automated using robotic process automation technology. The chapter presents a structured landscape of concepts and techniques for RPM, including techniques for user interaction log preprocessing, techniques for discovering frequent routines, notions of routine automatability, as well as techniques for synthesizing executable routine specifications for robotic process automation.

## **1 Introduction**

The rigidity and complexity of legacy applications, particularly in large organizations, engender situations in which workers are required to perform repetitive routines to transfer data from one application to another via their user interfaces. Examples of such repetitive routines include:


c The Author(s) 2022

W. M. P. van der Aalst and J. Carmona (Eds.): Process Mining Handbook, LNBIP 448, pp. 468–491, 2022. https://doi.org/10.1007/978-3-031-08848-3\_16

The automation of such routines can eliminate tedious and demotivating manual work, reduce cycle times, and enhance data quality. Advances in Robotic Process Automation (RPA) technology [1,41] make it possible to automate routines like the above ones. However, building and maintaining RPA bots requires a significant investment and hence, it is important for organizations to make the right decisions as to which bots they should build. In a typical organization, there may be tens of thousands of types of tasks, and any of them may involve one or more repetitive routines. Some routines are sufficiently frequent and widespread across the organization that they can be identified and scoped via interviews, focus groups, and workshops with workers. Other routines, however, may be less widespread or performed sporadically, but still sufficiently often that it is beneficial to automate them.

*Robotic Process Mining (RPM)* is a family of techniques to discover repetitive routines that can be automated using RPA technology, by analyzing interactions between one or more workers and one or more software applications, during the performance of one or more tasks in a business process. In general, RPM techniques take as input User Interaction logs (*UI logs*).<sup>1</sup> These UI logs are recorded while workers interact with one or more applications, typically desktop applications. Based on these logs, RPM techniques produce specifications of one or more routines that can be automated using RPA or related tools.

Depending on the type of technique, the discovered routine specifications may be conceptual (i.e. non-executable) or *executable*. A conceptual routine specification provides guidance to analysts and developers to help them scope a repetitive routine and to build an executable script to fully or partially automate the routine. For example, a non-executable specification of a routine could take the form of a textual description (in natural language), or a sequence of screenshots corresponding to repetitive sequences of interactions, or a sequence of user interactions (e.g. ["open sheet","select cell", "edit cell", "copy cell contents", ...]). An executable routine specification is a specification that contains all the information required to fully reproduce the routine via a dedicated execution engine or to synthesize a script that can be executed using an RPA tool or a similar type of automation tool.

This chapter reviews the state of the art in the field of RPM and provides a structured overview of the steps of a typical RPM pipeline, the techniques that may be employed in each of these steps, as well as open research challenges on the way to realizing mature RPM tool sets.

The chapter is partially based on a previous journal article [32]. The chapter extends this journal article by positioning the vision of RPM within the broader context of task mining and process mining, and by providing an updated review of related work in the field.

The rest of the chapter is structured as follows. Section 2 gives an overview of techniques related to robotic process mining, including task mining and process mining, and gives an overview of existing work on identification of task automa-

<sup>1</sup> In this chapter, we use the acronym *UI* to refer to a *user interaction*, not to be confused with a *user interface* which is another common use of this acronym.

tion opportunities. Section 3 presents a framework for robotic process mining and introduces techniques covering each component of the framework. Finally, Sect. 4 discusses open challenges in the field of robotic process mining.

## **2 Background**

### **2.1 Robotic Process Automation**

RPA is a class of tools to automatically execute sequences of steps (herein called *routines*) involving interactions between a user and a software application, or interactions between multiple applications via Application Programming Interfaces (APIs). In an RPA tool, the execution of a routine is driven by a pre-specified script, which consists of atomic steps corresponding to individual interactions, assembled together via control-flow structures (if-then-else statements, repeat-until loops, etc.) [43]. A common characteristic of RPA tools is that they are able to "operate on the user interfaces of computer systems in the way a human would do" [1]. For example, an RPA tool may perform clicks or keystrokes on the user interface of a desktop application to mimic a sequence of steps that would normally be performed by a human operator. Examples of RPA tools, as of the time of writing this chapter, include Automation Anywhere RPA Workspace<sup>2</sup>, Blue Prism Intelligent Automation Platform<sup>3</sup>, Microsoft Power Automate Desktop<sup>4</sup>, RocketBot<sup>5</sup>, and UiPath Platform.<sup>6</sup>

Typically, RPA tools include a design environment, where different types of users, ranging from software developers to business users, may specify and test scripts to automate one or more routines. Each such script is then embedded into a so-called *software bot*. A bot is a unit of execution in an RPA tool. A bot is responsible for executing a given script whenever a given type of trigger occurs. Bots are operated via so-called control dashboards, which allow human operators to oversee the work performed by a collection of bots.

Depending on how the control dashboard is used, we can distinguish two RPA use cases: *attended* and *unattended* [43]. In attended use cases, the bot is triggered by a user. During its execution, an attended bot may provide data to a user and take in data from a user. In these use cases, the user may run the bot's script step-by-step, pause or stop the bot, or otherwise intervene during the script's execution. Attended bots are suitable for routines where dynamic inputs are required (i.e. inputs gathered during a routine execution), where some decisions or checks require human judgment, or when the routine is likely to have unforeseen exceptions. For example, entering data from an invoice in a spreadsheet format into a financial system is an example of a routine suitable for attended RPA, given that in this setting, some types of errors may have financial

<sup>2</sup> https://www.automationanywhere.com/.

<sup>3</sup> https://www.automationanywhere.com/.

<sup>4</sup> https://powerautomate.microsoft.com/.

<sup>5</sup> https://www.rocketbot.com/.

<sup>6</sup> https://www.uipath.com/.

consequences. Unattended RPA bots, on the other hand, execute scripts without human involvement and do not take inputs during their execution. Unattended RPA bots are suitable for executing deterministic routines where all execution paths (including exceptions) are well understood and can be codified. Copying records from one system into another via their user interfaces through a series of copy-paste operations is an example of a routine that an unattended bot could execute. In this chapter, we focus on unattended RPA bots.

Figure 1 presents a simple lifecycle model of RPA bots, which we use below to position the role of robotic process mining.<sup>7</sup> According to this lifecycle model, an RPA bot goes through four phases:

**Fig. 1.** Simple RPA bot lifecycle [23]


<sup>7</sup> For the sake of conciseness, the RPA bot lifecycle model discussed here consists of four coarse-grained phases. A finer-grained RPA bot lifecycle can be found, for example, in [16].


In this chapter, we focus on techniques that leverage UI logs to support the analysis and development phases of RPA bots.

#### **2.2 Task Mining**

Task mining is a collection of techniques for analyzing the execution of tasks performed by human workers, based on records of interactions between these workers and one or more software applications. Depending on the goal of the analysis, we can distinguish between three use cases of task mining [26]: (i) task discovery and optimization; (ii) resource and workforce optimization; and (iii) task automation.

**Task Discovery and Optimization.** In this use case, the goal is to discover how a task is performed by one or more workers, to identify deviations with respect to policies or work instructions related to that task, and/or to uncover ways of improving the performance of the task. By applying task mining techniques to a task, we may discover that different workers perform the task in different ways. For example, one worker might open all the desktop windows required to perform a task upfront (e.g. an email client, a spreadsheet application, and a browser window connected to a CRM system), and only once all windows are open, they start navigating across these windows to complete the task. Another worker might start performing the task in one desktop window (e.g. the email client's window) and then open the other windows incrementally. Similarly, one worker might usually execute a task in a single go, without interruptions, while another might interleave the execution of the task with other work, or might multitask.

Having identified how a task is performed by one or more resources, task mining can help us to identify steps in a task that are responsible for delays (bottlenecks), as well as common rework loops or workarounds with respect to normative work instructions. Task mining also allows us to relate the sequences of steps that different workers perform with performance measures, such as the mean cycle time of a task or the defect rate of a task. For example, task mining may help us to identify that when a given step, such as clicking on a given cell number in a sheet, is repeated multiple times, the mean cycle time of a task is significantly higher than when this cell is visited only once.

**Resource and Workforce Optimization.** In this use case, the goal is to identify inefficiencies in the way tasks are assigned to resources, or conversely, to uncover ways to improve the assignment of tasks. For example, by analyzing UI logs, we may find that when an invoice entry task relates to an invoice from a company in country X, it takes more time for worker A to perform the task (rather than another worker B) whereas the opposite holds for invoices coming from country Y. We might also find that when worker A performs an invoice data entry task after 4:30pm, the task gets completed faster, but when this happens, some fields in the invoice are left unfilled, which might then be causing issues downstream.

**Task Automation and Robotic Process Mining.** In this use case, the goal is to discover opportunities to automate a task or part of a task. The automation of a task can be achieved using a variety of technologies. For example, if a task involves information flows between multiple applications, one could use middleware technology to programmatically connect these applications, thus replacing the manual information flow with an automated (programmatic) flow. Another approach is to develop and RPA bot to transfer data from one application to another by replicating the user interactions that a human worker would do to achieve this. Robotic Process Mining (RPM) refers to the use case of task mining where UI logs are analyzed in order to identify frequent routines that can be automated by means of one or more RPA bots. The rest of this chapter focuses on this latter use case of task mining.

#### **2.3 Relations Between Task Mining and Process Mining**

Task mining is in many ways related to process mining, particularly to techniques for automated process discovery (cf. Sects. 2 and 3). However, task mining and process mining differ in several respects. These differences stem from the differences in the inputs of these techniques. Process mining take as input event logs extracted from enterprise systems that support the execution of one or more business processes in an organization – e.g. Enterprise Resource Planning (ERP) or Customer Relationship Management (CRM) systems, as discussed in [2]. Meanwhile, task mining techniques take as input UI logs, consisting of records of micro-steps performed by workers while they interact with software applications to perform individual tasks in a process. Both types of logs consist of timestamped records, such that each record refers to the execution of an action (or task) by a user. Also, each record may contain a payload consisting of one or more attribute-value pairs. However, UI logs and event logs differ in at least four ways.

First, event logs intended for process mining consist of events at a finer level of granularity than UI logs. An event in an event log typically refers to the start, completion or other significant state change in the execution of a task within a business process, such as *Check purchase order* or *Transfer student records*. Such tasks can be seen as a composition of lower-level (micro-)steps, which may be recorded in an UI log. For example, task *Transfer student records* may involve multiple actions to copy the records associated with a student (name, surname, address, course details) from one application to another. In other words, an UI log may contain dozens or even hundreds of entries per task execution, whereas an event log would typically only contain one or a handful of entries per task execution. Also, the payload of the events in an event log may contain lowlevel information such as the specific cell or the pixel coordinates involved in a user interaction, or it may be associated to a screenshot taken during a user interaction. In contrast, event logs contain business-relevant attributes, such as the amount of a loan offer, the interest rate, the repayment term, etc.

Second, UI logs do not come with a notion of *case identifier* (or process instance identifier), whereas event logs typically do. In other words, events in an UI log are not explicitly correlated. A typical UI log consists of thousands of user interactions recorded during a period of several hours on the workstation(s) of one or more workers. Prior to being used, such UI logs needs to be segmented into logical units corresponding to task executions, as discussed later in this chapter.

Third, a record in an event log often does not contain all input or output data used or produced during the execution of the corresponding task. For example, a record in an event log corresponding to an execution of task *Transfer student records*, is likely not to contain all attributes of the corresponding student (e.g. address). Meanwhile, an UI log typically collects all the data observed during the execution of a task, particularly when the UI log is intended to be used for RPM purposes. Indeed, if some input or output attributes are missing in the UI log, the resulting routine specification would be incomplete, and hence the resulting RPA bot would not perform the routine correctly.

A fourth difference is that event logs are typically obtained as a by-product of transactions executed in an information system, rather than being explicitly recorded for analysis purposes. The latter characteristic entails that event logs are more likely to suffer from incompleteness, including missing attributes as discussed above, but also missing events. For example, in a patient treatment process in a hospital, it may be that the actual arrival of the patient to the emergency room is not recorded when a patient arrives by themselves, but it is recorded when a patient arrives via an ambulance. In other words, the presence or absence of an event in an event log depends on whether or not the information system is designed to record it, and whether or not the workers actually record it. Meanwhile, an UI log is recorded specifically for analysis purposes, which allows all relevant events to be collected subject to the capabilities of the UI recording tool.

The above differences in the input entail that it is often not possible nor desirable to use the same techniques for process mining as for task mining. In the field of process mining, a typical visualization consists of a graph with one node per activity. The emphasis of these techniques is to show the most frequent control-flow dependencies between the activities of the process. This approach is not feasible in the context of task mining because the steps are finegrained and therefore too numerous to be displayed in their entirety. Besides, only certain steps are relevant for a given use-case, specifically those that are part of a frequent routine. Accordingly, a task mining technique typically starts by pre-processing the UI log in order to extract only the most frequent sequences of steps (i.e. the most frequent routines) using sequence pattern mining techniques, or using event abstraction techniques such as those developed in the field of process mining [44].

Notwithstanding these differences, several commercial process mining vendors, such as Apromore<sup>8</sup>, Celonis<sup>9</sup>, and Minit<sup>10</sup>, take advantage of the commonalities between UI logs and business process event logs to offer task mining features. Typically, these tools discover directly-follows graphs (cf. [3]) from UI logs or from combinations of event logs and UI logs. For example, these tools may discover directly-follows graphs to visualize the sequences of screens visited by a user during the performance of one or more tasks, or to visualize the most frequent or the slowest steps during the performance of a task.

These visualizations are suitable when analyzing tasks for the purpose of task optimization and workflow optimization (cf. the first two use-cases above). They can also help users to visually detect candidate routines for automation, when those routines have a simple structure (e.g. perfect sequences of steps). However, beyond simple scenarios, these visualizations do not allow users to determine if a given task contains routines that can be automated by means of an RPA bot. In this respect, RPM techniques complement task mining techniques by explicitly addressing the questions of: (1) how to identify candidate routines for automation? and (2) how to derive an executable specification of a routine that has been identified as a candidate for automation?

## **3 Robotic Process Mining: A Framework**

RPA tools are able to automate a wide range of routines, raising the question *how to identify routines in an organization that may be beneficially automated using RPA?* [41] To address this question, we envision a new class of tools, namely Robotic Process Mining (RPM) tools.

We define RPM as *a class of techniques and tools to analyze data collected during the execution of user-driven tasks to support identifying and assessing candidate routines for automation and discovering routine specifications that RPA bots can execute*. In this context, a *user-driven task* is a task that involves interactions between a user (e.g. a worker in a business process) and one or more software applications.

Accordingly, the primary source of data for RPM tools consists of user interaction (UI) logs. RPM aims at assisting the analysts in drawing a systematic inventory of candidate routines for automation and help them to produce executable specifications that can be used as a starting point for their automation.

<sup>8</sup> https://apromore.com.

<sup>9</sup> https://celonis.com.

<sup>10</sup> https://minit.io.

#### **3.1 UI Logs and Routines**

Figure 2 presents a class diagram capturing the core concepts and RPM and their relations. In this class diagram, the two main concepts are User Interaction log (*UI log*) and Routine. UI logs are the input of RPM, while routines (represented as routine specifications or as RPA scripts) are the output of RPM.

**Fig. 2.** Class diagram of RPM concepts

An UI log is a chronologically ordered sequence of user interactions, or UIs in short, performed by a single user in a single workstation and involving interactions across one or more applications (including web and desktop applications). An example of an UI log, which we use herein as a running example, is given in Table 1.

Each row in this example corresponds to one UI (e.g. clicking a button or copying the content of a cell). Each UI is characterized by a *timestamp*, a *type*, and a set of *parameters*, or *payload* (e.g. application, button's label or value of a field). To be useful in the context of RPA, the payload should contain sufficient information for a software bot to reproduce the performed activity. For example, for a UI that refers to clicking a button, it is important to store a unique identifier of this button (e.g. either the element identifier, or its name if this is unique in the page). Likewise, for an event that refers to editing a field, an identifier of the field as well as a new value assigned to that field are required attributes. The payload of a UI is not standardized and depends on the UI type and application.


**Table 1.** Fragment of a user interaction log

Consequently, the UIs recorded in the same log may have different payloads. For example, the payload of UIs performed within a spreadsheet contains information regarding the spreadsheet name and the location of the target cell (e.g. the cell's row and column). In contrast, the payload of the UIs performed in a web browser contains information regarding the webpage URL, the name and identifier of the UI's target HTML element, and its value (if any).

An UI log consists of interactions of different types. To illustrate the types of interactions that may be exploited in the context of robotic process mining, Table 2 provides the concrete list of UI types (and associated parameters) supported by the Action Logger tool [33]. Action Logger is an open-source UI recording tool designed to record events generated by browsers and desktop applications, in a way that enables the discovery of automatable routines.

Note that in Table 2, the UI types are grouped into three groups: navigation, read, and write UIs. Navigation UIs correspond to actions that affect the state of the user interface, but without reading or writing any data. This includes, for example, moving from one tab to another in a broader, or selecting a cell in an Excel spreadsheet. Read actions are those where some data item is accessed, for example in order to copy it into the clipboard. Meantime, "write" actions are those where data is written into an element of the UI, for example, pasting the contents of the clipboard into the currently selected cell of an Excel spreadsheet.


**Table 2.** User interaction types and their parameters

To obtain an UI log suitable for RPM, all UIs related to a particular task have to be recorded. This recording procedure can be long-running, covering a session of several hours of work if the user performs multiple instances of this task one after the other. During such a session, a worker is expected to perform a number of tasks of the same or different types. The UI log shown in the example above describes the execution of a task corresponding to transferring student data from a spreadsheet into the web form of a study information system. The web form requires information such as the student's first name, last name, date of birth, and country of residence. If the country of residence is not Australia, the worker needs to perform one more step, indicating that the student will be registered as an international student.

Each execution of a task (herein also called a *task instance*) is represented by a *task trace*. In our running example, there are two traces belonging to a "new record creation" task. From the log, we can see that the worker performed this task in two different ways. In the first case, she manually filled in the form (UIs 1 to 18), while in the second case, she copied the data from a worksheet and pasted it into the corresponding fields (UIs 19 to 41).

Given a collection of task traces, the goal of RPM is to identify a repetitive sequence of UIs that can be observed in multiple task traces, herein called a *routine*, and to identify routines amenable for automation. For each such routine, RPM then aims at discovering an executable specification (herein called a *routine specification*). This routine specification may be initially captured in a platformindependent manner and then compiled into a platform-dependent *RPA script* to be executed in a given RPA tool.

## **3.2 RPM Phases**

We distinguish three main phases in RPM: (1) collecting and pre-processing UI logs corresponding to the executions of one or more tasks; (2) discovering candidate routines for RPA; and (3) discovering executable RPA routines.<sup>11</sup>

**Collecting and Pre-processing UI Logs.** We decompose the first phase into the recording step itself and two preprocessing steps, namely the segmentation of the log into task traces and the simplification of the resulting task traces. We map the second phase into a single step. Then, we decompose the third phase into three steps: the discovery of platform-independent routine specifications, the aggregation of routines with the same effects, and the compilation of the discovered specifications into platform-specific executable scripts. This decomposition of the three phases into steps is summarized in the RPM pipeline depicted in Fig. 3. Below we discuss each step of this pipeline.

<sup>11</sup> Once an RPA routine has been automated via an RPA bot, a fourth phase is to monitor this bot to detect anomalies or performance degradation events that may signal that the bot may need to be adjusted and re-implemented or retired. While relevant from a practical perspective, this phase is orthogonal to the three previous phases since it is relevant both for bots developed manually and bots developed using RPM techniques. Furthermore, previous work has shown that existing process mining tools are suitable for analyzing logs produced by RPA bots for monitoring purposes [20].

**Fig. 3.** RPM pipeline

The recording of an UI log involves capturing low-level UIs, such as selecting a field in a form, editing a field, opening a desktop application, or opening a web page. UI log recording may be achieved by instrumenting the software applications (including web browsers) used by the workers via plug-in or extension mechanisms. Logs collected by such plug-ins or extensions may be merged to produce a raw UI log corresponding to the execution of one or more tasks by a user during a period of time. This raw log usually needs to be preprocessed to be suitable for RPM.

The main challenge in this step is to identify what UIs must be recorded. The same UI (e.g. mouse click) can either be important or irrelevant in a given context. For example, a mouse click on a button is an important UI, but a mouse click on a web page's background is an irrelevant UI. Also, when a worker selects a web form, we need to record UIs at the level of the web page (the Document Object Model – DOM) in order to learn routines at the level of logical input elements (e.g. fields) and not at the level of pixel coordinates, which are dependent on screen resolution and window sizes. Existing UIs recording tools, such as JitBit Macro Recorder<sup>12</sup>, TinyTask<sup>13</sup>, and WinParrot<sup>14</sup>, save all the UIs performed by the user at a too low level of granularity, with reference to

<sup>12</sup> https://www.jitbit.com/macro-recorder/.

<sup>13</sup> https://www.tinytask.net/.

<sup>14</sup> http://www.winparrot.com/.

pixel coordinates (e.g. click the mouse at coordinates 748,365). As a result, the UI logs generated by these tools are not suitable for extracting useful routines. The RPA tools mentioned in Sect. 2.1 (e.g. UiPath and Automation Anywhere) provide recording functionality. However, this functionality is intended to record RPA scripts. These tools do not capture details about different fields' values, as these values are not relevant for RPA script generation. For example, an RPA script must know which cell in a spreadsheet has to be copied, and it is agnostic to the value stored in that cell. Hence, a new family of recording tools is needed to record UI logs required for RPM.

In [33], we introduced a tool to record UI logs in a format that is suitable for RPM. The tool records not only the UI actions (selecting a field, editing a field, copying into or pasting from the clipboard) but also the values associated with these actions (e.g. the value of a field after an editing event). The tool supports MS Excel and Google Chrome. The tool also simplifies the recorded UI logs by removing redundant events (e.g. double-copying without pasting, navigation between cells in Excel without modifying or copying their content). The applicability of such tool, however, is limited to desktop applications that provide APIs for listening to UI events and accessing the data consumed and produced by these events. To achieve a more general solution, it may be necessary to combine this latter approach with OCR technology in order to detect UI events and associated data from application screenshots, as outlined in [35,38].

In its raw form, an UI log consists of one single sequence of UIs recorded during a session. During this session, a user may have performed several executions of one or multiple tasks, that may be mixed up in the log. Moreover, in case of multi-tasking, UIs of multiple concurrent task executions may be mixed together. Before identifying candidate routines for automation, an UI log has to be segmented into task traces, such that each trace corresponds to the execution of one task instance. This involves the identification of the boundaries of the tasks and the assignment of UIs to specific task traces. Given the fragment of the UI log demonstrated in the running example, we can extract two segments, each corresponding to the processing of a specific entry in the spreadsheet containing students' data (UIs 1 to 18 and 19 to 41 in Table 1).

The problem of extracting segments from an UI log corresponding to task instances is similar to that of web session reconstruction [40], where the goal is to identify the beginning and the end of web navigation sessions in server log data (e.g. streams of clicks and web page navigation) [40]. Methods for session reconstruction are usually based on heuristics that rely on the structural organization of websites or time intervals between events. The former approach covers only the cases where all the user interactions are performed in the web applications. In contrast, the latter approach assumes that users make breaks in-between two consecutive segments – in our case, two routine instances.

The problem of segmentation is also related to that of preprocessing so-called *uncorrelated event logs* in process mining. As discussed in [2,3] each event in a log should include, as a minimum, a case identifier, a timestamp, and an activity label. When the events of an event log do not have a case identifier, the log is said to be uncorrelated. Various methods have been proposed to extract correlated (i.e. regular) event logs from uncorrelated ones. However, existing methods in this field address the problem in restrictive settings. Specifically, some approaches [17] assume that the underlying process is acyclic, while others [10,11] assume that an explicit process model is given as input (in addition to the uncorrelated event log). These assumptions do not hold in the context of RPM, where no explicit process model is available, and a routine may contain repetitions. Also, the above approaches sometimes produce inaccurate results, whereas in the context of RPM, we need to identify routines with high levels of confidence (preferably 100% confidence), since an inaccurate replication of a routine by an unattended RPA bot may lead to costly errors.

In some scenarios, segmentation may be accomplished by combining transactional data recorded by enterprise information systems and user interactions logs, as proposed in [35]. However, a shortcoming of this approach is that such transactional data often provides only limited information about the process context, which is not enough to identify the boundaries of tasks captured in the user interactions logs.

Recent work on UI log segmentation [5,7] proposes to use trace alignment between the logs and the corresponding interaction models to identify the segments. In practice, however, such interaction models are not available beforehand.

Another related work [30] proposes to discover segments in the log by identifying cycles in the graph constructed from this log. These cycles represent repetitive behavior in the log and thus potentially correspond to task instances recorded in the log. However, this approach assumes that the task instances recorded in the log do not overlap and occur consequently one after the other.

In the context of desktop assistants, research proposals such as TaskTracer and TaskPredictor have tackled the problem of analyzing UI logs generated by desktop applications to identify the current task performed by a user and to detect switches between one task and another [15,39]. These approaches can potentially be used to split the UI logs into segments corresponding to different tasks. However, such approaches are not able to distinguish different instances of the same task.

Ideally, UIs recorded in a log should only relate to the execution of the task(s) of interest. However, in practice, a log often also contains UIs that do not contribute to completing the recorded task(s). We can consider such UIs to be *noise*. Examples of noise UIs include a worker browsing the web (e.g. social networking) while executing a task that does not require doing that, or a worker committing mistakes (e.g. filling a text field with an incorrect value or copying a wrong cell of a spreadsheet). UIs 6, 7, 8, 9, 10, and 11 are noise in our running example. During the creation of the student record, the worker decided to make a small pause, switched to a new tab in the web browser (6–7), and navigated to Facebook (8), where she spent almost 4 min browsing the news feed, before going back to the tab with the active student form (9). All these UIs do not have any relation to the task being recorded; thus, they constitute noise. When performing the task, the worker selected a surname field in the form (10) and made a mistake by accidentally misspelling the surname of the student (11). She then had to select the same field again (12) and fill it in with the correct value (13). Although the UIs 10 and 11 belong to the performed task, their effects are overwritten by successive UIs (e.g. UI 11 is overwritten by UI 13) and, therefore, they do not affect the outcome of the routine and are considered to be noise. The presence of the noise may negatively affect the subsequent steps of the RPM pipeline (e.g. the discovery of the candidate routines). Accordingly, the next step in the RPM pipeline is *simplification*, which aims at noise identification and removal. The UIs in the log are removed so that the resulting log captures the same effects as the original one while being simpler (i.e. having fewer UIs).

One of the challenges that arises during the pre-processing step of the RPM pipeline is to separate irrelevant UIs (i.e. noise) from those UIs that do contribute to the completion of a task. A possible approach is to assume that noise takes the form of chaotic events that may happen anywhere during process execution. One technique for filtering out such chaotic events is described in [42]. However, if noise gravitates towards one particular state or set of states in the task (e.g. towards the start or the end of the task), techniques such as the one mentioned above may not discover it and consequently not filter it out. Moreover, some UIs can be mistakenly removed due to the different ways the same task can be performed and induce what may mistakenly appear to be chaotic sequences of UIs. Thus, it is important to consider the data perspective, i.e. values of data objects that are manipulated by the UIs. In this way, one can identify the UIs that share the same parameter values (e.g. copying a value from a worksheet and then pasting it in a web form), or have the same source/origin (e.g. all the UIs are performed on the same website). The UIs that do not share any data parameters and/or values or originate from different sources most likely constitute noise.

**Discovering Candidate Routines for Automation.** Given a set of simplified task traces, the next phase is to identify candidate routines for automation. This phase aims at extracting repetitive sequences of UIs that occur across multiple task traces, a.k.a. routines, and to identify which of those routines are amenable for automation. The output of this step is a set of candidate routines for automation.

Even though an automated RPM tool can considerably reduce the effort required to automate routine, there is still a lot of development, quality assurance, and maintenance effort required to automate a routine in a real-life setting. Also, the automation of a routine may require re-training and re-allocation of human workers involved in the process. And if the routine is only partially automated (as opposed to fully automated), some handoffs will have to be put in place between the manual and the automated parts of a routine. As a result, the costs of automating a routine may sometime (or even often) outweigh the benefits. Thus, the cost-benefit analysis of routine automation is an important step in an end-to-end RPM method. To perform this analysis, a first step is to assess is a routine is suitable for automation.

Mindful of this requirement, Lacity and Willcocks [27] propose high-level guidelines for determining if a task is a candidate for automation in the context of a case study at Telefonica. The guidelines, however, do not provide a formal and precise definition of what makes a routine suitable for automation.

In a recent systematic review of the RPA literature, Syed et al. [41] conclude that "there is a need for formal, systematic and evidence-based techniques to determine the suitability of tasks for RPA.". In other words, a major challenge in the field of RPM is how to formally characterize what makes a routine amenable for automation via RPA or other automation technologies.

Two necessary criteria for a routine to be amenable for automation are:


Considering the running example provided in Table 1 and assuming that the identified task traces frequently occur in the log, we would discover two candidate routines, handling the domestic and international students, respectively. Note that the routine in the first task trace is only partially automatable. The worker manually filled in the form by looking at the corresponding entry values in the spreadsheet. Since she did not read the data values explicitly (e.g. by copying the values to the clipboard), these values are unknown for the recording tool. Hence, it is not possible to understand how the values used for editing the form's fields were obtained. On the other hand, the routine from the second task trace is fully automatable, as it is clear how to compute the values for the fields of the web form in the target application (i.e. by copying them from the spreadsheet).

Several techniques proposed in the field of UI log mining address the problem of identifying routines that fulfill the "frequency" criterion. Dev and Liu [14] have noted that the problem of frequent routine identification from (segmented) UI logs can be mapped to that of frequent pattern mining, a well-known problem in the field of data mining [22]. In the literature, several algorithms are available to mine frequent patterns from sequences of symbols. Depending on their output, we can distinguish two types of frequent pattern mining algorithms: those that discover only exact patterns [28,37] (hence vulnerable to noise), and those that allow frequent patterns to have gaps within the sequence of symbols [18,45] (hence noise-resilient).

Bosco et al. [12] address the problem of discovering routines that fulfill the "determinism" requirement. Specifically, this technique discovers sequences of actions such that the input(s) of each action in the sequence (except the first one) can be derived from the data observed in previous actions. However, this technique can only discover perfectly sequential routines and is hence not resilient to noise and variability in the order of the actions.

Leno et al. [29,31] combine techniques for discovering frequent routines, with techniques for discovering deterministic routines, thus addressing both of the above requirements. This latter proposal also addresses the problem of synthesizing an executable routine specification and that of detecting semantically equivalent routines, as discussed later in this chapter.

The discovery of automatable routines from sequences of actions is related to the problem of automated process discovery, discussed in [3,8] of this handbook. This relation is explored by Jim´enez-Ram´ırez et al. [24], who apply process discovery techniques to extract process models from segmented UI logs. Importantly though, while it is possible to use automated process discovery algorithms to extract process models from segmented UI logs, the resulting process models cannot readily be used for automation (via RPA or other automation technology) for two reasons.

First, the process models discovered by process discovery techniques, such as those presented in [3,8], are control-flow models. They capture the occurrence and order of steps (tasks) in a process, but not the data taken as input and produced as output by each step in the process. Yet, in order to automate a routine, we need to know which data is used by each step in the routine and where these data comes from. We note that a subset of process discovery approaches can discover process models with data-driven branching conditions [13], or process models where some control-flow relations only hold under certain data-driven conditions [36], but they do not discover process models with data manipulation logic.

Second, the process models produced by automated process discovery techniques, typically contain traces that have not been observed (cf. the *generalization* property discussed in Chap. 2). However, when the purpose of a model is to serve as a blueprint for RPA, the generalization property is not desirable. Indeed, if a software bot executes such a model, it will sometimes produce sequences of action that might not correspond to a sequence of actions that a human worker would have performed. This, in turns, may lead to errors and these errors may later require time-consuming and costly corrective actions. Instead, routines for RPA must be 100% precise (cf. the definition of precision in Chap. 2), as a lack of precision may lead to potential errors when the routines are executed by an unattended RPA bot.

**Discovering Executable Routine Specifications.** Having identified a set of candidate routines for automation, the next step is that of *executable (sub-) routine discovery*. For each candidate routine, this step identifies the *activation condition* (UIs 2 and 19 in Table 1), which indicates when an instance of the routine should be triggered, and the *routine specification*, which specifies what UIs should be performed within that routine, what data is used by each UI in the routine, and how these data should be obtained.

The discovery of a routine specification involves identifying and synthesizing the transformation functions that have to be applied to the input data to convert it to the required format in the target application. In the running example, we can see that the web form requires a different date format than the one used in the spreadsheet (UIs 29 to 34). Hence, transferring the date of birth via simple copy and paste operations is insufficient, and the transformation function must be applied to achieve the desired result.

The problem of discovering executable routine specifications has been widely studied in the context of table auto-completion and data wrangling. For example, the Excel's Flash Fill feature detects string patterns in the values of the cells in a spreadsheet and uses these patterns for auto-completion [21]. Similarly, the authors in [9] propose an approach to extract structured relational data from semi-structured spreadsheets. However, such approaches can discover only the executable routines performed in one application and have a limited area of usage. In practice, the RPA routines often involve many of these applications.

Bosco et al. [12] suggest that the discovery of executable routine specifications can be tackled by applying methods for automated discovery of data transformations from examples [4,25]. However, these methods suffer from scalability issues when applied naively. Leno et al. [29] explore this approach and propose a series of optimizations to improve performance of the data transformation discovery techniques in the context of synthesis of routine specifications for RPA. This approach is further elaborated by the same authors in [31].

Gao et al. [19] extract rules from segmented UI logs to automatically fill in (web) forms. However, this approach only discovers branching conditions that specify whether a given activity has to be performed or not (e.g. check a box in a form) and only focuses on copy-paste operations without identifying more complex manipulations.

Agostinelly et al. [6] present an approach to discover routines from segmented UI logs and automate these routines via scripts. This approach, however, assumes that all the actions within a routine are automatable. In practice, it is possible that some actions have to be performed manually, and they can not be automated.

The output of the *executable (sub)routine discovery* step is a set of executable routine specifications of each automatable candidate routine. However, some of these specifications may produce identical effects, as they describe different variants of the same routine (e.g. filling in a web form in different orders). These variants are considered as duplicates and should be ignored, as their automation will not bring any benefits to the organization. Therefore, the next step in the RPM pipeline is *aggregation*. During this step, the discovered routine specifications leading to the same effects are replaced with one specification that captures the optimal way of performing the underlying routine. Several routine specifications may also be combined into a more complex specification that contains instructions on how to deal with different cases.

Once the script has been generated, it may be manually refined by an RPA developer, tested, and deployed into a production environment. The bot can be executed in *attended* or *unattended* settings. In attended settings, given an activation condition extracted from the routine specification, it can notify the user about its "readiness" to perform the routine when the condition is met and can be paused during execution, so that the user can make small corrections if needed and then resume the work. In unattended settings, the bot works independently without human involvement.

## **4 Outlook**

There are a number of research challenges that need to be overcome to realize the vision of RPM, particularly in the areas of candidate routine discovery, extraction of automatable routines, and aggregation of equivalent routines (cf. Fig. 3).

In the area of candidate routine identification (and the related area of UI log segmentation), existing techniques assume that the routine instances are strictly separated in the UI log, i.e. there is no interleaving of user interactions belonging to one instance of one routine, and user interactions belonging to another instance of the same or of another routine. In practice, such interleaving may occur, for example, when a user is multi-tasking and thus alternating their attention between multiple routines.

In the area of automatable routine discovery, existing techniques are based on data transformation discovery, and as such they are limited to data transfer routines, where the goal is to take data from one system and transfer them to another system. Furthermore, these techniques are limited in scope to discovering routines where one record in one application, e.g. one row of a spreadsheet, is copied into one or more fields of another application (e.g. a web form). In reality, a single routine may involve complex iterations, for example, a routine may involve copying an invoice containing multiple invoice line-items from one application to another. In this case, the top-level routine (copying an invoice) contains a nested iterated sub-routine (copying multiple line items). These kind of structures cannot be discovered via existing data transformation discovery techniques. These latter techniques can discover that there is a routine consisting in copying an invoice line item, but they cannot reason holistically about the higher-level routine where the entire invoice is copied.

The area of routine aggregation is still a green field of research. A fundamental open problem in this space is the definition of notions of routine equivalence that would allow us to detect, for example, that a routine performed by one worker is the same as the one performed by another worker, even though these two workers perform the steps in their respective routines in completely different ways.

The RPM techniques discussed in this chapter focus on the discovery of routines that can be executed in an end-to-end manner by an RPA bot. This assumption is constraining. In reality, routines may be automated for a certain subset of cases, but not for all cases (i.e. automation may only be partially achievable). A key challenge, which goes beyond the scope of the proposed RPM pipeline, is how to discover partially deterministic routines. While a fully deterministic routine can be executed end-to-end in all cases, a partially deterministic routine can be stopped if the bot reaches a point where the routine cannot be deterministically continued given the input data and other data that the bot collects during the routine's execution. For example, while copying records of purchase orders from a spreadsheet or an enterprise system, a bot may detect that this order comes from China and then it may stop because it does not know how to handle such orders. Or, in a similar vein, a bot may find that a PO number is missing (the corresponding cell is empty), and hence it cannot proceed. Discovering conditions under which a routine cannot be deterministically continued (or started) is an open challenge in the field of RPM. Yet, this capability is a precondition to ensure that bots synthesized via RPM techniques can gracefully degrade and stop in order to hand off to human operators.

Finally, the vision of RPM exposed in this chapter, focuses on the problem of discovering automatable routines. Besides this problem, we envision that the field of RPM will encompass complementary problems and questions such as performance mining of RPA bots. This includes answering questions such as: "What is the success or defect rate of a bot when performing a given routine?", "What patterns are correlated with or are causal factors of bot failures?", and "Are there cases where the effects of a bot's actions are abnormal and warrant manual inspection?" In other words, over time, we envision that the scope of RPM will expand to cover the entire RPA lifecycle (cf. Fig. 1), rather than being purely focused on the development of RPA bots.

**Acknowledgments.** Work supported by the European Research Council (PIX project) and by the Australian Research Council (DP180102839).

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Closing**

# **Scaling Process Mining to Turn Insights into Actions**

Wil M. P. van der Aalst1(B) and Josep Carmona<sup>2</sup>

<sup>1</sup> Process and Data Science (PADS), RWTH Aachen University, Aachen, Germany wvdaalst@pads.rwth-aachen.de <sup>2</sup> Universitat Politecnica de Catalunya, Barcelona, Spain ` jcarmona@cs.upc.edu

**Abstract.** This final chapter reflects on the current status of the process mining discipline and provides an outlook on upcoming developments and challenges. The broader adoption of process mining will be a gradual process. Process mining is already used for high-volume processes in large organizations, but over time process mining will also become the "new normal" for smaller organizations and processes with fewer cases. To get the highest return on investment, organizations need to "scale" their process mining activities. Also, from a research point-ofview, there are many exciting challenges. On the one hand, many of the original problems (e.g., discovering high-quality process models and scaling conformance checking) remain (partly) unsolved, still allowing for significant improvements. On the other hand, the large-scale use of process mining provides many research opportunities and generates novel scientific questions.

**Keywords:** Process mining *·* Execution management *·* Process management

## **1 Process Mining: Overview and Summary**

The chapters in this book illustrate the broadness of the process mining discipline. The interplay between data science and process science provides many challenges and opportunities [1]. In this book, we aim to provide a comprehensive overview. There are many dimensions to characterize the 16 earlier chapters.


In the first chapter of this book [3], we started with Fig. 1 showing a 360 degrees overview of process mining. The subsequent chapters have been focusing on different parts of the pipeline depicted in Fig. 1. The initial chapters focused on *process discovery*, starting with creating a simple Directly-Follows Graph (DFG) followed by a range of alternative, more sophisticated, techniques. As shown, process discovery is an important topic, but also very difficult [1]. Event data do not contain negative examples and

**Fig. 1.** Process mining uses event data extracted from information systems to provide insights and transparency that, ultimately, should lead to process improvements (i.e., process redesign, improved control, and automation).

the positive examples typically only cover a fraction of all possible behaviors. Mixtures of choice, concurrency, and loops make process discovery a *notoriously difficult task* with many *trade-offs*. Also, process models may be used for different purposes.

After discovery, the focus shifted to *conformance checking* [1,5]. Here the input consists of both modeled and observed behavior. For example, a multiset of traces is compared with an accepting Petri net. Surprisingly, state-of-the-art conformance checking techniques tend to be more demanding than discovery techniques (from a computational point of view). Computing alignments corresponds to solving optimization problems that grow exponentially in the size of the model and the length of traces in the event log.

Several chapters discussed the importance and complexity of data extraction and preprocessing. Later chapters focused on practical applications and more advanced topics such as model enhancement, streaming process mining, distributed process mining, and privacy-preserving process-mining techniques.

Figure 2 shows another overview of the building blocks of a successful process mining solution. The top of Fig. 2 shows examples of application domains where process mining can be used. In this book, we elaborated on applications in healthcare, auditing, sales, procurement, and IT services. However, process mining is a generic technology that can be used in any domain.

In the remainder of this concluding chapter, we take a step back and reflect on the developments in our discipline. Section 2 discusses the inevitability of process mining, but also stresses that concepts such as a Digital Twin of an Organization (DTO) are still far away from being a reality. Section 3 explains that it is important to scale process mining. Finally, Sect. 4 provides an outlook also listing research challenges.

**Fig. 2.** Process mining can be used in any application domain. However, it may be non-trivial to extract accurate event data and turn process mining results into actions. Change management and automation play a key role in realizing sustained improvements (as indicated by the two arcs closing the loop).

## **2 Process Mining as the New Normal**

Although process mining has proven its value in many organizations, it is not so easy to create a convincing *business case* to justify investments [1]. The reason is that process mining will most likely reveal performance and compliance problems, but this does not imply that these are automatically solved [8]. Financial and technical debts are wellknown concepts. However, most organizations tend to ignore their *Operational Process Debts* (OPDs). OPDs cause operational friction, but are difficult to identify and address. Although process mining results are often surprising, they typically reveal OPDs that were known to some, but not addressed adequately. Making these OPDs visible and transparent helps to address them.

In [2], the first author coined the term *Process Hygiene* (PH) to stress that process mining should be the "new normal" not requiring a business case. Just like personal hygiene, one should not expect an immediate return on investment. We know that activities such as brushing our teeth, washing our hands after going to the toilet, and changing clothes are the right thing to do. The same applies to process mining activities, i.e., process hygiene serves a similar role as personal hygiene. People responsible for operational processes need to be willing to look at possible problems. Objectively monitoring and analyzing key processes is important for the overall health and well-being of an organization. Process mining helps to ensure process hygiene. Not using process mining reflects the inability or unwillingness to manage processes properly. Fortunately, an increasing number of organizations is aware of this.

Although process mining is slowly becoming the "new normal", most organizations will *not* be able to use the forward-looking forms of process mining. As long as the extraction of event data, process discovery, and conformance checking are challenging for an organization, it is unlikely that machine learning and other forward-looking techniques (including artificial intelligence and simulation) will be of help. Terms such as the *Digital Twin of an Organization* (DTO) illustrate the desire to autonomously manage, adapt, and improve processes. Gartner defines a DTO as "a dynamic software model of any organization that relies on operational and/or other data to understand how an organization operationalizes its business model, connects with its current state, responds to changes, deploys resources and delivers exceptional customer value". Creating a DTO can be seen as one of the grand challenges in information systems, just like autonomous driving in mobility. However, just like the development of self-driving cars, the process will be slow with many minor incremental improvements.

## **3 Scaling Process Mining**

One of the main conclusions in [6] is that process mining needs *scale* to be most cost effective. Organizations need to aim for the *continuous* usage of process mining for *many processes* by *many people*. Initially, process mining was primarily used in process improvement projects. In such projects, a problematic process is analyzed to provide recommendations for improvement. Since data extraction is often the most problematic step, such projects often struggle to get results quickly. Moreover, the "end product" of such a project is often a just a report. To improve the process, change management and automation efforts are still needed. Therefore, traditional process mining projects struggle to realize a good Return on Investment (ROI).

**Fig. 3.** Scaling process mining to maximize the benefits.

Therefore, process mining should not be seen as a project, but as a *continuous company-wide* activity as shown in Fig. 3. There are several reasons for this.


Compare process mining for an organization to creating weather forecasts for a country. It does not make any sense to create a weather forecast for just one city on a particular day. Investments only make sense if one is able to create a weather forecast for any city on any day. Similarly, process mining is most effective when applied to many processes continuously.

**Fig. 4.** Turning insights into actions.

As part of scaling process mining, it is essential that insights are turned into concrete improvement actions. This is illustrated in Fig. 4. Process discovery and conformance checking can be seen as creating detailed X-ray images to detect problems and find root causes [1]. However, the value of an X-ray image is limited if it is not followed by interventions and treatment, e.g., surgery, chemotherapy, diet, and radiation therapy. Therefore, commercial process mining vendors are combining process mining with automation, e.g., Robotic Process Automation (RPA) and low-code automation platforms like Make.

## **4 Outlook**

How will the process mining discipline and market evolve? Most analysts expect the usage of process mining to grow exponentially in the coming years. Given the growing availability of event data and mature tools, there is no reason to doubt this. To predict the evolution of methods, techniques, and software capabilities, it is good to take another look at the *process mining manifesto* [7] written by the *IEEE Task Force on Process Mining* in 2011. The process mining manifesto lists the following eleven challenges.


There has been substantial progress in the areas covered by these challenges posed over a decade ago. For example, we now have comprehensive sets of publicly available benchmarks (C3) and we much better understand the different quality criteria (C6). Thanks to the over 40 commercial process mining tools, it is now much easier to apply process mining (C10) and understand the diagnostics (C11). Due to the many approaches combining process mining and machine learning, there has been major progress with respect to C8 and C9. Nevertheless, most of the challenges are still relevant and even basic problems such as process discovery and conformance checking have not been completely solved.

**Fig. 5.** Process mining challenges in focus in the next five years.

Figure 5 annotates the overview diagram with some of the most relevant challenges for the coming years. There is quite some overlap with the eleven challenges in [7]. For example, finding, extracting and transforming input data is still one of the main challenges when applying process mining in practice. Approaches such as object-centric process mining [3,4] try to make this easier by storing information about multiple objects in a consistent manner and allowing for process models that are not limited to a single case notion. Figure 5 also shows that there are still many open problems when it comes to basic capabilities such as process discovery and conformance checking. Figure 5 also lists challenges that were not discussed in [7]. For example, how to better combine algorithms and domain knowledge to create better process models (*userguided discovery*) and suggest improvements. There is also an increased emphasis on using process mining results to automatically trigger improvements (*action-oriented process mining*).

We hope that this chapter and book will inspire both academics and practitioners to work on these important challenges. The process mining discipline is rapidly developing and there is still room for original and significant contributions.

**Acknowledgments.** Funded by the Alexander von Humboldt (AvH) Stiftung and the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany's Excellence Strategy – EXC 2023 Internet of Production – 390621612.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Author Index**

Accorsi, Rafael 212 Augusto, Adriano 76 Burattin, Andrea 349 Carmona, Josep 76, 155, 495 de Leoni, Massimiliano 243 De Weerdt, Jochen 193 Di Ciccio, Claudio 108 Di Francescomarino, Chiara 320 Dumas, Marlon 468 Eulerich, Marc 445 Fahland, Dirk 274 Ghidini, Chiara 320

Jans, Mieke 445

Lebherz, Julian 212 Leno, Volodymyr 468

Maggi, Fabrizio Maria 468 Mannhardt, Felix 373 Martin, Niels 416 Montali, Marco 108 Munoz-Gama, Jorge 416 Polyvyanyy, Artem 468

Reinkemeyer, Lars 405 Rosa, Marcello La 468

van der Aalst, Wil M. P. 3, 37, 495 van Dongen, Boudewijn 155 Verbeek, Eric 76

Weidlich, Matthias 155 Wittig, Nils 416 Wynn, Moe Thandar 193