**Dana Fisman Grigore Rosu (Eds.)**

# **Tools and Algorithms for the Construction and Analysis of Systems**

**28th International Conference, TACAS 2022 Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2022 Munich, Germany, April 2–7, 2022 Proceedings, Part II**

# Lecture Notes in Computer Science 13244

Founding Editors

Gerhard Goos, Germany Juris Hartmanis, USA

# Editorial Board Members

Elisa Bertino, USA Wen Gao, China Bernhard Steffen , Germany Gerhard Woeginger , Germany Moti Yung , USA

# Advanced Research in Computing and Software Science Subline of Lecture Notes in Computer Science

Subline Series Editors

Giorgio Ausiello, University of Rome 'La Sapienza', Italy Vladimiro Sassone, University of Southampton, UK

Subline Advisory Board

Susanne Albers, TU Munich, Germany Benjamin C. Pierce, University of Pennsylvania, USA Bernhard Steffen , University of Dortmund, Germany Deng Xiaotie, Peking University, Beijing, China Jeannette M. Wing, Microsoft Research, Redmond, WA, USA More information about this series at https://link.springer.com/bookseries/558

# Tools and Algorithms for the Construction and Analysis of Systems

28th International Conference, TACAS 2022 Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2022 Munich, Germany, April 2–7, 2022 Proceedings, Part II

Editors Dana Fisman Ben-Gurion University of the Negev Be'er Sheva, Israel

Grigore Rosu University of Illinois Urbana-Champaign Urbana, IL, USA

ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Computer Science ISBN 978-3-030-99526-3 ISBN 978-3-030-99527-0 (eBook) https://doi.org/10.1007/978-3-030-99527-0

© The Editor(s) (if applicable) and The Author(s) 2022. This book is an open access publication.

Open Access This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

# ETAPS Foreword

Welcome to the 25th ETAPS! ETAPS 2022 took place in Munich, the beautiful capital of Bavaria, in Germany.

ETAPS 2022 is the 25th instance of the European Joint Conferences on Theory and Practice of Software. ETAPS is an annual federated conference established in 1998, and consists of four conferences: ESOP, FASE, FoSSaCS, and TACAS. Each conference has its own Program Committee (PC) and its own Steering Committee (SC). The conferences cover various aspects of software systems, ranging from theoretical computer science to foundations of programming languages, analysis tools, and formal approaches to software engineering. Organizing these conferences in a coherent, highly synchronized conference program enables researchers to participate in an exciting event, having the possibility to meet many colleagues working in different directions in the field, and to easily attend talks of different conferences. On the weekend before the main conference, numerous satellite workshops took place that attract many researchers from all over the globe.

ETAPS 2022 received 362 submissions in total, 111 of which were accepted, yielding an overall acceptance rate of 30.7%. I thank all the authors for their interest in ETAPS, all the reviewers for their reviewing efforts, the PC members for their contributions, and in particular the PC (co-)chairs for their hard work in running this entire intensive process. Last but not least, my congratulations to all authors of the accepted papers!

ETAPS 2022 featured the unifying invited speakers Alexandra Silva (University College London, UK, and Cornell University, USA) and Tomáš Vojnar (Brno University of Technology, Czech Republic) and the conference-specific invited speakers Nathalie Bertrand (Inria Rennes, France) for FoSSaCS and Lenore Zuck (University of Illinois at Chicago, USA) for TACAS. Invited tutorials were provided by Stacey Jeffery (CWI and QuSoft, The Netherlands) on quantum computing and Nicholas Lane (University of Cambridge and Samsung AI Lab, UK) on federated learning.

As this event was the 25th edition of ETAPS, part of the program was a special celebration where we looked back on the achievements of ETAPS and its constituting conferences in the past, but we also looked into the future, and discussed the challenges ahead for research in software science. This edition also reinstated the ETAPS mentoring workshop for PhD students.

ETAPS 2022 took place in Munich, Germany, and was organized jointly by the Technical University of Munich (TUM) and the LMU Munich. The former was founded in 1868, and the latter in 1472 as the 6th oldest German university still running today. Together, they have 100,000 enrolled students, regularly rank among the top 100 universities worldwide (with TUM's computer-science department ranked #1 in the European Union), and their researchers and alumni include 60 Nobel laureates. The local organization team consisted of Jan Křetínský (general chair), Dirk Beyer (general, financial, and workshop chair), Julia Eisentraut (organization chair), and Alexandros Evangelidis (local proceedings chair).

ETAPS 2022 was further supported by the following associations and societies: ETAPS e.V., EATCS (European Association for Theoretical Computer Science), EAPLS (European Association for Programming Languages and Systems), and EASST (European Association of Software Science and Technology).

The ETAPS Steering Committee consists of an Executive Board, and representatives of the individual ETAPS conferences, as well as representatives of EATCS, EAPLS, and EASST. The Executive Board consists of Holger Hermanns (Saarbrücken), Marieke Huisman (Twente, chair), Jan Kofroň (Prague), Barbara König (Duisburg), Thomas Noll (Aachen), Caterina Urban (Paris), Tarmo Uustalu (Reykjavik and Tallinn), and Lenore Zuck (Chicago).

Other members of the Steering Committee are Patricia Bouyer (Paris), Einar Broch Johnsen (Oslo), Dana Fisman (Be'er Sheva), Reiko Heckel (Leicester), Joost-Pieter Katoen (Aachen and Twente), Fabrice Kordon (Paris), Jan Křetínský (Munich), Orna Kupferman (Jerusalem), Leen Lambers (Cottbus), Tiziana Margaria (Limerick), Andrew M. Pitts (Cambridge), Elizabeth Polgreen (Edinburgh), Grigore Roşu (Illinois), Peter Ryan (Luxembourg), Sriram Sankaranarayanan (Boulder), Don Sannella (Edinburgh), Lutz Schröder (Erlangen), Ilya Sergey (Singapore), Natasha Sharygina (Lugano), Pawel Sobocinski (Tallinn), Peter Thiemann (Freiburg), Sebastián Uchitel (London and Buenos Aires), Jan Vitek (Prague), Andrzej Wasowski (Copenhagen), Thomas Wies (New York), Anton Wijs (Eindhoven), and Manuel Wimmer (Linz).

I'd like to take this opportunity to thank all authors, attendees, organizers of the satellite workshops, and Springer-Verlag GmbH for their support. I hope you all enjoyed ETAPS 2022.

Finally, a big thanks to Jan, Julia, Dirk, and their local organization team for all their enormous efforts to make ETAPS a fantastic event.

February 2022 Marieke Huisman ETAPS SC Chair ETAPS e.V. President

# Preface

TACAS 2022 was the 28th edition of the International Conference on Tools and Algorithms for the Construction and Analysis of Systems. TACAS 2022 was part of the 25th European Joint Conferences on Theory and Practice of Software (ETAPS 2022), which was held from April 2 to April 7 in Munich, Germany, as well as online due to the COVID-19 pandemic. TACAS is a forum for researchers, developers, and users interested in rigorous tools and algorithms for the construction and analysis of systems. The conference aims to bridge the gaps between different communities with this common interest and to support them in their quest to improve the utility, reliability, flexibility, and efficiency of tools and algorithms for building computer-controlled systems.

There were four submission categories for TACAS 2022:


Papers of categories 1–3 were restricted to 16 pages, and papers of category 4 to six pages.

This year 159 papers were submitted to TACAS, consisting of 112 research papers, five case study papers, 33 regular tool papers, and nine tool demo papers. Authors were allowed to submit up to four papers. Each paper was reviewed by three Program Committee (PC) members, who made use of subreviewers. Similarly to previous years, it was possible to submit an artifact alongside a paper, which was mandatory for regular tool and tool demo papers.

An artifact might consist of a tool, models, proofs, or other data required for validation of the results of the paper. The Artifact Evaluation Committee (AEC) was tasked with reviewing the artifacts based on their documentation, ease of use, and, most importantly, whether the results presented in the corresponding paper could be accurately reproduced. Most of the evaluation was carried out using a standardized virtual machine to ensure consistency of the results, except for those artifacts that had special hardware or software requirements. The evaluation consisted of two rounds. The first round was carried out in parallel with the work of the PC. The judgment of the AEC was communicated to the PC and weighed in their discussion. The second round took place after paper acceptance notifications were sent out; authors of accepted research papers who did not submit an artifact in the first round could submit their artifact at this time. In total, 86 artifacts were submitted (79 in the first round and seven in the second) and evaluated by the AEC regarding their availability, functionality, and/or reusability. Papers with an artifact that was successfully evaluated include one or more badges on the first page, certifying the respective properties.

Selected authors were requested to provide a rebuttal for both papers and artifacts in case a review gave rise to questions. Using the review reports and rebuttals, the Program and the Artifact Evaluation Committees extensively discussed the papers and artifacts and ultimately decided to accept 33 research papers, one case study, 12 tool papers, and four tool demos.

This corresponds to an acceptance rate of 29.46% for research papers and an overall acceptance rate of 31.44%.

Besides the regular conference papers, this two-volume proceedings also contains 16 short papers that describe the participating verification systems and a competition report presenting the results of the 11th SV-COMP, the competition on automatic software verifiers for C and Java programs. These papers were reviewed by a separate Program Committee (PC); each of the papers was assessed by at least three reviewers. A total of 47 verification systems with developers from 11 countries entered the systematic comparative evaluation, including four submissions from industry. Two sessions in the TACAS program were reserved for the presentation of the results: (1) a summary by the competition chair and of the participating tools by the developer teams in the first session, and (2) an open community meeting in the second session.

We would like to thank all the people who helped to make TACAS 2022 successful. First, we would like to thank the authors for submitting their papers to TACAS 2022. The PC members and additional reviewers did a great job in reviewing papers: they contributed informed and detailed reports and engaged in the PC discussions. We also thank the steering committee, and especially its chair, Joost-Pieter Katoen, for his valuable advice. Lastly, we would like to thank the overall organization team of ETAPS 2022.

April 2022 Dana Fisman Grigore Rosu PC Chairs

> Swen Jacobs Andrew Reynolds AEC Chairs, Tools, and Case-study Chairs

> > Dirk Beyer Competition Chair

# Organization

### Program Committee

Saddek Bensalem Verimag, France Nikolaj Bjorner Microsoft, USA

Konstantinos Mamouras Rice University, USA Andrew Reynolds University of Iowa, USA

Parosh Aziz Abdulla Uppsala University, Sweden Luca Aceto Reykjavik University, Iceland Timos Antonopoulos Yale University, USA Dirk Beyer LMU Munich, Germany Jasmin Blanchette Vrije Universiteit Amsterdam, The Netherlands Udi Boker Interdisciplinary Center Herzliya, Israel Hana Chockler King's College London, UK Rance Cleaveland University of Maryland, USA Alessandro Coglio Kestrel Institute, USA Pedro R. D'Argenio Universidad Nacional de Córdoba, Argentina Javier Esparza Technical University of Munich, Germany Bernd Finkbeiner CISPA Helmholtz Center for Information Security, Germany Dana Fisman (Chair) Ben-Gurion University, Israel Martin Fränzle University of Oldenburg, Germany Felipe Gorostiaga IMDEA Software Institute, Spain Susanne Graf Université Joseph Fourier, France Radu Grosu Stony Brook University, USA Arie Gurfinkel University of Waterloo, Canada Klaus Havelund Jet Propulsion Laboratory, USA Holger Hermanns Saarland University, Germany Falk Howar TU Clausthal / IPSSE, Germany Swen Jacobs CISPA Helmholtz Center for Information Security, Germany Ranjit Jhala University of California, San Diego, USA Jan Kretinsky Technical University of Munich, Germany Viktor Kuncak Ecole Polytechnique Fédérale de Lausanne, Switzerland Kim Larsen Aalborg University, Denmark Daniel Neider Max Planck Institute for Software Systems, Germany Dejan Nickovic AIT Austrian Institute of Technology, Austria Corina Pasareanu Carnegie Mellon University, NASA, and KBR, USA Doron Peled Bar Ilan University, Israel Anna Philippou University of Cyprus, Cyprus


# Artifact Evaluation Committee


Fabian Meyer RWTH Aachen University, Germany Stefanie Mohr Technical University of Munich, Germany Malte Mues TU Dortmund, Germany Yuki Nishida Kyoto University, Japan Philip Offtermatt Université de Sherbrooke, Canada Muhammad Osama Eindhoven University of Technology, The Netherlands Jiří Pavela Brno University of Technology, Czech Republic Adrien Pommellet LRDE, France Mathias Preiner Stanford University, USA José Proença CISTER-ISEP and HASLab-INESC TEC, Portugal Tim Quatmann RWTH Aachen University, Germany Etienne Renault LRDE, France Andrew Reynolds (Chair) University of Iowa, USA Mouhammad Sakr University of Luxembourg, Luxembourg Morten Konggaard Schou Aalborg University, Denmark Philipp Schlehuber-Caissier LRDE, France Hans-Jörg Schurr Inria Nancy - Grand Est, France Michael Schwarz Technische Universität München, Germany Joseph Scott University of Waterloo, Canada Ali Shamakhi Tehran Institute for Advanced Studies, Iran Lei Shi University of Pennsylvania, USA Matthew Sotoudeh University of California, Davis, USA Jip Spel RWTH Aachen University, Germany Veronika Šoková Brno University of Technology, Czech Republic

# Program Committee and Jury — SV-COMP

Raveendra Kumar Medicherla

Fatimah Aljaafari University of Manchester, UK Lei Bu Nanjing University, China Thomas Bunk LMU Munich, Germany Marek Chalupa Masaryk University, Czech Republic Priyanka Darke Tata Consultancy Services, India Daniel Dietsch University of Freiburg, Germany Gidon Ernst LMU Munich, Germany Fei He Tsinghua University, China Matthias Heizmann University of Freiburg, Germany Jera Hensel RWTH Aachen University, Germany Falk Howar TU Dortmund, Germany Soha Hussein University of Minnesota, USA Dominik Klumpp University of Freiburg, Germany Henrich Lauko Masaryk University, Czech Republic Will Leeson University of Virginia, USA Xie Li Chinese Academy of Sciences, China Viktor Malík Brno University of Technology, Czech Republic Tata Consultancy Services, India


# Steering Committee


# Additional Reviewers

Abraham, Erika Aguilar, Edgar Akshay, S. Asadi, Sepideh Attard, Duncan Avni, Guy Azeem, Muqsit Bacci, Giorgio Balasubramanian, A. R. Barbanera, Franco Bard, Joachim Basset, Nicolas Bendík, Jaroslav Berani Abdelwahab, Erzana Beutner, Raven Bhandary, Shrajan Biewer, Sebastian

Blicha, Martin Brandstätter, Andreas Bright, Curtis Britikov, Konstantin Brunnbauer, Axel Capretto, Margarita Castiglioni, Valentina Castro, Pablo Ceska, Milan Chadha, Rohit Chalupa, Marek Changshun, Wu Chen, Xiaohong Cruciani, Emilio Dahmen, Sander Dang, Thao Danielsson, Luis Miguel Degiovanni, Renzo Dell'Erba, Daniele Demasi, Ramiro Desharnais, Martin Dierl, Simon Dubslaff, Clemens Egolf, Derek Evangelidis, Alexandros Fedyukovich, Grigory Fiedor, Jan Fitzpatrick, Stephen Fleury, Mathias Frenkel, Hadar Gamboa Guzman, Laura P. Garcia-Contreras, Isabel Gianola, Alessandro Goorden, Martijn Gorostiaga, Felipe Gorrieri, Roberto Grahn, Samuel Grastien, Alban Grover, Kush Grünbacher, Sophie Guha, Shibashis Gutiérrez Brida, Simón Emmanuel Havlena, Vojtěch He, Jie Helfrich, Martin Henkel, Elisabeth Hicks, Michael Hirschkoff, Daniel Hofmann, Jana Hojjat, Hossein Holík, Lukáš Hospodár, Michal Huang, Chao Hyvärinen, Antti Inverso, Omar Itzhaky, Shachar Jaksic, Stefan Jansen, David N. Jin, Xiangyu Jonas, Martin Kanav, Sudeep Karra, Shyam Lal Katsaros, Panagiotis

Kempa, Brian Klauck, Michaela Kreitz, Christoph Kröger, Paul Köhl, Maximilian Alexander König, Barbara Lahijanian, Morteza Larraz, Daniel Le, Nham Lemberger, Thomas Lengal, Ondrej Li, Chunxiao Li, Jianlin Lorber, Florian Lung, David Luppen, Zachary Lybech, Stian Major, Juraj Manganini, Giorgio McCarthy, Eric Mediouni, Braham Lotfi Meggendorfer, Tobias Meira-Goes, Romulo Melcer, Daniel Metzger, Niklas Milovancevic, Dragana Mohr, Stefanie Najib, Muhammad Noetzli, Andres Nouri, Ayoub Offtermatt, Philip Otoni, Rodrigo Paoletti, Nicola Parizek, Pavel Parker, Dave Parys, Paweł Passing, Noemi Perez Dominguez, Ivan Perez, Guillermo Pinna, G. Michele Pous, Damien Priya, Siddharth Putruele, Luciano Pérez, Jorge A. Qu, Meixun Raskin, Mikhail

Rauh, Andreas Reger, Giles Reynouard, Raphaël Riener, Heinz Rogalewicz, Adam Roy, Rajarshi Ruemmer, Philipp Ruijters, Enno Schilling, Christian Schmitt, Frederik Schneider, Tibor Scholl, Christoph Schultz, William Schupp, Stefan Schurr, Hans-Jörg Schwammberger, Maike Shafiei, Nastaran Siber, Julian Sickert, Salomon Singh, Gagandeep Smith, Douglas Somenzi, Fabio

Stewing, Richard Stock, Gregory Su, Yusen Tang, Qiyi Tibo, Alessandro Trefler, Richard Trtík, Marek Turrini, Andrea Vaezipoor, Pashootan van Dijk, Tom Vašíček, Ondřej Vediramana Krishnan, Hari Govind Wang, Wenxi Wendler, Philipp Westfold, Stephen Winter, Stefan Wolovick, Nicolás Yakusheva, Sophia Yang, Pengfei Zeljić, Aleksandar Zhou, Yuhao Zimmermann, Martin

# Contents – Part II

#### Probabilistic Systems





# Contents – Part I

#### Synthesis


David Dill, Wolfgang Grieskamp, Junkil Park, Shaz Qadeer, Meng Xu, and Emma Zhong


#### Grammatical Inference


#### Verification Inference





### Constraint Solving


#### Model Checking and Verification


# Probabilistic Systems

# A Probabilistic Logic for Verifying Continuous-time Markov Chains

Ji Guan<sup>1</sup> and Nengkun Yu<sup>2</sup> ()

<sup>1</sup> State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, Beijing, China. guanji1992@gmail.com <sup>2</sup> Centre for Quantum Software and Information, University of Technology Sydney, Sydney, Australia.

nengkunyu@gmail.com

Abstract. A continuous-time Markov chain (CTMC) execution is a continuous class of probability distributions over states. This paper proposes a probabilistic linear-time temporal logic, namely continuous-time linear logic (CLL), to reason about the probability distribution execution of CTMCs. We defne the syntax of CLL on the space of probability distributions. The syntax of CLL includes multiphase timed until formulas, and the semantics of CLL allows time reset to study relatively temporal properties. We derive a corresponding model-checking algorithm for CLL formulas. The correctness of the model-checking algorithm depends on Schanuel's conjecture, a central open problem in transcendental number theory. Furthermore, we provide a running example of CTMCs to illustrate our method.

# 1 Introduction

As a popular model of probabilistic continuous-time systems, continuous-time Markov chains (CTMCs) have been extensively studied since Kolmogorov [25]. In the recent 20 years, probabilistic continuous-time model checking receives much attention. Adopting probabilistic computational tree logic (PCTL) [22] to this context with extra multiphase timed until formulas Φ1U <sup>T</sup><sup>1</sup> Φ<sup>2</sup> · · ·U <sup>T</sup><sup>K</sup> ΦK+1, for state formula Φ and time interval T , Aziz et al. proposed continuous stochastic logic (CSL) to specify the branching-time properties of CTMCs and the model-checking problem for CSL is decidable [8]. After that, efcient modelchecking algorithms were developed by transient analysis of CTMCs using uniformization [9] and stratifcation [41] for a restricted version (path formulas are restricted to single until formulas Φ1U <sup>I</sup>Φ2) and a full version of CSL, respectively. These algorithms have been practically implemented in model checkers PRISM [26], MRMC [24] and STORM [18]. Further details can be found in an excellent survey [23].

There are also diferent ways to specify the linear-time properties of CTMCs. Timed automata were frst used to achieve this task [11,13,14,15,19], and then metric temporal logic (MTL) [12] was also considered in this context. Subsequently, the probability of "the system being in state s<sup>0</sup> within fve-time units after having continuously remained in state s1" can be computed. However, some statements cannot be specifed and verifed because of the lack of a probabilistic linear-time temporal logic, for instance "the system being in state s<sup>0</sup> with high probability (≥ 0.9) within fve-time units after having continuously remained in state s<sup>1</sup> with low probability (≤ 0.1)". Furthermore, this probabilistic property cannot be expressed by CSL because CSL cannot express properties that are defned across several state transitions of the same time length in the execution of a CTMC.

In this paper, targeting to express the mentioned probabilistic linear-time properties, we introduce continuous-time linear logic (CLL). In particular, we adopt the viewpoint used in [2] by regarding CTMCs as transformers of probability distributions over states. CLL studies the properties of the probability distribution execution generated by a given initial probability distribution over time. By the fundamental diference between the views of state executions and probability distribution executions of CTMCs, CLL and CSL are incomparable and complementary, as the relation between probabilistic linear-time temporal logic (PLTL) and PCTL in model checking discrete-time Markov chains [2, Section 3.3].

The atomic propositions of CLL are explained on the space of probability distributions over states of CTMCs. We apply the method of symbolic dynamics to the probability distributions of CTMCs. To be specifc, we symbolize the probability value space [0, 1] into a fnite set of intervals I = {I<sup>k</sup> ⊆ [0, 1]} m <sup>k</sup>=1. A probability distribution µ over its set of states S = {s0, s2, . . . , sd−1} is then represented symbolically as a set of symbols

$$\mathbb{S}(\mu) = \{ \langle s, \mathcal{T} \rangle \in S \times \mathcal{J} \,:\, \mu(s) \in \mathcal{T} \}$$

where each symbol ⟨s, I⟩ asserts µ(s) ∈ I, i.e., the probability of state s in distribution µ falls in interval I. For example, ⟨s0, [0.9, 1]⟩ means the system is in state s<sup>0</sup> with a probability in 0.9 to 1. The symbolization idea of distributions has been considered in [2]: choosing a disjoint cover of [0, 1]:

$$\mathcal{T} = \{ [0, p\_1), [p\_1, p\_2), \dots, [p\_n, 1] \}.$$

Here, we remove this restriction and enrich the expressiveness of I . A crucial fact about this symbolization is that the set S × I is fnite. Consequently, the (probability distribution) execution path generated by an initial probability distribution µ induces a sequence of symbols in S × I over time. Therefore, the dynamics of CTMCs can be studied in terms of a (real-time) language over the alphabet S × I , which is the set of atomic propositions of CLL.

Diferent from non-probabilistic linear-time temporal logics — linear-time temporal logic (LTL) and MTL, CLL has two types of formulas: state formulas and path formulas. The state formulas are constructed using propositional connectives. The path formulas are obtained by propositional connectives and a temporal modal operator timed until U T for a bounded time interval T , as in MTL and CSL. The standard next-step temporal operator in LTL is meaningless in continuous-time systems since the time domain (real numbers) is uncountable. As a result, CLL can express the above mentioned probabilistic property "the system is at state s<sup>0</sup> with high probability (≥ 0.9) within 5 time units after having continuously remained at state s<sup>1</sup> with low probability (≤ 0.1)" in a path formula:

$$
\varphi = \langle s\_1, [0, 0.1] \rangle U^{[0,5]} \langle s\_0, [0.9, 1] \rangle \dots
$$

In this single until formula, there is a time instant 0 ≤ t ≤ 5 at which state s<sup>1</sup> with low probability transits to state s<sup>0</sup> with high probability. Then we illustrate this on the following timeline.

$$\xleftarrow{\downarrow 0 \overbrace{\langle \ast\_{1}, [0, 0.1] \rangle}^{} \uparrow \langle s\_{0}, [0.9, 1] \rangle}$$

Furthermore, CLL allows multiphase timed until formulas. The semantics of the formulas focuses on relative time intervals, i.e., time can be reset as in timed automata [5,6], while those of CSL [8] are for absolute time intervals. Subsequently, CLL can express not only relatively but also absolutely temporal properties of CTMCs.

We illustrate the signifcant diference between relatively temporal properties and absolutely temporal properties of CTMCs. For instance, "before probability distributions transition φ happening in 3 to 7 time units, the system always stays at state s<sup>0</sup> with a high probability (≥ 0.9)" can be formalized in path formulae

$$
\varphi' = \langle s\_0, [0.9, 1] \rangle U^{[3,7]} (\langle s\_1, [0, 0.1] \rangle U^{[0,5]} \langle s\_0, [0.9, 1] \rangle).
$$

As we can see, there are two time instants, namely t<sup>1</sup> and t2, happening distribution transitions. Time is reset to 0 after the frst distribution transition happens and thus t<sup>2</sup> is relative to t1. More clearly, we depict this on the following timeline.

$$\overbrace{\uparrow0s}^{=3} \underbrace{\downarrow t\_1 \le 7 \qquad \downarrow (t\_2 + t\_1) \le 12}\_{\langle s\_0, [0.9, 1] \rangle \}}^{\downarrow t\_1 \le 7} \langle s\_0, [0.9, 1] \rangle$$

An absolute version is "probability distribution transition φ happens and the system always stays at state s<sup>0</sup> with a high probability (≥ 0.9) in 3 to 7 time units"

$$
\varphi'' = \Box^{[3,7]} \langle s\_0, [0.9, 1] \rangle \wedge \langle s\_1, [0, 0.1] \rangle U^{[0,5]} \langle s\_0, [0.9, 1] \rangle \rangle.
$$

We can get a clear timeline representation by simply adding □[3,7]⟨s0, [0.9, 1]⟩ to that of φ. Assume that t < 3,

$$\underbrace{\begin{matrix}\downarrow\\\downarrow\\\downarrow\\\end{matrix}}\_{\langle s\_{1},[0,0.1]\rangle}\underbrace{\begin{matrix}\downarrow\\\langle s\_{0},[0.9,1]\rangle\end{matrix}}\_{\langle\langle s\_{0},[0.9,1]\rangle}\underbrace{\begin{matrix}\downarrow\\\downarrow\\\langle s\_{0},[0.9,1]\rangle\end{matrix}}\_{\langle\langle s\_{0},[0.9,1]\rangle\rangle}\end{matrix}}\_{\langle\langle s\_{0},[0.9,1]\rangle}$$

Time reset enriches the expressiveness of CLL but introduces more difculties to model checking CLL than CSL. We cross this by translating relative time to the absolute one. As a result, we develop an algorithm to model check CTMCs against CLL formulas. More precisely, we reduce the model-checking problem to a reachability problem of absolute time intervals. The reachability problem corresponds to the real root isolation problem of real polynomial-exponential functions (PEFs) over the feld of algebraic numbers, an extensively studied question in recent symbolic and algebraic computation community (e.g. [1,20,28]). By developing a state-of-the-art real root isolation algorithm, we resolve the latter problem under the assumption of the validity of Schanuel's conjecture, a central open question in transcendental number theory [27]. This conjecture has also been the footstone of the correctness of many recent model-checking algorithms, including the decidability of continuous-time Markov decision processes [30], the synthesizing inductive invariants for continuous linear dynamical systems [4], the termination analysis for probabilistic programs with delays [39], and reachability analysis for dynamical systems [20].

In summary, the main contributions of this paper are as follows.


Organization of this paper. In the next section, we give the mathematical preliminaries used in this paper. In Section 3, we recall the view of CTMCs as distribution transformers. After that, the symbolic dynamics of CTMCs are introduced by symbolizing distributions over states of CTMCs in Section 4. In the subsequent section, we present our continuous-time probabilistic temporal logic CLL. In Section 6, we develop an algorithm to solve the CLL model checking problem. A case study and related works are shown in Sections 7 and 8, respectively. We summarize our results and point out future research directions in the fnal section.

#### 2 Preliminaries

For the convenience of the readers, we review basic defnitions and notations of number theory, particularly Schanuel's conjecture.

Throughout this paper, we write C, R, Q and A for the felds of all complex, real, rational and algebraic numbers, respectively. In addition, Z denotes the set of all integer numbers. For F ∈ {C, R, Q, Z, A}, we use F[t] and F <sup>n</sup>×<sup>m</sup> to denote the set of polynomials in t with coefcients in F and n-by-m matrices with every entry in F, respectively. Furthermore, for F ∈ {R, Q, Z}, we use F <sup>+</sup> to denote the set of positive elements (including 0) of F.

A bounded (time) interval T is a subset of R <sup>+</sup>, which may be open, half-open or closed with one of the following forms:

$$[t\_1, t\_2], [t\_1, t\_2), (t\_1, t\_2], (t\_1, t\_2),$$

where t1, t<sup>2</sup> ∈ R <sup>+</sup> and t<sup>2</sup> ≥ t<sup>1</sup> (t<sup>1</sup> = t<sup>2</sup> is only allowed in the case of [t1, t2]). Here, t<sup>1</sup> and t<sup>2</sup> are called the left and right endpoints of T , respectively. Conveniently, we use inf T and sup T to denote t<sup>1</sup> and t2, respectively. In this paper, we only consider bounded intervals.

For reasoning about the temporal properties, we further defne the addition and subtraction of (time) intervals. The expression T + t or t + T , for t ∈ R +, denotes the interval {t + t ′ : t ′ ∈ T }. Similarly, T − t stands for the interval {−t + t ′ : t ′ ∈ T } if t ≤ inf T . Furthermore, for two intervals T<sup>1</sup> and T2,

$$
\mathcal{T}\_1 + \mathcal{T}\_2 = \{ t \in (t' + \mathcal{T}\_2) : t' \in \mathcal{T}\_1 \} = \{ t\_1 + t\_2 : t\_1 \in \mathcal{T}\_1 \text{ and } t\_2 \in \mathcal{T}\_2 \}.
$$

Two intervals T<sup>1</sup> and T<sup>2</sup> are disjoint if their intersection is an empty set, i.e., T<sup>1</sup> ∩ T<sup>2</sup> = ∅. Let us see some concrete examples: 1 + (2, 3) = (3, 4), (2, 3) − 1 = (1, 2), (2, 3) + [3, 4] = (5, 7) and (2, 3), [3, 4] are disjoint. It is obvious that all calculations of time intervals in the above are easy to be computed.

An algebraic number is a complex number that is a root of a non-zero polynomial in one variable with rational coefcients (or equivalent to integer coefcients, by eliminating denominators). An algebraic number α is represented by (P,(a, b), ε) where P is the minimal polynomial of α, a, b ∈ Q and a + bi is an approximation of α such that |α−(a+bi)| < ε and α is the only root of P in the open ball B(a + bi, ε). The minimal polynomial of α is the polynomial with the smallest degree in Q[t] such that α is a root of the polynomial and the coefcient of the highest-degree term is 1. Any root of f(t) ∈ A[t] is algebraic. Moreover, given the representations of a, b ∈ A, the representations of a ± b, <sup>a</sup> b and a · b can be computed in polynomial time, so does the equality checking [17].

Furthermore, a complex number is called transcendental if it is not an algebraic number. In general, it is challenging to verify relationships between transcendental numbers [33]. On the other hand, one can use the Lindemann-Weierstrass theorem to compare some transcendental numbers. The transcendence of e and π are direct corollaries of this theorem.

Theorem 1 (Lindemann-Weierstrass theorem). Let η1, · · · , η<sup>n</sup> be pairwise distinct algebraic complex numbers. Then P k λke <sup>η</sup><sup>k</sup> ̸= 0 for non-zero algebraic numbers λ1, · · · , λn.

The following concepts are introduced to study the general relation between transcendental numbers.

Defnition 1 (Algebraic independence). A set of complex numbers S = {a1, · · · , an} is algebraically independent over Q if the elements of S do not satisfy any nontrivial (non-constant) polynomial equation with coefcients in Q.

By the above defnition, for any transcendental number u, {u} is algebraically independent over Q, while {a} for any algebraic number a ∈ A is not. Thus, a set of complex numbers that is algebraically independent over Q must consist of transcendental numbers. {π, e<sup>π</sup> √ <sup>n</sup>} is also algebraically independent over Q for any positive integer n [31]. Checking the algebraic independence is challenging. For example, it is still widely open whether {e, π} is algebraically independent over Q.

Defnition 2 (Extension feld). Given two felds E ⊆ F, F is an extension feld of E, denoted by F/E, if the operations of E are those of F restricted to E.

For example, under the usual notions of addition and multiplication, the feld of complex numbers is an extension feld of real numbers.

Defnition 3 (Transcendence degree). Let L be an extension feld of Q, the transcendence degree of L over Q is defned as the largest cardinality of an algebraically independent subset of L over Q.

For instance, let Q(e)/Q = {a + be | a, b ∈ Q} and Q( √ 2)/Q = {a + b √ 2 | a, b ∈ Q} be two extension felds of Q. Then the transcendence degree of them are 1 and 0, respectively, by noting that <sup>e</sup> is a transcendental number and <sup>√</sup> 2 is an algebraic number.

Now, Schanuel's conjecture is ready to be presented.

Conjecture 1 (Schanuel's conjecture). Given any complex numbers z1, · · · , z<sup>n</sup> that are linearly independent over Q, the extension feld Q(z1, ..., zn, ez<sup>1</sup> , ..., ez<sup>n</sup> ) has transcendence degree of at least n over Q.

Stephen Schanuel proposed this conjecture during a course given by Serge Lang at Columbia in the 1960s [27]. Schanuel's conjecture concerns the transcendence degree of certain feld extensions of the rational numbers. The conjecture, if proven, would generalize the most well-known results in transcendental number theory signifcantly [29,37]. For example, the algebraical independence of {e, π} would simply follow by setting z<sup>1</sup> = 1 and z<sup>2</sup> = πi, and using Euler's identity e πi + 1 = 0.

# 3 Continuous-time Markov Chains as Distributions Transformers

We begin with the defnition of continuous-time Markov chains (CTMCs). A CTMC is a Markovian (memoryless) stochastic process that takes values on a fnite state set S (|S| = d < ∞) and evolves in continuous-time t ∈ R <sup>+</sup>. Formally,

Defnition 4. A CTMC is a pair M = (S, Q), where S (|S| = d) is a fnite state set and Q ∈ Qd×<sup>d</sup> is a transition rate matrix.

A transition rate matrix Q is a matrix whose of-diagonal entries {Qi,j}i̸=<sup>j</sup> are nonnegative rational numbers, representing the transition rate from state i to state j, while the diagonal entries {Qj,j} are constrained to be − P <sup>j</sup≯=<sup>i</sup> Qi,j for all 1 ≤ j ≤ d. Consequently, the column summations of Q are all zero.

The evolution of a CTMC can be regarded as a distribution transformer. Given initial distribution µ ∈ Qd×<sup>1</sup> ∈ D(S), the distribution at time t ∈ R <sup>+</sup> is:

$$
\mu\_t = e^{Qt} \mu,
$$

where D(S) is denoted as the set of all probability distributions over S. We call D(S) the probability distribution space of CTMCs. An execution path of CTMCs is a continuous function indexed by initial distribution µ ∈ D(S):

$$
\sigma\_{\mu} \colon \mathbb{R}^+ \to \mathcal{D}(\mathcal{S}), \qquad \sigma\_{\mu}(t) = e^{Qt}\mu. \tag{1}
$$

Example 1. We recall the illustrating example of CTMC M = (S, Q) in [8, Figure 1] as the running example in our work. In particular, M is a 5-dimensional CTMC with initial distribution µ, where S = {s0, s1, s2, s3, s4} and


#### 4 Symbolic Dynamics of CTMCs

In this section, we introduce symbolic dynamics to characterize the properties of the probability distribution space of CTMCs.

First, we fx a fnite set of intervals I = {I<sup>k</sup> ⊆ [0, 1]}k∈K, where the endpoints of each I<sup>k</sup> are rational numbers. With the states S = {s0, s1, · · · , sd−1}, we defne the symbolization of distributions as a function:

$$\mathbb{S}: \mathcal{D}(\mathcal{S}) \to 2^{S \times \mathcal{J}} \qquad \mathbb{S}(\mu) = \{ \langle s, \mathcal{Z} \rangle \in S \times \mathcal{J} \,:\, \mu(s) \in \mathcal{Z} \},\tag{2}$$

where × denotes the Cartesian product, and 2<sup>S</sup>×<sup>I</sup> is the power set of S × I . ⟨s, I⟩ ∈ S(µ) asserts that the probability of state s in distribution µ is in the interval I. The symbolization of distributions is a generalization of the discretization of distributions with Ik∩I<sup>m</sup> = ∅ for all k ̸= m which was studied in [2]. This generalization increases the expressiveness of our continuous linear-time logic introduced in the next section. Now, we can represent any given probability distribution by fnite symbols from S × I . For example, suppose

$$\mathcal{A}' = \{ [0, 0.1], (0.1, 0.9), [0.9, 1], [1, 1], [0.4, 0.4] \}, \tag{3}$$

and then the initial distribution µ in Example 1 is symbolized as

$$\begin{aligned} \mathbb{S}(\mu) &= \{ \langle s\_0, [0, 0.1] \rangle, \langle s\_1, (0.1, 0.9) \rangle, \langle s\_2, (0.1, 0.9) \rangle, \\ &\quad \langle s\_3, (0.1, 0.9) \rangle, \langle s\_3, [0.4, 0.4] \rangle, \langle s\_4, [0, 0.1] \rangle \}. \end{aligned} \tag{4}$$

As we can see from the above example, the symbolization of distributions on states considers the exact probabilities (singleton intervals) of the states and the range of their possibilities.

Next, we introduce the symbolization to CTMCs,

Defnition 5. A symbolized CTMC is a tuple SM = (S, Q, I ), where M = (S, Q) is a CTMC and I is a fnite set of intervals in [0, 1].

As we can see, the set of intervals is picked depending on CTMCs. Then, we extend this symbolization to the path σµ:

$$\mathbb{S}\diamond \sigma\_{\mu} : \mathbb{R}^{+} \to 2^{S \times \mathcal{J}} . \tag{5}$$

Defnition 6. Given a symbolized CTMC SM = (S, Q, I ), S◦σ<sup>µ</sup> is a symbolic execution path of M = (S, Q).

Given a symbolized CTMC SM = (S, Q, I ), the path σ<sup>µ</sup> of CTMC M = (S, Q) over real numbers R <sup>+</sup> generated by probability distribution µ induces a symbolic execution path S ◦ σ<sup>µ</sup> over fnite symbols S × I . Subsequently, the dynamics of CTMCs can be studied in terms of a language over S × I . In other words, we can study the temporal properties of CTMCs in the context of symbolized CTMCs.

# 5 Continuous Linear-time Logic

In this section, we introduce continuous linear-time logic (CLL), a probabilistic linear-time temporal logic, to specify the temporal properties of a symbolized CTMC SM = (S, Q, I ).

CLL has two types of formulas: state formulas and path formulas. The state formulas are constructed using propositional connectives. The path formulas are obtained by propositional connectives and a temporal modal operator timed until U T for a bounded time interval T , as in MTL and CSL. Furthermore, multiphase timed until formulas Φ0U <sup>T</sup><sup>1</sup> Φ1U <sup>T</sup><sup>2</sup> Φ<sup>2</sup> . . . U<sup>T</sup><sup>n</sup> Φ<sup>n</sup> are allowed to enrich the expressiveness of CLL. More importantly, time reset is involved in these multiphase formulas. Thus absolutely and relatively temporal properties of CTMCs can be studied.

Defnition 7. The state formulas of CLL are described according to the following syntax:

$$\Phi := \mathbf{true} \mid a \in AP \mid \neg \Phi \mid \Phi\_1 \land \Phi\_2$$

where AP denotes S × I as the set of atomic propositions.

The path formulas of CLL are constructed by the following syntax:

φ := true | Φ0U <sup>T</sup><sup>1</sup> Φ1U <sup>T</sup><sup>2</sup> Φ<sup>2</sup> . . . U<sup>T</sup><sup>n</sup> Φ<sup>n</sup> | ¬φ | φ<sup>1</sup> ∧ φ<sup>2</sup>

where n ∈ Z <sup>+</sup> is a positive integer, for all 0 ≤ k ≤ n, Φ<sup>k</sup> is a state formula, and Tk's are time intervals with the endpoints in Q<sup>+</sup>, i.e., each T<sup>k</sup> is one of the following forms:

(a, b), [a, b],(a, b], [a, b) ∀a, b ∈ Q +.

The semantics of CLL state formulas is defned on the set D(S) of probability distributions over S with the symbolized function S in Eq.(2) of Section 4.


The semantics of CLL path formulas is defned on execution paths {σµ}µ∈D(S) of CTMC M = (S, Q).


Not surprisingly, other Boolean connectives are derived in the standard way, i.e., false = ¬true, Φ<sup>1</sup> ∨ Φ<sup>2</sup> = ¬(¬Φ<sup>1</sup> ∧ ¬Φ2) and Φ<sup>1</sup> → Φ<sup>2</sup> = ¬Φ<sup>1</sup> ∨ Φ2, and the path formula φ follows the same way. Furthermore, we generalize temporal operators ♢ ("eventually") and □ ("always") of discrete-time systems into their timed variant ♢ <sup>T</sup> and □<sup>T</sup> , respectively, in the following:

$$
\Diamond^{\top} \Phi = \mathbf{tr} \mathbf{u} \mathbf{e} U^{\top} \Phi \qquad \Box^{\top} \Phi = \neg \Diamond^{\top} \neg \Phi.
$$

For n = 1 in multiphase timed until formulas, the until operator U T<sup>1</sup> is a timed variant of the until operator of LTL; the path formula Φ0U <sup>T</sup><sup>1</sup> Φ<sup>1</sup> asserts that Φ<sup>1</sup> is satisfed at some time instant in the interval T<sup>1</sup> and that at all preceding time instants in T1, Φ<sup>0</sup> holds. For example,

$$
\varphi = \langle s\_1, [0, 0.1] \rangle U^{[0,5]} \langle s\_0, [0.9, 1] \rangle,
$$

as mentioned in introduction section.

For general n, the CLL path formula Φ0U <sup>T</sup><sup>1</sup> Φ1U <sup>T</sup><sup>2</sup> Φ<sup>2</sup> . . . U<sup>T</sup><sup>n</sup> Φ<sup>n</sup> is explained over the induction on n. We frst mention that U T is right-associative, e.g., Φ0U <sup>T</sup><sup>1</sup> Φ1U <sup>T</sup><sup>2</sup> Φ<sup>2</sup> stands for Φ0U <sup>T</sup><sup>1</sup> (Φ1U <sup>T</sup><sup>2</sup> Φ2). This makes time reset, i.e., T<sup>1</sup> and T<sup>2</sup> do not have to be disjoint, and the starting time point of T<sup>2</sup> is based on some time instant in T1. Recall the multiphase timed until formula in introduction section and this formula expresses a relative time property:

$$
\varphi' = \langle s\_0, [0.9, 1] \rangle U^{[3,7]} (\langle s\_1, [0, 0.1] \rangle U^{[0,5]} \langle s\_0, [0.9, 1] \rangle),
$$

which is diferent to the following CLL path formula representing an absolutely temporal property of CTMCs:

$$
\varphi'' = \Box^{[3,7]} \langle s\_0, [0.9, 1] \rangle \wedge \langle s\_1, [0, 0.1] \rangle U^{[0,5]} \langle s\_0, [0.9, 1] \rangle \rangle.
$$

As an example, we clarify the semantics of CLL by comparing the above two path formulas in general forms:

$$
\Phi\_0 U^{\mathcal{T}\_1} \Phi\_1 U^{\mathcal{T}\_2} \Phi\_2 \quad \text{and} \quad \Phi\_0 U^{\mathcal{T}\_1} \Phi\_1 \wedge \Phi\_1 U^{\mathcal{T}\_2} \Phi\_2 .
$$

(1) σ<sup>µ</sup> |= Φ0U <sup>T</sup><sup>1</sup> Φ1U <sup>T</sup><sup>2</sup> Φ<sup>2</sup> asserts that there are time instants t<sup>1</sup> ∈ T1, t<sup>2</sup> ∈ T<sup>2</sup> such that µt1+t<sup>2</sup> |= Φ<sup>2</sup> and for any t ′ <sup>1</sup> ∈ T<sup>1</sup> ∩ [0, t1) and t ′ <sup>2</sup> ∈ T<sup>2</sup> ∩ [0, t2), µt ′ 1 |= Φ<sup>0</sup> and µt1+<sup>t</sup> ′ 2 |= Φ1, where µ<sup>t</sup> = e Qtµ ∀t ∈ R <sup>+</sup>. This is more clear in the following timeline.

$$\underbrace{\overbrace{\uparrow\text{time }0}^{\text{inf }\mathscr{T}\_{1}}}\_{\uparrow\text{ time }0} \underbrace{\overbrace{\uparrow t\_{1} \le \sup\,\mathscr{T}\_{1}}^{\text{inf }\mathscr{T}\_{2}}}\_{\Phi\_{0}} \downarrow \Phi\_{2}$$

(2)  $\sigma\_{\mu} = \Phi\_0 U^{\mathcal{T}\_1} \Phi\_1 \wedge \Phi\_1 U^{\mathcal{T}\_2} \Phi\_2$  asserts that there are time instants  $t\_1 \in \mathcal{T}\_1, t\_2 \in \mathcal{T}\_2$  such that  $\mu\_{t\_1} = \Phi\_1$  and  $\mu\_{t\_2} = \Phi\_2$ , and for any  $t\_1' \in \mathcal{T}\_1 \cap [0, t\_1)$  and  $t\_2' \in \mathcal{T}\_2 \cap [0, t\_2)$ ,  $\mu\_{t\_1'} = \Phi\_0$  and  $\mu\_{t\_2'} = \Phi\_1$ , where  $\mu\_t = e^{Qt} \mu$   $\forall t \in \mathbb{R}^+$ .

Before solving the model-checking problem of CTMCs against CLL formulas in the next section, we shall frst discuss what can be specifed in our logic CLL.

Given a CTMC (S, Q), CLL path formula ♢ [0,1000]⟨s, [1, 1]⟩ expresses a liveness property that state s ∈ S is eventually reached with probability one before time instant 1000. In terms of safety properties, formula □[100,1000]⟨s, [0, 0]⟩ represents that state s ∈ S is never reached (reached with probability zero) between time instants 100 and 1000. Furthermore, setting the intervals nontrivial (neither [0, 0] or [1, 1]), liveness and safety properties can be asserted with probabilities, such as ♢ [0,1000]⟨s, [0.5, 1]⟩ and □[100,1000]⟨s, [0, 0.5]⟩. For multiphase timed until formula ⟨s, [0.7, 1]⟩U [2,3]⟨s, [0.7, 1]⟩. . . U[2,3]⟨s, [0.7, 1]⟩, where the number of U [2,3] is 100, asserts that the probability of state s is beyond 0.7 in every time instant 2 to 3, and this happens at least 100 times.

Next, we can classify members of I as representing "low" and "high" probabilities. For example, if I contains 3 intervals {[0, 0.1],(0.1, 0.9), [0.9, 1]}, we can declare the frst interval as "low" and the last interval as "high". In this case □[10,1000)(⟨s0, [0, 0.1]⟩ → ⟨s1, [0.9, 1]⟩) says that, in time interval [10, 1000), whenever the probability of state s<sup>0</sup> is low, the probability of state s<sup>1</sup> will be high.

#### 6 CLL Model Checking

In this section, we provide an algorithm to model check CTMCs against CLL formulas, i.e., the following CLL model-checking problem — Problem 1 is decidable.

Problem 1 (CLL Model-checking Problem). Given a symbolized CTMC SM = (S, Q, I ) with an initial distribution µ and a CLL path formula φ on AP = S ×I , the goal is to decide whether σ<sup>µ</sup> |= φ, where σµ(t) = e Qtµ is an execution path defned in Eq.(1).

In particular, we show that

Theorem 2. Under the condition that Schanuel's conjecture holds, the CLL model-checking problem in Problem 1 is decidable.

In the following, we prove the above theorem from checking basic formulas — atomic propositions to the most complex one — nontrivial multiphase timed until formulas. For readability, we put the proofs of all results in Appendix A of the extended version [21] of this paper.

We start with the simplest case of atomic proposition ⟨s, I⟩. By the semantics of CLL, µ<sup>t</sup> |= ⟨s, I⟩ if and only if µ<sup>t</sup> = e Qtµ(s) ∈ I. To check this, we frst observe that the execution path e Qtµ of CTMCs is a system of polynomial exponential functions (PEFs).

Defnition 8. A function f : R → R is a polynomial-exponential function (PEF) if f has the following form:

$$f(t) = \sum\_{k=0}^{K} f\_k(t)e^{\lambda\_k t} \tag{6}$$

where for all 0 ≤ k ≤ K < ∞, fk(t) ∈ F1[t], fk(t) ̸= 0, λ<sup>k</sup> ∈ F<sup>2</sup> and F1, F<sup>2</sup> are felds. Without loss of generality, we assume that λk's are distinct.

Generally, for a PEF f(t) with the range in complex numbers C, g(t) = f(t) + f ∗ (t) is a PEF with the range in real numbers R, where f ∗ (t) is the complex conjugate of f(t). The factor t is omitted whenever convenient, i.e., f = f(t). t is called a root of a function f if f(t) = 0. PEFs often appear in transcendental number theory as auxiliary functions in the proofs involving the exponential function [10].

Lemma 1. Given a CTMC M = (S, Q) with S = {s0, . . . , sd−1}, Q ∈ Qd×<sup>d</sup> , and an initial distribution µ ∈ Qd×<sup>1</sup> , for any 0 ≤ i ≤ d − 1 , e Qtµ(si), the i-th entry of e Qtµ, can be expressed as a PEF f : R <sup>+</sup> → [0, 1] as in Eq.(6) with F<sup>1</sup> = F<sup>2</sup> = A.

By the above lemma, for a given t in some bounded time interval T (to be specifc in the latter discussion), e Qtµ(s) ∈ I is determined by the algebraic structure of PEF g(t) = e Qtµ(s) in T . That is all maximum intervals Tmax ⊆ T such that g(t) ∈ I for all t ∈ Tmax, where interval Tmax ̸= ∅ is called maximum for g(t) ∈ I if no sub-intervals T ′ ⊊ Tmax such that the property holds, i.e., g(t) ∈ I for all t ∈ T ′ . Then e Qtµ(s) ∈ I if and only if t ∈ Tmax for some maximum interval Tmax. So, we aim to compute the set T of all maximum intervals. By the continuity of PEF g(t), this can be done by identifying a real root isolation of the following PEF f(t) in T : f(t) = (g(t) − inf I)(g(t) − sup I).

A (real) root isolation of function f(t) in interval T is a set of mutually disjoint intervals, denoted by Iso(f)<sup>T</sup> = {(a<sup>j</sup> , b<sup>j</sup> ) ⊆ T } for a<sup>j</sup> , b<sup>j</sup> ∈ Q such that

– for any j, there is one and only one root of f(t) in (a<sup>j</sup> , b<sup>j</sup> );

– for any root t <sup>∗</sup> of f(t), t <sup>∗</sup> ∈ (a<sup>j</sup> , b<sup>j</sup> ) for some j.

Furthermore, if f has no any root in T , then Iso(f)<sup>T</sup> = ∅.

Although there are infnite kinds of real root isolations of f(t) in T , the number of isolation intervals equals to the number of distinct roots of f(t) in T .

Finding real root isolations of PEFs is a long-standing problem and can be at least backtracked to Ritt's paper [34] in 1929. Some following results were obtained since the last century (e.g. [7,38]). This problem is essential in the reachability analysis of dynamical systems, one active feld of symbolic and algebraic computation. In the case of F<sup>1</sup> = Q and F<sup>2</sup> = N <sup>+</sup> in [1], an algorithm named ISOL was proposed to isolate all real roots of f(t). Later, this algorithm has been extended to the case of F<sup>1</sup> = Q and F<sup>2</sup> = R [20]. A variant of the problem has also been studied in [28]. The correctness of these algorithms is based on Schanuel's conjecture. Other works are using Schanuel's conjecture to do the root isolation of other functions, such as exp-log functions [35] and tame elementary functions [36].

By Lemma 1, we pursue this problem in the context of CTMCs. The distinct feature of solving real root isolations of PEFs in our paper is to deal with complex numbers C, more specifcally algebraic numbers A, i.e., F<sup>1</sup> = F<sup>2</sup> = A. At the same time, to the best of our knowledge, all the previous works can only handle the case over R. Here, we develop a state-of-the-art real root isolation algorithm for PEFs over algebraic numbers. Thus from now on, we always assume that PEFs are over A, i.e., F<sup>1</sup> = F<sup>2</sup> = A in Eq.(6). In this case, it is worth noting that whether a PEF has a root in a given interval, T ⊆ R <sup>+</sup> is decidable subject to Schanuel's Conjecture if T is bounded [16], which falls in the situation we consider in this paper.

Theorem 3 ([16]). Under the condition that Schanuel's conjecture holds, there is an algorithm to check whether a PEF f(t) has a root in interval T , i.e., whether Iso(f)<sup>T</sup> = ∅.

In this paper, we extend the above checking Iso(f)<sup>T</sup> = ∅ to computing Iso(f)<sup>T</sup> of PEF f(t).

Theorem 4. Under the condition that Schanuel's conjecture holds, there is an algorithm to fnd real root isolation Iso(f)<sup>T</sup> for any PEF f(t) and interval T . Furthermore, the number of real roots is fnite, i.e., |Iso(f)<sup>T</sup> | < ∞.

We can compute the set T of all maximum intervals with the above theorem to check atomic propositions. Furthermore, we can compare the values of any real roots of PEFs, which is important in model checking general multiphase timed until formulas at the end of this section.

Lemma 2. Let f1(t) and f2(t) be two PEFs with the domains in T<sup>1</sup> and T2, and t<sup>1</sup> ∈ T<sup>1</sup> and t<sup>2</sup> ∈ T<sup>2</sup> are roots of them, respectively. Under the condition that Schanuel's conjecture holds, there is an efcient way to check whether or not t<sup>1</sup> − t<sup>2</sup> < g for any given rational number g ∈ Q.

For model checking general state formula Φ, we can also use real root isolation of some PEF to obtain the set of all maximum intervals Tmax such that µ<sup>t</sup> |= Φ for all t ∈ Tmax. The reason is that Φ admits conjunctive normal form consisting of atomic propositions. See the proof of the following lemma in Appendix A of the extended version [21] of this paper for the details.

Lemma 3. Under the condition that Schanuel's conjecture holds, given a time interval T , the set T of all maximum intervals in T satisfying µ<sup>t</sup> |= Φ can be computed, where Φ is a state formula of CLL. Furthermore, the number of all intervals in T is fnite; the left and right endpoints of each interval in T are roots of PEFs.

At last, we characterize the multiphase timed until formulas by the reachability analysis of time intervals (instants).

Lemma 4. σ<sup>µ</sup> |= Φ0U <sup>T</sup><sup>1</sup> Φ1U <sup>T</sup><sup>2</sup> Φ<sup>2</sup> · · ·U <sup>T</sup><sup>n</sup> Φ<sup>n</sup> if and only if there exist time intervals {I<sup>k</sup> ⊆ R <sup>+</sup>} n <sup>k</sup>=0 with I<sup>0</sup> = [0, 0] such that


By the above lemma, the problem of checking multiphase timed until formulas is reduced to verify the existence of a sequence of time intervals.

Now we can show the proof of Theorem 2.

Proof. Recall that the nontrivial step is to model check multiphase timed until formula Φ0U <sup>T</sup><sup>1</sup> Φ1U <sup>T</sup><sup>2</sup> Φ<sup>2</sup> · · ·U <sup>T</sup><sup>n</sup> Φn, where {Tj} n <sup>j</sup>=1 is a set of bounded rational intervals in R <sup>+</sup>, and for 0 ≤ k ≤ n + 1, Φ<sup>k</sup> is a state formula.

By Lemma 4, for model checking the above formula, we only need to check the existence of time intervals {Ik} n <sup>k</sup>=0 illustrated in the lemma. The following procedure can construct such a set of intervals if it exists:


$$\mathcal{J}\_k = \{ \mathcal{T} \cap (\mathcal{T}' + \mathcal{T}\_k) : \mathcal{T} \in \mathcal{J}\_k \text{ and } \mathcal{T}' \in \mathcal{J}\_{k-1} \}. \tag{7}$$

The above updates can be fnished by Lemma 2. If I<sup>k</sup> = ∅, then the formula is not satisfed;

– (4) Updating In: for each I ∈ In, we replace I with [s − ε, s) for some constant ε > 0 if there is an s ∈ I with s − ε ∈ I such that µ<sup>s</sup> |= Φ<sup>n</sup> where µ<sup>s</sup> = e Qsµ; Otherwise, remove this element from In. Again, this can be done by Lemma 3. If I<sup>n</sup> = ∅, then the formula is not satisfed;

– (5) Finally, let k from n − 1 to 1, updating Ik:

$$\mathcal{A}\_k^\prime = \{ [s-\inf \mathcal{T}\_k, s-\inf \mathcal{T}\_k] : [s-\varepsilon, s) \in \mathcal{A}\_{k+1}^\prime \}.$$

Thus after the above procedure, we have non-empty sets {Ik} n <sup>k</sup>=0 with the following properties.


Therefore, we can get a set of intervals {Ik} n <sup>k</sup>=0 satisfying the two conditions in Lemma 4 if it exists. On the other hand, it is easy to check that all such {Ik} n <sup>k</sup>=0 must be in {Ik} n <sup>k</sup>=0, i.e., for each k, I<sup>k</sup> ⊆ I for some I ∈ Ik. This ensures the correctness of the above procedure.

By the above constructive analysis, we give an algorithm for model checking CTMCs against CLL formulas. Focusing on the decidability problem, we do not provide the pseudocode of the algorithm. Alternatively, we implement a numerical experiment to illustrate the checking procedure in the next section.

#### 7 Numerical Implementation

In this section, we implement a case study of checking CTMCs against CLL formulas. Here, we consider a symbolized CTMC SM = (S, Q, I ), where M = (S, Q) is the CTMC in Example 1 and fnite set I is the one considered in Eq.(3). We check the properties of M given by the following two CLL path formulas mentioned in the introduction for diferent initial distributions.

$$\begin{aligned} \varphi &= \langle s\_1, [0, 0.1] \rangle U^{[0,5]} \langle s\_0, [0.9, 1] \rangle . \\ \varphi' &= \langle s\_0, [0.9, 1] \rangle U^{[3,7]} \langle s\_1, [0, 0.1] \rangle U^{[0,5]} \langle s\_0, [0.9, 1] \rangle . \end{aligned}$$

By Jordan decomposition, we have Q = SJS<sup>−</sup><sup>1</sup> where

$$S = \begin{pmatrix} 0 & -6 \ 0 \ 0 \ 0 \\ 0 & 2 \ 0 \ 0 \ 1 \\ -7 - 3 \ 0 \ 0 \ 0 \\ 3 & 3 \ 0 \ 1 \ 0 \\ 4 & 4 & 1 \ 0 \ 0 \end{pmatrix} \qquad J = \begin{pmatrix} -7 & 0 \ 0 \ 0 \ 0 \\ 0 & -3 \ 0 \ 0 \ 0 \\ 0 & 0 \ 0 \ 0 \ 0 \\ 0 & 0 \ 0 \ 0 \ 0 \\ 0 & 0 \ 0 \ 0 \ 0 \end{pmatrix} \qquad S^{-1} = \begin{pmatrix} \frac{1}{14} & 0 - \frac{1}{7} & 0 & 0 \\ -\frac{1}{6} & 0 & 0 & 0 \\ \frac{8}{21} & 0 & \frac{4}{7} & 0 \ 1 \\ \frac{2}{7} & 0 & \frac{3}{7} & 1 \ 0 \\ \frac{1}{3} & 1 & 0 & 0 \ 0 \end{pmatrix}.$$

Then, we consider an initial distribution µ as the same as the one in Example 1. Then we have that the value of e Qtµ is as follows:

$$
\begin{pmatrix} e^{-3t} & 0 & 0 & 0 \\ -\frac{1}{3}(e^{-3t}-1) & 1 & 0 & 0 \\ \frac{1}{2}(e^{-3t}-e^{-7t}) & 0 & e^{-7t} & 0 \\ \frac{3}{14}e^{-7t}-\frac{1}{2}e^{-3t}+\frac{2}{7}\ 0 & -\frac{3}{7}e^{-7t}+\frac{3}{7}\ 1 & 0 \\ \frac{1}{7}e^{-7t}-\frac{2}{3}e^{-3t}+\frac{8}{21}\ 0 & -\frac{4}{7}e^{-7t}+\frac{4}{7}\ 0 & 1 \\ \end{pmatrix}
\begin{pmatrix} 0.1 \\ 0.2 \\ 0.3 \\ 0.4 \\ 0 \end{pmatrix} = \begin{pmatrix} \frac{1}{10}e^{-3t} \\ -\frac{1}{30}e^{-3t}+\frac{7}{30} \\ \frac{1}{20}e^{-3t}+\frac{1}{4}e^{-7t} \\ -\frac{1}{20}e^{-3t}-\frac{3}{28}e^{-7t}+\frac{39}{20} \\ -\frac{1}{15}e^{-3t}-\frac{7}{7}e^{-7t}+\frac{39}{105} \end{pmatrix}.
$$

As we only consider states s<sup>0</sup> and s<sup>1</sup> in formulas φ and φ ′ , we focus on the following PEFs: f0(t) = <sup>1</sup> <sup>10</sup> e <sup>−</sup>3<sup>t</sup> and f1(t) = − 1 <sup>30</sup> e <sup>−</sup>3<sup>t</sup> + 7 30 .

Next, we initialize the model checking procedures introduced in the proof of Theorem 2. First, we compute the set T of all maximum intervals T ⊆ [0, 5] such that e Qtµ |= ⟨s0, [0.9, 1]⟩ for t ∈ T , i.e., f0(t) ∈ [0.9, 1] for t ∈ T . We obtain T = ∅ by the real root isolation algorithm mentioned in Theorem 4, and this indicates that σ<sup>µ</sup> ̸|= φ where σµ(t) = e Qtµ is the path induced by µ and defned in Eq.(1).

To check whether σ<sup>µ</sup> |= φ ′ , we compute the set T of all maximum intervals T ⊆ [0, 12] such that e Qtµ |= ⟨s0, [0.9, 1]⟩ for t ∈ T , i.e., f0(t) ∈ [0.9, 1] for t ∈ T . Again, we obtain T = ∅ by the real root isolation algorithm in Theorem 4. Therefore, σ<sup>µ</sup> ̸|= φ ′ .

In the following, we consider a diferent initial distribution µ<sup>1</sup> as follows:

$$e^{Qt}\mu\_1 = e^{Qt} \begin{pmatrix} 0.9\\0\\0.1\\0\\0 \end{pmatrix} = \begin{pmatrix} \frac{9}{10}e^{-3t} \\ -\frac{3}{10}(e^{-3t}-1) \\\ \frac{9}{20}e^{-3t}-\frac{7}{20}e^{-7t} \\ -\frac{9}{20}e^{-3t}+\frac{3}{20}e^{-7t}+\frac{3}{10} \\ -\frac{3}{5}e^{-3t}+\frac{1}{5}e^{-7t}+\frac{2}{5} \end{pmatrix}.$$

The key PEFs are: g0(t) = <sup>9</sup> <sup>10</sup> e <sup>−</sup>3<sup>t</sup> and g1(t) = − 3 <sup>10</sup> (e <sup>−</sup>3<sup>t</sup> − 1).

Again, we initialize the model checking procedures introduced in the proof of Theorem 2. We frst compute the set T of all maximum intervals T ⊆ [0, 5] such that e Qtµ<sup>1</sup> |= ⟨s1, [0, 0.1]⟩ for t ∈ T , i.e., g1(t) ∈ [0, 0.1] for t ∈ T . This can be done by fnding a real root isolation of the following PEF: g 0 1 (t) = − 3 <sup>10</sup> (e <sup>−</sup>3<sup>t</sup> − 1) − 1 10 .

By implementing the real root isolation algorithm in Theorem 4, we have

$$\text{Iso}(g\_1^0)\_{[0,5]} = \{ (0.13, 0.14) \} \text{ and then } \mathcal{F} = \{ [0, t^\*] \} \text{ for } t^\* \in (0.13, 0.14).$$

Following the same way, we compute T for e Qtµ<sup>1</sup> |= ⟨s0, [0.9, 1]⟩. Then we complete the model checking procedures in the proof of Theorem 2, and we conclude: σ<sup>µ</sup><sup>1</sup> |= φ. By repeating these, the result of the second formula φ ′ is σ<sup>µ</sup><sup>1</sup> ̸|= φ ′ .

#### 8 Related Works

Agrawal et al. [2] introduced probabilistic linear-time temporal logic (PLTL) to reason about discrete-time Markov chains in the context of distribution transformers as we did for CTMCs in this paper. Interestingly, the Skolem Problem can be reduced to the model checking problem for the logic PLTL [3]. The Skolem Problem asks whether a given linear recurrence sequence has a zero term and plays a vital role in the reachability analysis of linear dynamical systems. Unfortunately, the decidability of the problem remains open [32]. Recently, the Continuous Skolem Problem has been proposed with good behavior (the problem is decidable) and forms a fundamental decision problem concerning reachability

in continuous-time linear dynamical systems [16]. Not surprisingly, the Continuous Skolem Problem can be reduced to model-checking CLL. The primary step of verifying CLL formulas is to fnd a real root isolation of a PEF in a given interval. Chonev, Ouaknine and Worrell reformulated the Continuous Skolem Problem in terms of whether a PEF has a root in a given interval, which is decidable subject to Schanuel's conjecture [16]. An algorithm for fnding root isolation can also answer the problem of checking the existence of the roots of a PEF. However, the reverse does not work in general. Therefore, the decidability of the Continuous Skolem Problem cannot be applied to establish that of our CLL model checking.

Remark 1. By adopting the method in this paper, we established the decidability of model checking quantum CTMCs against signal temporal logic [40]. Again, we need Schanuel's conjecture to guarantee the correctness. A Lindblad's master equation governs a quantum CTMC and a more general real-time probabilistic Markov model than a CTMC, i.e., a CTMC is an instance of quantum CTMCs. We converted the evolution of Lindblad's master equation into a distribution transformer that preserves the laws of quantum mechanics. We reduced the model-checking problem of quantum CTMCs to the real root isolation problem, which we considered in this paper, and thus our method could be applied to it.

# 9 Conclusion

This paper revisited the study of temporal properties of fnite-state CTMCs by symbolizing the probability value space [0, 1] into a fnite set of intervals. To specify relatively and absolutely temporal properties, we propose a probabilistic logic for CTMCs, namely continuous linear-time logic (CLL). We have considered the model checking problem in this setting. Our main result is that a state-of-theart real root isolation algorithm over the feld of algebraic numbers was proposed to establish the decidability of the model checking problem under the condition that Schanuel's conjecture holds.

This paper aims to show decidability in as simple a fashion as possible without paying much attention to complexity issues. Faster algorithms on our current constructions would signifcantly improve from a practical standpoint.

# Acknowledgments

We want to thank Professor Joost-Pieter Katoen for his invaluable feedback and for pointing out the references [14,15,30]. This work is supported by the National Key R&D Program of China (Grant No: 2018YFA0306701), the National Natural Science Foundation of China (Grant No: 61832015), ARC Discovery Program (#DP210102449) and ARC DECRA (#DE180100156).

#### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Under-Approximating Expected Total Rewards in POMDPs?

Alexander Bork1() , Joost-Pieter Katoen<sup>1</sup> , and Tim Quatmann<sup>1</sup>

> RWTH Aachen University, Aachen, Germany alexander.bork@cs.rwth-aachen.de

Abstract We consider the problem: is the optimal expected total reward to reach a goal state in a partially observable Markov decision process (POMDP) below a given threshold? We tackle this—generally undecidable—problem by computing under-approximations on these total expected rewards. This is done by abstracting finite unfoldings of the infinite belief MDP of the POMDP. The key issue is to find a suitable under-approximation of the value function. We provide two techniques: a simple (cut-off) technique that uses a good policy on the POMDP, and a more advanced technique (belief clipping) that uses minimal shifts of probabilities between beliefs. We use mixed-integer linear programming (MILP) to find such minimal probability shifts and experimentally show that our techniques scale quite well while providing tight lower bounds on the expected total reward.

# 1 Introduction

The relevance of POMDPs. Partially observable Markov decision processes (POM-DPs) originated in operations research and nowadays are a pivotal model for planning in AI [40]. They inherit all features of classical MDPs: each state has a set of discrete probability distributions over the states and rewards are earned when taking transitions. However, states are not fully observable. Intuitively, certain aspects of the states can be identified, such as a state's colour, but states themselves cannot be observed. This partial observability reflects, for example, a robot's view of its environment while only having the limited perspective of its sensors at its disposal. The main goal is to obtain a policy—a plan how to resolve the non-determinism in the model—for a given objective. The key problem here is that POMDP policies must base their decisions only on the observable aspects (e.g. colours) of states. This stands in contrast to policies for MDPs which can make decisions dependent on the entire history of full state information.

Analysing POMDPs. Typical POMDP planning problems consider either finitehorizon objectives or infinite-horizon objectives under discounting. Finite-horizon objectives focus on reaching a certain goal state (such as "the robot has collected all items") within a given number of steps. For infinite horizons, no step bound

<sup>?</sup> This work is funded by the DFG RTG 2236 "UnRAVeL".

is provided and typically rewards along a run are weighted by a discounting factor that indicates how much immediate rewards are favoured over more distant ones. Existing techniques to treat these objectives include variations of value iteration [46,36,20,18,52,53] and policy trees [29]. Point-based techniques [38,42] approximate a POMDP's value function using a finite subset of beliefs which is iteratively updated. Algorithms include PBVI [38], Perseus [48], SARSOP [30] and HSVI [45]. Point-based methods can treat large POMDPs for both finiteand discounted infinite-horizon objectives [42].

Problem statement. In this paper we consider the problem: is the maximal expected total reward to reach a given goal state in a POMDP below a given threshold? We thus consider an infinite-horizon objective without discounting—also called an indefinite-horizon objective. A specific instance of the considered problem is the reachability probability to eventually reach a given goal state in a POMDP. This problem is undecidable [33,34] in general. Intuitively, this is due to the fact that POMDP policies need to consider the entire (infinite) observation history to make optimal decisions. For a POMDP, this notion is captured by an infinite, fully observable MDP, its belief MDP. This MDP is obtained from observation sequences inducing probabilities of being in certain states of the POMDP.

Previously proposed methods to solve the problem are e.g. to use approximate value iteration [22], optimisation and search techniques [1,12], dynamic programming [6], Monte Carlo simulation [43], game-based abstraction [51], and machine learning [13,14,19]. Other approaches restrict the memory size of the policies [35]. The synthesis of (possibly randomised) finite-memory policies is ETR-complete<sup>1</sup> [28]. Techniques to obtain finite-memory policies use e.g. parameter synthesis [28] or satisfiability checking and SMT solving [15,50].

Our approach. We tackle the aforementioned problem by computing underapproximations on maximal total expected rewards. This is done by considering finite unfoldings of the infinite belief MDP of the POMDP, and then applying abstraction. The key issue here is to find a suitable under-approximation of the POMDP's value function. We provide two techniques: a simple (cut-off) technique that uses a good policy on the POMDP, and a more advanced technique (belief clipping) that uses minimal shifts of probabilities between beliefs and can be applied on top of the simple approach. We use mixed-integer linear programming (MILP) to find such minimal probability shifts. Cut-off techniques for indefinite-horizon objectives have been used on computation trees—rather than on the belief MDP as used here—in Goal-HSVI [24]. Belief clipping amends the probabilities in a belief to be in a state of the POMDP yielding discretised values, i.e. an abstraction of the probability range [0, 1] is applied. Such grid-based approximations are inspired by Lovejoy's grid-based belief MDP discretisation method [32]. They have also been used in [7] in the context of dynamic programming for POMDPs, and to over-approximate the value function in model checking of POMDPs [8]. In fact, this paper on determining lower bounds for

<sup>1</sup> A decision problem is ETR-complete if it can be reduced to a polynomial-length sentence in the Existential Theory of the Reals (for which the satisfiability problem is decidable) in polynomial time, and there is such a reduction in the reverse direction.

indefinite-horizon objectives can be seen as the dual counterpart of [8]. Our key challenge—compared to the approach of [8]—is that the value at a certain belief cannot easily be under-approximated with a convex combination of values of nearby beliefs. On the other hand, an under-approximation can benefit from a "good" guess of some initial POMDP policy. In the context of [8], such a guessed policy is of limited use for over-approximating values in the POMDP induced by an optimal policy. Although our approach is applicable to all thresholds, the focus of our work is on determining under-approximations for quantitative objectives. Dedicated verification techniques for the qualitative setting—almost-sure reachability—are presented in [17,16,27].

Experimental results. We have implemented our cut-off and belief clipping approaches on top of the probabilistic model checker Storm [23] and applied it to a range of various benchmarks. We provide a comparison with the model checking approach in [37], and determine the tightness of our under-approximations by comparing them to over-approximations obtained using the algorithm from [8]. Our main findings from the experimental validation are:


# 2 Preliminaries and Problem Statement

Let Dist(A) := µ : A → [0, 1] | P <sup>a</sup>∈<sup>A</sup> µ(a) = 1 denote the set of probability distributions over a finite set A. The set supp(µ) := {a ∈ A | µ(a) > 0} is the support of µ ∈ Dist(A). Let R<sup>∞</sup> := R ∪ {∞, −∞}. We use Iverson bracket notation, where [x] = 1 if the Boolean expression x is true and [x] = 0 otherwise.

#### 2.1 Partially Observable MDPs

Definition 1 (MDP). A Markov decision process (MDP) is a tuple M = hS, Act, P, siniti with a (finite or infinite) set of states S, a finite set of actions Act, a transition function P: S ×Act ×S → [0, 1] with P s <sup>0</sup>∈<sup>S</sup> P(s, α, s<sup>0</sup> ) ∈ {0, 1} for all s ∈ S and α ∈ Act, and an initial state sinit.

We fix an MDP M := hS, Act, P, siniti. For s ∈ S and α ∈ Act, let post<sup>M</sup>(s, α) := {s <sup>0</sup> ∈ S | P(s, α, s<sup>0</sup> ) > 0} denote the set of α-successors of s in M. The set of enabled actions in s ∈ S is given by Act(s) := {α ∈ Act | post<sup>M</sup>(s, α) 6= ∅}.

Definition 2 (POMDP). A partially observable MDP (POMDP) is a tuple M = hM, Z, Oi, where M is the underlying MDP with |S| ∈ N, i.e. S is finite, Z is a finite set of observations, and O: S → Z is an observation function such that O(s) = O(s 0 ) =⇒ Act(s) = Act(s 0 ) for all s, s<sup>0</sup> ∈ S.

We fix a POMDP M := hM, Z, Oi with underlying MDP M. We lift the notion of enabled actions to observations z ∈ Z by setting Act(z) := Act(s) for some s ∈ S with O(s) = z which is valid since states with the same observations are required to have the same enabled actions. The notions defined for MDPs below also straightforwardly apply to POMDPs.

Remark 1. More general observation functions of the form O : S×Act → Dist(Z) can be encoded in this formalism by using a polynomially larger state space [16].

An infinite path through an MDP (and a POMDP) is a sequence π˜ = s0α1s1α<sup>2</sup> . . . such that αi+1 ∈ Act(si) and si+1 ∈ postM(s<sup>i</sup> , αi+1) for all i ∈ N. A finite path is a finite prefix πˆ = s0α<sup>1</sup> . . . αns<sup>n</sup> of an infinite path π˜. For finite πˆ let last(πˆ) := s<sup>n</sup> and |πˆ| := n. For infinite π˜ set |π˜| := ∞ and let π˜[i] denote the finite prefix of length i ∈ N. We denote the set of finite and infinite paths in M by Paths<sup>M</sup> fin and Paths<sup>M</sup> inf, respectively. Let Paths<sup>M</sup> := Paths<sup>M</sup> fin ∪ Paths<sup>M</sup> inf. Paths are lifted to the observation level by observation traces. The observation trace of a (finite or infinite) path π = s0α1s1α<sup>2</sup> . . . ∈ Paths<sup>M</sup> is O(π) := O(s0)α1O(s1)α<sup>2</sup> . . .. Two paths π, π<sup>0</sup> ∈ Paths<sup>M</sup> are observation-equivalent if O(π) = O(π 0 ).

Policies resolve the non-determinism present in MDPs (and POMDPs). Given a finite path πˆ, a policy determines the action to take at last(ˆπ).

Definition 3 (Policy). A policy for M is a function σ : Paths<sup>M</sup> fin → Dist(Act) such that for each path πˆ ∈ Paths<sup>M</sup> fin, supp(σ(ˆπ)) ⊆ Act(last(ˆπ)).

A policy σ is deterministic if |supp(σ(πˆ))| = 1 for all πˆ ∈ Paths<sup>M</sup> fin. Otherwise it is randomised. σ is memoryless if for all π, ˆ πˆ <sup>0</sup> ∈ Paths<sup>M</sup> fin we have last(πˆ) = last(πˆ 0 ) =⇒ σ(πˆ) = σ(πˆ 0 ). σ is observation-based if for all π, ˆ πˆ <sup>0</sup> ∈ Paths<sup>M</sup> fin it holds that O(πˆ) = O(πˆ 0 ) =⇒ σ(πˆ) = σ(πˆ 0 ). We denote the set of policies for M by Σ<sup>M</sup> and the set of observation-based policies for M by Σ<sup>M</sup> obs. A finite-memory policy (fm-policy) can be represented by a finite automaton where the current memory state and the state of the MDP determine the actions to take [4].

The probability measure µ σ,s <sup>M</sup> for paths in M under policy σ and initial state s is the probability measure of the Markov chain induced by M, σ, and s [4].

We use reward structures to model quantities like time, or energy consumption.

Definition 4 (Reward Structure). A reward structure for M is a function R: S × Act × S → R such that either for all s, s<sup>0</sup> ∈ S, α ∈ Act, R(s, α, s<sup>0</sup> ) ≥ 0 or for all s, s<sup>0</sup> ∈ S, α ∈ Act, R(s, α, s<sup>0</sup> ) ≤ 0 holds. In the former case, we call R positive, otherwise negative.

We fix a reward structure R for M. The total reward along a path π is defined as rewM,R(π) := P<sup>|</sup>π<sup>|</sup> <sup>i</sup>=1 R(si−1, α<sup>i</sup> , si). The total reward is always well-defined even if π is infinite—since all rewards are assumed to be either non-negative or non-positive. For an infinite path π˜ we define the total reward until reaching a set of goal states G ⊆ S by

$$\mathsf{rew}\_{M,\mathbf{R},G}(\tilde{\pi}) := \begin{cases} \mathsf{rew}\_{M,\mathbf{R}}(\hat{\pi}) & \text{if } \exists i \in \mathbb{N} : \hat{\pi} = \tilde{\pi}[i] \wedge last(\hat{\pi}) \in G \wedge \\ & \forall j < i : last(\tilde{\pi}[j]) \notin G, \\ \mathsf{rew}\_{M,\mathbf{R}}(\tilde{\pi}) & \text{otherwise}. \end{cases}$$

Intuitively, rewM,R,G(π˜) accumulates reward along π˜ until the first visit of a goal state s ∈ G. If no goal state is reached, reward is accumulated along the infinite path. The expected total reward until reaching G for policy σ and state s is

$$\mathsf{ER}\_{M,\mathbf{R}}^{\sigma}(s \mid = \Diamond G) := \int\_{\tilde{\pi} \in Paths\_{\mathrm{inf}}^{M}} \int\_{\mathrm{inf}} \mathsf{rew}\_{M,\mathbf{R},G}(\tilde{\pi}) \cdot \mu\_{M}^{\sigma,s}(d\tilde{\pi}).$$

Observation-based policies capture the notion that a decision procedure for a POMDP only accesses the observations and their history and not the entire state of the system. We are interested in reasoning about minimal and maximal values over all observation-based policies. For our explanations we focus on maximising (non-negative or non-positive) expected rewards. Minimisation can be achieved by negating all rewards.

Definition 5 (Maximal Expected Total Reward). The maximal expected total reward until reaching G from s in POMDP M is

$$\mathsf{ER}\_{\mathcal{M},\mathbf{R}}^{\max}(s \mathbb{H} \vDash \lozenge G) := \sup\_{\sigma \in \Sigma\_{\textup{obs}}^{\mathcal{M}}} \mathsf{ER}\_{\mathcal{M},\mathbf{R}}^{\sigma}(s \mathbb{H} \vDash \lozenge G).$$

We define ERmax <sup>M</sup>,R(♦G) := ERmax <sup>M</sup>,R(sinit |= ♦G).

The central problem of our work, the indefinite-horizon total reward problem, asks the question whether the maximal expected total reward until reaching a goal exceeds a given threshold.

Problem 1. Given a POMDP M, reward structure R, set of goal states G ⊆ S, and threshold λ ∈ R, decide whether ERmax <sup>M</sup>,R(♦G) ≤ λ.

Example 1. Fig. 1 shows a POMDP M with three states and two observations: O(s0) = O(s1) = and O(s2) = . A reward of 1 is collected when transitioning from s<sup>1</sup> to s<sup>2</sup> via the β-action. All other rewards are zero.

The policy that always selects α at s<sup>0</sup> and β at s<sup>1</sup> maximizes the expected total reward to reach G = {s2} but is not observation-based. The observation-based policy that for the first n ∈ N transition steps selects α and then selects β afterwards yields an expected total reward of 1 − ( <sup>1</sup>/2) <sup>n</sup>. With n → ∞ we obtain ERmax <sup>M</sup>,R(♦{s2}) = 1.

As computing maximal expected rewards exactly in POMDPs is undecidable [34], we aim at under-approximating the actual value ERmax <sup>M</sup>,R(♦G). This allows us to answer our problem negatively if the computed lower bound exceeds λ.

Remark 2. Expected rewards can be used to describe reachability probabilities by assigning reward 1 to all transitions entering G and assigning reward 0 to all other transitions. Our approach can thus be used to obtain lower bounds on reachability probabilities in POMDPs. This also holds for almost-sure reachability (i.e. "is the reachability probabilty one?"), though dedicated methods like those presented in [17,16,27] are better suited for that setting.

#### 2.2 Beliefs

The semantics of a POMDP M are captured by its (fully observable) belief MDP. The infinite state space of this MDP consists of beliefs [3,44]. A belief is a distribution over the states of the POMDP where each component describes the likelihood to be in a POMDP state given a history of observations. We denote the set of all beliefs for M by B<sup>M</sup> := {b ∈ Dist(S) | ∀s, s<sup>0</sup> ∈ supp(b) : O(s) = O(s 0 )} and write O(b) ∈ Z for the unique observation O(s) of all s ∈ supp(b).

The belief MDP of M is constructed by starting in the belief corresponding to the initial state and computing successor beliefs to unfold the MDP. Let P(s, α, z) := P s <sup>0</sup>∈S [O(s 0 ) = z] · P(s, α, s<sup>0</sup> ) be the probability to observe z ∈ Z after taking action α in POMDP state s. Then, the probability to observe z after taking action α in belief b is P(b, α, z) := P s∈S b(s) · P(s, α, z). We refer to <sup>J</sup>b|α, z<sup>K</sup> ∈ BM—the belief after taking <sup>α</sup> in <sup>b</sup>, conditioned on observing <sup>z</sup>—as the α-z-successor of b. If P(b, α, z) > 0, it is defined component-wise as

$$[b|\alpha, z](s) := \frac{[O(s) = z] \cdot \sum\_{s' \in S} b(s') \cdot \mathbf{P}(s', \alpha, s)}{\mathbf{P}(b, \alpha, z)}$$

for all <sup>s</sup> <sup>∈</sup> <sup>S</sup>. Otherwise <sup>J</sup>b|α, z<sup>K</sup> is undefined.

Definition 6 (Belief MDP). The belief MDP of M is the MDP bel(M) = BM, Act, PB, binit , where B<sup>M</sup> is the set of all beliefs in M, Act is as for M, binit := {sinit 7→ 1} is the initial belief, and P<sup>B</sup> : B<sup>M</sup> × Act × B<sup>M</sup> → [0, 1] is the belief transition function with

$$\mathbf{P}^B(b,\alpha,b') := \begin{cases} \mathbf{P}(b,\alpha,z) & \text{if } b' = [b|\alpha, z],\\ 0 & \text{otherwise.} \end{cases}$$

We lift a POMDP reward structure R to the belief MDP [25].

Definition 7 (Belief Reward Structure). For beliefs b, b<sup>0</sup> ∈ B<sup>M</sup> and action α ∈ Act, the belief reward structure R<sup>B</sup> based on R associated with bel(M) is given by

$$\mathbf{R}^B(b,\alpha,b') := \frac{\sum\_{s \in S} b(s) \cdot \sum\_{s' \in S} [O(s') = O(b')] \cdot \mathbf{R}(s,\alpha,s') \cdot \mathbf{P}(s,\alpha,s')}{\mathbf{P}(b,\alpha,O(b'))}.$$

Given a set of goal states G ⊆ S, we assume—for simplicity—that there is a set of observations Z <sup>0</sup> ⊆ Z such that s ∈ G iff O(s) ∈ Z 0 . This assumption can always be ensured by transforming the POMDP M. See the full technical report [10] for details. The set of goal beliefs for G is given by G<sup>B</sup> := {b ∈ B<sup>M</sup> | supp(b) ⊆ G}.

We now lift the computation of expected rewards to the belief level. Based on the well-known Bellman equations [5], the belief MDP induces a function that maps every belief to the expected total reward accumulated from that belief.

Definition 8 (POMDP Value Function). For b ∈ BM, the n-step value function V<sup>n</sup> : B<sup>M</sup> → R of M is defined recursively as V0(b) := 0 and

$$V\_n(b) := [b \notin G\_{\mathcal{B}}] \cdot \max\_{\alpha \in Act} \sum\_{b' \in post^{\text{hol}(\mathcal{M})}(b,\alpha)} \mathbf{P}^B(b,\alpha,b') \cdot \left(\mathbf{R}^B(b,\alpha,b') + V\_{n-1}(b')\right).$$

Figure 2. Belief MDP bel(M) of POMDP M from Fig. 1

The (optimal) value function V ∗ : B<sup>M</sup> → R<sup>∞</sup> is given by V ∗ (b) := limn→∞ Vn(b).

The n-step value function is piecewise linear and convex [44]. Thus, the optimal value function can be approximated arbitrarily close by a piecewise linear convex function [47]. The value function yields expected total rewards in M and bel(M):

ERmax <sup>M</sup>,R(s |= ♦G) = ERmax bel(M),R<sup>B</sup> ({s 7→ 1} |= ♦GB) = V ∗ ({s 7→ 1}).

Example 2. Fig. 2 shows a fragment of the belief MDP of the POMDP from Fig. 1. Observe ERmax bel(M),R<sup>B</sup> (♦ {s<sup>2</sup> 7→ 1}) = 1.

We reformulate our problem statement to focus on the belief MDP.

Problem 2 (equivalent to Problem 1). For a POMDP M, reward structure R, goal states G ⊆ S, and threshold λ ∈ R, decide whether V ∗ ({sinit 7→ 1}) ≤ λ.

As the belief MDP is fully observable, standard results for MDPs apply. However, an exhaustive analysis of bel(M) is intractable since the belief MDP is—in general—infinitely large<sup>2</sup> .

### 3 Finite Exploration Under-Approximation

Instead of approximating values directly on the POMDP, we consider approximations of the corresponding belief MDP. The basic idea is to construct a finite abstraction of the belief MDP by unfolding parts of it and approximate values at beliefs where we decide not to explore. In the resulting finite MDP, under-approximative expected reward values can be computed by standard model checking techniques. We present two approaches for abstraction: belief cut-offs and belief clipping. We incorporate those techniques into an algorithmic framework that yields arbitrarily tight under-approximations.

The technical report [10] contains formal proofs of our claims.

<sup>2</sup> The set of all beliefs—i.e. the state space of bel(M)—is uncountable. The reachable fragment is countable, though, since each belief has at most |Z| many successors.

Figure 3. Applying belief cut-offs to the belief MDP from Fig. 2

#### 3.1 Belief Cut-Offs

The general idea of belief cut-offs is to stop exploring the belief MDP at certain beliefs—the cut-off beliefs—and assume that a goal state is immediately reached while sub-optimal reward is collected. Similar techniques have been discussed in the context of fully observable MDPs and other model types [11,26,49,2]. Our work adapts the idea of cut-offs for POMDP over-approximations described in [8] to under-approximations. The main idea of belief cut-offs shares similarities with the SARSOP [30] and Goal-HSVI [24] approaches. While they apply cut-offs on the level of the computation tree, our approach directly manipulates the belief MDP to yield a finite model.

Let V : B<sup>M</sup> → R<sup>∞</sup> with V(b) ≤ V ∗ (b) for all b ∈ BM. We call V an underapproximative value function and V(b) the cut-off value of b. In each of the cut-off beliefs b, instead of adding the regular transitions to its successors, we add a transition with probability 1 to a dedicated goal state bcut. In the modified reward structure R<sup>0</sup> , this cut-off transition is assigned a reward<sup>3</sup> of V(b), causing the value for a cut-off belief b in the modified MDP to coincide with V(b). Hence, the exact value of the cut-off belief—and thus the value of all other explored beliefs—is under-approximated.

Example 3. Fig. 3 shows the resulting finite MDP obtained when considering the belief MDP from Fig. 2 with single cut-off belief b = {s<sup>0</sup> 7→ <sup>1</sup>/4, s<sup>1</sup> 7→ <sup>3</sup>/4}.

Computing cut-off values. The question of finding a suitable under-approximative value function V is central to the cut-off approach. For an effective approximation, such a function should be easy to compute while still providing values close to the optimum. If we assume a positive reward structure, the constant value 0 is always a valid under-approximation. A more sophisticated approach is to compute suboptimal expected reward values for the states of the POMDP using some arbitrary, fixed observation-based policy σ ∈ Σ<sup>M</sup> obs. Let U σ : S → R<sup>∞</sup> such that for all s ∈ S, U σ (s) = ER<sup>σ</sup>M,R(s |= ♦G). Then, we define the function U σ : B<sup>M</sup> → R<sup>∞</sup> as U σ (b) := P s∈supp(b) b(s) · U σ (s).

<sup>3</sup> We slightly deviate from Def. 4 by allowing transition rewards to be −∞ or +∞. Alternatively, we could introduce new sink states with a non-zero self-loop reward.

Lemma 1. U σ is an under-approximative value function, i.e. for all b ∈ BM:

$$\mathfrak{U}^{\sigma}(b) := \sum\_{s \in \operatorname{supp}(b)} b(s) \cdot U^{\sigma}(s) \le V^\*(b).$$

Thus, finding a suitable under-approximative value function reduces to finding "good" policies for M, e.g. by using randomly guessed fm-policies, machine learning methods [13], or a transformation to a parametric model [28].

#### 3.2 Belief Clipping

The cut-off approach provides a universal way to construct an MDP which underapproximates the expected total reward value for a given POMDP. The quality of the approximation, however, is highly dependent on the under-approximative value function used. Furthermore, regions where the belief MDP slowly converges towards a belief may pose problems in practice.

As a potential remedy for these problems, we propose a different concept called belief clipping. Intuitively, the procedure shifts some of the probability mass of a belief b in order to transform b to another belief ˜b. We then connect b to ˜b in a way that the accuracy of our approximation of the value V ∗ (b) depends only on the approximation of V ∗ ( ˜b) and the so-called clipping value—some notion of distance between b and ˜b that we discuss below. We can thus focus on exploring the successors of ˜b to obtain good approximations for both beliefs b and ˜b.

Definition 9 (Belief Clip). For b ∈ BM, we call µ: supp(b) → [0, 1] a belief clip if ∀s ∈ supp(b): µ(s) ≤ b(s) and P(µ) := P <sup>s</sup>∈supp(b) µ(s) < 1. The belief (b µ) ∈ B<sup>M</sup> induced by µ is defined by

$$\forall s \in \operatorname{supp}(b) \colon \ (b \ominus \mu)(s) \ := \ \frac{b(s) - \mu(s)}{1 - \sum(\mu)}.$$

Intuitively, a belief clip µ for b describes for each s ∈ supp(b) the probability mass that is removed ("clipped away") from b(s). The induced belief is obtained when normalising the resulting values so that they sum up to one.

Example 4. For belief b = {s<sup>0</sup> 7→ <sup>1</sup>/4, s<sup>1</sup> 7→ <sup>3</sup>/4}, consider the two belief clips µ<sup>1</sup> = {s<sup>0</sup> 7→ <sup>1</sup>/4, s<sup>1</sup> 7→ <sup>1</sup>/4} and µ<sup>2</sup> = {s<sup>0</sup> 7→ <sup>1</sup>/4, s<sup>1</sup> 7→ 0}. Both induce the same belief: (b µ1) = (b µ2) = {s<sup>0</sup> 7→ 0, s<sup>1</sup> 7→ 1}.

We have supp((b µ)) ⊆ supp(b), which also implies O((b µ)) = O(b). Given some candidate belief ˜b, consider the set of inducing belief clips:

$$\mathcal{L}(b,\tilde{b}) := \left\{ \mu \colon \operatorname{supp}(b) \to [0,1] \mid \mu \text{ is a belief clip for } b \text{ with } \tilde{b} = (b \ominus \mu) \right\}.$$

Belief ˜b is called an adequate clipping candidate for b iff C(b, ˜b) 6= ∅.

Definition 10 (Clipping Value). For b ∈ B<sup>M</sup> and adequate clipping candidate ˜b, the clipping value is <sup>∆</sup>b→˜<sup>b</sup> := P(δb→˜<sup>b</sup> ), where δb→˜<sup>b</sup> := arg minµ∈C(b,˜b) P(µ). The values δb→˜<sup>b</sup> (s) for s ∈ supp(b) are the state clipping values.

Figure 4. Applying belief clipping to the belief MDP from Fig. 2

Given a belief b and an adequate clipping candidate ˜b, we outline how the notion of belief clipping is used to obtain valid under-approximations. We assume b 6= ˜b, implying 0 < ∆b→˜<sup>b</sup> < 1. Instead of exploring all successors of b in bel(M), the approach is to add a transition from b to ˜b. The newly added transition has probability 1 − ∆b→˜<sup>b</sup> and gets assigned a reward of 0. The remaining probability mass (i.e. ∆b→˜<sup>b</sup> ) leads to a designated goal state bcut. To guarantee that—in general—the clipping procedure yields a valid under-approximation, we need to add a corrective reward value to the transition from b to bcut. Let L : S → R<sup>∞</sup> which maps each POMDP state to its minimum expected reward in the underlying, fully observable MDP M of M<sup>4</sup> , i.e. L(s) = ERmin M,R(s |= ♦G). This function soundly under-approximates the state values which can be achieved by any observation-based policy. It can be generated using standard MDP analysis. Given state clipping values δb→˜<sup>b</sup> (s) for s ∈ supp(b), the reward for the transition from b to bcut is P s∈supp(b) (δb→˜<sup>b</sup> (s)/∆b→˜<sup>b</sup> ) · L(s).

Example 5. For the belief MDP from Fig. 2, belief b = {s<sup>0</sup> 7→ <sup>1</sup>/4, s<sup>1</sup> 7→ <sup>3</sup>/4}, and clipping candidate ˜<sup>b</sup> <sup>=</sup> {s<sup>0</sup> 7→ <sup>0</sup>, s<sup>1</sup> 7→ <sup>1</sup>} we get <sup>∆</sup>b→˜<sup>b</sup> <sup>=</sup> <sup>1</sup>/4, as <sup>δ</sup>b→˜<sup>b</sup> <sup>=</sup> µ<sup>2</sup> = {s<sup>0</sup> 7→ <sup>1</sup>/4, s<sup>1</sup> 7→ 0} with the belief clip µ<sup>2</sup> as in Example 4. Furthermore, L(s0) = 0. The resulting MDP following our construction above is given in Fig. 4.

The following lemma shows that the construction yields an under-approximation.

$$\mathbf{Lemma 2. } (1 - \Delta\_{b \to \tilde{b}}) \cdot V^\*(\tilde{b}) \, + \, \Delta\_{b \to \tilde{b}} \cdot \sum\_{s \in supp(b)} \frac{\delta\_{b \to \tilde{b}}(s)}{\Delta\_{b \to \tilde{b}}} \cdot \mathfrak{L}(s) \, \le \, \, V^\*(b).$$

Proof (sketch). To gain some intuition, consider the special case, where ∆b→˜<sup>b</sup> = δb→˜<sup>b</sup> (s) = b(s) for some s ∈ supp(b). The clipping candidate ˜b can be interpreted as the conditional probability distribution arising from distribution b given that s is not the current state. The value V ∗ (b) can be split into the sum of (i) the probability that s is not the current state times the reward accumulated from belief ˜b and (ii) the probability that s is the current state times the reward accumulated from s, i.e. from the belief {s 7→ 1}. However, for the two summands

<sup>4</sup> When rewards are negative, we might have L(s) = −∞ for many s ∈ S \ G in which case the applicability of the clipping approach is very limited.

∀b

we must consider a policy that does not distinguish between the beliefs b, ˜b, and {s 7→ 1} as well as their observation-equivalent successors. In other words, the same sequence of actions must be executed when the same observations are made.

We consider such a policy that in addition is optimal at ˜b, i.e. the reward accumulated from ˜b is equal to V ∗ ( ˜b). For the reward accumulated from {s 7→ 1}, L(s) provides a lower bound. Hence, (1 − b(s)) · V ∗ (b) + b(s) · L(s) is a lower bound for the reward accumulated from b. A formal proof is given in [10]. ut

To find a suitable clipping candidate for a given belief b, we consider a finite candidate set B ⊆ B<sup>M</sup> consisting of beliefs with observation O(b). These beliefs do not need to be reachable in the belief MDP. The set can be constructed, e.g. by taking already explored beliefs or by using a fixed, discretised set of beliefs.

We are interested in minimising the clipping value ∆b→<sup>b</sup> <sup>0</sup> over all candidate beliefs b <sup>0</sup> ∈ B. A naive approach is to explicitly compute all clipping values for all candidates. We are using mixed-integer linear programming (MILP) [41] instead. An MILP is a system of linear inequalities (constraints) and a linear objective function considering real-valued and integer variables. A feasible solution of the MILP is a variable assignment that satisfies all constraints. An optimal solution is a feasible solution that minimises the objective function.

Definition 11 (Belief Clipping MILP). The belief clipping MILP for belief b ∈ B<sup>M</sup> and finite set of candidates B ⊆ {b <sup>0</sup> ∈ B<sup>M</sup> | O(b 0 ) = O(b)} is given by:

minimise ∆ such that: X b <sup>0</sup>∈B ab <sup>0</sup> = 1 . Select exactly one candidate b 0 (1)

<sup>0</sup> ∈ B: a<sup>b</sup> <sup>X</sup> <sup>0</sup> ∈ {0, <sup>1</sup>} (2)

$$\sum\_{s \in supp(b)} \delta\_s = \Delta \qquad \Rightarrow Compute \ clipping \ value \ for \ selected \ b' \tag{3}$$

$$\begin{array}{ccc} \forall s \in support(b) \colon & \delta\_s \in [0, b(s)] \\ \forall b' \in \mathfrak{B} \colon & \delta\_s \ge b(s) - (1 - \Delta) \cdot b'(s) - (1 - a\_{b'}) \end{array} \tag{4}$$

The MILP consists of O(|supp(b)| + |B|) variables and O(|supp(b)| · |B|) constraints. For b <sup>0</sup> ∈ B, the binary variable a<sup>b</sup> <sup>0</sup> indicates whether b <sup>0</sup> has been chosen as the clipping candidate. Moreover, we have variables δ<sup>s</sup> for s ∈ supp(b) and a variable ∆ to represent the (state) clipping values for b and the chosen candidate b 0 . Constraints 1 and 2 enforce that exactly one of the a<sup>b</sup> <sup>0</sup> variables is one, i.e. exactly one belief is chosen. Constraint 3 forces ∆ to be the sum of all state clipping values. δ<sup>s</sup> variables get a value between zero and b(s) (Constraint 4). Constraint 5 only affects δ<sup>s</sup> if the corresponding belief is chosen. Otherwise, a<sup>b</sup> 0 is set to 0 and the value on the right-hand side becomes negative. If a belief b 0 is chosen, the minimisation forces Constraint 5 to hold with equality as the right-hand side is greater or equal to 0. Assuming ∆ is set to a value below 1, we obtain a valid clipping values as

$$\forall s \in support(b) \colon \quad \delta\_s = b(s) - (1 - \Delta) \cdot b'(s) \quad \Longleftrightarrow \quad b'(s) = \frac{b(s) - \delta\_s}{1 - \Delta}.$$

Input : POMDP M = hM, Z, Oi with M = hS, Act, P, siniti, reward structure R, goal states G ⊆ S, under-approx. value function V, function L : S → R <sup>∞</sup> with L(s) = ERmin M,R(s |= ♦G) Output : Clipping belief MDP K<sup>M</sup> and reward structure R<sup>K</sup> 1 S <sup>K</sup> ← {binit, bcut} with binit = {sinit 7→ 1} and a new belief state bcut 2 P <sup>K</sup>(bcut, cut, bcut) ← 1, R<sup>K</sup>(bcut, cut, bcut) ← 0 // add self-loop 3 Q ← {binit} // initialize exploration set 4 while Q 6= ∅ do 5 b ← chooseBelief(Q), Q ← Q \ {b} // pop next belief to explore from Q 6 if supp(b) ⊆ G then P <sup>K</sup>(b, goal, b) ← 1, R<sup>K</sup>(b, goal, b) ← 0 // add self-loop 7 else if exploreBelief(b) then // expand b <sup>8</sup> foreach α ∈ Act(b) do // Using bel(M) and R<sup>B</sup> as in Defs. 6 and 7 9 foreach b <sup>0</sup> ∈ post bel(M) (b, α) do 10 P <sup>K</sup>(b, α, b<sup>0</sup> ) ← P <sup>B</sup>(b, α, b<sup>0</sup> ), R<sup>K</sup>(b, α, b<sup>0</sup> ) ← R<sup>B</sup>(b, α, b<sup>0</sup> ) 11 if b <sup>0</sup> ∈/ S <sup>K</sup> then S <sup>K</sup> ← S <sup>K</sup> ∪ {b 0 }, Q ← Q ∪ {b 0 } 12 else // apply cut-off and clipping to b 13 P <sup>K</sup>(b, cut, bcut) ← 1, R<sup>K</sup>(b, cut, bcut) ← V(b) // add cut-off transition 14 choose a finite set B ⊆ B<sup>M</sup> of clipping candidates for b <sup>15</sup> ˜b, ∆b→˜<sup>b</sup> , δb→˜<sup>b</sup> ← solveClippingMILP(b, B) <sup>16</sup> if ˜b 6= b and ˜b is adequate then // Clip b using ˜b 17 P <sup>K</sup>(b, clip, ˜b) <sup>←</sup> (1−∆b→˜<sup>b</sup> ), P <sup>K</sup>(b, clip, bcut) ← ∆b→˜<sup>b</sup> <sup>18</sup> R<sup>K</sup>(b, clip, ˜b) ← 0, R<sup>K</sup>(b, clip, bcut) ← P s∈supp(b) δ b→˜b (s) ∆b→˜<sup>b</sup> · L(s) <sup>19</sup> if ˜b /∈ S <sup>K</sup> then S <sup>K</sup> ← S <sup>K</sup> ∪ {˜b}, Q ← Q ∪ {˜b}

20 return K<sup>M</sup> = S <sup>K</sup>, Act ] {goal, cut, clip} , P <sup>K</sup>, binit and R<sup>K</sup>

Algorithm 1: Belief exploration algorithm with cut-offs and clipping

A trivial solution of the MILP is always obtained by setting a<sup>b</sup> <sup>0</sup> and ∆ to 1 and δ<sup>s</sup> to b(s) for all s and an arbitrary b <sup>0</sup> ∈ B. This corresponds to an invalid belief clip. However, as we minimise the value for ∆, we can conclude that no belief in the candidate set is adequate for clipping if ∆ is 1 in an optimal solution.

Theorem 1. An optimal solution to the belief clipping MILP for belief b and candidate set B sets a˜<sup>b</sup> to 1 and ∆ to a value below 1 iff ˜b ∈ B is an adequate clipping candidate for b with minimal clipping value.

#### 3.3 Algorithm

We incorporate belief cut-offs and belief clipping into an algorithmic framework outlined in Algorithm 1. As input, the algorithm takes an instance of Problems 1 and 2, i.e. a POMDP M with reward structure R and goal states G. In addition, the algorithm considers an under-approximative value function V (Sect. 3.1) and a function L for the computation of corrective reward values (Sect. 3.2).

Lines 1 and 2 initialise the state set S <sup>K</sup> of the under-approximative MDP K<sup>M</sup> with the initial belief binit and the designated goal state bcut which has only one transition to itself with reward 0. Furthermore, we initialise the exploration set Q by adding binit (Line 3). During the computation, Q is used to keep track of all beliefs we still need to process. We then execute the exploration loop (Lines 4 to 19) until Q becomes empty. In each exploration step, a belief b is selected<sup>5</sup> and removed from Q. There are three cases for the currently processed belief b.

If supp(b) ⊆ G, i.e. b is a goal belief, we add a self-loop with reward 0 to b and continue with the next belief (Line 6). b is not expanded as successors of goal beliefs will not influence the result of the computation.

If b is not a goal belief, we use a heuristic function<sup>6</sup> exploreBelief to decide if b is expanded in Line 7. Lines 8 to 11 outline the expansion step. The transitions from b to its successor beliefs and the corresponding rewards as in the original belief MDP (see Sect. 2.2) are added. Furthermore, the successor beliefs that have not been encountered before are added to the set of states S <sup>K</sup> and the exploration set Q.

If b is not expanded, we apply the cut-off approach and the clipping approach to b in Lines 12 to 19. In Line 13 we add a cut-off transition from b to bcut with a new action cut. We use the given under-approximative value function V to compute the cut-off reward. Towards the clipping approach, a set of candidate beliefs is chosen and the belief clipping MILP for b and the candidate set is constructed as described in Def. 11 (Lines 14 and 15). If an adequate candidate ˜b with clipping values ∆b→˜<sup>b</sup> and δb→˜<sup>b</sup> (s) for s ∈ supp(b) has been found, we add the transitions from b to bcut and to ˜b using a new action clip and probabilities ∆b→˜<sup>b</sup> and 1 − ∆b→˜<sup>b</sup> , respectively. Furthermore, we equip the transitions with reward values as described in Sect. 3.2 using the given function L (Lines 16 to 18). If the clipping candidate ˜b has not been encountered before, we add it to the state space of the MDP and to the exploration set in Line 19.

The result of the algorithm is an MDP K<sup>M</sup> with reward structure R<sup>K</sup>. The set of states S <sup>K</sup> of K<sup>M</sup> contains all encountered beliefs. To guarantee termination of the algorithm, the decision heuristic exploreBelief has to stop exploring further beliefs at some point. Moreover, the handling of clipping candidates in Line 19 should not add new beliefs to Q infinitely often. We therefore fix a finite set of candidate beliefs B # ⊆ B<sup>M</sup> and make sure that the candidate sets B in Line 14 satisfy (B \ S <sup>K</sup>) ⊆ B#. To ensure a certain progress in the exploration "clip-cycles"—i.e. paths of the form b<sup>1</sup> clip . . . clip b<sup>n</sup> clip b1—are avoided in KM. This can be done, e.g. by always expanding the candidate beliefs b ∈ B#.

Expected total rewards until reaching the extended set of goal beliefs Gcut := G<sup>B</sup> ∪ {bcut} in K<sup>M</sup> under-approximate the values in the belief MDP:

Theorem 2. For all beliefs b ∈ S <sup>K</sup> \ {bcut} it holds that

$$\mathsf{ER}\_{\mathsf{K},\mathcal{M},\mathbf{R}^{\mathcal{K}}}^{\max}(b \models \Diamond G\_{cut}) \leq V^\*(b) = \mathsf{ER}\_{bel(\mathcal{M}),\mathbf{R}^B}^{\max}(b \models \Diamond G\_{\mathcal{B}}).$$

Corollary 1. ERmax <sup>K</sup>M,R<sup>K</sup> (♦Gcut) <sup>≤</sup> ERmax <sup>M</sup>,R(♦G).

<sup>5</sup> For example, Q can be implemented as a FIFO queue.

<sup>6</sup> The decision can be made for example by considering the size of the already explored state space such that the expansion is stopped if a size threshold has been reached. More involved decision heuristics are subject to further research.


Table 1. Results for benchmark POMDPs with maximisation objective

#### 4 Experimental Evaluation

Implementation details. We integrated Algorithm 1 in the probabilistic model checker Storm [23] as an extension of the POMDP verification framework described in [8]. Inputs are a POMDP—encoded either explicitly or using an extension of the Prism language [37]—and a property specification. Internally, POMDPs and MDPs are represented using sparse matrices. The implementation supports minimisation<sup>7</sup> and maximisation of reachability probabilities, reachavoid probabilities (i.e. the probability to avoid a set of bad state until a set of goal states is reached), and expected total rewards. In a preprocessing step, functions V and L as considered in Algorithm 1 are generated. For V, we consider the function U <sup>σ</sup> as in Lemma 1, where σ is a memoryless observation-based policy given by a heuristic<sup>8</sup> . For the function L, we apply standard MDP analysis on the underlying MDP. When exploring the abstraction MDP KM, our heuristic expands a belief iff |S <sup>K</sup>| ≤ |S|·maxz∈<sup>z</sup> |O<sup>−</sup><sup>1</sup> (z)|, where |S <sup>K</sup>| is the number of already explored beliefs and |O<sup>−</sup><sup>1</sup> (z)| is the number of POMDP states with observation z. Belief clipping can either be disabled entirely, or we consider candidate sets B ⊆ B# η , where B # η := {b ∈ B | ∀s ∈ S : b(s) ∈ {i/<sup>η</sup> | i ∈ N, 0 ≤ i ≤ η}} forms a finite, regular grid of beliefs with resolution η ∈ N \ {0}. Grid beliefs b ∈ B# <sup>η</sup> are always expanded.

<sup>7</sup> For minimisation, the under-approximation yields upper bounds.

<sup>8</sup> The heuristic uses optimal values obtained on the fully observable underlying MDP.


Table 2. Results for benchmark POMDPs with minimisation objective

Furthermore, we exclude clipping candidates ˜<sup>b</sup> with <sup>δ</sup>b→˜<sup>b</sup> (s) > 0 for s with L(s) = −∞; clipping with such candidates is not useful as it induces a value of −∞. Expected total rewards on fully observable MDPs are computed using Sound Value Iteration [39] with relative precision 10−<sup>6</sup> . MILPs are solved using Gurobi [21].

Set-up. We evaluate our under-approximation approach with cut-offs only and with enabled belief clipping procedure using grid resolutions η = 2, 3, 4, 6. We consider the same POMDP benchmarks<sup>9</sup> as in [37,8]. The POMDPs are scalable versions of case studies stemming from various application domains. To establish an external baseline, we compare with the approach of [37] implemented in Prism [31]. Prism generates an under-approximation based on an optimal policy for an over-approximative MDP which—in contrast to Storm—means that always both, under- and over-approximations, have to be computed. We ran Prism with resolutions η = 2, 3, 4, 6, 8, 10 and report on the best approximation obtained. To provide a further reference for the tightness of our under-approximation, we compute over-approximative bounds as in [8] using the implementation in Storm with a resolution of η = 8. All experiments were run on an Intel® Xeon® Platinum 8160 CPU using 4 threads<sup>10</sup>, 64GB RAM and a time limit of 2 hours.

Results. Tables 1 and 2 show our results for maximising and minimising properties, respectively. The first columns contain for each POMDP the benchmark name,

<sup>9</sup> Instances with a finite belief MDP that would be fully explored by our algorithm are omitted since the exact value can be obtained without approximation techniques.

<sup>10</sup> For our implementation, only Gurobi runs multi-threaded. Prism uses multiple threads for garbage collection.

Figure 5. Accuracy for Drone 4-2 with different sizes of approximation MDP K<sup>M</sup>

model parameters, property type (probabilities (P) or rewards (R)), and the numbers of states, state-action pairs, and observations. Column Prism gives the result with the smallest gap between over- and under-approximation computed with the approach of [37]. For maximising (minimising) properties, our approach competes with the lower (upper) bound of the provided interval. The relevant value is marked in bold. We also provide the computation time and the considered resolution η. For our implementation, we give results for the configuration with disabled clipping and for clipping with different resolutions η. In each cell, we give the obtained value, the computation time and the number of states in the abstraction MDP KM. Time- and memory-outs are indicated by TO and MO. The right-most column indicates the over-approximation value computed via [8]. Discussion. The pure cut-off approach yields valid under-approximations in all benchmark instances—often exceeding the accuracy of the approach of [37] while being consistently faster. In some cases, the resulting values improve when clipping is enabled. However, larger candidate sets significantly increase the computation time which stems from the fact that many clipping MILPs have to be solved.

For Drone 4-2, Fig. 5 plots the resulting under-approximation values (y-axis) for varying sizes of the explored MDP K<sup>M</sup> (x-axis). The horizontal, dashed line indicates the computed over-approximation value. The quality of the approximation further improves with an increased number of explored beliefs.

#### 5 Conclusion

We presented techniques to safely under-approximate expected total rewards in POMDPs. The approach scales to large POMDPs and often produces tight lower bounds. Belief clipping generally does not improve on the simpler cut-off approach in terms of results and performance. However, considering—and optimising—the approach for particular classes of POMDPs might prove beneficial. Future work includes integrating the algorithm into a refinement loop that also considers over-approximation techniques from [8]. Furthermore, lifting our approach to partially observable stochastic games is promising.

Data Availability. The artifact [9] accompanying this paper contains source code, benchmark files, and replication scripts for our experiments.

### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Correct Probabilistic Model Checking with Floating-Point Arithmetic?

Arnd Hartmanns

University of Twente, Enschede, The Netherlands a.hartmanns@utwente.nl

Abstract. Probabilistic model checking computes probabilities and expected values related to designated behaviours of interest in Markov models. As a formal verification approach, it is applied to critical systems; thus we trust that probabilistic model checkers deliver correct results. To achieve scalability and performance, however, these tools use finite-precision floating-point numbers to represent and calculate probabilities and other values. As a consequence, their results are affected by rounding errors that may accumulate and interact in hard-to-predict ways. In this paper, we show how to implement fast and correct probabilistic model checking by exploiting the ability of current hardware to control the direction of rounding in floating-point calculations. We outline the complications in achieving correct rounding from higherlevel programming languages, describe our implementation as part of the Modest Toolset's mcsta model checker, and exemplify the tradeoffs between performance and correctness in an extensive experimental evaluation across different operating systems and CPU architectures.

# 1 Introduction

Given a Markov chain or Markov decision process (MDP [25]) model of a safetyor performance-critical system, probabilistic model checking (PMC) calculates quantitative properties of interest: the probability of (rare or catastrophic) failures, the expected recovery time after service interruption, or the long-run average throughput. These properties involve probabilities or expected costs/rewards of sets of model behaviours, and are often specified in a temporal logic like PCTL [16]. As a formal verification approach, users place great trust in the results delivered by a PMC tool such as Prism [22], Storm [9], ePMC [15], or the Modest Toolset's [18] mcsta. In contrast to classical model checkers for functional, Boolean-valued properties specified in e.g. LTL or CTL [2], a probabilistic model checker is inherently quantitative: the input model contains real-valued probabilities and costs/rewards; PCTL makes comparisons between real-valued constants and probabilities; the most efficient algorithms numerically iterate towards a fixpoint; and the final result itself may well be a real number.

<sup>?</sup> This work was supported by NWO VENI grant no. 639.021.754 and the EU's Horizon 2020 research and innovation programme under MSCA grant agreement 101008233.

Often, we can restrict to rationals, which simplifies the theory and facilitates "exact" algorithms using arbitrary-precision rational number datatypes. These algorithms only work for small models (as shown in the most recent QComp 2020 competition of quantitative verification tools [6]). In this paper, we thus focus on the PMC techniques that scale to large problems: those building upon iterative numerical algorithms, in particular value iteration (VI) [8]. We restrict to probabilistic reachability, i.e. calculating the probability to eventually reach a goal state, as this is the core problem in PMC for MDP. Embedded in the usual recursive CTL algorithm, it allows us to check any (unbounded) PCTL formula.

Starting from a trivial underapproximation of the reachability probability for each state of the model, VI iteratively improves the value of each state based on its successors' values. The true reachability probabilities are the least fixpoint of this procedure, towards which the algorithm converges. For roughly a decade, PMC tools implemented VI by stopping once the relative or absolute difference between subsequent iterations was below a threshold . Haddad and Monmege [12] showed in 2014<sup>1</sup> that this does not guarantee a difference of ≤ between the reported and the true probability, putting in question the trust placed in PMC tools. Then variants of VI were developed that provide sound, i.e. -correct, results: interval iteration (II) [3,5,13], sound value iteration (SVI) [26], and optimistic value iteration (OVI) [19]. We focus on II as the prototypical sound algorithm. It additionally iterates on an overapproximation; its stopping criterion is the difference between over- and underapproximation being ≤ .

If all probabilities in an MDP are rational numbers, then the true reachability probability as well as all intermediate values in II are rational, too. Yet implementing II with arbitrary-precision rationals is impractical since the smallerand-smaller differences between intermediate values end up using excessive computation time and memory. II is thus implemented with fixed-precision (usually 64-bit IEEE 754 double precision) floating point numbers. These, however, cannot represent all rationals, so operations must round to nearby representable values. Although II is numerically benign, consisting only of multiplications and additions within [0, 1], the default round to nearest, ties to even policy can cause II to deliver incorrect results. Wimmer et al. [29] show an example where PMC tools incorrectly state that a simple PCTL property is satisfied by a small Markov chain due to the underlying numeric difference having disappeared in rounding. We confirmed with current versions of Prism, Storm, and mcsta that the problem persists to today, even when requesting a "sound" algorithm like II. Wimmer et al. propose interval arithmetic to avoid such problems, cautioning that

[...] the memory consumption will roughly double, since two numbers for the interval bounds have to be stored [...]. The runtime will be higher by a small factor, because we need to derive lower and upper bounds for the intervals, requiring two model checking runs per sub-formula. [29, p. 5]

They did not provide an implementation, and we are not aware of any to date.

<sup>1</sup> Wimmer et al. [29] already in 2008 mention this problem in a more general setting, but neither give a concrete counterexample nor propose a solution tailored to PMC.

Our contribution. We present the first PMC implementation that computes correct lower and upper bounds on reachability probabilities despite using floatingpoint arithmetic. We benefit from two developments since Wimmer et al.'s paper of 2008: First, II (published 2014) already uses intervals (though not as Wimmer et al. envisioned), necessarily doubling memory consumption compared to VI (as do SVI and OVI, so it appears an unavoidable cost of soundness). In place of "two model checking runs per sub-formula", we can make the two interleaved computations inside II safe w.r.t. rounding. Second, hardware and programming language support for controlling the rounding direction in floating-point operations has improved, in particular with the AVX-512 instruction set in the newest x86-64 CPUs and widespread compiler support for C99's "floating-point environment" header fenv.h. Nevertheless, it is nontrivial to achieve runtime that is only "higher by a small factor". For the analysis of probabilistic systems, the only related use of safe rounding we are aware of is in the SSMT tool SiSAT [27].

Structure. We recap PMC and II (Sect. 2) as well as problems and solutions related to rounding in floating-point arithmetic in Sect. 3. We then present our new approach in Sect. 4, including important implementation aspects. The performance of our approach is crucial to its adoption in tools; thus in Sect. 5 we report on extensive experiments across different software and hardware configurations on models from the Quantitative Verification Benchmark Set (QVBS) [20].

#### 2 Probabilistic Model Checking

We write { x<sup>1</sup> 7→ y1, . . . } to denote the function that maps all x<sup>i</sup> to y<sup>i</sup> . Given a set S, its powerset is 2 <sup>S</sup>. A (discrete) probability distribution over S is a function µ ∈ S → [0, 1] with countable support spt(µ) def <sup>P</sup> <sup>=</sup> { <sup>s</sup> <sup>∈</sup> <sup>S</sup> <sup>|</sup> <sup>µ</sup>(s) <sup>&</sup>gt; <sup>0</sup> } and <sup>s</sup>∈spt(µ) µ(s) = 1. Dist(S) is the set of all probability distributions over S. If µ(s) ∈ Q for all s ∈ S, we call µ a rational probability distribution, in DistQ(S).

Markov decision processes (MDP) [25] combine the nondeterminism of Kripke structures with the finite random choices of discrete-time Markov chains (DTMC).

Definition 1. A Markov decision process (MDP) is a triple M = hS, s<sup>I</sup> , Ti where S is a finite set of states with initial state s<sup>I</sup> ∈ S and T : S → 2 DistQ(S) is the transition function. T(s) must be finite and non-empty for all s ∈ S.

For s ∈ S, an element µ of T(s) is a transition, and if s <sup>0</sup> ∈ spt(µ), then the transition has a branch to successor state s <sup>0</sup> with probability µ(s 0 ). If |T(s)| = 1 for all s ∈ S, then M is a DTMC.

Example 1. Fig. 1 shows our example MDP M<sup>γ</sup> n , which is actually a DTMC. It is a simplified and parametrised version of the counterexample of Wimmer et al. [29, Fig. 2]. It is parametrised in terms of n ∈ N (determining the number of chained states with transitions labelled b) and γ ∈ (0, 0.5) (changing some probabilities). We draw transitions as lines to an intermediate node from which

Fig. 1. Example parametrised MDP M<sup>γ</sup> n

probability-labelled branches lead to successor states. We omit the intermediate node for transitions with a single branch, and label some transitions to easily refer to them. M<sup>γ</sup> <sup>n</sup> has 4 + n states and transitions, and 7 + 2n branches.

In practice, higher-level modelling languages like Modest [14] are used to specify MDP. The semantics of an MDP is captured by its paths. A path represents a concrete resolution of all nondeterministic and probabilistic choices. Formally:

Definition 2. A finite path is a sequence πfin = s<sup>0</sup> µ<sup>0</sup> s<sup>1</sup> µ<sup>1</sup> . . . µn−1s<sup>n</sup> where s<sup>i</sup> ∈ S for all i ∈ { 0, . . . , n } and µ<sup>i</sup> ∈ T(si)∧µi(si+1) > 0 for all i ∈ { 0, . . . , n− 1 }. Let |πfin| def = n and last(πfin) def = sn. Πfin(s) is the set of all finite paths starting in s. A path is an analogous infinite sequence π, and Π(s) is the set of all paths starting in s. We write s ∈ π if ∃ i: s = si.

A scheduler (or adversary, policy or strategy) only resolves the nondeterministic choices of M. For this paper, memoryless deterministic schedulers suffice [4].

Definition 3. A function s : S → Dist(S) is a scheduler if, for all s ∈ S, we have s(s) ∈ T(s). The set of all schedulers of M is S(M).

We are interested in reachability probabilities. Let M|<sup>s</sup> = hS, s<sup>I</sup> , T|si with T|s(s) = { s(s) } be the DTMC induced by s on M. Via the standard cylinder set construction [10, Sect. 2.2] on M|s, a scheduler induces probability measures P M,s s on measurable sets of paths starting in s ∈ S.

Definition 4. For state s and goal state g ∈ S, the maximum and minimum probability of reaching g from s is defined as PM,s max ( g) = sups∈<sup>S</sup> P M,s <sup>s</sup> ({ π ∈ Π(s) | g ∈ π }) and P M,s min ( <sup>g</sup>) = infs∈<sup>S</sup> <sup>P</sup> M,s <sup>s</sup> ({ π ∈ Π(s) | g ∈ π }), respectively.

The definition extends to sets G of goal states. We omit the superscript for M when it is clear from the context, and if we omit that for s, then s = s<sup>I</sup> . From now on, whenever we have an MDP with a set of goal states G, we assume w.l.o.g. that all g ∈ G are absorbing, i.e. every g only has one self-loop transition.

Definition 5. A maximal end component (MEC) of M is a maximal (sub-)MDP hS 0 , T<sup>0</sup> , s<sup>0</sup> I i where S <sup>0</sup> ⊆ S, T 0 (s) ⊆ T(s) for all s ∈ S 0 , and the directed graph with vertex set S <sup>0</sup> and edge set { hs, s<sup>0</sup> i | ∃ µ ∈ T 0 (s): µ(s 0 ) > 0 } is strongly connected.

```
1 function II(M = hS, sI , Ti, G, opt, )
      // Preprocessing
 2 if opt = max then M := CollapseMECs(M, G) // collapse MECs
 3 S0 := Prob0(M, G, opt), S1 := Prob1(M, G, opt) // identify 0/1 states
 4 l := { s 7→ 0 | s ∈ S \ S1 } ∪ { s 7→ 1 | s ∈ S1 } // initialise lower vector
 5 u := { s 7→ 0 | s ∈ S0 } ∪ { s 7→ 1 | s ∈ S \ S0 } // initialise upper vector
      // Iteration
 6 while (u(sI ) − l(sI ))/l(sI ) >  do // while relative error > :
 7 foreach s ∈ S \ (S0 ∪ S1) do // update non-0/1 states:
 8 l(s) := optµ∈T (s)
                            P
                              s0∈spt(µ) µ(s
                                         0
                                         ) · l(s
                                              0
                                              ) // iterate lower vector
 9 u(s) := optµ∈T (s)
                             P
                               s0∈spt(µ) µ(s
                                         0
                                          ) · u(s
                                               0
                                                ) // iterate upper vector
10 return 1
             2
               (u(sI ) − l(sI ))
```
Alg. 1: Interval iteration for probabilistic reachability

#### 2.1 Algorithms

Interval iteration [3,5,12,13] computes reachability probabilities p(s) = P<sup>s</sup> opt(G), opt ∈ { max, min }. We show the basic algorithm as Alg. 1. It iteratively refines vectors l and u that map each state to a value in Q such that, at all times, we have l(s) ≤ p(s) ≤ u(s). In each iteration, the values in l and u are updated for all relevant states (line 7) via the classic Bellman equations of value iteration (lines 8-9). Their least fixpoint is p, towards which l converges from below. Some preprocessing is needed to ensure that the fixpoint is unique and also u converges towards p: for maximisation, we need to collapse MECs into single states (line 2). This can be be done via graph-based algorithms (see e.g. [7]) that only consider the graph structure of the MDP as in Definition 1 but do not perform calculations with the concrete probability values. For both maximisation and minimisation, we need to identify the sets S<sup>0</sup> and S<sup>1</sup> such that ∀s ∈ S<sup>0</sup> : p(s) = 0 and ∀s ∈ S<sup>1</sup> : p(s) = S<sup>1</sup> (line 3). This can equally done via graph-based algorithms [10, Algs. 1-4]. We then initialise l and u to trivial under-/overapproximations of p (lines 4-5). Iteration stops when the relative difference between l and u at s<sup>I</sup> is at most (which is often chosen as 10<sup>−</sup><sup>3</sup> or 10<sup>−</sup><sup>6</sup> ). The corresponding check in line 6 assumes that division by zero results in +∞, as is the default in IEEE 754. By convergence of l and u towards the fixpoint, II terminates, and we eventually return a value pˆ with the guarantee that p(s<sup>I</sup> ) ∈ [(1 − ) · p, ˆ (1 + ) · pˆ]. This makes II sound.

PCTL. The temporal logic PCTL [16] allows us to construct complex branchingtime properties. It takes standard CTL [2] and replaces the A(ψ) ("for all paths ψ holds") and E(ψ) ("there exists a path for which ψ holds") operators by the probabilistic operator P<sup>∼</sup>c(ψ) for "under all schedulers, the probability of the measurable set of paths for which ψ holds is ∼ c" where ∼ ∈ { <, ≤, >, ≥ } and c ∈ [0, 1]. To model-check a PCTL formula on MDP M, we follow the standard recursive CTL model checking algorithm [2, Sect. 6.4] except for the P

operator, which can be reduced to computing reachability probabilities. For the "finally"/"eventually" case P∼c(F φ), we can directly use interval iteration: Let S<sup>φ</sup> be the set of states recursively determined to satisfy φ. Call II(M, Sφ, opt∼, ) of Alg. 1 with opt<sup>∼</sup> = max if ∼ ∈ { <, ≤ } and opt<sup>∼</sup> = min otherwise, with two modifications: Change the stopping criterion of line 6 to check the difference for all states, and in line 10, return the set S<sup>P</sup> def = { s ∈ S | ∀x ∈ [l(s), u(s)]: x ∼ c }. If ∃s ∈ S, x ∈ [l(s), u(s)]: x c, however, we would need to either abort and report an "unknown" situation, or continue with a reduced until we can (hopefully eventually) decide the comparison. None of Prism, Storm, and mcsta appear to perform this extra check, though. In this paper, we only use PCTL for nonnested top-level P∗(F . . .) operators; the results are then true if s<sup>I</sup> ∈ SP, should be unknown in case the "unknown" situation applies to s<sup>I</sup> , and are false otherwise.

# 3 Floating-Point Arithmetic

The current implementations of II (in Prism, Storm, and mcsta) use IEEE 754 double-precision floating-point arithmetic to represent (a) the probabilities of the MDP's branches and (b) the values in l and u. A floating-point number is stored as a significand d and an exponent e w.r.t. to an agreed-upon base b such that it represents the value d · b e . We fix b = 2. IEEE 754 double precision uses 64 bits in total, of which 1 is a sign bit, 52 are for d, and 11 are for e. Standard alternatives are 32-bit single precision (1 sign, 23 bits for d, and 8 for e) and the 80-bit x87 extended precision format (with 1 sign bit, 64 for d, and 15 for e). The subset of Q that can be represented in such a representation is determined by the numbers of bits for d and e. For example, <sup>1</sup> 2 or <sup>7</sup> 8 can be represented exactly in all formats, but <sup>1</sup> <sup>10</sup> cannot. IEEE 754 prescribes that all basic operations (addition, multiplication, etc.) are performed at "infinite precision" with the result rounded to a representable number. The default rounding mode is to round to the nearest such number, choosing an even value in case of ties (round to nearest, ties to even). In single precision, <sup>1</sup> <sup>10</sup> is thus by default rounded to

> 13421773 · 2 <sup>−</sup><sup>27</sup> = 0.100000001490116119384765625.

A single rounded operation leads to an error of at most the distance between the two nearest representable numbers. In iterative computations, however, rounding may happen at every step. A striking example of the consequences is the failure of an American Patriot missile battery to intercept an incoming Iraqi Scud missile in February 1992 in Dharan, Saudi Arabia [28], which resulted in 28 fatalities. The Patriot system calculated time in seconds by multiplying its internal clock's value by a rounded binary representation of <sup>1</sup> <sup>10</sup> . After 100 hours of continuous operation, this lead to a cumulative rounding error large enough to miscalculate the incoming missile's position by more than half a kilometre [1].

#### 3.1 Errors in Probabilistic Model Checking

II accumulates and multiplies rounded floating-point values in the l and u vectors with potentially already-rounded values representing the rational probabilities of the model. Using the default rounding mode, how can we be sure that the final result does not miss the true probability by more than half a kilometre, too?

Following Wimmer et al. [29], let us consider MDP M<sup>γ</sup> <sup>n</sup> of Fig. 1 again, and determine whether P<sup>≤</sup> <sup>1</sup> 2 ( { s<sup>+</sup> }) holds. The model is acyclic, so it is easy to see that

$$p \stackrel{\text{def}}{=} \mathcal{P}\_{\text{max}}(\diamond \{ \, \, s\_+ \}) = \frac{1}{2} + \gamma^{n+2} > \frac{1}{2}.$$

Let us fix n = 1 and γ = 10−<sup>6</sup> . Then p = 1 <sup>2</sup> + 10−18. This value cannot be represented in double precision, and is by default rounded to 0.5.

We have encoded M<sup>γ</sup> n in the Modest and Prism languages, and checked the answers returned by Prism 4.7, Storm 1.6.4, and mcsta 3.1 for the property. The correct result would be false. Prism returns true in its default configuration, which uses an unsound algorithm, and false when requesting an algorithm with exact rational arithmetic, for which M<sup>γ</sup> n is small enough. If we explicitly request Prism to use II, then the result depends on the specified : for ≥ 10<sup>−</sup><sup>11</sup>, we get the correct result of false; for smaller ≤ 10<sup>−</sup><sup>12</sup>, i.e. higher precision, however, we incorrectly get true. Storm incorrectly returns true in its default configuration as well as when we request a sound algorithm via the --sound parameter. Only when using an exact rational algorithm via the --exact parameter does Storm correctly return false. mcsta, when using II (--alg IntervalIteration), incorrectly returns true, and additionally reports that it computed [l(s<sup>I</sup> ), u(s<sup>I</sup> )] as [0.5, 0.5], thus not including the true value of p. Other algorithms are not immune to the problem, either; for example, mcsta also answers true when using SVI, OVI, and when solving the MDP as a linear programming problem via the Google OR Tools' GLOP LP solver.

This example shows that using a sound algorithm does not guarantee correct results. The problem is not specific to cases of small probabilities like γ = 10<sup>−</sup><sup>6</sup> in the MDP; we can achieve the same effect using arbitrarily higher values of γ if we just increase n a little. Such bounded try-and-retry chains where "normal" probabilities in the model result in very small values during iteration and on the final result are not uncommon in the systems often modelled as MDPs, e.g. backoff schemes in communication protocols and randomised algorithms. In general, tiny differences in probabilities in one place may result in significant changes of the overall reachability probability; for example, in two-dimensional random walks, the long-run behaviour when the probabilities to move forward or backward are both <sup>1</sup> 2 is vastly different from if they are <sup>1</sup> <sup>2</sup> <sup>+</sup> <sup>δ</sup> and <sup>1</sup> <sup>2</sup> − δ, respectively, for any δ > 0.

#### 3.2 On Precision and Rounding Modes

In our concrete example, we may be able to avoid the problem by increasing precision: In the 80-bit extended format supported by all x86-64 CPUs, <sup>1</sup> <sup>2</sup> +10<sup>−</sup><sup>18</sup> is by default rounded to 5.000000000000000009... · 10<sup>−</sup><sup>1</sup> , so there is a chance of obtaining false unless other rounding during iterations would lose all the difference. Extended precision is used for C's long double type by e.g. the GCC compiler; it is thus readily accessible to programmers. It is, however, the most

precise format supported in common CPUs today; if we need more precision, we would have to resort to much slower software implementations using e.g. the GNU MPFR library. Any a-priori fixed precision, however, just shifts the problem to smaller differences, but does not eliminate it.

The more general solution that we propose in this paper is to control the rounding mode of the floating-point operations performed in the II algorithm. In addition to the default round to nearest, ties to even mode, the IEEE 754 standard defines three directed rounding modes: round towards zero (i.e. truncation), round towards +∞ (i.e. always round up), and round towards −∞ (i.e. always round down). As we will explain in Sect. 4, using the latter gives us an easy way to make the computations inside II safe, i.e. guarantee the under- and overapproximation invariants for l and u, respectively. Control of the floatingpoint rounding mode however appears to be a rarely-used feature of IEEE 754 implementations; consequently the level and style of support for it in CPUs and high-level programming languages is diverse.

### 3.3 CPU Support for Rounding Modes

Storm and mcsta run exclusively on x86-64 systems (with the upcoming ARMbased systems so far only supported via their x86-64 emulation layers), while Prism additionally supports several other platforms via manual compilation. Thus we focus on x64-64 in this paper as the platform probabilistic model checkers overwhelmingly run on today.

X87 and SSE. All x64-64 CPUs support two instruction sets to perform floatingpoint operations in double precision: The x87 instruction set, originating from the 8087 floating-point coprocessor, and the SSE instruction set, which includes support for double precision since the Pentium 4's SSE2 extension. Both implement operations according to the IEEE 754 standard. Aside from architectural particularities such as its stack-based approach to managing registers, the x87 instruction set notably includes support for 80-bit extended precision. In fact, by default, it performs all calculations in that extended precision, only rounding to double or single precision when storing values back to 64- or 32-bit memory locations. This has the advantage of reducing the error across sequences of operations, but for high-level languages makes the results depend on the compiler's choices of when to load/store intermediate values in memory vs. keeping them in x87 registers. The SSE instructions only support single and double precision.

Both the x87 and SSE instruction sets support all four rounding modes mentioned above. The rounding mode of operations for x87 and SSE is determined by the current value of the x87 FPU control word stored in the x87 FPU control register or the current value of the SSE MXCSR control register, respectively. That is, to change rounding mode, we need to obtain the current control register value, change the two bits determining rounding mode (with the other bits controlling other aspects of floating-point operations such as the treatment of NaNs), and apply the new value. This is done via the FNSTCW/FLDCW instruction pair on x87, and VSTMXCSR/VLDMXCSR for SSE. Rounding mode is thus part of the global (per-thread) state, and we must be careful to restore its original configuration when returning to code that does not expect rounding mode changes. Frequent changes of rounding mode thus incur a performance overhead due to the extra instructions that must be executed for every change and their effects on e.g. pipelining.

AVX-512. AVX-512 is the extension to 512 bits of the sequence of single instruction, multiple data (SIMD) instruction sets in x84-64 processors that started with SSE. It became available for general-purpose systems in high-end desktop (Skylake-X) and server (Xeon) CPUs in 2017, but it took until the 10th generation of Intel's Core mobile CPUs in 2019 before it was more widely available in end-user systems. It is supposed to appear in AMD CPUs with the upcoming Zen 4 architecture. Aside from its 512-bit SIMD instructions, AVX-512 crucially also includes new instructions for single floating-point values where the operation's rounding mode is specified as part of the instruction itself via the new "EVEX" encoding. Of particular note for implementing II are the new VFMADD(r1r2r3)SD fused multiply-add instructions (the r<sup>i</sup> determining how the operand registers are used) that can directly be used for the sums of products in the Bellman equations in lines 8-9 of Alg. 1. Overall, AVX-512 thus makes rounding mode independent of global state, and may improve performance by removing the need for extra instruction sequences to change rounding mode.

#### 3.4 Rounding Modes in Programming Languages

Support for non-default rounding modes is lacking in most high-level programming languages. Java, C#, and Python, for example, do not support them at all. If II is implemented in such a language, there is consequently no hope for a high-performance solution to the rounding problems described earlier.

For C and C++, the C99 and C++11 standards introduced access to the floating-point environment. The fenv.h/cfenv headers include the fegetround and fesetround functions to query the current rounding mode and change it, respectively. Implementations of these functions on x86-64 read/change both the x87 and SSE control registers accordingly. In the remainder of this paper, we focus on a C implementation, but most statements hold for C++ analogously. The level of support for the C99 floating-point features varies significantly between compilers; it is in particular still incomplete in Clang<sup>2</sup> and GCC [11, Further notes]. Still, both compilers provide access to the fegetround/fesetround functions (via the associated standard libraries), but GCC in particular is not rounding mode-aware in optimisations. This means that, for example, subexpressions that are evaluated twice, with a change in rounding mode in between, may be compiled by GCC into a single evaluation before the change, with the resulting value stored in a register and reused after the rounding mode change. This can

<sup>2</sup> The documentation as of October 2021 states that C99 support in Clang "is featurecomplete except for the C99 floating-point pragmas".

even happen when using the -frounding-math option<sup>3</sup> . Programmers thus need to inspect the generated assembly to ensure that no problematic transformations have been made, or try to make them impossible by declaring values volatile or inserting inline assembly "barriers".

Overall, C thus provides a standardised way to change x87/SSE rounding mode, but programmers need to be aware of compiler quirks when using these facilities. Support for AVX-512 instructions that include rounding mode bits in C, on the other hand, is only slightly more convenient than programming in assembly as we can use the intrinsics in the immintrin.h header; there is no standard higher-level abstraction of this feature in either C or C++.

### 4 Correctly Rounding Interval Iteration

Let us now change II as in Alg. 1 to consistently round in safe directions at every numeric operation. Given that we can change or specify the rounding mode of all basic floating-point operations on current hardware, we expect that a high-performance implementation can be achieved. First, the preprocessing steps require no changes as they are purely graph-based. The changes to the iteration part of the algorithm are straightforward: In line 6,

$$\text{while } (u(s\_I) - l(s\_I))/l(s\_I) > \epsilon \text{ do} \dots,$$

we round the results of the subtraction and of the division towards +∞ to avoid stopping too early. In line 8,

$$l(s) := \operatorname{opt}\_{\mu \in T(s)} \sum\_{s' \in \operatorname{spt}(\mu)} \mu(s') \cdot l(s'),$$

the multiplications and additions round towards −∞ while the corresponding operations on the upper bound in line 9 round towards +∞. Recall that all probabilities in the MDP are rational numbers, i.e. representable as num den with num, den ∈ N. We assume that num and den can be represented exactly in the implementation. Then, in line 8, we calculate the floating-point values for the µ(s 0 ) = num/den by rounding towards −∞. In line 9, we round the result of the corresponding division towards +∞. Finally, instead of returning the middle of the interval in line 10, we return [l(s<sup>I</sup> ), u(s<sup>I</sup> )] so as not to lose any information (e.g. in case the result is compared to a constant as in the example of Sect. 3.1).

With these changes, we obtain an interval guaranteed to contain the true reachability probability if the algorithm terminates. However, rounding away from the theoretical fixpoint in the updates of l and u means that we may reach an effective fixpoint—where l and u no longer change because all newly computed values round down/up to the values from the previous iteration—at a point where the relative difference of l(s<sup>I</sup> ) and u(s<sup>I</sup> ) is still above . This will happen in practice: In QComp 2020 [6], mcsta participated in the floatingpoint correct track by letting VI run until it reached a fixpoint under the default rounding mode with double precision. In 9 of the 44 benchmark instances that mcsta attempted to solve in this way, the difference between this fixpoint and

<sup>3</sup> The documentation as of Oct. 2021 states that -frounding-math "does not currently guarantee to disable all GCC optimizations that are affected by rounding mode."

```
1 function SR-SII(M = hS, sI , Ti, G, opt, )
2 . . .(preprocessing as in Alg. 1). . .
3 repeat
 4 chg := false
 5 fesetround(towards −∞)
 6 foreach s ∈ S \ (S0 ∪ S1) do
 7 lnew := optµ∈T (s)
                           P
                             s0∈spt(µ) µ(s
                                       0
                                       ) · l(s
                                            0
                                            ) // iterate lower vector
 8 if lnew 6= l(s) then chg := true
 9 l(s) := lnew
10 fesetround(towards +∞)
11 foreach s ∈ S \ (S0 ∪ S1) do
12 unew := optµ∈T (s)
                           P
                             s0∈spt(µ) µ(s
                                       0
                                        ) · u(s
                                             0
                                             ) // iterate upper vector
13 if unew 6= u(s) then chg := true
14 u(s) := unew
15 until ¬chg ∨ (u(sI ) − l(sI ))/l(sI ) ≤ 
16 return [l(sI ), l(sI )]
```
the true value was more than the specified . With safe rounding away from the true fixpoint, this would likely have happened in even more cases.

To ensure termination, we thus need to make one further change to the II of Alg. 1: In each iteration of the while loop, we additionally keep track of whether any of the updates to l and u changes the previous value. If not, we end the loop and return the current interval, which will be wider than the requested relative difference. We refer to II with all of the these modifications as safely rounding interleaved II (SR-III) in the remainder of this paper.

#### 4.1 Sequential Interval Iteration

When using the x87 or SSE instruction sets to implement SR-III, we need to insert a call to fesetround just before line 8, and another just before line 9. If, for an MDP with n states, we need m iterations of the while loop, we will make 2 · n · m calls to fesetround. This might significantly impact performance for models with many states, or that need many iterations (such as the haddadmonmege model of the QVBS, which requires 7 million iterations with = 10<sup>−</sup><sup>6</sup> despite only having 41 states). As an alternative, we can rearrange the iteration phase of II as shown in Alg. 2: We first update l for all states (lines 6-9), then u for all states (lines 11-14), with the rounding mode changes in between (lines 5 and 10). We call this variant of II safely rounding sequential II (SR-SII). It only needs 2·m calls to fesetround, which should improve its performance. However, it also changes the memory access pattern of II with an a priori unknown effect on performance. We write III for II to stress that it is interleaved, and SII for Alg. 2 without the safe rounding, in the remainder of this paper.

#### 4.2 Implementation Aspects

We have implemented III, SII, SR-III, and SR-SII in mcsta. While mcsta is written in C#, the new algorithms are (necessarily) written in C, called from the main tool via the P/Invoke mechanism. We used GCC 10.3.0 to compile our implementations on both 64-bit Linux and Windows 10. We manually inspected the disassembly of the generated code to ensure that GCC's optimisations did not interfere with rounding mode changes as described in Sect. 3.4. In a significant architectural change, we modified mcsta's state space exploration and representation code to preserve the exact rational values for the probabilities specified in the model, so that safely-rounded floating-point representations for the µ(s 0 ) can be computed during iteration as described above.

Of each algorithm, we implemented four variants: a default one that leaves the choice of instruction set to the compiler and uses fesetround to change rounding mode; an x87 variant that forces floating-point operations to use the x87 instructions by attributing the relevant functions with target("fpmath=387") and that changes rounding mode via inline assembly using FNSTCW/FLDCW; an SSE variant that forces the SSE instruction set via target("fpmath=sse") and uses VSTMXCSR/VLDMXCSR in inline assembly for rounding mode changes; and an AVX-512 variant that implements all floating-point operations requiring non-default rounding modes via AVX-512 intrinsics, in particular using \_mm\_fmadd\_round\_sd in the Bellman equations. All variants use double precision; default and SSE additionally have a single-precision version (which we omit for x87 since the reduced precision does not speed up the operations we use); and x87 also provides an 80-bit extended-precision version (however we currently return its results as safely-rounded double-precision values due to the unavailability of a long double equivalent in C#, which limits its use outside of performance testing for now). All in all, we thus provide 28 variants of interval iteration for comparison, out of which 14 provide guaranteed correct results.

In particular, the safe rounding makes PMC feasible at 32-bit single precision, which would otherwise be too likely to produce incorrect results. While we expect that this may deliver many results with low precision (but which are correct) due to a rounded fixpoint being reached long before the relative width reaches , it also halves the memory needed to store l and u, and may speed up computations. At the opposite end, mcsta is now also the first PMC tool that can use 80-bit extended precision, which however doubles the memory needed for l and u since 80-bit long double values occupy 16 bytes in memory (with GCC).

#### 5 Experiments

Using our implementation in mcsta, we first tested all variants of the algorithms on M<sup>γ</sup> n in the setting of Sect. 3.1. As expected, and validating the correctness of the approach and its implementation, all SR variants return unknown.

We then assembled a set of 31 benchmark instances—combinations of a model, values for its configurable parameters, and a property to check—from the QVBS covering DTMC, MDP, and probabilistic timed automata (PTA) [24] transformed to MDP by mcsta using the digital clocks approach [23]. These are all the models and probabilistic reachability probabilities from the QVBS supported by mcsta for which the result was not 0 or 1 (then it can be computed via graph-based algorithms) and for which a parameter configuration was available where PMC terminated within our timeout of 120 s but II needed enough time for it to be measured reliably (' 0.2 s). We checked each of these benchmarks with all 28 variants of our algorithms using = 10−<sup>6</sup> on different x86-64 systems: I11w: an Intel Core i5-1135G7 (up to 4.2 GHz) laptop running Windows 10, this being the only system we had access to with AVX-512 support; AMDw: an AMD Ryzen 9 5900X (3.7-4.8 GHz) workstation running Windows 10, representing current AMD CPUs in our evaluation; I4x: an Intel Core i7-4790 (3.6- 4.0 GHz) workstation running Ubuntu Linux 18.04, representing older-generation Intel desktop hardware; and IPx: an Intel Pentium Silver J5005 (1.5-2.8 GHz) compact PC running Ubuntu Linux 18.04, representing a non-Core low-power Intel system. We show a selection of our experimental results in the remainder of this section, mainly from I11w and AMDw. We remark on cases where the other systems (all with Intel CPUs) showed different patterns from I11w.

We present results graphically as scatter plots like in Fig. 2. Each such plot compares two algorithm variants in terms of runtime for the iteration phase of the algorithm only (i.e. we exclude the time for state space exploration and preprocessing). Every point hx, yi corresponds to a benchmark instance and indicates that the variant noted on the x-axis took x seconds to solve this instance while the one noted on the y-axis took y seconds. Thus points above the solid diagonal line correspond to instances where the x-axis method was faster; points above (below) the upper (lower) dotted diagonal line are where the x-axis method took less than half (more than twice) as long.

Fig. 2 first shows the performance impact of enabling safe rounding for the standard interleaved algorithm using double precision. The top row shows the behaviour on I11w. We see that runtime is drastically longer in the default variant that uses fesetround, but only increases by a factor of around 2 if we use the specific inline assembly instructions. We note that GCC includes the code for fesetround in the generated .dll file on Windows, but in contrast to the assembly methods does not inline it into the callers. Some of the difference may thus be function call overhead. The middle row shows the behaviour on AMDw. Here, default is affected just as badly, but the effect on SEE is worse while that on x87 is much lower than on the Intel I11w system. In the bottom row, we show the impact on default on the Linux systems (bottom left and bottom middle), which is much lower than on Windows. This is despite GCC implementing fesetround as an external library call here. The overhead still markedly differs between the two Intel CPUs, though. Finally, as expected, we see on the bottom right than safe rounding has almost no performance impact when using the AVX-512 instructions.

Seeing the significant impact enabling safe rounding can have, we next show what the sequential algorithm brings to the table, in Fig. 3. On the top left, we

Fig. 2. Performance impact of safe rounding across instruction sets and systems

compare the base algorithms without safe rounding, where SII takes up to twice as long in the worst case. This is likely due to the more cache-friendly memory access pattern of III: we store l and u interleaved for III, so it always operates on two adjacent values at a time. The bottom-left plot confirms that reducing the number of rounding mode changes reduces the overhead of safe rounding to essentially zero. The remaining four plots show the differences between SR-III and SR-SII. In all cases except x87 on AMDw, SR-III is slower. We thus have that III is fastest but unsafe, SII and SR-SII are equally fast but the latter is safe, and SR-III is safe but tends to be slower on the Intel systems. On the AMD system, SR-III surprisingly wins over SR-SII with x87, highlighting that the x87 instruction set in Ryzen 3 must be implemented very differently from SSE.

Fig. 3. Performance of interleaved compared to sequential II

We further investigate the impact of the instruction set in Fig. 4. Confirming the patterns we saw so far, SSE is slightly faster than x87 on I11w (and we see similar behaviour on the other Intel systems) but slower by a factor of more than 2 on the AMD CPU. The rightmost plot highlights that AVX-512 is the fastest alternative on the most recent Intel CPUs, which may in part be due to the availability of the fused multiply-add instruction that fits II so well.

All results so far were for double-precision computations. To conclude our evaluation, we show in Fig. 5 that reducing to single precision does not bring the expected performance benefits. We see in the leftmost plot that the overhead

Fig. 4. Performance with different instruction sets

Fig. 5. Performance with different precision settings (on I11w)

of safe rounding has a much higher variance compared to Fig. 2. The detailed tool outputs hint at the reason being that rounding away from the fixpoint occurs in much larger steps with single precision, which significantly slows down or stops the convergence in several instances. The middle plot shows that, aside from the slowly converging outliers, using single precision does not provide a speedup over using doubles. Finally, on the right, we show that the impact of enabling 80-bit extended precision on x87 is minimal.

# 6 Conclusion

There has been ample research into sound PMC algorithms over the past years, but the problem of errors introduced by naive implementations using default floating-point rounding has been all but ignored. We showed that a solution exists that, while perhaps conceptually simple, faces a number of implementation and performance obstacles. In particular, hardware support for rounding modes is arguably essential to achieve acceptable performance, but difficult to use from C/C++ and impossible to access from most other programming languages. We extensively explored the space of implementation variants, highlighting that performance crucially depends on the combination of the variant, the CPU, and the operating system. Nevertheless, our results show that truly correct PMC is possible today at a small cost in performance, which should all but disappear as AVX-512 is more widely adopted. With our implementation in mcsta, we provide the first PMC tool that combines fast, scalable, and correct.

Acknowledgments. This work was triggered by Masahide Kashiwagi's excellent overview of the different ways to change rounding mode as used by his kv library for verified numerical computations [21]. The author thanks Anke and Ursula Hartmanns for contributing to the diversity of hardware on which the experiments were performed by providing access to the AMDw and I11w systems.

Data availability. A dataset to replicate the experimental evaluation, including the exact versions of the tools and models used, is archived and available at DOI 10.4121/19074047 [17].

#### References


on Reachability Problems (RP). Lecture Notes in Computer Science, vol. 8762, pp. 125–137. Springer (2014). https://doi.org/10.1007/978-3-319-11439-2\_10


tion (CAV). Lecture Notes in Computer Science, vol. 10981, pp. 643–661. Springer (2018). https://doi.org/10.1007/978-3-319-96145-3\_37


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Correlated Equilibria and Fairness in Concurrent Stochastic Games**

Marta Kwiatkowska<sup>1</sup> , Gethin Norman<sup>2</sup> , David Parker<sup>3</sup> , and Gabriel Santos<sup>1</sup> ()

<sup>1</sup> Department of Computer Science, University of Oxford, Oxford, UK {marta.kwiatkowska,gabriel.santos}@cs.ox.ac.uk

<sup>2</sup> School of Computing Science, University of Glasgow, Glasgow, UK gethin.norman@glasgow.ac.uk

<sup>3</sup> School of Computer Science, University of Birmingham, Birmingham, UK d.a.parker@cs.bham.ac.uk

**Abstract.** Game-theoretic techniques and equilibria analysis facilitate the design and verifcation of competitive systems. While algorithmic complexity of equilibria computation has been extensively studied, practical implementation and application of game-theoretic methods is more recent. Tools such as PRISM-games support automated verifcation and synthesis of zero-sum and (*ε*-optimal subgame-perfect) social welfare Nash equilibria properties for concurrent stochastic games. However, these methods become ineffcient as the number of agents grows and may also generate equilibria that yield signifcant variations in the outcomes for individual agents. We extend the functionality of PRISM-games to support *correlated equilibria*, in which players can coordinate through public signals, and introduce a novel optimality criterion of *social fairness*, which can be applied to both Nash and correlated equilibria. We show that correlated equilibria are easier to compute, are more equitable, and can also improve joint outcomes. We implement algorithms for both normal form games and the more complex case of multi-player concurrent stochastic games with temporal logic specifcations. On a range of case studies, we demonstrate the benefts of our methods.

#### **1 Introduction**

Game-theoretic verifcation techniques can support the modelling and design of systems that comprise multiple agents operating in either a cooperative or competitive manner. In many cases, to efectively analyse these systems we also need to adopt a probabilistic approach to modelling, for example because agents operate in uncertain environments, use faulty hardware or unreliable communication mechanisms, or explicitly employ randomisation for coordination.

In these cases, *probabilistic model checking* provides a convenient unifed framework for both formally modelling probabilistic multi-agent systems and specifying their required behaviour. In recent years, progress has been made in this direction for several models, including turn-based and concurrent stochastic games (TSGs and CSGs), and for multiple temporal logics, such as rPATL [10] and its extensions [24]. Tool support has been developed, in the form of PRISMgames [22], and successfully applied to case studies across a broad range of areas.

Initially, the focus was on *zero-sum* specifcations [24], which can be natural for systems whose participants have directly opposing goals, such as the defender and attacker in a security protocol minimising or maximising the probability of a successful attack, respectively. However, agents often have objectives that are distinct but not directly opposing, and may also want to cooperate to achieve these objectives. Examples include network protocols and multi-robot systems.

For these purposes, *Nash equilibria* (NE) have also been integrated into probabilistic model checking of CSGs [24], together with *social welfare* (SW) optimality criterion, resulting in social welfare Nash equilibria (SWNE). An SWNE comprises a strategy for each player in the game where no player has an incentive to deviate unilaterally from their strategy and the sum of the individual objectives over all players is maximised.

One key limitation of SWNE, however, is that, as these techniques are extended to support larger numbers of players [21], the effciency and scalability of synthesising SWNE is signifcantly reduced. In addition, simply aiming to maximise the sum of individual objectives may not produce the best performing equilibrium, either collectively or individually; for example, they can ofer higher gains for specifc players, reducing the incentive of the other players to collaborate and instead motivating them to deviate from the equilibrium.

In this paper, we adopt a diferent approach and introduce, for the frst time within formal verifcation, both *social fairness* as an optimality criterion and *correlated equilibria*, and the insights required to make these usable in practical applications. Social fairness (SF) is particularly novel, as it is inspired by similar concepts used in economics and distinct from the fairness notions employed in verifcation. Correlated equilibria (CE) [3], in which players are able to coordinate through public *signals*, are easier to compute than NE and can yield better outcomes. Social fairness, which minimises the diferences between the objectives of individual players, can be considered for both CE and NE.

We frst investigate these concepts for the simpler case of normal form games, illustrating their diferences and benefts. We then extend the approach to the more powerful modelling formalism of CSGs and extend the temporal logic rPATL to formally specify agent objectives. We present algorithms to synthesise equilibria, using linear programming to fnd CE and a combination of backwards induction or value iteration for CSGs. We implement our approach in the PRISM-games tool [22] and demonstrate signifcant gains in computation time and that quantifably more fair and useful strategies can by synthesised for a range of application domains. An extended version of this paper, with the complete model checking algorithm, is available [23].

**Related work.** Nash equilibria have been considered for concurrent systems in [18], where a temporal logic is proposed whose key operator is a novel path quantifer which asserts that a property holds on all Nash equilibrium computations of the system. There is no stochasticity and correlated equilibria are not considered. In [2], a probabilistic logic that can express equilibria is formulated, along with complexity results, but no implementation has been provided.

The notion of fairness studied here is inspired by fairness of equilibria from economics [33,34] and aims to minimise the diference between the payofs, as opposed to maximising the lowest payof among the players in an NE [25]. Our notion of fairness can be thought of as a constraint applied to equilibria strategies, similar in style to social welfare, and used to select certain equilibria based on optimality. This is distinct from fairness used in verifcation of concurrent processes, where (strong) fairness refers to a property stating that, whenever a process is enabled infnitely often, it is executed infnitely often. This notion is typically defned as a constraint on infnite execution paths expressible in logics LTL and CTL\* and needed to prove liveness properties. For probabilistic models, verifcation under fairness constraints has been formulated for Markov decision processes and the logic PCTL\* [5,4]. For games on graphs, fairness conditions expressed as *ω*-regular winning conditions can be used to synthesise reactive processes [8]. Algorithms for strong transition fairness for *ω*-regular games have been recently studied in [6]. Both qualitative and quantitative approaches have been considered for verifcation under fairness constraints, but no equilibria.

#### **2 Normal Form Games**

We start by considering normal form games (NFGs), then defne our equilibria concepts for these games, present algorithms and an implementation for computing them, and fnally summarise some experimental results.

We frst require the following notation. Let *Dist*(*X*) denote the set of probability distributions over set *X*. For any vector *v ∈* R *<sup>n</sup>*, we use *v*(*i*) to refer to the *i*th entry of the vector. For any tuple *x* = (*x*1*, . . . , xn*) *∈ Xn*, element *x ′ ∈ X* and *i* ⩽ *n*, we defne the tuples *x−<sup>i</sup>* def = (*x*1*, . . . , xi−*1*, xi*+1*, . . . , xn*) and *x−<sup>i</sup>* [*x ′* ] def = (*x*1*, . . . , xi−*1*, x′ , xi*+1*, . . . , xn*).

**Defnition 1 (Normal form game).** *A (fnite, n-person)* normal form game *(NFG) is a tuple* N = (*N, A, u*) *where: N* = *{*1*, . . . , n} is a fnite set of players; A* = *A*1*× · · · ×A<sup>n</sup> and A<sup>i</sup> is a fnite set of actions available to player i ∈ N; u* = (*u*1*, . . . , un*) *and u<sup>i</sup>* : *A →* R *is a utility function for player i ∈ N.*

We fx an NFG N = (*N, A, u*) for the remainder of this section. In a play of N, each player *i ∈ N* chooses an action from the set *A<sup>i</sup>* at the same time. If each player *i* chooses *a<sup>i</sup>* , then the utility received by player *j* equals *u<sup>j</sup>* (*a*1*, . . . , an*). We next defne the *strategies* for players of N and *strategy profles* comprising a strategy for each player. We also defne *correlated profles*, which allow the players to coordinate their choices through a (probabilistic) *public signal*.

**Defnition 2 (Strategy and profle).** *A* strategy *σ<sup>i</sup> for player i is an element of Σ<sup>i</sup>* = *Dist*(*Ai*) *and a* strategy profle *σ is an element of Σ*<sup>N</sup> = *Σ*1*× · · · ×Σn.*

For strategy *σ<sup>i</sup>* of player *i*, the *support* is the set of actions *{a<sup>i</sup> ∈ A<sup>i</sup> | σi*(*ai*)*>*0*}* and the support of a profle is the product of the supports of the strategies.

**Defnition 3 (Correlated profle).** *A* correlated profle *is a tuple* (*τ, ς*) *comprising τ ∈ Dist*(*D*)*, where D* = *D*1*× · · · ×Dn, D<sup>i</sup> is a fnite set of* signals *for player i, and ς* = (*ς*1*, . . . , ςn*)*, where ς<sup>i</sup>* : *D<sup>i</sup> → Ai.*

For a correlated profle (*τ, ς*), the public signal *τ* is a joint distribution over signals *D<sup>i</sup>* for each player *i* such that, if player *i* receives the signal *d<sup>i</sup> ∈ D<sup>i</sup>* , then it chooses action *ςi*(*di*). We can consider any correlated profle (*τ, ς*) as a *joint strategy*, i.e., a distribution over *A*1*× · · · ×A<sup>n</sup>* where:

$$(\tau, \varsigma)(a\_1, \ldots, a\_n) = \sum \{ \tau(d\_1, \ldots, d\_n) \mid d\_i \in D\_i \land \varsigma(d\_i) = a\_i \text{ for all } i \in N \}\ .$$

Conversely, any joint strategy *τ ∈ Dist*(*A*1*× · · · ×An*) can be considered as a correlated profle (*τ, ς*) where *D<sup>i</sup>* = *A<sup>i</sup>* and *ς<sup>i</sup>* is the identity function for *i ∈ N*.

Any strategy profle *σ* can be mapped to an equivalent correlated profle (in which *τ* is the joint distribution *σ*1*× · · · ×σ<sup>n</sup>* and *ς<sup>i</sup>* is the identity function). On the other hand, there are correlated profles with no equivalent strategy profle. Under profle *σ* and correlated profle (*τ, ς*) the expected utilities of player *i* are:

$$\begin{array}{l} u\_{i}(\sigma) \stackrel{\text{def}}{=} \sum\_{(a\_{1}, \ldots, a\_{n}) \in A} u\_{i}(a\_{1}, \ldots, a\_{n}) \cdot \left(\prod\_{j=1}^{n} \sigma\_{j}(a\_{j})\right) \\\ u\_{i}(\tau, \varsigma) \stackrel{\text{def}}{=} \sum\_{(d\_{1}, \ldots, d\_{n}) \in D} \tau(d\_{1}, \ldots, d\_{n}) \cdot u\_{i}(\varsigma\_{1}(d\_{1}), \ldots, \varsigma\_{n}(d\_{n})) \;. \end{array}$$

**Example 1.** Consider the two-player NFG where *A<sup>i</sup>* = *{a i* 1 *, a<sup>i</sup>* <sup>2</sup>*}* and a correlated profle corresponding to the joint distribution *τ ∈ Dist*(*A*1*×A*2) where *τ* (*a* 1 1 *, a*<sup>1</sup> 2 ) = *τ* (*a* 2 1 *, a*<sup>2</sup> 2 ) = 0*.*5. Under this correlated profle the players share a fair coin and both choose their frst action if the coin is heads and their second action otherwise. This has no equivalent strategy profle. ■

**Optimal equilibria of NFGs.** We now introduce the notions of *Nash equilibrium* [27] and *correlated equilibrium* [3], as well as diferent defnitions of optimality for these equilibria: *social welfare* and *social fairness*. Using the notation introduced above for tuples, for any profle *σ* and strategy *σ ⋆ i* , the strategy tuple *σ−<sup>i</sup>* corresponds to *σ* with the strategy of player *i* removed and *σ−<sup>i</sup>* [*σ ⋆ i* ] to the profle *σ* after replacing player *i*'s strategy with *σ ⋆ i* .

**Defnition 4 (Best response).** *For a profle σ and correlated profle* (*τ, ς*)*, a* best response *for player i to σ−<sup>i</sup> and* (*τ, ς−<sup>i</sup>*) *are, respectively:*


**Defnition 5 (NE and CE).** *A strategy profle σ ⋆ is a* Nash equilibrium *(NE) and a correlated profle* (*τ, ς<sup>⋆</sup>* ) *is a* correlated equilibrium *(CE) if:*


*respectively. We denote by Σ<sup>N</sup> and Σ<sup>C</sup> the set of NE and CE, respectively.*


Fig. 1: Example: Cars at an intersection and the corresponding NFG.

Any NE of N is also a CE, while there can exist CEs that cannot be represented by a strategy profle and therefore are not NEs. For each class of equilibria, NE and CE, we introduce two optimality criteria, the frst maximising *social welfare* (SW), defned as the *sum* of the utilities, and the second maximising *social fairness* (SF), which minimises the *diference* between the players' utilities. Other variants of fairness have been considered for NE, such as in [25], where the authors seek to maximise the lowest utility among the players.

**Defnition 6 (SW and SF).** *An equilibrium σ ⋆ is a* social welfare *(SW) equilibrium if the sum of the utilities of the players under σ ⋆ is maximal over all equilibria, while σ ⋆ is a* social fair *(SF) equilibrium if the diference between the player's utilities under σ ⋆ is minimised over all equilibria.*

We can also defne the dual concept of *cost equilibria* [24], where players try to minimise, rather than maximise, their expected utilities by considering equilibria of the game N *<sup>−</sup>* = (*N, A, −u*) in which the utilities of N are negated.

**Example 2.** Consider the scenario, based on an example from [32], where three cars meet at an intersection and want to proceed as indicated by the arrows in Figure 1. Each car can either *proceed* or *yield*. If two cars with intersecting paths proceed, then there is an accident. If an accident occurs, the car having the right of way, i.e., the other car is to its right, has a utility of *−*100 and the car that should yield has a utility of *−*1000. If a car proceeds without causing an accident, then its utility is 5 and the cars that yield have a utility of *−*5. If all cars yield, then, since this delays all cars, all have utility *−*10. The 3-player NFG is given in Figure 1. Considering the diferent optimal equilibria of the NFG:


Modifying *u*<sup>2</sup> such that *u*2(*pro*<sup>1</sup> *, pro*<sup>2</sup> *, pro*<sup>3</sup> ) = *−*4*.*5 to, e.g., represent a reckless driver, the SWNE becomes for *c*<sup>1</sup> and *c*<sup>3</sup> to yield and *c*<sup>2</sup> to proceed with the expected utilities (*−*5*,* 5*, −*5), while the SWCE is still for *c*<sup>2</sup> to yield and *c*<sup>1</sup> and *c*<sup>3</sup> to proceed. The SFNE and SFCE also do not change. ■ **Algorithms for computing equilibria.** Before we give our algorithm to compute correlated equilibria, we briefy describe the approach of [21,24] for Nash equilibria computation that this paper builds upon. Finding NE in two-player NFGs is in the class of *linear complementarity* problems (LCPs) and we follow the algorithm presented in [24], which reduces the problem to SMT via labelled polytopes [28] by considering the regions of the strategy profle space, iteratively reducing the search space as positive probability assignments are found and added as restrictions on this space. To fnd SWNE and SFNE, we can enumerate all NE and then fnd the optimal NE.

When there are more than two players, computing NE values becomes a more complex task, as fnding NE within a given support no longer reduces to a linear programming (LP) problem. In [21] we presented an algorithm using support enumeration [31], which exhaustively examines all sub-regions, i.e., supports, of the strategy profle space, one at a time, checking whether that sub-region contains NEs. For each support, fnding SWNE can be reduced to a *nonlinear programming problem* [21]. This nonlinear programming problem can be modifed to fnd SFNE in each support, similarly to how the LP problem for SWCEs is modifed to fnd SFCEs below.

In the case of CE we can frst fnd a joint strategy for the players, i.e., a distribution over the action tuples, which, as explained above, can then be mapped to a correlated profle. A SWCE can be found by solving the following LP problem. Maximise: ∑ *i∈N* ∑ *<sup>α</sup>∈<sup>A</sup> <sup>u</sup>i*(*α*) *· <sup>p</sup><sup>α</sup>* subject to:

$$\sum\_{\alpha\_{-i}\in A\_{-i}} (u\_i(\alpha\_{-i}[a\_i]) - u\_i(\alpha\_{-i}[a\_i'])) \cdot p\_{\alpha\_{-i}[a\_i]} \gtrless 0 \tag{1}$$

$$0 \le p\_{\alpha} \le 1 \tag{2}$$

$$\sum\_{\alpha \in A} p\_{\alpha} = 1 \tag{3}$$

for all *i ∈ N*, *α ∈ A*, *a<sup>i</sup> , a′ <sup>i</sup> ∈ A<sup>i</sup>* , *α−<sup>i</sup> ∈ A−<sup>i</sup>* where *A−<sup>i</sup>* def <sup>=</sup> *{α−<sup>i</sup> | α ∈ A}*. The variables *p<sup>α</sup>* represent the probability of the joint strategy corresponding to the correlated profle selecting the action-tuple *α*. The above LP has *|A|* variables, one for each action-tuple, and ∑ *<sup>i</sup>∈<sup>N</sup>* (*|A<sup>i</sup> |* <sup>2</sup> *−|A<sup>i</sup> |*)+*|A|*+ 1 constraints. Computation of SFCE can be reduced to the following optimisation problem. Minimise *p* max *− p* min subject to: (1), (2) and (3) together with:

(

$$p^i = \sum\_{\alpha \in A} p\_\alpha \cdot u\_i(\alpha) \tag{4}$$

$$(\land\_{m \in N} p^i \ge p^m) \to (p^{\max} = p^i) \tag{5}$$

$$(\land\_{m \in N} p^i \leqslant p^m) \to (p^{\min} = p^i) \tag{6}$$

for all *i ∈ N*, *m ̸*= *i*, *α ∈ A*, *a<sup>j</sup> , a<sup>l</sup> ∈ A<sup>i</sup>* , *α−<sup>i</sup> ∈ A−<sup>i</sup>* . Again, the variables *p<sup>α</sup>* in the program represent the probability of the players playing the joint action *α*. The constraint (4) requires *p i* to equal the utility of player *i*. The constraints (5) and (6) set *p* max and *p* min as the maximum and minimum values within the utilities of the players, respectively. Given we use the constraints (1), (2) and (3), we start with the same number of variables and constraints as needed to compute SWCEs and incur an additional *|N|*+2 variables and 3*·|N|* constraints.


Table 1: Times (s) for synthesis of equilibria in NFGs (timeout 30 mins).

**Implementation.** To fnd SWNE or SFNE of two-player NFGs, we adopt a similar approach to [24], using labelled polytopes to characterise and fnd NE values through a reduction to SMT in both Z3 [13] and Yices [14]. As an optimised precomputation step, when possible we also search for and flter out *dominated strategies*, which speeds up the computation and reduces solver calls.

For NFGs with more than two players, solving the nonlinear programming problem based on support enumeration has been implemented in [21] using a combination of the SMT solver Z3 [13] and the nonlinear optimisation suite Ipopt [38]. To mitigate the ineffciencies of an SMT solver for such problems, we used Z3 to flter out unsatisfable support assignments with a timeout and then Ipopt is called to fnd SWNE values using an interior-point flter line-search algorithm [39]. To speed up the overall computation, the support assignments are analysed in parallel. Computing SFNE increases the complexity of the nonlinear program and, due to the ineffciency in this approach [21], we have not extended the implementation to compute SFNE.

As shown above, computing SWCE for NFGs reduces to solving an LP, and we implement this using either the optimisation solver Gurobi [17] or the SMT solver Z3 [13]. In the case of SFCE, the constraints (5) and (6) include implications, and therefore the problem does not reduce directly to an LP. When using Z3, we can encode these constraints directly as it supports assertions that combine inequalities with logical implications, a feature that linear solvers such as Gurobi do not have. Section 5 discusses implementing SFCE computation in Gurobi. Both solvers support the specifcation of *lower priority* or *soft* objectives, which makes it possible to have a consistent ordering for the players' payofs in cases where multiple equilibria exist.

**Effciency and scalability.** Table 1 presents experimental results for solving a selection of NFGs randomly generated with GAMUT [29], using Gurobi for SWCE and NE of two-player NFGs, Z3 for SFCE and both Ipopt and Z3 for NFGs of more than two players, and running on a 2.10GHz Intel Xeon Gold with 32GB of JVM memory. For each instance, Table 1 lists the number of players, actions for each player, joint actions and supports that need to be enumerated when fnding NE, as well as the time to fnd SWNEs, SWCEs and SFCEs (the time for fnding SFNEs of two-player games is the same as for SWNEs). As the results demonstrate, due to a simpler problem being solved and the fact that we do not need to enumerate the solutions, computing CEs scales far better than NEs as the number of players and actions increases. Finding NEs in games with more than two players is particularly hard as the constraints are nonlinear. We also see that SFCE computation is slower than SWCE, which is caused by the additional variables and constraints required when fnding SFCE and using Z3 rather than Gurobi for the solver.

#### **3 Concurrent Stochastic Games**

We now further develop our approach to support concurrent stochastic games (CSGs) [36], in which players repeatedly make simultaneous action choices that cause the game's state to be updated probabilistically. We extend the previously introduced defnitions of optimal equilibria to such games, focusing on subgameperfect equilibria, which are equilibria in every state of a CSG. We then present algorithms to reason about and synthesise such equilibria.

**Defnition 7 (Concurrent stochastic game).** *A* concurrent stochastic multiplayer game *(CSG) is a tuple* G = (*N, S, S, A, ∆, δ,* ¯ *AP, L*) *where:*


For the remainder of this section we fx a CSG G as in Defnition 7. The game G starts in one of its initial states *s*¯ *∈ S*¯ and, supposing G is in a state *s*, then each player *i* of G chooses an action from the set that are available, defned as *Ai*(*s*) def = *∆*(*s*) *∩ A<sup>i</sup>* if *∆*(*s*) *∩ A<sup>i</sup>* is non-empty and *Ai*(*s*) def = *{⊥}* otherwise. Supposing each player chooses *a<sup>i</sup>* , then the game transitions to state *s ′* with probability *δ*(*s,*(*a*1*, . . . , an*)). To enable quantitative analysis of G we augment it with *reward structures*, which are tuples *r*=(*rA, rS*) of an action reward function *r<sup>A</sup>* : *S×A →* R and state reward function *r<sup>S</sup>* : *S →* R.

A *path* of G is a sequence *π* = *s*<sup>0</sup> *<sup>α</sup>*<sup>0</sup> *−→ s*<sup>1</sup> *<sup>α</sup>*<sup>1</sup> *−→ · · ·* where *s<sup>k</sup> ∈ S*, *α<sup>k</sup>* = (*a k* 1 *, . . . , a<sup>k</sup> n* ) *∈ A*, *a k <sup>i</sup> ∈ Ai*(*sk*) for *i ∈ N* and *δ*(*sk, αk*)(*sk*+1) *>* 0 for all *k* ⩾ 0. We denote by *FPaths*G*,s* and *IPaths*G*,s* the sets of fnite and infnite paths starting in state *s* of G respectively and drop the subscript *s* when considering all fnite and infnite paths of G. As for NFGs, we can defne *strategies* of G that resolve the choices of the players. Here, a strategy for player *i* is a function *σi* : *FPaths*<sup>G</sup> *→ Dist*(*A<sup>i</sup> ∪ {⊥}*) such that, if *σi*(*π*)(*ai*)*>*0, then *a<sup>i</sup> ∈ Ai*(*last*(*π*)) where *last*(*π*) is the fnal state of *π*. Furthermore, we can defne strategy profles, correlated profles and joint strategies analogously to Defnitions 2 and 3.

The utility of a player *i* of G is defned by a random variable *X<sup>i</sup>* : *IPaths*<sup>G</sup> *→* R over infnite paths. For a profle<sup>4</sup> *σ* and state *s*, using standard techniques [20], we can construct a probability measure *Prob<sup>σ</sup>* <sup>G</sup>*,s* over the paths with initial state *s* corresponding to *σ*, denoted *IPaths<sup>σ</sup>* <sup>G</sup>*,s* and the expected value E *σ* <sup>G</sup>*,s*(*Xi*) of player *i*'s utility from *s* under *σ*. Given utilities *X*1*, . . . , X<sup>n</sup>* for all the players of G, we can then defne NE and CE (see Defnition 5) as well as the restricted classes of SW and SF equilibria as for NFGs (see Defnition 6). Following [24,21], we focus on *subgame-perfect* equilibria [30], which are equilibria in *every state* of G.

**Nonzero-sum properties.** As in [24] (for two-player CSGs) and [21] (for *n*player CSGs) we can specify equilibria-based properties using temporal logic. For simplicity, we restrict attention to nonzero-sum properties without nesting, allowing for the specifcation of NE and CE against either SW or SF optimality.

**Defnition 8 (Nonzero-sum specifcations).** *The syntax of nonzero-sum specifcations θ for CSGs is given by the grammar:*

$$\begin{array}{lclcl}\phi & :=& \langle\!\langle\!\mathbb{C}\rangle\!\rangle(\star\_{1},\star\_{2})\_{\mbox{opt}\sim x}(\theta) \\ \theta & :=& \mathsf{P}\lbrack\psi\rbrack+\cdots+\mathsf{P}\lbrack\psi\rbrack \quad \mid \ \mathsf{R}^{r}\lbrack\rho\rbrack+\cdots+\mathsf{R}^{r}\lbrack\rho\rbrack) \\ \psi & :=& \mathsf{X}\mathsf{a} \mid \ \mathsf{a}\ \mathsf{U}^{\leqslant k}\ \mathsf{a} \mid \ \mathsf{a}\ \mathsf{U}\ \mathsf{a} \\ \rho & :=& \mathsf{I}^{=k} \mid \ \mathsf{C}^{\leqslant k} \mid \ \mathsf{F}\ \mathsf{a} \end{array}$$

*where* C = *C*1: *· · ·* :*Cm, C*1*, . . . , C<sup>m</sup> are coalitions of players such that Ci∩C<sup>j</sup>* = ∅ *for all* 1 ⩽ *i ̸*= *j* ⩽ *m and ∪ m <sup>i</sup>*=1*C<sup>i</sup>* = *N,* (*⋆*1*, ⋆*2) *∈ {*ne*,* ce*}×{*sw*,* sf*},* opt *∈ {*min*,* max*}, ∼ ∈ {<,* ⩽*,* ⩾*, >}, x ∈* Q*, r is a reward structure, k ∈* N *and* a *is an atomic proposition.*

The nonzero-sum formulae of Defnition 8 extend the logic of in [24,21] in that we can now specify the type of equilibria, NE or CE, and optimality criteria, SW or SF. A probabilistic formula *⟨⟨C*1:*· · ·*:*Cm⟩⟩*(*⋆*1*, ⋆*2)max*∼x*(P[ *ψ*<sup>1</sup> ]+*· · ·*+P[ *ψ<sup>m</sup>* ]) is true in a state if, when the players form the coalitions *C*1*, . . . , Cm*, there is a subgame-perfect equilibrium of type *⋆*<sup>1</sup> meeting the optimality criterion *⋆*<sup>2</sup> for which the *sum* of the values of the objectives P[ *ψ*<sup>1</sup> ]*, . . . ,* P[ *ψ<sup>m</sup>* ] for the coalitions *C*1*, . . . , C<sup>m</sup>* satisfes *∼x*. The objective *ψ<sup>i</sup>* of coalition *C<sup>i</sup>* is either a next (X a), bounded until (a<sup>1</sup> U ⩽*k* a2) or until (a<sup>1</sup> U a2) formula, with the usual equivalences, e.g., F a *≡* true U a.

For a reward formula *⟨⟨C*1:*· · ·*:*Cm⟩⟩*(*⋆*1*, ⋆*2)opt*∼<sup>x</sup>*(R *r*1 [ *ρ*<sup>1</sup> ]+*· · ·*+R *<sup>r</sup>m*[ *ρ<sup>m</sup>* ]) the meaning is similar; however, here the objective of coalition *C<sup>i</sup>* refers to a reward formula *ρ<sup>i</sup>* with respect to reward structure *r<sup>i</sup>* and this formula is either a bounded instantaneous reward (I =*k* ), bounded accumulated reward (C ⩽*k* ) or reachability reward (F a).

For formulae of the form *⟨⟨C*1:*· · ·*:*Cm⟩⟩*(*⋆*1*, ⋆*2)min*∼<sup>x</sup>*(*θ*), the dual notions of cost equilibria are considered. We also allow *numerical* queries of the form *⟨⟨C*1:*· · ·*:*Cm⟩⟩*(*⋆*1*, ⋆*2)opt=?(*θ*), which return the sum of the optimal subgameperfect equilibrium's values.

<sup>4</sup> We can also construct such a probability measure and expected value given a correlated profle or joint strategy.

**Model checking nonzero-sum specifcations.** Similarly to [24,21], to allow model checking of nonzero-sum properties we consider a restricted class of CSGs. We make the following assumption, which can be checked using graph algorithms with time complexity quadratic in the size of the state space [1].

**Assumption 1.** *For each subformula* P[ a<sup>1</sup> U a<sup>2</sup> ]*, a state labelled ¬*a<sup>1</sup> *∨* a<sup>2</sup> *is reached with probability 1 from all states under all strategy profles and correlated profles. For each subformula* R *r* [ F a ]*, a state labelled* a *is reached with probability 1 from all states under all strategy profles and correlated profles.*

We now show how to compute the optimal values of a nonzero-sum formula *ϕ* = *⟨⟨C*1:*· · ·* : *Cm⟩⟩*(*⋆*1*, ⋆*2)opt*∼x*(*θ*) when opt = max. The case when opt = min can be computed by negating all utilities and maximising.

The model checking algorithm broadly follows those presented in [24,21], with the diferences described below. The problem is reduced to solving an *m*-player *coalition game* G *<sup>C</sup>* where *C* = *{C*1*, . . . , Cm}* and the choices of each player *i* in G *C* correspond to the choices of the players in coalition *C<sup>i</sup>* in G. Formally, we have the following defnition in which, without loss of generality, we assume *C* is of the form *{{*1*, . . . , n*1*}, {n*1+1*, . . . n*2*}, . . . , {nm−*1+1*, . . . nm}}* and let *j<sup>C</sup>* denote player *j*'s position in its coalition.

**Defnition 9 (Coalition game).** *For CSG* G = (*N, S, S, A, ∆, δ,* ¯ *AP, L*) *and partition C* = *{C*1*, . . . , Cm} of the players into m coalitions, we defne the* coalition game G *<sup>C</sup>* = (*{*1*, . . . , m}, S, S, A* ¯ *<sup>C</sup> , ∆<sup>C</sup> , δ<sup>C</sup> , AP, L*) *as an m-player CSG where:*


If all the objectives in *θ* are fnite-horizon, *backward induction* [35,27] can be applied to compute (precise) optimal equilibria values with respect to the criterion *⋆*<sup>2</sup> and equilibria type *⋆*1. On the other hand, if all the objectives are infnitehorizon, *value iteration* [9] can be used to approximate optimal equilibria values and, when there is a combination of objectives, the game under study is modifed in a standard manner to make all objectives infnite-horizon.

Backward induction and value iteration over the CSG G *<sup>C</sup>* both work by iteratively computing new values for each state *s* of G *C* . The values for each state, in each iteration, are found by computing optimal equilibria values of an NFG N whose utility function is derived from the outgoing transition probabilities from *s* in the CSG and the values computed for successor states of *s* in the previous iteration. The diference here, with respect to [21], is that the NFGs are solved for the additional equilibria and optimality conditions considered in this paper, which we compute using the algorithms presented in Section 2.

**Algorithm for probabilistic until.** Because of space limitations, we only present here the details of value iteration for (unbounded) probabilistic until, i.e., for *ϕ* = *⟨⟨C*1:*· · ·* : *Cm⟩⟩*(*⋆*1*, ⋆*2)max*∼x*(*θ*) where *θ* = P[ a 1 <sup>1</sup> U a 1 2 ]+ *· · ·* +P[ a *m* <sup>1</sup> U a *m* 2 ]. The complete model checking algorithm can be found in [23].

Following [21], we use VG*<sup>C</sup>* (*s, ⋆*1*, ⋆*2*, θ, n*) to denote the vector of computed values, at iteration *n*, in state *s* of G *C* for optimality criterion *⋆*<sup>2</sup> (SW or SF), equilibria type *⋆*<sup>1</sup> (NE or CE) and (until) objectives *θ*. We also use **1***<sup>m</sup>* and **0***<sup>m</sup>* to denote a vector of size *m* whose entries all equal to 1 or 0, respectively. For any set of states *S ′* , atomic proposition a and state *s* we let *ηS′* (*s*) equal 1 if *s ∈ S ′* and 0 otherwise, and *η*a(*s*) equal 1 if a *∈ L*(*s*) and 0 otherwise.

Each step of value iteration also keeps track of two sets *D, E ⊆ M*, where *M* = *{*1*, . . . , m}* are the players of G *C* . We use *D* for the subset of players that have already reached their goal (by satisfying a *i* 2 ) and *E* for the players who can no longer can satisfy their goal (having reached a state that fails to satisfy a *i* 1 ). It can then be ensured that their payofs no longer change and are set to 1 or 0, respectively. In these cases, we efectively consider a modifed game where, although the payofs for these players are set, we still need to take their strategies into account in order to guarantee an optimal equilibrium.

Optimal values for all states *s* in the CSG G *C* can be computed as the following limit: VG*<sup>C</sup>* (*s, ⋆*1*, ⋆*2*, θ*) = lim*n→∞* VG*<sup>C</sup>* (*s, ⋆*1*, ⋆*2*, θ, n*), where VG*<sup>C</sup>* (*s, ⋆*1*, ⋆*2*, θ, n*) = VG*<sup>C</sup>* (*s, ⋆*1*, ⋆*2*,* ∅*,* ∅*, θ, n*) and, for any *D, E ⊆ M* such that *D ∩ E* = ∅:

$$\mathsf{V\_{G^{c}}}(s,\star\_{1},\star\_{2},D,E,\theta,n) = \begin{cases} (\eta\_{D}(1),\ldots,\eta\_{D}(m)) & \text{if } D\cup E = M\\ (\eta\_{\mathsf{h}\_{2}^{1}}(s),\ldots,\eta\_{\mathsf{h}\_{2}^{m}}(s)) & \text{else if } n = 0\\ \mathsf{V\_{G^{c}}}(s,\star\_{1},\star\_{2},D\cup D',E,\theta,n) & \text{else if } D' \neq \mathcal{D}\\ \mathsf{V\_{G^{c}}}(s,\star\_{1},\star\_{2},D,E\cup E',\theta,n) & \text{else if } E' \neq \mathcal{D}\\ \mathit{val}(\mathsf{N},\star\_{1},\star\_{2}) & \text{otherwise} \end{cases}$$

where *D′* = *{l ∈ M\*(*D ∪ E*) *|* a *l* <sup>2</sup> *∈ L*(*s*)*}*, *E′* = *{l ∈ M\*(*D ∪ E*) *|* a *l* <sup>1</sup> *̸∈ L*(*s*) and *s ∈ L*(a *l* 2 )*}* and *val*(N*, ⋆*1*, ⋆*2) equals optimal values of the NFG N = (*M, A<sup>C</sup> , u*) with respect to the criterion *⋆*<sup>2</sup> and of equilibria type *⋆*<sup>1</sup> in which for any 1⩽*l*⩽*m* and *α ∈ A<sup>C</sup>* :

$$u\_l(\alpha) = \begin{cases} 1 & \text{if } l \in D \\ 0 & \text{else if } l \in E \\ \sum\_{s' \in S} \delta^{\mathcal{C}}(s, \alpha)(s') \cdot v\_{n-1}^{s', l} & \text{otherwise} \end{cases}$$

and (*v s ′ ,*1 *n−*1 *, v s ′ ,*2 *n−*1 *, . . . , v s ′ ,m n−*1 ) = V<sup>G</sup>*<sup>C</sup>* (*s ′ , ⋆*1*, ⋆*2*, D, E, θ, n−*1) for all *s ′ ∈ S*.

Since this paper considers equilibria for any number of coalitions (in particular, for more than two), the above follows the algorithm of [21] in the way that it keeps track of the coalitions that have satisfed their objective (*D*) or can no longer do so (*E*). By contrast the CSG algorithm of [24] was limited to two coalitions, which enabled the exploitation of effcient MDP analysis techniques for such coalitions. As explained in [21], in such a scenario we cannot reduce the analysis from an *n*-coalition game to an (*n −* 1)-coalition game, as otherwise we would give one of the remaining coalitions additional power (the action choices of the coalition that has satisfed their objective or can no longer do so), which would therefore give this coalition an advantage over the other coalitions.

**Strategy synthesis.** As in [24,21] we can extend the model checking algorithm to perform *strategy synthesis*, generating a witness (i.e., a profle or joint strategy) representing the corresponding optimal equilibrium. This is achieved by storing the profle or joint strategy for the NFG solved in each state. Both the profles and joint strategies require fnite memory and are probabilistic. Memory is required as choices change after a path formula becomes true or a target is reached and to keep track of the step bound in fnite-horizon properties. Randomisation is required for both NE and CE of NFGs.

**Correctness and complexity.** The correctness of the algorithm follows directly from [24,21], as changing the class of equilibria or optimality criterion does not change the proof. The complexity of the algorithm is linear in the formula size and value iteration requires fnding optimal NE or CE for an NFG in each state of the model. Computing NEs of an NFG with two (or more) players is PPADcomplete [12,11], while fnding optimal CEs of an NFG is in P [15].

#### **4 Case Studies and Experimental Results**

We have developed an implementation of our techniques for equilibria synthesis on CSGs, described above, building on top of the PRISM-games [22] model checker. Our implementation extends the tool's existing support for construction and analysis of CSGs, which is contained within its sparse matrix based "explicit" engine written in Java. We have considered a range of CSG case studies (supplementary material can be found at [40]). Below, we summarise the effciency and scalability of our approach, again running on a 2.10GHz Intel Xeon Gold with 32GB JVM memory, and then describe our fndings on individual case studies.

**Effciency and scalability.** Table 2 summarises the performance of our implementation on the case studies that we have considered. It shows the statistics for each CSG, and the time taken to build it and perform equilibria synthesis, for several diferent variants (NE vs. CE, SW vs. SF). Comparing the effciency of synthesising SWNE and SWCE, we see that the latter is typically much faster. For two-player NE, the social fairness variant is no more expensive to compute as we enumerate all NEs. For CE, which uses Z3 rather than Gurobi for fnding SF, we note that, although Z3 is able to fnd optimal equilibria, it is not primarily developed as an optimisation suite, and therefore generally performs poorly in comparison with Gurobi. The benefts of the social fair equilibria, in terms of the values yielded for individual players, are discussed in the in-depth coverage of the diferent case studies below.

**Aloha.** In this case study, introduced in [24], a number of users try to send packets using the slotted Aloha protocol. We suppose that each user has one packet to send and, in a time slot, if *k* users try and send their packet, then the probability that each packet is successfully sent is *q/k* where *q ∈* [0*,* 1]. If a user fails to send a packet, then the number of slots it waits before resending the packet is set according to Aloha's exponential backof scheme. The scheme requires that each user maintains a backof counter, which it increases each time


Table 2: Statistics for a set of CSG verifcation instances (timeout 2 hours).

there is a packet failure (up to *b*max) and, if the counter equals *k* and a failure occurs, randomly chooses the slots to wait from *{*0*,* 1*, . . . ,* 2 *<sup>k</sup>−*1*}*.

We suppose that the objective of each user is to minimise the expected time to send their packet, which is represented by the nonzero-sum formula *⟨⟨usr* <sup>1</sup>: *· · ·* :*usrm⟩⟩*(*⋆*1*, ⋆*2)min=?(R *time* [ F s<sup>1</sup> ]+*· · ·*+R *time* [ F s*<sup>m</sup>* ]). Synthesising optimal strategies for this specifcation, we fnd that the cases for SWNE and SWCE coincide (although SWCE returns a joint strategy for the players, this joint strategy can be separated to form a strategy profle). This profle requires one user to try and send frst, and then for the remaining users to take turns to try and send afterwards. If a user fails to send, then they enter backof and allow all remaining users to try and send before trying to send again. There is no gain to a user in trying to send at the same time as another, as this will increase the probability of a sending failure, and therefore the user having to spend time in backof before getting to try again. For SFNE, which has only been implemented for the two-player case, the two users follow identical strategies, which involve randomly deciding whether to wait or transmit, unless they are the only user that has not transmitted, and then they always try to send when not in backof. In the case of SFCE, users can employ a shared probabilistic signal to coordinate which user sends next. Initially, this is a uniform choice over the users, but as time progresses the signal favours the users with lower backof counters as these users have had fewer opportunities to send their packet previously.

In Figure 2 we have plotted the optimal values for the players, where SW*<sup>i</sup>* correspond to the optimal values (expected times to send their packets) for player

*i* for both SWNE and SWCE for the cases of two, three and four users. We see that the optimal values for the diferent users under SFNE and SFCE coincide, while under SWNE and SWCE they are diferent for each user (with the user sending frst having the lowest and the user sending last the highest). Comparing the sum of the SWNE (and SWCE) values and that of the SFCE values, we see a small decrease in the sum of less than 2% of the total, while for SFNE there is a greater diference as the players cannot coordinate, and hence try and send at the same time.

**Power control.** This case study is based on a model of power control in cellular networks from [7]. In the network there are a number of users that each have a mobile phone. The phones emit signals that the users can strengthen by increasing the phone's power level up to a bound (*pow*max). A stronger signal can improve transmission quality, but uses more energy and lowers the quality of the transmissions of other phones due to interference. We use the extended model from [22], which adds a probability of failure (*qfail*) when a power level is increased and assumes each phone has a limited battery capacity (*e*max). There is a reward structure associated with each phone representing transmission quality, which is dependent on both the phone's power level and the power levels of other phones due to interference. We consider the nonzero-sum property *⟨⟨p*1:*· · ·*:*pm⟩⟩*(*⋆*1*, ⋆*2)max=?(R *r*1 [ F e<sup>1</sup> ]+*· · ·*+R *<sup>r</sup>m*[ F e*<sup>m</sup>* ]), where each user tries to maximise their expected reward before their phone's battery is depleted.

In Figure 3 we have presented the expected rewards of the players under the synthesised SWCE and SFCE joint strategies. When performing strategy synthesis, in the case of two users the SWNE and SWCE yield the same profle in which, when the users' batteries are almost depleted, one user tries to increase their phone's power level and, if successful, in the next step, the second user then tries to increase their phone's power level. Since the frst user's phone battery is depleted when the second tries to increase, this increase does not cause any interference. On the other hand, if the frst user fails to increase their power level, then both users increase their battery levels. For the SFCE, the users can coordinate and fip a coin as to which user goes frst: as demonstrated by Figure 3 this yields equal rewards for the users, unlike the SWCE. In the case of three users, the SWNE and SWCE difer (we were only able to synthesise SWNE for *pow*max = 2 as for larger values the computation had not completed within

the timeout), again users take turns to try and increase their phone's power level. However, here if the users are unsuccessful the SWCE can coordinate as to which user goes next trying to increase their phone's battery level. Through this coordination, the users' rewards can be increased as the battery level of at most one phone increases at a time, which limits interference. On the other hand, for the SWNE users must decide independently whether to increase their phone's battery level and they each randomly decide whether to do so or not.

**Public good.** We next consider a variant of a *public good* game [19], based on the one presented in [22] for the two-player case. In this game a number of players each receive an initial amount of capital (*einit*) and, in each of *rmax* months, can invest none, half or all of their current capital. The total invested by the players in a month is multiplied by a factor *f* and distributed equally among the players before the start of the next month. The aim of the players is to maximise their expected capital which is represented by the formula: *⟨⟨p*1: *· · ·* :*pm⟩⟩*(*⋆*1*, ⋆*2)max=?(R *c*1 [ I <sup>=</sup>*rmax* ]+*· · ·*+R *<sup>c</sup>m*[ I <sup>=</sup>*rmax* ]).

Figure 4 plots, for the three-player model, both the expected capital of individual players and the total expected capital after three months for the SWNE, SWCE and SFNE as the parameter *f* varies. As the results demonstrate the players beneft, both as individuals and as a population, by coordinating through a correlated strategy. In addition, under the SFCE, all players receive the same expected capital with only a small decrease in the sum from that of the SWCE.

**Investors.** The fnal case study concerns a concurrent multi-player version of futures market investor model of [26], in which a number of investors (the players)

interact with a probabilistic stock market. In successive months, the investors choose whether to invest, wait or cash in their shares, while at the same time the market decides with probability *pbar* to bar each investor, with the restriction that an investor cannot be barred two months in a row or in the frst month, and then the values of shares and cap on values are updated probabilistically.

We consider both two- and three-player models, where each investor tries to maximise its individual proft represented by the following nonzero-sum property: *⟨⟨inv* <sup>1</sup>:*· · ·*:*invm⟩⟩*(*⋆*1*, ⋆*2)max=?(R *pf* <sup>1</sup> [ F cin<sup>1</sup> ]+*· · ·*+R *pf <sup>m</sup>*[ F cin*<sup>m</sup>* ]). In Figure 5 we have plotted the diferent optimal values for NE and CE of the two-player game and the diferent optimal values for CE of the three-player game (the computation of NE values timed out for the three player case). As the results demonstrate, again we see that the coordination that CEs ofer can improve the returns of the players and that, although considering social fairness does decrease the returns of some players, this is limited, particularly for CEs.

#### **5 Conclusions**

We have presented novel techniques for game-theoretic verifcation of probabilistic multi-agent systems, focusing on correlated equilibria and a notion of social fairness. We began with the simpler case of normal form games and then extended this to concurrent stochastic games, and used temporal logic to formally specify equilibria. We proposed algorithms for equilibrium synthesis, implemented them and illustrated their benefts, in terms of effciency and fairness, on case studies from a range of application domains.

Future work includes exploring the use of further game-theoretic topics within this area, such as techniques for mechanism design or other concepts such as Stackelberg equilibria. We plan to implement SFCE computation in Gurobi using the *big-M method* [16] to encode implications and techniques from [37] to encode conjunctions, which should yield a signifcant speed-up in their computation.

**Acknowledgements.** This project was funded by the ERC under the European Union's Horizon 2020 research and innovation programme (FUN2MODEL, grant agreement No. 834115).

### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Omega Automata

# A Direct Symbolic Algorithm for Solving Stochastic Rabin Games

Tamajit Banerjee<sup>1</sup> , Rupak Majumdar<sup>2</sup> , Kaushik Mallik<sup>2</sup> , Anne-Kathrin Schmuck<sup>2</sup> , and Sadegh Soudjani<sup>3</sup>

> 1 IIT Delhi, New Delhi, India <sup>2</sup> MPI-SWS, Kaiserslautern, Germany <sup>3</sup> Newcastle University, Newcastle upon Tyne, UK

Abstract. We consider turn-based stochastic 2-player games on graphs with ω-regular winning conditions. We provide a direct symbolic algorithm for solving such games when the winning condition is formulated as a Rabin condition. For a stochastic Rabin game with k pairs over a game graph with n vertices, our algorithm runs in O(n <sup>k</sup>+2k!) symbolic steps, which improves the state of the art.

We have implemented our symbolic algorithm, along with performance optimizations including parallellization and acceleration, in a BDD-based synthesis tool called Fairsyn. We demonstrate the superiority of Fairsyn compared to the state of the art on a set of synthetic benchmarks derived from the VLTS benchmark suite and on a control system benchmark from the literature. In our experiments, Fairsyn performed significantly faster with up to two orders of magnitude improvement in computation time.

### 1 Introduction

Symbolic algorithms for 2-player graph games are at the heart of many problems in the automatic synthesis of correct-by-construction hardware, software, and cyber-physical systems from logical specifications. The problem has a rich pedigree, going back to Church [10] and a sequence of seminal results [6,31,17,30,13,14,34,21]. A chain of reductions can be used to reduce the synthesis problem for ω-regular specifications to finding winning strategies in 2-player games on graphs, for which (symbolic) algorithms are known (see, e.g., [29,14,34,27]). These algorithms form the basis for algorithmic reactive synthesis.

For systems under uncertainty, it is also essential to capture non-determinism quantitatively using probability distributions [5,18,22,25]. Turn-based stochastic 2-player games [3,9], also known as 21/2-player games, generalize 2-player graph games with an additional category of "random" vertices: Whenever the game reaches a random vertex, a random process picks one of the outgoing edges according to a probability distribution. The qualitative winning problem asks whether a vertex of the game graph is almost surely winning for Player 0. Stochastic Rabin games were studied by Chatterjee et al. [7], who showed that the problem is NP-complete and that winning strategies can be restricted to

be pure (non-randomized) and memoryless. Moreover, they showed a reduction from qualitative winning in an n-vertex k-pair stochastic Rabin game to an O (n(k + 1))-vertex (k + 1)-pair (deterministic) Rabin game, resulting in an O (n(k + 1))k+2(k + 1)! algorithm. In contrast, we provide a direct O(n <sup>k</sup>+2k!) symbolic algorithm for the problem.

Our new direct symbolic algorithm is obtained in the following way. We replace the probabilistic transitions with transitions of the environment constrained by extreme fairness as described by Pnueli [28]. Extreme fairness is specified via a special set of Player 1 vertices, called live vertices. A run is extremely fair if whenever a live vertex is visited infinitely often, every outgoing edge from this vertex is taken infinitely often. As our first contribution, we show that to solve a qualitative stochastic Rabin game, we can equivalently solve a (deterministic) Rabin game over the same game graph by interpreting random vertices of the stochastic game as live vertices.

As our second contribution we prove a direct symbolic algorithm to solve (deterministic) Rabin games with live vertices, which we call extremely fair adversarial Rabin games. In particular, we show a surprisingly simple syntactic transformation that modifies well-known symbolic fixpoint algorithm for solving 2-player Rabin games on graphs (without live vertices), such that the modified fixpoint solves the extremely fair adversarial version of the game.

To appreciate the simplicity of our modification, let us consider the wellknown fixpoint algorithms for B¨uchi and co-B¨uchi games—particular classes of Rabin games—given by the following µ-calculus formula:

$$\begin{array}{ll} \text{Büchi:} & \nu Y. \ \mu X. \ (G \cap \text{Cpre}(Y)) \cup (\text{Cpre}(X)) \ ,\\ \text{Co-Büchi:} & \mu X. \ \nu Y. \ (G \cup \text{Cpre}(X)) \cap (\text{Cpre}(Y)) \ . \end{array}$$

where Cpre(·) denotes the controllable predecessor operator and G denotes the set of goal states that should be visited recurrently. In the presence of strong transition fairness, the new algorithm becomes

$$\begin{array}{ll} \textbf{Büchi:} & \nu Y. \,\mu X. \,\left(G \cap \text{Cpre}(Y)\right) \cup \left(\text{Apre}(Y,X)\right),\\ \textbf{Co-Büchi:} & \nu W. \,\mu X. \,\nu Y. \,\left(G \cup \text{Apre}(W,X)\right) \cap \left(\text{Cpre}(Y)\right). \end{array}$$

The only syntactic change (highlighted in blue) we make is to substitute the controllable predecessor for the µ variable X by a new almost sure predecessor operator Apre(Y, X) incorporating also the previous ν variable Y ; if the fixpoint starts with a µ variable (with no previous ν variable), like for co-B¨uchi games, we introduce one additional ν variable in the front. For the general class of Rabin specifications, with a more involved fixpoint and with arbitrarily high nesting depth depending on the number of Rabin pairs, we need to perform this substitution for every such Cpre(·) operator for every µ variable.

We prove the correctness of this syntactic fixpoint transformation for solving Rabin games [31,27] in this paper. It can be shown that the same syntactic transformation may be used to obtain fixpoint algorithms for qualitative solution of stochastic games with other popular ω-regular objectives, namely Reachability, Safety, (generalized) B¨uchi, (generalized) co-B¨uchi, Rabin-chain, parity, and GR(1). Owing to page constraints, these additional fixpoints are only discussed in the extended version [4] of this paper, where we also generalize all results presented in this paper to a weaker notion of fairness, called transition fairness. In a nutshell, these results show that one can solve games with live vertices while retaining the algorithmic characteristics and implementability of known symbolic fixpoint algorithms that do not consider fairness assumptions.

We have implemented our symbolic algorithm for solving stochastic Rabin games in a symbolic BDD-based reactive synthesis tool called Fairsyn. Fairsyn additionally uses parallellization and a fixpoint acceleration technique [23] to boost performance. We evaluate our tool on two case studies, one using synthetic benchmarks derived from the VLTS benchmark suite [15] and the other from controller synthesis for stochastic control systems [12]. We show that Fairsyn scales well on these case studies, and outperforms the state-of-the-art methods by up to two orders of magnitude.

All the technical proofs, the fixpoints for various other specifications, and an additional benchmark taken from the software engineering literature [8] can be found in the extended version of this paper under a slighly more relaxed setting of the problem (transition fairness instead of extreme fairness) [4].

### 2 Preliminaries

Notation: We write N<sup>0</sup> to denote the set of natural numbers including zero. Given a, b ∈ N0, we write [a; b] to denote the set {n ∈ N<sup>0</sup> | a ≤ n ≤ b}. By definition, [a; b] is an empty set if a > b. For any set A ⊆ U defined on the universe U, we write A to denote the complement of A. Given an alphabet A, we use the notation A<sup>∗</sup> and A<sup>ω</sup> to denote respectively the set of all finite words and the set of all infinite words formed using the letters of the alphabet A. Let A and B be two sets and R ⊆ A × B be a relation. For any element a ∈ A, we use the notation R(a) to denote the set {b ∈ B | (a, b) ∈ R}.

21/2-player game graph: We consider usual turn-based stochastic games, also known as 21/2-player games, played between Player 0, Player 1, and a third player representing environmental randomness which is treated as a "half player." Formally, a 21/2-player game graph is a tuple G = hV, V0, V1, Vr, Ei where (i) V is a finite set of vertices, (ii) V0, V1, and V<sup>r</sup> are subsets of V which form a partition of V , and (iii) E ⊆ V × V is the set of directed edges. The vertices in V<sup>r</sup> are called random vertices, and the edges originating in a random vertex are called random edges, denoted as Er. A 21/2-player game graph with no random vertices (i.e. V<sup>r</sup> = ∅) is called a 2-player game graph. A 21/2-player game graph with V<sup>1</sup> = ∅ is called a 11/2-player game graph (also known as Markov Decision Processes or MDPs). A 21/2-player game graph with V = V<sup>r</sup> is known as a Markov chain.

Strategies: A (deterministic) strategy of Player 0 is a function ρ<sup>0</sup> : V <sup>∗</sup>V<sup>0</sup> → V with ρ0(wv) ∈ E(v) for every wv ∈ V <sup>∗</sup>V0. Likewise, a strategy of Player 1 is a function ρ<sup>1</sup> : V <sup>∗</sup>V<sup>1</sup> → V with ρ1(wv) ∈ E(v) for every wv ∈ V <sup>∗</sup>V1. We denote the set of strategies of Player i by Π<sup>i</sup> . A strategy ρ<sup>i</sup> of Player i (i ∈ {0, 1}) is memoryless if for every w1v, w2v ∈ V <sup>∗</sup>V<sup>i</sup> , we have ρi(w1v) = ρi(w2v). In this paper we restrict attention to deterministic strategies, as randomized strategies are no more powerful than deterministic ones for 21/2-player Rabin games [7].

Plays: Consider an infinite sequence of vertices<sup>4</sup> π = v 0v 1v 2 . . . ∈ V <sup>ω</sup>. The sequence π is called a play over G starting at the vertex v 0 if for every i ∈ N0, we have v <sup>i</sup> ∈ V and (v i , vi+1) ∈ E. A play is finite if it is of the form v 0v 1 . . . v<sup>n</sup> for some finite n ∈ N0. Let ρ<sup>0</sup> ∈ Π<sup>0</sup> and ρ<sup>1</sup> ∈ Π<sup>1</sup> be a pair of strategies for the two players, and v <sup>0</sup> ∈ V be a given initial vertex. For every finite play π = v 0v 1 . . . vn, the next vertex v <sup>n</sup>+1 is obtained as follows: If v <sup>n</sup> ∈ V<sup>0</sup> then v <sup>n</sup>+1 = ρ0(v 0 . . . vn); if v <sup>n</sup> ∈ V<sup>1</sup> then v <sup>n</sup>+1 = ρ1(v 0 . . . v<sup>n</sup>); and if v <sup>n</sup> ∈ V<sup>r</sup> then v <sup>n</sup>+1 is chosen uniformly at random from the set Er(v <sup>n</sup>). The uniform probability distribution over the random edges is without loss of generality for the problem considered in this paper; we will come back to this after setting up the problem statement. Every play generated in this way by fixing ρ0, ρ1, and v 0 is called a play compliant with ρ<sup>0</sup> and ρ<sup>1</sup> that starts at vertex v 0 . The random choice in the random vertices induces a probability measure P ρ0,ρ<sup>1</sup> v <sup>0</sup> on the sample space of plays.<sup>5</sup> This is in contrast to 2-player games, where for any choice of ρ<sup>0</sup> ∈ Π0, ρ<sup>1</sup> ∈ Π1, and v <sup>0</sup> ∈ V , the resulting compliant play is unique.

Winning Conditions: A winning condition ϕ is a set of infinite plays over G, i.e., ϕ ⊆ V <sup>ω</sup>, where the game graph G will always be clear from the context. We adopt Linear Temporal Logic (LTL) notation for describing winning conditions. The atomic propositions for the LTL formulas are sets of vertices, i.e., elements of the set 2<sup>V</sup> . We use the standard symbols for the Boolean and the temporal operators: "¬" for negation, "∧" for conjunction, "∨" for disjunction, "→" for implication, "U" for until (A U B means "the play remains inside the set A until it moves to the set B"), " " for next ( A means "the next vertex is in the set A"), "♦" for eventually (♦A means "the play will eventually visit a vertex from the set A"), and "" for always (A means "the play will only visit vertices from the set A"). The syntax and semantics of LTL can be found in standard textbooks [3]. By slightly abusing notation, we use ϕ interchangeably to denote both the LTL formula and the set of plays satisfying ϕ. Hence, we write π ∈ ϕ to denote the satisfaction of the formula ϕ by the play π.

Rabin Winning Conditions: A Rabin winning condition is expressed using a set of k Rabin pairs R = {hG1, R1i, . . . ,hGk, Rki}, where k is any positive integer and G<sup>i</sup> , R<sup>i</sup> ⊆ V for all i ∈ [1; k]. We say that R has the index set P = [1; k]. A play π satisfies the Rabin condition R if π satisfies the LTL formula

$$\varphi \coloneqq \bigvee\_{i \in P} \left( \Diamond \Box \overline{R}\_i \wedge \Box \Diamond G\_i \right) . \tag{2}$$

Almost Sure Winning: Let G be 21/2-player game graph, ρ<sup>0</sup> ∈ Π<sup>0</sup> and ρ<sup>1</sup> ∈ Π<sup>1</sup> be a pair of strategies, v <sup>0</sup> ∈ V be an initial vertex, and ϕ be an ω-regular

<sup>4</sup> In our convention for denoting vertices, superscripts (ranging over N0) will denote the position of a vertex within a given sequence/play, whereas subscripts, either 0,

<sup>1,</sup> or r, will denote the membership of a vertex in the sets V0, V1, or V<sup>r</sup> respectively. <sup>5</sup> The unique measure P ρ0,ρ1 <sup>v</sup><sup>0</sup> is obtained through Carath´eodory's extension theorem by extending the pre-measure on every infinite extension—called the cylinder set—of every finite play; see [3, pp. 757] for details.

specification over the vertices of G. Then P ρ0,ρ<sup>1</sup> v <sup>0</sup> (ϕ) denotes the probability of satisfaction of ϕ by the plays compliant with ρ<sup>0</sup> and ρ<sup>1</sup> and starting at v 0 . The set of almost sure winning states of Player 0 for the specification ϕ is defined as the set W<sup>a</sup>.s. ⊆ V such that for every v <sup>0</sup> ∈ W<sup>a</sup>.s. the following holds: supρ0∈Π<sup>0</sup> infρ1∈Π<sup>1</sup> P ρ0,ρ<sup>1</sup> v <sup>0</sup> (ϕ) = 1. It is known [7, Thm. 4] that there is an optimal (deterministic) memoryless strategy ρ ∗ <sup>0</sup> ∈ Π0—called the optimal almost sure winning strategy—such that for every v <sup>0</sup> ∈ W<sup>a</sup>.s. it holds that infρ1∈Π<sup>1</sup> P ρ ∗ 0 ,ρ<sup>1</sup> v <sup>0</sup> (ϕ) = 1.

We extend the notion of winning to 2-player games as follows. Fix a 2-player game graph G = hV, V0, V1, ∅, Ei and an ω-regular specification ϕ over V . Player 0 wins the game from a vertex v <sup>0</sup> ∈ V if Player 0 has a strategy ρ<sup>0</sup> such that for every Player 1 strategy ρ1, the unique resulting play starting at v 0 is in ϕ. The winning region W ⊆ V is the set of vertices from which Player 0 wins the game. It is known that Player 0 has a memoryless strategy ρ ∗ <sup>0</sup>—called the optimal winning strategy—such that for every Player 1 strategy ρ<sup>1</sup> ∈ Π<sup>1</sup> and for every initial vertex v <sup>0</sup> ∈ W, the resulting unique compliant play is in ϕ [19].

#### 3 Problem Statement and Outline

Given a 21/2-player game graph G and a Rabin specification ϕ as in (2), we consider the problem of solving the induced qualitative reactive synthesis problem. That is, we want to compute the set of almost sure winning states Wa.s. of G w.r.t. ϕ and the corresponding optimal memoryless winning strategy ρ ∗ <sup>0</sup> of Player 0. This problem was solved by Chatterjee et al. [7] via a reduction from qualitative winning in the original 21/2-player Rabin game to winning in a larger (deterministic) 2-player Rabin game with an additional Rabin pair.

Instead of inflating the game graph and introducing an extra Rabin pair at the cost of more expensive computation, we propose a direct and computationally more efficient symbolic algorithm over the original game graph G. We get this algorithm by interpreting the random vertices of G as special Player 1 vertices, called live vertices, which are subject to an extreme fairness assumption: along every play, if a live vertex v is visited infinitely often, then all outgoing transitions of v are also taken infinitely often. This re-interpretation results in a 2-player Rabin game with special live Player 1 vertices that are subjected to extreme fairness assumptions on Player 1's behavior. We call such games extremely fair adversarial (2-player) Rabin games. The correctness of our symbolic algorithm then follows from the two main results of our paper.

(I) We show that qualitative winning in a 21/2-player Rabin game G is equivalent to winning in the extremely fair adversarial (2-player) Rabin game G ` obtained from G. Moreover, the winning strategy ρ<sup>0</sup> of Player 0 in G ` is also the optimal almost sure winning strategy in G for ϕ (see Thm. 1 in Sec. 4).

(II) We give a direct symbolic algorithm to compute the set of winning states, along with the Player 0 winning strategy for extremely fair adversarial (2-player) Rabin games (see Thm. 2 in Sec. 5).

Both contributions are discussed in detail in Sec. 4 and Sec. 5, respectively. Even though, for convenience, we have assumed a uniform probability distribution over the random edges, our contributions are valid for any arbitrary probability distribution. This follows from the established fact that the qualitative analysis of 21/2-player games does not depend on the precise probability values but only on the supports of the distributions [7].

We conclude the paper by an experimental evaluation in Sec. 6.

# 4 From Randomness to Extreme Fairness

In this section, we show that qualitative winning in 21/2-player Rabin games is equivalent to winning in extremely fair adversarial (2-player) Rabin games over the same underlying game graph. While it is known [16, Thm. 11.1] that the reduction of random vertices to extreme fairness is sound and complete for liveness winning conditions<sup>6</sup> we extend this connection to arbitrary Rabin winning conditions in this section, and therefore to the entire class of ω-regular specifications. We start with a formal definition of extremely fair adversarial games and the connection between randomness and extreme fairness, before stating our main result in Thm. 1.

Extremely Fair Adversarial Games: Let G = hV, V0, V1, ∅, Ei be a 2-player game graph with live vertices V ` ⊆ V1, denoted using the tuple G ` = hG, V ` i. The set of edges originating from the live vertices are called the live edges, and is denoted as E` := (V ` ×V )∩E. A play π over G ` is extremely fair with respect to V ` if it satisfies the following LTL formula:

$$\alpha := \bigwedge\_{(v,v') \in E^{\ell}} \left( \Box \Diamond v \to \Box \Diamond (v \wedge \bigfrown v') \right). \tag{3}$$

Given G ` and an ω-regular winning condition ϕ over V , Player 0 wins the extremely fair adversarial game over G ` for ϕ from a vertex v <sup>0</sup> ∈ V if Player 0 wins the game over G ` for the winning condition α → ϕ from v 0 .

Randomness as Extreme Fairness: Let G = hV, V0, V1, Vr, Ei be a 21/2-player game graph. Then we say that G induces the 2-player game graph with live vertices G ` := hhV, V0, V<sup>1</sup> ∪ Vr, ∅, Ei, Vri. Intuitively, we interpret every random vertex of G as a live Player 1 vertex in G ` . Obviously, this reinterpretation does not change the structure of the underlying graph specified by V and E.

Soundness of the Reduction: It remains to show that the almost sure winning set and the optimal almost sure winning strategy of Player 0 in G for ϕ is the same as the winning state set and the winning strategy of Player 0 in G ` for ϕ. This is formalized in the following theorem when ϕ is given as a Rabin condition. The proof essentially shows that the random vertices of G simulate the live vertices of G ` , and vice versa; details are in the extended version [4, App. B.6, pp. 61].

<sup>6</sup> An LTL formula ϕ over V describes a liveness property if every finite play π over G allows for a continuation π 0 s.t. ππ<sup>0</sup> ∈ ϕ.

Theorem 1. Let G be a 21/2-player game graph with vertex set V , ϕ ⊆ V <sup>ω</sup> be a Rabin winning condition as in (2), and G ` be the 2-player game graph with live edges induced by G. Let W ⊆ V be the set of vertices from which Player 0 wins the extremely fair adversarial game over G ` with respect to ϕ, and W<sup>a</sup>.s. be the almost sure winning set of Player 0 in the 21/2-player game G with respect to ϕ. Then, W = W<sup>a</sup>.s. . Moreover, an optimal almost sure winning strategy in G ` is also an optimal winning strategy in G, and vice versa.

#### 5 Extremely Fair Adversarial Rabin Games

This section presents our main result, which is a symbolic fixpoint algorithm that computes the winning region of Player 0 in the extremely fair adversarial game over G ` with respect to any ω-regular property formalized as a Rabin winning condition. This new symbolic fixpoint algorithm has multiple unique features.

(I) It works directly over G ` , without requiring any pre-processing step to reduce G ` to a "normal" 2-player game with larger set of vertices.

(II) Our new fixpoint algorithm is obtained from the algorithm of Piterman et al. [27] by a simple syntactic change. We simply replace all controllable predecessor operators over least fixpoint variables by a new almost sure predecessor operator invoking the preceding maximal fixpoint variable. This makes the proof of our new fixpoint algorithm conceptually simple (see Sec. 5.3).

At a higher level, we make a simple yet efficient syntactic transformation of the fixpoint to incorporate the fairness assumption on the live vertices, without introducing any extra computational complexity. Most remarkably, this transformation also works directly for fixpoint algorithms for reachability, safety, B¨uchi, (generalized) co-B¨uchi, Rabin-chain, and parity games, as these can be formalized as particular instances of a Rabin game. Moreover, it also works for generalized Rabin, generalized B¨uchi, and GR(1) games. Owing to page constrains, these additional cases are described in the extended version [4].

#### 5.1 Preliminaries on Symbolic Computations over Game Graphs

Set Transformers: Our goal is to develop symbolic fixpoint algorithms to characterize the winning region of an extremely fair adversarial game over a game graph with live edges. As a first step, given G ` , we define the required symbolic transformers of sets of states. We define the existential, universal, and controllable predecessor operators as follows. For S ⊆ V , we have

$$\operatorname{Pre}\_0^{\exists}(S) := \{ v \in V\_0 \mid E(v) \cap S \neq \emptyset \},\tag{4a}$$

$$\text{Pre}\_1^\vee(S) \coloneqq \{ v \in V\_1 \mid E(v) \subseteq S \}, \text{ and} \tag{4b}$$

$$\text{Cpre}(S) := \text{Pre}\_0^{\exists}(S) \cup \text{Pre}\_1^{\forall}(S). \tag{4c}$$

Intuitively, the controllable predecessor operator Cpre(S) computes the set of all states that can be controlled by Player 0 to stay in S after one step regardless of the strategy of Player 1. Additionally, we define two operators which take advantage of the fairness assumption on the live vertices. Given two sets S, T ⊆ V , we define the live-existential and almost sure predecessor operators:

$$\text{Lpre}^{\exists}(S) := \{ v \in V^{\ell} \mid E(v) \cap S \neq \emptyset \}, \quad \text{and} \tag{5a}$$

$$\text{Apre}(S, T) := \text{Cpre}(T) \cup \left(\text{Lpre}^{\exists}(T) \cap \text{Pre}\_1^{\forall}(S)\right). \tag{5b}$$

Intuitively, the almost sure predecessor operator<sup>7</sup> Apre(S, T) computes the set of all states that can be controlled by Player 0 to stay in T (via Cpre(T)) as well as all Player 1 states in V ` that (a) will eventually make progress towards T if Player 1 obeys its fairness-assumptions encoded in α (via Lpre<sup>∃</sup> (T)) and (b) will never leave S in the "meantime" (via Pre<sup>∀</sup> 1 (S)). All the used set transformers are monotonic with respect to set inclusion. Further, Cpre(T) ⊆ Apre(S, T) always holds, Cpre(T) = Apre(S, T) if V ` = ∅, and Apre(S, T) ⊆ Cpre(S) if T ⊆ S.

Fixpoint Algorithms in the µ-calculus: We use µ-calculus [20] as a convenient logical notation to define a symbolic algorithm (i.e., an algorithm that manipulates sets of states rather than individual states) for computing a set of states with a particular property over a given game graph G. The formulas of the µ-calculus, interpreted over a 2-player game graph G, are given by the grammar

$$\varphi \implies = p \mid X \mid \varphi \cup \varphi \mid \varphi \cap \varphi \mid pre(\varphi) \mid \mu X. \varphi \mid \nu X. \varphi \mid$$

where p ranges over subsets of V , X ranges over a set of formal variables, pre ranges over monotone set transformers in {Pre<sup>∃</sup> 0 ,Pre<sup>∀</sup> 1 , Cpre, Lpre<sup>∃</sup> , Apre}, and µ and ν denote, respectively, the least and the greatest fixed point of the functional defined as X 7→ ϕ(X). Since the operations ∪, ∩, and the set transformers pre are all monotonic, the fixed points are guaranteed to exist. A µ-calculus formula evaluates to a set of states over G, and the set can be computed by induction over the structure of the formula, where the fixed points are evaluated by iteration. We omit the (standard) semantics of formulas (see [20]).

#### 5.2 The Symbolic Algorithm

We now present our new symbolic fixpoint algorithm to compute the winning region of Player 0 in the extremely fair adversarial game over G ` with respect to a Rabin winning condition R. A detailed correctness proof can be found in the extended version [4, App. B.3, pp. 40].

Theorem 2. Let G ` = hG, V ` i be a game graph with live edges and R be a Rabin condition over G with index set P = [1; k]. Further, let Z <sup>∗</sup> denote the fixed point of the following µ-calculus expression:

$$
\nu Y\_{p\_0}, \mu X\_{p\_0}. \bigcup\_{p\_1 \in P} \nu Y\_{p\_1}, \mu X\_{p\_1}. \quad \bigcup\_{p\_2 \in P\_{\backslash \backslash}} \nu Y\_{p\_2}.
\mu X\_{p\_2}. \quad \dots \quad \bigcup\_{p\_k \in P\_{\backslash \backslash k-1}} \nu Y\_{p\_k}.
\mu X\_{p\_k}. \left[ \bigcup\_{j=0}^k \mathcal{C}\_{p\_j} \right], \tag{6a}
$$

<sup>7</sup> We will justify the naming of this operator later in Rem. 1.

$$where \quad \mathcal{C}\_{p\_j} := \left(\bigcap\_{i=0}^j \overline{R}\_{p\_i}\right) \cap \left[ \left(G\_{p\_j} \cap \text{Cpre}(Y\_{p\_j})\right) \cup \left(\text{Apre}(Y\_{p\_j}, X\_{p\_j})\right) \right], \tag{6b}$$

with<sup>8</sup> p<sup>0</sup> = 0, Gp<sup>0</sup> := ∅ and Rp<sup>0</sup> := ∅ as well as P\<sup>i</sup> := P \ {p1, . . . , pi}. Then Z ∗ is equivalent to the winning region W of Player 0 in the extremely fair adversarial game over G ` for the winning condition ϕ in (2). Moreover, the fixpoint algorithm runs in O(n <sup>k</sup>+2k!) symbolic steps, and a memoryless winning strategy for Player 0 can be extracted from it.

#### 5.3 Proof Outline

Given a Rabin winning condition over a "normal" 2-player game, [27] provided a symbolic fixpoint algorithm which computes the winning region for Player 0. The fixpoint algorithm in their paper is almost identical to our fixpoint algorithm in (6): it only differs in the last term of the constructed C-terms in (6b). [27] defines the term Cp<sup>j</sup> as

$$\left(\bigcap\_{i=0}^{j} \overline{R}\_{p\_i}\right) \cap \left[\left(G\_{p\_j} \cap \mathrm{Cpre}(Y\_{p\_j})\right) \cup \left(\mathrm{Cpre}(X\_{p\_j})\right)\right].$$

Intuitively, a single term Cp<sup>j</sup> computes the set of states that always remain within Qp<sup>j</sup> := T<sup>j</sup> <sup>i</sup>=0 Rp<sup>i</sup> while always re-visiting Gp<sup>j</sup> . That is, given the simpler (local) winning condition

$$
\psi := \Box Q \land \Box \Diamond G \tag{7}
$$

for two sets Q, G ⊆ V , the set

$$
\forall \nu Y. \; \mu X. \; Q \cap [(G \cap \text{Cpre}(Y)) \cup (\text{Cpre}(X))] \tag{8}
$$

is known to define exactly the states of a "normal" 2-player game G from which Player 0 has a strategy to win the game with winning condition ψ [26]. Such games are typically called Safe B¨uchi Games. The key insight in the proof of Thm. 2 is to show that the new definition of C-terms in (6b) via the new almost sure predecessor operator Apre actually computes the winning state sets of extremely fair adversarial safe B¨uchi games. Subsequently, we generalize this intuition to the fixpoint for the Rabin games.

Fair Adversarial Safe B¨uchi Games: The following theorem characterizes the winning states in an extremely fair adversarial safe B¨uchi game.

Theorem 3. Let G ` = hG, V ` i be a game graph with live vertices and Q, G ⊆ V be two state sets over G. Further, let

$$Z^\* := \nu Y. \ \mu X. \ Q \cap \left[ (G \cap \text{Cpre}(Y)) \cup (\text{Apre}(Y, X)) \right]. \tag{9}$$

Then Z ∗ is equivalent to the winning region of Player 0 in the extremely fair adversarial game over G ` for the winning condition ψ in (7). Moreover, the fixpoint algorithm runs in O(n 2 ) symbolic steps, and a memoryless winning strategy for Player 0 can be extracted from it.

<sup>8</sup> The Rabin pair hG<sup>p</sup><sup>0</sup> , R<sup>p</sup><sup>0</sup> i = h∅, ∅i in (6) is artificially introduced to make the fixpoint representation more compact. It is not part of R.

Intuitively, the fixpoints in (8) and (9) consist of two parts: (a) A minimal fixpoint over X which computes (for any fixed value of Y ) the set of states that can reach the "target state set" T := Q ∩ G ∩ Cpre(Y ) while staying inside the safe set Q, and (b) a maximal fixpoint over Y which ensures that the only states considered in the target T are those that allow to re-visit a state in T while staying in Q.

By comparing (8) and (9) we see that our syntactic transformation only changes part (a). Hence, in order to prove Thm. 3 it essentially remains to show that this transformation works for the even simpler safe reachability games.

Extremely Fair Adversarial Safe Reachability Games: A safe reachability condition is a tuple hT, Qi with T, Q ⊆ V and a play π satisfies the safe reachability condition hT, Qi if π satisfies the LTL formula

$$
\psi := Q \mathcal{U} T. \tag{10}
$$

A safe reachability game is often called a reach-while-avoid game, where the safe sets are specified by an unsafe set R := Q that needs to be avoided. Their extremely fair adversarial version is formalized in the following theorem and proved in the extended version [4, Thm. 3.3].

Theorem 4. Let G ` = hG, V ` i be a game graph with live edges and hT, Qi be a safe reachability winning condition. Further, let

$$Z^\* \coloneqq \nu Y. \ \mu X. \ T \cup (Q \cap \text{Apre}(Y, X)). \tag{11}$$

Then Z ∗ is equivalent to the winning region of Player 0 in the extremely fair adversarial game over G ` for the winning condition ψ in (10). Moreover, the fixpoint algorithm runs in O(n 2 ) symbolic steps, and a memoryless winning strategy for Player 0 can be extracted from it.

To gain some intuition on the correctness of Thm. 4, let us recall that the fixpoint for safe reachability games without live edges is given by:

$$
\mu X. \ T \cup (Q \cap \text{Cpre}(X)). \tag{12}
$$

Intuitively, the fixpoint computation in (12) is initialized with X<sup>0</sup> = ∅ and computes a sequence X<sup>0</sup> , X<sup>1</sup> , . . . , X<sup>k</sup> of increasing sets until X<sup>k</sup> = X<sup>k</sup>+1. We say that v has rank r if v ∈ X<sup>r</sup> \X<sup>r</sup>−<sup>1</sup> . All states contained in X<sup>r</sup> allow Player 0 to force the play to reach T in at most r − 1 steps while staying in Q. The corresponding Player 0 strategy ρ<sup>0</sup> is known to be winning w.r.t. (10) and along every play π compliant with ρ0, the path π remains in Q and the rank is always decreasing.

To see why the same strategy is also sound in the extremely fair adversarial safe reachability game G ` , first recall that for vertices v /∈ V ` of G ` , the operator Apre(X, Y ) simplifies to Cpre(X). With this, we see that for every v /∈ V ` a Player 0 winning strategy <sup>ρ</sup>e<sup>0</sup> in <sup>G</sup> ` can always force plays to stay in Q and to decrease their rank, similar to ρ0. Then every play π compliant with such a strategy <sup>ρ</sup>e<sup>0</sup> and visiting a vertex in <sup>V</sup> ` only finitely often satisfies (10).

Fig. 1. Fair adversarial game graph discussed in Ex. 1 and Ex. 2 with Player 0 and Player 1 vertices being indicated by circles and squares, respectively. The live vertices are V ` = {2, 3, 5} (double square, blue), the target vertices are G = {6, 9} (double circle, green), and the unsafe vertices are Q = {1} (red,dotted).

The only interesting case for soundness of Thm. 4 is therefore every play π that visits states in V ` infinitely often. However, as the number of vertices is finite, we only have a finite number of ranks and hence a certain vertex v ∈ V ` with a finite rank r needs to get visited by π infinitely often. From the definition of Apre, we know that only states v ∈ V ` are contained in X<sup>r</sup> if v has an outgoing edge reaching X<sup>k</sup> with k < r. Because of the extreme fairness condition, reaching v infinitely often implies that also a state with rank k s.t. k < r will get visited infinitely often. As X<sup>1</sup> = T we can show by induction that T is eventually visited along π while π always remains in Q until then.

In order to prove completeness of Thm. 4 we need to show that all states in V \ Z <sup>∗</sup> are losing for Player 0. Here, again the reasoning is equivalent to the "normal" safe reachability game for v /∈ V ` . For live vertices v ∈ V ` , we see that v is not added to Z <sup>∗</sup> via Apre if v /∈ T and either (i) none of its outgoing edges make progress towards T or (ii) some of its outgoing edges leave Z ∗ . One can therefore construct a Player 1 strategy that for (i)-vertices always choose an arbitrary transition and thereby never makes progress towards T (also if v is visited infinitely often), and for (ii)-vertices ensures that they are only visited once on plays which remain in Q. This ensures that (ii)-vertices never make progress towards T via their possibly existing rank-decreasing edges.

In the extended version [4], we have provided a detailed soundness and completeness proof of Thm. 4 along with the respective Player 0 and Player 1 strategy construction. In addition, there we also proved Thm. 3 using a reduction to Thm. 4 for every iteration over Y .

Example 1 (Extremely Fair adversarial safe reachability game). We consider an extremely fair adversarial safe reachability game over the game graph depicted in Fig. 1 with target vertex set T = G = {6, 9} and safe vertex set Q = V \ {1}.

We denote by Y <sup>m</sup> the m-th iteration over the fixpoint variable Y in (11), where Y <sup>0</sup> = V . Further, we denote by Xmi the set computed in the i-th iteration over the fixpoint variable X in (11) during the computation of Y <sup>m</sup> where X<sup>m</sup><sup>0</sup> = ∅. We further have X<sup>m</sup><sup>1</sup> = T = {6, 9} as Apre(·, ∅) = ∅. Now we compute

$$\begin{aligned} X^{12} &= T \cup (Q \cap \text{Apre}(Y^0, X^{11})) \\ &= \{6, 9\} \cup (V \mid \{1\} \cap \underbrace{[\text{Cpre}(X^{11}) \cup \{\text{Lpre}^{\exists}(X^{11}) \cap \text{Pre}\_1^{\forall}(V)]}\_{\{3,5\}}) = \{3, 5, 6, 7, 8, 9\}. \end{aligned} \tag{13}$$

We observe that the only vertices added to X via the Cpre term are 7 and 8. The live vertices 3 and 5 are added due to their outgoing edges leading to the target vertex 6. The additional requirement Pre<sup>∀</sup> 1 (V ) in Apre(Y 0 , X11) is trivially satisfied for all vertices at this point as Y <sup>0</sup> = V and can therefore be ignored. Doing one more iteration over X we see that now vertex 4 gets added via the Cpre term (as it is a Player 0 vertex that allows progress towards 5) and vertex 2 is added via the Apre term (as it is live and allows progress to 3). The iteration over X terminates with Y <sup>1</sup> = X1<sup>∗</sup> = V \ {1}.

Re-iterating over X for Y <sup>1</sup> gives X<sup>22</sup> = X<sup>12</sup> = {3, 5, 6, 7, 8, 9} as before. However, now vertex 2 does not get added to X<sup>23</sup> because vertex 2 has an edge leading to V \ Y <sup>1</sup> = {1}. Therefore the iteration over X terminates with Y <sup>2</sup> = X2<sup>∗</sup> = V \{1, 2}. When we now re-iterate over X for Y <sup>2</sup> we see that vertex 3 is not added to X<sup>32</sup> any more, as vertex 3 has a transition to V \ Y <sup>2</sup> = {1, 2}. Therefore the iteration over X now terminates with Y <sup>3</sup> = X3<sup>∗</sup> = V \ {1, 2, 3}. Now re-iterating over X does not change the vertex set anymore and the fixedpoint terminates with Y <sup>∗</sup> = Y <sup>3</sup> = V \ {1, 2, 3}.

We note that the fixpoint expression (12) for "normal" safe reachability games terminates after two iterations over X with X<sup>∗</sup> = {6, 7, 8, 9}, as vertices 7 and 8 are the only vertex added via the Cpre operator in (13). Due to the stricter notion of Cpre requiring that all outgoing edges of Player 0 vertices make process towards the target, (12) does not require an outer largest fixedpoint over Y to "trap" the play in a set of vertices which allow progress when "waiting long enough". This "trapping" required in (11) via the outer fixpoint over Y actually fails for vertices 2 and 3 (as they are excluded from the winning set of (11)). Here, Player 1 can enforce to "escape" to the unsafe vertex 1 in two steps before 2 and 3 are visited infinitely often (which would imply progress towards 6 via the existing live edges).

We see that the winning region in the "normal" game is much smaller than the winning region for the extremely fair adversarial game, as adding live transitions restricts the strategy choices of Player 1, making it easier for Player 0 to win.

Example 2 (Extremely fair adversarial safe B¨uchi game). We now consider an extremely fair adversarial safe B¨uchi game over the game graph depicted in Fig. 1 with target set G = {6, 9} and safe set Q = V \ {1}.

We first observe that we can rewrite the fixpoint in (9) as

$$
\nu Y. \,\mu X. \,\left[Q \cap G \cap \mathrm{Cpre}(Y)\right] \cup \left[Q \cap \left(\mathrm{Apre}(Y, X)\right)\right]. \tag{14}
$$

Using (14) we see that for Y <sup>0</sup> = V we can define T 0 := Q ∩ G ∩ Cpre(V ) = G = {6, 9}. Therefore the first iteration over X is equivalent to (13) and terminates with Y <sup>1</sup> = X<sup>1</sup><sup>∗</sup> = V \ {1}.

Now, however, we need to re-compute T for the next iteration over X and obtain T <sup>1</sup> = Q ∩ G ∩ Cpre(Y 1 ) = V \ {1} ∩ {6, 9} ∩ V \ {1, 2, 9} = {6}. This re-computation of T 1 checks which target vertices are repeatedly reachable, as required by the B¨uchi condition. As vertex 9 has no outgoing edge trivially it cannot be reached repeatedly.

With this, we see that for the next iteration over X we only have one target vertex T <sup>1</sup> = {6}. Unlike the safe reachability case in Ex. 1, the vertex 7 cannot be added to X22, since Player 1 can always decide to take the edge towards 9 from 7, and therefore prevents repeated visit of a target state. Vertices 2 and 3 get eliminated for the same reason as in the safe reachability game within the second and third iteration over Y . The overall fixpoint computation therefore terminates with Y <sup>∗</sup> = Y <sup>3</sup> = {4, 5, 6, 8}.

Proof of Thm. 2: The proof of Thm. 2 essentially follows from the same arguments as in the soundness proof of the Rabin fixpoint for 2-player game by Piterman et al. [27], which utilizes Thm. 4 and Thm. 3 at all suitable places. In [4, App. A, pp. 29], we illustrate the steps of the Rabin fixpoint in (6) using a simple extremely fair adversarial Rabin game with two Rabin pairs.

Remark 1. We remark that the fixpoint (11), as well as the Apre operator, are similar in structure to the solution of almost surely winning states in concurrent reachability games [1]. In concurrent games, the fixpoint captures the largest set of states in which the game can be trapped while maintaining a positive probability of reaching the target. In our case, the fixpoint captures the largest set of states in which Player 0 can keep the game while ensuring a visit to the target either directly or through some of the edges from the live vertices. The commonality justifies our notation and terminology for Apre.

Remark 2. [2] studied fair CTL and LTL model checking where the fairness condition is given by exteme fairness with all vertices of the transition system being live. They show that CTL model checking under this all-live fairness condition, can be syntactically transformed to non-fair CTL model checking. A similar transformation is possible for fair model checking of B¨uchi, Rabin, and Streett formulas. The correctness of their transformation is based on reasoning similar to our Apre operator. For example, a state satisfies the CTL formula ∀♦p under fairness iff all paths starting from the state either eventually visits p or always visits states from which a visit to p is possible.

Complexity Analysis of (6): For Rabin games with k Rabin pairs, Piterman et al. [27] proposed a fixpoint formula with alternation depth 2k + 1 . Using the accelerated fixpoint computation technique of Long et al. [23], they deduce a bound of O(n <sup>k</sup>+1k!) symbolic steps. We can apply the same acceleration technique to our fixpoint (6), yielding a complexity upper bound of O(n <sup>k</sup>+2k!) symbolic steps. (The additional complexity is because of an additional outermost ν-fixpoint.)

### 6 Experimental Evaluation

We developed a C++-based tool Fairsyn<sup>9</sup> , which implements the symbolic fair adversarial Rabin fixpoint from Eq. (6) using Binary Decision Diagrams (BDD).

<sup>9</sup> Repository URL: https://gitlab.mpi-sws.org/kmallik/synthesis-with-edge-fairness

Fairsyn has a single-threaded and a multi-threaded version, which respectively use the CUDD BDD library [32] and the Sylvan BDD library [11]. In both, we used a fixpoint acceleration procedure that "warm-starts" the inner fixpoints by exploiting a monotonicity property (detailed in the extended version [4]).

We demonstrate the effectiveness of our proposed symbolic algorithm for 21/2 player Rabin games using a set of synthetic benchmark experiments derived from the VLTS benchmark suite (Sec. 6.1) and a controller synthesis experiment for a stochastic dynamical system (Sec. 6.2); in the extended version [4], we include an additional software engineering benchmark example from the literature. In all of these examples, Fairsyn significantly outperformed the state-of-the-art.

The experiments in Sec. 6.1 were performed using the multi-threaded Fairsyn on a computer equipped with a 3 GHz Intel Xeon E7 v2 processor with 48 CPU cores and 1.5 TiB RAM. The experiments in Sec. 6.2 were performed using the single-threaded Fairsyn on a Macbook Pro (2015) laptop equipped with a 2.7 GHz Dual-Core Intel Core i5 processor with 16 GiB RAM.

### 6.1 The VLTS Benchmark Experiments

We present a collection of synthetic benchmarks for empirical evaluation of the merits of our direct symbolic algorithm compared to the one using the reduction to 2-player games [7]; in the following, we refer the latter as the indirect approach. Like our direct algorithm, the indirect approach has been implemented in Fairsyn and benefits from the same Sylvan-based parallel BDD-library and accelerated fixpoint solution technique. We collect the first 20 transition systems from the Very Large Transition Systems (VLTS) benchmark suite [15]; their descriptions can be found in the VLTS benchmark website. For each of them, we randomly generated instances of 21/2-player Rabin games with up to 3 Rabin pairs using the following procedure: (i) we labeled a given fraction of the vertices as random vertices, (ii) we equally partitioned the remaining vertices into system and environment vertices, and (iii) for every set in R = {hG1, R1i, . . . ,hGk, Rki}, we randomly selected up to 5% of all vertices to be contained in the set. All the vertices in (i), (ii), and (iii) were selected randomly. In these examples, the number of vertices ranged from 289–164,865, the number of BDD variables ranged from 9–18, and the number of transitions from 1224–2,621,480.

In Fig. 2, we compare the running times of Fairsyn and the indirect approach. On the left scatter plot, every point corresponds to one instance of the randomly generated benchmarks, where the X and the Y coordinates represent the running time for Fairsyn and the indirect approach respectively. The solid red line indicates the exact same performance for both methods, whereas the dashed red line indicates an order of magnitude performance improvement for Fairsyn compared to the indirect approach. Observe that Fairsyn was faster by up to two orders of magnitude for the majority of the cases. In the experiments, the memory footprint of Fairsyn and the indirect approach was similar.

In the right plot, the X-axis corresponds to the proportion of random vertices within the set of vertices in percentage: 0% corresponds to a 2-player game and 100% corresponds to a Markov chain. The Y-axis corresponds to the running time normalized with respect to the running time for the 0% case. We observe that Fairsyn was insensitive to the change of proportion of the random vertices. On the other hand, the indirect approach took longer time for larger proportion of random vertices, because for every random vertex it adds 3k + 2 additional vertices, thus causing a linear blowup in the size of the game graph. The big variations in the time differences of the two approaches are due to the varying size of the experiments: The larger a game graph is, the larger is the difference. Interestingly, for both Fairsyn and the indirect method, there is a dip in the running time when all the vertices are random (i.e. the 100% case), which is possibly due to faster computation of the Cpre and Apre operators and faster convergence of the fixpoint algorithm, owing to the absence of Player 0 and Player 1 vertices.

Fig. 2. LEFT: Comparison of running time of Fairsyn and the indirect approach on the VLTS benchmarks. All axes are in log-scale. RIGHT: Sensitivity of normalized running time w.r.t. variation of the proportion of random vertices. The blue and the red lines correspond to different instances of Fairsyn and the indirect approach respectively.

#### 6.2 Synthesis for Stochastically Perturbed Dynamical Systems

Synthesizing verified symbolic controllers for continuous dynamical systems is an active area in cyber-physical systems research [33]. We consider a stochastically perturbed dynamical system model, called the bistable switch [12], which is an important model studied in molecular biology. The system model, call it Σ, has a continuous and compact two-dimensional state space X = [0, 4]×[0, 4] ∈ R <sup>2</sup> and a finite input space U = {−0.5, 0, 0.5} × {−0.5, 0, 0.5}. Suppose for any given time k ∈ N, x1(k), x2(k) are the two states, u1(k), u2(k) are the two inputs, and w1(k), w2(k) are a pair of statistically independent noise samples drawn from a pair of distributions with bounded supports W<sup>1</sup> = [−0.4, −0.2], W<sup>2</sup> = [−0.4, −0.2] respectively. Then the states of Σ in the next time instant are:

$$x\_1(k+1) = x\_1(k) + 0.05\left(-1.3x\_1(k) + x\_2(k)\right) + u\_1(k) + w\_1(k),\tag{15}$$

$$x\_2(k+1) = x\_2(k) + 0.05 \left( \frac{(x\_1(k))^2}{(x\_1(k))^2 + 1} - 0.25x\_2(k) \right) + u\_2(k) + w\_2(k).$$

A controller C for Σ is a function C : X → U mapping the state x(k) at any time instant k to a suitable control input u(k). Then applying (15) repeatedly

#### 96 T. Banerjee et al.

Table 1. Performance comparison between Fairsyn and StochasticSynthesis (abbreviated as SS) [12] on a comparable implementation of the abstraction (uniform grid-based abstraction). Col. 1 shows the size of the resulting 21/2-player game graph (computed using the algorithm given in [24]), Col. 2 and 3 compare the total synthesis times and Col. 4 and 5 compare the peak memory footprint (as measured using the "time" command) for Fairsyn and SS respectively. "OoM" stands for out-of-memory.


with u(k) = C(x(k)), starting with an initial state (x1(0), x2(0)) = x(0) = xinit, gives us an infinite sequence of states (x(0), x(1), x(2), . . .) called a path. For a fixed controller C and for a given initial state xinit, we obtain a probability measure P C xinit on the sample space of paths of Σ, in a way similar to how we obtained the probability measure P ρ0,ρ<sup>1</sup> v <sup>0</sup> over infinite plays of 21/2-player games.

Let ϕ ⊆ X<sup>ω</sup> be a Rabin specification, defined using a finite predicate over X. We extend the notion of almost sure winning for control systems in the obvious way: A state x ∈ X of Σ is almost sure winning if there is a controller C such that P C x (ϕ) = 1. The controller synthesis problem asks to compute an optimal controller C ∗ such that for every almost sure winning state x, P C ∗ x (ϕ) = 1.

Fig. 3. Predicates over X.

Majumdar et al. [24] show that this synthesis problem can be approximately solved by lifting the system

Σ to a finite 21/2-player game. We used Fairsyn to solve the resulting 21/2-player Rabin games obtained for the controller synthesis problem for Σ in (15) and for the following specification given in LTL using the predicates A, B, C, D as shown in Fig. 3: ϕ := (♦B → ♦C) ∧ (♦A → ¬C).

In Table 1, we compare the performance of Fairsyn against the state-of-theart algorithm for solving this problem, which is implemented in the tool called StochasticSynthesis (SS) [12]. It can be observed that Fairsyn significantly outperforms SS for every abstraction of different coarseness considered here.

#### Acknowledgments:

R. Majumdar and K. Mallik are funded through the DFG project 389792660 TRR 248–CPEC, A.-K. Schmuck is funded through the DFG project (SCHM 3541/1-1), and S. Soudjani is funded through the EPSRC New Investigator Award CodeCPS (EP/V043676/1).

### References

1. de Alfaro, L., Henzinger, T.A., Kupferman, O.: Concurrent reachability games. In: 39th Annual Symposium on Foundations of Computer Science, FOCS. pp. 564–575. IEEE Computer Society (1998)


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Practical Applications of the Alternating Cycle Decomposition

Antonio Casares<sup>1</sup> () ID , Alexandre Duret-Lutz<sup>2</sup> ID , Klara J. Meyer<sup>3</sup> ID , Florian Renkin<sup>2</sup> ID , and Salomon Sickert<sup>4</sup> ID <sup>⋆</sup>

<sup>1</sup> LaBRI, Universit´e de Bordeaux, France, antonio.casares-santos@labri.fr <sup>2</sup> LRDE, EPITA, France, adl@lrde.epita.fr, frenkin@lrde.epita.fr 3

Independent Researcher, email@klarameyer.de <sup>4</sup> School of Computer Science and Engineering, The Hebrew University, Israel, salomon.sickert@mail.huji.ac.il

Abstract. In 2021, Casares, Colcombet, and Fijalkow introduced the Alternating Cycle Decomposition (ACD) to study properties and transformations of Muller automata. We present the frst practical implementation of the ACD in two diferent tools, Owl and Spot, and adapt it to the framework of Emerson-Lei automata, i.e., ω-automata whose acceptance conditions are defned by Boolean formulas. The ACD provides a transformation of Emerson-Lei automata into parity automata with strong optimality guarantees: the resulting parity automaton is minimal among those automata that can be obtained by duplication of states. Our empirical results show that this transformation is usable in practice. Further, we show how the ACD can generalize many other specialized constructions such as deciding typeness of automata and degeneralization of generalized B¨uchi automata, providing a framework of practical algorithms for ω-automata.

# 1 Introduction

Automata over infnite words have many applications, including verifcation and synthesis of reactive systems with specifcations given in formalisms such as Linear Temporal Logic (LTL) [27, 23, 11, 12, 2, 29]. The synthesis problem from LTL specifcations asks, given an LTL formula φ, to build a controller that processes an input word letter by letter, producing an output word, such that the combined input-output-word satisfes φ. The automata-theoretic approach to this problem (frst introduced by Pnueli and Rosner [27]) consists of building a deterministic ω-automaton A equivalent to the LTL specifcation φ, then construct a game from A in which the opponent chooses the input letters for the automaton, and fnally solve this game and obtain a controller from a winning strategy (whenever such a strategy exists). The automaton A can use diferent kinds of acceptance conditions (Rabin, Emerson-Lei, Muller, parity...) and

<sup>⋆</sup> Salomon Sickert is supported in part by the Deutsche Forschungsgemeinschaft (DFG) under project number 436811179, and in part funded by the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme under grant agreement No. 787367 (PaVeS)

thus we obtain games with diferent winning conditions. Among these games, parity games are the easiest to solve and there are highly-developed techniques for parity games solvers. Thus it is common practice to transform the automaton A to a parity one (for which we might need to augment the state space of the automaton). The top-ranked tools in the SyntComp competitions [17], Strix [23] (winner in editions 2018, 2019, 2020 and 2021) and ltlsynt [26], use this approach, producing a transition-based Emerson-Lei automata (TELA) as an intermediate step before constructing the parity automaton. For this reason, optimal and efcient procedures to transform Emerson-Lei automata into parity automata are of great importance.

Emerson-Lei (EL) acceptance conditions (frst defned by Emerson and Lei [10], and reinvented in the HOA format [3]) are arbitrary positive Boolean formulas over the primitives Inf(c) and Fin(c) where c's are colors from a set Γ. A run is accepting if the set of colors F ⊆ 2 <sup>Γ</sup> seen infnitely often is a satisfying assignment to the EL acceptance condition (see Section 2 for a formal defnition). Note that an explicit representation of all satisfying assignments is comparable to the Muller condition [15, Section 1.3.2]. Since the Boolean structure of LTL formulas can be mimicked by the Emerson-Lei acceptance conditions, a translation of LTL formulas to Emerson-Lei automata is particularly convenient.

Many algorithms to transform Emerson-Lei and Muller automata to parity have been proposed. In essence they all transform an automaton by turning each original state q into multiple states of the form (q, r) where r records some information about the current run, and transitions leaving (q, r) otherwise have a one-to-one mapping with those leaving q. Defnition 3 calls this a locally bijective morphism, and we like to refer to those as algorithms that duplicate states. For instance in the Later Appearance Record (LAR) [16], r is a list of all colors ordered by most recent appearance, producing therefore a blow-up of |Γ|! in the state-space of the automaton. The State Appearance Record (SAR) [24, 22] is a variation of this idea for state-based conditions, and the Color Appearance Record (CAR) [28] is a variation for the Emerson-Lei condition. The Index Appearance Record (IAR) [24, 22, 20] is a specialized construction for Rabin and Streett conditions, where r is now an ordering of pair indices. These algorithms have no particular insights about the input acceptance condition, such as inclusion or redundancies between colors (or pairs). In the Zielonka-tree transformation [31], r is a reference to a branch in a tree representation of a Muller condition. That tree representation is tailored to the condition and allows such simplifcations compared to previous methods (it can be proven to be always better [6, 25]). While none of these algorithms use the structure of the input automaton to optimize the produced automata, some heuristics have been proposed [28, 25, 21].

In 2021, inspired by the Zielonka tree, Casares et al. introduced the Alternating Cycle Decomposition (ACD) of a Muller automaton [6]. Simply put, the ACD is a forest, i.e., a list of trees, that captures how accepting and rejecting cycles interleave in the automaton. They use the ACD to transform Muller automata into parity automata, and they prove a strong optimality result: the resulting automaton uses an optimal number of colors and has a minimal number of states among those parity automata that can be obtained by duplicating states of the original one (see Theorem 1 for a formal statement). The main novelty of this transformation is that it does not only take into account the structure of both the acceptance condition and the automaton, but it exactly captures how they interact with each other. Moreover, Casares et al. [6] show that we can obtain some other valuable information about a Muller automaton from its ACD: for example the ACD can be used to decide typeness, i.e, if we can relabel it with another acceptance condition (parity, Rabin, Streett...). Their approach is primarily theoretical and puts the emphasis on how the ACD can be useful to obtain new results concerning Muller automata, but little is said about the costs of computing the ACD or the applicability of the transformation in practice.

Contributions. In this paper, we show that the ACD is practical. We adapt the defnition of the ACD to Emerson-Lei automata and the HOA format [3]. We implement the ACD and the associated transformation in two tools: Owl [18] and Spot [9], providing baselines for efcient implementations of these structures. We show that the ACD gives a usable and useful method to transform Emerson-Lei automata into parity ones, improving upon any previous transformation in terms of the size of the output parity automaton. We extend the ACD to produce state-based automata, and show that the ACD generally beats traditional degeneralization-based procedures. Our implementation can also use the ACD to check typeness of deterministic automata.

Structure of the paper. We begin by providing some common defnitions in Section 2. In Section 3, we defne the Alternating Cycle Decomposition, adapting the defnition of Casares et al. [6] to Emerson-Lei automata, and we provide an algorithm to compute it. In Section 5, we study the transformation of Emerson-Lei automata into parity ones using the ACD and we show experimental results obtained by comparing the ACD-transform implemented in Spot and Owl with other commonly used transformations. In Section 6 we show experimental results in the particular case of degeneralization of generalized B¨uchi automata. In Section 7 we discuss the utility of the ACD to decide typeness of automata.

#### 2 Preliminaries

We denote by |A| the cardinality of a set A and by 2<sup>A</sup> its power set. For a fnite alphabet Σ, we write Σ<sup>∗</sup> and Σ<sup>ω</sup> for the sets of fnite and infnite words, respectively, over Σ. The empty word is denoted by ε. Given v ∈ Σ<sup>∗</sup> , w ∈ Σ<sup>ω</sup>, we denote their concatenation by v · w and we write v ⊑ w if v is a prefx of w. We note inf(w) the set of letters that occur infnitely often in w. Given a map σ : A → B and a subset A′ ⊆ A, we denote σ|A′ the restriction of σ to A′ . We extend σ to A<sup>∗</sup> and A<sup>ω</sup> component-wise and we denote these extensions by σ whenever no confusion arises.

A (directed, edge-colored) graph is a pair G = (V, E) where V is a fnite set of vertices and E ⊆ V × Γ × V is a fnite set of Γ-colored edges. Note that with


Table 1: Encoding of common acceptance conditions into Emerson-Lei conditions. The variables c, c0, c1, . . . stand for arbitrary colors from the set Γ.

this defnition one can have multiple diferently colored edges from a vertex v to a vertex u. A graph G′ = (V ′ , E′ ) is a subgraph of G (written G′ ⊆ G) if V ′ ⊆ V and E′ ⊆ E. A graph G = (V, E) is strongly connected if for every pair of vertices (v, u) ∈ V 2 there is a path from v to u. A strongly connected component (SCC) of a graph G is a maximal strongly connected subgraph of G.

Emerson-Lei acceptance conditions. Let Γ = {0, . . . , n − 1} be a fnite set of n integers called colors, from now on also written Γ = { 0 , 1 , . . .} in our examples. We defne the set EL(Γ) of acceptance conditions according to the following grammar, where c stands for any color in Γ:

$$\alpha ::= \top \mid \bot \mid \mathsf{Inf}(c) \mid \mathsf{Fin}(c) \mid (\alpha \land \alpha) \mid (\alpha \lor \alpha)$$

Acceptance conditions are interpreted over subsets of Γ. For C ⊆ Γ we defne the satisfaction relation C |= α inductively according to the following semantics:

$$\begin{aligned} C &= \top & C \vdash \textsf{lsf}(c) \text{ iff } c \in C & C \vdash \alpha\_1 \land \alpha\_2 \text{ iff } C \vdash \alpha\_1 \text{ and } C \vdash \alpha\_2\\ C \not\models \bot & C \vdash \textsf{Fin}(c) \text{ iff } c \notin C & C \vdash \alpha\_1 \lor \alpha\_2 \text{ iff } C \vdash \alpha\_1 \text{ or } C \vdash \alpha\_2 \end{aligned}$$

We denote by ¬α the negation of the acceptance condition α, i.e., Fin(m) becomes Inf(m), and vice-versa, ∧ becomes ∨, etc. We assume that constants are propagated, i.e., a formula is either ⊤, ⊥, or does not contain ⊤ and ⊥.

Table 1 shows how common acceptance conditions can be encoded into Emerson-Lei conditions. Note that colors may appear multiple times; for instance (Fin( 0 ) ∧ Inf( 1 )) ∨ (Fin( 1 ) ∧ Inf( 0 )) is a Rabin condition.

Emerson-Lei automata. A transition-based Emerson-Lei automaton (TELA) is a tuple A = (Q, Σ, Q0, ∆, Γ, α), where Q is a fnite set of states, Σ is a fnite input alphabet, Q<sup>0</sup> ⊆ Q is a non-empty set of initial states, Γ is a set of colors, ∆ ⊆ Q×Σ×2 <sup>Γ</sup> ×Q is a fnite set of transitions, and α ∈ EL(Γ) is an Emerson-Lei condition. The graph of A is the directed edge-colored graph G<sup>A</sup> = (Q, E) where the edges E = {(q, C, q′ ) : ∃a ∈ Σ. (q, a, C, q′ ) ∈ ∆} are obtained from ∆ by removing Σ. We denote the transition (q, a, C, q′ ) ∈ ∆ and the edge (q, C, q′ ) ∈ E by q a:C −−→ q ′ and q C −→ q ′ , respectively. Further, we might omit a or C if they are clear from the context. We denote by γ the projection of ∆ or E to the set of colors Γ. Given a word w = a<sup>0</sup> · a<sup>1</sup> · a<sup>2</sup> · · · ∈ Σω, a run over w in A is a sequence ϱ = (q0, a0, C0, q1)·(q1, a1, C1, q2)· · · ∈ ∆<sup>ω</sup> such that q<sup>0</sup> ∈ Q0. The output of the run ϱ, is the word γ(ϱ) ∈ (2<sup>Γ</sup> ) <sup>ω</sup>. A run ϱ is accepting if inf(γ(ϱ)) ⊨ α. A word w ∈ Σ<sup>ω</sup> is accepted (or recognized) by A if there exists an accepting run over w in A. We denote L(A) the set of words accepted by A. Two automata A, A′ are equivalent if L(A) = L(A′ ). The size of an automaton, written |A|, is the cardinality of its set of states. A state q ∈ Q is reachable if there is a path from some state in Q<sup>0</sup> to q in GA.

An automaton A is deterministic if Q<sup>0</sup> is a singleton and for every q ∈ Q and a ∈ Σ there is at most one transition from q labeled with a, q a:C −−→ q ′ ∈ ∆.

We will use automata with acceptance defned over transitions (instead of stated-based acceptance) by default. However, in Sections 5 and 6 we will also discuss transformations towards automata with state-based acceptance.

If the acceptance condition of an automaton is represented as a condition of kind X (cf. Table 1), we call it an X-automaton. We assume that each transition of a parity-automaton is colored with exactly one color; this can be achieved by substituting the set C in a transition q a:C −−→ q ′ by min C (if C ̸= ∅) or by {|Γ|+1} if C = ∅. (If C is a singleton we will omit the brackets in the notation).

Labeled trees. A tree is a non-empty prefx-closed set T ⊆ N <sup>∗</sup> whose elements are called nodes. It is partially ordered by the prefx relation; if x ⊑ y we say that x is an ancestor of y and y is a descendant of x (we add the adjective "strict" if moreover x ̸= y). The empty string ε is the root of the tree. The set of children of a node x ∈ T is Children<sup>T</sup> (x) = {x · i ∈ T : i ∈ N}. The set of leaves of T is Leaves(T) = {x ∈ T : Children<sup>T</sup> (x) = ∅}. Nodes belonging to a same set Children<sup>T</sup> (x) are called siblings, and they are ordered from left to right by increasing value of their last component. If A is a set of labels, an A-labeled tree is a pair ⟨T, η⟩ of a tree T and a map η : T → A. The depth of a node x is Depth(x) = |x|. The height of T is Height(T) = max x∈T Depth(x).

#### 3 The Alternating Cycle Decomposition

The Alternating Cycle Decomposition (ACD), proposed by Casares et al. [6], is a generalization of the Zielonka tree. The ACD of an automaton A is a forest, a collection of trees, labeled with accepting and rejecting cycles of the automaton. For each SCC of A we have a unique tree and the labeling of each tree alternates between accepting and rejecting cycles. Thus the ACD captures the complexity of the cycle structure of each SCC. We present now the defnition of the ACD adapted to TELA.

For the rest of this section, let A = (Q, Σ, Q0, ∆, Γ, α) be a TELA and let G<sup>A</sup> = (Q, E) be the associated graph with edges colored by γ : E → 2 <sup>Γ</sup> . We lift γ to sets and defne γ(E′ ) = S e∈E′ γ(e) for every subset E′ ⊆ E.

Defnition 1. A cycle of A is a subset of edges ℓ ⊆ E forming a closed path in GA. A cycle ℓ is accepting (resp. rejecting) if γ(ℓ) ⊨ α (resp. γ(ℓ) ⊭ α). The set of states of a cycle ℓ is States(ℓ) = {q ∈ Q : some e ∈ ℓ passes through q}. The set of cycles of A is denoted *Cycles*(A). It is (partially) ordered by set inclusion.

Defnition 2 ([6]). Let S1, . . . , S<sup>k</sup> be an enumeration of the strongly connected components of GA. The Alternating Cycle Decomposition of A, denoted ACD(A), is a collection of k *Cycles*(A)-labeled trees ⟨T1, . . . , Tk⟩ with T<sup>i</sup> = ⟨T<sup>i</sup> , ηi⟩ such that:


If q ∈ Q is a state belonging to the SCC S<sup>i</sup> in A, we defne the tree associated to q as the subtree T<sup>q</sup> = ⟨Tq, ηq⟩ given by:

$$T\_q = \{ \varepsilon \} \cup \{ x \in T\_i \; : \; q \in Sates(\eta\_i(x)) \} \; , \quad \eta\_q = \eta\_i|\_{T\_q}.$$

Remark 1. We provide examples online at https://spot.lrde.epita.fr/ipynb/zlk tree.html and an executable copy of this notebook is included in the artifact [8].

#### 4 An Efcient Computation of the ACD

In this section we give an algorithm to compute the Alternating Cycle Decomposition of an Emerson-Lei automaton A, implemented in Owl [18] and Spot [9]. This can be done by frst computing an SCC-decomposition of G<sup>A</sup> which gives us the labels of the roots of the trees ⟨T1, . . . , Tk⟩, and then recursively computing the children of the nodes of each tree, following the defnition of ACD(A). Algorithm 1 shows how to compute the children of a given node and uses notation we introduce now.

Let C ⊆ Γ be a subset of colors and let S = (QS, ES) ⊆ G<sup>A</sup> be a subgraph. We defne the projection of S on C, denoted S<sup>↓</sup><sup>C</sup> = (QS, E′ S ), as the subgraph of S obtained by removing the edges e ∈ E<sup>S</sup> such that γ(e) ⊈ C, that is, E′ <sup>S</sup> = {(q, D, q′ ) ∈ E<sup>S</sup> : D ⊆ C}. We write Colors(S) = S e∈E<sup>S</sup> γ(e). We say that S ′ ⊆ S is an C-strongly connected component in S (C-SCC) if it is an SCC of S and Colors(S ′ ) = C. Further, max<sup>⊆</sup> is the set of all maximal elements according to the partial order defned by ⊆.

Note that Algorithm 1 uses Algorithm 2, which simplifes the Emerson-Lei conditions before passing the formula to a Max-SAT function (a SAT-solver that computes maximal satisfying assignments, e.g., by clause blocking) [4]. This preprocessing ensures that the ACD for Rabin or Streett acceptance conditions can be constructed without making use of the general purpose algorithm for computing maximal satisfying assignments.

Algorithm 1 Computing the children of a node.

1: Input: A cycle S = ηi(x) corresponding to the label of a node x of ACD(A). 2: Output: The set of labels for the children of x, (S1, . . . , Sk). 3: function Compute-Children(S) 4: children ← ∅, C ← Colors(S) 5: if C ⊨ α then ▷ Maximal subsets D ⊆ C such that D ⊨ α ⇔ C ⊭ α 6: {C1, . . . , Ck} ← Max-Satisfying-Subsets(C, ¬α) 7: else 8: {C1, . . . , Ck} ← Max-Satisfying-Subsets(C, α) 9: for D ∈ {C1, . . . , Ck} do 10: for S ′ ∈ SCCs of S<sup>↓</sup><sup>D</sup> do ▷ These might not be D-SCC in S 11: if Colors(S ′ ) ⊨ α ⇔ D ⊨ α then 12: children ← children ∪ {S′ } 13: else 14: children ← children ∪ Compute-Children(S ′ ) 15: return max<sup>⊆</sup> children ▷ Remove from children non-maximal cycles

Algorithm 2 The subprocedure Max-Satisfying-Subsets.

1: Input: A subset of colors C ⊆ Γ and an EL condition α ∈ EL(Γ). 2: Output: max⊆{D ⊆ C : D ⊨ α}. 3: function Max-Satisfying-Subsets(C, α) 4: if C ⊨ α then 5: return {C} 6: α ← α[if c ∈ C then c else ⊥] ▷ Replace colors not in C by false 7: L ← {c ∈ C : ¬c does not occur in α} 8: if L ̸= ∅ then 9: α ← α[if c ∈ L then ⊤ else c] ▷ Replace colors in L by true 10: {C1, . . . , Ck} ← Max-Satisfying-Subsets(C \ L, α) 11: return {C<sup>1</sup> ∪ L, . . . , C<sup>k</sup> ∪ L} 12: if α = ¬c<sup>1</sup> ∨ · · · ∨ ¬c<sup>n</sup> then 13: return {{c1, . . . , cn} \ {ci} : 1 ≤ i ≤ n}} 14: return Max-SAT(α)

Memoization. To optimize the construction of the ACD and to avoid duplicated recursive calls, we perform two kinds of memoization: First, we memoize the results of calling Algorithm 2 from Algorithm 1. (Thus we implicitly construct a Zielonka DAG for α.) Second, we memoize the recursive calls to Algorithm 1: this is useful, as distinct nodes in the ACD can be labeled by the same cycles.

### 5 From Emerson-Lei to Parity Automata

In this section we describe the transformation from TELA to parity automata using the Alternating Cycle Decomposition [6]. This transformation provides strong optimality guarantees: the resulting parity automaton has minimal size

among those that can be produced without merging states from the TELA and it uses an optimal number of colors (Theorem 1). We also show that this transformation can be adapted to produce state-based automata. Note that in this case we loose the frst optimality guarantee.

#### 5.1 The ACD Transformation

Let A = (Q, Σ, Q0, ∆, Γ, α) be a TELA and let ACD(A) = ⟨T1, . . . , Tk⟩. We introduce the following notation that will allow us to move in the ACD.

Given a transition e = q a:C −−→ q ′ such that both q and q ′ belong to the i-th SCC of A and a node x ∈ T<sup>i</sup> , we defne Support(x, e) to be the least ancestor z of x in T<sup>i</sup> such that e ∈ ηi(z). If Support(x, e) ̸= x and it is not a leaf in T<sup>q</sup> ′ , let z ′ be the only child of Support(x, e) that is an ancestor of x, and let y1, . . . , y<sup>s</sup> be an enumeration from left to right of the nodes in ChildrenTq′ (Support(x, e)). We defne NextBranch(x, e) as:

$$\begin{cases}Support(x, e), & \text{if } support(x, e) = x \text{ or } \text{if } support(x, e) \text{ is a leaf in } \mathcal{T}\_{q'},\\ y\_1, & \text{if } z' = y\_s, \\ y\_{j+1}, & \text{if } z' = y\_j, \ 1 \le j < s. \end{cases}$$

We defne a parity automaton PACD(A) = (P, Σ, P0, ∆<sup>P</sup> , Γ<sup>P</sup> , β) (ACD transform of A) equivalent to A as follows:

States. The states of PACD(A) are of the form (q, x), for q ∈ Q and x a leaf of the tree associated to q. Initial states are of the form (q0, x) with q<sup>0</sup> ∈ Q<sup>0</sup> is an initial state in A and x is the leftmost leaf on its corresponding tree.

$$P = \bigcup\_{q \in Q} \{q\} \times Leaves(\mathcal{T}\_q), \; P\_0 = \{ (q\_0, x) : q\_0 \in Q\_0, \; x \text{ the leftmost leaf in } \mathcal{T}\_{q\_0} \}.$$

Transitions. For each transition e = q a:C −−→ q ′ in ∆ and each state (q, x) ∈ P, let us defne a transition (q, x) a:p −−→ (q ′ , y) in ∆<sup>P</sup> as follows: frst, q ′ is the destination state for the original transition. If q and q ′ are not in the same SCC then y is defned as the leftmost leaf in T<sup>q</sup> ′ and p = 1 (except if all T<sup>i</sup> have height 1 and a rounded root: in that case p = 0). Otherwise, if both q and q ′ belong to the i-th SCC of A, then the destination leaf y is the leftmost descendant of NextBranch(x, e) in T<sup>q</sup> ′ .

We defne the color p of the transition as Depth(Support(x, e)), if the root of T<sup>i</sup> is a round node (ηi(ε) ⊨ α), or as Depth(Support(x, e)) + 1 otherwise. We remark that in this way, p is even if and only if ηi(z) ⊨ α.

Parity condition. The condition β is a parity min even condition (cf. Table 1).

Remark 2. If the color 0 does not appear on any transition then we shift all colors by −1 and replace β by a parity min odd condition.

Proposition 1 ([6]). The automaton PACD(A) recognizes L(A).

Remark 3. The ACD transformation preserves many properties (determinism, completeness, good-for-gameness, unambiguity...) of the automaton A, see [6].

Remark 4. Since the number of colors used by PACD(A) is at most the height of a tree in ACD(A), we obtain that PACD(A) never uses more colors than |Γ| + 1. Furthermore, since the TELA does not require all transitions to have a color, we can omit the maximal one and produce an automaton with at most |Γ| colors.

In order to state the optimality of this transformation we introduce the notion of locally bijective morphisms of automata. Given an automaton A = (Q, Σ, Q0, ∆, Γ, α) and q ∈ Q, we denote OutA(q) the set of outgoing transitions of q, i.e., OutA(q) = {q a:C −−→ q ′ ∈ ∆ : a ∈ Σ, C ⊆ Γ, q′ ∈ Q}.

Defnition 3 ([6]). Let A = (Q, Σ, Q0, ∆, Γ, α) and A′ = (Q′ , Σ, Q′ 0 , ∆′ , Γ′ , α′ ) be two EL automata over Σ. A locally bijective morphism from A to A′ (denoted φ: A → A′ ) is a pair of maps φ<sup>Q</sup> : Q → Q′ , φ<sup>∆</sup> : ∆ → ∆′ such that:


Theorem 1 ([6]). Let A be an Emerson-Lei automaton, and let PACD(A) be the parity automaton obtained by applying the ACD transformation. Then,


Note that all state-duplicating constructions mentioned in the introduction create locally bijective morphisms. Thus the above theorem shows that the ACD transformation duplicates the least number of states.

#### 5.2 Experimental Results

Figures 1 and 2 compare four diferent paritization procedures applied to 1065 TELA generated<sup>5</sup> from LTL formulas from the Synthesis Competition. These automata have between 2 and 55 colors (mean 5.92, median 5) and between 1 and 245761 states (mean 2023.20, median 20). Automata with fewer than 2 colors have been ignored since they are trivial to paritize.

The procedures are Owl's and Spot's implementation of ACD transform, as well as Spot's implementation of the Zielonka Tree transform [6], and Spot's previous paritization function (called to parity) [28]. We refer the reader to Section 8 for information about the used versions. Two dotted lines on the sides

<sup>5</sup> We used ltl2tgba -G -D from Spot, and ltl2dela from Owl.

Fig. 1: Comparison of the output size of the four paritization procedures.

Fig. 2: Time spent performing these four paritization procedures.

of the plots hold cases that did not fnish within 500 seconds (red, inner line), or where the tool reported an error<sup>6</sup> (orange, outer line). Pink dots represent input automata that already have parity acceptance: for those, running the ACD transform still makes sense as it will produce an output with a minimal number of colors. However, Owl's implementation, which mostly cares about reducing the number of states, uses a shortcut and will return the input automaton unmodifed in this case: this explains the pink cloud on the left of Figure 2.

Owl's and Spot's implementations of the ACD transform produce automata with the same size, as expected. The cases that are not on the diagonal all correspond to timeouts or tool errors. The Zielonka Tree transform, which does not take the automaton structure into consideration, produces automata that are on the average 2.11 times bigger (median 1.60), while its runtime is on the average 6.55 times slower (median 0.97). Lastly, Spot's to parity function is not far from the optimal size given by ACD transform: on the average its output is 3.28 times larger, but the median of that size ratio is 1.00. Similarly, it is on the average 15.94 times slower, but with a median of 1.04.

<sup>6</sup> Either "out-of-memory", or "too many colors" as Spot is restricted to 32 colors.

#### 5.3 ACD Transformation Towards State-Based Parity Automata

Sometimes it is desired to obtain an automaton with the acceptance defned over states. A state-based parity automaton is a tuple A = (Q, Σ, Q0, ∆, ϕ: Q → N) where (Q, Σ, Q0, ∆) is the underlying structure defned as for transition-based automata in Section 2 (with the only diference that ∆ ⊆ Q × Σ × Q now), and ϕ: Q → N is a map associating colors to states. A run over A is accepting if the minimal color visited infnitely often is even.

Let A be a TELA with ACD(A) = ⟨T1, . . . , Tk⟩. We defne an equivalent state-based parity automaton Psb-ACD(A) = (P, Σ, P0, ∆<sup>P</sup> , ϕ: P → N) as follows:

States. States are of the form (q, x), for q ∈ Q and x ∈ T<sup>q</sup> (now the second component corresponds to a node of the ACD that is not necessarily a leaf). The set of initial states is the same as for PACD(A) :

$$P = \bigcup\_{q \in Q} \{q\} \times T\_q, \quad P\_0 = \{ (q\_0, x) \; : \; q\_0 \in Q\_0, \; x \text{ the leftmost leaf in } \mathcal{T}\_{q\_0} \}.$$

Transitions. For each transition e = q a:C −−→ q ′ ∈ ∆ and (q, x) ∈ P we defne one transition (q, x) <sup>a</sup>−→ (q ′ , y) ∈ ∆<sup>P</sup> . To specify the destination node y, we distinguish two cases:

Suppose that x is a leaf in Tq. If NextBranch(x, e) is not the leftmost child of Support(x, e) in T<sup>q</sup> ′ , then y is the leftmost leaf below NextBranch(x, e) in Tq ′ (as in the transition-based case). If NextBranch(x, e) is the leftmost child (a "lap" around Support(x, e) is fnished), then we set y = Support(x, e).

If x is not a leaf in Tq, the destination y is determined exactly as if the transition started in (q, x′ ) for x ′ the leftmost leaf in T<sup>q</sup> under x.

Parity condition. ϕ((q, x)) = Depth(x), if the root of T<sup>q</sup> is a round node, and ϕ((q, x)) = Depth(x) + 1 otherwise.

Note that we do not have the same optimality guarantee as in the transitionbased case: If x is not a leaf in its corresponding tree, then the states of the form (q, x) ∈ P are not necessarily reachable in Psb-ACD(A) . We only need to add those that can be reached from the initial state. However, the set of reachable states does depend on the ordering of the children in the trees of the ACD, and therefore the size of the fnal automaton depends on this ordering.

We propose a heuristic to order the children of nodes in ACD(A). Let T<sup>i</sup> be a tree in ACD(A) and x ∈ T<sup>i</sup> . We defne:

$$D\_i(x) = \{ q' \in Q \; : \; q \xrightarrow{a} q' \notin \eta\_i(x), \text{ for some } q \in Sates(\eta\_i(x)), a \in \Sigma \}.$$

The heuristic consists in ordering the children of a node T<sup>i</sup> by decreasing |Di(x)|. Experiments involving transformations towards state-based automata and testing this heuristic can be found in Section 6.2.

#### 6 Degeneralization of Generalized B¨uchi Automata

The transformation of generalized-B¨uchi automata with n colors into B¨uchi automata (with a single color) is known as "degeneralization" and has been a very common processing step between algorithms that translate temporal-logic formulas into generalized-B¨uchi automata, and model-checking algorithms that (used to) only work with B¨uchi automata. While it initially consisted in making 2 <sup>n</sup> copies of the GBA [30, Appendix B] to remember the set of colors that had yet to be seen, degeneralization to state-based B¨uchi acceptance can be done using only n + 1 copies once an arbitrary order of colors has been selected [13]. A similar construction to transition-based B¨uchi acceptance requires only n copies of the original automaton. Diferent orders of colors may lead to a diferent numbers of reachable states in the B¨uchi automaton. Some tools even attempted to start the degeneralization in diferent copies to reduce the number of reachable states [14]. Nowadays, an implementation such as the degeneralization of Spot implements several SCC-based optimizations [2] to reduce the number of output states, but is still sensitive to the arbitrary order selected for colors.

#### 6.1 Transition-based Degeneralization

This order-sensitivity of the degeneralization, even in its transition-based variant, makes a striking diference with ACD. When applied to a generalized B¨uchi automaton that has some accepting and rejecting paths, the ACD-transform produces an automaton with acceptance Inf( 0 )∨Fin( 1 ). Since all transitions are either labeled by 0 or 1 , color 1 is superfuous<sup>7</sup> and the condition can be reduced to Inf( 0 ). In this context, ACD-transform therefore gives us a transition-based B¨uchi automaton by duplicating the fewest number of states (Theorem 1(2)).

It can be seen that the cycling around the diferent children of the ACD (whose ordering is arbitrary) performed during ACD-transform is similar to the process used in traditional degeneralization. What makes the latter sensitive to color ordering is that it only "sees" one transition at a time, while the ACD provides a view of the cycles. For instance a degeneralization would process the sequence x 0 y 1 z diferently from the sequence x 1 y 0 z depending on the order in which colors are expected to be encountered. However, if there is no other transition reaching or leaving y the two colors will always be seen together so their order should not matter: the two transitions belong to the same node of the ACD. The propagation of colors [28] is a related preprocessing step that can improve the degeneralization by propagating all colors common to the incoming transitions of a state to its outgoing transitions and vice-versa. It would turn the previous situation into x 0 1 y 0 1 z making the color order selected by the degeneralization irrelevant (in this case).

A comparison of the output size of the traditional degeneralization implemented in Spot (which includes several optimizations learned over the years)

<sup>7</sup> In an automaton with "parity min" acceptance where all transitions are colored, the maximal color can always be omitted and replaced by the empty set.

Fig. 3: Two-dimensional histogram of the sizes of 1000 automata, degeneralized to transition-based B¨uchi automata, using Spot's degeneralization function (with or without propagation of colors), or using ACD-transform.

against that of ACD-transform is given in the left plot of Figure 3. Unsurprisingly, because of ACD-transform's optimality, there are no cases where ACD loses to Spot's transition-based degeneralization. The use of the propagation of colors (right of the plot) is an improvement (the non-optimal cases dropped from 419 to 235) but not a cure.

Remark 5. The input automata used in this section and the next one is a set of 1000 randomly generated, minimal, deterministic, transition-based generalized B¨uchi automata, with 3 or 4 states and 2 or 3 colors. The reason for using such small minimal automata is to be able to use a SAT-based minimization [1] on the degeneralized state-based output in the next section to estimate how large the gap between an optimal and our procedure is.

#### 6.2 State-based degeneralization

If ACD is used to produce a state-based output, as explained in Subsection 5.3, the obtained automaton is not guaranteed to be minimal with respect to locally bijective morphisms. In this case we can obtain a weaker optimality result:

Proposition 2. Let A be a generalized B¨uchi automaton, and let Bsb−ACD(A) be the state-based B¨uchi automaton obtained by applying the ACD state-based transformation. If B ′ be is a state-based B¨uchi automaton admitting a locally bijective morphism to A, then |Bsb−ACD(A) | ≤ |B′ | + |A|.

Proof. Let B ′ be a state-based B¨uchi automaton admitting a locally bijective morphism to A. We can transform it into a transition-based B¨uchi automaton B ′ trans by setting the transitions leaving accepting states to be accepting. This automaton has the same size than B ′ and it also accepts a locally bijective morphism to A. Therefore, by Theorem 1, we have that |BACD(A) | ≤ |B′ trans| =

Fig. 4: Comparison of three ways to degeneralize to state-based B¨uchi: (acd, acd.heuristic) using the state-based version of ACD-transform with or without heuristic, and (degen) classical degeneralization.

Fig. 5: Efect of the heuristic for ordering children of the ACD, and comparison to the minimal degeneralized automata (when known).


Figure 4 compares three ways to perform state-based degeneralization. The ACD comes in two variants, with or without the heuristic of Section 5.3, and it is compared against the state-based degeneralization of Spot.

Figure 5 shows how the heuristic variant compares to the one without, and how it compares with the size of a minimal DBA, when its size could be computed in reasonable time (in 649 cases). Note that there might not be a local bijective morphism between the input automaton and the minimal DBA computed this way, nonetheless these minimal size automata can serve as a reference point to estimate the quality of a degeneralization. Compared to this subset of minimal DBA, the average number of additional states produced by the state-based ACD is 0.17 with heuristics, and 0.33 without. Comparatively, Spot's degeneralization has an average of 1.21 extra states.

#### 7 Deciding Typeness

We highlight now how the ACD can be used to decide typeness of deterministic TELA. This problem, frst introduced by Krishnan and Brayton [19], consists of deciding whether we can replace the acceptance condition of a given automaton by another (hopefully simpler) without changing the transition structure and preserving the language (see Table 1 for a list of common acceptance conditions).

Let A = (Q, Σ, Q0, ∆, Γ, α) be a TELA. We say that A is X-type, for X ∈ {B, C, GB,GC,P, R, S}, if there is an X-automaton over the same structure, A′ = (Q, Σ, Q0, ∆′ , Γ′ , β) (where ∆ and ∆′ only difer on the coloring of the transitions), such that L(A) = L(A′ ) and β belongs to X. We emphasize that we permit to use a diferent set of colors Γ ′ in A′ . Some conditions can always be rewritten as conditions of other kinds (for example, B¨uchi conditions can be expressed as parity ones, so being B-type implies being P-type). We should not confuse this notion with the expressive power of deterministic automata using these conditions. For example, both deterministic parity automata and Rabin automata recognize all ω-regular languages, but there are Rabin automata that are not parity-type. Further, we say that an automaton A is weak if for every SCC S of A, all cycles in S are accepting or all of them are rejecting.

The following result shows that the ACD is a sufcient data structure for deciding typeness for many common acceptance conditions. We remark that the second item adds to the results of Casares et al. [7] (this statement only holds if transitions of automata are labeled with subsets of colors, which is not allowed in their model).

Proposition 3 ([7, Section 5.2]). Let A be a deterministic TELA such that all its states q ∈ Q are reachable and let ACD(A) = ⟨T1, . . . , Tk⟩ be its Alternating Cycle Decomposition. Then the following statements hold:


Also, the least number of colors used by a deterministic parity automaton recognizing L(A) is max 1≤i≤k Height(Ti) + ν, where ν = 0 if the root of all trees of maximal height have the same shape (round or square), and ν = 1 otherwise.

If one of the previous conditions holds, then ACD(A) also provides an efective procedure to relabel A with the corresponding acceptance condition.

Remark 6. The ACD gives a typeness result for each SCC of the automaton, which allows to simplify the acceptance condition of each of them independently. Further, implications from right to left in Proposition 3 also hold for non-deterministic automata.

Proposition 3 provides an efective procedure to check typeness of TELA: we just have to build the ACD and verify that it has the appropriate shape. Spot's implementation of ACD has options to abort the construction as soon as it detects that the shape is wrong. Moreover, if an automaton is parity-type, the ACD provides a method to relabel the automaton with a minimal number of colors. Finally, if the automaton already has parity acceptance, the ACD transformation boils down to the algorithm of Carton and Maceiras [5].

#### 8 Availability

The ACD and the transformations based on it are currently implemented in two open-source tools: Spot 2.10 [9] and Owl 21.0 [18]. (The original developments were independent before the authors met and worked on this joint paper.)

In Spot 2.10, the ACD can be played with using the Python bindings. The acd class implements the decomposition, and will render it as an interactive forest of nodes that can be clicked to highlight the relevant cycles in the input automaton. The acd transform() and acd transform sbacc() implements the transitionbased and state-based variant of the paritization procedure. Additionally, the acd class has options to heuristically order the children to favor the state-based construction, or to abort the construction as soon as it is clear that the ACD does not have Rabin or Street shape (in case one wants to use it to establish typeness of automata). All these features are illustrated at https://spot.lrde.ep ita.fr/ipynb/zlktree.html. In the future, ACD will be used more by the rest of Spot, and will be one option of the ltlsynt tool (for LTL synthesis).

In Owl, the ACD transformation is available through the aut2parity command. This command reads an automaton in the HOA format [3] using arbitrary acceptance, and produces a parity automaton in the same format. The tool Strix [23], which builds upon Owl, gained in version 21.0.0 the option to use the ACD-construction as an intermediate step.

Instructions to reproduce all experiments and included in the artifact [8].

#### 9 Conclusion

We have shown that ACD is more than a theoretically-appealing construction: our two implementations show that the construction is very usable in practice, and provide a baseline for further improvements. We have also shown that ACD is a Swiss-army knife for ω-automata in the sense that it can generalize and replace several specifc constructions (paritization, degeneralization, typeness checks).

# References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Sky Is Not the Limit Tighter Rank Bounds for Elevator Automata in Büchi Automata Complementation**

Vojtěch Havlena , Ondřej Lengál , and Barbora Šmahlíková ihavlena@fit.vut.cz, lengal@vut.cz, xsmahl00@vut.cz

Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic

**Abstract.** We propose several heuristics for mitigating one of the main causes of combinatorial explosion in rank-based complementation of Büchi automata (BAs): unnecessarily high bounds on the ranks of states. First, we identify *elevator automata*, which is a large class of BAs (generalizing semi-deterministic BAs), occurring often in practice, where ranks of states are bounded according to the structure of strongly connected components. The bounds for elevator automata also carry over to general BAs that contain elevator automata as a sub-structure. Second, we introduce two techniques for refining bounds on the ranks of BA states using data-flow analysis of the automaton. We implement out techniques as an extension of the tool Ranker for BA complementation and show that they indeed greatly prune the generated state space, obtaining significantly better results and outperforming other state-of-the-art tools on a large set of benchmarks.

#### **1 Introduction**

*Büchi automata* (BA) complementation has been a fundamental problem underlying many applications since it was introduced in 1962 by Büchi [8,17] as an essential part of a decision procedure for a fragment of the second-order arithmetic. BA complementation has been used as a crucial part of, e.g., termination analysis of programs [13,20,10] or decision procedures for various logics, such as S1S [8], the first-order logic of Sturmian words [33], or the temporal logics ETL and QPTL [38]. Moreover, BA complementation also underlies BA inclusion and equivalence testing, which are essential instruments in the BA toolbox. Optimal algorithms, whose output asymptotically matches the lower bound of (0.76) [43] (potentially modulo a polynomial factor), have been developed [37,1]. For a successful real-world use, asymptotic optimality is, however, not enough and these algorithms need to be equipped with a range of optimizations to make them behave better than the worst case on BAs occurring in practice.

In this paper, we focus on the so-called *rank-based* approach to complementation, introduced by Kupferman and Vardi [24], further improved with the help of Friedgut [14], and finally made optimal by Schewe [37]. The construction stores in a macrostate partial information about all runs of a BA A over some word . In addition to tracking states that A can be in (which is sufficient, e.g., in the determinization of NFAs), a macrostate also stores a guess of the rank of each of the tracked states in the *run DAG* that captures all these runs. The guessed ranks impose restrictions on how the future of a state might look like (i.e., when A may accept). The number of macrostates in the complement depends combinatorially on the maximum rank that occurs in the macrostates. The constructions in [24,14,37] provides only coarse bounds on the maximum ranks.

A way of decreasing the maximum rank has been suggested in [15] using a PSpace (and, therefore, not really practically applicable) algorithm (the problem of finding the optimal rank is PSpace-complete). In our previous paper [19], we have identified several basic optimizations of the construction that can be used to refine the *tight-rank upper bound* (TRUB) on the maximum ranks of states. In this paper, we push the applicability of rank-based techniques much further by introducing two novel lightweight techniques for refining the TRUB, thus significantly reducing the generated state space.

Firstly, we introduce a new class of the so-called *elevator automata*, which occur quite often in practice (e.g., as outputs of natural algorithms for translating LTL to BAs). Intuitively, an elevator automaton is a BA whose strongly connected components (SCCs) are all either inherently weak1 or deterministic. Clearly, the class substantially generalizes the popular inherently weak [6] and semi-deterministic BAs [11,3,4]). The structure of elevator automata allows us to provide tighter estimates of the TRUBs, not only for elevator automata *per se*, but also for BAs where elevator automata occur as a sub-structure (which is even more common). Secondly, we propose a lightweight technique, inspired by data flow analysis, allowing to propagate rank restriction along the skeleton of the complemented automaton, obtaining even tighter TRUBs. We also extended the optimal rank-based algorithm to transition-based BAs (TBAs).

We implemented our optimizations within the Ranker tool [18] and evaluated our approach on thousands of hard automata from the literature (15 % of them were elevator automata that were not semi-deterministic, and many more contained an elevator substructure). Our techniques drastically reduce the generated state space; in many cases we even achieved exponential improvement compared to the optimal procedure of Schewe and our previous heuristics. The new version of Ranker gives a smaller complement in the majority of cases of hard automata than other state-of-the-art tools.

### **2 Preliminaries**

*Words, functions.* We fix a finite nonempty alphabet Σ and the first infinite ordinal = {0, 1, . . .}. For ∈ , by [] we denote the set {0, . . . , }. For ∈ we use bbcc to denote the largest even number smaller of equal to , e.g., bb42cc = bb43cc = 42. An (infinite) word is represented as a function : → Σ where the -th symbol is denoted as . We abuse notation and sometimes also represent as an infinite sequence = 0<sup>1</sup> . . . We use Σ to denote the set of all infinite words over Σ. For a (partial) function : → and a set ⊆ , we define () = { () | ∈ }. Moreover, for ∈ and ∈ , we use ⊳{ ↦→ } to denote the function ( \{ ↦→ ()})∪{ ↦→ }.

*Büchi automata.* A (nondeterministic transition/state-based) *Büchi automaton* (BA) over Σ is a quadruple A = (, , , ∪ ) where is a finite set of*states*, : ×Σ → 2 is a *transition function*, ⊆ is the sets of *initial*states, and ⊆ and ⊆ are the sets of *accepting states* and *accepting transitions* respectively. We sometimes treat as a set of transitions <sup>→</sup> , for instance, we use <sup>→</sup> <sup>∈</sup> to denote that <sup>∈</sup> (, ).

<sup>1</sup> An SCC is inherently weak if it either contains no accepting states or, on the other hand, all cycles of the SCC contain an accepting state.

Moreover, we extend to sets of states ⊆ as (, ) = Ð ∈ (, ), and to sets of symbols Γ ⊆ Σ as (, Γ) = Ð ∈<sup>Γ</sup> (, ). We define the inverse transition function as <sup>−</sup><sup>1</sup> = { <sup>→</sup> <sup>|</sup> <sup>→</sup> <sup>∈</sup> }. The notation | for ⊆ is used to denote the restriction of the transition function ∩ ( × Σ × ). Moreover, for ∈ , we use A [] to denote the BA (, , {}, ∪ ).

A *run* of A from ∈ on an input word is an infinite sequence : → that starts in and respects , i.e., <sup>0</sup> = and ∀ ≥ 0: <sup>→</sup> +<sup>1</sup> <sup>∈</sup> . Let inf () denote the states occurring in infinitely often and inf () denote the transitions occurring in infinitely often. The run is called *accepting* iff inf () ∩ ≠ ∅ or inf () ∩ ≠ ∅.

A word is accepted by A from a state ∈ if there is an accepting run of A from , i.e., <sup>0</sup> = . The set L<sup>A</sup> () = { ∈ Σ | A accepts from } is called the *language* of (in A). Given a set of states ⊆ , we define the language of as L<sup>A</sup> () = Ð ∈ L<sup>A</sup> () and the language of A as L (A) = L<sup>A</sup> (). We say that a state ∈ is *useless* iff L<sup>A</sup> () = ∅. If = ∅, we call A *state-based* and if = ∅, we call A *transition-based*. In this paper, we fix a BA A = (, , , ∪ ).

# **3 Complementing Büchi automata**

In this section, we describe a generalization of the rank-based complementation of statebased BAs presented by Schewe in [37] to our notion of transition/state-based BAs. Proofs can be found in [16].

#### **3.1 Run DAGs**

First, we recall the terminology from [37] (which is a minor modification of the one in [24]), which we use in the paper. Let the *run DAG* of A over a word be a DAG (directed acyclic graph) G = (, ) containing vertices and edges such that

**–** ⊆ × s.t. (, ) ∈ iff there is a run of A from over with = , **–** ⊆ × s.t. ( (, ), ( 0 , 0 )) ∈ iff <sup>0</sup> = + 1 and <sup>0</sup> ∈ (, ).

Given G as above, we will write (, ) ∈ G to denote that (, ) ∈ . A vertex (, ) ∈ is called *accepting* if is an accepting state and an edge ( (, ), ( 0 , 0 )) ∈ is called *accepting* if <sup>→</sup> 0 is an accepting transition. A vertex ∈ G is *finite* if the set of vertices reachable from is finite, *infinite* if it is not finite, and *endangered* if it cannot reach an accepting vertex or an accepting edge.

We assign ranks to vertices of run DAGs as follows: Let G 0 = G and = 0. Repeat the following steps until the fixpoint or for at most 2 + 1 steps, where = ||.


For all vertices that have not been assigned a rank yet, we assign rank () ← .

We define the *rank of* , denoted as rank (), as max{rank () | ∈ G} and the *rank of* A, denoted as rank (A), as max{rank () | ∈ Σ \ L (A)}.

**Lemma 1.** *If* ∉ L (A)*, then* rank () ≤ 2||*.*

#### **3.2 Rank-Based Complementation**

In this section, we describe a construction for complementing BAs developed in the work of Kupferman and Vardi [24]—later improved by Friedgut, Kupferman, and Vardi [14], and by Schewe [37]—extended to our definition of BAs with accepting states and transitions (see [19] for a step-by-step introduction). The construction is based on the notion of tight level rankings storing information about levels in run DAGs. For a BA A and = ||, a *(level) ranking* is a function : → [2] such that ( ) ⊆ {0, 2, . . . , 2}, i.e., assigns even ranks to accepting states of A. For two rankings and 0 we define 0 iff for each ∈ and <sup>0</sup> ∈ (, ) we have 0 ( 0 ) ≤ () and for each <sup>00</sup> ∈ (, ) it holds 0 ( <sup>00</sup>) ≤ bb ()cc. The set of all rankings is denoted by R. For a ranking , the *rank* of is defined as rank ( ) = max{ () | ∈ }. We use ≤ 0 iff for every state ∈ we have () ≤ 0 () and we use < <sup>0</sup> iff ≤ 0 and there is a state ∈ with () < <sup>0</sup> (). For a set of states ⊆ , we call to be -*tight* if (i) it has an odd rank , (ii) () ⊇ {1, 3, . . . , }, and (iii) ( \ ) = {0}. A ranking is *tight* if it is -tight; we use T to denote the set of all tight rankings.

The original rank-based construction [24] uses macrostates of the form (, , ) to track all runs of A over . The -component contains guesses of the ranks of states in (which is obtained by the classical subset construction) in the run DAG and the -set is used to check whether all runs contain only a finite number of accepting states. Friedgut, Kupferman, and Vardi [14] improved the construction by having consider only tight rankings. Schewe's construction [37] extends the macrostates to (, , , ) with ∈ representing a particular even rank such that tracks states with rank . At the cut-point (a macrostate with = ∅) the value of is changed to + 2 modulo the rank of . Macrostates in an accepting run hence iterate over all possible values of . Formally, the complement of A = (, , , ∪ ) is given as the (state-based) BA Schewe(A) = ( 0 , 0 , 0 , 0 ∪ ∅), whose components are defined as follows:

$$\begin{array}{l} \mathsf{I} - \underline{\mathsf{Q}}' = \underline{\mathsf{Q}}\_{1} \cup \underline{\mathsf{Q}}\_{2} \text{ where} \\ \mathsf{I} \cdot \underline{\mathsf{Q}}\_{1} = 2\mathsf{Q} \text{ and} \\ \mathsf{I} \cdot \underline{\mathsf{Q}}\_{2} = \{ (\mathsf{S}, \mathsf{O}, f, i) \in 2^{\mathcal{Q}} \times 2^{\mathcal{Q}} \times \mathcal{T} \times \{ 0, 2, \dots, 2n - 2 \} \mid f \text{ is } \mathsf{S}\text{-right}, \\ \mathsf{O} \subseteq \mathsf{S} \cap f^{-1}(i) \}, \end{array}$$

$$-I' = \{I\},$$

$$-\lrcorner \delta' = \dot{\delta\_1} \dot{\cup} \delta\_2 \cup \delta\_3 \text{ where } \dot{\lrcorner}$$


• <sup>3</sup> : <sup>2</sup> × Σ → 2 <sup>2</sup> such that ( 0 , 0 , 0 , 0 ) ∈ <sup>3</sup> ( (, , , ), ) iff ∗ <sup>0</sup> = (, ), ∗ 0 , ∗ rank ( ) = rank ( 0 ), ∗ and 0 0−1

◦ if = ∅ then <sup>0</sup> = ( + 2) mod (rank ( ) + 1) and <sup>0</sup> = ( 0 ), and ◦ if ≠ ∅ then <sup>0</sup> = and <sup>0</sup> = (, ) ∩ 0−1 (); and

$$-\lrcorner Q'\_F = \{ \emptyset \} \cup ( (2^{\mathcal{Q}} \times \{ \emptyset \} \times \mathcal{T} \times \omega) \cap Q\_2).$$

We call the part of the automaton with states from <sup>1</sup> the *waiting* part (denoted as Waiting), and the part corresponding to <sup>2</sup> the *tight* part (denoted as Tight).

**Theorem 2.** *Let* A *be a BA. Then* L (*Schewe*(A)) = Σ \ L (A)*.*

The space complexity of Schewe's construction for BAs matches the theoretical lower bound O ( (0.76) ) given by Yan [43] modulo a quadratic factor O ( 2 ). Note that our extension to BAs with accepting transitions does not increase the space complexity of the construction.

*Example 3*. Consider the BA A over {, } given in Fig. 1a. A part of Schewe(A) is shown in Fig. 1b (we use ({:0, :1}, ∅) to denote the macrostate ({, }, ∅, { ↦→ 0, ↦→ 1}, 0)). We omit the -part of each macrostate since the corresponding values are 0 for all macrostates in the figure. Useless states are covered by grey stripes. The full automaton contains even more transitions from {} to useless macrostates of the form ({:·, :·, :·}, ∅). ut

From the construction of Schewe(A), we can see that the number of states is affected mainly by sizes of macrostates and by the maximum rank of A. In particular, the upper bound on the number

of states of the complement with the maximum rank is given in the following lemma.

**Lemma 4.** *For a BA* A *with sufficiently many states such that* rank (A) = *the number of states of the complemented automaton is bounded by* 2 + (+) (+)! *where* = max{0, 3 − d 2 e}*.*

From Lemma 1 we have that the rank of A is bounded by 2||. Such a bound is often too coarse and hence Schewe(A) may contain many redundant states. Decreasing the bound on the ranks is essential for a practical algorithm, but an optimal solution is PSpace-complete [15]. The rest of this paper therefore proposes a framework of lightweight techniques for decreasing the maximum rank bound and, in this way, significantly reducing the size of the complemented BA.

#### **3.3 Tight Rank Upper Bounds**

Let ∉ L (A). For ℓ ∈ , we define the ℓ-th *level* of G as level (ℓ) = { | (, ℓ) ∈ G}. Furthermore, we use ℓ to denote the ranking of level ℓ of G. Formally,

$$f\_{\ell}^{\alpha}(q) = \begin{cases} rank\_{\alpha}((q,\ell)) & \text{if } q \in level\_{\alpha}(\ell), \\ 0 & \text{otherwise.} \end{cases} \tag{1}$$

We say that the ℓ-th level of G is *tight* if for all ≥ ℓ it holds that (i) is tight, and (ii) rank ( ) = rank ( ℓ ). Let = 0<sup>1</sup> . . . ℓ−<sup>1</sup> (<sup>ℓ</sup> , <sup>ℓ</sup> , <sup>ℓ</sup> , <sup>ℓ</sup> ) . . . be a run on a word  in Schewe(A). We say that is a *super-tight run* [19] if = for each ≥ ℓ. Finally, we say that a mapping : 2 → R is a *tight rank upper bound (TRUB) wrt* iff

$$\exists \ell \in \omega \colon level\_{\alpha}(\ell) \text{ is tight} \land (\forall k \ge \ell \colon \mu(level\_{\alpha}(k)) \ge f\_k^{\alpha}). \tag{2}$$

Informally, a TRUB is a ranking that gives a conservative (i.e., larger) estimate on the necessary ranks of states in a super-tight run. We say that is a TRUB iff is a TRUB wrt all ∉ L (A). We abuse notation and use the term TRUB also for a mapping 0 : 2 → if the mapping inner ( 0 ) is a TRUB where inner ( 0 ) () = { ↦→ | = 0 () .<sup>−</sup> <sup>1</sup> if <sup>∈</sup> else <sup>=</sup> 0 ()} for all ∈ 2 . ( .<sup>−</sup> is the *monus* operator, i.e., minus with negative results saturated to zero.) Note that the mappings = { ↦→ (2| \ | .<sup>−</sup> <sup>1</sup>)}∈2 and inner () are trivial TRUBs.

The following lemma shows that we can remove from Schewe(A) macrostates whose ranking is not covered by a TRUB (in particular, we show that the reduced automaton preserves super-tight runs).

**Lemma 5.** *Let be a TRUB and* B *be a BA obtained from Schewe*(A) *by replacing all occurrences of* <sup>2</sup> *by* 0 2 = {(, , , ) | ≤ ()}*. Then,* L (B) = Σ \ L (A)*.*

#### **4 Elevator Automata**

In this section, we introduce *elevator automata*, which are BAs having a particular structure that can be exploited for complementation and semi-determinization; elevator automata can be complemented in O (16 ) (cf. Lemma 10) space instead of 2 O ( log ) , which is the lower bound for unrestricted BAs, and semi-determinized in O (2 ) instead of O (4 ) (cf. [16]). The class of elevator automata is quite general: it can be seen as a substantial generalization of semi-deterministic BAs (SDBAs) [11,5]. Intuitively, an elevator automaton is a BA whose strongly connected components are all either deterministic or inherently weak.

Let A = (, , , ∪ ). ⊆ is a *strongly connected component* (SCC) of A if for any pair of states , <sup>0</sup> ∈ it holds that is reachable from 0 and 0 is reachable from . is *maximal* (MSCC) if it is not a proper subset of another SCC. An MSCC is *trivial* iff <sup>|</sup><sup>|</sup> <sup>=</sup> <sup>1</sup> and | = ∅. The *condensation* of A is the DAG cond (A) = (M, E) where M is the set of A's MSCCs and E = {(1, 2) | ∃<sup>1</sup> ∈ 1, ∃<sup>2</sup> ∈ 2, ∃ ∈ Σ: <sup>1</sup> <sup>→</sup> <sup>2</sup> <sup>∈</sup> }. An MSCC is *non-accepting* if it contains no accepting state and no accepting transition, i.e., <sup>∩</sup> <sup>=</sup> <sup>∅</sup> and | ∩ = ∅. The *depth* of (M, E) is defined as the number of MSCCs on the longest path in (M, E).

We say that an SCC is *inherently weak accepting* (IWA) iff *every cycle* in the transition diagram of A restricted to contains an accepting state or an accepting transition. is *inherently weak* if it is either non-accepting or IWA, and A is inherently weak if all of its MSCCs are inherently weak. A is *deterministic* iff || ≤ 1 and <sup>|</sup>(, )| ≤ <sup>1</sup> for all <sup>∈</sup> and <sup>∈</sup> <sup>Σ</sup>. An SCC <sup>⊆</sup> is *deterministic* iff (, | , ∅, ∅) is deterministic. A is a *semi-deterministic BA* (SDBA) if A [] is deterministic for every ∈ ∪ { ∈ | <sup>→</sup> <sup>∈</sup> , <sup>∈</sup> , <sup>∈</sup> <sup>Σ</sup>}, i.e., whenever a run in <sup>A</sup> reaches an accepting state or an accepting transition, it can only continue deterministically.

A is an *elevator (Büchi) automaton* iff for every MSCC of A it holds that is (i) deterministic, (ii) IWA, or (iii) non-accepting. In other words, a BA is an elevator automaton iff every nondeterministic SCC of A that contains an accepting state or transition is inherently weak. An example of an elevator automaton obtained from the LTL formula GF( ∨ GF( ∨ GF)) is shown in Fig. 2. The BA consists of three connected deterministic components. Note that the automaton is neither semi-deterministic nor unambiguous.

Fig. 2: The BA for LTL formula GF( ∨ GF( ∨ GF)) is elevator

depth of cond (A). In the worst case, A consists of a chain of deterministic components, yielding the upper bound on the rank of elevator automata given in the following lemma.

**Lemma 6.** *Let* A *be an elevator automaton such that its condensation has the depth . Then* rank (A) ≤ 2*.*

#### **4.1 Refined Ranks for Elevator Automata**

Notice that the upper bound on ranks provided by Lemma 6 can still be too coarse. For instance, for an SDBA with three linearly ordered MSCCs such that the first two MSCCs are non-accepting and the last one is deterministic accepting, the lemma gives us an upper bound on the rank 6, while it is known that every SDBA has the rank at most 3 (cf. [5]). Another examples might be two deterministic non-trivial MSCCs connected by a path of trivial MSCCs, which can be assigned the same rank.

Instead of refining the definition of elevator automata into some quite complex list of constraints, we rather provide an algorithm that performs a traversal through cond (A) and assigns each MSCC a label of the form *type*:*rank* that contains (i) a type and (ii) a bound on the maximum rank of states in the component. The types of MSCCs that we consider are the following:

T**:** *trivial* components,

IWA**:** *inherently weak accepting* components,

D**:** *deterministic* (potentially accepting) components, and

N**:** *non-accepting* components.

Note that the type in an MSCC is not given *a priori* but is determined by the algorithm (this is because for deterministic non-accepting components, it is sometimes better to treated them as D and sometimes as N, depending on their neighbourhood). In the following, we assume that A is an elevator automaton without useless states and, moreover, all accepting conditions on states and transitions not inside non-trivial MSCCs are removed (any BA can be easily transformed into this form).

We start with terminal MSCCs , i.e., MSCCs that cannot reach any other MSCC:

**T1**: If is IWA, then we label it with IWA:0 .

**T2**: Else if is deterministic accepting, we label it with D:2 .

Fig. 3: Rules for assigning types and rank bounds to MSCCs. The symbols <sup>2</sup> and <sup>2</sup> are interpeted as 0 if all the corresponding edges from the components having rank ℓ and ℓ , respectively, are deterministic; otherwise they are interpreted as 2. Transitions between two components <sup>1</sup> and <sup>2</sup> are deterministic if the BA (, | , ∅, ∅) is deterministic for = (1, Σ) ∩ (<sup>1</sup> ∪ 2).

(Note that the previous two options are complete due to our requirements on the structure of A.) When all terminal MSCCs are labelled, we proceed through cond (A), inductively on its structure, and label nonterminal components based on the rules defined below.

Fig. 4: Structure of elevator ranking rules

The rules are of the form that uses the structure depicted in Fig. 4, where children nodes denote already processed MSCCs. In particular, a child node of the form :ℓ denotes an aggregate node of *all* siblings of the type with ℓ being the maximum rank of these siblings. Moreover, we use typemax{, , } to denote the type ∈ {D, N, IWA} for which = max{, , } where is an expression containing ℓ (if there are more such types, is chosen arbitrarily). The rules for assigning a type and a rank ℓ to are the following:


Then, for every MSCC of A, we assign each of its states the rank of . We use : → to denote the rank bounds computed by the procedure above.

#### **Lemma 7.** *is a TRUB.*

Using Lemma 5, we can now use to prune states during the construction of Schewe(A), as shown in the following example.

*Example 8*. As an example, consider the BA A in Fig. 1a. The set of MSCCs with their types is given as

Fig. 5: A part of Schewe(A). The TRUB computed by elevator rules is used to prune states outside the yellow area.

{{}:N, {, }:IWA} showing that BA A is an elevator. Using the rules **T1** and **I4** we get the TRUB = {:1, :0, :0}. The TRUB can be used to prune the generated states as shown in Fig. 5. ut

#### **4.2 Efficient Complementation of Elevator Automata**

In Section 4.1 we proposed an algorithm for assigning ranks to MSCCs of an elevator automaton A. The drawback of the algorithm is that the maximum obtained rank is not bounded by a constant but by the depth of the condensation of A. We will, however, show that it is actually possible to change A by at most doubling the number of states and obtain an elevator BA with the rank at most 3.

Intuitively, the construction copies every non-trivial MSCC with an accepting state or transition into a component • , copies all transitions going into states in to also go into the corresponding states in • , and, finally, removes all accepting conditions from . Formally, let A = (, , , ∪ ) be a BA. For ⊆ , we use • to denote a unique copy of , i.e., • = { • | ∈ } s.t. • ∩ = ∅. Let M be the set of MSCCs of A. Then, the *deelevated* BA DeElev(A) = ( 0 , 0 , 0 , 0 ∪ 0 ) is given as follows:

$$\begin{array}{l} \mathsf{-} & \mathcal{Q}' = \mathcal{Q} \cup \mathcal{Q}^{\bullet}, \\ \mathsf{-} & \delta': \mathcal{Q}' \times \Sigma \to 2\mathcal{Q}' \text{ where for } q \in \mathcal{Q} \\ & \mathsf{-} & \delta'(q, a) = \delta(q, a) \cup (\delta(q, a))^{\bullet} \text{ and} \\ & \mathsf{+} & \delta'(q^{\bullet}, a) = (\delta(q, a) \cap C)^{\bullet} \text{ for } q \in C \in \mathcal{M}; \\ \mathsf{-} & I' = I, \text{ and} \\ \mathsf{-} & Q'\_F = \mathcal{Q}^{\bullet}\_F \text{ and } \delta'\_F = \{q^{\bullet} \xrightarrow{a} r^{\bullet} \mid q \xrightarrow{a} r \in \delta\_F\} \cap \delta'. \end{array}$$

It is easy to see that the number of states of the deelevated automaton is bounded by 2||. Moreover, if A is elevator, so is DeElev(A). The construction preserves the language of A, as shown by the following lemma.

**Lemma 9.** *Let* A *be a BA. Then,* L (A) = L (*DeElev*(A))*.*

Moreover, for an elevator automaton A, the structure of DeElev(A) consists of (after trimming useless states) several non-accepting MSCCs with copied terminal deterministic or IWA MSCCs. Therefore, if we apply the algorithm from Section 4.1 on DeElev(A), we get that its rank is bounded by 3, which gives the following upper bound for complementation of elevator automata.

**Lemma 10.** *Let* A *be an elevator automaton with suffficiently many states . Then the language* Σ \ L (A) *can be represented by a BA with at most* O (16 ) *states.*

The complementation through DeElev(A) gives a better upper bound than the rank refinement from Section 4.1 applied directly on A, however, based on our experience, complementation through DeElev(A) behaves worse in many real-world instances. This poor behaviour is caused by the fact that the complement of DeElev(A) can have a larger Waiting and macrostates in Tight can have larger -components, which can yield more generated states (despite the rank bound 3). It seems that the most promising approach would to be a combination of the approaches, which we leave for future work.

Fig. 6: Rules assigning types and rank bounds for non-elevator automata.

#### **4.3 Refined Ranks for Non-Elevator Automata**

The algorithm from Section 4.1 computing a TRUB for elevator automata can be extended to compute TRUBs even for general non-elevator automata (i.e., BAs with nondeterministic accepting components that are not inherently weak). To achieve this generalization, we extend the rules for assigning types and ranks to MSCCs of elevator automata from Section 4.1 to take into account general non-deterministic components. For this, we add into our collection of MSCC types *general* components (denoted as G). Further, we need to extend the rules for terminal components with the following rule:

**T3**: Otherwise, we label with G:2| \ | .

Moreover, we adjust the rules for assigning a type and a rank ℓ to to the following (the rule **I1** is the same as for the case of elevator automata):

Fig. 7: is G

**I2**–**I5**: *(We replace the corresponding rules for their counterparts including general components from Fig. 6).*

**I6**: Otherwise, we use the rule in Fig. 7.

Then, for every MSCC of a BA A, we assign each of its states the rank of . Again, we use : → to denote the rank bounds computed by the adjusted procedure above.

**Lemma 11.** *is a TRUB.*

#### **5 Rank Propagation**

In the previous section, we proposed a way, how to obtain a TRUB for elevator automata (with generalization to general automata). In this section, we propose a way of using the structure of A to refine a TRUB using a propagation of values and thus reduce the size of Tight. Our approach uses *data*

Fig. 8: Rank propagation flow

*flow analysis* [32] to reason on how ranks and rankings of macrostates of Schewe(A) can be decreased based on the ranks and rankings of the *local neighbourhood* of the macrostates. We, in particular, use a special case of *forward analysis* working on the *skeleton* of Schewe(A), which is defined as the BA K<sup>A</sup> = (2 , <sup>0</sup> , ∅, ∅) where <sup>0</sup> = { <sup>→</sup> <sup>|</sup> <sup>=</sup> (, )} (note that we are only interested in the structure of <sup>K</sup><sup>A</sup> and not its language; also notice the similarity of K<sup>A</sup> with Waiting). Our analysis refines a rank/ranking estimate () for a macrostate of K<sup>A</sup> based on the estimates for its predecessors 1, . . . , (see Fig. 8). The new estimate is denoted as 0 ().

More precisely, : 2 → V is a function giving each macrostate of K<sup>A</sup> a value from the domain V. We will use the following two value domains: (i) V = , which is used for estimating *ranks* of macrostates (in the *outer macrostate analysis*) and (ii) V = R, which is used for estimating *rankings* within macrostates (in the *inner macrostate analysis*). For each of the analyses, we will give the *update function* up : (2 → V) × (2 ) +<sup>1</sup> → V, which defines how the value of () is updated based on the values of (1), . . . , (). We then construct a system with the following equation for every ∈ 2 :

$$\mu(\mathcal{S}) = \sup(\mu, \mathcal{S}, R\_1, \dots, R\_m) \quad \text{where } \{R\_1, \dots, R\_m\} = \delta^{\prime - 1}(\mathcal{S}, \Sigma). \tag{3}$$

We then solve the system of equations using standard algorithms for data flow analysis (see, e.g., [32, Chapter 2]) to obtain the fixpoint ∗ . Our analyses have the important property that if they start with <sup>0</sup> being a TRUB, then ∗ will also be a TRUB.

As the initial TRUB, we can use a trivial TRUB or any other TRUB (e.g., the output of elevator state analysis from Section 4).

#### **5.1 Outer Macrostate Analysis**

We start with the simpler analysis, which is the *outer macrostate analysis*, which only looks at sizes of macrostates. Recall that the rank of every super-tight run in Schewe(A) does not change, i.e., a super tight run stays in Waiting as long as needed so that when it jumps to Tight, it takes the rank and never needs to decrease it. We can use this fact to decrease the maximum rank of a macrostate in KA. In particular, let us consider all cycles going through . For each of the cycles , we can bound the maximum rank of a super-tight run going through by 2 − 1 where is the smallest number of non-accepting states occurring in any macrostate on (from the definition, the rank of a tight ranking does not depend on accepting states). Then we can infer that the maximum rank of any super-tight run going through is bounded by the maximum rank of any of the cycles going through (since can never assume a higher rank in any super-tight run). Moreover, the rank of each cycle can also be estimated in a more precise way, e.g. using our elevator analysis.

Since the number of cycles in K<sup>A</sup> can be large2, instead of their enumeration, we employ data flow analysis with the value domain V = (i.e, for every macrostate of KA, we remember a bound on the maximum rank of ) and the following update function:

$$\sup\_{out} \left( \mu, S, R\_1, \dots, R\_m \right) = \min \{ \mu(S), \max \{ \mu(R\_1), \dots, \mu(R\_m) \} \}. \tag{4}$$

Intuitively, the new bound on the maximum rank of is taken as the smaller of the previous bound () and the largest of the bounds of all predecessors of , and the new value is propagated forward by the data flow analysis.

<sup>2</sup> K<sup>A</sup> can be exponentially larger than A and the number of cycles in K<sup>A</sup> can be exponential to the size of KA, so the total number of cycles can be double-exponential.

*Example 12*. Consider the BA Aex in Fig. 9a. When started from the initial TRUB <sup>0</sup> = {{} ↦→ 1, {, } ↦→ 3, {, , , } ↦→ 7} (Fig. 9b), outer macrostate analysis decreases the maximum rank estimate for {, } to 1, since min{<sup>0</sup> ({, }, max{<sup>0</sup> ({})}} = min{3, 1} = 1. The estimate for {, , , } is not affected, because min{7, max{1, 7}} = 7 (Fig. 9c). ut

Fig. 9: Example of outer macrostate analysis. (a) Aex (• denotes accepting transitions). The initial TRUB <sup>0</sup> in (b) is refined to ∗ out in (c).

**Lemma 13.** *If is a TRUB, then* C { ↦→ upout (, , 1, . . . , )} *is a TRUB.*

**Corollary 14.** *When started with a TRUB* 0*, the outer macrostate analysis terminates and returns a TRUB* ∗ out*.*

#### **5.2 Inner Macrostate Analysis**

Our second analysis, called *inner macrostate analysis*, looks deeper into super-tight runs in Schewe(A). In particular, compared with the outer macrostate analysis from the previous section—which only looks at the *ranks*, i.e., the bounds on the numbers in the rankings—, inner macrostate analysis looks at how the *rankings* assign concrete values to the *states* of A *inside the macrostates*.

Inner macrostate analysis is based on the following. Let be a super-tight run of Schewe(A) on ∉ L (A) and (, , , ) be a macrostate from Tight. Because is super-tight, we know that the rank () of a state ∈ is bounded by the ranks of the predecessors of . This holds because in super-tight runs, the ranks are only *as high as necessary*; if the rank of were higher than the ranks of its predecessors, this would mean that we may wait in Waiting longer and only jump to with a lower rank later.

Let us introduce some necessary notation. Let , <sup>0</sup> ∈ R be rankings (i.e., , <sup>0</sup> : → ). We use t 0 to denote the ranking { ↦→ max{ (), <sup>0</sup> ()} | ∈ }, and u 0 to denote the ranking { ↦→ min{ (), <sup>0</sup> ()} | ∈ }. Moreover, we define *max-succ-rank* ( ) = max≤{ <sup>0</sup> ∈ R | 0 } and a function dec : R → R such that dec() is the ranking 0 for which

$$\theta'(q) = \begin{cases} \theta(q) \div 1 & \text{if } \theta(q) = rank(\theta) \text{ and } q \notin \mathcal{Q}\_F, \\ \|\theta(q) \div 1\| & \text{if } \theta(q) = rank(\theta) \text{ and } q \in \mathcal{Q}\_F, \\ \theta(q) & \text{otherwise.} \end{cases} \tag{5}$$

Intuitively, *max-succ-rank* ( ) is the (pointwise) maximum ranking that can be reached from macrostate with ranking over (it is easy to see that there is a unique such maximum ranking) and dec() decreases the maximum ranks in a ranking by one (or by two for even maximum ranks and accepting states).

The analysis uses the value domain V = R (i.e., each macrostate of K<sup>A</sup> is assigned a ranking giving an upper bound on the rank of each state in the macrostate) and the update function upin given in the right-hand side of the page. Intuitively, upin

updates () for every ∈ to hold the maximum rank compatible with the ranks of its predecessors. We note line Line 6, which makes use of the fact that we can only consider tight rankings (whose rank is odd), so we can decrease the estimate using the function dec defined above.

**<sup>1</sup>** upin (, , 1, . . . , )**: <sup>2</sup> foreach** 1 ≤ ≤ *and* ∈ Σ **do <sup>3</sup> if** ( , ) = **then <sup>4</sup>** <sup>←</sup> *max-succ-rank* (()) **<sup>5</sup>** ← () u Ã { | is defined}; **<sup>6</sup> if** rank () *is even* **then** ← dec(); **<sup>7</sup> return** ;

*Example 15*. Let us continue in Section 5.1 and perform inner macrostate analysis starting with the TRUB {{:1}, {:1, :1}, {:7, :7, :7, :7}} obtained from ∗ out. We show three iterations of the algorithm for {, , , } in the right-hand side (we do not show {, } except the first iteration since it does not affect intermediate steps). We can notice that in the three iterations, we could decrease the maximum rank estimate to {:6, :6, :6, :6} due to the accepting transitions from and . In the last of the three iterations, when all states have the even rank 6, the condition on Line 6 would become true and the rank of all states would be decremented

to 5 using dec. Then, again, the accepting transitions from and would decrease the rank of to 4, which would be propagated to and so on. Eventually, we would arrive to the TRUB {:1, :1, :1, :1}, which could not be decreased any more, since {:1, :1} forces the ranks of and to stay at 1. ut

**Lemma 16.** *If is a TRUB, then* C { ↦→ upin (, , 1, . . . , )} *is a TRUB.*

**Corollary 17.** *When started with a TRUB* 0*, the inner macrostate analysis terminates and returns a TRUB* ∗ in*.*

# **6 Experimental Evaluation**

*Used tools and evaluation environment.* We implemented the techniques described in the previous sections as an extension of the tool Ranker [18] (written in C++). Speaking in the terms of [19], the heuristics were implemented on top of the RankerMaxR configuration (we refer to this previous version as RankerOld). We tested the correctness of our implementation using Spot's autcross on all BAs in our benchmark. We compared modified Ranker with other state-of-the-art tools, namely, Goal [41] (implementing Piterman [34], Schewe [37], Safra [36], and Fribourg [1]), Spot 2.9.3 [12] (implementing Redziejowski's algorithm [35]), Seminator 2 [4], LTL2dstar 0.5.4 [23], and Roll [26]. All tools were set to the mode where they output an automaton with the standard state-based Büchi acceptance condition. The experimental evaluation was performed on a 64-bit GNU/Linux Debian workstation with an Intel(R) Xeon(R) CPU E5-2620 running at 2.40 GHz with 32 GiB of RAM and using a timeout of 5 minutes.

*Datasets.* As the source of our benchmark, we use the two following datasets: (i) random containing 11,000 BAs over a two letter alphabet used in [40], which were randomly

Fig. 10: Comparison of the state space generated by our optimizations and other rankbased procedures (horizontal and vertical dashed lines represent timeouts). Blue data points are from random and red data points are from LTL. Axes are logarithmic.

generated via the Tabakov-Vardi approach [39], starting from 15 states and with various different parameters; (ii) LTL with 1,721 BAs over larger alphabets (up to 128 symbols) used in [4], which were obtained from LTL formulae from literature (221) or randomly generated (1,500). We preprocessed the automata using Rabit [30] and Spot's autfilt (using the --high simplification level), transformed them to state-based acceptance BAs (if they were not already), and converted to the HOA format [2]. From this set, we removed automata that were (i) semi-deterministic, (ii) inherently weak, (iii) unambiguous, or (iv) have an empty language, since for these automata types there exist more efficient complementation procedures than for unrestricted BAs [5,4,6,28]. In the end, we were left with **2,592** (random) and **414** (LTL) *hard* automata. We use all to denote their union (**3,006** BAs). Of these hard automata, 458 were elevator automata.

#### **6.1 Generated State Space**

In our first experiment, we evaluated the effectiveness of our heuristics for pruning the generated state space by comparing the sizes of complemented BAs without postprocessing. This use case is directed towards applications where postprocessing is irrelevant, such as inclusion or equivalence checking of BAs.

We focused on a comparison with two less optimized versions of the rank-based complementation procedure: Schewe (the version "Reduced Average Outdegree" from [37] implemented in Goal under -m rank -tr -ro) and its optimization RankerOld. The scatter plots in Fig. 10 compare the numbers of states of automata generated by Ranker and the other algorithms and the upper part of Table 1 gives summary statistics. Observe that our optimizations from this paper drastically reduced the generated search space compared with both Schewe and RankerOld (the mean for Schewe is lower than for RankerOld due to its much higher number of timeouts); from Fig. 10b we can see that the improvement was in many cases *exponential* even when compared with our previous optimizations in RankerOld. The median (which is a more meaningful indicator with the presence of timeouts) decreased by 44 % w.r.t. RankerOld, and we also reduced the Table 1: Statistics for our experiments. The upper part compares various optimizations of the rank-based procedure (no postprocessing). The lower part compares Ranker to other approaches (with postprocessing). The left-hand side compares sizes of complement BAs and the right-hand side runtimes of the tools. The **wins** and **losses** columns give the number of times when Ranker was strictly better and worse. The values are given for the three datasets as "all (random : LTL)". Approaches in Goal are labelled with <sup>G</sup>.


number of timeouts by 23 %. Notice that the numbers for the LTL dataset do not differ as much as for random, witnessing the easier structure of the BAs in LTL.

#### **6.2 Comparison with Other Complementation Techniques**

In our second experiment, we compared the improved Ranker with other state-of-theart tools. We were comparing sizes of output BAs, therefore, we postprocessed each output automaton with autfilt (simplification level --high). Scatter plots are given in Fig. 11, where we compare Ranker with Spot (which had the best results on average from the other tools except Roll) and Roll, and summary statistics are in the lower part of Table 1. Observe that Ranker has by far the lowest mean (except Roll) and the third lowest median (after Seminator 2 and Roll, but with less timeouts). Moreover, comparing the numbers in columns **wins** and **losses** we can see thatRanker gives strictly better results than other tools (**wins**) more often than the other way round (**losses**).

In Fig. 11a see that indeed in the majority of cases Ranker gives a smaller BA than Spot, especially for harder BAs (Spot, however, behaves slightly better on the simpler BAs from LTL). The results in Fig. 11b do not seem so clear. Roll uses a learning-based approach—more heavyweight and completely orthogonal to any of the other tools—and can in some cases output a tiny automaton, but does not scale, as observed by the number of timeouts much higher than any other tool. It is, therefore, positively surprising that Ranker could in most of the cases still obtain a much smaller automaton than Roll.

Regarding runtimes, the prototype implementation in Ranker is comparable to Seminator 2, but slower than Spot and LTL2dstar (Spot is the fastest tool). Implementations of other approaches clearly do not target speed. We note that the number of timeouts of Ranker is still higher than of some other tools (in particular Piterman, Spot, Fribourg); further state space reduction targeting this particular issue is our future work.

### **7 Related Work**

BA complementation remains in the interest of researchers since their first introduction by Büchi in [8]. Together with a hunt for efficient complementation techniques, the effort has been put into establishing the lower bound. First, Michel showed that the lower bound is ! (approx. (0.36) ) [31] and later Yan refined the result to (0.76) [43].

Fig. 11: Comparison of the complement size obtained by Ranker and other state-of-theart tools (horizontal and vertical dashed lines represent timeouts). Axes are logarithmic.

The complementation approaches can be roughly divided into several branches. *Ramsey-based complementation*, the very first complementation construction, where the language of an input automaton is decomposed into a finite number of equivalence classes, was proposed by Büchi and was further enhanced in [7]. *Determinizationbased complementation* was presented by Safra in [36] and later improved by Piterman in [34] and Redziejowski in [35]. Various optimizations for determinization of BAs were further proposed in [29]. The main idea of this approach is to convert an input BA into an equivalent deterministic automaton with different acceptance condition that can be easily complemented (e.g. Rabin automaton). The complemented automaton is then converted back into a BA (often for the price of some blow-up). *Slice-based complementation* tracks the acceptance condition using a reduced abstraction on a run tree [42,21]. *A learningbased approach* was introduced in [27,26]. Allred and Ultes-Nitsche then presented a novel optimal complementation algorithm in [1]. For some special types of BAs, e.g., deterministic [25], semi-deterministic [5], or unambiguous [28], there exist specific complementation algorithms. *Semi-determinization based complementation* converts an input BA into a semi-deterministic BA [11], which is then complemented [4].

*Rank-based complementation*, studied in [24,15,14,37,22], extends the subset construction for determinization of finite automata by storing additional information in each macrostate to track the acceptance condition of all runs of the input automaton. Optimizations of an alternative (sub-optimal) rank-based construction from [24] going through *alternating Büchi automata* were presented in [15]. Furthermore, the work in [22] introduces an optimization of Schewe, in some cases producing smaller automata (this construction is not compatible with our optimizations). As shown in [9], the rank-based construction can be optimized using simulation relations. We identified several heuristics that help reducing the size of the complement in [19], which are compatible with the heuristics in this paper.

*Acknowledgements.* We thank anonymous reviewers for their useful remarks that helped us improve the quality of the paper. This work was supported by the Czech Science Foundation project 20-07487S and the FIT BUT internal project FIT-S-20-6427.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# On-The-Fly Solving for Symbolic Parity Games

Maurice Laveaux<sup>1</sup> () , Wieger Wesselink<sup>1</sup> , and Tim A.C. Willemse1,<sup>2</sup>

<sup>1</sup> Eindhoven University of Technology, Eindhoven, The Netherlands <sup>2</sup> ESI (TNO), Eindhoven, The Netherlands {m.laveaux, j.w.wesselink, t.a.c.willemse}@tue.nl

Abstract. Parity games can be used to represent many diferent kinds of decision problems. In practice, tools that use parity games often rely on a specifcation in a higher-order logic from which the actual game can be obtained by means of an exploration. For many of these decision problems we are only interested in the solution for a designated vertex in the game. We formalise how to use on-the-fy solving techniques during the exploration process, and show that this can help to decide the winner of such a designated vertex in an incomplete game. Furthermore, we defne partial solving techniques for incomplete parity games and show how these can be made resilient to work directly on the incomplete game, rather than on a set of safe vertices. We implement our techniques for symbolic parity games and study their efectiveness in practice, showing that speed-ups of several orders of magnitude are feasible and overhead (if unavoidable) is typically low.

### 1 Introduction

A parity game is a two-player game with an ω-regular winning condition, played by players ♢ ('even') and □ ('odd') on a directed graph. The true complexity of solving parity games is still a major open problem, with the most recent breakthroughs yielding algorithms running in quasi-polynomial time, see, e.g., [18,7]. Apart from their intriguing status, parity games pop up in various fundamental results in computer science (e.g., in the proof of decidability of a monadic secondorder theory). In practice, parity games provide an elegant, uniform framework to encode many relevant decision problems, which include model checking problems, synthesis problems and behavioural equivalence checking problems.

Often, a decision problem that is encoded as a parity game, can be answered by determining which of the two players wins a designated vertex in the game graph. Depending on the characteristics of the game, it may be the case that only a fraction of the game is relevant for deciding which player wins a vertex. For instance, deciding whether a transition system satisfes an invariant can be encoded by a simple, solitaire (i.e., single player) parity game. In such a game, player □ wins all vertices that are sinks (i.e., have no successors), and all states leading to such sinks, so checking whether sinks are reachable from a designated vertex sufces to determine whether this vertex is won by □, too. Clearly, as soon as a sink is detected, any further inspection of the game becomes irrelevant.

A complicating factor is that in practice, the parity games that encode decision problems are not given explicitly. Rather, they are specifed in some higherorder logic such as a parameterised Boolean equation system, see, e.g. [11]. Exploring the parity game from such a higher-order specifcation is, in general, time-and memory-consuming. To counter this, symbolic exploration techniques have been proposed, see e.g. [19]. These explore the game graph on-the-fy and exploit efcient symbolic data structures such as LDDs [13] to represent sets of vertices and edges. Many parity game solving algorithms can be implemented quite efectively using such data structures [20,28,29], so that in the end, exploring the game graph often remains the bottleneck.

In this paper, we study how to combine the exploration of a parity game and the on-the-fy solving of the explored part, with the aim to speed-up the overall solving process. The central problem when performing on-the-fy solving during the exploration phase is that we have to deal with incomplete information when determining the winner for a designated vertex. Moreover, in the symbolic setting, the exploration order may be unpredictable when advanced strategies such as chaining and saturation [9] are used.

To formally reason about all possible exploration strategies and the artefacts they generate, we introduce the concept of an incomplete parity game, and an ordering on these. Incomplete parity games are parity games where for some vertices not all outgoing edges are necessarily known. In practice, these could be identifed by, e.g., the todo queue in a classical breadth-frst search. The extra information captured by an incomplete parity game allows us to characterise the safe set for a given player α. This is a set of vertices for which it can be established that if player α wins the vertex, then she cannot lose the vertex if more information becomes available. We prove an optimality result for safe sets, which, informally, states that a safe set for player α is also the largest set with this property (see Theorem 1).

The vertices won by player α in an α-safe set can be determined using a standard parity game solving algorithm such as, e.g., Zielonka's recursive algorithm [31] or Priority Promotion [2]. However, these algorithms may be less efcient as on-the-fy solvers. For this reason, we study three symbolic partial solvers: solitaire winning cycle detection, forced winning cycle detection and fatal attractors [17]. In particular cases, frst determining the safe set for a player and only subsequently solving the game using one of these partial solvers will incur an additional overhead. As a fnal result, we therefore prove that all these solvers can be (modifed to) run on the incomplete game as a whole, rather than on the safe set of a player (see Propositions 1-3).

As a proof of concept, we have implemented an (open source) symbolic tool for the mCRL2 toolset [6], that explores a parity game specifed by a parameterised Boolean equation system and solves these games on-the-fy. We report on the efectiveness of our implementation on typical parity games stemming from, e.g., model checking and equivalence checking problems, showing that it can speed up the process with several orders of magnitude, while adding low overhead if the entire game is needed for solving.

Related Work. Our work is related to existing techniques for solving symbolic parity games such as [20,19], as we extend these existing methods with on-thefy solving. Naturally, our work is also related to existing work for on-the-fy model checking. This includes work for on-the-fy (explicit) model checking of regular alternation-free modal mu-calculus formulas [23] and work for on-thefy symbolic model checking of RCTL [1]. Compared to these our method is more general as it can be applied to the full modal mu-calculus (with data), which subsumes RCTL and the alternation-free subset. Optimisations such as the observation that checking LTL formulas of type AG reduces to reachability checks [14] are a special case of our methods and partial solvers. Furthermore, our methods are not restricted to model checking problems only and can be applied to any parity game, including decision problems such as equivalence checking [8]. Furthermore, our method is agnostic to the exploration strategy employed.

Structure of the paper. In Section 2 we recall parity games. In Section 3 we introduce incomplete parity games and show how partial solving can be applied correctly. In Section 4 we present several partial solvers that we employ for on-the-fy solving. Finally, in Section 5 we discuss the implementation of these techniques and apply them to several practical examples. The omitted proofs for the supporting lemmas can be found in [22].

### 2 Preliminaries

A parity game is an infnite-duration, two-player game that is played on a fnite directed graph. The objective of the two players, called even (denoted by ♢) and odd (denoted by □), is to win vertices in the graph.

Defnition 1. A parity game is a directed graph G = (V, E, p,(V♢, V□)), where


Henceforth, let G = (V, E, p,(V♢, V□)) be an arbitrary parity game. Throughout this paper, we use α to denote an arbitrary player and ¯α denotes the opponent. We write vE to denote the set of successors {w ∈ V | (v, w) ∈ E} of vertex v. The set sinks(G) is defned as the largest set U ⊆ V satisfying for all v ∈ U that vE = ∅; i.e., sinks(G) is the set of all sinks: vertices without successors. If we are only concerned with the sinks of player α, we write sinksα(G); i.e., sinksα(G) = V<sup>α</sup> ∩ sinks(G). We write G ∩ U, for U ⊆ V , to denote the subgame (U,(U × U) ∩ E, p↾<sup>U</sup> ,(V♢ ∩ U, V□ ∩ U)), where p↾<sup>U</sup> (v) = p(v) for all vertices v ∈ U.

Example 1. Consider the graph depicted in Figure 1, representing a parity game. Diamond-shaped vertices are owned by player ♢, whereas box-shaped vertices are owned by player □. The priority of a vertex is written inside the vertex. Vertex u<sup>1</sup> is a sink owned by player □. ⊓⊔

Fig. 1. An example parity game

Plays and strategies. The game is played as follows. Initially, a token is placed on a vertex of the graph. The owner of a vertex on which the token resides gets to decide the successor vertex (if any) that the token is moved to next. A maximal sequence of vertices (i.e., an infnite sequence or a fnite sequence ending in a sink) visited by the token by following this simple rule is called a play. A fnite play π is won by player ♢ if the sink in which it ends is owned by player □, and it is won by player □ if the sink is owned by player ♢. An infnite play π is won by player ♢ if the minimal priority that occurs infnitely often along π is even, and it is won by player □ otherwise.

A strategy σ<sup>α</sup> : V <sup>∗</sup>V<sup>α</sup> → V for player α is a partial function that prescribes where player α moves the token next, given a sequence of vertices visited by the token. A play v<sup>0</sup> v<sup>1</sup> . . . is consistent with a strategy σ if and only if σ(v<sup>0</sup> . . . vi) = vi+1 for all i for which σ(v<sup>0</sup> . . . vi) is defned. Strategy σ<sup>α</sup> is winning for player α in vertex v if all plays consistent with σ<sup>α</sup> and starting in v are won by α. Player α wins vertex v if and only if she has a winning strategy σ<sup>α</sup> for vertex v. The parity game solving problem asks to compute the set of vertices W♢, won by player ♢ and the set W□, won by player □. Note that since parity games are determined [31,24], every vertex is won by one of the two players. That is, the sets W♢ and W□ partition the set V .

Example 2. Consider the parity game depicted in Figure 1. In this game, the strategy σ♢, partially defned as σ♢(πu0) = u<sup>2</sup> and σ♢(πu2) = u0, for arbitrary π, is winning for player ♢ in u<sup>0</sup> and u2. Player □ wins vertex u<sup>3</sup> using strategy σ□(πu3) = u4, for arbitrary π. Note that player ♢ is always forced to move the token from u<sup>4</sup> to u3. Vertex u<sup>1</sup> is a sink, owned by player □, and hence, won by player ♢. ⊓⊔

Dominions. A strategy σ<sup>α</sup> is said to be closed on a set of vertices U ⊆ V if every play, consistent with σ<sup>α</sup> and starting in a vertex v ∈ U remains in U. If player α has a strategy that is closed on U, we say that the set U is α-closed. A dominion for player α is a set of vertices U ⊆ V such that player α has a strategy σ<sup>α</sup> that is closed on U and which is winning for α. Note that the sets W♢ and W□ are dominions for player ♢ and player □, respectively, and, hence, every vertex won by player α must belong to an α-dominion.

Example 3. Reconsider the parity game of Figure 1. Observe that player □ has a closed strategy on {u3, u4}, which is also winning for player □. Hence, the set {u3, u4} is an □-dominion. Furthermore, the set {u2, u3, u4} is ♢-closed. However, none of the strategies for which {u2, u3, u4} is closed for player ♢ is winning for her; therefore {u2, u3, u4} is not an ♢-dominion. ⊓⊔ Predecessors, control predecessors and attractors. Let U ⊆ V be a set of vertices. We write pre(G, U) to denote the set of predecessors {v ∈ V | ∃u ∈ U : u ∈ vE} of U in G. The control predecessor set of U for player α in G, denoted cpreα(G, U), contains those vertices for which α is able to force entering U in one step. It is defned as follows:

$$\mathsf{cpre}\_{\alpha}(G, U) = (V\_{\alpha} \cap \mathsf{pre}(G, U)) \cup (V\_{\bar{\alpha}} \mid (\mathsf{pre}(G, V \nmid U) \cup \mathsf{insks}(G)))$$

Note that both pre and cpre are monotone operators on the complete lattice (2<sup>V</sup> , ⊆). The α-attractor to U in G, denoted Attrα(G, U), is the set of vertices from which player α can force play to reach a vertex in U:

$$\mathsf{Attr}\_{\alpha}(G, U) = \mu Z.(U \cup \mathsf{cpre}\_{\alpha}(G, Z))$$

The α-attractor to U can be computed by means of a fxed point iteration, starting at U and adding α-control predecessors in each iteration until a stable set is reached. We note that the α-attractor to an α-dominion D is again an α-dominion.

Example 4. Consider the parity game G of Figure 1 once again. The ♢-control predecessors of {u2} is the set {u0}. Note that since player □ can avoid moving to u<sup>2</sup> from vertex u<sup>3</sup> by moving to vertex u4, vertex u<sup>3</sup> is not among the ♢ control predecessors of {u2}. The ♢-attractor to {u2} is the set {u0, u2}, which is the largest set of vertices for which player ♢ has a strategy to force play to the set of vertices {u2}. ⊓⊔

### 3 Incomplete Parity Games

In many practical applications that rely on parity game solving, the parity game is gradually constructed by means of an exploration, often starting from an 'initial' vertex. This is, for instance, the case when using parity games in the context of model checking or when deciding behavioural preorders or equivalences. For such applications, it may be proftable to combine exploration and solving, so that the costly exploration can be terminated when the winner of a particular vertex of interest (often the initial vertex) has been determined. The example below, however, illustrates that one cannot naively solve the parity game constructed so far.

Example 5. Consider the parity game G in Figure 2, consisting of all vertices and only the solid edges. This game could, for example, be the result of an exploration starting from u4. Then G ∩ {u0, u1, u2, u3, u4, u5} is a subgame for which we can conclude that all vertices form an ♢-dominion. However, after exploring the dotted edges, player □ can escape to vertex u<sup>4</sup> from vertex u5. Consequently, vertices u<sup>4</sup> and u<sup>5</sup> are no longer won by player ♢ in the extended game. Furthermore, observe that the additional edge from u<sup>3</sup> to u<sup>5</sup> does not afect the previously established fact that player ♢ wins this vertex. ⊓⊔

Fig. 2. A parity game where the dotted edges are not yet known.

To facilitate reasoning about games with incomplete information, we frst introduce the notion of an incomplete parity game.

Defnition 2. An incomplete parity game is a structure ⅁ = (G, I), where G is a parity game (V, E, p,(V♢, V□)), and I ⊆ V is a set of vertices with potentially unexplored successors. We refer to the set I as the set of incomplete vertices; the set V \ I is the set of complete vertices.

Observe that (G, ∅) is a 'standard' parity game. We permit ourselves to use the notation for parity game notions such as plays, strategies, dominions, etcetera also in the context of incomplete parity games. In particular, for ⅁ = (G, I), we will write pre(⅁, U) and Attrα(⅁, U) to indicate pre(G, U) and Attrα(G, U), respectively. Furthermore, we defne ⅁ ∩ U as the structure (G ∩ U, I ∩ U).

Intuitively, while exploring a parity game, we extend the set of vertices and edges by exploring the incomplete vertices. Doing so gives rise to potentially new incomplete vertices. At each stage in the exploration, the incomplete parity game extends incomplete parity games explored in earlier stages. We formalise the relation between incomplete parity games, abstracting from any particular order in which vertices and edges are explored.

Defnition 3. Let ⅁ = ((V, E, p,(V♢, V□)), I), ⅁ ′ = ((V ′ , E′ , p′ ,(V ′ ♢ , V ′ □ )), I′ ) be incomplete parity games. We write ⅁ ⊑ ⅁ ′ if the following conditions hold:

(1) V ⊆ V ′ , V♢ ⊆ V ′ ♢ and V□ ⊆ V ′ □ ; (2) E ⊆ E′ and ((V \ I) × V ) ∩ E′ ⊆ E; (3) p = p ′ ↾<sup>V</sup> ; (4) I ′ ∩ V ⊆ I

Conditions (1) and (3) are self-explanatory. Condition (2) states that on the one hand, no edges are lost, and, on the other hand, E′ can only add edges from vertices that are incomplete: for complete vertices, E′ specifes no new successors. Finally, condition (4) captures that the set of incomplete vertices I ′ cannot contain vertices that were previously complete. We note that the ordering ⊑ is refexive, anti-symmetric and transitive.

Example 6. Suppose that ⅁ = (G, I) is the incomplete parity game depicted in Figure 2, where G is the game with all vertices and only the solid edges, and I = {u3, u5}. Then ⅁ ⊑ ⅁ ′ , where ⅁ ′ = (G′ , I′ ) is the incomplete parity game where G′ is the depicted game with all vertices and both the solid edges and dotted edges, and I ′ = ∅. ⊓⊔ Let us briefy return to Example 5. We concluded that the winner of vertex u<sup>4</sup> (and also u5) changed when adding new information. The reason is that player □ has a strategy to reach an incomplete vertex owned by her. Such an incomplete vertex may present an opportunity to escape from plays that would be non-winning otherwise. On the other hand, the incomplete vertex u<sup>3</sup> has already been sufciently explored to allow for concluding that this vertex is won by player ♢, even if more successors are added to u3. This suggests that for some subset of vertices, we can decide their winner in an incomplete parity game and preserve that winner in all future extensions of the game. We formally characterise this set of vertices in the defnition below.

Defnition 4. Let ⅁ = (G, I), with G = (V, E, p,(V♢, V□)) be an incomplete parity game. The α-safe vertices for ⅁, denoted by safeα(⅁), is the set V \ Attrα¯(G, Vα¯ ∩ I).

Example 7. Consider the incomplete parity game ⅁ of Example 6 once more. We have safe♢(⅁) = {u0, u1, u2, u3} and safe□(⅁) = {u0, u1, u2, u4, u5}. ⊓⊔

In the remainder of this section, we show that it is indeed the case that while exploring a parity game, one can only safely determine the winners in the sets safe□(⅁) and safe♢(⅁), respectively. More specifcally, we claim (Lemma 1) that all α-dominions found in safeα(⅁) are preserved in extensions of the game, and (Lemma 2) the winner of vertices not in safeα(⅁) are not necessarily won by the same player in extensions of the game.

Lemma 1. Given two incomplete games ⅁ and ⅁ ′ such that ⅁ ⊑ ⅁ ′ . Any αdominion in ⅁ ∩ safeα(⅁) is also an α-dominion in ⅁ ′ .

Example 8. Recall that in Example 7, we found that safe♢(⅁) = {u0, u1, u2, u3}. Observe that in the incomplete parity game ⅁ of Example 6, restricted to vertices {u0, u1, u2, u3}, all vertices are won by player ♢, and, hence, {u0, u1, u2, u3} is an ♢-dominion. Following Lemma 1 we can indeed conclude that this remains an ♢-dominion in all extensions of ⅁, and, in particular, for the (complete) parity game ⅁ ′ of Example 6. ⊓⊔

Lemma 2. Let ⅁ be an incomplete parity game. Suppose that W is an αdominion in ⅁. If W ̸⊆ safeα(⅁), then there is an (incomplete) parity game ⅁ ′ such that ⅁ ⊑ ⅁ ′and all vertices in W \ safeα(⅁) are won by α¯.

As a corollary of the above lemma, we fnd that α-dominions that contain vertices outside of the α-safe set are not guaranteed to be dominions in all extensions of the incomplete parity game.

Corollary 1. Let ⅁ be an incomplete parity game. Suppose that W is an αdominion in ⅁. If W ̸⊆ safeα(⅁), then there is an (incomplete) parity game ⅁ ′ such that ⅁ ⊑ ⅁ ′ and W is not an α-dominion in ⅁ ′ .

The theorem below summarises the two previous results, claiming that the sets safe♢(⅁) and safe□(⅁) are the optimal subsets that can be used safely when combining solving and the exploration of a parity game.

Theorem 1. Let ⅁ = (G, I), with G = (V, E, p,(V♢, V□)), be an incomplete parity game. Defne W<sup>α</sup> as the union of all α-dominions in ⅁∩safeα(⅁), and let W? = V \ (W♢ ∪ W□). Then W? is the largest set of vertices v for which there are incomplete parity games ⅁ <sup>α</sup> and ⅁ α¯ such that ⅁ ⊑ ⅁ <sup>α</sup> and ⅁ ⊑ ⅁ <sup>α</sup>¯ and v is won by α in ⅁ <sup>α</sup> and v is won by α¯ in ⅁ α¯ .

Proof. Let ⅁, with G = (V, E, p,(V♢, V□)) be an incomplete parity game. Pick a vertex v ∈ W?. Suppose that in G, vertex v ∈ W? is won by player α. Let ⅁ <sup>α</sup> = ⅁. Then ⅁ ⊑ ⅁ <sup>α</sup> and v is also won by α in ⅁ α.

Next, we argue that there must be a game ⅁ α¯ such that ⅁ ⊑ ⅁ <sup>α</sup>¯ and v is won by ¯α in ⅁ α¯ . Since v ∈ W? is won by player α in G, v must belong to an α-dominion in G. Towards a contradiction, assume that v ∈ safeα(⅁). Then there must also be a α-dominion containing v in G ∩ safeα(⅁), since ¯α cannot escape the set safeα(⅁). But then v ∈ Wα. Contradiction, so v /∈ safeα(⅁). So, v must be part of an α-dominion D in G such that D ̸⊆ safeα(⅁). By Lemma 2, we fnd that there is an incomplete parity game ⅁ α¯ such that ⅁ ⊑ ⅁ <sup>α</sup>¯ and all vertices in D \ safeα(⅁), and vertex v ∈ D in particular, are won by ¯α in ⅁ α¯ .

Finally, we argue that W? cannot be larger. Pick a vertex v /∈ W?. Then there must be some player α such that v ∈ Wα, and, consequently, there must be an α-dominion D ⊆ ⅁ ∩ safeα(⅁) such that v ∈ D. But then by Lemma 1, we fnd that v is won by α in all incomplete parity games ⅁ ′ such that ⅁ ⊑ ⅁ ′ . ⊓⊔

#### 4 On-the-fy Solving

In the previous section we saw that for any solver solveα, which accepts a parity game as input and returns an α-dominion Wα, a correct on-the-fy solving algorithm can be obtained by computing W<sup>α</sup> = solveα(⅁ ∩ safeα(⅁)) while exploring an (incomplete) parity game ⅁. While this approach is clearly sound, computing the set of safe vertices can be expensive for large state spaces and potentially wasteful when no dominions are found afterwards. We next introduce safe attractors which, we show, can be used to search for specifc dominions without frst computing the α-safe set of vertices.

#### 4.1 Safe Attractors

We start by observing that the α-attractor to a set U in an incomplete parity game ⅁ does not make a distinction between the set of complete and incomplete vertices. Consequently, it may wrongly conclude that α has a strategy to force play to U when the attractor strategy involves incomplete vertices owned by ¯α. We thus need to make sure that such vertices are excluded from consideration. This can be achieved by considering the set of unsafe vertices Vα¯ ∩I as potential vertices that can be used by the other player to escape. We defne the safe αattractor as the least fxed point of the safe control predecessor. The latter is defned as follows:

$$\mathsf{spec}\_{\alpha}(\mathbb{G}, U) = (V\_{\alpha} \cap \mathsf{pre}(\mathbb{G}, U)) \cup (V\_{\bar{\alpha}} \mid (\mathsf{pre}(\mathbb{G}, V \mid U) \cup \mathsf{insks}(\mathbb{G}) \cup I)),$$

Lemma 3. Let ⅁ be an incomplete parity game. For all vertex sets X ⊆ safeα(⅁) it holds that cpreα(⅁ ∩ safeα(⅁), X) = spreα(⅁, X).

The safe α-attractor to U, denoted SAttrα(⅁, U), is the set of vertices from which player α can force to safely reach U in ⅁:

$$\mathsf{SAttr}\_{\alpha}(\odot, U) = \mu Z.(U \cup \mathsf{spre}\_{\alpha}(\odot, Z))$$

Lemma 4. Let ⅁ be an incomplete parity game, and X ⊆ safeα(⅁). Then Attrα(⅁ ∩ safeα(⅁), X) = SAttrα(⅁, X).

In particular, we can conclude the following:

Corollary 2. Let ⅁ be an incomplete parity game, and X ⊆ safeα(⅁) be an α-dominion. Then SAttrα(⅁, X) is an α-dominion for all ⅁ ′ satisfying ⅁ ⊑ ⅁ ′ .

One application of the above corollary is the following: since on-the-fy solving is typically performed repeatedly, previously found dominions can be expanded by computing the safe α-attractor towards these already solved vertices. Another corollary is the following, which states that complete sinks can be safely attracted towards.

Corollary 3. Let ⅁ = (G, I) be an incomplete parity game and let ⅁ ′ be such that ⅁ ⊑ ⅁ ′ . Then SAttrα(⅁,sinksα¯(⅁) \ I) is an α-dominion in ⅁ ′ .

#### 4.2 Partial Solvers

In practice, a full-fedged solver, such as Zielonka's algorithm [31] or one of the Priority Promotion variants [2], may be costly to run often while exploring a parity game. Instead, cheaper partial solvers may be used that search for a dominion of a particular shape. We study three such partial solvers in this section, with a particular focus on solvers that lend themselves for parity games that are represented symbolically using, e.g., BDDs [5], MDDs [25] or LDDs [13]. For the remainder of this section, we fx an arbitrary incomplete parity game ⅁ = ((V, E, p,(V♢, V□)), I).

Winning solitaire cycles. A simple cycle in ⅁ can be represented by a fnite sequence of distinct vertices v<sup>0</sup> v<sup>1</sup> . . . v<sup>n</sup> satisfying v<sup>0</sup> ∈ vnE. Such a cycle is an α-solitaire cycle whenever all vertices on that cycle are owned by player α.

Observe that if all vertices on an α-solitaire cycle have a priority that is of the same parity as the owner α, then all vertices on that cycle are won by player α. Formally, these are thus cycles through vertices in the set P<sup>α</sup> ∩ Vα, where P♢ = {v ∈ V \ sinks(⅁) | p(v) mod 2 = 0} and P□ = {v ∈ V \ sinks(⅁) | p(v) mod 2 = 1}. Let C α sol(⅁) represent the largest set of <sup>α</sup>-solitaire winning cycles. Then C α sol(⅁) = νZ.(P<sup>α</sup> <sup>∩</sup> <sup>V</sup><sup>α</sup> <sup>∩</sup> pre(⅁, Z)).

#### Proposition 1. The set C α sol(⅁) is an <sup>α</sup>-dominion and we have <sup>C</sup> α sol(⅁) <sup>⊆</sup> safeα(⅁).

Proof. We frst prove that C α sol(⅁) <sup>⊆</sup> safeα(⅁). We show, by means of an induction on the fxed point approximants A<sup>i</sup> of the attractor, that C α sol(⅁) <sup>∩</sup> Attrα¯(⅁, Vα¯ <sup>∩</sup> I) = ∅. The base case follows immediately, as C α sol(⅁) <sup>∩</sup> <sup>A</sup><sup>0</sup> <sup>=</sup> <sup>C</sup> α sol(⅁) ∩ ∅ <sup>=</sup> <sup>∅</sup>. For the induction, we assume that C α sol(⅁) <sup>∩</sup> <sup>A</sup><sup>i</sup> <sup>=</sup> <sup>∅</sup>; we show that also <sup>C</sup> α sol(⅁) <sup>∩</sup> ((Vα¯ ∩ I) ∪ cpreα¯ (⅁, Ai)) = ∅. First, observe that C α sol(⅁) <sup>⊆</sup> <sup>V</sup>α; hence, it sufces to prove that C α sol(⅁) <sup>∩</sup> (V<sup>α</sup> \ (pre(⅁, V \ <sup>A</sup>i) <sup>∪</sup> sinks(⅁)) = <sup>∅</sup>. But this follows immediately from the fact that for every vertex v ∈ C<sup>α</sup> sol(⅁), we have <sup>v</sup> <sup>∈</sup> <sup>P</sup><sup>α</sup> <sup>∩</sup> V<sup>α</sup> ∩pre(⅁, C α sol(⅁)); more specifcally, we have vE ∩ C<sup>α</sup> sol(⅁) ̸<sup>=</sup> <sup>∅</sup> for all <sup>v</sup> ∈ C<sup>α</sup> sol(⅁).

The fact that C α sol(⅁) is an <sup>α</sup>-dominion follows from the fact that for every vertex v ∈ C<sup>α</sup> sol(⅁), there is some <sup>w</sup> <sup>∈</sup> vE ∩ C<sup>α</sup> sol(⅁). This means that player <sup>α</sup> must have a strategy that is closed on C α sol(⅁). Since all vertices in <sup>C</sup> α sol(⅁) are of the priority that is benefcial to α, this closed strategy is also winning for α. ⊓⊔

Observe that winning solitaire cycles can be computed without frst computing the α-safe set. Parity games that stand to proft from detecting winning solitaire cycles are those originating from verifying safety properties.

Winning forced cycles. In general, a cycle in safeα(⅁), through vertices in P♢ can contain vertices of both players, providing player □ an opportunity to break the cycle if that is benefcial to her. Nevertheless, if breaking a cycle always inadvertently leads to another cycle through P♢, then we may conclude that all vertices on these cycles are won by player ♢. We call these cycles winning forced cycles for player ♢. A dual argument applies to cycles through P□. Let C α for(⅁) represent the largest set of vertices that are on winning forced cycles for player α. More formally, we defne C α for(⅁) = νZ.(P<sup>α</sup> <sup>∩</sup> safeα(⅁) <sup>∩</sup> cpreα(⅁, Z)).

Lemma 5. The set C α for(⅁) is an <sup>α</sup>-dominion and we have <sup>C</sup> α for(⅁) <sup>⊆</sup> safeα(⅁).

A possible downside of the above construction is that it again requires to frst compute safeα(⅁), which, in particular cases, may incur an additional overhead. Instead, we can compute the same set using the safe control predecessor. We defne C α <sup>s</sup>−for(⅁) = νZ.(P<sup>α</sup> <sup>∩</sup> spreα(⅁, Z)).

Proposition 2. We have C α for(⅁) = <sup>C</sup> α <sup>s</sup>−for(⅁).

Proof. Let τ (Z) = P<sup>α</sup> ∩ spreα(⅁, Z). We use set inclusion to show that C α for(⅁) is indeed a fxed point of τ .


We show next that for any Z = τ (Z), we have Z ⊆ C<sup>α</sup> for(⅁). Let <sup>Z</sup> be such. We frst show that for every v ∈ Z∩Vα, there is some w ∈ vE∩Z, and for every v ∈ Z∩Vα¯, we have v /∈ sinks(⅁), v /∈ I and vE ⊆ Z. Pick v ∈ Z ∩Vα. Then v ∈ τ (Z)∩V<sup>α</sup> = P<sup>α</sup> ∩ V<sup>α</sup> ∩ spreα(⅁, Z) ⊆ pre(⅁, Z). But then vE ∩ Z ̸= ∅. Next, let v ∈ Z ∩ Vα¯. Then v ∈ τ (Z)∩Vα¯ = P<sup>α</sup> ∩Vα¯ ∩spreα(⅁, Z) ⊆ Vα¯ \ (pre(⅁, V \Z)∪sinks(⅁)∪I). So v /∈ pre(⅁, V \ Z) ∪ sinks(⅁) ∪ I. Consequently, vE ⊆ Z, v /∈ sinks(⅁) and v /∈ I.

Since for every v ∈ Z ∩ Vα, we have vE ∩ Z ̸= ∅, there must be a strategy for player α to move to another vertex in Z. Let σ be this strategy. Moreover, since for all v ∈ Z ∩Vα¯ we have vE ⊆ Z, we fnd that σ is closed on Z and since Z ∩ sinks(⅁) = ∅, strategy σ induces forced cycles. Moreover, since Z ⊆ Pα, we can conclude that all vertices in Z are on winning forced cycles.

Finally, we must argue that Z ⊆ safeα(⅁). But this follows from the fact that Z ∩ Vα¯ ∩ I = ∅, and, hence, also Z ∩ Attrα¯(⅁, Vα¯ ∩ I) = ∅. Since Z is contained within P<sup>α</sup> ∩ safeα(⅁), we fnd that Z ⊆ C<sup>α</sup> for(⅁). ⊓⊔

Fatal attractors. Both solitaire cycles and forced cycles utilise the fact that the parity winning condition becomes trivial if the only priorities that occur on a play are of the parity of a single player. Fatal attractors [17] were originally conceived to solve parts of a game using algorithms that have an appealing worstcase running time; for a detailed account, we refer to [17]. While ibid. investigates several variants, the main idea behind a fatal attractor is that it identifes cycles in which the priorities are non-decreasing until the dominating priority of the attractor is (re)visited. We focus on a simplifed (and cheaper) variant of the psolB algorithm of [17], which is based on the concept of a monotone attractor, which, in turn, relies on the monotone control predecessor defned below, where P <sup>≥</sup><sup>c</sup> = {v ∈ V | p(v) ≥ c}:

$$\mathsf{Mcpre}\_{\alpha}(\mathbb{\hat{\circ}}, Z, U, c) = P^{\geq c} \cap \mathsf{cpre}\_{\alpha}(\mathbb{\hat{\circ}}, Z \cup U)$$

The monotone attractor for a given priority is then defned as the least fxed point of the monotone control predecessor for that priority, formally MAttrα(⅁, U, c) = µZ.Mcpreα(⅁, Z, U, c). A fatal attractor for priority c is then the largest set of vertices closed under the monotone attractor for priority c; i.e., F <sup>α</sup>(⅁, c) = νZ.(P <sup>=</sup><sup>c</sup> ∩ safeα(⅁) ∩ MAttrα(⅁ ∩ safeα(⅁), Z, c)), where P <sup>=</sup><sup>c</sup> = P <sup>≥</sup><sup>c</sup> \ P ≥c+1 .

Lemma 6 (See [17], Theorem 2). For even c, we have that MAttr♢(⅁ ∩ safeα(⅁), F ♢(⅁, c), c) ⊆ safe♢(⅁) and MAttr♢(⅁ ∩ safeα(⅁), F ♢(⅁, c), c) is an ♢ dominion. If c is odd then we have MAttr□(⅁∩safeα(⅁), F □(⅁, c), c) ⊆ safe□(⅁) and MAttr□(⅁ ∩ safeα(⅁), F □(⅁, c), c) is an □-dominion.

Our simplifed version of the psolB algorithm, here dubbed solB<sup>−</sup> computes fatal attractors for all priorities in descending order, accumulating ♢ and □ dominions and extending these dominions using a standard ♢ or □-attractor. This can be implemented using a simple loop over these priorities.

In line with the previous solvers, we can also modify this solver to employ a safe monotone control predecessor, which uses a construction that is similar in spirit to that of the safe control predecessor. Formally, we defne the safe monotone control predecessor as follows:

$$\mathsf{sMCpre}\_{\alpha}(\mathbb{G}, Z, U, c) = P^{\geq c} \cap \mathsf{spre}\_{\alpha}(\mathbb{G}, Z \cup U)$$

The corresponding safe monotone α-attractor, denoted sMAttrα(⅁, U, c), is defned as follows: sMAttrα(⅁, U, c) = µZ.sMcpreα(⅁, Z, U, c). We defne the safe fatal attractor for priority c as the set F α s (⅁, c) = νZ.(P <sup>=</sup><sup>c</sup> ∩ sMAttrα(⅁, Z, c)).

Proposition 3. Let ⅁ be an incomplete parity game. We have F ♢ s (⅁, c) = F ♢(⅁, c) for even c and for odd c we have F □ s (⅁, c) = F □(⅁, c).

Similar to algorithm solB<sup>−</sup>, the algorithm solB<sup>−</sup> s computes safe fatal attractors for priorities in descending order and collects the safe-α-attractor extended dominions obtained this way.

#### 5 Experimental Results

We experimentally evaluate the techniques of Section 4. For this, we use games stemming from practical model checking and equivalence checking problems. Our experiments are run, single-threaded, on an Intel Xeon 6136 CPU @ 3 GHz PC. The sources for these experiments can be obtained from the downloadable artefact [21].

#### 5.1 Implementation

We have implemented a symbolic exploration technique for parity games in the mCRL2 toolset [6]. Our tool exploits techniques such as read and write dependencies [20,4], and uses sophisticated exploration strategies such as chaining and saturation [9]. We use MDD-like data structures [25] called List Decision Diagrams (LDDs), and the corresponding Sylvan implementation [13], to represent parity games symbolically. Sylvan also ofers efcient implementations for set operations and relational operations, such as predecessors, facilitating the implementation of attractor computations, the described (partial) solvers, and a full solver based on Zielonka's recursive algorithm [31], which remains one of the most competitive algorithms in practice, both explicitly and symbolically [28,12]. For the attractor set computation we have also implemented chaining to determine (multi-)step α-predecessors more efciently.

For all three on-the-fy solving techniques of Section 4, we have implemented 1) a variant that runs the standard (partial) solver on the α-safe subgame and removes the found dominion using the standard attractor (within that subgame), and 2) a variant that uses (partial) solvers with the safe attractors. Moreover, we also conduct experiments using the full solver running on an α-safe subgame. An important design aspect is to decide how the exploration and the on-the-fy solving should interleave. For this we have implemented a time based heuristic that keeps track of the time spent on solving and exploration steps. The time measurements are used to ensure that (approximately) ten percent of total time is spent on solving by delaying the next call to the solver. We do not terminate the partial solver when it requires more time, and thus it is only approximate. As a result of this heuristic, cheap solvers will be called more frequently than more expensive (and more powerful) ones, which may cause the latter to explore larger parts of the game graph.

#### 5.2 Cases

Table 1 provides an overview of the models and a description of the property that is being checked. The properties are written in the modal µ-calculus with data [15]. For the equivalence checking case we have mutated the original model to introduce a defect. For each property, we indicate the nesting depth (ND) and alternation depth [10] and whether the parity game is solitaire (Yes/No). The nesting depth indicates how many diferent priorities occur in the resulting game; for our encoding this is at most ND+2 (the additional ones encode constants 'true' and 'false'). The alternation depth is an indication of a game's complexity due to alternating priorities.

Table 1. Models and formulas.


Model Ref. Prop. Result ND AD Sol. Description

We use MODEL-i to indicate the parity game belonging to model MODEL and property i. Models SWP, BKE and CCP are protocol specifcations. The model PDI is a specifcation of a EULYNX SCI-LX SySML interface model that is used for a train interlocking system. Finally, WMS is the specifcation of a workload management system used at CERN. Using tools in mCRL2 [6], we have converted each model and property combination into a so-called parameterised Boolean equation systems [16], a higher-level logic that can be used to represent the underlying parity game.

Parity games SWP-1, WMS-1, WMS-2 and BKE-1 encode typical safety properties where some action should not be possible. In terms of the alternationfree modal mu-calculus with regular expressions, such properties are of the shape [true<sup>∗</sup> .a]false. These properties are violated exactly when the vertex encoding 'false' can be reached. Parity games SWP-2, WMS-3 and WMS-4 are more complex properties with alternating priorities, where WMS-4 encodes branching bisimulation using the theory presented in [8]. The parity games BKE-2 and CCP-1 encode a 'no deadlock' property given by a formula which states that after every path there is at least one outgoing transition. Finally, CCP-2 and all PDI cases contain formulas with multiple fxed points that yield games with multiple priorities but no (dependent) alternation.

Table 2. Experiments with parity games where on-the-fy solving cannot terminate early. All run times are in seconds. The number of vertices is given in millions. Memory is given in gigabytes. Bold-faced numbers indicate the lowest value.


#### 5.3 Results

In Tables 2 and 3 we compare the on-the-fy solving strategies presented in Section 4. In the 'Strategy' column we indicate the on-the-fy solving strategy that is used. Here full refers to a complete exploration followed by solving with the Zielonka recursive algorithm. We use solitaire to refer to solitaire winning cycle detection, cycles for forced winning cycle detection, fatal to refer to fatal attractors and fnally partial for on-the-fy solving with a Zielonka solver on safe regions. For solvers with a standard variant and a variant that utilises the safe attractors the frst number indicates the result of applying the (standard) solver on safe vertices, and the second number (following the slash '/') indicates the result when using the solver that utilises safe attractors.

The column 'Vertices' indicates the number of vertices explored in the game. In the next columns we indicate the time spent on exploring and solving specifically and the total time in seconds. We exclude the initialisation time that is common to all experiments. Finally, the last column indicates memory used by the tool in gigabytes. We report the average of 5 runs and have set a timeout (indicated by ‡) at 1200 seconds per run. Table 2 contains all benchmarks that require a full exploration of the game graph, providing an indication of the overTable 3. Experiments with parity games in which at least one partial solver terminates early. All run times are in seconds. The number of vertices is given in millions. For solvers with two variants the frst number indicates the result of applying the solver on safe vertices, and following the slash '/' the result when using the solver that uses safe attractors. Memory is given in gigabytes. Bold-faced numbers indicate the lowest value.


head in cases where this is unavoidable; Table 3 contains all benchmarks where at least one of the partial solvers allows exploration to terminate early.

For games SWP-1, WMS-1, WMS-2 in Table 3 we fnd that solitaire, and in particular the safe attractor variant, is able to determine the solution the fastest. Also, for all entries in Table 2 this is the solver with the least overhead. Next, we observe that for cases such as WMS-1 and PDI-3 using the safe attractor variants of the solvers can be detrimental. Our observation is that frst computing safe sets (especially using chaining) can be quick when most vertices are owned by one player and one priority and the computation of the safe attractor, which uses the more difcult safe control predecessor is more involved in such cases. There are also cases WMS-3, WMS-4, CCP-1 and CCP-2 where the safe attractor variants are faster and these cases all have multiple priorities. In cases where these solvers are slow (for example PDI-3) we also observe that more states are explored before termination, because the earlier mentioned time based heuristic results in calling the solver signifcantly less frequently.

For parity games SWP-2 and WMS-3 only fatal and partial are able to fnd a solution early, which shows that more powerful partial solvers can be useful. From Table 2 and the cases in which the safe attractor variants perform poorly we learn that the partial solvers can, as expected, cause overhead. This overhead is in our benchmarks on average 30 percent, but when it terminates early it can be very benefcial, achieving speed-ups of up to several orders of magnitude.

#### 6 Conclusion

In this work we have developed the theory to reason about on-the-fy solving of parity games, independent of the strategy that is used to explore games. We have introduced the notion of safe vertices, shown their correctness, proven an optimality result, and we have studied partial solvers and shown that these can be made to run without determining the safe vertices frst; which can be useful for on-the-fy solving. Finally, we have demonstrated the practical purpose of our method and observed that solitaire winning cycle detection with safe attractors is almost always benefcial with minimal overhead, but also that more powerful partial solvers can be useful.

Based on our experiments, one can make an educated guess which partial solver to select in particular cases; we believe that this selection could even be steered by analysing the parameterised Boolean equation system representing the parity game. It would furthermore be interesting to study (practical) improvements for the safe attractors, and their use in Zielonka's recursive algorithm.

Acknowledgements We would like to thank Jeroen Meijer and Tom van Dijk for their help regarding the Sylvan library when implementing our prototype. This work was supported by the TOP Grants research programme with project number 612.001.751 (AVVA), which is (partly) fnanced by the Dutch Research Council (NWO).

#### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Equivalence Checking

# **Distributed Coalgebraic Partition Refinement**

### Fabian Birkmann , Hans-Peter Deifel*,?* , and Stefan Milius*??*

Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany {fabian.birkmann,hans-peter.deifel,stefan.milius}@fau.de

**Abstract.** Partition refinement is a method for minimizing automata and transition systems of various types. Recently we have developed a partition refinement algorithm and the tool CoPaR that is generic in the transition type of the input system and matches the theoretical run time of the best known algorithms for many concrete system types. Genericity is achieved by modelling transition types as functors on sets and systems as coalgebras. Experimentation has shown that memory consumption is a bottleneck for handling systems with a large state space, while running times are fast. We have therefore extended an algorithm due to Blom and Orzan, which is suitable for a distributed implementation to the coalgebraic level of genericity, and implemented it in CoPaR. Experiments show that this allows to handle much larger state spaces. Running times are low in most experiments, but there is a significant penalty for some.

# **1 Introduction**

Minimization is an important and basic algorithmic task on state-based systems, concerned with reducing the state space as much as possible while retaining the system's behaviour. It is used for equivalence checking of systems and as a subtask in model checking tools in order to handle larger state spaces and thus mitigate the state-explosion problem.

We focus on the task of identifying behaviourally equivalent states modulo bisimilarity. For classic labelled transitions systems this notion obeys the principle 'states *s* and *t* are bisimilar if for every transition *s <sup>a</sup>*−−→ *s* 0 , there exists a transition *t <sup>a</sup>*−−→ *t* <sup>0</sup> with *s* <sup>0</sup> and *t* <sup>0</sup> bisimilar', and symmetrically for transitions from *t*. Bisimilarity is a rather fine-grained branching-time notion of equivalence (cf. [17]); it is widely used and preserves all properties expressible as *µ*-calculus formulas. Moreover, it has been generalized to yield equivalence notions for many other types of state-based systems and automata.

Due to the above principle, bisimilarity is defined by a fixed point, to be understood as a greatest fixed point and is hence approximable from above. This is used by *partition refinement* algorithms: The initial partition considers all states tentatively equivalent is then iteratively refined using observations

*<sup>?</sup>* Supported by the Deutsche Forschungsgemeinschaft (DFG) within the Research and Training Group 2475 "Cybercrime and Forensic Computing" (393541319/GRK2475/1-2019)

*<sup>??</sup>* Supported by Deutsche Forschungsgemeinschaft (DFG) under project MI 717/7-1.

about the states until a fixed point is reached. Consequently, such procedures run in polynomial time and can also be efficiently implemented, in contrast to coarser system equivalences such as trace equivalence and language equivalence of nondeterministic systems which are PSPACE-complete [23]. This makes minimization under bisimilarity interesting even in cases where the main equivalence is linear-time, such as for automata.

Efficient partition refinement algorithms exist for various systems: Kanellakis and Smolka provide a minimization algorithm with run time O(*m*·*n*) for labelled transition systems with *n* states and *m* transitions. Even faster algorithms have been developed over the past 50 years for many types of systems. For example, Hopcroft's algorithm for minimizing deterministic automata has run time in O(*n*·log *n*) [21]; it was later generalized to variable input alphabets, with run time O(*n*·|*A*|·log *n*) [18,24]. The Paige-Tarjan algorithm minimizes transition systems in time O((*m* + *n*) · log *n*) [31], and generalizations to labelled transition systems have the same time complexity [13, 22, 36]. For the minimization of weighted systems (a.k.a. *lumping*), Valmari and Franchescini [38] have developed a simple O((*m*+*n*)·log *n*) algorithm for systems with rational weights. Buchholz [10] gave an algorithm for weighted automata, and Högberg et al. [20] one for (bottom-up) weighted trees automata, both with run time in O(*m* · *n*).

In previous work [16,42], we have provided an efficient partition refinement algorithm, which is generic in the system type, captures all the above system types, and matches or, in some cases even improves on the run time complexity of the respective specialized algorithms. Subsequently, we have shown how to extend the generic complexity analysis to weighted tree automata and implemented the algorithm in the tool CoPaR [11, 41], again matching the previous best run time complexity and improving it in the case of weighted tree automata with weights from a non-cancellative monoid. The algorithm is based on ideas of Paige and Tarjan, which leads to its efficiency. Genericity is achieved by modelling state based systems as coalgebras, following the paradigm of universal coalgebra [34], in which the transitions structure of systems is encapsulated by a set functor. The algorithm and tool are *modular* in the sense that functors can be built from a preimplemented set of basic functors by standard set constructions such as cartesian product, disjoint union and functor composition. The tool then automatically derives a parser for input coalgebras of the composed type and provides a corresponding partition refinement implementation off the shelf. In addition, new basic functors *F* may easily be added to the set of basic functors by implementing a simple refinement interface for them plus a parser for encoded *F*coalgebras. Our experiments with the tool have shown that run time scales well with the size of systems. However, memory usage becomes a bottleneck with growing system size, a problem that has previously also been observed by Valmari [37] for partition refinement. One strategy to address this is to distribute the algorithm across multiple computers, which store and process only a part of the state space and communicate via message passing. For ordinary labelled transition systems and Markov systems this has been investigated in a series of papers by Blom and Orzan [4–9] who were also motivated to mitigate the memory bottleneck of sequential partition refinement algorithms.

Our contribution in this paper is an extension of CoPaR by an efficient distributed partition algorithm in coalgebraic generality. Like in Blom and Orzan's work, our algorithm is a distributed version of a simple but effective algorithm called "the naive method" [23], or "the final chain algorithm" in coalgebraic generality [25, 42]. We first generalize signature refinement introduced by Blom and Orzan to the level of coalgebras. We also combine generalized signatures (Section 3) with the previous encodings of set functors and their coalgebras [11, 41] via the new notion of a signature interface (Definition 3.1). This is a key idea to make coalgebraic signature refinement and the final chain algorithm implementable in a tool like CoPaR. In addition, we demonstrate how signature interfaces of functors can be combined (Construction 3.3 and Proposition 3.4) along standard functor constructions. This yields a similar modularity principle than for the previous sequential algorithm. However, this is a new feature for signature refinement and also, to our knowledge, for the final chain algorithm. Consequently, our distributed, modular and generic implementation of the final chain algorithm is new (already as sequential algorithm).

We also provide experiments demonstrating its scalability and show that much larger state spaces can indeed be handled. Our benchmarks include weighted tree automata for non-cancellative monoids, a type of system for which our previous sequential implementation is heavily limited by its memory requirements. For those systems the running times of the distributed algorithm are even faster then those of the sequential algorithm. In a second set of benchmarks stemming from the PRISM benchmark suite [27] we again show that larger systems can now be handled; however, for some of these there is a penalty in run time.

**Related work.** Balcazar et al. [1] have proved that the problem of bisimilarity checking for labelled transition systems is *P*-complete, which implies that it is hard to parallelize efficiently. Nevertheless, parallel algorithms have been proposed by Rajasekaran and Lee [33]. These are designed for shared memory machines and hence do not distribute RAM requirements over multiple machines.

Symbolic techniques are an orthogonal approach to reduce memory usage of partition refinement algorithms and have been explored e.g. by Wimmer et al. [40] and van Dijk and de Pol [15].

Two other orthogonal extensions of the generic coalgebraic minimization and CoPaR have been presented in recent work. First a non-trivial extension computes (1) reachable states and (2) the transition structure of the minimized systems [12]. Second, Wißmann et al. [43] have shown how to compute distinguishing formulas in a Hennessy-Milner style logic for a pair of behaviourally inequivalent states.

### **2 Preliminaries**

Our algorithmic framework and the tool CoPaR [41,42] are based on modelling state-based systems abstractly as *coalgebras* for a (set) *functor* that encapsulates the transition type, following the paradigm of *universal coalgebra* [34]. We now recall some standard notations for sets and maps and basic notions and examples in coalgebra. We fix a singleton set 1 = {∗}; for every set *X* we have a unique map !: *X* → 1 and the identity map id*<sup>X</sup>* : *X* → *X*. We denote composition of maps by (−) · (−), in applicative order. Given maps *f* : *X* → *A*, *g* : *X* → *B* we define h*f, g*i: *X* → *A* × *B* by h*f, g*i(*x*) = (*f*(*x*)*, g*(*x*)). The type of transitions of states in a system is modelled by a set functor *F*. Informally, *F* assigns to every set *X* a set *F X* of structured collections of elements of *X*, and an *F*-coalgebra is a map *c* : *S* → *F S* which assigns to every state *s* ∈ *S* in a system a structured collection *c*(*s*) ∈ *F S* of successor states of *s*. The functor *F* also determines a canonical notion of behavioural equivalence of states of a coalgebra; this arises by stipulating that morphisms of coalgebras are behaviour preserving maps.

**Definition 2.1.** A *functor F* : Set → Set assigns to each set *X* a set *F X* and to each map *f* : *X* → *Y* a map *F f* : *F X* → *F Y* , preserving identities and composition (*F*id*<sup>X</sup>* = id*FX*, *F*(*g* · *f*) = *F g* ·*F f*). An *F-coalgebra* (*S, c*) consists of a set *S* of *states* and a *transition structure c* : *S* → *F S*. A *morphism h*: (*S, c*) → (*S* 0 *, c*0 ) of *F*-coalgebras is a map *h*: *S* → *S* 0 that preserves the transition structure, i.e. *F h* · *c* = *c* 0 · *h*. Two states *s, t* ∈ *S* of a coalgebra *c* : *S* → *F S* are *behaviourally equivalent* (*s* ∼ *t*) if there exists a coalgebra morphism *h* with *h*(*s*) = *h*(*t*).

**Example 2.2.** We mention several types of systems which are instances of the general notion of coalgebra and the ensuing notion of behavioural equivalence. All these are possible input systems for our tool CoPaR.

(1) Transition systems. The *finite powerset* functor P*<sup>ω</sup>* maps a set *X* to the set P*ωX* of all *finite* subsets of *X*, and a map *f* : *X* → *Y* to the map P*ωf* = *f*[−]: P*ωX* → P*ωY* taking direct images. Coalgebras for P*<sup>ω</sup>* are finitely branching (unlabelled) transition systems. Two states are behaviourally equivalent iff they are (strongly) bisimilar in the sense of Milner [29,30] and Park [32]. Similarly, finitely branching labelled transition systems with label alphabet *A* are coalgebras for the functor *F X* = P*ω*(*A* × *X*).

(2) Deterministic automata. For an input alphabet *A*, the functor given by *F X* = 2 × *X<sup>A</sup>*, where 2 = {0*,* 1}, sends a set *X* to the set of pairs of boolean values and functions *A* → *X*. An *F*-coalgebra (*S, c*) is a deterministic automaton (without an initial state). For each state *s* ∈ *S*, the first component of *c*(*s*) determines whether *s* is a final state, and the second component is the successor function *A* → *S* mapping each input letter *a* ∈ *A* to the successor state of *s* under input letter *a*. States *s, t* ∈ *S* are behaviourally equivalent iff they accept the same language in the usual sense.

(3) Weighted tree automata simultaneously generalize tree automata and weighted (word) automata. Inputs of such automata stem from a finite *signature Σ*, i.e. a finite set of input symbols, each with a prescribed natural number, its *arity*. *Weights* are taken from a commutative monoid (*M,* +*,* 0). A (bottom-up) *weighted tree automaton* (WTA) (over *M* with inputs from *Σ*) consists of a finite set *S* of states, an output map *f* : *S* → *M*, and for each *k* ≥ 0, a transition map *µ<sup>k</sup>* : *Σ<sup>k</sup>* → *M<sup>S</sup> <sup>k</sup>*×*<sup>S</sup>*, where *Σ<sup>k</sup>* denotes the set of *k*-ary input symbols in *Σ*; the maximum arity of symbols in *Σ* is called the *rank*.

Every signature *Σ* gives rise to its associated *polynomial functor*, also denoted *Σ*, which assigns to a set *X* the set ` *<sup>n</sup>*∈<sup>N</sup> *Σn*×*X<sup>n</sup>*, where ` denotes disjoint union (coproduct). Further, for a given monoid (*M,* +*,* 0) the *monoid-valued functor M*(−) sends a set *X* to the set of maps *f* : *X* → *M* that are finitely supported, i.e. *f*(*x*) = 0 for almost all *x* ∈ *X*. Given a map *f* : *X* → *Y* , *M*(*f*) : *M*(*X*) → *M*(*<sup>Y</sup>* ) sends a map *v* : *X* → *M* in *M*(*X*) to the map *y* 7→ P *x*∈*X,f*(*x*)=*y v*(*x*), corresponding to the standard image measure construction.

Weighted tree automata are coalgebras for the composite functor *F X* = *M* × *M*(*ΣX*) ; indeed, given a coalgebra *c* = h*c*1*, c*2i: *S* → *M* × *M*(*ΣS*) , its first component *c*<sup>1</sup> is the output map, and the second component *c*<sup>2</sup> is equivalent to the family of transitions maps *µ<sup>k</sup>* described above.

As proven by Wißmann et al. [41, Prop. 6.6], the coalgebraic behavioural equivalence is precisely backward bisimulation of weighted tree automata as introduced by Högberg et al. [20, Def. 16].

(4) The *bag functor* B: Set → Set sends a set *X* to the set of all finite multisets (or *bags*) over *X*. This is the special case of the monoid-valued functor for the monoid (N*,* +*,* 0). Accordingly, B-coalgebras are weighted transition systems with positive integers as weights, or they may be regarded as finitely branching transition systems where multiple transitions between a pair of states are allowed. Behavioural equivalence coincides with weighted (or strong) bisimilarity.

(5) Markov chains. The *finite distribution functor* D*<sup>ω</sup>* is a subfunctor of the monoid-valued functor R(−) for the usual monoid of addition on the real numbers. It maps a set *X* to the set of all finite probability distributions on *X*. That means that P D*ωX* is the set of all finitely supported maps *d*: *X* → [0*,* 1] such that *<sup>x</sup>*∈*<sup>X</sup> <sup>d</sup>*(*x*) = 1. The action of <sup>D</sup>*<sup>ω</sup>* on maps is the same as that of <sup>R</sup>(−) .

As shown by Rutten and de Vink [35], coalgebras *c* : *S* → (D*ωS* + 1)*<sup>A</sup>* are precisely Larsen and Skou's probabilistic transition systems [28] (aka. labelled Markov chains [14]) with the label alphabet *A*. In fact, for each state *s* ∈ *S* and action label *a* ∈ *A*, that state either cannot perform an *a*-action (when *c*(*s*)(*a*) ∈ 1) or the distribution *c*(*s*)(*a*) determines for every state *t* ∈ *C* the probability with which *s* transitions to *t* with an *a*-action.

Coalgebraic behavioural equivalence is precisely probabilistic bisimilarity in the sense of Larsen and Skou, see Rutten and de Vink [35, Cor. 4.7].

(6) Markov decision processes are systems which feature both non-deterministic and probabilistic branching. They are coalgebras for composite functors such as P*ω*(*A* × D*ω*(−)) or P*ω*(D*ω*(*A* × (−)) (simple/general Segala systems); Bartels et al. [2] list further functors for various species of probabilistic systems.

**Encodings.** To supply coalgebras as inputs to CoPaR and in order to speak about the size of a coalgebra in terms of states and transitions, we need

**Definition 2.3 [12, Def. 3.1].** An *encoding* of a set functor *F* consists of a set *A* of *labels* and a family of maps *[<sup>X</sup>* : *F X* → B(*A* × *X*), one for every set *X*, such that the map h*F*!*, [X*i: *F X* → *F*1 × B(*A* × *X*) is injective.

The *encoding* of a coalgebra *c* : *S* → *F S* is h*F*!*, [S*i · *c* : *S* → *F*1 × B(*A* × *S*). For *s* ∈ *S* we write *s <sup>a</sup>*−−→ *t* whenever (*a, t*) is contained in the bag *[S*(*c*(*s*)). The *number of states* and *edges* of a given encoded input coalgebra are *n* = |*S*| and *m* = P *s*∈*S* |*[S*(*c*(*s*))|, respectively, where |*b*| = P *<sup>x</sup>*∈*<sup>X</sup> <sup>b</sup>*(*x*) for a bag *<sup>b</sup>* : *<sup>X</sup>* <sup>→</sup> <sup>N</sup>.

An encoding of a set functor *F* specifies how *F*-coalgebras are represented as directed graphs, and the required injectivity ensures that different coalgebras have different encodings.

**Example 2.4.** We recall a few key examples of encodings used by CoPaR [42]; for the required injectivity, see [12, Prop. 3.3].

(1) For the finite powerset functor P*<sup>ω</sup>* one takes a singleton label set *A* = 1 and *[<sup>X</sup>* : P*ωX* → B(1 × *X*) is the obvious inclusion: *[X*(*U*)(∗*, x*) = 1 iff *x* ∈ *U* ⊆ *X*. (2) For the monoid-valued functor *M*(−) we take labels *A* = *M*, and the map *[<sup>X</sup>* : *M*(*X*) → B(*M* × *X*) is given by *[X*(*t*)(*m, x*) = 1 if *t*(*x*) = *m* 6= 0 and 0 else. (3) As a special case, the bag functor B has labels *A* = N, and the map *[<sup>X</sup>* : B*X* → B(N × *X*) is given by *[X*(*t*)(*n, x*) = 1 if *t*(*x*) = *n* and 0 else.

**Remark 2.5.** (1) Readers familiar with category theory may wonder about the *naturality* of encodings *[X*. It turns out [12] that in almost all instances, our encodings are not natural transformations, except for polynomial functors. As shown in *op. cit.*, all our encodings satisfy a property called *uniformity*, which implies that they are subnatural transformations [12, Prop. 3.15].

(2) Having an encoding of a set functor *F* does not imply a reduction of the problem of minimizing *F*-coalgebras to that of coalgebras for B(*A* × −). In fact, the behavioural equivalence of *F*-coalgebras and coalgebras for B(*A* × −) may be very different unless *[<sup>X</sup>* is natural, which is not the case for most encodings.

Functors in CoPaR can be combined by product, coproduct or composition, leading to modularity. But in order to automatically handle combined functors, our tool crucially depends on the ability to form products and coproducts of encodings [41, 42]. We refrain from going into technical details, but note for further use that given a pair of functors *F*1*, F*<sup>2</sup> with encodings *A<sup>i</sup> , [X,i* one obtains encodings for the functors *F*<sup>1</sup> × *F*<sup>2</sup> (cartesian product) and *F*<sup>1</sup> + *F*<sup>2</sup> (disjoint union) with the label set *A* = *A*<sup>1</sup> + *A*2.

**Input syntax and processing.** We briefly recall the input format of CoPaR and how inputs are processed; for more details see [41, Sec. 3.1]. CoPaR accepts input files representing a finite *F*-coalgebra. The first line of an input file specifies the functor *F* which is written as a term according to the following grammar:

$$\begin{aligned} T &::= \mathbf{X} \mid \mathcal{P}\_{\omega} T \mid \mathcal{B}T \mid \mathcal{D}\_{\omega} T \mid M^{(T)} \mid \Sigma\\ \Sigma &::= C \mid T + T \mid T \times T \mid T^{A} \quad C ::= \mathbb{R} \mid A & A ::= \{s\_{1}, \ldots, s\_{n}\} \mid n, \end{aligned} \tag{1}$$

where *n* ∈ N denotes the set {0*, . . . , n*−1}, the *s<sup>k</sup>* are strings subject to the usual conventions for variable names (a letter or an underscore character followed by alphanumeric characters or underscore), exponents *F <sup>A</sup>* are written F^A, and *M* is one of the monoids (Z*,* +*,* 0), (R*,* +*,* 0), (C*,* +*,* 0), (P*ω*(64)*,* ∪*,* ∅) (the monoid of 64-bit words with bitwise or), and (N*,* max*,* 0) (the additive monoid of the tropical semiring). Note that *C* effectively ranges over at most countable sets, and *A* over finite sets. A term *T* determines a functor *F* : Set → Set in the evident way, with X interpreted as the argument.

The remaining lines of an input file specify a finite coalgebra *c* : *S* → *F S*. Each line has the form *s*:␣*t* for a state *s* ∈ *S*, and *t* represents the element *c*(*s*) ∈ *F S*. The syntax for *t* depends on the specified functor *F* and follows the structure of

Fig. 1: Examples of input files with encoded coalgebras [41]

the term *T* defining *F*; the details are explained in [41, Sec. 3.1.2]. Fig. 1 from *op. cit.* shows two coalgebras and the corresponding input files.

After reading the functor term *T*, CoPaR builds a parser for the functorspecific input format and then parses the input coalgebra given in that format into an intermediate format which internally represents the encoding of the input coalgebra (Definition 2.3). For composite functors the parsed coalgebra then undergoes a substantial amount of preprocessing, which also affects how transitions are counted; see [41, Sec. 3.5] for more details.

### **3 Coalgebraic Partition Refinement**

As mentioned in the introduction, the sequential partition refinement algorithm previously implemented in CoPaR is based on ideas used in the Paige-Tarjan algorithm [31] for transition systems. However, as has been mentioned by Blom and Orzan [8], the Paige-Tarjan algorithm carefully selects the block of states to split in each iteration, and the data structures used for this selection take a lot of memory and require modification to allow a distributed implementation. Hence, Blom and Orzan have built their distributed algorithm from a rather simple sequential partition refinement algorithm based on what Kanellakis and Smolka refer to as the *naive method* [23]. We now recall this algorithm and subsequently show how it can be adapted to the coalgebraic level of generality.

**Signature Refinement.** Given a finite labelled transition system with the state set *S*, a partition on *S* may be presented by a function *π* : *S* → N, i.e. two states *s, t* ∈ *S* lie in the same block of the partition iff *π*(*s*) = *π*(*t*). The *signature* of a state *s* ∈ *S* is the set of outgoing transitions to blocks of *π*:

$$\text{sig}\_{\pi}(s) = \{(a, \pi(t)) \mid s \xrightarrow{a} t\} \subseteq \mathcal{P}\_{\omega}(A \times \mathbb{N}).\tag{2}$$

A *signature refinement step* then refines *π* by putting *s, t* ∈ *S* into different blocks iff sig*<sup>π</sup>* (*s*) 6= sig*<sup>π</sup>* (*t*). Concretely, we put *π*new(*s*) = hash(sig*<sup>π</sup>* (*s*)) using a perfect, deterministic hash function hash. The signature refinement algorithm (Fig. 2) starts with a trivial initial partition on *S* and repeats the refinement step until the partition stabilizes, i.e. until two subsequent partitions have the same size.

**Coalgebraic Signature Refinement.** Regarding a labelled transition system as a coalgebra *c* : *S* → P*ω*(*A* × *S*) (Example 2.2(1)), signatures are obtained by postcomposing the transition structure with the partition under the functor:

$$\mathbf{sig}\_{\pi} = S \xrightarrow{c} \mathcal{P}\_{\omega}(A \times S) \xrightarrow{\mathcal{P}\_{\omega}(A \times \pi)} \mathcal{P}\_{\omega}(A \times \mathbb{N}).\tag{3}$$

**Variables :** old and new partitions represented by *π, π*new : *S* → N with sizes *l, l*new, resp.; set *H* for counting block numbers;

```
1 foreach s ∈ S do
2 πnew(s) ← 0 ;
3 end
4 lnew ← 1;
5 while l 6= lnew do
6 π ← πnew, H ← ∅;
7 foreach s ∈ S do
8 πnew(s) ← hash(sigπ
                         (s));
9 H ← H ∪ {πnew(s)};
10 end
11 l ← lnew;
12 lnew ← |H|;
13 end
```
Fig. 2: Signature refinement for labelled transition systems

The generalisation to coalgebras for arbitrary *F* is immediate: the *signature of a state* of an *F*-coalgebra *c* : *S* → *F S* w.r.t. a partition *π* is given by the function sig*<sup>π</sup>* = *F π* · *c*. In the refinement step of the above algorithm two states are identified by the next partition if they have the same signatures currently:

$$
\pi\_{\text{new}}(s) = \pi\_{\text{new}}(t) \iff \text{sig}\_{\pi}(s) = \text{sig}\_{\pi}(t) \iff (F\pi)(c(s)) = (F\pi)(c(t)). \tag{4}
$$

Hence, the algorithm in fact simply applies *F*(−) · *c* to the initial partition corresponding to the trivial quotient !: *S* → 1 until stability is reached. Note that this is precisely the *Final Chain Algorithm* by König and Küpper [25, Alg. 3.2] computing behavioural equivalence of a given *F*-coalgebra. Its correctness thus proves correctness of the *coalgebraic signature refinement* which is the algorithm in Fig. 2 with sig*<sup>π</sup>* = *F π* · *c*. Since we represent functors and their coalgebras by encodings we use an interface to *F* to compute signatures based on encodings.

**Definition 3.1.** Given a functor *F* with encoding *A, [X*, a *signature interface* consists of a function sig : *F*1 × B(*A* × N) → *F* N such that for every finite set *S* and every partition *π* : *S* → N we have

$$F\pi = \left(FS \xrightarrow{\langle F1, b\_S\rangle} F1 \times \mathcal{B}(A \times S) \xrightarrow{F1 \times \mathcal{B}(A \times \pi)} F1 \times \mathcal{B}(A \times \mathbb{N}) \xrightarrow{\text{sig}} F\mathbb{N}\right). \tag{5}$$

Given a coalgebra *c* : *S* → *F S*, a state *s* ∈ *S* and a partition *π* : *S* → N, the two arguments of sig should be understood as follows. The first argument is the value *F*!(*c*(*s*)) ∈ *F*1, which intuitively provides an observable output of the state *s*. The second argument is the bag B(*A* × *π*)(*[S*(*c*(*s*)) formed by those pairs (*a, n*) of labels *a* and numbers *n* of blocks of the partition *π* to which *s* has an edge; that is, that bag contains one pair (*a, n*) for each edge *s <sup>a</sup>*−−→ *s* <sup>0</sup> where *π*(*s* 0 ) = *n*. Thus, when supplied with these inputs, sig correctly computes the signature of *s*; indeed, to see this, precompose equation (5) with the coalgebra structure *c*.

**Example 3.2.** (1) The constant functor !*C* has the label set *A* = ∅, so we have B(∅ × N) ∼= 1, and we define the function sig : *C* × B(∅ × N) → *C* by sig(*c,* ∗) = *c*. (2) The powerset functor P*<sup>ω</sup>* has the label set *A* = 1, and we define the function sig : P*ω*1 × B(1 × N) → P*ω*N by sig(*z, b*) = {*n* : *b*(∗*, n*) 6= 0}.

(3) The monoid-valued functor R(−) has the label set *A* = R, and we define the function sig : R × B(R × N) → R(N) by sig(*z, b*)(*n*) = *Σ*{*r* | *b*(*r, n*) 6= 0}.

Next we show how signature interfaces can be combined by products (×) and coproducts (+). This is the key to the modularity of the implementation (be it distributed or sequential) of the coalgebraic signature refinement in CoPaR.

**Construction 3.3.** Given a pair of functors *F*1*, F*<sup>2</sup> with encodings *A<sup>i</sup> , [X,i* and signature interfaces sig*<sup>i</sup>* , we put *A* = *A*<sup>1</sup> + *A*<sup>2</sup> and define the following functions: (1) for the product functor *F* = *F*1×*F*<sup>2</sup> we take sig : *F*1×B(*A*×N) → *F*1N×*F*2N*,*

> sig(*t, b*) = sig<sup>1</sup> (pr<sup>1</sup> (*t*)*,* filter1(*b*))*,*sig<sup>2</sup> (pr<sup>2</sup> (*t*)*,* filter2(*b*)) *.*

Here, pr*<sup>i</sup>* : *F*1 → *Fi*1 is the projection map and filter*<sup>i</sup>* : B(*A* × N) → B(*A<sup>i</sup>* × N) is given by filter*i*(*b*)(*a, n*) = *b*(in*<sup>i</sup> a, n*), where in*<sup>i</sup>* : *Fi*N → *F* N is the injection map. (2) for the coproduct functor *F* = *F*<sup>1</sup> + *F*<sup>2</sup> we take

sig : *F*1 × B(*A* × N) → *F*1N + *F*2N*,* sig(in*<sup>i</sup> t, b*) = in*i*(sig*<sup>i</sup>* (*t,* filter*i*(*b*)))*.*

**Proposition 3.4.** *The functions* sig *defined in Construction 3.3 yield signature interfaces for the functors F*<sup>1</sup> × *F*<sup>2</sup> *and F*<sup>1</sup> + *F*2*, respectively.*

As a consequence of this result, it suffices to implement signature interfaces only for *basic* functors according to the grammar in (1), i.e. the trivial identity and constant functors as well as the functors P*ω*, B, D*<sup>ω</sup>* and the supported monoid-valued functors *M*(−) . Signature interfaces of products, coproducts and exponents, being a special form of product, are derived using Construction 3.3.

Functor composition can be reduced to these constructions by a technique called *desorting* [42, Sec. 8.2], which transforms a coalgebra of a composite functor into a coalgebra for a coproduct of basic functors whose signature interfaces can then be combined by + (see also [41, Sec. 3.5]). As for the previous Paige-Tarjan style algorithm, this leads to the modularity in the functor of the coalgebraic signature refinement algorithm: signature interfaces for composed functors are automatically derived in CoPaR. Moreover, a new basic functor *F* may be added by implementing a signature interface for *F*, effectively extending the grammar of supported functors in (1) by a clause *F T*.

#### **4 The Distributed Algorithm**

Our distributed algorithm for coalgebraic signature refinement is a generalization of Blom and Orzan's original algorithm [8] to coalgebras. We highlight differences to *op. cit.* at the end of this section.

We assume a distributed high-bandwidth cluster of *W* workers *w*1*, . . . , w<sup>W</sup>* that is failure-free, i.e. nodes do not crash, messages do not get lost and between two nodes the order of messages is preserved. The communication is based on non-blocking *send* operations and blocking *receive* operations. Messages are triples of the form (*from, to, data*), where the *data* field may be structured and will often contain a tag to simplify interpretation.

**Description.** The distributed algorithm is based on the sequential algorithm presented in Fig. 2, using a distributed hashtable to keep track of the partition. As for the sequential algorithm, the input consists of an *F*-coalgebra (*S, c*) with |*S*| = *n* states. We split the state space evenly among the workers as a preprocessing step. We write *S<sup>i</sup>* with |*S<sup>i</sup>* | = *n/W* for the set of states of worker *w<sup>i</sup>* . The input for worker *w<sup>i</sup>* is the encoding of that part of the transition structure of the input coalgebra which is needed to compute the signatures of the states in *S<sup>i</sup>* . This information is presented to *w<sup>i</sup>* as the list of all outgoing edges of states of *S<sup>i</sup>* in the encoding of the coalgebra (*S, c*), i.e. the list of all *s <sup>a</sup>*−−→ *t* with *s* ∈ *S<sup>i</sup>* (cf. Definition 2.3). We refer to the block number *π*(*s*) of a state *s* ∈ *S* as its ID.

After processing the input, the algorithm runs in two phases. In the *Initialization Phase* (Fig. 3) the workers exchange update demands about the IDs stored in the distributed hashtable. If *w<sup>i</sup>* has an edge *s <sup>a</sup>*−−→ *s* 0 into some state *s* <sup>0</sup> of *w<sup>j</sup>* , then during refinement *w<sup>i</sup>* needs to be kept up to date about the ID of *s* <sup>0</sup> and thus instructs *w<sup>j</sup>* to do so. Worker *w<sup>j</sup>* remembers this information by storing *w<sup>i</sup>* in the set In*<sup>s</sup>* <sup>0</sup> = {*w<sup>i</sup>* | ∃*s* ∈ *S<sup>i</sup> , a* ∈ *A. s <sup>a</sup>*−−→ *s* <sup>0</sup>} of incoming edges of *s* 0 (lines 14–16). Hence, for each edge *s <sup>a</sup>*−−→ *s* <sup>0</sup> with *s* ∈ *S<sup>i</sup>* and *s* <sup>0</sup> ∈ *S<sup>j</sup>* , worker *w<sup>i</sup>* sends a message to *w<sup>j</sup>* , informing *w<sup>j</sup>* to add *w<sup>i</sup>* to In*<sup>s</sup>* <sup>0</sup> (lines 5–8).

**Variables :** Set *V* of visited states; process count *d*;

for each *s* ∈ *S<sup>i</sup>* a list In*<sup>s</sup>* of workers with an edge into *s*

```
1 V ← ∅, d ← 0;
2 foreach s ∈ Si do
3 Ins ← [];
4 end
5 foreach edge s → s
                     0
                       of wi with
    s
     0
      6∈ V do
6 V ← V ∪ {s
                  0
                  };
 7 send(wi, wj , s0
                    );
8 end
9 foreach 1 ≤ j ≤ W do
10 send(wi, wj , DONE);
11 end
12 waitFor(d = W);
13 return([Ins | s ∈ Si]);
                                     14 on receive (wk, wi, s) do
                                     15 Ins ← (wk :: Ins);
                                     16 end
                                     17 on receive (_, _, DONE) do
                                     18 d ← d + 1;
                                     19 end
```
Fig. 3: Initialization Phase of worker *w<sup>i</sup>*

The main phase is the *Refinement Phase* (Fig. 4), mimicking the refinement loop of the undistributed algorithm. In each iteration all workers compute their part of the new partition, i.e. the IDs *h<sup>s</sup>* = hash(sig*<sup>π</sup>* (*s*)) for each of their states *s* ∈ *S<sup>i</sup>* (line 5). In addition, every worker *w<sup>i</sup>* is responsible for sending the computed ID of *s* ∈ *S<sup>i</sup>* to workers in In*<sup>s</sup>* that need it for computation of their own signatures in the next iteration (lines 6–9). The IDs are also sent to a designated worker counterOf(*hs*) (lines 10–12). This ensures that IDs are counted precisely once at the end of the round when the partition size is computed after all messages have been received (lines 14–17). The actual counting (line 19) is a

**Variables :** Old, respectively new partitions *π, π*new with sizes *l, l*new; finished workers *d*; ID-counting set *H*;

```
1 πnew ← 0!, l ← −1, lnew ← 0, H ← ∅;
2 while l 6= lnew do
3 l ← lnew, π ← πnew;
4 foreach s ∈ Si do
 5 πnew(s) ← hash(sigπ
                        (s));
6 foreach wj ∈ Ins do
 7 send(wi, wj ,
 8 hUPD, s, πnew(s)i);
9 end
10 send(wi,
11 counterOf(πnew(s)),
12 hCOUNT, πnew(s)i);
13 end
14 foreach 1 ≤ j ≤ W do
15 send(wi, wj , DONE);
16 end
17 waitFor(d = W);
18 l ← lnew;
19 lnew ← distribSum(sizeOf(H));
20 synchronize;
21 end
                                    22 on receive
                                        (wk, wi,(UPD, s, hs)) do
                                    23 πnew(s) ← hs;
                                    24 end
                                    25 on receive
                                        (wk, wi,(COUNT, hs)) do
                                    26 H ← H ∪ {hs};
                                    27 end
                                    28 on receive (_, wi, DONE) do
                                    29 d ← d + 1;
                                    30 end
```
Fig. 4: Refinement Phase of worker *w<sup>i</sup>*

primitive operation in the MPI library, for an explicit O(log *W*) algorithm using messages see e.g. Blom and Orzan [8, Fig. 6]. Finally, the workers synchronize before starting the next iteration (line 20). The refinement phase stops if two consecutive partitions have the same size (line 2).

**Correctness.** The Initialization Phase (Fig. 3) terminates since every worker reaches line 10, sends DONE to all workers and thus also receives it (lines 17–19) a total of *W* times, allowing it to progress past line 12. An analogous argument proves termination of every iteration of the Refinement Phase (Fig. 4). The sequential algorithm is correct, hence we know the loop of the refinement phase terminates when all IDs are computed and counted correctly, since then the distributed and the sequential algorithm compute precisely the same partitions.

To show that the signatures are computed correctly, we note that if all DONE messages have been received in a round, then, by order-preservation of messages, all messages sent previously in this round have also been received. This ensures that no workers are missing from the lists In*<sup>s</sup>* computed in the Initialization Phase and that during the Refinement Phase new IDs are sent to all concerned workers (Fig. 4, lines 6–8). This establishes correctness of the signature computation, and the signatures coincide on all workers since we assume that the hash function is deterministic. Finally, the use of the counterOf function (line 11) ensures that each ID is included in the counting set of exactly one worker. Thus, the distributed sum of the sizes of all counting sets is equal to the size of the partition.

**Complexity.** Let us assume that not only states, but also outgoing transitions are distributed evenly among the workers, i.e. every worker has about *m/W* outgoing transitions. In the Initialization Phase, the loop sending messages runs in O( *m <sup>W</sup>* ) and receiving takes O(*W* · *n <sup>W</sup>* ) = O(*n*), since for worker *w<sup>i</sup>* every other worker *w<sup>j</sup>* might have an edge into every state in *S<sup>i</sup>* . Both are executed in parallel so in total the phase runs in O(max( *<sup>m</sup> <sup>W</sup> , n*)) = O( *m <sup>W</sup>* + *n*). In the Refinement Phase, we assume the run time of computing signatures and their hashes is linear in the number of edges. Then the loop for computing and hashing (O( *m <sup>W</sup>* )) and counting (O( *n <sup>W</sup>* )) signatures runs in total in O( *m*+*n <sup>W</sup>* ), since it is performed by all workers independently. Each worker receives at most *m/W* ID-updates each round and the partition size is computable in O(*W*) giving the complexity of one refinement step in O( *m*+*n <sup>W</sup>* ). As many as *n* iterations might be needed for a total complexity of O( *m <sup>W</sup>* + *n*) + *n* · O( *n*+*m <sup>W</sup>* ) = O *mn*+*<sup>n</sup>* 2 *<sup>W</sup>* + *n .*

**Remark 4.1.** The above analysis assumes that signature interfaces are implemented with a linear run time in their input bag. This could in fact be theoretically realized for all basic functors (whence also for their combinations) currently implemented in CoPaR, which would involve using bucket sort for the grouping of bag elements by the target block (second component), e.g. for monoid-valued functors. However, since the table used in bucket sort would be very large (the size of the last partition) and memory conscience is our main motivation, we opted for an implementation using a standard *n* log *n* sorting algorithm instead.

**Implementation details.** CoPaR is implemented in Haskell. We were able to reuse, with only minor adjustments, major parts of the code base of CoPaR dedicated to the representation and processing of coalgebras. This includes the implemented functors and their encodings together with the corresponding parser and preprocessing algorithms (see Section 2). As explained in Section 3 the sequential Paige-Tarjan-style algorithm of CoPaR was not used; we implemented an additional "algorithmic frontend" to our "coalgebraic backend". To compute signatures during the Refinement Phase, each functor implements the signature interface (Definition 3.1), which is written in Haskell as follows:

```
class Hashable (Signature f) => SignatureInterface f where
  type Signature f :: Type
  sig :: F1 f -> [(Label f , Int )] -> Signature f
```
We require in the second line a type Signature f, that serves as an implementation-specific datatype representation of *F* N. In the type of sig, the types f*,* Label f and F1 f correspond to the name of *F*, its label type and the set *F*1, respectively.

**Example 4.2.** The Haskell-implementation of the signature interface for the finite power set functor P*<sup>ω</sup>* from Example 3.2(2) is as follows:

```
data P x = P x already defined in CoPaR
type instance Label P = () also already defined
instance SignatureInterface P where
 type Signature P = Set Int
```

```
sig :: F1 f -> [((), Int )] -> Set Int
sig _ = setFromList . map snd
```
Signature interfaces for the other basic functors according to the grammar in (1) are implemented similarly. For combined functors CoPaR automatically derives their signature interface based on Construction 3.3.

In the algorithm itself, each worker runs three threads in parallel: The first thread is for computing, the second one is for sending and the third one is for receiving signatures. This allows us to keep calls to the MPI interface separated from (pure) signature computation, simplifying logic and allowing the workers to scatter the ID of one state while simultaneously computing the signature of the next one to ensure that neither signature computation nor network traffic become bottlenecks. For inter-thread communication and synchronization we rely on Haskell's *software transactional memory* [19] to ease concurrent programming, e.g. to avoid race conditions.

**Comparison to Blom and Orzan's algorithm.** We now discuss a few differences of our algorithm to Blom and Orzan's original one [8].

In Blom and Orzan's algorithm for LTSs the sets In*<sup>s</sup>* of *s* ∈ *S<sup>i</sup>* are in fact *lists* and contain worker *w<sup>k</sup>* a total of *r* times if there exist *r* edges from states in *S<sup>k</sup>* to *s*. This induces a redundancy in messages of ID updates, since *w<sup>i</sup>* sends *r* (instead of one) messages with the ID of *s* to *wk*. If the LTS has an average fanout of *f* then each worker has *t* = *n/W* · *f* outgoing transitions; this is the number of ID updates received every round. Since there are only *n* states, at most *n/t* = *W/f* of those messages are necessary. In our scenario, we have *W f* for large coalgebras, hence the overhead becomes massive; e.g. for *W* = 10*, f* = 100 already 90% of all ID messages are redundant. We use sets instead of lists for In*<sup>s</sup>* to avoid this redundancy.

Signature computation and communication do not proceed simultaneously in Blom and Orzan's original algorithm. However, in their optimized version [9] and in Blom et al.'s algorithm for state labelled continuous-time Markov chains [4] they do.

Another difference of our implementation is that we decided to hash the signatures directly on the workers of the respective states while Blom and Orzan decided to first send the signatures to some dedicated hashing worker who is then (uniquely) responsible for hashing, i.e. computing a new ID. This method allows to compute new IDs in constant time. However, for more complex functors supported by CoPaR, sending signatures could result in very large messages, so we opted for minimizing network traffic at the cost of slower signature computation.

### **5 Evaluation**

To illustrate the practical utility and scalability of the algorithm and its implementation in CoPaR, we report on a number of benchmarks performed on a selection of randomly generated and real world data. In previous evaluations of sequential CoPaR [41], we were limited by the 16GB RAM of a standard workstation. Here we demonstrate that our distributed implementation fulfills its

main objective of handling larger systems without lifting the memory restriction per process. All benchmarks were run on a high performance computing cluster consisting of nodes with two Xeon 2660v2 "Ivy Bridge" chips (10 cores per chip + SMT) with 2.2GHz clock rate and 64GB RAM. The nodes are connected by a fat-tree InfiniBand interconnect fabric with 40 GBit/s bandwidth. Most execution runs were performed using 32 workers on 8 nodes, resulting in 4 worker processes per node. No process used more than 16GB RAM. Execution times of the sequential algorithm were taken using one node of the cluster. No times are given for executions that ran out of 16GB memory previously [41]; those were not run on the cluster.

**Weighted Tree Automata.** In previous work [41], we have determined the size of the largest weighted tree automata for different parameters that the sequential version of CoPaR could handle in 16GB of RAM. Here, we demonstrate that the distributed version can indeed overcome these memory constraints and process much larger inputs.

Recall from Example 2.2 that weighted tree automata are coalgebras for the functor *F X* = *M*×*M*(*ΣX*) . For these benchmarks, we use *ΣX* = 4×*X<sup>r</sup>* with rank *r* ∈ {1*, . . . ,* 5} and the monoids (2*,* ∨*,* 0) (available as the finite powerset functor in CoPaR), (N*,* max*,* 0) and (P*ω*(64)*,* ∪*,* ∅). To generate a random automaton with *n* states, we uniformly chose *k* = 50 ·*n* transitions from the set of all possible transitions (using an efficient sampling algorithm by Vitter [39]) resulting in a coalgebra encoding with *n* <sup>0</sup> = 51 · *n* states and *m* = (*r* + 1) · *k* edges. We took care to restrict the state and transition weights to at most 50 different monoid elements in each example, to avoid the situation where all states are already distinguished in the first iteration of the algorithm.

Table 1 lists results for both the sequential and distributed implementation when run on the same input. These are the largest WTAs for their respective rank and monoid that sequential CoPaR could handle using at most 16GB of RAM [41]. In contrast, the distributed implementation uses less than 1GB per worker for those examples and is thus able to handle much larger inputs. Incidentally, the

distributed implementation is also faster despite the overhead incurred by network communication. This can partly be attributed to the input-parsing stage, which does not need inter-worker synchronization and is thus perfectly parallelizable.

To test the scaling properties of the distributed algorithm, we ran CoPaR with the same input WTA but a varying number of worker processes. For this we chose the WTA for the monoid (2*,* ∨*,* 0) with *ΣX* = 4 × *X*<sup>5</sup> having 86852 states with 4342600 transitions and file size 186MB. The figure on the right above depicts the maximum memory usage per worker and the overall running time. The results show that both data points scale nicely with up to 32 workers, but while the


running time even increases when using up to 128 workers, the memory usage per worker (the main motivation for this work) continues to decrease significantly.

Table 1: Maximally manageable WTAs for sequential CoPaR; "Mem." and "Time" are the memory and time required for the distributed algorithm and are the maximum over all workers. "Seq. Time" is the time needed by sequential CoPaR.

**PRISM Models.** Finally, we show how our distributed partition refinement implementation performs on models from the benchmark suite [27] of the PRISM model checker [26]. These model (aspects of) real-world protocols and are thus a good fit to evaluate how CoPaR performs on inputs that arise in practice. Specifically, we use the *fms* and *wlan\_time\_bounded* families of systems. These are continuous time Markov chains, regarded as coalgebras for *F X* = R(*X*) , and Markov decision processes regarded as coalgebras for *F X* = N×P*ω*(N×(D*ωX*)), respectively. Again, our translation to coalgebras took care to force a coarse initial partition in the algorithm.

The results in Table 2 show that the distributed implementation is again able to handle larger systems than sequential CoPaR in 16GB of RAM per process. For the *fms* benchmarks, the distributed implementation is again faster than the sequential one. However, this is not the case for the *wlan* examples. The larger run times might be explained by the much higher number of iterations of the refinement phase (*i*-column of the table). This means that only few states are distinguished in each phase, and thus signatures are re-computed more often and more network traffic is incurred.


Table 2: Benchmarks on PRISM models: *n* and *m* are the numbers of states and edges of the input coalgebra; *i* is the number of refinement steps (iterations). The other columns are analogous to Table 1.

#### **6 Conclusions and Future Work**

We have presented a new and simple partition refinement algorithm in coalgebraic genericity which easily lends itself to a distributed implementation. Our algorithm is based on König and Küpper's final chain algorithm [25] and Blom and Orzan's signature refinement algorithm for labelled transition systems [8]. We have provided a distributed implementation in the tool CoPaR. Like the previous sequential Paige-Tarjan style partition refinement algorithm, our new algorithm is modular in the system type. This is made possible by combining signature interfaces by product and coproduct, which is used by CoPaR for handling combined type functors. Experimentation has shown that with the distributed algorithm CoPaR can handle larger state spaces in general. Run times stay low for weighted tree automata, whereas we observed severe penalties on some models from the PRISM benchmark suite.

An additional optimization of the coalgebraic signature refinement algorithm should be possible using Blom and Orzan's idea [9] to mark in each iteration those states whose signatures can change in the next iteration and only recompute signatures for those states in the next round. This might mitigate the run time penalties we have seen in some of the PRISM benchmarks.

Further work on CoPaR concerns symbolic techniques: we have a prototype sequential implementation of the coalgebraic signature refinement algorithm where state spaces are represented using BDDs. In a subsequent step it could be investigated whether this can be distributed. In another direction the distributed algorithm might be extended to compute distinguishing formulas, as recently achieved for the sequential algorithm [43], for which there is also an implemented prototype. Finally, there is still work required to integrate all these new features, i.e. distribution, distinguishing formulas, reachability and computation of minimized systems, into one version of CoPaR.

**Data Availability Statement** The software CoPaR and the input files that were used to produce the results in this paper are available for download [3]. The latest version of CoPaR can be obtained at https://git8.cs.fau.de/software/copar.

# **References**


on Concurrency Theory (CONCUR). LIPIcs, vol. 85, pp. 28:1–28:16. Schloss Dagstuhl (2017)


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# From Bounded Checking to Verifcation of Equivalence via Symbolic Up-to Techniques <sup>⋆</sup>

Vasileios Koutavas<sup>1</sup> , Yu-Yang Lin<sup>1</sup> () , and Nikos Tzevelekos<sup>2</sup>

<sup>1</sup> Trinity College Dublin, Dublin, Ireland {Vasileios.Koutavas,linhouy}@tcd.ie

<sup>2</sup> Queen Mary University of London, London, UK nikos.tzevelekos@qmul.ac.uk

Abstract. We present a bounded equivalence verifcation technique for higher-order programs with local state. This technique combines fully abstract symbolic environmental bisimulations similar to symbolic game semantics, novel up-to techniques, and lightweight state invariant annotations. This yields an equivalence verifcation technique with no false positives or negatives. The technique is bounded-complete, in that all inequivalences are automatically detected given large enough bounds. Moreover, several hard equivalences are proved automatically or after being annotated with state invariants. We realise the technique in a tool prototype called Hobbit and benchmark it with an extensive set of new and existing examples. Hobbit can prove many classical equivalences including all Meyer and Sieber examples.

Keywords: Contextual equivalence · bounded model checking · symbolic bisimulation · up-to techniques · operational game semantics.

# 1 Introduction

Contextual equivalence is a relation over program expressions which guarantees that related expressions are interchangeable in any program context. It encompasses verifcation properties like safety and termination. It has attracted considerable attention from the semantics community (cf. the 2017 Alonzo Church Award), and has found its main applications in the verifcation of cryptographic protocols [4], compiler correctness [26] and regression verifcation [10,11,9,17].

In its full generality, contextual equivalence is hard as it requires reasoning about the behaviour of all program contexts, and becomes even more difcult in languages with higher-order features (e.g. callbacks) and local state. Advances in bisimulations [16,29,3], logical relations [1,13,15] and game semantics [18,25,8,20] have ofered powerful theoretical techniques for hand-written proofs of contextual equivalence in higher-order languages with state. However, these advancements have yet to be fully integrated in verifcation tools for contextual equivalence in programming languages, especially in the case of bisimulation techniques. Existing tools [12,24,14] only tackle carefully delineated language fragments.

<sup>⋆</sup> This publication has emanated from research supported in part by a grant from Science Foundation Ireland under Grant number 13/RC/2094\_2.

In this paper we aim to push the frontier further by proposing a bounded model checking technique for contextual equivalence for the entirety of a higherorder language with local state (Sec. 3). This technique, realised in a tool called Hobbit, <sup>3</sup> automatically detects inequivalent program expressions given sufcient bounds, and proves hard equivalences automatically or semi-automatically.

Our technique uses a labelled transition system (LTS) for open expressions in order to express equivalence as a bisimulation. The LTS is symbolic both for higher-order arguments (Sec. 4), similarly to symbolic game models [8,20] and derived proof techniques [3,15], and frst-order ones (Sec. 6), adopting established techniques (e.g. [6]) and tools such as Z3 [23]. This enables the defnition of a fully abstract symbolic environmental bisimulation, the bounded exploration of which is the task of the Hobbit tool. Full abstraction guarantees that our tool fnds all inequivalences given sufcient bounds, and only reports true inequivalences. As is corroborated by our experiments, this makes Hobbit a practical inequivalence detector, similar to traditional bounded model checking [2] which has been proved an efective bug detection technique in industrial-scale C code [6,7,30].

However, while profcient in bug fnding, bounded model checking can rarely prove the absence of errors, and in our setting prove an equivalence: a bound is usually reached before all—potentially infnite—program runs are explored. Inspired by hand-written equivalence proofs, we address this challenge by proposing two key technologies: new bisimulation up-to techniques, and lightweight user guidance in the form of state invariant annotations. Hence we increase signifcantly the number of equivalences proven by Hobbit, including for example all classical equivalences due to Meyer and Sieber [21].

Up-to techniques [28] are specifc to bisimulation and concern the reduction of the size of bisimulation relations, oftentimes turning infnite transition systems into fnite ones by focusing on a core part of the relation. Although extensively studied in the theory of bisimulation, up-to techniques have not been used in practice in an equivalence checker. We specifcally propose three novel up-to techniques: up to separation and up to re-entry (Sec. 5), dealing with infnity in the LTS due to the higher-order nature of the language, and up to state invariants (Sec. 7), dealing with infnity due to state updates. Up to separation allows us to reduce the knowledge of the context the examined program expressions are running in, similar to a frame rule in separation logic. Up to re-entry removes the need of exploring unbounded nestings of higher-order function calls under specifc conditions. Up to state invariants allows us to abstract parts of the state and make fnite the number of explored confgurations by introducing state invariant predicates in confgurations.

State invariants are common in equivalence proofs of stateful programs, both in handwritten (e.g. [16]) and tool-based proofs. In the latter they are expressed manually in annotations (e.g. [9]) or automatically inferred (e.g. [14]). In Hobbit we follow the manual approach, leaving heuristics for automatic invariant inference for future work. An important feature of our annotations is the ability to express relations between the states of the two compared terms, enabled by the up to

<sup>3</sup> Higher Order Bounded BIsimulation Tool (Hobbit), https://github.com/LaifsV1/Hobbit.

state invariants technique. This leads to fnite bisimulation transition systems in examples where concrete value semantics are infnite state.

The above technologies, combined with standard up-to techniques, transform Hobbit from a bounded checker into an equivalence prover able to reason about infnite behaviour in a fnite manner in a range of examples, including classical example equivalences (e.g. all in [21]) and some that previous work on up-to techniques cannot algorithmically decide [3] (cf. Ex. 22). We have benchmarked Hobbit on examples from the literature and newly designed ones (Sec. 8). Due to the undecidable nature of contextual equivalence, up-to techniques are not exhaustive: no set of up-to techniques is guaranteed to fnitise all examples. Indeed there are a number of examples where the bisimulation transition system is still infnite and Hobbit reaches the exploration bound. For instance, Hobbit is not able to prove examples with inner recursion and well-bracketing properties, which we leave to future work. Nevertheless, our approach provides a contextual equivalence tool for a higher-order language with state that can prove many equivalences and inequivalences which previous work could not handle due to syntactic restrictions and other limitations (Sec. 9).

Related work Our paper marries techniques from environmental bisimulations up-to [16,29,28,3] with the work on fully abstract game models for higher-order languages with state [18,8,20]. The closest to our technique is that of Biernacki et al. [3], which introduces up-to techniques for a similar symbolic LTS to ours, albeit with symbolic values restricted to higher-order types, resulting in infnite LTSs in examples such as Ex. 21, and with inequivalence decided outside the bisimulation by (non-)termination, precluding the use up-to techniques in examples such as Ex. 22. Close in spirit is the line of research on logical relations [1,13,15] which provides a powerful tool for hand-written proofs of contextual equivalence. Also related are the tools Hector [12] and Coneqct [24], and SyTeCi [14], based on game semantics and step-indexed logical relations respectively (cf. Sec. 9).

# 2 High-Level Intuitions

Contextual equivalence requires that two program expressions lead to the same observable result in any program context these may be fed in. Instead of working directly with this defnition, we can translate programs into a semantic model that is fully abstract, reducing contextual equivalence to semantic equality.

The semantic model we use is that of Game Semantics [18]. We model programs as formal interactions between two players: a Proponent (corresponding to the program) and an Opponent (standing for any program context). Concretely, these interactions are sets of traces produced from a Labelled Transition System (LTS), the nodes and labels of which are called confgurations and moves respectively. The LTS captures the interaction of the program with its environment, which is realised via function applications and returns: moves can be questions (i.e. function applications) or answers (returns), and belong to proponent or opponent. E.g. a program calling an external function will issue a proponent question, while the return of the external function will be an opponent answer. In the examples that follow, moves that correspond to the opponent shall be underlined.

Fig. 1. Sample LTS's modelling expressions in Section 2.

Example 1. Consider the expression N = (**fun** f -> f (); 0) of type (unit → unit) → int. Evaluating N leads to a function g being returned (i.e. g is λf.f(); 0). When g is called with some input f1, it will always return 0 but in the process it may call the external function f1. The call to f<sup>1</sup> may immediately return or it may call g again (i.e. reenter), and so on. The LTS for N is as in Fig. 1 (top).

Given two expressions M, N, checking their equivalence will amount to checking bisimulation equivalence of their (generally infnite) LTS's. Our checking routine performs a bounded analysis that aims to either fnd a fnite counterexample and thus prove inequivalence, or build a bisimulation relation that shows the equivalence of the expressions. The former case is easier as it is relatively rapid to explore a bisimulation graph up to a given depth. The latter one is harder, as the target bisimulation can be infnite. To tackle part of this infnity, we use three novel up-to techniques for environmental bisimulation.

Up-to techniques roughly assert that if a core set of confgurations in the bisimulation graph explored can be proven to be part of a relation satisfying a defnition that is more permissive than standard bisimulation, then a superset of confgurations forms a proper bisimulation relation. This has the implication that a bounded analysis can be used to explore a fnite part of the bisimulation graph to verify potentially infnitely many confgurations. As there can be no complete set of up-to techniques, the pertaining question is how useful they are in practice. In the remainder of this section we present the frst of our up-to techniques, called up to separation, via an example equivalence. The intuition behind this technique comes from Separation Logic and amounts to saying that functions that access separate regions of the state can be explored independently. As a corollary, a function that manipulates only its own local references may be explored independently of itself, i.e. it sufces to call it once.


Fig. 2. Syntax and reduction semantics of the language λ imp .

Example 2. Consider M = (**fun** f -> **ref** x = 0 **in** f (); !x) and N from Ex. 1. The LTS corresponding to M and N are shown in Fig. 1 (middle and top). Regarding M, we can see that opponent is always allowed to reenter the proponent function g, which creates a new reference x<sup>n</sup> each time. This makes each confguration unique, which prevents us from fnding cycles and thus fnitise the bisimulation graph. Moreover, both the LTS for M and N are infnite because of the stack discipline they need to adhere to when O issues reentrant calls.

With separation, however, we could prune the two LTS's as in Fig. 1 (bottom). We denote the confgurations after the frst opponent call as C1. Any opponent call after C<sup>1</sup> leads to a confguration which difers from C<sup>1</sup> either by a state component that is not accessible anymore and can thus be separated, or by a stack component that can be similarly separated. Hence, the LTS's that we need to consider are fnite and thus the expressions are proven equivalent.

# 3 Language and Semantics

We develop our technique for the language λ imp, a simply typed lambda calculus with local state whose syntax and reduction semantics are shown in Fig. 2. Expressions (Exp) include the standard lambda expressions with recursive functions (fxf(x).e), together with location creation (ref l = v in e), dereferencing (!l), and assignment (l := e), as well as standard base type constants (c) and operations (op(⃗e)). Locations are mapped to values, including function values, in a store (St). We write · for the empty store and let f(χ) denote the set of free locations in χ.

The language λ imp is simply-typed with typing judgements of the form ∆; Σ ⊢ e : T, where ∆ is a type environment (omitted when empty), Σ a store typing and T a value type (Type); Σ<sup>s</sup> is the typing of store s. The rules of the type system are standard and omitted here. Values consist of boolean, integer, and unit constants,

functions and arbitrary length tuples of values. To keep the presentation of our technique simple we do not include reference types as value types, efectively keeping all locations local. Exchange of locations between expressions can be encoded using get and set functions. In Ex. 22 we show the encoding of a classic equivalence with location exchange between expressions and their context. Future work extensions to our technique to handle location types can be informed from previous work [18,14].

The reduction semantics is by small-step transitions between confgurations containing a store and an expression, ⟨s ; e⟩ → ⟨s ′ ; e ′ ⟩, defned using single-hole evaluation contexts (ECxt) over a base relation ,→. Holes [·]<sup>T</sup> are annotated with the type T of closed values they accept, which we may omit to lighten notation. Beta substitution of x with v in e is written as e[v/x]. We write ⟨s ; e⟩ ⇓ to denote ⟨s ; e⟩ →<sup>∗</sup> ⟨t; v⟩ for some t, v. We write ⃗χ to mean a syntactic sequence, and assume standard syntactic sugar from the lambda calculus. In our examples we assume an ML-like syntax and implementation of the type system, which is also the concrete syntax of Hobbit.

We consider environments Γ ∈ N fn −⇀ Val which map natural numbers to closed values. The concatenation of two such environments Γ<sup>1</sup> and Γ2, written Γ1, Γ<sup>2</sup> is defned when dom(Γ1) ∩ dom(Γ2) = ∅. We write ( <sup>i</sup><sup>1</sup> v1, . . . , <sup>i</sup><sup>n</sup> vn) for a concrete environment mapping i1, . . . , i<sup>n</sup> to v1, . . . , vn, respectively. When indices are unimportant we omit them and treat Γ environments as lists.

General contexts D contain multiple, non-uniquely indexed holes [·]i,T , where T is the type of value that can replace the hole. Notation D[Γ] denotes the context D with each hole [·]i,T replaced with Γ(i), provided that i ∈ dom(Γ) and Σ ⊢ Γ(i) : T, for some Σ. We omit hole types where possible and indices when all holes in D are annotated with the same i. In the latter case we write D[v] instead of D[(iv)] and allow to replace all holes of D with a closed expression e, written D[e]. We assume the Barendregt convention for locations, thus replacing context holes avoids location capture. Standard contextual equivalence [22] follows.

Defnition 3 (Contextual Equivalence). Expressions ⊢ e<sup>1</sup> : T and ⊢ e<sup>2</sup> : T are contextually equivalent, written as e<sup>1</sup> ≡ e2, when for all contexts D such that ⊢ D[e1] : unit and ⊢ D[e2] : unit we have ⟨· ; D[e1]⟩ ⇓ if ⟨· ; D[e2]⟩ ⇓.

### 4 LTS with Symbolic Higher-Order Transitions

Our Labelled Transition System (LTS) has symbolic transitions for both higherorder and frst-order transitions. For simplicity we frst present our LTS with symbolic higher-order and concrete frst-order transitions. We develop our theory and most up-to techniques on this simpler LTS. We then show its extension with symbolic frst-order transitions and develop up to state invariants which relies on this extension. We extend the syntax with abstract function names α:

$$\mathsf{Val} \mathsf{al} \colon \quad u, v, w \mathrel{\mathop{:}}=c \mid \mathsf{fix} f(x).e \mid (\vec{v}) \mid \alpha\_T$$

Abstract function names α<sup>T</sup> are annotated with the type T of function they represent, omitted where possible; an(χ) is the set of abstract names in χ.


Fig. 3. The Labelled Transition System.

We defne our LTS (shown in Fig. 3) by opponent and proponent call and return transitions, based on Game Semantics [18]. Proponent transitions are the moves of an expression interacting with its context. Opponent transitions are the moves of the context surrounding this expression. These transitions are over proponent and opponent confgurations ⟨A ; Γ ; K ; s ; e⟩ and ⟨A ; Γ ; K ; s ; ·⟩, respectively. In these confgurations:


In addition, we introduce a special confguration ⟨⊥⟩ which is used in order to represent expressions that cannot perform given transitions (cf. Remark 6). We let a trace be a sequence of app and ret moves (i.e. labels), as defned in Fig. 3.

For the LTS to provide a fully abstract model of the language, it is necessary that functions which are passed as arguments or return values from proponent to opponent be abstracted away, as the actual syntax of functions is not directly observable in λ imp. This is achieved by deconstructing such values v to:


We let ulpatt(v) contain all such pairs (D, Γ) for v; e.g.: ulpatt((λx.e1, 5)) = {( ([·]<sup>i</sup> , 5), [ i λx.e1] ) | for any i}. We extend ulpatt to types through the use of symbolic function names: ulpatt(T) is the largest set of pairs (D, Γ) such that ⊢ D[Γ] : T, where rng(Γ) = ⃗αT⃗ , and D does not contain functions.

<sup>4</sup> thus, Γ is encoding the environment of Environmental Bisimulations (e.g. [16])

In Fig. 3, proponent application and return transitions (PropApp, PropRet) use ultimate pattern matching for values and accumulate the functions generated by the proponent in the Γ environment of the confguration, leaving only their indices on the label of the transition itself. Opponent application and return transitions (OpApp, OpRet) use ultimate pattern matching for types to generate opponent-generated values which can only contain abstract functions. This eliminates the need for quantifying over all functions in opponent transitions but still includes infnite quantifcation over all base values. Symbolic frst-order values in Sec. 6 will obviate the latter.

At opponent application the following preorder performs a beta reduction when opponent applies a concrete function. This technicality is needed for soundness.

Defnition 4 (≻). For application v u we write v u ≻ e to mean e = α u, when v = α; and e = e ′ [u/x][fxf(x).e′/f], when v = fxf(x).e′ .

In our LTS, C ranges over confgurations and η over transition labels; <sup>η</sup> =⇒ means <sup>τ</sup>−→<sup>∗</sup> , when η = τ , and <sup>τ</sup>=⇒ η −→ <sup>τ</sup>=⇒ otherwise. Standard weak (bi-)simulation follows.

Defnition 5 (Weak Bisimulation). Binary relation R is a weak simulation when for all C<sup>1</sup> R C<sup>2</sup> and C<sup>1</sup> η −→ C ′ 1 , there exists C ′ 2 such that C<sup>2</sup> η =⇒ C ′ <sup>2</sup> and C ′ <sup>1</sup> R C ′ 2 . If <sup>R</sup>, <sup>R</sup><sup>−</sup><sup>1</sup> are weak simulations then <sup>R</sup> is a weak bisimulation. Similarity (⊏≈) and bisimilarity (≈) are the largest weak simulation and bisimulation, respectively.

Remark 6. Any proponent confguration that cannot match a standard bisimulation transition challenge can trivially respond to the challenge by transitioning into ⟨⊥⟩ by the Response rule in Fig. 3. By the same rule, this confguration can trivially perform all transitions except a special termination transition, labelled with ↓. However, regular confgurations that have no pending proponent calls (K = ·), can perform the special termination transition (Term rule), signalling the end of a complete trace, i.e. a completed computation. This mechanism allows us to encode complete trace equivalence, which coincides with contextual equivalence [18], as bisimulation equivalence. In a bisimulation proof, if a proponent confguration is unable to match a bisimulation transition with a regular transition, it can still transition to ⟨⊥⟩ where it can simulate every transition of the other expression, apart from <sup>↓</sup>

−→ leading to a complete trace. Our mechanism for treating unmatched transitions has the beneft of enabling us to use the standard defnition of bisimulation over our LTS. This is in contrast to previous work [3,15], where termination/non-termination needed to be proven independently or baked in the simulation conditions. More importantly, our approach allows us to use bisimulation up-to techniques even when one of the

Defnition 7 (Bisimilar Expressions). Expressions ⊢ e<sup>1</sup> : T and ⊢ e<sup>2</sup> : T are bisimilar, written e<sup>1</sup> ≈ e2, when ⟨· ; · ; · ; · ; e1⟩ ≈ ⟨· ; · ; · ; · ; e2⟩.

related confgurations diverges, which is not possible in previous symbolic LTSs

[18,15,3], and is necessary in examples such as Ex. 22.

# Theorem 8 (Soundness and Completeness). e<sup>1</sup> ≈ e<sup>2</sup> if e<sup>1</sup> ≡ e2.

As a fnal remark, the LTS presented in this section is fnite state only for a small number of trivial equivalence examples. The following section addresses sources of infnity in the transition systems through bisimulation up-to techniques.

# 5 Up-to Techniques

We start by the defnition of a sound up-to technique.

Defnition 9 (Weak Bisimulation up to f). R is a weak simulation up to f when for all C<sup>1</sup> R C<sup>2</sup> and C<sup>1</sup> η −→ C ′ 1 , there is C ′ <sup>2</sup> with C<sup>2</sup> η =⇒ C ′ <sup>2</sup> and C ′ 1 f(R) C ′ 2 . If R, R<sup>−</sup><sup>1</sup> are weak simulations up to f then R is a weak bisimulation up to f.

Defnition 10 (Sound up-to technique). A function f is a sound up-to technique when for any <sup>R</sup> which is a simulation up to <sup>f</sup> we have <sup>R</sup> <sup>⊆</sup> (⊏≈).

Hobbit employs the standard techniques: up to identity, up to garbage collection, up to beta reductions and up to name permutations. Here we present two novel up-to techniques: up to separation and up to reentry.

Up to Separation Our experience with Hobbit has shown that one of the most efective up-to techniques for fnitising bisimulation transition systems is the novel up to separation which we propose here. The intuition of this technique is that if diferent functions operate on disjoint parts of the store, they can be explored in disjoint parts of the bisimulation transition system. Taken to the extreme, a function that does not contain free locations can be applied only once in a bisimulation test as two copies of the function will not interfere with each other, even if they allocate new locations after application. To defne up to separation we need to defne a separating conjunction for confgurations.

Defnition 11 (Stack Interleaving). Let K1, K<sup>2</sup> be lists of evaluation contexts from ECxt (Fig. 2); we defne the interleaving operation K<sup>1</sup> #⃗k K<sup>2</sup> inductively, and write <sup>K</sup><sup>1</sup> # <sup>K</sup><sup>2</sup> to mean <sup>K</sup><sup>1</sup> #⃗k <sup>K</sup><sup>2</sup> for unspecifed ⃗k. We let · #· · = · and:

$$E\_1, K\_1 \#\_{\left(1, \tilde{k}\right)} K\_2 = E\_1, \left(K\_1 \#\_{\tilde{k}} K\_2\right) \qquad K\_1 \#\_{\left(2, \tilde{k}\right)} E\_2, K\_2 = E\_2, \left(K\_1 \#\_{\tilde{k}} K\_2\right) \dots$$

Defnition 12 (Separating Conjuction). Let C<sup>1</sup> = ⟨A<sup>1</sup> ; Γ<sup>1</sup> ; K<sup>1</sup> ; s<sup>1</sup> ; ˆe1⟩ and C<sup>2</sup> = ⟨A<sup>2</sup> ; Γ<sup>2</sup> ; K<sup>2</sup> ; s<sup>2</sup> ; ˆe2⟩ be well-formed confgurations. We defne:

– C<sup>1</sup> ⊕<sup>1</sup> ⃗k C<sup>2</sup> def <sup>=</sup> ⟨A<sup>1</sup> <sup>∪</sup> <sup>A</sup><sup>2</sup> ; <sup>Γ</sup>1, Γ<sup>2</sup> ; <sup>K</sup><sup>1</sup> #⃗k <sup>K</sup><sup>2</sup> ; <sup>s</sup>1, s<sup>2</sup> ; ˆe1⟩ when <sup>e</sup>ˆ<sup>2</sup> <sup>=</sup> ·

$$-\,^{C}C\_{1} \oplus\_{k}^{2}^{2}C\_{2} \overset{def}{=} \langle A\_{1} \cup A\_{2}; \, \Gamma\_{1}, \Gamma\_{2}; K\_{1} \#\_{k} K\_{2}; s\_{1}, s\_{2}; \hat{e}\_{2} \rangle \text{ } when \, \hat{e}\_{1} = \cdot$$

provided dom(s1) ∩ dom(s2) = ∅. We let C<sup>1</sup> ⊕ C<sup>2</sup> denote ∃i,⃗k. C<sup>1</sup> ⊕<sup>i</sup> ⃗k C2.

The function sep provides the up to separation technique; it is defned as:

UpTo⊕ C<sup>1</sup> R C<sup>2</sup> C<sup>3</sup> R C<sup>4</sup> C<sup>1</sup> ⊕ i ⃗k C<sup>3</sup> sep(R) C<sup>2</sup> ⊕ i ⃗k C<sup>4</sup> UpTo⊕⊥<sup>L</sup> C<sup>1</sup> R ⟨⊥⟩ C<sup>3</sup> R C<sup>4</sup> C<sup>1</sup> ⊕ C<sup>3</sup> sep(R) ⟨⊥⟩ UpTo⊕⊥<sup>R</sup> C<sup>1</sup> R C<sup>2</sup> C<sup>3</sup> R ⟨⊥⟩ C<sup>1</sup> ⊕ C<sup>3</sup> sep(R) ⟨⊥⟩

Soundness follows by extending [28,27] with a weaker, sufcient proof obligation.

Lemma 13. Function sep is a sound up-to technique.

Many example equivalences have a fnite transition system when using up to separation in conjunction with the simple techniques of the preceding section.

Example 14. The following is a classic example equivalence from Meyer and Sieber [21]. The following expressions are equivalent at type (unit → unit) → unit.

$$M = \mathsf{fun} \ \mathsf{f} \ \mathsf{ \rightharpoonup \mathsf{re}} \mathsf{f} \ \mathsf{x} = \mathsf{o} \ \mathsf{in} \ \mathsf{f} \ \begin{array}{c} () \end{array} \qquad \qquad N = \mathsf{fun} \ \mathsf{f} \ \mathsf{ \rightharpoonup \mathsf{f}} \ () \ \mathsf{ \rightharpoonup \mathsf{f}}$$

For both functions, after initial application of the function by the opponent, the proponent calls f, growing the stack K in the two confgurations. At that point the opponent can apply the same functions again. The LTS of both M and N is thus infnite because K can grow indefnitely, and so is a bisimulation proving this equivalence. It is additionally infnite because the opponent can keep applying the initial function applications even after these return. However, if we apply the up-to separation technique immediately after the frst opponent application, the Γ environments become empty, and thus no second application of the same functions can happen. The LTS thus becomes trivially small. Note that no other up to technique is needed here. Hobbit applies up-to separation after every opponent application transition and explores the confguration containing the application expression and the smallest possible Γ; this does not lead to false-negative (or false-positive) results.

Example 15. This example is due to Bohr and Birkedal [5] and includes a nonsynchronised divergence.

```
M = fun f ->
       ref l1 = false in ref l2 = false in
       f (fun () -> if !l1 then _bot_ else l2 := true);
       if !l2 then _bot_ else l1 := true
N = fun f -> f (fun () -> _bot_)
```
Note that \_bot\_ is a diverging computation. This is a hard example to prove using environmental bisimulation even with up to techniques, requiring quantifcation over contexts within the proof. However, with up-to separation after the opponent applies the initial functions, the Γ environments are emptied, thus leaving only one application of M and N that needs to be explored by the bisimulation. Applications of the inner function provided as argument to f only leads to a small number of reachable confgurations. Hobbit can indeed prove this equivalence.

Up to Proponent Function Re-entry The higher-order nature of λ imp and its LTS allows infnite nesting of opponent and proponent calls. Although up to separation avoids those in a number of examples, here we present a second novel up-to technique, which we call up to proponent function re-entry (or simply, up to re-entry). This technique has connections to the induction hypothesis in the defnition of environmental bisimulations in [16]. However up to re-entry is specifcally aimed at avoiding nested calls to proponent functions, and it is designed to work with our symbolic LTS. In combination with other techniques this eliminates the need to consider confgurations with unbounded stacks K in many classical equivalences, including those in [21].

$$\begin{array}{c} \text{UJToREEERY} \\ \hline \\ \forall \vec{\eta}, C, A', I\_1', I\_2', s\_1', s\_2' \quad \text{(\underline{\text{app}}(i. \ .) \notin \{\vec{\eta}\} \text{ and} \\ \qquad \qquad \qquad \qquad \qquad \qquad \langle A; \Gamma\_1 :, s\_1 :, \rangle \xleftarrow{\text{app}(i., C)} \xrightarrow{\mathsf{q}} \langle A'; \Gamma\_1' :, s\_1' : \rangle \text{ and} \\ \qquad \qquad \qquad \qquad \langle A; \Gamma\_2 :, : s\_2 :, \rangle \xleftarrow{\text{app}(i., C)} \xrightarrow{\mathsf{q}} \langle A'; \Gamma\_2' :, s\_2' : \rangle \\ \quad \quad \quad \quad \quad \text{implies } \Gamma\_1' = \Gamma\_1 \text{ and } \Gamma\_2' = \Gamma\_2 \text{ and } s\_1 = s\_1' \text{ and } s\_2 = s\_2' \rangle \\ \quad \quad C\_1 \xrightarrow{\mathsf{app}(i., C)} \xrightarrow{\mathsf{q}'} \xrightarrow{\mathsf{app}(i., C')} \langle A'; \Gamma\_1, K\_1', K\_1; s\_1; e\_1' \rangle \\ \quad \quad C\_2 \xrightarrow{\mathsf{app}(i., C')} \xrightarrow{\mathsf{q}'} \xrightarrow{\mathsf{app}(i., C')} \langle A'; \Gamma\_2, K\_2', K\_2; s\_2; e\_2' \rangle \\ \quad \quad \quad \quad \langle A' : \Gamma\_1, K\_1', K\_1 : s\_1; e\_1' \rangle \end{array}$$

Fig. 4. Up to Proponent Function Re-entry (omitting rules for ⊥-confgurations).

Up to re-entry is realised by function reent in Fig. 4. The intuition of this up-to technique is that if the application of related functions at i in the Γ environments has no potential to change the local stores (up to garbage collection, encoded by (≍)) or increase the Γ environments, then there are no additional observations to be made by nested calls to the i-functions, thus confgurations reached by such nested calls are added to the relation by this up-to technique. Soundness follows similarly to up-to separation.

In Hobbit we require the user to fag the functions to be considered for the up to re-entry technique. This annotation is later combined with state invariant annotations, as they are often used together. Inequivalences found while using the up to re-entry and state invariant annotations could be false-negatives due to incorrect user annotations. Hobbit ensures that no such false-negatives are reported by re-running discovered inequivalences with these two techniques of.

Below is an example where the state invariant needed is trivial and up to separation together with up to re-entry are sufcient to prove the equivalence.

Example 16. M = **ref** x = 0 **in fun** f -> f (); !x N = **fun** f -> f (); 0

This is like Ex. 2 except the reference in M is created outside of the function body. The LTS for this is as follows. Labels ⟨•; !x1⟩ are continuations.

Again, the opponent is allowed to reenter g as before. With up-to reentry, however, the opponent skips nested calls to g as these do not modify the state.

N mirrors the above LTS without the x<sup>1</sup> reference and with continuation ⟨•; 0⟩.

### 6 Symbolic First-Order Transitions

We extend λ imp constants (Const) with a countable set of symbolic constants ranged over by κ. We defne symbolic environments σ ::= · | (κ ⌢ e), σ, where ⌢ is either = or ̸=, and e is an arithmetic expression over constants, and interpret them as conjunctions of (in-)equalities, with the empty set interpreted as ⊤.

Defnition 17 (Satisfability). Symbolic environment σ is satisfable if there exists an assignment δ, mapping the symbolic constants of σ to actual constants, such that δσ is a tautology; we then write δ ⊨ σ.

We extend reduction confgurations with a symbolic environment σ, written as σ ⊢ ⟨s ; e⟩. These constants are implicitly annotated with their type. We modify the reduction semantics from Fig. 2 to consider symbolic constants:


All other reduction semantics rules carry the σ. The LTS from Sec. 4 is modifed to operate over confgurations of the form <sup>σ</sup> <sup>⊢</sup> <sup>C</sup> or · ⊢ ⟨⊥⟩. We let <sup>C</sup><sup>e</sup> range over both forms of confgurations. All LTS rules for proponent transitions simply carry the σ; rule Tau may increase σ due to the inner reduction. Opponent transitions generate fresh symbolic constants, instead of actual constants: labels app(i, D[⃗α]) and ret(D[⃗α]) in rules OpApp and OpRet of Fig. 3, respectively, contain D with symbolic, instead of concrete constants. We adapt (bi-)simulation as follows.

Defnition 18. Binary relation R on symbolic confgurations is a weak simulation when for all <sup>C</sup>e<sup>1</sup> <sup>R</sup> <sup>C</sup>e<sup>2</sup> and <sup>C</sup>e<sup>1</sup> <sup>η</sup><sup>1</sup> −→ <sup>C</sup>e′ 1 , <sup>∃</sup>Ce′ 2 such that <sup>C</sup>e<sup>2</sup> <sup>η</sup><sup>2</sup> ==<sup>⇒</sup> <sup>C</sup>e′ <sup>2</sup> and

Ce′ <sup>1</sup> <sup>R</sup> <sup>C</sup>e′ 2 (Ce′ 1 .σ, <sup>C</sup>e′ 2 .σ) is sat. <sup>∀</sup>δ. δ <sup>|</sup>= (Ce′ 1 .σ, <sup>C</sup>e′ 2 .σ) =⇒ δη<sup>1</sup> = δη<sup>2</sup>

Lemma 19. (σ<sup>1</sup> <sup>⊢</sup> <sup>C</sup>1) <sup>⊏</sup><sup>≈</sup> (σ<sup>2</sup> <sup>⊢</sup> <sup>C</sup>2) if for all <sup>δ</sup> <sup>|</sup><sup>=</sup> <sup>σ</sup>1, σ<sup>2</sup> we have δC<sup>1</sup> <sup>⊏</sup><sup>≈</sup> δC2.

Corollary 20 (Soundness, Completeness). (· ⊢ <sup>C</sup>1) <sup>⊏</sup><sup>≈</sup> (· ⊢ <sup>C</sup>2) if <sup>C</sup><sup>1</sup> <sup>⊏</sup><sup>≈</sup> <sup>C</sup>2.

The up-to techniques we have developed in previous sections apply unmodifed to the extended LTS as the techniques do not involve symbolic constants, with the exception of up to beta which requires adapting the defnition of a beta move to consider all possible δ. The introduction of symbolic frst-order transitions allows us to prove many interesting frst-order examples, such as the equivalence of bubble sort and insertion sort, an example borrowed from Hector [12] (omitted here, see the Hobbit distribution). Below is a simpler example showing the equivalence of two integer swap functions which, by leveraging Z3 [23], Hobbit is able to prove.

```
Example 21.
M = let swap xy =
        let (x,y) = xy
        in (y, x)
      in swap
                            N = fun xy -> let (x,y) = xy in
                                    ref x = x in ref y = y in
                                    x := !x - !y; y := !x + !y;
                                    x := !y - !x; (!x, !y)
```
### 7 Up to State Invariants

The addition of symbolic constants into λ imp and the LTS not only allows us to consider all possible opponent-generated constants simultaneously in a symbolic execution of proponent expressions, but also allows us to defne an additional powerful up-to technique: up to state invariants. We defne this technique in two parts: up to abstraction and up to tautology realised by abs and taut. 5

$$\begin{array}{c} \text{UpTotaut} \\ (\sigma\_1 \vdash C\_1) \text{ } \mathcal{R} \text{ } (\sigma\_2 \vdash C\_2) \\ (\sigma\_1 \vdash C\_1) \text{ } \mathcal{R} \text{ } (\sigma\_2 \vdash C\_2) \\ \hline (\sigma\_1 \vdash C\_1) [\vec{c}/\vec{\kappa}] \text{ } \mathsf{abs}(\mathcal{R}) \text{ } (\sigma\_2 \vdash C\_2) [\vec{c}/\vec{\kappa}] \\ \end{array} \begin{array}{c} \text{UpTotaut} \\ (\sigma\_1, \sigma\_2', \sigma\_1', \sigma\_2') \text{ is sat.} \\ \sigma\_1, \sigma\_2 \land \neg(\sigma\_1', \sigma\_2') \text{ is not sat.} \\ \hline (\sigma\_1 \vdash C\_1) \text{ } \mathsf{taut}(\mathcal{R}) \text{ } (\sigma\_2 \vdash C\_2) \end{array}$$

The frst function abs allows us to derive the equivalence of confgurations by abstracting constants with fresh symbolic constants (of the same type) and instead prove equivalent the more abstract confgurations. The second function taut allows us to introduce tautologies into the symbolic environments. These are predicates which are valid; i.e., they hold for all instantiations of the abstract variables. Combining the two functions we can introduce a tautology I(⃗c) into the symbolic environments, and then abstract constants ⃗c from the predicate but also from the confgurations with symbolic ones, obtaining I(⃗κ), which encodes an invariant that always holds.

Currently in Hobbit, up to abstraction and tautology are combined and applied in a principled way. Functions can be annotated with the following syntax:

$$F = \begin{array}{ccccc} \mathsf{fun} & \times & \{\vec{\kappa} \: \mid \: l\_1 \; \; \mathbf{a} \mathbf{s} \: \: C\_1[\vec{\kappa}] \; , & \ldots & \: l\_n \; \; \mathbf{a} \mathbf{s} \: \: C\_n[\vec{\kappa}] \; \mid \: \phi\} & \star \mathbf{e} \end{array}$$

The annotation instructs Hobbit to use the two techniques when opponent applies related functions where at least one of them has such an annotation. If both functions contain annotations, then they are combined and the same ⃗κ are used in both annotations. The techniques are used again when proponent returns from the functions, and proponent calls opponent from within the functions.<sup>6</sup> As discussed in Sec. 5, the same annotation enables up to reentry in Hobbit.

When Hobbit uses the above two up-to techniques it 1) pattern-matches the values currently in each location l<sup>i</sup> with the value context C<sup>i</sup> where fresh

<sup>5</sup> Hobbit also implements an up to σ-normalisation and garbage collection technique.

<sup>6</sup> Finer-grain control of application of these up-to techniques is left to future work.

symbolic constants ⃗κ are in its holes, obtaining a substitution [⃗c/⃗κ]; 2) the up to tautology technique is applied for the formula ϕ[⃗c/⃗κ]; and 3) the up to abstraction technique is applied by replacing ϕ[⃗c/⃗κ] in the symbolic environment with ϕ, and the contents of locations l<sup>i</sup> with C<sup>i</sup> [⃗κ].

Example 22. Following is an example by Meyer and Sieber [21] featuring location passing, adapted to λ imp where locations are local.

```
M = let loc_eq loc1loc2 = [. . . ] in
     fun q -> ref x = 0 in
              let locx = (fun () -> !x) , (fun v -> x := v) in
              let almostadd_2 locz {w | x as w | w mod 2 == 0} =
                if loc_eq (locx,locz) then x := 1 else x := !x + 2
              in q almostadd_2; if !x mod 2 = 0 then _bot_ else ()
```

```
N = fun q -> _bot_
```
In this example we simulate general references as a pair of read-write functions. Function loc\_eq implements a standard location equality test. The two higherorder expressions are equivalent because the opponent can only increase the contents of x through the function almostadd\_2. As the number of times the opponent can call this function is unbounded, the LTS is infnite. However, the annotation of function almostadd\_2 applies the up to state invariants technique when the function is called (and, less crucially, when it returns), replacing the concrete value of x with a symbolic integer constant w satisfying the invariant w mod 2 == 0. This makes the LTS fnite, up to permutations of symbolic constants. Moreover, up to separation removes the outer functions from the Γ environments, thus preventing re-entrant calls to these functions. Note the up to techniques are applied even though one of the confgurations is diverging (\_bot\_). This would not be possible with the LTS and bisimulation of [3].

# 8 Implementation and Evaluation

We implemented the LTS and up-to techniques for λ imp in a tool prototype called Hobbit, which we ran on a test-suite of 105 equivalences and 68 inequivalences— 3338 and 2263 lines of code for equivalences and inequivalences respectively.

Hobbit is bounded in the total number of function calls it explores per path. We ran Hobbit with a default bound of 6 calls except where a larger bound was found to prove or disprove equivalence—46 examples required a larger bound, and the largest bound used was 348. To illustrate the impact of up-to techniques, we checked all fles (pairs of expressions to be checked for equivalence) in fve confgurations: default (all up-to techniques on), up to separation of, annotations (up to state invariants and re-entry) of, up to re-entry of, and everything of. The tool stops at the frst trace that disproves equivalence, after enumerating all traces up to the bound, or after timing out at 150 seconds. Time taken and exit status (equivalent, inequivalent, inconclusive) were recorded for each fle; an overview of the experiment can be seen in the following table. All experiments ran on an Ubuntu 18.04 machine with 32GB RAM, Intel Core i7 1.90GHz CPU, with intermediate calls to Z3 4.8.10 to prune invalid internal symbolic branching

and decide symbolic bisimulation conditions. All constraints passed to Z3 are of propositional satisfability in conjunctive normal form (CNF).


We can observe that Hobbit was sound and bounded-complete for our examples; no false reports and all inequivalences were identifed. Up-to techniques also had a signifcant impact on proving equivalence. With all techniques on, it proved 68.6% of our equivalences; a dramatic improvement over 2.9% proven with none on. The most signifcant technique was up-to separation—necessary for 55.6% of equivalences proven and reducing time taken by 99.99%—which was useful when functions could be independently explored by the context. Following was annotations—necessary for 34.7% of equivalences and decreasing time by 96.9%—and up-to re-entry—20.8% of fles and decreased time by 96.8%. Although the latter two required manual annotation, they enabled equivalences where our language was able to capture the proof conditions. Note that, since turning of invariant annotations also turns of re-entry, only 10 fles needed up-to re-entry on top of invariant annotations. In contrast, inequivalences did not beneft as much. This was expected as without up-to techniques Hobbit is still based on bounded model checking, which is theoretically sound and complete for inequivalences, and fnds the shortest counterexample traces using breadth-frst search. Nonetheless, with up-to techniques turned of, inequivalences were discovered in 515.7s (vs. 20s with techniques on) and three fles timed out, due to the techniques reducing the size and branching factor of confgurations. This suggests that the reduction in state space is still relevant when searching for counterexamples.

### 9 Comparison with Existing Tools

There are two main classes of tools for contextual equivalence checking. The frst one includes semantics-driven tools that tackle higher-order languages with state like ours. In this class belong game-based tools Hector [12] and Coneqct [24], which can only address carefully crafted fragments of the language, delineated by type restrictions and bounded data types. The most advanced tool in this class is SyTeCi [14], which is based on logical relations and removes a good part of the language restrictions needed in the previous tools. The second class concerns tools that focus on frst-order languages, typically variants of C, with main tools including Rêve [9], SymDiff [17] and RVT [11]. These are highly optimised for handling internal loops, a problem orthogonal to handling the interactions between higher-order functions and their environment, addressed by Hobbit and related tools. We believe the techniques used in these tools may be useful when adapted to Hobbit, which we leave for future work.

In the higher-order contextual equivalence setting, the most relevant tool to compare with Hobbit is SyTeCi. This is because SyTeCi supersedes previous tools by proving examples with fewer syntactical limitations. We ran the tools on

examples from both SyTeCi's and our own benchmarks—7 and 15 equivalences, and 2 and 7 inequivalences from SyTeCi and Hobbit respectively—with a timeout of 150s and using Z3. Unfortunately, due to diferences in parsing and SyTeCi's syntactical restrictions, the input languages were not entirely compatible and only few manually translated programs were chosen.


We were unable to translate many of our examples because of restrictions in the input syntax supported by SyTeCi. Some of these restrictions were inessential (e.g. absence of tuples) while others were substantial: the tool does not support programs where references are allocated both inside and outside functions (e.g. Ex. 15), or with non-synchroniseable recursive calls. Moreover, SyTeCi relies on Constrained Horn Clause satisfability which is undecidable. In our testing SyTeCi sometimes timed out on examples; in private correspondence with its creator this was attributed to Z3's ability to solve Constrained Horn Clauses. Finally, SyTeCi was sound for equivalences, but not always for inequivalences as can be seen in the table above; the reason is unclear and may be due to bugs. On the other hand, SyTeCi was able to solve equivalences we are not able to handle; e.g. synchronisable recursive calls and examples with well-bracketing properties.

# 10 Conclusion

Our experience with Hobbit suggests that our technique provides a signifcant contribution to verifcation of contextual equivalence. In the higher-order case, Hobbit does not impose language restrictions as present in other tools. Our tool is able to solve several examples that can not be solved by SyTeCi, which is the most advanced tool in this family. In the frst-order case, the problem of contextual equivalence difers signifcantly as the interactions that a frst-order expression can have with its context are limited; e.g. equivalence analyses do not need to consider callbacks or re-entrant calls. Moreover, the distinction between global and local state is only meaningful in higher-order languages where a program phrase can invoke diferent calls of the same function, each with its own state. Therefore, tools for frst-order languages focus on what in our setting are internal transitions and the complexities arising from e.g. unbounded datatypes and recursion, whereas we focus on external interactions with the context.

As for limitations, Hobbit does not handle synchronised internal recursion and well-bracketed state, which SyTeCi can often solve. More generally, Hobbit is not optimised for internal recursion as frst-order tools are. In this work we have also disallowed reference types in λ imp to simplify the technical development; location exchange is encoded via function exchange (cf. Ex. 22). We intend to address these limitations in future work and explore applications of Hobbit to real-world examples.

# References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Equivalence Checking for Orthocomplemented Bisemilattices in Log-Linear Time***<sup>⋆</sup>*

Simon Guilloud() and Viktor Kunčak

EPFL IC LARA, Station 14, CH-1015 Lausanne, Switzerland {Simon.Guilloud,Viktor.Kuncak}@epfl.ch

**Abstract.** Motivated by proof checking, we consider the problem of efficiently establishing equivalence of propositional formulas by relaxing the completeness requirements while still providing certain guarantees. We present a quasilinear time algorithm to decide the word problem on a natural algebraic structures we call orthocomplemented bisemilattices, a subtheory of Boolean algebra. The starting point for our procedure is a variation of Aho, Hopcroft, Ullman algorithm for isomorphism of trees, which we generalize to directed acyclic graphs. We combine this algorithm with a term rewriting system we introduce to decide equivalence of terms. We prove that our rewriting system is terminating and confluent, implying the existence of a normal form. We then show that our algorithm computes this normal form in log linear (and thus sub-quadratic) time. We provide pseudocode and a minimal working implementation in Scala.

# **1 Introduction**

Reasoning about propositional logic and its extensions is a basis of many verification algorithms [19]. Propositional variables may correspond to, for example, sub-formulas in first-order logic theories of SMT solvers [2,5,26], hypotheses and lemmas inside proof assistants [13,27,32], or abstractions of sets of states. In particular, it is often of interest to establish that *two propositional formulas are equivalent*. The equivalence problem for propositional logic is coNP-complete as a negation of propositional satisfiability [8]. From proof complexity point of view [18] many known proof systems, including (nonextended) resolution [31] and cutting planes [29] have exponential-sized shortest proofs for certain propositional formulas. SAT and SMT solvers rely on DPLL-style algorithms [9,10] and do not have polynomial run-time guarantees on equivalence checking, even if formulas are syntactically close. Proof assistants implement such algorithms as tactics, so they have similar difficulties. A consequence of this is that implemented systems may take a very long time (or fail to acknowledge) that a large formula is equivalent to its minor variant differing in, for example, reordering of internal conjuncts or disjuncts. Similar situations also arise in program verifiers [12,21,30,34,35], where assertions act as lemmas in a proof.

*<sup>⋆</sup>* We acknowledge the financial support of the Swiss National Science Foundation project 200021\_197288 "A Foundational Verifier". ©The Author(s) 2022

It is thus natural to ask for an approximation of the propositional equivalence problem: *can we find an expressive theory supporting many of the algebraic laws of Boolean algebra but for which we can still have a complete and efficient algorithm for formula equivalence?* By efficient, we mean about as fast, up to logarithmic factors, as the simple linear-time syntactic comparison of formula trees.

We can use such an efficient equivalence algorithm to construct more flexible proof systems. Consider any sound proof system for propositional logic and replace the notion of *identical* sub-formulas with our notion of fast equivalence. For example, the axiom schema *𝑝* → (*𝑞* → *𝑝*) becomes *𝑝* → (*𝑞* → *𝑝* ′ ) for all equivalent *𝑝* and *𝑝* ′ . The new system remains sound. It accepts all the previously admissible inference steps, but also some new ones, which makes it more flexible.

$$\begin{array}{llll} \text{L1:} & \begin{array}{l} \text{L1:} \\ \text{L2:} \ x \sqcup y = y \sqcup x \\ \text{L3:} \end{array} & \begin{array}{l} \text{L2:} \ y = y \sqcup x \\ \text{L3:} \end{array} & \begin{array}{l} \text{L1':} \\ x \sqcup x = y \wedge x \\ \text{L2':} \ x \wedge (y \wedge z) = (x \wedge y) \wedge z \\ \text{L3':} \end{array} \\ \begin{array}{l} \text{L4:} \end{array} & \begin{array}{l} \text{L5':} \ x \wedge (y \wedge z) = (x \wedge y) \wedge z \\ \text{L2':} \ x \wedge (y \wedge z) = (x \wedge y) \wedge z \\ \text{L3':} \end{array} \\ \begin{array}{l} \text{L5':} \ x \wedge 0 = x \\ \text{L6':} \ x \wedge 0 = 0 \\ \text{L7':} \ x \wedge 1 = x \\ \text{L7':} \end{array} & \begin{array}{l} \text{L6':} \ x \wedge 0 = 0 \\ x \wedge 0 = 0 \\ \text{L8':} \end{array} \\ \begin{array}{l} \text{L7':} \ x \wedge 0 = 0 \\ x \wedge 0 = \gamma \end{array} & \begin{array}{l} \text{L8':} \ x \wedge 0 = \gamma \\ \text{L8':} \end{array} \\ \begin{array}{l} \text{L7':} \ x \wedge \neg \pi = 0 \\ \text{L8':} \end{array} \end{array}$$

**Table 1.** Laws of an algebraic structures (*𝑆,* ∧*, ⊔,* 0*,* 1*,* ¬). Our algorithm is complete (and loglinear time) for structures that satisfy laws L1-L8 and L1'-L8'. We call these structures orthocomplemented bisemilattices (OCBSL).

$$\begin{array}{ll} \text{L9:} & \mathbf{x} \sqcup (\mathbf{x} \wedge \mathbf{y}) = \mathbf{x} & \text{L9':} & \mathbf{x} \wedge (\mathbf{x} \sqcup \mathbf{y}) = \mathbf{x} \\ \text{L10:} & \mathbf{x} \sqcup (\mathbf{y} \wedge \mathbf{z}) = (\mathbf{x} \sqcup \mathbf{y}) \wedge (\mathbf{x} \sqcup \mathbf{z}) & \text{L10':} & \mathbf{x} \wedge (\mathbf{y} \sqcup \mathbf{z}) = (\mathbf{x} \wedge \mathbf{y}) \sqcup (\mathbf{x} \wedge \mathbf{z}) \end{array}$$

**Table 2.** Neither the absorption law L9,L9' nor distributivity L10,L10' hold in OCBSL. Without L9,L9', the operations ∧ and *⊔* induce different partial orders. If an OCBSL satisfies L10,L10' then it also satisfies L9,L9' and is precisely a Boolean algebra.

#### **1.1 Problem Statement**

This paper proposes to approximate propositional formula equivalence using a new algorithm that solves exactly the word problem for structures we call orthocomplemented bisemilattices (axiomatized in Table 1), in only log-linear time. In general, the word problem for an algebraic theory with signature *𝑆* and axioms *𝐴* is the problem of determining, given two terms *𝑡*<sup>1</sup> and *𝑡*<sup>2</sup> in the language of *𝑆* with free variables, whether *𝑡*<sup>1</sup> = *𝑡*<sup>2</sup> is a consequence of the axioms. Our main interest in the problem is that orthocomplemented bisemilattices (OCBSL) are a generalisation of Boolean algebra. This structure satisfies a weaker set of axioms that omits the distributivity law as well as its weaker variant, the absorption law (Table 2). Hence, this problem is a relaxation "up to distributivity" of the propositional formula equivalence. A positive answer implies formulas are equivalent in all Boolean algebras, hence also in propositional logic.

**Definition 1 (Word Problem for Orthocomplemented Bisemilattices).** *Consider the signature with two binary operations* ∧*, ⊔, unary operation* ¬ *and constants,* 0*,* 1*. The OCBSL-word problem is the problem of determining, given two terms 𝑡*<sup>1</sup> *and 𝑡*<sup>2</sup> *in this signature, containing free variables, whether 𝑡*<sup>1</sup> = *𝑡*<sup>2</sup> *is a consequence (in the sense of first-order logic with equality) of the universally quantified axioms L1-L8,L1'-L8' in Table 1.*

**Contribution.** We present an (*𝑛* log<sup>2</sup> (*𝑛*)) algorithm for the word problem of orthocomplemented lattices. In the process, we introduce a confluent and terminating rewriting system for OCBSL on terms modulo commutativity. We analyze the algorithm to show its correctness and complexity. We present its executable description and a Scala implementation at https://github.com/epfl-lara/OCBSL.

#### **1.2 Related Work**

The word problem on *lattices* has been studied in the past. The structure we consider is, in general, *not* a lattice. Whitman [33] showed decidability of the word problem on free lattices, essentially by showing that the natural order relation on lattices between two words can be decided by an exhaustive search. The word problem on *orthocomplemented lattices* has been solved typically by defining a suitable sequent calculus for the order relation with a cut rule for transitivity [4,17]. Because a cut elimination theorem can be proved similarly to the original from Gentzen [11], the proof space is finite and a proof search procedure can decide validity of the implication in the logic, which translates to the original word problem.

The word problem for free lattices was shown to be in PTIME by Hunt et al. [15] and the word problem for orthocomplemented lattices was shown to be in PTIME by Meinander [25]. Those algorithms essentially rely on similar proof-search methods as the previous ones, but bound the search space. These results make no mention of a specific degree of the polynomial; our analysis suggest that, as described, these algorithms run in (*𝑛* 4 ). Related techniques of locality have been applied more broadly and also yield polynomial bounds, with the specific exponents depending on local Horn clauses that axiomatize the theory [3, 24].

Aside from the use in equivalence checking, the problem is additionally of independent interest because OCBSL are a natural weakening of Boolean Algebra and orthocomplemented lattices. They are dual to complemented lattices in the sense illustrated by Figure 1. A slight weakening of OCBSL, called de Morgan bisemilattice, has been used to simulate electronic circuits [6, 22]. OCBSL may be applicable in this scenario as well. Moreover, our algorithm can also be adapted to decide, in log-linear time, the word problem for this weaker theory.

To the best of our knowledge, no solution was presented in the past for the word problem for orthocomplemented bisemilattices (OCBSL). Moreover, we are not aware of previous log-linear algorithms for the related previously studied theories either.

#### **1.3 Overview of the Algorithm**

It is common to represent a term, like a Boolean formula, as an abstract syntax tree. In such a tree, a node corresponds to either a function symbol, a constant symbol or a variable, and the children of a function node represent the arguments of the function. In general, for a symbol function *𝑓*, trees *𝑓*(*𝑥, 𝑦*) and *𝑓*(*𝑦, 𝑥*) are distinct; the children of a node are stored in a specific order. Commutativity of a function symbol *𝑓* corresponds to the fact that children of a node labelled by *𝑓* are instead unordered. Our algorithm thus uses as its starting point a variation of the algorithm of Aho, Hopcroft, and Ullman [14] for tree isomorphism, as it corresponds to deciding equality of two terms modulo commutativity. However, the theory we consider contains many more axioms than merely commutativity. Our approach is to find an equivalent set of reduction rules, themselves understood modulo commutativity, that is suitable to compute a normal form of a given formula with respect to those axioms using the ideas of term rewriting [1]. The interest of tree isomorphism in our approach is two-fold: first, it helps to find application cases of our reduction rules, and second, it compares the two terms of our word problem. In the final algorithm, both aspects are realized simultaneously.

(a) Complemented lattice (b) Orthocomplemented bisemilattice(c) Orthocomplemented lattice

**Fig. 1.** Bisemilattices satisfying absorption or de Morgan laws.

# **2 Preliminaries**

#### **2.1 Lattices and Bisemilattices**

To define and situate our problem, we present a collection of algebraic structures satisfying certain subsets of the laws in tables 1 and 2.

A structure (*𝑆,* ∧) that is associative (L1), commutative (L2) and idempotent (L3) is a **semilattice**. A semilattice induces a partial order relation on *𝑆* defined by *𝑎* ≤ *𝑏* ⟺ (*𝑎*∧*𝑏*) = *𝑎*. Indeed, one can verify that ∃*𝑐.*(*𝑏*∧*𝑐*) = *𝑎* ⟺ (*𝑏*∧*𝑎*) = *𝑎*, from which transitivity follows. Antisymetry is immediate. In such partially ordered set (poset) *𝑆*, two elements *𝑎* and *𝑏* always have a *greatest lower bound*, or *𝑔𝑙𝑏*, *𝑎* ∧ *𝑏*. Conversely, a poset such that any two elements have a *𝑔𝑙𝑏* is always a semilattice. A structure (*𝑆,* ∧*,* 0*,* 1) that satisfies L1, L2, L3, L4, and L5 is a bounded **upper-semilattice**. Equivalently, 1 is the maximum element and 0 the minimum element in the corresponding poset. Similarly, a structure (*𝑆, ⊔,* 0*,* 1) that satisfies L1' to L5' is a bounded **lower-semilattice**. In that case, we write the corresponding ordering relation *⊒*. Note that it points in the direction opposite to ≤, so that 1 is always the "maximum" element and 0 the "minimum" element. A structure (*𝑆,* ∧*, ⊔*) is a **bisemilattice** if (*𝑆,* ∧) is an upper semilattice and (*𝑆, ⊔*) a lower semilattice. There are in general no specific laws relating the two semilattices of a bisemilattice. They can be the same semilattice or completely different. If the bisemilattice satisfies the absorption law (L9), then the two semilattices are related in such a way that *𝑎* ≤ *𝑏* ⟺ *𝑎 ⊒ 𝑏*, i.e. the two orders ≤ and *⊒* are equal and the structure is called a lattice. A bisemilattice is **consistently bounded** if both semilattices are bounded and if 0<sup>∧</sup> = 0*<sup>⊔</sup>* = 0 and 1<sup>∧</sup> = 1*<sup>⊔</sup>* = 1, which will be the case in this paper. A structure (*𝑆,* ∧*, ⊔,* ¬*,* 0*,* 1) that satisfies L1 to L7 and L1' to L7' is called a **complemented bisemilattice**, with complement operation ¬. A complemented bisemilattice satisfying de Morgan's Law (L8 and L8') is an **orthocomplemented bisemilattice** and implies ¬0 = ¬(¬1∧ 0) = ¬¬1*⊔*¬0 = 1. A structure satisfying L1-L9 and L1'-L9' is an **orthocomplemented lattice**. Both de Morgan laws (L8, L8') and absorption laws (L9 and L9') relate the two semilattices, in a way summarised in Figure 1. In bisemilattices, orthocomplementation is (merely) equivalent to *𝑎* ≤ *𝑏* ⟺ ¬*𝑏 ⊒* ¬*𝑎*. Indeed, we have:

$$a \le b \stackrel{def}{\iff} a \land b = a \stackrel{L8'}{\iff} \neg a \sqcup \neg b = \neg a \stackrel{def}{\iff} \neg b \supseteq \neg a$$

In the presence of L1-L8,L1'-L8', the law of absorption (L9 and L9') is implied by distributivity. In fact, an orthocomplemented bisemilattice with distributivity is a lattice and even a Boolean algebra. In this sense, we can consider orthocomplemented bisemilattices as "Boolean algebra without distributivity".

#### **2.2 Term Rewriting Systems**

We next review basics of term rewriting systems. For a more complete treatment, see [1].

**Definition 2.** *A term rewriting system is a list of rewriting rules of the form 𝑒<sup>𝑙</sup>* = *𝑒<sup>𝑟</sup> with the meaning that the occurence of 𝑒<sup>𝑙</sup> in a term 𝑡 can be replaced by 𝑒<sup>𝑟</sup> . 𝑒<sup>𝑙</sup> and 𝑒<sup>𝑟</sup> can contain free variables. To apply the rule, 𝑒<sup>𝑙</sup> is unified with a subterm of 𝑡, and that subterm is replaced by 𝑒<sup>𝑟</sup> with the same unifier. If applying a rewriting rule to 𝑡*<sup>1</sup> *yields 𝑡*2 *, we say that 𝑡*<sup>1</sup> *reduces to 𝑡*<sup>2</sup> *and write 𝑡*<sup>1</sup> → *𝑡*<sup>2</sup> *. We denote by* ∗ → *the transitive closure of* → *and by* ∗ ↔ *its transitive symmetric closure.*

An axiomatic system such as L1-L9, L1'-L9' induces a term rewriting system, interpreting equalities from left to right. In that case *𝑡*<sup>1</sup> ∗ ↔ *𝑡*<sup>2</sup> coincides with the validity of the equality *𝑡*<sup>1</sup> = *𝑡*<sup>2</sup> in the theory given by the axioms [1, Theorem 3.1.12].

**Definition 3.** *A term rewriting system is terminating if there exists no infinite chain of reducing terms 𝑡*<sup>1</sup> → *𝑡*<sup>2</sup> → *𝑡*<sup>3</sup> → *....*

**Fact 1** *If there is a well-founded order < (or, in particular, a measure 𝑚) on terms such that 𝑡*<sup>1</sup> → *𝑡*<sup>2</sup> ⟹ *𝑡*<sup>2</sup> *< 𝑡*<sup>1</sup> *(or, in particular 𝑚*(*𝑡*<sup>2</sup> ) *< 𝑚*(*𝑡*<sup>1</sup> )*) then the term rewriting system is terminating.*

**Definition 4.** *A term rewriting system is confluent iff: for all 𝑡*<sup>1</sup> *, 𝑡*2 *, 𝑡*3 *, 𝑡*1 ∗ → *𝑡*2∧*𝑡*<sup>1</sup> →<sup>∗</sup> *𝑡*3 *implies* ∃*𝑡*<sup>4</sup> *.𝑡*2 ∗ → *𝑡*<sup>4</sup> ∧ *𝑡*<sup>3</sup> ∗ → *𝑡*<sup>4</sup> *.*

**Theorem 1 (Church-Rosser Property ).** *[1, Chapter 2] A term rewriting system is confluent if and only if* ∀*𝑡*<sup>1</sup> *, 𝑡*2 *.*(*𝑡*<sup>1</sup> ∗ ↔ *𝑡*<sup>2</sup> ) ⟹ (∃*𝑡*<sup>3</sup> *.𝑡*1 ∗ → *𝑡*<sup>3</sup> ∧ *𝑡*<sup>2</sup> ∗ → *𝑡*<sup>3</sup> )*.*

A terminating and confluent term rewriting system directly implies decidability of the word problem for the underlying structure, as it makes it possible to compute the normal form of two terms to check if they are equivalent. Note that commutativity is not a terminating rewriting rule, but similar results holds if we consider the set of all terms, as well as rewrite rules, modulo commutativity [1, Chapter 11], [28]. To efficiently manipulate terms modulo commutativity and achieve log-linear time, we will employ an algorithm for comparing trees with unordered children.

#### **3 Directed Acyclic Graph Equivalence**

The structure of formulas with commutative nodes correspond to the usual mathematical definition of a labelled rooted tree, i.e. an acyclic graph with one distinguished vertex (root) where there is no order on the children of a node. For this reason, we use as our starting point the algorithm of Hopcroft, Ullman and Aho for tree isomorphism [14, Page 84, Example 3.2], which has also been studied subsequently [7, 23].

To account for structure sharing, we further generalize this representation to singlyrooted, labeled, Directed Acyclic Graphs, which we simply call DAGs. Our DAGs generalize rooted directed trees. Any DAG can be transformed into a rooted tree by duplicating subgraphs corresponding to nodes with multiple parent, as in Figure 2. This transformation in general results in an exponential blowup in the number of nodes. Dually, using DAGs instead of trees can exponentially shrink space needed to represent certain terms.

**Fig. 2.** A DAG and the corresponding Tree

**Fig. 3.** Two equivalent DAGs with different number of nodes.

Checking for equality between *ordered* trees or DAGs is easy in linear time: we simply recursively check equality between the children of two nodes.

**Definition 5.** *Two ordered nodes 𝜏 and 𝜋 with children 𝜏*<sup>0</sup> *, ..., 𝜏<sup>𝑚</sup> and 𝜋*<sup>0</sup> *, ..., 𝜋<sup>𝑛</sup> are equivalent (noted 𝜏* ∼ *𝜋) iff*

$$label(\pi) = label(\pi), \; m = n \; and \; \forall i < n, \; \pi\_i \sim \pi\_i$$

For unordered trees or DAG, the equivalence checking is less trivial, as the naive algorithm has exponential complexity due to the need to find the adequate permutation.

**Definition 6.** *Two unordered nodes 𝜏 and 𝜋 with children 𝜏*<sup>0</sup> *, ..., 𝜏<sup>𝑚</sup> and 𝜋*<sup>0</sup> *, ..., 𝜋<sup>𝑛</sup> are equivalent (noted 𝜏* ∼ *𝜋) iff*

*𝑙𝑎𝑏𝑒𝑙*(*𝜏*) = *𝑙𝑎𝑏𝑒𝑙*(*𝜋*)*, 𝑚* = *𝑛 and there exists a permutation 𝑝 s.t.* ∀*𝑖 < 𝑛, 𝜏𝑝*(*𝑖*) ∼ *𝜋<sup>𝑖</sup>*

For trees, note that this definition of equivalence corresponds exactly to isomorphism. It is known that DAG-isomorphism is GI-complete, so it is conjectured to have complexity greater than PTIME. Fortunately, this does not prevent our solution because our notion of equivalence on DAGs is not the same as isomorphism on DAGs. In particular, two DAGs can be equivalent without having the same number of nodes, i.e. without being isomorphic, as Figure 3 illustrates.


Algorithm 1 is the generalization of Hopcroft, Ullman and Aho's algorithm. It decides in log-linear time if two labelled (unordered) DAGs are equivalent according to definition 5. The algorithm generalizes straightforwardly to DAGs with a mix of ordered and unordered nodes: if a node is ordered, we skip the sorting operation in line 7.

The algorithm works bottom to top. We first sort the DAG in reverse topological order using, for example, Kahn's algorithm [16]. This way, we explore the DAG starting from a leaf and finishing with the root. It is guaranteed that when we treat a node, all its children have already been treated.

The algorithm recursively assigns codes to the nodes of both DAGs recursively. In the unlabelled case:

Equivalence Checking for Orthocomplemented Bisemilattices in Log-Linear Time 203


**Lemma 1 (Algorithm 1 Correctness).** *The codes assigned to any two nodes 𝑛 and 𝑚 of 𝑠𝜏*++*𝑠<sup>𝜋</sup> are equal if and only if 𝑛* ∼ *𝑚.*

*Proof.* Let *𝑛* and *𝑚* denote any two DAG nodes. By induction on the height of *𝑛*:

	- 1. Their labels are equal
	- 2. There exist a permutation *𝑝* s.t. *𝑛𝑝*(*𝑖*) ∼ *𝑚<sup>𝑖</sup>*

i.e *𝑛* and *𝑚* have the same code if and only if *𝑛* ∼ *𝑚*.

**Corollary 1.** *The algorithm returns True if and only if 𝜏* ∼ *𝜋.*

*Time Complexity.* Using Kahn's algorithm, sorting *𝜏* and *𝜋* is done in linear time. Then the loop touches every node a single time. Inside the loop, the first line takes linear time with respect to the number of children of the node and the second line takes log-linear time with respect to the number of children. Since we use HashMaps, the last instructions take effectively constant time (because hash code is computed from the address of the node and not its content).

So for general DAG, the algorithm runs in time at most log-quadratic in the number of nodes. Note however that for DAGs with bounded number of children per node as well as for DAGs with bounded number of parents per nodes, the algorithm is log-linear. In fact, the algorithm is log-linear with respect to the total number of edges in the graph. For this reason, the algorithm is still only log-linear in input size. It also follows that the algorithm is always at most log-linear with respect to the tree or formula underlying the DAG, which may be much larger than the DAG itself. Moreover, there exists cases where the algorithm is log-linear in the number of nodes, but the underlying tree is exponentially larger. The full binary symmetric graph is such an example.

#### **4 Word Problem on Orthocomplemented Bisemilattices**

We will use the previous algorithm for DAG equivalence applied to a formula in the language of bisemilattices (*𝑆,* ∧*, ⊔*) to account for commutativity (axioms L1, L1'), but we need to combine it with the remaining axioms. From now on we work with axioms L1-L8, L1'-L8' in Table 1. The plan is to express those axioms as reduction rules. Of rules L2-L8 and L2'-L8', all but L8 and L8' reduce the size of the term when applied from left to right, and hence seem suitable as rewrite rules.

It may seem that the simplest way to deal with de Morgan law is to use it (along with double negation elimination) to transform all terms into negation normal form. It happens, however, that doing this causes troubles when trying to detect application cases of rule L7 (complementation). Indeed, consider the following term:

$$f = (a \land b) \sqcup \neg(a \land b)$$

Using complementation it clearly reduces to 1, but pushing into negation-normal form, it would first be transformed to (*𝑎* ∧ *𝑏*) *⊔* (¬*𝑎* ∨ ¬*𝑏*). To detect that these two disjuncts are actually opposite requires to recursivly verify that ¬(*𝑎* ∧ *𝑏*) = (¬*𝑎* ∨ ¬*𝑏*).

It is actually simpler to apply de Morgan law the following way:

$$\mathbf{x} \land \mathbf{y} = \neg(\neg\mathbf{x} \sqcap \neg\mathbf{y})$$

Instead of removing negations from the formula, we remove one of the binary semilattice operators. (Which one we keep is arbitrary; we chose to keep *⊔*.) Now, when we look if rule L7 can be applied to a disjunction node (i.e. two children *𝑦* and *𝑧* such that *𝑦* = ¬*𝑧*), there are two cases: if *𝑥* is not itself a negation, i.e. it starts with *⊔*, we compute ¬*𝑥* code from the code of *𝑥* in constant time. If *𝑥* = ¬*𝑥* ′ then ¬*𝑥* ∼ *𝑥* ′ so the code of ¬*𝑥* is simply the code of *𝑥* ′ , in constant time as well. Hence we obtain the code of all children and their negation and we can sort those codes to look for collisions, all of it in time linear in the number of children.

We now restate the axioms L1-L8 ,L1'-L8' in this updated language in Table 3.

$$\begin{array}{c|c|c} A1 & \Box \ \Box(\ldots,\mathbf{x}\_{j},\mathbf{x}\_{j},\ldots) = \Box \Box(\ldots,\mathbf{x}\_{j},\mathbf{x}\_{j},\ldots) \\ A2 & \Box \overline{\langle\mathbf{x},\bigsqcup\big]}(\overline{\mathbf{y}}) = \bigsqcup(\overline{\mathbf{x}},\overline{\mathbf{y}}) \\ \hline \end{array} \\ \begin{array}{c|c|c} A3 & \Box \underline{\langle\mathbf{x},\mathbf{x}\rangle} = \bigsqcup(\mathbf{x},\overline{\mathbf{y}}) \\ A4 & \Box \underline{\langle\mathbf{x},\mathbf{x}\rangle} = \bigsqcup(\mathbf{x},\overline{\mathbf{y}}) \\ A5 & \Box \underline{\langle\mathbf{0},\overline{\mathbf{x}}\rangle} = \bigsqcup(\overline{\mathbf{x}}) \\ A6 & \Box \neg \underline{\langle\mathbf{0},\overline{\mathbf{x}}\rangle} = \bigsqcup(\overline{\mathbf{x}}) \\ A7 & \Box \underline{\langle\mathbf{0},\neg \mathbf{x},\overline{\mathbf{y}}\rangle} = \operatorname{I} \\ A8 & \neg \neg \underline{\chi} = \operatorname{I} \\ A7 & \Box \big[(\neg \neg \chi\_{1},\ldots,\neg \neg \chi\_{i}) \bigsqcup \bigsqcup \neg \neg \underline{\chi} \bigsqcup \neg \neg \underline{\chi} \bigsqcup \neg \neg \underline{\chi} \bigsqcup \neg \neg \underline{\chi} \bigsqcup \\\end{array} \end{array}$$

**Table 3.** Laws of algebraic structures(*𝑆, ⊔,* 0*,* 1*,* ¬), equivalent to L1-L8, L1-L8' under de Morgan transformation.

It is straightforward and not surprising that axiom A8 as well as A1'-A8' all follow from axioms A1-A7, so A1-A7 are actually complete for our theory.

#### **4.1 Confluence of the Rewriting System**

In our equivalence algorithm, A1 is taken care of by the arbitrary but consistent ordering of the nodes. Axioms A2-A7 form a term rewriting system. Since all those rules reduce the size of the term, the system is terminating in a number of steps linear in the size of the term. We will next show that it is confluent. We will thus obtain the existence of a normal form for every term, and will finally show how our algorithm computes that normal form.

**Definition 7.** *Consider a pair of reduction rules 𝑙*<sup>0</sup> → *𝑟*<sup>0</sup> *and 𝑙*<sup>1</sup> → *𝑟*<sup>1</sup> *with disjoint sets of free variables such that 𝑙*<sup>0</sup> *= 𝐷*[*𝑠*]*, 𝑠 is not a variable and 𝜎 is the most general unifier of 𝜎𝑠* = *𝜎𝑙*<sup>1</sup> *. Then* (*𝜎𝑟*<sup>0</sup> *,* (*𝜎𝐷*)[*𝜎𝑟*<sup>2</sup> ]) *is called a* critical pair*.*

Informally, a critical pair is a most general pair of term (with respect to unification) (*𝑡*1 *, 𝑡*2 ) such that for some *𝑡*<sup>0</sup> , *𝑡*<sup>0</sup> → *𝑡*<sup>1</sup> and *𝑡*<sup>0</sup> → *𝑡*<sup>2</sup> via two "overlapping" rules. They are found by matching the left-hand side of a rule with a non-variable subterm of the same or another rule.

*Example 1 (Critical Pairs).*

1. Matching left-hand side of A6 with the subterm ¬*𝑥* of rule A7, we obtain the pair

$$(1, \bigsqcup (\neg x, x, \vec{y}))$$

which arises from reducing the term *𝑡*<sup>0</sup> = ⨆ (¬*𝑥,* ¬¬*𝑥, ⃗𝑦*) in two different ways.

2. Matching left-hand sides of A2 and A7 gives

$$(\bigsqcup (\vec{x}, \vec{y}, \neg \bigsqcup (\vec{y})), 1)$$

which arise from reducing <sup>⨆</sup> (*⃗𝑥,* <sup>⨆</sup> (*⃗𝑦*)*,* ¬ ⨆ (*⃗𝑦*)) using A2 or A7.

3. Matching left-hand sides of A5 and A7 gives

(¬0*,* 1)

which arise from reducing 0 *⊔* ¬0 in two different ways.

**Proposition 1 ( [1, Chapter 6]).** *A terminating term rewriting system is confluent if and only if all critical pairs* (*𝑡*<sup>1</sup> *, 𝑡*2 ) *are joinable i.e.* ∃*𝑡*<sup>3</sup> *. 𝑡*1 ∗ → *𝑡*<sup>3</sup> ∧ *𝑡*<sup>2</sup> ∗ → *𝑡*<sup>3</sup> *.*

In the first of the previous examples, the pair is clearly joinable by commutativity and a single application of rule A7 itself. The second example is more interesting. Observe that <sup>⨆</sup> (*⃗𝑥, ⃗𝑦,* ¬ ⨆ (*⃗𝑦*)) = 1 is a consequence of our axiom, but the left part cannot be reduced to 1 in general in our system. To solve this problem we need to add the rule A9: ⨆ (*⃗𝑥, ⃗𝑦,* ¬ ⨆ (*⃗𝑦*)) = 1. Similarly, the third example forces us to add A10: ¬0 = 1 to our set of rules. From A10 and A6 we then find the expected critical pair A11: ¬1 = 0.

$$\begin{array}{lcl} A1 & \square \left( \left( \ldots x\_i, \ge\_i, \ldots \right) = \bigsqcup \left( \ldots x\_j, \ge\_i, \ldots \right) \\ A2 & \square \left( \left( \vec{\overline{x}}, \bigsqcup \left( \vec{\overline{y}} \right) \right) = \bigsqcup \left( \vec{\overline{x}}, \vec{\overline{y}} \right) \\ & \bigsqcup \left( \infty \right) = \mathbf{x} \\ A3 & \cdot \bigsqcup \left( \infty, \mathbf{x}, \vec{\overline{y}} \right) = \bigsqcup \left( \infty, \vec{\overline{y}} \right) \\ A4 & \cdot \bigsqcup \left( \left( 1, \vec{\overline{x}} \right) = 1 \\ A5 & \cdot \bigsqcup \left( 0, \vec{\overline{x}} \right) = \bigsqcup \left( \vec{\overline{x}} \right) \\ A6 & \cdot \neg \neg \chi = \mathbf{x} \\ A7 & \cdot \bigsqcup \left( \infty, \neg \chi, \vec{\overline{y}} \right) = 1 \\ A9 & \cdot \bigsqcup \left( \vec{\overline{x}}, \vec{y}, \neg \bigsqcup \left( \vec{\overline{y}} \right) \right) = 1 \\ A10 & \cdot \neg 0 = 1 \\ A11 & \cdot \neg 1 = 0 \\ \end{array}$$

**Table 4.** Terminating and confluent set of rewrite rules equivalent to L1-L8, L1'-L8'

#### **4.2 Complete Terminating Confluent Rewrite System**

The analysis of all possible pairs of rules to find all critical pairs is straightforward. It turns out that the A9, A10 and A11 are the only rules we need to add to our system to obtain confluence. We have checked the complete list of critical pairs for rules A2-A11 (we omit the details due to lack of space). All those pairs are joinable, i.e. reduce to the same term, which implies, by Proposition 1, that the system is confluent. Table 4 shows the complete set of reduction rules (as well as commutativity).

Since the system A2-A11 considered over the language (*𝑆,* <sup>⨆</sup> *,* ¬*,* 0*,* 1) modulo commutativity of <sup>⨆</sup> is terminating and confluent, it implies the existence of a normal form reduction. For any term *𝑡*, we note its normal form *𝑡*↓. In particular, for any two terms *𝑡*1 and *𝑡*<sup>2</sup> , we have *𝑡*<sup>1</sup> = *𝑡*<sup>2</sup> in our theory iff *𝑡*<sup>1</sup> ∗ ↔ *𝑡*<sup>2</sup> iff *𝑡*1↓ and *𝑡*2↓ are equivalent terms modulo commutativity. We finally reach our conclusion: an algorithm that computes the normal form (modulo commutativity) of any term gives a decision procedure for the word problem for orthocomplemented bisemilattices.

# **5 Algorithm and Complexity**

The rewriting system readily gives us a quadratic algorithm. Indeed, using our base algorithm for DAG equivalence, we can check, in linear time, for application cases of any one of rewriting rules A2-A11 of Table 4 modulo commutativity. Since a term can only be reduced up to *𝑛* times, the total time spent before finding the normal form of a term is at most quadratic. It is however possible to find the normal form of a term in a single pass of our equivalence algorithm, resulting in a more efficient algorithm.

#### **5.1 Combining Rewrite Rules and Tree Isomorphism**

We give an overview on how to combine rules A2-A7, A9, A10, A11 within the tree isomorphism algorithm, which we present using Scala-like <sup>1</sup> pseudo code in Figure 7.

<sup>1</sup> https://www.scala-lang.org/

For conciseness, we omit the dynamic programming optimizations allowed by structure sharing in DAGs (which would store the normal form and additionally check if a node was already processed.) For each rule, we indicate the most relevant lines of the algorithm in Figure 7.

*A2* (Associativity, Lines 10, 20, 32, 42) When analysing a <sup>⨆</sup> node, after the recursive call, find all children that are <sup>⨆</sup> themselves and replace them by their own children. This is simple enough to implement but there is actually a caveat with this in term of complexity. We will come back to it in section 5.

*A3* (Idempotence, Lines 8, 31, 35 ) This corresponds to the fact that we eliminate duplicate children in disjunctions. When reaching a <sup>⨆</sup> node, after having sorted the code of its children, remove all duplicates before computing its own code.

*A4, A5* (Bounds, Lines 8, 31, 35, 11, 36) To account for those axioms, we reserve a special code for the nodes <sup>1</sup> and <sup>0</sup>. For A4, when we reach some <sup>⨆</sup> node, if it has 1 as one of its children, we accordingly replace the whole node by 1. For A5, we just remove nodes with the same codes as 0 from the parent node before computing its own code.

*A6* (Involution, Lines 17, 22) When reaching a negation node, if its child is itself a negation node, replace the parent node by its grandchildren before assigning it a code.

*A7* (Complement, Lines 11, 36) As explained earlier, our representation of nodes let us do the following to detect cases of A7: First remember that we already applied double negation elimination, so that two "opposite" nodes cannot both start with a negation. Then we can simply separate the children between negated and non-negated (after the recursive call), sort them using their assigned code and look for collisions.

*A9* (Also Complement, Lines 11, 36) This rule is slightly more tricky to apply. When analysing a <sup>⨆</sup> node *𝑥*, after computing the code of all children of *𝑥*, find all children of the form ¬ ⨆ . For every such node, take the set of its own children and verify if it is a subset of the set of all children of *𝑥*. If yes, then rule A9 applies. Said otherwise, we look for collisions between grandchildren (through a negation) and children of every <sup>⨆</sup> node.

*A10, A11* (Identities, Lines 17, 26) These rules are simple. In a ¬ node, if its child has the same code as 0 (resp 1), assign code 1 (resp 0) to the negated node.

#### **5.2 Case of Quadratic Runtime for the Basic Algorithm**

All the rules we introduced in the previous section into Algorithm 1 take time (log)linear in the number of children of a node to apply, which is not more than the time we spent in the DAG/tree isomorphism algorithm. For A3, checking for duplicates is done in linear time in an ordered data structure. A4 and A5 (Bounds) consist in searching for specific values, which take logarithmic time in the size of the list. A6 (Involution) takes constant time. A7 (Complement) is detected by finding a collision between two separate ordered lists, also easily done in (log) linear time. A9 (Also complement) consists in verifying if grandchildren of a node are also children, and since children are sorted this takes loglinear time in the number of grandchildren. Since a node is the grandchild of only one other node, the same computation as in the original algorithm holds. A10 and A11 take constant time. Hence, the total time complexity is (*𝑛* log(*𝑛*)), as in the algorithm for tree isomorphism.

As stated in Section 3 regarding the algorithm for DAG equivalence whose complexity we aim to preserve, the time complexity analysis crucially relies on the fact that in a tree, a node is never the child (or grandchild) of more than one node during the execution. However, this is generally not true in the presence of associativity. Indeed consider the term represented in Figure 4. The 5th <sup>⨆</sup> has 2 children, but after applying

**Fig. 4.** A term with quadratic runtime

A2, the 4th has 3 children, the 3rd has 4 children and so on. On the generalization of such an example, since an *𝑥<sup>𝑖</sup>* is the child of all higher <sup>⨆</sup> , our key property does not hold and the algorithm runtime would be quadratic. Of course, such a simple counterexample is easily solved by applying a leading pass of associativity reduction before actually running the whole algorithm. It turns out however that it is not sufficient, since cases of associativity can appear after the application of the other A-rules.

In fact, there is only one rule that can creates case of rule A2, and this rule is A6 (Involution). The remaining rules whose right-hand side can start with a <sup>⨆</sup> have their left-hand side already starting with <sup>⨆</sup> . It may seem simple enough to also apply double negation elimination in a leading pass, but unfortunately, cases of A6 can also be created from other rules. It is easy to see, for similar reasons, that only the application of A2b ( ⨆ (*𝑥*) = *𝑥*) can create such cases. And unfortunately, such cases of A2b can arise from rules A3 and A5 which can only be detected using the full algorithm. To summarize, the typical problematic case is depicted in Figure 5. This term is clearly equivalent to ⨆ (*𝑥*<sup>1</sup> *, 𝑥*<sup>2</sup> *, 𝑥*<sup>3</sup> *, 𝑥*<sup>4</sup> ), but to detect it we must first find that *𝑧*<sup>1</sup> and *𝑧*<sup>2</sup> are equivalent to 0, so we cannot simply solve it with an early pass.

#### **5.3 Final Log-Linear Time Algorithm**

Fortunately, we can solve this problem at a logarithmic-only price. Observe that if we are able to detect early nodes which would cancel to 0, the problem would not exist: When analysing a node, we would first call the algorithm on all subnodes equivalent to

**Fig. 5.** A non-trivial term with quadratic runtime

**Fig. 6.** the term of Figure 5 during the algorithm's execution

0, remove them and then when there is a single children left, remove the trivial disjunct, the double negation and the successive disjunction (as in Figure 5) before doing the recursive call on the unique nontrivial child. However, we of course cannot know in advance which child will be equivalent to 0.

Moreover note (still using Figure 5) that if the *𝑧*-child is as large as the non-trivial node, then even if we do the "useless" work, we at least obtain that the size a tree is divided by two, and hence the potential depth of the tree as well. By standard complexity analysis, the time penalty would only be a logarithmic factor.

The previous analysis suggests the following solution, reflected in Figure 7 lines 28-29. When analysing a node, make recursive calls on children in order of their size, starting with the smallest up to the second biggest. If any of those children are non-zero, proceed as normal. If all (but possibly the last) children are equivalent to zero, then replace the current node by its biggest (and at this point non-analyzed) child, i.e. apply second half of rule A2 (associativity). If applicable, apply double negation elimination and associativity as well before continuing the recursive call.

We illustrate this on the example of Figure 5. Consider the algorithm when reaching the second <sup>⨆</sup> node. There are two cases:


indeed computed the code of the disjunction that contains *𝑥*<sup>2</sup> when it was unnecessary since we apply associativity anyway. This "useless" work consists in sorting and applying axioms to the true children of the node (in this case *𝑥*<sup>2</sup> *, 𝑥*<sup>3</sup> and *𝑥*<sup>4</sup> ) and takes time quasilinear in the number of such children. In particular, it is bounded by the size of the subtree itself and we know it is the smallest of the two.

Analogous situation can arise from the use of rule A3 (idempotence), but here trivially the two subtrees must have the same number of (real) subnodes, so that the same reasoning holds.

Denote by |*𝑛*| the size of a node, i.e. the number of descendants of *𝑛*. We compute the penalty of useless work we incur by computing children of a node *𝑛* in the wrong order, i.e. by computing a non-0 child *𝑛<sup>𝑤</sup>* when all other are 0. *𝑛<sup>𝑤</sup>* cannot be the largest child of *𝑛* for otherwise we would have found that all other children are 0 before needing to compute *𝑛𝑤*. Hence |*𝑛𝑤*| ≤ |*𝑛*|∕2. It follows that the total amount of useless work is bounded by log(|*𝑛*|) ⋅ *𝑊* (*𝑛*), where

$$W(n) \le |n|/2 + \sum\_{i} W(n\_i) \quad \text{for } \sum\_{i} |n\_i| < |n|.$$

It is clear that *𝑊* (*𝑛*) is maximized when *𝑛* has exactly two children of equal size:

$$W(n) \le |n|/2 + 2 \cdot W(n/2)$$

By observing that we can divide *𝑛* by 2 only log(*𝑛*) times,

$$W(n) \le \sum\_{m=1}^{\log(n)} 2^m \cdot |n|/2^m$$

so we obtain *𝑊* (*𝑛*) = (|*𝑛*| log(|*𝑛*|)) and hence the total runtime is (*𝑛*(log *𝑛*) 2 ).

### **6 Conclusion**

We have described a decision procedure with log-linear time complexity for the word problem on orthocomplemented bisemilattices. This algorithm can also be simplified to apply to weaker theories. Dually, we believe it can be generalized to decide some stronger theories (still weaker than Boolean algebras) efficiently. While the word problem for orthocomplemented *lattices* was known to be in PTIME [15] and as such the membership of orthocomplemented *bisemilattices* in PTIME may not come as a surprise, this is, to the best of our knowledge, the first time that this result has been explicitly stated, and the first time that an algorithm with such low log-linear complexity was proposed for this or a related problem. The algorithm has not only low complexity but, according to our experience, is easy to implement. It can be used as an approximation for Boolean algebra equivalence, and we plan to use it as the basis of a kernel for a proof assistant. We also envision possible uses of the algorithm in SMT and SAT solvers. The algorithm is able to detect many natural and non-trivial cases of equivalence even on formulas that may be too large for existing solvers to deal with, so it may also complement an existing repertoire of subroutines used in more complex reasoning tasks. For a minimal working implementation in Scala closely following Figure 7, see https://github.com/epfl-lara/OCBSL.

```
1 def equivalentTrees(tau: Term, pi: Term): Boolean =
2 val codesSig: HashMap[(String, List[Int]), Int] = Empty
3 codesSig.update(("zero", Nil), 0); codesSig.update(("one", Nil), 1)
4 val codesNodes: HashMap[Term, Int] = Empty
5 def updateCodes(sig: (String, List[Int]), n: Node): Unit = ... // codesSig, codesNodes
6 def bool2const(b:Boolean): String = if b then "one" else "zero"
7 def rootCode(n: Term): Int =
8 val L = pDisj(n, Nil).map(codesNodes).sorted.filter(_≠ 0).distinct
9 if L.isEmpty then ("zero", Nil), n)
10 else if L.length == 1 then codesNodes.update(n, L.head)
11 else if L.contains(1) or checkForContradiction(L) then updateCodes(("one", Nil), n)
12 else updateCodes(("or", L), n)
13 codesNodes(n)
14 def pDisj(n:Node, acc:List[Node]): List[Node] = n match
15 case Variable(id) ⇒ updateCodes((id.toString, Nil), n); return n :: acc
16 case Literal(b) ⇒ updateCodes((bool2const(b), Nil), n); return n :: acc
17 case Negation(child) ⇒ pNeg(child, n, acc)
18 case Disjunction(children) ⇒ children.foldleft(acc)(pDisj)
19 def pNeg(n:Node, parent:Node, acc:List[Node]): List[Node] = n match // under negation
20 case Negation(child) ⇒ pDisj(child, acc)
21 case Variable(id) ⇒ updateCodes((id.toString, Nil), n)
22 updateCodes(("neg", List(codesNodes(n))), parent)
23 List(parent)::acc
24 case Literal(b) ⇒ updateCodes((bool2const(b), Nil), n)
25 updateCodes((bool2const(!b), Nil), parent)
26 List(parent)::acc
27 case Disjunction(children) ⇒
28 val r0 = orderBySize(children)
29 val r1 = r0.tail.foldLeft(Nil)(pDisj)
30 val r2 = r1.map(codesNodes).sorted.filter(_≠ 0).distinct
31 if isEmpty(r2) then pNeg(r0.head, parent, acc)
32 else val s1 = pDisj(r0.head, r1)
33 val s2 = s1 zip (s1 map codesNodes)
34 val s3 = s2.sorted.filter(_≠ 0).distinct // all wrt. 2nd element
35 if s3.contains(1) or checkForContradiction(s3)
36 then updateCodes(("one", Nil), n); updateCodes(("zero", Nil), parent)
37 List(parent)::acc
38 else if isEmpty(s3) then updateCodes(("zero", Nil), n)
39 updateCodes(("one", Nil), parent)
40 List(parent)::acc
41 else if s3.length == 1 then pNeg(s3.head._1, parent, acc)
42 else updateCodes(("or", s3 map (_._2)), n)
43 updateCodes(("neg", List(n)), parent)
44 List(parent)::acc
45 return rootCode(tau) == rootCode(pi)
```
**Fig. 7.** Final algorithm. distinctBy runs in log-linear time. checkForContradiction detects application cases of A7 and A9 (Complement). Maintenance of size field used by orderBySize elided.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Monitoring and Analysis

# A Theoretical Analysis of Random Regression Test Prioritization

Pu Yi<sup>1</sup> , Hao Wang<sup>1</sup> , Tao Xie<sup>1</sup> () , Darko Marinov<sup>2</sup> , and Wing Lam<sup>3</sup>

<sup>1</sup> Peking University, Beijing, China lukeyi@pku.edu.cn, tony.wanghao@stu.pku.edu.cn, taoxie@pku.edu.cn <sup>2</sup> University of Illinois Urbana-Champaign, Urbana, IL, USA marinov@illinois.edu <sup>3</sup> George Mason University, Fairfax, VA, USA winglam@gmu.edu

Abstract. Regression testing is an important activity to check software changes by running the tests in a test suite to inform the developers whether the changes lead to test failures. Regression test prioritization (RTP) aims to inform the developers faster by ordering the test suite so that tests likely to fail are run earlier. Many RTP techniques have been proposed and are often compared with the random RTP baseline by sampling some of the n! diferent test-suite orders for a test suite with n tests. However, there is no theoretical analysis of random RTP. We present such an analysis, deriving probability mass functions and expected values for metrics and scenarios commonly used in RTP research. Using our analysis, we revisit some of the most highly cited RTP papers and fnd that some presented results may be due to insufcient sampling. Future RTP research can leverage our analysis and need not use random sampling but can use our simple formulas or algorithms to more precisely compare with random RTP.

Keywords: Regression Test Prioritization · Random · Analysis

# 1 Introduction

Software developers commonly check their code by running tests. Regression testing [48] runs tests after code changes, to check whether the changes break the existing functionality. A test that passes before the changes but fails after indicates that the changes should be debugged (unless the test is faky [25]). Finding test failures faster enables the developers to start debugging earlier.

A popular regression testing approach is regression test prioritization (RTP) [12, 19,21,23,38,39,48], which runs the tests from a test suite in an order that aims to fnd test failures sooner. For example, Google [14] and Microsoft [42] report on using RTP in industry. More formally, a test suite T is a set (unordered) of tests, and RTP techniques produce a test-suite order—a permutation of the tests in the test suite—in which to run the tests. Various RTP techniques have been proposed in the literature since the seminal papers from 20+ years ago [12,36,38,47] that have garnered thousands of citations.

RTP techniques are often compared with random RTP. Our inspection [44] of the 100 most cited papers on RTP shows that 56 papers use random RTP as a comparison baseline. Although random RTP often performs worse than advanced techniques, recent papers still use random RTP, because it has a small overhead and may perform well in certain scenarios. We additionally check papers published in the latest testing conferences (ICST and ISSTA 2020/2021) and fnd that 50% (2/4) of the RTP papers [6,15,30,34] use random RTP. While random RTP has been used as a baseline for 20+ years, all evaluations have been empirical, performed by randomly sampling some of the n! orders for a test suite with n tests. The selected sample size varies (20, 50, 100, 200, 1000), with no clear correlation with n; some papers do not even report the sample size [44]. However, no prior work has presented a theoretical analysis of random RTP.

Before we summarize our analysis, we describe some metrics and scenarios most commonly used in RTP research. We frst introduce some terms: failure is simply a failing test, fault is the root cause (bug in the code) for the failure, and we say that a failure detects a fault if the failure is caused by the fault [36]. In general, many failures may detect the same fault, and one failure may detect many faults. We capture the relationship between failures and faults by a failureto-fault matrix. To compare RTP techniques, researchers quantify how fast (testsuite) orders fnd all faults (not failures because having many failures that detect the same fault is not as valuable as having a few failures that detect many faults).

RTP evaluations involve three aspects: RTP metric, failure-to-fault matrix, and allowed orders. The most widely used metric is Average Percentage of Faults Detected (APFD) [38], denoted as α for short. Another popular metric is Cost-Cognizant APFD (APFDc) [11], denoted as γ for short. Section 2 formally defnes these metrics based on the failure-to-fault matrix; each metric assigns to an order a value between 0 and 1, with higher values indicating better orders. Traditional RTP research used seeded faults, which allow fairly precisely deriving the failureto-fault matrix [10, 22, 37] that can arbitrarily map failures and faults. Recent RTP research mostly uses real failures, e.g., analyzing real regression testing runs from continuous integration systems [14,15,23,24,27,34], making it rather difcult to precisely derive the failure-to-fault matrix. As a result, the increasingly popular failure-to-fault matrices are all-to-one, where all failures map to the same one fault, and one-to-one, where each failure maps to a distinct fault.

To describe allowed orders, we note that real test suites often partition tests, e.g., in JUnit [20], each test method belongs to a test class. Traditional research ignores this partitioning and allows all n! orders (Ωa(T) for short) of n tests. We introduced compatible<sup>4</sup> orders [46] (Ωc(T) for short) that consider the partitioning and allow only orders that do not interleave tests from diferent classes.

We present the frst theoretical analysis for the cases most commonly used in RTP research. We introduce an algorithm for efciently computing the exact probability mass functions (PMFs) of α for all failure-to-fault matrices and Ωa(T). We demonstrate the efciency of our algorithm on the benchmarks from

<sup>4</sup> Our original term was class-compatible [46] because we considered as tests only test methods in test classes, but the concept easily generalizes to other kinds of tests.

Fig. 1: Example metrics for two orders (Com. is compatible) for n = 5, m = 3; class C1 has 3 tests with costs ⟨40, 20, 60⟩, class C2 has 2 with ⟨100, 80⟩; C1.t1 detects fault F1; C1.t3 detects F2; C2.t1 detects F2 and F3; C2.t2 detects F3.

the largest RTP dataset for Java projects [34]. For the common all-to-one and one-to-one cases, we further derive a closed-form formula and a good approximation, respectively. We also derive closed-form formulas for the expected values for both α and γ for the general failure-to-fault matrix, for both Ωa(T) and Ωc(T), and we compare these values in various scenarios. Interestingly, on average, Ωa(T) can perform much better (up to 1/2) than Ωc(T) for certain scenarios, but cannot perform much worse (only up to 1/6) for any scenario; Section 5.1 presents this comparison, including two scenarios near the limits (1/2 and 1/6).

We fnally derive two interesting properties for the α and γ metrics. Using these properties, we revisit some of the highly cited papers on RTP and fnd that some presented results may be biased due to insufcient sampling. Overall, our theoretical analysis provides new insights into the random RTP widely used in prior work but only via empirical sampling. Our results show that in many cases researchers need not run sampling but can use simple formulas or algorithms to obtain more precise statistics for the random RTP metrics.

#### 2 Preliminaries

Our notation largely follows the prior work that introduced APFD (α) [38] and APFD<sup>c</sup> (γ) [11], but we make explicit the failure-to-fault matrix. Let n be the number of tests and m be the number of faults detected by (some of) these tests. Let M be a failure-to-fault matrix, i.e., a n × m Boolean matrix such that Mj,i = true if (failure of) test j detects fault i, and each fault has at least one failure (i.e., ∀i.∃j.Mj,i). Let T be the set of tests in the test suite. We denote the set of tests that detect the fault i as T<sup>i</sup> = {j|Mj,i}. In general, T<sup>i</sup> and T<sup>i</sup> ′ for i ̸= i ′ need not be disjoint because one failing test can detect multiple faults. The total number of failures is k = |{j|∃i.Mj,i}|, and we use k<sup>i</sup> = |T<sup>i</sup> |.

For an order o (a permutation of T), we use <<sup>o</sup> to compare the positions of two tests t and t ′ in the order: t <<sup>o</sup> t ′ denotes that t precedes t ′ in o, and t ≤<sup>o</sup> t ′ denotes that t = t ′ or t <<sup>o</sup> t ′ . We denote the j th test in an order o as t<sup>j</sup> (o). Let τi(o) = min<sup>j</sup> M<sup>t</sup><sup>j</sup> (o),i be the position of the frst test to detect the fault i in o. Prior work [11,38] defned metrics α and γ (using the notation T F instead of τ ). We use α(o) and γ(o) to indicate α and γ, respectively, for a given order o. We drop o from <o, ≤o, t<sup>j</sup> (o), τi(o), α(o), and γ(o) when clear from the context.

The most popular RTP metric is α [38], defned for an order o as follows.

Defnition 1 (α). APFD is defned as

$$\alpha = 1 - \frac{\sum\_{i=1}^{m} \tau\_i}{nm} + \frac{1}{2n} \tag{1}$$

Plotting the percentage of faults detected against the percentage of executed tests, α represents the area under the curve, as shown in two examples in Fig. 1. The diagonal lines interpolate the percentage of faults detected and lead to nice properties of mean/median α values and symmetry (Section 6). α ranges between 0 and 1, more precisely between 1/(2n) and 1−1/(2n). A larger α indicates that an order detects faults earlier, on average.

While α efectively considers the number of tests, the "cost cognizant" metric γ considers the cost of tests [11]. The cost can be measured in various ways, but most work uses the test runtime. We use σ(t) to denote the cost (runtime) of a test t; the total cost of a set of tests T is σ(T) = P t∈T σ(t).

Defnition 2 (γ). APFD<sup>c</sup> is defned as

$$\gamma = \frac{\sum\_{i=1}^{m} \left( \sum\_{j=\tau\_i}^{n} \sigma(t\_j) - \frac{1}{2} \sigma(t\_{\tau\_i}) \right)}{m \cdot \sigma(T)} \tag{2}$$

Plotting the percentage of faults detected against the percentage of total test-suite cost, γ represents the area under the curve, as shown in Fig. 1. Note that α can be viewed as a special case of γ where ∀t, t′ ∈ T.σ(t) = σ(t ′ ).

In practice, tests often belong to classes5—e.g., JUnit [20] test methods belong to test classes, Maven [28] test classes belong to modules, and pytest [35] test functions belong to test fles—and tests from each class run together. Our prior work [46] defned compatible orders as those where all tests from each class are consecutive. We use T<sup>C</sup> to denote the set of tests in a class C. An order o is compatible if ∀C, j ≤ j ′ ≤ j ′′.t<sup>j</sup> (o) ∈ T<sup>C</sup> ∧ t<sup>j</sup> ′′ (o) ∈ T<sup>C</sup> ⇒ t<sup>j</sup> ′ (o) ∈ T<sup>C</sup> . For example, o2 in Fig. 1 is compatible, while o1 is not. To distinguish the cases for all orders from the cases for only compatible orders, we use the subscripts <sup>a</sup> and <sup>c</sup>, respectively, e.g., Ea[x] and Ec[x] represent the expected value of x for the uniform selection of all orders and compatible orders, respectively, and Pa(A) and Pc(A) represent the probability of event A for the uniform selection of all orders and compatible orders, respectively. We denote the set of all orders and all compatible orders for T as Ωa(T) and Ωc(T), respectively [46].

We analyze RTP techniques in scenarios, each of which consists of a test suite with n tests, m faults, the failure-to-fault matrix, the cost of each test, and for Ωc(T) the class of each test. To analyze compatible orders, we introduce some new notation to indicate the class of tests. We use Ti,C = T<sup>i</sup> ∩ T<sup>C</sup> to denote the

<sup>5</sup> The term class for a set of tests that run together need not represent a test class.

set of tests in class C that detect the fault i. Let C be the set of all classes, and C<sup>i</sup> be the set of classes that contain at least one test that detects the fault i, i.e., C<sup>i</sup> = {C ∈ C|Ti,C ̸= ∅}. Let C(t) be the class that t belongs to, i.e., t ∈ TC(t) . The number of compatible orders is |Ωc(T)| = |C|! Q <sup>C</sup>∈C |T<sup>C</sup> |!.

For a set of orders S, be it Ωa(T) or Ωc(T), the probability mass function (PMF) of a metric, α or γ, is a function p from the metric value to its probability: p(x) = P(metric = x) = |{o ∈ S|metric(o) = x}|/|S|. We next derive some PMFs as all prior RTP work shows only sampled distributions of random RTP.

#### 3 PMF of α

To analyze the PMF of the metric α, we frst propose an algorithm to calculate the PMF of α for the general case of M. We then discuss two special cases, i.e., all-to-one and one-to-one, which are the most common in recent RTP research.

#### 3.1 Algorithm to Calculate PMF of α for the General Case

To calculate the PMF of α, a na¨ıve algorithm would enumerate all n! orders and compute α for each order. In theory, α can take O(n!) diferent values, e.g., when m = P<sup>n</sup> <sup>i</sup>=1 n <sup>i</sup> and all n tests fail and detect n, n<sup>2</sup> , . . . , n<sup>n</sup> diferent faults, then each of the n! orders has a diferent α. In practice, however, the number of faults m and the number of failing tests k are usually small, e.g., in our evaluation dataset [34], 2906 out of 2980 (98%) scenarios have k ≤ 10. We present an algorithm that computes the exact PMF with O(n <sup>2</sup>mk · k!) time complexity. Despite the k! factor, the algorithm runs in reasonable time in practice, under 30sec for any of the 2906 scenarios. When k > 10, one can resort to sampling.

We next describe the intuition for our algorithm. P<sup>m</sup> <sup>i</sup>=1 τ<sup>i</sup> is the only part of α that depends on the (test-suite) order, so we frst calculate the PMF of this sum and then convert it to the PMF of α. Iterating over the faults does not lead to a nice recursive formulation. Our key insight is to instead iterate over the positions of all k failing tests. We view P<sup>m</sup> <sup>i</sup>=1 τ<sup>i</sup> as a weighted sum

$$\sum\_{i=1}^{m} \tau\_i = \sum\_{j=1}^{k} w\_j \phi\_j \tag{3}$$

where ϕ<sup>j</sup> is the position of the j th failing test in the order, and w<sup>j</sup> ≥ 0 is the weight, calculated as the number of faults detected frst by the j th failing test (Line 11 of Algorithm 1). For example, consider the order o1 in Fig. 1. The relative order of the k = 4 failing tests is ρ = ⟨C2.t2, C1.t1, C1.t3, C2.t1⟩; we use metavariable ρ to distinguish the notation from o for the order of all n tests. For this relative order, w = ⟨1, 1, 1, 0⟩ because the m = 3 faults are detected frst by C2.t2, C1.t1, and C1.t3. The positions for this relative order ρ are ϕ = ⟨1, 2, 3, 5⟩ because the 4 failing tests in ρ appear in these positions in the order o1.

We call a ϕ = ⟨ϕ<sup>1</sup> . . . ϕk⟩ valid if 1 ≤ ϕ<sup>1</sup> < . . . < ϕ<sup>k</sup> ≤ n. Both sequences ϕ and w = ⟨w<sup>1</sup> . . . wk⟩ can vary for diferent orders. While ϕ has <sup>n</sup> k valid

#### Algorithm 1: Calculate the PMF of α

1 Input: n,m,M // the number of tests and faults, and the failure-to-fault matrix 2 Output: p // the PMF of α: p(x) = P (α = x) 3 Function PMF() // main function; return the PMF of α for all orders 4 k = |{j|∃i.Mj,i}| // number of failing tests in M, in practice k≪n 5 q = PMF sum() // compute the PMF of P<sup>m</sup> <sup>i</sup>=1 τ<sup>i</sup> <sup>6</sup> return λx.q(mn − mnx + <sup>m</sup> 2 ) // convert that PMF to the PMF of α 7 Function PMF sum() // return the PMF of P<sup>m</sup> i=1 τ<sup>i</sup> for all orders 8 P = ⟨PMF rorder(ρ), ∀ρ ∈ perms({j|∃i.Mj,i})⟩ // enumerate all relative orders 9 return λx. P <sup>p</sup>∈P p(x)/|P| // average PMFs of <sup>P</sup><sup>m</sup> <sup>i</sup>=1 τ<sup>i</sup> for each relative order 10 Function PMF rorder(ρ) // return the PMF of P<sup>m</sup> i=1 τ<sup>i</sup> for a relative order ρ <sup>11</sup> w = ⟨|{i|M<sup>ρ</sup><sup>j</sup> ,<sup>i</sup> ∧ ∄j ′ < j.M<sup>ρ</sup><sup>j</sup> ′ ,i}|, ∀j ∈ 1..k⟩ // w are the weights in formula (3) 12 return λs.f(w, k, n)(s)/ n k // the total number of ϕ is <sup>n</sup> k 13 // the function should be memoized to reuse the results for the repeated w,g,h Function f(w, g, h) // return fg,<sup>h</sup> given weights w, calculated with formula (4) 14 if g > h then 15 return λs.0 16 if g = 0 then 17 return λs.1s=0 18 return λs.f(w, g, h − 1)(s)+f(w, g − 1, h − 1)(s − wgh)

possibilities, we note that w has at most k! possibilities (with k! ≪ n k as k ≪ n in practice) because w depends only on ρ. Therefore, we frst fx w by enumerating the k! relative orders of the k failing tests. Then for each relative order, the problem of calculating the PMF of P<sup>m</sup> <sup>i</sup>=1 τ<sup>i</sup> = P<sup>k</sup> <sup>j</sup>=1 wjϕ<sup>j</sup> becomes "given w, count the number of valid ϕ such that P<sup>k</sup> <sup>j</sup>=1 wjϕ<sup>j</sup> = s for each s", which can be solved recursively as follows.

Let fg,h(s) be the number of assignments for the values of ϕ1, . . . , ϕ<sup>g</sup> such that 1 ≤ ϕ<sup>1</sup> < . . . < ϕ<sup>g</sup> ≤ h and P<sup>g</sup> <sup>j</sup>=1 wjϕ<sup>j</sup> = s. The problem is to fnd fk,n(s). As the base case, (1) fg,h(s) = 0 for g > h because ϕ<sup>g</sup> < g cannot hold; (2) f0,h(s) = 1s=0, where 1 is the indicator function, because only the empty sequence ⟨⟩ is valid and P<sup>0</sup> <sup>j</sup>=1 wjϕ<sup>j</sup> = 0. For all h ≥ g > 0, the number of assignments for fg,h(s) has two cases: (1) if ϕ<sup>g</sup> ≤ h − 1, the number is equal to fg,h−1(s) by defnition; (2) if ϕ<sup>g</sup> = h, the number for s is equal to the number of assignments for ϕ1, . . . , ϕg−<sup>1</sup> such that ϕg−<sup>1</sup> ≤ ϕ<sup>g</sup> − 1 = h − 1 and P<sup>g</sup>−<sup>1</sup> <sup>j</sup>=1 wjϕ<sup>j</sup> = ( P<sup>g</sup> <sup>j</sup>=1 wjϕ<sup>j</sup> ) − wgϕ<sup>g</sup> = s − wgh, which is fg−1,h−1(s − wgh). In total,

$$f\_{g,h}(s) = \begin{cases} 0 & g > h \\ \mathbf{1}\_{s=0} & g = 0 \\ f\_{g,h-1}(s) + f\_{g-1,h-1}(s - w\_g h) & \text{otherwise} \end{cases} \tag{4}$$

After solving fk,n, we get the PMF of P<sup>m</sup> <sup>i</sup>=1 τ<sup>i</sup> for each relative order of the k failing tests. Because each of k! relative orders has the same probability by symmetry, we simply take the average of their PMFs to get the PMF of P<sup>m</sup> <sup>i</sup>=1 τ<sup>i</sup> for all orders. Finally, we convert the PMF of P<sup>m</sup> <sup>i</sup>=1 τ<sup>i</sup> to the PMF of α.


Table 1: Number of tests, failures, runtime (in ms), and Jensen-Shannon (JS) distance for 10 largest scenarios [34] and one synthetic scenario (TSmax)

We next describe Algorithm 1 in more detail. The input is the number of tests n, the number of faults m, and the failure-to-fault matrix M. The main function PMF invokes PMF sum to get the PMF of P<sup>m</sup> <sup>i</sup>=1 τ<sup>i</sup> and converts it to the PMF of α. The function PMF sum enumerates all relative orders ρ of the k failing tests, invokes PMF rorder(ρ) to get the PMF of P<sup>m</sup> <sup>i</sup>=1 τ<sup>i</sup> for each relative order, and averages these PMFs to get the PMF of P<sup>m</sup> <sup>i</sup>=1 τ<sup>i</sup> for all (relative) orders. Function PMF rorder(ρ) computes the weights w from formula (3), invokes f(w,k,n) to get fk,n for w, and converts it to the PMF of P<sup>m</sup> <sup>i</sup>=1 τ<sup>i</sup> .

We fnally discuss the time complexity and the empirical performance of Algorithm 1. The major cost comes from computing the function f. Because there are O(k!) diferent w and 0 ≤ g ≤ k, g ≤ h ≤ n, we have O(nk ·k!) diferent inputs for which to compute f. With memoization, f is computed only once for each input. Each computation takes O(nm) because |support(fg,h)| = O(nm) as 1 ≤ τ<sup>i</sup> ≤ n for 1 ≤ i ≤ m. Therefore, the cost of computing f for all inputs is O(n <sup>2</sup>mk · k!). The other costs in the algorithm are lower than the cost of f; hence, the overall time complexity of Algorithm 1 is O(n <sup>2</sup>mk · k!).

Implementation: While top-down recursion makes it easier to present the algorithm, for better performance our implementation uses bottom-up dynamic programming to compute f. Our implementation fts in only 117 lines of C++. Dataset: We use the RTP dataset with the most Java projects [34] for our evaluation. In this dataset, each test is a test class and each class is a Maven module [28]. The dataset has 2980 scenarios, and 2906 (98%) have k ≤ 10. We select, for each k ≤ 10, the scenario with the maximum number of tests (n) from the dataset. We also make a synthetic scenario with 2118 tests, being the largest number of tests in the dataset, and 10 failures. We use both all-to-one and one-to-one failure-to-fault matrices on the selected scenarios.

Evaluation: As Table 1 shows, the code fnishes in under 30sec (on a common laptop) for all real scenarios; it takes more time on the synthetic one for all-to-one and one-to-one, but the runtime is still 33sec and 4min, respectively.

#### 3.2 PMFs of α for Special Cases

As mentioned in Section 1, recent RTP research uses real failures and faults, with two kinds of failure-to-fault matrices: all-to-one and one-to-one. We discuss the PMFs of α for these two commonly used cases.

3.2.1 All-to-One: We frst derive the PMF of α for all-to-one. In this case, m = 1, k ≥ 1, and w<sup>1</sup> = 1, ∀j > 1.w<sup>j</sup> = 0 in formula (3). Therefore, the recursive formula (4) becomes fg,h(s) = fg,h−1(s) +fg−1,h−1(s) for g > 1, which is similar to Pascal's triangle. This observation hints that the PMF of α for all-to-one may have a closed formula with binomial coefcients.

Theorem 3 (The PMF of α for all-to-one failure-to-fault matrix).

$$P(\alpha = 1 - \frac{s}{n} + \frac{1}{2n}) = \frac{\binom{n-s}{k-1}}{\binom{n}{k}}, s \in \{1, 2, \dots, n-k+1\} \tag{5}$$

Proof. For all-to-one, the α value depends solely on τ1, which is essentially ϕ<sup>1</sup> in formula (3). For 1 ≤ s ≤ n−k+1, τ<sup>1</sup> = s holds as long as s = ϕ<sup>1</sup> < . . . < ϕ<sup>k</sup> ≤ n. To satisfy the condition, we just need to choose the k−1 positions after position s. Therefore, n−<sup>s</sup> k−1 out of <sup>n</sup> k ways to choose k positions in n satisfy the condition, so P(τ<sup>1</sup> = s) = n−<sup>s</sup> k−1 / n k , and formula (5) directly follows.

With (5), we can use O(n) time to compute the PMF of α for all-to-one. We can compute the needed binomial coefcients iteratively, starting from k−1 k−1 = 1, with the recurrence <sup>n</sup> ′+1 k−1 = n ′+1 <sup>n</sup>′−k+2 <sup>n</sup> ′ k−1 , n′ ≥ k − 1, and get <sup>n</sup> k = n k n−<sup>1</sup> k−1 . 3.2.2 One-to-One: We next consider the PMF of α for one-to-one. In this case, m = k and each failing test fnds a distinct fault, so for every relative order of the k failing tests, ∀j.w<sup>j</sup> = 1 in formula (3). Therefore, running Algorithm 1 and memoizing on w, the complexity becomes O(n 2k <sup>2</sup> + k!). k! is because we need to iterate through all the relative orders. We can avoid k! if we check in advance that the failure-to-fault matrix is one-to-one, so the complexity is O(n 2k 2 ).

Moreover, considering formula (4) when ∀j.w<sup>j</sup> = 1, fk,n essentially models the problem "counting the number of partitions of s into k distinct summands from {1, 2, . . . , n}". Specifcally, fg,h(s) can be viewed as the number of partitions of s into g distinct summands in {1, 2, . . . , h}, and fg,h(s) = fg,h−1(s) + fg−1,h−1(s−h) holds because the summand g can be either less than h or exactly h, corresponding to fg,h−1(s) and fg−1,h−1(s − h), respectively. To the best of our knowledge, no closed formula is known for this problem. Considering that in our evaluation dataset, 99.8% (2975/2980) of scenarios have n 2k <sup>2</sup> < 10<sup>9</sup> , the O(n 2k 2 ) algorithm is efcient enough for practical use for almost all cases.

Approximation: Furthermore, we can approximate the PMF by ignoring the distinct-number constraint, i.e., "counting the number of partitions of s into k summands from {1, 2, . . . , n}". This problem has a nice generating function (x + x <sup>2</sup> + . . . + x n) k , where the coefcient of x s is the number of partitions [43]:

$$\sum\_{i=0}^{\lfloor \frac{n-k}{n} \rfloor} \binom{k}{i} (-1)^i \binom{s-ni-1}{k-1} \tag{6}$$

We can calculate these coefcients using two algorithms with diferent tradeofs. The frst algorithm frst pre-calculates the binomial coefcients with Pascal's triangle and then calculates all the coefcients with formula (6). The frst step takes O(nk<sup>2</sup> ) because s − ni − 1 ≤ nk and i ≤ k. The second step takes O(nk<sup>2</sup> ) because each of O(nk) coefcients takes O(k) to compute as ⌊ s−k n ⌋ ≤ k. Thus, the overall time complexity of the frst algorithm is O(nk<sup>2</sup> ). The second algorithm calculates the generating function directly with the fast Fourier transform [4] by frst converting x + x <sup>2</sup> + . . . + x <sup>n</sup> to the point-value representation, calculating each point value to the k th power, and interpolating to get the coefcients. The second algorithm takes O(nk log(nk)) because the length of the polynomial is O(nk). Comparing the complexity, the frst algorithm is better when k is small compared to n (i.e., k − log k < log n), and the second is better otherwise.

To evaluate the approximation, we use Jensen–Shannon (JS) distance [16] between the exact and the approximated PMFs. We check our approximation on the same real scenarios as in Section 3.1. As Table 1 shows, the approximation yields PMFs with a small JS distance, the largest only 0.0442 for n = 52, k = 9.

#### 3.3 PMF of γ

The PMF of γ is more complex than that of α because even for the simplest allto-one failure-to-fault matrix, the number of possible values of γ can be Ω(2n). For example, consider n tests with costs 1, 2, 4, . . . , 2 n−1 , and only one test fails and detects the only fault. The γ value depends on the sum of the costs of the tests that precede the failure. 2n−<sup>1</sup> diferent sets of the tests can precede the failure, and every set has a distinct sum of the costs. Even for the example in Fig. 1, the support of PMF for γ (33) is much bigger than that for α (8).

# 4 Expected Values for All Orders Ωa(T )

While some comparisons of RTP techniques use full samples of PMFs, many use just the arithmetic mean of the samples. We next derive formulas for expected values to obtain the mean faster and without the imprecision from sampling.

In this section, we consider the case where order o is uniformly selected from Ωa(T), allowing n! orders of n tests. Because α is a special case of γ where ∀t, t′ ∈ T.σ(t) = σ(t ′ ), we frst derive γ.

To start with a simple example, consider a test suite with only one failing test (k = 1). For a random order, the test can be at any position with equal probability. Intuitively, the expected position across all of the orders is at the middle of the sequence, hence α and γ should be about 1/2. In fact, we will show that they are exactly 1/2. Moreover, the expected values of both α and γ are 1/2 as long as each fault is detected by only one failing test (∀i.k<sup>i</sup> = |T<sup>i</sup> | = 1, which includes one-to-one). In general, the failure-to-fault matrix can be more complex: many tests could detect the same fault, and a test could detect many faults. To compute the expected values of α and γ, we frst prove a useful lemma. Lemma 4. For every fault i,

$$\forall t \notin T\_i. P\_a(t < t\_{\tau\_i}) = P\_a(\forall t' \in T\_i. t < t') = \frac{1}{k\_i + 1} \tag{7}$$

Proof. Since τ<sup>i</sup> is the position of the frst test from T<sup>i</sup> in the order, t precedes tτi if t precedes every t ′ ∈ T<sup>i</sup> . Consider the relative position of each t /∈ T<sup>i</sup> with respect to all the tests from T<sup>i</sup> in a random order. By symmetry, it is equally likely that t is in any of the k<sup>i</sup> + 1 relative positions created by the relative order of the k<sup>i</sup> tests from T<sup>i</sup> . Therefore, the probability that t is in the relative position preceding all the k<sup>i</sup> tests from T<sup>i</sup> is <sup>1</sup> <sup>k</sup>i+1 .

We frst use this lemma to compute Ea[γ].

Theorem 5 (The expected value of γ for Ωa(T)).

$$\mathcal{E}\_{\mathbf{a}}[\gamma] = 1 - \frac{\sum\_{i=1}^{m} \left( \frac{\sigma(T \backslash T\_i)}{k\_i + 1} + \frac{\sigma(T\_i)}{2k\_i} \right)}{m \cdot \sigma(T)} \tag{8}$$

Proof. From (2), the two key terms in γ are σ(tτ<sup>i</sup> ) and P<sup>n</sup> j=τ<sup>i</sup> σ(t<sup>j</sup> ). By symmetry, any test t ∈ T<sup>i</sup> can be the frst in the order, or equivalently t = tτ<sup>i</sup> , with probability <sup>1</sup> ki . Thus

$$\mathbb{E}\_{\mathbf{a}}[\sigma(t\_{\tau\_i})] = \sum\_{t \in T\_i} P(t = t\_{\tau\_i})\sigma(t) = \frac{\sigma(T\_i)}{k\_i} \tag{9}$$

Next, consider that P<sup>n</sup> j=τ<sup>i</sup> σ(t<sup>j</sup> ) = P t∈T <sup>P</sup> <sup>σ</sup>(t)1tτi≤<sup>t</sup> can be also calculated as t∈T<sup>i</sup> σ(t)1tτi≤<sup>t</sup> + P t /∈T<sup>i</sup> σ(t)1tτi≤t. For every test t ∈ T<sup>i</sup> , tτ<sup>i</sup> ≤ t by defnition, so ∀t ∈ T<sup>i</sup> .Ea[1tτi≤t] = 1. For every test t /∈ T<sup>i</sup> , Ea[1tτi≤t] = Pa(tτ<sup>i</sup> ≤ t) = 1 − Pa(t < tτ<sup>i</sup> ) = <sup>k</sup><sup>i</sup> <sup>k</sup>i+1 . The last equality stems from Lemma 4. Therefore, by the linearity of expectation, we get

$$\mathrm{E}\_{\mathrm{a}}[\sum\_{j=\tau\_{i}}^{n}\sigma(t\_{j})] = \sigma(T\_{i}) + \frac{k\_{i}}{k\_{i}+1}\sigma(T \nmid T\_{i}) \tag{10}$$

From (2), (9), and (10), we get (8).

Corollary 5.1 (The expected value of α for Ωa(T)).

$$\mathcal{E}\_{\mathbf{a}}[\alpha] = 1 - \frac{(n+1)\sum\_{i=1}^{m}\frac{1}{k\_i+1}}{nm} + \frac{1}{2n} \tag{11}$$

Revisiting the case where each fault can be detected by only one failing test, setting ∀i.k<sup>i</sup> = 1 in (8) or (11), gives exactly 1/2 = Ea[α] = Ea[γ]. In fact, even in the general case of any failure-to-fault matrix, we fnd that the two expected values are similar if not the same, inspiring us to derive the following bound:

Theorem 6 (The expected diference of α and γ for Ωa(T)).

$$-\frac{1}{12} < \mathcal{E}\_\mathbf{a}[\alpha] - \mathcal{E}\_\mathbf{a}[\gamma] < \frac{1}{2n} \tag{12}$$

Proof. From formulas (8) and (11), we have Ea[α] − Ea[γ] = ∆<sup>γ</sup> − ∆<sup>α</sup> + 1 2n , where ∆<sup>γ</sup> = P<sup>m</sup> <sup>i</sup>=1( 1 2ki − <sup>1</sup> ki+1 )σ(Ti) m·σ(T) and ∆<sup>α</sup> = P<sup>m</sup> i=1 1 ki+1 nm . Since k<sup>i</sup> ≥ 1, we have − 1 <sup>12</sup> ≤ 1 2k<sup>i</sup> − 1 <sup>k</sup>i+1 ≤ 0 (with basic calculus, minimum is for k<sup>i</sup> = 2 or k<sup>i</sup> = 3), which, combined with σ(Ti) ≤ σ(T), gives − 1 <sup>12</sup> ≤ ∆<sup>γ</sup> ≤ 0. Since k<sup>i</sup> ≥ 1, we also have 0 < 1 <sup>k</sup>i+1 ≤ 1 2 , which gives 0 < ∆<sup>α</sup> ≤ 1 2n . Thus, we have − 1 <sup>12</sup> ≤ ∆<sup>γ</sup> − ∆<sup>α</sup> + 1 <sup>2</sup><sup>n</sup> < 1 2n . However, ∆<sup>γ</sup> − ∆<sup>α</sup> + 1 <sup>2</sup><sup>n</sup> = − 1 <sup>12</sup> would require ∆<sup>α</sup> = 1 2n and thus ∀i.k<sup>i</sup> = 1, in which case ∆<sup>γ</sup> = 0 and ∆<sup>γ</sup> − ∆<sup>α</sup> + 1 <sup>2</sup><sup>n</sup> = 0 ̸= − 1 <sup>12</sup> . Therefore, the equality cannot hold and − 1 <sup>12</sup> < Ea[α] − Ea[γ] < 1 2n .

# 5 Expected Values for Compatible Orders Ωc(T )

In this section, we consider the expected values of α and γ for Ωc(T). Compatible orders do not interleave tests from diferent classes, as defned in Section 2. Similar to Ωa(T), we frst prove a useful lemma for Ωc(T).

Lemma 7. For every fault i, (note that if t /∈ Ti, C(t) may have another t ′ ∈ Ti)

$$\forall t \notin T\_i. P\_c(t < t\_{\tau\_i}) = P\_c(\forall t' \in T\_i. t < t') = \begin{cases} \frac{1}{|\mathcal{C}\_i|(|T\_{i,C(t)}| + 1)} & C(t) \in \mathcal{C}\_i\\ \frac{1}{|\mathcal{C}\_i| + 1} & C(t) \notin \mathcal{C}\_i \end{cases} \tag{13}$$

Proof. For C(t) ∈ C<sup>i</sup> case, two conditions must hold for t /∈ Ti,C(t) to precede all tests that detect the fault i. First, among all classes in C<sup>i</sup> , C(t) must be the frst in the order, and by symmetry, each class in C<sup>i</sup> can be the frst with the same probability <sup>1</sup> |Ci| . Second, t must precede all tests from Ti,C(t) , which (similar to Lemma 4) holds with the probability <sup>1</sup> <sup>|</sup>Ti,C(t)|+1 . The two conditions are independent because they are about the class order and the test order inside the class, respectively, and these orders are independent of each other. Therefore, the probability that t precedes the frst test that detects the fault i is <sup>1</sup> |Ci|(|Ti,C(t)|+1) .

For C(t) ∈ C / <sup>i</sup> case, only one condition—C(t) precedes all classes in Ci must hold for t to precede the frst test that detects the fault i, which (similar to Lemma 4) happens with probability <sup>1</sup> |Ci|+1 .

Theorem 8 (The expected value of γ for Ωc(T)).

 $\mathrm{E\_c}[\gamma] = 1 - \frac{1}{m \cdot \sigma(T)} \sum\_{i=1}^{m} \left( \frac{\sum\_{C \notin \mathcal{C}\_i} \sigma(T\_C)}{|\mathcal{C}\_i| + 1} + \right. \tag{14}$ 
$$\frac{1}{|\mathcal{C}\_i|} \sum\_{C \in \mathcal{C}\_i} \left( \frac{\sigma(T\_C \backslash T\_{i,C})}{|T\_{i,C}| + 1} + \frac{\sigma(T\_{i,C})}{2|T\_{i,C}|} \right) \tag{14}$$

Proof. We frst compute the two key terms σ(tτ<sup>i</sup> ) and P<sup>n</sup> j=τ<sup>i</sup> σ(t<sup>j</sup> ) in γ. For each test t ∈ T<sup>i</sup> to be the frst, its class C(t) ∈ C<sup>i</sup> should be the frst among all classes in C<sup>i</sup> with probability <sup>1</sup> |Ci| , and t must be the frst among all tests in Ti,C(t) with probability <sup>1</sup> |Ti,C(t)| . These two events are independent, so the joint probability is <sup>1</sup> |Ci||Ti,C(t)| . By σ(tτ<sup>i</sup> ) = P t∈T<sup>i</sup> σ(t) · 1t=tτi , we have

$$\mathrm{E}\_{\mathrm{c}}[\sigma(t\_{\tau\_i})] = \sum\_{t \in T\_i} \frac{\sigma(t)}{|\mathcal{C}\_i||T\_{i,C(t)}|} = \frac{1}{|\mathcal{C}\_i|} \sum\_{C \in \mathcal{C}\_i} \frac{\sigma(T\_{i,C})}{|T\_{i,C}|} \tag{15}$$

Next, consider P<sup>n</sup> j=τ<sup>i</sup> σ(t<sup>j</sup> ) = P t∈T σ(t)·1tτi≤t. Each t is either (1) t ∈ T<sup>i</sup> , where 1tτi≤<sup>t</sup> = 1 by defnition of τ<sup>i</sup> ; or (2) t /∈ T<sup>i</sup> , where Ec[1tτi≤t] = Ec[1tτi<t] = Pc(tτ<sup>i</sup> < t) = 1 − Pc(t < tτ<sup>i</sup> ) can be obtained from Lemma 7. Combining these cases, we have

$$\begin{aligned} \operatorname{E}\_{\mathbf{c}}[\sum\_{j=\tau\_{i}}^{n}\sigma(t\_{j})] &= \sigma(T\_{i}) + \frac{|\mathcal{C}\_{i}|}{|\mathcal{C}\_{i}|+1} \sum\_{C \notin \mathcal{C}\_{i}} \sigma(T\_{C}) + \\ &\sum\_{C \in \mathcal{C}\_{i}} \left(1 - \frac{1}{|\mathcal{C}\_{i}|(|T\_{i,C}|+1)}\right) \sigma(T\_{C} \nmid T\_{i,C}) \end{aligned} \tag{16}$$

From (2), (15), and (16), we get (14).

Corollary 8.1 (The expected value of α for Ωc(T)).

$$\mathrm{E}\_{\mathrm{c}}[\alpha] = 1 - \frac{1}{nm} \sum\_{i=1}^{m} \left( \frac{\sum\_{C \notin \mathcal{C}\_{i}} |T\_{C}|}{|\mathcal{C}\_{i}| + 1} + \frac{1}{|\mathcal{C}\_{i}|} \sum\_{C \in \mathcal{C}\_{i}} \frac{|T\_{C}| + 1}{|T\_{i,C}| + 1} \right) + \frac{1}{2n} \tag{17}$$

We next discuss the expected diference of Ec[α] and Ec[γ]. Unlike the case with Ωa(T), where the diference has a rather small bound, we fnd that the diference can be rather large for Ωc(T).

Theorem 9 (The expected diference of α and γ for Ωc(T)).

$$-\frac{1}{2} < \mathcal{E}\_{\mathbf{c}}[\alpha] - \mathcal{E}\_{\mathbf{c}}[\gamma] \le \frac{1}{2} - \frac{1}{2n} \tag{18}$$

Proof. From (14) and (17), we get Ec[α] − Ec[γ] = ∆<sup>γ</sup> − ∆<sup>α</sup> + 1 2n , where 1 P<sup>m</sup> i=1 <sup>P</sup> C /∈Ci σ(T<sup>C</sup> ) 1 P σ(T<sup>C</sup> \Ti,C ) σ(Ti,C ) !

∆<sup>γ</sup> = m·σ(T) |Ci|+1 + |Ci| C∈C<sup>i</sup> <sup>|</sup>Ti,C <sup>|</sup>+1 + 2|Ti,C | and ∆<sup>α</sup> = 1 nm P<sup>m</sup> <sup>i</sup>=1( P C /∈Ci |T<sup>C</sup> | |Ci|+1 + 1 |Ci| P C∈C<sup>i</sup> |T<sup>C</sup> |+1 <sup>|</sup>Ti,C <sup>|</sup>+1 ). ∆<sup>γ</sup> > 0 because all the terms in ∆<sup>γ</sup> are positive. From ∀i, C ∈ C<sup>i</sup> .|C<sup>i</sup> | ≥ 1, |Ti,C | ≥ 1, we have

$$\begin{split} \Delta\_{\gamma} &\leq \frac{1}{m \cdot \sigma(T)} \sum\_{i=1}^{m} \left( \frac{\sum\_{C \notin \mathcal{C}\_{i}} \sigma(T\_{C})}{1+1} + \frac{1}{1} \sum\_{C \in \mathcal{C}\_{i}} \left( \frac{\sigma(T\_{C} \backslash T\_{i,C})}{1+1} + \frac{\sigma(T\_{i,C})}{2 \cdot 1} \right) \right) \\ &= \frac{1}{m \cdot \sigma(T)} \cdot \frac{1}{2} \sum\_{i=1}^{m} \sigma(T) = \frac{1}{2} \end{split}$$

Similarly,

$$\begin{array}{l} \Delta\_{\alpha} \leq \frac{1}{nm} \sum\_{i=1}^{m} \left( \frac{\sum\_{C \notin \mathcal{C}\_{i}} |T\_{C}|}{|\mathcal{C}\_{i}| + 1} + \frac{1}{|\mathcal{C}\_{i}|} \sum\_{C \in \mathcal{C}\_{i}} \frac{|T\_{C}| + 1}{1 + 1} \right) \\ \leq \frac{1}{nm} \sum\_{i=1}^{m} \left( \frac{\sum\_{C \notin \mathcal{C}\_{i}} |T\_{C}|}{2|\mathcal{C}\_{i}|} + \frac{\sum\_{C \in \mathcal{C}\_{i}} (|T\_{C}| + 1)}{2|\mathcal{C}\_{i}|} \right) = \frac{1}{nm} \sum\_{i=1}^{m} \left( \frac{n}{2|\mathcal{C}\_{i}|} + \frac{1}{2} \right) \leq \frac{n+1}{2n} \end{array}$$

From 0 ≤ |Ti,C | ≤ |T<sup>C</sup> |, we also have ∆<sup>α</sup> ≥ 1 n . Combining 0 < ∆<sup>γ</sup> ≤ 1 2 and 1 <sup>n</sup> ≤ ∆<sup>α</sup> ≤ n+1 2n , we get − 1 <sup>2</sup> < ∆<sup>γ</sup> − ∆<sup>α</sup> + 1 <sup>2</sup><sup>n</sup> ≤ 1 <sup>2</sup> − 1 2n

Considering many inequalities in the preceding proof, one may expect the bounds to be loose, but we show two scenarios where bounds are close to tight. Both scenarios have only one fault. Scenario one has two classes: C<sup>1</sup> has only one passing test t with cost qN (q > 0 is arbitrary), and C<sup>2</sup> has N failing tests each with cost <sup>q</sup> N . We assume N ≫ 1. t must be the frst or last in any compatible order, each with probability 1/2 (when C<sup>1</sup> is frst or second). Ec[α] is close to 1, and Ec[γ] is only about 1/2. Precisely, Ec[α] − Ec[γ] = <sup>N</sup>2−2N+2 <sup>2</sup>N2+2<sup>N</sup> ≈ 1 2 when N ≫ 1. Scenario two has two classes: C<sup>2</sup> has N failing tests with cost <sup>q</sup> N , and C<sup>3</sup> has N<sup>2</sup> passing tests each with cost <sup>q</sup> <sup>N</sup><sup>3</sup> . The two classes have only two orders, each with probability 1/2. Ec[γ] is close to 1, and Ec[α] is only about 1/2. Precisely, Ec[α] − Ec[γ] = <sup>1</sup> <sup>N</sup>+1 − N2+2 <sup>2</sup>N2+2<sup>N</sup> + 1 <sup>2</sup><sup>N</sup> ≈ −<sup>1</sup> <sup>2</sup> when N ≫ 1.

#### 5.1 Comparison of Ωa(T ) and Ωc(T )

Orders that are compatible have more constraints on the PMF, which could increase or decrease average α or γ values. To compare how orders in Ωa(T) and Ωc(T) perform on average, we compare Ea[α] with Ec[α] and Ea[γ] with Ec[γ].

Theorem 10 (Diference of Ec[γ] and Ea[γ]).

$$\frac{1}{2n} - \frac{1}{2} \le \mathcal{E}\_{\mathbf{c}}[\gamma] - \mathcal{E}\_{\mathbf{a}}[\gamma] \le \frac{1}{6} \tag{19}$$

Proof. From (8) and (14), we have

$$\begin{split} \mathbf{E}\_{\mathbf{c}}[\gamma] - \mathbf{E}\_{\mathbf{a}}[\gamma] &= \frac{1}{m \cdot \sigma(T)} \sum\_{i=1}^{m} \left( \frac{\sigma(T\_{i})}{2k\_{i}} + \frac{\sigma(T)T\_{i}}{k\_{i}+1} - \frac{\sum\_{C \notin \mathcal{C}\_{i}} \sigma(T\_{C})}{|\mathcal{C}\_{i}| + 1} - \\ &\quad \frac{1}{|\mathcal{C}\_{i}|} \sum\_{C \in \mathcal{C}\_{i}} \left( \frac{\sigma(T\_{C}) \langle T\_{i,C} \rangle}{|T\_{i,C}| + 1} + \frac{\sigma(T\_{i,C})}{2|T\_{i,C}|} \right) \right) \end{split} \tag{20}$$

Because ∀i.1 ≤ k<sup>i</sup> ≤ n, |C<sup>i</sup> | ≥ 1, |Ti,c| ≥ 1, we have

 $\mathrm{E\_c[\gamma] - E\_a[\gamma]} \ge \frac{1}{m \cdot \sigma(T)}$   $\sum\_{i=1}^m (\frac{1}{2n} - \frac{1}{2})\sigma(T) = \frac{1}{2n} - \frac{1}{2}$ 

For the other side, because ∀i.|C<sup>i</sup> | ≤ k<sup>i</sup> , |Ti,c| ≤ k<sup>i</sup> , we have

$$\begin{array}{l} \mathrm{E}\_{\mathrm{c}}[\gamma]-\mathrm{E}\_{\mathrm{a}}[\gamma] \leq \frac{1}{m\cdot\sigma(T)}\sum\_{i=1}^{m} \left(\frac{\sigma(T\_{i})}{2k\_{i}} + \frac{\sigma(T|T\_{i})}{k\_{i}+1} - \frac{\sum\_{C\notin\mathcal{C}\_{i}}\sigma(T\_{C})}{k\_{i}+1}\right) -\\ \qquad \frac{\left(\sum\_{C\in\mathcal{C}\_{i}}\sigma(T\_{C})\right)-\sigma(T\_{i})}{|\mathcal{C}\_{i}|(k\_{i}+1)} - \frac{\sigma(T\_{i})}{2|\mathcal{C}\_{i}|k\_{i}}\right) \\ = \frac{1}{m\cdot\sigma(T)}\sum\_{i=1}^{m} \left(1 - \frac{1}{|\mathcal{C}\_{i}|}\right) \left(\frac{\sum\_{C\in\mathcal{C}\_{i}}\sigma(T\_{C})}{k\_{i}+1} - \sigma(T\_{i})\left(\frac{1}{k\_{i}+1} - \frac{1}{2k\_{i}}\right)\right) \\ \leq \frac{1}{m\cdot\sigma(T)}\sum\_{i=1}^{m} \left(1 - \frac{1}{|\mathcal{C}\_{i}|}\right) \frac{\sum\_{C\in\mathcal{C}\_{i}}\sigma(T\_{C})}{k\_{i}+1} \\ \leq \frac{1}{m\cdot\sigma(T)}\sum\_{i=1}^{m} \frac{|\mathcal{C}\_{i}| - 1}{|\mathcal{C}\_{i}|(|\mathcal{C}\_{i}| + 1)} \sum\_{C\in\mathcal{C}\_{i}}\sigma(T\_{C}) \\ \leq \frac{1}{m\cdot\sigma(T)}\sum\_{i=1}^{m} \frac{\sigma(T)}{6} = \frac{1}{6} \end{array}$$

The third last inequality holds because ∀k<sup>i</sup> ≥ 1. 1 <sup>k</sup>i+1 − 1 2k<sup>i</sup> ≥ 0. The last inequality holds because ∀|C<sup>i</sup> | ≥ 1. |Ci|−1 |Ci|(|Ci|+1) ≤ 1 6 , which can be shown with simple calculus, and P C∈C<sup>i</sup> σ(T<sup>C</sup> ) ≤ σ(T).

Corollary 10.1 (Diference of Ec[α] and Ea[α]).

$$\frac{1}{2n} - \frac{1}{2} \le \mathcal{E}\_{\mathbf{c}}[\alpha] - \mathcal{E}\_{\mathbf{a}}[\alpha] \le \frac{1}{6} \tag{21}$$

We give two scenarios where the preceding bounds are close to tight. In both scenarios, we set ∀t, t′ ∈ T.σ(t) = σ(t ′ ), so that α = γ and Ec[α] − Ea[α] = Ec[γ] − Ea[γ]. The frst scenario has one fault F, each of the |C| classes contains n |C| tests, and tests from only one class detect F but all tests in that class detect F. In this scenario, Ea[α] = 1 − |C|(n+1) <sup>n</sup>(n+|C|) + 1 2n , and Ec[α] = 1 − |C|−1 <sup>2</sup>|C| − 1 2n . If we consider |C| = √ n, when n ≫ 1, Ea[α] ≈ 1 but Ec[α] ≈ 1/2, hence Ec[α]−Ea[α] ≈ −1/2. The second scenario has one fault F and two classes with 1 and n − 1 tests, and each class contains only one test that detects F. In this scenario, Ea[α] = <sup>2</sup> <sup>3</sup> − 1 2n and Ec[α] = <sup>3</sup> 4 . When n ≫ 1, Ec[α] − Ea[α] ≈ 1 <sup>12</sup> , close to the upper bound of 1/6.

In brief, measured by α or γ, compatible orders can be much worse on average than all orders (up to 1/2) but cannot be much better (up to 1/6).

#### 6 Properties of Metrics and Checking Prior RTP Work

Prior work on random RTP uses sampling and often visualizes α and γ values as boxplots that may show the median, mean, quartiles (25% and 75%), and "whiskers" (1.5 times the interquartile range) of the sampled distribution. For papers that show these boxplots, we identify two properties for the boxplots, focusing on Ωa(T) because it is used in almost all prior work instead of Ωc(T) [46]:


To check the boxplots from prior work, we search on Google Scholar for papers related to "test prioritization" and keep only the papers that contain both "test" and "prioriti" in the titles. We sort these papers based on their citation count and check the top 100 papers with the highest citation count [44].

#### 6.1 Mean/Median at Least Half

Lemma 11. ∀o ∈ Ωa(T) and its reverse order o ∈ Ωa(T),

$$
\gamma(o) + \gamma(\overline{o}) \ge 1 \tag{22}
$$

The equality holds if ∀i.k<sup>i</sup> = 1.

Proof sketch. To give some intuition, when ∀i.k<sup>i</sup> = 1, the test that detects the fault i frst does not change by reversing the order, so the "prefxes" of the test in o and o complement each other and form the entire test suite. In this case, γ(o) + γ(o) = 1. If ∃i.k<sup>i</sup> ≥ 2, the test that detects the fault i frst in o is not the same test in o, and the "prefxes" of these two tests in o and o do not form the entire test suite, so γ(o) + γ(o) > 1. We omit the details due to space limit.

Theorem 12 (Measures of central tendency are at least half).

$$\min\{\mathcal{E}\_\mathbf{a}[\alpha], \text{Med}\_\mathbf{a}(\alpha), \mathcal{E}\_\mathbf{a}[\gamma], \text{Med}\_\mathbf{a}(\gamma)\} \ge 1/2\tag{23}$$

The equality holds if ∀i.k<sup>i</sup> = 1.

Proof sketch. From (22), we get Ea[γ] = <sup>1</sup> 2 · P o∈Ωa(T ) (γ(o)+γ(o)) <sup>n</sup>! ≥ 1 2 and the equality holds if ∀i.k<sup>i</sup> = 1. Because α can be viewed as a special case of γ, we also have the same result for Ea[α]. The same result for Meda(α) and Meda(γ) can also be derived from (22). We omit the details due to space limit.

When we inspect the top 100 most cited RTP papers, we fnd at least fve papers with boxplots clearly showing a mean or median below 1/2. These papers range from seminal papers [12, Figs. 2b, 2c, 2e] (year 2000) and [13, Fig. 3: schedule, tcas] (2002) to more recent [29, Fig. 4] (2007), [5, Fig. 2] (2016 – a coauthor of this prior paper is also in this paper), and [41, Fig. 5] (2017). Instead of sampling random orders for an arbitrary number of times, future RTP research could use our formulas or algorithm to obtain correct mean and median values.

#### 6.2 Symmetric PMF

We also prove that α and γ PMFs are symmetric when (23)'s equality holds.

Theorem 13 (Symmetry of the α and γ PMFs). If Ea[α] = 1/2 ∨ Meda(α) = 1/2 ∨ Ea[γ] = 1/2 ∨ Meda(γ) = 1/2 ∨ ∀i.k<sup>i</sup> = 1, then

$$\forall \delta. P(\alpha = 1/2 - \delta) = P(\alpha = 1/2 + \delta) \land P(\gamma = 1/2 - \delta) = P(\gamma = 1/2 + \delta) \tag{24}$$

Proof. From Theorem 12, min{Ea[α], Meda(α),Ea[γ], Meda(γ)} = 1/2 ∨ ∀i.k<sup>i</sup> = 1 ⇔ ∀i.k<sup>i</sup> = 1 ⇒ ∀o.α(o) + α(o) = 1 ∧ γ(o) + γ(o) = 1. Each order has exactly one reverse order, so the PMFs of α and γ are symmetric around 1/2.

When we inspect the top 100 most cited RTP papers again, we fnd at least three papers relevant to this property. Based on the information in these papers, we believe that ∀i.k<sup>i</sup> = 1 is true. Ideally, we would confrm each paper's failure-to-fault matrix, but papers often omit such details. On a positive note, the authors of one paper [38] released their dataset, which we analyze and confrm that ∀i.k<sup>i</sup> = 1. The papers that violate this property include the most widely cited paper on RTP [38, Fig. 5: schedule, schedule2, tcas] (year 2001; 1563 citations per Google Scholar) and others, both older [36, Fig. 4: schedule, schedule2, tcas] (1999) and newer [40, Fig. 2] (2015) papers.

Instead of randomly sampling orders to approximate PMFs, future RTP papers could use our algorithm to compute exact PMFs. While we fnd only fve and three papers that defnitely violate Mean/Median at Least Half and Symmetric PMF, respectively, we suspect that many others may violate these or similar properties. However, due to the lack of data in many papers (e.g., no boxplot for random RTP), we cannot easily identify all violations.

# 7 Related Work

Some prior work [45, 49] considers expected values of α and γ but in diferent contexts from ours. Random testing (but not random RTP) has been studied for a while [7–9,17,18,31–33,50]. The most related are theoretical analyses of random test generation. B¨ohme and Paul [2, 3] analyze how random sampling of test inputs compares to systematic generation: random can be more efcient when the cost to systematically generate a test input exceeds the cost to randomly sample an input by some factor. B¨ohme et al. [1] analyze the connection between Shannon's entropy and the discovery rate of a fuzzer that randomly generates inputs. They provide the foundation for identifying random seeds for the fuzzer to improve the overall efciency. Their analysis also enables future systematic approaches for test generation to be more efciently compared with random. Similarly, our analysis can help future RTP work more efciently compare against random RTP and avoid insufcient sampling. Beyond random test generation, Majumdar and Niksic [26] present a theoretical analysis on the efectiveness of randomly inserted partition faults to fnd bugs in distributed systems. In contrast, our analysis is on test-suite orders for random RTP.

# 8 Conclusion

Regression test prioritization (RTP) is a popular regression testing approach. Majority of highly cited RTP papers have compared RTP techniques with random RTP. However, all evaluations have been empirical, with no prior theoretical analysis of random RTP. This paper has presented such analysis, by introducing an algorithm for efciently computing the exact probability mass function of APFD, deriving closed-form formulas and approximations for various metrics and scenarios, and deriving two interesting properties forAPFD and APFDc. Overall, our analysis provides new insights into the random RTP, and our results show that future RTP work often need not use random sampling but can use our simple formulas or algorithms to more precisely evaluate random RTP.

Acknowledgments. We thank Anjiang Wei, Dezhi Ran, and Sasa Misailovic for their help. This work was partially supported by US NSF grants CCF-1763788. CCF-1956374, NSFC grant No. 62161146003, Tencent Foundation, and XPLORER PRIZE. We acknowledge support for research on regression testing from Dragon Testing, Microsoft, and Qualcomm. Tao Xie is the corresponding author, and also afliated with Key Laboratory of High Confdence Software Technologies (Peking University), Ministry of Education, China.

### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Verified First-Order Monitoring with Recursive Rules

Sheila Zingg<sup>1</sup> , Srdan Krsti ¯ c´ 1 () , Martin Raszyk<sup>1</sup> () , Joshua Schneider<sup>1</sup> () , and Dmitriy Traytel<sup>2</sup> ()

1 Institute of Information Security, Department of Computer Science, ETH Zürich, Zurich, Switzerland, {srdan.krstic,martin.raszyk,joshua.schneider}@inf.ethz.ch <sup>2</sup> Department of Computer Science, University of Copenhagen, Copenhagen, Denmark, traytel@di.ku.dk

Abstract. First-order temporal logics and rule-based formalisms are two popular families of specification languages for monitoring. Each family has its advantages and only few monitoring tools support their combination. We extend metric first-order temporal logic (MFOTL) with a recursive let construct, which enables interleaving rules with temporal logic formulas. We also extend VeriMon, an MFOTL monitor whose correctness has been formally verified using the Isabelle proof assistant, to support the new construct. The extended correctness proof covers the interaction of the new construct with the existing verified algorithm, which is subtle due to the presence of the bounded future temporal operators. We demonstrate the recursive let's usefulness on several example specifications and evaluate our verified algorithm's performance against the DejaVu monitoring tool.

Keywords: Rule-based specifications · Monitoring · Formal verification.

### 1 Introduction

In runtime verification, a monitor observes events generated by a running system and analyzes the event streams for compliance with a given specification. Temporal specification languages for monitoring are often classified as operational or declarative [10]. Operational languages explicitly describe how the monitor's input should be transformed to obtain an output. Two important subclasses of operational languages are rule-based formalisms [2,13] and stream runtime verification (SRV) languages [6,8,11,20]. Both formulate the transformations as recursive equations. In contrast, declarative languages, such as first-order temporal logics [4,15], describe the output by composing high-level operators.

Operational and declarative languages have complementary advantages: declarative languages let specification authors focus on the "what" and not the "how", whereas operational languages offer the authors more control over the evaluation. Most runtime verification tools do not support mixing the paradigms, especially when it comes to parametric, i.e., first-order, specification languages. A notable exception is the recent addition of recursive rules to past-time first-order temporal logic (PFLTL), implemented in the DejaVu monitoring tool [14]. As another important benefit, recursive rules can express operations like transitive closure that are not expressible in first-order logics.

In this paper, we introduce recursion in metric first-order temporal logic (MFOTL) [4] in the form of a recursive let construct. We develop and implement an evaluation algorithm for MFOTL with recursion in VeriMon [3, 21], an MFOTL monitor whose correctness has been formally verified in the Isabelle proof assistant. To this end, we extend the formal correctness proof to cover the recursive let construct.

Unlike PFLTL, MFOTL supports bounded future temporal operators and aggregations (Section 2). The interaction of recursion with bounded future operators is subtle. To avoid non-termination, DejaVu requires all recursive occurrences to be guarded by a previous operator. We similarly require the recursive occurrences to be guarded in our monitor, but we relax the requirement on the guard to other past-time operators which ensure that their subformulas are evaluated strictly in the past. Moreover, we allow future operators in the recursive let construct, as long as no recursion takes place in the future operator's arguments. These restrictions ensure that the fixpoint given by the recursive let operator is well-defined. At the same time, they are permissive and allow us to formulate interesting examples, several of which are beyond what PFLTL with recursion can express.

Consider a specification that aims to secure hosts in a network that communicate with each other and with the outside world. A host is *tainted* by an address range iff there is a chain of communication from the address to the host and all hosts on the chain trigger an intrusion detection alert within one hour after communicating with the previous host. This specification can be expressed directly using our recursive let construct (to model chains of communication) and future temporal operators (to specify "within one hour after").

We start by extending MFOTL with a non-recursive let operator (Section 3). This special case is mainly of pedagogical value: aspects common to both let operators are easier to explain on the simpler non-recursive variant. Yet, this construct is useful in practice to structure complex formulas and improve monitoring performance by sharing common subformulas. Thus we extend VeriMon's algorithms and proofs with the non-recursive let.

We then introduce the recursive let operator (Section 4.1), exemplify its semantics with several specifications (Section 4.2), and develop the monitoring algorithm and sketch its correctness (Section 4.3). VeriMon's repository [24] contains complete formal proofs.

This work is part of the long-term effort to develop a trustworthy monitor that surpasses in expressiveness and efficiency other non-verified tools. In this work, our focus is on expressiveness (and trustworthiness). Nonetheless, we evaluate our algorithmic additions to VeriMon on a micro-benchmark and observe that even without further optimizations it exhibits an incomparable performance to DejaVu (Section 5). Moreover, we detected a problem in DejaVu's handling of variable names in recursive subformulas.

In summary, our main contribution is the extension of MFOTL with a recursive let operator and the design of an evaluation algorithm for it. Along the way, we introduce a non-recursive let operator, which proved essential when writing complex specifications. Our contributions are implemented as part of VeriMon and proved correct using Isabelle.

*Related Work.* Our work adds rule-based specification features [13] to a first-order specification language [16]. Above we describe our contribution's relationship to DejaVu and VeriMon, two monitors for first-order temporal specifications. VeriMon's algorithm [21], which we extend, is based on the algorithm used in the MonPoly monitor [5], although VeriMon has optimizations that are not present in MonPoly and vice versa [3]. VeriMon supports a more expressive specification language than MonPoly, and our introduction of the recursive let has increased the gap between the two. VeriMon's and MonPoly's algorithms work with finite relations. These tools are thus restricted to MFOTL's *monitorable fragment* [4], which ensures that all subformulas evaluate to finite results. In contrast, DejaVu finitely represents infinite relations using BDDs and thus supports the full PFLTL (but only closed formulas). Both DejaVu and our work restrict the recursive let syntactically. datatype *data* = Int *int* | Flt *double* | Str *string* type\_synonym *db* = *string* ⇒ *data list set* datatype *trm* <sup>=</sup> <sup>V</sup> *nat* <sup>|</sup> <sup>C</sup> *data* <sup>|</sup> *trm*+*trm* <sup>|</sup> ... type\_synonym *ts* = *nat* typedef *trace* <sup>=</sup> {*<sup>s</sup>* :: (*db*×*ts*) *stream*. trace *<sup>s</sup>*} typedef <sup>I</sup> <sup>=</sup> {(*<sup>a</sup>* :: *nat*,*<sup>b</sup>* :: *enat*). *<sup>a</sup>* <sup>≤</sup> *<sup>b</sup>*} datatype *frm* = *string*(*trm list*) | *trm* ◦ *trm* | ¬ *frm* | ∃ *frm* | *frm*∨*frm* | *frm*∧*frm* <sup>|</sup> <sup>I</sup> *frm* <sup>|</sup> #<sup>I</sup> *frm* <sup>|</sup> *frm* <sup>S</sup><sup>I</sup> *frm* <sup>|</sup> *frm* <sup>U</sup><sup>I</sup> *frm* <sup>|</sup> *nat* <sup>←</sup> *agg*\_*op*(*trm*;*nat*) *frm* fun etrm :: *data list* ⇒ *trm* ⇒ *data* where etrm *<sup>v</sup>* (<sup>V</sup> *<sup>x</sup>*) = *<sup>v</sup>* ! *<sup>x</sup>* <sup>|</sup> etrm *<sup>v</sup>* (<sup>C</sup> *<sup>x</sup>*) = *<sup>x</sup>* <sup>|</sup> etrm *<sup>v</sup>* (*t*<sup>1</sup> <sup>+</sup>*t*2) = etrm *v t*<sup>1</sup> <sup>+</sup>etrm *v t*<sup>2</sup> <sup>|</sup> ... fun sat :: *trace* ⇒ *data list* ⇒ *nat* ⇒ *frm* ⇒ *bool* where sat <sup>σ</sup> *v i* (*p*(*as*)) = (map (etrm *<sup>v</sup>*) *as* <sup>∈</sup> <sup>Γ</sup> <sup>σ</sup> *i p*) <sup>|</sup> sat <sup>σ</sup> *v i* (*t*<sup>1</sup> ◦ *<sup>t</sup>*2) = (etrm *v t*<sup>1</sup> ◦ etrm *v t*2) <sup>|</sup> sat σ *v i* (¬ϕ) = (¬sat σ *v i* ϕ) <sup>|</sup> sat σ *v i* (∃ϕ) = (∃*z*. sat σ (*<sup>z</sup>* # *<sup>v</sup>*) *<sup>i</sup>* ϕ) <sup>|</sup> sat σ *v i* (α∨β) = (sat σ *v i* α∨sat σ *v i* β) <sup>|</sup> sat σ *v i* (α∧β) = (sat σ *v i* α∧sat σ *v i* β) <sup>|</sup> sat <sup>σ</sup> *v i* ( *<sup>I</sup>* <sup>ϕ</sup>) = (case *<sup>i</sup>* of <sup>0</sup> <sup>⇒</sup> False <sup>|</sup> *<sup>j</sup>*+<sup>1</sup> <sup>⇒</sup> <sup>T</sup> <sup>σ</sup> *<sup>i</sup>*−<sup>T</sup> <sup>σ</sup> *<sup>j</sup>* <sup>∈</sup><sup>I</sup> *<sup>I</sup>* <sup>∧</sup>sat <sup>σ</sup> *v j* <sup>ϕ</sup>) <sup>|</sup> sat <sup>σ</sup> *v i* (#*<sup>I</sup>* <sup>ϕ</sup>) = (<sup>T</sup> <sup>σ</sup> (*i*+1)−<sup>T</sup> <sup>σ</sup> *<sup>i</sup>* <sup>∈</sup><sup>I</sup> *<sup>I</sup>* <sup>∧</sup>sat <sup>σ</sup> *<sup>v</sup>* (*i*+1) <sup>ϕ</sup>) <sup>|</sup> sat <sup>σ</sup> *v i* (αS*<sup>I</sup>* <sup>β</sup>) = (<sup>∃</sup> *<sup>j</sup>* <sup>≤</sup> *<sup>i</sup>*. <sup>T</sup> <sup>σ</sup> *<sup>i</sup>*−<sup>T</sup> <sup>σ</sup> *<sup>j</sup>* <sup>∈</sup><sup>I</sup> *<sup>I</sup>* <sup>∧</sup>sat <sup>σ</sup> *v j* <sup>β</sup>∧(∀*<sup>k</sup>* ∈ { *<sup>j</sup>* <.. *<sup>i</sup>*}. sat <sup>σ</sup> *v k* <sup>α</sup>)) <sup>|</sup> sat <sup>σ</sup> *v i* (αU*<sup>I</sup>* <sup>β</sup>) = (<sup>∃</sup> *<sup>j</sup>* <sup>≥</sup> *<sup>i</sup>*. <sup>T</sup> <sup>σ</sup> *<sup>j</sup>*−<sup>T</sup> <sup>σ</sup> *<sup>i</sup>* <sup>∈</sup><sup>I</sup> *<sup>I</sup>* <sup>∧</sup>sat <sup>σ</sup> *v j* <sup>β</sup>∧(∀*<sup>k</sup>* ∈ {*<sup>i</sup>* ..< *<sup>j</sup>*}. sat <sup>σ</sup> *v k* <sup>α</sup>)) <sup>|</sup> sat σ *v i* (*<sup>y</sup>* <sup>←</sup> Ω(*t*;*b*) ϕ) = let *<sup>M</sup>* <sup>=</sup> {(*x*,card<sup>∞</sup> *<sup>Z</sup>*) <sup>|</sup> *<sup>x</sup> <sup>Z</sup>*. *<sup>Z</sup>* <sup>=</sup> {*z*. length *<sup>z</sup>* <sup>=</sup> *<sup>b</sup>*∧sat σ (*z*@*v*) *<sup>i</sup>* ϕ∧etrm (*z*@*v*) *<sup>t</sup>* <sup>=</sup> *<sup>x</sup>*} ∧*<sup>Z</sup>* <sup>6</sup><sup>=</sup> {}} in (*<sup>M</sup>* <sup>=</sup> {} −→ fv ϕ ⊆ {<sup>0</sup> ..< *<sup>b</sup>*})∧*<sup>v</sup>* ! *<sup>y</sup>* <sup>=</sup> eval\_agg\_op Ω *<sup>M</sup>*)

Fig. 1. Formal syntax and semantics of MFOTL with aggregations, where ◦ ∈ {=,<,≤}

Other rule-based [2,13] and SRV-based monitors [6,8,11,20] can express the temporal operators present in LTL, but struggle with extensions that introduce parameters. Even for the operators they can express, specialized algorithms that are carefully tuned for the operators tend to exhibit a better performance. Instead of encoding temporal operators, we take the opposite approach and enrich a monitor that uses specialized algorithms for temporal operators with general-purpose recursion.

Datalog [1] adds recursion to first-order logic, similarly to our addition of recursion to temporal logic. However, Datalog has no built-in notion of time and hence other measures must be taken to ensure that the fixpoints are well-defined, e.g., by restricting negation. Restricting the recursive occurrences to be strictly in the past is a natural and expressive alternative for monitoring, as we do not restrict negation beyond of what the monitorable fragment requires. Works on Datalog extensions with metric temporal operators [7,19,22] mostly study the decidability and complexity of computational problems related to these extensions, whereas we design, implement, and formally verify an executable algorithm.

# 2 Metric First-Order Temporal Logic

MFOTL extends linear temporal logic with first-order quantification, past-time operators, and interval bounds on the temporal operators [4]. The VeriMon monitor [3] supports a fragment of this logic. It also adds new features, specifically regular matching operators as in linear dynamic logic [9], which results in metric first-order dynamic logic (MFODL), as well as aggregations. Our extension of VeriMon with recursive rules retains the additional features of MFODL. However, the additional features are orthogonal to our extension and hence we base our presentation in this paper on MFOTL with aggregations.

We summarize MFOTL's syntax and semantics, as well as the monitorable fragment. The presentation generally follows the Isabelle formalization; however, we sometimes

deviate from Isabelle's concrete syntax for simplicity. We begin by defining some auxiliary types (top of Fig. 1). The logic's universe (type *data*) is fixed and infinite: it is a disjoint sum of integers, 64-bit IEEE floats, and strings of 8-bit characters. Databases (type *db*) encode first-order structures as functions from predicate names to relations over *data*. Relations are represented as sets of lists. A *trace* is a *stream* (an infinite sequence) of time-stamped databases. Time-stamps (type *ts*) are modeled as natural numbers (type *nat*). We write <sup>Γ</sup> σ *<sup>i</sup>* for the *<sup>i</sup>*th database in σ, and <sup>T</sup> σ *<sup>i</sup>* for its time-stamp. The predicate trace enforces monotone and eventually increasing time-stamps, i.e., <sup>∀</sup>*<sup>i</sup>* <sup>≤</sup> *<sup>j</sup>*. <sup>T</sup> σ *<sup>i</sup>* <sup>≤</sup> <sup>T</sup> σ *<sup>j</sup>* and <sup>∀</sup>*x*. <sup>∃</sup>*i*. *<sup>x</sup>* < <sup>T</sup> σ *<sup>i</sup>*. Non-empty intervals (type <sup>I</sup>) are represented by their end-points. We write [*a*,*b*] for the unique interval satisfying *<sup>n</sup>* <sup>∈</sup><sup>I</sup> [*a*,*b*] iff *a* ≤ *n* ≤ *b*, where *n* ∈<sup>I</sup> *I* denotes that *I* contains the natural number *n*. The interval is unbounded from above if *b* = ∞, which the type *enat* adds to the natural numbers.

Terms (type *trm*) are constructed recursively from variables (represented by De Bruijn indices), constants, and arithmetic operators. We use named variables in examples and omit the V and C constructors. There are two kinds of atomic formulas (type *frm*): flexible predicates of the form *p*(*as*), where *as* is a list of terms, and rigid predicates *<sup>t</sup>*<sup>1</sup> ◦ *<sup>t</sup>*<sup>2</sup> for ◦ ∈ {=,<,≤}, which have a fixed interpretation. Formally, the existential quantifier ∃ does not carry a variable name because of the De Bruijn encoding. We use fv α to denote the set of De Bruijn indices of α's free variables.

The semantics is given by the functions etrm and sat (Fig. 1). Both depend on a valuation, which is a *data list* assigning a value to each variable. The satisfaction function sat for formulas additionally depends on a trace σ and a time-point *<sup>i</sup>*, which is an index into the trace. Indexing into lists is denoted by *v* ! *x*, the operation *z* # *v* prepends the value *z* to the list *<sup>v</sup>*, and @ concatenates two lists. The notation {*<sup>x</sup>* ..< *<sup>y</sup>*} and {*<sup>x</sup>* <.. *<sup>y</sup>*} is shorthand for the sets {*x*, *<sup>x</sup>*+1,..., *<sup>y</sup>*−1} and {*x*+1, *<sup>x</sup>*+2,..., *<sup>y</sup>*} of natural numbers, respectively.

An aggregation formula *<sup>y</sup>* <sup>←</sup> Ω(*t*;*b*) ϕ binds *<sup>b</sup>* variables in the subformula ϕ; the remaining free variables of ϕ are used for grouping. Each group is assigned an aggregate value *y*, which is computed by first evaluating the term *t* on each valuation that matches the group and that satisfies ϕ, then aggregating the results using the operator Ω (e.g., MIN for minimum). To this end, eval\_agg\_op Ω *<sup>M</sup>* (not shown) applies Ω to a set *<sup>M</sup>* of value–multiplicity pairs [3]; card<sup>∞</sup> *Z* is the cardinality of *Z*, or ∞ if *Z* is infinite. The conjunct *<sup>M</sup>* <sup>=</sup> {} −→ fv ϕ ⊆ {<sup>0</sup> ..< *<sup>b</sup>*} ensures that the formula is satisfied by the aggregate value of an empty *M* only if there are no grouping variables. Otherwise, infinitely many groups would be labeled with that value, rendering such aggregations non-monitorable.

The decidable predicate mon :: *frm* ⇒ *bool* specifies the monitorable fragment. We omit its formal definition and refer to the earlier descriptions of VeriMon [3,21] for details. Intuitively, mon places restrictions on the formula's structure to ensure that all subformulas have finitely many satisfying valuations. Also, the interval *I* of every U*<sup>I</sup>* operator must be bounded. A monitor for a monitorable formula can thus compute a finite set of satisfying valuations for every time-point after observing a sufficiently long trace prefix.

#### 3 Non-Recursive Let Operator

We first introduce a non-recursive let operator Let*string* := *frm* in *frm* to the *frm* datatype. The formula Let *<sup>p</sup>* :<sup>=</sup> α in β associates the formula α with the predicate named *<sup>p</sup>*, which may be used in the formula β. We call such a predicate *let-bound*. The operator is

non-recursive: *<sup>p</sup>* has the same meaning within α as in the surrounding context (unless it is bound by a nested let in α). Although the non-recursive let operator does not enhance MFOTL's expressiveness, it improves readability (by using descriptive let-bound predicate names), as well as modularity and evaluation efficiency (by sharing subformulas).

Intuitively, the meaning of Let *<sup>p</sup>* :<sup>=</sup> α in β is the same as that of β after replacing all its predicates of the form *<sup>p</sup>*(*as*) with the formula α, whose free variables have been replaced with the terms *as* in a capture-avoiding way. The formal syntax does not specify explicitly how α's free variables map to *<sup>p</sup>*'s arguments. The mapping is induced by the De Bruijn indices: the variable with index 0 becomes the first argument, and so forth. We list the arguments explicitly in examples that use named variables. For instance, the formula Let *<sup>p</sup>*(*x*) :<sup>=</sup> *<sup>p</sup>*(*x*)∧ ∃*y*. *<sup>q</sup>*(*x*, *<sup>y</sup>*) in [0,2] *<sup>p</sup>*(*y*) should be equivalent to [0,2] (*p*(*y*)∧ ∃*z*. *<sup>q</sup>*(*y*,*z*)). We achieve this by defining Let's semantics as follows.

$$\text{sast } \sigma \text{ } \iota \text{ } (\mathsf{Let } p := \alpha \text{ in } \beta) = \mathsf{sat } (\sigma[p \Rightarrow \lambda j. \mathsf{satrel } \sigma \text{ } j \,\alpha]) \text{ v } i \,\beta$$

We write satrel σ *<sup>j</sup>* α as an abbreviation for {*v*. sat σ *v j* α∧length *<sup>v</sup>* <sup>=</sup> nfv α}, i.e., the relation containing the valuations that satisfy α. The function nfv α returns the minimum length of *<sup>v</sup>* needed to cover all of α's free variables, i.e., <sup>0</sup> if α is closed and Max (fv α) +<sup>1</sup> otherwise. The trace σ[*<sup>p</sup>* <sup>V</sup> *<sup>R</sup>*] is the same as the trace σ except that for every time-point *i*, the database at *i* maps the predicate name *p* to *R i*, where *R* has type *nat* <sup>⇒</sup> *data list set* and is called a *temporal relation*. Note that the subformula α is not necessarily evaluated at time-point *i*. Instead, the choice of the time-point is deferred until the predicate *<sup>p</sup>* is used within β, which we achieve by updating the entire trace. This supports the intuition behind *unfolding* the let operator Let *<sup>p</sup>* :<sup>=</sup> α in β described above, especially as subformulas *<sup>p</sup>*(*as*) may occur under temporal operators in β.

*Implementation.* To evaluate an MFOTL formula on a trace, VeriMon computes a finite set of satisfying valuations (represented by the type *table*) recursively for each subformula. It applies standard table operations such as the natural join (./) and union. Tables are sets of tuples, which are lists of optional *data* values (with missing values denoted by ⊥) and thus refine valuations. This representation allows us to use lists of the same length for subformulas with different free variables. As with valuations, the variables' De Bruijn indices are used to look up their value in a tuple.

VeriMon processes an unbounded trace incrementally. Its interface consists of two functions init :: *frm* ⇒ *state* and step :: *dbs*×*ts list* ⇒ *state* ⇒ (*nat* ×*table*) *list* ×*state*. The function init initializes the monitor's state (type *state*), and step updates it with a batch of new time-stamped databases to produce a list of new satisfactions. Instead of *db list*, step uses the type *dbs* = (*string* \* *table list*) (a partial mapping from *string* to *table list*) to efficiently retrieve all relations (encoded as tables) associated with a predicate name at once. Besides some auxiliary data, *state* stores an *inductive state* of type *sfrm* that mirrors the inductive representation of formulas, augmented with data structures for evaluating temporal operators and buffering intermediate results. Internally step (*dbs*,*tss*) *st* calls eval *j n tss dbs <sup>s</sup>*ϕ, where *<sup>j</sup>* is the combined length of the trace prefix including the new batch, *<sup>n</sup>* <sup>=</sup> nfv <sup>ϕ</sup> for the monitored formula <sup>ϕ</sup>, and *<sup>s</sup>*ϕ is the inductive state, all stored in *st*. The function eval returns a list of tables with new satisfactions, as well as the updated inductive state. Satisfactions are reported for every time-point in order. They may be delayed if the formula contains future operators.

To evaluate Let *<sup>p</sup>* :<sup>=</sup> α in β, we use the tables with α's satisfactions to evaluate *<sup>p</sup>* within β, which requires that the tuples in these tables do not have missing values. Therefore, we require that let operators satisfy mon (Let *<sup>p</sup>* :<sup>=</sup> α in β) = ({<sup>0</sup> ..< nfv α} ⊆ fv α∧mon α∧mon β). Specifically, the (indices of) α's free variables must not have gaps. We add the constructor SLet *p m s*α*s*β to the inductive state, which stores *<sup>p</sup>*, the number *<sup>m</sup>* <sup>=</sup> nfv α of free variables in α, and the states for subformulas α and β. It is initialized by initializing *<sup>s</sup>*α and *<sup>s</sup>*β recursively. The function eval evaluates it as follows.

$$\begin{array}{l} \mathsf{eval } j \text{ } t \text{ } \mathsf{tsx} \text{ } d \text{ } (\mathsf{SLet } p \text{ } m \text{ } s\_{\alpha} \text{ } s\_{\beta}) = \\ \quad \left(\underline{\text{let }} \left(\mathsf{xs}, \, s'\_{\alpha}\right) = \mathsf{eval } j \text{ } m \text{ } t \text{ } s\_{\alpha} \text{ } (\text{ys}, \, s'\_{\beta}) = \mathsf{eval } j \text{ } n \text{ } t \text{ } s \text{ } (d \text{bs} [p \mapsto \mathsf{xs}]) \text{ } s\_{\beta} \\ \quad \underline{\text{in } (ys, \, \mathsf{SLet } p \text{ } m \, s'\_{\alpha} \, s'\_{\beta})) \end{array} \right) \begin{array}{l} \mathsf{sx} \\ \quad \underline{\text{let } } p \text{ } m \text{ } s'\_{\alpha} \text{ } s'\_{\beta}) \end{array}$$

We write *dbs*[*p* 7→ *xs*] for the partial mapping *dbs* updated at *p* with *xs*. The recursive call of eval on *<sup>s</sup>*α may return multiple tables in the list *xs*. Note that step generalizes the original VeriMon interface [3] as it consumes multiple time-stamped databases at once. The generalized interface of eval allows us to pass all tables at once to the recursive call for *<sup>s</sup>*β.

*Correctness.* We relate the outputs of step and sat to prove our monitor correct. As mentioned earlier, the monitor may delay its output. We precisely characterize its *progress* for a given formula and trace prefix. Intuitively, the progress is the number of time-points that the monitor is able to evaluate given a trace prefix. Progress is a useful tool in the correctness proof as it helps us describe the output *at every time-point*. Moreover, we show below that progress can be made arbitrarily large, which is important for completeness.

Formally, prog <sup>σ</sup> *<sup>P</sup>* <sup>ϕ</sup> *<sup>j</sup>* is <sup>ϕ</sup>'s progress *<sup>i</sup>*ϕ after reading the first *<sup>j</sup>* databases of trace <sup>σ</sup>. We added the partial mapping *P* that assigns to every let-bound predicate its own progress, i.e., the progress of the formula defining the predicate. For example, the progress of a predicate *p* that is not let-bound is *j*. Otherwise, it is equal to the progress of the formula it is bound to (stored in *P p*). The progress of <sup>α</sup>U[*a*,*b*] <sup>β</sup> is the smallest *<sup>i</sup>* such that τ σ *<sup>i</sup>* <sup>≥</sup> τ σ (Min {*i*α,*i*β, *<sup>j</sup>*−1})−*b*. The progress of both <sup>α</sup>∧<sup>β</sup> and <sup>α</sup>∨<sup>β</sup> is Min {*i*α,*i*β}.

The invariant invar <sup>σ</sup> *j P n s*ϕ <sup>ϕ</sup> relates an inductive state *<sup>s</sup>*ϕ to the formula <sup>ϕ</sup>. The inductive state must reflect the monitor's state after processing the first *j* databases in the trace σ, assuming that *<sup>P</sup>* specifies the let-bound predicates' progress. The parameter *<sup>n</sup>* is the length of the tuples stored within *<sup>s</sup>*ϕ. The invariant is defined inductively over *<sup>s</sup>*ϕ; we reuse VeriMon's definition for the MFOTL operators and add a case for Let:

$$\begin{aligned} \text{invar } \sigma \text{ } j \text{ } P \text{ } m \text{ } s\_{\alpha} \text{ } \alpha \quad \text{invar } \left( \sigma [p \Rightarrow \lambda i. \texttt{ satz} \,\sigma \text{ } i \,\alpha] \right) \text{ } j \text{ } \left( P [p \mapsto \text{prog } \sigma \text{ } P \propto j] \right) \text{ } n \text{ } s\_{\beta} \text{ } \beta \\\ m = \texttt{nfv} \,\alpha \quad \{0 \ldots < m\} \subseteq \texttt{fv} \,\alpha \\\ \hline \text{invar } \sigma \text{ } j \text{ } P \text{ } n \text{ } (\texttt{Sl.et } p \text{ } m \text{ } s\_{\alpha} \text{ } s\_{\beta}) \text{ } (\texttt{let } p \coloneqq \alpha \text{ in } \beta) \end{aligned}$$

The first two premises restrict the subformula states *<sup>s</sup>*α and *<sup>s</sup>*β, where *<sup>s</sup>*β reflects the evaluation of β on the modified trace, and *<sup>p</sup>*'s progress is that of α. The premise *<sup>m</sup>* <sup>=</sup> nfv α enforces that *<sup>m</sup>* is equal to *<sup>p</sup>*'s arity, and {<sup>0</sup> ..< *<sup>m</sup>*} ⊆ fv α is the constraint from mon.

Our extensions preserve the monitor's correctness: we formally proved the theorem below, which characterizes the monitor's eval function. The theorem is stated here for the empty progress mapping ∅, which must be generalized in the proof (as *P* changes in the above rule). Let δ be a natural number and ϕ be a monitorable formula with *<sup>n</sup>* <sup>=</sup> nfv ϕ. The function the maps the optional value h*x*i to *x* and ⊥ to some unspecified value.

Theorem 1. *(a)* invar σ <sup>0</sup> <sup>∅</sup> *n s*<sup>0</sup> ϕ <sup>ϕ</sup> *for the initial state <sup>s</sup>* 0 ϕ *. (b) Suppose that <sup>s</sup>*ϕ *satisfies* invar <sup>σ</sup> *<sup>j</sup>* <sup>∅</sup> *n s*ϕ <sup>ϕ</sup> *and that dbs contains all relations from* <sup>σ</sup> *for the indices in the list js* = [ *<sup>j</sup>* ..< *<sup>j</sup>*+δ]*. Then* (*xs*,*<sup>s</sup>* 0 ϕ ) = eval (*j*+δ) *<sup>n</sup>* (map (τ σ) *js*) *dbs <sup>s</sup>*ϕ *satisfies* invar <sup>σ</sup> (*j*<sup>+</sup> δ) <sup>∅</sup> *n s*<sup>0</sup> ϕ <sup>ϕ</sup>*, and the <sup>i</sup>-th table in the list xs, for* prog <sup>σ</sup> <sup>∅</sup> <sup>ϕ</sup> *<sup>j</sup>* <sup>≤</sup> *<sup>i</sup>* <sup>&</sup>lt; prog <sup>σ</sup> <sup>∅</sup> <sup>ϕ</sup> (*j*+δ)*, contains (only) all tuples <sup>v</sup> of length <sup>n</sup> satisfying* sat σ (map the *<sup>v</sup>*) σ *<sup>i</sup>* ϕ*.*

Soundness follows immediately from Thm. 1, whereas completeness additionally requires the aforementioned property that any progress can be reached by making the trace prefix long enough, which we also proved for our modified progress function:

Theorem 2. *If* mon ϕ*, then for all <sup>i</sup> there exists a <sup>j</sup> such that* prog σ <sup>∅</sup> ϕ *<sup>j</sup>* <sup>≥</sup> *<sup>i</sup>.*

# 4 Past-Recursive Let Operator

It is well-known that first-order logic (FOL) cannot express certain queries, notably the transitive closure of a binary relation. This remains true when restricted to finite structures [18]. Although MFOTL is rather different from ordinary FOL, we conjecture that it cannot express transitive closure either. This hampers its ability to model hierarchies of unbounded depth. Moreover, recursive patterns are sometimes the most natural way to express certain specifications. We describe an extension of MFOTL that can encode a "temporally directed" form of transitive closure and other recursive patterns.

Specifically, we introduce another let operator in which the predicate may refer to itself recursively. The intended semantics is that of a fixpoint, i.e., the predicate *p* defined by a formula α should be interpreted by a temporal relation that is equal to the evaluation of α under that interpretation of *<sup>p</sup>*. The fixpoint might not always exist or it might not be unique. Therefore, different fixpoint operators have been studied in the context of nontemporal logics and query languages [1]. For instance, it is common to require that all recursive occurrences of *p* in its defining formula are positive, i.e., under an even number of negations. This ensures monotonicity and hence the existence of a least fixpoint.

MFOTL's future operators are interpreted over infinite traces. This poses a new challenge for monitoring recursively defined predicates, even if we restrict our attention to positive formulas. Consider the recursive definition of *<sup>p</sup>* by *<sup>q</sup>*∨#[0,∞] *<sup>p</sup>*, where *<sup>q</sup>* is a predicate from the trace. Although *<sup>q</sup>*∨#[0,∞] *<sup>p</sup>* is monitorable (at most one additional timepoint must be known to evaluate it), the recursive definition of *<sup>p</sup>* is equivalent to ♦[0,∞] *<sup>q</sup>* under the least fixpoint semantics. However, ♦[0,∞] *<sup>q</sup>* is not monitorable, as one might need the entire, infinite trace to evaluate it. Therefore, we focus on a fragment where every recursive occurence of *p* must be *strictly in the past*. This guarantees a unique fixpoint even if the defining formula is not monotone, so the predicate may occur negatively as well.

The syntax of our past-recursive let operator is similar to the one of Let: we add the constructor LetPast *string* := *frm* in *frm* to the *frm* datatype. However, the semantics is different (Section 4.1). The restriction to strictly past recursion is enforced by a syntactic monitorability condition that is checked by mon. Consider the formula LetPast *p* := α in β. Intuitively, every recursive occurrence of *<sup>p</sup>* in α must be *guarded* by at least one strictly past operator, and there must be no future operator on the path from the occurrence to α's root. We *do* allow future operators in the other parts of α, though.

We give examples of LetPast in Section 4.2. The evaluation of LetPast requires an extension of VeriMon's algorithm (Section 4.3), which we also formally prove correct.


Fig. 2. Auxiliary definitions for the syntactic restriction on LetPast

#### 4.1 Semantics

The semantics of the past-recursive let operator is defined by the equation

$$\text{isst } \sigma \text{ vi } (\textsf{LetPast} \, p := \alpha \,\textsf{in} \, \beta) = \textsf{sat} \, (\sigma [p \Rightarrow \textsf{recp} \, (\lambda \mathsf{R} \, j. \, \textsf{strtel} \, (\sigma [p \Rightarrow \mathsf{R}]) \, j \, \alpha)]) \, \forall \, \beta$$

We evaluate β at the same time-point *<sup>i</sup>* as the recursive let operator using an appropriately updated trace. The temporal relation assigned to *p* is computed by the combinator recp:

$$\begin{array}{l} \textbf{fun } \mathsf{recp} :: ((nat \Rightarrow data \, list \, set) \Rightarrow nat \Rightarrow data \, list \, set) \Rightarrow nat \Rightarrow data \, list \, set \, \textbf{where} \\ \textbf{recp } f \lor i = f \, (\lambda j. \, \underline{\text{if } j < i \, \underline{\text{then} \, } \mathsf{recp } f \, \, j \, \underline{\text{else} \, } \{\}) \, i \end{array}$$

The argument *f* is a function that transforms temporal relations, and recp *f* returns again a temporal relation. Intuitively, recp *f* evaluates to the fixpoint *f* (recp *f*), except that *f R i* can only access time-points of *R* before *i*. For all other time-points *j* ≥ *i*, the relation *R j* is empty. The combinator recp is well-defined because *i* is a natural number; the recursive call recp *f j* affects the result only if *<sup>j</sup>* < *<sup>i</sup>* and hence we can prove termination using *<sup>i</sup>* as a variant. For the semantics of LetPast, we choose *f R i* <sup>=</sup> satrel (σ[*<sup>p</sup>* <sup>V</sup> *<sup>R</sup>*])*<sup>i</sup>* α, i.e., the satisfactions of α with *<sup>p</sup>* mapped to *<sup>f</sup>*'s argument *<sup>R</sup>*, to which recp supplies the result of the recursive evaluation (up to but excluding *i*).

Our definition of sat is total: it gives meaning to every formula. This includes formulas LetPast *<sup>p</sup>* :<sup>=</sup> α in β where *<sup>p</sup>* occurs in α without a past guard or under a future operator. However, the semantics behaves unexpectedly in such cases. For example, LetPast *<sup>p</sup>* := (*q*∨#[0,∞] *<sup>p</sup>*)in *<sup>p</sup>* is equivalent to *<sup>q</sup>*. Our monitor therefore requires properly guarded formulas. Not only does this avoid confusion about the semantics, it also simplifies the implementation because the monitor need not eliminate unguarded occurrences.

Next, we describe the formalization of the syntactic restriction. The idea is to determine for every predicate whether it is used strictly in the past by analyzing the formula recursively. The datatype *recSafety* (Fig. 2) represents the possible outcomes. U(nused) means that a predicate does not occur in the formula. P(ast) means that it is evaluated at strictly earlier time-points, whereas NF (Non-Future) additionally allows the current time-point. <sup>A</sup>(ny) covers all remaining cases. The linear order < on *recSafety* is induced by <sup>U</sup> < <sup>P</sup> < NF < <sup>A</sup>. Its reflexive closure <sup>≤</sup> corresponds to implication. For example, if the predicate *p* is unused (U), it is clearly evaluated at earlier time-points only (P). The least upper bound *x*t*y* with respect to ≤ corresponds to logical disjunction.

The function slp *<sup>p</sup>* ϕ (Fig. 2) analyzes the past-guardedness of a predicate *<sup>p</sup>* in a formula ϕ. It uses a composition operator *<sup>y</sup>*<sup>∗</sup> *<sup>x</sup>* on *recSafety*. The patterns in the definition of ∗ should be matched sequentially from top to bottom; e.g., A∗U is equal to U. Intuitively, *y* ∗ *x* describes the guardedness of a predicate that is *x*-used in some subformula, which is then *<sup>y</sup>*-used. For example, slp *<sup>p</sup>* ( *<sup>I</sup>* <sup>ϕ</sup>) = <sup>P</sup>∗slp *<sup>p</sup>* <sup>ϕ</sup> because <sup>ϕ</sup> and all occurences of *<sup>p</sup>* therein are evaluated at time-points that are strictly in the past relative to *<sup>I</sup>* <sup>ϕ</sup>. Note that we make a case distinction for <sup>α</sup>S*<sup>I</sup>* <sup>β</sup>: if the interval *<sup>I</sup>* excludes zero, <sup>β</sup> is always evaluated strictly in the past. Future operators always result in A if *p* is used in an operand.

Finally, we define the mon predicate for the recursive let operator:

$$\mathsf{mon}\left(\mathsf{LetPast}\,p:=\alpha\,\mathsf{in}\,\beta\right) = \left(\mathsf{slp}\,p\,\alpha \leq \mathsf{P}\wedge\{0\,\ldots<\mathsf{n}\mathsf{fv}\,\alpha\} \subseteq \mathsf{fv}\,\alpha\wedge\mathsf{mon}\,\alpha\wedge\mathsf{mon}\,\beta\right)$$

The only difference to Let is the restriction of *<sup>p</sup>*'s occurrences in α via slp, which is generally an over-approximation. For example, slp *<sup>p</sup>* ( *<sup>I</sup> <sup>I</sup>* #*<sup>I</sup> <sup>p</sup>*) = <sup>A</sup> even though *<sup>p</sup>* is evaluated at strictly earlier time-points. Therefore, some instances of LetPast that our algorithm could evaluate correctly are not considered to satisfy mon. We plan to replace *recSafety* with a more precise lattice in future work.

#### 4.2 Examples

*Temporal Operators.* We first show that the non-metric S operator can be reduced to LetPast and . (We omit the interval subscripts if the interval is [0,∞].) Using the special ts(*t*) predicate, which is true iff *t* is the current time-stamp, we can also express the metric version. This example serves to gently illustrate the semantics of LetPast. In general, formulas are more readable if they are directly expressed in terms of S, and monitoring can be more efficient. Below we give further examples in which LetPast adds expressiveness.

Let α and β be two monitorable MFOTL formulas with free variables fv α and fv β, respectively. The formula α<sup>S</sup> β is monitorable only if fv α <sup>⊆</sup> fv β, so let us assume that, too. The following unfolding of S's semantics is well-known:

$$\mathsf{sat}\,\sigma\,\mathsf{v}\,\mathsf{i}\,(a\,\mathsf{S}\,\beta) \iff \mathsf{sat}\,\sigma\,\mathsf{i}\,\beta\,\mathsf{i}\,\mathsf{j}\,\mathsf{i}\,\left(\mathsf{sat}\,\sigma\,\mathsf{i}\,\mathsf{a}\,\wedge\,\mathsf{i}\,>\mathsf{0}\wedge\mathsf{sat}\,\sigma\,\mathsf{i}\,\mathsf{i}\,\,(\mathsf{i}-\,\mathsf{I})\,(a\,\mathsf{S}\,\beta)\right)\,\,(\mathsf{l})$$

As the unfolding recursively evaluates the formula at the previous time-point, we can directly translate it into a recursive let operator: <sup>ϕ</sup><sup>S</sup> <sup>≡</sup> LetPast *<sup>s</sup>*(*x*) :<sup>=</sup> <sup>ψ</sup> in *<sup>s</sup>*(*x*), where ψ <sup>≡</sup> β∨(α<sup>∧</sup> *<sup>s</sup>*(*x*)). The predicate name *<sup>s</sup>* must be fresh, i.e., it must not occur in α nor <sup>β</sup>. The variable list *<sup>x</sup>* enumerates fv <sup>β</sup>. The formula <sup>ϕ</sup><sup>S</sup> is monitorable because *<sup>s</sup>*(*x*) is clearly past-guarded, and hence slp *<sup>s</sup>* ψ <sup>=</sup> <sup>P</sup>. (We also need fv β <sup>=</sup> {<sup>0</sup> ..< nfv β}, which can be achieved by renaming variables in α and β.) Let us analyze the semantics of ϕS:

sat <sup>σ</sup> *v i* <sup>ϕ</sup><sup>S</sup> ⇐⇒ sat (σ[*<sup>s</sup>* <sup>V</sup> recp (λ*R j*. satrel (σ[*<sup>s</sup>* <sup>V</sup> *<sup>R</sup>*]) *<sup>j</sup>* <sup>ψ</sup>) | {z } =*f*ψ ]) *v i* (*s*(*x*)) ⇐⇒ *<sup>v</sup>* <sup>∈</sup> recp *<sup>f</sup>*ψ *<sup>i</sup>* ⇐⇒ sat (σ[*<sup>s</sup>* <sup>V</sup> <sup>λ</sup> *<sup>j</sup>*. if *<sup>j</sup>* <sup>&</sup>lt; *<sup>i</sup>* then recp *<sup>f</sup>*ψ *<sup>j</sup>* else {}]) *v i* <sup>ψ</sup> (∗) ⇐⇒ sat σ *v i* β<sup>∨</sup> sat σ *v i* α∧*<sup>i</sup>* > <sup>0</sup> <sup>∧</sup>*<sup>v</sup>* <sup>∈</sup> (if *<sup>i</sup>*−<sup>1</sup> <sup>&</sup>lt; *<sup>i</sup>* then recp *<sup>f</sup>*ψ (*i*−1) else {}) ⇐⇒ sat σ *v i* β<sup>∨</sup> sat <sup>σ</sup> *v i* <sup>α</sup>∧*<sup>i</sup>* <sup>&</sup>gt; <sup>0</sup>∧sat <sup>σ</sup> *<sup>v</sup>* (*i*−1) <sup>ϕ</sup><sup>S</sup> 

These equations hold for all valuations *<sup>v</sup>* of length nfv β and if the variables *<sup>x</sup>* are ordered by their De Bruijn indices. Step (∗) exploits the freshness of *<sup>s</sup>* with respect to α and β, which allows us to replace σ[*<sup>s</sup>* <sup>V</sup> ...] by σ. The equations result in the same unfolding as (1). Hence, we can prove the semantic equivalence of <sup>ϕ</sup><sup>S</sup> and <sup>α</sup><sup>S</sup> <sup>β</sup> by induction on *<sup>i</sup>*.

The following *SinceLet* formula encodes <sup>α</sup>S[*a*,*b*] <sup>β</sup>. Other encodings exist, however.

$$\mathsf{Let} \\ \mathsf{Past} \, s(\overline{\mathsf{x}}, t) \coloneqq (\beta \wedge \mathsf{ts}(t)) \vee (\alpha \wedge \blacktriangleleft s(\overline{\mathsf{x}}, t)) \, \mathsf{in} \, \exists t, u. \, s(\overline{\mathsf{x}}, t) \wedge \mathsf{ts}(u) \wedge a \le u - t \wedge u - t \le b$$

Here, *t* and *u* are fresh variables, where *t* records the time-stamp of the past satisfaction of β, whereas *<sup>u</sup>* is the time-stamp at which we evaluate *SinceLet*. The subformula *<sup>a</sup>* <sup>≤</sup> *<sup>u</sup>*−*t*<sup>∧</sup> *<sup>u</sup>*−*<sup>t</sup>* <sup>≤</sup> *<sup>b</sup>* corresponds to <sup>T</sup> <sup>σ</sup> *<sup>j</sup>*−<sup>T</sup> <sup>σ</sup> *<sup>i</sup>* <sup>∈</sup><sup>I</sup> [*a*,*b*], which is part of <sup>S</sup>[*a*,*b*] 's semantics (Fig. 1).

*Temporally-Directed Transitive Closure.* We proceed by showing that LetPast can compute a temporally-directed transitive closure over events observed at a sequence of distinct time-points. Hence, we assume that the trace contains a single event at every time-point. The closure is directed in the sense that the transitive chains can only be extended by *newer* events. We consider the following two types of events from [14]: *<sup>r</sup>*(*y*, *<sup>x</sup>*,*d*) denotes that process *<sup>y</sup>* reports some data *<sup>d</sup>* to another process *<sup>x</sup>*, and *<sup>s</sup>*(*x*, *<sup>y</sup>*) denotes that process *x* spawns process *y*. The *Spawn* formula

$$\mathsf{LetPast} \, p(\mathsf{u}, \mathsf{v}) \coloneqq \mathsf{s}(\mathsf{u}, \mathsf{v}) \vee (\mathsf{op} \, p(\mathsf{u}, \mathsf{v})) \vee (\exists \mathsf{t}. \, (\mathsf{op} \, p(\mathsf{u}, \mathsf{t})) \wedge \mathsf{s}(\mathsf{t}, \mathsf{v})) \, \mathsf{in} \, r(\mathsf{y}, \mathsf{x}, \mathsf{d}) \wedge \neg p(\mathsf{x}, \mathsf{y}) \, \mathsf{d}$$

encodes violations of the property that whenever process *y* sends some data *d* to a process *<sup>x</sup>*, denoted as *<sup>r</sup>*(*y*, *<sup>x</sup>*,*d*), then there was a chain of process spawns: *<sup>s</sup>*(*x*, *<sup>x</sup>*1),*s*(*x*1, *<sup>x</sup>*2),..., *<sup>s</sup>*(*xk*, *<sup>y</sup>*), occurring in this order in the trace. In other words, a process may only send data to its "ancestors". To check this property, a monitor needs to compute the (temporallydirected) transitive closure *<sup>p</sup>*(*u*, *<sup>v</sup>*) of the relation *<sup>s</sup>*. The definition of the closure has two recursive predicate instances with different arguments. The *Spawn* formula is inspired by a similar one used to evaluate the DejaVu monitor [14]. Unlike DejaVu, we do not require the formula to be closed and thus leave the variables *x*, *y*, and *d* free.

The *Trans* formula

$$\begin{array}{lcl} \mathsf{Let} \mathsf{Past} \, p(\mathsf{u}, \mathsf{v}) := s(\mathsf{u}, \mathsf{v}) \vee (\blackdot{\mathsf{op}} \, p(\mathsf{u}, \mathsf{v})) \vee \\ & (\exists t. \, (\mathsf{op}(\mathsf{u}, t)) \wedge s(\mathsf{t}, \mathsf{v})) \vee (\exists t. \, s(\mathsf{u}, t) \wedge (\blackdot{\mathsf{op}} \, p(\mathsf{t}, \mathsf{v}))) \vee \\ & (\exists t, t'. \, (\mathsf{op}(\mathsf{u}, t)) \wedge s(\mathsf{t}, t') \wedge (\blackdot{\mathsf{op}} \, p(\mathsf{t}', \mathsf{v}))) \quad \mathsf{in} \quad r(\mathsf{y}, \mathsf{x}, \mathsf{d}) \wedge \neg p(\mathsf{x}, \mathsf{y}) \end{array}$$

encodes violations of the same property as *Spawn* even if *<sup>s</sup>*(*x*, *<sup>x</sup>*1),*s*(*x*1, *<sup>x</sup>*2),...,*s*(*xk*, *<sup>y</sup>*) are received by the monitor out-of-order, i.e., they do not occur in this order in the trace.

We can interpret the events *<sup>s</sup>*(*x*, *<sup>y</sup>*) as edges in a directed graph and the predicate *<sup>p</sup>*(*x*, *<sup>y</sup>*) in *Trans* as computing the reachability of vertices in the directed graph. We also extend the directed edges *<sup>s</sup>*(*x*, *<sup>y</sup>*) with a weight *<sup>w</sup>* to *<sup>s</sup>* <sup>+</sup>(*x*, *<sup>y</sup>*,*w*). Then the *Trans*<sup>+</sup> formula

LetPast *<sup>p</sup>*(*u*, *<sup>v</sup>*,*w*) :<sup>=</sup> *<sup>s</sup>* <sup>+</sup>(*u*, *<sup>v</sup>*,*w*)∨( *<sup>p</sup>*(*u*, *<sup>v</sup>*,*w*))<sup>∨</sup> (∃*t*,*w*1,*w*2. ( *<sup>p</sup>*(*u*,*t*,*w*1))∧*<sup>s</sup>* <sup>+</sup>(*t*, *<sup>v</sup>*,*w*2)∧*<sup>w</sup>* <sup>=</sup> *<sup>w</sup>*<sup>1</sup> <sup>+</sup>*w*2)<sup>∨</sup> (∃*t*,*w*1,*w*2. *<sup>s</sup>* <sup>+</sup>(*u*,*t*,*w*1)∧( *<sup>p</sup>*(*t*, *<sup>v</sup>*,*w*2))∧*<sup>w</sup>* <sup>=</sup> *<sup>w</sup>*<sup>1</sup> <sup>+</sup>*w*2)<sup>∨</sup> (∃*t*,*<sup>t</sup>* 0 ,*w*1,*w*2,*w*3. ( *<sup>p</sup>*(*u*,*t*,*w*1))∧*<sup>s</sup>* <sup>+</sup>(*t*,*<sup>t</sup>* 0 ,*w*2)∧( *<sup>p</sup>*(*<sup>t</sup>* 0 , *<sup>v</sup>*,*w*3))<sup>∧</sup> *w* = *w*<sup>1</sup> +*w*<sup>2</sup> +*w*3) in Let *<sup>m</sup>*(*u*, *<sup>v</sup>*,*w*) :<sup>=</sup> *<sup>w</sup>* <sup>←</sup> MIN(*w*;*u*, *<sup>v</sup>*). *<sup>p</sup>*(*u*, *<sup>v</sup>*,*w*) in *<sup>m</sup>*(*x*, *<sup>y</sup>*,*w*)∧ ¬( *<sup>m</sup>*(*x*, *<sup>y</sup>*,*w*))

yields all pairs of vertices *x*, *y* and the length *w* of the shortest path from *x* to *y* whenever *y* becomes reachable from *x* or the length of the shortest path changes. The relation

*s* <sup>+</sup>(*x*, *<sup>y</sup>*,*w*) can itself be obtained by evaluating a more complex temporal formula, e.g., *s* <sup>+</sup>(*x*, *<sup>y</sup>*,*w*) <sup>≡</sup> *<sup>e</sup>*(*x*, *<sup>y</sup>*,*w*)∧¬♦[0,10] *<sup>d</sup>*(*x*, *<sup>y</sup>*) with the following two types of events: *<sup>e</sup>*(*x*, *<sup>y</sup>*,*w*) denotes an edge from *<sup>x</sup>* to *<sup>y</sup>* with weight *<sup>w</sup>*; *<sup>d</sup>*(*x*, *<sup>y</sup>*) denotes deletion of the edge from *<sup>x</sup>* to *<sup>y</sup>*. The *eventually* operator ♦*<sup>I</sup>* <sup>ϕ</sup> abbreviates (∃*x*. *<sup>x</sup>* <sup>=</sup> *<sup>x</sup>*)U*<sup>I</sup>* <sup>ϕ</sup>. Such a relation *<sup>s</sup>* <sup>+</sup>(*x*, *<sup>y</sup>*,*w*) contains all edges that are not revoked within <sup>10</sup> time units after receiving *<sup>e</sup>*(*x*, *<sup>y</sup>*,*w*). We could use the non-recursive let operator Let *s* <sup>+</sup>(*x*, *<sup>y</sup>*,*w*) :<sup>=</sup> *<sup>e</sup>*(*x*, *<sup>y</sup>*,*w*)∧ ¬♦[0,10] *<sup>d</sup>*(*x*, *<sup>y</sup>*) to precompute the relation and use it when evaluating the recursive let operator in *Trans*+.

As another application of future operators under LetPast, recall our introductory example. Suppose that hosts in a network communicate with each other and with the outside world: *comm*(*src*,*dest*) indicates that host *src* sends a message to host *dest*; *in*(*r*,*h*) and *out*(*h*,*r*) indicate that the host *<sup>h</sup>* receives or sends traffic from or to an IP address in the range *r*, respectively. The hosts are equipped with an intrusion detection system (IDS), whose alerts are denoted by *ids*(*h*). We say that a host *h* is *tainted* by an address range *r* iff there is a chain of communication from *r* to *h* and all hosts on the chain (including *h*) trigger an IDS alert within one hour after communicating with the previous host. The formula

$$\begin{array}{lcl} \mathsf{Let} \mathsf{Past} \, \mathsf{int} \, \mathsf{int}(r,h) := \left( \left( \mathsf{in}(r,h) \vee \exists h'. \left( \begin{matrix} \bullet \, \mathsf{t} \, \mathsf{int}(r,h') \end{matrix} \right) \wedge \, \mathsf{conn}(h',h) \right) \wedge \left\{ \begin{matrix} \bullet \, \mathsf{l} \, \mathsf{h} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{l} \, \mathsf{$$

is true whenever a host communicates back to the IP range by which it was tainted.

*Periodic Behavior.* Suppose that we monitor a boolean signal *b*(*x*), parametrized by an integer parameter *x*, between the user's *start*(*x*) and *stop*(*x*) commands. An arbitrary amount of time may pass between these two commands. Our task is to detect periodic activations of *<sup>b</sup>*(*x*), with a fixed period *<sup>t</sup>* > <sup>0</sup> and error tolerance <sup>0</sup> <sup>≤</sup> ε < *<sup>t</sup>*. We shall ignore positive noise in *b*(*x*), i.e., additional activations besides the periodic ones.

Let us make the task more precise. An alarm must be raised at time-point *i<sup>n</sup>* iff there exist time-points *<sup>i</sup>*<sup>0</sup> <sup>&</sup>lt; *<sup>i</sup>*<sup>1</sup> <sup>&</sup>lt; ··· <sup>&</sup>lt; *<sup>i</sup><sup>n</sup>* such that *start*(*x*) holds at *<sup>i</sup>*0, *stop*(*x*) holds at *<sup>i</sup>n*, and *b*(*x*) holds at all *i<sup>k</sup>* for 1 ≤ *k* ≤ *n*−1. Moreover, the difference of time-stamps for adjacent time-points *<sup>i</sup><sup>k</sup>* and *<sup>i</sup>k*+1, where <sup>1</sup> <sup>≤</sup> *<sup>k</sup>* <sup>≤</sup> *<sup>n</sup>*−2, must be in the interval [*t*−ε,*t*+ε]; the differences for the pairs *<sup>i</sup>*0, *<sup>i</sup>*<sup>1</sup> and *<sup>i</sup>n*−1, *<sup>i</sup><sup>n</sup>* must each be at most *<sup>t</sup>* <sup>+</sup>ε.

Our first attempt *PB* to formalize the alarm condition without recursion is

$$\text{stop}(\mathbf{x}) \land \left(\spadesuit\_I(\text{start}(\mathbf{x}) \lor b(\mathbf{x}))\right) \land \left(\left(b(\mathbf{x}) \longrightarrow (\spadesuit\_I \text{start}(\mathbf{x})) \lor (\spadesuit\_J b(\mathbf{x}))\right) \mathbf{S} \land \text{start}(\mathbf{x})\right)$$

where *<sup>I</sup>* = [0,*t*+ε], *<sup>J</sup>* = [*t*−ε,*t*+ε], and *<sup>K</sup>* <sup>ϕ</sup> abbreviates (∃*x*. *<sup>x</sup>* <sup>=</sup> *<sup>x</sup>*)S*<sup>K</sup>* <sup>ϕ</sup>. This formula follows an inductive approach: every *b*(*x*) between *start*(*x*) and *stop*(*x*) must be preceded by *b*(*x*) or *start*(*x*), with the appropriate time difference. However, *PB* does not ignore noise, as adding *<sup>b</sup>*(*x*) events to the trace may silence an alarm. For example, let *<sup>t</sup>* <sup>=</sup> 10, ε <sup>=</sup> 0, and σ be a trace starting with ({*start*(1)},0), ({*b*(1)},10), ({*stop*(1)},20). We write {*p*(1), *<sup>p</sup>*(2)} for the database where the predicate *<sup>p</sup>* holds for <sup>1</sup> and 2. On σ, *PB* is true at the third time-point. Inserting a database {*b*(1)} with time-stamp 15 falsifies *PB* at the now fourth time-point, although the trace still satisfies the natural language description.

The following *PBLet* formula expresses the intended condition using LetPast:

 $\mathsf{LetPast}\operatorname{periodic}(\mathbf{x}) := \operatorname{start}(\mathbf{x}) \vee \left( b(\mathbf{x}) \wedge \left( (\blackneg\_I \operatorname{start}(\mathbf{x})) \vee (\blackneg\_I \operatorname{periodic}(\mathbf{x})) \right) \right)$  $\operatorname{stop}(\mathbf{x}) \wedge \spadesuit\_I \operatorname{periodic}(\mathbf{x})$ 

This example depends crucially on the flexible past guards we support: here, the recursion goes through with an interval constraint. Note that 0 6∈ *<sup>J</sup>* because we assumed ε < *<sup>t</sup>*.

As another example of periodic behavior, we analyze an integer-valued *signal*(*y*) between the (now non-parametric) commands *start* and *stop*. We aim to discover whether *signal*(*y*) is piecewise constant, with the constant segments being exactly *t* time units long. Moreover, the signal's values for subsequent segments must differ by at most δ. The next formula uses the general S operator as the recursion guard to capture this property.

$$\begin{array}{l} \mathsf{LetPast} \mathit{segment}(\mathbf{y}) := \exists z \, \mathit{signal}(\mathbf{y}) \land \left( \left( (\spadesuit \mathit{signal}(\mathbf{z})) \, \mathsf{S}\_{[0,t]} \, (\mathit{signal}(\mathbf{z}) \wedge \mathbf{0} \, \mathit{start}) \right) \lor \mathbf{0} \right) \\ \left( (\mathbf{0} \, \mathit{signal}(\mathbf{z}) \, \mathsf{S}\_{[t,l]} \, \mathit{segment}(\mathbf{z})) \right) \land \neg \delta \leq \mathbf{y} - \mathbf{z} \wedge \mathbf{y} - \mathbf{z} \leq \delta \, \mathsf{in} \\ \mathbf{0} \, \mathsf{step} \land \exists \mathbf{y}. \, ((\mathbf{0} \, \mathit{signal}(\mathbf{y})) \, \mathsf{S}\_{[0,t]} \, \mathit{segment}(\mathbf{y})) \end{array}$$

*Turing Machines.* Every MFOTL formula can be viewed as a function on traces, where the function's output is the set of satisfying valuations, either at a fixed or at all timepoints. VeriMon's monitorable fragment guarantees that one can compute the valuation at every time-point. Thus, monitorable formulas correspond to computable functions. If we give up on the requirement that the function's output must be available at a fixed timepoint, the past-recursive let operator is expressive enough to simulate arbitrary Turing machines (TM). This is not a contradiction: we simulate a single TM step at every time-point, and there is an infinite supply of time-points. Running the monitor on a configuration that does not halt will never produce an output, i.e., a nonempty set of satisfying valuations.

Let *<sup>M</sup>* <sup>=</sup> <sup>h</sup>Σ,*b*,*Q*,*q*0,*q<sup>f</sup>* ,δ<sup>i</sup> be a deterministic TM with tape alphabet Σ, blank symbol *<sup>b</sup>* <sup>∈</sup> <sup>Σ</sup>, control states *<sup>Q</sup>*, initial state *<sup>q</sup>*<sup>0</sup> <sup>∈</sup> *<sup>Q</sup>*, final state *<sup>q</sup><sup>f</sup>* <sup>∈</sup> *<sup>Q</sup>*, and transition function <sup>δ</sup> <sup>∈</sup> (*Q*×<sup>Σ</sup> <sup>→</sup> *<sup>Q</sup>*×<sup>Σ</sup> × {−1,0,1}). Whenever the machine is in state *<sup>q</sup>*<sup>1</sup> and reads the symbol *s*1, it enters state *q*2, writes the symbol *s*2, and moves the head by *m* tape cells to the right, where δ(*q*1,*s*1) = <sup>h</sup>*q*2,*s*2,*m*i. Without loss of generality, we assume that Σ and *<sup>Q</sup>* are finite subsets of the integers. We simulate *<sup>M</sup>* using the formula <sup>ϕ</sup>*<sup>M</sup>* shown below.

$$\begin{array}{l} \mathsf{Let } \mathsf{Set}\,\mathsf{Set}\,\mathsf{cf}\,\mathsf{g}(q,i,s):\\ \mathsf{Let } \mathsf{cf}\,\mathsf{g}(q,i,s):=\mathsf{O}\,\mathsf{cf}\,\mathsf{g}(q,i,s) \text{ in} \\ \mathsf{Let } head(q,s):=\mathsf{cf}\,\mathsf{g}(q,0,s) \vee \left(\neg(\exists x,z.\,\mathsf{cf}\,\mathsf{g}(x,0,z)) \wedge (\exists y.z.\,\mathsf{cf}\,\mathsf{g}(q,y,z)) \wedge s=b\right) \text{ in} \\ (input(i,s) \wedge q = q\_{0}) \vee \\ \quad \bigvee\_{\begin{subarray}{c}q\_{1},s\_{1}\\ \delta(q\_{1},s\_{2})=\langle q\_{2},s\_{2},m\rangle \quad \big(\exists j.\,\mathsf{cf}\,\mathsf{g}(q\_{1},j,s) \wedge j \neq 0 \wedge i=j-m\rangle\big)\Big) \text{ in} \,\mathsf{cf}\,\mathsf{g}(q\_{f},i,s) \end{subarray}} \end{array}$$

The idea is that *cfg* represents the current configuration of the TM. Specifically, *cfg*(*q*,*i*,*s*) holds if the machine is in control state *<sup>q</sup>* and the tape contains the symbol *s* in the *i*th cell to the right of the head (*i* may be negative). Note that we use nested, non-recursive let operators to abbreviate repeated subformulas. In the body of Let *cfg*(*q*,*i*,*s*):<sup>=</sup> *cfg*(*q*,*i*,*s*)in ..., the predicate *cfg* refers to the previous configuration. The predicate *head* provides the current state and the symbol under the head. Its definition extends the tape by a blank symbol if necessary. The simulation is started at time-point 0 by providing the tape's initial content in the predicate *input*, which must include the cell *input*(0,*s*0) with the symbol *<sup>s</sup>*<sup>0</sup> under the head's initial position. If and only if *<sup>M</sup>* halts on this input, there exists a time-point *<sup>i</sup>* at which <sup>ϕ</sup>*<sup>M</sup>* is satisfied by at least one valuation (*i*,*s*). Moreover, the satisfying valuations at *<sup>i</sup>* represent the final state of the tape.

#### 4.3 Algorithm

The restriction to past-guarded recursion allows for an efficient evaluation algorithm for LetPast formulas. It is efficient because no fixpoint iteration is required at individual time-points. To evaluate LetPast *<sup>p</sup>* :<sup>=</sup> α in β, we first try to evaluate α for as many timepoints as possible and then use the results to interpret *<sup>p</sup>* in β. This part is the same as for the non-recursive Let, but the evaluation of α itself differs. The syntactic monitorability condition guarantees that α at time-point *<sup>i</sup>* depends on the predicate *<sup>p</sup>* only for timepoints strictly less than *<sup>i</sup>*. Specifically, we have defined mon (LetPast *<sup>p</sup>* :<sup>=</sup> α in β) such that the progress of α's evaluation does not depend on *<sup>p</sup>*'s progress beyond time-point *<sup>i</sup>*−1. Therefore, we can evaluate α at time-point <sup>0</sup> without providing any table for *<sup>p</sup>*, then use the result to evaluate α at time-point 1, and so forth.

There are two cases that require care. First, if α contains future operators, multiple time-points may be evaluated at once. The above process must then be repeated within a single monitor step. Second, if α contains no future operators, α is evaluated at all timepoints *<sup>i</sup>* < *<sup>j</sup>*, where *<sup>j</sup>* is the current trace prefix length. We could then attempt to evaluate α once more at time-point *j* using the table computed at *j*−1 for *p*. However, this would not yield any further tables because all occurrences of *p* are below at least one past operator that tries to access the time-stamp at time-point *j*, which is not yet known. Therefore, this last evaluation attempt would needlessly traverse the formula state. We optimize this case and buffer α's result at time-point *<sup>j</sup>*−1 until the next input database arrives.

It is crucial that the evaluation of a recursive let does not get stuck waiting for tables that it needs to produce itself. Therefore, all operators that are strictly past-guarding as defined by slp (Fig. 2) must be well-behaved: the evaluation algorithm must compute a result at time-point *<sup>i</sup>* < *<sup>j</sup>* even if the operands' results are available only for time-points *i* 0 <sup>&</sup>lt; *<sup>i</sup>*. In particular, <sup>S</sup>*<sup>I</sup>* without <sup>0</sup> in the interval is considered strictly past-guarding. We have modified VeriMon's evaluation algorithm for <sup>α</sup>S*<sup>I</sup>* <sup>β</sup> to achieve this behavior.

The inductive state SLetPast *p m s*α *<sup>s</sup>*β *<sup>i</sup> buf* for a recursive let operator extends SLet with a counter *<sup>i</sup>* :: *nat*, which tracks the progress of *<sup>p</sup>* as observed by *<sup>s</sup>*α, and an optional buffer *buf* :: *table option*. The meaning of the other arguments is the same as for SLet. In the initial state, *i* is zero and *buf* is ⊥. Let the function list\_opt map ⊥ to [] and h*x*i to [*x*], where h*x*i is the embedding of *x* into the *option* type. A single monitor step updates the state as follows (see Section 3 for a description of eval's interface):

$$\begin{array}{l} \mathsf{eval } j \; n \; \mathsf{tsx} \; d \; (\mathsf{Sl} \mathsf{etPast} \; p \; m \; s\_{\alpha} \; s\_{\beta} \; i \; buf) =\\ \begin{array}{l} \left(\mathsf{let} \; (\mathsf{xs}, s\_{\alpha}', i', buf') = \mathsf{evals}\_{\mathsf{LP}} \; j \; m \; t \; \mathsf{tsx} \; d \; bs \; p \; [] \; s\_{\alpha} \; i \; (\mathsf{list} \; \mathsf{opt} \; buf); \\\ (\mathsf{ys}, s\_{\beta}') = \mathsf{evals} \; j \; n \; t \; \mathsf{tsx} \; (dbs[p \mapsto \mathsf{xs}]) \; s\_{\beta} \\ \end{array} \right) \\ \begin{array}{l} \mathsf{in} \; (\mathsf{ys}, \mathsf{Sl} \mathsf{etPast} \; p \; m \; s\_{\alpha}' \; s\_{\beta}' \; i' \; buf')) \end{array} \end{array}$$

α

β

The heavy lifting is performed by evalLP, which is mutually recursive with eval. We forward relevant variables from eval. The accumulator *xs* :: *table list* collects *<sup>s</sup>*α's results.

evalLP *j m tss dbs <sup>p</sup> xs <sup>s</sup>*<sup>α</sup> *<sup>i</sup> buf* <sup>=</sup> let (*xs*<sup>0</sup> ,*s* 0 α ) = eval *j m tss* (*dbs*[*<sup>p</sup>* 7→ *buf* ]) *<sup>s</sup>*α; *<sup>i</sup>* <sup>0</sup> = *i*+length *buf* in case *xs*<sup>0</sup> of [] <sup>⇒</sup> (*xs*,*<sup>s</sup>* 0 α ,*i* 0 ,⊥) | *x* # \_ ⇒ (if *i* <sup>0</sup> +1 ≥ *j* then (*xs*@*xs*<sup>0</sup> ,*s* 0 α ,*i* 0 ,h*x*i) else evalLP *m j* [] (clear\_dbs *dbs*) *p* (*xs*@*xs*<sup>0</sup> ) *s* 0 α *i* 0 *xs*0 ) 

First, evalLP evaluates *<sup>s</sup>*<sup>α</sup> with *dbs* updated at *<sup>p</sup>* using the current buffer, which may be empty. Since *i* tracks *p*'s progress, we then increase its new value *i* <sup>0</sup> by the length of *buf* . The evaluation results in a list *xs*<sup>0</sup> of tables and a new state *s* 0 α . We continue to iterate evalLP only if two conditions are met: *xs*<sup>0</sup> must be nonempty, as otherwise there is no new data to evaluate *s* 0 α on, and *<sup>i</sup>* <sup>0</sup> +1 must be less than the current input prefix length. The latter condition serves as an obvious termination criterion, although it is stricter than necessary. We could perform an additional iteration in the case that *i* <sup>0</sup> +1 = *j*. However, such an iteration would never produce new results because the past operators guarding *p* can only be evaluated further if there are new time-stamps. Therefore, we optimize this case by choosing the stricter condition. If we continue the iteration, we append *xs*<sup>0</sup> to the accumulator *xs*. Moreover, we clear *tss* and *dbs* because all tables from the new input database have already been processed by the first call to eval. Specifically, the function clear\_dbs *dbs* updates *dbs* at all points at which it is defined to an empty list.

We illustrate our algorithm with an example, tracing the computations of eval and evalLP. We evaluate LetPast *<sup>p</sup>*(*x*):<sup>=</sup> *<sup>q</sup>*(*x*)<sup>∨</sup> *<sup>p</sup>*(*x*)in *<sup>p</sup>*(*x*), which has the same semantics as [0,∞] *<sup>q</sup>*(*x*), on a prefix with two time-points at time-stamps <sup>0</sup> and 3. We omit details about the subformulas' states, as well as brackets around singleton lists, i.e., [1] is displayed as 1. Let *dbs*<sup>0</sup> <sup>=</sup> {*<sup>q</sup>* 7→ [{1},{2}]} be the content of the trace prefix.

eval *<sup>j</sup>*:<sup>2</sup> *<sup>n</sup>*:<sup>1</sup> *tss*:[0,3] *dbs*:*dbs*<sup>0</sup> *<sup>s</sup>*ϕ:(SLetPast *<sup>p</sup>* <sup>1</sup> <sup>α</sup><sup>0</sup> <sup>β</sup><sup>0</sup> <sup>0</sup> <sup>⊥</sup>) <sup>|</sup> evalLP *<sup>j</sup>*:<sup>2</sup> *<sup>m</sup>*:<sup>1</sup> *tss*:[0,3] *dbs*:*dbs*<sup>0</sup> *<sup>p</sup>*:*<sup>p</sup> xs*:[] *<sup>s</sup>*α:α<sup>0</sup> *<sup>i</sup>*:<sup>0</sup> *buf* :[] | | eval *<sup>j</sup>*:<sup>2</sup> *<sup>n</sup>*:<sup>1</sup> *tss*:[0,3] *dbs*:(*dbs*0[*<sup>p</sup>* 7→ []]) *<sup>s</sup>*ϕ:α<sup>0</sup> = ([{1}],α1) | | evalLP *<sup>j</sup>*:<sup>2</sup> *<sup>m</sup>*:<sup>1</sup> *tss*:[] *dbs*:{*<sup>q</sup>* 7→ []} *<sup>p</sup>*:*<sup>p</sup> xs*:[{1}] *<sup>s</sup>*α:α<sup>1</sup> *<sup>i</sup>*:<sup>0</sup> *buf* :[{1}] | | | eval *<sup>j</sup>*:<sup>2</sup> *<sup>n</sup>*:<sup>1</sup> *tss*:[] *dbs*:{*<sup>p</sup>* 7→ [{1}],*<sup>q</sup>* 7→ []} *<sup>s</sup>*ϕ:α<sup>1</sup> = ([{1,2}],α2) | | | iteration stops because *i* <sup>0</sup> = 1 and hence *i* <sup>0</sup> +1 = 2 ≥ *j* = 2 | | = ([{1},{1,2}],α2,1,h{1,2}i) <sup>|</sup> = ([{1},{1,2}],α2,1,h{1,2}i) <sup>|</sup> eval *<sup>j</sup>*:<sup>2</sup> *<sup>n</sup>*:<sup>1</sup> *tss*:[0,3] *dbs*:(*dbs*0[*<sup>p</sup>* 7→ [{1},{1,2}]]) *<sup>s</sup>*ϕ:β<sup>0</sup> = ([{1},{1,2}], β2) = ([{1},{1,2}],SLetPast *<sup>p</sup>* <sup>1</sup> <sup>α</sup><sup>2</sup> <sup>β</sup><sup>2</sup> <sup>1</sup> h{1,2}i)

*Correctness.* We extended the correctness proof of eval (Thm. 1) to cover the new state constructor SLetPast. The added case differs from the one for the non-recursive let in that evalLP is used to evaluate the first subformula. The proof also required additional invariants for the *i* and *buf* arguments of SLetPast, as well as a characterization of LetPast's progress. Recall that progress describes the number of time-points that the monitor is able to evaluate given a trace prefix of length *j*. We express the progress of the let-bound predicate *<sup>p</sup>*, which is defined in terms of α, as a least fixpoint:

progLP <sup>σ</sup> *P p* <sup>α</sup> *<sup>j</sup>* <sup>=</sup> l {*i*. *<sup>i</sup>* <sup>=</sup> prog σ (*P*[*<sup>p</sup>* 7→ *<sup>i</sup>*]) α *<sup>j</sup>*} prog <sup>σ</sup> *<sup>P</sup>* (LetPast *<sup>p</sup>* :<sup>=</sup> <sup>α</sup> in <sup>β</sup>) *<sup>j</sup>* <sup>=</sup> prog <sup>σ</sup> (*P*[*<sup>p</sup>* 7→ progLP <sup>σ</sup> *P p* <sup>α</sup> *<sup>j</sup>*]) <sup>β</sup> *<sup>j</sup>*

(We do not update σ in these definitions as progress depends only on the time-stamp sequence but not on the databases in σ.) The above characterization follows the iteration in evalLP: Since prog is pointwise monotone in *P* and at most *j* (both facts we prove in the formalization), the fixpoint can be reached by iteratively computing prog σ (*P*[*<sup>p</sup>* 7→ *<sup>i</sup>*]) α *<sup>j</sup>* starting with *<sup>i</sup>* <sup>=</sup> 0. Similarly, evalLP starts by evaluating <sup>α</sup> with no data for *<sup>p</sup>* and it feeds the results back into the evaluation until no further results can be obtained. Theorem 2 remains true after adding the above equation to prog.

The state invariant for SLetPast is given by the rule

invar σ[*<sup>p</sup>* <sup>V</sup> recp (λ*R k*. satrel (σ[*<sup>p</sup>* <sup>V</sup> *<sup>R</sup>*]) *<sup>k</sup>* α)] *<sup>j</sup>* (*P*[*<sup>p</sup>* 7→ *<sup>i</sup>*]) *m s*α <sup>α</sup> invar σ[*<sup>p</sup>* <sup>V</sup> recp (λ*R k*. satrel (σ[*<sup>p</sup>* <sup>V</sup> *<sup>R</sup>*]) *<sup>k</sup>* α)] *<sup>j</sup>* (*P*[*<sup>p</sup>* 7→ progLP <sup>σ</sup> *P p* <sup>α</sup> *<sup>j</sup>*]) *n s*<sup>β</sup> <sup>β</sup> *buf* <sup>=</sup> ⊥ −→ *<sup>i</sup>* <sup>=</sup> progLP <sup>σ</sup> *P p* <sup>α</sup> *<sup>j</sup>* <sup>∀</sup>*Z*. *buf* <sup>=</sup> <sup>h</sup>*Z*i −→ *<sup>i</sup>*+<sup>1</sup> <sup>=</sup> progLP <sup>σ</sup> *P p* <sup>α</sup> *<sup>j</sup>*

<sup>∧</sup>table *<sup>m</sup>* (fv α) (recp (λ*R k*. satrel (σ[*<sup>p</sup>* <sup>V</sup> *<sup>R</sup>*]) *<sup>k</sup>* α)) *<sup>Z</sup> <sup>m</sup>* <sup>=</sup> nfv α slp *<sup>p</sup>* α <sup>≤</sup> <sup>P</sup> {<sup>0</sup> ..< *<sup>m</sup>*} ⊆ fv α invar <sup>σ</sup> *j P n* (SLetPast *p m s*α *<sup>s</sup>*β *<sup>i</sup> buf*) (LetPast *<sup>p</sup>* :<sup>=</sup> <sup>α</sup> in <sup>β</sup>)

The first two premises use the same updated trace as in the semantics of LetPast (Section 4.1). The updated progress for *<sup>p</sup>* differs slightly between the premise for *<sup>s</sup>*α and that for *<sup>s</sup>*β. For the latter it is given by progLP, as expected. The predicate *<sup>p</sup>*'s progress within *<sup>s</sup>*<sup>α</sup> is equal to the state variable *<sup>i</sup>*, which is one less than progLP <sup>σ</sup> *P p* <sup>α</sup> *<sup>j</sup>* if the buffer *buf* is nonempty. This reflects to the optimization discussed in Section 4.3. The predicate table *A n R Z* is true iff the table *Z* contains tuples of length *n* that assign values to variables *A* and they are exactly the tuples of this kind satisfying map the *v* ∈ *R*.

# 5 Evaluation

We have used Isabelle/HOL's code generator [12] to export a certified implementation of VeriMon's core init and step functions and every function those depend on (e.g., operations on red-black trees), which amounts to about 10 000 lines of OCaml code. VeriMon augments this generated code with unverified parsers and pretty-printers. We evaluate this implementation to answer the following research questions: (1) How does VeriMon perform when monitoring formulas with the recursive let operator?; and (2) How does it compare to existing monitors for temporal first-order specifications with recursive rules?

To answer these questions, we run VeriMon and DejaVu and benchmark some of the example formulas introduced in Section 4.2. Instead of *SinceLet*, we opt for the simpler *OnceLet* <sup>=</sup> LetPast *<sup>o</sup>*(*u*, *<sup>v</sup>*) :<sup>=</sup> *<sup>s</sup>*(*u*, *<sup>v</sup>*)<sup>∨</sup> *<sup>o</sup>*(*u*, *<sup>v</sup>*) in *filter*(*x*, *<sup>y</sup>*)∧*o*(*x*, *<sup>y</sup>*) encoding the non-metric operator. We also include *Once* <sup>=</sup> *filter*(*x*, *<sup>y</sup>*)∧*s*(*x*, *<sup>y</sup>*) for comparison. The predicate *filter*(*x*, *<sup>y</sup>*) keeps the output size small. The *OnceLet* formula uses only one recursive predicate instance, whose variable order matches the one in the predicate's definition. Other formulas have more than one instance with different variable orders.

For the *PBLet* formula, we use an existing random trace generator [17] configured to pick parameters from a small integer domain, which increases the probability of producing satisfactions. For the other formulas, we generate traces using a similar strategy to the one used in DejaVu's benchmarks on the *Spawn* formula [14]. Namely, edges of a tree of spawned processes with a configurable branching factor are linearized into a trace, level by level. In the final level all edges converge to a single node for the formulas *Trans* and *Trans*+. We define the edges by Let *s* <sup>+</sup>(*x*, *<sup>y</sup>*,*w*) :<sup>=</sup> *<sup>e</sup>*(*x*, *<sup>y</sup>*,*w*)∧¬♦[0,10] *<sup>d</sup>*(*x*, *<sup>y</sup>*) in the *Trans*<sup>+</sup> formula and revoke one half of the edges on the second level of the branching.

We have executed our experiments on an Intel Core i5-4200U CPU using 8 GB RAM. Initially, DejaVu crashed on the *OnceLet* and *Spawn* formulas. We investigated the issue and found that its formula's abstract syntax tree was disconnected in these cases. We assume that this is caused by naming variables in the recursive rules' definitions


Fig. 3. Execution times of the monitors in seconds (TO = timeout of 120 seconds)

differently from those in the rules' usages. After renaming the variables in the let-bound predicates of these two formulas, the issue was fixed and we restarted the experiments.

The evaluation results (Figure 3) show that DejaVu's performance is incomparable to VeriMon's. VeriMon outperforms DejaVu on the formulas *Once* and *OnceLet* and scales well on *PBLet*, which, together with the *Trans*<sup>+</sup> formula, we could not express in PFLTL with recursion. DejaVu outperforms VeriMon on the *Spawn* and *Trans* formulas for which VeriMon's time complexity of processing one event is linear in the trace length because the number *N* of valuations satisfying the recursive predicates grows linearly in the trace length and the time complexity of updating the recursive predicate is linear in *N*. We conjecture based on some preliminary experiments that VeriMon's performance can be significantly improved by optimizing the representation of sets of tuples in two ways: (a) using tuples of a fixed length with a fixed assignment of variables to positions in a tuple (i.e., no De Bruijn indices); (b) using a collection of indices to optimize the computation of joins on various sets of shared columns. Nevertheless, processing one event can unlikely be made trace-length independent: *Trans* encodes the *incremental dynamic transitive closure* graph problem, with the best known algorithm processing every new edge in the input in amortized linear time (in the graph's maximum out-degree) [23].

### 6 Conclusion

We have presented the extension of a monitor for MFOTL with non-recursive and pastrecursive let operators. The presence of bounded future temporal operators complicates both the semantics and the evaluation algorithms for the new constructs, compared to earlier unverified extensions of past-only monitors [14]. Yet, the formal correctness proofs that we have carried out ensure the trustworthiness of our development.

As future work we plan to improve the performance of evaluating expensive joins by introducing indices, as used in database management systems. Expressiveness-wise we will consider further relaxing the requirements on the recursive let. We can omit the past guard if we define a Datalog-style fragment for which the fixpoint is well-defined. Beyond relaxing guards, we may want to allow recursion through future operators in certain situations. The main challenge is that this would make the progress notion data-dependent (unlike currently, where it only depends on the time-stamps).

*Acknowledgments* We thank David Basin for supporting this work and the anonymous TACAS reviewers for their helpful comments. Dmitriy Traytel is supported by a Novo Nordisk Fonden Start Package Grant (NNF20OC0063462).

# References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Maximizing Branch Coverage with Constrained Horn Clauses

Ilia Zlatkin , Grigory Fedyukovich()

Florida State University, Tallahassee, FL, USA, iz20e@fsu.edu, grigory@cs.fsu.edu

Abstract. State-of-the-art solvers for constrained Horn clauses (CHC) are successfully used to generate reachability facts from symbolic encodings of programs. In this paper, we present a new application to test-case generation: if a block of code is provably unreachable, no test case can be generated allowing to explore other blocks of code. Our new approach uses CHC to incrementally construct diferent program unrollings and extract test cases from models of satisfable formulas. At the same time, a CHC solver keeps track of CHCs that represent unreachable blocks of code which makes the unrolling process more efcient. In practice, this lets our approach to terminate early while guaranteeing maximal coverage. Our implementation called Horntinuum exhibits promising performance: it generates high coverage in the majority of cases and spends less time on average than state-of-the-art.

# 1 Introduction

Branch coverage is a method for testing that aims to maximize the number of program branches to be collectively visited by a set of test cases. Branches in the code are commonly attributed to the conditional statements or loops. For testing a loop-free program, possible test cases for all the branches can be identified by symbolic execution, powered by efficient solvers for Boolean Satisfiability (SAT) or Satisfiability Modulo Theories (SMT). If a conditional is placed inside or after a loop, test-case generation immediately becomes challenging because the cost of exploration of every next iteration grows exponentially in the worst case.

Many verification problems can be reduced to synthesizing interpretations of predicates in systems of SMT formulas, also known as constrained Horn clauses (CHC), that provide a modular encoding for programs with arbitrary control flow. In this paper, we propose to use CHC also for test-case generation. Solutions to CHC, also called inductive invariants, carry reachability information and are useful in pruning the search space explored by test-case generators. If an invariant shows that a branch can never be taken, then it is guaranteed that no test can ever reach the branch, and thus a test-case generator can safely proceed to discovery of the next test case.

We contribute a new approach to test-case generation that aims at maximizing branch coverage using inductive invariants. In essence, our approach gradually enumerates different unrollings and uses an off-the-shelf SMT solver to get values for program variables that represent test cases. Unrollings are constructed on-the-fly by exploring the CHC encoding of programs. Concurrently, an incremental CHC solver determines a subset of unreachable CHCs which allows the algorithm to explore fewer unrollings in the next iterations. The algorithm terminates when the test cases are generated for all reachable branches, but all the remaining branches are provably unreachable.

These features distinguish our approach from other white-box test generators [1, 8, 9] that consider reachability information only in the bounded context. That is, in the presence of unreachable branches and loops, they may continue iterating forever, even if all possible test cases have already been generated. Reliance on invariants lets our tool to terminate early while still guaranteeing maximal possible coverage.

The approach has been implemented on top of the FreqHorn CHC solver [14] and the Z3 SMT solver [27]. It enables test-case generation for C programs, converted to CHCs by the SeaHorn [21] tool. Experiments conducted on a range of public benchmarks demonstrate the strengths of our approach compare to state-of-the-art: SMT-based incremental test-case generation is able to detect high-quality solutions in the majority of cases and is on average less expensive.

#### 2 Related Work

Automated test generation has two main approaches: fuzzing (e.g., [7,20,25,26, 29,31,33,34]) and symbolic/concolic execution (e.g., [3,8,11,22,23,28,32]). The former group uses user-given seed inputs and further mutates them based on various heuristics (sometimes using the source code as well). The latter group, which also includes our one, proceeds by enumerating paths and generating test cases, often using constraint solvers. Recent algorithms, including FuSeBMC [1] and Verifuzz [9], follow both approaches: begin with symbolic execution (namely, some bounded model checking [10, 19]) and then proceed to fuzzing.

The closest related work [22] suggests to accelerate testing using interpolation. Aiming the same goal as us, i.e., prune unreachable paths, they however do not generate inductive invariants, which limits the generality of their method.

Earlier attempts to combine static analysis techniques and testing [11] were tailored to particular frameworks and languages. With the rise of SMT solvers, approaches became more scalable, goal-oriented [3], and at the same time more agnostic to programming languages. Recent works, e.g. [33], offer a great flexibility of applications of static analyzers to test-case generation, e.g., to direct fuzzers to specific blocks of code. Following this trend, our approach continues bridging the gap between state-of-the-art in automated reasoning and testing.

While we are not aware of any specific applications of CHC solvers to testcase generation, we are largely inspired by the work in model checking, e.g., [6,21] that can both discover invariants and find counterexamples (from which a test case can be extracted). The main difference is in the application: model checkers often focus on a single property/bug, while our goal is to cover the maximal number of branches. Furthermore, many practical approaches including [1, 9]

```
1 int x = 0;
2 int y = nondet ();
3 int z = nondet ();
4 while (1) {
5 if (x >= 5)
6 y ++; // needs at least 6 iterations to reach
7 else
8 x ++; //  ∈ [0, 5] always holds
9 if (y <= 5)
10 z ++;
11 else
12 if (x > y)
13 y ++; // this is unreachable
14 else
15 x = 0;
16 if (z == 0)
17 break ;
18 }
```
Fig. 1: Loopy program with control-fow divergence and unreachable branches.

are based on existing model checkers (that typically use constraint solvers as blackbox), CHC formulation allows to build tools modularly and directly on top of an SMT solver, thus allowing to use it incrementally for both counterexample finding and invariant generation.

#### 3 Motivating Example

Fig. 1 gives a program with a single loop. It has three variables: x is assigned zero before the loop, which we cannot change, and the remaining y and z could be taken from the user. The loop has four if-then-elses (including one nested), and it terminates when the value of z at the end of an iteration equals zero. To completely cover all the branches, we need to consider seven cases, in particular:


All these make the program quite interesting and its analysis challenging.

#### 4 Background

This paper approaches the problem of automated test-case generation by reduction to the Satisfability Modulo Theories (SMT) problem. Automated SMT solvers determine the existence of a satisfying assignment to variables (also called a model) of a first-order logic formula. Formula is logically stronger than formula (denoted =⇒ ), if every model of also satisfies . The unsatisfiability of formula is denoted =⇒ ⊥, and we also write *M* ∈ ∅ to indicate that no model *M* of the formula (which is clear from the context) exists. By writing (), we denote a predicate over free variables .

Constrained Horn clauses (CHC) are used as intermediate verification language used by both verification frontends and backend SMT solves. This allows to split efforts while designing a verification tool for a new language: while focusing on encoding programs to CHCs, researchers rely on advances of CHC solvers that will solve these CHCs. Thus, by demonstrating our algorithms at the level of CHCs, we allow for many their particular instantiations for various programming languages (that support CHC encoding).

Defnition 1. A linear constrained Horn clause (CHC) over a set of uninterpreted relation symbols *R* is a frst-order-logic formula having the form of either:

$$\begin{aligned} \varphi(\vec{x\_1}) &\Longrightarrow& inv\_1(\vec{x\_1})\\ \invine{1-} inv\_i(\vec{x\_i}) \land \varphi(\vec{x\_i}, \vec{x\_j}) &\Longrightarrow& inv\_j(\vec{x\_j})\\ inv\_n(\vec{x\_n}) \land \varphi(\vec{x\_n}) &\Longrightarrow& \bot \end{aligned}$$

where all ∈ *R* are uninterpreted symbols, all ⃗ are vectors of variables, and is a fully interpreted formula called constraint.

These types of implications are called respectively a fact, an inductive clause, and a query. Note that constraint of each CHC does not have applications of any predicates from *R* . Further, by body(), we denote the premise of , by src() an application of ∈ *R* in body() (but if is a fact, we write src() def = ⊤). Similarly, by head(), we denote the conclusion of , and by dst() we denote an application of ∈ *R* in head() (and if is a query, we write dst() def = ⊥).

Intuitively, CHCs allow to generate program encodings with "holes" that represent unrollings of unknown lengths. Then, possible instantiations of these holes can be used in the discovery of meaningful information about the program, such as loop invariants, or function summaries.

Defnition 2. Given a set *R* of uninterpreted predicates and a set of CHCs over *R* , we say that is satisfable if there exists an interpretation for every ∈ *R* that makes all implications in valid.

CHCs are useful also when there is a need to access various pieces of program encoding and pose reachability queries. In particular, it is straightforward to design a Bounded Model Checking (BMC) [5] tool on top of CHCs and use it for test-case generation. Specifically, by traversing the graph structure imposed on the CHCs, we can access all possible program traces and create the corresponding unrollings.

Defnition 3. Given a system of CHCs over *R* , an unrolling of of length is a conjunction ⟨0,...,⟩ def = ⋀ 0≤≤ (⃗ , ⃗+1), such that 1) <sup>0</sup> is a fact, 2) each ∈ , 3) for each pair and +1, rel(dst()) = rel(src(+1)), and variables of each ⃗ are shared only between −<sup>1</sup> (⃗−1, ⃗) and (⃗ , ⃗+1).

For bug finding, it is essential to enumerate various unrollings and check their satisfiability. Once a satisfiable formula ⟨0,...,⟩ is found for some query , the bug is found (and its counterexample can be obtained from the model), and thus no interpretation for predicates in *R* exists.

Lemma 1. Given a system of CHCs , let ⟨0,...,⟩ be one of its unrollings, such that <sup>0</sup> is a fact, and is the query. Then if ⟨0,...,⟩ is satisfable then is unsatisfable.

In the next section, we expand on the notions of CHCs and unrollings, give examples, and present the application to test-case generation.

# 5 Test-case Generation for Branch Coverage

The concept of constrained Horn clauses is convenient for formulating the problem of constructing a maximal branch coverage (MBC) of a given program. At the highest level, the problem of finding MBC is concerned with finding a set of program executions that visit all reachable program branches. Given the CHC encoding of the program, this can be reduced to a problem of finding a set of satisfiable unrollings that involve the maximal number of CHCs. However, to guarantee maximality, this needs a special property of the CHC encoding: the constraint in each CHC should represent a straight-line code sequence with no branches (a.k.a. basic block). Technically, this can be formulated as the requirement for each CHC to have a conjunction of literals (a.k.a. cube), i.e., have no disjunctions, in its body.

Example 1. Fig. 2 gives a CHC encoding of the program in Fig. 1. There are eight CHCs over four uninterpreted predicates , , , and . The program entry is encoded in the first CHC (i.e., the only fact with the dst-predicate ), and its exit in the last CHC (i.e., with the dst-predicate ). All other CHCs encode

$$\begin{array}{llll} (1) & x = 0 \implies A(x, y, z) & \bigoplus & \\ (2) & A(x, y, z) \land x \ge 5 \land x' = x \land y' = y + 1 \land z' = z \implies B(x', y', z') & \bigoplus \\ (3) & A(x, y, z) \land x < 5 \land x' = x + 1 \land y' = y \land z' = z \implies B(x', y', z') & \bigoplus \\ (4) & B(x, y, z) \land y \le 5 \land x' = x \land y' = y \land z' = z + 1 \implies C(x', y', z') & \prod \\ (5) & B(x, y, z) \land y > 5 \land x > y \land x' = x \land y' = y + 1 \land z' = z \implies C(x', y', z') & \prod \\ (6) & B(x, y, z) \land y > 5 \land x \le y \land x' = 0 \land y' = y \land z' = z \implies C(x', y', z') & \bigoplus \\ (7) & C(x, y, z) \land z \ne 0 \implies A(x, y, z) & \bigoplus \\ (8) & C(x, y, z) \land z = 0 \implies D(x, y, z) & \bigoplus \end{array} \tag{1}$$

Fig. 2: CHCs of the motivating example (left) and src/dst-dependency graph (right).

the loop with the total of six symbolic paths, following → → → (as can be seen from the graphic representation) but involving different CHCs. Each CHCs has no disjunctions in the body: it has the conjunction of the (negation of) guard and the encoding of program instructions following the corresponding branch until either a next conditional or the join occurs. Note that there are no queries in this system since there are no assertions in the program.

To formulate the MBC problem at the level of CHCs, it is convenient to introduce the concept of a src/dst-dependency graph for a system of CHCs.

Defnition 4. Given a system of CHCs over a set of uninterpreted predicate symbols *R* , its src/dst -dependency graph ⟨*R* , ⟩ is a directed graph with edges labeled by CHCs from :

> def = {⟨rel(src()), , rel(dst())⟩ | ∈ }.

Because we are bound in this paper to use only disjunction-free CHCs, the points of control-flow divergence in a program encoded in these CHCs are captured by vertices in the src/dst-dependency graph that have more than one outgoing edge<sup>1</sup> . To generate a test case visiting some block of code encoded in a CHC , it is enough to find an unrolling ⟨0,...,⟩ and show that this unrolling is satisfiable. In this case, the CHC is called reachable: i.e., the satisfying assignment would naturally correspond to a program trace beginning at

<sup>1</sup> Thus, in this case, the src/dst-dependency graph can be seen as a control-fow graph (CFG) of the encoded program. In practice, many verifcation tools that are based on CHC do not generate CHCs in such form but apply some generalization and compression to CFG during preprocessing. This results in CHCs with disjunctive bodies that are unsuitable for our approach. In these cases, we explicitly convert the body of each CHC to a disjunctive normal form (DNF) and clone the CHC for each cube in the DNF. The CHCs after this transformation is still a correct encoding of the original program, and its src/dst-dependency graph is suitable for our approach, but it may not exactly match the CFG of the original program.

the program entry point and reaching the code in that branch. Furthermore, if the execution depends on some input values, these values can also be extracted from the satisfying assignment.

Example 2. According to Fig. 2, the first point of control-flow divergence is predicate . To show that CHC 3 is reachable, we create the following unrolling from bodies of CHCs 1 and 3:

> = 0 ∧ < 5 ∧ ′ = + 1 ∧ ′ = ∧ ′ = .

This formula is satisfiable, and there exists a model *M* = { ↦→ 0, ↦→ 0, ↦→ 0, . . .}, thus giving us two values for input variables and (both zeroes).

It can also be seen that some CHCs cannot be visited by any trace. To find them, we can pose additional safety verification queries and aim at generating an appropriate invariant.

Lemma 2. Given a system of CHCs over some *R* , and let be some CHC from . If the extended CHC system ∪ {src() ∧ =⇒ ⊥} is satisfable, then is unreachable.

The proof of the lemma follows directly from Lemma 1.

Example 3. In the CHC system in Fig. 2, CHC 5 is never reachable. We introduce a new query CHC as follows:

(, , ) ∧ > 5 ∧ > ∧ ′ = ∧ ′ = + 1 ∧ ′ = =⇒ ⊥

The system ∪ {} is satisfiable, with the following interpretation *M* :

$$\mathcal{M}(A) = \mathcal{M}(B) = \mathcal{M}(C) = \lambda x, y, z, x \le 5$$

Because ≤ 5 ∧ > 5 ∧ > is unsatisfiable, CHC 5 is unreachable.

These ingredients lets us state the MBC problem formally.

Defnition 5 (MBC). Given a system of CHCs over some *R* , the problem of maximizing branch coverage of is concerned with 1) determining a subset ⊆ of CHCs which are provably unreachable (i.e., Lemma 2 applies), and 2) fnding satisfable unrollings for all CHCs from ∖ .

The practical significance of the MBC problem consists in allowing the testgeneration tools that are based on bounded model checking, e.g., [1], to terminate earlier. The invariants discovered while iteratively applying Lemma 2 can serve as annotations of various nodes of the program CFG, which further enables to prune the search space of the test cases. In particular, for our running example in Fig. 1, line 13 is provably unreachable, thus it makes no sense to search for its test case.

Furthermore, with the invariant that blocks a branch at hand, the tools can now explore fewer unrollings leading to other branches in the next iterations of the loop. Specifically, to reach line 6, five iterations of the loop will provably skip line 13, so instead of (2 \* 3)<sup>5</sup> = 7776 unrollings, the tool should only explore (2 \* 2)<sup>5</sup> = 1024 unrollings.

#### 6 Solving the MBC problem

In this section, we introduce our novel approach to constructing the maximal branch coverage using a system of disjunction-free CHCs. We begin with outlining our key ideas that can be implemented on top of existing test-case generators and invariant generators, and then proceed to describing our efficient implementation.

#### 6.1 Key Insights

The approach has a simple high-level structure. Because the number of CHCs in a program encoding is always finite, we can pose a safety verification query for each of them.

Existing CHC solvers are equipped with the functionality to generate both, the counterexamples and safety invariants. However, recent evaluation [17] show that the bounded-model-checking implementations often outperform generalpurpose solvers on unsatisfiable CHC instances (likely, because they do not invest efforts in generating invariants). This suggests that for performance reasons, it makes sense to alternate between separate runs of a counterexample generator (via enumerating the unrollings) and an invariant generator. This allows for two main benefits, outlined in the next two paragraphs.

A counterexample generator, in the MBC setting, should handle a large number of unrollings. Many of the unrollings are unsatisfiable since some sequentially aligned branches might be incompatible, and some other branches might be waiting for a certain loop iteration. It is thus essential to share the information about conflicting paths' segments (e.g., unsatisfiable prefixes, as in our implementation) to accelerate the search. Dually, satisfiable unrollings can often be extended to unrollings for other reachable CHCs, and this information can be exploited in the enumerative search for the remaining branches.

An invariant generator, invoked multiple times throughout the process, deals with many largely similar safety verification instances (since all CHCs are the same, and only queries are different). Thus, a lot of information can be reused between verification runs, opening the opportunities for incremental verifcation [13]. Formally, all invariants that are discovered while proving the unreachability of a CHC will remain valid after switching to another CHC. Even more, the solvers that target conjunctive invariant generation, e.g., [15,24] can output "partial" invariants (i.e., some lemmas) even for unsatisfiable CHC instances, which then can be reused/completed in the next runs of the solver.

These observations let us to conclude that despite using off-the-shelf tools for bounded model checking and invariant generation is possible, an MBC will likely exhibit a more optimized performance through the design of new algorithms that incorporates the aforementioned insights.

#### 6.2 General Driver

The pseudocode of our approach is given in Alg. 1. The algorithm begins with identifying a subset cur of CHCs that need to be considered in its iterations.

#### Algorithm 1: CHC-based test-case generator.

```
Input: : a CHC system over R
  Output: : a set of satisfying assignments to variables in 
  Data: invs: mapping from R to invariants,  = ⟨R , ⟩: an edge-labeled
        graph, cur ⊆ : a subset of CHCs to consider, length: counter
        representing the length of the current unrollings, traces: a (global) set
        of traces to consider
1 ⟨R , ⟩ ← src/dst-dependency graph of ;
2 cur ← { | ⟨, , 1⟩ ∈  and ∃⟨, , 2⟩ ∈  where 1 ̸= 2};
3 if cur = ∅ then cur ← { | src() = ⊤};
4 length ← 1;
5 while  ̸= ∅ do
6 for chc ∈ cur do
7 ⟨res, invs, cex ⟩ ← solveCHCs( ∪ {body(chc) =⇒ ⊥}, invs);
8 if res = sat then
9 cur ← cur ∖ {chc};
10  ← {⟨, , ⟩ | ⟨, , ⟩ ∈  and  ̸= chc};
11 else if res = unsat then
12 cur ← cur ∖ {chc};
13  ←  ∪ {cex};
14 else
15 traces ← ∅;
16 GetTraces(, ⊤, chc, length, nil, prefxes, traces);
17 for  ∈ traces do
18 ⟨res, M ⟩ ← checkSAT(unroll(, ));
19 if res = sat then
20  ←  ∪ {M };
21 cur ← cur ∖ {chc};
22 break;
23 else
24 prefxes ← prefxes ∪ {};
25 length ← length + 1;
```
We say that a CHC opens a branch, if the outdegree of rel(src()) in the src/dst-dependency graph is greater than one (line 2). Thus, to generate a test case visiting a branch, it is enough to find an unrolling ⟨0,...,⟩ where opens that branch and show that this unrolling is satisfiable. If, however, there are no branches in the given program at all, then cur gets all facts of the CHC system (line 3), and the remaining coverage generation is straightforward.

The rest of the algorithm is organized as a big loop that decides if any CHC from cur are (un)reachable and terminates when cur is empty. At each iteration of the loop, all CHCs from cur are enumerated, and the algorithm seeks to apply Lemma 2, i.e., extends with one query and solves these CHC (line 7). The algorithm can use any CHC solving algorithm that decides the satisfiability


of CHCs, returns inductive invariants (line 8) or (optionally<sup>2</sup> ) a counterexample (line 11). In both cases, the CHC is excluded from cur . Additionally, if satisfiable, this CHC cannot be used in any unrolling, and it is excluded also from the auxiliary graph (line 10, to prune the search space of the remaining test cases). If a counterexample is returned, the branch is reachable, and the test case is extracted from this counterexample (line 13).

It is also possible (and in practice, very likely) that the CHC solver returns unknown (because the problem is undecidable, and invariant generators are often limited to either a fixed shape of invariants, or a certain timeout). In this case (lines 16-22), the algorithm proceeds with an explicit enumeration of unrollings of a predetermined length (line 16). Each trace = ⟨0, 1, . . . , length−1⟩ has an associated unrolling ⟨<sup>0</sup> ,...,length−<sup>1</sup> ⟩ which is checked for the satisfiability (line 18) with an off-the-shelf SMT solver. If satisfiable (line 19), the branch opened by the current CHC is reachable, the test case is generated from the model, and the CHC is excluded from cur . If unsatisfiable (line 23), the algorithm registers this as an unsatisfable prefx to be avoided in the trace generation in the next iterations (see Alg. 2).

Theorem 1. When Alg. 1 terminates, the resulting set contains all the variable assignments needed for maximal coverage.

In the next two paragraphs we discuss two important design choices that do not affect the correctness of our implementation, but optimize it.

<sup>2</sup> In fact, the counterexample detection in some CHC solvers, e.g., [24] proceeds in a similar fashion as described in our algorithm, but if invoked multiple times throughout the algorithm, it is likely that the CHC solver will perform many redundant actions. We thus do not use this functionality in our experiments (and our Alg. 3), but leave it in the pseudocode for the sake of completeness of presentation.

#### Algorithm 3: solveCHCs:

Input: : a CHC system over *R* , invs: mapping from *R* to invariants Output: res ∈ ⟨sat, unsat⟩, invs: updated mapping, [cex : counterexample] 1 ′ ← ∅; 2 for chc ∈ do 3 ′ ← ′ ∪ {src(chc) ∧ ( body(chc) ) [*R* ↦→ invs] =⇒ dst(chc)}; 4 if .⊤ is a solution for ′ then 5 return invs; <sup>6</sup> return ⟨res, invs, ⟩ ← FreqHorn( ′ );

#### 6.3 Incremental Trace Enumeration

Our algorithm allows for sharing the information obtained during its iterations using two global data structures: the set of unsatisfiable prefxes discovered during the trace enumeration and the graph-structure ⟨*R* , ⟩ representing potentially reachable CHCs (line 10 of Alg. 1). Intuitively, the latter is constructed by an iterative removal of edges from the src/dst-dependency graph, thus allowing for a more focused search of suitable traces. Both data structures are used in Alg. 2 that is called at the next algorithm iteration.

Conceptually, Alg. 2 is a dynamic-programming implementation of a path finder in an arbitrary directed graph. Given a length of path, its starting point and ending points, the algorithm recursively visits the graph edges and stores them in vectors<sup>3</sup> . In our setting, the algorithm is optimized in two ways. First, at line 1, it skips paths with unsatisfiable prefixes (because the corresponding unrollings will be unsatisfiable too). Second, at lines 4 and 7, it excludes all the unreachable CHCs that have been excluded from the graph previously.

Example 4. Recall our running example for program encoded as CHCs in Fig. 2. For length = 2 and CHC (2), Alg. 2, constructs a single trace ⟨(1),(2)⟩, that corresponds to an unsatisfiable unrolling, found by Alg. 1, and thus added to prefxes. Consequently, for length = 3, traces ⟨(1),(2),(4)⟩ and ⟨(1),(2),(6)⟩ are not generated. Furthermore, because (5) is never reachable, then edge ⟨,(5), ⟩ is excluded from permanently.

#### 6.4 Incremental Invariant Discovery

Alg. 3 gives the main idea of our CHC solver, which relies on the FreqHorn [15] algorithm to synthesize invariants (any other CHC solver could be used as well). However, in addition, it recycles the invariants invs generated in all previous runs. Specifically, it substitutes interpretations for each ∈ *R* in the body of each CHC (line 3). Because each such formula represents an over-approximation

<sup>3</sup> We use the notation @ to represent the "push back" operation over a vector and an element .

of the set of reachable states at a particular program location, this substitution is sound.

If it appears that after the substitution, all the remaining invariants are simply true-formulas (line 4) then invs is already a solution, and the CHCs solver is not needed. On the other hand, invariants could be generated by an external tool.

While the pseudocode of FreqHorn is omitted from Alg. 3 for simplicity, we list its distinguishing features here. The approach is driven by Syntax Guided Synthesis (SyGuS) [2], and it supports (possibly, non-linear) arithmetic and arrays [16]. It automatically constructs formal grammars () for each ∈ *R* based on either source code [14], or program behaviors [15,30]. Importantly, these grammars are conjunction-free, and they allow for only a finite number of candidates. FreqHorn iteratively attempts to apply production rules of each () to sample a candidate and checks it with an SMT solver (successfully checked candidate is then called a lemma). The process continues either until a conjunction of lemmas is sufficient, or the search space is exhausted. To make the process less dependent on the order in which candidates are considered, Freq-Horn uses batching [12] (e.g., checks several candidates at the same time) and effectively filters them using the well known Houdini algorithm [18].

These features make FreqHorn especially useful for the application to testcase generation. Behaviors and counterexamples can be obtained from traces as outlined in Sect. 6.3. Each new counterexample potentially contributes to a new data candidate to be considered in the next invocations of the algorithm. Then, following our incremental schema, new candidates are used in conjunction with previously generated invariants, and either added to invs, or dropped. Note that even if FreqHorn returns unknown indicating that it is unable to find a strong enough invariant, it almost always finds some lemmas that might be useful for the next iterations of our main algorithm.

#### 7 Evaluation

We have implemented the approach in a tool called Horntinuum<sup>4</sup> . The backend of Horntinuum is developed on top of FreqHorn [14] and uses it for CHC solving. All the symbolic reasoning in our backend is performed by the Z3 [27] SMT solver, v4.8.10. For encoding C benchmarks to CHCs in our frontend, we use the SeaHorn [21] verification framework, v10.0.0-rc0, via its Docker image<sup>5</sup> .

Implementation details. The success of our approach largely depends on the preprocessing performed by SeaHorn while producing the CHC encoding. Since our algorithm works on disjunction-free CHCs (recall Sect. 6.1), we configure SeaHorn to perform a small-step encoding, i.e., introducing a CHC per

<sup>4</sup> The source code of the tool is publicly available at https://github.com/izlatkin/ HornLauncher with the CHC-based backend at https://github.com/izlatkin/aeval/ tree/tg.

<sup>5</sup> https://hub.docker.com/r/seahorn/seahorn-llvm10.

each basic block (via the --step=small option). However, the encoder, based on LLVM, additionally performs several LLVM transformations<sup>6</sup> and auxiliary SeaHorn's passes that may introduce disjunctions to CHCs. Since this recipe is not configurable in SeaHorn yet, we additionally get rid of disjunctions, by performing a DNF-ization, over the CHCs received from SeaHorn.

We also had to overcome a relatively minor engineering obstacle to allow recognizing multiple nondet() function calls (see an example in Fig. 1). The CHC representation is in some sense declarative, i.e., it is not always possible to detect the order of function calls from formulas that represent program unrollings. Thus, we rename each invocation of nondet() in each input C file, e.g., to nondet() which lets Horntinuum to associate each function invocation with a sequence of static-single-assignment (SSA) variables that encode (possibly many, if nondet() is called in a loop) outputs of nondet() occurring in an unrolling. Further, it gives a sequence of concrete values obtained for each of the SSA variable by the SMT solver. In a generated test case, sequences of SSA values of each nondet are stored in a separate array (to capture values in each loop iteration) and accessed by an automatically generated body of the corresponding unique nondet() function.

In a sense, the final output of our tool is a set of context-specific implementations of function nondet() written in different header files. The initial C file should include a header from this set, be compiled and run in order to reproduce the detected test case.<sup>7</sup>

Experimental setup. To evaluate Horntinuum, we configured the gcov tool, v9.3.0, a code coverage analysis and profiling tool that tracks all statements visited in a single run of the program. Running gcov for each our generated test case and merging the statistics gives the final coverage: we ultimately target to maximize the number of code visited by at least one test case.<sup>8</sup>

We compared Horntinuum with state-of-the-art tools FuSeBMC [1], Verifuzz [9], and KLEE [8] 9 that exhibited a decent performance in TestComp 2021. Our experiments were run on a "Dell OptiPlex 7090 Tower" desktop computer with 2.5 GHz Intel Core i7 8-Core (11th Gen), 16GB 3200 MHz DDR4 RAM, and Ubuntu 20.04.1 LTS installed on it.

For the experimentation we considered 316 benchmarks from TestComp (from loop-\* tracks, excluding the programs with floating points that our CHC solver

<sup>6</sup> One transformation, for instance, removes redundant branches from the code, e.g., replaces if (nondet()) foo(); else foo(); by just foo. Technically, the CHC encoding received by our tool does not represent all branches of the original program, whilch thus leads to a smaller coverage detected. We have not seen many such examples in our benchmarks set, however.

<sup>7</sup> Note the diference with the TestComp format [4] that keeps all values in the same XML fle. Our proposed format is more general and easily convertible to TestComp.

<sup>8</sup> The full logs and tables are available at https://www.cs.fsu.edu/∼grigory/ horntinuum.zip.

<sup>9</sup> All the binaries were downloaded from https://test-comp.sosy-lab.org/2021/ systems.php.

Fig. 3: Coverage comparison: each point in a plot represents a pair of the coverages (% × %) of Horntinuum (x-axis) and a competitor (y-axis) for the same benchmarks.

does not support yet). The largest considered benchmark has >5K LoC. The performance of all three competitors (using the timeout of 15 minutes) on our machine was consistent to the one exhibited in TestComp 2021: Verifuzz slightly outperforms FuSeBMC, and both outperform KLEE.

#### Expectations and results. We aim to answer two main questions:


Plots in Fig. 3 and Fig. 4 attempt to answer these questions, respectively.

We first give a pairwise comparison between the coverage %% reported by the tools (Fig. 3). If a tool was unable to analyze a program, the corresponding

<sup>10</sup> We believe the ability to successfully terminate the test-case generation early is of great interest to software engineers. However, unfortunately, it is not the main determining factor in testing competitions.

Fig. 4: Runtime needed to get 1% of coverage (sec × sec) of Horntinuum (x-axis) and a competitor (y-axis). Solid triangles represent runs (green: Horntinuum, orange: the competitor) in which the corresponding tool detected larger coverage and took less time. Blank triangles are the remaining (non-representative) runs. Triangles on the boundaries represent runs in which one of the tool detected zero coverage.

point is placed on the boundary. The experiments revealed that given the same timeout, Horntinuum generates test cases with larger or equal coverage than KLEE on 241 programs, FuSeBMC on 178 programs, and Verifuzz on 177 programs. These numbers include cases when the competitor crashed or did not return any coverage but exclude cases when Horntinuum did so.

A pairwise comparison between the "runtime/coverage" ratio taken by the tools is shown in Fig. 4. For this experiment, for every plot, we only considered benchmarks, on which either of tools generated test cases with larger coverage, and on which it terminated before the competitor. Specifically:


These numbers lets us conclude that it is much likely that Horntinuum could return a larger coverage in a shorter amount of time, then a competitor could

Fig. 5: Impact of invariants: pairs of the runtimes (sec × sec) of Horntinuum with and without invariants.

do so. The remaining benchmarks (e.g., on which Horntinuum generates more test cases but takes more time than a competitor) are still shown in the plot but are excluded from the statistics: in these cases it is impossible to draw a consistent conclusion on the tools' performance.

Controlled experiment. Lastly we present an interesting statistic on the effect of invariant generation on runtime of test-case generation (Fig. 5). For the sake of experiment, we modified Alg. 1 such that it skips invariant generation but enumerates traces and exploits the unsatisfiable prefixes. It turns out that this negatively affects 184 benchmarks, on which the modified version takes more time. These include 12 benchmarks, on which Horntinuum with invariants terminates before the timeout, but Horntinuum without invariants does not terminate (represented as points on the right boundary). These benchmarks demonstrate a possible scenario when programs under test have unreachable branches that can be identified by a CHC solver, allowing the test-case generator to terminate earlier.

#### 8 Conclusion

We have shown that CHCs is a promising vehicle that a test-case generators could use in order to improve the quality of solutions and the runtime. Specifically, using CHC encodings of programs, various program unrollings are enumerated, and test cases are extracted from models of satisfiable formulas. Our novel CHC-based approach and its implementation in Horntinuum use SMT solvers incrementally. In the future we are going to extend our support for data types and optimize the algorithm for searching deep counterexamples a la [6].

Acknowledgments The work is supported in parts by a gift from Amazon Web Services and a grant from FSU's Council on Research & Creativity.

# References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and

# Efcient Analysis of Cyclic Redundancy Architectures via Boolean Fault Propagation

Marco Bozzano , Alessandro Cimatti , Alberto Griggio , and Martin Jon´aˇs()

Fondazione Bruno Kessler, Trento, Italy {cimatti,bozzano,griggio,mjonas}@fbk.eu

Abstract. Many safety critical systems guarantee fault-tolerance by using several redundant copies of their components. When designing such redundancy architectures, it is crucial to analyze their fault trees, which describe combinations of faults of individual components that may cause malfunction of the system. State-of-the-art techniques for fault tree computation use frst-order formulas with uninterpreted functions to model the transformations of signals performed by the redundancy system and an AllSMT query for computation of the fault tree from this encoding. Scalability of the analysis can be further improved by techniques such as predicate abstraction, which reduces the problem to Boolean case.

In this paper, we show that as far as fault trees of redundancy architectures are concerned, signal transformation can be equivalently viewed in a purely Boolean way as fault propagation. This alternative view has important practical consequences. First, it applies also to general redundancy architectures with cyclic dependencies among components, to which the current state-of-the-art methods based on AllSMT are not applicable, and which currently require expensive sequential reasoning. Second, it allows for a simpler encoding of the problem and usage of efcient algorithms for analysis of fault propagation, which can significantly improve the runtime of the analyses. A thorough experimental evaluation demonstrates the superiority of the proposed techniques.

# 1 Introduction

Fault-tolerance is a fundamental property of safety critical systems that enables their safe operation even in the presence of faults. There are many ways to ensure fault-tolerance, often based on redundancy: spare parts are available for backup and are ready to take over with diferent degrees of promptness (e.g., hot/warm/cold standby), or with multiple replicas running in parallel. The latter is a common approach to fault-tolerance in computer-based control systems, where the results computed by the independent replicas are combined together by means of voters. The idea dates back to the pioneering space application in Saturn Launch Vehicle [12], and has then been adopted in the Primary Flight Computer [19] of the Boeing 777. The idea is becoming prominent with the advent of modern Integrated Modular Avionics [16], a cost-efective solution for the management of highly intensive software control systems.

(a) Reference non-redundant system.

(b) TMR redundant system with three replicas of modules M1, M2, whose results are combined by a voter.

Fig. 1: Network of computational modules with cyclic dependencies, extended by triple modular redundancy.

Fig. 2: Selected ways of extending a single reference module M with triple modular redundancy (using 1, 2, and 3 voters) [6].

One of the most used instances of the approach to redundancy by using module replicas is the triple modular redundancy (tmr) schema, in which the computational modules are replaced by three redundant copies, whose results can be combined by one to three voters. An example of using tmr to add redundancy to a reference non-redundant architecture is shown in Figure 1. Note that there are multiple ways of combining the results of a single triplicated computational module by voters, some of which are shown in Figure 2 [6].

Assessing the actual degree of fault-tolerance of a redundant architecture is directly related to the construction and analysis of the corresponding fault tree [17]. A fault tree describes the combinations of failures of individual components that may cause higher-level malfunction, e.g., bring the system into a dangerous state. Such combinations are traditionally called cut sets. Given the set of all cut sets of the system, a fault tree can be reconstructed. Subsequently, from the fault tree expressed as a Binary Decision Diagram, it is possible to compute the reliability of the system from the reliability measures of the components, and to synthesize the analytical form of the reliability function [6].

In this paper, we tackle the problem of automatically analyzing the reliability of redundancy architectures with parallel replicas and voting. We propose a general framework that encompasses also redundancy architectures with cyclic dependencies among components, such as the system from Figure 1, to which current state-of-the-art approaches [6] are not applicable. The modeling is based on symbolic transition systems over the quantifer-free theory of linear real arithmetic and uninterpreted functions (UFLRA). In particular, real numbers are used to represent the signals of the architecture and multiple instances of the same uninterpreted function symbol are used to represent component replicas. The modeling framework is a strict generalization of the combinational approach proposed in [4,5], that only allows for acyclic architectures.

As the main contribution, we propose an analysis technique based on the reduction to fault propagation graphs over Boolean structures [7]. We prove that the reduction is correct: the signal transformation performed by a redundancy architecture can be equivalently viewed in a Boolean way as fault propagation.

We carry out a systematic experimental evaluation on the set of redundancy architectures with cyclic dependencies to evaluate scalability of the proposed solution. Moreover, we perform evaluation on acyclic redundancy architectures to compare the performance against the state-of-the-art approach based on predicate abstraction [5,6], which can be applied only to redundancy architectures without cycles. The proposed approach proves to be very scalable, being able to analyze cyclic architectures with thousands of nodes, and is dramatically more efcient than a direct reduction to model checking of symbolic transition systems over UFLRA. In the restricted set of acyclic benchmarks, the proposed approach provides better performance even over the optimized method proposed in [5] and extended in [6] that adopts a structural form of predicate abstraction to improve over basic AllSMT [14].

The paper is structured as follows. In Section 2, we present logical preliminaries and basic notions of fault propagation graphs. In Section 3, we describe the framework of redundancy architectures with cycles. In Section 4, we present the reduction to fault propagation and prove its correctness. In Section 5, we discuss the related work. The experiments are presented in Section 6. In Section 7, we draw some conclusions and discuss some directions for future work.

# 2 Preliminaries

#### 2.1 General Background

In this section, we explain the basic mathematical conventions that are used in the paper. We assume that the reader is familiar with standard frst-order logic and the basic ideas of Satisfability Modulo Theories (smt), as presented e.g. in [1]. A theory in the smt sense is a pair (Σ, C), where Σ is a frst-order signature and C is a class of models over Σ. We use the standard notions of interpretation, assignment, model, satisfability, validity, and logical consequence. We refer to 0-arity predicates as Boolean variables, and to 0-arity uninterpreted functions as (theory) variables. We denote variables with x, y, . . . , formulas with φ, ψ, . . . , and uninterpreted functions with f, g, . . . , possibly with subscripts. We denote vectors with · (e.g. x), and individual components with subscripts (e.g. x<sup>j</sup> ). We denote the domain of Booleans with B = {⊤, ⊥}. If x1, . . . , x<sup>n</sup> are variables and φ is a formula, we write φ(x1, . . . , xn) to indicate that all the variables occurring free in φ are in x1, . . . , xn. If φ is a formula without uninterpreted functions and µ is a function that maps each free variable of φ to a value of the corresponding sort, [[φ]]<sup>µ</sup> denotes the result of the evaluation of φ under this assignment. A Boolean formula is called positive if it does not use other logical connectives than conjunctions and disjunctions.

In this paper, we shall use the theory of linear real arithmetic (LRA), in which the numeric constants and the arithmetic and relational operators have their standard meaning, extended with uninterpreted functions (UF), whose interpretation is not fxed in C, and with voters (V), which are k-ary functions whose interpretation is the majority function defned as below. For simplicity, we consider only voters with odd arity as even-arity voters are rarely used in practice. However, our approach can be extended to support even-arity voters.

Defnition 1. The k-ary majority function majority : R <sup>k</sup> → R for an odd k > 0 is defned by majority(x) = y if there is y such that y = x<sup>j</sup> for at least ⌈k/2⌉ distinct j and majority(x) = x<sup>1</sup> otherwise.

Given a set of variables x, we denote with x ′ the set {x ′ | x ∈ x}. A symbolic transition system S is a triple (x, I(x), T(x, x ′ )), where x is a set of variables, and I(x), T(x, x ′ ) are formulae over some signature. An assignment to the variables in x is a state of S. A state s is initial if it is a model of I(x), i.e., s |= I(x). The states s, s′ denote a transition if s ∪ s ′ |= T(x, x ′ ), also written T(s, s′ ). A trace is a sequence of states s0, s1, . . . such that s<sup>0</sup> is initial and T(s<sup>i</sup> , s′ <sup>i</sup>+1) for all i. We denote traces with π, and with π<sup>j</sup> the j-th element of π. A state s is reachable in S if there exists a trace π such that π<sup>i</sup> = s for some i.

### 2.2 Fault Propagation Graphs

In this section we briefy introduce the necessary notions of fault propagation, and in particular the formalism of symbolic fault propagation graphs. Intuitively, fault propagation graphs can be used to describe how failures of some components of a given system can cause the failure of other components of a system. In an explicit (hyper)graph representation, components can be represented by nodes, and dependencies by edges among them, with the meaning that an edge from component c<sup>1</sup> to component c<sup>2</sup> states that the failure of c<sup>1</sup> can cause the failure (propagation) of c2. In the symbolic representation adopted here, we model components as Boolean variables (where ⊥ means "not failed" and ⊤ means "failed"), and express the dependencies as Boolean formulae encoding the conditions that can lead to the failure of each component. The basic concepts are formalized in the following defnitions. For more information, we refer to [7].

Defnition 2 (Fault propagation graph). A symbolic fault propagation graph ( fpg) is a pair (C, canFail), where C is a fnite set of system components and canFail is a function that assigns to each component c a Boolean formula canFail(c) over the set of variables C.

Defnition 3 (Trace of FPG). Let G be a fault propagation graph (C, canFail). A state of G is a function from C to B. A trace of G is a sequence of states π = π0π<sup>1</sup> . . . ∈ (B C ) <sup>ω</sup> such that all i > 0 and c ∈ C satisfy (i) πi(c) = πi−1(c) or (ii) πi−1(c) = ⊥ and πi(c) = [[canFail(c)]]πi−<sup>1</sup> .

Example 1 ([7]). Consider a system with components control on ground (g), hydraulic control (h), and electric control (e) such that g can fail if both h and e have failed, h can fail if e has failed, and e can fail if h has failed. This system can be modeled by a fault propagation graph ({g, e, h}, canFail), where canFail(g) = h ∧ e, canFail(h) = e, and canFail(e) = h.

One of the traces of this system is {g 7→ ⊥, h 7→ ⊤, e 7→ ⊥}{g 7→ ⊥, h 7→ ⊤, e 7→ ⊤}{g 7→ ⊤, h 7→ ⊤, e 7→ ⊤}ω, where h is failed initially, which causes failure of e in the second step, and the failures of h and e together cause a failure of g in the third step.

Fault propagation graphs are often used to identify sets of initial faults that can lead the system to a dangerous or unwanted state (usually called a top level event). Such sets of initial faults are called cut sets.

Defnition 4 (Cut set). Let G be a fault propagation graph G = (C, canFail) and φ a positive Boolean formula, called top level event. The assignment cs : C → B is called a cut set of G for φ if there is a trace π of G that starts in the state cs and there is some k ≥ 0 such that π<sup>k</sup> |= φ. A cut set cs is called minimal cut set if it is minimal with respect to the pointwise ordering of functions B <sup>C</sup> , i.e., there is no other cut set cs′ such that {c ∈ C | cs′ (c) = ⊤} ⊊ {c ∈ C | cs(c) = ⊤}.

For brevity, when talking about cut sets, we often mention only the components that are set to ⊤ by the cut set.

Example 2 ([7]). The minimal cut sets of the fpg from Example 1 for the top level event φ = g are {g}, {h}, and {e}. These three cut sets are witnessed by the following traces:

1. {g 7→ ⊤, h 7→ ⊥, e 7→ ⊥}<sup>ω</sup>, 2. {g 7→ ⊥, h 7→ ⊤, e 7→ ⊥}{g 7→ ⊥, h 7→ ⊤, e 7→ ⊤}{g 7→ ⊤, h 7→ ⊤, e 7→ ⊤}<sup>ω</sup>, 3. {g 7→ ⊥, h 7→ ⊥, e 7→ ⊤}{g 7→ ⊥, h 7→ ⊤, e 7→ ⊤}{g 7→ ⊤, h 7→ ⊤, e 7→ ⊤}<sup>ω</sup>.

Note that the fpg has also other cut sets, such as {g, e}, {h, e}, and {g, h, e}, which are not minimal.

In the following, we work with fault propagation graphs whose all canFail formulas are positive. Such fault propagation graphs are called monotone. Note that the defnition of trace ensures that in each trace, if a component c is set to ⊤ in a state π<sup>i</sup> , it is ⊤ in all the subsequent states π<sup>j</sup> for j > i. This ensures that each trace eventually reaches a fxed point. Moreover, before reaching this fxed point, the trace can contain at most |C| distinct states.

For monotone fpgs, there is an efcient algorithm for minimal cut set enumeration [7]. This approach consists in enumerating of the minimal models of a specifc LRA formula, in which theory constraints are used only if the input fpg contains cycles (and which therefore is purely Boolean for acyclic fpgs).

# 3 Cyclic Redundancy Architectures

In this section, we describe the framework adopted to model redundancy architectures, in form of a restricted class of symbolic transition systems modulo UFLRA. We call this restricted class transition systems with uninterpreted functions and voters (UF+V TS). <sup>1</sup> This modeling framework is more expressive than mere smt formulas modulo UFLRA, which were used in the previous works on analysis of redundancy architectures [6], as it can express architectures that contain cyclic dependencies among the modules.

Defnition 5 (UF+V transition system). A transition system with uninterpreted functions and voters is a tuple (VS, Vin, Vinit, Tnext, Tinit), where


A UF+V transition system is called well formed if it does not contain cyclic dependencies among voters, i.e., there is no sequence v<sup>1</sup> . . . v<sup>n</sup> of signal variables such that v<sup>1</sup> = v<sup>n</sup> and each v<sup>i</sup> with i > 0 satisfes Tnext(vi) = voterk(x1, . . . , xk) with x<sup>j</sup> = vi−<sup>1</sup> for some 1 ≤ j ≤ k. For well formed UF+V TS, we can defne voter depth vd: V<sup>S</sup> ∪ Vin → N as the unique solution to the following set of equations: vd(in) = 0 for each in ∈ Vin, vd(s) = 0 for each v ∈ V<sup>S</sup> such that Tnext(v) = f(x1, x2, . . . , xk), and vd(v) = max{vd(xi) | 1 ≤ i ≤ k} + 1 for each v ∈ V<sup>S</sup> such that Tnext(v) = voterk(x1, x2, . . . , xk).

In the rest of the paper, we assume that all UF+V TS are well formed. In the rest of this section, let us fx an arbitrary well formed UF+V transition system S = (VS, Vin, Vinit, Tnext, Tinit).

We now give a formal defnition of the behavior of the UF+V system in presence of faults. Intuitively, we are given the set Faults of faulty signal-producing components of the system, which do not have to behave correctly: a faulty component neither has to start in its specifed initial value nor respect its transition function.

Defnition 6 (Trace of UF+V TS). A state of a UF+V transition system S is an arbitrary assignment of real numbers to signal and input variables s: (V<sup>S</sup> ∪ Vin) → R.

<sup>1</sup> Note than although UF+V TS and the related concepts can be defned directly in terms of UFLRA symbolic transition systems, we chose to make the defnition explicit to simplify the presentation and proofs.

The sequence of states π = π0π<sup>1</sup> . . . ∈ (R <sup>V</sup><sup>S</sup> <sup>∪</sup>Vin ) <sup>ω</sup> is called a trace of the system S for the fault set Faults ⊆ VS, input stream ι = ι0ι<sup>1</sup> . . . ∈ (R <sup>V</sup>in ) ω, initial value assignment Init: Vinit → R, and interpretation [[ ]], which to each uninterpreted function symbol of arity k assigns a function [[f]]: R <sup>k</sup> → R, if:


Traces for the fault set Faults = ∅ are called nominal.

Note that each uninterpreted module needs one time step to compute its result, while the results of voters are instantaneous. The time delay for modules allows cyclic dependencies among modules, while no delay for voters gives the expected semantics to architectures where some replicas of a module are guarded by a voter and others are not, such as in schemas from Figures 2b and 2c.

Example 3. Consider the example from Figure 1, where the reference system with 3 modules M1, M2, and M<sup>3</sup> is extended with tmr such that the modules M<sup>1</sup> and M<sup>2</sup> are replaced by three replicas whose results are combined by a voter.

We can represent the redundancy version of the system as a UF+V TS as follows. The nominal behavior of the modules M1, M2, and M<sup>3</sup> is represented by binary uninterpreted functions f1, f2, and f3, respectively. Further, we represent initial values of M1, M2, M<sup>3</sup> by variables initm<sup>1</sup> , initm<sup>2</sup> , and initm<sup>3</sup> respectively. Finally, we represent the output of i-th replica of each module M<sup>j</sup> by a signal variable x i j and the output of the voter corresponding to the module M<sup>j</sup> by a signal variable x v j .

This gives the UF+V transition system S = (VS, {in1, in2}, Vinit, Tnext, Tinit), with V<sup>S</sup> = {x 1 1 , x<sup>2</sup> 1 , x<sup>3</sup> 1 , x<sup>v</sup> 1 , x<sup>1</sup> 2 , x<sup>2</sup> 2 , x<sup>3</sup> 2 , x<sup>v</sup> 2 , x<sup>1</sup> <sup>3</sup>}, Vinit = {initm<sup>j</sup> | j ∈ {1, 2, 3}}, and

Tnext(x i 1 ) = f1(in1, x<sup>v</sup> 2 ) for 1 ≤ i ≤ 3, Tinit(x i 1 ) = init<sup>m</sup><sup>1</sup> for 1 ≤ i ≤ 3, Tnext(x i 2 ) = f2(in2, x<sup>v</sup> 1 ) for 1 ≤ i ≤ 3, Tinit(x i 2 ) = init<sup>m</sup><sup>2</sup> for 1 ≤ i ≤ 3, Tnext(x 1 3 ) = f3(x v 1 , x<sup>v</sup> 2 ), Tinit(x 1 3 ) = init<sup>m</sup><sup>3</sup> , Tnext(x v j ) = voter <sup>3</sup>(x 1 j , x<sup>2</sup> j , x<sup>3</sup> j ) for j ∈ {1, 2}.

We defne the class of redundancy transition systems, where the only purpose of all voters is to recognize and repair outputs of failed components; more specifcally, if all components behave correctly, the voters are not necessary.

Defnition 7 (Redundancy UF+V TS). We call the system S a redundancy UF+V transition system if in all its nominal traces, all inputs of each voter are always identical. Formally, if π is any nominal trace of S and if v is a variable for which Tnext(v) = voter <sup>k</sup>(x), then {πi(x<sup>j</sup> ) <sup>|</sup> <sup>1</sup> <sup>≤</sup> <sup>j</sup> <sup>≤</sup> <sup>k</sup>} = 1 for all i ≥ 0.

Similarly to fpgs, a cut set is a set of faults that leads to the undesired behavior of the system. In particular, given a set of signals that are considered as output signals (or outputs) of the system, a cut set of the given UF+V TS is a set of faults that can cause an incorrect value of at least one output.

Defnition 8 ((Minimal) cut set). A fault set Faults ⊆ V<sup>S</sup> is called a cut set of S for a set of output signals Vout ⊆ V<sup>S</sup> if there exist an input stream, initial value assignment, and an interpretation such that values of output signals of some trace π for the fault set Faults difer from the outputs of the nominal trace π nom with the same input stream, initial values, and interpretation, i.e., there is c ≥ 0 and o ∈ Vout for which πc(o) ̸= π nom c (o). A cut set is called minimal ( mcs) if it is minimal in terms of set inclusion.

Since the redundancy UF+V TS form a subclass of UFLRA transition systems, there is a straightforward procedure for minimal cut set enumeration. As in the case of combinational systems [6], one can construct a miter system, which consists of two copies of the architecture: the frst is allowed to fail and the second is constrained to behave nominally. Minimal cut sets can then be obtained by using a technique based on symbolic model checking [3] to enumerate all minimal assignments to fault variables under which it is possible to reach some state in which the outputs of the two copies difer.

# 4 Reducing Redundancy UF+V TS to Fault Propagation Graphs

In this section, we show the main result of the paper, which is that minimal cut set enumeration of redundancy UF+V transition systems can be reduced to minimal cut set enumeration of Boolean fault propagation graphs, which is more efcient than mcs enumeration based on miter construction and model checking.

#### 4.1 Reduction

We for each UF+V system S defne a corresponding fpg S <sup>B</sup>. The components of S <sup>B</sup> correspond to the signal variables of the original system S. With a slight abuse of notation, we use the same names for the original real-valued signal variables of S and the components of S <sup>B</sup>, although they have diferent types. Intuitively, the reduction ensures that each component v of S <sup>B</sup> can fail if and only if there is a trace of S in which the value of the signal variable v deviates from its nominal value.

Defnition 9. Let S = (VS, Vin, Vinit, Tnext, Tinit) be a UF+V TS. We defne a corresponding fpg S <sup>B</sup> = (VS, canFail), where canFail(v) = W v ′∈x∩V<sup>S</sup> v ′ if Tnext(v) = f(x) and canFail(v) = atLeast ⌈k/2⌉ (x ∩ VS) if Tnext(v) = voter <sup>k</sup>(x), using the defnition atLeastm(X) = W Y ⊆X |Y |=m V y∈Y y. 2

<sup>2</sup> Note that there are more efcient and compact encodings for the atLeast constraint [18]; we use the most simple one for presentation purposes.

Example 4. Consider the transition system S from Example 3. The corresponding fault propagation graph is S <sup>B</sup> = ({x 1 1 , x<sup>2</sup> 1 , x<sup>3</sup> 1 , x<sup>v</sup> 1 , x<sup>1</sup> 2 , x<sup>2</sup> 2 , x<sup>3</sup> 2 , x<sup>v</sup> 2 , x<sup>1</sup> <sup>3</sup>}, canFail), where

$$\begin{aligned} can& \text{Fail}(x\_1^i) = x\_2^v \quad \text{for all } 1 \le i \le 3, \quad \text{can} \text{Fail}(x\_2^i) = x\_1^v \quad \text{for all } 1 \le i \le 3, \\\ can& \text{Fail}(x\_3^1) = x\_1^v \lor x\_2^v, \\\ can& \text{Fail}(x\_1^v) = \text{at} \text{Least}\_{\mathcal{B}}(x\_1^1, x\_1^2, x\_1^3), \qquad \text{can} \text{Fail}(x\_2^v) = \text{at} \text{Least}\_{\mathcal{B}}(x\_2^1, x\_2^2, x\_2^3). \end{aligned}$$

#### 4.2 Correctness

We show that the reduction preserves the cut sets. In the rest of the section, let S = (VS, Vin, Vinit, Tnext, Tinit) be an arbitrary redundancy UF+V TS, Faults ⊆ V<sup>S</sup> be an arbitrary fault set, and Vout ⊆ V<sup>S</sup> be an arbitrary set of output signals. First, we show that each cut set of S corresponds to a cut set of S B.

Lemma 1. If Faults is a cut set of S for the set of outputs Vout, then cs defned as cs(v) = ⊤ if v ∈ Faults is a cut set of S <sup>B</sup> for the top level event W o∈Vout o.

Proof. Let Faults be a cut set of S for some trace π for some ι, Init, and [[ ]]. Let π nom be the corresponding nominal trace. Defne the trace π <sup>B</sup> of S <sup>B</sup> as π B <sup>0</sup> = cs and for all i > 0 defne π B <sup>i</sup> by π B i (v) = ⊤ if π B i−1 (v) = ⊤ and π B i (v) = [[canFail(v)]]π<sup>B</sup> i−1 if π B i−1 (v) = ⊥. In other words, π <sup>B</sup> is the unique trace starting in cs in which all the components fail as soon as possible. By monotonicity, the trace π <sup>B</sup> has a fxed point, i.e., there is n such that π B <sup>n</sup> = π B <sup>n</sup>′ for all n ′ > n.

We show that π <sup>B</sup> satisfes π B n (o) = ⊤ for some o ∈ Vout and thus cs is a cut set for the top level event W o∈Vout o. To do this, we prove by induction on i and on the voter depth vd(v) 3 that for all v ∈ V<sup>S</sup> and i ≥ 0, πi(v) ̸= π nom i (v) implies π B n (v) = ⊤. We distinguish three cases:

	- If i = 0: since π0(v) ̸= π nom 0 (v), then it must be the case that π0(v) ̸= Init(Tinit(v)), therefore v ∈ Faults. This is a contradiction.
	- If i > 0: then πi(v) ̸= π nom i (v) by defnition implies

$$\|\|f\|(\pi\_{i-1}(x\_1),\ldots,\pi\_{i-1}(x\_k)) \neq \|f\|(\pi\_{i-1}^{nom}(x\_1),\ldots,\pi\_{i-1}^{nom}(x\_k))$$

and hence πi−1(x<sup>j</sup> ) ̸= π nom i−1 (x<sup>j</sup> ) for some 1 ≤ j ≤ k because [[f]] is a function. Since πi−1(in) = π nom i−1 (in) holds for all in ∈ Vin, we know that x<sup>j</sup> ∈ VS. Therefore the induction hypothesis implies π B n (x<sup>j</sup> ) = ⊤ and thus π B <sup>n</sup>+1(v) = ⊤ because π B n satisfes canFail(v). Since π B <sup>n</sup> was chosen as the fxed point of π <sup>B</sup>, this implies π B n (v) = π B <sup>n</sup>+1(v) = ⊤.

<sup>3</sup> Induction on the voter depth is employed because UF+V transition systems propagate results of voters instantaneously.

– If v ̸∈ Faults and Tnext(v) = voter <sup>k</sup>(x1, . . . , xk), then πi(v) ̸= π nom i (v) for any i ≥ 0 by defnition implies

$$\pi\_i \operatorname{majority}(\pi\_i(x\_1), \dots, \pi\_i(x\_k)) \ne \operatorname{majority}(\pi\_i^{nom}(x\_1), \dots, \pi\_i^{nom}(x\_k)). \tag{1}$$

Since S is a redundancy TS, all π nom i (x<sup>j</sup> ) are equal and the disequality (1) implies that πi(x<sup>j</sup> ) ̸= π nom i (x<sup>j</sup> ) for at least ⌈k/2⌉ of x<sup>j</sup> . All these x<sup>j</sup> are not in Vin and must therefore be in VS. By defnition of voter depth, vd(x<sup>j</sup> ) < vd(v) for all these x<sup>j</sup> . Therefore by the induction hypothesis π B n (x<sup>j</sup> ) = ⊤ for at least ⌈k/2⌉ of x<sup>j</sup> and thus π B <sup>n</sup>+1(v) = ⊤ because π B n satisfes canFail(v). This again implies π B n (v) = π B <sup>n</sup>+1(v) = ⊤ because π B n is the fxed point of π B.

This fnishes the proof: if Faults is a cut set, πc(o) ̸= π nom c (o) for some c ≥ 0 and o ∈ Vout, and thus π B n (o) = ⊤. Therefore we know that π B n |= W o∈Vout o and thus cs is a cut set of S <sup>B</sup>. ⊓⊔

For the converse direction, we for each fault set devise a trace of the UF+V TS S that propagates all the possible deviations from the nominal value. We call this trace maximally fault-propagating. In this trace, all signal values are from the set {0, 1}, all nominal signal values are 0 and become 1 only as a result of a fault. Moreover, if there is a trace for the given fault set in which a signal deviates from its nominal value, the value of the corresponding signal in the maximally fault-propagating will be 1.

Defnition 10 (Maximally fault-propagating trace). Let S be a UF+V TS. Defne


The maximally fault-propagating trace of S for a fault set Faults, denoted as π fp , is the unique trace of S for the above input stream, initial values, interpretation, and the given fault set that for all i ≥ 0 and v satisfes π fp i (v) = 1 whenever v ∈ Faults.

Observe that the trace π fp is monotone, i.e., once a signal gets set to 1, it stays set to 1 for the rest of the trace. This is formalized by the following lemma, which can be proven by induction on i, j − i, and voter depth of v.

Lemma 2. Let S be a UF+V TS, Faults a fault set, and π fp the corresponding maximally fault-propagating trace. Then π fp i (v) = 1 for each i ≥ 0 and v ∈ V<sup>S</sup> implies π fp j (v) = 1 for all j > i.

We can now show that if a trace of the fpg version S <sup>B</sup> of a UF+V TS S triggers the top level event for some initial fault assignment, there is a trace in the original system S for the corresponding fault set whose output deviates from the nominal one; namely it is the trace π fp .

Lemma 3. If cs defned as cs(v) = ⊤ if v ∈ Faults is a cut set of S <sup>B</sup> for the top level event W o∈Vout o, then Faults is a cut set of S for the set of outputs Vout.

Proof. Suppose that the trace π <sup>B</sup> of S <sup>B</sup> with the initial state cs satisfes π B c (o) = ⊤ for some c ≥ 0 and o ∈ Vout. We show that Faults is a cut set of S for the set of output signals Vout. Let π fp be the maximally fault-propagating trace of S for Faults and π nom the corresponding nominal trace.

We show that for each i ≥ 0 and v ∈ VS, the condition π B i (v) = ⊤ implies π fp i (v) ̸= π nom i (v). We proceed by induction on i:

– For i = 0: If cs = π B 0 (v) = ⊤, then v ∈ Faults and thus π fp 0 (v) ̸= π nom 0 (v) because π fp 0 (v) = 1 and π nom 0 (v) = 0.

#### – For i > 0: Assume that π B i (v) = ⊤. We distinguish four cases:


Therefore π B c (o) = ⊤ implies π fp c (o) ̸= π nom c (o) and Faults is a cut set of S. ⊓⊔

#### Theorem 1. For each fault set Faults, the following two claims are equivalent:


#### 5 Related Work

Approaches to the analysis of redundant architectures include [6], which addresses the generation of the reliability function for a class of generic architectures including tree- and dag-like structures. The computation of the reliability is based on predicate abstraction and bdds. Our work extends and improves the approach of [6] in several directions. First, it supports cyclic architectures, to which predicate abstraction as defned in [6] cannot be applied. Second, it does not require that the redundancy is localized within small blocks (manually

defned by the user or in a library), to which the predicate abstraction can be applied. In contrast, our approach applies the abstraction directly on the level of individual modules and voters. Moreover, the approach of [6] needs to compute the abstracted versions of the specifed blocks upfront by quantifer elimination. Finally, our approach outperforms the approach of [6].

Other works on redundant architecture analysis are either based on ad-hoc algorithms [13] which are not fully automated, and require discretization and additional input data from the user, or use simulation techniques such as Monte Carlo analysis [15], which do not examine the system behaviors exhaustively.

A classifcation of fault tolerant architectures is presented in [10]. The classifcation is based on three diferent patterns, namely comparison, voting, and sparing, that can be composed to defne generic and possibly cyclic architectures. A follow-up work [11] builds upon these patterns and introduces strategies to evaluate several architectures at once (family-based analysis of redundant architectures) by reduction to Discrete Time Markov Chains. Our techniques are orthogonal, and could be applied on top of the approach proposed in [11].

The concept of maximally fault-propagating trace used to prove Lemma 3 is similar to the concept of maximally diverse interpretations [8], which can be used to efciently reduce a formula in the positive fragment of EUF logic to a sat formula. Both concepts restrict the interpretations of uninterpreted functions to a specifc subclass, which exhibits all the relevant behaviors.

#### 6 Experimental Evaluation

We have performed an experimental evaluation of the proposed approach for minimal cut set enumeration in order to answer the following research questions:


#### 6.1 Benchmarks and Setup

To answer these research questions, we used four sets of redundancy systems:

Scalable cyclic systems This benchmark set contains two kinds of benchmarks. For evaluation on redundancy architectures with a linear number

Fig. 3: Scalable architectures used in the experimental evaluation.

of cycles, we have generated ladder-shaped (Figure 3a) architectures of all lengths between 1 and 100. For evaluation on redundancy architectures with a large number of cycles, we have generated radiator-shaped (Figure 3b) architectures of all lengths between 1 and 50. For each of the architectures, we have generated its three redundant versions by replacing each module by a tmr block with one to three voters by using schemas from Figures 2b, 2d, and 2e. This yields systems with 2 · length · (3 + numVoters) signals.


We have evaluated the following approaches for minimal cut set enumeration:


Fig. 4: Solving time on ladder-shaped benchmarks. Divided according to the number of voters per one reference module.

Fig. 5: Solving time on radiator-shaped benchmarks. Divided according to the number of voters per one reference module.

backend (sat-based in SMT-PGFDS and bdd-based in xSAP). To answer RQ4, we have thus performed more fne-grained analysis as follows.

From each fpg, we generated the corresponding Boolean formula, which is possible since the graph is acyclic [7]. We also generated the Boolean formula obtained by predicate abstraction from each OCRA encoding. We thus obtained two Boolean formulas for each system: one by reduction to fault propagation (fp), and one by reduction by predicate abstraction (pa). We have then used the sat-based enumeration algorithm of SMT-PGFDS and also bdd-based enumeration algorithm of xSAP on both of these Boolean formulas. This gives 4 combinations: fp-sat, fp-bdd, pa-sat, pa-bdd.

All experiments were executed on a cluster of 9 computational nodes, each with Intel Xeon CPU X5650 @ 2.67GHz, 12 cpu and 96 GiB of ram. We have used time limit 1 hour of wall-clock time and memory limit 16 GB for each benchmark-solver pair. The detailed experimental results can be found at https: //es-static.fbk.eu/people/mjonas/papers/tacas22 redarchs/.

#### 6.2 Results for Cyclic Benchmarks

The comparison of running times of fpg-based and of model-checking-based approaches on the scalable cyclic benchmarks is shown in Figures 4 and 5. Figure 4 shows a signifcant beneft of the technique based on fault propagation on the ladder-shaped benchmarks; not only that it can enumerate cut sets of all the used benchmarks, but its run-times are dramatically better. However, as can be seen on Figure 5, the situation is diferent on the radiator-shaped benchmarks, which contain a large number of cycles. Although the performance of technique based

Fig. 7: Solving time on scalable acyclic benchmarks. Divided by the architecture and number of voters per one reference module.

on fault propagation is still superior to the model-checking-based technique, it scales poorly on the systems with 2 and 3 voters per one tmr block. The answer to RQ1 is thus that the proposed approach scales well if the number of cycles in the system is not too large; if the number of cycles is large, the technique scales worse, but nevertheless signifcantly better than the state-of-the-art technique based on miter construction and model checking [3].

The run-times on random cyclic benchmarks are shown in Figure 6. The fgure shows that the performance of the proposed technique is better by several orders of magnitude and can enumerate minimal cut sets of 59 random systems that are out of reach for the technique based on model checking. Note that some of the systems are hard for both of the approaches: both approaches timed out on 66 of the 250 benchmarks. Together with the results for the ladder-shaped and radiator-shaped systems, this answers RQ2: the technique proposed in this paper has signifcantly better performance than the state-of-the-art technique based on model checking.

There are two reasons of the observed performance diference. First is the reduction of UFLRA transition system to the Boolean one, which has

Fig. 6: Solving time on random cyclic benchmarks.

been also observed to bring signifcant beneft on acyclic systems in the case of predicate abstraction [6]. Second is the underlying mcs-enumeration technique applied the resulting fpg. This technique reduces the expensive sequential reasoning to an enumeration of minimal models of a single smt formula, which can signifcantly improve performance [7].

#### 6.3 Results for Acyclic Benchmarks

The comparison of the performance on acyclic scalable benchmarks is shown in Figure 7. The results are divided according to the method used to reduce the

Fig. 8: Solving time on random acyclic benchmarks.

problem to Boolean case (fp vs. pa) and the technique used to enumerate the minimal cut sets of the Boolean system (sat vs. bdd). Scatter plots of solving times on random acyclic benchmarks can be seen on Figure 8.

The results show that the reduction of the problem to fault propagation and using an of-the-shelf solver for enumeration of minimal cut sets of the resulting Boolean system (i.e., fp-sat) is clearly superior to the state-of-the-art approach based on predicate abstraction and bdd-based mcs enumeration (i.e., pa-bdd). The diference between these two approaches is even several orders of magnitude on scalable benchmarks and grows with the size of the system and its complexity. The performance is also signifcantly better on the random benchmarks. This answers RQ3 in favor of the technique proposed in this paper.

As for RQ4, Figures 7 and 8 show that both the diferent reduction technique (fp vs. pa) and the solving technique (sat vs. bdd) play a role in this diference. However, the larger part of the runtime diference between the proposed approach (fp-sat) and the state-of-the-art approach (pa-bdd) [6] is due to better performance of sat-based enumeration. This insight is additional interesting outcome of our our experiments. Nevertheless, for both of the enumeration approaches, the proposed reduction based on fault propagation provides better performance than the state-of-the-art reduction by predicate abstraction.

#### 7 Conclusions and Future Work

We have presented a framework for modeling redundancy architectures with possible cyclic dependencies among the computational modules and we have developed an efcient approach for enumeration of minimal cut sets of such architectures. The experimental evaluation has shown that this approach dramatically outperforms the state-of-the-art approach based on model checking on cyclic redundancy architectures and has a better performance than the state-ofthe-art approach based on predicate abstraction on acyclic architectures.

In the future, we plan to extend the approach to a more general class of voters than majority voters. We also plan to extend the approach to support common cause analysis for diferent component faults and possibly to synthesize an optimal distribution of the modules of the architecture between the computational nodes of a system such as Integrated Modular Avionics.

### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and

# Tools | Optimizations, Repair and Explainability

# **Adiar**

# **Binary Decision Diagrams in External Memory**

Steffan Christ Sølvsten () , Jaco van de Pol , Anna Blume Jakobsen, and Mathias Weller Berg Thomasen

Aarhus University, Denmark {soelvsten,jaco}@cs.au.dk

**Abstract.** We follow up on the idea of Lars Arge to rephrase the Reduce and Apply operations of Binary Decision Diagrams (BDDs) as iterative I/O-efficient algorithms. We identify multiple avenues to simplify and improve the performance of his proposed algorithms. Furthermore, we extend the technique to other common BDD operations, many of which are not derivable using Apply operations alone. We provide asymptotic improvements to the few procedures that can be derived using Apply. Our work has culminated in a BDD package named Adiar that is able to efficiently manipulate BDDs that outgrow main memory. This makes Adiar surpass the limits of conventional BDD packages that use recursive depth-first algorithms. It is able to do so while still achieving a satisfactory performance compared to other BDD packages: Adiar, in parts using the disk, is on instances larger than 9*.*5 GiB only 1*.*47 to 3*.*69 times slower compared to CUDD and Sylvan, exclusively using main memory. Yet, Adiar is able to obtain this performance at a fraction of the main memory needed by conventional BDD packages to function.

**Keywords:** Time-forward Processing · External Memory Algorithms · Binary Decision Diagrams

#### **1 Introduction**

A Binary Decision Diagram (BDD) provides a canonical and concise representation of a boolean function as an acyclic rooted graph. This turns manipulation of boolean functions into manipulation of graphs [10, 11].

Their ability to compress the representation of a boolean function has made them widely used within the field of verification. BDDs have especially found use in model checking, since they can efficiently represent both the set of states and the state-transition function [11]. Examples are the symbolic model checkers NuSMV [14, 15], MCK [17], LTSmin [19], and MCMAS [24] and the recently envisioned symbolic model checking algorithms for CTL\* in [3] and for CTLK in [18]. Hence, continuous research effort is devoted to improve the performance of this data structure. For example, despite the fact that BDDs were initially envisioned back in 1986, BDD manipulation was first parallelised in 2014 by Velev and Gao [35] for the GPU and in 2016 by Van Dijk and Van de Pol [16] for multi-core processors [12].

The most widely used implementations of decision diagrams make use of recursive depth-first algorithms and a unique node table [16, 23, 34]. Lookup of nodes in this table and following pointers in the data structure during recursion both pause the entire computation while missing data is fetched [21,26]. For large enough instances, data has to reside on disk and the resulting I/O-operations that ensue become the bottle-neck. So in practice, the limit of the computer's main memory becomes the upper limit on the size of the BDDs.

**Related Work.** Prior work has been done to overcome the I/Os spent while computing on BDDs. David Long [25] achieved a performance increase of a factor of two by blocking all nodes in the unique node table based on their time of creation, i.e. with a depth-first blocking. But, in [6] this was shown to only improve the worst-case behaviour by a constant. Ochi, Yasuoka, and Yajima [28] designed in 1993 breadth-first BDD algorithms that exploit a levelwise locality on disk. Their technique was improved by Ashar and Cheong [8] in 1994 and by Sanghavi et al. [31] in 1996. The fruits of their labour was the BDD library CAL capable of manipulating BDDs larger than available main memory. Kunkle, Slavici and Cooperman [22] extended in 2010 the breadth-first approach to distributed BDD manipulation.

The breadth-first algorithms in [8, 28, 31] are not optimal in the I/O-model, since they still use a single hash table for each level. This works well in practice, as long as a single level of the BDD can fit into main memory. If not, they still exhibit the same worst-case I/O behaviour as other algorithms [6].

In 1995, Arge [5, 6] proposed optimal I/O algorithms for the basic BDD operations Apply and Reduce. To this end, he dropped all use of hash tables. Instead, he exploited a total and topological ordering of all nodes within the graph. This is used to store all recursion requests in priority queues, so they get synchronized with the iteration through the sorted input stream of nodes. Martin Sm´erek implemented these algorithms in 2009 as they were described, ˇ but the performance was disappointing, since the intermediate unreduced BDD grew too large to handle in practice [personal communication, Sep 2021].

**Contributions.** Our work directly follows up on the theoretical contributions of Arge in [5, 6]. We simplified and improved on his I/O-optimal Apply and Reduce algorithms. In particular, we modified and pruned the intermediate representation, to prevent data duplication and to save on the number of sorting operations. We also provide I/O-efficient versions of several other standard BDD operations, where we obtain asymptotic improvements for the operations that are derivable from Apply.

Our proposed algorithms and data structures have been implemented to create a new easy-to-use and open-source BDD package, named Adiar. Our experimental evaluation shows that Adiar is able to manipulate BDDs larger than the given main memory available, with only an acceptable slowdown compared to a conventional BDD library running exclusively in main memory.

#### **1.1 Overview**

The rest of the paper is organised as follows. Section 2 covers preliminaries on the I/O-model and Binary Decision Diagrams. We present our algorithms for I/O-efficient BDD manipulation in Section 3. Section 4 provides an overview of the resulting BDD package, Adiar, and Section 5 contains an experimental evaluation of it. Our conclusions and future work are in Section 6.

#### **2 Preliminaries**

#### **2.1 The I/O-Model**

The I/O-model [1] allows one to reason about the number of data transfers between two levels of the memory hierarchy, while abstracting away from technical details of the hardware, to make a theoretical analysis manageable.

An I/O-algorithm takes inputs of size *N*, residing on the higher level of the two, i.e. in *external storage* (e.g. on a disk). The algorithm can only do computations on data that reside on the lower level, i.e. in *internal storage* (e.g. main memory). This internal storage can only hold a smaller and finite number of *M* elements. Data is transferred between these two levels in blocks of *B* consecutive elements [1]. Here, *B* is a constant size not only encapsulating the page size or the size of a cache-line but more generally how expensive it is to transfer information between the two levels. The cost of an algorithm is the number of data transfers, i.e. the number of *I/O-operations*, or just *I/Os*, it uses.

For all realistic values of *N*, *M*, and *B* we have that *N/B <* sort(*N*) *N*, where sort(*N*) , *N/B* · log*M/B*(*N/B*) [1, 7] is the sorting lower bound, i.e. it takes *Ω*(sort(*N*)) I/Os in the worst-case to sort a list of *N* elements [1]. With an *M/B*-way merge sort algorithm, one can obtain an optimal *O*(sort(*N*)) I/O sorting algorithm [1], and with the addition of buffers to lazily update a tree structure, one can obtain an I/O-efficient priority queue capable of inserting and extracting *N* elements in *O*(sort(*N*)) I/Os [4].

**TPIE.** The TPIE library [36] provides an implementation of I/O-efficient algorithms and data structures such that the use of *B*-sized buffers is completely transparent to the programmer. Elements can be stored in files that act like lists. One can **push** new elements to the end of a file and read the **next** elements from the file in either direction, provided **has next** returns true. One can also **peek** the next element without moving the read head. TPIE provides an optimal *O*(sort(*N*)) external memory merge sort algorithm for its files. Furthermore, it provides an implementation of the I/O-efficient priority queue of [30] as developed in [29], which supports the **push**, **top** and **pop** operations.

#### **2.2 Binary Decision Diagrams**

A Binary Decision Diagram (BDD) [10], as depicted in Fig. 1, is a rooted directed acyclic graph (DAG) that concisely represents a boolean function B *<sup>n</sup>* → B,

Fig. 1: Examples of Reduced Ordered Binary Decision Diagrams. Leaves are drawn as boxes with the boolean value and internal nodes as circles with the decision variable. *Low* edges are drawn dashed while *high* edges are solid.

where B = {>*,* ⊥}. The leaves contain the boolean values ⊥ and > that define the output of the function. Each internal node contains the *label i* of the input variable *x<sup>i</sup>* it represents, together with two outgoing arcs: a *low* arc for when *x<sup>i</sup>* = ⊥ and a *high* arc for when *x<sup>i</sup>* = >. We only consider Ordered Binary Decision Diagrams (OBDD), where each unique label may only occur once and the labels must occur in sorted order on all paths. The set of all nodes with label *j* is said to belong to the *j*th *level* in the DAG.

If one exhaustively (1) skips all nodes with identical children and (2) removes any duplicate nodes, then one obtains the *Reduced Ordered Binary Decision Diagram* (ROBDD) of the given OBDD. If the variable order is fixed, this reduced OBDD is a unique canonical form of the function it represents [10].

The two primary algorithms for BDD manipulation are called Apply and Reduce. The Apply computes the OBDD *h* = *f g* where *f* and *g* are OBDDs and  is a function B×B → B. This is essentially done by recursively computing the product construction of the two BDDs *f* and *g* and applying  when recursing to pairs of leaves. The Reduce applies the two reduction rules on an OBDD bottom-up to obtain the corresponding ROBDD [10].

Common implementations of BDDs use recursive depth-first procedures that traverse the BDD and the unique nodes are managed through a hash table [9, 16,20,23,34]. The latter allows one to directly incorporate the Reduce algorithm of [10] within each node lookup [9, 27]. They also use a memoisation table to minimise the number of duplicate computations [16, 23, 34]. If the size *N<sup>f</sup>* and *N<sup>g</sup>* of two BDDs are considerably larger than the memory *M* available, each recursion request of the Apply algorithm will in the worst case result in an I/O, caused by looking up a node within the memoisation and following the low and high arcs [6,21]. Since there are up to *N<sup>f</sup>* · *N<sup>g</sup>* recursion requests, this results in up to *O*(*N<sup>f</sup>* · *Ng*) I/Os in the worst case. The Reduce operation transparently built into the unique node table with a *find-or-insert* function can also cause an I/O for each lookup within this table [21]. This adds yet another *O*(*N*) I/Os, where *N* is the number of nodes in the unreduced BDD.

Lars Arge provided in [5,6] a description of an Apply algorithm that is capable of only using *O*(sort(*N<sup>f</sup>* · *Ng*)) I/Os and a Reduce that uses *O*(sort(*N*)) I/Os (see [6] for a detailed description). He also proved this to be optimal for both algorithms, assuming a levelwise ordering of nodes on disk [6]. Our algorithms, implemented in Adiar, differ from Arge's in subtle non-trivial ways. We will not elaborate further on his original proposal, since our algorithms are simpler and better at conveying the *time-forward processing* technique he used. Instead, we will mention where our Reduce and Apply algorithms differ from his.

# **3 BDD Manipulation by Time-forward Processing**

Our algorithms exploit the total and topological ordering of the internal nodes in the BDD depicted in (1) below, where parents precede their children. It is topological by ordering a node by its *label*, *i* : N, and total by secondly ordering on a node's *identifier*, *id* : N. This identifier only needs to be unique on each level as nodes are still uniquely identifiable by the combination of their label and identifier.

$$(i\_1, id\_1) < (i\_2, id\_2) \equiv i\_1 < i\_2 \lor (i\_1 = i\_2 \land id\_1 < id\_2) \tag{1}$$

We write the *unique identifier* (*i, id*) : N × N for a node as *xi,id* .

BDD nodes do not contain an explicit pointer to their children but instead the children's unique identifier. Following the same notion, leaf values are stored directly in the leaf's parents. This makes a node a triple (*uid, low, high*) where *uid* : N×N is its unique identifier and *low* and *high* : (N×N)+B are its children. The ordering in (1) is lifted to compare the *uid*s of two nodes, and so a BDD is represented by a file with BDD nodes in sorted order. For example, the BDDs in Fig. 1 would be represented as the lists depicted in Fig. 2.

The Apply algorithm in [6] produces an unreduced OBDD, which is turned into an ROBDD with Reduce. The original algorithms of Arge solely work on a node-based representation. Arge briefly notes that with an arc-based representation, the Apply algorithm is able to output its arcs in the order needed by the following Reduce, and vice versa. Here, an arc is a triple (*source, is high, target*) (written as *source is high* −−−→ *target*) where *source* : N × N, *is high* : B, and *target* : (N×N)+B, i.e. *source* and *target* contain the level and identifier of internal nodes. We have further pursued this idea of an arc-based representation and can conclude that the algorithms indeed become simpler and more efficient with an arc-based output from Apply. On the other hand, we see no such benefit over the more compact node-based representation in the case of Reduce. Hence as is depicted in Fig. 3, our algorithms work in tandem by cycling between the node-based and arc-based representation.

$$\begin{array}{ll} \text{1a: } [\ (x\_{2,0}, \bot, \top) \qquad ] \\ \text{1b: } [\ (x\_{0,0}, \bot, x\_{1,0}) \quad , (x\_{1,0}, \bot, \top) \, ] \\ \text{1c: } [\ (x\_{0,0}, x\_{1,0}, x\_{1,1}) \, , (x\_{1,0}, \bot, \top) \, , (x\_{1,1}, \top, \bot) \, ] \\ \text{1d: } [\ (x\_{1,0}, x\_{2,0}, \top) \quad , (x\_{2,0}, \bot, \top) \, ] \end{array}$$

Fig. 2: In-order representation of BDDs of Fig. 1

Fig. 3: The Apply–Reduce pipeline of our proposed algorithms

(a) Semi-transposed graph. (pairs indicate nodes in Fig. 1a and 1b, respectively) (b) In-order arc-based representation.

Fig. 4: Unreduced output of Apply when computing *x*<sup>2</sup> ⇒ (*x*<sup>0</sup> ∧ *x*1)

Notice that our Apply outputs two files containing arcs: arcs to internal nodes (blue) and arcs to leaves (red). Internal arcs are output at the time their targets are processed, and since nodes are processed in ascending order, internal arcs end up being sorted with respect to the unique identifier of their target. This groups all in-going arcs to the same node together and effectively reverses internal arcs. Arcs to leaves, on the other hand, are output when their source is processed, which groups all out-going arcs to leaves together. These two outputs of Apply represent a semi-transposed graph, which is exactly of the form needed by the following Reduce. For example, the Apply on the node-based ROBDDs in Fig. 1a and 1b with logical implication as the operator will yield the arc-based unreduced OBDD depicted in Fig. 4.

For simplicity, we will ignore any cases of leaf-only BDDs in our presentation of the algorithms. They are easily extended to also deal with those cases.

#### **3.1 Apply**

Our Apply algorithm works by a single top-down sweep through the input DAGs. Internal arcs are reversed due to this top-down nature, since an arc between two internal nodes can first be resolved and output at the time of the arc's target. These arcs are placed in the file *Finternal* . Arcs from nodes to leaves are placed in the file *Fleaf* .

The algorithm itself essentially works like the standard Apply algorithm. Given a recursion request for a pair of input nodes *v<sup>f</sup>* from *f* and *v<sup>g</sup>* from *g*, a single node is created with label min(*v<sup>f</sup> .uid.label, vg.uid.label*) and recursion requests *rlow* and *rhigh* are created for its two children. If the label of *v<sup>f</sup> .uid* and

```
1 Apply(f , g , )
 2 Finternal ← [ ] ; Fleaf ← [ ] ; Qapp:1 ← ∅ ; Qapp:2 ← ∅
 3 vf ← f . nex t ( ) ; vg ← g . nex t ( ) ; i d ← 0 ; l a b e l ← u n d e fi n e d
 4
 5 /∗ I n s e r t r e q u e s t f o r r o o t (vf , vg) ∗/
 6 Qapp:1 . push (NIL undefined −−−−−→ (vf .uid, vg.uid) )
 7
 8 /∗ P r oce s s r e q u e s t s in t o p o l o g i c a l o r de r ∗/
 9 while Qapp:1 6= ∅ ∨ Qapp:2 6= ∅ do
10 (s
            is high −−−→ (tf , tg) , low , high ) ← TopOf(Qapp:1 , Qapp:2 )
11
12 tseek ← i f low , high = NIL then min (tf ,tg ) e l s e max(tf ,tg )
13 while vf . uid < tseek ∧ f . h a s n e x t ( ) do vf ← f . nex t ( ) od
14 while vg . uid < tseek ∧ g . h a s n e x t ( ) do vg ← g . nex t ( ) od
15
16 i f low = NIL ∧ high = NIL ∧ tf 6∈ {⊥, >} ∧ tg 6∈ {⊥, >}
17 ∧ tf . label = tg . label ∧ tf . id 6= tg . id
18 then /∗ Forward i n f o rm a t i o n o f min(tf , tg) t o max(tf , tg) ∗/
19 v ← i f tseek = vf then vf e l s e vg
20 while Qapp:1 . top ( ) matches −→ (tf , tg) do
21 (s
                is high −−−→ (tf , tg) ) ← Qapp:1 . pop ( )
22 Qapp:2 . push (s
                            is high −−−→ (tf , tg) , v . low , v . high )
23 od
24 e l s e /∗ P r oce s s r e q u e s t (tf , tg) ∗/
25 i d ← i f l a b e l 6= tseek . label then 0 e l s e i d+1
26 l a b e l ← tseek . label
27
28 /∗ Forward or o u t p u t ou t−g o i n g a r c s ∗/
29 rlow , rhigh ← RequestsFor ( (tf , tg ) , vf , vg , low , high , )
30 ( i f rlow ∈ {⊥, >} then Fleaf e l s e Qapp:1 ) . push (xlabel,id
                                                                     ⊥−→ rlow )
31 ( i f rhigh ∈ {⊥, >} then Fleaf e l s e Qapp:1 ) . push (xlabel,id
                                                                     >−→ rhigh )
32
33 /∗ Ou tpu t in−g o i n g a r c s ∗/
34 while Qapp:1 6= ∅ ∧ Qapp:1 . top ( ) matches ( −→ (tf , tg) ) do
35 (s
                is high −−−→ (tf , tg) ) ← Qapp:1 . pop ( )
36 i f s 6= NIL then Finternal . push (s
                                                 is high −−−→ xlabel,id )
37 od
38 while Qapp:1 6= ∅ ∧ Qapp:2 . top ( ) matches ( −→ (tf , tg) , , ) do
39 (s
                is high −−−→ (tf , tg) , , ) ← Qapp:2 . pop ( )
40 i f s 6= NIL then Finternal . push (s
                                                 is high −−−→ xlabel,id )
41 od
42 od
43 return Finternal , Fleaf
```
Fig. 5: The Apply algorithm

*vg.uid* are equal, then *rlow* = (*v<sup>f</sup> .low, vg.low*) and *rhigh* = (*v<sup>f</sup> .high, vg.high*). Otherwise, *rlow* , resp. *rhigh* , contains the *uid* of the low child, resp. the high child, of min(*v<sup>f</sup> , vg*), whereas max(*v<sup>f</sup> .uid, vg.uid*) is kept as is.

The pseudocode for the Apply procedure is shown in Fig. 5, where the **RequestsFor** function computes *rlow* and *rhigh* for the pair of nodes (*t<sup>f</sup> , tg*). The goal of the rest of the algorithm is to obtain the information that **RequestsFor** needs in an I/O-efficient way. To this end, the two priority queues *Qapp*:1 and *Qapp*:2 are used to synchronise recursion requests for a pair of nodes (*t<sup>f</sup> , tg*) with the sequential order of reading nodes in *f* and *g*. *Qapp*:1 has elements of the form (*s is high* −−−→ (*t<sup>f</sup> , tg*)) and *Qapp*:2 has elements (*s is high* −−−→ (*t<sup>f</sup> , tg*)*, low, high*). The boolean *is high* and the unique identifer *s*, being the request's origin, are used on lines 33 – 41, to output all ingoing arcs when the request is resolved.

Elements in *Qapp*:1 are sorted in ascending order by min(*t<sup>f</sup> , tg*), i.e. the node encountered first from *f* and *g*. Requests to the same (*t<sup>f</sup> , tg*) are grouped together by secondarily sorting the tuple lexicographically. *Qapp*:2 is sorted in ascending order by max(*t<sup>f</sup> , tg*), i.e. the second of the two to be visited, and ties are again broken lexicographically. This second priority queue is used in the case where *t<sup>f</sup> .label* = *tg.label* but *t<sup>f</sup> .id* 6= *tg.id*, i.e. when both are needed to resolve the request but they are not necessarily available at the same time. To this end, the given request is moved from *Qapp*:1 into *Qapp*:2 on lines 19 – 23. Here, the request is extended with the unique identifiers *low* and *high* of min(*v<sup>f</sup> , vg*), which makes the children of min(*v<sup>f</sup> , vg*) available at max(*v<sup>f</sup> , vg*).

The next request to process from *Qapp*:1 or *Qapp*:2 is dictated by the **TopOf** function on line 10. In the case that both *Qapp*:1 and *Qapp*:2 are non-empty, let *r*<sup>1</sup> = (*s*<sup>1</sup> *is high*<sup>1</sup> −−−−→ (*tf*:1*, tg*:1)) be the top element of *Qapp*:1 and let the top element of *Qapp*:2 be *r*<sup>2</sup> = (*s*<sup>2</sup> *is high*<sup>2</sup> −−−−→ (*tf*:2*, tg*:2)*, low, high*). **TopOf**(*Qapp*:1*, Qapp*:2) returns (*r*1*,* Nil*,* Nil) if min(*tf*:1*, tg*:1) *<* max(*tf*:2*, tg*:2) and *r*<sup>2</sup> otherwise. If either one is empty, then it equivalently outputs the top request of the other.

The arc-based output greatly simplifies the algorithm compared to the original proposal of Arge in [6]. Our algorithm only uses two priority queues rather than four. Arge's algorithm, like ours, resolves a node before its children, but due to the node-based output it has to output this entire node before its children. Hence, it has to identify all nodes by the tuple (*t<sup>f</sup> , tg*), doubling the space used. Instead, the arc-based output allows us to output the information at the time of the children and hence we are able to generate the label and its new identifier for both parent and child. Arge's algorithm also did not forwarded a request's source *s*, so repeated requests to the same pair of nodes were merely discarded upon retrieval from the priority queue, since they carried no relevant information. Our arc-based output, on the other hand, makes every element placed in the priority queue forward the source *s*, vital for the creation of the semi-transposed graph.

**Proposition 1 (Following Arge 1996 [6]).** *The Apply algorithm in Fig. 5 has I/O complexity O*(*sort*(*N<sup>f</sup>* ·*Ng*)) *and O*((*N<sup>f</sup>* ·*Ng*)·log(*N<sup>f</sup>* ·*Ng*)) *time complexity, where N<sup>f</sup> and N<sup>g</sup> are the respective sizes of the BDDs for f and g.*

See the full paper [33] for the proof.

**Pruning by shortcutting the operator** The Apply procedure above, like Arge's original algorithm, follows recursion requests until a pair of leaves is met. Yet, for example in Fig. 4 the node for the request (*x*2*,*0*,* >) is unnecessary to resolve, since all leaves of this subgraph trivially will be > due to the implication operator. The subsequent Reduce will remove this node and its children in favour of the > leaf. Hence, the **RequestsFor** function can instead immediately create a request for the leaf. We implemented this in Adiar, since it considerably decreases the size of *Qapp*:1, *Qapp*:2, and of the output.

#### **3.2 Reduce**

Our Reduce algorithm in Fig. 6 works like other explicit variants with a single bottom-up sweep through the OBDD. Since the nodes are resolved and output in a bottom-up descending order, the output is exactly in the reverse order as it is needed for any following Apply. We have so far ignored this detail, but the only change necessary to the Apply algorithm in Section 3.1 is for it to read the list of nodes of *f* and *g* in reverse.

The priority queue *Qred* is used to forward the reduction result of a node *v* to its parents in an I/O-efficient way. *Qred* contains arcs from unresolved sources *s* in the given unreduced OBDD to already resolved targets *t* 0 in the ROBDD under construction. The bottom-up traversal corresponds to resolving all nodes in descending order. Hence, arcs *s is high* −−−→ *t* 0 in *Qred* are first sorted on *s* and secondly on *is high*; the latter simplifies retrieving the low and high arcs on lines 8 and 9. The base-cases for the Reduce algorithm are the arcs to leaves in *Fleaf* , which follow the exact same ordering. Hence, on lines 8 and 9, arcs in *Qred* and *Fleaf* are merged using the **PopMax** function that retrieves the arc that is maximal with respect to this ordering.

Since nodes are resolved in descending order, *Finternal* follows this ordering on the arc's target when elements are read in reverse. The reversal of arcs in *Finternal* makes the parents of a node *v*, to which the reduction result is to be forwarded, readily available on lines 26 – 32.

The algorithm otherwise proceeds similarly to the standard Reduce algorithm [10]. For each level *j*, all nodes *v* of that level are created from their high and low arcs, *ehigh* and *elow* , taken out of *Qred* and *Fleaf* . The nodes are split into the two temporary files *F<sup>j</sup>*:1 and *F<sup>j</sup>*:2 that contain the mapping [*uid* 7→ *uid*<sup>0</sup> ] from a node in the given unreduced OBDD to its equivalent node in the output. *F<sup>j</sup>*:1 contains the nodes *v* removed due to the first reduction rule and is populated on lines 7 – 12: if both children of *v* are the same then [*v.uid* 7→ *v.low*] is pushed to this file. *F<sup>j</sup>*:2 contains the mappings for the second rule and is populated on lines 15 – 24. Nodes not placed in *F<sup>j</sup>*:1 are placed in an intermediate file *F<sup>j</sup>* and sorted by their children. This makes duplicate nodes immediate successors. Every unique node encountered in *F<sup>j</sup>* is output to *Fout* before mapping itself and all its duplicates to it in *F<sup>j</sup>*:2. Since nodes are output out-of-order compared to the input and it is unknown how many will be output for said level, they are given new decreasing identifiers starting from the maximal possible value MAX ID. Finally, *F<sup>j</sup>*:2 is sorted back in order of *Finternal* to forward the results

```
1 Reduce(Finternal , Fleaf )
2 Fout ←[ ] ; Qred ← ∅
3 while Qred 6= ∅ do
4 j ← Qred . top ( ) . s o u r c e . l a b e l ; i d ← MAX ID;
5 Fj ← [ ] ; Fj:1 ←[ ] ; Fj:2 ←[ ]
6
7 while Qred . top ( ) . s o u r c e . l a b e l = j do
8 ehigh ←PopMax(Qred , Fleaf )
9 elow ← PopMax(Qred , Fleaf )
10 i f ehigh . t a r g e t = elow . t a r g e t
11 then Fj:1 . push ( [ elow . s o u r c e 7→ elow . t a r g e t ] )
12 e l s e Fj . push ( ( elow . s ou rce , elow . t a r g e t , ehigh . t a r g e t ) )
13 od
14
15 s o r t v ∈ Fj by v . low and s e c o n dl y by v . hi gh
16 v
         0 ← u n d e fi n e d
17 for each v ∈ Fj do
18 i f v
               0
                  i s u n d e fi n e d o r v . low 6= v
                                               0
                                                . low o r v . hi gh 6= v
                                                                     0
                                                                      . hi gh
19 then
20 i d ← i d − 1
21 v
              0 ← (xj,id , v . low , v . hi gh )
22 Fout . push (v )
23 Fj:2 . push ( [ v . uid 7→ v
                                  0
                                   . uid ] )
24 od
25
26 s o r t [ uid 7→ uid0
                         ]∈ Fj:2 by uid i n d e s c e n di n g o r d e r
27 for each [ uid 7→ uid0
                              ] ∈ MergeMaxUid(Fj:1 , Fj:2 ) do
28 while a r c s from Finternal . peek ( ) matches −→ uid do
29 (s
                is high −−−→ uid ) ← Finternal . nex t ( )
30 Qred . push (s
                          is high −−−→ uid0
                                   )
31 od
32 od
33 od
34 return Fout
```
Fig. 6: The Reduce algorithm

in both *F<sup>j</sup>*:1 and *F<sup>j</sup>*:2 to their parents on lines 26 – 32. Here, **MergeMaxUid** merges the mappings [*uid* 7→ *uid*<sup>0</sup> ] in *F<sup>j</sup>*:1 and *F<sup>j</sup>*:2 by always taking the mapping with the largest *uid* from either file.

Since the original algorithm of Arge in [6] takes a node-based OBDD as an input and internally uses node-based auxiliary data structures, his Reduce algorithm had to create two copies of the input to reverse all internal arcs: a copy sorted by the nodes' low child and one sorted by their high children. Since *Finternal* already has its arcs reversed, our design eliminates two expensive sorting steps and more than halves the memory used.

Another consequence of Arge's node-based representation is that his algorithm had to move all arcs to leaves into *Qred* rather than merging requests from *Qred* with the base-cases from *Fleaf* . The semi-transposed input allows us to decrease the number of I/Os due to *Qred* by *Θ*(sort(*N`*)) where *N`* are the number of arcs to leaves (see [33] for the proof). In practice, together with pruning the recursion during Apply, this can provide up to a factor 2 speedup [33].

**Proposition 2 (Following Arge 1996 [6]).** *The Reduce algorithm in Fig. 6 has an O*(*sort*(*N*)) *I/O complexity and an O*(*N* log *N*) *time complexity.*

See the full paper [33] for the proof. Arge proved in [6] that this *O*(sort(*N*)) I/O complexity is optimal for the input, assuming a levelwise ordering of nodes.

#### **3.3 Other BDD Algorithms**

By applying the above algorithmic techniques, one can obtain all other singlyrecursive BDD algorithms; see [33] for the details. We now design asymptotically better variants of Negation and Equality Checking than what is possible by deriving them using Apply.

**Negation** A BDD is negated by inverting the value in its nodes' leaf children. This is an *O*(1) I/O-operation if a *negation flag* is used to mark whether the nodes should be negated on-the-fly as they are read from the stream.

**Proposition 3.** *Negation has I/O, space, and time complexity O*(1)*.*

This is an improvement over the *O*(sort(*N*)) I/Os spent by Apply to compute *f* ⊕>, where ⊕ is exclusive or. Furthermore, disk space is shared between BDDs.

**Equality Checking** To check for *f* ≡ *g* one has to check the DAG of *f* being isomorphic to the one for *g* [10]. This makes *f* and *g* trivially inequivalent when the number of nodes, number of levels, or the label or size of each of the *L* levels do not match. This can be checked in *O*(1) and *O*(*L/B*) I/Os if the Reduce algorithm in Fig. 6 is made to also output the relevant meta-information.

If *f* ≡ *g*, the isomorphism relates the roots of the BDDs for *f* and *g*. For any node *v<sup>f</sup>* of *f* and *v<sup>g</sup>* of *g*, if (*v<sup>f</sup> , vg*) is uniquely related by the isomorphism, then so should (*v<sup>f</sup> .low, vg.low*) and (*v<sup>f</sup> .high, vg.high*). Hence, one can check for equality by traversing the product of both BDDs (as in Apply) and check for one of the following two conditions being violated.


If the first condition is never violated, it is guaranteed that *f* ≡ *g*, and so > is output. The second ensures that the algorithm terminates earlier on negative cases and lowers the provable complexity bound; see [33] for the proof.

**Proposition 4.** *Equality Checking has I/O complexity O*(*sort*(*N*)) *and time complexity O*(*N* log *N*)*, where N* = min(*N<sup>f</sup> , Ng*) *is the minimum of the respective sizes of the BDDs for f and g.*

If (1) on page 5 is extended such that ⊥*,* > succeed all unique identifiers and ⊥ *<* >, then Fig. 6 actually enforces a much stricter ordering; it outputs nodes in an order purely based on their label and the unique identifier of their children.

**Proposition 5.** *If G<sup>f</sup> and G<sup>g</sup> are outputs of Reduce in Fig. 6, then f* ≡ *g if and only if the ith nodes of G<sup>f</sup> and G<sup>g</sup> match numerically.*

See the full paper [33] for the proof. The negation operation breaks this property by changing the leaf values without changing their order. So, in the case where *f* or *g*, but not both, have their negation flag set, one still has to use the *O*(sort(*N*)) algorithm above, but otherwise a simple linear scan of both BDDs suffices.

**Corollary 1.** *If the negation flag of the BDDs for f and g are equal, then Equality Checking can be done in* 2 · *N/B I/Os and O*(*N*) *time, where N* = min(*N<sup>f</sup> , Ng*) *is the minimum of the respective sizes of the BDDs for f and g.*

Both Proposition 4 and Corollary 1 are an asymptotic improvement on the *O*(sort(*N*<sup>2</sup> )) equality checking algorithm by computing *f* ↔ *g* with Apply and Reduce and then test whether the output is the > leaf.

#### **4 Adiar: An Implementation**

The algorithms and data structures described in Section 3 have been implemented in a new BDD package, named Adiar1*,* <sup>2</sup> . The most important operations are shown in Table 1. Interaction with the BDD package is done through C++ programs that include the <adiar/adiar.h> header file and are built and linked with CMake. Its two dependencies are the Boost library and the TPIE library; the latter is included as a submodule of the Adiar repository, leaving it to CMake to build TPIE and link it to Adiar.

Adiar is initialised with the adiar init(memory, temp dir) function, where memory is the memory (in bytes) dedicated to Adiar and temp dir is the directory where temporary files will be placed, e.g. a dedicated harddisk. The BDD package is deinitialised by calling the adiar deinit() function.

The bdd object in Adiar is a container for the underlying files for each BDD, while a bdd object is used for possibly unreduced arc-based OBDDs. Reference counting on the underlying files is used to reuse the same files and to immediately delete them when the reference count decrements to 0. Files are deleted as early as possible by use of implicit conversions between the bdd and bdd objects and an overloaded assignment operator, making the concurrently occupied space on disk minimal.

<sup>1</sup> **adiar** h portuguese i (*verb*) : to defer, to postpone

<sup>2</sup> Source code is publicly available at github.com/ssoelvsten/adiar


Table 1: Some of the operations supported by Adiar and their I/O-complexity.

### **5 Experimental Evaluation**

While time-forwarding may be an asymptotic improvement over the recursive approach in the I/O-model, its usability in practice is another question entirely. We have compared *Adiar* 1*.*0*.*1 to the recursive BDD packages *CUDD* 3*.*0*.*0 [34] and *Sylvan* 1*.*5*.*0 [16] (in single-core mode). We constructed BDDs for some benchmarks in all tools in a similar manner, ensuring the same variable ordering.

The experimental results<sup>3</sup> were obtained on server nodes of the *Grendel* cluster at the Centre for Scientific Computing Aarhus. Each node has two 48-core 3*.*0 GHz Intel Xeon Gold 6248R processors, 384 GiB of RAM, 3*.*5 TiB of available SSD disk, run CentOS Linux, and compile code with GCC 10*.*1*.*0. We report the *minimum* measured running time, since it minimises any error caused by the CPU, memory and disk [13]; using the average or median does not significantly change any of our results. For comparability all compute nodes are set to use 350 GiB of the available RAM, while each BDD package is given 300 GiB of it. Sylvan was set to not use any parallelisation, given a ratio between the node table and the cache of 64:1 and set to start its data structures 2<sup>12</sup> times smaller than the final 262 GiB it may occupy, i.e. at first with a table and cache that occupies 66 MiB. The size of the CUDD cache was set such it would have the same node table to cache ratio when reaching 300 GiB.

#### **5.1 Queens**

The solution to the Queens problem is the number of arrangements of *N* queens on an *N* × *N* board, such that no queen is threatened by another. Our benchmark follows the description in [22]: the variable *xij* represents whether a queen is placed on the *i*th row and the *j*th column and the solution to the problem then corresponds to the number of satisfying assignments to the formula V*<sup>N</sup>*−<sup>1</sup> *i*=0 W*<sup>N</sup>*−<sup>1</sup> *<sup>j</sup>*=0 (*xij* ∧ ¬*has threat*(*i, j*)), where *has threat*(*i, j*) is true, if a queen is placed on a tile (*k, l*), that would be in conflict with a queen placed on (*i, j*).

<sup>3</sup> Available at Zenodo [32] and at github.com/ssoelvsten/bdd-benchmark

Fig. 7: Running time solving *N*-Queens (lower is better).

The ROBDD of the innermost conjunction can be directly constructed, without any BDD operations.

The current version of Adiar is implemented purely using external memory algorithms. These perform poorly when given small amounts of data. Hence, it is not meaningful to compare performance for *N <* 12 where the BDDs involved are 23*.*5 MiB or smaller. For *N* ≥ 12, Fig. 7 shows how the gap in running time between Adiar and other BDD packages shrinks as instances grow. At *N* = 15, which is the largest instance solved by Sylvan and CUDD, Adiar is 1*.*47 times slower than CUDD and 2*.*15 times slower than Sylvan.

The largest instance solved by Adiar is *N* = 17 where the largest BDD constructed is 719 GiB in size. In contrast, Sylvan only constructed a 12*.*9 GiB sized BDD for *N* = 15. Even though Adiar has to use disk, it only becomes 1*.*8 times slower per processed node compared to its highest performance at *N* = 13. Conversely, Adiar is able to solve the *N* = 15 problem with much less main memory than both Sylvan and CUDD. Fig. 8 shows the running time on the same machine with its memory, including its file system cache, limited with *cgroups* to be 1 GiB more than given to the BDD package. Yet, Adiar is only 1*.*39 times slower when decreasing its memory down to 2 GiB, while Sylvan cannot function with less than 56 GiB of memory available.

Fig. 8: Running time of 15-Queens with variable memory (lower is better).

We also ran experiments on counting the number of draw positions in a 3Dversion of Tic-Tac-Toe, derived from [22]. Our results [33] paint a similar picture: Adiar is only 2*.*50 times slower than Sylvan for Sylvan's largest solved instance; Sylvan only creates BDDs of up to 34*.*4 GiB in size, whereas Adiar constructs a 902 GiB sized BDD; Adiar only slows down by a factor of 2*.*49 per processed node when using the disk extensively to solve the larger instances.

#### **5.2 Combinatorial Circuit Verification**

The *EPFL* Combinational Benchmark Suite [2] consists of 23 combinatorial circuits designed for logic optimisation and synthesis. 20 of these are split into the two categories *random/control* and *arithmetic*, and each of these original circuits *C<sup>o</sup>* is distributed together with one circuit optimised for size *C<sup>s</sup>* and one circuit optimised for depth *Cd*. The last three are the *More than a Million Gates* benchmarks, which we will ignore as they come without optimised versions.

Based on the approach of the *Nanotrav* program as distributed with CUDD, we verify the functional equivalence between each output gate of *C<sup>o</sup>* and the corresponding gate in each optimised circuits *Cd*, and *Cs*. The BDDs are computed by representing every input gate by a decision variable, and computing the BDD of all other gates from the BDDs of their input wires. Finally, the BDDs for every pair of corresponding output gates are tested for equality. Memoisation ensures that the same gate is not computed twice, while a reference counter is maintained for each gate such that dead BDDs in the memoisation table may be garbage collected. Recall that Adiar stores each BDD in a separate file, while Sylvan and CUDD share nodes between different BDDs in a forest.

Table 2 shows the number of verified instances with each BDD package within a 15 days time limit. Adiar is able to verify three more benchmarks than both other BDD packages. This is despite the fact that most instances include hundreds of concurrent BDDs, while the disk is only 12 times larger than main memory. For example, the largest verified benchmark, *mem ctrl*, has up to 1231 BDDs existing at the same time.

Table 3 shows the time it took Adiar to verify equality between the original and each of the optimised circuits, for the three largest cases verified. The table also shows the sum of the sizes of the output BDDs that represent each circuit. Throughout all solved benchmarks, equality checking took less than 1*.*47% of the total construction time and the *O*(*N/B*) algorithm could be used in 71*.*6% of all BDD comparisons. The *voter* benchmark with its single output shows that


Table 2: Number of verified *arithmetic* and *random/control* circuits from [2]


Table 3: Running time for equivalence testing. *O*(sort(*N*)) and *O*(*N/B*) is the number of times the respective algorithm in Section 3.3 was used.

the *O*(*N/B*) algorithm is about 10 times faster than the *O*(sort(*N*)) algorithm and can compare at least 2 · 5*.*75 MiB*/*0*.*006 s = 1*.*86 GiB/s.

### **6 Conclusions and Future Work**

Adiar provides an I/O-efficient implementation of BDDs. The iterative BDD algorithms exploit a topological ordering of the BDD nodes in external memory, by use of priority queues and sorting algorithms. All recursion requests for a single node are processed together, eliminating the need for a memoisation table.

The performance of Adiar is very promising in practice for instances larger than a few hundred MiB. As the size of the BDDs increase, the performance of Adiar gets closer to conventional recursive BDD implementations – for BDDs larger than a few GiB the use of Adiar has at most resulted in a 3*.*69 factor slowdown. Simultaneously, the design of our algorithms allow us to compute on BDDs that outgrow main memory with only a 2*.*49 factor slowdown, which is negligible compared to use of swap memory with conventional BDD packages.

This performance comes at the cost of Adiar not being able to share nodes between BDDs. Yet, this increase in space usage is not a problem in practice and it makes garbage collection a trivial and cheap deletion of files on disk. On the other hand, the lack of sharing makes it impossible to check for functional equivalence with a mere pointer comparison. Instead, one has to explicitly check for the two DAGS being isomorphic. We have improved the asymptotic and practical performance of equality checking such that it is negligible in practice.

This lays the foundation on which we intend to develop external memory versions of the BDD algorithms that are still missing for symbolic model checking. Specifically, we intend to improve the performance of quantifying multiple variables and designing a relational product operation. Furthermore, we will improve performance for small instances that fit entirely into internal memory.

#### **Acknowledgements**

Thanks to the late Lars Arge, to Gerth S. Brodal, and to Mathias Rav for their inputs. Furthermore, thanks to the Centre for Scientific Computing Aarhus (phys.au.dk/forskning/cscaa/) for running our experiments on their cluster.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Forest GUMP: A Tool for Explanation

#### Alnis Murtovi(), Alexander Bainczyk, and Bernhard Steffen

Chair for Programming Systems, TU Dortmund University, Dortmund, Germany {alnis.murtovi,alexander.bainczyk,bernhard.steffen}@tu-dortmund.de

Abstract. In this paper, we present Forest GUMP (for Generalized, Unifying Merge Process) a tool for providing tangible experience with three concepts of explanation. Besides the well-known model explanation and outcome explanation, Forest GUMP also supports class characterization, i.e., the precise characterization of all samples with the same classification. Key technology to achieve these results is algebraic aggregation, i.e., the transformation of a Random Forest into a semantically equivalent, concise white-box representation in terms of Algebraic Decision Diagrams (ADDs). The paper sketches the method and illustrates the use of Forest GUMP along an illustrative example taken from the literature. This way readers should acquire an intuition about the tool, and the way how it should be used to increase the understanding not only of the considered dataset, but also of the character of Random Forests and the ADD technology, here enriched to comprise infeasible path elimination.

Keywords: Random Forest, Binary/Algebraic Decision Diagram, Aggregation, Infeasible Paths, Explainability, Random Seed

### 1 Introduction

Random Forests are one of the most widely known classifiers in machine learning [3,17]. The method is easy to understand and implement, and at the same time achieves impressive classification accuracies in many applications. Compared to other methods, Random Forests are fast to train and they are clearly more suitable for smaller datasets. In contrast to a single decision tree, Random Forests, a collection of many trees, do not overfit as easily on a dataset and their variance decreases with their size. On the other hand, Random Forests are considered black-box models because of their highly parallel nature: following the execution of Random Forests means, in particular, following the execution in all the involved trees. Such black-box executions are hard to explain to a human user even for very small examples.

In contrast, decision trees are considered white-box models because of their sequential evaluation nature. Even if a tree is large in size, a human can easily follow its computation step by step by evaluating (simple) decisions at each node from the root to a leaf. Indeed, the set of decisions along such an execution path precisely explains why a certain choice has been taken.

Popular methods towards explainability try to establish some user intuition. For example, they may hint at the most influential input data, like highlighting or framing the area of a picture where a face has been identified. Such information is very helpful, and it helps in particular to reveal some of the "popular" drastic mismatches incurred by neural networks: if the framed area of the image does not contain the "tagged" object, the identification is clearly questionable. However, even in a correct classification, the tag by itself gives no reason why the identification is indeed correct.

More ambitious are methods that try to turn black-box model into whitebox models, ideally preserving the semantics of the classification function. For Random Forests this has been achieved for the first time in [10,14] using the 'aggregating power' of Algebraic Decision Diagrams (ADDs) and Binary Decision Diagrams (BDDs). ADDs are essentially decision trees whose leaves are labelled with elements of some algebra, whereas BBDs are the special case for the algebra of Boolean values. Lifting the algebraic operations from the leaves to the entire ADDs/BDDs allows one to aggregate entire Random Forests into single semantically equivalent ADDs, the precondition for solving three explainability problems:


In this paper, we present Forest GUMP (for Generalized, Unifying Merge Process) a tool for providing a tangible experience with the described concepts of explanation. Experimentation with Forest GUMP does not only yield semantically equivalent, concise white-box representations for a given Random Forest which reveal characteristics of the underlying datasets, but it also allows one to experience, e.g., the impact of random seeds on both the quality of prediction and the size of the explaining models (cf. Section 6). Our implementation relies on the standard Random Forest implementation in Weka [28] and on the ADD implementation of the ADD-Lib [9,12,26]. For a more detailed description of the transformations and a quantitative analysis we refer the reader to [10,11,14].

Related Work: Various methods for making Random Forests interpretable exist such as extracting decision rules from the considered black-box model [6], methods that are agnostic to the black-box model under consideration [20,24] or by deriving a single decision tree from the black-box model [5,7,16,27,29]. In this context, single decision trees are considered key to a solution of both, the model explanation and outcome explanation problem. State of the art solutions to derive a single decision tree from a Random Forest are approximative [5,7,16,27,29]. Thus, their derived explanations are not fully faithful to the original semantics of the considered Random Forest. This is in contrast to our ADD-based aggregation, which precisely reflects the semantics of the original Random Forest.

After a short introduction to Random Forests in Section 2, we present our approach to their aggregation in Section 3 which is followed by an elimination of redundant predicates from the decision diagrams in Section 4 and a noncompositional abstraction in Section 5. Section 6 introduces Forest GUMP and solutions to the three explainability problems. In the end, we summarize the lessons we have learned using Forest GUMP in Section 7 which is followed by a conclusion and direction to future work in Section 8.

# 2 Random Forests

Learning Random Forests is a quite popular, and algorithmically relatively simple classification technique that yields good results for many real-world applications. Its decision model generalises a training dataset that holds examples of input data labelled with the desired output, also called class. As its name suggests, an ensemble of decision trees constitutes a Random Forest. Each of these trees is itself a classifier that was learned from a random sample of the training dataset. Consequently, all trees are different in structure, they represent different decision functions, and can yield different decisions for the very same input data.

To apply a Random Forest to previously unseen input data, every decision tree is evaluated separately: Tracing the trees from their root down to one of the leaves yields one decision per tree, i.e. the predicted class. The overall decision of the Random Forest is then derived as the most frequently chosen class, an aggregation commonly referred to as majority vote. The key advantage of this approach is, compared to single decision trees, the reduced variance. A detailed introduction to Random Forests, decision trees, and their learning procedures can be found in [3,17,23].

In this paper, we use Weka [28] as our reference implementation of Random Forests. However, our approach does not depend on implementation details and can be easily adapted to other implementations.

Figure 1 shows a small Random Forests that was learned from the popular Iris dataset [8]. The dataset lists dimensions of Iris flowers' sepals and petals for three different species. Using this forest to decide the species on the basis of given measurements requires to first evaluate the three trees individually and to subsequently determine the majority vote. This effort clearly grows linearly with the size of the forest. In the following we use this example to illustrate our approach of forest aggregation for explainability.

Fig. 1. Random Forest learned from the Iris dataset [8] (39 nodes).

Key idea behind our approach is to partially evaluate the Random Forests at construction time which, in particular, eliminates redundancies between the individual trees of a Random Forest. E.g., in our accompanying Iris flower example (cf. Fig. 1) the predicate petalwidth < 1.65 is used in all three trees. This can easily lead to cases where the same predicate is evaluated many times in the classification process. The partial evaluation proposed in this paper transforms Random Forests into decision structures where such redundancies are totally eliminated.

An adequate data structure to achieve this goal for binary decisions are Binary Decision Diagrams [1,4,19] (BDDs): For a given predicate ordering, they constitute a normal form where each predicate is evaluated at most once, and only if required to determine the final outcome.

Algebraic Decision Diagrams (ADDs) [2] generalise BDDs to capture functions of the type B <sup>P</sup> → C<sup>n</sup> which are exactly what we need to specify the semantics of Random Forests for a classification domain C. Moreover, in analogy to BDDs, which inherit the algebraic structure of their co-domain B, ADDs also inherit the algebraic structure of their co-domains if available.

We exploit this property during the partial evaluation of Random Forests by considering the class vector co-domain (cf. Sect. 3). The aggregation to achieve the corresponding optimised decision structures is then a straightforward consequence of the used ADD technology.

#### 3 Class Vector Aggregation

Class vectors faithfully represent the information about how many trees of the original Random Forest voted for a certain outcome. Obviously, this information is sufficient to obtain the precise results of a corresponding majority vote. Formally, the domain of class vectors forms a monoid

$$V := (\mathbb{N}^{|\mathcal{C}|}, +, \mathbf{0})$$

where addition + is defined component-wise and 0 is the neutral element.

Fig. 2. Class vector aggregation of the Random Forest (83 nodes).

Fig. 3. Class vector aggregation of the Random Forest without semantically redundant nodes (43 nodes).

With the compositionality of the algebraic structure V and the corresponding ADDs D<sup>V</sup> , we can transform any Random Forest incrementally into a semantically equivalent ADD. Starting with the empty Random Forest, i.e. the neutral element 0, we consider one tree after the other, aggregating a growing sequence of decision trees until the entire forest is entailed in the new decision diagram. The details of this transformation are described in [14]. Figure 2 shows the result of this transformation for our running example.

### 4 Infeasible Path Elimination

When aggregating the trees of a Random Forest they all use varying sets of predicates. In contrast to simple Boolean variables, predicates are not independent on one another, i.e. the evaluation of one predicate may yield some degree of knowledge about other predicates. E.g., the predicate petallength < 2.45 induces knowledge about other predicates that reason about petallength: When the petal length is smaller than 2.45 it cannot possibly be greater or equal to 2.7 at the same time. This is not taken care of by the symbolic treatment of predicates we followed until now. In fact, predicates are typically considered independent in the ADD/BDD community.

Infeasible path elimination, as illustrated by the difference between Figure 2 and Figure 3 for our running example, leverages the potential of a semantic

treatment of predicates with significant effect on the size of the resulting ADDs. In fact, the experiments with thousands of trees reported in [14] would not have been successful without infeasible path elimination.

Please note that infeasible path elimination


Infeasible path elimination is a hard problem in general.<sup>1</sup> Our corresponding implementation uses SMT-solving [21] to eliminate all infeasible paths. An indepth discussion of infeasible path elimination is a topic in its own and beyond the scope of this paper.

Class vector aggregation and infeasible path elimination are both compositional and can therefore be applied in arbitrary order without changing the semantics. The majority vote at compile time described in the next section is not compositional and must therefore be applied at the very end.

#### 5 Majority Vote at Compile Time

As mentioned above, maintaining the information about the result of the majority votes is not compositional. In fact, knowing the result of the majority votes for two Random Forest gives no clue about the majority vote of the combined forest. Thus the majority vote abstraction can only be applied at the very end, after the entire aggregation has been computed compositionally.

The result of the compositional aggregation process, including infeasible path elimination, is a decision diagram d ∈ D<sup>V</sup> with class vectors in its terminal nodes. The majority vote abstraction ∆<sup>C</sup> : D<sup>V</sup> → D<sup>C</sup> can now be defined as the lifted version of the majority vote abstraction on class vectors v ∈ N |C| (cf. [14]):

$$\delta\_C(\mathbf{v}) := \operatorname\*{arg\,max}\_{c \in \mathcal{C}} \mathbf{v}\_c.$$

Note that δ<sup>C</sup> does not project into the same carrier set but rather from one algebraic structure V into another C. However, these transformations can be applied to the corresponding decision diagrams in the very same way. Fig. 4 shows the result of the most frequent class abstraction for our running example.

<sup>1</sup> For the cases considered here it is polynomial, but there are of course theories for which it becomes exponentially hard or even undecidable.

Fig. 4. Most frequent label abstraction of the aggregated Random Forest (majority vote) without semantically redundant nodes (18 nodes).

# 6 Forest GUMP and Three Problems of Explanability

Forest GUMP<sup>2</sup> (Generalized Unifying Merge Process) is a tool we developed to illustrate the power of algebraic aggregation for the optimization and explanation of Random Forests. It is designed to allow everyone, in particular people without IT or machine learning knowledge, to experience the nature of Random Forests. To avoid unnecessary entry hurdles, we decided to implement Forest GUMP as a simple to use web application. It allows the user to experience the methods described in the previous sections and the proposed solutions to the explanability problems which will be illustrated in the following sections. We will first give a brief overview of Forest GUMP and then showcase its potential in the following sections.

Forest GUMP's user interface (see Figure 5) is essentially divided into two parts. On the left side the user can input the necessary data to learn a Random Forest and subsequently visualize it while the currently chosen representation will be visualized on the right side. First, the user has to upload a dataset or choose one of six datasets that we provide (cf. (1) in Fig. 5) on which the Random Forest will be learned. Next, the hyperparameters necessary for the learning procedure have to be selected, such as the number of trees to be learned (cf. (2) in Fig. 5). Then, one can choose different aggregation methods, i.e. the ones

<sup>2</sup> A link to a running instance of Forest GUMP is available at https://gitlab.com/ scce/forest-gump.

Fig. 5. Overview of Forest GUMP. The visualized ADD is our solution to the class class characterization problem (cf. Sect. 6.3) for the class Iris-Setosa.


Fig. 6. The execution history in Forest GUMP.

mentioned in the previous sections and further ones which will be explained in the following Sections (cf. (3) in Fig. 5). It it also possible to input a sample, classify it with the ADD and highlight the path from the root the leaf (satisfied predicates are highlighted in green, unsatisfied predicates are highlighted in red). In the end, the currently visualized ADD can be exported as Forest GUMP provides code generators for Java, C++, Python and GraphViz's dot format (cf. (4) in Fig. 5). Additionally, the currently visualized ADD can be exported as an SVG to be viewed locally (cf. (4) in Fig. 5).

The grey rectangle (cf. (6) in Fig. 5) points to the root of the currently visualized ADD. One can zoom into/out which can be helpful when the ADDs are rather large (cf. (6) in Fig. 5). On the top left the number of nodes and the length of the currently highlighted path are displayed (cf. (7) in Fig. 5). On the bottom right, one can open a history of all the representations one chose to visualize (cf. (8) in Fig. 5).

Figure 6 shows the expanded execution history. For each visualized ADD, the execution history lists the aggregation variant, the hyperparameters used to learn


Fig. 7. The user can either choose to upload their own dataset or select one of six exemplary datasets.

the Random Forest and the size (i.e. the number of nodes) and the maximum depth which is the longest path from root to leaf. The execution history also allows one to replay an experiment by clicking on the button on the right side of a row which allows one to compare different ADD variants. One can also delete the individual entries or the whole history and export the history to a CSV.

### 6.1 A Walkthrough of Forest GUMP

In the following we will see how hard it is to understand how a Random Forest comes to its decision and provide methods for solving the three explainability problems with absolute precision.

Learning a Random Forest To begin, we need a Random Forest which requires a dataset on which it will be learned. In Forest GUMP, the user can upload their own dataset in the Attribute-Relation File Format (ARFF) [28]. Alternatively, we provide six exemplary datasets from which a user can select one to directly start using the tool. Figure 7 illustrates how this looks like in Forest GUMP. Having chosen a dataset, next, the hyperparameters necessary for the learning procedure of the Random Forest have to be specified (see Figure 8). The inputs are the following:


Additionally, the user can decide to eliminate the infeasible paths as this can strongly reduce the size of the ADDs (see Section 4). While the predicate order is fixed by default, the user can decide to let Forest GUMP optimize the predicate order as the order can also greatly impact the size of the ADDs. A more in

<sup>3</sup> One can generate a random seed by clicking on the button next to the input field.


Fig. 8. The user has to specify the necessary hyperparameters to be able to learn a Random Forest. While the first three hyperparameters are needed for the learning procedure, the elimination of the infeasible paths and the optimization of the predicate order are specific to our aggregation method.

depth discussion on the interplay between the infeasible path elimination and the predicate order will follow. Figure 9 shows a Random Forest that was learned on the Iris dataset, consisting of 20 trees<sup>4</sup> , a bagging size of 100% and 58 as the seed. If we now want to classify a given input, for each tree we would have to traverse from the root to the leaf and receive one predicted class per tree. The class which was predicted most often is the final result. Trying to understand why the Random Forest predicted this specific class is seemingly impossible. In the following we will show how we can do better.

#### 6.2 Model Explanation Problem

The canonical white-box model corresponding to the Random Forest of Figure 9 can be constructed through the most frequent label abstraction (see Sect. 5) of the aggregated Random Forest (see Sect. 3), whose infeasible paths are eliminated (see Sect. 4). This solves the Model Explanation Problem.

Figure 10 sketches the result of this construction: A canonical white-box model with 310 nodes. Admittedly, this model is still frightening, but given a sample, it allows one to easily follow the corresponding classification process, and in this case it may require at most 19 individual decisions based on the petal

<sup>4</sup> Note that each decision tree is represented as an ADD.

Iris-versicolor

petallength < 2.45 petalwidth < 1.55

petallength < 5.35

sepallength < 6.6

petallength < 2.45

sepallength < 6.05

sepalwidth < 2.9

sepalwidth < 2.45

sepallength < 6.05 sepalwidth < 2.55 petalwidth < 1.65

petalwidth < 0.8

sepallength < 6.05

sepalwidth < 2.45

petalwidth < 1.65

sepallength < 5.85

sepallength < 6.05 sepalwidth < 2.85 petalwidth <1.35 petalwidth < 0.8

petallength < 4.75

sepallength < 4.95

petallength < 5.25 sepalwidth < 2.25

sepalwidth < 2.45

sepalwidth < 2.35

petalwidth < 1.45

sepallength < 6.05

sepalwidth < 2.85

petalwidth < 1.35

sepalwidth < 3.1 petallength <4.75

sepallength < 4.95

petallength < 5.25 sepallength < 5.8

petallength < 5.45

petallength < 2.6 sepallength < 5.95

sepalwidth < 2.55 petalwidth < 1.65

sepalwidth < 2.45 sepallength <6.15 sepallength < 6.5 sepallength < 6.05

petalwidth < 1.65 sepallength < 4.95

petallength < 2.6

sepallength < 6.05

sepalwidth < 2.25

sepalwidth < 2.35

petalwidth < 1.65

petallength < 2.6

sepallength < 6.15 sepallength < 6.5

sepallength < 5.95

sepallength < 4.95

sepallength < 6.05

sepalwidth < 2.55

sepallength < 6.05

petalwidth < 1.65

petallength < 4.85 sepalwidth < 2.65

> sepalwidth < 2.45 sepalwidth < 2.55

> > sepalwidth < 2.35 sepalwidth < 2.45

petalwidth <1.65 sepallength < 4.95 petalwidth < 1.55

sepallength < 4.95 petalwidth <1.55

petallength < 4.85

petallength < 5.25 sepalwidth <2.25

petallength < 2.45 petalwidth < 1.55

sepalwidth < 2.45 sepalwidth < 2.45

sepalwidth < 2.35

petallength < 2.6

petallength < 4.75

petalwidth < 1.65

sepallength < 5.4

petallength < 2.6

sepalwidth < 2.35

petallength < 5.25

sepallength < 6.15

petallength < 5.35

petallength < 4.75

petallength <2.6

petallength < 5.25

sepallength < 6.05

sepallength <6.5

petallength < 2.45

petallength < 5.25

sepallength < 6.15

sepallength < 6.5

sepallength < 6.5 sepallength <6.05

petallength < 5.35

petallength < 5.05

sepalwidth < 2.45

sepalwidth < 2.25

petalwidth < 0.8

petalwidth < 1.75

sepalwidth < 2.55

petallength < 2.45

petalwidth < 1.35 petalwidth < 0.8

petallength < 5.05

sepallength < 6.05

sepallength < 5.85

sepalwidth < 2.55

petalwidth < 1.55

sepallength < 6.6

petallength <5.45

sepallength < 6.05

sepallength < 6.05 sepalwidth < 2.55

sepallength < 5.95

sepallength < 6.15 sepallength < 6.5

sepallength < 5.8

petallength < 5.05 petallength < 5.25

sepallength < 6.05 sepallength < 6.05

petallength < 5.45

sepalwidth < 2.35

sepallength < 5.8

sepalwidth < 3.1

sepallength < 5.4

sepallength < 6.05

petallength < 4.85

sepalwidth < 2.85

sepallength < 5.8

sepalwidth < 2.9

petalwidth < 1.35 petalwidth < 0.8

sepallength < 4.95

sepalwidth < 2.35

sepalwidth < 2.35

sepallength < 6.15 sepallength < 6.5

sepallength < 4.95

petalwidth < 1.55

sepallength < 6.05 sepallength <6.05

sepalwidth < 2.85

petalwidth < 1.35

petallength <5.05

sepalwidth < 2.55

petallength < 5.45

Fig. 9. A Random Forest consisting of 20 individual decision trees (191 number of nodes, longest path consists of 9 nodes). Note that each decision tree is represented as an ADD and that all ADDs share common subfunctions, i.e. it is essentially a shared ADD forest. The actual Random Forest, where nothing is shared, contains 284 nodes.

sepallength < 5.8 sepalwidth < 2.35 petallength < 5.45 petalwidth < 1.75

petallength < 5.45 petalwidth < 1.75 sepallength < 5.95

sepallength < 5.6 petallength < 5.35 petallength < 5.05

sepallength < 5.8 petallength < 5.45 petalwidth < 1.75

petalwidth < 1.75

sepallength < 5.95 sepalwidth < 2.65

Fig. 10. An extract of the model explanation. The ADD is constructed from the most frequent label abstraction of the aggregated Random Forest following an elimination of all infeasible paths (310 nodes, longest path with length 19, the highlighted path has a length of 9).

and sepal characteristics. This decision set is our set of predicates. The conjunction of these predicates is a solution to the Outcome Explanation Problem. However, more concise explanations are derived from the class characterization BDD discussed in the following section.

Given the sample petallength = 2.4, petalwidth = 1.8, sepallength = 5.9, sepalwidth = 2.5, the outcome explanation given by the model explanation consists of the following 9 predicates (in Figure 10 satisfied predicates are highlighted in green, unsatisfied predicates are highlighted in red):

```
¬(petalwidth < 0.75) ∧ ¬(petalwidth < 1.7) ∧ (petallength < 4.95) ∧
 (sepalwidth < 2.65) ∧ (petallength < 4.85) ∧ (sepallength < 5.95) ∧
¬(petalwidth < 1.75) ∧ (petallength < 2.6) ∧ (petallength < 2.45)
```
Fig. 11. The class characterization for the class Iris-Setosa (10 nodes, the highlighted path is also the longest path with length 5). The leaf corresponding to Iris-Setosa is highlighted in green, the leaf representing all other classes (i.e. Iris-Virginica and Iris-Versicolor) is highlighted in red.

While this is already an improvement compared to the Random Forest, where you would have to traverse all 20 decision trees, we will see how we can improve even more in the following.

#### 6.3 Class Characterization Problem

The class characterization problem is particularly interesting because it allows one to 'reverse' the classification process. While the direct problem is 'given a sample, provide its classification', the reverse problem sounds 'given a class, what are the characteristics of all the samples belonging to this class?'

BDD-based Class Characterisation can be defined via the following simple transformation function: Given a class c ∈ C, we define a corresponding projection function δB(c) : C → B on the co-domain as

$$\delta\_B(c)(c') := \begin{cases} 1 & \text{if } c' = c \\ 0 & \text{otherwise.} \end{cases}$$

for c <sup>0</sup> ∈ C. Again, the function δB(c) can be lifted to operate on ADDs, yielding ∆B(c) : D<sup>C</sup> → DB.

The BDD shown in Figure 11 is a minimal characterization of the set of all the samples that are guaranteed to be classified as Iris-Setosa.

Fig. 12. The outcome explanation for the input petallength = 2.4, petalwidth = 1.8, sepallength = 5.9, sepalwidth = 2.5 (10 nodes, hightlighted path of length 5).

Being able to reverse a learned classification function has a major practical importance. Think, e.g., of a marketing research scenario where data have been collected with the aim to propose bestfitting product offers to customers according to their user profile. This scenario can be considered as a classification problem where the offered product plays the role of the class. Now, being able to reverse the customer → product classification function provides the marketing team with a tailored product → customer promotion process: for a given product, it addresses all customers considered to favor this very product as in the corresponding patent [18].

The path highlighted in Figure 11 is the path from the root to the leaf for the same sample petallength = 2.4, petalwidth = 1.8, sepallength = 5.9, sepalwidth = 2.5. Compared to the path with length 9 in the model explanation, we now have a path of length 5 with the following predicates:

¬(petalwidth < 0.75) ∧ (petallength < 4.95) ∧ (petallength < 4.85) ∧ (petallength < 2.6) ∧ (petallength < 2.45)

#### 6.4 Outcome Explanation Problem

The previous classification formula expresses the collection of 'conditions' that this sample satisfies, and it provides therefore a precise justification why it is classified in this class. Despite the fact that the class characterization BDD is canonical, it is easy to see that there are some redundancies in the formula. For example, a petallength < 2.45 is also inherently smaller than 2.6, 4.85 and 4.95; therefore, for this specific sample those three predicates are redundant. This is the result of the imposed predicate ordering in BDDs: all the BDD predicates are listed, and they are listed in a fixed order. After eliminating these redundancies, we are left with the following precise minimal outcome explanation: this sample is recognized as belonging to the class Iris-Setosa because it has the properties ¬(petalwidth < 0.75) ∧ (petallength < 2.45).

In Forest GUMP we make these redundant predicates explicit by highlighting them in blue (see Figure 12). From 9 predicates in the model explanation to 5 predicates in the class characterization, we have now arrived at an explanation that only consists of 2 predicates.

### 7 Lessons Learned

Playing with Forest GUMP led to interesting observations not only concerning the analyzed data domains but also concerning Random Forest Learning and the applied ADD technology.

Random Forest Learning. Changing the random seed for the learning process had a significant impact on the size of the explanation models and the class characterizations. The observed sizes of the explanation models ranged from 138 to 519. Interesting was that the larger sizes did not necessarily imply a better prediction quality. The same also applied to the class characterizations. In fact, we observed a 100% prediction quality for a class characterization of only 3 nodes, while a class characterization for the same species with 40 nodes only scored 33% prediction.

Analyzed Data Domain. The class characterizations for the three iris species differed quite a bit. For two species the observed sizes were much bigger than the sizes of the third species, independently of the chosen random seed and bagging size. In fact, for Iris-Setosa we observed a class characterization with only 3 nodes implying an outcome explanation for our chosen sample with only one predicate. Figure 13 serves for the corresponding explanation. Put it differently, class characterizations seem to be good indications for 'tightness': The closer the samples lie the more criteria are required for separation.

ADD Technology. ADDs are canonical as soon as one has chosen a predicate/variable ordering. Although we could observe the effect of corresponding optimization heuristics<sup>5</sup> , the impact was moderate and helpful mainly for model explanation and class characterization. Figure 14 shows the the outcome explanation for the same problem but where the ADD, representing the class characterization for the class Iris-Setosa, is reordered.<sup>6</sup> While the reordering

<sup>5</sup> CUDD [25] provides a number of heuristics for optimizing variable orders.

<sup>6</sup> The used reordering method is named CUDD REORDER GROUP SIFT CONV as it was both, fast and effective, in our experiments.

Fig. 13. Visualization of the iris dataset using only the petal length and petal width.

reduces the class characterization size from 10 to 8 nodes, the length of the outcome explanation is unchanged. For the model explanation of Figure 10, the size can be reduced from 310 nodes to 196 nodes while the path for the sample petallength = 2.4, petalwidth = 1.8, sepallength = 5.9, sepalwidth = 2.5 actually increased by 1 (from 9 to 10). Thus the outcome explanation may even be impaired. This is not too surprising as these optimizations aim a size reduction and not depth reduction of the considered ADDs. We are currently investigating good heuristics for depth reduction.

More striking was the impact of infeasible path elimination. In fact, this optimization can be regarded key for scalability when increasing the forest size. [14] reports results about forests with 10.000 trees. Without infeasible path reduction already 100 trees are problematic.

Standard ADD frameworks work on Boolean variables rather than predicates. Thus in their setting infeasible paths do not occur. The problem of infeasible path reduction in ADDs was first discussed in [13,14]. Our current corresponding solution is still basic. We are currently generalizing our solution using more involved SMT technology.

Of course, these observations where made on rather small datasets and it has to be seen how well they tranfer to more complex scenarios. We believe, however, that they indicate general phenomena whose essence remains true in larger setting.

#### 8 Conclusion and Perspectives

We have presented Forest GUMP (for Generalized, Unifying Merge Process) a tool for providing tangible experience with three concepts of explanation: model

Fig. 14. The outcome explanation for the input petallength = 2.4, petalwidth = 1.8, sepallength = 5.9, sepalwidth = 2.5 (8 nodes, highlighted path of length 5) where the class characterization from Figure 11 is reordered.

explanation, outcome explanation, and class characterization. Key technology to achieve model explanation is algebraic aggregation, i.e. the transformation of a Random Forest into a semantically equivalent, concise white-box representation in terms of Algebraic Decision Diagrams. Class characterization is then achieved in terms of BDDs where the structure unnecessary to distinguish the considered class is collapsed. This abstraction is not only interesting in itself to better understand how easily the classes can be separated, but it also leads to highly optimized outcome explanations. Together with infeasible path elimination and the suppression of redundant predicates on a path, we observe reductions of outcome explanations by more than an order of magnitude. Forest GUMP allows even newcomers to easily experience these phenomena without much training.

Of course, these are first steps in a very ambitious new direction and it has to be seen how far the approach carries. Scalability will probably require decomposition methods, perhaps in a similar fashion as illustrated by the difference between model explanation and the considerably smaller class characterization. More work is needed also on techniques that aim at limiting the number of involved predicates.

Data Availability Statement: The artifact is available in the Zenodo repository [22].

#### References


national Conference on Computer Aided Design (ICCAD). pp. 188–191 (1993). https://doi.org/10.1109/ICCAD.1993.580054


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Alpinist: an Annotation-Aware GPU Program Optimizer?

Omer S¸akar ¨ <sup>1</sup> () , Mohsen Safari<sup>1</sup> , Marieke Huisman<sup>1</sup> , and Anton Wijs<sup>2</sup>

<sup>1</sup> Formal Methods and Tools, University of Twente, Enschede, The Netherlands {o.f.o.sakar,m.safari,m.huisman}@utwente.nl

<sup>2</sup> Software Engineering & Technology, Eindhoven University of Technology, Eindhoven, The Netherlands

a.j.wijs@tue.nl

Abstract. GPU programs are widely used in industry. To obtain the best performance, a typical development process involves the manual or semi-automatic application of optimizations prior to compiling the code. To avoid the introduction of errors, we can augment GPU programs with (pre- and postcondition-style) annotations to capture functional properties. However, keeping these annotations correct when optimizing GPU programs is labor-intensive and error-prone.

This paper introduces Alpinist, an annotation-aware GPU program optimizer. It applies frequently-used GPU optimizations, but besides transforming code, it also transforms the annotations. We evaluate Alpinist, in combination with the VerCors program verifier, to automatically optimize a collection of verified programs and reverify them.

Keywords: GPU · Optimization · Deductive verification · Annotationaware · Program transformation

### 1 Introduction

Over the course of roughly a decade, graphics processing units (GPUs) have been pushing the computational limits in fields as diverse as computational biology [64], statistics [35], physics [7], astronomy [24], deep learning [29], and formal methods [17,43,44,65,67]. Dedicated programming languages such as CUDA [34] and OpenCL [42] can be used to write GPU source code. To achieve the most performance out of GPUs, developer should apply incremental optimizations, tailored to the GPU architecture. Unfortunately, this is to a large extent a manual activity. The fact that for different GPU devices, the same code tends to require a different sequence of transformations [21] makes this procedure even more time consuming and error-prone. Recently, automating this has received some attention, for instance by applying machine learning [3].

<sup>?</sup> This work is supported by NWO grant 639.023.710 for the Mercedes project and by NWO TTW grant 17249 for the ChEOPS project

Fig. 1: Annotation-Aware Program Transformation.

Reasoning about the correctness of GPU software is hard, but necessary. Multiple verification techniques and tools have been developed to aid in this task aimed at detecting data races, see [8, 10, 14, 32, 33], and for a recent overview, see [22]. Some of these techniques apply deductive program verification, which requires a program to be manually augmented with pre- and postcondition annotations. However, annotating a program is time consuming. The more complex a program is, the more challenging it becomes to annotate it. In particular, as a program is being optimized repeatedly, its annotations tend to change frequently.

This paper presents Alpinist, a tool that can apply annotation-aware transformations [26] on annotated GPU programs. It can be used with the deductive program verifier VerCors [9]. VerCors can verify the functional correctness of GPU programs [10]. It allows the verification of many typical GPU computations, see e.g., [48,50,51]. The purpose of Alpinist is twofold (see Fig. 1): First, it automates the optimization of GPU code, to the extent that the developer needs to indicate which optimization needs to be applied where, and the tool performs the transformation. Interestingly, the presence of annotations is exploited by Alpinist to determine whether an optimization is actually applicable, and in doing so, can sometimes apply an optimization where a compiler cannot. Second, as it applies a code transformation, it also transforms the related annotations, which means that once the developer has annotated the unoptimized, simpler code, any further optimized version of that code is automatically annotated with updated pre- and postconditions, making it reverifiable. This avoids having to re-annotate the program every time it is optimized for a specific GPU device.

Alpinist supports GPU code optimizations that are used frequently in practice, namely loop unrolling, tiling, kernel fusion, iteration merging, matrix linearization and data prefetching. In the current paper, we discuss how Alpinist has been implemented, how it can be applied on annotated GPU code, and how some of the more complex optimizations work. In addition, we evaluate the effect of applying several of these optimizations, both in terms of annotation size and time needed to verify a program, to a collection of examples including the verified case studies in [48, 49, 51].

Outline. Section 2 demonstrates how Alpinist optimizes a verified GPU program while preserving its provability. Section 3 discusses the architecture of Alpinist. Section 4 discusses the most complex optimizations supported by

```
1 /*@ context_everywhere N > 0 && N < a.length;
 2 req (\forall* int i; 0 <= i < a.length; Perm(a[i], 1));
 3 ens (\forall* int i; 0 <= i < a.length; i != a.length-1 ==> Perm(a[i+1], 1));
 4 ens (\forall* int i; 0 <= i < a.length; i == a.length-1 ==> Perm(a[0], 1));
 5 ens (\forall int i; 0 <= i < a.length-1; a[i+1] == N*i);
 6 ens a[0] == N*(a.length-1); @*/
 7 void Host(int[] a, int size, int N) {
 8 par Kernel1 (int tid = 0 .. a.length)
 9 /*@ context Perm(a[tid], 1);
10 ens a[tid] == 0; @*/
11 { a[tid] = 0; }
12 par Kernel2 (int tid = 0 .. a.length)
13 /*@ context tid != a.length-1 ? Perm(a[tid+1], 1) : Perm(a[0], 1);
14 req tid != a.length-1 ? a[tid+1] == 0 : a[0] == 0;
15 ens tid != a.length-1 ? a[tid+1] == N*tid : a[0] == N*tid; @*/
16 { /*@ inv k >= 0 && k <= N;
17 inv tid != a.length-1 ? Perm(a[tid+1], 1) : Perm(a[0], 1);
18 inv tid != a.length-1 ? a[tid+1] == k*tid : a[0] == k*tid;@*/
19 for(int k = 0; k < N; k++) {
20 if (tid != a.length-1) { a[tid+1] = a[tid+1] + tid; }
21 else { a[0] = a[0] + tid; }
22 } } }
```
Fig. 2: A verified GPU-style program

Alpinist in detail, namely loop unrolling, tiling and kernel fusion, and briefly discusses the remaining three. Section 5 presents the results of experiments in which the tool has been applied on a collection of programs. Section 6 discusses related work and Section 7 concludes the paper, and discusses future work.

# 2 Annotation-Aware Optimization using Alpinist

This section illustrates how Alpinist can optimize a verified GPU program while preserving its provability. Fig. 2 shows a GPU program with annotations [10] that is verified by VerCors. The example is written in a simplified version of VerCors' own language PVL. The program initializes an array a, and subsequently updates the values in a, N times. The workflow of a GPU program in general is that the host (i.e., CPU) invokes a kernel, i.e., a GPU function, executed by a specified number of GPU threads. These threads are organized in one or more thread blocks. In this program, there are two kernels, both executed by one thread block of a.length threads (lines 8 and 12 (l.8, l.12))<sup>3</sup> . Each thread has a unique identifier, in the example called tid. In the first kernel (l.8-l.11), each thread initializes a[tid] to 0. In the second kernel (l.12-l.22), each thread updates a[tid+1] (modulo a.length) N times, by adding tid to it. In the main Host function, Kernel1 is called, followed by Kernel2.

The kernels, the for-loop and the host function are annotated for verification (in blue), using permission-based separation logic [6,11,12]. Permissions capture which memory locations may be accessed by which threads; they are fractional values in the interval (0, 1] (cf. Boyland [12]): any fraction in the interval (0,

<sup>3</sup> In practice, the size of a block cannot exceed a specific upper-bound, but for this example, we assume that a.length is sufficiently small.

```
1 /*@ context_everywhere N > 0 && N < a.length;
2 req (\forall* int i; 0 <= i < a.length; Perm(a[i], 1));
3 ens (\forall* int i; 0 <= i < a.length; i != a.length-1 ==> Perm(a[i+1], 1));
4 ens (\forall* int i; 0 <= i < a.length; i == a.length-1 ==> Perm(a[0], 1));
5 ens (\forall int i; 0 <= i < a.length-1; a[i+1] == N*i);
6 ens a[0] == N*(a.length-1); @*/
7 void Host(int[] a,int size,int N){
8 par Fused_Kernel(int tid = 0 .. a.length)
9 /*@ req Perm(a[tid], 1);
10 ens tid != a.length-1 ? Perm(a[tid+1], 1) : Perm(a[0], 1);
11 ens tid != a.length-1 ? a[tid+1] == N*tid : a[0] == N*tid; @*/
12 {
13 a[tid] = 0;
14 /*@ req Perm(a[tid], 1);
15 req a[tid] == 0;
16 ens tid != a.length-1 ? Perm(a[tid+1], 1) : Perm(a[0], 1);
17 ens tid != a.length-1 ? a[tid+1] == 0 : a[0] == 0; @*/
18 barrier(Fused_Kernel)
19
20 int a_reg_0, a_reg_1;
21 if (tid != a.length-1) { a_reg_1 = a[tid+1] } else { a_reg_0 = a[0] }
22 int k = 0;
23 if (tid != a.length-1) { a_reg_1 = a_reg_1 + tid; }
24 else { a_reg_0 = a_reg_0 + tid; }
25 k ++;
26 /*@ inv k >= 0 + 1 && k <= N;
27 inv tid != a.length-1 ? Perm(a[tid+1], 1) : Perm(a[0], 1);
28 inv tid != a.length-1 ? a reg 1 == k*tid : a reg 0 == k*tid; @*/
29 for(k; k < N; k++) {
30 if (tid != a.length-1) { a_reg_1 = a_reg_1 + tid; }
31 else { a_reg_0 = a_reg_0 + tid; }
32 }
33 if (tid != a.length-1) { a[tid+1] = a_reg_1 } else { a[0] = a_reg_0 };
34 } }
```
Fig. 3: An optimized GPU-style program, annotated for verification

1) indicates a read permission, while 1 indicates a write permission. A write permission can be split into multiple read permissions and read permissions can be added up, and transformed into a write permission if they add up to 1. The soundness of the logic ensures that for each memory location, the total number of permissions among all threads does not exceed 1.

To specify permissions, predicates are used of the form Perm(L, π) where L is a heap location and π a fractional value in the interval (0, 1] (e.g., 1\3). Preand postconditions, denoted by keywords req and ens, should hold at the beginning and the end of an annotated function, respectively. The keyword context abbreviates both req and ens (l.9, l.13). The keyword context everywhere is used to specify a property that must hold throughout the function (l.1). Note that \forall\* is used to express a universal separating conjunction over permission predicates (l.2-l.4) and \forall is used as standard universal conjunction over logical predicates (l.5). For logical conjunction, && is used and ∗∗ is used as separating conjunction in separation logic.

In the example, write permissions are required for all locations in a (l.2). The pre- and postconditions of the first kernel specify that each thread needs write permission for a[tid] (l.9). The postcondition states that a[tid] is set to 0 (l.10). In the second kernel, all threads have write permission for a[tid+1], except thread a.length-1 which has write permission for a[0] (l.13). Moreover, it is required that a[tid+1] (modulo a.length) is 0 (l.14). For the for-loop (l.19 l.22), loop invariants are specified: k is in the range [0, N] (l.16), each thread has write permission for a[tid+1] (modulo a.length) (l.17) and this location always has the value k\*tid (l.18). The postconditions of the second kernel and the host function are similar to this latter invariant.

Fig. 3 shows an optimized version of the program, with updated annotations to make it verifiable. Alpinist has applied three optimizations:


To preserve provability of the optimized program, Alpinist changed the annotations, in particular the pre- and postcondition of the fused kernel and the loop invariants (highlighted in Fig. 3). Moreover, Alpinist introduced an annotated barrier (l.14-l.18). Since threads synchronize at a barrier, it is possible to redistribute the permissions. In the rest of the paper, we discuss how Alpinist performs these annotation-aware transformations.

# 3 The Design of Alpinist

This section gives a high-level overview of the design of Alpinist. The optimizations supported by Alpinist are discussed in Section 4. To understand the design of Alpinist, we first explain the architecture of the VerCors verifier.

#### 3.1 VerCors' Architecture

VerCors is a deductive program verifier, which is designed to work for different input languages (e.g., Java and OpenCL). It takes as input an annotated program, which is then transformed in several steps into an annotated Silver program. Silver is an intermediate verification language, used as input for Viper [37, 60]. Viper then generates proof obligations, which can be discharged by an automated theorem prover, such as Z3 [36].

The internal transformations in VerCors are defined over our internal AST representation (written in the Common Object Language or COL [52]), which captures the features of all input languages. Some of the transformations are generic (e.g., splitting composite variable declarations) and others are specific to verification (e.g., transforming contracts). The transformations implemented as part of Alpinist are also applied on the COL AST, but they are developed with a different goal in mind, and in particular several of the transformation are specific to the supported optimizations.

Using VerCors and its architecture to implement Alpinist gives us some benefits. First, existing helper functions can be reused, which simplifies tasks such as gathering information regarding specific AST nodes. Second, some generic transformations of VerCors can be reused, such as splitting composite variable declarations or simplifying expressions. This helps to simplify the implementation of the optimizations. Third, using the architecture of VerCors allows us to prove assertions that we generate relatively easily by invoking VerCors internally.

#### 3.2 Alpinist's Architecture

Alpinist takes a verified file as its input, annotated with special optimization annotations that indicate where specific optimizations should be applied. Alpinist is written in Java and Scala and runs on Windows, Linux and macOS. Fig. 4 gives a high-level overview of the internal design of Alpinist. The input program goes through four phases: the parsing phase, the applicability checking phase, the transformation phase and the output phase.

The parsing phase transforms the input file into a COL AST, after which the applicability checking phase checks if the optimization can be applied. Some optimizations, such as tiling (see Section 4.2), are always applicable, hence their applicability check always passes. For other optimizations, prerequisites must be established. Sometimes, a syntactical analysis of the AST suffices, e.g., kernel fusion (see Section 4.3). For this optimization, it must be determined whether there is any data dependency between two selected kernels. When analysis of the AST is not enough, VerCors can be used to perform more complex reasoning. An example of this is loop unrolling (see Section 4.1). Its prerequisite is that for the loop to be unrollable k times, it is guaranteed that the loop executes at least k times. This prerequisite is encoded as an assertion to be proven by VerCors.

The applicability checking phase is one of the strengths of Alpinist. It exploits the fact that the input program is annotated to determine whether an optimization is applicable, and relies on the fact that VerCors can perform complex reasoning. Moreover, this approach allows to distinguish failure due to unsatisfied prerequisites and due to mistakes in the transformation procedure.

Fig. 4: The internal design of Alpinist.

If the applicability check passes (i.e., the optimization is applicable), the transformation phase is next, otherwise a message is generated that the prerequisites could not be proven.

The transformation phase applies the optimizations to the input AST. The output phase either prints the optimized program in the same language as the input program, or a message is printed, signifying either a failure in optimizing or a verification failure in the applicability checking phase.

# 4 GPU Optimizations

Alpinist supports six frequently-used GPU optimizations, namely loop unrolling, tiling, kernel fusion, iteration merging, matrix linearization and data prefetching. This section discusses loop unrolling, tiling, and kernel fusion in detail. The other optimizations follow the same approach in spirit and are discussed briefly, which can be found in the Alpinist implementation [16]. Each optimization is introduced in the context of GPU programs. Then, we discuss how to apply them. Interesting insights are discussed where relevant.

#### 4.1 Loop Unrolling

Loop unrolling is a frequently-used optimization technique that is applicable to both GPU and CPU programs. It unrolls some iterations of a loop, which increases the code size, but can have a positive impact on program performance; e.g., see [21, 38, 46, 59, 63] for its impact, specifically on GPU programs. Fig. 5 shows an example of unrolling an (annotated) loop twice: the body of the loop is duplicated twice before the loop. This has the following effect on the annotations: the loop invariant bounding the loop variable (l.5) changes in the optimized program (l.14). Note that the other loop invariants (i.e., Inv(i)) remain the same. Moreover, after each unrolling part, we add all invariants as assertions (l.8-l.10) except after the last unroll. This captures that the code produced by unrolling the loop should still satisfy the original loop invariants.

Our approach to loop unrolling is more general than optimization techniques during compilation. For instance, the unroll pragma in CUDA [55] and the unroll function in Halide [56] unroll loops by calculating the number of iterations to see if unrolling is possible, i.e., it should be computable at compile time. This difference is illustrated in Fig. 5 where N (i.e., the number of iterations) is unknown at compile time. Their approach cannot automatically handle this

Fig. 5: An example of unrolling a loop 2 times.

} }

```
1 void Host(int[] array, int size){
2 par Kernel(tid=0..size){
3 int i = init; // The loop variable
4
       .
       .
       .
5 //@ assert (i == a) || (i == b); // Depending on initialization of i only one
6 // of the conditions is specified
7 /*@ inv i >= a && i <= b; // The lowerbound of i (a), The upperbound of i (b)
8 inv Inv(i); @*/ // Additional loop invariants
9 loop (cond(i)) { // The loop condition
10 body(i); // The loop body, a sequence of statements in the ith iteration.
11 i = upd(i); } // The update function of i, restricted to (i + c), (i − c),
12 } } // (i × c) or (i/c) where c is a positive integer constant4
                                                                      .
```
Fig. 6: A general template of a loop inside a kernel.

case, while our approach can automatically unroll the loop, since annotations (l.1, l.6) specify the lower-bound of N (provided by the programmer, who knows that this is a valid lower-bound). VerCors verifies that the unrolling is valid.

Fig. 6 shows a loop template in a verified GPU program. We would like to automatically unroll the loop k times and preserve the provability of the program. To accomplish this, we follow a procedure consisting of three parts: the main, checking and updating part. In the main part, an annotated (verified) GPU program and positive k are given as input. Next we go to the checking part, to see if it is possible to unroll the loop k times. This part corresponds with the applicability checking phase. Thus, we statically calculate the number of loop iterations, by counting how many times the condition (cond(i)) holds starting from either a (as the lowerbound of i) or b (as the upperbound of i), depending on the operation of upd(i). If k is greater than the total number of loop iterations at the end of the checking part, then we report an error. Otherwise

 If c was negative, for the multiplication and division, i would oscillate between positive and negative values and hence would not always be useful as array index. Hence we consider c to be positive.

Fig. 7: Inter- and intra-tiling of an array as T = 12, N = 4 and dT/Ne = 3.

```
1 void Host(int[] a, int T){
2 par Kernel(tid = 0..T)
3 /*@ // Preconditions related to permissions and functional correctness
4 req prePerm(a[tid]) ** preFunc(a[tid]);
5 // Postconditions related to permissions and functional correctness
6 ens postPerm(a[tid]) ** postFunc(a[tid]); @*/
7 { body(a[tid]); } }
```
Fig. 8: A general unoptimized GPU program to apply for tiling.

we go to the updating part, in which we update either a or b according to the operation in upd(i). If the operation is addition or multiplication, then the loop variable i (in the unoptimized program) goes from a to b. That means, after unrolling, a should be updated according to the constant c from the update expression and k. If the operation is subtraction or division, i goes from b to a. Thus, after unrolling, b should be updated. After the updating part, we return to the main part to unroll the loop k times.

#### 4.2 Tiling

Tiling is another well-known optimization technique for GPU programs. It increases the workload of the threads to fully utilize GPU resources by assigning more data to each thread. Concretely, we assume there are T threads and a onedimensional array of size T in the unoptimized GPU program where each thread is responsible for one location in that array (Fig. 8). To apply the optimization, we first divide the array into dT/Ne chunks, each of size N (1 ≤ N ≤ T) 5 . There are two different ways to create and assign threads to array cells (as in Fig. 7):


Both forms of tiling can have a positive impact on GPU program performance; e.g., see [25, 28, 47, 69] for the impact of this optimization.

Fig. 9 shows the optimized version of Fig. 8 by applying inter-tiling. Regarding program optimization, two major changes happen: 1) the total number of threads has reduced (l.2), and 2) the body is encapsulated inside a loop (l.16 l.18). As mentioned, in inter-tiling, we define N threads instead of T. The number

<sup>5</sup> Since N is in the range 1 ≤ N ≤ T, the last chunk might have fewer cells.

```
1 void Host(int[] a, int T){
2 par Kernel(tid = 0..N)
3 /*@ req (\forall* int i; 0 <= i && i < ceiling(T, N) && tid+i×N < T;
4 pre(a[tid+i×N]));
5 ens (\forall* int i; 0 <= i && i < ceiling(T, N) && tid+i×N < T;
6 post(a[tid+i×N])); @*/
7 {
8 int j = 0;
9 /*@ inv j >= 0 && j <= ceiling(T, N);
10 inv (\forall* int i; 0 <= i && i < ceiling(T, N) && tid+i×N < T;
11 prePerm(a[tid+i×N]));
12 inv (\forall int i; j <= i && i < ceiling(T, N) && tid+i×N < T;
13 preFunc(a[tid+i×N]));
14 inv (\forall* int i; 0 <= i && i < j && tid+i×N < T;
15 postFunc(a[tid+i×N])); @*/
16 loop (tid+j×N < T){
17 body(a[tid+j×N]);
18 j = j + 1; }
19 } }
```
Fig. 9: Optimized version of the GPU program of Fig. 8 after applying inter-tiling.

of chunks is indicated by the function ceiling(T, N). Each thread in the newly added loop iterates over all chunks (in the range 0 to ceiling(T, N)-1) to be responsible for a specific location. This happens by the loop variable j and the loop condition tid+j×N < T. This means, each thread tid can access its own location at index tid in each chunk. To preserve verifiability, we add invariants to the loop (l.9-l.17). Therefore, we specify:


Moreover, we modify the specification of the kernel (l.3-l.6). Note that we have the condition tid+j×N < T in all universally quantified invariants, because the last chunk might have fewer cells than N. We quantified the pre- and postcondition of the kernel over the chunks in the same way as the invariants.

Intra-tiling is in essence similar to inter-tiling with two major differences: 1) the total number of threads is ceiling(T, N), and 2) each thread in the loop iterates over cells within its own chunk. Therefore, we have different conditions in the loop and the quantified invariants. Alpinist also supports this.

Above, each thread is assigned to one cell. This can easily be generalized to have each thread assigned to one or more consecutive cells (i.e., a task). A similar procedure can be applied as long as the tasks do not overlap, i.e., each cell is assigned to at most one thread.

#### 4.3 Kernel Fusion

Kernel fusion is a GPU optimization where we merge two or more consecutive kernels into one. It increases the potential to use thread-local registers to store intermediate results (see Section 2) and can lead to less power consumption. See [2, 19, 61, 62, 68] for the impact of kernel fusion on GPU programs. We provide a generalized procedure to fuse an arbitrary number of consecutive kernels while considering data dependency between them. The idea is to fuse them by repeatedly fusing the first two kernels (i.e., kernel reduction). In each iteration, if there is no data dependency between the two kernels, we safely fuse them. Else if there is only one thread block then we fuse the two kernels by inserting a barrier between the bodies, else fusion fails.

A benefit of this approach is that it only considers two kernels at a time. In this way, it can be determined whether a barrier is necessary between two specific kernels, and we do not miss any possible fusion optimization. Another benefit of this approach is that when a data dependency between two kernels P and P + 1 (1 < P < #kernels−1) is detected, the output of the approach is the fusion of the first P kernels, and the remaining unfused kernels after P. This allows the user to not only find out that there is a data dependency between P and P + 1, but also to obtain fused kernels where possible.

There are multiple challenges in this transformation: (1) how to detect data dependency between two kernels? (2) how to collect the pre- and postconditions for the fused kernel? and (3) how to deal with permissions so that in the fused kernel the permission for a location does not exceed 1? The main difficulty in addressing these challenges is that we have to consider many different possible scenarios. Fortunately, we can use the information from the contract of the two kernels. The permission patterns in the contract indicate for each thread which locations it reads from and writes to. We provide procedures to separately collect pre- and postconditions related to permissions and to functional correctness. Due to space limitations, we only discuss the essential steps to collect the precondition related to permissions for array accesses of the fused kernel in Alg. 1. Collecting the rest of the contract uses a similar procedure.

Alg. 1 requires kernels k1 and k2 to not lose any permissions, only possibly redistribute them (using a barrier). Furthermore, for ease of presentation, we assume that in both k1 and k2, each thread accesses at most one cell of array a, and that the expressions used to compute array indices only combine constants and thread ID variables, using standard arithmetic operators.

We compare the postcondition of k1 and the precondition of k2 (l.2) to understand how to add permissions of the preconditions of k1 and k2 to the precondition of the fused kernel. Note that prePerm and postPerm correspond to a permission-related pre- and postcondition, respectively. We use the postcondition of k1 for this comparison since the permission at the end of k1 needs to be sufficient to satisfy the precondition of k2. If the index expressions e1 and e2 to access an array a are syntactically the same, then they refer to the same array cell. In that case, we first add to the precondition of the fused kernel the original permission from the precondition of k1 that corresponds to the permis-

#### Algorithm 1 Kernel fusion procedure for collecting precondition permissions.


sion for a[e1] in the postcondition of k1 (remember that the latter permission may have been obtained in k1 after permission redistribution). Second, if p1 is not sufficient for the precondition of k2 (l.5), we add additional permission to the precondition of the fused kernel to satisfy the precondition of k2 (l.6).

The remaining different cases in the algorithm correspond to the different edge cases that we should consider when e1 and e2 are not syntactically the same. In particular, data dependency happens when the accumulated permission (in both kernels) for one location is greater than 1, and there is at least one write permission. Therefore, we have to distinguish multiple cases: 1) p1+p2 does not exceed 1 (l.8), 2) p1 + p2 exceeds 1, but no write permission is involved (l.10), or 3) and 4) at least one write is involved (l.13 and l.15). In the latter two cases, a barrier must be introduced to take care of distributing permissions from the access in k1 to the access in k2, and possibly additional permission for the latter must be added to the precondition of the fused kernel (l.17). After constructing the contract of the fused kernel, we check for data dependency.

Fig. 10 shows an example of fusing two kernels. We only present the permission precondition expressions which are collected with Alg. 1. There are two shared arrays a and b. To collect permission preconditions in the fused kernel, we follow steps {l.2→l.3→l.4} for array a and steps {l.2→l.3→l.4→l.5→l.6} for array b. As there is no data dependency, we can safely fuse the two kernels.

Implementing Data Dependency Detection. One of the implementation challenges of kernel fusion is to check data dependency in the applicability checking phase. Our idea of detecting kernel dependencies is similar to detecting loop iteration dependencies, see [1]. To detect data dependency for a specific shared array, the function SV is used. Fig. 11 shows an example of the output of SV. The kernel has 1\2 permission for a[tid+1] and 1\3 permission for a[0] if tid+1 is out of bounds. SV takes an array name and the pre- and postconditions of a kernel (of the form cond(tid) => Perm(a[patt(tid)], p)) on l.3-l.6, and returns a mapping from indices patt(tid) to the permissions p (in Fig. 11: right).

Fig. 10: An example of collecting preconditions in fusing two kernels.

Fig. 11: Example output of the SV function for array a.

If the function SV is executed for two kernels to fuse with the same shared array a, the results SV1(a) and SV2(a) can be compared to determine whether there is data dependency between the two kernels. This comparison is described generally at l.8-l.16 in Algorithm 1. For each corresponding location in SV1(a) and SV2(a), we can determine, for example, whether both permissions combined do not exceed 1 (l.8) or whether the location in k1 has write permission (l.12).

#### 4.4 Other Optimizations

We briefly discuss the three remaining optimizations supported by Alpinist. Iteration merging is an optimization technique related to loop unrolling that is applicable to both GPU and CPU programs<sup>6</sup> . Iteration merging reduces the number of loop iterations by extending the loop body with multiple copies of it, as opposed to creating copies of it outside the loop, as is done in loop unrolling. Iteration merging can have a positive performance impact; see [38,46,53] for the effectiveness of this optimization on GPU programs.

Matrix linearization is an optimization where we transform two-dimensional arrays into one dimension ones. This optimization can result in better memory access patterns, thereby improving caching. See [5,13,54] for the impact of matrix linearization on GPU programs.

The last optimization implemented in Alpinist is data prefetching. Suppose there is a verified GPU program where each thread accesses an array location in global memory multiple times. In this optimization, we prefetch the values of those locations that are in global memory into registers which are local to each thread. A similar optimization, in which intermediate results are stored in register memory, is applied in Section 2. Therefore, instead of multiple accesses to the high latency global memory, we benefit from low-latency registers. Data prefetching can have a positive performance impact; see [4, 58, 70].

<sup>6</sup> Iteration merging is also referred to as loop unrolling/vectorization in the literature.


Table 1: A summary of the optimization and verification times for all optimizations.

# 5 Evaluation

This section describes the evaluation of Alpinist. The goal is to


#### 5.1 Experiment Setup

Alpinist is evaluated on examples from three different sources. The first source consists of hand-made examples that cover different scenarios for each optimization. The second source is a collection of verified programs from VerCors' example repository<sup>7</sup> . The third source consists of complex case studies that are already verified in VerCors: two parallel prefix sum algorithms [51], parallel stream compaction and summed-area table algorithms [48], a variety of sorting algorithms [49], a solution [27] to the VerifyThis 2019 challenge 1 [18] and a Tic-Tac-Toe example [57] based on [23]. In total, we applied the optimizations 30 times in the first category, 23 times in the second category and 17 times in the third category (in total 70 experiments). All the examples are annotated with special optimization annotations such that Alpinist can apply those optimizations automatically. All these examples are publicly available at [15]. All the experiments were conducted on a MacBook Pro 2020 (macOS 11.3.1) with a 2.0GHz Intel Core i5 CPU. Each experiment was performed ten times, after which the average times, i.e., optimization and verification times, of those executions were recorded for the experiment.

#### 5.2 Results & Discussion

Q1 To test whether Alpinist works on GPU programs, we applied the six optimizations in all 70 experiments and used VerCors to reverify all the resulting programs. All these tests were successful.

Q2 To investigate how long it takes for Alpinist to transform GPU programs, we recorded the transformation time for each optimization applied to all the

<sup>7</sup> The example repository of VerCors is available at https://github.com/utwente-fmt/ vercors/tree/dev/examples.



examples. Table 1 summarizes the best and worst optimization times for the six optimizations (as reported by Alpinist). To investigate the impact on the verification time, the table also shows the (best and worst) verification times of the original and optimized programs (as reported by VerCors). The table shows the minimum, maximum, average and median times of all examples. It can be observed that Alpinist takes insignificant time to apply each optimization to all the examples. Moreover, the verification time after optimizing generally increases. For loop unrolling, tiling and iteration merging, the verification time increases. This can be attributed to the additional code that is generated. For kernel fusion, the verification time decreases. This is due to verifying fewer kernels. For matrix linearization and data prefetching, the verification time slightly increases. This can be attributed to the linear expressions in matrix linearization and the extra statements to read from/write to the registers in data prefetching. Q3 To investigate the usability of Alpinist on real-world examples, we successfully applied it on the third category with the complex case studies. Table 2 shows the optimization and verification times of applying loop unrolling, iteration merging, matrix linearization and data prefetching to these case studies. Note that in the case studies only these four optimizations could be applied. In the table, N/A indicates that the optimization is not applicable to the example.

### 6 Related Work

To the best of our knowledge, this is the first paper to showcase a tool that implements annotation-aware transformations. We categorize the related work into three parts, covering both tools and optimizations.

Automatic Optimizations without Correctness. There is a large body of related work, see e.g., [2, 4, 19, 25, 28, 47, 61, 62, 68–70], that shows the impact of automated optimizations on GPU programs, but does not consider correctness, or the preservation of it. Our tool can potentially complement these approaches by preserving the provability of the optimized programs.

Correctness Proofs for Transformations. Another body of related work focuses on different approaches to preserve provability not specific to GPU programs. COMPCert [30, 31] is a formally verified C compiler which preserves semantic equivalence of the source and compiled program, by proving correctness of each transformation in the compilation process. Wijs and Engelen [66] and De Putter and Wijs [45] prove the preservation of functional properties over transformations on models of concurrent systems. They prove preservation of model-independent properties. This approach differs from ours as they work on models instead of concrete programs.

Compiler Optimization Correctness. Finally, there is related work that focusses on the compilation of sequential programs, performing transformations from high-level source code to lower-level machine code while preserving the semantics. These approaches neither consider parallelization, nor target different architectures. In GPU programming, the optimizations often need to be applied manually rather than during the compilation process.

Namjoshi and Xu [41] use a proof checker to show equivalence between an original WebAssembly program and optimized program. An equivalence proof is generated based on the transformations. Namjoshi and Singhania [40] created a semi-automatic loop optimizer with user-directives. The loops are verified during compilation. For each transformation, semantics are defined to guarantee semantical equivalence to the original program. Namjoshi and Pavlinovic [39] focus on recovering from precision loss due to semantics-preserving program transformations and propose systematic approaches to simplify analysis of the transformed program. Finally Gjomemo et al. [20] help compiler optimizations by supplying high-level information gathered by external static analysis (e.g., Frama-C). This information is used by the compiler for better reasoning.

# 7 Conclusion

In this paper, we presented Alpinist, the annotation-aware GPU program optimizer. Given an unoptimized, annotated GPU program, we showed how Alpinist transforms both the code and the annotations, with the goal to preserve the provability of the optimized GPU program. Alpinist supports loop unrolling, tiling, kernel fusion, iteration merging, matrix linearization and data prefetching, of which the first three are discussed in detail. We discussed the design and implementation of Alpinist, and we validated it by verifying a set of examples and reverifying their optimized counterparts.

For future work, there are other optimizations that could be supported, such as data prefetching for all memory patterns as mentioned by Ayers et al. [4]. Another open question is if and how this approach can be used in program compilation. We also plan to extend this approach to preserve the provability of transpiled code, e.g., CUDA to OpenCL conversions. Moreover, we plan to investigate how Alpinist can be combined with techniques such as autotuning that automatically detect the potential for applying specific optimizations and identify optimal parameter configurations [3, 63].

# References


352 O. S ¨ ¸akar et al.


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Automatic Repair for Network Programs

Lei Shi<sup>1</sup>, Yuepeng Wang<sup>2</sup> , Rajeev Alur<sup>1</sup> , and Boon Thau Loo<sup>1</sup>

<sup>1</sup> University of Pennsylvania, Philadelphia, USA <sup>2</sup> Simon Fraser University, Burnaby, Canada {shilei,alur,boonloo}@seas.upenn.edu yuepeng@sfu.ca

Abstract. Debugging imperative network programs is a difcult task for operators as it requires understanding various network modules and complicated data structures. For this purpose, this paper presents an automated technique for repairing network programs with respect to unit tests. Given as input a faulty network program and a set of unit tests, our approach localizes the fault through symbolic reasoning, and synthesizes a patch ensuring that the repaired program passes all unit tests. It applies domain-specifc abstraction to simplify network data structures and exploits function summary reuse for modular symbolic analysis. We have implemented the proposed techniques in a tool called NetRep and evaluated it on 10 benchmarks adapted from real-world software-defned network controllers. The evaluation results demonstrate the efectiveness and efciency of NetRep for repairing network programs.

# 1 Introduction

Emerging tools for program synthesis and repair facilitate automation of programming tasks in various domains. For example, in the domain of end-user programming, synthesis techniques allow users without any programming experience to generate scripts from examples for extracting, wrangling, and manipulating data in spreadsheets [13,40]. In computer-aided education, repair techniques are capable of providing feedback on programming assignments to novice programmers and help them improve programming skills [49,14]. In software development, synthesis and repair techniques aim to reduce the manual eforts in various tasks, including code completion [43,10], application refactoring [42], program parallelization [8], bug detection [11,41], and patch generation [11,32].

As an emerging domain, Software-Defned Networking (SDN) ofers the infrastructure for monitoring network status and managing network resources based on programmable software, replacing traditional specialized hardware in communication devices. Since SDN provides an opportunity to dynamically modify the trafc handling policies on programmable routers, this technology has witnessed growing industrial adoption. However, using SDNs involves many programming tasks that are inevitably susceptible to programmer errors leading to bugs [3,23]. For example, a device with incorrect routing policies could forward a packet to undesired destinations, and a buggy frewall rule may make the entire network system vulnerable to security threats.

In the SDN framework, a logically centralized control plane generates rules that are installed into data planes, which in turn decides the routing of packets throughout the network. While network verifcation is a well-studied feld where operators can be hinted on incorrectly installed rules [3,4,22], little prior work has explored the problem of automatically repairing the corresponding bug in the control plane, especially those written in widely used general-purpose languages such as Java or Python. Existing work mostly restricts the target to control plane programs written in domain-specifc languages such as Datalog [51,17].

Since networks cannot tolerate even small mistakes, and most network operators are not trained in programming skills, debugging and repair tools in this domain should prioritize accuracy and automation. This means that many existing techniques for general program repair are not suitable to this domain as they trade of accuracy for heuristics for scaling with the size of analyzed programs and number of discovered potential bugs.

Motivated by the demand for automated repair and the limitations of existing techniques, we develop a precise and scalable program repair technique for network programs. Specifcally, our repair technique takes as input a network program and a set of unit tests, reveals the program location that causes the test failure, and automatically generates a patch to fx the program. In the setting of SDN, a unit test corresponds to an incorrectly installed routing rule generated by the control plane from a reported packet. Such unit tests can be discovered by a separate network verifcation procedure [3,4,22].

Our main idea is to use symbolic reasoning using constraints capturing the semantics of the program for accurate repair, and modular analysis to improve the efciency. We extended the encoding techniques from prior work [21,12] to support object-oriented features in Java. We also developed a new approach to focus the analysis on one function at a time and gradually narrow down the range of faulty statements along with the specifcation for the expected behavior.

The proposed technique is implemented in an automatic network program repair tool called NetRep. To evaluate NetRep, we adapt 10 benchmarks from real-world faulty network programs in Floodlight that require changing up to 3 lines of code to fx and apply NetRep to repair the benchmarks automatically. The experimental results show that NetRep is able to fnd a repair that passes all unit tests for faulty programs up to 738 lines of code for 8 benchmarks using 2 or 3 test cases, outperforming a state-of-the-art repair tool for general Java programs. Furthermore, NetRep is efcient in terms of repair time, requiring only an average running time of 744 seconds across all benchmarks.

Contributions. We make the following main contributions in this paper:


```
1 @network public class MacAddr {
2 private long value ;
3 private MacAddr ( long v ) { value = v ; }
4 public static MacAddr NONE = new MacAddr (0);
5 public static MacAddr of ( long v ) { return new MacAddr ( v );}
6 ... }
7 public class FirewallRule {
8 public MacAddr dl_dst ; public boolean any_dl_dst ;
9 public FirewallRule () {
10 dl_dst = MacAddr . NONE ; any_dl_dst = true ; ... }
11 public boolean isSameAs ( FirewallRule r ) {
12 if (... || any_dl_dst != r . any_dl_dst
13 || ( any_dl_dst == false &&
14 dl_dst != r . dl_dst )) {
15 return false ; }
16 return true ; }
17 ... }
```
Fig. 1: Code snippet about a bug in Floodlight.

```
1 public boolean test ( long mac1 , long mac2 ) {
2 FirewallRule r1 = new FirewallRule () ;
3 r1 . dl_dst = MacAddr . of ( mac1 ) ; r1 . any_dl_dst = false ;
4 FirewallRule r2 = new FirewallRule () ;
5 r2 . dl_dst = MacAddr . of ( mac2 ) ; r2 . any_dl_dst = false ;
6 return r1 . isSameAs ( r2 ) ; }
```
Fig. 2: Unit test that reveals the bug in FirewallRule.

– We develop a tool called NetRep based on the proposed techniques and evaluate it using 10 benchmarks adapted from real-world network programs. The evaluation results demonstrate that NetRep is efective for bug localization and able to generate correct patches for realistic network programs.

#### 2 Overview

In this section, we give a high-level overview of our repair techniques and walk through the NetRep tool using an example adapted from the Floodlight SDN controller [9].

Figure 1 shows a simplifed code snippet about frewall rules in Floodlight. Specifcally, the program consists of two classes – FirewallRule and MacAddr. The FirewallRule class describes rules enforced by the frewall, including information about source and destination mac addresses. The MacAddr class is an auxiliary data structure that stores the raw value of mac addresses <sup>3</sup> .

The network program shown in Figure 1 is problematic because the isSameAs function compares two mac addresses using the != operator rather than a negation of the equals functions. The != operator only compares two objects based on their memory addresses, whereas the intent of the developer is to check if two mac addresses have the same raw value. The bug is revealed by the unit test in Figure 2, then confrmed and fxed by the Floodlight developers <sup>4</sup> . Next, let

<sup>3</sup> A unique 48-bit number that identifes each network device.

<sup>4</sup> https://github.com/foodlight/foodlight/commit/4d528e4bf5f02c59347bb9c0beb1b875ba2c821e

us illustrate how NetRep localizes this bug based on unit tests test(1, 2) = false and test(1, 1) = true and automatically synthesizes a patch to fx it.

At a high level, NetRep enters a loop that iteratively attempts to fnd the fault location and synthesize the patch. Since our repair technique works in a modular fashion, NetRep frst selects a function F in the program and tries to repair each possible fault location at a time. If NetRep cannot synthesize a patch consistent with the provided unit tests for any potential fault location in F, it backtracks and selects the next function and repeats the same process until all possible functions are checked. We now describe the experience of running NetRep on our illustrative example.

Iteration 1. NetRep selects the constructor of FirewallRule as the target function. Fault localization determines that the fault is located at the dl dst = MacAddr.NONE part of Line 10, because it is related to the equality checking in the unit test. However, it is not the fault location. NetRep tries to synthesize a patch that passes all unit tests to replace this statement, but fails.

Iteration 2. NetRep selects the same function – constructor of FirewallRule, but the fault localization switches to a diferent statement any dl dst = true at Line 10. Similar to Iteration 1, the synthesizer cannot generate a correct patch by replacing this statement.

Iteration 3. Since none of the statements in the constructor is the fault location, NetRep now selects a diferent function: isSameAs. The fault localization determines that any dl dst = false at Line 13 may be the fault location as it may afect the testing results. However, having tried to replace the statement with many other candidate statements, e.g., r.any dl dst = false, any dl dst = true, the synthesizer still fails to generate the correct patch.

Last iteration. Finally, after several attempts to localize the fault, NetRep identifes the fault lies in dl dst != r.dl dst at Line 14, which is indeed the reported bug location. At this time, the synthesizer manages to generate a correct patch !dl dst.equals(r.dl dst). Replacing the original condition at Line 14 with this patch results in a program that can pass all the provided test cases, so NetRep has successfully repaired the original faulty program.

# 3 Preliminaries

In this section, we present the language of network programs and describe a program formalism that is used in the rest of paper. We also defne the program repair problem that we want to solve.

### 3.1 Language of Network Programs

The language of network programs considered in this paper is summarized in Figure 3. A network program consists of a set of classes, where each class has an optional annotation @network to denote that the class can beneft from network domain-specifc abstraction.

Prog P ::= C <sup>+</sup> Stmt s ::= l := e | jmp (e) L Class C ::= @network? class C {a <sup>+</sup> F <sup>+</sup>} | ret v | x := new C Func F ::= function f(x1, . . . , xn) (L : s) <sup>+</sup> | x := C.f(v1, . . . , vn) Expr e ::= l | c | op(e1, . . . , en) | x := y.f(v1, . . . , vn) LValue l ::= x | x.a | x[v] Imm v ::= x | c x, y ∈ Variable c ∈ Constant L ∈ LineID

C ∈ ClassName f, f<sup>0</sup> ∈ FuncName a ∈ FieldName

Fig. 3: Syntax of network programs.

Each class in the program consists of a list of felds and functions. Each function has a name, a parameter list, and a function body. The function body is a list of statements, where each statement is labeled with its line number. Various kinds of statements are included in our language of network programs. Specifically, assign statement l := e assigns expression e to left value l. Conditional jump statement jmp (e) L frst evaluates predicate e. If the result is true, then the control fow jumps to line L; otherwise, it performs no operation. Note that our language does not have traditional if statements or loop statements, but those statements can be expressed using conditional jumps. <sup>5</sup>

Return statement ret v exits the current function with return value v. New statement x := new C creates an object of class C and assigns the object address to variable x. Static call x := C.f(v1, . . . , vn) invokes the static function f in class C with arguments v1, . . . , v<sup>n</sup> and assigns the return value to variable x. Similarly, virtual call x := y.f(v1, . . . , vn) invokes the virtual function f on receiver object y with arguments v1, . . . , v<sup>n</sup> and assigns the return value to variable x. Diferent kinds of expressions are supported including constants, variable accesses, feld accesses, array accesses, arithmetic operations, and logical operations. Since the semantics of network programs is similar to that of traditional programs written in object-oriented languages, we omit the formal description of semantics.

In addition, we assume each statement in the program is labeled with a globally unique line number, and line numbers are consecutive within a function.

#### 3.2 Problem Statement

We assume a unit test t is written in the form of a pair (I, O), where I is the input and O is the expected output. Given a network program P and a unit test t = (I, O), we say P passes the test t if executing P on input I yields the expected output <sup>O</sup>, denoted by <sup>J</sup>PK<sup>I</sup> <sup>=</sup> <sup>O</sup>. Otherwise, if <sup>J</sup>PK<sup>I</sup> ̸<sup>=</sup> <sup>O</sup>, we say <sup>P</sup> fails the test t. In general, given a network program P and a set of unit tests E, program P is faulty modulo E if there exists a test t ∈ E such that P fails on t.

Now let us turn the attention to the meaning of fault locations and patches.

Defnition 1 (Fault location and patch). Let P be a program that is faulty modulo tests E. Line L is called the fault location of P, if there exists a statement

<sup>5</sup> Our repair techniques only handle bounded loops. If there are unbounded loops in the network program, we need to perform loop unrolling.

#### Algorithm 1 Modular Program Repair

```
1: procedure Repair(P, E)
   Input: Program P, examples E
   Output: Repaired program P
                              ′
                               or ⊥ to indicate failure
2: P ← Abstraction(P);
3: V ← {L 7→ false | L ∈ Lines(P)}; P
                                     ′ ← ⊥;
4: while P
             ′ = ⊥ do
5: F ← SelectFunction(P, V);
6: if F = ⊥ then return ⊥;
7: V, P
             ′ ← RepairFunction(P, F, E, V);
8: return P
               ′
               ;
9: procedure RepairFunction(P, F, E, V)
   Input: Program P, function F, examples E, visited map V
   Output: Updated visited map V, repaired program P
                                                   ′
10: P
        ′ ← ⊥;
11: while P
              ′ = ⊥ do
12: L ← LocalizeFault(P, F, E, V);
13: if L ̸= ⊥ then
14: V ← V[L 7→ true];
15: else
16: V ← V[L
                    ′
                     7→ true | TransInFunc(L
                                         ′
                                          , P, F)];
17: if L = ⊥ or IsCallStmt(P, L) then return V, ⊥;
18: P
           ′ ← SynthesizePatch(P, E, F, L);
19: return V, P
                 ′
                  ;
```
s such that replacing line L of P with s yields a new program that can pass all tests in E. Here, the statement s is called a patch to P.

Problem statement. Given a network program P that is faulty modulo tests E, our goal is to fnd a fault location L in P and generate the corresponding patch s, such that for any unit test t ∈ E, the patched program P ′ can always pass the test t.

# 4 Modular Program Repair

In this section, we present our algorithm for automatically repairing network programs from a set of unit tests.

#### 4.1 Algorithm Overview

The top-level repair algorithm is described in Algorithm 1. The Repair procedure takes as input a faulty network program P and unit tests E and produces as output a repaired program P ′ or ⊥ to indicate repair failure.

At a high level, the Repair procedure maintains a visited map V from line numbers to boolean values, representing whether each line of P is checked or not. The Repair procedure frst applies the domain-specifc abstraction to program P (Line 2) and initializes the visited map V by setting every line in P as not checked (Line 3). Next, it tries to iteratively repair P in a modular way until it fnds a program P ′ that is not faulty modulo tests E (Lines 4 – 8). In particular, the Repair procedure invokes SelectFunction to choose a function F as the target of repair (Line 5). If none of the functions in P can be repaired, it returns ⊥ to indicate that the repair procedure failed (Line 6). Otherwise, it invokes the RepairFunction procedure (Line 7) to enter the localization-synthesis loop inside the target function F.

In addition to the program P and tests E, the RepairFunction procedure takes as input a target function F and the current visited map V. It produces as output the updated version of the visited map V, as well as a repaired program P ′ or ⊥ to indicate that the function F cannot be repaired. As shown in Lines 11 – 18 of Algorithm 1, RepairFunction alternatively invokes sub-procedures LocalizeFault and SynthesizePatch to repair the target function. In particular, the goal of LocalizeFault is to identify a fault location in function F. If LocalizeFault manages to fnd a fault location L in F, then line L is marked as visited (Line 14). Otherwise, if LocalizeFault returns ⊥, it means function F and all functions transitively invoked in F are correct or not repairable. In this case, all lines in F and its transitive callees are marked as checked (Line 16). Furthermore, if the identifed fault location L corresponds to a statement that invokes F ′ , it means the fault location is inside F ′ . Thus, RepairFunction directly returns ⊥ (Line 17) and SelectFunction will choose F ′ as the target function in the next iteration. On the other hand, the goal of the sub-procedure SynthesizePatch is generating a patch for function F given the fault location L. If SynthesizePatch successfully synthesizes a patch and produces a non-faulty program P ′ , then the entire procedure succeeds with repaired program P ′ . Otherwise, RepairFunction backtracks with a new program location and repeat the same process.

In the rest of this section, we explain fault localization, modular analysis, and patch synthesis in more detail.

#### 4.2 Fault Localization

Next, we give a high-level description of our fault localization technique that aims to fnd the fault location in a given program. This corresponds to the LocalizeFault procedure in Algorithm 1. We will frst show how to encode the problem on an entire program, and then explain how the analysis can be made modular to boost the performance.

At a high level, our fault localization technique uses a symbolic approach by reducing the fault localization problem into a constraint solving problem. In particular, we introduce a boolean variable for each line L, denoted by B[L], and encode the fault localization problem as an SMT formula, such that the value of the variable B[L] indicates whether line L is correct or not.

Checking faulty programs. To understand how to encode the fault localization problem, let us frst explain how to encode the consistency check given a

program P and a test case t = (I, O). Specifcally, the encoded SMT formula Φ(t) consists of three components:


The satisfability of formula Φ(t) indicates the result of consistency check. If Φ(t) is satisfable, the solver generates a feasible execution trace and an assignment of all intermediate states along this trace. In this case, program P can pass the test t because there exists a valid trace following the control fow and every pair of adjacent states in the trace is consistent with the semantics of the corresponding statement. Otherwise, if Φ(t) is unsatisfable, P fails the test t.

Now to check whether P against a set of unit tests E, we can conjoin the formula Φ(t<sup>j</sup> ) for each unit test t<sup>j</sup> ∈ E and obtain the conjunction Φ = V <sup>t</sup>j∈E Φ(t<sup>j</sup> ). The satisfability of formula Φ indicates whether P is faulty modulo tests E 6 .

Methodology of fault localization. Let P be a faulty program modulo E, we know the corresponding formula Φ for consistency check is unsatisfable. Suppose the fault location is line L<sup>i</sup> , one key insight is that replacing the semantic constraint Φi(S, S′ ) with true yields a satisfable formula. This is because true does not enforce any constraint between the pre-state S and post-state S ′ , so a previously invalid trace caused by the bug at L becomes valid now.

Based on this insight, we develop a methodology to fnd the fault location using symbolic reasoning. Specifcally, given a consistency check formula Φ, we can obtain a fault localization formula Φ ′ by replacing the semantic constraint Φi(S, S′ ) with B[L<sup>i</sup> ] → Φi(S, S′ ) for every line L<sup>i</sup> , i ∈ [1, n]. Here, variable B[L<sup>i</sup> ] decides whether or not it turns the semantic constraint of L<sup>i</sup> into true. Thus, B[L<sup>i</sup> ] = false indicates L<sup>i</sup> is a fault location.

<sup>6</sup> The encoding is described in more detail in the extended version [46].

One hiccup here is that formula Φ ′ is always satisfable and a model of Φ ′ can simply assign B[L<sup>i</sup> ] = false for all L<sup>i</sup> . It means all lines in the program are fault locations, which is not useful for fault localization. To address this issue, we can add a cardinality constraint stating there are exactly K variables in map B that can be assigned to false, which forces the constraint solver to fnd exactly K fault locations in program P.

Modular analysis. The method above can precisely compute a potential fault location. But an obvious shortcoming is it is hard to scale. Encoding a long program involves 1) a large number of semantic constraints, 2) many fault location choices, as well as 3) many intermediate states to be assigned.

Notice that although a program can be arbitrarily long, developers usually follow the design practice that every function is of limited size. Focusing on analyzing one function at a time and recursively search for the fnal fault location could be way more efcient than solving one NP-hard problem at the entire program's scale.

To facilitate modular analysis of a function, we need to summarize the behavior of its sub-modules (callee functions) and infer external specifcation from its higher-level module (caller function).

The encoding method introduced above treats one line of code as a constraint on its pre-state and post-state. To summarize the behavior of a callee function, we aim to turn it into a similar constraint on the pre-state and post-state for the calling statement. The inner states of this callee function should be skipped in the encoding. We can compute such summaries of the target function's callees by symbolic execution. We start with a symbolic representation of the pre-state and execute the callee function until it returns, and claim that the output state equals the post-state. In this way, we can entirely eliminate all bug location choices and inner state assignments in the callee function, as well as greatly simplifying the semantic constraint.

There are two ways to infer the specifcation of target function. The frst way is to encode only the calling stack of the target function up until the top-level function, where we can use the test case as the specifcation. All function calls made by the target's caller and transitive callers that are not in the stack can be replaced by the automatically computed summary. We can also disable all fault location choices except for lines in the target function. Another way is to infer a possible pre-condition and post-condition of the target function. From the perspective of the caller, the target function is a line of code that puts an incorrect constraint on its pre-state and post-state. After the analysis, the constraint solver will infer a feasible pre-state and post-state assuming this incorrect constraint is removed. This assignment can be used as the pre-condition and post-condition, which eliminates the need to encode any caller function. Since the second approach will possibly introduce incompleteness into the analysis, we use it only to infer a specifcation to synthesize the fnal patch, and use the frst one for every function's analysis.

Domain-specifc abstraction. A domain-specifc abstraction is essentially a function summary as discussed above. But for those repeatedly used network

classes (identifed by the @network annotation), we can pre-defne some more succinct abstractions based on domain knowledge to make the analysis easier. The abstraction A[F] of a function F is an over-approximation of F that is precise enough to characterize the behavior of F.

The abstraction is useful due to two observations. First, source code for network programs may only be partially available due to the use of high-level interface and native implementation. For example, when comparing the equality between two network addresses, the getClass function is frequently used, but its implementation depends on the runtime and is not available. To make the analysis easier, we can instead use the following abstraction for such comparison:

A[equals] : λx. λy.(x.dtype = y.dtype ∧ x.value = y.value ),

where x.dtype denotes the dynamic type of the object x.

Second, network programs have complex operations that are challenging for symbolic reasoning. For instance, bit manipulations are heavily used in network data structures. While bit manipulations can improve the performance of network programs, they present signifcant challenges for symbolic analysis due to the encoding in the theory of bitvectors. We can give an abstraction equivalent in correctness but simpler in the behavior, e.g., using the identity function instead of a hash code computation.

#### 4.3 Patch Synthesis

The last step of our repair algorithm is to generate a patch to fx the faulty program. This corresponds to the SynthesizePatch procedure in Algorithm 1. It can be reduced to a sketch fnishing problem in program synthesis where we replace the existing faulty line with a hole.

Our general idea is to use plain enumerative search with a depth bound in the candidate patch's space, but with two signifcant optimizations.

First, we reduce the search space with heuristics. On one hand, we only replace the core expression in the faulty statement with a hole to focus on the most expressive part. To be specifc, we consider changing the right-hand-side expressions of assignments, conditional expressions of jump statements, return values of return statements, and functions and arguments for function invocations. On the other hand, we use a limited grammar to guide the search. We parameterize all constants, variables, felds, functions, and operators over the sketch and only instantiate constructs that are in scope. For example, given a particular sketch with a hole, we only populate the variable set with all local and global variables that are in scope of the hole. Also, if the hole corresponds to the conditional expression of a if statement, we only add logical operators to the grammar.

Second, we use the local specifcation to guide the synthesis. Sketch completion is diferent from synthesizing a complete program in that the specifcation is defned for the entire program. We have to repeatedly waste time on executing the correct part of the program to verify a candidate patch. We use the technique described in the modular analysis section to generate a pre-condition and post-condition for only the faulty line. In this way, only the generated patch needs to be executed to verify against the specifcation, which greatly saves time when the program grows larger.

# 5 Implementation

We have implemented the proposed repair technique in a tool called NetRep. NetRep leverages the Soot static analysis framework [26] to convert Java programs into Jimple code, which provides a succinct yet expressive set of instructions for analysis. In addition, NetRep utilizes the Rosette tool [48] to perform symbolic reasoning for fault localization and patch synthesis. While our implementation closely follows the algorithm presented in Section 4, we also conduct several optimizations important to improve the performance of NetRep.

Memories for diferent types. Since the conversion between bitvectors and integers imposes signifcant overhead on running time, NetRep divides the memory into one part for integers and another for bitvectors. In this design, NetRep automatically selects the memory chunk based on the variable types. The type checking can guarantee that no such conversion will exist.

Stack and heap. In order to reduce the number of memory operations, NetRep also divides the memory into stack and heap. As is standard, stack only stores static data and its layout is deterministic. Therefore, stacks are implemented using fxed-size vectors, and thus can be efciently accessed for read and write operations. On the other hand, heap stores dynamic data that are usually not known at compile time, such as allocated objects. Since the heap size cannot be determined beforehand, NetRep uses an uninterpreted function f(x) to represent heaps, where x is the address and f(x) is the value stored at x.

String values. Since reasoning over string values is a challenging task and not always necessary for repairing network programs, we simplifed the representation of strings with integer values. Specifcally, NetRep maps each string literal to a unique integer and represents all string operations (e.g. concatenation) with uninterpreted functions.

Bounded program analysis. In order to improve the repair time, NetRep only performs bounded program analysis for fault localization and patch synthesis. Namely, we unroll loops and inline functions up to K times, where K is a predefned hyper-parameter. In this way, function summaries can be easily and efciently computed using symbolic execution.

# 6 Evaluation

To evaluate the proposed techniques, we perform experiments that are designed to answer the following research questions:



Table 1: Experimental results of NetRep.

RQ4 How is NetRep compared to other repair tools for Java programs?

Benchmark collection. To obtain realistic benchmarks, we crawl the commit history of Floodlight [9], a representative open-source SDN controller in Java that supports the OpenFlow protocol and a rich set of network functions. To distinguish commits caused by bug repairs from those generated for non-repair scenarios, we identify commits based on the following criteria: 1) The commit message contains keywords about repairing bugs, e.g., "bug", "error", "fx"; 2) The commit changes no more than three lines of code.

Following these criteria, we have collected 10 commits from the Floodlight repository and adapted them into our benchmarks. Specifcally, given a commit in the repository, we take the code before the commit as the faulty network program and the version after the commit as the ground-truth repaired program. The code is post-processed and the parts irrelevant to the bug of interest are removed. We also identify corresponding unit tests and modify them to directly reveal the bug as appropriate. Each benchmark in our evaluation consists of a faulty network program and its corresponding unit tests.

Experimental setup. All experiments are conducted on a computer with 4-core 2.80GHz CPU and 16GB of physical memory, running the Arch Linux Operating system. We use Racket v7.7 as the compiler and runtime system of NetRep and set a time limit of 1 hour for each benchmark.

#### 6.1 Main Results

Our main experimental results are summarized in Table 1. The column labeled "Module" describes the network module to which the benchmark belongs. The next two columns labeled "LOC" and "# Funcs" show the number of lines of source code (in Jimple) and the number of functions, respectively. The "# Tests" column presents the number of unit tests used for fault localization and patch synthesis. Next, the "Succ" and "Exp" columns show whether NetRep can successfully repair the program and if the generated patch is exactly the same as the ground-truth. Since NetRep returns the frst fx that can pass all provided test cases, the repaired programs are not necessarily the same as those expected in the ground-truth. In this case, the table will show a "Yes" in the "Succ" column and a "No" in the "Exp" column. Finally, the last three columns in Table 1 denote the fault localization time, patch synthesis time and the total running time of NetRep.

As shown in Table 1, there is a range of 13 to 65 functions in each benchmark and the average number of functions is 34 across all benchmarks. Each benchmark has 212 – 809 lines of Jimple code, with the average being 496. NetRep succeeds in repairing 8 out of 10 benchmarks. Furthermore, for 5 benchmarks that can be successfully repaired, NetRep is able to generate exactly the same fx as ground-truth. Given that our benchmarks cover programs from a variety of modules of Floodlight, such as DHCP Server, Firewall, etc, we believe that NetRep is efective to repair realistic network programs (RQ1).

We inspected the reason why NetRep fails to repair benchmarks 2 and 5. NetRep is not able to localize the fault in benchmark 2 due to its incomplete support for unbounded data structures with dynamic allocation such as hash map. For Benchmark 5, NetRep is able to localize the fault but not able to synthesize the correct patch. This is because the expected function to be invoked has side efects with another function, which needs some improvements in the specifcation checking to verify.

Regarding the efciency, NetRep can repair 8 benchmarks in an average of 744 seconds with only 2 to 3 test cases. The fault localization time ranges from 39 seconds to 893 seconds, with 50% of the benchmarks within fve minutes. The patch synthesis time ranges from 39 seconds to 2139 seconds, with 60% of the benchmarks within fve minutes. In summary, the evaluation results show that NetRep only takes minutes to localize bugs in a faulty program and synthesize a correct patch based on two to three unit tests (RQ2).

#### 6.2 Ablation Study

To explore the impact of modular analysis and domain-specifc abstraction on the proposed repair technique, we develop three variants of NetRep:


To understand the impact of modular analysis and domain-specifc abstraction, we run all variants on the 10 collected benchmarks. For each variant, we

Fig. 4: Comparing NetRep against three variants.

measure the total running time (including time for fault localization and time for patch synthesis) on each benchmark, and order the results by running time in increasing order. The results for all variants are depicted in Figure 4. All lines stop at the last benchmark that the corresponding variant can solve within 1 hour time limit.

As shown in Figure 4, both NetRep-NoAbs and NetRep-NoMod can only solve 4 out of 10 benchmarks in the evaluation, with the average running time being 569 seconds and 610 seconds, respectively. NetRep-NoModAbs solves the least number of benchmarks: 3 out of 10. For the ones that it can solve, the average running time is 1165 seconds. This experiment shows that modular analysis and domain-specifc abstraction are both great boost to NetRep's efciency to repair network programs (RQ3).

#### 6.3 Comparison with the Baseline

To understand how NetRep performs compared to other Java program repair tools, we compare NetRep against a state-of-the-art tool called Jaid [5] on our benchmarks. Specifcally, Jaid takes as input a faulty Java program, a set of unit tests, and a function signature for fault localization and patch synthesis, a setting closest to NetRep among a variety of tools. Note that Jaid solves a simpler repair problem than NetRep, because it requires the user to specify a function that is potentially incorrect in the program, whereas NetRep does not need input other than the faulty program and unit tests. In order to run Jaid on our benchmarks, we adjust their formats to ft Jaid's and provide the faulty function (known from the ground truth) as input for Jaid.

Jaid will indefnitely enumerate all possible patches, rather than recommending a most correct one. We think it is successful if the expected patch can be found among the results. In practice, human assistance is needed to pick out this patch from the thousands of candidates.

As a result, Jaid is able to fnish on 8 out of 10 benchmarks. The expected patches are found among 2 of them, whereas NetRep can give the expected result for 5 benchmarks on the frst recommendation. For one benchmark, Jaid is unable to fx. For another one, it runs out of memory.

We argue that NetRep is better suited for automatically repairing network programs compared to Jaid. First, it only requires network operators to provide unit test cases. As is discussed above, they can be automatically discovered by another verifcation or testing procedure. In comparison, Jaid requires users to have skill of programming network controllers to identify the buggy function and pick the correct patch from the results. This is beyond the ability of most network operators and starts to require an expert team. Second NetRep has higher repairing accuracy. As we discussed above, network is sensitive to small mistakes. High accuracy is crucial for a network to function correctly.

In summary, NetRep is more efective in automatically fxing bugs in network programs compared to state-of-the-art repairing tools for Java programs, especially with respect to repairing accuracy and automation (RQ4).

#### 7 Related Work

Automated program repair. Automated program repair is an active research area that aims to automatically fx the mistakes in programs based on specifcations of correctness criteria [11,28,39,18], with a variety of applications such as aiding software development [34], fnding security vulnerabilities [37], and teaching novice programmers [49,14]. Diferent techniques have been proposed to solve the automated program repair problem, including heuristics-based techniques [16,31], semantics-based techniques [37,27], and learning-based techniques [45,30,32,47]. NetRep is a semantics-based automated repair tool. Different from prior work, NetRep is specialized to repair network programs based on modular analysis and network data structure abstractions.

Fault localization. Researchers have developed various approaches to fault localization, including spectrum-based, learning-based, and constraint-based techniques. Specifcally, the spectrum-based techniques [27,1,2,7,44,6,19] perform fault localization by identifying which part of program is active during a run through execution profles (called program spectrum). Learning-based techniques [29,53,54] typically train machine learning models to predict and rank possible fault locations. By contrast, constraint-based techniques [21,20,12] encode the semantics of problems as logical constraints and reduce the fault localization problem into constraint satisfaction problem. In spirit, NetRep uses a similar idea for fault localization. However, NetRep performs modular analysis and enables debugging programs involving object-oriented features, whereas prior work only analyzes the entire program in a C-like language. Besides, NetRep reuses the fault localization result to speedup the patch synthesis while prior work mainly focuses on the fault localization step.

Patch synthesis. Many synthesis algorithms have been developed for generating patches, including enumerative search [27], constraint-based techniques [37], statistical model [52], machine learning [15], hint from existing code [25], and so on. In terms of patch synthesis, NetRep generates a context-free grammar from the context of fault locations and performs enumerative search based on the grammar to synthesize patches. It does not require machine learning model or statistical information for ranking all possible patches. However, it is conceivable that NetRep will beneft from the guidance of such ranking techniques.

Verifcation and synthesis for SDN. In the networking domain, several verifcation tools [3,33,23,24] have been proposed based on either model checking or theorem proving. For example, VeriCon [3] performs deductive verifcation to verify the correctness of SDN programs specifed by network-wide invariants on all admissible topologies. In addition to verifcation, synthesis techniques [36,35,38] have also been proposed to aid software-defned networking. NetRep aims to repair network programs automatically, which is a diferent problem than SDN verifcation or synthesis.

Repair for network programs. Our work is most related to automated repair of network programs in the SDN domain [50,51,17]. Prior work about autorepair [50,51] relies on using Datalog to capture the operational semantics of the target language to be repaired. The repair techniques work for domain-specifc languages (e.g. Datalog or Ruby on Rails) with simple structure. Similarly, Hojjat et al. [17] propose a framework based on horn clause repair problem to help network operators fx faulty confgurations. However, NetRep targets Java network programs with object-oriented features and more complex constructs, which cannot be handled by existing techniques.

### 8 Limitations and Future Work

We discuss several limitations of NetRep that we plan to improve in future work. First, NetRep repairs the faulty network program with the frst correct patch that can pass all tests. A user interaction that resumes the synthesis can be introduced in case it is not intended by the user or a more formal specifcation.

Second, patches that require complicated changes, e.g., those involving control fow structures, are beyond NetRep's ability. They make up 44% of our collection of bug-fxing commits. We envision that the challenge can be addressed by introducing more sophisticated patch synthesis techniques such as searching over a domain-specifc language for edits.

Third, in order to force symbolic execution to terminate in fnite time, NetRep currently unrolls all loops in the network program, which may result in missing a potential bug. Loop invariant inference techniques can be leveraged to overcome this challenge and still guarantee termination.

#### 9 Conclusion

In this paper, we have proposed an automated repair technique for network controller programs with unit tests as specifcations. Our technique internally performs symbolic reasoning for bug localization and patch synthesis, optimized by network domain-specifc abstractions and modular analysis to reduce encoding size. we have implemented a tool called NetRep and evaluated it on 10 benchmarks adapted from the Floodlight framework. The experimental results demonstrate that NetRep is efective for repairing realistic network programs with moderate change sizes.

### References


the Thirty-First AAAI Conference on Artifcial Intelligence. pp. 1345–1351. AAAI Press (2017)


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# 11th Competition on Software Verification: SV-COMP 2022

# Progress on Software Verification: SV-COMP 2022

Dirk Beyer <sup>B</sup>

LMU Munich, Munich, Germany

Abstract. The 11th edition of the Competition on Software Verification (SV-COMP 2022) provides the largest ever overview of tools for software verification. The competition is an annual comparative evaluation of fully automatic software verifiers for C and Java programs. The objective is to provide an overview of the state of the art in terms of effectiveness and efficiency of software verification, establish standards, provide a platform for exchange to developers of such tools, educate PhD students on reproducibility approaches and benchmarking, and provide computing resources to developers that do not have access to compute clusters. The competition consisted of 15 648 verification tasks for C programs and 586 verification tasks for Java programs. Each verification task consisted of a program and a property (reachability, memory safety, overflows, termination). The new category on data-race detection was introduced as demonstration category. SV-COMP 2022 had 47 participating verification systems from 33 teams from 11 countries.

Keywords: Formal Verification · Program Analysis · Competition · Software Verification · Verification Tasks · Benchmark · C Language · Java Language · SV-Benchmarks · BenchExec · CoVeriTeam

# 1 Introduction

This report is the 2022 edition of the series of competition reports (see footnote) that accompanies the competition, by explaining the process and rules, giving insights into some aspects of the competition (this time the focus is on trouble shooting and reproducing results on a small scale), and, most importantly, reporting the results of the comparative evaluation. The 11th Competition on Software Verification (SV-COMP, https://sv-comp.sosy-lab.org/2022) is the largest comparative evaluation ever in this area. The objectives of the competitions were discussed earlier (1-4 [16]) and extended over the years (5-6 [17]):

1. provide an overview of the state of the art in software-verification technology and increase visibility of the most recent software verifiers,

This report extends previous reports on SV-COMP [10, 11, 12, 13, 14, 15, 16, 17, 18].

Reproduction packages are available on Zenodo (see Table 4).

<sup>B</sup> dirk.beyer@sosy-lab.org


The SV-COMP 2020 report [17] discusses the achievements of the SV-COMP competition so far with respect to these objectives.

Related Competitions. There are many competitions in the area of formal methods [9], because it is well-understood that competitions are a fair and accurate means to execute a comparative evaluation with involvement of the developing teams. We refer to a previous report [17] for a more detailed discussion and give here only the references to the most related competitions [20, 53, 67].

Quick Summary of Changes. While we try to keep the setup of the competition stable, there are always improvements and developments. For the 2022 edition, the following changes were made:


# 2 Organization, Definitions, Formats, and Rules

Procedure. The overall organization of the competition did not change in comparison to the earlier editions [10, 11, 12, 13, 14, 15, 16, 17, 18]. SV-COMP is an open competition (also known as comparative evaluation), where all verification tasks are known before the submission of the participating verifiers, which is necessary due to the complexity of the C language. The procedure is partitioned into the benchmark submission phase, the training phase, and the evaluation phase. The participants received the results of their verifier continuously via e-mail (for preruns and the final competition run), and the results were publicly announced on the competition web site after the teams inspected them.

Competition Jury. Traditionally, the competition jury consists of the chair and one member of each participating team; the team-representing members circulate every year after the candidate-submission deadline. This committee reviews the competition contribution papers and helps the organizer with resolving any disputes that might occur (from competition report of SV-COMP 2013 [11]).

In more detail, the tasks of the jury consist of the following:


The team representatives of the competition jury are listed in Table 5.

License Requirements. Starting 2018, SV-COMP required that the verifier must be publicly available for download and has a license that


Two exceptions were made to allow minor incompatibilities for commercial participants: The jury felt that the rule "allows any kind of (re-)distribution of the unmodified verifier archive" is too broad. The idea of the rule was to maximize the possibilities for reproduction. Starting with SV-COMP 2023, this license requirement shall be changed to "allows (re-)distribution of the unmodified verifier archive via SV-COMP repositories and archives".

Validation of Results. The validation of the verification results was done by eleven validation tools, which are listed in Table 1, including references to literature. Four new validators support the competition:



Table 1: Tools for witness-based result validation (validators) and witness linter

Table 2: Scoring schema for SV-COMP 2022 (unchanged from 2021 [18])


Task-Definition Format 2.0. SV-COMP 2022 used the task-definition format in version 2.0. More details can be found in the report for Test-Comp 2021 [19].

Properties. Please see the 2015 competition report [13] for the definition of the properties and the property format. All specifications used in SV-COMP 2022 are available in the directory c/properties/ of the benchmark repository.

Categories. The (updated) category structure of SV-COMP 2022 is illustrated by Fig. 1. The categories are also listed in Tables 8, 9, and 10, and described in detail on the competition web site (https://sv-comp.sosy-lab.org/2022/benchmarks.php). Compared to the category structure for SV-COMP 2021, we added the subcategory Termination-BitVectors to category Termination and the sub-category SoftwareSystems-BusyBox-ReachSafety to category SoftwareSystems.

Scoring Schema and Ranking. The scoring schema of SV-COMP 2022 was the same as for SV-COMP 2021. Table 2 provides an overview and Fig. 2 visually illustrates the score assignment for the reachability property as an example. As before, the rank of a verifier was decided based on the sum of points (normalized for meta categories). In case of a tie, the rank was decided based on success run time, which is the total CPU time over all verification tasks for which the verifier reported

Fig. 1: Category structure for SV-COMP 2022; category C-FalsificationOverall contains all verification tasks of C-Overall without Termination; Java-Overall contains all Java verification tasks; compared to SV-COMP 2021, there is one new sub-category in Termination and one new sub-categories in SoftwareSystems

Fig. 2: Visualization of the scoring schema for the reachability property (unchanged from 2021 [18])

Fig. 3: Benchmarking components of SV-COMP and competition's execution flow (same as for SV-COMP 2020)

a correct verification result. Opt-out from Categories and Score Normalization for Meta Categories was done as described previously [11] (page 597).

Reproducibility. SV-COMP results must be reproducible, and consequently, all major components are maintained in public version-control repositories. The overview of the components is provided in Fig. 3, and the details are given in Table 3. We refer to the SV-COMP 2016 report [14] for a description of all components of the SV-COMP organization. There are competition artifacts at Zenodo (see Table 4) to guarantee their long-term availability and immutability.

Competition Workflow. The workflow of the competition is described in the report for Test-Comp 2021 [19] (SV-COMP and Test-Comp use a similar workflow).


Table 3: Publicly available components for reproducing SV-COMP 2022



# 3 Reproducing a Verification Run and Trouble-Shooting Guide

In the following we explain a few steps that are useful to reproduce individual results and for trouble shooting. It is written from the perspective of a participant.

Step 1: Make Verifier Archive Available. The first action item for a participant is to submit a merge request to the repository that contains all the verifier archives (see list of merge requests at GitLab). Typical problems include:


Step 2: Ensure That Verifier Works on Competition Machines. Once the CI checks passed and the archive is merged into the official competition repository, the verifier can be executed on the competition machines on a few verification

<sup>1</sup> https://github.com/sosy-lab/benchexec/blob/3.10/doc/tool-integration.md

tasks. The competition uses the infrastructure VerifierCloud, and remote execution in this compute cloud is possible using CoVeriTeam [29]. CoVeriTeam is a tool for constructing cooperative verification tools from existing components, and the competition is supported by this project since SV-COMP 2021. Among its many capabilities, it enables remote execution of verification runs directly on the competition machines, which was found to be a valuable service for trouble shooting. A description and example invokation for each participating verifier is available in the CoVeriTeam documentation (see file doc/competitionhelp.md in the CoVeriTeam repository). Competition participants are asked to execute their tool locally using CoVeriTeam and then remotely on the competition machines. Typical problems include:


Step 3: Check Prerun Results. So far, we considered executing individual verification runs in the Docker container or remotely on the competition machines. As a service to the participating teams, the competition offers training runs and provides the results to the teams. Typical checks that teams perform on the prerun results include:


<sup>2</sup> https://gitlab.com/sosy-lab/benchmarking/competition-scripts/-/tree/svcomp22


Table 5: Competition candidates with tool references and representing jury members; new for first-time participants, <sup>∅</sup> for hors-concours participation


Table 6: Algorithms and techniques that the participating verification systems used; new for first-time participants, <sup>∅</sup> for hors-concours participation

(continues on next page)


# 4 Participating Verifiers

The participating verification systems are listed in Table 5. The table contains the verifier name (with hyperlink), references to papers that describe the systems, the representing jury member and the affiliation. The listing is also available on the competition web site at https://sv-comp.sosy-lab.org/2022/systems.php. Table 6 lists the algorithms and techniques that are used by the verification tools, and Table 7 gives an overview of commonly used solver libraries and frameworks.

Hors-Concours Participation. There are verification tools that participated in the comparative evaluation, but did not participate in the rankings. We call this kind of participation hors concours as these participants cannot participate in rankings and cannot "win" the competition. Those are either passive or active participants. Passive participation means that the tools are taken from previous years of the competition, in order to show progress and compare new tools against them (Coastal<sup>∅</sup>, CPA-BAM-BnB<sup>∅</sup>, CPALockator<sup>∅</sup>, Divine<sup>∅</sup>, Esbmc-incr<sup>∅</sup>, Gazer-Theta<sup>∅</sup>, Lazy-CSeq<sup>∅</sup>, Pinaka<sup>∅</sup>, PredatorHP<sup>∅</sup>, Smack<sup>∅</sup>, Spf<sup>∅</sup>). Active paritication means that there are teams actively developing the tools, but there are reasons why those tools should not occur in the rankings. For example, a

Table 7: Solver libraries and frameworks that are used as components in the participating verification systems (component is mentioned if used more than three times; new for first-time participants, <sup>∅</sup> for hors-concours participation


(continues on next page)


tool might use other tools that participate in the competition on their own, and comparing such a tool in the ranking could be considered unfair (CVT-AlgoSel new<sup>∅</sup>, CVT-ParPort new<sup>∅</sup>). Also, a tool might produce uncertain results and the team was not sure if the full potential of the tool can be shown in the SV-COMP experiments (Infernew<sup>∅</sup>). Those participations are marked as 'hors concours' in Table 5 and others, and the names are annotated with a symbol (<sup>∅</sup>).

# 5 Results and Discussion

The results of the competition represent the the state of the art of what can be achieved with fully automatic software-verification tools on the given benchmark set. We report the effectiveness (number of verification tasks that can be solved and correctness of the results, as accumulated in the score) and the efficiency (resource consumption in terms of CPU time and CPU energy). The results are presented in the same way as in last years, such that the improvements compared to last year are easy to identify, except that due to the number of tools, we have to split the table and put the hors-concours verifiers into a second results table. The results presented in this report were inspected and approved by the participating teams.

Computing Resources. The resource limits were the same as in the previous competitions [14]: Each verification run was limited to 8 processing units (cores), 15 GB of memory, and 15 min of CPU time. Witness validation was limited to 2 processing units, 7 GB of memory, and 1.5 min of CPU time for violation witnesses and 15 min of CPU time for correctness witnesses. The machines


Table 8: Quantitative overview over all regular results; empty cells are used for opt-outs, new for first-time participants


Table 9: Quantitative overview over all hors-concours results; empty cells represent opt-outs, new for first-time participants, <sup>∅</sup> for hors-concours participation

for running the experiments are part of a compute cluster that consists of 167 machines; each verification run was executed on an otherwise completely unloaded, dedicated machine, in order to achieve precise measurements. Each machine had one Intel Xeon E3-1230 v5 CPU, with 8 processing units each, a frequency of 3.4 GHz, 33 GB of RAM, and a GNU/Linux operating system (x86\_64-linux, Ubuntu 20.04 with Linux kernel 5.4). We used BenchExec [32] to measure and control computing resources (CPU time, memory, CPU energy) and VerifierCloud to distribute, install, run, and clean-up verification runs, and to collect the results. The values for time and energy are accumulated over all cores of the CPU. To measure the CPU energy, we used CPU Energy Meter [35] (integrated in BenchExec [32]).

One complete verification execution of the competition consisted of 309 081 verification runs (each verifier on each verification task of the selected categories according to the opt-outs), consuming 937 days of CPU time and 249 kWh of CPU energy (without validation). Witness-based result validation required 1.43 million validation runs (each validator on each verification task for categories with witness validation, and for each verifier), consuming 708 days of CPU time. Each tool was executed several times, in order to make sure no installation issues occur during the execution. Including preruns, the infrastructure managed a

Table 10: Overview of the top-three verifiers for each category; new for first-time participants, measurements for CPU time and energy rounded to two significant digits ('–' indicates a missing energy value due to a configuration bug)


Fig. 4: Quantile functions for category C-Overall. Each quantile function illustrates the quantile (x-coordinate) of the scores obtained by correct verification runs below a certain run time (y-coordinate). More details were given previously [11]. A logarithmic scale is used for the time range from 1 s to 1000 s, and a linear scale is used for the time range between 0 s and 1 s.

total of 2.85 million verification runs consuming 19 years of CPU time, and 16.3 million validation runs consuming 11 years of CPU time.

Quantitative Results. Tables 8 and 9 present the quantitative overview of all tools and all categories. Due to the large number of tools, we need to split the presentation into two tables, one for the verifiers that participate in the rankings (Table 8), and one for the hors-concours verifiers (Table 9). The head row mentions the category, the maximal score for the category, and the number of verification tasks. The tools are listed in alphabetical order; every table row lists the scores of one verifier. We indicate the top three candidates by formatting their scores in bold face and in larger font size. An empty table cell means that the verifier opted-out from the respective main category (perhaps participating in subcategories only, restricting the evaluation to a specific topic). More information (including interactive tables, quantile plots for every category, and also the raw data in XML format) is available on the competition web site (https://sv-comp.sosy-lab.org/2022/results) and in the results artifact (see Table 4).

Table 10 reports the top three verifiers for each category. The run time (column 'CPU Time') and energy (column 'CPU Energy') refer to successfully solved verification tasks (column 'Solved Tasks'). We also report the number of tasks for which no witness validator was able to confirm the result (column 'Unconf. Tasks'). The columns 'False Alarms' and 'Wrong Proofs' report the number of verification tasks for which the verifier reported wrong results, i.e., reporting a counterexample when the property holds (incorrect False) and claiming that the program fulfills the property although it actually contains a bug (incorrect True), respectively.


Table 11: Results of verifiers in demonstration category NoDataRace

Score-Based Quantile Functions for Quality Assessment. We use scorebased quantile functions [11, 32] because these visualizations make it easier to understand the results of the comparative evaluation. The results archive (see Table 4) and the web site (https://sv-comp.sosy-lab.org/2022/results) include such a plot for each (sub-)category. As an example, we show the plot for category C-Overall (all verification tasks) in Fig. 4. A total of 13 verifiers participated in category C-Overall, for which the quantile plot shows the overall performance over all categories (scores for meta categories are normalized [11]). A more detailed discussion of score-based quantile plots, including examples of what insights one can obtain from the plots, is provided in previous competition reports [11, 14].

The winner of the competition, Symbiotic, not only achieves the best cummulative score (graph for Symbiotic has the longest width from x = 0 to its right end), but is also extremely efficient (area below the graph is very small). Verifiers whose graphs start with a negative commulative score produced wrong results. Several verifiers whose graphs start with a minimal CPU time larger than 3 s are based on Java and the time is consumed by starting the JVM.

Demo Category NoDataRace. SV-COMP 2022 had a new category on data-race detection and we report the results in Table 11. The benchmark set contained a total of 162 verification tasks. The category was defined as a demonstration category because it was not clear how many verifiers would participate. Eight verifiers specified the execution for this sub-category in their benchmark definition <sup>3</sup> and participated in this demonstration. A detailed table was generated by BenchExec's table-generator together with all other results as well and is available on the competition web site and in the artifact (see Table 4).

The results are presented as a means to show that such a category is useful; the results do not represent the full potential of the verifiers, as they were not fully tuned by their developers but handed in for demonstrating abilities only.

Alternative Rankings. The community suggested to report a couple of alternative rankings that honor different aspects of the verification process as complement to the official SV-COMP ranking. Table 12 is similar to Table 10, but contains

<sup>3</sup> https://gitlab.com/sosy-lab/sv-comp/bench-defs/-/tree/svcomp22/benchmark-defs

Table 12: Alternative rankings for catagory Overall; quality is given in score points (sp), CPU time in hours (h), kilo-watt-hours (kWh), wrong results in errors (E), rank measures in errors per score point (E/sp), joule per score point (J/sp), and score points (sp)


the alternative ranking categories Correct and Green Verifiers. Column 'Quality' gives the score in score points, column 'CPU Time' the CPU usage of successful runs in hours, column 'CPU Energy' the CPU usage of successful runs in kWh, column 'Solved Tasks' the number of correct results, column 'Wrong Results' the sum of false alarms and wrong proofs in number of errors, and column 'Rank Measure' gives the measure to determine the alternative rank.

Correct Verifiers — Low Failure Rate. The right-most columns of Table 10 report that the verifiers achieve a high degree of correctness (all top three verifiers in the C-Overall have less than 2‰ wrong results). The winners of category Java-Overall produced not a single wrong answer. The first category in Table 12 uses a failure rate as rank measure: number of incorrect results max(total score,1) , the number of errors per score point (E/sp). We use E as unit for number of incorrect results and sp as unit for total score. The worst result was 0.023 E/sp in SV-COMP 2021 and is now at 0.042 E/sp. Goblint is the best verifier regarding this measure.

Green Verifiers — Low Energy Consumption. Since a large part of the cost of verification is given by the energy consumption, it might be important to also consider the energy efficiency. The second category in Table 12 uses the energy consumption per score point as rank measure: total CPU energy max(total score,1) , with the unit J/sp. The worst result from SV-COMP 2021 was 630 J/sp and is now at 690 J/sp. Also here, Goblint is the best verifier regarding this measure.

New Verifiers. To acknowledge the verification systems that participate for the first or second time in SV-COMP, Table 13 lists the new verifiers (in SV-COMP 2021 or SV-COMP 2022).


Table 13: New verifiers in SV-COMP 2021 and SV-COMP 2022; column 'Subcategories' gives the number of executed categories (including demo category NoDataRace), new for first-time participants, <sup>∅</sup> for hors-concours participation

Table 14: Confirmation rate of verification witnesses during the evaluation in SV-COMP 2022; new for first-time participants, <sup>∅</sup> for hors-concours participation


Verifiable Witnesses. Results validation is of primary importance in the competition. All SV-COMP verifiers are required to justify the result (True or False) by producing a verification witness (except for those categories for which no result validator is available). We used ten independently developed witness-based result validators and one witness linter (see Table 1).

Fig. 5: Number of evaluated verifiers for each year (first-time participants on top)

Table 14 shows the confirmed versus unconfirmed results: the first column lists the verifiers of category C-Overall, the three columns for result True reports the total, confirmed, and unconfirmed number of verification tasks for which the verifier answered with True, respectively, and the three columns for result False reports the total, confirmed, and unconfirmed number of verification tasks for which the verifier answered with False, respectively. More information (for all verifiers) is given in the detailed tables on the competition web site and in the results artifact; all verification witnesses are also contained in the witnesses artifact (see Table 4). The verifiers 2ls and UKojak are the winners in terms of confirmed results for expected results True and False, respectively. The overall interpretation is similar to SV-COMP 2020 and 2021 [17, 18].

### 6 Conclusion

The 11th edition of the Competition on Software Verification (SV-COMP 2022) was the largest ever, with 47 participating verification systems (incl. 14 horsconcours and 14 new verifiers) (see Fig. 5 for the participation numbers and Table 5 for the details). The number of result validators was increased from 6 in 2021 to 11 in 2022, to validate the results (Table 1). The number of verification tasks was increased to 15 648 in the C category and to 586 in the Java category, and a new category on data-race detection was demonstrated. A new section in this report (Sect. 3) explains steps to reproduce verification results and to investigate problems during execution, and a new table tried to give an overview of the usage of common solver libraries and frameworks. The high quality standards of the TACAS conference, in particular with respect to the important principles of fairness, community support, and transparency are ensured by a competition jury in which each participating team had a member. We hope that the broad overview of verification tools stimulates their further application by an ever growing user community of formal methods.

Data-Availability Statement. The verification tasks and results of the competition are published at Zenodo, as described in Table 4. All components and data that are necessary for reproducing the competition are available in public version repositories, as specified in Table 3. For easy access, the results are presented also online on the competition web site https://sv-comp.sosy-lab.org/2022/results. Funding Statement. This project was funded in part by the Deutsche Forschungsgemeinschaft (DFG) — 418257054 (Coop).

# References


Open Access. This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution, and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# AProVE: Non-Termination Witnesses for C Programs<sup>⋆</sup> (Competition Contribution)

Jera Hensel , Constantin Mensendiek , and Jürgen Giesl

LuFG Informatik 2, RWTH Aachen University, Germany

Abstract. To (dis)prove termination of C programs, AProVE uses symbolic execution to transform the program's LLVM code into an integer transition system, which is then analyzed by several backends. The transformation steps in AProVE and the tools in the backend only produce sub-proofs in their domains. Hence, we now developed new techniques to automatically combine the essence of these proofs. If non-termination is proved, then they yield an overall witness, which identifies a nonterminating path in the original C program.

# 1 Verification Approach and Software Architecture

To prove (non-)termination of a C program, AProVE uses the Clang compiler [7] to translate it to the intermediate representation of the LLVM framework [15]. Then AProVE symbolically executes the LLVM program and uses abstraction to obtain a finite symbolic execution graph (SEG) containing all possible program runs. We refer to [14,17] for further details on our approach to prove termination.

To prove non-termination, AProVE runs three approaches in parallel, see Fig. 1. The first two approaches transform the lassos of the SEG to integer transition systems (ITSs), which are then passed to the tools T2 [6] and LoAT [11]. If one of the tools returns a proof of non-termination, AProVE uses it to construct a non-terminating path through the C program. The path of the first succeeding approach is returned to the user while all other computations are stopped. T2's proof consists of a recurrent set characterizing those variable assignments that lead to a non-terminating ITS run. Here, AProVE uses an SMT solver to identify a corresponding concrete assignment of the variables in the ITS (which correspond to the variables in the (abstract) program states of the SEG). The third approach transforms the lassos of the SEG directly to SMT formulas which are only satisfiable if there is a non-terminating path, and in this case, we can deduce a variable assignment from the model of the formulas returned by the solver. While the first and the third approach were already available in AProVE before [13], we now extended them by the generation of non-termination witnesses. To this end, the variable assignment obtained from these approaches

<sup>⋆</sup> funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - 235950644 (Project GI 274/6-2)

Fig. 1: AProVE's Workflow for Non-Termination Analysis

is used by AProVE to step through the corresponding lasso of the SEG in order to obtain a concrete execution path which witnesses non-termination. To ensure that the generation of the path terminates, AProVE stops as soon as a program state of the SEG is visited twice. Thus, this approach only succeeds if the first loop on the path whose body is executed several times is already the non-terminating loop. However, it does not find non-termination witnesses for programs with several loops, where the non-terminating path first leads through several iterations of other loops before it ends in a non-terminating loop.

To handle such programs as well, we now developed a novel second approach for proving non-termination which uses our tool LoAT in the backend. To understand how LoAT finds non-termination proofs, consider the function f in Fig. 2. The first loop decrements x as long as x is positive and increments y by the same amount. Afterwards, the second loop does not terminate if y is greater than 1. Hence, the function f does not terminate if the initial value of the parameter x is greater than 1. LoAT can detect such coherences in the corresponding ITS (Fig. 3a) generated by AProVE. To this end, LoAT uses different forms of loop acceleration: Finite acceleration combines several iterations of a looping rule into a new rule. LoAT applies this simplification to the rule r<sup>1</sup> representing the first loop, resulting in the new rule r<sup>4</sup> in Fig. 3b. In the second looping rule r3, the guard is invariant w.r.t. the update of the variables in this rule. In such a case, LoAT applies non-terminating acceleration, transforming r<sup>3</sup> to r5. Finally, chaining allows to represent the successive execution of two rules. For example, the rule r<sup>6</sup> is the result of chaining r<sup>0</sup> and r4. The exact simplification steps performed by LoAT in this example are shown in Fig. 3c. Note that the final rule r<sup>8</sup> starts from the initial function


#### Fig. 2: Example C Function

$$\begin{aligned} r\_0 &\colon f(x, y) &\to \ell\_1(x, 0) \\ r\_1 &\colon \ell\_1(x, y) &\to \ell\_1(x - 1, y + 1) & [x > 0] \\ r\_2 &\colon \ell\_1(x, y) &\to \ell\_2(x, y) & [x \le 0] \\ r\_3 &\colon \ell\_2(x, y) &\to \ell\_2(x, y) & [y > 1] \end{aligned}$$

Fig. 3a: Corresponding ITS


Fig. 3b: Simplified Rules

symbol and directly goes to non-termination. Every variable assignment satisfying the respective final guard x > 1 results in a non-terminating run.

The simplification tree in Fig. 3c is also the starting point for our new technique to generate non-termination witnesses. AProVE constructs this tree from LoAT's proof output. Then, by processing the leaves of the simplification tree from left to right, a path through the SEG can be derived. To determine how often one has to traverse earlier loops on the path to the nonterminating loop, AProVE uses an SMT solver

Fig. 3c: Simplification Tree

to find a concrete variable assignment that satisfies the final guard. In our example, the final guard x > 1 would be satisfied by {x = 2, y = 0}, for example. Consequently, the corresponding concrete execution path includes two iterations of the first loop before reaching the non-terminating second loop.

Once the path is constructed, AProVE extracts the LLVM program positions from the states, obtaining a non-terminating path through the LLVM program in form of a lasso. Using the Clang debug information output, AProVE then matches the LLVM lines to the lines in the C program. The resulting C witness can be validated by the tools CPAchecker [5] and Ultimate Automizer [12].

#### 2 Discussion of Strengths and Weaknesses

In general, AProVE is especially powerful on programs where a precise modeling of the values of program variables and memory contents is needed to (dis)prove termination. However, on large programs containing many variables which are not relevant for termination, tools with CEGAR-based approaches are often faster. The reason is that AProVE does not implement any techniques to decide which variables are relevant for (non-)termination.

Furthermore, one of AProVE's most crucial weaknesses when proving nontermination in past editions of SV-COMP was to produce a meaningful witness. Therefore, in the two approaches for proving non-termination in AProVE that are based on T2 or on the direct analysis of lassos of the SEG, we added the novel techniques presented in the current paper to generate non-termination witnesses from the obtained variable assignments. Here, the problem is that when computing a concrete execution path, we cannot be sure when to stop the computation: Whenever we visit a program position repeatedly, we do not know if this position is part of the non-terminating loop of the lasso, or if it is still part of the finite path to the non-terminating loop.

In contrast, in our new approach based on LoAT, the simplification tree allows us to infer the order in which the loops of the program are traversed and this tree also contains the information which loop is the non-terminating one. Thus, this approach extends AProVE's power substantially, since it can find non-termination witnesses for programs where all non-terminating paths lead through several iterations of more than one loop. On the other hand, there are also examples where the other two approaches outperform the approach based on LoAT, e.g., if T2 finds a non-termination proof and LoAT does not. Our observation is that especially for small programs containing only a single loop, the other approaches are often faster. This is also confirmed by our results in the Termination category of SV-COMP 2022 : While in the sub-categories MainControlFlow and MainHeap, 83% of the non-termination proofs are found using T2 or the direct SMT approach, in Termination-Other, 95% of the non-termination proofs result from the LoAT approach. This set consists of especially large programs, which often contain more than one loop.

More information about SV-COMP 2022 including the competition results can be found in the competition report [3].

### 3 Setup and Configuration

AProVE is developed in the "Programming Languages and Verification" group headed by J. Giesl at RWTH Aachen University. On the web site [2], AProVE can be downloaded or accessed via a web interface. Moreover, [2] also contains a list of external tools used by AProVE and a list of present and past contributors.

In SV-COMP 2022, AProVE only participates in the category "Termination". All files from the submitted archive must be extracted into one folder. AProVE is implemented in Java and needs a Java 11 Runtime Environment. Moreover, AProVE requires the Clang compiler [7] to translate C to LLVM. To analyze the resulting ITSs in the backend, AProVE uses LoAT [11] and T2 [6]. Furthermore, it applies the satisfiability checkers Z3 [8], Yices [9], and MiniSAT [10] in parallel (our archive contains all these tools). As a dependency of T2, Mono [16] (version ≥ 4.0) needs to be installed. Extending the path environment is necessary so that AProVE can find these programs. Using the wrapper script aprove.py in the BenchExec repository, AProVE can be invoked, e.g., on the benchmarks defined in aprove.xml in the SV-COMP repository. The most recent version of AProVE with the improved witness generation can be downloaded at [1].

Data Availability Statement. All data of SV-COMP 2022 are archived as described in the competition report [3] and available on the competition web site. This includes the verification tasks, results, witnesses, scripts, and instructions for reproduction. The version of our verifier as used in the competition is archived together with other participating tools [4].

### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# BRICK: Path Enumeration Based Bounded Reachability Checking of C Program (Competition Contribution)?

Lei Bu() , Zhunyi Xie, Lecheng Lyu, Yichao Li, Xiao Guo, Jianhua Zhao, and Xuandong Li

State Key Laboratory for Novel Software Technology, Nanjing University, China bulei@nju.edu.cn

Abstract. BRICK is a bounded reachability checker for embedded C programs. BRICK conducts a path-oriented style checking of the bounded state space of the program, that enumerates and checks all the possible paths of the program in the threshold one by one. To alleviate the path explosion problem, BRICK locates and records unsatisfiable core path segments during the checking of each path and uses them to prune the search space. Furthermore, derivative free optimization based falsification and loop induction are introduced to handle complex program features like nonlinear path conditions and loops efficiently.

# 1 Verification Approach

Existing bounded software checkers usually encode the bounded state space of the program into one constraint solving problem directly. However, in this manner, when the size of the program or the bound of the checking increases, the corresponding constraint solving problem explodes quickly and becomes difficult to solve by existing SAT/SMT solvers.

To solve this problem, BRICK conducts a path-oriented style checking of the bounded state space of the program, that enumerates and checks all the possible paths in the threshold one by one [1,2]. The main merit of the approach is that, in this case, the size of the problem needs to be solved by the constraint solver is well controlled and can be easily handled. The main features of BRICK's solving are reported below:

#### 1.1 Flexible Path Enumeration

BRICK enumerates potential paths from the control flow graph (CFG) of the given program to the user-defined step bound. Two path enumeration strategies are applied in BRICK, each with its own advantages.

<sup>?</sup> This work is supported in part by the National Key Research and Development Plan (No. 2017YFA0700604), the Leading-Edge Technology Program of Jiangsu Natural Science Foundation (No. BK20202001), and the National Natural Science Foundation of China (No.62172200, No.61632015).

First, we can simply conduct classical Depth-first-search (DFS) to enumerate program paths. The benefit of this approach is that, if the DFS stops without touching the given bound, we can get a result that the target state is not reachable in general, not only in the bounded state space.

We have also implemented a special method to encode the jump-to relation between different code blocks into an SAT formula and obtain the potential path by SAT solving. The benefit is that if the potential path is confirmed to be infeasible by following path condition solving, the infeasible path segment in the path can be located and encoded back to the SAT formula to prune all the future paths containing such infeasible segment.

#### 1.2 Infeasible Path Segment Pool Guided State Space Pruning

BRICK conducts the lazy solving of the path by encoding the path condition of the potential path into a feasibility problem. BRICK asks a constraint solver, i.e. SMT solver (Z3 [6]), interval analysis (dReal [4]), and derivative-free optimization-based solving (Section 1.3), to solve the problem. If the path is decided to be infeasible by the solver, BRICK tries to extract the unsatisfiable core (UC) of the feasibility problem of this path, and maps the UC constraints to a infeasible path segment in the path, which will be added to infeasible paths pool. After that, all the paths that contain any infeasible path in the infeasible path pool will be reported as unreachable directly in the following path enumeration.

#### 1.3 Derivative-free Optimization Based Constraint Falsification

We can see that constraint solving plays an important role in BRICK. However, complex path conditions, like nonlinear constraints, which widely appear in programs, are hard to be handled efficiently by the existing solvers. In BRICK, a classification model-based derivative-free-optimization (DFO) approach is used to alleviate this difficult situation by conducting a sample-feedback-learn style DFO solving [8].

More specifically, the underlying solver guesses sample solution for the feasibility problem. Then, we evaluate whether the sampled solution can satisfy the path constraint or not, and calculate the distance between the sampled solution and the correct one if the sampled one does not satisfy the path constraint. Such distance will be used as the metric of feedback in the classification-based DFO learning, to guide the solver to converge to the value that fits the path constraint. In practice, this approach works very well in nonlinear problem solving. However, this DFO-based approach can not tell the target is not reachable, if it fails to find a solution.

#### 1.4 Induction-based Loop Handling

If the target program contains loops, the number of potential paths may explode. To alleviate this problem, we conduct an induction-based proof to try to handle the loop before we start to do the BMC.

First of all, we collect the constraints from the assertions and generate the weakest precondition respectively. Then, we conduct the normal induction-based proof to see whether such constraints are satisfied in any iteration. If no counterexamples are returned, we know that the assertions won't be violated in the loop. Furthermore, we are also working on the integration of loop invariant generation to further refine the CFG under checking.

# 2 Software Architecture

The architecture of BRICK is shown in Fig.1. It consists of a loop processing module, a path enumerating module, and a constraint solving module, all implemented in C++ language.

In the loop processing module, if the program contains assertion-related loop, BRICK conducts loop induction-based verification firstly. If the induction works, BRICK reports unreachable; otherwise, it builds the program CFG, and performs the following path enumeration based checking.

In the path enumerating module, BRICK employs SAT-based and DFS-based path enumerating methods to extract the program path and its corresponding path condition. The constraint solving module accepts the path condition and performs constraint solving accordingly. All the techniques used has been mentioned in Section 1 respectively. The solvers used in BRICK including SAT solver MiniSAT [3], SMT solver Z3 [6], interval analysis solver dReal [4], and our implementation of the DFO method RACOS [9].

Fig. 1. Architecture of BRICK

### 3 Strengths and Weaknesses

Most of the bounded reachability checkers, i.e. CBMC [5], encode the bounded state space to a huge SMT formula consisting of both conjunction and disjunction of different kinds of formulas, which are difficult for the existing solvers to handle and may cause memory explosion easily. Instead, BRICK conducts the verification in a path-oriented way:


BRICK had participated in the ReachSafety/Floats category of SV-COMP 2022 [10]. BRICK has successfully verified 439 of all the 469 tasks, ranked 1st in this sub-category. Furthermore, we can see that for these 439 solved cases, BRICK only used 1000 seconds in total. On the other hand, CoveriTeam and VeriAbs [7] which won the 2nd and 3rd place in this category spent 9300 and 18000 seconds respectively, which are 9 and 18 times higher than BRICK.

For the weakness, like all the other bounded checkers, BRICK may not be able to give proofs of correctness of a program, if it can not finish the search in the given step bound. In this case, BRICK can only report bounded true. For example, on the cases of SV-COMP 2022, besides the 439 cases which are proved by BRICK, there are also several programs that BRICK can only give a bounded result or just timeout. Therefore, for the future work, we are implementing techniques including loop summary, k-induction and so on to try to abstract the loops and give a proof of the correctness in certain cases.

# 4 Tool Setup and Configuration

The binary file of BRICK for Ubuntu 20.04 is available at https://github.com/ brick-tool-dev/BRICK-2.0. To install the tool, please clone this repository and follow the instruction in README.md. A tailored version of BRICK took part in the ReachSafety/Floats category in SV-COMP 2022 [10]. The version [11] supports the checking of reachability of Error Function. The BenchExec wrapper script for the tool is brick.py and brick.xml is the benchmark description file.

# 5 Software Project and Contributors

BRICK is available under MIT License. The team of BRICK is from Software Engineering Group, Nanjing University. We would like to thank Sicun Gao for his kindly help with the usage of dReal.

Data Availability Statement. All data of SV-COMP 2022 are archived as described in the competition report [10] and available on the competition web site. This includes the verification tasks, results, witnesses, scripts, and instructions for reproduction. The version of our verifier as used in the competition is archived together with other participating tools [11].

# References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# A Prototype for Data Race Detection in CSeq 3<sup>⋆</sup> (Competition Contribution)

Alex Coto, Omar Inverso, Emerson Sales, and Emilio Tuosto

Gran Sasso Science Institute, L'Aquila, Italy {alex.coto,omar.inverso,emerson.sales,emilio.tuosto}@gssi.it

Abstract. We sketch a sequentialization-based technique for bounded detection of data races under sequential consistency, and summarise the major improvements to our verifcation framework over the last years.

Keywords: Bounded model checking · Context-bounded analysis · Sequentialization · Data races · Reachability · Concurrency · Threads

# 1 Verifcation Approach

Our approach is based on lazy sequentialization [7]. The idea is to convert the concurrent program P of interest into a non-deterministic sequential program Qu,k that preserves all feasible executions of P up to unwinding bound u and k rounds (or execution contexts [8]). Among diferent techniques [6], we choose bounded model checking [3] to analyse Qu,k. In this section, we briefy overview lazy sequentialisation, and sketch a novel extension to detect data races. Further elements of novelty w.r.t. engineering of our tool are discussed in the next section.

Lazy Sequentialization. We unwind all loops and inline all functions in P, except the main function and those from which a thread is spawned, obtaining a bounded program P<sup>u</sup> that preserves all feasible executions of P up to the unwinding bound u. We then transform each function of P<sup>u</sup> into a thread simulation function where each visible statement is assigned a numerical label and a guard, and each call to a concurrency-specifc function is replaced by a call to a function that models the same intended semantics; for each simulation function, we add a global variable to represent the program counter, initially set to zero.

A thread's execution context of P<sup>u</sup> is simulated by invoking the corresponding thread simulation function of Qu,k that executes from the frst statement to a non-deterministically selected label, updates the program counter, and returns. Further execution contexts are simulated by re-invoking the simulation function, where the guards ensure that the control is repositioned to the correct numerical label via a sequence of jumps, and so on. To retain consistency of the local state of the thread across diferent invocations of the simulation functions, static storage

<sup>⋆</sup> This work has been partially funded by MIUR project PRIN 2017FTXR7S IT-MATTERS and MUR project FISR2020IP 05310 MVM-Adapt.

is enforced for all local variables. We drive the overall simulation of P<sup>u</sup> from the main function of Qu,k, by invoking the thread simulation functions appropriately.

Data Race Detection. A program contains a data race if it can execute two conficting actions (i.e., one thread modifes a memory location and another one reads or modifes the same location), at least one of which is not atomic, and neither happens before the other [9]. Consider two threads performing the operation v = v + 1 on a shared variable initialised to zero. Both threads try to modify the data at the memory location reserved for v, but the necessary sequences of memory accesses are not synchronised, and thus may interleave. If a context-switch happens between the memory read and write operations in the thread that runs frst, both threads will read 0, and at the end of the execution the value of v will be 1. To detect such situation, we alter the encoding from P<sup>u</sup>

```
k:
 void *w addr = &v;
 assert(w addrs[1] != w addr);
 w addrs[0] = w addr;
 v = v + 1;
k+1:
 w addrs[0] = 0;
```
to Qu,k by (i) adding a shared array w addrs that stores a pointer to the memory location targeted by a write operation for each thread, (ii) injecting additional control code at each visible statement, and (iii) splitting the modifed sequentialised encoding of the visible statement into two separate

sequentialised statements to allow in-between context switching. The code fragment shows the modifed sequentialised encoding (no guards for simplicity, injected code greyed out) for the statement v = v + 1 of the frst thread of the program described above. We store in w addr the address of the variable being written, and then assert that the other thread is not writing to the same location; in the same (simulated) execution context, we store w addr in w addrs, so that the assertion can be checked within the other thread too. We reset w addrs right after the statement under consideration. Note the label k+1 that allows thread pre-emption. Now, one of the threads can execute the simulated statement at label k and context-switch at label k+1 while w addrs still points to v; this makes it possible to schedule the other thread, and fail the assertion in there.

In the general case, handling multiple memory write accesses for a single statement requires a slightly diferent tracking mechanism for write addresses, or decomposition into simpler statements. Statements with read-only shared memory access are handled without updating w addrs. Programs with more than two threads require multiple assertions.

# 2 Software Architecture

CSeq is a framework for quick development of static analysis and program transformation prototypes. For parsing the input program CSeq relies on pycparserext (pypi.org/project/pycparserext), an extension of pycparser (github.com/ eliben/pycparser), which in turn is built on top of PLY (www.dabeaz.com/ ply), a Python implementation of Lex and Yacc. All the mentioned components as well as CSeq are entirely written in Python.

We combined several groups of modules in CSeq, namely (i) program simplifcation, (ii) program unfolding, (iii) sequentialization, (iv) instrumentation, and (v) backend invocation and counterexample generation. For the analysis of the sequentialised program we rely on CBMC (www.cprover.org/cbmc), that in turn embeds the DPLL-style MiniSat SAT solver (minisat.se).

CSeq 3.0 incorporates a signifcant number of enhancements. At an architectural level, the main element of novelty is in the modularity between the generalpurpose functionalities of the framework and the specifc lazy sequentialization, which opens up to the possibility of prototyping diferent static analysers for other applications (e.g., [11,10]) as well as improving older sequentializationbased prototypes (e.g., [4,12,13] and variations thereof). The enhancements to the framework include: Python 3 support, support for GNU C compiler extensions, a fully re-implemented symbol table, revised general-purpose modules such as constant propagation, function inlining, and loop unrolling, and a custombuilt version of CBMC (not used in the competition) for SAT-solving under assumptions. For the competition we include (experimental) enhanced constant propagation, and simplifed function inlining. Besides the data race checking extension, the sequentialization modules include improvements from earlier implementations [5,8,6] and for diferent editions of SV-COMP up to date, in particular: extended pthread API support (conditional waiting, barriers, and thread-specifc data management), context-bounded analysis, and a major code overhaul.

#### 3 Strengths and Weaknesses

The table below summarises the performance of our tool on the 764 cases of the Concurrency category and the 162 cases of the data race demo category.


Our technique excels at hunting bugs, as shown by the number of correct unsafe (incl. 17 malformed witnesses and 50 unconfrmed witnesses), but gets quickly expensive with larger bounds, hitting the resource limits. The additional contextswitch points and the use of pointers for data race detection introduce further overhead. The other

failures are due to limiting assumptions or glitches in the implementation. All the false positives are due to corner cases in the encoding.

#### 4 Setup and Confguration

We competed in the ConcurrencySafety category and in the data race detection demo category. CSeq 3.0 is available at https://github.com/omainv/cseq/ releases.

Installation instructions are in the README fle within the package. A wrapper script (lazy-cseq.py) invokes CSeq up to three times, with the options -l lazy for lazy sequentialisation, --sv-comp to enable the required violation witnesses format, --atomic-parameters to assume atomic passing of function arguments, --nondet-condvar-wakeups for non-deterministic spurious conditional variables wake-up calls, --deep-propagation for experimental constant folding and propagation, --32 for 32-bit architectures, --threads 100 to limit the overall number of threads, --data-race-check when required, and --backend cbmc to use CBMC 5.4 for sequential analysis.

For reachability checking, on diferent invocations the script adds diferent parameters: -r2 -w2 -f2, -r4 -w3 -f5, and -r20 -w1 -f11, where r is the number of rounds, and f and w are the unwind bounds for for (i.e., potentially bounded) and while (i.e., potentially unbounded) loops, respectively; on the last invocation --softunwindbound and --unwind-for-max 10000 are also added to fully unfold for loops if a static bound can be found, up to the given hard bound. For data race detection, the above parameters are replaced with -c4 -u2, -c10 -u10, and -c50 -w20 -f20 with --unwind-for-max 100. Note that in this case the bound is on the number of execution contexts rather than rounds (-c vs. -r), and -u is used as a shorthand for -f and -w.

We leave the analysis running to completion every time. When the result is TRUE, the scripts restarts the analysis with the next set of parameters. As soon as the script gets FALSE, it returns FALSE. Only if the analysis using the last set of parameters is fnished and the result is TRUE, then the script returns TRUE.

Data Availability Statement. All data of SV-COMP 2022 are archived as described in the competition report [1] and available on the competition web site. This includes the verifcation tasks, results, witnesses, scripts, and instructions for reproduction. The version of our verifer as used in the competition is archived together with other participating tools [2].

# References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/ 4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Dartagnan: SMT-based Violation Witness Validation (Competition Contribution)

Hern´an Ponce-de-Le´on<sup>1</sup> () ? , Thomas Haas<sup>2</sup> , and Roland Meyer<sup>2</sup>

<sup>1</sup>Bundeswehr University Munich, Munich, Germany <sup>2</sup>TU Braunschweig, Braunschweig, Germany hernan.ponce@unibw.de, t.haas@tu-braunschweig.de, roland.meyer@tu-bs.de

Abstract. The validation of violation witnesses is an important step during software verification. It hides false alarms raised by verifiers from engineers, which in turn helps them concentrate on critical issues and improves the verification experience. Until the 2021 edition of the Competition on Software Verification (SV-COMP), CPAchecker was the only witness validator for the ConcurrencySafety category. This article describes how we extended the Dartagnan verifier to support the validation of violation witnesses. The results of the 2022 edition of the competition show that, for witnesses generated by different verifiers, Dartagnan succeeds in the validation of witnesses where CPAchecker does not. Our extension thus improves the validation possibilities for the overall competition. We discuss Dartagnan's strengths and weaknesses as a validation tool and describe possible ways to improve it in the future.

# 1 Introduction

Most software verification tools report witnesses to property violations. Since SV-COMP 2015, there is a common format in which witnesses are represented by automata [4]. Each edge of such an automaton is annotated with data that can be used to match program executions. A data annotation can be, e.g., "assumption" specifying constraints on values of variables in a given state, "control" specifying the outcome of a branch condition, or "startline" specifying a concrete line in the source code. More details about data annotations and their semantics can be found in the exchange format documentation [1].

A witness validator checks that a violation can be reproduced using the information provided by the witness. Automata-based verifiers can easily be converted into validators by analyzing the synchronized product of the program with the witness automaton. In this setting, the witness automaton guides the verifier. If none of the outgoing edges on the program state match the next edge of the witness automaton, then the verifier cannot explore the current path further. If the edge on the program state matches, then the witness automaton and the program proceed to the next state, eventually leading to a violation.

<sup>?</sup> Jury member.

While this idea allows one to easily convert any automata-based verifier into a validator, not all verifiers are automata-based.

Dartagnan is an SMT-based verifier. In the next section, we explain how to convert it into a validator. The idea is to extract information from the witness and use it to reduce the search space explored by the backend SMT solver.

#### 2 Validation Approach

Given a concurrent program and a specification in the form of assertions, Dartagnan generates an SMT formula ϕVer = ϕCf ∧ϕDf ∧ϕSc ∧ϕ which is satisfiable if and only if some assertion fails [17,16]. The formulas ϕCf and ϕDf encode (respectively) the control flow and the data flow of the program. Formula ϕSc encodes scheduling constraints. Finally, ϕ expresses that at least one assertion must fail. If the formula is satisfiable, then a violation exists. The goal of Dartagnan (as a verifier) is to find such a violation. This amounts to finding an appropriate scheduling among the threads. Such a scheduling is encoded as a happens-before relation between the instructions. Dartagnan thus searches the space of all viable happens-before relations to find a violation or prove that none exists.

We now explain how to extend Dartagnan into a violation witness validator. The idea is to extract from the violation witness a formula ϕ<sup>Y</sup> that we conjoin to the rest of Dartagnan's encoding, resulting in ϕVal = ϕVer ∧ ϕ<sup>Y</sup> . The extra constraints in ϕ<sup>Y</sup> reduce the search space for the SMT solver. For the verification of concurrent programs taking inputs from the environment, there are two sources of non-determinism: the data coming from the input (which might influence the control flow) and the scheduling. The purpose of ϕ<sup>Y</sup> is to reduce this non-determinism. Extending the SMT encoding as described in ϕVal is conceptually easy. The interesting question is "what information from the witness shall we use?" The less information we use, the more we move from pure validation to full verification.

While automata-based validators can use some information in a straightforward manner, this is not the case for Dartagnan.


The exchange format for violation witnesses allows for expressing information about state assumptions, the control flow, and the scheduling. We abstract out from the former two and only use scheduling information. We assume that witness automata represent a single path and that the edges contain "startline" data corresponding to read or write instructions<sup>1</sup> . Those are the only instructions

<sup>1</sup> Our validator accepts witnesses that do not satisfy the second assumption, but it filters out the corresponding edges.

that can affect our happens-before relation. While we do not explicitly encode the outcome of control-flow instructions, certain control-flow information is implicitly encoded based on which instructions are executed. We explain the reason behind these design decisions and assumptions, discuss its limitations, and describe how we plan to improve this in the future in Section 3. Despite these limitations, and as we show in Section 4, our validator performs well in practice.

Let (S, E) be a witness automaton with states S and edges E. For each e ∈ E, function e2i(e) returns the set of read or write instructions coming from the "startline" in the C file that corresponds to the given edge. Since witnesses represent single paths, they can be seen as a word over S. Let w ∈ S <sup>∗</sup> be a witness, we define the witness-to-formula function which constructs ϕ<sup>Y</sup> as

w2f(w) = true if w = w2f(w 0 ) ∧ W i1∈e2i(( ,s)) i2∈e2i((s, )) happens-before(i1, i2) if w = s · w 0

#### 3 Strengths and Weaknesses

The main strengths of our validation approach are simplicity and modularity. The approach just requires to add a new sub-formula to the SMT encoding used for verification. The validator is modular in the sense that using more or different information from the witness does not change the validation approach. For example, adding information from the witness about the control flow just requires adding more constraints to ϕY.

Our validation approach assumes that witness automata represent single paths. This is a limitation not imposed by the exchange format. However, verifiers tend to stop as soon as they find one violation and thus generate witnesses representing a single violation path. A second limitation is that we do not explicitly consider control-flow information. This might impact the performance of the validation since not all non-determinism is removed and the search space might still be large. Converting such control-flow information into SMT is simple in principle. However, since Dartagnan internally converts the C program into Boogie [15], matching conditionals with the corresponding assembly-like jumps requires some work. A second consequence of not extracting control-flow information from the witness is that we might validate witnesses that do not lead to a violation. This is because we over-approximate the paths of the program represented by the witness and thus our approximation might include the path leading to the violation even if the witness did not.

### 4 Validation Results

We inspected the results of SV-COMP 2022 [5] to answer the following questions

RQ1: What percentage of the witnesses can Dartagnan validate? RQ2: What percentage can Dartagnan not validate and why?

#### RQ3: Can Dartagnan validate witnesses that CPAchecker cannot? RQ4: Can CPAchecker validate witnesses that Dartagnan cannot?

From the 20 verifiers in ConcurrencySafety, we selected five tools implementing different verification approaches. We consider them good representatives of the whole category: (i) CBMC [13] (used as a backend by Deagle [9] and Lazy-CSeq [11]), (ii) CPAchecker [7] (used as a backend by CPA-Lockator [3] and Graves [14]), (iii) EBF [2] (combines BMC with fuzzing, a very effective technique to find bugs), (iv) Dartagnan [17] (only tool where the memory model, here sequential consistency, is taken as an input), and (v) Gem-Cutter [12] (shares the codebase with UTaipan [8] and UAutomizer [10]).

Table 1 presents the results of the validation in SV-COMP 2022. We report the number of witnesses generated by each verifier ("Witnesses"). For each of the validators (columns "Dartagnan" and "CPAchecker"), we report the number of cases where the validation conclusively finished (i.e., it returned True or False), whether the violation was confirmed (left of "/") or not (right of "/"), and the number of correct validations by one tool where the other did not report a result (columns "Dart \ CPA" and "CPA \ Dart", respectively).


Table 1. Results of the validation in SV-COMP 2022.

For the SMT-based verifiers CBMC and EBF, Dartagnan has 63.28% resp. 75.52% success rate in the validation (against 31.15% resp. 19.66% success rate for CPAchecker). Unfortunately, it did not validate any of the witnesses generated by CPAchecker. This was due to a bug in the witness parser that has been identified and fixed after the competition. CPAchecker validated all the witnesses that it generated as a verifier. Dartagnan validated 89.74% of its own witnesses while CPAchecker only validated 12.82%. For GemCutter, the validation success of Dartagnan is only 6.02%. This is because, due to another bug, it wrongly marked 237 witnesses as not validated. The fixed version of Dartagnan is able to validate all such cases. Despite this, from the 18 witnesses that Dartagnan validated, 15 of them were not validated by CPAchecker, thus improving the validation possibilities for the overall competition.

### 5 Software Project and Configuration

The project home page is https://github.com/hernanponcedeleon/Dat3M. To run Dartagnan as a validator, use the following command:

\$ Dartagnan-SVCOMP.sh -witness <witness> <property> <program>

Data Availability Statement. All data of SV-COMP 2022 are archived as described in the competition report [5] and available on the competition web site. This includes the verification tasks, results, witnesses, scripts, and instructions for reproduction. The version of our verifier as used in the competition is archived together with other participating tools [6].

# References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Deagle: An SMT-based Verifier for Multi-threaded Programs (Competition Contribution) ?

Fei He1,2,<sup>3</sup> () , Zhihang Sun1,2,<sup>3</sup> , and Hongyu Fan1,2,<sup>3</sup>

 School of Software, Tsinghua University, Beijing, China Key Laboratory for Information System Security, MoE, Beijing, China Beijing National Research Center for Information Science and Technology, Beijing, China

Abstract. Deagle is an SMT-based multi-threaded program verification tool. It is built on top of CBMC (front-end) and MiniSAT (back-end). The basic idea of Deagle is to integrate into the SMT solver an ordering consistency theory that handles ordering relations over the shared variable accesses in the program. The front-end encodes the input program into an extended propositional formula that contains ordering constraints. The back-end is reinforced with a solver for the ordering consistency theory. This paper presents the basic idea, architecture, installation, and usage of Deagle.

Keywords: Program verification · Satisfiability modulo theories · Concurrency.

# 1 Verification Approach

Given a multi-threaded program, the thread communication behaviors can be modeled using the happens-before relations over memory access (read/write) events [1]. There are various kinds of happens-before relations: program order (PO), read-from order (RF), write serialization order (WS), and from-read order (FR). A happens-before ordering formula (abbreviated as ordering formula) is a logical formula that involves only memory access events and happens-before relations.

Deagle is an SMT-based multi-threaded program verifier, which consists of


<sup>?</sup> This work was supported in part by the National Key Research and Development Program of China (No. 2018YFB1308601) and the National Natural Science Foundation of China (No. 62072267 and No. 62021002).

Compared with [8]: The theory solver in [8] uses a from-read axiom to derive FR orders. Besides the from-read axiom, Deagle also implements a writeserialization axiom [11], with which WS orders can also be derived. In return, the front-end of Deagle need not encode both FR and WS orders explicitly.

# 2 Software Architecture

Deagle is developed on top of CBMC [9] and MiniSAT [6] using C++. Additionally, for ease of usage and debugging, Deagle reuses some modules developed in Yogar-CBMC [10,11]. Deagle is not a strategy selection-based verifier. Deagle runs the following procedures successively to verify a given C program:

Preprocessing (from Yogar-CBMC) For each global structure variable in the C program, the preprocessing procedure unfolds it by creating a fresh variable for each member. Note that arrays need no preprocessing; CBMC is able to handle each array as an entity.

Parsing and Goto-Program Generation (originally in CBMC) CBMC employs Flex and Bison to transform the preprocessed C program into an abstract syntax tree (AST). Then CBMC builds a goto program, where all branching statements and loop statements are represented with (conditional) goto statements.

Library Function Modeling (extended from CBMC) CBMC models each multithreading-related library function (e.g., pthread cond wait). For example, mutex m contains a Boolean variable m locked indicating whether m is locked; pthread mutex lock(&m) assumes m locked to be originally f alse and sets m locked to true. Based on CBMC, we extend Deagle to support the modeling of more library functions.

Unwinding We employ bounded model checking (BMC ) [3,4,5] to handle loops. If the program contains loops, we determine an unwinding limit and unwind the program to a loop-free bounded program:

– If the maximal loop time of the program can be determined through static analysis, e.g.,

$$for \ (i = 0; i < 10; i++)$$

we set the unwinding limit to this maximal loop time;

– If the maximal loop time depends on non-determinism. e.g.,

$$for \ (i = 0; i < n; i++)$$

where n is attained from the function VERIFIER nondet int, we report UNKNOWN since such loops cannot be fully unwound.

– Otherwise, we set the unwinding limit to 2.

Formula Generation (extended from CBMC) After unwinding, the loop-free program is represented in the static single assignment (SSA) form, where each thread is a chain of assignments. These assignments can be directly modeled into first-order logic formulas (for ease of solving, we further convert them into propositional logic formulas). Additionally, an assignment may contain global memory access events; we model program orders and read-from orders (please refer to [8] for more information) of these events into the formulas.

Constraint Solving (extended from MiniSAT) We develop an ordering consistency theory solver and integrate it into the DPLL(T) framework [8]. For efficiency, we extend MiniSAT, an SAT-based solver, to run our theory solver exclusively. Please refer to [8] for the detailed algorithms of our decision procedure.

Witness Generation (adapted from Yogar-CBMC) If the back-end solver returns satisfiable (i.e., finds a counterexample violating the property), our ordering consistency theory solver reports a sequence (total order) of these events, which can be used for generating the witness of the counterexample.

# 3 Strengths and Weaknesses

Compared to the traditional method [1] which explicitly converts ordering formulas into propositional formulas, Deagle employs a dedicated theory solver to handle ordering formulas, which improves both time and space efficiency. Ignoring some tasks in goblint-regression that require unwinding 10000 times, Deagle reports TIMEOUT in only 9 tasks and OUT OF MEMORY in only 7 tasks – fewer than most ConcurrencySafety competitors.

In most weaver tasks (117 out of 169), the number of loop iterations is nondeterministic. As is mentioned in previous section, Deagle reports UNKNOWN. Since such tasks are common in real-world programs, we are exploring an approach to dealing with such programs in the future work.

# 4 Tool Setup and Configuration

The source code of Deagle 1.3 (the submitted version in SV-COMP 2022 [2]) is publicly accessible <sup>4</sup> . Please refer to README for more installation instructions. In SV-COMP 2022, Deagle participates in ConcurrencySafety category and only checks property Unreach-Call <sup>5</sup> . By setting parameters

− − 32 − −no − unwinding − assertions − −closure

one can reproduce Deagle's results of SV-COMP 2022.

<sup>4</sup> Deagle repository: https://github.com/thufv/Deagle

<sup>5</sup> The benchmark definition of Deagle: https://gitlab.com/sosy-lab/sv-comp/ bench-defs/-/blob/main/benchmark-defs/deagle.xml

#### 4.1 Parameter Definition

Deagle inherits lots of parameters from CBMC. Due to the page limit, we only describe parameters related to the competition or newly added in Deagle:


# 5 Software Project

Deagle is developed by Fei He, Zhihang Sun, and Hongyu Fan from the Formal Verification Lab<sup>6</sup> in Tsinghua University. Deagle is licensed under GPLv3. Since Deagle is developed over CBMC and MiniSAT, and reuses some modules from Yogar-CBMC, it also contains copyright of those tools.

# 6 Acknowledgement

We appreciate SV-COMP hosts for holding the competition and giving advice on participating. We are also grateful to developers, maintainers, and contributors of CBMC, MiniSAT, and Yogar-CBMC, on which Deagle is based.

# References


<sup>6</sup> homepage: https://thufv.github.io/team


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# The Static Analyzer Frama-C in SV-COMP (Competition Contribution)

Dirk Beyer <sup>B</sup> and Martin Spiessl

LMU Munich, Munich, Germany

Abstract. Frama-C is a well-known platform for source-code analysis of programs written in C. It can be extended via its plug-in architecture by various analysis backends and features an extensive annotation language called ACSL. So far it was hard to compare Frama-C to other software verifiers. Our competition participation contributes an adapter named Frama-C-SV, which makes it possible to evaluate Frama-C against other software verifiers. The adapter transforms standard verification tasks (from the well-known SV-Benchmarks collection) in a way that can be understood by Frama-C and produces a verification witness as output. While Frama-C provides many different analyses, we focus on the Evolved Value Analysis (EVA), which uses a combination of different domains to over-approximate the behavior of the analyzed program.

Keywords: Software verification · Program analysis · Formal methods · Competition on Software Verification · Comparative Evaluation · SV-COMP · Frama-C

# 1 Approach

This competition contribution is based on Frama-C [12], a program-analysis platform for C programs. The purpose of the participation in the comparative evaluation SV-COMP is to show the strengths of Frama-C when applied to the problem of verifying C programs from the SV-Benchmarks [4] collection of verification tasks.

# 2 Architecture

Although Frama-C has a large configuration space, it does not support standard specifications as used in SV-COMP, and it does not produce verification witnesses as default. In order to overcome this obstacle we implemented an adapter for Frama-C using input and output transformers, and the adaption architecture is illustrated in Fig. 1. In the following, we describe the artifacts and actors of the participating verifier: in Sect. 2.1 we describe all the components that are developed as part of the adapter, while in Sect. 2.2 we describe in more detail how the used EVA analysis of Frama-C works.

Fig. 1: Architecture of Frama-C-SV: the inputs and outputs of Frama-C are translated to interface with the established standards as used by SV-COMP; the components that are necessary to adapt Frama-C for comparison with other verifiers amount to 678 lines of code mostly written in Python

### 2.1 Frama-C-SV

Input Transformer. The input transformer takes the program p and specification s and creates a new program p 0 in which the specification s has been expressed as Frama-C-specific annotations. Frama-C uses ACSL [1] as language to specify annotations. The input transformer also selects configuration parameters for Frama-C that are best suited for the verification task. Currently we encode reachability tasks into signed integer overflows by adding an artificial overflow to the body of the function reach\_error. This works well in practice and is also sound, since if there were any other overflows, the task would contain undefined behavior and would not be a valid reachability task in the first place.

Configuration Options. Depending on the input program and specification, we can choose different options that are passed to Frama-C. In essence, this acts like an algorithm selection [14] and, e.g., allows us to choose a different configuration of Frama-C depending on the specified property.

Harness. Some programs in the SV-Benchmarks collection use specific functions to model non-determinism. We provide implementations for those functions (\_\_VERIFIER\_\*) in a separate C program such that the semantics of those functions can be understood by Frama-C. This separate C program is passed to Frama-C together with the transformed program p 0 .

Output Transformer. The output of Frama-C needs to be interpreted regarding the original specification, and depending on the outcome, a verification witness needs to be generated. Thus, we need an output transformer for (a) providing a verdict for the verification task and (b) providing a verification witness. Regarding (a), the output transformer interprets the CSV report that can be generated by Frama-C to determine whether the program was proven to be safe (verdict TRUE), whether a specification violation occurred (verdict FALSE), or whether no such statement can be made (verdict UNKNOWN). We also generate a minimal correctness or violation witness for the verdicts TRUE and FALSE, respectively. The witness automata consist of only one node, which for violation witnesses is marked as violation node. In the future we plan to augment these witnesses with information such as invariants that have been found by Frama-C.

#### 2.2 Frama-C

One of the strengths of Frama-C is its modular architecture [10], which allows a configuration of the best possible analysis backends for a certain verification problem. We choose the plug-in EVA [9], which is well suited for an automatic analysis. Other plug-ins such as the Weakest-Preconditions (WP) plug-in require hints from the user in order to be effective. In the following we will briefly describe the most important aspects of the EVA analysis configuration that we use. For a more detailed description, we refer the reader to the relevant literature [7, 8, 9].

Frama-C provides a meta-option called -eva-precision for the EVA plug-in with possible values ranging from 0 to 11. With higher values for this option more precise domains and thresholds are used, at the cost of increased computation time. We currently use the maximum value of 11 in order to make the best use of the 900 s CPU time limit. In the future we might want to iteratively increase this value starting at lower precisions.

Domains. The EVA analysis always uses the domain cvalue, which tracks values of variables either as constant values, sets, or intervals of possible values (including modular congruence constraints). For pointer addresses, these are either tracked as addresses with offsets or as so-called garbled mix, which overapproximates the set of possible memory locations. In addition, depending on the precision level, various other domains are used that we describe in the following. The domain symbolic-locations tracks a map of symbolic locations to values, which is, e.g., helpful for analyzing expressions containing array accesses such as a[i]<a[j]. The equality domain tracks equalities of C expressions found in the code, whereas the gauges domain tracks relations between variables in a loop with the goal to discover linear inequality invariants [16]. Lastly the octagon domain tracks certain linear constraints between pairs of variables [13]. As we use the highest precision level, all of these domains are used in our contribution.

Precision of the State-Space Exploration. Apart from the domains, the precision of state-space exploration in Frama-C is affected by various options. We will describe some of these in the following; a complete list of affected settings and values is always printed by Frama-C when the option eva-precision is specified by the user. Option slevel (set to 5 000) determines how many separate states are kept before new states will be joined into existing ones. Option ilevel (set to 256) determines how many different values are tracked per variable before overapproximating the value range. Option plevel (set to 2 000) affects the size up to which arrays are tracked. The option auto-loop-unroll (set to 1 024) will determine up to which bound a loop is considered for unrolling.

#### 3 Strengths and Weaknesses

The competition contribution shows the strengths of Frama-C in checking C programs for overflows and also —in the currently supported sub-categories <sup>1</sup>— for reachability. Here we are able to show that our results are comparable and often surpass those of other tools based on abstract interpretation [11] such as Goblint [15]. While the EVA analysis of Frama-C that we use is based on abstract interpretation, the precision options described in Sect. 2.2 allow for a more precise state-space exploration, which behaves more like model checking. More details about the results can be found in the competition report [2] and artifact [3].

The approach that we describe in this paper creates a compatibility layer between the abilities used by Frama-C and the standards used in the SV-Benchmarks collection. While still a work in progress, we have shown that it is possible to bridge this gap while preserving overall soundness. It is also interesting to consider the results on verification tasks from the SV-Benchmarks collections for a tool that did not participate before.

Although our approach is sound in general, we are likely not showcasing the full potential of Frama-C. One aspect to consider here is the large configuration space, which means there might be ways to verify more tasks with a better heuristic for selecting the configuration options. The other aspect is that Frama-C also provides different plug-ins such as the WP plug-in, which requires more (manual) annotations, but can also potentially solve more tasks than the more automatic EVA plug-in.

#### 4 Software Project and Contributors

The software project Frama-C is developed at https://git.frama-c.com/ pub/frama-c/ and our adapter Frama-C-SV is developed at https://gitlab. com/sosy-lab/software/frama-c-sv, both being released under open-source licenses. The exact version of the adapter that participated in SV-COMP 2022 is also archived in the competition's tool-archive repository <sup>2</sup> [6]. Frama-C was funded by the European Commission in program Horizon 2020. The adapter Frama-C-SV was funded by the DFG. We thank the Frama-C authors <sup>3</sup> for their contribution to the software-verification community.

Data Availability Statement. All data of SV-COMP 2022 are archived as described in the competition report [2] and available on the competition web site. This includes the verification tasks [4], competition results [3], verification witnesses [5], scripts, and instructions for reproduction. The version of Frama-C-SV as used in the competition is archived together with other participating tools [6].

Funding Statement. This work was funded in part by the Deutsche Forschungsgemeinschaft (DFG) – 378803395 (ConVeY).

<sup>1</sup> We opted out of subcategories with unsound results caused by Frama-C making assumptions that are different from the conventions of SV-COMP.

<sup>2</sup> https://gitlab.com/sosy-lab/sv-comp/archives-2022/blob/svcomp22/2022/frama-c-sv.zip

<sup>3</sup> https://frama-c.com/html/authors.html

#### References


0/), which permits use, sharing, adaptation, distribution, and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. Open Access. This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **GDart: An Ensemble of Tools for Dynamic Symbolic Execution on the Java Virtual Machine (Competition Contribution)**?

Malte Mues (B) <sup>1</sup> and Falk Howar <sup>1</sup>,<sup>2</sup>

<sup>1</sup> TU Dortmund University, Dortmund, Germany {malte.mues, falk.howar}@tu-dortmund.de <sup>2</sup> Fraunhofer ISST, Dortmund Germanny

**Abstract.** GDart is an ensemble of tools allowing dynamic symbolic execution of JVM programs. The dynamic symbolic execution engine is decomposed into three different components: a symbolic decision engine (DSE), a concolic executor (SPouT), and a SMT solver backend allowing meta-strategy solving of SMT problems (JConstraints). The symbolic decision component is loosely coupled with the executor by a newly introduced communication protocol. At SV-COMP 2022, GDart solved 471 of 586 tasks finding more correct false results (302) than correct true results (169). It scored fourth place.

**Keywords:** Dynamic Symbolic Execution · Software Verification

# **1 Verification Approach**

This paper presents the GDart ensemble tool, a dynamic symbolic execution engine for the JVM. Dynamic symbolic execution is a well-established technique for software testing (cf. DART [6]) and there have been already two contestants to SV-COMP 2021 using this technique (cf. JDart [7, 9] and COASTAL<sup>3</sup> ). It is a search algorithm for systematic exploration of a program's state space for a property violation which either stops after exhausting the resource limits, exploring the complete symbolic state space, or encountering an error. The end of the search is fully configurable in GDart.

In SV-COMP 2022 [3], a dynamic symbolic execution tool (JDart (714 Points)) wins the Java track for the first time beating JBMC (700 Points) [4], a bounded model checker for Java, and Java Ranger (670 Points) [11], a symbolic execution engine extended by veritesting [1] for Java. JDart's result underlines the potential of dynamic symbolic execution for the verification of Java programs in general. The concrete implementation of JDart is closely coupled to the Java PathFinder VM (JPF-VM) [12] running the complete analysis within one virtual machine. The advantage of the JPF-VM is that it runs

<sup>?</sup> This work has been partially founded by an Amazon Research Award

<sup>3</sup> https://github.com/DeepseaPlatform/coastal

as a guest JVM on top of a host JVM. The analysis might mock parts of the guest JVM and use the host JVM for running side computation required to compute results used in the mock. The downside of the JPF-VM is its research tool status and that it is costly to maintain it given Java's fast pace in releasing new features.

COASTAL demonstrated for the first time what a loosely coupled architecture between the symbolic exploration engine and a concolic execution engine might look like. It instruments the bytecode with ASM<sup>4</sup> , a java bytecode manipulation framework, to obtain symbolic traces. This makes the analysis independent of the JPF-VM. The downside is that bytecode manipulation offers less flexibility than hooking directly into the JVM.

# **2 Software Architecture**

Fig. 1: GDart's ensemble architecture and the interplay between the components.

GDart takes the strengths of JDart's mocking flexibility and combines it with COASTAL's modular design. Figure 1 demonstrates the architecture of the GDart ensemble tool. The main analysis component is the symbolic explorer. It orchestrates the concolic executor and requests solutions for SMT problems from the constraint solvers powering the symbolic exploration.

*Symbolic Exploration.* We name the symbolic explorer "DSE" component as it does symbolic exploration and starts the concolic executor, the two main steps in applying dynamic symbolic execution. It manages the constraint tree and guides its exploration. Both steps together are the main tasks of a dynamic symbolic execution engine. To explore a path, it computes a set of concrete values that drives the concolic executor down the path of interest and seeds the executor with these values. After the termination of the executor, it parses the obtained symbolic trace and integrates it into the symbolic tree. Next, it constructs from the symbolic tree a SMT problem that describes the next path to explore and starts a constraint solver to get a model suitable to drive the execution down this path or an unsatisfiable verdict implying that the path is unreachable. The

<sup>4</sup> https://asm.ow2.io

search behavior of GDart is configured in the DSE. Once the search terminates, DSE generates a verification witness from the constraint tree.

*Concolic Executor.* One of the core contributions of GDart is the new concolic executor SPouT implemented as part of the Espresso guest language running on top of the GraalVM [13] 5 . The GraalVM is an industrial-grade JVM maintained by Oracle allowing to use most of the architectural benefits the JPF-VM offered apart from state tracking. But concolic execution does not require JPF-VM's state tracing feature. SPouT can be seeded with concrete values to drive down the execution along a concrete path. In addition, it can introduce new symbolic variables for previously unknown inputs. During execution, it records manipulation and constraints checks on symbolic variables and reports a symbolic execution trace together with the concrete execution result on termination of the path exploration. Decisions on the symbolic variables are encoded in the SMT-Lib format. As SPouT maintains the two VM layers, it allows mocking of behavior in the Espresso VM running the analysis and implements a substitute executed on the host GraalVM during concolic execution the same way JDart does for mocking the environment if needed. The feature is also used for intercepting invocations of the string library in Java and encoding them symbolically.

*Constraint Solving.* The third component is constraint solving. DSE uses the JConstraints library to model SMT-Lib constraints internally and interact with the solver. GDart is backed by CVC4 [2] and Z3 [5]. We combine these two SMT solvers in a portfolio approach according to the CvcSeqEval strategy presented in our previous work [8].

#### **3 Strengths and Weaknesses**

GDart is the fourth place with 640 points behind JDart (714 points), JBMC (700 points), and Java Ranger (670 points). Dynamic symbolic execution tools tend to be stronger in finding property violations than confirming the absence of property violations on the SV-COMP benchmark. This is partially by design as some of the problems (e.g., those problems in the jayhorn-recursive subgroup) aim for testing the handling of tremendously large and hard to explore state spaces. GDart disproves the property in 302 cases and confirms it in 169 cases. In total, GDart answered 471 of 586 tasks correctly and none incorrect. These are 40 more correct false proved tasks than Java Ranger found (262 correct false tasks out of 466 solved tasks). In total GDart solved five more tasks than Java Ranger and 35 less than JBMC.

In direct comparison with GDart, JDart solved 192 (+23) correct true tasks and 330 (+28) correct false tasks. Three factors are contributing to the gap between GDart and JDart: the performance overhead of spinning up one JVM per executor run (We do not have the exact number, but spinning up a JVM

<sup>5</sup> https://www.graalvm.org

costs at least 500 ms per JVM affecting especially tasks with huge exploration trees.), technical maturity of the implementation as JDart is around for more time, and a value tracing heuristic built into JDart for tracking numerical values origin from a serialized string representation not built into GDart. The performance overhead for spinning up multiple JVMs is the only drawback that is influenced by the modular design of GDart and will not go away in the future. JDart's time per task after archiving 600 points is close to five seconds CPU time in the score-based quantile plots for CPU time while GDart's time per task reaches close to 50 seconds CPU time for the same score.

The weakness of dynamic symbolic execution is state space explosion which also affects GDart. Slowing down each executor run by spinning up new VMs is a disadvantage given the resource constraints of SV-COMP. On the bright side, with more relaxed resource limits it is possible to run the execution runs in parallel to the symbolic exploration of the constraints tree as future work for the DSE component allowing parallel breadth-first search on multi-core machines. At the moment all paths are explored sequentially.

# **4 Tool Setup**

GDart is run with various configuration options hard-coded into the SV-COMP run scripts. More precisely, we enabled witness generation, used the described solver strategy in the constraint backend, chose a breadth-first search on the constraint tree, and used the same bounded solving as JDart. The search is configured to terminate on the first hit assertion error.

# **5 Software Project**

The components are currently all developed at TU Dortmund by the group led by Falk Howar. DSE<sup>6</sup> is available under the Apache 2.0 license, JConstraints<sup>7</sup> as well, and SPouT<sup>8</sup> is available under the GPL v2 license. We also provide the run scripts for SV-COMP on GitHub<sup>9</sup> .

# **6 Data Availability Statement**

The GDart archive used for SV-COMP 2022 is available at Zenoodo [10].

# **References**

1. Avgerinos, T., Rebert, A., Cha, S.K., Brumley, D.: Enhancing symbolic execution with veritesting. In: Proc. ICSE. pp. 1083–1094 (2014). https:// doi.org/10.1145/2568225.2568293

<sup>6</sup> https://github.com/tudo-aqua/dse

<sup>7</sup> https://github.com/tudo-aqua/jconstraints

<sup>8</sup> https://github.com/tudo-aqua/spout

<sup>9</sup> https://github.com/tudo-aqua/gdart-svcomp


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Graves-CPA: A Graph-Attention Verifier Selector (Competition Contribution)

Will Leeson() and Matthew B. Dwyer

University of Virginia, Charlottesville VA 22903, USA {will-leeson,matthewbdwyer}@virginia.edu

Abstract. Graves-CPA is a verification tool which uses algorithm selection to decide an ordering of underlying verifiers to most effectively verify a given program. Graves-CPA represents programs using an amalgam of traditional program graph representations and uses stateof-the-art graph neural network techniques to dynamically decide how to run a set of verification techniques. The Graves technique is implementation agnostic, but it's competition submission, Graves-CPA, is built using several CPAchecker configurations as its underlying verifiers.

Keywords: Software Verification · Graph Attention Networks · Graph Neural Networks · Algorithm Selection

# 1 Verification Approach

Graves-CPA is an algorithm selector for software verification based on graph neural network techniques. As the tool PeSCo [14] has shown, dynamic ordering of verification techniques can result in faster and more accurate verification. Computing an ordering on techniques dynamically will incur some runtime, but an effective ordering will oftentimes make this overhead insignificant in comparison to the time saved by using a more appropriate technique. Like most algorithm selectors, Graves-CPA uses machine learning to make its selections. However, it uses graph neural networks (GNNs) so it can represent programs using traditional program abstractions, such as abstract syntax trees (ASTs). Graves-CPA uses a variant of GNNs called Graph Attention Networks (GATs) [16]. GATs use a learned attention mechanism which is trained to learn the importance of edges in a given graph.

GNNs are an emerging field in machine learning. Traditional neural networks accept input vectors, which have a fixed size and a natural ordering on elements, but graphs, in general, have neither. GNNs avoid these issues by operating on individual nodes in the graph, instead of the graph as a whole [15]. Typically, the input to a GNN is the current representation of a node v and a collation of the representations of its neighboring nodes. The output is then a new representation for v. This process is repeated independently for all nodes in the graph. Thus, the number of nodes in the graph and order in which they are processed is irrelevant.

The Graves technique is tool agnostic [11], meaning it can be trained to select from any set of verifiers. Our competition contribution selects an ordering from the techniques utilized by CPAchecker [3], similar to PeSCo in previous competitions.

To form its selection, Graves-CPA produces a graph representation of a given program, G, which is based on its AST with control flow, data flow, and function call and return edges added between the tree's nodes. The AST's nodes and edges ensure the semantics of the statements in the program are maintained. Control flow edges maintain the branching and order of execution between these statements. Data flow edges explicitly relate the definitions, uses, and interactions of values in the program. G is passed to a GNN, consisting of a series of GATs, which outputs a graph feature vector This feature vector is finally passed to a fully connected neural network which decides the sequence in which Graves-CPA's suite of verification techniques are run.

# 2 System Architecture

#### 2.1 Graph Generation

To generate a graph from a program, Graves-CPA relies on the AST produced by the C compiler Clang [10]. Using a visitor pattern [9], Graves-CPA walks the AST to generate data flow edges and the edges of the program's Interprocedural Control Flow Graph (ICFG). Function call and return edges in the ICFG are those which can be determined purely syntactically. Using the ICFG and data flow edges, Graves-CPA produces additional data flow edges using the worklist reaching definition algorithm [1]. We limit the number of iterations of the reaching definition algorithm, making our data edges an under-approximation of possible data flow edges. Once this graph is generated, it is parsed into a list of nodes and several edge sets. Nodes represent the AST token which corresponds to them using a one-hot encoding. These nodes and edges are used as input to the GNN.

#### 2.2 Prediction

To form a prediction, Graves-CPA uses a GNN, visualized in Figure 1, which consists of 2 GAT layers, a jumping knowledge layer [17], and an attentionbased pooling layer [12]. The GAT layers are crucial to our technique. When propagating data through the graph, the attention mechanisms in each layer weights edges so information important to predictions is more prominent than superfluous data.

The jumping knowledge layer concatenates intermediate graph representations, denoted by A, B, and C, allowing the model to learn from each representation. The attention-based pooling layer calculates an attention value for each node in the graph. All nodes are weighted by their respective attention values and then summed together to form a graph feature vector. The combination of

Fig. 1. Graves' uses a GNN comprised of 2 GAT layers, a Jumping Knowledge layer, and attention pooling layer. These layers produce a graph feature vector which a 3 layer prediction network uses to order verifiers for sequential execution. An in depth description of this architecture can be found in Leeson et al. [11].

GAT layers and the attention-based pool allows the network to weigh the importance of both edges and nodes when forming the graph feature vector. This feature vector is fed to a three layer neural network which decides the sequence of tool execution.

Graves-CPA was trained using data collected from running 5 configurations of the CPAchecker framework on the verification tasks from SV-COMP 2021. Labels for each configuration come from the SV-COMP score the configuration would receive for a given program minus a time penalty. Similar to CPAchecker's competition contribution, these configurations are symbolic execution [6], value analysis [7], value analysis with CEGAR [7], predicate analysis [5], and bounded model checking with k-induction [4]. To prevent Graves-CPA from overfitting to the SV-COMP benchmarks, we train on a subset of the dataset, only utilizing 20% of it. Like previous iterations of PeSCo, the network is trained to rank the configurations in the order in which they should be executed.

Graves-CPA uses the machine learning libraries PyTorch [13] and PyTorch-Geometric [8], an extension of PyTorch for graphs and other irregularly shaped data, to implement its machine learning components. Graves-CPA is implemented using a combination of Python, C++, and Java.

#### 2.3 Execution

Using the ordering produced by the previous step, CPAchecker is run in a sequential fashion with each verification configuration. If a technique goes past a given time limit or fails to produce a result, the next technique is executed.

# 3 Strengths and Weaknesses

Graves-CPA operates on program graphs which are an abstraction of the program. Its underlying model uses this abstraction to learn what software patterns a particular verification technique excels at handling. This allows Graves-CPA to produce a dynamic ordering which should run techniques more equipped to the given problem first, reducing run time. In [11], the authors perform a qualitative study which suggests the network learns to rank verification techniques using program features an expert would use to decide between techniques.

In SV-COMP 2022 [2], there were 4,548 problems both Graves-CPA and CPA-checker reported the correct result. Graves-CPA's dynamic selection of CPA-checker's static configuration ordering allowed it to solve these problems 37 hours faster. Further, Graves-CPA was able to solve 142 problems that CPAchecker could not, due to resource constraints or other issues.

Machine learning relies on the fact that training data is representative of the real world. If this is not the case, the model can easily make poor predictions. These poor decisions can be seen in competition in the 559 instances where Graves-CPA chooses an ordering that doesn't produce the correct result, but CPAchecker does. In most of these instances, Graves-CPA runs out of resources or incorrectly predicts the remaining techniques will not produce a correct result.

# 4 Tool Setup and Configuration

Graves-CPA is built on the PeSCo codebase, which in turn is built on the CPAchecker codebase, and participates in the ReachSafety and Overall categories. It can be downloaded as a fork: https://github.com/will-leeson/cpachecker. Graves-CPA requires cmake, LLVM, either make or ninja, and ant (a CPAchecker dependency) to be built and the python libraries PyTorch and PyTorch-Geometric to be executed. To build the project, simply run the shell script setup.sh and add our graph generation tool, graph-builder, to your path. Now, you may verify a program with Graves-CPA using the command:

```
scripts/cpa.sh -svcomp22-graves -spec [prop.prp] [file.c]
```
# 5 Software Project and Contributions

Graves-CPA is an open source project developed by the authors at the University of Virginia. We would like to thank the team behind the PeSCo and CPAChecker tools for allowing us to build on their work.

# Acknowledgements

We would like to thank Hongning Wang for his advice on graph neural networks and prediction systems. This material is based in part upon work supported by the U.S. Army Research Office under grant number W911NF-19-1-0054 and by the DARPA ARCOS program under contract FA8750-20-C-0507.

# References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and

# **GWIT: A Witness Validator for Java based on GraalVM (Competition Contribution)**?

Falk Howar (B) <sup>1</sup>,<sup>2</sup> and Malte Mues <sup>1</sup>

**Abstract.** GWIT is a validator for violation witnesses produced by Java verifiers in the SV-COMP software verification competition. GWIT weaves assumptions documented in a witness into the source code of a program, effectively restricting the part of the program that is explored by a program analysis. It then uses the GDart tool (dynamic symbolic execution) to search for reachable errors in the modified program.

# **1 Introduction**

Software verification tools, like any other software, can contain bugs. Given their intended use, i.e., proving the absence of errors in programs, however, bugs in verification tools are particularly problematic. On the other hand, verification tools can generate certificates for computed verdicts (e.g., counterexamples) that can be used to validate verification results. In the SV-COMP competition on software verification *violation witnesses* and *correctness witnesses*, based on annotated abstract control-flow automata have been established as a standardized representation of such certificates [1, 2]. Participating verifiers are expected to produce witnesses for verdicts and *witness validators* are used for confirming verdicts based on these witnesses.

In this paper, we present GWIT (as in "**G**uess **W**hat **I**'m **T**hinking" or as in **G**Dart-based **wit**ness validator), a validator of violation witnesses for Java programs, based on the GDart tool ensemble [6]. GWIT validates violation witnesses by weaving the assumptions documented in a witness into the original program under analysis and checks the restricted program with dynamic symbolic execution.

# **2 Witness Validation in GWIT**

We illustrate the operation of GWIT for the small example shown in Figure 1: In the program, a String value is created nondeterministically before asserting that the value of this String value should not be "whoopsy". This program contains a reachable error: in case the value "whoopsy" is returned by the call to Verifier.nondetString(), an assertion violation will be triggered.

<sup>1</sup> TU Dortmund University, Dortmund, Germany {falk.howar, malte.mues}@tu-dortmund.de <sup>2</sup> Fraunhofer ISST, Dortmund Germanny

<sup>?</sup> This work has been partially founded by an Amazon Research Award

```
1 public static void main(String[] args) {
2 String s = Verifier.nondetString();
3 assert !s.equals("whoopsy")
4 }
```
Fig. 1: Small program with reachable error.

Java verifiers will generate a violation witness in such a case. In SV-COMP, witnesses are produced in a standardized format, conceptually based on controlflow automata and technically realized as models in the *GraphML* format [2]. Figure 2 shows an excerpt of such a witness for the above example. The witness makes an assumption on the state of the program when executing line 2 of the example program, namely that variable s has value "whoopsy". As discussed, execution paths on which this assumption holds, will lead to an error.

GWIT weaves the assumptions from the witness into the original program, restricting the number of program paths that have to be explored for finding the error. Figure 3 shows the result for our example: a call to Witness.assume(...) is generated from the assumption from the witness in Figure 2. The assume method wraps potentially many calls to the Verifier.assume(...) method, enabling multiple assumptions on the same line of code (e.g., due to execution of that line in a loop). The counters array keeps statistic on assumptions per line. The Verifier.assume(...) method is used by GDart to stop analysis on paths that violate the corresponding assumption.

Figure 4, finally, shows the effect of weaving the witness into the code on the obtained constraints-trees. In the left of the figure, the tree computed by GDart for the original program is shown. The tree has two satisfiable paths, branching on the condition of the assert statement. The right of the figure shows the tree for the modified program. This tree contains a node for the assumption, one path that is not executed after the violation of the assumption, one path that is not feasible after the assumption for the assert statement, and one path leading to an error (i.e., assertion violation). In this small example, the tree for the modified program is more complex than the tree for the original program, but it has fewer complete execution paths. In more complex programs, assumptions will typically remove multiple execution paths, making the validation task significantly easier than the original verification task.

```
<edge source="n0" target="n1">
  <data key="originfile">Main.java</data>
  <data key="startline">2</data>
  <data key="threadId">0</data>
  <data key="assumption">s.equals("whoopsy")</data>
  <data key="assumption.scope">...</data>
</edge>
```

```
1 static int[] counters = new int[] { 0 };
2 public static void assume(int id, boolean ... assumptions) {
3 int idx = counters[id];
4 counters[id]++;
5 Verifier.assume(assumptions[idx]);
6 }
7
8 public static void main(String[] args) {
9 String s = Verifier.nondetString();
10 Witness.assume(0, s.equals("whoopsy"));
11 assert !s.equals("whoopsy")
12 }
```
Fig. 4: Constraints-tree for original program (left) and modified program (right).

# **3 Performance and Limitations**

While the approach of GWIT is sound for violation witnesses, the current implementation still has limitations, validating roughly half of the witnesses provided by verifiers.

*Soundness.* GWIT is sound: weaving a witness into the code adds additional decision nodes to the constraints-tree. In the sub-tree rooted at such a new node, some paths become unsatisfiable and will not be explored. Every complete path ψ in the modified tree has an equivalent path φ in the original constraints-tree such that ψ =⇒ φ. If an error is reached in the modified tree, it is also reachable in the original program.

*Performance.* For programs with few decisions, the modified program may actually be more complex than the original program, but GDart does only explore more paths than in the original program in cases where the initial value along some path does not satisfy an assumption. Comparing the CPU times of GDart used as a verifier and used through GWIT, using almost identical configuration options (only difference: GWIT does not produce witnesses), complexity is reduced for most benchmark instances that do not fail due to syntactic errors during weaving (see below).

Two extreme examples are the BellmanFord-FunSat02 for which weaving a witness with 13 assumptions increases CPU time more than twice, leading to a timeout during validation and the nanoxml\_eqchk/prop2 instance for which the CPU time required for validation is less than 14% of the CPU time needed for the original verification task.

Overall, GWIT successfully validates 301 of 614 witnesses provided by GDart and JBMC [3] (the only Java verifiers that currently produce witnesses). In 286 cases, validation failed with inconclusive verdicts due to currently unsupported features of witness. In 15 cases, incorrect weaving (see below) prevented validation of witnesses. For 12 witnesses, validation attempts exhaust resource limits.

*Limitations.* First, GWIT currently only supports violation witnesses. In principle, it should be possible to validate verification witnesses by weaving assertions into the program code, but it is not obvious that such an approach makes the validation of witnesses a simpler problem than the original verification task. Second, since weaving witnesses is done on the source code, it only works correctly on proper blocks, delimited with braces, and with one statement per line. While this does not affect soundness, it makes the validation of witnesses impossible in some cases.

# **4 Tool Setup**

GWIT is shipped as a git repository with sub-projects delivering all required components. Checking out the repository and initializing all sub-projects pulls in all required source code. For building the SPouT component, the mx build system<sup>3</sup> maintained by the GraalVM [7] team is required. Other components are built with maven. Once all build systems are available, the ./compile-all.sh script builds GWIT. The ./run-gwit.sh is used to validate witnesses, taking the witness file and source folders of a benchmark instance as parameters. GWIT currently does not expose any other configuration parameters.

# **5 Software Project**

The GWIT tool is available on GitHub<sup>4</sup> . GWIT's scripts are licensed under the Apache 2.0 license. The sub-project bring their own license as follows: DSE<sup>5</sup> is available under the Apache 2.0 license, JConstraints<sup>6</sup> [4] as well, and SPouT<sup>7</sup> is available under the GPL v2 license. The components of GWIT and GWIT itself are currently developed at TU Dortmund by the group led by Falk Howar.

<sup>3</sup> https://github.com/graalvm/mx

<sup>4</sup> https://github.com/tudo-aqua/gwit

<sup>5</sup> https://github.com/tudo-aqua/dse

<sup>6</sup> https://github.com/tudo-aqua/jconstraints

<sup>7</sup> https://github.com/tudo-aqua/spout

### **6 Data Availability Statement**

The GWIT archive used for SV-COMP 2022 is available at Zenoodo [5].

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# The Static Analyzer Infer in SV-COMP (Competition Contribution)

Matthias Kettl() and Thomas Lemberger

LMU Munich, Germany

Abstract. We present Infer-sv, a wrapper that adapts Infer for SV-COMP. Infer is a static-analysis tool for C and other languages, developed by Facebook and used by multiple large companies. It is strongly aimed at industry and the internal use at Facebook. Despite its popularity, there are no reported numbers on its precision and efficiency. With Infer-sv, we take a first step towards an objective comparison of Infer with other SV-COMP participants from academia and industry.

# 1 Facebook Infer

Infer [6] is a compositional and incremental static-analysis tool developed at Facebook. Infer supports a wide array of analyses; this includes memory safety, buffer overruns, performance constraints and different reachability analyses for C, C++, Objective C, Java, C#, and .Net. For memory analysis, Infer uses bi-abduction [7] with separation logic [14]. Infer supports the integration of new abstract domains through the abstract-interpretation framework Infer:AI. Infer analyzes programs compositionally (building method summaries) and incrementally (only analyzing changed program parts). In contrast to most other tools that participate in SV-COMP, Infer is not an academic verifier. Instead, it is aimed at practical use during software development. This has direct implications on the development focus: When Infer is told to incrementally analyze software, it outputs only newly discovered bugs and does not re-report bugs found in previous analyses. This allows developers to ignore warnings not deemed relevant and reduces the cognitive burden on developers due to false alarms. Multiple large companies use Infer—among others: Amazon Web Services, Facebook, Microsoft, Mozilla, and Spotify. At the time of this writing, Infer has more than 12 000 stars on GitHub and was forked over 1 500 times. Despite its popularity, there are no reported numbers on Infer's precision and soundness. With the participation of Infer in the C language track of SV-COMP '22, we hope to take a first step towards an objective comparison of Infer with other verifiers.

The following other commercial verifiers participate in SV-COMP '22: 2ls [16], Cbmc [10], Crux <sup>1</sup> , Frama-C [5], VeriAbs [12], and VeriFuzz [9].

<sup>1</sup> https://crux.galois.com/

# 2 Infer in SV-COMP

### 2.1 Infer-SV

Verification. We provide the wrapper Infer-sv to adapt Infer to the SV-COMP specification format for program properties. Infer-sv parses the property to analyze, adjusts the program under analysis for Infer, runs Infer with fitting analyses, and reports a verification verdict based on the feedback produced by Infer. Infer-sv supports the following SV-COMP program properties:

no-overflow. The aim is to check for arithmetic overflows on signed-integer types. Infer-sv runs Infer's buffer-overrun analysis <sup>2</sup> to detect these.

unreach-call. The aim is to check for reachable calls to function reach\_error. Infer provides a function-call reachability analysis <sup>3</sup> , but this analysis proved very imprecise. To mitigate this, Infer-sv performs a program transformation <sup>4</sup> : It replaces each call to function reach\_error with an overflow-provoking statement int \_\_reach\_error\_x = 0x7fffffff + 1. No task with property unreach-call contains a signed-integer overflow, so the original reachability property holds if and only if any of the introduced overflows is reachable. Infer-sv runs Infer's buffer-overrun analysis on the transformed program to check this.

valid-memsafety. The aim is to check for invalid pointer dereferences, invalid frees of memory, and memory leaks. To analyze memory safety, Infer-sv uses two analyses: bi-abduction <sup>5</sup> and Infer:Pulse <sup>6</sup> . SV-COMP requires verifiers to report the concrete type of violation detected: valid-deref, valid-memtrack, or valid-free. Infer-sv analyzes the error codes reported by Infer to determine the exact violation found. If Infer reports multiple fitting warnings, we take the first.

Witnesses. SV-COMP requires participants to report GraphML verificationresult witnesses [3, 4] in tandem with each result, and these witnesses must be successfully validated by at least one participating witness validator. Natively, Infer does not support the generation of GraphML witnesses. To mitigate this, Infer-sv creates generic witnesses: When reporting a violation, it generates a violation witness [4] that represents all possible program paths. When reporting a program safe, it generates a correctness witness [3] that only contains the trivial invariant 'true'. These witnesses do not helpfully guide towards a violation or proof, but are valid according to the SV-COMP rules.

Participation. Infer-sv participates hors concours in the categories ReachSafety, ConcurrencySafety, NoOverflows, and SoftwareSystems. Because of missing support, we exclude Infer-sv from categories aimed at float handling, as well as category MemSafety-MemCleanup.

<sup>2</sup> https://fbinfer.com/docs/checker-bufferoverrun

<sup>3</sup> https://fbinfer.com/docs/checker-annotation-reachability

<sup>4</sup> https://github.com/facebook/infer/issues/763

<sup>5</sup> https://fbinfer.com/docs/checker-biabduction

<sup>6</sup> https://fbinfer.com/docs/checker-pulse

Fig. 1: Comparison of the run time (in CPU time seconds) of three SV-COMP '22 medalists and Infer, across all tasks correctly solved by the respective pair

```
1 int main () {
2 if (0) {
3 int x = 0 x7fffffff + 1;
4 }
5 }
   (a) Infer correctly reports safety
                                      1 void reach_error () {
                                      2 int x = 0 x7fffffff + 1;
                                      3 }
                                      4 int main () {
                                      5 if (0) {
                                      6 reach_error ();
                                      7 }
                                      8 }
                                        (b) Infer incorrectly reports an alarm
1 int main () {
2 int x = 0 x7fffffff ;
3 int y = -1;
4 while ( x > 0) {
5 x = x - 2* y;
6 }
7 }
                                      1 int main () {
                                      2 int x = 0 x7fffffff ;
                                      3 int y = -1;
                                      4 while (x > 0) {
                                      5 x = x - 2* y;
                                      6 y = y + 2;
                                      7 }
                                      8 }
```
(c) Infer correctly reports an alarm

(d) Infer incorrectly reports safety

Fig. 2: Examples of Infer's inconsistent results

#### 2.2 Strengths of Infer

Infer scales well [6]. This shows in the SV-COMP results: For 6 000 out of 8 000 tasks with a verification verdict, Infer finishes the analysis in less than one second of CPU time. The remaining 2 000 tasks each take less than 100 s of CPU time. This means that Infer stays significantly below the time limit of 900 s per task. Figure 1 compares the run time of Infer (in CPU-time seconds) to the best SV-COMP '22 tools in the categories that Infer participated in: CPAchecker [11], Symbiotic [8], and VeriAbs [12]. Each plot shows the run time for all tasks that are correctly solved by both Infer and the respective other verifier (independent of result validation). It is visible that Infer (y-axis) is significantly faster than the other tools (x-axis) for almost all tasks. This speed makes Infer integrate well in continuous-integration development systems [13, 15].

### 2.3 Weaknesses of Infer

Infer demonstrates low analysis precision. Figures 2a and 2b illustrate a low precision across function calls (intraprocedural analysis): Both programs contain an unreachable, signed integer overflow. The only difference is the indirection in Fig. 2b due to the additional function call. Infer correctly reports Fig. 2a safe, but incorrectly reports an alarm for Fig. 2b. We assume that the intraprocedural analysis of Infer does not check whether reach\_error is reachable from the program entry. Infer-sv mitigates this issue for property unreach-call through the mentioned program transformation, but this imprecision still leads Infer to report wrong alarms across all program properties.

Infer can also show imprecision within a single function. Consider Figs. 2c and 2d: The only change between Fig. 2c and Fig. 2d is the addition of a statement in line 6, y = y + 2. This has no influence on the integer overflow in line 5, so both programs contain an overflow. Infer correctly reports the overflow for Fig. 2c, but wrongly reports Fig. 2d safe.

These imprecisions strongly reflect in the SV-COMP results of Infer, leading to many incorrect proofs and alarms.

# 3 Usage

Infer-sv requires Python 3.6 or later. Script setup.sh downloads and extracts version 1.1.0 of Infer. From the tool's directory, Infer-sv can be run with the following command:

```
./ infer - wrapper . py \
     -- data - model { ILP32 or LP64 } \
     -- property path / to / property . prp \
     -- program path / to / program . c \
```
Setting the data model is optional. Infer-sv will print the recognized property and the command line it uses to call Infer. Infer-sv prints the full output of Infer, including all warnings, and the final verification verdict on the last line. The verification verdict can be true, false, unknown or error.

# 4 Conclusion

The participation of Infer in SV-COMP allows an objective comparison with other verifiers for C. This shows that the selected analyses of Infer are very efficient, but suffer from strong imprecision on the considered benchmark tasks.

Contributors. Infer <sup>7</sup> is developed by Facebook and the open-source community under the MIT license, and Infer-sv <sup>8</sup> is developed under the Apache 2.0 license at the Software and Computational Systems Lab at LMU Munich, led by Dirk Beyer.

<sup>7</sup> https://github.com/facebook/infer

<sup>8</sup> https://gitlab.com/sosy-lab/software/infer-sv

Funding Statement. This work was funded in part by the Deutsche Forschungsgemeinschaft (DFG) – 418257054 (Coop).

Data Availability Statement. All data of SV-COMP 2022 are archived as described in the competition report [1] and available on the competition web site. This includes the verification tasks, results, witnesses, scripts, and instructions for reproduction. The version of our verifier as used in the competition is archived together with other participating tools [2].

#### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# LART: Compiled Abstract Execution<sup>⋆</sup> (Competition Contribution)

Henrich Lauko  ⋆⋆ and Petr Roˇckai

Faculty of Informatics, Masaryk University, Brno, Czech Republic xlauko@mail.muni.cz

Abstract. lart – llvm abstraction and refnement tool – originates from the divine model-checker [5,7], in which it was employed as an abstraction toolchain for the llvm interpreter. In this contribution, we present a stand-alone tool that does not need a verifcation backend but performs the verifcation natively. The core idea is to instrument abstract semantics directly into the program and compile it into a native binary that performs program analysis. This approach provides a performance gain of native execution over the interpreted analysis and allows compiler optimizations to be employed on abstracted code, further extending the analysis efciency. Compilation-based abstraction introduces new challenges solved by lart, like domain interaction of concrete and abstract values simulation of nondeterministic runtime or constraint propagation.

Keywords: Abstract interpretation · Compilation-based abstraction · llvm · lart · divine · Formal verifcation · Symbolic execution.

# 1 Verifcation Approach and Software Architecture

As it is with many tasks in computer science, one can approach them in multiple ways, and verifcation is not an exception. In general, tools approach program analysis using an interpretation, giving them complete control over a program state and program execution but paying the cost for performance. Our tool lart challenges the task utilizing the toolset from the opposite side of the spectrum – compilation – using a technique of so-called compilation-based abstraction. The main idea of this approach is to compile nondeterministic execution directly into the executable and perform reachability analysis by its native execution. This approach is most similar to one presented in symcc [6]. Symcc performs a compilation of symbolic execution into the native binary. In contrast, we present a more general approach that allows arbitrary abstraction. Spin model checker [4] also provides a mode where the model is compiled together with a verifer to a single executable.

During the compilation, lart performs llvm-to-llvm transformation to augment instructions that can manipulate with nondeterministic values. This is

<sup>⋆</sup> This work has been partially supported by Red Hat, Inc.

<sup>⋆⋆</sup> Jury member representing lart at sv-comp 2022.

a purely syntactic abstraction of a program, e.g., add instruction is replaced by call to lart add. Additionally, lart provides a set of semantic libraries (abstract domains) to give meaning to abstract instruction. Each abstract domain defnes the native representation of abstract values, implements abstract instructions and transformations to and from concrete values and other domains. The tool provides multiple domains that allow analyses with various precisions, e.g., interval analysis, nullity analysis, or symbolic analysis. Finally, to allow native execution, domains are present in static libraries linked to instrumented programs under test.

In comparison to concrete programs, abstracted programs also exhibit nondeterministic control fow. To explore all possible execution paths, lart provides a confgurable runtime library. The overall architecture of compilation-based abstraction is depicted in fgure Figure 1.

The confguration used in the competition contribution employs an iterative deepening search of program paths. At each branching point of a program, the execution forks to explore all possibilities. Finally, the main process of the analysis gathers results from explored paths and notifes the user if an error is reachable. This approach eventually sufers from potential infnite loops and path explosion problem. However, it is sufcient for bug hunting or even verifcation in the case of employed overapproximative abstraction, which widens the efect of infnite loops. Also, in many simple cases, a compiler can summarize the efects of program loops, minimizing the impact of path explosion.

Fig. 1. lart architecture overview.

In order to obtain a performant result, we strive to minimize the amount of syntactic abstraction. Instrumentation achieves this by combining forward datafow analysis and Andersen alias analysis [1], tainting only those instructions that might encounter nondeterministic values, and abstracting only the tainted instructions. This analysis is entirely overapproximative and detects all possible candidates for abstraction quickly. The actual abstract computation is resolved later during execution.

However, we don't want to perform expensive abstract computation when tainted instructions do not obtain nondeterministic operands. This might occur when a C function at one point receives concrete arguments and at another call site some abstract arguments. In the former case, we would like to perform it fully concretely. While in the latter, we want to execute only the necessary amount of tainted instructions abstractly. Therefore, lart synthesizes simple dispatch routines that pick a concrete or abstract instruction depending on the operands. The dispatch routine also handles the possibility of mixing concrete and abstract operands – lifting concrete values to an abstract domain if necessary. We require that all operands of abstract instructions are in the same domain. See an example of dispatch in Figure 2.

```
__lart_value __lart_dispatch_add ( __lart_value a , __lart_value b) {
    if ( is_abstract (a) || is_abstract ( b )) {
         if (! is_abstract (a ))
              a. abstract = lift (a . concrete );
         else if (! is_abstract ( b ))
              b. abstract = lift (b . concrete );
         return domain :: add (a . abstract , b . abstract );
    }
    return a. concrete + b . concrete ;
}
```
Fig. 2. Syntactically abstracted values in lart are represented in union type of an abstract or concrete type ( lart value). The dispatch routine lifts operands to an abstract domain and resolves in which domain the instruction should be executed. Since the abstraction dispatch is purely syntactic, it can be inlined to abstracted source code and further optimized. This gives the compiler a possibility to optimize repeated checks in dispatch routines.

The runtime for native execution takes care of multiple responsibilities. First of all, it implements an execution fork when a branch is conditioned by the abstract value that results in both possibilities, e.g., when a branch is conditioned on symbolic term x < 5, both outcomes are possible. Furthermore, the runtime takes care of memory management of abstraction. To not disrupt the original program's memory layout, lart keeps all abstract data in a shadow memory. Therefore the union values presented in Figure 2 are split into two separately addressed regions – concrete program memory and abstract shadow memory. The information on whether variables hold an abstract value is also kept in the shadow memory.

#### 2 Strengths and Weaknesses

The main strength of the compilation-based abstraction is in the utilization of native runtime and compiler optimizations on abstracted code. From theory, the native execution should consistently outperform the same interpreted analysis. However, it comes at the cost of a more complex source transformation that is harder to relate to its origin. Furthermore, the overapproximative nature of the syntactic analysis produces unnecessary execution of dispatch functions when not needed. In contrast, an interpreter can compute in specifc domain without additional dispatches. Another advantage of the approach is a reusable result of syntactic abstraction that can be linked with various domains to perform analysis concurrently without repeated llvm instrumentation.

The best comparison of lart is with the divine model-checker, which uses lart's transformation and domain libraries internally, but instead of compiling to native executable, it interprets abstracted llvm ir. Results from the competition support the hypothesis that the compilation-based approach of lart outperforms divine in all reachability subcategories, except one where longer times are caused by diferent state space exploration order.

Given the simplistic runtime, abstracted binaries produced by lart lack further analysis optimizations and verifcation capabilities. Presently, the exploration algorithm only supports reachability analysis of single-threaded programs. However, we plan to support memory safety and overfow checking using sanitizers like approach.

Another goal of lart's compilation-based approach is to provide a reusable abstraction component for verifcation tools. The proof of this concept is shown on divine and now on the native mode that can be analyzed by standard programmer toolset, like debuggers or sanitizers.

### 3 Tool Setup and Confguration

The verifer archive can be found on the sv-comp 2022 [2] page under the name lart. In case the binary distribution does not work on your system, we also provide a source distribution and build instructions at https://github.com/xlauko/ lart/tree/svcomp-2022. It is sufcient to run lart using compiler wrapper script as follows: lartcc <domain> testcase.c -o abstract and then execute the abstract binary to perform the analysis.

For sv-comp contribution, lart wrapper handles additional settings and setup of workfow presented in Figure 1. The wrapper sets lart options based on the property fle and the benchmark. In particular, lart enables symbolic mode if any nondeterminism is found, and it sets which errors should be reported based on the property fle. It also generates witness fles. More details can be found on the aforementioned distribution page. Due to support limitation lart participates only in ReachSafety and DeviceDrivers categories.

### 4 Software Project and Contributors

The project home page is https://github.com/xlauko/lart. The lart is open source software distributed under the MIT license. Active contributors to the tool are listed as authors of this paper.

Data Availability Statement. All data of SV-COMP 2022 are archived as described in the competition report [2] and available on the competition web site. This includes the verifcation tasks, results, witnesses, scripts, and instructions for reproduction. The version of our verifer as used in the competition is archived together with other participating tools [3].

#### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Symbiotic 9: String Analysis and Backward Symbolic Execution with Loop Folding<sup>∗</sup> (Competition Contribution)

Marek Chalupa <sup>B</sup>, Vincent Mihalkoviˇc, Anna Recht´aˇckov´a, ˇ Luk´aˇs Zaoral, and Jan Strejˇcek

Masaryk University, Brno, Czech Republic

Abstract. The development of Symbiotic 9 focused mainly on two components. One is the symbolic executor Slowbeast, which newly supports backward symbolic execution including its extension called loop folding. This technique can infer inductive invariants from backward symbolic execution states. Thanks to these invariants, Symbiotic 9 is able to produce non-trivial correctness witnesses, which is a feature that is missing in previous versions of Symbiotic. We have also extended forward symbolic execution in Slowbeast with a basic support for parallel programs. The second component with significant improvements is the instrumentation module. In particular, we have extended the static analysis of accesses to arrays with features designed for programs that manipulate C strings.

Symbiotic 9 is the Overall winner of SV-COMP 2022. Moreover, it won also the categories MemSafety and SoftwareSystems, and placed third in FalsificationOverall.

# 1 Verification Approach

Symbiotic 9 combines fast static analyses with code instrumentation and program slicing [13] to speed up the code verification. In the SV-COMP configuration of Symbiotic 9, the code verification is performed by symbolic executors, namely by Slowbeast [8] and our fork of Klee [4].

As Symbiotic works internally with llvm [10], it first compiles the given C program into llvm bitcode. The following steps depend on the verified property.

Verification of the Property unreach-call For this property, Symbiotic 9 directly slices the llvm bitcode to remove instructions that have no influence on the reachability of error calls and then run Klee with the time limit of 333 seconds. Klee is very efficient and often decides the task within this time limit. If Klee fails to decide, we parse its output and proceed according to the case of the failure. If Klee failed because the program contains threads, we

<sup>∗</sup> This work has been supported by the Czech Science Foundation grant GA19-24397S.

<sup>B</sup> Jury member and the corresponding author: chalupa@fi.muni.cz


Table 1. The comparison of supported features of Klee (our fork and the upstream) and Slowbeast (SV-COMP 2022 and SV-COMP 2021 versions). The marks ✓/✓/✗ mean supported/partially supported/unsupported.

run Slowbeast with forward symbolic execution (SE) and the threads support turned on. If Klee failed for any other reason, we run Slowbeast with backward symbolic execution with loop folding (BSELF) [8] described later. If BSELF also fails (the current implementation supports only selected program features), we run Slowbeast with forward symbolic execution.

Note that running forward symbolic execution first with Klee and then with Slowbeast if Klee fails does make a good sense as Klee and Slowbeast support a different set of features. The main differences between these tools (and the upstream Klee and the version of Slowbeast used in Symbiotic 8) are summarized in Table 1. Row symbolic addresses indicates whether tools model the non-determinism in the placement of allocated objects (this is useful, e.g., when comparing addresses of such objects). Row incremental solving indicates whether tools can associate the state of an SMT solver to every symbolic execution state and incrementally add constraints instead of always solving formulas from the scratch. Row caching solver calls indicates whether tools can remember results of solver calls and use them later to quickly decide some other solver calls. Finally, row lazy memory indicates if the tool can create memory objects on-demand when first accessing them, without their previous allocation (it assumes that the accesses to memory are valid). This feature is crucial when we want execute a program by parts, without starting from the entry point. The meaning of the remaining rows should be clear or is explained later.

If an error is found by either tool, it is replayed on the unsliced code. If the replay succeeds, we generate a violation witness. If no error is found and the analysis was complete, we generate a correctness witness. If the program correctness was proved by Slowbeast with BSELF, we generate a witness containing the computed invariants, otherwise we generate a trivial correctness witness as we have no invariants at hand. In all other cases, Symbiotic 9 answers unknown.

Verification of Other Properties For verification of other properties than unreach-call, Symbiotic 9 uses the same workflow as Symbiotic 8 [7]. In brief, the instrumentation module marks program instructions that can potentially violate the considered property. The module employs suitable fast static analyses to identify these instructions (e.g., when checking the property no-overflow, it uses a range analysis to discover the instructions that may perform a signed integer overflow). The bitcode with marked instructions is sliced such that the arguments and the reachability of these instructions are preserved. The sliced bitcode is passed to Klee. If it discovers a property violation and then replays it on the unsliced code, we produce a violation witness. If Klee completes its analysis without any property violation found, we produce a trivial correctness witness. In all other cases, Symbiotic 9 returns unknown.

Backward Symbolic Execution with Loop Folding (BSELF) [8] Slowbeast newly implements backward symbolic execution (BSE) [9], which explores the program backward from target locations towards the initial location and incrementally computes weakest preconditions for the explored program paths. BSE is a valuable technique on its own as it precisely corresponds to k-induction on control-flow paths [8]. Loop folding is a technique that aims to infer inductive invariants during BSE. Roughly speaking, when BSE starts from an error location and reaches a loop header, loop folding creates an initial invariant candidate that is disjoint with the current weakest precondition (i.e., the states that can reach the error location). If the invariant candidate is actually an invariant, we know that the error location is not reachable via the explored path. Otherwise, a pre-image of the invariant candidate along a loop path is computed, over-approximated, and added to the candidate. This process is repeated until an invariant is found or until it fails for some reason, e.g., when it discovers that the error location is actually reachable. Loop folding can infer complex disjunctive invariants and since it uses the error states, it is also property-driven.

String Analysis and Other Improvements The second major improvement in Symbiotic 9 is in the instrumentation for the property valid-memsafety. We have improved the analysis for the identification of out-of-bounds array accesses.

In Symbiotic 8, this analysis only determined whether an array access done via the index variable is in bounds [14]. The analysis in Symbiotic 9 also handles more general patterns where the array contains a concrete value (0 in the case of C strings) and the index pointer is incremented by one until it points to this concrete value, and where the pointer is incremented a fixed number of times.

Further, we have extended the forward symbolic execution in Slowbeast to handle parallel programs. For now, the symbolic execution is highly inefficient as it examines each interleaving of globally visible events. We plan to implement some reductions in the future. Slowbeast has been also extended to generate witnesses as this functionality was missing. Notably, it can generate non-trivial correctness witnesses using the invariants computed by BSELF. Previous versions of Symbiotic generate only trivial correctness witnesses.

Slicing has been also improved. It now applies a fast and coarse slicing before the main slicing. The coarse slicing detects all basic blocks from which no slicing criterion (i.e., an instruction whose reachability and arguments should be preserved) is syntactically reachable and replaces them by calls to abort.

#### 2 Strengths and Weaknesses

Forward symbolic execution is unable to fully analyze unbounded loops or infinite execution paths. Hence, unless program slicing removes the unbounded computation from the program, forward symbolic execution cannot verify it. However, backward symbolic execution and BSELF can fully analyze at least some unbounded programs [8]. Still, both these methods are computationally complex as the number of paths they must search may be enormous and their exploration may involve many non-trivial calls to the SMT solver. Therefore, these methods do not scale to real-world programs.

A strong aspect of Symbiotic is the very interplay of fast static analyses in the instrumentation, program slicing, and forward and backward symbolic execution. Fast static analyses are able to deem correct many parts of the code (with respect to the verified property). These parts of the code are then usually removed by slicing and only the possibly unsafe parts of the program (and their dependencies) get into a symbolic executor. In this sense, Symbiotic does incremental or conditional [3] verification.

Results of Symbiotic 9 in SV-COMP 2022 In SV-COMP 2022 [1], Symbiotic 9 won categories MemSafety, SoftwareSystems, and Overall, and got the 3rd place in FalsificationOverall. Moreover, it produced 1529 correct answers that were not confirmed, which is the highest number in SV-COMP 2022. 1073 unconfirmed answers are in MemSafety-Juliet, where we produced some incorrect witnesses due to a bug. Another 258 unconfirmed answers are in Termination. Symbiotic 9 produced only 3 incorrect answers caused by a bug in the replay mode of Slowbeast.

#### 3 Software Project and Contributors

All components of Symbiotic 9 use llvm 10 [10]. Slicer and instrumentation module are written in C++ and extensively use the library DG [5]. Klee is implemented in C++ and Slowbeast [12] is written in Python. Both symbolic executors use Z3 [11] as the SMT solver. Control scripts are written in Python.

Symbiotic 9 and all its components and external libraries are available under open-source licenses that comply with SV-COMP's policy for the reproduction of results. Symbiotic 9 participated in all categories of SV-COMP 2022 except the categories with Java programs.

Symbiotic 9 has been developed by Marek Chalupa, Vincent Mihalkoviˇc, Anna Recht´aˇckov´a, and Luk´aˇs Zaoral under the supervision of Jan Strejˇcek. ˇ

Data Availability Statement. All data of SV-COMP 2022 are archived as described in the competition report [1] and available on the competition web site. This includes the verification tasks, results, witnesses, scripts, and instructions for reproduction. The version of Symbiotic used in the competition is archived together with other participating tools [2] and also in its own artifact [6] at Zenodo.

# References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and

# Symbiotic-Witch: A Klee-Based Violation Witness Checker? (Competition Contribution)

Paulína Ayaziová, Marek Chalupa , and Jan Strejček 

Faculty of Informatics, Masaryk University, Brno, Czech Republic {xayaziov,chalupa,strejcek}@fi.muni.cz

Abstract. Symbiotic-Witch is a new tool for checking violation witnesses in the GraphML-based format used at SV-COMP since 2015. Roughly speaking, Symbiotic-Witch symbolically executes a given program with Klee and simultaneously tracks the set of nodes the witness automaton can be in. Moreover, it reads the return values of nondeterministic functions specified in the witness and uses them to prune the symbolic execution. The violation witness is confirmed if the symbolic execution reaches an error and the current set of witness nodes contains a matching violation node.

Symbiotic-Witch currently supports violation witnesses of reachability safety, memory safety, memory cleanup, and overflow properties.

# 1 Verification Approach

We present a new checker of violation witnesses called Symbiotic-Witch. The checker first loads a given violation witness in the GraphML format [5] and a given program. Then it performs symbolic execution [11] of the program and simultaneously tracks the progress of the execution in the witness automaton. More precisely, every state of symbolic execution is accompanied by the set of witness automaton nodes that can be reached under the executed program path. If the symbolic execution detects a violation of the considered property and the tracked set of witness automata nodes contains a violation node, the witness is confirmed.

Note that the original description of the witness format [5] does not provide any formal semantics of the format. We interpret it in the way that if an edge in a witness automaton matches an executed program instructions, then we can follow the edge but we can also stay in its starting node. Hence, if we have the set of witness automaton nodes reached under a certain program path, then prolongation of this path can add some nodes to this set, but it never removes any node from the set. A brief reading of an upcoming detailed description of the format [4] reveals that it can be the case that an edge matching an executed program instruction has to be taken. If this is indeed the case, we will adjust

<sup>?</sup> This work has been supported by the Czech Science Foundation grant GA19-24397S.

our tool, but the current implementation and the following texts consider the former semantic.

Before Symbiotic-Witch starts the symbolic execution, we remove from the witness automaton all nodes that are not on any path from the entry node to a violation node. In general, witness automata are related to program executions using node and edge attributes. Symbiotic-Witch currently supports only some attributes of witness edges to map a program execution to a given witness automaton. Namely, it uses the line number of executed instructions, the information whether true or false branch is taken, and the information about entering a function or returning from a function. Additionally, if the witness automaton contains a single path from the entry node to a violation node and there is some information about return values of the \_\_VERIFIER\_nondet\_\* functions on this path, then we use these values in the symbolic execution of the program. Return values not provided in the witness are treated as symbolic values.

A more precise description of the approach can be found in the bachelor's thesis of P. Ayaziová [1].

#### 2 Software Architecture

The approach has been implemented in a tool called Symbiotic-Witch, which is basically a modification of the symbolic executor Klee [8]. More precisely, it is derived from the clone of Klee used in Symbiotic, which employs the SMT solver Z3 [13] and supports symbolic pointers, memory blocks of symbolic sizes etc. For parsing of witnesses in the GraphML format, we use the library RapidXML.

As Klee executes programs in llvm [12], a given C program has to be translated to llvm first. We use Clang for this translation as explained in Section 4.

The current version of Symbiotic-Witch runs on llvm version 10.0.0.

#### 3 Strengths and Weaknesses

Existing violation witness checkers (excluding Dartagnan [10] designed for concurrent programs) can be roughly divided into two categories.

– CPA-witness2test [6], FShell-witness2test [6], and Nitwit [14] perform one program execution based on the information in the witness. If this execution violates the specification, the witness is confirmed. This approach is very efficient for witnesses fully describing one program execution that violates the property. However, if a witness describes more program executions and only some of them violate the property, these tools can easily miss the violating executions. In particular, if a witness does not specify some return value of a \_\_VERIFIER\_nondet\_\* function, FShell-witness2test uses the default value 0, Nitwit picks a random value, and CPA-witness2test fails the witness confirmation.

– CPAchecker [5], UltimateAutomizer [5], and MetaVal [7] create a product of a given witness automaton and the original program and analyze it. As a result, some execution paths of the original program can be analyzed repeatedly for different paths in the witness automaton. To suppress this effect, these checkers usually ignore the possibility to stay in a witness automaton node whenever there is a matching transition leaving the node. Unfortunately, a valid witness can be unconfirmed due to this strategy.

We believe that our approach to checking violation witnesses removes all mentioned disadvantages. Symbolic execution allows us to efficiently examine many program executions corresponding to a given witness automaton, and program executions are not analyzed repeatedly. The approach can easily handle witnesses based on return values from the \_\_VERIFIER\_nondet\_\* functions as well as those based on description of branching.

There is only one principal case when a valid witness is not confirmed by Symbiotic-Witch (ignoring the cases when Symbiotic-Witch simply runs out of resources). This case can arise when Symbiotic-Witch uses the information about return values of \_\_VERIFIER\_nondet\_\* functions stored in the witness. Symbiotic-Witch uses the information immediately when the symbolic execution calls such a function and there is a matching edge in the witness with a return value that has not been used yet (i.e., the starting node of the edge is in the set of tracked witness nodes and the target node is not). This "eager approach" usually works very well, especially for witnesses containing return values for all calls of \_\_VERIFIER\_nondet\_\* functions. However, there can be witnesses where some return values are missing and a particular contained return value should not be used for the first matching call of the \_\_VERIFIER\_nondet\_\* function. Such witnesses can be valid, but Symbiotic-Witch can fail to confirm them. As far as we know, such witnesses do not appear in SV-COMP and other witness checkers would probably fail to confirm them as well.

On the negative side, our approach inherits the disadvantages and limitations of symbolic execution and Klee. In particular, it can suffer the path explosion problem on witnesses that do not provide return values of \_\_VERIFIER\_nondet\_\* functions. Further, Symbiotic-Witch does not support parallel programs as Klee does not support them.

Our current approach is suitable for cases when a witness can be checked based on a finite program execution. That is why our tool supports violation witnesses of safety properties. Table 1 shows the numbers of violation witnesses confirmed in SV-COMP 2022 [2] by individual witness checkers in the categories supported by Symbiotic-Witch.

We believe that symbolic execution can be also used for checking termination violation witnesses and for checking correctness witnesses. We plan to extend Symbiotic-Witch in these directions. We also plan to add a witness refinement mode [5] already provided by CPAchecker and UltimateAutomizer. In this mode, when a witness is confirmed, Symbiotic-Witch would produce another witness describing a single program execution (by specifying return values for all calls of \_\_VERIFIER\_nondet\_\* functions) that exhibits the property violation.


Table 1. The numbers of confirmed witnesses in relevant SV-COMP 2022 categories

# 4 Tool Setup and Configuration

For the use in SV-COMP 2022, we have integrated our witness checker (originally called Witch-Klee) with Symbiotic [9], which takes care of translation of a given C program into llvm using Clang and then slightly modifies the llvm program to improve the efficiency of witness checking.

The archive with Symbiotic-Witch can be downloaded from SV-COMP archives. The witness checking process is invoked by

./symbiotic [–prp <prop>] [–32] –witness-check <wit.graphml> <prog.c>

where <wit.graphml> is a violation witness to be checked and <prog.c> is the corresponding program. By default, the tool considers reachability safety property and 64-bit architecture. The considered property can be changed by the –prp option and <prop> instantiated to memsafety or memcleanup or no-overflow. The 32-bit architecture is set by –32.

Our witness checker can be also downloaded directly from its repository mentioned below. The version used in SV-COMP 2022 is marked with the tag SV-COMP22. It can be executed without Symbiotic via a shell script as

./witch.sh <prog.c> <wit.graphml>

which calls Clang to translate <prog.c> to llvm and then passes the llvm program and the witness <wit.graphml> to the witness checker.

## 5 Software Project and Contributors

Symbiotic-Witch has been developed at Faculty of Informatics, Masaryk University by Paulína Ayaziová under the guidance of Marek Chalupa and Jan Strejček. The tool is available under the MIT license and all used tools and libraries (llvm, Klee, Z3, RapidXML, Symbiotic) are also available under open-source licenses that comply with SV-COMP's policy for the reproduction of results. The source code of our witness checker can be found at:

https://github.com/ayazip/witch-klee

Data Availability Statement. All data of SV-COMP 2022 are archived as described in the competition report [2] and available on the competition web site. This includes the verification tasks, results, witnesses, scripts, and instructions for reproduction. The version of Symbiotic-Witch used in the competition is archived together with other participating tools [3].

# References


14. Švejda, J., Berger, P., Katoen, J.: Interpretation-based violation witness validation for C: NITWIT. In: Biere, A., Parker, D. (eds.) Tools and Algorithms for the Construction and Analysis of Systems - 26th International Conference, TACAS 2020, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2020, Dublin, Ireland, April 25-30, 2020, Proceedings, Part I. Lecture Notes in Computer Science, vol. 12078, pp. 40–57. Springer (2020), https://doi.org/10.1007/978-3-030-45190-5\_3

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Theta: portfolio of CEGAR-based analyses with dynamic algorithm selection (Competition Contribution)

Zs´ofa Ad´am ´ <sup>1</sup> , Levente Bajczi<sup>1</sup> , Mih´aly Dobos-Kov´acs<sup>1</sup> , Akos Hajdu ´ <sup>2</sup> , and Vince Moln´ar<sup>1</sup> <sup>⋆</sup>(B)

<sup>1</sup> Department of Measurement and Information Systems Budapest University of Technology and Economics, Budapest, Hungary molnarv@mit.bme.hu <sup>2</sup> Meta Platforms Inc., London, United Kingdom

Abstract. Theta is a model checking framework based on abstraction refnement algorithms. In SV-COMP 2022, we introduce: 1) reasoning at the source-level via a direct translation from C programs; 2) support for concurrent programs with interleaving semantics; 3) mitigation for nonprogressing refnement loops; 4) support for SMT-LIB-compliant solvers. We combine all of the aforementioned techniques into a portfolio with dynamic algorithm selection.

# 1 Verifcation Approach and Software Architecture

Theta [10] is a generic and confgurable model checking framework written in Java 11. A simplifed version of the architecture (focusing on software verifcation aspects) can be seen in Figure 1.

Fig. 1. Architecture of Theta.

The input is a C program that is frst translated to extended control-fow automata (XCFA). Previously, Theta used LLVM [3], which had various advantages, but its static single assignment (SSA) form proved overall disadvantageous for abstraction-based algorithms. This year we use a new, direct translation (no

<sup>⋆</sup> Jury member representing Theta at SV-COMP 2022.

intermediate language and SSA form) via an ANTLR parser. Furthermore, the CFA being "extended" refers to the fact that since this year we support concurrent programs by an analysis with interleaving semantics. After parsing we apply various passes to the XCFA (e.g., large-block encoding or partial order reduction). The core of Theta is a CEGAR-based analysis framework, targeting reachability properties via predicate and explicit analyses [8], along with interpolation and Newton-based refnements [7]. This year, Theta added generic support for SMT solvers (including interpolation) via the SMT-LIB interface. At SV-COMP'22 we use CVC4 [4], MathSAT [6], and Z3 [9], where the latter is used via the Java API from before. Finally, a verdict (safe, unsafe, unknown) and a witness is produced corresponding to the C program (using metadata from the translation).

Fig. 2. Overview of the dynamic portfolio of Theta.

Verifcation portfolio. Based on preliminary experiments and domain knowledge, we manually constructed a dynamic algorithm selection portfolio [1] for SV-COMP'22, illustrated by Figure 2. Rounded white boxes correspond to decision points. We start by branching on the arithmetic (foats, bitvectors, integers). Under integers, there are further decision points based on the cyclomatic complexity and the number of havocs and variables. Grey boxes represent confgurations, defning the solver/domain/refnement in this order. Lighter and darker grey represents explicit and predicate domains respectively. Internal timeouts are written below the boxes. An unspecifed timeout means that the confguration can use all the remaining time. The solver can be CVC4 (C) [4], MathSAT (M), MathSAT with foats (Mf) [6] or Z3 (Z) [9]. Abstract domains are explicit values (E), explicit values with all variables tracked (EA), Cartesian predicate abstraction (PC) or Boolean predicate abstraction (PB) [8]. Finally, refnement can be Newton with weakest preconditions (N) [7], sequence interpolation (S) or backward binary interpolation (B) [8]. Arrows marked with a question mark (?) indicate an inconclusive result, that can happen due to timeouts or unknown results. Furthermore, this year's portfolio also includes a novel dynamic (run-time) check for refnement progress between iterations that can shut down potential infnite loops (by treating them as unknown result) [1]. Note also that for solver issues (e.g., exceptions from the solver) we have diferent paths in some cases.

## 2 Strengths and Weaknesses

Theta currently targets ReachSafety and ConcurrencySafety with limited support for structs, arrays and pointers, and no support for dynamic memory allocation, mutexes and recursion. Due to this, Theta fails for most tasks in ProductLines, Recursive, Heap and Arrays. Out of the 6163 tasks, roughly 2/3 can be translated and there are 888 confrmed correct (541 safe, 347 unsafe), 116 unconfrmed correct, and only 15 incorrect (11 false positive, 4 false negative) results [5]. Note that almost all unsupported cases are detected and reported as an error, and we only have a few incorrect results due to subtle issues.

The main strength of the tool is the combination of algorithm selection (pick algorithm based on input) and portfolios (try multiple algorithms until one succeeds). Out of the 1004 correct results, 315 could not be solved by the frst confguration that the portfolio tries: dynamic checks intervened for 181 internal timeouts, 72 solver issues (e.g. wrong models), 19 non-progressing refnements, and 74 other (unknown) faults before the eventual success.

Having a diverse portfolio also paid of. Bitvector and foat arithmetic tasks were either solved by explicit analyses (with a mixture of interpolation- and Newton-based refnements) before even trying predicate confgurations, or if explicit analyses failed, predicate confgurations were unsuccessful too. The integer arithmetic required a more diverse confguration set: Predicate abstraction solved roughly 48% of the tasks (45% Cartesian, 3% Boolean) and explicit analysis solved 52% (33% with empty precision, 19% with all variables tracked).

The SMT-LIB support provided a great improvement: previously we only had Z3, which still dominates the integer cases. However, all of the bitvector tasks were solved by MathSAT, making Z3 an unused backup. With foats, roughly half of the tasks were solved by MathSAT, while the other half needed CVC4 as backup. Since foats are reduced to bitvectors, we did not rely on Z3 based on poor performance in our preliminary experiments.

The most successful subcategories are BitVectors, ControlFlow, Loops, XCSP (38-45% correct), mostly because they use features of C that our frontend supports well. We plan to mitigate the high number of timeouts in the future with approximations (e.g. mixing integers and bitvectors), and further analyses (e.g., inferring loop invariants). We also have a signifcant amount of unconfrmed results: we believe this can be improved by generating more compact witnesses.

This year Theta added support for sequential concurrency via a preprocessing step: it yields an encoding where exploring all interleavings preserve interthread behaviors. The analyses treat consecutive non-global memory accesses as one atomic block, reducing the exploration of unnecessary total orders. A drawback of using preprocessing for partial order reduction instead of an on-line algorithm is the superfuous exploration of certain total orders, e.g., all interleavings of independent global memory accesses will also be explored. This is because such accesses might overlap with non-independent memory accesses at other times, and the preprocessing step is not aware of such details.

Using a wrapper, Theta integrates concurrency seamlessly with the existing framework (abstract domains, refnements), except the error location-based search [8] (used for non-concurrent cases) because the required distance metric is not well defned in concurrent programs. Instead, we opted to use a breadth-frst search, which had outperformed depth-frst strategies in preliminary tests. We theorize that this is due to bugs being reachable within the frst few instructions most of the time, but only via a specifc total order. The performance for concurrent programs is still limited though, and we plan to integrate a declarative approach in the future, which could be used for weakly-ordered programs as well.

# 3 Tool Setup and Confguration

The competition contribution is based on Theta 3.0.0-svcomp22-v1.<sup>3</sup> Additionally, Theta uses CVC4 v1.9, MathSAT v5.6.6 and Z3 v4.5.0. The project's repository contains build instructions, but an archive can be found at the SV-COMP repository<sup>4</sup> and Zenodo [2]. with pre-built binaries for Ubuntu 20.04 (LTS). The toolchain requires packages openjdk-11-jre-headless, libgomp1 and libmpfr-dev to be installed. The entry point of the toolchain is the script theta/theta-start.sh, which takes the verifcation task (C program) as its only mandatory input and runs the portfolio. As additional arguments we use --portfolio COMPLEX --witness-only --loglevel RESULT. Additional arguments are described in the readme included with the binaries.

# 4 Software Project

Theta is maintained by the Critical Systems Research Group<sup>5</sup> of the Budapest University of Technology and Economics with various contributors. The project is available open-source on GitHub<sup>3</sup> under an Apache 2.0 license.

Data Availability. The version of Theta used in this paper is available at [2].

Acknowledgment and Funding. The authors would like to thank Tam´as T´oth, Mil´an Mondok, Istv´an Majzik, Zolt´an Micskei and Andr´as V¨or¨os for their contributions to the project; and the competition organizers, especially Dirk Beyer for their help during the preparation for SV-COMP. The research contributions of the authors from the Budapest Univ. of Tech. and Econ. were funded by the EC and NKFIH through the Arrowhead Tools project (EU grant No. 826452, NKFIH grant 2019-2.1.3-NEMZ ECSEL-2019-00003), and by the UNKP-21-2 New National Excellence Program of ITM from the NRDI Fund.

<sup>3</sup> https://github.com/ftsrg/theta/releases/tag/svcomp22-v1

<sup>4</sup> https://gitlab.com/sosy-lab/sv-comp/archives-2022/-/blob/main/2022/theta.zip

<sup>5</sup> https://ftsrg.mit.bme.hu

# References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Ultimate GemCutter and the Axes of Generalization (Competition Contribution)

Dominik Klumpp?<sup>1</sup> , Daniel Dietsch<sup>1</sup> , Matthias Heizmann<sup>1</sup> , Frank Sch¨ussele<sup>1</sup> , Marcel Ebbinghaus<sup>1</sup> , Azadeh Farzan<sup>2</sup> , and Andreas Podelski<sup>1</sup>

> <sup>1</sup> University of Freiburg, Freiburg im Breisgau, Germany klumpp@informatik.uni-freiburg.de <sup>2</sup> University of Toronto, Toronto, Canada

Abstract. Ultimate GemCutter verifies concurrent programs using the CEGAR paradigm, by generalizing from spurious counterexample traces to larger sets of correct traces. We integrate classical CEGAR generalization with orthogonal generalization across interleavings. Thereby, we are able to prove correctness of programs otherwise out-of-reach for interpolation-based verification. The competition results show significant advantages over other concurrency approaches in the Ultimate family.

# 1 Verification Approach

Ultimate GemCutter is a verification tool for concurrent programs based on the CEGAR paradigm: It (1) picks a trace from the set of all program interleavings (a possible "counterexample"), (2) proves correctness of this trace (the counterexample is "spurious"), and (3) generalizes the proof to conclude that a larger (usually infinite) set of traces is correct. Classically, CEGAR focuses on generalization across traces with varying numbers of loop iterations, by finding inductive loop invariants. GemCutter proposes additional generalization along

an orthogonal axis: across interleavings. Concurrent programs contain many redundant interleavings of actions from different threads, i.e., interleavings with the same (input/output-) behaviour. A na¨ıve application of CEGAR requires explicit proofs of correctness for all these interleavings. Intermediate states during execution of redundant interleavings differ, and different interleavings often require different correctness proofs. Gem-

Cutter addresses this as illustrated in the figure on the right: We prove correctness of a trace τ , here τ = a1a2b, where a1, a<sup>2</sup> are actions of the first thread,

<sup>?</sup> Jury Member: Dominik Klumpp

and b is an action of the second thread. The proof of correctness is generated using Craig interpolation or similar techniques. We generalize this proof into a Floyd-Hoare automaton [8] to show that a regular language L (green area in the figure above) of traces is correct. The new contribution is the subsequent generalization step: If a trace τ<sup>1</sup> differs from a (correct) trace τ<sup>2</sup> in L only by the ordering of independent statements, these traces are (Mazurkiewicz-) equivalent [3]. We conclude that τ<sup>1</sup> is also correct. Hence, the set of all such traces, denoted cl(L) (pink area), contains only correct traces. If the set of all program interleavings P is a subset of cl(L), we conclude that the program is correct.

To soundly make this conclusion, we need a suitable notion of independence between statements, which guarantees that the order of execution of two independent statements does not matter for program correctness. An intuitive sufficient condition is that neither statement writes to a memory location read or written by the other statement. If we cannot establish this condition syntactically, we use an SMT solver to check if executing the statements in either order is guaranteed to give the same result. We use information from the Floyd-Hoare automaton to refine this check in the style of conditional independence [5]. Such information can for instance express (but is not limited to) non-aliasing of pointers.

However, the inclusion P ⊆ cl(L) is in general undecidable [3]; cl(L) may not be regular. We reverse our viewpoint to provide a sufficient condition that can be effectively checked: Rather than adding all equivalent traces to L – thus obtaining cl(L) –, we instead remove all but one trace of each equivalence class from P – yielding a reduction P <sup>0</sup> of P (formally, cl(P 0 ) = P). We use the sleep set technique [5] to remove transitions from an automaton for P to get an automaton that recognizes one such reduction P 0 . We then check whether the (regular) reduction P 0 is included in the (regular) language L. If this inclusion P <sup>0</sup> ⊆ L holds, it implies that P ⊆ cl(L) also holds, and the program is correct. If the inclusion does not (yet) hold, GemCutter picks another program trace and repeats the process, iteratively building up the language L of correct traces by taking the union of the Floyd-Hoare automata computed in all iterations.

A key feature of the reduction-based approach is that the generalization along the iteration and interleaving axes is combined not just additively, but multiplicatively: In the geometrical intuition of the figure above, we do not just take the union of L (green area) with the equivalence class [τ ] of τ (blue area), but consider all traces in cl(L) (the pink area which is spanned by both). Further, we heuristically try to pick a set of representatives in a way that harmonizes

with CEGAR generalization, i.e., a reduction P 0 with simple loop invariants. To this end, we prefer representatives with context-switches at all loop boundaries. Ideally, each thread performs one complete loop iteration and then hands control over to the next thread (the last thread hands back control to the first thread). Consider the example program on the right, with the postcondition x = y. Here, a proof for the

```
// Thread 1:
int x = 0;
for (int i = 0; i < N; ++ i) {
  x += A[i ];
}
```

```
// Thread 2:
int y = 0;
for (int j = 0; j < N; ++ j) {
  y += A[j ];
}
```
set of all interleavings P, or some inopportunely chosen reduction, needs invariants that capture the fact that x = P<sup>i</sup> <sup>k</sup>=0 A[k], and similar for y. Such invariants are usually not found by Craig interpolation. However, the loop invariant i = j ∧ x = y suffices for the reduction that places context-switches at all loop boundaries. The general idea is that for this kind of reduction, the proof often needs to summarize only the effect of a single loop iteration rather than unboundedly many iterations (which may require quantifiers or non-linear arithmetic). Similar observations were first made by Farzan and Vandikas [4].

GemCutter furthermore aims to improve efficiency of the proof check, i.e., the check whether a reduction P 0 is a subset of the set of proven traces L. The state explosion problem of concurrent programs makes the computation of an automaton recognizing a reduction P <sup>0</sup> as well as the subsequent inclusion check prohibitively expensive. To address this, we implemented a form of persistent set reduction [5], which allows us to compute a more compact automaton recognizing P 0 . This results in a more time- and memory-efficient inclusion check.

Reductions that interact harmoniously with CEGAR generalization do not always allow for an efficient proof check, nor vice versa. In the ConcurrencySafety category, where correctness proofs may become complicated, we prioritize generalization by computing reductions that typically allow for simpler proofs (described above), even though proof checking for such reductions is often more expensive. By contrast, in the NoDataRace category we found proof assertions to be usually quite simple (often only expressing non-aliasing of pointers), so we prioritize faster proof checks (and postpone context-switches as far as possible).

Implementation GemCutter uses the libraries and the front-end of the Ultimate framework, and extends Ultimate with a new CEGAR loop implementation and new algorithms operating on finite automata. We represent programs P, reductions P <sup>0</sup> and sets of proven traces L as finite automata. Ultimate constructs Floyd-Hoare automata (for L) only on-demand [7]. Due to the state explosion problem, GemCutter extends this approach to the program and the reduction. The necessary parts of the automata are constructed just-in-time during traversal by automata algorithms. Various techniques are implemented as instances of a few generic interfaces (on-demand automata, and visitors that monitor and guide automaton traversal) for flexibility: Radically different algorithms can be created by configuring, exchanging and stacking interface implementations. The following techniques and optimizations (all used in SV-COMP) can be combined with each other independently: (i) sleep set reduction; (ii) persistent set reduction; (iii) discovery and pruning of states that cannot reach accepting states; (iv) guidance towards representatives of a specific form, e.g. with context-switches at loop boundaries; and (v) inclusion check between automata.

#### 2 Strengths and Weaknesses

The main advantage over other concurrency approaches in Ultimate (in Automizer and Taipan) lies in the generalization across interleavings: Automizer and Taipan typically require more complex proofs possibly out-of-reach for Craig interpolation and similar techniques. GemCutter performs significantly better, winning 3rd place in the ConcurrencySafety category (behind the bounded model checkers Deagle [6] and CSeq [10]) and 1st place in the NoDataRace demo category. For details, refer to the competition report [1].

Since our proof check decides a stronger condition (P <sup>0</sup> ⊆ L), it might miss some cases in which the proof is actually sufficient, i.e., P ⊆ cl(L) holds. This is because P <sup>0</sup> and L might contain different representatives for the same equivalence class of interleavings. This weakness cannot be resolved completely due to the undecidability of the inclusion P ⊆ cl(L). It can however be attenuated by considering other choices of representatives (other than preferring contextswitches at loop boundaries) and exploring the effect. This choice is currently given as an input parameter; an approach that heuristically chooses a reduction based on the program structure might perform better. Our notion of independence between statements is currently ignorant of the specification being verified. We hope to extend our approach to take this into account. Finally, our approach (and implementation) can be easily extended with other reduction methods that correspond to more aggressive generalization along the interleaving axis.

Our approach only verifies programs with a bounded number of threads. GemCutter runs out of time or memory if it is unable to establish such an upper bound, e.g. for many benchmarks in pthread-ext/ or goblint-regression/.

# 3 Architecture, Setup, Configuration, and Project

GemCutter is part of the program analysis framework Ultimate<sup>3</sup> , written in Java and licensed under LGPLv3<sup>4</sup> . GemCutter version 0.2.2-839c364b requires Java 11 and Python 3.6. Its Linux version, binaries of the required SMT solvers<sup>5</sup> , and a Python wrapper script were submitted as a .zip archive. GemCutter is invoked with

./Ultimate.py --spec <p> --file <f> --architecture <a> --full-output where <p> is an SV-COMP property file, <f> is an input C file, <a> is the architecture (32bit or 64bit), and --full-output enables verbose output to stdout. A violation witness may be written to the file witness.graphml. The benchmarking tool BenchExec [2] supports GemCutter through the tool-info module ultimategemcutter.py<sup>6</sup> . GemCutter participates in the ConcurrencySafety and NoDataRace categories, as declared in its SV-COMP benchmark definition file ugemcutter.xml<sup>7</sup> .

Data Availability Our .zip archive is available online<sup>8</sup> and on Zenodo [9].

<sup>3</sup> ultimate.informatik.uni-freiburg.de and github.com/ultimate-pa/ultimate

<sup>4</sup> www.gnu.org/licenses/lgpl-3.0.en.html

<sup>5</sup> Z3 (github.com/Z3Prover/z3), CVC4 (cvc4.github.io) and Mathsat (mathsat.fbk.eu)

<sup>6</sup> github.com/sosy-lab/benchexec/blob/main/benchexec/tools/ultimategemcutter.py

<sup>7</sup> gitlab.com/sosy-lab/sv-comp/bench-defs/-/blob/main/benchmark-defs/ugemcutter.xml

<sup>8</sup> gitlab.com/sosy-lab/sv-comp/archives-2022/-/blob/main/2022/ugemcutter.zip and git.io/JM69B

### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Wit4Java: A Violation-Witness Validator for Java Verifers (Competition Contribution)

Tong Wu<sup>1</sup> , Peter Schrammel<sup>2</sup> , and Lucas C. Cordeiro<sup>1</sup> ()

<sup>1</sup> University of Manchester, Manchester, United Kingdom <sup>2</sup> University of Sussex, Brighton, and Difblue Ltd, Oxford, United Kingdom lucas.cordeiro@manchester.ac.uk

Abstract. We describe and evaluate a violation-witness validator for Java verifers called Wit4Java. It takes a Java program with a safety property and the respective violation-witness output by a Java verifer to generate a new Java program whose execution deterministically violates the property. We extract the value of the program variables from the counterexample represented by the violation-witness and feed this information back into the original program. In addition, we have two implementations for instantiating source programs by injecting counterexamples. Experimental results show that Wit4Java can correctly validate the violation-witnesses produced by JBMC and GDart in a few seconds.

Keywords: Witness Validation · Software Verifcation · Java Bytecode.

# 1 Overview

Witness validation is the process of checking whether the same results can be reproduced independently according to the given program, specifcation, verifcation result, and the generated witness, improving the trust level of the software verifers [2].

Here, we describe and evaluate a new violation-witness validator for Java programs called Wit4Java. We take an approach similar to Rocha et al. [5] and Beyer et al. [1] for C programs and apply it to Java programs. As a result, we implement Wit4Java as a Python script that creates a new Java program or a unit test case using Mockito with the program variable values extracted from the counterexample. As input, Wit4Java uses the violation-witness in the GraphML format to extract the value of the non-deterministic variables in Java programs. Lastly, Wit4Java runs the new created program using the Java Virtual Machine (JVM) to check the assert statements.

There are some validators for C programs in the literature [6,12]. For example, NitWit is an interpretation-based witness validator that can execute each statement step-by-step without compiling the entire program [12]. The concept of MetaVal is to generate a new program based on the input and then use any checker to check for specifcations [6]. CPA-witness2test and FShell-witness2test are execution-based validators for C programs that can process the witness in

GraphML format and generate a test harness that drives the program to the specifcation violation [1]. Rocha et al. focus on the counterexample produced by ES-BMC [4] while CPA-witness2test and FShell-witness2test can process GraphML fles. However, witness validation for SV-COMP's Java track [7] is still at an early stage. GWIT is another validator that uses assumptions to prune the search space for dynamic symbolic execution, limiting the analysis to paths where a given assumption holds [10,11].

Fig. 1. Wit4Java Architecture. The grey boxes represent the inputs and outputs, and the white boxes represent the validation process.

#### 2 Validation Approach

The architecture of Wit4Java is illustrated in Fig. 1. First, Wit4Java takes the Java program and the witness as input. Then, it uses the Python package NetworkX to read the graph content of the witness and extracts the counterexample values of the variables corresponding to the source program from the violationwitness and saves them. After that, it generates new programs that contain the witness's assumptions. Finally, the validation process is performed by the JVM (using the -ea option) to check whether the execution of the generated program exhibits the detected assertion failure.

There are two implementations (Wit4Java 1.0 and Wit4Java 2.0) to extract and use counterexamples. The frst version is to save them as tuples (linenum, counterexample). Then it reads the source program and replaces the variables of the program statements with counterexamples if the line number and variable in the program match the tuple, thus generating a new created Java program. In comparison, the second version records the data types and values of the counterexamples and saves them sequentially into two lists. Moreover, only the assumptions made in the witness for the non-deterministic variables (as determined by Verifer.nondet) are recorded. Then, it builds a unit test case and employs the Mockito framework to mock the Verifer.nondet calls in the source program to make them return deterministic counterexample values from the lists. This makes the execution of the source program follow the path described in the witness and eventually reach the violated property.


We show examples for both implementations in Listings 1.1 to 1.4. Wit4Java 1.0 (the naive version) saves the counterexamples in witness in line number order. It directly replaces the variable values in the source program, thus generating a new program (cf. Listing 1.2). Wit4Java 2.0 (We name it the Mockito version) generates a test case that returns the counterexample value when the mocked function is called (cf. Listing 1.4).

```
Listing 1.3. Violation witness
< edge source ="203.167" target ="207.186" >
       < data key =" originfile ">
       Main . java
       </ data >
       < data key =" startline " >
       13
       </ data >
       < data key =" assumption ">
       v1 = 1;
       </ data >
</ edge >
 < edge source ="207.186" target ="252.201" >
       < data key =" originfile ">
       Main . java
       </ data >
       < data key =" startline " >
       14
       </ data >
       < data key =" assumption ">
       v2 = 0;
       </ data >
 </ edge >
                                                     Listing 1.4. Output of Wit4Java 2.0
                                                  List_type = [ int , int ];
                                                  List_value = [1 , 0];
                                                  Mockito . mockStatic ( Verifier . class );
                                                  int n = List_type . length ;
                                                  OngoingStubbing < Integer >
                                                     stubbing_int = Mockito .
                                                       when ( Verifier . nondetInt ());
                                                  for ( int i = 0; i < n; i ++) {
                                                   if (" int " . equals ( List_type [i ])) {
                                                      stubbing_int = stubbing_int .
                                                        thenReturn ( Integer .
                                                           parseInt ( List_value [i ]));
                                                   }
                                                  }
                                                  Main . main ( new String [0]);
```
#### 3 Discussion of Strengths and Weaknesses

Fig. 2 on the left compares the validation results of the two validation tools Wit4Java and GWIT. The former is based on version 1.0 (naive version). The latter is based on violation-witnesses produced by GDart. The results indicate that Wit4Java has successfully validated 140 out of 302 witnesses, while GWIT correctly validates 150 results. Version 2.0 handles counterexamples with diferent values for each iteration within a loop better than version 1.0. This is because version 1.0 skips the counterexamples before the last iteration. However, version 2.0 can fully use the counterexamples generated by each iteration. Fig. 2 on the right compares the validation results of the two versions of Wit4Java, which shows that version 2.0 (Mockito version) has a better validation ability (168 out of 302), thereby outperforming both version 1.0 and GWIT. However, the tool can only handle witnesses with concrete counterexamples. There are two main reasons why Wit4Java shows the result unknown: JBMC [3,8] produces an empty witness, or the witness does not contain a counterexample for a non-deterministic value. Besides, the validation for strings is not supported yet, which occurs in almost half of witnesses because JBMC does not yet output counterexample values for strings. Thus we were not able to test it. Generally, there are not enough

witnesses of high quality for testing the witness validator yet because JBMC sometimes correctly terminates without producing a witness in SV-COMP. The witness support in the Java verifers requires further development work so that they are able to produce complete violation witnesses whenever they terminate with verdict false.

Fig. 2. Validation results based on 302 witnesses. The x -axis represents the names of the two tools and the y-axis represents the number of witnesses. A green "false" indicates a confrmed correct result.

# 4 Tool Setup and Confguration

The competition submission is based on Wit4Java version 1.0 (naive version).<sup>3</sup> For the competition [9], Wit4Java is called by executing the script wit4java.py. It reads .java source fles and corresponding witnesses in the given benchmark directories. The answer would be false if the assertion failure was found. As an example, we can validate the witness by executing the following command:

./wit4java.py -witness <path-to-sv-witnesses>/witness.graphml <path-to-svbenchmarks>/java/jbmc-regression/return2

where witness.graphml indicates the witness to be validated, and return2 indicates the benchmark name. The Benchexec tool info module is called wit4java.py and the benchmark defnition fle wit4java-validate-violation-witnesses.xml. NetworkX should be installed separately in the SV-COMP machines. If a validation task does not fnd a property violation, it will return unknown.

# 5 Software Project and Contributors

Tong Wu maintains Wit4Java. It is publicly available under a BSD-style license. The source code is available at https://github.com/Anthonysdu/wit4java, and instructions for running the tool are given in the README fle.

<sup>3</sup> https://github.com/Anthonysdu/wit4java

# Acknowledgment

The work in this paper is partially funded by the EPSRC grants EP/T026995/1, EP/V000497/1, EU H2020 ELEGANT 957286, and Soteria project awarded by the UK Research and Innovation for the Digital Security by Design (DSbD) Programme.

# References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and

# **Author Index**

Ádám, Zsófia II-474 Aiken, Alex I-338 Aizawa, Akiko I-87 Albert, Elvira I-201 Alur, Rajeev II-353 Amat, Nicolas I-505 Amendola, Arturo I-125 Asgaonkar, Aditya I-167 Ayaziová, Paulína II-468 Bainczyk, Alexander II-314 Bajczi, Levente II-474 Banerjee, Tamajit II-81 Barbosa, Haniel I-415 Barrett, Clark I-143, I-415 Becchi, Anna I-125 Beyer, Dirk I-561, II-375, II-429 Biere, Armin I-443 Birkmann, Fabian II-159 Blatter, Lionel I-303 Blicha, Martin I-524 Bork, Alexander II-22 Bortolussi, Luca I-281 Bozzano, Marco I-543, II-273 Brain, Martin I-415 Bromberger, Martin I-480 Bruyère, Véronique I-244 Bryant, Randal E. I-443, I-462 Bu, Lei II-408

Casares, Antonio II-99 Cassez, Franck I-167 Castro, Pablo F. I-396 Cavada, Roberto I-125 Chakarov, Aleksandar I-404 Chalupa, Marek II-462, II-468 Cimatti, Alessandro I-125, I-543, II-273 Cohl, Howard S. I-87 Cordeiro, Lucas C. II-484 Coto, Alex II-413

D'Argenio, Pedro R. I-396 Darulova, Eva I-303 de Pol, Jaco van II-295

Deifel, Hans-Peter II-159 Demasi, Ramiro I-396 Dey, Rajen I-87 Dietsch, Daniel II-479 Dill, David I-183 Dobos-Kovács, Mihály II-474 Dragoste, Irina I-480 Duret-Lutz, Alexandre II-99 Dwyer, Matthew B. II-440

Ebbinghaus, Marcel II-479

Fan, Hongyu II-424 Faqeh, Rasha I-480 Farzan, Azadeh II-479 Fedchin, Aleksandr I-404 Fedyukovich, Grigory I-524, II-254 Ferrando, Andrea I-125 Fetzer, Christof I-480 Fijalkow, Nathanaël I-263 Fuller, Joanne I-167

Gallo, Giuseppe Maria I-281 Garhewal, Bharat I-223 Giannakopoulou, Dimitra I-387 Giesl, Jürgen II-403 Gipp, Bela I-87 González, Larry I-480 Goodloe, Alwyn I-387 Gordillo, Pablo I-201 Greiner-Petter, André I-87 Grieskamp, Wolfgang I-183 Griggio, Alberto II-273 Guan, Ji II-3 Guilloud, Simon II-196 Guo, Xiao II-408

Haas, Thomas II-418 Hajdu, Ákos II-474 Hartmanns, Arnd II-41 Havlena, Vojtˇech II-118 He, Fei II-424 Heizmann, Matthias II-479 Hensel, Jera II-403

Hernández-Cerezo, Alejandro I-201 Heule, Marijn J. H. I-443, I-462 Hovland, Paul D. I-106 Howar, Falk II-435, II-446 Hückelheim, Jan I-106 Huisman, Marieke II-332 Hujsa, Thomas I-505 Hyvärinen, Antti E. J. I-524 Imai, Keigo I-379 Inverso, Omar II-413 Jakobsen, Anna Blume II-295 Jonáš, Martin II-273 Kanav, Sudeep I-561 Karri, Ramesh I-3 Katoen, Joost-Pieter II-22 Katz, Guy I-143 Kettl, Matthias II-451 Klumpp, Dominik II-479 Koenig, Jason R. I-338 Koutavas, Vasileios II-178 Krämer, Jonas I-303 Kremer, Gereon I-415 Kˇretínský, Jan I-281 Krötzsch, Markus I-480 Krsti´c, Sr dan II-236 Kunˇcak, Viktor II-196 Kupferman, Orna I-25 Kwiatkowska, Marta II-60 Lachnitt, Hanna I-415 Lam, Wing II-217 Lange, Julien I-379 Lauko, Henrich II-457 Laveaux, Maurice II-137 Leeson, Will II-440 Lemberger, Thomas II-451 Lengál, Ondˇrej II-118 Li, Xuandong II-408 Li, Yichao II-408 Lin, Yi I-64 Lin, Yu-Yang II-178 Loo, Boon Thau II-353 Lyu, Lecheng II-408 Majumdar, Rupak II-81 Mallik, Kaushik II-81 Mann, Makai I-415

Marinov, Darko II-217 Marx, Maximilian I-480 Mavridou, Anastasia I-387 Mensendiek, Constantin II-403 Meyer, Klara J. II-99 Meyer, Roland II-418 Mihalkoviˇc, Vincent II-462 Milius, Stefan II-159 Mitra, Sayan I-322 Mohamed, Abdalrhman I-415 Mohamed, Mudathir I-415 Molnár, Vince II-474 Mues, Malte II-435, II-446 Murali, Harish K I-480 Murtovi, Alnis II-314

Namjoshi, Kedar S. I-46 Neider, Daniel I-263 Nenzi, Laura I-281 Neykova, Rumyana I-379 Niemetz, Aina I-415 Norman, Gethin II-60 Nötzli, Andres I-415

Ozdemir, Alex I-415

Padon, Oded I-338 Park, Junkil I-183 Parker, David II-60 Patel, Nisarg I-46 Paulsen, Brandon I-357 Pérez, Guillermo A. I-244 Perez, Ivan I-387 Pilati, Lorenzo I-125 Pilato, Christian I-3 Podelski, Andreas II-479 Ponce-de-León, Hernán II-418 Preiner, Mathias I-415 Pressburger, Tom I-387 Putruele, Luciano I-396

Qadeer, Shaz I-183 Quatmann, Tim II-22

Raha, Ritam I-263 Rakamari´c, Zvonimir I-404 Raszyk, Martin II-236 Rechtáˇ ˇ cková, Anna II-462 Reeves, Joseph E. I-462 Renkin, Florian II-99

Reynolds, Andrew I-415 Roˇckai, Petr II-457 Rot, Jurriaan I-223 Roy, Rajarshi I-263 Roy, Subhajit I-3 Rubio, Albert I-201 Rungta, Neha I-404 Safari, Mohsen II-332 Sakar, Ömer ¸ II-332 Sales, Emerson II-413 Santos, Gabriel II-60 Scaglione, Giuseppe I-125 Schmuck, Anne-Kathrin II-81 Schneider, Joshua II-236 Schrammel, Peter II-484 Schubotz, Moritz I-87 Schüssele, Frank II-479 Sharygina, Natasha I-524 Sheng, Ying I-415 Shenwald, Noam I-25 Shi, Lei II-353 Shoham, Sharon I-338 Sickert, Salomon II-99 Siegel, Stephen F. I-106 Šmahlíková, Barbora II-118 Sølvsten, Steffan Christ II-295 Soudjani, Sadegh II-81 Spiessl, Martin II-429 Staquet, Gaëtan I-244 Steffen, Bernhard II-314 Strejˇcek, Jan II-462, II-468 Sun, Dawei I-322 Sun, Zhihang II-424

Tabajara, Lucas M. I-64 Tacchella, Alberto I-125 Takhar, Gourav I-3 Thomasen, Mathias Weller Berg II-295 Tinelli, Cesare I-415

Tonetta, Stefano I-543 Traytel, Dmitriy II-236 Trost, Avi I-87 Tuosto, Emilio II-413 Tzevelekos, Nikos II-178 Ulbrich, Mattias I-303 Vaandrager, Frits I-223 Vardi, Moshe Y. I-64 Vozarova, Viktoria I-543 Wang, Chao I-357 Wang, Hao II-217 Wang, Yuepeng II-353 Weidenbach, Christoph I-480 Wesselink, Wieger II-137 Wijs, Anton II-332 Willemse, Tim A. C. II-137 Wißmann, Thorsten I-223 Wu, Haoze I-143 Wu, Tong II-484 Wu, Wenhao I-106 Xie, Tao II-217 Xie, Zhunyi II-408 Xu, Meng I-183 Yi, Pu II-217 Youssef, Abdou I-87 Yu, Nengkun II-3 Zamboni, Marco I-125

Zaoral, Lukáš II-462 Zelji´c, Aleksandar I-143 Zhao, Jianhua II-408 Zhong, Emma I-183 Zilio, Silvano Dal I-505 Zingg, Sheila II-236 Zlatkin, Ilia II-254 Zohar, Yoni I-415