**Sharon Shoham Yakir Vizel (Eds.)**

# LNCS 13371

# **Computer Aided Verification**

**34th International Conference, CAV 2022 Haifa, Israel, August 7–10, 2022 Proceedings, Part I**

# **Lecture Notes in Computer Science 13371**

#### Founding Editors

Gerhard Goos *Karlsruhe Institute of Technology, Karlsruhe, Germany*

Juris Hartmanis *Cornell University, Ithaca, NY, USA*

#### Editorial Board Members

Elisa Bertino *Purdue University, West Lafayette, IN, USA*

Wen Gao *Peking University, Beijing, China*

Bernhard Steffen *TU Dortmund University, Dortmund, Germany*

Moti Yung *Columbia University, New York, NY, USA* More information about this series at https://link.springer.com/bookseries/558

Sharon Shoham · Yakir Vizel (Eds.)

# Computer Aided Verification

34th International Conference, CAV 2022 Haifa, Israel, August 7–10, 2022 Proceedings, Part I

*Editors* Sharon Shoham Tel Aviv University Tel Aviv, Israel

Yakir Vizel Technion – Israel Institute of Technology Haifa, Israel

ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Computer Science ISBN 978-3-031-13184-4 ISBN 978-3-031-13185-1 (eBook) https://doi.org/10.1007/978-3-031-13185-1

© The Editor(s) (if applicable) and The Author(s) 2022. This book is an open access publication.

**Open Access** This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

# **Preface**

It was our privilege to serve as the program chairs for CAV 2022, the 34th International Conference on Computer-Aided Verification. CAV 2022 was held during August 7–10, 2022. CAV-affiliated workshops were held on July 31 to August 1 and August 11 to August 12. This year, CAV was held as part of the Federated Logic Conference (FLoC) and was collocated with many other conferences in software/hardware verification and logic for computer science. Due to the easing of COVID-19 travel restrictions, CAV 2022 and the rest of the FLoC were in-person events.

CAV is an annual conference dedicated to the advancement of the theory and practice of computer-aided formal analysis methods for hardware and software systems. The primary focus of CAV is to extend the frontiers of verification techniques by expanding to new domains such as security, quantum computing, and machine learning. This puts CAV at the cutting edge of formal methods research, and this year's program is a reflection of this commitment.

CAV 2022 received a high number of submissions (209). We accepted nine tool papers, two case studies, and 40 regular papers, which amounts to an acceptance rate of roughly 24%. The accepted papers cover a wide spectrum of topics, from theoretical results to applications of formal methods. These papers apply or extend formal methods to a wide range of domains such as smart contracts, concurrency, machine learning, probabilistic techniques, and industrially deployed systems. The program featured a keynote talk by Ziyad Hanna (Cadence Design Systems and University of Oxford), a plenary talk by Aarti Gupta (Princeton University), and invited talks by Arie Gurfinkel (University of Waterloo) and Neha Rungta (Amazon Web Services). Furthermore, we continued the tradition of Logic Lounge, a series of discussions on computer science topics targeting a general audience. In addition to all talks at CAV, the attendees got access to talks at other conferences held as part of FLoC.

In addition to the main conference, CAV 2022 hosted the following workshops: Formal Methods for ML-Enabled Autonomous Systems (FoMLAS), On the Not So Unusual Effectiveness of Logic, Formal Methods Education Online, Democratizing Software Verification (DSV), Verification of Probabilistic Programs (VeriProP), Program Equivalence and Relational Reasoning (PERR), Parallel and Distributed Automated Reasoning, Numerical Software Verification (NSV-XV), Formal Reasoning in Distributed Algorithms (FRIDA), Formal Methods for Blockchains (FMBC), Synthesis (Synt), and Workshop on Open Problems in Learning and Verification of Neural Networks (WOLVERINE).

Organizing a flagship conference like CAV requires a great deal of effort from the community. The Program Committee (PC) for CAV 2022 consisted of 86 members – a committee of this size ensures that each member has a reasonable number of papers to review in the allotted time. In all, the committee members wrote over 800 reviews while investing significant effort to maintain and ensure the high quality of the conference program. We are grateful to the CAV 2022 PC for their outstanding efforts in evaluating the submissions and making sure that each paper got a fair chance. Like recent years in CAV, we made the artifact evaluation mandatory for tool paper submissions and optional but encouraged for the rest of the accepted papers. The Artifact Evaluation Committee consisted of 79 reviewers who put in significant effort to evaluate each artifact. The goal of this process was to provide constructive feedback to tool developers and help make the research published in CAV more reproducible. The Artifact Evaluation Committee was generally quite impressed by the quality of the artifacts. Among the accepted regular papers, 77% of the authors submitted an artifact, and 58% of these artifacts passed the evaluation. We are very grateful to the Artifact Evaluation Committee for their hard work and dedication in evaluating the submitted artifacts.

CAV 2022 would not have been possible without the tremendous help we received from several individuals, and we would like to thank everyone who helped make CAV 2022 a success. First, we would like to thank Maria A Schett and Daniel Dietsch for chairing the Artifact Evaluation Committee and Hari Govind V K for putting together the proceedings.We also thank Grigory Fedyukovich for chairing the workshop organization and Shachar Itzhaky for managing publicity. We would like to thank the FLoC organizing committee for organizing the Logic Lounge, Mentoring workshop, and arranging student volunteers. We also thank Hana Chockler for handling sponsorship for all conferences in FLoC. We would also like to thank FLoC chair Alexandra Silva and co-chairs Orna Grumberg and Eran Yahav for the support provided. Last but not least, we would like to thank members of the CAV Steering Committee (Aarti Gupta, Daniel Kroening, Kenneth McMillan, and Orna Grumberg) for helping us with several important aspects of organizing CAV 2022.

We hope that you will find the proceedings of CAV 2022 scientifically interesting and thought-provoking!

June 2022 Sharon Shoham Yakir Vizel

# **Organization**

### **Steering Committee**


# **Conference Co-chairs**


# **Artifact Evaluation Co-chairs**


# **Publicity Chair**


# **Workshop Chair**


### **Proceedings and Talks Chair**


### **Program Committee**


Andrey Rybalchenko Microsoft Research, UK Anna Lukina TU Delft, The Netherlands Anna Slobodova Intel, USA Caterina Urban INRIA, France Dana Drachsler Cohen Technion, Israel Daniel Kroening Amazon, USA Derek Dreyer MPI-SWS, Germany Guy Avni University of Haifa, Israel Hadar Frenkel CISPA, Germany Hillel Kugler Bar Ilan University, Israel Josh Berdine Meta, UK Krishna Shankaranarayanan IIT Bombay, India

Ana Sokolova University of Salzburg, Austria Anastasia Mavridou KBR Inc., NASA Ames Research Center, USA Andreas Podelski University of Freiburg, Germany Andrei Popescu University of Sheffield, UK Antoine Miné Sorbonne Université, France Armin Biere University of Freiburg, Germany Azalea Raad Imperial College London, UK Bor-Yuh Evan Chang University of Colorado Boulder and Amazon, USA Corina Pasareanu CMU/NASA Ames Research Center, USA David Jansen Institute of Software, Chinese Academy of Sciences, China Deepak D'Souza Indian Institute of Science, India Dejan Jovanovi´c Amazon Web Services, USA Elizabeth Polgreen University of Edinburgh, UK Elvira Albert Complutense University of Madrid, Spain Erika Abraham RWTH Aachen University, Germany Grigory Fedyukovich Florida State University, USA Guy Katz Hebrew University of Jerusalem, Israel Hiroshi Unno University of Tsukuba, Japan Isabel Garcia-Contreras University of Waterloo, Canada Ivana Cerna Masaryk University, Czech Republic Jade Alglave University College London and Arm, UK Jean-Baptiste Jeannin University of Michigan, USA Joost-Pieter Katoen RWTH Aachen University, Germany Joxan Jaffar National University of Singapore, Singapore Kenneth L. McMillan University of Texas at Austin, USA Klaus V. Gleissenthall Vrije Universiteit Amsterdam, The Netherlands Konstantin Korovin University of Manchester, UK Kuldeep Meel National University of Singapore, Singapore

Laura Titolo NIA/NASA LaRC, USA Marijana Lazi´c TU Munich, Germany Marijn J. H. Heule CMU, USA Markus Rabe Google, USA Miriam García Soto IST Austria, Austria Oded Padon VMware Research, USA Parasara Sridhar Duggirala UNC Chapel Hill, USA Rajeev Joshi Amazon, USA Rayna Dimitrova CISPA, Germany Ruben Martins CMU, USA Rupak Majumdar MPI-SWS, Germany Ruzica Piskac Yale University, USA Subodh Sharma IIT Delhi, India Supratik Chakraborty IIT Bombay, India Swarat Chaudhuri UT Austin, USA Swen Jacobs CISPA, Germany Temesghen Kahsai Amazon, USA

Liana Hadarean Amazon Web Services, USA Marieke Huisman University of Twente, The Netherlands Martina Seidl Johannes Kepler University Linz, Austria Mingsheng Ying Chinese Academy of Sciences and Tsinghua University, China Natasha Sharygina University of Lugano, Switzerland Nikolaj Bjørner Microsoft Research, USA Nir Piterman University of Gothenburg, Sweden Parosh Aziz Abdulla Uppsala University, Sweden Pavithra Prabhakar Kansas State University, USA Philippa Gardner Imperial College London, UK Pierre Ganty IMDEA Software Institute, Spain Roderick Bloem Graz University of Technology, Austria Sébastien Bardin CEA List, Université Paris-Saclay, France Shuvendu Lahiri Microsoft Research, USA Sorav Bansal IIT Delhi and CompilerAI Labs, India Taylor T. Johnson Vanderbilt University, USA Wei-Ngan Chin National University of Singapore, Singapore Xavier Rival INRIA Paris/ENS/PSL, France Xujie Si McGill University, Canada Yang Liu Nanyang Technological University, Singapore Yu-Fang Chen Institute of Information Science, Academia Sinica, Taiwan Yuxin Deng East China Normal University, China

### **Artifact Evaluation Committee**

Guy Amir Hebrew University of Jerusalem, Israel Muqsit Azeem Technical University of Munich, Germany Kshitij Bansal Meta, USA Tianshu Bao Vanderbilt University, USA Fabian Bauer-Marquart University of Konstanz, Germany Anna Becchi Fondazione Bruno Kessler, Italy Ranadeep Biswas Informal Systems, France Christopher Brix RWTH Aachen University, Germany Marek Chalupa IST Austria, Austria Kevin Cheang University of California, Berkeley, USA Lesly-Ann Danie KU Leuven, Belgium Kinnari Dave Certified Kernel Tech, LLC, USA Simon Dierl TU Dortmund, Germany Florian Dorfhuber Technical University of Munich, Germany Benjamin Farinier Technische Universität Wien, Austria Parisa Fathololumi Stevens Institute of Technology, USA Mathias Fleury University of Freiburg, Germany Luke Geeson University College London, UK Pablo Gordillo Complutense University of Madrid, Spain Manish Goyal University of North Carolina at Chapel Hill, USA Akos Hajdu Meta Platforms Inc., UK Alejandro Hernández-Cerezo Complutense University of Madrid, Spain Jana Hofmann CISPA Helmholtz Center for Information Miguel Isabel Universidad Complutense de Madrid, Spain Martin Jonáš Fondazione Bruno Kessler, Italy Samuel Judson Yale University, USA Sudeep Kanav LMU Munich, Germany Daniela Kaufmann Johannes Kepler University Linz, Austria Brian Kempa Iowa State University, USA Ayrat Khalimov Université libre de Bruxelles, Belgium Nishant Kheterpal University of Michigan, USA Edward Kim University of North Carolina at Chapel Hill, USA John Kolesar Yale University, USA Bettina Könighofer Graz University of Technology, Austria Mitja Kulczynski Kiel University, Germany Thomas Lemberger LMU Munich, Germany Julien Lepiller Yale University, USA Sven Linker Lancaster University in Leipzig, Germany Kirby Linvill University of Colorado Boulder, USA Y. Cyrus Liu Stevens Institute of Technology, USA

Security, Germany

Tobias Meggendorfer IST Austria, Austria Felipe R. Monteiro Amazon, USA Daniel Schoepe Amazon, UK Ðor -

Tianhan Lu University of Colorado Boulder, USA Enrico Magnago Fondazione Bruno Kessler and University of Trento, Italy Fabian Meyer RWTH Aachen University, Germany Stefanie Mohr Technical University of Munich, Germany Raphaël Monat LIP6, Sorbonne Université and CNRS, France Marcel Moosbrugger Technische Universität Wien, Austria Marco Muniz Aalborg University, Denmark Neelanjana Pal Vanderbilt University, USA Francesco Parolini Sorbonne University, France Mário Pereira Universidade NOVA de Lisboa, Portugal João Pereira ETH Zurich, Switzerland Sumanth Prabhu Indian Institute of Science and TCS Research, India Cedric Richter University of Oldenburg, Germany Clara Rodríguez Complutense University of Madrid, Spain Bob Rubbens University of Twente, The Netherlands Rubén Rubio Universidad Complutense de Madrid, Spain Stanly Samuel Indian Institute of Science, India Philipp Schröer RWTH Aachen University, Germany Joseph Scott University of Waterloo, Canada Arnab Sharma University of Oldenburg, Germany Salomon Sickert Hebrew University of Jerusalem, Israel Yahui Song National University of Singapore, Singapore Michael Starzinger University of Salzburg, Austria Martin Tappler Graz University of Technology, Austria Michael Tautschnig Queen Mary University of London, UK Mertcan Temel Intel Corporation, USA Saeid Tizpaz-Niari University of Texas at El Paso, USA Deivid Vale Radboud University Nijmegen, The Netherlands Vesal Vojdani University of Tartu, Estonia Masaki Waga Kyoto University, Japan Peixin Wang University of Oxford, UK Tobias Winkler RWTH Aachen University, Germany Stefan Zetzsche University College London, UK Xiao-Yi Zhang National Institute of Informatics, Japan Linpeng Zhang University College London, UK de Žikeli´c IST Austria, Austria

# **Additional Reviewers**

A. R. Balasubramanian Aaron Gember-Jacobson Abhishek Rose Aditya Akella Alberto Larrauri Alexander Bork Alvin George Ameer Hamza Andres Noetzli Anna Becchi Anna Latour Antti Hyvarinen Benedikt Maderbacher Benno Stein Bettina Koenighofer Bruno Blanchet Chana Weil-Kennedy Christoph Welzel Christophe Chareton Christopher Brix Chun Tian Constantin Enea Daniel Hausmann Daniel Kocher Daniel Schoepe Darius Foo David Delmas David MacIver Dmitriy Traytel Enrique Martin-Martin Filip Cano Francesco Parolini Frederik Schmitt Frédéric Recoules Fu Song Gadi Aleksandrowicz Gerco van Heerdt Grégoire Menguy Gustavo Petri Guy Amir

Haggai Landa Hammad Ahmad Hanna Lachnitt Hongfei Fu Ichiro Hasuo Ilina Stoilkovska Ira Fesefeldt Jens Gutsfeld Ji Guan Jiawei Chen Jip Spel Jochen Hoenicke Joshua Moerman Kedar Namjoshi Kevin Batz Konstantin Britikov Koos van der Linden Lennard Gäher Lesly-Ann Daniel Li Wenhua Li Zhou Luca Laurenti Lukas Armborst Lukáš Holík Malte Schledjewski Martin Blicha Martin Helfrich Martin Lange Masoud Ebrahimi Mathieu Lehaut Michael Sammler Michael Starzinger Miguel Gómez-Zamalloa Miguel Isabel Ming Xu Noemi Passing Niklas Metzger Nishant Kheterpal Noam Zilberstein Norine Coenen

Omer Rappoport Ömer ¸Sakar Peter Lammich Prabhat Kumar Jha Raghu Rajan Rodrigo Otoni Romain Demangeon Ron Rothblum Roope Kaivola Sadegh Soudjani Sepideh Asadi Shachar Itzhaky Shaun Azzopardi Shawn Meier Shelly Garion Shubhani Shyam Lal Karra Simon Spies Soline Ducousso Song Yahui Spandan Das Sumanth Prabhu Teodora Baluta Thomas Noll Tim King Tim Quatmann Timothy Bourke Tobias Winkler Vasileios Klimis Vedad Hadzic Vishnu Bondalakunta Wang Fang Xing Hong Yangjia Li Yash Pote Yehia Abd Alrahman Yuan Feng Zachary Kincaid Ziv Nevo Zurab Khasidashvili

# **Contents – Part I**




# **Contents – Part II**

#### **Probabilistic Techniques**





# **Invited Papers**

# **A Billion SMT Queries a Day (Invited Paper)**

Neha Rungta(B)

Amazon Web Services, Seattle, USA rungta@amazon.com

**Abstract.** Amazon Web Services (AWS) is a cloud computing services provider that has made significant investments in applying formal methods to proving correctness of its internal systems and providing assurance of correctness to their end-users. In this paper, we focus on how we built abstractions and eliminated specifications to scale a verification engine for AWS access policies, Zelkova, to be usable by all AWS users. We present milestones from our journey from a thousand SMT invocations daily to an unprecedented billion SMT calls in a span of five years. In this paper, we talk about how the cloud is enabling application of formal methods, key insights into what made this scale of a billion SMT queries daily possible, and present some open scientific challenges for the formal methods community.

**Keywords:** Cloud Computing · Formal Verification · SMT Solving

### **1 Introduction**

Amazon Web Services (AWS) has made significant investments in developing and applying formal tools and techniques to prove the correctness of critical internal systems and provide services to AWS users to prove correctness of their own systems [24]. We use and apply a varied set of automated reasoning techniques at AWS. For example, we use (i) bounded model checking [35] to verify memory safety properties of boot code running in AWS data centers and of real-time operating system used in IoT devices [22,25,26], (ii) proof assistants such as EasyCrypt [12] and domain-specific languages such as Cryptol [38] to verify cryptographic protocols [3,4,23], (iii) HOL-Lite [33] to verify the BigNum implementation [2], (iv) P [28] to test key storage components in Amazon S3 [18], and (v) Dafny [37] to verify key authorization and crypto libraries [1]. Automated reasoning capabilities for external AWS users leverage (i) data-flow analysis [17] to prove correct usage of cloud APIs [29,40], (ii) monotonic SAT theories [14] to check properties of network configurations [5,13], and (iii) theories for strings and automaton in SMT solvers [16,39,46] to provide security for access controls [6,19].

This paper describes key milestones in our journey of generating billion SMT queries a day in the context of AWS Identity and Access Management (IAM). IAM is a system for controlling access to resources such as applications, data, and workload in AWS. Resource owners can configure access by writing *policies* that describe when to allow and deny user requests that access the resource. These configurations are expressed in the IAM policy language. For example, Amazon Simple Storage Service (S3) is an object storage service that offers data durability, availability, security, and performance. S3 is used widely to store and protect data for a range of applications. A *bucket* is a fundamental container in S3 where users can upload unlimited amounts of data in the form of objects. Amazon S3 supports fine-grained access control to the data based on the needs of the user. Ensuring that only intended users have access to their resource is important for the security of the resource. While the policy language allows for compact specifications of expressive policies, reasoning about the interaction between the semantics of different policy statements can be challenging to manually evaluate, especially in large policies with multiple operators and conditions.

To help AWS users secure their resources, we built Zelkova, a policy analysis tool designed to reason about the semantics of AWS access control policies. Zelkova translates policies and properties into Satisfiability Modulo Theories (SMT) formulas and uses SMT solvers to prove a variety of security properties such as "*Does the policy grant broad public access?*" [6]. The SMT encoding uses the theory of strings, regular expressions, bit vectors, and integer comparisons. The use of the wildcards ∗ (any number of characters) and ? (exactly one character) in the string constraints makes the decision problem PSPACE-complete. Zelkova uses a portfolio solver, where it invokes multiple solvers in the backend and uses the results from the solver that returns first, in a winner takes all strategy. This allows us to leverage the diversity among solvers and quickly solve queries—a couple hundred milliseconds to tens of seconds. A sample of AWS services that integrate Zelkova includes Amazon S3 (object storage), AWS Config (change-based resource auditor), Amazon Macie (security service), AWS Trusted Advisor (compliance to AWS best practices), and Amazon GuardDuty (intelligent threat detection). Zelkova drives preventative control features such as Amazon S3 Block Public Access and visibility into who outside an account has access to its resources [19].

Zelkova is an automated reasoning tool developed by formal methods experts and requires some degree of expertise in formal methods to use it. We cannot expect all AWS users to be experts in formal methods, have the time to be trained in the use of formal methods tools, or even be experts in the cloud domain. In this paper, we present the three pillars of our solution that enable Zelkova to be used by *all AWS users*. Using a combination of techniques such as eliminating specifications, domain-specific abstractions, and advances in SMT solvers we make the power of Zelkova available to all AWS users.

### **2 Eliminate Writing Specifications**

#### **End users will not write a specification**

Zelkova follows a traditional verification approach where it takes as input a policy and a specification, and produces a yes or no answer. We have developers and cloud administrators who author policies to govern access to cloud

resources. We have someone else, a security engineer, who writes a specification of what is considered acceptable. The automated reasoning engine Zelkova does the verification and returns a yes or no answer. This approach is effective for a limited number of use cases, but it is hard to scale to all AWS users. The bottleneck to scaling the verification effort is the *human effort required to specify what is acceptable behavior*. The SLAM work had similar a observation about specifications; for use of Static Driver Verifier, they needed to provide the tool as well as the specification [7]. A person has to put in a lot of work upfront to define acceptable behavior and only at the end of the process, they get back an answer—a boolean. It's a single bit of information for all the work they've put in. They have no information about whether they had the right specification or whether they wrote the specification correctly.

To scale our approach to all AWS users, we had to fundamentally rethink our approach and completely remove the bottleneck of having people write a specification. To achieve that, we flipped the rules of the game and made the automated reasoning engine responsible for specification. We had the machine put in the upfront cost. Now it takes as input a policy and returns a detailed set of findings (declarative statements about what is true of the system). These findings are presented to a user, the security engineer, who reviews these findings and makes decisions about whether these findings represent valid risks in the system that should be fixed or are acceptable behaviors of the system. Users are now taking the output of the machine and saying "yes" or "no".

#### **2.1 Generating Possible Specifications (Findings)**

To remove the bottleneck of specification, we changed the question from is this policy correct? to who has access?. The response to the former is a boolean while the response to the latter is a set of findings. AWS access control policies specify *who* has access to a given resource, via a set of Allow and Deny statements that grant and prohibit access, respectively. Figure 1 shows a simplified policy specifying access to an AWS resource. This policy specifies conditions on the cloud-based network (known as a VPC) for which the request originated and on the organizational Amazon customer (referred to by an Org ID) who made the request. The first statement *allows* access to any request whose SrcVpc is either vpc-a *or* vpc-b. The second statement *allows* access to any request whose OrgId is o-2. However, the third statement *denies* access from vpc-b *unless* the OrgId is o-1.

For each request, access is granted only if: (a) *some* Allow statement matches the request, and (b) *none* of the Deny statements match the request. Consequently, it can be quite tricky to determine what accesses are allowed by a given policy. First, individual statements can use regular expressions, negation, and conditionals. Second, to know the effect of an allow statement, one must consider all possible deny statements that can *overlap* with it, *i.e.*, can refer to the same request as the allow. Thus, policy verification is not *compositional*, in that we cannot determine if a policy is "correct" simply by *locally* checking that each statement is "correct." Instead, we require a *global* verification mechanism, that simultaneously considers all the statements and their subtle interactions, to determine if a policy grants only the intended access.

For the example policy sketch shown in Fig. 1, access can be summarized through a set of three findings, which say that access is granted to a request iff:


The findings are sound as no other requests are granted access. The findings are mostly precise; most of the requests match the conditions that are granted access. The finding "OrgId is o-2" also includes some requests that are not allowed, *e.g.*, when SrcVpc is vpc-b. To help understandability of the findings, we sacrifice this precision. Precise findings would need to include negation, and that would add complexity for the users to make decisions. Finally, the findings compactly summarize the policy in three positive statements declaring *who* has access. In principle, the notion of compact findings is similar to abstract counterexamples or minimizing counterexamples [21,30,32]. Since the findings are produced by the machine and already verified to be true, we have a person deciding if they *should be* true. The human is making a judgment call and expressing intent.

We use stratified predicate abstraction for computing the findings. Enumerating all possible requests is computationally *intractable*, and even if it were not, the resulting set of findings is far too large and hence *useless*. We tackle the problem of summarizing the super-astronomical request-space by using *predicate abstraction*. Specifically, we make a syntactic pass over the policy to extract the set of constants that are used to constrain access, and we use those constants to generate a family of predicates whose conjunctions compactly describe partitions of the space of all requests. For example, from the policy in Fig. 1 we would extract the following predicates

$$\begin{array}{ll} p\_a \doteq \mathsf{Src}\mathsf{pcc} = \mathsf{vpcc}\mathsf{-a}, \, p\_b \doteq \mathsf{Src}\mathsf{pcc} = \mathsf{vpcc}\mathsf{-b}, \, p\_\star \doteq \mathsf{Src}\mathsf{pcc} = \star, \\ q\_1 \doteq \mathsf{Orgld} = \mathsf{o-1}, \qquad q\_2 \doteq \mathsf{Orgld} = \mathsf{o-2}, \qquad q\_\star \doteq \mathsf{Orgld} = \star. \end{array}$$

The first row has three predicates describing the possible value of the SrcVpc of the request: that it equals vpc-a or vpc-b or some value other than vpc-a and vpc-b.


**Fig. 3.** Cubes generated by the predicates *p*a*, p*b*, p*-*, q*1*, q*2*, q* generated from the policy in Fig. 1 and the result of querying Zelkova to check if the the requests corresponding to each cube are granted access by the policy.

Similarly, the second row has three predicates describing the value of the OrgId of the request: that it equals o-1 or o-2 or some value other than o-1 and o-2.

We can compute findings by enumerating all the *cubes* generated by the above predicates and querying Zelkova to determine if the policy allows access to the requests described by the cube. The enumeration of cubes is common in SAT solvers and other predicate abstraction based approaches [8,15,36]. The set of all the cubes are shown in Fig. 3. The chief difficulty with enumerating all the cubes *greedily* is that we end up eagerly *splitting-cases* on the values of fields when that may not be required. For example, in Fig. 3, we split cases on the possible value of OrgId even though it is irrelevant when SrcVpc is vpc-a. This observation points the way to a new algorithm where we *lazily* generate the cubes as follows. Our algorithm maintains a *worklist* of minimally refined cubes. At each step, we (1) ask Zelkova if the cube allows an access that is not covered by any of its refinements; (2) if so, we add it to the set of findings; and (3) if not, we refine the cube "point-wise" along the values of each field individually and add the results to the worklist. The above process is illustrated in Fig. 2.

The specifications or findings generated by the machine are presented in the context of the access control domain. The developers do not have to learn a new means to specify correctness, think about what they want to be correct of the system, or check the completeness of their specifications. This is a very important lesson that we need to apply across many other applications for formal methods to be successful at scale. The challenge here is the specifics depend on the domain.

#### **3 Domain-Specific Abstractions**

#### **It's all about the end user**

Zelkova was developed by formal methods subject matter experts who learnt domain of AWS access control policies. Once we had the analysis engine, we faced the same challenges all other formal methods tool developers had before us. How do we make it accessible to all users? One hard earned lesson was "eliminating the need for specifications" as discussed in the previous section. But that was only part of the answer. There was a lot more to do. Many more questions to answer—How do we get users to use it? How do we present the results to the


**Fig. 4.** Interface that presents Access Analyzer findings to users.

users? How do the results stay updated? The answer was to design and build domain-specific abstractions. Do one thing and do it really well.

We created a higher level service on top of Zelkova called IAM Access Analyzer. We provide a one-click way to enable Access Analyzer for an AWS account or AWS Organization. An account in AWS is a fundamental construct that serves as a container for the user's resources, workloads, and data. Users can create policies to grant access to resources in their account to other users. In Access Analyzer, we use the account as a *zone of trust*. This abstraction lets us say that access to resources by users within their zone of trust is considered safe. But access to resources outside their zone of trust is potentially unsafe.

Once a user enables Access Analyzer, we use stratified predicate abstraction to analyze the policies and generate findings showing which users outside the zone of trust have access to resources. We had to shift from a mode where Zelkova can answer "any access query" to Zelkova can enumerate "who has access to what". This brings to attention the permissions that could lead to unintended access of data. While this idea seems simple in hindsight, it took us a couple of years to figure out the right abstraction for the domain. It can be used by all AWS users. They did not need to be experts in the area of formal methods or even have deep understanding of how access control in the cloud worked.

Each finding includes details about the resource, the external entity with access to it, and the permissions granted so that the user can take appropriate action. We present example findings in Fig. 4. Note these findings are not presented as SMT-lib formulas but rather in the domain that the user expects— AWS access control constructs. These map to the findings presented in the previous section for Fig. 1. Users can view the details included in the finding to determine whether the access is intentional or a potential risk that the user should resolve.

Most automated reasoning tools are run as a one-off: prove something, and then move on to the next challenge. In the cloud environment this was not the case. Doing the analysis once was not sufficient in our domain. We had to design a means to continuously monitor the environment and changes to access control policies within the zone of trust and update the findings based on that. To that end, Access Analyzer analyzes these policies if a user adds a new policy, or changes an existing policy, and either generates new findings, or removes findings, or updates the existing findings. Access Analyzer also analyzes all policies periodically, to ensure that in a rare case, if a change event to the policy is missed by the system, it is still able to keep the findings updated. The ease of enablement, just-in-time analysis on updates, and periodic analysis across all policies are the key factors in getting us to a billion queries daily.

# **4 SMT Solving at Cloud Scale**

#### **Every query matters**

The use of SMT solving in AWS features and services means that millions of users are relying on the correctness and timeliness of the underlying solvers for the security of their cloud infrastructure. The challenges around correctness and timeliness in solver queries have been well studied in the automated reasoning community, but they have been treated as independent features. Today, we are generating a billion SMT queries every day to support various use cases across a wide variety of AWS services. We have discovered an intricate dependency between correctness and timeliness that manifests at this scale.

#### **4.1 Monotonicity in Runtimes Across Solver Versions**

Zelkova uses a portfolio solver to discharge its queries. When given a query, Zelkova invokes multiple solvers in the backend and uses the results from the solver that returns first, in a winner takes all strategy [6]. The portfolio approach allows us to leverage the diversity amongst solvers. One of our goals is to leverage the latest advancements in the SMT solver community. SMT solver researchers and developers are fixing issues, making improvements to existing features, adding new theories, adding features such as generation of proofs, and making other performance improvements. Before deploying a new version of the solver within the production environment, we perform extensive offline testing and benchmarking to gain confidence in the correctness of the answers, performance of the queries, and ensure there are no regressions.

While striving for correctness and timeliness, one of the challenges we face is that new solver versions are not monotonically better in their performance than their previous version. A solution that works well in the cloud setting is a massive portfolio, sometimes even containing older versions of the same solver. This presents two issues. One, when we discover a bug in an older version of the solver, we need to patch this old version. This creates an operational burden of maintaining many different versions of the different solvers. Two, when the number of solvers increases, we need to ensure that each solver provides a correct result. Checking the correctness of queries that result in SAT is straightforward, but SMT solvers need to provide proof for the UNSAT queries. The proof generation and checking needs to be timely as well.

**Fig. 5.** Comparing the runtime for solving SMT queries generated by Zelkova by CVC4 and the different cvc5 versions (a) CVC4 vs. cvc5 version 0.0.4, (b) CVC4 vs. cvc5 version 0.0.7. Comparing the runtimes of winner take all in the portfolio solver of Zelkova with: (c) a portfolio solver consisting of Z3 sequence string solver, Z3 automata solver, and cvc5 version 0.0.4 (d) a portfolio solver consisting of Z3 sequence string solver, Z3 automata solver, and cvc5 version 0.0.7. Evaluating the performance of the latest cvc5 version 1.0.0 with its older versions (e) cvc5 version 0.0.4 and (f) cvc5 version 0.0.7

In the Zelkova portfolio solver [6], we use CVC4, and our original goal was to replace CVC4 with the then latest version of cvc5 (version 0.0.4)<sup>1</sup>. We wanted to leverage the proof checking capabilities of cvc5 to ensure the correctness of UNSAT queries [11]. To check the timeliness requirements, we ran experiments across our benchmarks, comparing the results of CVC4 to those of cvc5 (version 0.0.4). The results across a representative set of queries are shown in Fig. 5(a). In the graph we have approximately 15,000 SMT queries that are generated by Zelkova; we select a distribution of queries that are solved between 1 s and 30 s, after which the solver process is killed and a timeout is reported. Some queries that are not solved by CVC4 within the time bound of 30 s are now being solved by cvc5 (version 0.0.4), as seen by the points in the graph along the y-axis on the extreme right. However, cvc5 (version 0.0.4) times out on some queries that are solved by CVC4, as seen by the points on the top of the graph.

The results presented in Fig. 5(b) are not surprising given that the problem space is computationally hard, and there is an inherent randomness in search heuristics within SMT solvers. In an evaluation of cvc5, the authors discuss examples where CVC4 outperforms cvc5 [10]. But this poses a challenge for us when we are using the result of these solvers in security controls and services that millions of users rely on. The changes did not meet the timeliness requirement of continuing to solve the queries within 30 s. When a query times out, to be sound, the analysis marks the bucket as public. The impact of a query timing out, that was previously being solved, will lead to the user not being able to access the resource. This is unexpected for the user because there was no change in their configuration.

For example, consider the security checks in the Amazon S3 Block Public Access that block requests based on the results of the analysis. In this context, suppose that there was a bucket marked as "not public" based on the results of a query, and now that same query times out; the bucket will be marked as "public". This will lock down access to the bucket and the intended users will not be able to access it. Even a single regression that leads to loss of access for the user is not an acceptable change. As another example, these security checks are also used by IoT devices. In the case of a smart lock, a time out in the query that was previously being solved could lead to a loss of access to the user's home. The criticality of these use cases combined with the end user expectation is a key challenge in our domain.

We debugged and fixed the issue in cvc5 that was causing certain queries to time out. But even with this fix, CVC4 was 2x faster than cvc5 for many easier problems that took 1 s to solve originally. This slowdown was significant for us because Zelkova is called in the request path of security controls such as Amazon S3 Block Public Access. When a user attempts to attach a new access control policy or update an existing one, a synchronous call is made to Zelkova and the corresponding portfolio solvers to determine if the access control policy

<sup>1</sup> Note that while this section talks in detail about the CVC solver, the observations are common across all solvers. We select the results of the CVC solver as a representative because it is a mature solver with an active community.

being attached grants unrestricted public access or not. The bulk of the analysis time is spent in the SMT solvers, so doubling the analysis time for queries can lead to a degraded user experience. Where and how the analysis results are used plays an important role in how we track changes to the timeliness of the solver queries.

Our solution was to *add a new solver to the portfolio rather then replace an existing solver*. We added cvc5 (version 0.0.7) to the existing portfolio of solvers consisting of CVC4, Z3 with the sequence string solver, and a custom Z3-based automata solver. When we started the evaluation of cvc5, we did not plan to add a new version of the CVC solver to the portfolio. We had expected to the latest version of cvc5 to be comparable in timeliness to CVC4. We worked closely with the CVC developers and cvc5 was better on many queries, but it did not meet our timeliness requirements on all queries. This led to our decision to add cvc5 (version 0.0.7) to the Zelkova portfolio solver.

The results of comparing the portfolio solvers of two Z3 solvers, CVC4 and cvc5 (version 0.0.4) with a winner take all and portfolio solver *without* cvc5 (version 0.0.4) is shown in Fig. 5(c). The same configuration now with cvc5 (version 0.0.7) is shown in Fig. 5(d). The results show that the portfolio solving approach that Zelkova takes in the cloud is an effective one.

The cycle now repeats with cvc5 (version 1.0.0), and the same question comes up again. The question we are evaluating yet again is, "do we upgrade the existing cvc5 version with the latest or add yet another version of CVC to the portfolio solver". Some early experiments show that there is no clear answer yet. The results so far comparing the different version of cvc5 shown in Fig. 5(e) and (f) indicate that the latest version of cvc5 is not monotically better in performance than either of its previous versions. We do want to leverage the better proof generating capabilities of cvc5 (version 1.0.0) in order to gain more assurance in the correctness of the UNSAT queries.

#### **4.2 Stability of the Solvers**

We have spent quite a bit of time defining and implementing the encoding of the AWS access control policies into SMT. We update the encoding as we expand to more use cases or when we support new features in AWS. This is a slow and careful process that requires expertise in understanding AWS and how SMT solvers work. There is a lot of trial and error to figure out what encoding is correct and performant.

To illustrate the importance of the encoding, we present an experiment on solver runtimes with different ordering of clauses for our encoding (Fig. 6). For the same set of problem instances used in Fig. 5, we now use the standard SMT competition shuffler<sup>2</sup> to reorder assertions, terms, and rename variables to study the effect of ordering clauses for our default encoding. In Fig. 6, each point on the x axis corresponds to a single problem instance. For the problem instance, we run it in its original form (default encoding) which is the "base time", and

<sup>2</sup> https://github.com/SMT-COMP/scrambler.

**Fig. 6.** Variance in runtimes after shuffling terms in the problem instances.

five shuffled versions. This gives us a total of six versions of the problem; we record the min, max, and mean times. So for each problem instance, x we have:


The instances are sorted by 'base time' so the line looks smooth in base time, and the other points look more scattered. The comparison between CVC4 in Fig. 6(a) and Fig. 6(b) cvc5 shows that cvc5 can solve more problems with the default encoding shown by the smooth base line. However, when we shuffle the assertions, terms and other constructs in the problem instance, the performance of cvc5 varies more dramatically compared to that of CVC4. The points for the maximal time are spread wider across the graph and there are now several timeouts in Fig. 6(b).

### **4.3 Concluding Remarks**

Based on our experience from generating a billion SMT queries a day, we propose some general areas of research for the community. We believe these are key to enabling the use of solvers to evaluate security controls, and to enable applications in emerging technologies such as quantum computing, blockchains, and bio-engineering.

**Monotonicity and Stability in Runtimes.** One of the main challenges we encountered is the lack of monotonicity and stability in runtimes within a given solver version and across different versions. Providing this stability is a fundamentally hard problem due to the inherent randomness in SMT solver heuristics, search strategies, and configuration flags. One approach would be to incorporate the algorithm portfolio approach [31,34,42] within mainstream SMT solvers. A way enable algorithm portfolio is to leverage serverless and cloud computing environment, and develop parallel SMT solving and distributed search strategies. At AWS, this is an area that we are investing in as well. There has been some work in parallel and distributed SMT solving [41,45] but we need more. Another aspect of research would be to develop specialized solvers that focus on a specific class of problems. The SMT-comp could devise categories that allow room for specific types of problem instances as an incentive for developing these solvers.

**Reduce the Barrier to Entry.** Generating a billion SMT queries day is a result of the exceptional work and innovation of the entire SMT community over the past 20 years. A question we are thinking about is how to replicate the success described here for other domains in Amazon and elsewhere. There is a natural tendency in the formal methods community to target tools for the expert user. This limits their broader use and applicability. If we can find ways to lower the barrier to adoption, we can gain greater traction and improve the security, correctness, availability, and robustness of more systems.

**More Abstractions.** SMT solvers are powerful engines. One potential research direction for the broader community is to provide one or more higher level languages that allows people to specify their problems. We could create different languages based on the domain and take into account the expectations of developers. This would make interacting with a solver a more black-box exercise. The success we have had with SMT in Amazon, can be recreated in other domains if we provide developers the ability to easily encode their problems in a higher level language and use SMT solvers to solve them. It will more easily scale by not requiring a formal methods expert as an intermediary. Developing new abstractions or intermediate representations could be one approach to unlock billions of other SMT queries.

**Proof Generation.** All SMT solvers should be generating proofs to help the end-user gain confidence in the results. There has been some initial work in this area [9,20,27,43,44],but SMT has a long way to catch up with SAT solvers, and for good reason. The proof production is important for us gain greater confidence in the correctness of our answers, though it creates a tension with the timeliness. We need the proof production to be performant and the tools that check the generated proofs to be correct themselves. Continued push on different testing approaches, including fuzzing and property-based testing of SMT solvers, should continue with the same rigor and enthusiasm. Using these fuzz testing and mutation testing based techniques in the development workflow of SMT solvers is something that should become mainstream.

We are working to provide a set of benchmarks that can be leveraged by SMT developers to help further their work, are funding research grants in these areas, and are willing to evaluate new solvers.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Program Verification with Constrained Horn Clauses (Invited Paper)**

Arie Gurfinkel(B)

University of Waterloo, Waterloo, Canada arie.gurfinkel@uwaterloo.ca

**Abstract.** Many problems in program verification, Model Checking, and type inference are naturally expressed as satisfiability of a verification condition expressed in a fragment of First-Order Logic called Constrained Horn Clauses (CHC). This transforms program analysis and verification tasks to the realm of first order satisfiability and into the realm of SMT solvers. In this paper, we give a brief overview of how CHCs capture verification problems for sequential imperative programs, and discuss CHC solving algorithm underlying the Spacer engine of SMT-solver Z3.

### **1 Introduction**

First Order Logic (FOL) is a powerful formalism that naturally captures many interesting decision (and optimization) problems. In recent years, there has been a tremendous progress in automated logic reasoning tools, such as Boolean SATisfiability Solvers (SAT) and Satisfiability Modulo Theory (SMT) solvers. This enabled the use of logic and logic satisfiabilty solvers as a universal solution to many problems in Computer Science, in general, and in Program Analysis, in particular. Most new program analysis techniques formalize the desired analysis task in a fragment of FOL, and delegate the analysis to a SAT or an SMT solver. Examples include deductive verification tools such as Dafny [30] and Why3 [13], symbolic execution engines such as KLEE [7], Bounded Model Checking engines such as CBMC [10] and SMACK [9], and many others.

In this paper, we focus on a fragment of FOL called Constrained Horn Clauses (CHC). CHCs arise in many applications of automated verification. They naturally capture such problems as discovery and verification of inductive invariants [4,18]; Model Checking of safety properties of finite- and infinite-state systems [2,23]; safety verification of push-down systems (and their extensions) [4,28]; modular verification of distributed and parameterized systems [17,19,33]; and type inference [35,36], and many others.

Using CHC, developers of program analysis tools can separate the process of developing a proof methodology (also known as generation of Verification Condition (VC)) from the algorithmic details of deciding whether the VC is correct. Such a flexible design simplifies supporting multiple proof methodologies, multiple languages, and multiple verification tasks with a single framework. Today, there are multiple effective program verification tools based on the CHC methodology, including a C/C++ verification framework SeaHorn [18], a Java verification framework JayHorn [25], and an Android information flow verification tool HornDroid [8], a Rust verification framework RustHorn [31], Solidity verification tools SmartACE [37] and Solidity Compiler Model Checker [1]. Many more approaches utilize CHC as part of a more general verification solution.

The idea of reducing program verification (and model checking) to FOL satisfiability is well researched. A great example is the use of *Constraint Logic Programming* (CLP) [24] in program verification, or the use of Datalog for pointer analysis [34]. What is unique is the application of SMT-solvers in the decision procedure and lifting of techniques that have been developed in Model Checking and Program Verification communities to the uniform setting of satisfiabilty of CHC formulas. In the rest of this paper, we show how verification problems can be represented in CHCs (Sect. 2), and describe key algorithms behind Spacer [27], a CHC engine of the SMT solver Z3 [32] that is used to solve them (Sect. 3).

### **2 Logic of Constrained Horn Clauses**

In this section, we give a brief overview of Constrained Horn Clauses (CHC). We illustrate an application of CHC to verification of a simple imperative program with a loop.

The logic of Constrained Horn Clauses is a fragment of FOL. We assume that the reader is familiar with the basic concepts of FOL, including signatures, theories, and models. For the purpose of this presentation, let Σ be some fixed FOL signature and A be an FOL theory over Σ. For example, Σ is a signature for arithmetic, including constants 0, and 1, and a binary function · + ·, and A the theory of Presburger arithmetic. A *Constrained Horn Clause (CHC)* is an FOL sentence of the form:

$$\forall V \cdot (\varphi \land p\_1(X\_1) \land \dots \land p\_k(X\_k) \implies h(X)) \tag{1}$$

where <sup>V</sup> is the set of all free variables in the body of the sentence, {p*i*}*<sup>k</sup> i*=1 and <sup>h</sup> are uninterpreted predicate symbols (in the signature), {X*i*}*<sup>k</sup> <sup>i</sup>*=1 and X are first-order terms, and p(X) stands for application of predicate p to a list of terms X.

A CHC in Eq. (1) can be equivalently written as the following clause:

$$(\neg \varphi \lor \neg p\_1(X\_1) \lor \dots \lor \neg p\_n(X\_n) \lor h(X))\tag{2}$$

where all free variables are implicitly universally quantified. Note that in this case only h appears positively, which explains why these are called *Horn* clauses. We write CHC(A) to denote the set of all sentences in FOL modulo theory A that can be written as a set of Constrained Horn Clauses. A sentence Φ is in CHC(A) if it can be written as a conjunction of clauses of the form of Eq. (1).

**Fig. 1.** A program and its verification conditions in CHC.

A CHC(A) sentence Φ is satisfiable if there exists a model M of A extended with interpretation for all of the uninterpreted predicates in Φ such that M satisfies Φ, written M |= Φ. In practice, we are often interested not in an arbitrary model, but a model that can be described concisely in some target fragment of FOL. We call such models *solutions*. Given an FOL fragment F, an F-solution to a CHC(A) formula Φ is a model M such that M |= Φ and interpretation of every uninterpreted predicate in M is definable in F. Most commonly, F is taken to be either a quantifier free or universally quantified fragment of arithmetic A, often further extended with arrays.

*Example 1.* To illustrate the definitions above consider a C program of a simple counter shown in Fig. 1. The goal is to verify that the assertion at the end of the program holds on every execution. To verify the assertion using the principle of inductive invariants, we need to show that there exists a formula Inv(x) over program variable x such that (a) it is true before the loop, stable at every iteration of the loop, and guarantees the assertion when the loop terminates. Since we are interested in partial correctness, we are not concerned with the case when the loop does not terminate. This principle is naturally encoded as three Constrained Horn Clauses, shown in the in Fig. 1. The uninterpreted predicate *Inv* represents the inductive invariant. The program is correct, hence the CHCs are satisfiable. The satisfying model extends the theory of arithmetic with the following definitions of *Inv*:

$$Inv^{\mathcal{M}} = \{ z \mid z \le 5 \} \tag{3}$$

The CHCs also have a *solution* in the quantifier free theory of Linear Integer Arithmetic. In particular, *Inv* can be defined as follows:

$$Inv = \lambda z \cdot z \le 5\tag{4}$$

where the notation function with argument x and body ϕ.

The CHCs in this example can be expressed as an SMT-LIB script, shown in Fig. 2, and solved by Spacer engine of Z3. Note that the script uses some Z3-specific extensions, including logic HORN and several option that disable preprocessing (which is not necessary for such a simple example).

*Example 2.* Figure <sup>3</sup> shows a similar program, however, with a function inc that abstracts away the increment operation. The corresponding CHCs are also shown

**Fig. 2.** CHCs from Fig. 1 in SMT-LIB format.

in Fig. 3. There are two unknowns, *Inv* that represents the desired inductive invariant, and *Inc* that represents the summary (i.e., pre- and post-conditions, or an over-approximation) of the function inc. Since the program still satisfies the assertion, the CHCs are satisfiable, and have

$$Inv^{\mathcal{M}} = \{z \mid z \le 5\} = \lambda z \cdot z \le 5\tag{5}$$

$$\operatorname{Inc}^{\mathcal{M}} = \{(z, r) \mid r = z + 1\} = \lambda z, r \cdot r \le z + 1 \tag{6}$$

The corresponding SMT-LIB script is shown in Fig. 4.

*Example 3.* In this last example, consider a set of CHCs shown in Fig. 5. They are similar to CHCs in Fig. 1, with one exception. These CHCs are unsatisfiable. There is no interpretation of *Inv* to satisfy them. This is witnessed by a refutation – a resolution proof – shown in Fig. 6. The corresponding SMT-LIB script in shown in Fig. 7.

# **3 Solving CHC Modulo Theories**

The logic of CHC can be seen as a convenient modelling language. That is, it does not restrict or impose a preference on a decision procedure used to solve the problem. In fact, a variety of solvers and techniques are widely available, including Spacer [28] (that is available as part of Z3), FreqHorn [12], and ELDARICA [22]. There is also an annual competition, CHC-COMP<sup>1</sup>, to evaluate state-of-the-art solvers. In the rest of this section, we give a brief overview of the algorithm underlying Spacer.

<sup>1</sup> https://chc-comp.github.io/.

**Fig. 4.** CHCs from Fig. 3 in SMT-LIB format.

**Fig. 5.** An example of unsatisfiable CHCs.

Spacer is an extension and generalization of SAT-based Model Checking algorithms to CHC modulo SMT-supported theories. On propositional transition systems, Spacer behaves similarly to IC3 [6] and PDR [11], and can be seen as an adaptation of these algorithms. For other first-order theories, Spacer extends Generalized PDR of Hoder and Bjørner [21].

Given a CHC system Φ, Spacer works by iteratively looking for a bounded derivation of false from Φ. It explores Φ in a top-down (or backwards) direction. Each time Spacer fails to find a derivation of a fixed bound N, the reasons for failure are analyzed to derive consequences of Φ that explain why a derivation of false must have at least N + 1 steps. This process is repeated until either (a) false is derived and Φ is shown to be unsatisfiable, (b) the consequences form a solution to Φ, thus, showing that Φ satisfiable, or (c) the process continues indefinitely, but continuously ruling out impossibility of longer and longer refutations. Thus, even though the problem is in general undecidable, Spacer always makes progress trying to show that Φ is unsatisfiable or that there is no short proof of unsatisiability.

Spacer is a procedure for solving linear and non-linear CHCs. For convenience of the presentation, we restrict ourselves to a special case of non-linear CHCs that consists of the following three clauses:

$$Init(X) \Rightarrow P(X) \tag{7}$$

$$P(X) \Rightarrow Bad(X)\tag{8}$$

$$P(X) \land P(X^o) \land \text{Tr}(X, X^o, X') \Rightarrow P(X') \tag{9}$$

$$x = 0 \xrightarrow[\begin{array}{c} \overline{\forall x \cdot x \geq 0 \implies Inv(x)} \\ x = 0 \xrightarrow{Inv(0)} \end{array} \xrightarrow{} \begin{array}{c} \overline{\forall x \cdot Inv(x) \land x < 5 \implies Inv(x+1)} \\ x = 0 \xrightarrow{Inv(1)} \end{array} \xrightarrow{} \begin{array}{c} \overline{\forall x \cdot Inv(x) \land x \ge 1 \implies false} \\ false \end{array}$$

**Fig. 6.** Refutation proof for CHCs in Fig. 5.

**Fig. 7.** CHCs from Fig. 5 in SMT-LIB format.

where, <sup>X</sup> is a set of free variables, <sup>X</sup> <sup>=</sup> {x <sup>|</sup> <sup>x</sup> <sup>∈</sup> <sup>X</sup>} and <sup>X</sup>*<sup>o</sup>* <sup>=</sup> {x*<sup>o</sup>* <sup>|</sup> <sup>x</sup> <sup>∈</sup> <sup>X</sup>} are auxiliary free variables, *Init*, *Bad*, and *Tr* are FOL formulas over the free variables (as indicated), and P is an uninterpreted predicate. Recall that all free variables in each clause are implicitly universally quantified. Thus, the only unknown to solve for is the uninterpreted predicate P. We call these three clauses a *safety problem*, and write *Init*(X), *Tr* (X, X*<sup>o</sup>*, X ), *Bad*(X) as a shorthand to represent them. It is not hard to show that satisfiability of arbitrary CHCs is reducible to a safety problem. Thus, this simplification does not lose generality. In practice, Spacer directly supports more complex CHCs with multiple unknown uninterpreted predicates.

Before presenting the algorithm, we need to introduce two concepts from logic: *Craig Interpolation* and *Model Based Projection*.

*Craig Interpolation.* Given two formulas A[x, z] and B[y, z] such that A ∧ B is unsatisfiable, a *Craig interpolant* I[z] = Itp(A[x, z], B[y, z]), is a formula I[z] such that A[x, z] ⇒ I[z] and I[z] ⇒ ¬B[y, z]. We further require that the interpolant is a clause. Intuitively, the interpolant I captures the consequences of A that are inconsistent with B. If A is a conjunction of literals, the interpolant can be seen as a semantic variant of an UNSAT core.

*Model Based Projection.* Let ϕ be a formula, U ⊆ *Vars*(ϕ) a subset of variables of ϕ, and P a model of ϕ. Then, ψ = Mbp(U, P, ϕ) is a model based projection if (a) ψ is a monomial, (b) *Vars*(ψ) ⊆ *Vars*(ϕ) \ U, (c) P |= ψ, (d) ψ ⇒ ∃V · ϕ. Intuitively, an MBP is an under-approximation of existential quantifier elimination, where the choice of the under-approximation is guided by the model. **Input:** A safety problem -*Init*(X), *Tr*(X, X*o*, X- ), *Bad*(X). **Output:** *Unreachable* or *Reachable* **Data:** A cex queue *<sup>Q</sup>*, where a cex <sup>c</sup> <sup>∈</sup> *<sup>Q</sup>* is a pair m, i, <sup>m</sup> is a cube over state variables, and <sup>i</sup> <sup>∈</sup> <sup>N</sup>. A level <sup>N</sup>. A set of reachable states Reach. A trace F0, F1,... **Notation:** <sup>F</sup>(A, B) = *Init*(X- ) <sup>∨</sup> (A(X) <sup>∧</sup> <sup>B</sup>(X*o*) <sup>∧</sup> *Tr*), and <sup>F</sup>(A) = <sup>F</sup>(A, A) **Initially:** *<sup>Q</sup>* <sup>=</sup> <sup>∅</sup>, <sup>N</sup> = 0, <sup>F</sup><sup>0</sup> <sup>=</sup> *Init*, <sup>∀</sup>i > <sup>0</sup> · <sup>F</sup>*<sup>i</sup>* <sup>=</sup> <sup>∅</sup>, Reach <sup>=</sup> *Init* **Require:** *Init* → ¬*Bad* **repeat Unreachable** If there is an i<N s.t. <sup>F</sup>*<sup>i</sup>* <sup>⊆</sup> <sup>F</sup>*i*+1 **return** *Unreachable*. **Reachable** If Reach <sup>∧</sup> *Bad* is satisfiable, **return** *Reachable*. **Unfold** If <sup>F</sup>*<sup>N</sup>* → ¬*Bad*, then set <sup>N</sup> <sup>←</sup> <sup>N</sup> + 1 and <sup>Q</sup> ← ∅. **Candidate** If for some <sup>m</sup>, <sup>m</sup> <sup>→</sup> <sup>F</sup>*<sup>N</sup>* <sup>∧</sup> *Bad*, then add m, N to *<sup>Q</sup>*. **Successor** If there is m, i + 1 ∈ *<sup>Q</sup>* and a model <sup>M</sup> s.t. <sup>M</sup> <sup>|</sup><sup>=</sup> <sup>ψ</sup>, where <sup>ψ</sup> <sup>=</sup> <sup>F</sup>(∨Reach) <sup>∧</sup> <sup>m</sup>- . Then, add s to Reach, where s- <sup>∈</sup> Mbp({X, X*o*}, ψ). **MustPredecessor** If there is m, i + 1 ∈ *<sup>Q</sup>*, and a model <sup>M</sup> s.t. <sup>M</sup> <sup>|</sup><sup>=</sup> <sup>ψ</sup>, where <sup>ψ</sup> <sup>=</sup> <sup>F</sup>(F*i*, <sup>∨</sup>Reach) <sup>∧</sup> <sup>m</sup>- . Then, add <sup>s</sup> to *<sup>Q</sup>*, where <sup>s</sup> <sup>∈</sup> Mbp({X*o*, X- }, ψ). **MayPredecessor** If there is m, i + 1 ∈ *<sup>Q</sup>* and a model <sup>M</sup> s.t. <sup>M</sup> <sup>|</sup><sup>=</sup> <sup>ψ</sup>, where <sup>ψ</sup> <sup>=</sup> <sup>F</sup>(F*i*) <sup>∧</sup> <sup>m</sup>- . Then, add <sup>s</sup> to *<sup>Q</sup>*, where <sup>s</sup>*<sup>o</sup>* <sup>∈</sup> Mbp({X, X- }, ψ). **NewLemma** If there is an m, i + 1 ∈*Q*, s.t. <sup>F</sup>(F*i*) <sup>∧</sup> <sup>m</sup> is unsatisfiable. Then, add <sup>ϕ</sup> <sup>=</sup> Itp(F(F*i*), m- ) to <sup>F</sup>*<sup>j</sup>* , for all 0 <sup>≤</sup> <sup>j</sup> <sup>≤</sup> <sup>i</sup> + 1. **ReQueue** If m, i ∈ *<sup>Q</sup>*, 0 <i<N and <sup>F</sup>(F*i*−1) <sup>∧</sup> <sup>m</sup> is unsatisfiable, then add m, i + 1 to *<sup>Q</sup>*. **Push** For 0 <sup>≤</sup> i<N and a clause (<sup>ϕ</sup> <sup>∨</sup> <sup>ψ</sup>) <sup>∈</sup> <sup>F</sup>*i*, if <sup>ϕ</sup> <sup>∈</sup> <sup>F</sup>*i*+1, <sup>F</sup>(<sup>ϕ</sup> <sup>∧</sup> <sup>F</sup>*i*) <sup>→</sup> <sup>ϕ</sup>- , then add <sup>ϕ</sup> to <sup>F</sup>*<sup>j</sup>* , for all <sup>j</sup> <sup>≤</sup> <sup>i</sup> + 1.

**until** <sup>∞</sup>;

$$\text{Algorithm 1: Rule-based description of } \text{SpaceR.}$$

We present Spacer [27] as a set of rules shown in Algorithm 1. While the algorithm is sound under any order on application of the rules, it is easy to see that only some orders lead to progress. Since solving CHCs even over LIA is undecidable, we are only concerned with soundness and progress, and do not discuss termination. The algorithm is based on the core principles of IC3 [5], however, it differs significantly in the details. The rules **Unreachable** and **Reachable** detect termination, either by discovering an inductive solution, or by discovering existence of a refutation, respectively. **Unfold** increases the exploration depth, and **Candidate** constructs a new *proof obligation* based on the current depth and the set *Bad* of *bad states*. **Successor** computes additional *reachable states*, that is, an under-approximation of the model of the implicit predicate P. Note that it used Model Based Projection to under-approximate forward predicate transformer. The rules **MustPredecessor** and **MayPredecessor** compute a new proof obligation that precedes an existing one. **MustPredecessor** does the computation based on existing reachable states, while **MayPredecessor** makes a guess based on existing over-approximation of P. In this case, MBP is used again, but now to under-approximate a backward predicate transformer. The rule **NewLemma** computes a new over-approximation, called a *lemma*, of what is derivable about P in i + 1 by blocking a proof obligation. This is very similar to the corresponding step in IC3. Note, however, that interpolation is used to generalize the learned lemma beyond the literals of the proof obligation. **ReQueue** allows pushing blocked proof obligations to higher level, and **Push** allows pushing and inductively generalizing lemmas.

Spacer was introduced in [27]. Extension for convex linear arithmetic (i.e., discovering convex and co-convex solutions) is described in [3]. Support for quantifier free solutions for CHC over the combined theories of arrays and arithmetic is described in [26]. Extension for quantified solutions, which are necessary for establishing interesting properties when arrays are involved is described in [20]. More recently, the interpolation for lemma-generalization has been replaced by more global guidance [14]. This made Spacer competitive with other data-driven approaches that infer new lemmas based on numerical values of blocked counterexamples. Machine Learning-based inductive generalization has been suggested in [29]. The solver has also been extended to support Algebraic Data Types and Recursive Functions [16]. Work on improving support for bit-vectors [15] and experimenting with support for uninterpreted functions is ongoing.

### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Formal Methods for Probabilistic Programs**

# **Data-Driven Invariant Learning for Probabilistic Programs**

Jialu Bao1(B) , Nitesh Trivedi<sup>2</sup>, Drashti Pathak<sup>3</sup>, Justin Hsu<sup>1</sup> , and Subhajit Roy<sup>2</sup>

<sup>1</sup> Cornell University, Ithaca, NY, USA jb965@cornell.edu, email@justinh.su <sup>2</sup> Indian Institute of Technology (IIT) Kanpur, Kanpur, India *{*nitesht,subhajit*}*@iitk.ac.in <sup>3</sup> Amazon, Bengaluru, India

**Abstract.** Morgan and McIver's *weakest pre-expectation* framework is one of the most well-established methods for deductive verification of probabilistic programs. Roughly, the idea is to generalize binary state assertions to real-valued *expectations*, which can measure expected values of probabilistic program quantities. While loop-free programs can be analyzed by mechanically transforming expectations, verifying loops usually requires finding an *invariant expectation*, a difficult task.

We propose a new view of invariant expectation synthesis as a *regression* problem: given an input state, predict the *average* value of the post-expectation in the output distribution. Guided by this perspective, we develop the first *data-driven* invariant synthesis method for probabilistic programs. Unlike prior work on probabilistic invariant inference, our approach can learn piecewise continuous invariants without relying on template expectations. We also develop a data-driven approach to learn *sub-invariants* from data, which can be used to upper- or lowerbound expected values. We implement our approaches and demonstrate their effectiveness on a variety of benchmarks from the probabilistic programming literature.

**Keywords:** Probabilistic programs · Data-driven invariant learning · Weakest pre-expectations

### **1 Introduction**

*Probabilistic programs*—standard imperative programs augmented with a sampling command—are a common way to express randomized computations. While the mathematical semantics of such programs is fairly well-understood [25], verification methods remain an active area of research. Existing automated techniques are either limited to specific properties (e.g., [3,9,35,37]), or target simpler computational models [4,15,28].

*Reasoning About Expectations.* One of the earliest methods for reasoning about probabilistic programs is through *expectations*. Originally proposed by Kozen [26], expectations generalize standard, binary assertions to quantitative, real-valued functions on program states. Morgan and McIver further developed this idea into a powerful framework for reasoning about probabilistic imperative programs, called the *weakest pre-expectation calculus* [30,33].

Concretely, Morgan and McIver defined an operator called the *weakest preexpectation* (wpe), which takes an expectation E and a program P and produces an expectation E such that E (σ) is the expected value of E in the output distribution -Pσ. In this way, the wpe operator can be viewed as a generalization of Dijkstra's weakest pre-conditions calculus [16] to probabilistic programs. For verification purposes, the wpe operator has two key strengths. First, it enables reasoning about probabilities and expected values. Second, when P is a loop-free program, it is possible to transform wpe(P, E) into a form that does not mention the program P via simple, mechanical manipulations, essentially analyzing the effect of the program on the expectation through syntactically transforming E.

However, there is a caveat: the wpe of a loop is defined as a least fixed point, and it is generally difficult to simplify this quantity into a more tractable form. Fortunately, the wpe operator satisfies a *loop rule* that simplifies reasoning about loops: if we can find an expectation I satisfying an *invariant* condition, then we can easily bound the wpe of a loop. Checking the invariant condition involves analyzing just the body of the loop, rather than the entire loop. Thus, finding invariants is a primary bottleneck towards automated reasoning about probabilistic programs.

*Discovering Invariants.* Two recent works have considered how to automatically infer invariant expectations for probabilistic loops. The first is Prinsys [21]. Using a template with one hole, Prinsys produces a first-order logical formula describing possible substitutions satisfying the invariant condition. While effective for their benchmark programs, the method's reliance on templates is limiting; furthermore, the user must manually solve a system of logical formulas to find the invariant.

The second work, by Chen et al. [14], focuses on inferring polynomial invariants. By restricting to this class, their method can avoid templates and can apply the Lagrange interpolation theorem to find a polynomial invariant. However, many invariants are not polynomials: for instance, an invariant may combine two polynomials piecewise by branching on a Boolean condition.

*Our Approach: Invariant Learning.* We take a different approach inspired by data-driven invariant learning [17,19]. In these methods, the program is executed with a variety of inputs to produce a set of execution traces. This data is viewed as a training set, and a machine learning algorithm is used to find a classifier describing the invariant. Data-driven techniques reduce the reliance on templates, and can treat the program as a black box—the precise implementation of the program need not be known, as long as the learner can execute the program to gather input and output data. But to extend the data-driven method to the probabilistic setting, there are a few key challenges:


*Outline.* After covering preliminaries (Sect. 2), we present our contributions.


We discuss related work in Sect. 7.

# **2 Preliminaries**

*Probabilistic Programs.* We will consider programs written in **pWhile**, a basic probabilistic imperative language with the following grammar:

P := **skip** | x ← e | x ←\$ d | P ; P | **if** e **then** P **else** P | **while** e : P,

where e is a boolean or numerical expression. All commands P map memories to distributions over memories [25], and the semantics is entirely standard and can be found in the extended version. We write -P<sup>σ</sup> for the output distribution of program P from initial state σ. Since we will be interested in running programs on concrete inputs, *we will assume throughout that all loops are almost surely terminating*; this property can often be established by other methods (e.g., [12, 13,31]).

*Weakest Pre-expectation Calculus.* Morgan and McIver's *weakest pre-expectation calculus* reasons about probabilistic programs by manipulating *expectations*.

**Definition 1.** *Denote the set of program states by* Σ*. Define the set of expectations,* <sup>E</sup>*, to be* {<sup>E</sup> <sup>|</sup> <sup>E</sup> : <sup>Σ</sup> <sup>→</sup> <sup>R</sup><sup>∞</sup> <sup>≥</sup><sup>0</sup>}*. Define* <sup>E</sup><sup>1</sup> <sup>≤</sup> <sup>E</sup><sup>2</sup> *iff* <sup>∀</sup><sup>σ</sup> <sup>∈</sup> <sup>Σ</sup> : <sup>E</sup>1(σ) <sup>≤</sup> E2(σ)*. The set* E *is a complete lattice.*

While expectations are technically mathematical functions from Σ to the nonnegative extended reals, for formal reasoning it is convenient to work with a more restricted syntax of expectations (see, e.g., [8]). We will often view numeric expressions as expectations. Boolean expressions b can also be converted to expectations; we let [b] be the expectation that maps states where b holds to 1, and other states to 0. As an example of our notation, [flip = 0] ·(x+ 1), x+ 1 are two expectations, and we have [flip = 0] · (x + 1) ≤ x + 1.

$$\begin{aligned} \mathsf{wpe}(\mathsf{skip}, E) &:= E\\ \mathsf{wpe}(x \leftarrow e, E) &:= E[e/x] \\ \mathsf{wpe}(x \not\sim d, E) &:= \lambda \sigma . \sum\_{v \in \mathcal{V}} [d]\_{\sigma}(v) \cdot E[v/x] \\ \mathsf{wpe}(P \mathrel{\mathop{:}} Q, E) &:= \mathsf{wpe}(P, \mathsf{wpe}(Q, E)) \\ \mathsf{wpe}(\mathsf{if} \ e \ \mathsf{then} \ P \ \mathsf{else} \ Q, E) &:= [e] \cdot \mathsf{wpe}(P, E) + [\neg e] \cdot \mathsf{wpe}(Q, E) \\ \mathsf{wpe}(\mathsf{while} \ e \ \mathsf{if} \ E \ E) &:= \mathsf{lfp}(\lambda X. \ [e] \cdot \mathsf{wpe}(P, X) + [\neg e] \cdot E) \end{aligned}$$

**Fig. 1.** Morgan and McIver's weakest pre-expectation operator

Now, we are ready to introduce Morgan and McIver's *weakest pre-expectation transformer* wpe. In a nutshell, this operator takes a program P and an expectation E to another expectation E , sometimes called the *pre-expectation*. Formally, wpe is defined in Fig. 1. The case for loops involves the least fixed-point (lfp) of Φwpe <sup>E</sup> := λX.([e] · wpe(P,X)+[¬e] · E), the *characteristic function* of the loop with respect to wpe [23]. The characteristic function is monotone on the complete lattice E, so the least fixed-point exists by the Kleene fixed-point theorem.

The key property of the wpe transformer is that for any program P, wpe(P, E)(σ) is the expected value of E over the output distribution -Pσ.

**Theorem 1 (See, e.g.,** [23]**).** *For any program* P *and expectation* E ∈ E*,* wpe(P, E) = λσ.- σ-<sup>∈</sup><sup>Σ</sup> <sup>E</sup>(σ ) · -Pσ(σ )

Intuitively, the weakest pre-expectation calculus provides a syntactic way to compute the expected value of an expression E after running a program P, except when the program is a loop. For a loop, the least fixed point definition of wpe(**while** e : P, E) is hard to compute.

#### **3 Algorithm Overview**

In this section, we introduce the two related problems we aim to solve, and a meta-algorithm to tackle both of them. We will see how to instantiate the meta-algorithm's subroutines in Sect. 4 and Sect. 5.

*Problem Statement.* Analogous to when analyzing the weakest pre-conditions of a loop, knowing a loop *invariant* or *sub-invariant* expectation enables one to easily bound the loop's weakest pre-expectations, but a (sub)invariant expectation can be difficult to find. Thus, we aim to develop an algorithm to automatically synthesize invariants and sub-invariants of probabilistic loops. More specifically, our algorithm tackles the following two problems:

1. **Finding exact invariants:** Given a loop **while** G : P and an expectation postE as input, we want to find an expectation I such that

$$I = \Phi^{\mathsf{wpe}}\_{\mathsf{postE}}(I) := [G] \cdot \mathsf{wpe}(P, I) + [\neg G] \cdot \mathsf{postE}.\tag{1}$$

Such an expectation I is an *exact invariant* of the loop with respect to postE. Since wpe(**while** G : P, postE) is a fixed point of Φwpe postE, wpe(**while** G : P, postE) has to be an exact invariant of the loop. Furthermore, when **while** G : P is almost surely terminating and postE is upper-bounded, the existence of an exact invariant I implies I = wpe(**while** e : P, E). (We defer the proof to the extended version.)

2. **Finding sub-invariants:** Given a loop **while** G : P and expectations preE, postE, we aim to learn an expectation I such that

$$I \le \Phi\_{\mathsf{postE}}^{\mathsf{wpe}}(I) := [G] \cdot \mathsf{wpe}(P, I) + [\neg G] \cdot \mathsf{postE} \tag{2}$$
  $\mathsf{preE} \le I$ .

The first inequality says that I is a sub-invariant: on states that satisfy G, the value of I lower bounds the expected value of itself after running one loop iteration from initial state, and on states that violate G, the value of I lower bounds the value of postE. Any sub-invariant lower-bounds the weakest pre-expectation of the loop, i.e., I ≤ wpe(**while** G : P, E) [22]. Together with the second inequality preE ≤ I, the existence of a sub-invariant I ensures that preE lower-bounds the weakest pre-expectation.

Note that an exact invariant is a sub-invariant, so one indirect way to solve the second problem is to solve the first problem, and then check preE ≤ I. However, we aim to find a more direct approach to solve the second problem because often exact invariants can be complicated and hard to find, while sub-invariants can be simpler and easier to find.

**Fig. 2.** Algorithm Exist

*Methods.* We solve both problems with one algorithm, Exist (short for EXpectation Invariant SynThesis). Our data-driven method resembles Counterexample Guided Inductive Synthesis (CEGIS), but differs in two ways. First, candidates are synthesized by fitting a machine learning model to data consisted of program traces starting from random input states. Our target programs are also probabilistic, introducing a second source of randomness to program traces. Second, our approach seeks high-quality counterexamples—violating the target constraints as much as possible—in order to improve synthesis. For synthesizing invariants and sub-invariants, such counterexamples can be generated by using a computer algebra system to solve an optimization problem.

We present the pseudocode in Fig. 2. Exist takes a probabilistic program geo, a post-expectation or a pair of pre/post-expectation *pexp*, and hyper-parameters N*runs* and N*states* . Exist starts by generating a list of features *feat*, which are numerical expressions formed by program variables used in geo. Next, Exist samples N*states* initialization *states* and runs geo from each of those states for N*runs* trials, and records the value of *feat* on program traces as *data*. Then, Exist enters a CEGIS loop. In each iteration of the loop, first the learner learnInv trains models to minimize their violation of the required inequalities (e.g., Eqs. (2) and (3) for learning sub-invariants) on *data*. Next, extractInv translates learned models into a set *candidates* of expectations. For each candidate inv, the verifier verifyInv looks for program states that *maximize* inv's violation of required inequalities. If it cannot find any program state where inv violates the inequalities, the verifier returns inv as a valid invariant or sub-invariant. Otherwise, it produces a set *cex* of counter-example program states, which are added to the set of initial states. Finally, before entering the next iteration, the algorithm augments *states* with a new batch of N *states* initial states, generates trace data from running geo on each of these states for N*runs* trials, and augments the dataset *data*. This data augmentation ensures that the synthesis algorithm collects more and more initial states, some randomly generated (sampleStates) and some from prior counterexamples (*cex* ), guiding the learner towards better candidates. Like other CEGIS-based tools, our method is sound but not complete, i.e., if the algorithm returns an expectation then it is guaranteed to be an exact invariant or sub-invariant, but the algorithm might never return an answer; in practice, we set a timeout.

#### **4 Learning Exact Invariants**

In this section, we detail how we instantiate Exist's subroutines to learn an exact invariant I satisfying I = Φwpe postE(I), given a loop geo and an expectation *pexp* = postE.

At a high level, we first sample a set of program states *states* using sampleStates. From each program state s ∈ *states*, sampleTraces executes geo and estimates wpe(geo, postE)(s). Next, learnInv trains regression models M to predict the estimated wpe(geo, postE)(s) given the value of features evaluated on s. Then, extractInv translates the learned models M to an expectation I. In an ideal scenario, this I would be equal to wpe(geo, postE), which is also always an exact invariant. But since I is learned from stochastic data, it may be noisy. So, we use verifyInv to check whether I satisfies the invariant condition I = Φwpe postE(I).

The reader may wonder why we took this complicated approach, first estimating the weakest pre-expectation of the loop, and then computing the invariant: If we are able to learn an expression for wpe(geo, postE) directly, then why are we interested in the invariant I? The answer is that with an invariant I, we can also *verify* that our computed value of wpe(prog, postE) is correct by checking the invariant condition and applying the loop rule. Since our learning process is inherently noisy, this verification step is crucial and motivates why we want to find an invariant.

**Fig. 3.** Running example: program and model tree

*A Running Example.* We will illustrate our approach using Fig. 3. The simple program geo repeatedly loops: whenever x becomes non-zero we exit the loop; otherwise we increase n by 1 and draw x from a biased coin-flip distribution (x gets 1 with probability p, and 0 otherwise). We aim to learn wpe(geo, n), which is [<sup>x</sup> = 0] · <sup>n</sup> + [<sup>x</sup> = 0] · (<sup>n</sup> <sup>+</sup> <sup>1</sup> p ).

*Our Regression Model.* Before getting into how Exist collects data and trains models, we introduce the class of regression models it uses – *model trees*, a generalization of decision trees to regression tasks [34]. Model trees are naturally suited to expressing piecewise functions of inputs, and are straightforward to train. While our method can in theory generalize to other regression models, our implementation focuses on model trees.

More formally, a model tree T ∈ T over features F is a full binary tree where each internal node is labeled with a predicate φ over variables from F, and each leaf is labeled with a real-valued model <sup>M</sup> ∈ M : <sup>R</sup><sup>F</sup> <sup>→</sup> <sup>R</sup>. Given a feature vector in <sup>x</sup> <sup>∈</sup> <sup>R</sup><sup>F</sup> , a model tree <sup>T</sup> over <sup>F</sup> produces a numerical output <sup>T</sup>(x) <sup>∈</sup> <sup>R</sup> as follows:


Throughout this paper, we consider model trees of the following form as our regression model. First, node predicates φ are of the form f c, where f ∈ F is a feature, ∈ {<, ≤, =, >, ≥} is a comparison, and c is a numeric constant. Second, leaf models on a model tree are either all *linear models* or all products of constant powers of features, which we call *multiplication models*. For example, assuming n, <sup>1</sup> <sup>p</sup> are both features, Fig. 3b and c are two model trees with linear leaf models, and Fig. 3b expresses the weakest pre-expectation wpe(geo, n). Formally, the leaf model M on a feature vector f is either

$$M\_l(f) = \sum\_{i=1}^{|\mathcal{F}|} \alpha\_i \cdot f\_i \qquad \text{or} \qquad M\_m(f) = \prod\_{i=1}^{|\mathcal{F}|} f\_i^{\alpha\_i}$$

with constants {αi}i. Note that multiplication models can also be viewed as linear models on logarithmic values of features because log Mm(f) = -|F | <sup>i</sup>=1 α<sup>i</sup> · log(fi). While it is also straightforward to adapt our method to other leaf models, we focus on linear models and multiplication models because of their simplicity and expressiveness. Linear models and multiplication models also complement each other in their expressiveness: encoding expressions like x + y uses simpler features with linear models (it suffices if F x, y, as opposed to needing F x+y if using multiplicative models), while encoding <sup>p</sup> <sup>1</sup>−<sup>p</sup> uses simpler features with multiplicative models (it suffices if <sup>F</sup> p, <sup>1</sup> <sup>−</sup> <sup>p</sup>, as opposed to needing <sup>F</sup> <sup>p</sup> 1−p if using linear models).

# **4.1 Generate Features (getFeatures)**

Given a program, the algorithm first generates a set of features F that model trees can use to express unknown invariants of the given loop. For example, for geo, <sup>I</sup> = [<sup>x</sup> = 0] · <sup>n</sup> + [<sup>x</sup> = 0] · (<sup>n</sup> <sup>+</sup> <sup>1</sup> <sup>p</sup> ) is an invariant, and to have a model tree (with linear/multiplication leaf models) express I, we want F to include both n and <sup>1</sup> <sup>p</sup> , or <sup>n</sup> <sup>+</sup> <sup>1</sup> <sup>p</sup> as one feature. F should include the program variables at a minimum, but it is often useful to have more complex features too. While generating more features increases the expressivity of the models, and richness of the invariants, there is a cost: the more features in F, the more data is needed to train a model.

Starting from the program variables, getFeatures generates two lists of features, F<sup>l</sup> for linear leaf models and F<sup>m</sup> for multiplication leaf models. Intuitively, linear models are more expressive if the feature set F includes some products of terms, e.g., <sup>n</sup> · <sup>p</sup>−<sup>1</sup>, and multiplication models are more expressive if <sup>F</sup> includes some sums of terms, e.g., n + 1.

# **4.2 Sample Initial States (sampleStates)**

Recall that Exist aims to learn an expectation I that is equal to the weakest pre-expectation wpe(**while** G : P, postE). A natural idea for sampleTraces is to run the program from all possible initializations multiple times, and record the average value of postE from each initialization. This would give a map close to wpe(**while** G : P, postE) if we run enough trials so that the empirical mean is approximately the actual mean. However, this strategy is clearly impractical many of the programs we consider have infinitely many possible initial states (e.g., programs with integer variables). Thus, sampleStates needs to choose a manageable number of initial states for sampleTraces to use.

In principle, a good choice of initializations should exercise as many parts of the program as possible. For instance, for geo in Fig. 3, if we only try initial states satisfying <sup>x</sup> = 0, then it is impossible to learn the term [<sup>x</sup> = 0] · (<sup>n</sup> <sup>+</sup> <sup>1</sup> p ) in wpe(geo, n) from data. However, covering the control flow graph may not be enough. Ideally, to learn how the expected value of postE depends on the initial state, we also want data from multiple initial states along each path.

While it is unclear how to choose initializations to ensure optimal coverage, our implementation uses a simpler strategy: sampleStates generates N*states* states in total, each by sampling the value of every program variable uniformly at random from a space. We assume program variables are typed as booleans, integers, probabilities, or floating point numbers and sample variables of some type from the corresponding space. For boolean variables, the sampling space is simply {0, 1}; for probability variables, the space includes reals in some interval bounded away from 0 and 1, because probabilities too close to 0 or 1 tend to increase the variance of programs (e.g., making some loops iterate for a very long time); for floating point number and integer variables, the spaces are respectively reals and integers in some bounded range. This strategy, while simple, is already very effective in nearly all of our benchmarks (see Sect. 6), though other strategies are certainly possible (e.g., performing a grid search of initial states from some space).

# **4.3 Sample Training Data (sampleTraces)**

We gather training data by running the given program geo on the set of initializations generated by sampleStates. From each program state s ∈ *states*, the subroutine sampleTraces runs geo for N*runs* times to get output states {s1,...,s<sup>N</sup>*runs* } and produces the following training example:

$$(s\_i, v\_i) = \left( s\_i, \frac{1}{N\_{runs}} \sum\_{i=1}^{N\_{runs}} \mathsf{postE}(s\_i) \right).$$

Above, the value v<sup>i</sup> is the empirical mean of postE in the output state of running geo from initial state si; as N*runs* grows large, this average value approaches the true expected value wpe(geo, postE)(s).

# **4.4 Learning a Model Tree (learnInv)**

Now that we have the training set *data* = {(s1, v1),...,(sK, vK)} (where K = N*states* ), we want to fit a model tree T to the data. We aim to apply off-theshelf tools that can learn model trees with customizable leaf models and loss. For each data entry, v<sup>i</sup> approximates wpe(geo, postE)(si), so a natural idea is to train a model tree T that takes the value of features on s<sup>i</sup> as input and predicts vi. To achieve that, we want to define the loss to measure the error between predicted values T(Fl(si)) (or T(Fm(si))) and the target value vi. Without loss of generality, we can assume our invariant I is of the form

$$I = \mathsf{postE} + [G] \cdot I' \tag{4}$$

because I being an invariant means

$$I = [\neg G] \cdot \mathsf{postE} + [G] \cdot \mathsf{wpe}(P, I) = \mathsf{postE} + [G] \cdot (\mathsf{wpe}(P, I) - \mathsf{postE}).$$

In many cases, the expectation I = wpe(P, I) − postE is simpler than I: for example, the weakest pre-expectation of geo can be expressed as <sup>n</sup>+[<sup>x</sup> = 0]·( <sup>1</sup> p ); while I is represented by a tree that splits on the predicate [x = 0] and needs both n, <sup>1</sup> <sup>p</sup> as features, the expectation <sup>I</sup> <sup>=</sup> <sup>1</sup> <sup>p</sup> is represented by a single leaf model tree that only needs p as a feature.

Aiming to learn weakest pre-expectations I in the form of Eq. (4), Exist trains model trees T to fit I . More precisely, learnInv trains a model tree T<sup>l</sup> with linear leaf models over features F<sup>l</sup> by minimizing the loss

$$err\_l(T\_l, data) = \left(\sum\_{i=1}^{K} \left(\mathsf{postE}(s\_i) + G(s\_i) \cdot T\_l(\mathcal{F}\_l(s\_i)) - v\_i\right)^2\right)^{1/2},\tag{5}$$

where postE(si) and G(si) represents the value of expectation postE and G evaluated on the state si. This loss measures the sum error between the prediction postE(si) + G(si) · Tl(Fl(si)) and target vi. Note that when the guard G is false on an initial state si, the example contributes zero to the loss because postE(si) + G(si) · Tl(Fl(si)) = postE(si) = vi; thus, we only need to generate and collect trace data for initial states where the guard G is true.

Analogously, learnInv trains a model tree T<sup>m</sup> with multiplication leaf models over features F<sup>m</sup> to minimize the loss errm(Tm, data), which is the same as errl(Tl, data) except Tl(Fl(si)) is replaced by Tm(Fm(si)) for each i.

# **4.5 Extracting Expectations from Models (extractInv)**

Given the learned model trees T<sup>l</sup> and Tm, we extract expectations that approximate wpe(geo, postE) in three steps:


# **4.6 Verify Extracted Expectations (verifyInv)**

Recall that geo is a loop **while** G : P, and given a set of candidate invariants *invs*, we want to check if any *inv* ∈ *invs* is a loop invariant, i.e., if *inv* satisfies

$$inv = [\neg G] \cdot \mathbf{postE} + [G] \cdot \mathbf{wpe}(P, inv). \tag{6}$$

Since the learned model might not predict the expected value for every data point exactly, we must verify whether inv satisfies this equality using verifyInv. If not, verifyInv looks for counterexamples that maximize the violation in order to drive the learning process forward in the next iteration. Formally, for every *inv* ∈ *invs*, verifyInv queries computer algebra systems to find a set of program states S such that S includes states maximizing the absolute difference of two sides in Eq. (6):

$$S \ni \mathsf{star} \mathbf{m} \mathbf{x}\_s |inv(s) - \left( [\neg G] \cdot \mathsf{postE} + [G] \cdot wp(P, inv) \right)(s)|.$$

If there are no program state where the absolute difference is non-zero, verify-Inv returns *inv* as a true invariant. Otherwise, the maximizing states in S are added to the list of counterexamples cex; if no candidate in *invs* is verified, verifyInv returns False and the accumulated list of counterexamples cex. The next iteration of the CEGIS loop will sample program traces starting from these counterexample initial states, hopefully leading to a learned model with less error.

# **5 Learning Sub-invariants**

Next, we instantiate Exist for our second problem: learning sub-invariants. Given a program geo = **while** G : P and a pair of pre- and post- expectations (preE, postE), we want to find a expectation I such that preE ≤ I, and

$$I \le \Phi^{\mathsf{wpe}}\_{\mathsf{postE}}(I) := [\neg G] \cdot \mathsf{postE} + [G] \cdot \mathsf{wpe}(P, I)$$

Intuitively, Φwpe postE(I) computes the expected value of the expectation I after one iteration of the loop. We want to train a model M such that M translates to an expectation I whose expected value decrease each iteration, and preE ≤ I.

The high-level plan is the same as for learning exact invariants: we train a model to minimize a loss defined to capture the sub-invariant requirements. We generate features F and sample initializations *states* as before. Then, from each s ∈ *states*, we repeatedly run just the loop body P and record the set of output states in data; this departs from our method for exact invariants, which repeatedly runs the entire loop to completion. Given this trace data, for any program state s ∈ *states* and expectation I, we can compute the empirical mean of I's value after running the loop body P on state s. Thus, we can approximate wpe(P, I)(s) for <sup>s</sup> <sup>∈</sup> *states* and use this estimate to approximate <sup>Φ</sup>wpe postE(I)(s). We then define a loss to sum up the violation of <sup>I</sup> <sup>≤</sup> <sup>Φ</sup>wpe postE(I) and preE ≤ I on state s ∈ *states*, estimated based on the collected data.

The main challenge for our approach is that existing model tree learning algorithms do not support our loss function. Roughly speaking, model tree learners typically assume a node's two child subtrees can be learned separately; this is the case when optimizing on the loss we used for exact invariants, but this is *not* the case for the loss for sub-invariants.

To solve this challenge, we first broaden the class of models to neural networks. To produce sub-invariants that can be verified, we still want to learn simple classes of models, such as piecewise functions of numerical expressions. Accordingly, we work with a class of neural architectures that can be translated into model trees, *neural model trees*, adapted from neural decision trees developed by Yang et al. [41]. We defer the technical details of neural model trees to the extended version, but for now, we can treat them as differentiable approximations of standard model trees; since they are differentiable they can be learned with gradient descent, which can support the sub-invariant loss function.

*Outline.* We will discuss changes in sampleTraces, learnInv and verifyInv for learning sub-invariants but omit descriptions of getFeatures, sampleStates, extractInv because Exist generates features, samples initial states and extracts expectations in the same way as in Sect. 4. To simplify the exposition, we will assume getFeatures generates the same set of features F = F<sup>l</sup> = F<sup>m</sup> for model trees with linear models and model trees with multiplication models.

# **5.1 Sample Training Data (sampleTraces)**

Unlike when sampling data for learning exact invariants, here, sampleTraces runs only one iteration of the given program geo = **while** G : P, that is, just P, instead of running the whole loop. Intuitively, this difference in data collection is because we aim to directly handle the sub-invariant condition, which encodes a single iteration of the loop. For exact invariants, our approach proceeded indirectly by learning the expected value of postE after running the loop to termination.

From any initialization s<sup>i</sup> ∈ *states* such that G holds on si, sampleTraces runs the loop body P for N*runs* trials, each time restarting from si, and records the set of output states reached. If executing P from s<sup>i</sup> leads to output states {si1,...,siN*runs* }, then sampleTraces produces the training example:

$$(s\_i, S\_i) = \left(s\_i, \{s\_{i1}, \dots, s\_{iN\_{rms}}\} \right),$$

For initialization s<sup>i</sup> ∈ *states* such that G is false on si, sampleTraces simply produces (si, Si)=(si, ∅) since the loop body is not executed.

# **5.2 Learning a Neural Model Tree (learnInv)**

Given the dataset data = {(s1, S1),...,(sK, SK)} (with K = N*states* ), we want to learn an expectation <sup>I</sup> such that preE <sup>≤</sup> <sup>I</sup> and <sup>I</sup> <sup>≤</sup> <sup>Φ</sup>wpe postE(I). By case analysis on the guard <sup>G</sup>, the requirement <sup>I</sup> <sup>≤</sup> <sup>Φ</sup>wpe postE(I) can be split into two constraints:

$$[G] \cdot I \le [G] \cdot \mathsf{wpe}(P, I) \qquad \text{and} \qquad [\neg G] \cdot I \le [\neg G] \cdot \mathsf{postE}.$$

If I = postE + [G] · I , then the second requirement reduces to [¬G] · postE ≤ [¬G] · postE and is always satisfied. So to simplify the loss and training process, we again aim to learn an expectation I of the form of postE + [G] · I . Thus, we want to train a model tree T such that T translates into an expectation I , and

$$\mathsf{preE} \le \mathsf{postE} + [G] \cdot I' \tag{7}$$

$$[G] \cdot (\mathsf{postE} + [G] \cdot I') \le [G] \cdot \mathsf{wpe}(P, \mathsf{postE} + [G] \cdot I') \tag{8}$$

Then, we define the loss of model tree T on data to be

$$err(T, data) := err\_1(T, data) + err\_2(T, data),$$

where err1(T, data) captures Eq. (7) and err2(T, data) captures Eq. (8).

Defining err<sup>1</sup> is relatively simple: we sum up the one-sided difference between preE(s) and postE(s) + G(s) · T(F(s)) across s ∈ *states*, where T is the model tree getting trained and F(s) is the feature vector F evaluated on s. That is,

$$err\_1(T, data) := \sum\_{i=1}^{K} \max\left(0, \mathsf{preE}(s\_i) - \mathsf{postE}(s\_i) - G(s\_i) \cdot T(\mathcal{F}(s\_i))\right). \tag{9}$$

Above, preE(si), postE(si), and G(si) are the value of expectations preE, postE, and G evaluated on program state si.

The term err<sup>2</sup> is more involved. Similar to err1, we aim to sum up the onesided difference between two sides of Eq. (8) across state s ∈ *states*. On program state s that does not satisfy G, both sides are 0; for s that satisfies G, we want to evaluate wpe(P, postE + [G] · I ) on s, but we do not have exact access to wpe(P, postE+[G]· I ) and need to approximate its value on s based on sampled program traces. Recall that wpe(P, I)(s) is the expected value of I after running program P from s, and our dataset contains training examples (si, Si) where S<sup>i</sup> is a set of states reached after running P on an initial state s<sup>i</sup> satisfying G. Thus, we can approximate [G] · wpe(P, postE + G · I )(si) by

$$G(s\_i) \cdot \frac{1}{|S\_i|} \cdot \sum\_{s \in S\_i} \left( \mathsf{postE}(s) + G(s) \cdot I'(s) \right).$$

To avoid division by zero when s<sup>i</sup> does not satisfy G and S<sup>i</sup> is empty, we evaluate the expression in a short-circuit manner such that when G(si) = 0, the whole expression is immediately evaluated to zero.

Therefore, we define

$$\begin{aligned} \operatorname{err}\_2(T, data) &= \sum\_{i=1}^K \max\left(0, G(s\_i) \cdot \mathsf{postE}(s\_i) + G(s\_i) \cdot T(\mathcal{F}(s\_i)) \\ &\quad - G(s\_i) \cdot \frac{1}{|S\_i|} \cdot \sum\_{s \in S\_i} \left(\mathsf{postE}(s) + G(s) \cdot T(\mathcal{F}(s))\right)\right). \end{aligned}$$

Standard model tree learning algorithms do not support this kind of loss function, and since our overall loss err(T, data) is the sum of err1(T, data) and err2(T, data), we cannot use standard model tree learning algorithm to optimize err(T, data) either. Fortunately, gradient descent does support this loss function. While gradient descent cannot directly learn model trees, we can use gradient descent to train a *neural* model tree T to minimize err(T, data). The learned neural networks can be converted to model trees, and then converted to expectations as before. (See discussion in the extended version.)

# **5.3 Verify Extracted Expectations (verifyInv)**

The verifier verifyInv is very similar to the one in Sect. 4 except here it solves a different optimization problem. For each candidate inv in the given list invs, it looks for a set S of program states such that S includes

**argmax**spreE(s) − inv(s) and **argmax**sG(s) · I(s) − [G] · wpe(P, I)(s).

As in our approach for exact invariant learning, the verifier aims to find counterexample states s that violate at least one of these constraints by as large of a margin as possible; these high-quality counterexamples guide data collection in the following iteration of the CEGIS loop. Concretely, the verifier accepts inv if it cannot find any program state s where preE(s) − inv(s) or G(s)· I(s)−[G]·wpe(P, I)(s) is positive. Otherwise, it adds all states s ∈ S with strictly positive margin to the set of counterexamples cex.

# **6 Evaluations**

We implemented our prototype in Python, using sklearn and tensorflow to fit model trees and neural model trees, and Wolfram Alpha to verify and perform counterexample generation. We have evaluated our tool on a set of 18 benchmarks drawn from different sources in prior work [14,21,24]. Our experiments were designed to address the following research questions:

**R1.** Can Exist synthesize exact invariants for a variety of programs?

**R2.** Can Exist synthesize sub-invariants for a variety of programs?

We summarize our findings as follows:


We present in the extended version the tables of complete experimental results. Because the training data we collect are inherently stochastic, the results produced by our tool are not deterministic.<sup>1</sup> As expected, sometimes different trials on the same benchmarks generate different sub-invariants; while the exact invariant for each benchmark is unique, Exist may also generate semantically equivalent but syntactically different expectations in different trials (e.g. it happens for BiasDir).

<sup>1</sup> The code and data sampled in the trial that produced the tables in this paper can be found at https://github.com/JialuJialu/Exist.


**Table 1.** Exact Invariants generated by Exist

**Table 2.** Sub-invariants generated by Exist


*Implementation Details.* For input parameters to Exist, we use N*runs* = 500 and N*states* = 500. Besides input parameters listed in Fig. 2, we allow the user to supply a list of features as an optional input. In feature generation, getFeatures enumerates expressions made up by program variables and user-supplied features according to a grammar. Also, when incorporating counterexamples cex, we make 30 copies of each counterexample to give them more weights in the training. All experiments were conducted on a MacBook Pro 2020 with M1 chip running macOS Monterey Version 12.1.

#### **6.1 R1: Evaluation of the Exact Invariant Method**

*Efficacy of Invariant Inference.* Exist was able to infer provably correct invariants in 14/18 benchmarks. Out of 14 successful benchmarks, only 2 of them need user-supplied features (n · p for Bin2 and Sum0). Table 1 shows the postexpectation (postE), the inferred invariant (Learned Invariant), sampling time (ST), learning time (LT), verification time (VT) and the total time (TT) for a few benchmarks. For generating exact invariants, the running time of Exist is dominated by the sampling time. However, this phase can be parallelized easily.

*Failure Analysis.* Exist failed to generate invariants for 4/18 benchmarks. For two of them, Exist was able to generate expectations that are very close to an invariant (DepRV and LinExp); for the third failing benchmarks (Duel), the ground truth invariant is very complicated. For LinExp, while a correct invariant is <sup>z</sup> + [n > 0] · <sup>2</sup>.<sup>625</sup> · <sup>n</sup>, Exist generates expectations like <sup>z</sup> + [n > 0] · (2.<sup>63</sup> · n − 0.02) as candidates. For DepRV, a correct invariant is x · y + [n > 0] · (0.25 · <sup>n</sup><sup>2</sup> + 0.<sup>5</sup> · <sup>n</sup> · <sup>x</sup> + 0.<sup>5</sup> · <sup>n</sup> · <sup>y</sup> <sup>−</sup> <sup>0</sup>.<sup>25</sup> · <sup>n</sup>), and in our experiment Exist generates <sup>0</sup>.<sup>25</sup> · <sup>n</sup><sup>2</sup> + 0.<sup>5</sup> · <sup>n</sup> · <sup>x</sup> + 0.<sup>5</sup> · <sup>n</sup> · <sup>y</sup> <sup>−</sup> <sup>0</sup>.<sup>27</sup> · <sup>n</sup> <sup>−</sup> <sup>0</sup>.<sup>01</sup> · <sup>x</sup> + 0.12. In both cases, the ground truth invariants use coefficients with several digits, and since learning from data is inherently stochastic, Exist cannot generate them consistently. In our experiments, we observe that our CEGIS loop does guide the learner to move closer to the correct invariant in general, but sometimes progress obtained in multiple iterations can be offset by noise in one iteration. For GeoAr, we observe the verifier incorrectly accepted the complicated candidate invariants generated by the learner because Wolfram Alpha was not able to find valid counterexamples for our queries.

*Comparison with Previous Work.* There are few existing tools that can automatically compute expected values after probabilistic loops. We experimented with one such tool, called Mora [7]. (See high-level comparison in Sect. 7.) We managed to encode our benchmarks Geo0, Bin0, Bin2, Geo1, GeoAr, and Mart in their syntax. Among them, Mora fails to infer an invariant for Geo1, GeoAr, and Mart. We also tried to encode our benchmarks Fair, Gambler, Bin1, and RevBin but found Mora's syntax was too restrictive to encode them.

#### **6.2 R2: Evaluation of the Sub-invariant Method**

*Efficacy of Invariant Inference.* Exist is able to synthesize sub-invariants for 27/34 benchmarks. As before, Table 2 reports the results for a few benchmarks. Two out of 27 successful benchmarks use user-supplied features – Gambler with pre-expectation x ·(y − x) uses (y − x), and Sum0 with pre-expectation x + [x > 0]·(p·n/2) uses p·n. Contrary to the case for exact invariants, the learning time dominates. This is not surprising: the sampling time is shorter because we only run one iteration of the loop, but the learning time is longer as we are optimizing a more complicated loss function.

One interesting thing that we found when gathering benchmarks is that for many loops, pre-expectations used by prior work or natural choices of preexpectations are themselves sub-invariants. Thus, for some instances, the subinvariants generated by Exist is the same as the pre-expectation preE given to it as input. However, Exist is not checking whether the given preE is a subinvariant: the learner in Exist does not know about preE besides the value of preE evaluated on program states. Also, we also designed benchmarks where pre-expectations are *not* sub-invariants (BiasDir with preE = [x = y] · x, DepRV with preE <sup>=</sup> <sup>x</sup> · <sup>y</sup> + [n > 0] · <sup>1</sup>/<sup>4</sup> · <sup>n</sup><sup>2</sup>, Gambler with preE <sup>=</sup> <sup>x</sup> · (<sup>y</sup> <sup>−</sup> <sup>x</sup>), Geo0 with preE = [flip == 0] · (1 <sup>−</sup> <sup>p</sup>1)), and Exist is able to generate sub-invariants for 3/4 such benchmarks.

*Failure Analysis.* On program instances where Exist fails to generate a subinvariant, we observe two common causes. First, gradient descent seems to get stuck in local minima because the learner returns suboptimal models with relatively low loss. The loss we are training on is very complicated and likely to be highly non-convex, so this is not surprising. Second, we observed inconsistent behavior due to noise in data collection and learning. For instance, for GeoAr with preE <sup>=</sup> <sup>x</sup>+[<sup>z</sup> = 0]· <sup>y</sup> ·(1−p)/p, Exist could sometimes find a sub-invariant with supplied feature (1 − p), but we could not achieve this result consistently.

*Comparison with Learning Exact Invariants.* The performance of Exist on learning sub-invariants is less sensitive to the complexity of the ground truth invariants. For example, Exist is not able to generate an exact invariant for LinExp as its exact invariant is complicated, but Exist is able to generate sub-invariants for LinExp. However, we also observe that when learning subinvariants, Exist returns complicated expectations with high loss more often.

### **7 Related Work**

*Invariant Generation for Probabilistic Programs.* There has been a steady line of work on probabilistic invariant generation over the last few years. The Prinsys system [21] employs a template-based approach to guide the search for probabilistic invariants. Prinsys is able encode invariants with guard expressions, but the system doesn't produce invariants directly—instead, Prinsys produces logical formulas encoding the invariant conditions, which must be solved manually.

Chen et al. [14] proposed a counterexample-guided approach to find polynomial invariants, by applying Lagrange interpolation. Unlike Prinsys, this approach doesn't need templates; however, invariants involving guard expressions common in our examples—cannot be found, since they are not polynomials. Additionally, Chen et al. [14] uses a weaker notion of invariant, which only needs to be correct on certain initial states; our tool generates invariants that are correct on all initial states. Feng et al. [18] improves on Chen et al. [14] by using *Stengle's Positivstellensatz* to encode invariants constraints as a semidefinite programming problem. Their method can find polynomial sub-invariants that are correct on all initial states. However, their approach cannot synthesize piecewise linear invariants, and their implementation has additional limitations and could not be run on our benchmarks.

There is also a line of work on abstract interpretation for analyzing probabilistic programs; Chakarov and Sankaranarayanan [11] search for linear expectation invariants using a "pre-expectation closed cone domain", while recent work by Wang et al. [40] employs a sophisticated algebraic program analysis approach.

Another line of work applies *martingales* to derive insights of probabilistic programs. Chakarov and Sankaranarayanan [10] showed several applications of martingales in program analysis, and Barthe et al. [5] gave a procedure to generate candidate martingales for a probabilistic program; however, this tool gives no control over which expected value is analyzed—the user can only guess initial expressions and the tool generates valid bounds, which may not be interesting. Our tool allows the user to pick which expected value they want to bound.

Another line of work for automated reasoning uses *moment-based analysis*. Bartocci et al. [6,7] develop the Mora tool, which can find the moments of variables as functions of the iteration for loops that run forever by using ideas from computational algebraic geometry and dynamical systems. This method is highly efficient and is guaranteed to compute moments exactly. However, there are two limitations. First, the moments can give useful insights about the distribution of variables' values after each iteration, but they are fundamentally different from our notion of invariants which allow us to compute the expected value of any given expression *after termination* of a loop. Second, there are important restrictions on the probabilistic programs. For instance, conditional statements are not allowed and the use of symbolic inputs is limited. As a result, most of our benchmarks cannot be handled by Mora.

In a similar vein, Kura et al. [27,39] bound higher *central moments* for running time and other monotonically increasing quantities. Like our work, these works consider probabilistic loops that terminate. However, unlike our work, they are limited to programs with constant size increments.

*Data-Driven Invariant Synthesis.* We are not aware of other data-driven methods for learning probabilistic invariants, but a recent work Abate et al. [1] proves probabilistic termination by learning ranking supermartingales from trace data. Our method for learning sub-invariants (Sect. 5) can be seen as a natural generalization of their approach. However, there are also important differences. First, we are able to learn general sub-invariants, not just ranking supermatingales for proving termination. Second, our approach aims to learn model trees, which lead to simpler and more interpretable sub-invariants. In contrast, Abate, et al. [1] learn ranking functions encoded as two-layer neural networks.

Data-driven inference of invariants for deterministic programs has drawn a lot of attention, starting from Daikon [17]. ICE learning with decision trees [20] modifies the decision tree learning algorithm to capture *implication counterexamples* to handle inductiveness. Hanoi [32] uses counterexample-based inductive synthesis (CEGIS) [38] to build a data-driven invariant inference engine that alternates between weakening and strengthening candidates for synthesis. Recent work uses neural networks to learn invariants [36]. These systems perform classification, while our work uses regression. Data from fuzzing has been used for *almost correct* inductive invariants [29] for programs with closed-box operations.

*Probabilistic Reasoning with Pre-expectations.* Following Morgan and McIver, there are now pre-expectation calculi for domain-specific properties, like expected runtime [23] and probabilistic sensitivity [2]. All of these systems define the pre-expectation for loops as a least fixed-point, and practical reasoning about loops requires finding an invariant of some kind.

**Acknowledgements.** This work is in part supported by National Science Foundation grant #1943130 and #2152831. We thank Ugo Dal Lago, I¸sil Dillig, IITK PRAISE group, Cornell PL group, and all reviewers for helpful feedback. We also thank Anmol Gupta in IITK for building a prototype verifier using Mathematica.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Sound and Complete Certificates for Quantitative Termination Analysis of Probabilistic Programs

Krishnendu Chatterjee<sup>1</sup>, Amir Kafshdar Goharshady2(B) , Tobias Meggendorfer<sup>1</sup>, and Ðorđe Žikelić<sup>1</sup>

<sup>1</sup> Institute of Science and Technology Austria (ISTA), Klosterneuburg, Austria {krishnendu.chatterjee,tobias.meggendorfer,djordje.zikelic}@ist.ac.at <sup>2</sup> The Hong Kong University of Science and Technology (HKUST), Hong Kong, China goharshady@cse.ust.hk

Abstract. We consider the quantitative problem of obtaining lowerbounds on the probability of termination of a given non-deterministic probabilistic program. Specifically, given a non-termination threshold *p* ∈ [0*,* 1]*,* we aim for certificates proving that the program terminates with probability at least 1−*p*. The basic idea of our approach is to find a terminating stochastic invariant, i.e. a subset *SI* of program states such that (i) the probability of the program ever leaving *SI* is no more than *p*, and (ii) almost-surely, the program either leaves *SI* or terminates.

While stochastic invariants are already well-known, we provide the first proof that the idea above is not only sound, but also complete for quantitative termination analysis. We then introduce a novel sound and complete characterization of stochastic invariants that enables templatebased approaches for easy synthesis of quantitative termination certificates, especially in affine or polynomial forms. Finally, by combining this idea with the existing martingale-based methods that are relatively complete for *qualitative* termination analysis, we obtain the first automated, sound, and relatively complete algorithm for *quantitative* termination analysis. Notably, our completeness guarantees for quantitative termination analysis are as strong as the best-known methods for the qualitative variant.

Our prototype implementation demonstrates the effectiveness of our approach on various probabilistic programs. We also demonstrate that our algorithm certifies lower bounds on termination probability for probabilistic programs that are beyond the reach of previous methods.

### 1 Introduction

Probabilistic Programs. Probabilistic programs extend classical imperative programs with randomization. They provide an expressive framework for specifying probabilistic models and have been used in machine learning [22,39], network

A longer version, including appendices, is available at [12]. Authors are ordered alphabetically.

c The Author(s) 2022 S. Shoham and Y. Vizel (Eds.): CAV 2022, LNCS 13371, pp. 55–78, 2022. https://doi.org/10.1007/978-3-031-13185-1\_4

analysis [20], robotics [41] and security [4]. Recent years have seen the development of many probabilistic programming languages such as Church [23] and Pyro [6], and their formal analysis is an active topic of research. Probabilistic programs are often extended with non-determinism to allow for either unknown user inputs and interactions with environment or abstraction of parts that are too complex for formal analysis [31].

Termination. Termination has attracted the most attention in the literature on formal analysis of probabilistic programs. In non-probabilistic programs, it is a purely qualitative property. In probabilistic programs, it has various extensions:


Previous Qualitative Works. There are many approaches to prove a.s. termination based on weakest pre-expectation calculus [27,31,37], abstract interpretation [34], type systems [5] and martingales [7,9,11,14,25,26,32,35]. This work is closest in spirit to martingale-based approaches. The central concept in these approaches is that of a *ranking supermartingale (RSM)* [7], which is a probabilistic extension of ranking functions. RSMs are a sound and complete proof rule for finite termination [21], which is a stricter notion than a.s. termination. The work of [32] proposed a variant of RSMs that can prove a.s. termination even for programs whose expected runtime is infinite, and lexicographic RSMs were studied in [1,13]. A main advantage of martingale-based approaches is that they can be fully automated for programs with affine/polynomial arithmetic [9,11].

Previous Quantitative Works. Quantitative analyses of probabilistic programs are often more challenging. There are only a few works that study the quantitative termination problem: [5,14,40]. The works [14,40] propose martingale-based proof rules for computing lower-bounds on termination probability, while [5] considers functional probabilistic programs and proposes a type system that allows incrementally searching for type derivations to accumulate a lower-bound on termination probability. See Sect. 8 for a detailed comparison.

Lack of Completeness. While [5,14,40] all propose sound methods to compute lower-bounds on termination probability, none of them are theoretically complete nor do their algorithms provide relative completeness guarantees. This naturally leaves open whether one can define a complete certificate for proving termination with probability at least 1 <sup>−</sup> <sup>p</sup> <sup>∈</sup> [0, 1], i.e. a certificate that a probabilistic program admits if and only if it terminates with probability at least 1 <sup>−</sup> <sup>p</sup>, which allows for automated synthesis. Ideally, such a certificate should also be synthesized automatically by an algorithm with relative completeness guarantees, i.e. an algorithm which is guaranteed to compute such a certificate for a sufficiently general subclass of programs. Note, since the problem of deciding whether a probabilistic program terminates with probability at least 1 <sup>−</sup> <sup>p</sup> is undecidable, one cannot hope for a general complete algorithm so the best one can hope for is relative completeness.

Our Approach. We present the first method for the probabilistic termination problem that is complete. Our approach builds on that of [14] and uses stochastic invariants in combination with a.s. reachability certificates in order to compute lower-bounds on the termination probability. A *stochastic invariant* [14] is a tuple (*SI*, p) consisting of a set *SI* of program states and an upper-bound <sup>p</sup> on the probability of a random program run ever leaving *SI*. If one computes a stochastic invariant (*SI*, p) with the additional property that a random program run would, with probability 1, either terminate or leave *SI*, then since *SI* is left with probability at most p the program must terminate with probability at least 1 <sup>−</sup> <sup>p</sup>. Hence, the combination of stochastic invariants and a.s. reachability certificates provides a sound approach to the probabilistic termination problem.

While this idea was originally proposed in [14], our method for computing stochastic invariants is fundamentally different and leads to completeness. In [14], a stochastic invariant is computed indirectly by computing the set *SI* together with a *repulsing supermartingale (RepSM)*, which can then be used to compute a probability threshold <sup>p</sup> for which (*SI*, p) is a stochastic invariant. It was shown in [40, Section 3] that RepSMs are incomplete for computing stochastic invariants. Moreover, even if a RepSM exists, the resulting probability bound need not be tight and the method of [14] does not allow optimizing the computed bound or guiding computation towards a bound that exceeds some specified probability threshold.

In this work, we propose a novel and orthogonal approach that computes the stochastic invariant and the a.s. termination certificate at the same time and is provably complete for certifying a specified lower bound on termination probability. First, we show that stochastic invariants can be characterized through the novel notion of *stochastic invariant indicators (SI-indicators)*. The characterization is both sound and complete. Furthermore, it allows fully automated computation of stochastic invariants for programs using affine or polynomial arithmetic via a template-based approach that reduces quantitative termination analysis to constraint solving. Second, we prove that stochastic invariants together with an a.s. reachability certificate, when synthesized in tandem, are not only *sound* for probabilistic termination, but also *complete*. Finally, we present the first *relatively complete algorithm* for probabilistic termination. Our algorithm considers polynomial probabilistic programs and *simultaneously* computes a stochastic invariant and an a.s. reachability certificate in the form of an RSM using a template-based approach. Our algorithmic approach is relatively complete.

While we focus on the probabilistic termination problem in which the goal is to *verify* a given lower bound 1 <sup>−</sup> <sup>p</sup> on the termination probability, we note that our method may be straightforwardly adapted to *compute* a lower bound on the termination probability. In particular, we may perform a binary-search on p and search for the smallest value of <sup>p</sup> for which 1 <sup>−</sup> <sup>p</sup> can be verified to be a lower bound on the termination probability.

Contributions. Our specific contributions in this work are as follows:


# 2 Overview

Before presenting general theorems and algorithms, we first illustrate our method on the probabilistic program in Fig. 1. The program models a 1-dimensional discrete-time random walk over the real line that starts at <sup>x</sup> = 0 and terminates once a point with x < 0 is reached. In every time step, <sup>x</sup> is incremented by a random value sampled according to the uniform distribution *Uniform*([−1, 0.5]). However, if the stochastic process is in a point with <sup>x</sup> <sup>≥</sup> 100, then the value of x might also be incremented by a random value independently sampled from *Uniform*([−1, 2]). The choice on whether the second increment happens is nondeterministic. By a standard random walk argument, the program does not terminate almost-surely.

Outline of Our Method. Let <sup>p</sup> = 0.01. To prove this program terminates with probability at least 1 <sup>−</sup> <sup>p</sup> = 0.99, our method computes the following two objects:


$$\begin{array}{ll} & x = 0 \\ \ell\_{init}: & \textbf{while} \quad x \ge 0 \quad \textbf{do} \\ \ell\_1: & r\_1 := Uniform([-1, 0.5]) \\ \ell\_2: & x := x + r\_1 \\ \ell\_3: & \textbf{if} \quad x \ge 100 \quad \textbf{then} \\ \ell\_4: & \textbf{if} \quad \star \text{ then} \\ \ell\_5: & r\_2 := Uniform([-1, 2]) \\ \ell\_6: & x := x + r\_2 \\ \ell\_{out}: \end{array}$$

Fig. 1. Our running example.

Synthesizing SI. To find a stochastic invariant, our method computes a state function f which assigns a non-negative real value to each reachable program state. We call this function a *stochastic invariant indicator (SI-indicator)*, and it serves the following two purposes: First, exactly those states which are assigned a value strictly less than 1 are considered a part of the stochastic invariant *SI*. Second, the value assigned to each state is an upper-bound on the probability of leaving *SI* if the program starts from that state. Finally, by requiring that the value of the SI-indicator at the initial state of the program is at most p, we ensure a random program run leaves the stochastic invariant with probability at most p.

In Sect. 4, we will define SI-indicators in terms of conditions that ensure the properties above and facilitate automated computation. We also show that SIindicators serve as a *sound and complete* characterization of stochastic invariants, which is one of the core contributions of this work. The significance of completeness of the characterization is that, in order to search for a stochastic invariant with a given probability threshold p, one may equivalently search for an SI-indicator with the same probability threshold whose computation can be automated. As we will discuss in Sect. 8, previous approaches to the synthesis of stochastic invariants were neither complete nor provided tight probability bounds. For Fig. 1, we have the following set *SI* which will be left with probability at most <sup>p</sup> = 0.01 :

$$SI(\ell) = \begin{cases} (x < 99) & \text{if } \ell \in \{\ell\_{init}, \ell\_1, \ell\_2, \ell\_3, \ell\_{out}\} \\ \text{false} & \text{otherwise.} \end{cases} \tag{1}$$

An SI-indicator for this stochastic invariant is:

$$f(\ell, x, r\_1, r\_2) = \begin{cases} \frac{x+1}{100} & \text{if } \ell \in \{\ell\_{init}, \ell\_1, \ell\_3, \ell\_{out}\} \text{ and } x < 99\\ \frac{x+1+r\_1}{100} & \text{if } \ell = \ell\_2 \text{ and } x < 99\\ 1 & \text{otherwise.} \end{cases} \tag{2}$$

It is easy to check that (*SI*, 0.01) is a stochastic invariant and that for every state <sup>s</sup> = (-, x, r1, r<sup>2</sup>), the value <sup>f</sup>(s) is an upper-bound on the probability of eventually leaving *SI* if program execution starts at <sup>s</sup>. Also, <sup>s</sup> <sup>∈</sup> SI <sup>⇔</sup> <sup>f</sup>(s) <sup>&</sup>lt; 1. Synthesizing a Termination Proof. To prove that a probabilistic program terminates with probability at least 1 <sup>−</sup> <sup>p</sup>, our method searches for a stochastic invariant (*SI*, p) for which, additionally, a random program run with probability 1 either leaves *SI* or terminates. This idea is formalized in Theorem 2, which shows that stochastic invariants provide a *sound and complete* certificate for proving that a given probabilistic program terminates with probability at least 1 <sup>−</sup> <sup>p</sup>. In order to impose this additional condition, our method simultaneously computes an RSM for the set of states ¬*SI* ∪ *Stateterm*, where *Stateterm* is the set of all terminal states. RSMs are a classical certificate for proving almost-sure termination or reachability in probabilistic programs. A state function η is said to be an RSM for ¬*SI* ∪ *Stateterm* if it satisfies the following two conditions:


The existence of an RSM for ¬*SI* ∪*Stateterm* implies that the program will, with probability 1, either terminate or leave *SI*. As (*SI*, p) is a stochastic invariant, we can readily conclude that the program terminates with probability at least 1 <sup>−</sup> <sup>p</sup> = 0.99. An example RSM with <sup>ε</sup> = 0.05 for our example above is:

$$\eta(\ell, x, r\_1, r\_2) = \begin{cases} x + 1.1 & \text{if } \ell = \ell\_{init} \\ x + 1.05 & \text{if } \ell = \ell\_1 \\ x + 1.2 + r\_1 & \text{if } \ell = \ell\_2 \\ x + 1.15 & \text{if } \ell = \ell\_3 \\ x + 1 & \text{if } \ell = \ell\_{out} \\ 100 & \text{otherwise.} \end{cases} \tag{3}$$

Simultaneous Synthesis. Our method employs a template-based approach and synthesizes the SI and the RSM simultaneously. We assume that our method is provided with an affine/polynomial invariant I which over-approximates the set of all reachable states in the program, which is necessary since the defining conditions of SI-indicators and RSMs are required to hold at all reachable program states. Note that invariant generation is an orthogonal and well-studied problem and can be automated using [10]. For both the SI-indicator and the RSM, our method first fixes a symbolic template affine/polynomial expression for each location in the program. Then, all the defining conditions of SI-indicators and RSMs are encoded as a system of constraints over the symbolic template variables, where reachability of program states is encoded using the invariant I, and the synthesis proceeds by solving this system of constraints. We describe our algorithm in Sect. 6, and show that it is *relatively complete* with respect to the provided invariant <sup>I</sup> and the probability threshold 1−p. On the other hand, we note that our algorithm can also be adapted to *compute* lower bounds on the termination probability by combining it with a binary search on p.

Completeness vs Relative Completeness. Our characterization of stochastic invariants using indicator functions is complete. So is our reduction from quantitative termination analysis to the problem of synthesizing an SI-indicator function and a certificate for almost-sure reachability. These are our core theoretical contributions in this work. Nevertheless, as mentioned above, RSMs are complete only for finite termination, not a.s. termination. Moreover, template-based approaches lead to completeness guarantees only for solutions that match the template, e.g. polynomial termination certificates of a bounded degree. Therefore, our end-to-end approach is only relatively complete. These losses of completeness are due to Rice's undecidability theorem and inevitable even in *qualitative* termination analysis. In this work, we successfully provide approaches for *quantitative* termination analysis that are as complete as the best known methods for the qualitative case.

#### 3 Preliminaries

We consider imperative arithmetic probabilistic programs with non-determinism. Our programs allow standard programming constructs such as conditional branching, while-loops and variable assignments. They also allow two probabilistic constructs – probabilistic branching which is indicated in the syntax by a command 'if prob(p) then ... ' with <sup>p</sup> <sup>∈</sup> [0, 1] a real constant, and sampling instructions of the form <sup>x</sup> := <sup>d</sup> where <sup>d</sup> is a probability distribution. Sampling instructions may contain both discrete (e.g. Bernoulli, geometric or Poisson) and continuous (e.g. uniform, normal or exponential) distributions. We also allow constructs for (demonic) non-determinism. We have non-deterministic branching which is indicated in the syntax by 'if then ...', and non-deterministic assignments represented by an instruction of the form <sup>x</sup> := ndet([a, b]), where a, b <sup>∈</sup> <sup>R</sup> ∪ {±∞} and [a, b] is a (possibly unbounded) real interval from which the new variable value is chosen non-deterministically. We also allow one or both sides of the interval to be open. The complete syntax of our programs is presented in [12, Appendix A].

Notation. We use boldface symbols to denote vectors. For a vector **x** of dimension <sup>n</sup> and 1 <sup>≤</sup> <sup>i</sup> <sup>≤</sup> <sup>n</sup>, **<sup>x</sup>**[i] denotes the <sup>i</sup>-th component of **<sup>x</sup>**. We write **<sup>x</sup>**[<sup>i</sup> <sup>←</sup> <sup>a</sup>] to denote an <sup>n</sup>-dimensional vector **<sup>y</sup>** with **<sup>y</sup>**[i] = <sup>a</sup> and **<sup>y</sup>**[j] = **<sup>x</sup>**[j] for <sup>j</sup> <sup>=</sup> <sup>i</sup>.

Program Variables. Variables in our programs are real-valued. Given a finite set of variables <sup>V</sup> , a *variable valuation* of <sup>V</sup> is a vector **<sup>x</sup>** <sup>∈</sup> <sup>R</sup>|<sup>V</sup> <sup>|</sup> .

Probabilistic Control-Flow Graphs (pCFGs). We model our programs via probabilistic control-flow graphs (pCFGs) [11,14]. A *probabilistic control-flow graph (pCFG)* is a tuple <sup>C</sup> = (*L*,V, *init*, **x***init*, <sup>→</sup>, G,*Pr* , *Up*), where:

	- the bottom element <sup>u</sup> = <sup>⊥</sup>, denoting no update;
	- a Borel-measurable expression <sup>u</sup> : <sup>R</sup>|<sup>V</sup> <sup>|</sup> <sup>→</sup> <sup>R</sup>, denoting a deterministic variable assignment;
	- a probability distribution <sup>u</sup> = <sup>d</sup>, denoting that the new variable value is sampled according to d;
	- an interval <sup>u</sup> = [a, b] <sup>⊆</sup> <sup>R</sup> ∪ {±∞}, denoting a non-deterministic update. We also allow one or both sides of the interval to be open.

We assume the existence of the special *terminal location* denoted by *out*. We also require that each location has at least one outgoing transition, and that each - ∈ *L*<sup>A</sup> has a unique outgoing transition. For each location - ∈ L<sup>C</sup> , we assume that the disjunction of guards of all transitions outgoing from is equivalent to *true*, i.e. <sup>τ</sup>=(l,\_) <sup>G</sup>(<sup>τ</sup> ) <sup>≡</sup> *true*. Translation of probabilistic programs to pCFGs that model them is standard, so we omit the details and refer the reader to [11]. The pCFG for the program in Fig. 1 is provided in [12, Appendix B].

States, Paths and Runs. <sup>A</sup> *state* in a pCFG <sup>C</sup> is a tuple (-, **<sup>x</sup>**), where is a location in <sup>C</sup> and **<sup>x</sup>** <sup>∈</sup> <sup>R</sup>|<sup>V</sup> <sup>|</sup> is a variable valuation of <sup>V</sup> . We say that a transition <sup>τ</sup> = (-, - ) is *enabled* at a state (-, **<sup>x</sup>**) if - ∈ L<sup>C</sup> or if - <sup>∈</sup> <sup>L</sup><sup>C</sup> and **<sup>x</sup>** <sup>|</sup><sup>=</sup> <sup>G</sup>(<sup>τ</sup> ). We say that a state (- , **x** ) is a *successor* of (-, **<sup>x</sup>**), if there exists an enabled transition <sup>τ</sup> = (-, - ) in <sup>C</sup> such that (- , **x** ) can be reached from (-, **<sup>x</sup>**) by executing <sup>τ</sup> , i.e. we can obtain **x** by applying the updates of τ to **x**, if any. A *finite path* in C is a sequence (-<sup>0</sup>, **<sup>x</sup>**<sup>0</sup>),(-<sup>1</sup>, **<sup>x</sup>**<sup>1</sup>),...,(<sup>k</sup>, **<sup>x</sup>**<sup>k</sup>) of states with (-<sup>0</sup>, **<sup>x</sup>**<sup>0</sup>)=(*init*, **<sup>x</sup>***init*) and with (<sup>i</sup>+1, **<sup>x</sup>**<sup>i</sup>+1) being a successor of (<sup>i</sup>, **<sup>x</sup>**<sup>i</sup>) for each 0 <sup>≤</sup> <sup>i</sup> <sup>≤</sup> <sup>k</sup> <sup>−</sup> 1. A state (-, **<sup>x</sup>**) is *reachable* in <sup>C</sup> if there exists a finite path in <sup>C</sup> that ends in (-, **<sup>x</sup>**). A *run* (or *execution*) in C is an infinite sequence of states where each finite prefix is a finite path. We use *State*C, *Fpath*C, *Run*C, *Reach*<sup>C</sup> to denote the set of all states, finite paths, runs and reachable states in C, respectively. Finally, we use *Stateterm* to denote the set {(*out*, **<sup>x</sup>**) <sup>|</sup> **<sup>x</sup>** <sup>∈</sup> <sup>R</sup>|<sup>V</sup> <sup>|</sup> } of terminal states.

Schedulers. The behavior of a pCFG may be captured by defining a probability space over the set of all runs in the pCFG. For this to be done, however, we need to resolve non-determinism and this is achieved via the standard notion of a scheduler. A *scheduler* in a pCFG C is a map σ which to each finite path <sup>ρ</sup> <sup>∈</sup> *Fpath*<sup>C</sup> assigns a probability distribution <sup>σ</sup>(ρ) over successor states of the last state in ρ. Since we deal with programs operating over real-valued variables, the set *Fpath*<sup>C</sup> may be uncountable. To that end, we impose an additional *measurability* assumption on schedulers, in order to ensure that the semantics of probabilistc programs with non-determinism is defined in a mathematically sound way. The restriction to measurable schedulers is standard. Hence, we omit the formal definition.

Semantics of pCFGs. A pCFG C with a scheduler σ define a stochastic process taking values in the set of states of C, whose trajectories correspond to runs in <sup>C</sup>. The process starts in the initial state (*init*, **<sup>x</sup>***init*) and inductively extends the run, where the next state along the run is chosen either deterministically or is sampled from the probability distribution defined by the current location along the run and by the scheduler σ. These are the classical operational semantics of Markov decision processes (MDPs), see e.g. [1,27]. A pCFG <sup>C</sup> and a scheduler <sup>σ</sup> together determine a probability space (*Run*C, <sup>F</sup>C, <sup>P</sup><sup>σ</sup>) over the set of all runs in <sup>C</sup>. For details, see [12, Appendix C]. We denote by <sup>E</sup><sup>σ</sup> the expectation operator on (*Run*C, <sup>F</sup>C, <sup>P</sup><sup>σ</sup>). We may analogously define a probability space (*Run*C(,**x**), <sup>F</sup>C(,**x**), <sup>P</sup><sup>σ</sup> C(,**x**)) over the set of all runs in <sup>C</sup> that start in some specified state (-, **x**).

Probabilistic Termination Problem. We now define the termination problem for probabilistic programs considered in this work. A state (-, **<sup>x</sup>**) in a pCFG C is said to be a *terminal state* if - = *out*. A run ρ ∈ *Run*<sup>C</sup> is said to be *terminating* if it reaches some terminal state in C. We use *Term* ⊆ *Run*<sup>C</sup> to denote the set of all terminating runs in *Run*C. The *termination probability* of a pCFG <sup>C</sup> is defined as inf<sup>σ</sup> <sup>P</sup><sup>σ</sup>[*Term*], i.e. the smallest probability of the set of terminating runs in C with respect to any scheduler in C (for the proof that *Term* is measurable, see [40]). We say that C terminates *almost-surely (a.s.)* if its termination probability is 1. In this work, we consider the Lower Bound on the Probability of Termination (LBPT) problem that, given <sup>p</sup> <sup>∈</sup> [0, 1], asks whether 1 <sup>−</sup> <sup>p</sup> is a lower bound for the termination probability of the given probabilistic program, i.e. whether inf<sup>σ</sup> <sup>P</sup><sup>σ</sup>[*Term*] <sup>≥</sup> <sup>1</sup> <sup>−</sup> p.

#### 4 A Sound and Complete Characterization of SIs

In this section, we recall the notion of stochastic invariants and present our characterization of stochastic invariants through stochastic indicator functions. We fix a pCFG <sup>C</sup> = (*L*,V, *init*, **x***init*, <sup>→</sup>, G,*Pr* , *Up*). A *predicate function* in <sup>C</sup> is a map F that to every location - <sup>∈</sup> *<sup>L</sup>* assigns a logical formula <sup>F</sup>(-) over program variables. It naturally induces a set of states, which we require to be Borelmeasurable for the semantics to be well-defined. By a slight abuse of notation, we identify a predicate function F with this set of states. Furthermore, we use <sup>¬</sup><sup>F</sup> to denote the negation of a predicate function, i.e. (¬F)(-) = <sup>¬</sup>F(-). An *invariant* in C is a predicate function I which additionally over-approximates the set of reachable states in <sup>C</sup>, i.e. for every (-, **<sup>x</sup>**) <sup>∈</sup> *Reach*<sup>C</sup> we have **<sup>x</sup>** <sup>|</sup><sup>=</sup> <sup>I</sup>(-). *Stochastic invariants* can be viewed as a probabilistic extension of invariants, which a random program run leaves only with a certain probability. See Sect. 2 for an example.

Definition 1 (Stochastic invariant [14]). *Let SI a predicate function in* C *and* <sup>p</sup> <sup>∈</sup> [0, 1] *a probability. The tuple* (*SI*, p) *is a* stochastic invariant (SI) *if the probability of a run in* C *leaving the set of states defined by SI is at most* p *under any scheduler. Formally, we require that*

$$\sup\_{\theta} \mathbb{P}^{\sigma} \left[ \rho \in \operatorname{Run}\_{\mathcal{C}} \mid \rho \text{ reaches some } (\ell, \mathbf{x}) \text{ with } \mathbf{x} \notin \operatorname{SI}(\ell) \right] \le p.$$

Key Challenge. If we find a stochastic invariant (*SI*, p) for which termination happens almost-surely on runs that do not leave *SI*, we can immediately conclude that the program terminates with probability at least 1−<sup>p</sup> (this idea is formalized in Sect. 5). The key challenge in designing an efficient termination analysis based on this idea is the computation of appropriate stochastic invariants. We present a *sound and complete* characterization of stochastic invariants which allows for their effective automated synthesis through template-based methods.

We characterize stochastic invariants through the novel notion of *stochastic invariant indicators (SI-indicators)*. An SI-indicator is a function that to each state assigns an upper-bound on the probability of violating the stochastic invariant if we start the program in that state. Since the definition of an SI-indicator imposes conditions on its value at reachable states and since computing the exact set of reachable states is in general infeasible, we define SI-indicators with respect to a supporting invariant with the later automation in mind. In order to understand the ideas of this section, one may assume for simplicity that the invariant exactly equals the set of reachable states. A *state-function* in C is a function f that to each location - ∈ *L* assigns a Borel-measurable real-valued function over program variables <sup>f</sup>(-) : <sup>R</sup>|<sup>V</sup> <sup>|</sup> <sup>→</sup> <sup>R</sup>. We use <sup>f</sup>(-, **<sup>x</sup>**) and <sup>f</sup>(-)(**x**) interchangeably.

Definition 2 (Stochastic invariant indicator). *A tuple* (f*SI* , p) *comprising a state function* <sup>f</sup>*SI and probability* <sup>p</sup> <sup>∈</sup> [0, 1] *is a* stochastic invariant indicator (SI-indicator) *with respect to an invariant* I*, if it satisfies the following conditions:*

	- *If Up*(<sup>τ</sup> )=(j, <sup>⊥</sup>)*,* **<sup>x</sup>** <sup>|</sup>= <sup>I</sup>(-) <sup>⇒</sup> <sup>f</sup>(-, **<sup>x</sup>**) <sup>≥</sup> <sup>f</sup>(-, **x**). *– If Up*(<sup>τ</sup> )=(j, u) *with* <sup>u</sup> : <sup>R</sup>|<sup>V</sup> <sup>|</sup> <sup>→</sup> <sup>R</sup> *an expression, we have* **<sup>x</sup>** <sup>|</sup><sup>=</sup> I(-) <sup>⇒</sup> <sup>f</sup>(-, **<sup>x</sup>**) <sup>≥</sup> <sup>f</sup>(- , **<sup>x</sup>**[x<sup>j</sup> <sup>←</sup> <sup>u</sup>(**x**<sup>i</sup>)]).
	- *If Up*(<sup>τ</sup> )=(j, u) *with* <sup>u</sup> = <sup>d</sup> *a distribution, we have* **<sup>x</sup>** <sup>|</sup>= <sup>I</sup>(-) <sup>⇒</sup> f(-, **<sup>x</sup>**) <sup>≥</sup> <sup>E</sup><sup>X</sup>∼<sup>d</sup>[f(- , **<sup>x</sup>**[x<sup>j</sup> <sup>←</sup> <sup>X</sup>])].
	- *If Up*(<sup>τ</sup> )=(j, u) *with* <sup>u</sup> = [a, b] *an interval, we have* **<sup>x</sup>** <sup>|</sup>= <sup>I</sup>(-) <sup>⇒</sup> f(-, **<sup>x</sup>**) <sup>≥</sup> sup<sup>X</sup>∈[a,b]{f(- , **<sup>x</sup>**[x<sup>j</sup> <sup>←</sup> <sup>X</sup>])}.

Intuition. (C1) imposes that <sup>f</sup> is nonnegative at any state contained in the invariant <sup>I</sup>. Next, for any state in <sup>I</sup>, (C2) imposes that the value of <sup>f</sup> does not increase in expectation upon a one-step execution of the pCFG under any scheduler. Finally, the condition (C3) imposes that the initial value of <sup>f</sup> in <sup>C</sup> is at most p. Together, the indicator thus intuitively over-approximates the probability of violating *SI*. An example of an SI-indicator for our running example in Fig. 1 is given in (2). The following theorem formalizes the above intuition and is our main result of this section. In essence, we prove that (*SI*, p) is a stochastic invariant in <sup>C</sup> iff there exists an SI-indicator (f*SI* , p) such that *SI* contains all states at which <sup>f</sup>*SI* is strictly smaller than <sup>1</sup>. This implies that, for every stochastic invariant (*SI*, p), there exists an SI-indicator such that (*SI* , p) defined via *SI* (-)=(**<sup>x</sup>** <sup>|</sup>= <sup>I</sup>(-) <sup>∧</sup> <sup>f</sup>*SI* (-, **<sup>x</sup>**) <sup>&</sup>lt; 1) is a stochastic invariant that is at least as tight as (*SI*, p).

Theorem 1 (Soundness and Completeness of SI-indicators). *Let* C *be a pCFG,* <sup>I</sup> *an invariant in* <sup>C</sup> *and* <sup>p</sup> <sup>∈</sup> [0, 1]*. For any SI-indicator* (f*SI* , p) *with respect to* I, *the predicate map SI defined as SI*(-)=(**<sup>x</sup>** <sup>|</sup>= <sup>I</sup>(-) <sup>∧</sup> <sup>f</sup>*SI* (-, **<sup>x</sup>**) <sup>&</sup>lt; 1) *yields a stochastic invariant* (*SI*, p) *in* <sup>C</sup>*. Conversely, for every stochastic invariant* (*SI*, p) *in* <sup>C</sup>*, there exist an invariant* <sup>I</sup>*SI and a state function* <sup>f</sup>*SI such that* (f*SI* , p) *is an SI-indicator with respect to* <sup>I</sup>*SI and for each* - ∈ *L we have SI*(-) <sup>⊇</sup> (**<sup>x</sup>** <sup>|</sup><sup>=</sup> <sup>I</sup>*SI* (-) <sup>∧</sup> <sup>f</sup>*SI* (-, **<sup>x</sup>**) <sup>&</sup>lt; 1)*.*

Proof Sketch. Since the proof is technically involved, we present the main ideas here and defer the details to [12, Appendix E]. First, suppose that I is an invariant in <sup>C</sup> and that (f*SI* , p) is an SI-indicator with respect to <sup>I</sup>, and let *SI*(-)=(**<sup>x</sup>** <sup>|</sup>= <sup>I</sup>(-) <sup>∧</sup> <sup>f</sup>*SI* (-, **<sup>x</sup>**) <sup>&</sup>lt; 1) for each - ∈ *L*. We need to show that (*SI*, p) is a stochastic invariant in <sup>C</sup>. Let sup<sup>σ</sup> <sup>P</sup><sup>σ</sup> (,**x**)[*Reach*(¬*SI*)] be a state function that maps each state (-, **<sup>x</sup>**) to the probability of reaching <sup>¬</sup>*SI* from (-, **<sup>x</sup>**). We consider a lattice of non-negative semi-analytic state-functions (L, ) with the partial order defined via <sup>f</sup> <sup>f</sup> if <sup>f</sup>(-, **<sup>x</sup>**) <sup>≤</sup> <sup>f</sup> (-, **<sup>x</sup>**) holds for each state (-, **<sup>x</sup>**) in <sup>I</sup>. See [12, Appendix D] for a review of lattice theory. It follows from a result in [40] that the probability of reaching ¬*SI* can be characterized as the least fixed point of the *next-time operator* <sup>X</sup>¬*SI* : L→L. Away from <sup>¬</sup>*SI*, the operator <sup>X</sup>¬*SI* simulates a one-step execution of <sup>C</sup> and maps <sup>f</sup> ∈ L to its maximal expected value upon one-step execution of C where the maximum is taken over all schedulers, and at states contained in <sup>¬</sup>*SI* the operator <sup>X</sup>¬*SI* is equal to 1. It was also shown in [40] that, if a state function <sup>f</sup> ∈ L is a pre-fixed point of <sup>X</sup>¬*SI* , then it satisfies sup<sup>σ</sup> <sup>P</sup><sup>σ</sup> (,**x**)[*Reach*(¬*SI*)] <sup>≤</sup> <sup>f</sup>(-, **<sup>x</sup>**) for each (-, **<sup>x</sup>**) in I. Now, by checking the defining properties of pre-fixed points and recalling that <sup>f</sup>*SI* satisfies Non-negativity condition (C<sup>1</sup>) and Non-increasing expected value condition (C<sup>2</sup>) in Definition 2, we can show that <sup>f</sup>*SI* is contained in the lattice <sup>L</sup> and is a pre-fixed point of <sup>X</sup>¬*SI* . It follows that sup<sup>σ</sup> <sup>P</sup><sup>σ</sup> (*init*,**x***init* )[*Reach*(¬*SI*)] <sup>≤</sup> <sup>f</sup>*SI* (*init*, **<sup>x</sup>***init*). On the other hand, by initial condition (C<sup>3</sup>) in Definition <sup>2</sup> we know that <sup>f</sup>*SI* (*init*, **<sup>x</sup>***init*) <sup>≤</sup> <sup>p</sup>. Hence, we have sup<sup>σ</sup> <sup>P</sup><sup>σ</sup> (*init*,**x***init* )[*Reach*(¬*SI*)] <sup>≤</sup> <sup>p</sup> so (*SI*, p) is a stochastic invariant.

Conversely, suppose that (*SI*, p) is a stochastic invariant in <sup>C</sup>. We show in [12, Appendix E] that, if we define I*SI* to be the trivial true invariant and define <sup>f</sup>*SI* (-, **<sup>x</sup>**) = sup<sup>σ</sup> <sup>P</sup><sup>σ</sup> (,**x**)[*Reach*(¬*SI*)], then (f*SI* , p) forms an SI-indicator with respect to I*SI* . The claim follows by again using the fact that f*SI* is the least fixed point of the operator <sup>X</sup>¬*SI* , from which we can conclude that (f*SI* , p) satisfies conditions (C1) and (C2) in Definition 2. On the other hand, the fact that (*SI*, p) is a stochastic invariant and our choice of <sup>f</sup>*SI* imply that (f*SI* , p) satisfies the initial condition (C3) in Definition 2. Hence, (f*SI* , p) forms an SIindicator with respect to <sup>I</sup>*SI* . Furthermore, *SI*(-) <sup>⊇</sup> (**<sup>x</sup>** <sup>|</sup><sup>=</sup> <sup>I</sup>*SI* (-) <sup>∧</sup> <sup>f</sup>*SI* (-, **<sup>x</sup>**) <sup>&</sup>lt; 1) follows since <sup>1</sup> > f*SI* (-, **<sup>x</sup>**) = sup<sup>σ</sup> <sup>P</sup><sup>σ</sup> (,**x**)[*Reach*(¬*SI*)] implies that (-, **<sup>x</sup>**) cannot be contained in <sup>¬</sup>*SI* so **<sup>x</sup>** <sup>|</sup>= *SI*(-). This concludes the proof.

Based on the theorem above, in order to compute a stochastic invariant in C for a given probability threshold p, it suffices to synthesize a state function f*SI* that together with p satisfies all the defining conditions in Definition 2 with respect to some supporting invariant I, and then consider a predicate function *SI* defined via *SI*(-)=(**<sup>x</sup>** <sup>|</sup>= <sup>I</sup>(-) <sup>∧</sup> <sup>f</sup>*SI* (-, **<sup>x</sup>**) <sup>&</sup>lt; 1) for each - ∈ *L*. This will be the guiding principle of our algorithmic approach in Sect. 6.

Intuition on Characterization. Stochastic invariants can essentially be thought of as quantitative safety specifications in probabilistic programs – (*SI*, p) is a stochastic invariant if and only if a random probabilistic program run leaves *SI* with probability at most p. However, what makes their computation hard is that they do not consider probabilities of staying within a specified safe set. Rather, the computation of stochastic invariants requires computing *both* the safe set *and* the certificate that it is left with at most the given probability. Nevertheless, in order to reason about them, we may consider *SI* as an implicitly defined safe set. Hence, if we impose conditions on a state function f*SI* to be an upper bound on the reachability probability for the target set of states (**<sup>x</sup>** <sup>|</sup>= <sup>I</sup>(-)∧f*SI* (-, **<sup>x</sup>**) <sup>&</sup>lt; 1), and in addition impose that <sup>f</sup>*SI* (*init*, **<sup>x</sup>***init*) <sup>≤</sup> <sup>p</sup>, then these together will entail that p is an upper bound on the probability of ever leaving *SI* when starting in the initial state. This is the intuitive idea behind our construction of SI-indicators, as well as our soundness and completeness proof. In the proof, we show that conditions (C<sup>1</sup>) and (C<sup>2</sup>) in Definition <sup>2</sup> indeed entail the necessary conditions to be an upper bound on the reachability probability of the set (**<sup>x</sup>** <sup>|</sup>= <sup>I</sup>(-) <sup>∧</sup> <sup>f</sup>*SI* (-, **<sup>x</sup>**) <sup>&</sup>lt; 1).

#### 5 Stochastic Invariants for LBPT

In the previous section, we paved the way for automated synthesis of stochastic invariants by providing a sound and complete characterization in terms of SI-indicators. We now show how stochastic invariants in combination with any a.s. termination certificate for probabilistic programs can be used to compute lower-bounds on the probability of termination. Theorem 2 below states a general result about termination probabilities that is agnostic to the termination certificate, and shows that stochastic invaraints provide a *sound and complete* approach to quantitative termination analysis.

Theorem 2 (Soundness and Completeness of SIs for Quantitative Termination). *Let* <sup>C</sup> = (*L*,V, *init*, **x***init*, <sup>→</sup>, G,*Pr* , *Up*) *be a pCFG and* (*SI*, p) *<sup>a</sup> stochastic invariant in* C*. Suppose that, with respect to every scheduler, a run in* C *almost-surely either terminates or reaches a state in* ¬*SI, i.e.*

$$\inf\_{\sigma} \mathbb{P}^{\sigma} \left[ Term \cup Recall(\neg SI) \right] = 1. \tag{4}$$

*Then* <sup>C</sup> *terminates with probability at least* 1 <sup>−</sup> <sup>p</sup>*. Conversely, if* <sup>C</sup> *terminates with probability at least* 1 <sup>−</sup> <sup>p</sup>*, then there exists a stochastic invariant* (*SI*, p) *in* C *such that, with respect to every scheduler, a run in* C *almost-surely either terminates or reaches a state in* ¬*SI.*

Proof Sketch. The first part (soundness) follows directly from the definition of *SI* and (4). The completeness proof is conceptually and technically involved and presented in [12, Appendix H]. In short, the central idea is to construct, for every <sup>n</sup> greater than a specific threshold <sup>n</sup>0, a stochastic invariant (*SI* <sup>n</sup>, p <sup>+</sup> <sup>1</sup> <sup>n</sup> ) such that a run almost-surely either terminates or exists *SI* <sup>n</sup>. Then, we show that ∩<sup>∞</sup> <sup>n</sup>=n<sup>0</sup> *SI* <sup>n</sup> is our desired *SI*. To construct each *SI* <sup>n</sup>, we consider the infimum termination probability at every state (-, **<sup>x</sup>**) and call it <sup>r</sup>(-, **<sup>x</sup>**). The infimum is taken over all schedulers. We then let *SI* <sup>n</sup> be the set of states (-, **<sup>x</sup>**) for whom r(-, **<sup>x</sup>**) is greater than a specific threshold α. Intuitively, our stochastic invariant is the set of program states from which the probability of termination is at least α, no matter how the non-determinism is resolved. Let us call these states likelyterminating. The intuition is that a random run of the program will terminate or eventually leave the likely-terminating states with high probability.

Quantitative to Qualitative Termination. Theorem 2 provides us with a recipe for computing lower bounds on the probability of termination once we are able to compute stochastic invariants: if (*SI*, p) is a stochastic invariant in a pCFG C, it suffices to prove that the set of states *Stateterm* ∪ ¬*SI* is reached almost-surely with respect to any scheduler in C, i.e. the program terminates or violates SI. Note that this is simply a qualitative a.s. termination problem, except that the set of terminal states is now augmented with <sup>¬</sup>*SI*. Then, since (*SI*, p) is a stochastic invariant, it would follow that a terminal state is reached with probability at least 1−p. Moreover, the theorem shows that this approach is both sound and complete. In other words, proving quantitative termination, i.e. that we reach *Stateterm* with probability at least <sup>1</sup> <sup>−</sup> <sup>p</sup> is now reduced to (i) finding a stochastic invariant (*SI*, p) and (ii) proving that the program <sup>C</sup> obtained by adding ¬*SI* to the set of terminal states of C is a.s. terminating. Note that, to preserve completeness, (i) and (ii) should be achieved in tandem, i.e. an approach that first synthesizes and fixes *SI* and then tries to prove a.s. termination for ¬*SI* is not complete.

Ranking Supermartingales. While our reduction above is agnostic to the type of proof/certificate that is used to establish a.s. termination, in this work we use Ranking Supermartingales (RSMs) [7], which are a standard and classical certificate for proving a.s. termination and reachability. Let <sup>C</sup> = (*L*,V, *init*, **x***init*, → , G,*Pr* , *Up*) be a pCFG and <sup>I</sup> an invariant in <sup>C</sup>. Note that as in Definition 2, the main purpose of the invariant is to allow for automated synthesis and one can again simply assume it to equal the set of reachable states. An ε-RSM for a subset T of states is a state function that is non-negative in each state in I, and whose expected value decreases by at least ε > 0 upon a one-step execution of <sup>C</sup> in any state that is not contained in the target set T. Thus, intuitively, a program run has an expected tendency to approach the target set T where the distance to T is given by the value of the RSM which is required to be non-negative in all states in I. The ε-ranked expected value condition is formally captured via the next-time operator X (See [12, Appendix E]). An example of an RSM for our running example in Fig. 1 and the target set of states ¬*SI* ∪ *Stateterm* with *SI* the stochastic invariant in Eq. (1) is given in Eq. (3).

Definition 3 (Ranking supermartingales). *Let* T *be a predicate function defining a set of target states in* <sup>C</sup>*, and let* ε > 0*. A state function* <sup>η</sup> *is said to be an* ε-ranking supermartingale (ε-RSM) *for* T *with respect to the invariant* I *if it satisfies the following conditions:*


Note that the second condition can be expanded according to location types in the exact same manner as in condition C<sup>2</sup> of Definition 2. The only difference is that in Definition 2, the expected value had to be non-increasing, whereas here it has to decrease by ε. It is well-known that the two conditions above entail that <sup>T</sup> is reached with probability 1 with respect to any scheduler [7,11].

Theorem 3. (Proof in [12, Appendix I]). *Let* C *be a pCFG,* I *an invariant in* C *and* T *a predicate function defining a target set of states. If there exist* ε > 0 *and an* <sup>ε</sup>*-RSM for* <sup>T</sup> *with respect to* <sup>I</sup>*, then* <sup>T</sup> *is a.s. reached under any scheduler, i.e.*

$$\inf\_{\sigma} \mathbb{P}^{\sigma}\_{(\ell\_{init}, \mathbf{x}\_{init})} \left[ \operatorname{Reach}(T) \right] = 1.$$

The following theorem is an immediate corollary of Theorems 2 and 3.

Theorem 4. *Let* C *be a pCFG and* I *be an invariant in* C*. Suppose that there exist a stochastic invariant* (*SI*, p)*, an* ε > <sup>0</sup> *and an* <sup>ε</sup>*-RSM* <sup>η</sup> *for Stateterm* ∪¬*SI with respect to* <sup>I</sup>*. Then* <sup>C</sup> *terminates with probability at least* 1 <sup>−</sup> <sup>p</sup>*.*

Therefore, in order to prove that <sup>C</sup> terminates with probability at least 1 <sup>−</sup> <sup>p</sup>, it suffices to find (i) a stochastic invariant (*SI*, p) in <sup>C</sup>, and (ii) an <sup>ε</sup>-RSM <sup>η</sup> for *Stateterm* ∪ ¬*SI* with respect to <sup>I</sup> and some ε > <sup>0</sup>. Note that these two tasks are interdependent. We cannot simply choose any stochastic invariant. For instance, the trivial predicate function defined via *SI* = true always yields a valid stochastic invariant for any <sup>p</sup> <sup>∈</sup> [0, 1], but it does not help termination analysis. Instead, we need to compute a stochastic invariant and an RSM for it *simultaneously*.

Power of Completeness. We end this section by showing that our approach certifies a tight lower-bound on termination probability for a program that was proven in [40] not to admit any of the previously-existing certificates for lower bounds on termination probability. This shows that our completeness pays off in practice and our approach is able to handle programs that were beyond the reach of previous methods. Consider the program in Fig. 2 annotated by an invariant I. We show that our approach certifies that this program terminates with probability at least 0.5. Indeed, consider a stochastic invariant (*SI*, 0.5) with *SI*(-) = true if - = -<sup>3</sup>, and *SI*(-<sup>3</sup>) = false, and a state function defined via η(*init*, x) = <sup>−</sup> log(x) + log(2) + 3, <sup>η</sup>(-<sup>1</sup>, x) = <sup>−</sup> log(x) + log(2) + 2, <sup>η</sup>(-<sup>2</sup>, x)=1 and <sup>η</sup>(-<sup>3</sup>, x) = <sup>η</sup>(*out*, x)=0 for each <sup>x</sup>. Then one can easily check by inspection that (*SI*, 0.5) is a stochastic invariant and that <sup>η</sup> is a (log(2) <sup>−</sup> 1)-RSM for *Stateterm* ∪ ¬*SI* with respect to I. Therefore, it follows by Theorem 4 that the program in Fig. <sup>2</sup> terminates with probability at least 0.5.

#### 6 Automated Template-Based Synthesis Algorithm

We now provide template-based relatively complete algorithms for simultaneous and automated synthesis of SI-indicators and RSMs, in order to solve the quantitative termination problem over pCFGs with affine/polynomial arithmetic. Our approach builds upon the ideas of [2,9] for qualitative and non-probabilistic cases.


Fig. 2. A program that was shown in [40] not to admit a repulsing supermartingale [14] or a gamma-scaled supermartingale [40], but for which our method can certify the tight lower-bound of 0*.*5 on the probability of termination.

Input and Assumptions. The input to our algorithms consists of a pCFG C together with a probability <sup>p</sup> <sup>∈</sup> [0, 1], an invariant I, and technical variables <sup>δ</sup> and M, which specify polynomial template sizes used by the algorithm and which will be discussed later. In this section, we limit our focus to affine/polynomial pCFGs, i.e. we assume that all guards <sup>G</sup>(<sup>τ</sup> ) in <sup>C</sup> and all invariants <sup>I</sup>(-) are conjunctions of affine/polynomial inequalities over program variables. Similarly, we assume that every update function <sup>u</sup> : <sup>R</sup>|<sup>V</sup> <sup>|</sup> <sup>→</sup> <sup>R</sup> used in deterministic variable assignments is an affine/polynomial expression in <sup>R</sup>[<sup>V</sup> ].

<sup>-</sup> We assume an invariant is given as part of the input. Invariant generation is an orthogonal and well-studied problem and can be automated using [10,16].

Output. The goal of our algorithms is to synthesize a tuple (f, η, ε) where <sup>f</sup> is an SI-indicator function, <sup>η</sup> is a corresponding RSM, and ε > 0, such that:


As shown in Sects. <sup>4</sup> and 5, such a tuple <sup>w</sup> = (f, η, ε) serves as a certificate that the probabilistic program modeled by C terminates with probability at least 1 <sup>−</sup> p. We call <sup>w</sup> a quantitative termination certificate.

Overview. Our algorithm is a standard template-based approach similar to [2,9]. We encode the requirements of Definitions 2 and 3 as entailments between affine/polynomial inequalities with unknown coefficients and then apply the classical Farkas' Lemma [17] or Putinar's Positivstellensatz [38] to reduce the synthesis problem to Quadratic Programming (QP). Finally, we solve the resulting QP using a numerical optimizer or an SMT-solver. Our approach consists of the four steps below. Step 3 follows [2] exactly. Hence, we refer to [2] for more details on this step.

Step 1. Setting Up Templates. The algorithm sets up symbolic templates with unknown coefficients for f,η and ε.


Step 2. Generating Entailment Constraints. In this step, the algorithm symbolically computes the requirements of Definition 2, i.e. C1–C3, and their analogues in Definition 3 using the templates generated in the previous step. Note that all of these requirements are entailments between affine/polynomial inequalities over program variables whose coefficients are unknown. In other words, they are of the form <sup>∀</sup>**<sup>x</sup>** <sup>A</sup>(**x**) <sup>⇒</sup> <sup>b</sup>(**x**) where <sup>A</sup> is a set of affine/polynomial inequalities over program variables whose coefficients contain the unknown variables <sup>c</sup> and <sup>ε</sup> generated in the previous step and <sup>b</sup> is a single such inequality. For example, for the program of Fig. 1, the algorithm symbolically computes condition C<sup>1</sup> at line -<sup>1</sup> as follows: <sup>∀</sup>**<sup>x</sup>** <sup>I</sup>(-<sup>1</sup>, **<sup>x</sup>**) <sup>⇒</sup> <sup>f</sup>(-<sup>1</sup>, **<sup>x</sup>**) <sup>≥</sup> 0. Assuming that the given invariant is <sup>I</sup>(-<sup>1</sup>, **<sup>x</sup>**) := (<sup>x</sup> <sup>≤</sup> 1) and an affine (degree 1) template was generated in the previous step, the algorithm expands this to:

$$\forall \mathbf{x} \; \; 1 - \mathbf{x} \ge 0 \Rightarrow \overline{c\_{1,0}} + \overline{c\_{1,1}} \cdot x + \overline{c\_{1,2}} \cdot r\_1 + \overline{c\_{1,3}} \cdot r\_2 \ge 0. \tag{5}$$

The algorithm generates similar entailment constraints for every location and every requirement in Definitions 2 and 3.

Step 3. Quantifier Elimination. At the end of the previous step, we have a system of constraints of the form i <sup>∀</sup>**<sup>x</sup>** <sup>A</sup><sup>i</sup>(**x**) <sup>⇒</sup> <sup>b</sup><sup>i</sup>(**x**) . In this step, the algorithm sets off to eliminate the universal quantification over **x** in every constraint. First, consider the affine case. If A<sup>i</sup> is a set of linear inequalities over program variables and b<sup>i</sup> is one such linear inequality, then the algorithm attempts to write b<sup>i</sup> as a linear combination with non-negative coefficients of the inequalities in <sup>A</sup><sup>i</sup> and the trivial inequality <sup>1</sup> <sup>≥</sup> <sup>0</sup>. For example, it rewrites (5) as λ <sup>1</sup> · (1 <sup>−</sup> <sup>x</sup>) + <sup>λ</sup> <sup>2</sup> <sup>=</sup> <sup>c</sup> <sup>1</sup>,<sup>0</sup> <sup>+</sup> <sup>c</sup> <sup>1</sup>,<sup>1</sup> · <sup>x</sup> <sup>+</sup> <sup>c</sup> <sup>1</sup>,<sup>2</sup> · <sup>r</sup><sup>1</sup> <sup>+</sup> <sup>c</sup> <sup>1</sup>,<sup>3</sup> · <sup>r</sup><sup>2</sup> where <sup>λ</sup> <sup>i</sup>'s are new *nonnegative* unknown variables for which we need to synthesize non-negative real values. This inequality should hold for all valuations of program variables. Thus, we can equate the corresponding coefficients on both sides and obtain this equivalent system:

$$\begin{array}{ll} \widehat{\lambda\_1} + \widehat{\lambda\_2} = \widehat{c\_{1,0}} & \text{(the constant factor)}\\ -\widehat{\lambda\_1} = \widehat{c\_{1,1}} & \text{(coefficient of } x) \\ \mathbf{0} = \widehat{c\_{1,2}} = \widehat{c\_{1,3}} & \text{(coefficients of } r\_1 \text{ and } r\_2) \\ \mathbf{0} & \text{(} . \end{array} \tag{6}$$

This transformation is clearly sound, but it is also complete due to the wellknown Farkas' lemma [17]. Now consider the polynomial case. Again, we write b<sup>i</sup> as a combination of the polynomials in Ai. The only difference is that instead of having non-negative real coefficients, we use sum-of-square polynomials as our multiplicands. For example, suppose our constraint is

$$\forall \mathbf{x} \; \; \; g\_1(\mathbf{x}) \ge 0 \land g\_2(\mathbf{x}) \ge 0 \Rightarrow g\_3(\mathbf{x}) > 0,$$

where the gi's are polynomials with unknown coefficients. The algorithm writes

$$g\_3(\mathbf{x}) = h\_0(\mathbf{x}) + h\_1(\mathbf{x}) \cdot g\_1(\mathbf{x}) + h\_2(\mathbf{x}) \cdot g\_2(\mathbf{x}),\tag{7}$$

where each h<sup>i</sup> is a sum-of-square polynomial of degree at most M. The algorithm sets up a template of degree M for each h<sup>i</sup> and adds well-known quadratic constraints that enforce it to be a sum of squares. See [2, Page 22] for details. It then expands (7) and equates the corresponding coefficients of the LHS and RHS as in the linear case. The soundness of this transformation is trivial since each h<sup>i</sup> is a sum-of-squares and hence always non-negative. Completeness follows from Putinar's Positivstellensatz [38]. Since the arguments for completeness of this method are exactly the same as the method in [2], we refer the reader to [2] for more details and an extension to entailments between strict polynomial inequalities.

Step 4. Quadratic Programming. All of our constraints are converted to Quadratic Programming (QP) over template variables, e.g. see (6). Our algorithm passes this QP instance to an SMT solver or a numerical optimizer. If a solution is found, it plugs in the values obtained for the <sup>c</sup> and <sup>ε</sup> variables back into the template of Step 1 and outputs the resulting termination witness (f, η, ε).

We end this section by noting that our algorithm is sound and relatively complete for synthesizing affine/polynomial quantitative termination certificates.

Theorem 5 (Soundness and Completeness in the Affine Case). *Given an affine pCFG* C*, an affine invariant* I, *and a non-termination upper-bound* p ∈ [0, 1], *if* <sup>C</sup> *admits a quantitative termination certificate* <sup>w</sup> = (f, η, ε) *in which both* f *and* η *are affine expressions at every location, then* w *corresponds to a solution of the QP instance solved in Step 4 of the algorithm above. Conversely, every such solution, when plugged back into the template of Step 1, leads to an affine quantitative termination certificate showing that* C *terminates with probability at least* 1 <sup>−</sup> <sup>p</sup> *over every scheduler.*

Theorem 6 (Soundness and Relative Completeness in the Polynomial Case). *Given a polynomial pCFG* C*, a polynomial invariant* I *which is a compact subset of* R|<sup>V</sup> <sup>|</sup> *at every location* -*, and a non-termination upper-bound* <sup>p</sup> <sup>∈</sup> [0, 1], *if* <sup>C</sup> *admits a quantitative termination certificate* <sup>w</sup> = (f, η, ε) *in which both* <sup>f</sup> *and* η *are polynomial expressions of degree at most* δ *at every location, then there exists an* <sup>M</sup> <sup>∈</sup> <sup>N</sup>, *for which* <sup>w</sup> *corresponds to a solution of the QP instance solved in Step 4 of the algorithm above. Conversely, every such solution, when plugged back into the template of Step 1, leads to a polynomial quantitative termination certificate of degree at most* δ *showing that* C *terminates with probability at least* 1 <sup>−</sup> <sup>p</sup> *over every scheduler.*

*Proof.* Step 2 encodes the conditions of an SI-indicator (Definition 2) and RSM (Definition 3). Theorem 4 shows that an SI-indicator together with an RSM is a valid quantitative termination certificate. The transformation in Step 3 is sound and complete as argued in [2, Theorems 4 and 10]. The affine version relies on Farkas' lemma [17] and is complete with no additional constraints. The polynomial version is based on Putinar's Positivstellensatz [38] and is only complete for large enough M, i.e. a high-enough degree for sum-of-square multiplicands. This is why we call our algorithm *relatively* complete. In practice, small values of <sup>M</sup> are enough to synthesize <sup>w</sup> and we use <sup>M</sup> = 2 in all of our experiments.

<sup>-</sup>-We need a more involved transformation for *strict* inequalities. See [2, Theorem 8].

#### 7 Experimental Results

Implementation. We implemented a prototype of our approach in Python and used SymPy [33] for symbolic computations and the MathSAT5 SMT Solver [15] for solving the final QP instances. We also applied basic optimizations, e.g. checking the validity of each entailment and thus removing tautological constraints.

Machine and Parameters. All results were obtained on an Intel Core i9- 10885H machine (8 cores, 2.4 GHz, 16 MB Cache) with 32 GB of RAM running Ubuntu 20.04. We always synthesized quadratic termination certificates and set <sup>δ</sup> = <sup>M</sup> = 2.

Benchmarks. We generated a variety of random walks with complicated behavior, including nested combinations of probabilistic and non-deterministic branching and loops. We also took a number of benchmarks from [14]. Due to space limitations, in Table 1 we only present experimental results on a subset of our benchmark set, together with short descriptions of these benchmarks. Complete evaluation as well as details on all benchmarks are provided in [12, Appendix J].

Results and Discussion. Our experimental results are summarized in Table 1, with complete results provided in [12, Appendix J]. In every case, our approach was able to synthesize a certificate that the program terminates with probability at least 1−<sup>p</sup> under any scheduler. Moreover, our runtimes are consistently small and less than 6 s per benchmark. Our approach was able to handle programs that are beyond the reach of previous methods, including those with unbounded differences and unbounded non-deterministic assignments to which approaches such as [14] and [40] are not applicable, as was demonstrated in [40]. This adds experimental confirmation to our theoretical power-of-completeness result at the end of Sect. 5, which showed the wider applicability of our method. Finally, it is noteworthy that the termination probability lower-bounds reported in Table 1 are not tight. There are two reasons for this. First, while our theoretical approach is sound and complete, our algorithm can only synthesize affine/polynomial certificates for quantitative termination, and the best polynomial certificate of a certain degree might not be tight. Second, we rely on an SMT-solver to solve our QP instances. The QP instances often become harder as we decrease p, leading to the solver's failure even though the constraints are satisfiable.

#### 8 Related Works

Supermartingale-Based Approaches. In addition to qualitative and quantitative termination analyses, supermartingales were also used for the formal analysis of other properties in probabilistic programs, such as, liveness and safety properties [3,8,14,42], cost analysis of probabilistic programs [36,43]. While all these works demonstrate the effectiveness of supermartingale-based techniques, below we present a more detailed comparison with other works that consider automated computation of lower bounds on termination probability.


Table 1. Summary of our experimental results on a subset of our benchmark set. See [12, Appendix J] for benchmark details and for the results on all benchmarks.

Comparison to [14]. The work of [14] introduces stochastic invariants and demonstrates their effectiveness for computing lower bounds on termination probability. However, their approach to computing stochastic invariants is based on repulsing supermartingales (RepSMs), and is orthogonal to ours. RepSMs were shown to be incomplete for computing stochastic invariants [40, Section 3]. Also, a RepSM is required to have *bounded differences*, i.e. the absolute difference of its value is any two successor states needs to be bounded from above by some positive constant. Given that the algorithmic approach of [14] computes linear RepSMs, this implies that the applicability of RepSMs is compromised in practice as well, and is mostly suited to programs in which the quantity that behaves like a RepSM depends only on variables with bounded increments and sampling instructions defined by distributions of bounded support. Our approach does not impose such a restriction, and is the first to provide completeness guarantees.

Comparison to [40]. The work of [40] introduces γ-scaled submartingales and proves their effectiveness for computing lower bounds on termination probability. Intuitively, for <sup>γ</sup> <sup>∈</sup> (0, 1), a state function <sup>f</sup> is a <sup>γ</sup>-scaled submartingale if it is a bounded nonnegative function whose value in each non-terminal state decreases in expected value at least by a factor of γ upon a one-step execution of the pCFG. One may think of the second condition as a multiplicative decrease in expected value. However, this condition is too strict and γ-scaled submartingales are not complete for lower bounds on termination probability [40, Example 6.6].

Comparison to [5]. The work of [5] proposes a type system for functional probabilistic programs that allows incrementally searching for type derivations and accumulating a lower bound on termination probability. In the limit, it finds arbitrarily tight lower bounds on termination probability, however it does not provide any completeness or precision guarantees in finite time.

Other Approaches. Logical calculi for reasoning about properties of probabilistic programs (including termination) were studied in [18,19,29] and extended to programs with non-determinism in [27,28,31,37]. These works consider proof systems for probabilistic programs based on the weakest pre-expectation calculus. The expressiveness of this calculus allows reasoning about very complex programs, but the proofs typically require human input. In contrast, we aim for a fully automated approach for probabilistic programs with polynomial arithmetic. Connections between martingales and the weakest pre-expectation calculus were studied in [24]. A sound approach for proving almost-sure termination based on abstract interpretation is presented in [34].

Cores in MDPs. *Cores* are a conceptually equivalent notion to stochastic invariants introduced in [30] for finite MDPs. [30] presents a sampling-based algorithm for their computation.

#### 9 Conclusion

We study the quantitative probabilistic termination problem in probabilistic programs with non-determinism and propose the first relatively complete algorithm for proving termination with at least a given threshold probability. Our approach is based on a sound and complete characterization of stochastic invariants via the novel notion of stochastic invariant indicators, which allows for an effective and relatively complete algorithm for their computation. We then show that stochastic invariants are sound and complete certificates for proving that a program terminates with at least a given threshold probability. Hence, by combining our relatively complete algorithm for stochastic invariant computation with the existing relatively complete algorithm for computing ranking supermartingales, we present the first relatively complete algorithm for probabilistic termination. We have implemented a prototype of our algorithm and demonstrate its effectiveness on a number of probabilistic programs collected from the literature.

Acknowledgements. This research was partially supported by the ERC CoG 863818 (ForM-SMArt), the HKUST-Kaisa Joint Research Institute Project Grant HKJRI3A-055, the HKUST Startup Grant R9272 and the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie Grant Agreement No. 665385.

# References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Does a Program Yield the Right Distribution?**

# **Verifying Probabilistic Programs via Generating Functions**

Mingshuai Chen(B) , Joost-Pieter Katoen(B) , Lutz Klinkenberg(B) , and Tobias Winkler(B)

RWTH Aachen University, Aachen, Germany *{*chenms,katoen,lutz.klinkenberg,

tobias.winkler*}*@cs.rwth-aachen.de

**Abstract.** We study discrete probabilistic programs with potentially unbounded looping behaviors over an infinite state space. We present, to the best of our knowledge, *the first decidability result for the problem of determining whether such a program generates exactly a specified distribution over its outputs* (provided the program terminates almostsurely). The class of distributions that can be specified in our formalism consists of standard distributions (geometric, uniform, etc.) and finite convolutions thereof. Our method relies on representing these (possibly infinite-support) distributions as *probability generating functions* which admit effective arithmetic operations. We have automated our techniques in a tool called Prodigy, which supports automatic invariance checking, compositional reasoning of nested loops, and efficient queries to the output distribution, as demonstrated by experiments.

**Keywords:** Probabilistic programs · Quantitative verification · Program equivalence · Denotational semantics · Generating functions

# **1 Introduction**

Probabilistic programs [26,43,48] augment deterministic programs with stochastic behaviors, e.g., random sampling, probabilistic choice, and conditioning (via posterior observations). Probabilistic programs have undergone a recent surge of interest due to prominent applications in a wide range of domains: they steer autonomous robots and self-driving cars [20,54], are key to describe security [6] and quantum [61] mechanisms, intrinsically code up randomized algorithms for solving NP-hard or even deterministically unsolvable problems (in, e.g., distributed computing [2,53]), and are rapidly encroaching on AI as well

This research was funded by the ERC Advanced Project FRAPPANT under grant No. 787914, by the EU's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant No. 101008233, and by the DFG RTG 2236 UnRAVeL.

as approximate computing [13]. See [5] for recent advancements in probabilistic programming.

The crux of probabilistic programming, `a la Hicks' interpretation [30], is to *treat normal-looking programs as if they were probability distributions*. A randomnumber generator, for instance, is a probabilistic program that produces a uniform distribution across numbers from a range of interest. Such a lift from deterministic program states to possibly infinite-support distributions (over states) renders the verification problem of probabilistic programs notoriously hard [39]. In particular, reasoning about probabilistic loops often amounts to computing quantitative fixed-points which are highly intractable in practice. As a consequence, existing techniques are mostly concerned with approximations, i.e., they strive for verifying or obtaining upper and/or lower bounds on various quantities like assertion-violation probabilities [59], preexpectations [9,28], moments [58], expected runtimes [40], and concentrations [15,16], which reveal only partial information about the probability distribution carried by the program.

In this paper, we address the problem of *how to determine whether a (possibly infinite-state) probabilistic program yields exactly the desired (possibly infinitesupport) distribution under all possible inputs*. We highlight two scenarios where encoding the *exact* distribution – other than (bounds on) the above-mentioned quantities – is of particular interest: (I) In many safety- and/or security-critical domains, e.g., cryptography, a slightly perturbed distribution (while many of its probabilistic quantities remain unchanged) may lead to significant attack vulnerabilities or even complete compromise of the cryptographic system, see, e.g., Bleichenbacher's biased-nonces attack [29, Sect. 5.10] against the probabilistic Digital Signature Algorithm. Therefore, the system designer has to impose a complete specification of the anticipated distribution produced by the probabilistic component. (II) In the context of quantitative verification, the user may be interested in multiple properties (of different types, e.g., the aforementioned quantities) of the output distribution carried by a probabilistic program. In absence of the exact distribution, multiple analysis techniques – tailored to different types of properties – have to be applied in order to answer all queries from the user. We further motivate our problem using a concrete example as follows.

*Example 1 (Photorealistic Rendering* [37]*).* Monte Carlo integration algorithms form a well-known class of probabilistic programs which approximate complex integral expressions by sampling [27]. One of its particular use-cases is the photorealistic rendering of virtual scenes by a technique called *Monte Carlo path tracing* (MCPT) [37].

MCPT works as follows: For every pixel of the output image, it shoots n sample rays into the scene and models the light transport behavior to approximate the incoming light at that particular point. Starting from a certain pixel position, MCPT randomly chooses a direction, traces it until a scene object is hit, and then proceeds by either (i) terminating the tracing and evaluating the overall ray, or (ii) continuing the tracing by computing a new direction. In the physical world, the light ray may be reflected arbitrarily often and thus stopping the tracing after a certain amount of bounces would introduce a bias in the

integral estimation. As a remedy, the decision when to stop the tracing is made in a *Russian roulette* manner by flipping a coin<sup>1</sup> at each intersection point [1].

The program in Fig. 1 is an implementation of a simplified MCPT path generator. The cumulative length of all n rays is stored in the (random) variable c, which is directly proportional to MCPT's expected runtime. The implementation is designed in a way that c *induces a distribution as the sum of* n *independent and identically distributed (i.i.d.) geometric random variables* such that the resulting integral estimation is unbiased. In our framework, we view such an exact output distribution of c as a *specification* and verify – fully automatically – that the implementation in Fig. 1 with nested loops indeed satisfies this specification. -

*Approach.* Given a probabilistic loop L <sup>=</sup> while (ϕ) {P} with guard ϕ and loop-free body P, we aim to determine whether L agrees with a specification S:

$$L = \mathbf{while1e}\left(\varphi\right)\left\{P\right\} \quad \stackrel{?}{\sim} S \quad , \tag{\*} \tag{\*}$$

namely, whether L yields – upon termination – exactly the same distribution as encoded by S under all possible program inputs. This problem is non-trivial: (C1) L may induce an infinite state space and infinite-support distributions, thus making techniques like probabilistic bounded model checking [34] insufficient for verifying the property by means of unfolding the loop L. (C2) There is, to the best of our knowledge, a lack of non-trivial characterizations of L and S such that problem () admits a decidability result. (C3) To decide problem () – even for a loop-free program L – one has to account for infinitely or even uncountably many inputs such that L yields the same distribution as encoded by S when being deployed in all possible contexts.

We address challenge (C1) by exploiting the forward denotational semantics of probabilistic programs based on *probability generating function* (PGF) representations of (sub-)distributions [42], which benefits crucially from closedform (i.e., finite) PGF representations of possibly infinite-support distributions. A probabilistic program L hence acts as a transformer -L(·) that transforms an input PGF g into an output PGF -L(g) (as an instantiation of Kozen's

<sup>1</sup> The bias of the coin depends on the material's *reflectivity*: a reflecting material such as a mirror requires more light bounces than an absorptive one, e.g., a black surface.

transformer semantics [43]). In particular, we *interpret the specification* S *as a loop-free probabilistic program* I. Such an identification of specifications with programs has two important advantages: (i) we only need a single language to encode programs as well as specifications, and (ii) it enables compositional reasoning in a straightforward manner, in particular, the treatment of nested loops. The problem of checking L <sup>∼</sup> S then boils down to checking whether L and I transform every possible input PGF into the same output PGF:

$$\forall g \in \mathsf{PGF} \colon \underbrace{[\mathtt{while} \,\mathtt{e}\,(\varphi)\,\{P\}]}\_{L}(g) \begin{array}{rcl} \stackrel{?}{=} & [I](g) \,. \end{array} \tag{7}$$

As I is loop free, problem (†) can be reduced to checking the equivalence of two *loop-free* probabilistic programs (cf. Lemma 2):

$$\forall g \in \mathsf{PGF} \colon \quad \llbracket \mathtt{if} \left( \varphi \right) \left\{ P \mathrel{\mathop{\mathsf{\mathop{\mathsf{\hbox{\hbox{\hbox{\hbox{\hbox{\hbox{\hbox{\hbox{\hbox{\hbox{\hbox{\hbox{\hbox{\hbox{\hbox{\hbox{\hbox{\hbox{\hbox{\hbox{\hbox{\hbox{\hbox{\hbox{\hbox{\hbox{\hbox{\hbox{\hbox{\hbox{\hbox{\hbox{\hbox{\hbox{\hbox{\hbox{\hat{\big{\big{\hat{\big{\big{\big{\hat{\big{\big{\big{\hat{\big{\big{\hat{\big{\big{\hat{\big{\hat{\big{\hat{\big{\hat{\Theta}}}}}}}}}}}}}}}}}}}}}}} \right}} \right}} \right{\}}} \right{\}}} \right{\}} \Phi} \right{)} } \cdot \}} $$

Now challenge (C3) applies since the universal quantification in problem (‡) requires to determine the equivalence against infinitely many – possibly infinitesupport – distributions over program states. We facilitate such an equivalence checking by developing a *second-order PGF* (SOP) semantics for probabilistic programs, which naturally extends the PGF semantics while allowing to reason about infinitely many PGF transformations simultaneously (see Lemma 3).

Finally, to obtain a decidability result (cf. challenge (C2)), we develop the *rectangular discrete probabilistic programming language* (ReDiP) – a variant of pGCL [46] with syntactic restrictions to rectangular guards – featuring various nice properties, e.g., they inherently support i.i.d. sampling, and in particular, they *preserve closed-form PGF* when acting as PGF transformers. We show that *problem* (‡) *is decidable for* ReDiP *programs* P *and* I *if all the distribution statements therein have rational closed-form PGF* (cf. Lemma 4). As a consequence, problem (†) *and thereby* problem () *of checking* L <sup>∼</sup> S *are decidable if* L *terminates almost-surely on all possible inputs* g (cf. Theorem 4).

*Demonstration.* We have automated our techniques in a tool called Prodigy. As an example, Prodigy was able to verify, fully automatically in 25 milliseconds, that the implementation of the MCPT path generator with nested loops (in Fig. 1) is indeed equivalent to the loop-free program

$$\mathbf{c} \mathrel{+=} \mathbf{i} \,\mathrm{id}(\mathbf{g}\mathbf{e}\mathbf{o}\mathbf{e}\mathbf{t}\mathbf{i}\,\mathbf{c}^{\{1/2\}}, \mathbf{n})\,\stackrel{\circ}{\cdot}\mathbf{n} \mathrel{:=} \,0\,\mathrm{d}$$

which encodes the specification that, upon termination, c is distributed as the sum of n i.i.d. geometric random variables. With such an output distribution, multiple queries can be efficiently answered by applying standard PGF operations. For example, the expected value and variance of the runtime are *<sup>E</sup>*[c] = n and *Var* [c]=2n, respectively (assuming <sup>c</sup> = 0 initially).

**Contributions.** The main contributions of this paper are:


*Organization.* We introduce generating functions in Sect. 2 and define the ReDiP language in Sect. 3. Section 4 presents the PGF semantics. Section 5 establishes our decidability result in reasoning about ReDiP loops, with case studies in Sect. 6. After discussing related work in Sect. 7, we conclude the paper in Sect. 8. Further details, e.g., proofs and additional examples, can be found in the full version [18].

#### **2 Generating Functions**

*"A generating function is a clothesline on which we hang up a sequence of numbers for display."* — H. S. Wilf, Generatingfunctionology [60]

The method of *generating functions* (GF) is a vital tool in many areas of mathematics. This includes in particular enumerative combinatorics [22,60] and – most relevant for this paper – probability theory [35]. In the latter, the sequences "hanging on the clotheslines" happen to describe probability distributions over the non-negative integers <sup>N</sup>, e.g., <sup>1</sup>/<sup>2</sup>, <sup>1</sup>/<sup>4</sup>, <sup>1</sup>/<sup>8</sup>,... (aka, the geometric distribution).

The most common way to relate an (infinite) *sequence* of numbers to a generating *function* relies on the familiar Taylor series expansion: Given a sequence, for example <sup>1</sup>/<sup>2</sup>, <sup>1</sup>/<sup>4</sup>, <sup>1</sup>/<sup>8</sup>,..., find a function <sup>x</sup> → <sup>f</sup>(x) whose Taylor series around x = 0 uses the numbers in the sequence as coefficients. In our example,

$$\frac{1}{2-x} = \frac{1}{2} + \frac{1}{4}x + \frac{1}{8}x^2 + \frac{1}{16}x^3 + \frac{1}{32}x^4 + \dots,\tag{1}$$

for all <sup>|</sup>x<sup>|</sup> <sup>&</sup>lt; 2, hence the "clothesline" used for hanging up <sup>1</sup>/<sup>2</sup>, <sup>1</sup>/<sup>4</sup>, <sup>1</sup>/<sup>8</sup>,... is the function 1/(2 <sup>−</sup> x). Note that the GF is a – from a purely syntactical point of view – *finite* object while the sequence it represents is *infinite*. A key strength of this technique is that many meaningful operations on infinite series can be performed by manipulating an encoding GF (see Table 1 for an overview and examples). In other words, GF provide an *interface* to perform operations on and extract information from infinite sequences in an effective manner.

#### **2.1 The Ring of Formal Power Series**

Towards our goal of encoding distributions over *program states* (valuations of finitely many integer variables) as generating functions, we need to consider *multivariate* GF, i.e., GF with more than one variable. Such functions represent multidimensional sequences, or *arrays*. Since multidimensional Taylor series quickly become unhandy, we will follow a more *algebraic* approach that is also advocated in [60]: We treat sequences and arrays as elements from an algebraic structure: the *ring of Formal Power Series* (FPS). Recall that a (commutative) *ring* (A, <sup>+</sup>, ·, <sup>0</sup>, 1) consists of a non-empty carrier set A, associative and commutative binary operations "+" (addition) and "·" (multiplication) such that multiplication distributes over addition, and neutral elements 0 and 1 w.r.t. addition and multiplication, respectively. Further, every a <sup>∈</sup> A has an additive inverse <sup>−</sup>a <sup>∈</sup> A. Multiplicative inverses a<sup>−</sup><sup>1</sup> = 1/a need not always exist. Let k <sup>∈</sup> <sup>N</sup> <sup>=</sup> {0, <sup>1</sup>,...} be fixed in the remainder.



*<sup>a</sup>* Projections are not always well-defined, e.g., <sup>1</sup> <sup>1</sup>*−X*+*<sup>Y</sup>* [X/1] = <sup>1</sup> *<sup>Y</sup>* is ill-defined because <sup>Y</sup> is not invertible. However, in all situations where we use projection it will be well-defined; in particular, projection is well-defined for PGF.

**Definition 1 (The Ring of FPS).** *<sup>A</sup>* k*-dimensional FPS is a* k*-dim. array* f : <sup>N</sup><sup>k</sup> <sup>→</sup> <sup>R</sup>*. We denote FPS as* formal sums *as follows: Let* **<sup>X</sup>**=(X<sup>1</sup>,...,X<sup>k</sup>) *be an ordered vector of symbols, called* indeterminates*. The FPS* f *is written as*

$$f := \sum\_{\sigma \in \mathbb{N}^k} f(\sigma) \mathbf{X}^{\sigma}$$

*where* **<sup>X</sup>**<sup>σ</sup> *is the* monomial X<sup>σ</sup><sup>1</sup> <sup>1</sup> <sup>X</sup><sup>σ</sup><sup>2</sup> <sup>2</sup> ··· <sup>X</sup><sup>σ</sup>*<sup>k</sup>* <sup>k</sup> *. The* ring of FPS *is denoted* <sup>R</sup>[[**X**]] *where the operations are defined as follows: For all* f,g <sup>∈</sup> <sup>R</sup>[[**X**]] *and* σ <sup>∈</sup> <sup>N</sup><sup>k</sup>*,* (f <sup>+</sup> g)(σ) = f(σ) + g(σ)*, and* (f · g)(σ) = <sup>σ</sup>1+σ2=<sup>σ</sup> <sup>f</sup>(σ<sup>1</sup>)g(σ<sup>2</sup>)*.*

The multiplication f · g is the usual *Cauchy product* of power series (aka discrete convolution); it is well defined because for all σ <sup>∈</sup> <sup>N</sup><sup>k</sup> there are just *finitely* many <sup>σ</sup><sup>1</sup> <sup>+</sup> <sup>σ</sup><sup>2</sup> <sup>=</sup> <sup>σ</sup> in <sup>N</sup><sup>k</sup>. We write fg instead of <sup>f</sup> · <sup>g</sup>.

The formal sum notation is standard in the literature and often useful because the arithmetic FPS operations are very similar to how one would do calculations with "real" sums. We stress that the indeterminates **X** are merely *labels* for the k dimensions of f and do not have any other particular meaning. In the context of this paper, however, it is natural to identify the indeterminates with the program variables (e.g. indeterminate X refers to variable <sup>x</sup>, see Sect. 3).

Equation (1) can be interpreted as follows in the ring of FPS: The "sequences" <sup>2</sup> <sup>−</sup> <sup>1</sup>X + 0X<sup>2</sup> <sup>+</sup> ... and <sup>1</sup>/<sup>2</sup> <sup>+</sup> <sup>1</sup>/<sup>4</sup><sup>X</sup> <sup>+</sup> <sup>1</sup>/<sup>8</sup>X<sup>2</sup> <sup>+</sup> ... are (multiplicative) *inverse* elements to each other in <sup>R</sup>[[X]], i.e., their product is 1. More generally, we say that an FPS f is *rational* if f <sup>=</sup> gh<sup>−</sup><sup>1</sup> <sup>=</sup> g/h where <sup>g</sup> and <sup>h</sup> are polynomials, i.e., they have at most finitely many non-zero coefficients; and we call such a representation a *rational closed form*.

A more extensive introduction to FPS can be found in [18, Appx. D].

#### **2.2 Probability Generating Functions**

We are especially interested in GF that describe probability distributions.

**Definition 2 (PGF).** *<sup>A</sup>* k*-dimensional FPS* g *is a* probability generating function *(PGF) if* (i) *for all* <sup>σ</sup> <sup>∈</sup> <sup>N</sup><sup>k</sup> *we have* g(σ) <sup>≥</sup> <sup>0</sup>*, and* (ii) <sup>σ</sup>∈N*<sup>k</sup>* <sup>g</sup>(σ) <sup>≤</sup> <sup>1</sup>*.*

For example, (1) is the PGF of a <sup>1</sup>/<sup>2</sup>-geometric distribution. The PGF of other standard distributions are given in Table 3 further below. Note that Definition 2 also includes *sub-PGF* where the sum in (ii) is strictly less than 1.

### **3** ReDiP**: A Probabilistic Programming Language**

This section presents our *Rectangular Discrete Probabilistic Programming Language*, or ReDiP for short. The word "rectangular" refers to a restriction we impose on the guards of conditionals and loops, see Sect. 3.2. ReDiP is a variant of pGCL [46] with some extra syntax but also some syntactic restrictions.

#### **3.1 Program States and Variables**

Every ReDiP-program P operates on a finite set of <sup>N</sup>-valued *program variables Vars*(P) = {x<sup>1</sup>,..., <sup>x</sup>k}. We do not consider negative or non-integer variables. A *program state* of P is thus a mapping σ : *Vars*(P) <sup>→</sup> <sup>N</sup>. As explained in Sect. 1, the key idea is to represent distributions over such program states as PGF. Consequently, we identify a single program state σ with the *monomial* **<sup>X</sup>**<sup>σ</sup> <sup>=</sup> X<sup>σ</sup>(x1) <sup>1</sup> ··· <sup>X</sup><sup>σ</sup>(x*k*) <sup>k</sup> where <sup>X</sup><sup>1</sup>,...,X<sup>k</sup> are indeterminates representing the program variables <sup>x</sup><sup>1</sup>,..., <sup>x</sup>k. We will stick to this notation: throughout the whole paper, we typeset program variables as x and the corresponding FPS indeterminate as X. The initial program state on which a given ReDiP-program is supposed to operate must always be stated explicitly.

#### **3.2 Syntax of** ReDiP

The syntax of ReDiP is defined inductively, see the leftmost column of Table 2. Here, <sup>x</sup> and <sup>y</sup> are program variables, n <sup>∈</sup> <sup>N</sup> is a constant, D is a *distribution expression* (see Table 3), and <sup>P</sup><sup>1</sup>, P<sup>2</sup> are ReDiP-programs. The general idea of ReDiP is to provide a minimal core language to keep the theory simple. Many other common language constructs such as linear arithmetic updates x := 2y + 3 are expressible in this core language. See [18, Appx. A] for a complete specification.


**Table 2.** Syntax and semantics of ReDiP. g is the input PGF.

**Table 3.** A non-exhaustive list of common discrete distributions with rational PGF. The parameters p, n, and λ are a probability, a natural, and a non-negative real number, respectively. T is a reserved placeholder indeterminate.


The word "rectangular" in ReDiP emphasizes that our if-guards can only identify *axis-aligned hyper-rectangles*<sup>2</sup> in N<sup>k</sup>, but no more general polyhedra. These *rectangular guards* <sup>x</sup> < n have the fundamental property that they preserve rational PGF. On the other hand, allowing more general guards like <sup>x</sup> < <sup>y</sup> breaks this property (see [21] and our comments in [18, Appx. B].

The most intricate feature of ReDiP is the – potentially unbounded – loop while (<sup>x</sup> < n) {P}. A program that does not contain loops is called *loop-free*.

<sup>2</sup> More precisely, we can simulate statements like if (R) *{*...*}* else *{*...*}*, where R is a finite Boolean combination of rectangular guards, through appropriate nesting of if () ; note that such an R is indeed a finite union of axis-aligned rectangles in <sup>N</sup>*<sup>k</sup>*.

#### **3.3 The Statement** x **+=** iid**(***D,* y**)**

The novel iid statement is the heart of the loop-free fragment of ReDiP – it subsumes both <sup>x</sup> := D ("assign a D-distributed sample to <sup>x</sup>") and the standard assignment x := y. We include the assign-increment (+=) version of iid in the core fragment of ReDiP for technical reasons; the assignment <sup>x</sup> := iid(D, <sup>y</sup>) can be recovered from that as syntactic sugar by simply setting x := 0 beforehand.

Intuitively, the meaning of <sup>x</sup> += iid(D, <sup>y</sup>) is as follows. The right-hand side iid(D, <sup>y</sup>) can be seen as a function that takes the current value v of variable <sup>y</sup>, then draws v i.i.d. samples from distribution D, computes the sum of all these samples and finally increments x by the so-obtained value. For example, to perform <sup>x</sup> := <sup>y</sup>, we may just write <sup>x</sup> := iid(dirac(1), <sup>y</sup>) as this will draw y times the number 1, then sum up these y many 1's to obtain the result y and assign it to x. Similarly, to assign a random sample from a, say, uniform distribution to <sup>x</sup>, we can execute <sup>y</sup> := 1 <sup>x</sup> := iid(unif(n), <sup>y</sup>).

But iid is not only useful for defining standard operations. In fact, taking sums of i.i.d. samples is common in probability theory. The *binomial distribution* with parameters p <sup>∈</sup> (0, 1) and n <sup>∈</sup> <sup>N</sup>, for example, is the defined as the sum of n i.i.d. Bernoulli-p-distributed samples and thus

<sup>x</sup> := binomial(p, <sup>y</sup>) is equivalent to <sup>x</sup> := iid(bernoulli(p), <sup>y</sup>)

for all constants p <sup>∈</sup> (0, 1). Similarly, the *negative* (p, n)-binomial distribution is the sum of n i.i.d. geometric-p-distributed samples. Overall, iid renders the loop-free fragment of ReDiP *strictly more expressive* than it would be if we had included only <sup>x</sup> := D and <sup>x</sup> := <sup>y</sup> instead. As a consequence, since we use loopfree programs as a specification language (see Sect. 5), iid enables us to write more expressive program specifications while retaining decidability.

#### **4 Interpreting** ReDiP **with PGF**

In this section, we explain the PGF-based semantics of our language which is given in the second column of Table 2. The overall idea is to view a ReDiPprogram P as a *distribution transformer* [44,46]. This means that the input to P is a *distribution* over initial program states (inputting a deterministic state is just the special case of a Dirac distribution), and the output is a distribution over final program states. With this interpretation, if one regards distributions as *generalized program states* [33], a probabilistic program is actually *deterministic*: The same input distribution always yields the same output distribution. The goal of our PGF-based semantics is to construct an *interpreter* that executes a ReDiP-program statement-by-statement in forward direction, transforming one generalized program state into the next. We stress that these generalized program states, or distributions, can be infinite-support in general. For example, the program <sup>x</sup> := geometric(0.5) outputs a geometric distribution – which has infinite support – on x.

#### **4.1 A Domain for Distribution Transformation**

We now define a domain, i.e., an *ordered* structure, where our program's in- and output distributions live. Following the general idea of this paper, we encode them as PGF. Let *Vars* be a fixed finite set of program variables <sup>x</sup><sup>1</sup>,..., <sup>x</sup><sup>k</sup> and let **<sup>X</sup>** = (X<sup>1</sup>,...,Xk) be corresponding formal indeterminates. We let PGF <sup>=</sup> {g <sup>∈</sup> <sup>R</sup>[[**X**]] <sup>|</sup> g is a PGF} denote the set of all PGF. Recall that this also includes sub-PGF (Definition 2). Further, we equip PGF with the pointwise order, i.e., we let g f iff g(σ) <sup>≤</sup> f(σ) for all σ <sup>∈</sup> <sup>N</sup><sup>k</sup>. It is clear that (PGF, ) is a partial order that is moreover ω*-complete*, i.e., there exists a least element 0 and all ascending chains <sup>Γ</sup> <sup>=</sup> {g<sup>0</sup> <sup>g</sup><sup>1</sup> ...} in PGF have a least upper bound sup <sup>Γ</sup> <sup>∈</sup> PGF. The maxima in (PGF, ) are precisely the PGF which are not a sub-PGF.

#### **4.2 From Programs to PGF Transformers**

Next we explain how distribution transformation works using (P)GF (cf. Table 1). This is in contrast to the PGF semantics from [42] which operates on infinite sums in a non-constructive fashion.

**Definition 3 (The PGF Transformer** -P**).** *Let* P *be a* ReDiP*-program. The* PGF transformer -P : PGF <sup>→</sup> PGF *is defined inductively on the structure of* P *through the second column in Table 2.*

We show in Theorem 2 below that -P is well-defined. For now, we go over the statements in the language ReDiP and explain the semantics.

*Sequential Composition.* The semantics of <sup>P</sup><sup>1</sup>P<sup>2</sup> is straightforward and intuitive: First execute <sup>P</sup><sup>1</sup> on <sup>g</sup> and then <sup>P</sup><sup>2</sup> on -P<sup>1</sup>(g), i.e., -P<sup>1</sup> P<sup>2</sup>(g) = -P<sup>2</sup>(-P<sup>1</sup>(g)). The fact that our semantics transformer moves *forwards* through the program – as program interpreters usually do – is due to this definition.

*Conditional Branching.* To translate if (<sup>x</sup> < n) {P<sup>1</sup>} else {P<sup>2</sup>}, we follow the standard procedure which partitions the input distribution according to <sup>x</sup> < n and <sup>x</sup> <sup>≥</sup> n, processes the two parts independently and finally recombines the results [44]. We realize the partitioning using the (formal) *Taylor series expansion*. This is feasible because we only allow *rectangular* guards of the form <sup>x</sup> < n, where n is a constant. Thus, for a given input PGF g, the *filtered PGF* <sup>g</sup><sup>x</sup><n is obtained through expanding <sup>g</sup> in its first <sup>n</sup> terms. The else -part is obviously <sup>g</sup><sup>x</sup>≥<sup>n</sup> <sup>=</sup> <sup>g</sup>−g<sup>x</sup><n. We then evaluate -P<sup>1</sup>(g<sup>x</sup><n)+-P<sup>2</sup>(g<sup>x</sup>≥<sup>n</sup>) recursively. *Assigning a Constant.* Technically, our semantics realizes an assignment <sup>x</sup> := n in two steps: It first sets <sup>x</sup> to 0 and then increments it by n. The former is achieved by substituting X for 1 which corresponds to computing the marginal distribution in all variables except X. For example,


where the rightmost four lines explain this annotation style [42]. Note that <sup>0</sup>.5<sup>Y</sup> <sup>2</sup> + 0.5Y <sup>3</sup> is indeed the marginal of the input distribution in Y .

*Decrementing a Variable.* Since our program variables cannot take negative values, we define <sup>x</sup>−− as max(x−1, 0), i.e., <sup>x</sup> *monus* (modified minus) 1. Technically, we realize this through if (<sup>x</sup> < 1) {skip} else {x−−}, i.e., we apply the decrement only to the portion of the input distribution where x ≥ 1. The decrement itself can then be carried out through "multiplication by X<sup>−</sup><sup>1</sup>". Note that <sup>X</sup><sup>−</sup><sup>1</sup> is not an element of <sup>R</sup>[[X]] because X has no inverse. Instead, the operation gX<sup>−</sup><sup>1</sup> is an alias for *shift*←(g) which shifts g "to the left" in dimension X. To implement the semantics on top of existing computer algebra software, it is very handy to perform the multiplication by X<sup>−</sup><sup>1</sup> instead. This is justified because for PGF g with g[X/0] = 0, *shift*←(g) and gX<sup>−</sup><sup>1</sup> are equal.

*The* iid *Statement.* The semantics of <sup>x</sup> += iid(D, <sup>y</sup>) relies on the fact that

$$T\_1 \sim \begin{bmatrix} D \end{bmatrix} \dots \ T\_n \sim \begin{bmatrix} D \end{bmatrix} \qquad \text{implies} \qquad \sum\_{i=1}^n T\_i \sim \begin{bmatrix} D \end{bmatrix}^n,\tag{2}$$

where X <sup>∼</sup> g means that r.v. X is distributed according to PGF g (see, e.g., [55, p. 450]). The iid statement generalizes this observation further: If n is not a constant but a random (program) variable <sup>y</sup> with PGF h(Y ), then we perform the *substitution* h[Y /-D] (i.e., replace Y by -D in h) to obtain the PGF of the sum of <sup>y</sup>-many i.i.d. samples from D. We slightly modify this substitution to g[Y /Y -D[T /X]] in order to (i) not alter <sup>y</sup>, and (ii) account for the increment to x. For example,

$$\begin{array}{l} \begin{array}{l} \|\|\|\, 0.2 + 0.3Y + 0.5Y^2 \\ \mathbf{x} \end{array} \\ \mathbf{x} \text{ + = \ \textbf{i} \,\mathrm{id}(\textbf{berno} \,\textbf{11i}(0.5), \textbf{y}) \\ \begin{array}{l} \|\|\|\, 0.2 + 0.3Y(0.5 + 0.5X) + 0.5Y^2(0.5 + 0.5X)^2 \\ \|\|\|\, 0.2 + 0.15Y + 0.125Y^2 + 0.15XY + 0.25XY^2 + 0.125X^2Y^2 \end{array} \\ \end{array}$$

*The* while*-Loop.* The fixed point semantics of the while loop is standard [42,44] and reflects the intuitive *unrolling rule*, namely that while (ϕ) {P} is equivalent to if (ϕ) {P while (ϕ) {P}} else {skip}. Indeed, the fixed point formula in Table 2 can be derived using the semantics of if discussed above. We revisit this fixed point characterization in Sect. 5.1.

*Properties of* -P*.* Our PGF semantics has the property that all programs – except while loops – are able to operate on the input PGF in (rational) *closed form*, i.e., they never have to expand the input as an infinite series (which is of course impossible in practice). More formally:

**Theorem 1 (Closed-Form Preservation).** *Let* P *be a* loop-free ReDiP *program, and let* g <sup>=</sup> h/f <sup>∈</sup> PGF *be in rational closed form. Then we can compute a rational closed form of* -P(g) <sup>∈</sup> PGF *by applying the transformations in Table 2.*

The proof is by induction over the structure of P noticing that all the necessary operations (substitution, differentiation, etc.) preserve rational closed forms, see [18, Appx. D]. A slight extension of our syntax, e.g., admitting non-rectangular guards, renders that closed forms are not preserved, see [18, Appx. B]. Moreover, -P has the following *healthiness* [46] properties:

**Theorem 2 (Properties of** -P**).** *The PGF transformer* -P *is*


#### **4.3 Probabilistic Termination**

Due to the presence of possibly unbounded while-loops, a ReDiP-program does not necessarily halt, or may do so only with a certain probability. Our semantics naturally captures the termination probability.

**Definition 4 (AST).** *<sup>A</sup>* ReDiP*-program* P *is called* almost-surely terminating *(AST) for PGF* g *if* -P(g)[**X**/**1**] = g[**X**/**1**]*, i.e., if it does not leak probability mass.* P *is called* universally *AST (UAST) if it is AST for all* g <sup>∈</sup> PGF*.*

Note that all loop-free ReDiP-programs are UAST. In this paper, (U)AST only plays a minor role. Nonetheless, the proof rule below yields a stronger result (cf. Lemma 2) if the program is UAST. There exist various of techniques and tools for proving (U)AST [17,47,50].

# **5 Reasoning About Loops**

We now focus on loopy programs L <sup>=</sup> while (ϕ) {P}. Recall from Table <sup>2</sup> that -L : PGF <sup>→</sup> PGF is defined as the *least fixed point* of a higher order functional

$$(\Psi\_{\varphi, P} \colon (\mathsf{PGF} \to \mathsf{PGF}) \quad \xrightarrow{} (\mathsf{PGF} \to \mathsf{PGF}) .)$$

Following [42], we show that <sup>Ψ</sup>ϕ,P is sufficiently well-behaved to allow reasoning about loops by *fixed point induction*.

#### **5.1 Fixed Point Induction**

To apply fixed point induction, we need to lift our domain PGF from Sect. 4.1 by one order to (PGF → PGF), the domain of *PGF transformers*. This is because the functional <sup>Ψ</sup>ϕ,P operates on PGF transformers and can thus be seen as a secondorder function (this point of view regards PGF as first-order objects). Recall that in contrast to this, the function -P is first-order – it is just a PGF transformer. The order on (PGF → PGF) is obtained by lifting the order on PGF pointwise (we denote it with the same symbol ). This implies that (PGF → PGF) is also an <sup>ω</sup>-complete partial order. We can then show that <sup>Ψ</sup>ϕ,P (see Table 2) is a continuous function. With these properties, we obtain the following induction rule for upper bounds on -L, cf. [42, Theorem 6]:

**Lemma 1 (Fixed Point Induction for Loops).** *Let* L <sup>=</sup> while (ϕ) {P} *be <sup>a</sup>* ReDiP*-loop. Further, let* ψ: PGF <sup>→</sup> PGF *be a PGF transformer. Then*

> <sup>Ψ</sup>ϕ,P (ψ) <sup>ψ</sup> implies -L ψ .

The goal of the rest of the paper is to *apply the rule from* Lemma 1 in *practice*. To this end, we must somehow specify an *invariant* such as ψ by finite means. Since ψ is of type (PGF <sup>→</sup> PGF), we consider ψ as a program I – more specifically, <sup>a</sup> ReDiP-program – and identify ψ <sup>=</sup> -I. Further, by definition

$$
\Psi\_{\varphi, P}(\left[I\right]) \, = \, \left[\mathbf{if}\left(\varphi\right)\left\{P\left\|\begin{smallmatrix}I\\ \bullet \end{smallmatrix}\right\|\mathsf{e1se}\left\{\mathsf{skip}\right\}\right\}\right],
$$

and thus the term <sup>Ψ</sup>ϕ,P (-I) is also a PGF-transformer expressible as a ReDiPprogram. These observations and Lemma 1 imply the following:

**Lemma 2.** *Let* L <sup>=</sup> while (ϕ) {P} *and* I *be* ReDiP*-programs. Then*

$$\begin{array}{rcl} \mathsf{[\{if \(\varphi\) \{P\} \{P\} \{\mathsf{else} \{\mathsf{skip}\}\}\text{]} \sqsubseteq \begin{array}{rcl} \mathsf{[I]} & \text{implies} & \mathsf{[L]} \sqsubseteq \begin{array}{rcl} \mathsf{[I]} & \sqsubseteq & \mathsf{[I]} \end{array} \end{array} \end{array} \end{right} \begin{array}{rcl} \mathsf{[L]} \sqsubseteq \begin{array}{rcl} \mathsf{[I]} & \sqsubseteq & \mathsf{[J]} \end{array} \end{array} \tag{3}$$

*Further, if* L *is UAST (Definition 4), then*

$$\begin{array}{rcl} \left[ \text{if } (\varphi) \left\{ P \right\} \right] \mathfrak{e1se} \left\{ \text{skip} \right\} \end{array} = \begin{bmatrix} I \end{bmatrix} \quad \text{iff} \quad \begin{bmatrix} L \end{bmatrix} = \begin{bmatrix} I \end{bmatrix} \tag{4}$$

Lemma <sup>2</sup> effectively reduces checking whether ψ given as a ReDiP-program I is an invariant of <sup>L</sup> to checking *equivalence* of if (ϕ) {P I} else {skip} and I provided L is UAST. If I is loop-free, then the latter two programs are both loop-free and we are left with the task of proving whether they yield the same output distribution for all inputs. We now present a solution to this problem.

#### **5.2 Deciding Equivalence of Loop-free Programs**

Even in the absence of loops, deciding if two given ReDiP-programs are equivalent is non-trivial as it requires reasoning about infinitely many – possibly infinitesupport – distributions on program variables. In this section, we first show that -P<sup>1</sup> <sup>=</sup> -<sup>P</sup><sup>2</sup> is *decidable* for loop-free ReDiP programs <sup>P</sup><sup>1</sup> and <sup>P</sup><sup>2</sup>, and then use this result together with Lemma 2 to obtain the main result of this paper.

**SOP: Second-Order PGF.** Our goal is to check if -P<sup>1</sup>(g) = -P<sup>2</sup>(g) for *all* g <sup>∈</sup> PGF. To tackle this, we encode whole *sets* of PGF into a single object – an FPS we call *second-order PGF* (SOP). To define SOP, we need a slightly more flexible view on FPS. Recall from Definition <sup>1</sup> that a k-dim. FPS is an array f : <sup>N</sup><sup>k</sup> <sup>→</sup> <sup>R</sup>. Such an <sup>f</sup> can be viewed equivalently as an <sup>l</sup>-dim. array with (k−l)-dim. arrays as entries. In the formal sum notation, this is reflected by partitioning **<sup>X</sup>** = (**Y**, **<sup>Z</sup>**) and viewing f as an FPS in **<sup>Y</sup>** *with coefficients that are FPS in the other indeterminates* **Z**. For example,

$$\begin{aligned} \left((1-Y)^{-1}(1-Z)^{-1} &= 1+Y+Z+Y^2+YZ+Z^2+... \\ &= \left(1-Z\right)^{-1}+\left(1-Z\right)^{-1}Y+\left(1-Z\right)^{-1}Y^2+... \end{aligned}$$

where in the lower line the coefficients (1−Z)−<sup>1</sup> are considered elements in <sup>R</sup>[[Z]].

**Definition 5 (SOP).** *Let* **U** *and* **X** *be disjoint sets of indeterminates. A formal power series* f <sup>∈</sup> <sup>R</sup>[[**U**, **<sup>X</sup>**]] *is a* second-order PGF (SOP) *if*

$$f = \sum\_{\tau \in \mathbb{N}^{|\mathbf{U}|}} f(\tau) \mathbf{U}^{\tau} \quad \text{(with } f(\tau) \in \mathbb{R}[[\mathbf{X}]] \text{)} \qquad \text{implies} \qquad \forall \tau \colon f(\tau) \in \mathsf{PGF}.$$

That is, an SOP is simply an FPS whose coefficients are PGF – instead of generating a sequence of probabilities as PGF do, it generates a *sequence of distributions*. An (important) example SOP is

$$f\_{dirac} = (1 - XU)^{-1} = 1 + XU + X^2U^2 + \dots \in \mathbb{R}[[U, X]],\tag{5}$$

i.e., for all i <sup>≥</sup> 0, f*dirac*(i) = <sup>X</sup><sup>i</sup> <sup>=</sup> dirac(i). As a second example consider <sup>f</sup>*binom* <sup>=</sup> <sup>f</sup>*dirac*[X/0.5+0.5X]; it is clear that <sup>f</sup>*binom*(i) = (0.5+0.5X)<sup>i</sup> <sup>=</sup> binomial(0.5, i) for all i <sup>≥</sup> 0. Note that if **<sup>U</sup>** <sup>=</sup> <sup>∅</sup>, then SOP and PGF coincide. For fixed **X** and **U**, we denote the set of all second-order PGF with SOP.

**SOP Semantics of** ReDiP**.** The appeal of SOP is that, syntactically, they are still formal power series, and some can be represented in closed form just like PGF. Moreover, we can readily extend our PGF transformer -P to an SOP transformer -P : SOP <sup>→</sup> SOP. A key insight of this paper is that – without any changes to the rules in Table 2 – applying -P to an SOP is the same as applying -P *simultaneously* to all the PGF it subsumes:

**Theorem 3.** *Let* P *be a* ReDiP*-program. The transformer* -P : SOP <sup>→</sup> SOP *is well-defined. Further, if* f <sup>=</sup> <sup>τ</sup>∈N*|***U***<sup>|</sup>* <sup>f</sup>(<sup>τ</sup> )**U**<sup>τ</sup> *is an SOP, then*

$$\mathbb{I}[P](f) = \sum\_{\tau \in \mathbb{N}|\mathbf{U}|} [P](f(\tau))\mathbf{U}^{\tau} \ . $$

**An SOP Transformation for Proving Equivalence.** We now show how to exploit Theorem <sup>3</sup> for equivalence checking. Let <sup>P</sup><sup>1</sup> and <sup>P</sup><sup>2</sup> be (loop-free) ReDiPprograms; we are interested in proving whether -P<sup>1</sup> <sup>=</sup> -P<sup>2</sup>. By linearity it holds that -P<sup>1</sup> <sup>=</sup> -P<sup>2</sup> iff -P<sup>1</sup>(**X**<sup>σ</sup>) = -P<sup>2</sup>(**X**<sup>σ</sup>) for all σ <sup>∈</sup> <sup>N</sup><sup>k</sup>, i.e., to check equivalence it suffices to consider all (infinitely many) point-mass PGF as inputs. **Lemma 3 (SOP-Characterisation of Equivalence).** *Let* <sup>P</sup><sup>1</sup> *and* <sup>P</sup><sup>2</sup> *be* ReDiP*-programs with Vars*(Pi) ⊆ {x1,..., <sup>x</sup>k} *for* i ∈ {1, <sup>2</sup>}*. Further, consider a vector* **<sup>U</sup>** = (U<sup>1</sup>,...,Uk) *of meta indeterminates, and let* <sup>g</sup>**<sup>X</sup>** *be the SOP*

$$g\mathbf{x} \ = \ (1 - X\_1 U\_1)^{-1} (1 - X\_2 U\_2)^{-1} \cdots (1 - X\_k U\_k)^{-1} \ \in \ \mathbb{R}[[\mathbf{U}, \mathbf{X}]] \ \ .$$

*Then* -<sup>P</sup><sup>1</sup> <sup>=</sup> -P<sup>2</sup> *if and only if* -P<sup>1</sup>(g**X**) = -P<sup>2</sup>(g**X**)*.*

The proof of Lemma 3 (see [18, Appx. F.5]) relies on Theorem 3 and the fact that the *rational* SOP <sup>g</sup>**<sup>X</sup>** generates all (multivariate) point-mass PGF; in fact it holds that <sup>g</sup>**<sup>X</sup>** <sup>=</sup> <sup>σ</sup>∈N*<sup>k</sup>* **<sup>X</sup>**<sup>σ</sup>**U**<sup>σ</sup>, i.e., <sup>g</sup>**<sup>X</sup>** generalizes <sup>f</sup>*dirac* from (5). It follows:

**Lemma 4.** -<sup>P</sup><sup>1</sup> <sup>=</sup> -P<sup>2</sup> *is decidable for loop-free* ReDiP*-programs* P1, P2*.*

Our main theorem follows immediately from Lemmas 2 and 4:

**Theorem 4.** *Let* L <sup>=</sup> while (ϕ) {P} *be UAST with loop-free body* P *and* I *be a loop-free* ReDiP*-program. It is decidable whether* -L <sup>=</sup> -I*.*

*Example 2.* In Fig. <sup>2</sup> we prove that the two UAST programs L and I

$$\begin{aligned} \mathtt{while}\,(\mathtt{n}>0) \left\{ \begin{array}{c} \mathtt{c} + = \mathtt{i}\,\mathtt{id}\,(\mathtt{geon}\mathtt{etric}\,(\mathtt{l}/\mathtt{2}),\mathtt{n}) \mathrel{\mathtt{+}} \\ \mathtt{n} := \mathtt{n}-1 \end{array} \right\} \end{aligned} $$

**Fig. 2.** Program equivalence follows from the equality of the resulting SOP (Lemma 3).

are equivalent (i.e., -L <sup>=</sup> -I) by showing that if (<sup>n</sup> > 0) {P I} <sup>=</sup> -I as suggested by Lemma 2. The latter is achieved as in Lemma 3: We run both programs on the input SOP <sup>g</sup>N,C = (1 <sup>−</sup> NU)−<sup>1</sup>(1 <sup>−</sup> CV )−<sup>1</sup>, where U, V are meta indeterminates corresponding to N and C, respectively, and check if the results are equal. Note that I is the loop-free specification from Example 1; thus by transitivity, the loop L is equivalent to the loop in Fig. 1. -

# **6 Case Studies**

We have implemented our techniques in Python as a prototype called Prodigy<sup>3</sup>: PRObability DIstributions via GeneratingfunctionologY. By interfacing with different computer algebra systems (CAS), e.g., Sympy [49] and GiNaC [10,57] – as backends for symbolic computation of PGF and SOP semantics – Prodigy decides whether a given probabilistic loop agrees with an (invariant) specification encoded as a loop-free ReDiP program. Furthermore, it supports efficient queries on various quantities associated with the output distribution.

In what follows, we demonstrate in particular the applicability of our techniques to programs featuring stochastic dependency, parametrization, and nested loops. The examples are all presented in the same way: the iterative program on the left side and its corresponding specification on the right. The presented programs are all UAST, given the parameters are instantiated from a suitable value domain.<sup>4</sup> For each example, we report the time for performing the equivalence check on a 2,4 GHz Intel i5 Quad-Core processor with 16GB RAM running macOS Monterey 12.0.1. Additional examples can be found in [18, Appx. E].


**Fig. 3.** Generating complementary binomial distributions (for n, m) by coin flips. binomial(<sup>1</sup>/<sup>2</sup>, c) is an alias for iid(bernoulli(<sup>1</sup>/<sup>2</sup>), c).


**Fig. 4.** A program modeling two dueling cowboys with parametric hit probabilities.

*Example 3 (Complementary Binomial Distributions).* We show that the program in Fig. <sup>3</sup> generates a joint distribution on <sup>n</sup>, <sup>m</sup> such that both <sup>n</sup> and <sup>m</sup> are binomially distributed with support c and are complementary in the sense that n + m = c holds certainly (if n = m = 0 initially, otherwise the variables

<sup>3</sup> https://github.com/LKlinke/Prodigy.

<sup>4</sup> Parameters of Example <sup>4</sup> have to be instantiated with a probability value in (0, 1).

are incremented by the corresponding amounts). Prodigy automatically checks that the loop agrees with the specification in 18.3 ms. The resulting distribution can then be analyzed for any given input PGF g by computing -I(g), where I is the loop-free program. For example, for input g <sup>=</sup> C<sup>10</sup>, the distribution as computed by Prodigy has the *factorized* closed form (M+<sup>N</sup> <sup>2</sup> )<sup>10</sup>. The CAS backends exploit such factorized forms to perform algebraic manipulations more efficiently compared to fully expanded forms. For instance, we can evaluate the queries *<sup>E</sup>*[m<sup>3</sup>+2mn+n<sup>2</sup>] = 235, or P r(m > <sup>7</sup>∧n < 3) = 7/128, almost instantly. -

*Example 4 (Dueling Cowboys* [46]*).* The program in Fig. 4 models a duel of two cowboys with *parametric* hit probabilities a and b. Variable t indicates the cowboy who is currently taking his shot, and c monitors the state of the duel (c = 1: duel is still running, c = 0: duel is over). Prodigy automatically verifies the specification in 11.97 ms. We defer related problems – e.g., *synthesizing* parameter values to meet a parameter-free specification – to future work. -

**Fig. 5.** Nested loops with invariants for the inner and outer loop.

*Example 5 (Nested Loops).* The inner loop of the program in Fig. 5 modifies x which influences the termination behavior of the outer loop. Intuitively, the program models a random walk on N: In every step, the value of the current position <sup>x</sup> changes by some random δ ∈ {−1, <sup>0</sup>, <sup>1</sup>, <sup>2</sup>,...} such that δ + 1 is geometrically distributed. The example demonstrates how our technique enables *compositional* reasoning. We first provide a loop-free specification for the inner loop, prove its correctness, and then simply *replace* the inner loop by its specification, yielding a program without nested loops. This feature is a key benefit of reusing the loop-free fragment of ReDiP as a specification language. Moreover, existing techniques that cannot handle nested loops can profit from it; in fact, we can prove the overall program to be UAST using the rule of [47]. Interestingly, the outer loop has *infinite expected runtime* (for any input distribution where the probability that <sup>x</sup> > 0 is positive). We can prove this by *querying the expected value* of the program variable c in the resulting output distribution. The automatically computed result is ∞, which indeed proves that the expected runtime of this program is not finite. This example furthermore shows that our technique can be generalized beyond rational functions since the PGF of the catalan(p) distribution is (1 <sup>−</sup> <sup>1</sup> <sup>−</sup> <sup>4</sup>p(1−p)T) / <sup>2</sup>p, i.e., algebraic but not rational. We leave a formal generalization of the decidability result from Theorem 4 to algebraic functions for future work. Prodigy verifies this example in 29.17ms. -

*Scalability Issue.* It is not difficult to construct programs where Prodigy poorly scales: its performance depends highly on the number of consecutive probabilistic branches and the size of the constant n in guards (requiring n-th order PGF derivation, cf. Table 2).

# **7 Related Work**

This section surveys research efforts that are highly related to our approach in terms of semantics, inference, and equivalence checking of probabilistic programs.

*Forward Semantics of Probabilistic Programs.* Kozen established in his seminal work [43] a generic way of giving forward, denotational semantics to probabilistic programs as *distribution transformers*. Klinkenberg et al. [42] instantiated Kozen's semantics as PGF transformers. We refine the PGF semantics substantially such that it enjoys the following crucial properties: (i) our PGF transformers (when restricted to loop-free ReDiP programs) preserve closed-form PGF and thus are effectively constructable. In contrast, the existing PGF semantics in [42] operates on infinite sums in a non-constructive fashion; (ii) our PGF semantics naturally extends to SOP, which serves as the key to reason about the exact behavior of unbounded loops (under possibly uncountably many inputs) in a fully automatic manner. The PGF semantics in [42], however, supports only (over-)approximations of looping behaviors and can hardly be automated; and (iii) our PGF semantics is capable of interpreting program constructs like i.i.d. sampling that is of particular interest in practice.

*Backward Semantics of Probabilistic Programs.* Many verification systems for probabilistic programs make use of backward, denotational semantics – most pertinently, the *weakest preexpectation* (WP) calculi [38,46] as a quantitative extension of Dijkstra's weakest preconditions [19]. The WP of a probabilistic program C w.r.t. a postexpectation g, denoted by wp-C(g)(·), maps every initial program state σ to the expected value of g evaluated in final states reached after executing C on σ. In contrast to Dijkstra's predicate transformer semantics which admits also strongest postconditions, the counterpart of "strongest postexpectations" does unfortunately not exist [36, Chap. 7], thereby not amenable to forward reasoning. We remark, in particular, that checking program equivalence via WP is difficult, if not impossible, since it amounts to reasoning about uncountably many postexpectations g. We refer interested readers to [5, Chaps. 1–4] for more recent advancements in formal semantics of probabilistic programs.

*Probabilistic Inference.* There are a handful of probabilistic systems that employ an alternative forward semantics based on *probability density function* (PDF) representations of distributions, e.g., (λ)PSI [24,25], AQUA [32], Hakaru [14,52], and the density compiler in [11,12]. These systems are dedicated to probabilistic inference for programs encoding continuous distributions (or joint discretecontinuous distributions). Reasoning about the underlying PDF representations, however, amounts to resolving complex integral expressions in order to answer inference queries, thus confining these techniques either to (semi-)numerical methods [11,12,14,32,52] or exact methods yet limited to bounded looping behaviors [24,25]. Apart from these inference systems, a recently developed language called Dice [31] featuring exact inference for discrete probabilistic programs is also confined to statically bounded loops. The tool Mora [7,8] supports exact inference for various types of Bayesian networks, but relies on a restricted form of intermediate representation known as prob-solvable loops, whose behaviors can be expressed by a system of C-finite recurrences admitting closed-form solutions.

*Equivalence of Probabilistic Programs.* Murawski and Ouaknine [51] showed an Exptime decidability result for checking the equivalence of probabilistic programs over *finite* data types by recasting the problem in terms of probabilistic finite automata [23,41,56]. Their techniques have been automated in the equivalence checker APEX [45]. Barthe et al. [4] proved a 2-Exptime decidability result for checking equivalence of *straight-line* probabilistic programs (with deterministic inputs and no loops nor recursion) interpreted over all possible extensions of a finite field. Barthe et al. [3] developed a relational Hoare logic for probabilistic programs, which has been extensively used for, amongst others, proving program equivalence with applications in provable security and side-channel analysis.

The decidability result established in this paper is *orthogonal* to the aforementioned results: (i) our decidability for checking L <sup>∼</sup> S applies to discrete probabilistic programs L with *unbounded* looping behaviors over a possibly *infinite* state space; the specification S – though, admitting no loops – encodes a possibly *infinite-support* distribution; yet as a compromise, (ii) our decidability result is confined to ReDiP programs that necessarily terminate almost-surely on all inputs, and involve only distributions with rational closed-form PGF.

#### **8 Conclusion and Future Work**

We showed the decidability of – and have presented a fully-automated technique to verifying – whether a (possibly unbounded) probabilistic loop is equivalent to a loop-free specification program. Future directions include determining the complexity of our decision problem; amending the method to continuous distributions using, e.g., *characteristic functions*; extending the notion of probabilistic equivalence to probabilistic refinements; exploring PGF-based counterexampleguided synthesis of quantitative loop invariants (see [18, Appx. F.6] for generating counterexamples); and tackling Bayesian inference.

**Acknowledgments.** The authors thank Philipp Schr¨oer for providing support for his tool Probably ( https://github.com/Philipp15b/Probably) which forms the basis of our implementation.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Abstraction-Refinement for Hierarchical Probabilistic Models**

Sebastian Junges1(B) and Matthijs T. J. Spaan<sup>2</sup>

<sup>1</sup> Radboud University, Nijmegen, The Netherlands sjunges@cs.ru.nl <sup>2</sup> Delft University of Technology, Delft, The Netherlands

**Abstract.** Markov decision processes are a ubiquitous formalism for modelling systems with non-deterministic and probabilistic behavior. Verification of these models is subject to the famous state space explosion problem. We alleviate this problem by exploiting a hierarchical structure with repetitive parts. This structure not only occurs naturally in robotics, but also in probabilistic programs describing, e.g., network protocols. Such programs often repeatedly call a subroutine with similar behavior. In this paper, we focus on a local case, in which the subroutines have a limited effect on the overall system state. The key ideas to accelerate analysis of such programs are (1) to treat the behavior of the subroutine as uncertain and only remove this uncertainty by a detailed analysis if needed, and (2) to abstract similar subroutines into a parametric template, and then analyse this template. These two ideas are embedded into an abstraction-refinement loop that analyses hierarchical MDPs. A prototypical implementation shows the efficacy of the approach.

# **1 Introduction**

Markov Decision Processes (MDPs) are *the* model for sequential decision making under probabilistic uncertainty, and as such are central in modelling of randomized algorithms, distributed systems with lossy channels, or as the underlying formalism in reinforcement learning. A key question in the verification of MDPs is: *What is the maximal probability that some error state is reached?* In this question, one accounts for the probabilistic nature as well as the inherit (potentially adversarial) nondeterminism of the system. Various state-of-the-art probabilistic model checkers, such as Storm [20], Prism [27] and Modest [17] implement a variety of methods that automatically compute such maximal probabilities. Most widespread are variations of value-iteration that iteratively apply a transition function to converge towards the requested probability.

*Hierarchical Structure.* Despite various successes, the state space explosion remains a significant challenge to the model-based analysis of MDPs. To overcome this challenge, some approaches exploit symmetries or the parallel composition of a system. Other approaches exploit that typically not all paths through a system are equally likely and thus aim to find the essential or critical subsystem.


(a) Repeated invocation of passToken(p) (b) passToken(p): Pass succeed twice.

**Fig. 1.** Simplified example for sending a token over an unreliable channel.

While we exploit related ideas—a detailed comparison is given in the related work, cf. Sect. 7—our approach is fundamentally different and instead exploits a *hierarchical decomposition* natural in many system models. This decomposition is captured naturally by probabilistic programs (over discrete bounded variables) with non-nested subroutines, where some subroutines are called repeatedly with *similar* arguments. Figure 1 shows an example in which we demonstrate our approach in Sect. 2. More generally, we are interested in systems with an overall task that is achieved by a suitable combination of a limited number of sub-tasks. Such a setting occurs naturally, e.g. (i) in robotics, when multiple rooms in a floor need to be inspected, or (ii) in routing, when multiple packets need to be routed sequentially. The underlying problem structure is also exploited in hierarchical planning [5,19,30], where the goal is to find a good but not necessarily optimal policy (and induced value). *We combine insights from hierarchical planning with an abstraction-refinement perspective and then construct an anytime algorithm with strict guarantees on the result.*

*Local Model-Based Analysis.* An adequate operational model for the model-based analysis of hierarchical systems is given by a *hierarchical MDP*, where the state space of a hierarchical MDP can be partitioned into *subMDPs*. Abstractly, one can represent a hierarchical MDP by the collection of subMDPs and a *macrolevel MDP* [19] where the probabilities of outgoing transitions at a state are described by a corresponding subMDP, cf. Sect. 3.2. In this paper, we focus on a hierarchical MDPs where the policies that are optimal in (only) a subMDP are optimal (partial) policies in the hierarchical MDP. More intuitively, we can solve the subMDPs individually, i.e., the solution (w.r.t. the fixed measure) for the subMDP is part of the globally optimal solution. While this assumption is restrictive, it is satisfied in various interesting settings. The assumption allows us to analyse subMDPs out-of-context, i.e., we can first analyse the subMDPs and then construct the correct macro-MDP, i.e., extract transition probabilities and rewards from the subMDP analysis. This approach already improves the maximal memory consumption and allows for additional speed-ups if the *same* subMDP occurs multiple times.

*Epistemic Uncertainty During Computation.* The key insight to accelerate the outlined approach further is to avoid analysing all subMDPs precisely, while still providing sound guarantees on the obtained results. Therefore, consider that even before analysing the subMDPs we can analyse an uncertain variant of the macrolevel MDP where we do not yet know the associated transition probabilities and rewards but instead only know intervals. We may then do two things: First, we can identify the subMDPs which are most critical, i.e., where replacing the interval by a concrete value yields most benefits. Second, and more importantly, we can analyse a set of subMDPs and refine the associated uncertainties, i.e., tighten the associated intervals. To support the analysis of sets of subMDPs, we observe that often, these subMDPs are slight variations. In this paper, we represent them as parameterised instances of a particular templates that we define using parametric MDPs (pMDPs). The resulting intervals can be used to create an (interval-valued version of the) macro-level MDP. Analysing this gives bounds on the expected reward in the hierarchical MDP, and the bounds can be refined by analysing the subMDPs more precisely.

*Contributions.* In a nutshell, we explicitly allow for *uncertainty* during the solving process to speed up the analysis of hierarchical MDPs. Concretely, we contribute a scalable approach to solve hierarchical MDPs with many different sub-MDPs, in particular when these subMDPs are similar, but not the same. The approach resembles an abstraction-refinement loop where we abstract the hierarchical MDP in two layers and then refine the analysis of the lower layer to get a refined representation of the complete MDP. In every step, we can provide absolute error bounds. Our approach interprets the different subMDPs as a form of uncertainty. The efficient analysis originates from progress made in the analysis of uncertain (or parametric) MDPs, and brings that progress to a novel setting. The empirical evaluation with a prototype called level-up shows the efficacy of the approach.

# **2 Overview**

We clarify the approach and its applicability with a motivating example that drastically abstracts a token passing process where the channel quality varies [12].

*Setting.* Consider the protocol in Fig. 1a which sends a token N times via a channel. That channel successfully transmits packets with probability p, where p varies over time. The subroutine takes t amount of time, depending on p. Specifically, in the model, we alternate between accumulating the required time and updating the channel quality for N token transmissions and then return the accumulated time. We aim to compute the expected return value. For the subroutine, we assume that sending a token is repeated until an acknowledgement is received, which is abstractly modelled in Fig. 1b and corresponds to the small Markov chain in Fig. 2a. First, the file must successfully be sent (s<sup>0</sup> <sup>→</sup> <sup>s</sup>1), then we start sending acknowledgements. The process terminates (s<sup>1</sup> <sup>→</sup> <sup>s</sup>2) once an acknowledgement is received. The complete protocol from Fig. 1 including the subroutine is reflected by the large Markov chain in Fig. 2b that repeats the small Markov chain (with different probabilities). This model may be analysed with standard tools, but for large N (and larger subroutines), the state space explosion must be alleviated.

(b) Hierarchical MDP, rewards of 1 at states with loops

**Fig. 2.** Ingredients for hierarchical MDPs with the Example from Fig. 1. Annotations reflect subMDPs within the macro-MDPs in Fig. 3.

*Macro-MDPs and Enumeration.* We thus suggest to abstract the hierarchical model into the macro-level MDP in Fig. 3a. Here, every state corresponds to an invocation of the subprocess. The reward at the states corresponds to the expected reward for the complete subprocess. Thus, naively, one may construct the macro-MDP, analyse all (reachable) subMDPs independently and annotate the macro-MDP states with the appropriate rewards, and finally analyse the macro-MDP to obtain a result of ≈12.3. This approach avoids representing the complete hMDP in the memory, but it is still restricted to analysing systems with a limited number of subMDPs.

*Our Approach.* We improve scalability by constructing a parameterized macro-MDP. Reconsider the rewards for Fig. 3a. The values can be computed via the graph in Fig. 3d, where we pick for each value for p (x-axis) and compute the corresponding expected reward E (y-axis) obtained by analysing the subMDP in Fig. 2a. Intuitively, in our abstraction, we annotate the rewards with lower- and upper bounds rather than exact values. Therefore, we compute bounds on the rewards by selecting an interval for the values <sup>p</sup> <sup>∈</sup> [8/25, <sup>25</sup>/32], as shown in Fig. 3e. Conceptually, this means that we analyse a set of subMDPs at once, namely all subMDPs with <sup>p</sup> <sup>∈</sup> [8/25, <sup>25</sup>/32]. Annotating the corresponding expected rewards, in this case [64/25, <sup>25</sup>/4], then yields the macro-MDP in Fig. 3b. Analysis of this MDP yields that overall expected time is in [7.68, 18.75]. We refine these bounds by analysing subsets of the subMDPs. We may split the values for p into two sets [8/25, <sup>2</sup>/5] and [1/2, <sup>25</sup>/32]. Then, we obtain two corresponding intervals on the expected time in the subMDP as shown in Fig. 3f. Model checking the associated macro-MDP, in Fig. 3c, bounds to expected time by [10.12, 14.25]. Technically, we realize this reasoning using parameter lifting [33].

**Fig. 3.** Visualising the computation of expected rewards for the hMDP from Fig. 2b using a macro-MDP and interval-based abstractions.

*Supported Extensions.* For conciseness, this example is necessarily simple. Our approach allows nondeterminism, i.e., action-choices, in the macro-MDP *and* in the subMDPs. The subMDPs may have multiple outgoing transitions, but this must be combined with a restricted type of nondeterminism in the subMDP: If multiple outgoing transitions are present, the macro-MDP has transition probabilities that depend on the subMDPs. We present a useful extension for reachability probabilities, see the discussion at the bottom of Sect. 3.3.

*More Examples.* Key ingredient to models where the approach excels are a repetitive task whose characteristics depend on some global state. Two variations are the expected energy consumption of a robot with slowly degrading components that, e.g., can be improved by maintenance or for job scheduling with periodically changing distribution of tasks (e.g., day vs. night).

### **3 Formal Problem Statement**

We formalize MDPs and *hierarchical MDPs* (hMDPs) to pose the problem statement, then identify a subclass of hMDPs which we call *local-policy hMDPs* and restrict our problem on computing optimal expected rewards in local-policy hMDPs. Furthermore, we introduce parametric MDPs as they are key to the abstraction-refinement procedure later in the paper.

#### **3.1 Background**

**Definition 1 (Parametric MDP).** *A parametric MDP (pMDP) is a tuple* <sup>M</sup> <sup>=</sup> SM, AM, ιM, x, PM, rM, TM *where* <sup>S</sup><sup>M</sup> *is a finite set of* states*,* <sup>A</sup><sup>M</sup> *is a finite set of* actions*,* <sup>ι</sup><sup>M</sup> <sup>∈</sup> <sup>S</sup><sup>S</sup> *is the* initial state*,* x <sup>=</sup> x0,...xn *is a vector of* parameters*,* <sup>P</sup><sup>M</sup> : <sup>S</sup><sup>M</sup> <sup>×</sup> <sup>A</sup><sup>M</sup> <sup>×</sup> <sup>S</sup><sup>M</sup> <sup>→</sup> <sup>Q</sup>[x] *are the* transition probabilities*,* <sup>r</sup><sup>M</sup> : <sup>S</sup> <sup>→</sup> <sup>Q</sup>[x] *the* state rewards*, and* <sup>T</sup><sup>M</sup> *is a set of* target states*.*

We drop the subscripts whenever possible. MDPs are *parametric* if x <sup>=</sup> and *parameter-free* otherwise. We omit parameters for parameter-free MDPs. We recap some standard notions on pMDPs (and MDPs):

For a (parameter) *valuation* <sup>u</sup> <sup>∈</sup> <sup>R</sup><sup>x</sup>, the *instantiation* <sup>M</sup>[u] globally substitutes <sup>P</sup>M(s, a, s ) with <sup>P</sup>M(s, a, s )(u) and <sup>r</sup>M(s) with <sup>r</sup>M(s)(u). An assignment <sup>u</sup> is well-defined, if <sup>M</sup>(u) constitutes an MDP, i.e., if - s- <sup>P</sup>M(s, α, s )(u) ∈ {0, <sup>1</sup>} and <sup>r</sup>M(s)(u) <sup>≥</sup> 0 for each <sup>s</sup> <sup>∈</sup> <sup>S</sup>, <sup>α</sup> <sup>∈</sup> <sup>A</sup>. We denote the set of all welldefined assignments with <sup>U</sup>M. The set Act(s) denotes the enabled actions at state <sup>s</sup>, Act(s) = {<sup>α</sup> <sup>|</sup> - s- <sup>P</sup>M(s, α, s ) = 0 }. If <sup>|</sup>Act(s)<sup>|</sup> = 1 for every <sup>s</sup> <sup>∈</sup> <sup>S</sup>, then the (parametric) MDP is a (parametric) *Markov chain* (MC). A path π is an (in)finite sequence of states s<sup>0</sup> <sup>α</sup><sup>0</sup> −→ <sup>s</sup><sup>1</sup> ..., with <sup>s</sup><sup>i</sup> <sup>∈</sup> <sup>S</sup>, <sup>α</sup><sup>i</sup> <sup>∈</sup> Act(si), <sup>P</sup>(si, αi, si+1) = 0. For finite <sup>π</sup>, last(π) denotes the last state of <sup>π</sup>. We use [<sup>s</sup> <sup>→</sup> ♦T] to denote the set of (finite) paths <sup>T</sup> only at the end. The reward <sup>r</sup>(π) along a finite path π is the sum of the state rewards r(π) := r(si).

*Specifications.* We consider indefinite horizon expected reward, i.e., the expected accumulated reward until reaching the target states. We refer to [3,32] for a formal treatment and only introduce notation. Therefore, the unique probability measure P r for a set of paths in a parameter-free *Markov chain* <sup>M</sup> reaching state <sup>T</sup> can be defined using the usual cylinder set construction. We define P rM(<sup>s</sup> <sup>→</sup> ♦T) as the probability to reach a state in T, <sup>π</sup>∈[s→♦T] P r(π)dπ. We then define the expected reward until hitting <sup>T</sup>, ERM(<sup>s</sup> <sup>→</sup> ♦T) = <sup>π</sup>∈[s→♦T] P r(π)· <sup>r</sup>(π)dπ. In both definitions, if s is the initial state, we simply write ...(♦T). For technical conciseness, we make the standard assumption that target states are reached with probability 1, which ensures that the integral exists and is finite. (Arbitrary) reachability probabilities can be nevertheless be modelled using rewards.

*Policies.* In pMDPs, we resolve nondeterminism with policies. In this paper, it suffices to consider *memoryless policies* <sup>σ</sup> : <sup>S</sup> <sup>→</sup> <sup>A</sup>. The set of such policies is denoted <sup>Σ</sup>(M). We omit <sup>M</sup> if it is clear from the context. It is helpful to also consider *partial* policies ˆσ : S - <sup>A</sup>. For an pMDP <sup>M</sup> and a (partial) policy <sup>σ</sup>ˆ, the induced dynamics are described by the *induced pMDP* <sup>M</sup>[ˆσ], defined as SM, AM, ιM, x, P, rM, TM, where the transition probabilities are given as

$$P(s,\alpha,s') = \begin{cases} P\_{\mathcal{M}}(s,\alpha,s') & \text{if } \hat{\sigma}(s) = \alpha, \\ 0 & \text{otherwise.} \end{cases}$$

If <sup>σ</sup> is total (not partial), then <sup>M</sup> is a MC. We define the maximal expected reward ERmax <sup>M</sup> (♦T) = max<sup>σ</sup>∈<sup>Σ</sup> ERM[σ](♦T), and say that a policy <sup>σ</sup> is optimal, if ERmax <sup>M</sup> (♦T) = ERM[σ](♦T).

*Regions and Parametric Model Checking.* A set of valuations described by is called a (rectangular) *region*, if <sup>R</sup> <sup>=</sup> {<sup>u</sup> <sup>|</sup> <sup>u</sup><sup>−</sup> <sup>≤</sup> <sup>u</sup> <sup>≤</sup> <sup>u</sup><sup>+</sup>} for adequate bounds <sup>u</sup>−, u<sup>+</sup> <sup>∈</sup> <sup>R</sup><sup>x</sup> and using pointwise inequalities, i.e., R is a Cartesian product of intervals of parameter values. We denote this region also with [[u−, u<sup>+</sup>]]. For regions, we may compute a lower bound on minu∈<sup>R</sup> ERmax <sup>M</sup>[u](♦T) and an upper bound on maxu∈<sup>R</sup> ERmax <sup>M</sup>[u](♦T) via *parameter lifting* [33,36].

#### **3.2 Hierarchical MDPs**

We concentrate on solving hierarchical MDPs (hMDPs). We assume that hMDPs are parameter-free and that their topology has some additional known structure.

**Definition 2 (Hierarchical MDPs).** *A MDP* M *with a partitioning of its states* <sup>S</sup><sup>M</sup> <sup>=</sup> **<sup>S</sup>**<sup>i</sup> *is a hierarchical MDP, if for all* <sup>i</sup>*,*

*– there exists a unique* s<sup>i</sup> <sup>ι</sup> <sup>∈</sup> **<sup>S</sup>**<sup>i</sup> *such that* <sup>s</sup><sup>i</sup> <sup>ι</sup> <sup>=</sup> <sup>ι</sup><sup>M</sup> *or pred*M(s<sup>i</sup> <sup>ι</sup>) ⊆ **S**i*, and – for all* <sup>s</sup> <sup>∈</sup> **<sup>S</sup>**<sup>i</sup> \ {s<sup>i</sup> <sup>ι</sup>}*, it holds that* <sup>s</sup><sup>i</sup> <sup>ι</sup> <sup>=</sup> <sup>ι</sup><sup>M</sup> *and pred*M(s) <sup>⊆</sup> **<sup>S</sup>**i.

The state s<sup>ι</sup> is called the *entry state*, which we denote entryi. States with succM(s) <sup>∩</sup> **<sup>S</sup>**<sup>i</sup> <sup>=</sup> <sup>∅</sup> are called *exit-states*. The set succ(i) := succM(**S**i) \ **<sup>S</sup>**<sup>i</sup> are the *successor states* of the partition <sup>i</sup>. Let <sup>Y</sup> = max<sup>i</sup> <sup>|</sup>succ(i)|. By adding auxiliary states, we can assume that <sup>|</sup>succ(i)<sup>|</sup> <sup>=</sup> <sup>Y</sup> for all <sup>i</sup>. We call partitions with <sup>|</sup>**S**i<sup>|</sup> = 1 *trivial*. We use <sup>I</sup> := {<sup>i</sup> | |**S**i<sup>|</sup> <sup>&</sup>gt; <sup>1</sup>} to denote the indices of the nontrivial partitions. We remark that every MDP can be considered as an hMDP with only trivial partitions.

**Problem:** Given a (hierarchical) MDP <sup>M</sup> with target states <sup>T</sup> and <sup>η</sup> <sup>∈</sup> [0, 1], compute bounds lb, ub with lb <sup>≤</sup> ERmax <sup>M</sup> (♦T) <sup>≤</sup> ub and <sup>η</sup> · ub <sup>≤</sup> lb.

The naive solution to this problem is to ignore the hierarchical structure and solve the MDP monolithically. In this paper, we contribute methods that actively exploit the structure of the hierarchical MDPs with <sup>|</sup>I<sup>|</sup> 1. We will make an additional assumption on the structure of the hierarchical MDP.

#### **3.3 Optimal Local Subpolicies and Beyond**

Intuitively, we want to ensure that the optimal policy within the partitions can be computed locally, i.e., on partition without taking into account the complete MDP. Therefore, each partition within the MDP can be considered as an individual MDP. In particular, each **S**<sup>i</sup> induces a subMDP as follows:

**Definition 3 (subMDP).** *Given a hierarchical MDP* M *and partition S*i*, the corresponding subMDP is an MDP* <sup>M</sup><sup>i</sup> := S<sup>i</sup> := *<sup>S</sup>*<sup>i</sup> <sup>∪</sup> *succ*M(*S*i) ∪ {⊥}, A<sup>M</sup> <sup>∪</sup> {α⊥}, ι := *entry*i, Pi, ri, Gi *with* <sup>P</sup><sup>i</sup> *defined by*

$$P\_i(s, \alpha, s') := \begin{cases} P\_{\mathcal{M}}(s, \alpha, s') & \text{if } s \in \mathcal{S}\_i \text{ and } \alpha \in A\_{\mathcal{M}}, \\ 1 & \text{else } \text{if } s \notin \mathcal{S}\_i, \alpha = \alpha\_\perp, \text{ and } s' = \bot \\ 0 & \text{otherwise.} \end{cases}$$

<sup>r</sup><sup>i</sup> *is defined as* <sup>r</sup>i(s) = <sup>r</sup>M(s) *if* <sup>s</sup> <sup>∈</sup> *<sup>S</sup>*i*,* <sup>r</sup>i(s)=0 *otherwise, and* <sup>G</sup><sup>i</sup> := {⊥i}*.*

Thus, for every partition of the hierarchical MDP, the corresponding subMDP contains additionally the successor states, and a unique bottom state that is a target state and simplifies our construction later.

Likewise, we can (de)compose memoryless policies for the hierarchical MDP as a union of policies on the individual subMDPs. We do this only for nontrivial partitions. Let <sup>σ</sup><sup>i</sup> : <sup>S</sup><sup>i</sup> → <sup>A</sup> denote memoryless policies for <sup>M</sup><sup>i</sup> and <sup>σ</sup> <sup>i</sup> the restriction of σ<sup>i</sup> to **S**i, then ( <sup>I</sup> σi) : S - A is the unique partial policy such that

$$(\bigsqcup\_{\mathbb{I}} \sigma\_i)(s) := \sigma'\_i(s) \text{ if } s \in \mathbf{S}\_i, i \in \mathbb{I} \quad \text{and} \quad (\bigsqcup\_{\mathbb{I}} \sigma\_i)(s) := \bot \text{ otherwise.}$$

Intuitively, we want that the union of locally optimal policies, a partial policy, can be completed to a total policy that is optimal.

**Definition 4 (Optimal local subpolicies).** *Given a hierarchical MDP* M *with target states* <sup>T</sup> *and optimal policies* <sup>σ</sup><sup>i</sup> <sup>∈</sup> <sup>Σ</sup>(Mi) *for all* <sup>i</sup> <sup>∈</sup> <sup>I</sup>*. The hierarchical MDP has* optimal local subpolicies*, if for* σˆ = <sup>I</sup> σ<sup>i</sup> *it holds that ERmax* <sup>M</sup>[ˆσ] <sup>=</sup> *ERmax* <sup>M</sup> *.*

That is, if we collect (locally) optimal policies <sup>σ</sup><sup>i</sup> and apply them to <sup>M</sup>, we obtain the MDP M[( <sup>I</sup> σi)]. In that MDP, we can pick an optimal policy, and together with ( <sup>I</sup> <sup>σ</sup>i) this constitutes an optimal and total policy for <sup>M</sup>.

**Assumption:** The hierarchical MDP has optimal local subpolicies.

Roughly, the idea now becomes that rather than solving one large MDP with S states, we solve <sup>|</sup>I<sup>|</sup> MDPs with <sup>S</sup>/|I<sup>|</sup> states and one MDP with <sup>I</sup> states (assuming equally-sized and only nontrivial partitions).

The assumption is restrictive, but not unreasonable: A subroutine may not have any nondeterminism, or a finished task will have no influence on any future task. The following proposition, while obvious, formalizes that:

**Proposition 1 (Sufficient criterion).** *Let* M *be a hierarchical MDP. The MDP has optimal local subpolicies, if for each* <sup>i</sup> <sup>∈</sup> <sup>I</sup> *either*


**Beyond Optimal Local Subpolicies.** The efficiency of our approach is partly due to the assumption in Definition 4. We observe that adapting this definition allows for a spectrum of specific yet useful cases. In particular, say that our system describes a protocol in which we must optimize the probability to satisfy N tasks all may fail – the subMDPs will have two successor states. Often, it is then easy to see (and model) that a locally optimal policy will aim to satisfy each task and that thus, the locally optimal policy optimizes the probability to reach the corresponding successor state. Then, by adopting the target states in Definition 3 to be the successor state where the task is successful, the notion of an optimal policy—and thus of an optimal local subpolicy—changes. These changes are minimal and everything that follows below is easily adapted to this setting as demonstrated by the prototypical implementation.

# **4 Solving hMDPs with Abstraction-Refinement**

In this section, we consider hMDPs with optimal local subpolicies. We step-wise develop a sketch of an anytime algorithm that provides lower and upper bounds on the expected reward in this hMDP. In Sect. 4.1, we introduce an alternative representation of our problem that formalizes the idea of individually computing subMDPs. We then formalize the ideas that allow to construct an anytime algorithm in Sect. 4.2. In Sect. 4.3, we introduce the abstract requirements for analysing sets of subMDPs into the algorithm, and finally, in Sect. 4.4 we introduce a method that realises this using pMDPs.

#### **4.1 The Macro-MDP Formulation**

We adapt macro-MDPs [5] which summarize the subMDPs by single states.

**Definition 5 (Macro-MDP).** *Let* <sup>M</sup> *be a hMDP with* <sup>n</sup> *non-trivial <sup>S</sup>*<sup>i</sup> *partitions and* <sup>S</sup><sup>M</sup> *partitioned as* <sup>S</sup><sup>M</sup> <sup>=</sup> *<sup>S</sup>*<sup>i</sup> <sup>∪</sup> <sup>S</sup> *. The* macro-MDP *is defined as* <sup>μ</sup>(M) := S ∪ {*entry*<sup>i</sup> <sup>|</sup> <sup>1</sup> <sup>≤</sup> <sup>i</sup> <sup>≤</sup> <sup>n</sup>}, AM, ιM, <sup>∅</sup>, P, r, TM *with* <sup>P</sup> *and* <sup>r</sup> *given by*

$$P(s,\alpha,s') = \begin{cases} \mathcal{P}r\_{\mathcal{M}\_i[\sigma\_i]}(\diamondsuits') & \text{if } s \in S\_i,\\ P\_{\mathcal{M}}(s,\alpha,s') & \text{otherwise}, \end{cases} \quad r(s) = \begin{cases} \mathcal{E}\mathcal{R}^{m\times}\_{\mathcal{M}\_i}(\diamondsuit\bot) & \text{if } s \in S\_i, \\ r\_{\mathcal{M}}(s) & \text{otherwise}. \end{cases}$$

*where* <sup>M</sup><sup>i</sup> *is the corresponding subMDP (see Definition 3) and* <sup>σ</sup><sup>i</sup> *is an arbitrary but fixed optimal policy, i.e., a policy such that ER*<sup>M</sup>*i*[σ*i*](♦Gi) = *ERmax* <sup>M</sup>*<sup>i</sup>* (♦Gi)*.*

Intuitively, we replace the transitions within **S**<sup>i</sup> by a 'big-step semantics' that aggregates the transitions within **S**<sup>i</sup> by single transitions such that the probability to reach any successor matches the probability to do so within **S**<sup>i</sup> under a specific –optimal– policy. Likewise, the expected reward matches the expected reward collected in **S**<sup>i</sup> 1.

*Remark 1.* To define a *unique* macro-MDP, we can take the lexicographically smallest policy σ<sup>i</sup> among the optimal policies. Furthermore, we observe that for the cases covered by Proposition 1, it is not necessary to compute σ<sup>i</sup> at all: Either there is a single successor—implying Pr<sup>M</sup>*i*[σ*i*](♦{s }) = 1 for any <sup>σ</sup>i—or <sup>|</sup>Σ(Mi)<sup>|</sup> = 1.

The following theorem formalises that, given the assumptions, taking the bigstep semantics is adequate when optimizing for an expected reward.

<sup>1</sup> Due to the additive nature of expected rewards, we can annotate the state with the expected reward even though it may differ over the different paths to an exit of **S**i.

**Theorem 1.** *Let* <sup>M</sup> *be a hMDP with optimal local subpolicies and let* <sup>μ</sup>(M) *be the corresponding macro-MDP. Then: ERmax* <sup>μ</sup>(M)(♦T) = *ERmax* <sup>M</sup> (♦T)*.*

The important ingredient are the optimal local subpolicies that ensure that we aggregate behavior within the partitions by behavior that agrees with a (globally) optimal policy. We give a proof in the appendix<sup>2</sup>.

*Naive Algorithm.* Algorithmically, we first compute ERmax <sup>M</sup>*<sup>i</sup>* (♦Ti) and the associated policy σi, then compute the reachability probabilities on the induced Markov chain. We collect these results in a vector resi, which is helpful to construct the macro-MDP. To clarify further constructions in this paper, we make res<sup>i</sup> explicit. Recall that <sup>|</sup>succM(**S**i)<sup>|</sup> <sup>=</sup> <sup>Y</sup> for all <sup>i</sup>.

**Definition 6 (Results for subMDP).** *Let* M<sup>i</sup> *be a subMDP for the partition <sup>S</sup>*<sup>i</sup> *of a hMDP* <sup>M</sup>*. Let succ*M(*S*i) *be ordered. We define res*<sup>i</sup> <sup>∈</sup> <sup>R</sup><sup>Y</sup> +1 *s.t.*

$$\mathsf{res}\_{i}(j) := \mathsf{Pr}\_{\mathcal{M}\_{i}[\sigma\_{i}]}(\Diamond \{\mathsf{succ}\_{\mathcal{M}}(\mathcal{S}\_{i})\_{j}\}) \text{ for } 0 \le j < Y \text{ and } \mathsf{res}\_{i}(Y) := \mathsf{ER}\_{\mathcal{M}\_{i}}^{\mathsf{max}}(\Diamond G\_{i}),$$

*where* <sup>σ</sup><sup>i</sup> *is an arbitrary but fixed policy such that ER*<sup>M</sup>*i*[σ*i*](♦Gi) = *ERmax* <sup>M</sup>*<sup>i</sup>* (♦Gi)*.*

This allows us to reformulate the macro-MDP, in particular, the following two identities do hold:

$$P(s,\alpha,s') = \begin{cases} \mathsf{res}\_i(j) & \text{if } s \in \mathbf{S}\_i \text{ and} \\ & s' = \mathsf{succ}\_{\mathcal{M}}(\mathbf{S}\_i)\_j \quad \quad r(s) = \begin{cases} \mathsf{res}\_i(Y) & \text{if } s \in \mathbf{S}\_i, \\ r\_{\mathcal{M}}(s) & \text{otherwise.} \end{cases} \end{cases} \tag{1}$$

The identities trivialize that constructing the macro-MDP can be done by precomputing the necessary result-vectors.

**Enumeration baseline**: With macro-MDPs, we reduce the computation of ERmax <sup>M</sup> (♦T) to (1) analysing all subMDPs <sup>M</sup><sup>i</sup> and (2) analysing <sup>μ</sup>(M).

This rather naive algorithm already limits memory and may exploit similarities between subMDPs during the analysis, e.g., based on the structure discussed in Sect. 4.4. It performs well if the number <sup>|</sup>I<sup>|</sup> of subMDPs is sufficiently small. We are interested in considering methods that allow for larger I or larger sub-MDPs. In particular, we want to avoid analysing all subMDPs, all individually.

#### **4.2 The Uncertain Macro-MDP Formulation**

*Uncertainty Before Computation.* We start introducing a method that allows providing bounds on the expected rewards after individually analysing a subset of the subMDPs. Before computing the individual probabilities in Mi, we are *uncertain* about the probabilities and rewards in the MDP <sup>μ</sup>(M). Under this

<sup>2</sup> See: https://doi.org/10.48550/arXiv.2206.02653.

uncertainty, we may not be able to compute ERmax <sup>μ</sup>(M)(♦T) precisely. However, we may solve the problem statement by *bounding* the expected reward. Thus, the goal is to compute values lb, ub s.t.

$$\mathsf{h}\mathbf{b} \le \mathsf{ER}\_{\mathcal{M}}^{\mathsf{max}}(\lozenge{0}T) = \mathsf{ER}\_{\mu(\mathcal{M})}^{\mathsf{max}}(\lozenge{0}T) \le \mathsf{u}\mathbf{b}.\tag{2}$$

*Uncertain Macro-MDPs.* We capture the a-priori uncertainty about the sub-MDP results in an uncertain macro-MDP, a particularly shaped *parametric* MDP.

**Definition 7 (Uncertain macro-MDP).** *Let* <sup>M</sup> *be a hMDP with* <sup>n</sup> *nontrivial <sup>S</sup>*<sup>i</sup> *partitions and* <sup>S</sup><sup>M</sup> *partitioned as* <sup>S</sup><sup>M</sup> <sup>=</sup> *<sup>S</sup>*<sup>i</sup> <sup>∪</sup> <sup>S</sup> *. The* uncertain macro-MDP *is defined as* <sup>ν</sup>(M) := S ∪ {*entry*<sup>i</sup> <sup>|</sup> <sup>1</sup> <sup>≤</sup> <sup>i</sup> <sup>≤</sup> <sup>n</sup>}, AM, ιM, x, P, r, TM *with parameters* x := {pi,j , q<sup>i</sup> <sup>|</sup> <sup>1</sup> <sup>≤</sup> <sup>i</sup> <sup>≤</sup> n, <sup>1</sup> <sup>≤</sup> <sup>j</sup> <sup>≤</sup> <sup>Y</sup> } *where* <sup>Y</sup> <sup>=</sup> <sup>|</sup>*succ*M(*S*i)|*.* <sup>P</sup> *and* <sup>r</sup> *given by*

$$P(s,\alpha,s') := \begin{cases} p\_{i,j} & \text{if } s \in \mathcal{S}\_i \text{ and} \\ & s' = \mathsf{succ}\_{\mathcal{M}}(\mathcal{S}\_i)\_j, \quad r(s) := \begin{cases} q\_i & \text{if } s \in \mathcal{S}\_i, \\ r\_{\mathcal{M}}(s) & \text{otherwise.} \end{cases} \end{cases}$$

*Remark 2.* Whenever M<sup>i</sup> and M<sup>i</sup> are isomorphic, we may reduce the parameters and replace each occurrence of p<sup>i</sup>-,j with pi,j and each occurrence of q<sup>i</sup> with qi.

The uncertain macro-MDP can be instantiated to coincide with the macro-MDP by setting the parameters accordingly.

**Theorem 2.** *Let* <sup>M</sup> *be a hMDP,* <sup>μ</sup>(M) *the associated unique macro-MDP, and* <sup>ν</sup>(M) *the associated uncertain macro-MDP with parameters* <sup>p</sup>i,j *and* <sup>q</sup>i*. Let* <sup>u</sup><sup>∗</sup> *be a parameter valuation with* u∗(pi,j ) = *res*i(j) *and* u∗(qi) = *res*i(Y ) *for all* i, j*. Then:*

$$\nu(\mathcal{M})[u^\*] = \mu(\mathcal{M})$$

*Proof sketch.* The construction of the uncertain macro-MDP and the macro-MDP only differs in the assignment of probabilities. We set u here as in the characterisation in (1) and thus the equality follows.

*Computing Bounds.* Assume for now that we can derive some (trivial) sound bounds on the results vector for any subMDP M<sup>i</sup> 3.

**Definition 8 (Sound bounds on results).** *For* Mi*, the vectors lbres*<sup>i</sup> *and ubres*<sup>i</sup> *are* sound bounds *if the following pointwise inequality holds*

$$
abla \mathbf{res}\_i \le r \mathbf{es}\_i \le \mu \mathbf{res}\_i. \tag{3}$$

<sup>3</sup> We discuss our approach in Sect. 4.4, alternatively, one may use bounds from, e.g., [4].

These bounds on properties in the subMDP correspond to bounds on the parameters of the uncertain macro-level MDP <sup>ν</sup>(M). Let us formalize this idea.

**Definition 9 (Suitable parameter region).** *Given* u<sup>∗</sup> *from Theorem 2. The bounds* <sup>u</sup>−, u<sup>+</sup> *are* suitable *if* <sup>u</sup><sup>−</sup> <sup>≤</sup> <sup>u</sup><sup>∗</sup> <sup>≤</sup> <sup>u</sup><sup>+</sup>*. For suitable* <sup>u</sup>−, u<sup>+</sup>*, the region* [[u−, u<sup>+</sup>]] *is called* suitable*.*

Using this notion, sound bounds lbres<sup>i</sup> and ubres<sup>i</sup> thus yield suitable bounds <sup>u</sup>−(x), u<sup>+</sup>(x) for all <sup>x</sup> <sup>∈</sup> <sup>j</sup> <sup>p</sup>i,j ∪ {qi}. Combined, the sound bounds for every <sup>i</sup> yields a suitable region. Formally:

**Fig. 4.** Analysing hMDPs via uncertain macro-MDPs via individual refinement.

**Lemma 1.** *Given sound bounds lbres*i, *ubres*<sup>i</sup> *for each* i*, there exists a trivial mapping Reg s.t. Reg*(*lbres*1,... *lbres*n, *ubres*1,... *ubres*n) *is a suitable region.*

With the suitable region we can apply verification on the parametric MDP.

**Lemma 2.** *Let* R *be a suitable region. Then:*

$$\min\_{u \in R} \mathsf{ER}\_{\nu(\mathcal{M})[u]}^{\max}(\Diamond T) \le \mathsf{ER}\_{\mathcal{M}}^{\max}(\Diamond T) \le \max\_{u \in R} \mathsf{ER}\_{\nu(M)[u]}^{\max}(\Diamond T).$$

*Proof sketch.* We observe that the inequalities follow from the fact that <sup>u</sup><sup>∗</sup> <sup>∈</sup> <sup>R</sup> with u<sup>∗</sup> as in Theorem 2. By that theorem, ERmax <sup>ν</sup>(M)[u∗](♦T) = ERmax <sup>μ</sup>(M)(♦T). The statement then follows from Theorem 1.

From the bounds that we can compute using a suitable region, we then set lb and ub for Eq. (2):

$$\mathsf{ub} \le \min\_{u \in R} \mathsf{ER}\_{\nu(\mathcal{M})[u]}^{\max}(\lozenge T) \le \mathsf{ER}\_{\geqslant \prime}^{\max}(\lozenge T) \le \max\_{u \in R} \mathsf{ER}\_{\nu(M)[u]}^{\max}(\lozenge T) \le \mathsf{ub}. \tag{4}$$

Computationally, we may use parameter lifting [33] to find these values.

*Refinement Loop.* The complete anytime algorithm is summarized in Fig. 4. We start with an hMDP <sup>M</sup> and extract the uncertain macro-MDP <sup>ν</sup>(M) and the subMDPs {Mi}<sup>4</sup>. Furthermore we compute (trivial) sound bounds on lbres<sup>i</sup> <sup>≤</sup> res<sup>i</sup> <sup>≤</sup> ubresi. This leads to a suitable region [[u−, u<sup>+</sup>]] = Reg(lbres1, ubres1,...). Then, we may at any time compute the bounds lb, ub on the expected reward

<sup>4</sup> For efficiency, one must implement extraction without first computing an explicit representation of M.

in the hMDP <sup>M</sup> by analysing <sup>ν</sup>(M) on the region [[u−, u<sup>+</sup>]]. To tighten these bounds, we must first refine the suitable region. Therefore, we analyse individual subMDPs <sup>M</sup><sup>i</sup> and compute res<sup>i</sup> and thus <sup>u</sup>∗(x) for <sup>x</sup> ∈ ∪jpi,j <sup>∪</sup> <sup>q</sup>i. This refines the suitable bounds such that <sup>u</sup>−(x) = <sup>u</sup>∗(x) = <sup>u</sup><sup>+</sup>(x) for <sup>x</sup> ∈ ∪jpi,j <sup>∪</sup>qi. We call this refinement *individual refinement*. The new region is suitable and Theorem 2 ensures correctness of the refinement. As we only have finitely many subMDPs, we obtain lb = ub after finitely many steps.

**Anytime version of the enumeration baseline.** Individually refine any subset of subMDPs, then analyse the uncertain macro-MDP <sup>ν</sup>(M).

#### **4.3 Set-Based SubMDP Analysis**

Next, we aim to provide an alternative refinement procedure that analyses a set of subMDPs at once, i.e., that refines the suitable bounds for a set of parameters at once. We denote the set of goal states for all subMDPs as G<sup>5</sup>.

*Adequate Abstractions.* We aim to compute sound bounds on the results for a set of subMDPs such that the bounds are sound for every individual subMDP in this set. We generalize Definition 8 as follows: The (lower and upper) bounds lbres<sup>I</sup> , ubres<sup>I</sup> are *sound*, if they are sound (lower and upper) bounds for every resi, <sup>i</sup> <sup>∈</sup> <sup>I</sup>.

**Lemma 3.** *Let lbres*<sup>I</sup> *satisfy the following inequations using* <sup>0</sup> <sup>≤</sup> j<Y *:*

$$l\mathfrak{bres}\_I(Y) \le \min\_i \mathsf{ER}\_{\mathcal{M}\_i}^{\max}(\diamondsuit G) \qquad and \qquad l\mathfrak{bres}\_I(j) \le \min\_i \min\_\sigma \mathsf{Pr}\_{\mathcal{M}\_i[\sigma]}(\diamondsuit G). \tag{5}$$

*Then, lbres*<sup>I</sup> *is a sound lower bound.*

*Proof sketch.* We must show lbres<sup>I</sup> <sup>≤</sup> res<sup>i</sup> for each <sup>i</sup> <sup>∈</sup> <sup>I</sup>. By definition for each <sup>1</sup> <sup>≤</sup> <sup>j</sup> <sup>≤</sup> <sup>Y</sup> , lbres<sup>I</sup> (j) <sup>≤</sup> min<sup>i</sup>-<sup>∈</sup><sup>I</sup> res<sup>i</sup>- (j) and trivially min<sup>i</sup>-<sup>∈</sup><sup>I</sup> res<sup>i</sup>-(j) <sup>≤</sup> resi(j).

We omit the analogous statement for ubres<sup>6</sup>. In Sect. 4.4, we discuss a particular approach to obtain these bounds, i.e., the right hand sides of the equations in Eq. 5. Here, we update the algorithm sketch to handle this alternative refinement.

*Remark 3.* We cannot compute the optimal policy <sup>σ</sup><sup>i</sup> for the subMDP <sup>M</sup><sup>i</sup> in this setting. Thus, we must compute probability bounds for all policies, which may make these bounds weak. Some optimizations are possible as some actions can in fact be excluded. More importantly, however, is that for cases within Proposition 1 the policy σ<sup>i</sup> is irrelevant.

<sup>5</sup> Formally, we label the goal states and use G to refer to denote those states.

<sup>6</sup> where min becomes max and inequalities flip.

*Updated Algorithm.* We update the loop from Fig. 4: Rather than refining using a single i, we refine using a set I. Instead of resi, we use Lemma 3 to compute sound bounds lbres<sup>I</sup> , ubres<sup>I</sup> and call this *set-based refinement*. We may set lbres<sup>i</sup> = lbres<sup>I</sup> for each <sup>i</sup> <sup>∈</sup> <sup>I</sup>. Then, we can compute a new suitable region via Lemma 1. With the suitable region, we can still utilise Eq. (4) to compute an approximation [lb, ub]. However, for completeness we must ensure that if <sup>|</sup>I<sup>|</sup> = 1, the upper and lower bounds coincide, i.e., lbres{i} <sup>=</sup> ubres{i} for every <sup>i</sup>. That can be ensured by using individual subMDP refinement when <sup>|</sup>I<sup>|</sup> = 1.

**Idea:** We may improve the anytime algorithm by iteratively considering sets of subMDPs and extract sound bounds.

We now first discuss the set-based analysis of multiple subMDPs Mi. We clarify the realization of the loop box in Sect. 5.

**Fig. 5.** Analysing hMDPs with set-based refinement on templated subMDPs.

#### **4.4 Templates for Set-Based subMDP Analysis**

We present an instance of set-based subMDP analysis where the subMDPs can be described as instantiations of a parametric MDPs.

*Parametric Templates.* We observe that the subMDPs are often similar, e.g., they define sending a file over a channel, exploring a room, in different conditions. We capture this similarity as follows: Let {T1,... <sup>T</sup>m} define a set of parametric MDPs, where we call each pMDP a *template*. In particular, for a hierarchical MDP <sup>M</sup> with partitioning **<sup>S</sup>**1,... **<sup>S</sup>**<sup>n</sup> and corresponding subMDPs <sup>M</sup>1,...,M<sup>n</sup> a subMDP <sup>M</sup><sup>i</sup> is an instantiation of template <sup>T</sup><sup>j</sup> and parameter instantiation <sup>v</sup><sup>7</sup>, if <sup>M</sup><sup>i</sup> <sup>=</sup> <sup>T</sup><sup>j</sup> [v]. For a concise description, this paper considers hMDPs over a single template <sup>T</sup> and, for any <sup>I</sup> <sup>⊆</sup> <sup>I</sup>, we denote <sup>V</sup><sup>I</sup> := {v1,...,vn} the finite (multi)set of parameter instantiations for the pMDP <sup>T</sup> such that <sup>T</sup> [vi] = <sup>M</sup>i.

*Abstractions from Templates.* In terms of the templates, Lemma 3 requires us to bound the expected rewards ERmax <sup>T</sup> [v](♦G) for all <sup>v</sup> <sup>∈</sup> <sup>V</sup><sup>I</sup> . We realize this by defining the smallest region toRegion(V<sup>I</sup> ) <sup>⊇</sup> <sup>V</sup><sup>I</sup> . For this region, we obtain expected rewards by computing the minimum maximal reward in toRegion(V<sup>I</sup> ). That is:

$$\mathsf{I}\mathsf{I}\mathsf{bres}\_{I}(Y) := \min\_{v \in \mathsf{sto}\mathsf{Region}(V\_{I})} \mathsf{ER}\_{T[v]}^{\max}(\lozenge G) \quad \le \quad \min\_{i} \mathsf{ER}\_{\mathsf{M}\_{i}}^{\max}(\lozenge G).$$

<sup>7</sup> We use <sup>v</sup> instead of <sup>u</sup> to avoid confusion with the instantiations for pMDP <sup>ν</sup>(M).

We handle the probabilities equally while taking into account the quantification over the policies. Following Lemma 3, these bounds are sound. Upper bounds are handled analogously. Computationally, we again use parameter lifting [33] to find these bounds. We can easily refine: Whenever we split I (or equally, V<sup>I</sup> ), we can compute (potentially) smaller regions toRegion(V<sup>I</sup> ).

In Fig. 5, we depict our method. In contrast to Fig. 4, we pass the template T rather than the individual subMDPs. Furthermore, we now compute initial sound bounds via the analysis of the template (i.e., of V<sup>I</sup> ) and must pass the mapping from I to V<sup>I</sup> to clarify the shape of the subMDPs.

**Abstraction-Refinement** on the subMDPs provides increasingly tight suitable regions for the uncertain macro-MDP from the anytime baseline.

**Algorithm 1.** Algorithm for Abstraction-Refinement Procedure

1: Construct macro-MDP <sup>ν</sup>(M), class-MDP <sup>T</sup> , and <sup>V</sup><sup>I</sup> from high-level description. 2: <sup>Q</sup> ← {<sup>I</sup> <sup>=</sup> <sup>I</sup>, bounds = [0, <sup>∞</sup>), weightedvals = <sup>I</sup> → {1}} 3: lb ← 0; ub ← ∞; #iter = 0; Res ← ∅ 4: **while** <sup>η</sup> · ub <sup>&</sup>gt; lb **do** 5: <sup>R</sup> <sup>←</sup> Q.pop() Use priority 6: **if** R.I <sup>=</sup> {i} **then** 7: Res[i] <sup>←</sup> check one(<sup>T</sup> [vi]) Computes res<sup>i</sup> 8: **else** 9: R.bounds <sup>←</sup> check set(<sup>T</sup> ,toRegion(VR.I )) Computes lbresR.I , ubresR.I 10: <sup>Q</sup> <sup>←</sup> <sup>Q</sup> <sup>∪</sup> split(R) Split R.I, keep bounds and weights 11: **end if** 12: **if** #iter *mod* k = 1 or Q is empty **then** 13: R- <sup>←</sup> Reg(extract(Q, Res)) Compute suitable region via Lem <sup>1</sup> 14: lb, ub <sup>←</sup> check set(ν(M), R- ) 15: **end if** 16: **end while**

#### **5 Implementing the Abstraction-Refinement Loop**

Algorithm 1 outlines a basic implementation of the idea sketched in Fig. 5. We detail this implementation and then discuss an essential improvement.

We construct <sup>ν</sup>(M), <sup>T</sup> , and (the implicit) mapping <sup>V</sup> : <sup>I</sup> <sup>→</sup> <sup>V</sup><sup>I</sup> to map sub-MDPs to instantiations of T from a suitable high-level representation. We initialize a priority queue with triples that represent sets of template instantiations: <sup>I</sup> such that <sup>V</sup><sup>I</sup> := {v<sup>i</sup> := <sup>V</sup> (i) <sup>|</sup> <sup>i</sup> <sup>∈</sup> <sup>I</sup>} contains all valuations <sup>v</sup> such that <sup>T</sup> [v] is a subMDP of M. We initially store bounds reflecting lbres<sup>I</sup> and ubres<sup>I</sup> as well as weights for the computation of the priority (see below). Initially, we assume that lb = 0 and ub = ∞, we count the number of iterations in #iter. Res is map for storing result vectors. The algorithm now refines lb and ub until the gap between lb and ub is sufficiently small.

The main loop now iteratively refines lb, ub by first refining lbres<sup>I</sup> and ubres<sup>I</sup> , by splitting <sup>I</sup> and model checking <sup>T</sup> w.r.t. subsequently smaller regions toRegion(V<sup>I</sup> ) (l. 5-11): Therefore, we take a set <sup>R</sup> from the queue. If R.I <sup>=</sup> {i} is a singleton, we compute lbresR.I = res<sup>i</sup> = ubresR.I and store this result. Otherwise, we apply model checking to the pMDP <sup>T</sup> w.r.t. the region representation of R.I. We then split R.I, by splitting I into (here) two subsets. For splitting I, we use the geometric interpretation of toRegion(V<sup>I</sup> ) as a subset of R|y| , where we then split along one of the axis into two equally large subsets. Every k (we use k = 8) iterations, we analyse the macro-MDP (l. 12-15). From Q and Res we extract the proper bounds lbresi, ubres<sup>i</sup> from Res[i] if possible and from Q using R.bounds for <sup>R</sup> such that <sup>i</sup> <sup>∈</sup> R.I otherwise. Then via Reg(lbres1, ubres1,...) from Lemma <sup>1</sup> we compute a suitable region R . We analyse the uncertain macro-MDP to obtain lb and ub in accordance with Eq. (4).

Finally, we discuss the priority function: If we a-priori naively assume that each subMDP contributes an equal amount to the overal minimal expected reward in the hMDP (weights are all one) then the following priority function: |R.bounds|·- <sup>v</sup>∈<sup>I</sup> R.weights(v) computes priorities that correlate with how much computing res<sup>i</sup> for all <sup>i</sup> <sup>∈</sup> <sup>I</sup> would reduce the gap between lb and ub.

*Termination and Correctness Argument.* Algorithm 1 terminates. We split in such way that max<sup>I</sup>∈<sup>Q</sup> <sup>|</sup>I<sup>|</sup> monotonically decreases. Thus, eventually <sup>Q</sup> is empty and Res contains results for all subMDPs. Then, R is a point region and checking <sup>ν</sup>(M) with this point region ensures that lb <sup>=</sup> ub. Correctness follows as <sup>R</sup> is always suitable, see Eq. (4).

*Computing Expected Visits.* Based on our empirical evaluation we added one crucial improvement: While the algorithm above assumed that all subMDPs (or states in the macro-MDP) are equally important, that assumption is generally inadequate. Roughly, only states reached by the optimal policy contribute at all (provided the bounds are tight enough that we can identify these states). The reachable states are weighted by the expected number of visits of these states. We compute an approximation of this expected number of visit by computing the currently optimizing policy (a by-product of l. 13) and compute the center of R ; this results in a MC for which we can compute the number of expected visits by a standard equation system [32]. Additionally, we update the weights for the regions in the queue based on these new results. We remark that this also makes the priority function more useful.

*Interleaving Individual Refinement.* Furthermore, for a subMDPs for which the expected number of visits is large<sup>8</sup> are individually analysed (and the points are removed from the region in the queue). This optimization reduces the need to split the corresponding regions until we obtain tight bounds.

<sup>8</sup> In our implementation, we define this as subMDPs where the expected number of visits is in the top 1 + <sup>1</sup>/<sup>16</sup> · #iter percent, but not more than 150 at a time.

#### **6 Experiments**

*Implementation.* We implemented level-up<sup>9</sup>, a prototype on top of the python bindings for Storm [20]. level-up analyses hierarchical MDPs by taking two MDPs, each provided as probabilistic program descriptions in the PRISM format: One MDP that encodes the (uncertain) macro-MDP and one that describes the parametric template for the subMDPs. The parameter instance of the sub-MDP can be deduced as a function of the high-level variable assignment of the macro-MDP states. For technical reasons, the prototype currently provides support for subMDPs with one or two successor states – arguably the setting in which we expect our prototype to perform best. For subMDPs with a single successor state, the uncertain macro-MDP may be represented as an (parameterfree) MDP with interval-valued rewards. For two successors, we include support of the extension of Sect. 3.3 where the successor aims to optimize reaching a fixed successor state.



*Setup.* We investigate the scalability and the quality of the approximation over time. Therefore, we run our prototype on an MacBook 2020 M1 with an 8 GB RAM limit. We compare the enumerative baseline from Sect. 4.1 with Algorithm 1. Both exploit the hierarchical nature of the MDP. We qualitatively compare to standard model checking on the flat MDP, see below. We use a collection of benchmarks reflecting networks, job schedulers and robots.

*Results.* We consider instances that we summarize in Table 1. In particular, we give the benchmark name and instance for reference, the approximate number of states in the hierarchical MDP (computed from the macro-MDP and the

<sup>9</sup> The source code and executables, the benchmarks, logfiles and utilities are all available in an archived Docker container: https://doi.org/10.5281/zenodo.6524787.

subMDPs), the number of nontrivial partitions, and the number of states and actions in the (uncertain) macro-MDP and subMDPs, respectively. Then, we give the time to setup the data structures from the high-level representation tinit in seconds. We highlight that a flat representation of all our benchmarks has at least 10<sup>7</sup>, often more, states. As a reference, we present the performance of the enumerative baseline from Sect. 4.1. The performance of this approach is positive as it enables the verification of huge MDPs. A TO indicates >1200 s. To scale to either larger subMDPs or more subMDPs, we use the abstraction-refinement loop. To reflect the anytime nature, we list three run times for terminating when <sup>η</sup> · ub <sup>≤</sup> lb with <sup>η</sup> ∈ {0.5, <sup>0</sup>.9, <sup>0</sup>.95} respectively. The largest time faster than the enumerative baseline is highlighted (further to the right is better for the abstraction-refinement). For η = 0.95, we give details: The number of iterations (iter), the number of individual refinements based on the improvement from Sect. 5, and the fraction of time spent on model checking the uncertain macro-MDPs %um, the set-refinements %sr, and the individual refinements %ir, respectively.

*Discussion.* Before we discuss details of the results, let us clarify that *exploiting the hierarchical structure is essential*. MDPs with <sup>≈</sup>10<sup>8</sup> states are at the limit of what fits in around 8GB of memory<sup>10</sup>. Symbolic methods based on MTBDDs easily scale beyond these sizes, but—noting that the subMDPs are all slightly different—the models we consider lack the necessary symmetry that make MTB-DDs compact. Thus, support for hierarchical MDPs is a necessary step forward.

Regarding the abstraction-refinement: While a larger study may be necessary, we can start with two standard observations: The abstraction-refinement loop is significantly faster on <sup>η</sup> <sup>≤</sup> <sup>0</sup>.9. As <sup>η</sup> <sup>→</sup> 1, coarse abstractions are insufficient. Furthermore, the efficiency of the abstraction-refinement heavily depends on the particular structure. That being said, the approach outperforms the enumerative approach, especially for η = 0.9, and up to more than an order of magnitude. This happens even if <sup>I</sup> is rather small, or if, e.g., <sup>T</sup> is small. We furthermore observe that for large I, the bookkeeping in python becomes a bottleneck. We think these observations are promising: we left many options for further optimizations and tweaking towards particular examples on the table. However, for models where most time is spent on model checking the macro-level MDP, the approach is less suitable. We furthermore conjecture that tailored algorithms may exploit some of these dimensions, e.g., when there is the macro-MDP or the subMDPs are indeed MCs or perhaps acyclic, depending on the number of parameters and their influence [36], or based on the relative weight of the uncertain rewards compared to rewards in the macro-MDP.

#### **7 Related Work**

In the model-free reinforcement learning (RL) setting, hierarchical models are popular. An excellent, recent survey is given in [29]. Our work generalizes the

<sup>10</sup> Assuming 128 byte per state, i.e., 8 doubles and 16 (32-bit) ints, as used in Storm.

solution techniques on hierarchical MDPs that assume that these subMDPs are the same. In RL, this assumption is treated liberally, and the methods provide only weak error bounds. In contrast, our model-based approach provides errorbounds in every step, and the error disappears in finitely many steps.

Hierarchical abstractions are used to analyse large MDPs in [5]. There, the goal is to find a policy that almost optimizes the reward. Rather than preimposing a hierarchy, the algorithm aims to find a hierarchy and define the goal states of the subMDP such that the model admits local policies. Instead, our solution can find the optimal policy and in particular gives strict error bounds at the cost of requiring a high-level model that induces the hierarchy. An symbolic approach for continuous MDP, where the transition probabilities are the result of an associated LP, has recently been discussed in [24]. An hierarchical SCCdecomposition [1] aims to accelerate the process of solving a (given, monolithic) Markov chain. The computation of reward-bounded properties [18] generalizes topological value iteration and their notion of episodes mildly resembles an hierarchical approach but no uncertainty is assumed or used in the approach. The probabilistic model checker PAT [35] analyses a hierarchical probabilistic timed automaton given as a process algebra. The hierarchy is not exploited in the solving process.

While symbolic approaches, often on decision diagrams, exploit the transition system by compressing the data structures, abstractions aim to yield smaller systems that may assess an approximation for the sought-for values. Abstractionrefinement without an imposed hierarchy is explored in [16,21,25]: Refinement amounts to considering a better approximation of the state space. In contrast, we impose the hierarchy, the abstraction amounts to an imprecise analysis of this fixed state space and we refine by analysing the state space more precisely (by means of analysing subMDPs at a greater level of detail). Contract-based abstractions (in probabilistic systems) are used to decompose the analysis of systems given by parallel running subsystems [14,28,38]. Partial exploration and bounded model checking approaches focus on the most critical paths, i.e., the paths where most of the probability mass lies [7,23,26], but these approaches do generally not exploit the hierarchical and repetitive structure. The observation that many parts of the system are not critical allows us to weigh the potential benefit of refining the intervals in various parts of the macro-MDP.

Parametric MDPs are commonly used to model and analyse the effects of uncertainty in the precise transitions [15,23,31]. The methods presented in [13,22] exploit a repetitive structure in parametric MCs to accelerate the construction of closed form solutions and are not applicable to MDPs. Parametric models have been used to support the design of systems [2,8] or their adaption [6,9], to find policies for partially observable systems [11], to analyse Bayesian networks [34], and to speed up the analysis of, e.g., software product lines [10,37]. On top of technical differences, none of these approaches uses a hierarchical decomposition of an MDP or uses the results of the analysis in the analysis of a larger MDP.

# **8 Conclusion**

This paper presents a first verification approach that exploits a specific hierarchical structure natural in many models to accelerate analysing the underlying MDP. An essential ingredient is to separate the two levels in the hierarchy. Then, when analysing the (toplevel) macro-MDP, we may consider subMDPs that have not yet been analysed as epistemic uncertainty. Analysis techniques for uncertain (more precise: parametric) MDPs then enable an online approximation loop that incrementally removes uncertainty in a targeted fashion by analysing more and more subMDPs (more) precisely. Three clear directions for future work are to (i) consider an approach where one lifts the restrictions to locally-optimal policies, (ii) investigate the applicability to a richer set of temporal properties and (iii) to allow automatic detection of partitions in, e.g., the Prism language.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

**Formal Methods for Neural Networks**

# **Shared Certificates for Neural Network Verification**

Marc Fischer1(B) , Christian Sprecher<sup>2</sup>, Dimitar Iliev Dimitrov<sup>1</sup> , Gagandeep Singh<sup>3</sup> , and Martin Vechev<sup>1</sup>

<sup>1</sup> ETH Zurich, Z¨urich, Switzerland {marc.fischer,dimitar.iliev.dimitrov,martin.vechev}@inf.ethz.ch <sup>2</sup> Nostic Solutions AG, Freienbach, Switzerland christian.sprecher@nostic.ch <sup>3</sup> University of Illinois at Urbana-Champaign & VMware Research, Champaign, USA

ggnds@illinois.edu

**Abstract.** Existing neural network verifiers compute a proof that each input is handled correctly under a given perturbation by propagating a symbolic abstraction of reachable values at each layer. This process is repeated from scratch independently for each input (e.g., image) and perturbation (e.g., rotation), leading to an expensive overall proof effort when handling an entire dataset. In this work, we introduce a new method for reducing this verification cost without losing precision based on a key insight that abstractions obtained at intermediate layers for different inputs and perturbations can overlap or contain each other. Leveraging our insight, we introduce the general concept of shared certificates, enabling proof effort reuse across multiple inputs to reduce overall verification costs. We perform an extensive experimental evaluation to demonstrate the effectiveness of shared certificates in reducing the verification cost on a range of datasets and attack specifications on image classifiers including the popular patch and geometric perturbations. We release our implementation at https://github.com/eth-sri/proof-sharing.

**Keywords:** Neural Network Verification · Local Verification · Adversarial Robustness

# **1 Introduction**

The success of neural networks across a wide range of application domains [21,30] has led to their widespread application and study. Despite this success, neural networks remain vulnerable to adversarial attacks [8,23] which raises concerns over their trustworthiness in safety-critical settings such as autonomous driving and medical devices. To overcome this barrier, formal verification of neural networks has been proposed as a key technology in the literature [39]. As a result,

c The Author(s) 2022 S. Shoham and Y. Vizel (Eds.): CAV 2022, LNCS 13371, pp. 127–148, 2022. https://doi.org/10.1007/978-3-031-13185-1\_7

M. Fischer and C. Sprecher—Equal contribution.

C. Sprecher—Work performed while at ETH Zurich.

recent years have witnessed a growing interest in verifying critical safety properties of neural networks (e.g., fairness, robustness) [14,17,18,31,32,40,42] specified using pre and post conditions over network inputs and outputs respectively. Conceptually, existing verifiers propagate sets of inputs in the precondition captured in symbolic form (e.g., convex sets) through the network, an expensive process that produces over-approximations of all possible values at intermediate layers. The final abstraction of the output can then be used to check postconditions. The key technical challenge all existing verifiers aim to address is speeding up and scaling the certification process, i.e., faster and more efficient propagation of symbolic shapes while reducing the overapproximation error.

*This Work: Accelerating Certification via Proof Sharing.* In this work, we propose a new, complementary method for accelerating neural network verification based on the key observation that instead of treating each certification attempt in isolation as existing verifiers do, we can reuse proof effort among multiple such attempts, thus obtaining significant overall speed-ups without losing precision. Figure 1 illustrates both, standard verification and the concept of proof sharing.

In standard verification an input region I1(*x*) (orange square) is propagated from left to right, obtaining intermediate shapes at each intermediate layer (here the goal is to verify all points in the input region are classified as "cat" by the neural network N). We observe that the abstraction obtained for a new region I2(*x*) (e.g., blue shapes) can be contained inside existing abstractions from I1(*x*), an effect we term *proof subsumption*. This effect can be observed both between abstractions obtained from different specifications (e.g., -<sup>∞</sup> and adversarial patches) for the same data point and between proofs for the same property but different, yet semantically similar inputs. Building on this observation, we introduce the notion of proof sharing via templates. Proof sharing works in two steps: first, we leverage abstractions from existing proofs in order to create templates, and second, we augment the verifier with these templates, stopping the expensive propagation at an intermediate layer as soon as the newly generated abstraction is included inside an existing template. Key technical ingredients to the effectiveness of our approach are fast template generation and inclusion checking techniques. We experimentally demonstrate that proof sharing can achieve significant speed-ups in challenging scenarios including proving robustness to adversarial patches [10] and geometric perturbations [3] across different neural network architectures.

*Main Contributions.* Our key contributions are:


**Fig. 1.** Visualization of neural network verification. The input regions <sup>I</sup>1(*x*), <sup>I</sup>2(*x*) are propagated layer by layer through a neural network N. The high-dimensional convex shapes are visualized in 2d. While initially I1(*x*) and I2(*x*) only slightly overlap, at layer <sup>k</sup>, <sup>N</sup>1:k(I2(*x*)) is fully contained in <sup>N</sup>1:k(I1(*x*)). (Color figure online)

### **2 Background**

Here we formally introduce the necessary background for proof sharing.

*Neural Network.* A neural network <sup>N</sup> is a function <sup>N</sup> : <sup>R</sup><sup>d</sup>in <sup>→</sup> <sup>R</sup><sup>d</sup>out , commonly built from individual layers N = N<sup>L</sup> ◦ N<sup>L</sup>−<sup>1</sup> ◦···◦ N1. Throughout this text, we consider feed-forward neural networks, where each layer Ni(*x*) = max(*Ax*+*b*, 0) consists of an affine transformation (*Ax* + *b*) as well as a rectified linear unit (ReLU), that applies the max with 0 elementwise. A neural network, classifying inputs into c classes, outputs dout := c scores, one for each class, and assigns the class with the highest score as the predicted one. While, as is common in the neural network verification literature, we use image classification as a proxy task, many other applications work analogously. Our approach also naturally extends to other types of neural networks, if verifiers exist for these architectures. We discuss the challenges and limitations of such generalizations in Sect. 4.5. In the following, for k<L, we let N1:<sup>k</sup> denote the application of the first k layers and N<sup>k</sup>+1:<sup>L</sup> denote the last L − k layers respectively.

*(Local) Neural Network Verification.* Given a set of inputs and a postcondition ψ, the goal of neural network verification is to prove that ψ holds over the output of the neural network corresponding to the given set of inputs. In this work, we focus on local verification, proving that ψ holds for the network output for a given region <sup>I</sup>(*x*) <sup>⊆</sup> <sup>R</sup><sup>d</sup>in formed around the input *<sup>x</sup>*. Formally, we state this as:

*Problem 1 (Local neural network verification).* For a region <sup>I</sup>(*x*) <sup>⊆</sup> <sup>R</sup><sup>d</sup>in , neural network N, and postcondition ψ, verify that ∀*z* ∈ I(*x*). N(*z*) |= ψ. We write I(*x*) |= ψ if ∀*z* ∈ I(*x*). N(*z*) |= ψ.

Here, we restrict ourselves to verifiers based on abstract interpretation [11,14] as they achieve state-of-the-art precision and scalability [31,32]. Further, many other popular verifiers [38,42] can be formulated using abstract interpretation. These verifiers propagate I(*x*) symbolically through the network N layer-by-layer using abstract transformers, which overapproximate the effect of applying the transformations defined in the different layers on symbolic shapes. The propagation yields an abstraction of the exact shape at each layer. The verifiers finally check if the abstracted output implies ψ. This is showcased in Fig. 1, where the input regions I1(*x*) and I2(*x*) are propagated layer-by-layer through N.

For a verifier V , we let V (I(*x*), N) denote the abstraction obtained after the propagation of I(*x*) through the network N. We declutter notation by overloading N and writing N(I(*x*)) for the same if V is clear from context, i.e., V (I(*x*), N) = N(I(*x*)).

We consider robustness verification, where the goal is to prove that the network classification does not change within an input region. A common input region is the -<sup>∞</sup>-bounded additive noise, defined as I-(*x*) := {*z* | *x*−*z*<sup>∞</sup> ≤ }. Here, defines the size of the maximal perturbation to *x*. The postcondition ψ denotes classification to the same class as *x*. Throughout this paper, we consider different instantiations for I(*x*) but assume that ψ denotes classification invariance (although other choices would work analogously). Due to this, we refer to I(*x*) as input region and specification interchangeably. For example, in Fig. 1, the goal is to verify that all points contained in N(I1(*x*)) are classified as "cat".

# **3 Proof Sharing with Templates**

Before introducing our framework for proof sharing, we further expand the motivation example discussed in Fig. 1.

#### **3.1 Motivation: Proof Subsumption**

As stated earlier, we empirically observed that for many input regions I<sup>i</sup>(*x*) and I<sup>j</sup> (*x*), the abstraction corresponding to one region at some intermediate layer k contains that of another. Formally:

**Definition 1 (Proof Subsumption).** *For specifications* I<sup>i</sup>(*x*), I<sup>j</sup> (*x*)*, we say that the proof of* I<sup>i</sup>(*x*) *subsumes that of* I<sup>j</sup> (*x*) *if at some layer* k*,* N1:<sup>k</sup>(I<sup>j</sup> (*x*)) ⊆ N1:<sup>k</sup>(I<sup>i</sup>(*x*))*, which we denote as* I<sup>j</sup> (*x*) ⊆N,k I<sup>i</sup>(*x*)*.*

While not formally required, particularly interesting are cases where proof subsumption occurs despite I<sup>i</sup>(*x*) ⊆ I<sup>j</sup> (*x*). This form of proof subsumption is showcased in Fig. 1, where I1(*x*) and I2(*x*) have only a small overlap, yet I2(*x*) ⊆N,k I1(*x*). For another example, consider a neural network N trained as a hand-written digit classifier for the MNIST dataset [22] (example shown in Fig. 2) and the following two specifications:

**Fig. 2.** Example of an MNIST image. <sup>I</sup><sup>18</sup>,<sup>21</sup> <sup>5</sup>×<sup>5</sup> (*x*) signifies arbitrary change in the outlined area.

– -<sup>∞</sup>-bounded perturbations: all the pixels in an input image can arbitrarily be changed independently by a small amount I-(*x*) := {*z* | *x* − *z*<sup>∞</sup> ≤ },



**Fig. 3.** The abstraction obtained for I-(*x*) (blue) contains that for Ii,j <sup>2</sup>×<sup>2</sup>(*x*) (orange) (projected to <sup>d</sup> <sup>=</sup> 2). (Color figure online)

– adversarial patches [10]. A p × p patch inside which the pixel intensity can vary arbitrarily is placed on an image at coordinates (i, j), for which we write Ii,j <sup>p</sup>×<sup>p</sup>. We showcase a patch in Fig. 2 and formally define them in Sect. 4.3.

Clearly <sup>I</sup>i,j <sup>p</sup>×<sup>p</sup>(*x*) ⊆ I-(*x*) (unless = 1). In Table 1, we show that for a classifier (5 layers with 100 neurons each) we indeed observe proof subsumption. We report the accuracy, i.e., the rate of correct predictions on the unperturbed test data, as well as the certified accuracy, i.e., the rate of samples *x* for which the prediction is correct and I(*x*) |= ψ is verified, for I with = 0.1 and 0.2 over the whole test set. We also show the percentage of <sup>I</sup>i,j <sup>2</sup>×<sup>2</sup>(*x*) contained in I-(*x*) at layer k. To this end, we pick 1000 random *x* for which I-(*x*) is verifiable and sample 2 (i, j) pairs each. We utilize a Box domain verifier and a robustly trained network [24]. Figure <sup>3</sup> shows a patch specification <sup>I</sup>i,j <sup>2</sup>×<sup>2</sup>(*x*) (in orange) contained in the -<sup>∞</sup> specification I-(in blue) projected to 2 dimensions via PCA.

*Reasons for Proof Subsumption.* In Table 1, we observe that the rate of proof subsumption increases with larger and k. These observations give an intuition as to why we observe proof subsumption. First, as input regions pass through the neural network, in each layer the abstractions become more imprecise. While this fundamentally limits verification, it makes the subsumption of abstractions more probable. This effect increases, when increasing for I-. Second, and more fundamentally, while passing through the layers of a neural network, we observed that semantically similar yet distinct image inputs, e.g., two similar-looking handwritten digits, have activation vectors that grow closer in -<sup>2</sup> norm as they pass through the layers of the neural network [21,34]. This effect is a consequence of the neural network distilling low-level information (e.g., individual pixel values) into high-level concepts (e.g., the classes of digits). As specifications (and their proofs) correspond to sets of concrete inputs, a similar effect may apply. We conjecture that these two effects drive the observed proof subsumption.

#### **3.2 Proof Sharing with Templates**

Leveraging this insight, we introduce the idea of proof sharing via templates, showcased in Fig. 4. We use an abstraction obtained from a robustness proof

**Fig. 4.** Conceptualization of proof sharing with templates. In (a) we create a verifiable template <sup>T</sup> (black-dashed border) from specification <sup>N</sup>1:k(I1(*x*)). When verifying new specifications <sup>I</sup>2,..., <sup>I</sup>5, shown in (b), we can shortcut the verification of all but <sup>I</sup><sup>5</sup> by subsuming them in T.

N1:<sup>k</sup>(I1(*x*)) at layer k to create a template T. After ensuring that T is verifiable, it can be used to shortcut the verification of other regions, e.g., of I2(*x*),..., I5(*x*). Formally we decompose proof sharing into two sub-problems: (i) the generation of proof templates and (ii) the matching of abstractions corresponding to other properties to these templates. For simplicity, here we only consider templates at a single layer k of the neural network and we show an extension to multiple layers in Sect. 4.3.

Our goal is to construct a template T at layer k that implies the postcondition and captures abstractions at layer k obtained from propagating several I<sup>i</sup>(*x*). As it is challenging to find a single T that captures abstractions corresponding to many input regions, yet remains verifiable, we allow a set of templates T . We state this formally as:

*Problem 2 (Template Generation).* For a given neural network N, input *x* and set of specifications I1,..., I<sup>r</sup>, layer k and a postcondition ψ, find a set of templates T with |T | ≤ m such that:

$$\begin{aligned} \mathop{\arg\max}\_{\mathcal{T}} & \sum\_{i=1}^{r} \left[ \bigvee\_{T \in \mathcal{T}} N\_{1:k}(\mathcal{T}\_i(\mathbf{x})) \subseteq T \right] \\ \text{s.t. } & \forall \ T \in \mathcal{T}. N\_{k+1:L}(T) \mid = \psi. \end{aligned} \tag{1}$$

Intuitively, Eq. (1) aims to find a set T of templates T at layer k, such that the maximal amount (via the sum) of specifications I1,..., I<sup>r</sup> is contained in at least one template T (via the disjunction) while ensuring that the individual T are still verifiable (via the constraint on the second line). As neural network verification required by the constraints of Eq. (1), is NP-complete [17], computing an exact solution to Problem 2 is computationally infeasible. Therefore, we compute an approximate solution to Eq. (1). In general, Problem 2 does not necessarily require that the templates T are created from previous proofs. However, building on proof subsumption, as discussed in Sect. 3.1, in Sect. 4 we will infer the templates from previously obtained abstractions.

To leverage proof sharing once the templates T are obtained, we need to be able to match an abstraction S = N1:k(I(*x*)) verified using proof transfer to a template in T :

*Problem 3 (Template Matching).* Given a set of templates T at layer k of a neural network N, and a new input region I(*x*), determine whether there exists a T ∈ T such that S ⊆ T, where S = N1:k(I(*x*)).

Together, Problems 2 and 3 outline a general framework for proof sharing, permitting many instantiations. We note that Problems 2 and 3 present an inherent precision vs. speed trade-off: Problem 3 can be solved most efficiently for small values of m = |T | and simpler representations of T (allowing faster checking of S ⊆ T) at the cost of lower proof matching rates. Alternatively, Eq. (1) can be maximized by large m and T represented by complex abstractions, thus attaining high precision but expensive template generation and matching.

*Beyond Proof Sharing on the Same Input.* In this section, we focused on proof sharing for different specifications of the same input *x*. However, we observed that proof sharing is even possible between specifications defined on different inputs *x* and *x* . To facilitate the use of templates in this setting, Eq. (1) in Problem 2 can be adapted to consider an input distribution.

### **4 Efficient Verification via Proof Sharing**

We now consider an instantiation of proof sharing where we are given an input *x* and properties I1,..., I<sup>r</sup> to verify. Our general approach, based on Problems 2 and 3, is shown in Algorithm 1. In this section, we first discuss Algorithm 1 in general. We then describe the possible choices of abstract domains and their implications on the algorithm, followed by a discussion on template generation for two different specific problems. Finally, we conclude the section with a discussion on the conditions for effective proof sharing verification.

In Algorithm 1, we first create the set of templates T (Line 1, discussed shortly) and subsequently verify I1,..., I<sup>r</sup> using T . Here, we consider two, potentially identical, verifiers V<sup>T</sup> and VS, where V<sup>T</sup> is used to create the templates T and V<sup>S</sup> is used to propagate input regions up to the template layer k. For each I<sup>i</sup> we propagate it up to layer k (Line 4) to obtain S = N1:<sup>k</sup>(I<sup>i</sup>(*x*)) and check if we can match it to a template T<sup>j</sup> ∈ T (Line 6) using an inclusion check. If a match is found, then we conclude that N(I<sup>i</sup>(*x*)) |= ψ and set the verification output v<sup>i</sup> to True. If this is not the case (Line 11) we verify N(I<sup>i</sup>(*x*)) |= ψ directly by checking VS(S, N<sup>k</sup>+1:<sup>L</sup>) |= ψ. If the template generation fails, we revert to verifying I<sup>i</sup> by applying V<sup>S</sup> in the usual way (omitted in Algorithm 1).

*Soundness.* As long as the templates T are sound, this procedure is sound, i.e. Algorithm 1 only returns v<sup>i</sup> = True if ∀*z* ∈ I<sup>i</sup>(*x*). N(z) |= ψ holds. Formally:

**Theorem 1.** *Algorithm 1 is sound if* ∀ T ∈ T, z ∈ T. N<sup>k</sup>+1:<sup>L</sup>(z) |= ψ *and* V<sup>S</sup> *is sound.*

This holds by the construction of the algorithm:

*Proof.* For a given *x* and Ii, Algorithm 1 only claims v<sup>i</sup> = True if either the check in (i) Line 6 or (ii) Line 11 succeeds. Since V<sup>S</sup> is sound, we know that ∀*z* ∈ I<sup>i</sup>(*x*). N1:<sup>k</sup>(z) ∈ S. Therefore in case (i) by our requirement on T as well as S ⊆ T it follows that ∀*z* ∈ I<sup>i</sup>(*x*). N(z) |= ψ. In case (ii) we execute Line 12 and the same property holds due to the soundness of VS.

Importantly, Theorem 1 shows that the generation process of T does not affect the overall soundness as long as the set of templates T ful-

```
Algorithm 1: Neural Network Verifica-
 tion Utilizing Proof Templates
  Input: x, I1,..., Ir, k, ψ, verifiers VS, VT
  Result: v1,...,vr indicating
           vi := (N(Ii(x)) |= ψ)
1 T ← gen templates(x, N, k, ψ, VS, VT )
2 v1,...,vr ← False
3 for i ← 1 to r do
4 S ← VS(Ii(x), N1:k)
5 for Tj ∈ T do
6 if S ⊆ Tj then
7 vi ← True
8 break
9 end
10 end
11 if ¬vi then
12 vi ← (VS(S, Nk+1:L) |= ψ)
13 end
14 end
15 return v1,...,vr
```
fills the condition in Theorem 1. In particular, that means that when solving Problem 2, it suffices to show the side condition (∀ T ∈ T . N<sup>k</sup>+1:<sup>L</sup>(T) |= ψ) holds, while heuristically approximating the actual optimization criteria. We let V<sup>T</sup> denote the verifier used to ensure this property in gen templates.

*Precision.* We say a verifier V<sup>1</sup> is more precise than another verifier V<sup>2</sup> on N if out of a set of specifications it can verify some that V<sup>2</sup> can not.

**Theorem 2.** *If* VS(VS(I<sup>i</sup>(*x*), N1:<sup>k</sup>), N<sup>k</sup>+1:<sup>L</sup>) = VS(I<sup>i</sup>(*x*), N)*, then Algorithm 1 is at least as precise as* VS*.*

*Proof.* Since, even if the inclusion check in Line 6 fails, due to Line 12 we output v<sup>i</sup> = VS(VS(I<sup>i</sup>(*x*), N1:<sup>k</sup>), N<sup>k</sup>+1:<sup>L</sup>) |= ψ (Line 12), which by our requirement equals v<sup>i</sup> = VS(I<sup>i</sup>(*x*), N) |= ψ. Therefore we have at least the precision of VS.

The required property holds for any verifier V<sup>S</sup> for which the abstractions of all network layers depends only on the abstractions from previous layers and is fulfilled for all verifiers considered in this paper. For verifiers V<sup>S</sup> that do not fulfill the required property, potential losses in precision can be remedied (at the cost of runtime) by using VS(I<sup>i</sup>(*x*), N1:<sup>L</sup>) in Line 12. Interestingly, it is even possible to increase the precision of Algorithm 1 over V<sup>S</sup> by creating templates T that are verified with a more precise verifier V<sup>T</sup> . However, in this discussion, we restrict ourselves to speed gains. We believe that obtaining precision gains requires instantiating our framework with a significantly different approach than that taken for improving speed which is the main focus of our work. We leave this as an interesting item for future work.

*Run-Time.* Here, we aim to characterize the run-time of Algorithm 1 as well as its speed-up over conventional verification. For an input *x*, (keeping the other parameters fixed), the expected run time is

$$t\_{PS} = t\_T + r(t\_S + t\_{\subseteq} + (1 - \rho)t\_{\psi}) \tag{2}$$

where t<sup>T</sup> is the expected time required to generate the templates at Line 1, r is the number of specifications to be verified, t<sup>S</sup> is the expected time to compute S (Line 4), t<sup>⊆</sup> is the time to check S ⊆ T for T ∈ T until a match is found (Line 5 to Line 10), ρ ∈ [0, 1] is the rate of specifications where a template is found and t<sup>ψ</sup> is the time required to check ψ on the network output corresponding to S (Line 12). This time is minimized if the individual expected run times t<sup>T</sup> , tS, t<sup>ψ</sup> are minimal and ρ is large (i.e., close to 1). Unfortunately, computing the template match rate ρ analytically is challenging and requires global reasoning over the neural network for all valid inputs, which are not clearly defined. However, our empirical analysis (in Sect. 5) shows that ρ is higher when templates are created at later layers (as in Sect. 3.1).

To determine the speed-up compared to a baseline standard verifier, we make the simplifying assumption that there is a single verifier V = V<sup>S</sup> = V<sup>T</sup> that has expected run-time ν for each layer. Thus, the expected run-time for the conventional verifier is tBL = rLν. We have t<sup>T</sup> = λmLν, t<sup>S</sup> = kν, t<sup>ψ</sup> = (L − k)ν, t<sup>⊆</sup> = ηm and ultimately tP S = (m + r(1 − ρ))Lν + rρkν + rηm for constants <sup>λ</sup> <sup>∈</sup> <sup>R</sup><sup>&</sup>gt;0, which indicates the overhead in generating one template over just verifying it, and <sup>η</sup> <sup>∈</sup> <sup>R</sup><sup>&</sup>gt;<sup>0</sup> which denotes the time required to perform an inclusion check for one template. As this phrasing shows, Algorithm 1 has the same asymptotic runtime as the base verifier V . Further, this formulation allows us to write our expected speed-up as <sup>t</sup>BL <sup>t</sup>P S <sup>=</sup> <sup>r</sup> λm+ηrm/Lμ+rρk/L+r(1−ρ) . This speed-up is maximized when k is small compared to L, i.e., templates are placed early in the neural network, the matching rate ρ is close to 1, and m, λ, η are small, i.e., generation and matching are fast. Unfortunately, these requirements are at odds with each other: as we show in Sect. 5, higher m leads to higher matching rate ρ and ρ is naturally higher for templates later in the neural network (higher k). Thus high speed-ups require careful hyper-parameter choices.

To showcase how we can achieve good templates as well as fast matching, we next discuss the choice of the abstract domain to be used in the propagation and the representation of the templates. Then we discuss the template generation procedure and instantiate it for the verification of robustness to adversarial patches and geometric perturbations.

#### **4.1 Choice of Abstract Domain**

To solve Problems 2 and 3 in a way that minimizes the expected runtime and maximizes the overall precision, the choice of abstract domain is crucial. Here we briefly review common choices of abstract domains for neural network verification and how they are suited to our problem. Geometrically these domains can be thought of as a convex abstraction of the set of vectors representing reachable values at each layer of the neural network. We say that an abstraction a<sup>1</sup> is more precise than another abstraction a2, if and only if a<sup>1</sup> ⊆ a2, i.e., all points in a<sup>1</sup> occur in a2. Similarly, we say that a domain is more precise than another if it can express all abstractions in the other domain.

The Box (or Interval) domain [14,16,24] abstracts sets in d dimensions as <sup>B</sup> <sup>=</sup> {*<sup>a</sup>* + diag(*d*)*<sup>e</sup>* <sup>|</sup> *<sup>e</sup>* <sup>∈</sup> [−1, 1]d} with center *<sup>a</sup>* <sup>∈</sup> <sup>R</sup><sup>d</sup> and width *<sup>d</sup>* <sup>∈</sup> <sup>R</sup><sup>d</sup> <sup>≥</sup><sup>0</sup>. The Zonotope domain [14,15,24,31,40] uses relaxations Z of the form

$$Z = \{ \mathbf{a} + \mathbf{A}\mathbf{e} \mid \mathbf{e} \in [-1, 1]^q \}, \tag{3}$$

parametrized with *<sup>a</sup>* <sup>∈</sup> <sup>R</sup><sup>d</sup> and *<sup>A</sup>* <sup>∈</sup> <sup>R</sup><sup>d</sup>×<sup>q</sup>.

A third common choice are (restricted) convex Polyhedra P [12,32,42]. Here, we consider P to be in the DeepPoly (DP) domain [32,42]. Generally, Boxes are less precise, i.e. certify fewer properties, than Zonotopes or Polyhedra.

For efficient proof sharing, we require a fast inclusion check S ⊆ T, which is challenging in our context due to the high dimensionality d of the intermediate neural network layers. While we point the interested reader to [29] for a detailed discussion, we summarize the key results in Table 2. There, ✓ denotes feasibility, i.e. low polynomial runtime (usually 2d compar**Table 2.** Feasibility of <sup>S</sup> <sup>⊆</sup> <sup>T</sup> for Box B, Zonotope Z (with order reduction) and DP Polyhedra P.


isons, sometimes with an additional matrix multiplication), ✗ denotes infeasibility, e.g. exponential run time. If T is a Box all checks are simple as it suffices to compute the outer bounding box of S and compare the 2d constraints. If T is a DP Polyhedra these checks require a linear program (LP) to be solved. While the size of this LP permits a low theoretical time complexity, in case S is a Box or DP Polyhedra, in practice, we consider calling an LP solver too expensive (denoted as (✓)). For Zonotopes these checks are generally infeasible, as they require enumeration of the faces or corners, which is computationally expensive for large d and P. While Zonotopes can be encoded as Polyhedra (but not necessarily DP Polyhedra) and the same LP inclusion check as for P could be used, the resulting LP would require exponentially many variables due to the previously mentioned enumeration. However, by placing constraints on the matrix *A* in Eq. (3) these inclusion checks can be performed efficiently. The mapping of a Zonotope to such a restricted Zonotope is called order reduction via outer-approximation [19,29].

In particular, for a Zonotope Z we consider the order reduction αBox to its outer bounding box (where *A* is diagonal) and note that other choices of α are possible (e.g. the reduction to affine transformations of a hyperbox).

For a general Zonotope Z its outer bounding box Z = αBox(Z) can be easily obtained. The center of <sup>Z</sup> is *<sup>a</sup>*, the center of Z. The width *<sup>d</sup>* <sup>∈</sup> <sup>R</sup><sup>d</sup> <sup>≥</sup><sup>0</sup> is given as <sup>d</sup><sup>i</sup> <sup>=</sup> <sup>q</sup> <sup>j</sup>=1 |Ai,j |. Z is represented as either a Box or a Zonotope (with *A* = diag(*d*)). To check S ⊆ Z for a general Zontope S it suffices to check αBox(S) ⊆ Z which reduces to the simple inclusion check for boxes.

Based on the above discussion we will use the Zonotope domain to represent all abstractions, and use verifiers V<sup>S</sup> = V<sup>T</sup> that propagate these zonotopes using the state-of-the-art DeepZ transformers [31]. To permit efficient inclusion checks we apply αBox on the resulting zonotopes to obtain the Box templates T, which we treat as a special case of Zonotopes.

#### **4.2 Template Generation**

We now discuss instantiations for gen templates in Algorithm 1. Recall from Sect. 3.1 the idea of proof subsumption, i.e. that abstractions for some specification contain abstractions for other specifications. Building on this, we relax the Problem 2 in order to create m templates T<sup>j</sup> from intermediate abstractions <sup>N</sup>1:<sup>k</sup>(Iˆi(*x*)) for some <sup>I</sup>ˆ1,..., <sup>I</sup>ˆm. Note that <sup>I</sup>ˆ<sup>j</sup> are not necessarily directly related to the specifications I1,..., I<sup>r</sup> that we want to verify. For a chosen layer k, input *x*, number of templates m and verifiers V<sup>S</sup> and V<sup>T</sup> we optimize

$$\begin{aligned} & \mathop{\arg\max}\_{\mathcal{Z}\_1,\dots,\mathcal{Z}\_m} \sum\_{i=1}^r \left[ \bigvee\_{j=1}^m V\_S(\mathcal{Z}\_i(\mathbf{z}), N\_{1:k}) \subseteq T\_j \right] \\ & \text{where } T\_j = \alpha\_{\text{Box}}(V\_T(\hat{\mathcal{Z}}\_j(\mathbf{z}), N\_{1:k})) \\ & \text{s.t. } V\_T(T\_j, N\_{k+1:L}) \models \psi \text{ for } j \in 1, \dots, m. \end{aligned} \tag{4}$$

As originally in Problem 2 (Eq. (1)) we aim to find a set of templates such that the intermediate shapes at layer k for most of the r specifications are covered by at least one template T. In contrast to Eq. (1), we tie T<sup>j</sup> to the specifications <sup>I</sup>ˆ<sup>j</sup> . This alone does not make the problem easier to tackle. However, next, we will discuss how to generate application-specific parametric <sup>I</sup>ˆ<sup>j</sup> and solve Eq. (4) by optimizing over their parameters, allowing us to solve template generation much more efficiently than in Eq. (1).

#### **4.3 Robustness to Adversarial Patches**

We now instantiate the above scheme in order to verify the robustness of image classifiers against adversarial patches [10]. Consider an attacker that is allowed to arbitrarily change any p×p patch of the image, as showcased earlier in Fig. 2. For such a patch over pixel positions ([i, i+p−1]×[j, j+p−1]), the corresponding perturbation is

$$\begin{aligned} \mathcal{T}\_{p \times p}^{i,j}(\boldsymbol{x}) &:= \{ \boldsymbol{z} \in [0,1]^{h \times w} \mid \boldsymbol{z}\_{\pi\_{i,j}^C} = \boldsymbol{x}\_{\pi\_{i,j}^C} \} \\ \text{with } \pi\_{i,j} &= \left\{ (k,l) \mid \begin{array}{l} k \in {}\_{i}, \ldots, i+p-1 \\ l \in j, \ldots, j+p-1 \end{array} \right\} \end{aligned}$$

where h and w denote the height and width of the input *x*. Here πi,j denotes the parts of the image affected by the patch, and π<sup>C</sup> i,j its complement, i.e., the

<sup>N</sup>1:k(*I*<sup>ˆ</sup> i(*x*, i)) Tk <sup>=</sup> <sup>α</sup>Box(N1:k(*I*<sup>ˆ</sup> i(*x*, <sup>i</sup>))) βkTk

**Fig. 5.** Example splits <sup>μ</sup> for 10 <sup>×</sup> 10 pixels.

**Fig. 6.** Example Template. (Color figure online)

unaffected part of the image. To prove robustness for an arbitrarily placed p × p patch, however, one must consider the perturbation set <sup>I</sup><sup>p</sup>×<sup>p</sup>(*x*) := <sup>∪</sup>i,jIi,j <sup>p</sup>×<sup>p</sup>(*x*).

To prove robustness for I<sup>p</sup>×<sup>p</sup>, existing approaches [10] separately verify Ii,j <sup>p</sup>×<sup>p</sup>(*x*) for all i ∈ {1,...,h − p + 1}, j ∈ {1,...,w − p + 1}. For example, with p = 2 and a 28 × 28 MNIST image, this approach requires 729 individual proofs. Because the different proofs for I<sup>p</sup>×<sup>p</sup> share similarities, this is an ideal candidate for proof sharing. We utilize Algorithm 1 and check ∧<sup>i</sup>v<sup>i</sup> at the end to speed up this process. For template generation, we solve Eq. (4) for m templates with an input perturbation <sup>I</sup>ˆ<sup>i</sup> per template.

We empirically found that (recall Table 1) setting <sup>I</sup>ˆ<sup>i</sup> to an -<sup>∞</sup> region Ii to work particularly well to capture a majority of patch perturbations <sup>I</sup>i,j <sup>p</sup>×<sup>p</sup> at intermediate layers. Specifically, we found that setting <sup>i</sup> to the maximally verifiable value for this input to work particularly well.

To further increase the number of specifications contained in a set of templates T , we use m template perturbations of the form

$$\hat{\mathcal{L}}\_i(\mathbf{z}) := \{ \mathbf{z} \mid \|\mathbf{z}\_{\mu\_i} - \mathbf{z}\_{\mu\_i}\|\_{\infty} \le \epsilon\_i \land \mathbf{z}\_{\mu\_i^C} = \mathbf{z}\_{\mu\_i^C} \},$$

where μ<sup>i</sup> denotes a subset of pixels of the input image and μ<sup>C</sup> <sup>i</sup> its complement and we maximize <sup>i</sup> in a best-effort manner. In particular, we consider μ1,...,μm, such that they partition the set of pixels in the image (e.g., in Fig. 5).

As noted earlier, this generation procedure needs to be fast, yet obtain T to which many abstractions match in order to obtain speed-ups. Thus, we consider small <sup>m</sup>, and fixed patterns <sup>μ</sup>1,...,μm. For each <sup>I</sup>ˆi, we aim to find the largest <sup>i</sup> which can still be verified in order to maximize the number of matches. Note that for m = 1, this is equivalent to the -<sup>∞</sup> input perturbation I with the maximally verifiable for the given image.

Concretely, we can perform binary search over <sup>i</sup> in order find a large i, still satisfying <sup>N</sup><sup>k</sup>+1:<sup>L</sup>(αBox(N1:<sup>k</sup>(Iˆi))) <sup>|</sup><sup>=</sup> <sup>ψ</sup>. Verification with our chosen DeepZ Zonotopes is not monotonous in <sup>i</sup> due to the non-monotonic transformers used for non-linearities (e.g., ReLU). This renders the application of binary search a best-effort approximation. As we don't require a formal maximum but rather aim to solve a surrogate for Problem 2, this still works well in practice. Further note that, applying αBox to templates introduces imprecision, i.e. V<sup>T</sup> might not be able to prove properties over templates that it could without the application of αBox. However, Theorem 2 (which only requires properties of VS) still applies. *Templates at Multiple Layers.* We can extend this approach to obtain templates at multiple layers without a large increase in computational cost. With templates at multiple layers, we first try to match the propagated shape against the earliest template layer and upon failure propagate it further to the next, where we again attempt to match the template. In Algorithm 1, this means repeating the block from Line 4 to Line 10 for each template layer before going on to the check on Line 11.

The full template generation procedure is given in Algorithm 2. First, we perform a binary search over <sup>i</sup> (Line 6) to find the largest i, for which the specification is ver-


ifiable. Then for each layer k in the set of layers K at which we are creating templates we create a box T<sup>k</sup> from the Zonotope. As this T<sup>k</sup> may not be verifiable, due to the imprecision added in αBox, we then perform another binary search for the largest scaling factor β<sup>k</sup> (Line 10), which is applied to the matrix *A* in Eq. (3). We denote this operation as βkTk. We show an example for a single layer k in Fig. 6. The blue area outlines the Zonotope found via Line 6, which is verifiable as it is fully on one side of the decision boundary (red, dashed). After applying αBox (orange), however, is not (crosses the decision boundary). By scaling it with β<sup>k</sup> the shape is verifiable again (green) and used as a template.

#### **4.4 Geometric Robustness**

Geometric robustness verification [3,13,28,32] aims to verify the robustness of neural networks against geometric transformations such as image rotations or translations. These transformations typically include an interpolation operation. For example consider rotation R<sup>γ</sup> of an image by γ ∈ Γ degrees for an interval Γ (e.g., γ ∈ [−5, 5]), for which we consider the specification I<sup>Γ</sup> (*x*) := {Rγ(*x*) | γ ∈ Γ}. We note that, unlike -<sup>∞</sup> and patch verification, the input regions for geometric transformations are non-linear and have no closed-form solutions. Thus, an overapproximation of the input region must be obtained [3]. For large Γ, the approximate input region I<sup>Γ</sup> (*x*), can be too coarse resulting in imprecise verification. Hence, in order to assert ψ on I<sup>Γ</sup> , existing state-of-the-art approaches [3], split Γ into r smaller ranges Γ1,...,Γ<sup>r</sup> and then verify the resulting r specifications (I<sup>Γ</sup><sup>i</sup> , ψ) for i ∈ 1,...,r. These smaller perturbations share similarities facilitating proof sharing. We instantiate our approach similar to Sect. 4.3. A key difference to Sect. 4.3 is that while *<sup>x</sup>* ∈ Ii,j <sup>p</sup>×<sup>p</sup>(*x*) for all i, j in patches, here in general *x* ∈ IΓ<sup>i</sup> (*x*) for most i. Therefore, the individual perturbations Ii(*x*) do not overlap. To account for this, we consider m templates and split Γ into m equally sized chunks (unrelated to the r splits) obtaining the angles γ1,...,γ<sup>m</sup> at the center of each chunk. For m templates we then consider the perturbations <sup>I</sup>ˆ<sup>i</sup> := <sup>I</sup><sup>i</sup> (Rγ<sup>i</sup> (*x*)), denoting the -<sup>∞</sup> perturbation of size <sup>i</sup> around the γ<sup>i</sup> degree rotated *x*. To find the template we employ a procedure analogous to Algorithm 2.

#### **4.5 Requirements for Proof Sharing**

Now, we discuss the requirements on the neural network N such that proof sharing via templates works well. For simplicity, we discuss simple per-dimension box bounds propagation for V<sup>S</sup> and V<sup>T</sup> . However, similar arguments can be made for more complex relational abstractions such as Zonotopes or Polyhedra.

In order for an abstraction S to match to a template T, we need to show interval inclusion for each dimension. For a particular dimension i this can occur in two ways: (i) when both S and T are just a point in that dimension and these points coincide, e.g., a<sup>S</sup> <sup>i</sup> = a<sup>T</sup> <sup>i</sup> , or (ii) when a<sup>S</sup> <sup>i</sup> <sup>±</sup> <sup>d</sup><sup>S</sup> <sup>i</sup> <sup>⊆</sup> <sup>a</sup><sup>T</sup> <sup>i</sup> <sup>±</sup> <sup>d</sup><sup>T</sup> <sup>i</sup> . While particularly in ReLU networks, the first case can occur after a ReLU layer sets values to zero, we focus our analysis here on the second case as it is more common. In this case, the width of T in that dimension d<sup>T</sup> <sup>i</sup> must be sufficient to cover S. Ignoring case (i) and letting supp(T) denote the dimensions in which dT <sup>i</sup> > 0, we can pose that supp(S) ⊆ supp(T) as a necessary condition for inclusion. While it is in general hard to argue about the magnitudes of these values, this approach still provides an intuition. When starting from input specifications supp(I) <sup>⊆</sup> supp(Iˆ), supp(S) <sup>⊆</sup> supp(T) can only occur if during propagation through the neural network <sup>N</sup>1:<sup>k</sup> the mass in supp(Iˆ) can "spread out" sufficiently to cover supp(S). In the fully connected neural networks that we discuss here, the matrices of linear layers provide this possibility. However, in networks that only read part of the input at a time such as recurrent neural networks, or convolutional neural networks in which only locally neighboring inputs feed into the respective output in the next layer, these connections do not necessarily exist. This makes proof sharing hard until layers later in the neural network, that regionally or globally pool information. As this increases the depth of the layer k at which proof transfer can be applied, this also decreases the potential speed-up of proof transfer. This could be alleviated by different ways of creating templates, which we plan to investigate in the future.

#### **5 Experimental Evaluation**

We now experimentally evaluate the effectiveness of our algorithms from Sect. 4.

#### **5.1 Experimental Setup**

We consider the verification of robustness to adversarial patch attacks and geometric transformations in Sect. 5.2 and Sect. 5.3, respectively. We define specifications on the first 100 test set images each from the MNIST [22] and the

**Table 3.** Rate of <sup>I</sup>i,j <sup>2</sup>×<sup>2</sup> matched to templates T for I<sup>2</sup>×<sup>2</sup> patch verification for different combinations of template layers <sup>k</sup>, 7 <sup>×</sup> 200 networks,using <sup>m</sup> = 1 template.


**Table 4.** Average verification time in seconds per image for I<sup>2</sup>×<sup>2</sup> patches for different combinations of template layers <sup>k</sup>, 7 <sup>×</sup> 200 networks,using <sup>m</sup> = 1 template.


CIFAR-10 dataset [20] ("CIFAR") as with repetitions and parameter variations the overall runtime becomes high. We use DeepZ [31] as the baseline verifier as well as for V<sup>S</sup> and V<sup>T</sup> [31]. Throughout this section, we evaluate proof sharing for two networks on two common datasets: We use a seven layer neural network with 200 neurons per layer ("7 × 200") and a nine layer network with 500 neurons per layer ("9 × 500") for both the MNIST[22] and CIFAR datasets [20], both utilizing ReLU activations. These architectures are similar to the fully-connected ones used in the ERAN and Mnistfc VNN-Comp categories [2].

For MNIST, we train 100 epochs, enumerating all patch locations for each sample, and for CIFAR we train for 600 with 10 random patch locations, as outlined in [10] with interval training [16,24]. On MNIST the 7 × 200 and the 9 × 500 achieve a natural accuracy of 98.3% and 95.3% respectively. For CIFAR, these values are 48.8% and 48.1% respectively. Our implementation utilizes PyTorch [25] and is evaluated on Ubuntu 18.04 with an Intel Core i9-9900K CPU and 64 GB RAM. For all timing results, we provide the mean over three runs.

#### **5.2 Robustness Against Adversarial Patches**

For MNIST, containing 28 × 28 images, as outlined in Sect. 4.3, in order to verify inputs to be robust against 2 × 2 patch perturbations, 729 individual perturbations must be verified. Only if all are verified, the overall property can be verified for a given image. Similarly, for CIFAR, containing 32 × 32 color images, there are 961 individual perturbations (the patch is applied over all color channels).

We now investigate the two main parameters of Algorithm 2: the masks μ1,...,μ<sup>m</sup> and the layers k ∈ K. We first study the impact of the layer k used for creating the template. To this end, we consider the 7 × 200 networks,


**Table 5.** I<sup>2</sup>×<sup>2</sup> patch verification with templates at the 2nd & 3rd layer of the 7 × 200 networks for different masks.

**Table 6.** I<sup>2</sup>×<sup>2</sup> patch verification with templates generated on the second and third layer using the -<sup>∞</sup>-mask. Verification times are given for the baseline <sup>t</sup> BL and for applying proof sharing t P S in seconds per image.


use <sup>m</sup> = 1 (covering the whole image; equivalent to <sup>I</sup>ˆ-). Table 3 shows the corresponding template matching rates, and the overall percentage of individual patches that can be verified "patches verif.". (The overall percentage of images for which I<sup>2</sup>×<sup>2</sup> is true is reported as "verif." in Table 6.) Table 4 shows the corresponding verification times (including the template generation). We observe that many template matches can already be made at the second or third layer. As creating templates simultaneously at the second and third layer works well for both datasets, we utilize templates at these layers in further experiments.

Next, we investigate the impact of the pixel masks μ1,...,μm. To this end, we consider three different settings, as showcased in Fig. 5 earlier: (i) the full image (-<sup>∞</sup>-mask as before; m = 1), (ii) "center + border" (m = 2), where we consider the 6 × 6 center pixel as one group and all others as another, and (iii) the 2 × 2 grid (m = 4) where we split the image into equally sized quarters.

As we can see in Table 5, for higher m more patches can be matched to the templates, indicating that our optimization procedure is a good approximation to Problem 2, which only considers the number of templates matched. Yet, for m > 1 the increase in matching rate p does not offset the additional time in template generation and matching. Thus, m = 1 results in a better trade-off. This result highlights the trade-offs discussed throughout Sect. 3 and Sect. 4. Based on this investigation we now, in Table 6, evaluate all networks and datasets using m = 1 and template generation at layers 2 and 3. In all cases, we obtain a speed up between 1.2 to 2× over the baseline verifier. Going from 2 × 2 to 3 × 3 patches speed ups remain around 1.6 and 1.3 for the two datasets respectively.


**Table 7.** Speed-ups achievable in the setting of Table 3. t BL the baseline.

*Comparison with Theoretically Achievable Speed-Up.* Finally, we want to determine the maximal possible speed-up with proof sharing and see how much of this potential is realized by our method. To this end we investigate the same setting and network as in Table 3. We let t BL and t P S denote the runtime of the base verifier without and with proof sharing respectively. Similar to the discussion in Sect. 4 we can break down t P S into t<sup>T</sup> (template generation time), t<sup>S</sup> (time to propagate one input to layer k), t<sup>⊆</sup> (time to perform template matching) and t<sup>ψ</sup> (time to verify S if no match). Table 7 shows different ratios of these quantities. For all, we assume a perfect matching rate at layer k and calculate the achievable speed-up for patch verification on MNIST. Comparing the optimal and realized results, we see that at layers 3 and 4 our template generation algorithm, despite only approximately solving Problem 2 achieves near-optimal speed-up. By removing the time for template matching and template generation we can see that, at deeper layers, speeding up t<sup>⊆</sup> and t<sup>T</sup> only yield diminishing returns.

#### **5.3 Robustness Against Geometric Perturbations**

For the verification of geometric perturbations, we take 100 images from the MNIST dataset and the 7 × 200 neural network from Sect. 5.2. In Table 8, we consider an input region with ±2° rotation, ±10% contrast and ±1% brightness change, inspired by [3]. To verify this region, similar to existing approaches [3], we choose to split the rotation into r regions, each yielding a Box specification over the input. Here we use m = 1, a single template, with the largest verifiable found via binary search. We observe that as we increase r, the verification rate increases, but also the speed ups. Proof sharing enables significant speed-up between 1.6 to 2.9×.

Finally, we investigate the impact of the number of templates m. To this end, we consider a setting with a large parameter space: ±40° rotation generated input region with r = 200. In Table 9, we evaluate this for m templates obtained from the -<sup>∞</sup> input perturbation around m equally spaced rotations, where we apply binary search to find <sup>i</sup> tailored for each template. Again we observe that m > 1 allows more templates matches. However, in this setting the relative increase is much larger than for patches, thus making m = 3 faster than m = 1.


**Table 8.** <sup>±</sup>2° rotation, <sup>±</sup>10% contrast and <sup>±</sup>1% brightness change split into <sup>r</sup> perturbations on 100 MNIST images. Verification rate, rate of splits matched and verified along with the run time of Zonotope t BL and proof sharing t P S.

**Table 9.** ±40° rotation split into 200 perturbations evaluated on MNIST. The verification rate is just 15 %, but 82.1 % of individual splits can be verified.


#### **5.4 Discussion**

We have shown that proof sharing can achieve speed-ups over conventional execution. While the speed-up analysis (see Sect. 4 and Table 7) put a ceiling on what is achievable in particular settings, we are optimistic that proof sharing can be an important tool for neural network robustness analysis. In particular, as the size of certifiable neural networks continues to grow, the potential for gains via proof sharing is equally growing. Further, the idea of proof effort reuse can enable efficient verification of larger disjunctive specifications such as the patch or geometric examples considered here. Besides the immediately useful speed-ups, the concept of proof sharing is interesting in its own right and can provide insights into the learning mechanisms of neural networks.

# **6 Related Work**

Here, we briefly discuss conceptually related work:

*Incremental Model Checking* The field of model checking aims to show whether a formalized model, e.g. of software or hardware, adheres to a specification. As neural network verification can also be cast as model checking, we review incremental model checking techniques which utilize a similar idea to proof sharing: reuse partial previous computations when checking new models or specifications. Proof sharing has been applied for discovering and reusing lemmas when proving theorems for satisfiability [6], Linear Temporal Logic [7], and modal μ-calculus [33]. Similarly, caching solvers [35] for Satisfiability Modulo Theories cache obtained results or even the full models used to obtain the solution, with assignments for all variables, allowing for faster verification of subsequent queries. For program analysis tasks that deal with repeated similar inputs (e.g. individual commits in a software project) can leverage partial results [41], constraints [36] precision information [4,5] from previous runs.

*Proof Sharing Between Networks.* In neural network verification, some approaches abstract the network to achieve speed-ups in verification. These simplifications are constructed in a way that the proof can be adapted for the original neural network [1,43]. Similarly, another family of approaches analyzes the difference between two closely related neural networks by utilizing their structural similarity [26,27]. Such approaches can be used to reuse analysis results between neural network modifications, e.g. fine-tuning [9,37].

In contrast to these works, we do not modify the neural network, but achieve speed-ups rather by only considering the relaxations obtained in the proofs. [37] additionally consider small changes to the input, however, these are much smaller than the difference in specification we consider here.

#### **7 Conclusion**

We introduced the novel concept of proof sharing in the context of neural network verification. We showed how to instantiate this concept, achieving speed-ups of up to 2 to 3 x for patch verification and geometric verification. We believe that the ideas introduced in this work can serve as a solid foundation for exploring methods that effectively share proofs in neural network verification.

### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Example Guided Synthesis of Linear Approximations for Neural Network Verification**

Brandon Paulsen(B) and Chao Wang

University of Southern California, Los Angeles, CA 90089, USA *{*bpaulsen,wang626*}*@usc.edu

**Abstract.** Linear approximations of nonlinear functions have a wide range of applications such as rigorous global optimization and, recently, verification problems involving neural networks. In the latter case, a linear approximation must be hand-crafted for the neural network's activation functions. This hand-crafting is tedious, potentially error-prone, and requires an expert to prove the soundness of the linear approximation. Such a limitation is at odds with the rapidly advancing deep learning field – current verification tools either lack the necessary linear approximation, or perform poorly on neural networks with state-of-the-art activation functions. In this work, we consider the problem of automatically synthesizing sound linear approximations for a given neural network activation function. Our approach is example-guided: we develop a procedure to generate examples, and then we leverage machine learning techniques to learn a (static) function that outputs linear approximations. However, since the machine learning techniques we employ do not come with formal guarantees, the resulting synthesized function may produce linear approximations with violations. To remedy this, we bound the maximum violation using rigorous global optimization techniques, and then adjust the synthesized linear approximation accordingly to ensure soundness. We evaluate our approach on several neural network verification tasks. Our evaluation shows that the automatically synthesized linear approximations greatly improve the accuracy (i.e., in terms of the number of verification problems solved) compared to hand-crafted linear approximations in state-of-the-art neural network verification tools. An artifact with our code and experimental scripts is available at: https://zenodo. org/record/6525186#.Yp51L9LMIzM.

# **1 Introduction**

Neural networks have become a popular model choice in machine learning due to their performance across a wide variety of tasks ranging from image classification, natural language processing, and control. However, they are also known

c The Author(s) 2022 S. Shoham and Y. Vizel (Eds.): CAV 2022, LNCS 13371, pp. 149–170, 2022. https://doi.org/10.1007/978-3-031-13185-1\_8

This work was partially funded by the U.S. National Science Foundation grants CNS-1813117 and CNS-1722710.

to misclassify inputs in the presence of both small amounts of input noise and seemingly insignificant perturbations to the inputs [37]. Indeed, many works have shown they are vulnerable to a variety of seemingly benign input transformations [1,9,17], which raises concerns about their deployment in safetycritical systems. As a result, a large number of works have proposed verification techniques to prove that a neural network is not vulnerable to these perturbations [35,43,44], or in general satisfies some specification [15,18,27,28].

Crucial to the precision and scalability of these verification techniques are *linear approximations* of the network's activation functions.

In essence, given some arbitrary activation function σ(x), a linear approximation is a *coefficient generator function* <sup>G</sup>(l, u) → al, bl, au, bu, where l, u <sup>∈</sup> <sup>R</sup> are real values that correspond to the interval [l, u], and <sup>a</sup>l, bl, au, b<sup>u</sup> <sup>∈</sup> <sup>R</sup> are realvalued coefficients in the linear lower and upper bounds such that the following condition holds:

$$\forall x \in [l, u]. \ a\_l \cdot x + b\_l \le \sigma(x) \le a\_u \cdot x + b\_u \tag{1}$$

Indeed, a key contribution in many seminal works on neural network verification was a hand-crafted G(l, u) [2,7,19,33–35,42–45,47] and follow-up work built off these hand-crafted approximations [36,38]. Furthermore, linear approximations have applications beyond neural network verification, such as rigorous global optimization and verification [21,40].

However, crafting G(l, u) is tedious, error-prone, and requires an expert. Unfortunately, in the case of neural network activation functions, experts have only crafted approximations for the most common functions, namely ReLU, sigmoid, tanh, max-pooling, and those in vanilla LSTMs. As a result, existing techniques cannot handle new and cutting-edge activation functions, such as Swish [31], GELU [14], Mish [24], and LiSHT [32].

In this work, we consider the problem of automatically synthesizing the coefficient generator function G(l, u), which can alternatively be viewed as four individual functions G<sup>a</sup>*<sup>l</sup>* (l, u), G<sup>b</sup>*<sup>l</sup>* (l, u), G<sup>a</sup>*<sup>u</sup>* (l, u), and G<sup>b</sup>*<sup>u</sup>* (l, u), one for each coefficient. However, synthesizing the generator functions is a challenging task because (1) the search space for each function is very large (in fact, technically infinite), (2) the optimal generator functions are highly nonlinear for all activation functions considered both in our work and prior work, and (3) to prove soundness of the synthesized generator functions, we must show:

$$\forall [l, u] \in \mathbb{II} \mathbb{R}, x \in [l, u] \; .$$

$$\left( \mathcal{G}\_{a \iota}(l, u) \cdot x + \mathcal{G}\_{b \iota}(l, u) \right) \le \sigma(x) \le \left( \mathcal{G}\_{a \iota}(l, u) \cdot x + \mathcal{G}\_{b \iota}(l, u) \right) \tag{2}$$

where IR <sup>=</sup> {[l, u] <sup>|</sup> l, u <sup>∈</sup> <sup>R</sup>, l <sup>≤</sup> <sup>u</sup>} is the set of all real intervals. The above equation has highly non-linear constraints, which cannot be directly handled by standard verification tools, such as the Z3 [6] SMT solver.

To solve the problem, we propose a novel example-guided synthesis and verification approach, which is applicable to any differentiable, Lipschitz-continuous activation function σ(x). (We note that activation functions are typically required to be differentiable and Lipschitz-continuous in order to be trained

**Fig. 1.** Overview of our method for synthesizing the coefficient generator function.

by gradient descent, thus our approach applies to any *practical* activation function). To tackle the potentially infinite search space of G(l, u), we first propose two *templates* for G(l, u), which are inspired by the hand-crafted coefficient functions of prior work. The "holes" in each template are filled by a machine learning model, in our case a small neural network or linear regression model. Then, the first step is to partition the input space of G(l, u), and then assign a single template to each subset in the partition. The second step is to fill in the holes of each template. Our approach leverages an example-generation procedure to produce a large number of training examples of the form ((l, u),(al, bl, au, bu)), which can then be used to train the machine learning component in the template. However, a template instantiated with a trained model may still violate Eq. 2, specifically the lower bound (resp. upper bound) may be above (resp. below) the activation function over some interval [l, u]. To ensure soundness, the final step is to bound the *maximum violation* of a particular template instance using a rigorous global optimization technique based on interval analysis, which is implemented by the tool IbexOpt [5]. We then use the computed maximum violation to adjust the template to ensure Eq. 2 always holds.

The overall flow of our method is shown in Fig. 1. It takes as input the activation function <sup>σ</sup>(x), and the set of input intervals <sup>I</sup><sup>x</sup> <sup>⊆</sup> IR for which <sup>G</sup>(l, u) will be valid. During *design time*, we follow the previously described approach, which outputs a set of sound, instantiated templates which make up G(l, u). Then the synthesized G(l, u) is integrated into an existing verification tool such as AutoLiRPA [46] or DeepPoly [35]. These tools take as input a neural network and a specification, and output the verification result (proved, counterexample, or unknown). At *application time* (i.e., when attempting to verify the input specification), when these tools need a linear approximation for σ(x) over the interval [l, u], we lookup the appropriate template instance, and use it to compute the linear approximation (al, bl, au, bu), and return it to the tool.

To the best of our knowledge, our method is the first to synthesize a linear approximation generator function G(l, u) for any given activation function σ(x). Our approach is fundamentally different from the ones used by stateof-the-art neural network verification tools such as AutoLiRPA and Deep-Poly, which require an expert to hand-craft the approximations. We note that, while AutoLiRPA can handle activations that it does not explicitly support by *decomposing* σ(x) into elementary operations for which it has (hand-crafted) linear approximations, and then combining them, the resulting bounds are often not tight. In contrast, our method synthesizes linear approximations for σ(x) as a whole, and we show experimentally that our synthesized approximations significantly outperform AutoLiRPA.

We have implemented our approach and evaluated it on popular neural network verification problems (specifically, robustness verification problems in the presence of input perturbations). Compared against state-of-the-art linear approximation based verification tools, our synthesized linear approximations can drastically outperform these existing tools in terms of the number of problems verified on recently published activation functions such as Swish [31], GELU [14], Mish [24], and LiSHT [32].

To summarize, we make the following contributions:


# **2 Preliminaries**

In this section, we discuss background knowledge necessary to understand our work. Throughout the paper, we will use the following notations: for variables or scalars we use lower case letters (e.g., <sup>x</sup> <sup>∈</sup> <sup>R</sup>), for vectors we use bold lower case letters (e.g., **<sup>x</sup>** <sup>∈</sup> <sup>R</sup><sup>n</sup>) and for matrices we use bold upper case letters (e.g., **<sup>W</sup>** <sup>∈</sup> <sup>R</sup><sup>n</sup>×<sup>m</sup>). In addition, we use standard interval notation: we let [l, u] = {<sup>x</sup> <sup>∈</sup> <sup>R</sup>|<sup>l</sup> <sup>≤</sup> <sup>x</sup> <sup>≤</sup> <sup>u</sup>} be a real-valued interval, we denote the set of all real intervals as IR <sup>=</sup> {[l, u]|l, u <sup>∈</sup> <sup>R</sup>, l <sup>≤</sup> <sup>u</sup>}, and finally we define the set of <sup>n</sup>-dimensional intervals as IR<sup>n</sup> <sup>=</sup> {×<sup>n</sup> <sup>i</sup>=1[li, ui] <sup>|</sup> [li, ui] <sup>∈</sup> IR}, where×is the Cartesian product.

#### **2.1 Neural Networks**

We consider a neural network to be a function <sup>f</sup> : <sup>X</sup> <sup>⊆</sup> <sup>R</sup><sup>n</sup> <sup>→</sup> <sup>Y</sup> <sup>⊆</sup> <sup>R</sup><sup>m</sup>, which has n inputs and m outputs. For ease of presentation, we focus the discussion on *feed-forward, fully-connected* neural networks (although the bounds synthesized by our method apply to all neural network architectures). For **<sup>x</sup>** <sup>∈</sup> <sup>X</sup>, such networks compute f(**x**) by performing an alternating series of matrix multiplications followed by the element-wise application of an activation function σ(x).

Formally, an l-layer neural network with k<sup>i</sup> neurons in each layer (and letting <sup>k</sup><sup>0</sup> <sup>=</sup> n, k<sup>l</sup> <sup>=</sup> <sup>m</sup>) has <sup>l</sup> weight matrices and bias vectors **<sup>W</sup>**<sup>i</sup> <sup>∈</sup> <sup>R</sup>k*i−*1×k*<sup>i</sup>* and **<sup>b</sup>**<sup>i</sup> <sup>∈</sup> <sup>R</sup>k*<sup>i</sup>* for <sup>i</sup> ∈ {1..l}. The input of the network is <sup>f</sup><sup>0</sup> <sup>=</sup> **<sup>x</sup>**<sup>T</sup> , and the output of layer i is given by the function: f<sup>i</sup> = σ(fi−<sup>1</sup> · **W**<sup>i</sup> + **b**i) which can be applied recursively until the output layer of the network is reached.

Initially, common choices for the activation function σ(x) were ReLU(x) = max(0, x), sigmoid(x) = <sup>e</sup>*<sup>x</sup>* <sup>e</sup>*x*+1 , and tanh(x) = <sup>e</sup>*x*−e*−<sup>x</sup>* <sup>e</sup>*x*+e*−<sup>x</sup>* , however the field has advanced rapidly in recent years and, as a result, automatically discovering novel activations has become a research subfield of its own [31]. Many recently proposed activations, such as Swish and GELU [14,31], have been shown to outperform the common choices in important machine learning tasks.

#### **2.2 Existing Neural Network Verification Techniques and Limitations**

We consider neural network verification problems of the following form: given a neural network <sup>f</sup> : <sup>X</sup> <sup>→</sup> <sup>Y</sup> and an input set <sup>X</sup> <sup>⊆</sup> <sup>X</sup>, compute an overapproximation <sup>Y</sup> such that {f(**x**) <sup>|</sup> **<sup>x</sup>** <sup>∈</sup> <sup>X</sup>} ⊆ <sup>Y</sup> <sup>⊆</sup> <sup>Y</sup>. The most scalable approaches to neural network verification (where scale is measured by number of neurons in the network) use linear bounding techniques to compute Y , which require a *linear approximation* of the network's activation function. This is an extension of *interval analysis* [26] (e.g., intervals with linear lower/upper bounds [35,46]) to compute Y , and thus X and Y are represented as elements of IR<sup>n</sup> and IR<sup>m</sup>, respectively.

We use Fig. 2 to illustrate a typical neural network verification problem. The network has input neurons x1, x2, output neurons x7, x<sup>8</sup> and a single hidden layer. We assume the activation function is swish(x) = x · sigmoid(x), which is shown by the blue line in Fig. 3. Our input space is X = [−1, 1] × [−1, 1] (i.e., x1, x<sup>2</sup> ∈ [−1, 1]), and we want to prove x<sup>7</sup> > x8, which can be accomplished by first computing the bounds x<sup>7</sup> ∈ [l7, u7], x<sup>8</sup> ∈ [l8, u8], and then showing l<sup>7</sup> > u8. Following the prior work [35] and for simplicity, we split the affine transformation and application of activation function in the hidden layer into two steps, and we assume the neurons xi, where i ∈ {1..8}, are ordered such that i<j implies that x<sup>i</sup> is in either the same layer as x<sup>j</sup> , or a layer prior to x<sup>j</sup> .

Linear bounding based neural network verification techniques work as follows. For each neuron xi, they compute the concrete lower and upper bounds l<sup>i</sup> and ui, together with symbolic lower and upper bounds. The symbolic lower and upper bounds are linear constraints i−1 <sup>j</sup>=0 c<sup>l</sup> <sup>j</sup> · <sup>x</sup><sup>j</sup> <sup>+</sup> <sup>c</sup><sup>l</sup> <sup>i</sup> ≤ x<sup>i</sup> ≤ i−1 <sup>j</sup>=0 c<sup>u</sup> <sup>j</sup> · <sup>x</sup><sup>j</sup> <sup>+</sup> <sup>c</sup><sup>u</sup> i , where each of c<sup>l</sup> i, c<sup>u</sup> <sup>i</sup> is a constant. Both bounds are computed in a forward layerby-layer fashion, using the result of the previous layers to compute bounds for the current layer.

We illustrate the computation in Fig. 2. In the beginning, we have x<sup>1</sup> ∈ [−1, 1] as the concrete bounds, and −1 ≤ x<sup>1</sup> ≤ 1 as the symbolic bounds, and similarly for x2. To obtain bounds for x3, x4, we multiply x1, x<sup>2</sup> by the edge weights, which for x<sup>3</sup> gives the linear bounds −x1+x<sup>2</sup> ≤ x<sup>3</sup> ≤ −x1+x2. Then, to compute l<sup>3</sup> and

**Fig. 2.** An example of linear bounding for neural network verification.

u3, we minimize and maximize the linear lower and upper bounds, respectively, over x1, x<sup>2</sup> ∈ [−1, 1]. Doing so results in l<sup>3</sup> = −2, u<sup>3</sup> = 2. We obtain the same result for x4.

However, we encounter a key challenge when attempting to bound x5, as we need a linear approximation of σ(x3) over [l3, u3] when bounding x5, and similarly for x6. Here, a linear approximation for x<sup>5</sup> can be regarded as a set of coefficients al, bl, au, b<sup>u</sup> such that the following *soundness* condition holds: ∀x<sup>3</sup> ∈ [l3, u3] . a<sup>l</sup> · x<sup>3</sup> + b<sup>l</sup> ≤ σ(x3) ≤ a<sup>u</sup> · x<sup>3</sup> + bu. In addition, a sub goal for the bounds is *tightness*, which typically means the volume between the bounds and σ(x) is minimized. Crafting a function to generate these coefficients has been the subject of many prior works. Many seminal papers on neural network verification have focused on solving this problem alone. Broadly speaking, they fall into the following categories.

*Hand-Crafted Approximation Techniques.* The first category of techniques use hand-crafted functions for generating al, bl, au, bu. Hand-crafted functions are generally fast because they are static, and tight because an expert designed them. Unfortunately, current works in this category are not *general* – they only considered the most common activation functions, and thus cannot currently handle our motivating example or any recent, novel activation functions. For these works to apply to our motivating example, an expert would need to handcraft an approximation for the activation function, which is both difficult and error-prone.

*Expensive Solver-Aided Techniques.* The second category use expensive solvers and optimization tools to compute sound and tight bounds in a general way, but at the cost of runtime. Recent works include DiffRNN [25] and POPQORN [19]. The former uses (unsound) optimization to synthesize candidate coefficients and then uses an SMT solver to verify soundness of the bounds. The latter uses constrained-gradient descent to compute coefficients. We note that, while these works do not explicitly target an arbitrary activation function σ(x), their techniques can be naturally extended. Their high runtime and computational cost are undesirable and, in general, make them less scalable than the first category.

*Decomposing Based Techniques.* The third category combine hand-crafted approximations with a decomposing based technique to obtain generality and efficiency, but at the cost of tightness. Interestingly, this is similar to the approach used by nonlinear SMT solvers and optimizers such as dReal [11] and Ibex [5]. To the best of our knowledge, only one work AutoLiRPA [46] implements this approach for neural network verification. Illustrating on our example, AutoLiRPA does not have a static linear approximation for σ(x3) = x<sup>3</sup> · sigmoid(x3), but it has

**Fig. 3.** Approximation of AutoLiRPA (red) and our approach (green). (Color figure online)

static approximations for sigmoid(x3) and x3·y. Thus we can bound sigmoid(x3) over x<sup>3</sup> ∈ [−2, 2], and then, letting y = sigmoid(x3), bound x<sup>3</sup> · y. Doing so results in the approximation shown as red lines in Fig. 3. While useful, they are suboptimal because they do not minimize the area between the two bounding lines. This suboptimality occurs due to the decomposing, i.e., the static approximations used here were not designed for swish(x) as a whole, but designed for the individual elementary operations.

*Our Work: Synthesizing Static Approximations.* Our work overcomes the limitation of prior work by automatically synthesizing a *static* function specifically for any given activation function σ(x) *without* decomposing. Since the synthesis is automatic, and results in a bound generator function, we obtain generality and efficiency, and since the synthesis targets σ(x) specifically, we *usually* (demonstrated empirically) obtain tightness. In Fig. 3, for example, the bounds computed by our method are represented by the green lines. The synthesized bound generator function can then be integrated to state-of-the-art neural network verification tools, including AutoLiRPA.

*Wrapping Up the Example.* For our running example, using AutoLiRPA's linear approximation, we would add the linear bounds for x<sup>5</sup> shown in Fig. 2. To compute l5, u5, we would substitute the linear bounds for x<sup>3</sup> into x5's linear bounds, resulting in linear bounds with only x1, x<sup>2</sup> terms that can be minimized/maximized for l5, l<sup>6</sup> respectively. We do the same for x6, and then we repeat the entire process until the output layer is reached.

#### **3 Problem Statement and Challenges**

In this section, we formally define the synthesis problem and then explain the technical challenges. During the discussion, we focus on synthesizing the generator functions for the upper bound, but in Sect. 3.1, we explain how we can obtain the lower bound functions.

#### **3.1 The Synthesis Problem**

Given an activation function σ(x) and an input universe x ∈ [lx, ux], we define the set of all intervals over <sup>x</sup> in this universe as <sup>I</sup><sup>x</sup> <sup>=</sup> { [l, u] <sup>|</sup> [l, u] <sup>∈</sup> IR,l, u <sup>∈</sup> [lx, ux]}. In our experiments, for instance, we use l<sup>x</sup> = −10 and u<sup>x</sup> = 10. Note that if we encounter an [l, u] ∈ Ix, we fall back to a decomposing-based technique.

Our goal is to synthesize a generator function G(l, u) → au, bu, or equivalently, two generator functions G<sup>a</sup>*<sup>u</sup>* (l, u) and G<sup>b</sup>*<sup>u</sup>* (l, u) such that ∀[l, u] ∈ Ix, x ∈ <sup>R</sup>, the condition <sup>x</sup> <sup>∈</sup> [l, u] =<sup>⇒</sup> <sup>σ</sup>(x) ≤ G<sup>a</sup>*<sup>u</sup>* (l, u) · <sup>x</sup> <sup>+</sup> <sup>G</sup><sup>b</sup>*<sup>u</sup>* (l, u) holds. This is the same as requiring that the following condition does **not** hold (i.e., the formula is unsatisfiable):

$$\exists [l, u] \in I\_x, x \in \mathbb{R} \text{ } . \, x \in [l, u] \land \sigma(x) > \mathcal{G}\_{a\_u}(l, u) \cdot x + \mathcal{G}\_{b\_u}(l, u) \tag{3}$$

The formula above expresses the search for a counterexample, i.e., an input interval [l, u] such that G<sup>a</sup>*<sup>u</sup>* (l, u)·x+G<sup>b</sup>*<sup>u</sup>* (l, u) is not a sound upper bound of σ(x) over the interval [l, u]. Thus, if the above formula is unsatisfiable, the soundness of the coefficient functions G<sup>a</sup>*<sup>u</sup>* , G<sup>b</sup>*<sup>u</sup>* is proved. We note that we can obtain the lower bound generator functions G<sup>a</sup>*<sup>l</sup>* (l, u), G<sup>b</sup>*<sup>l</sup>* (l, u) by synthesizing upper bound functions G<sup>a</sup>*<sup>u</sup>* (l, u), G<sup>b</sup>*<sup>u</sup>* (l, u) for −σ(x) (i.e. reflecting σ(x) across the x-axis), and then letting G<sup>a</sup>*<sup>l</sup>* = −G<sup>a</sup>*<sup>u</sup>* (l, u), G<sup>b</sup>*<sup>l</sup>* = −G<sup>b</sup>*<sup>u</sup>* (l, u).

In addition to *soundness*, we want the bound to be *tight*, which in our context has two complementary goals. For a given [l, u] ∈ I<sup>x</sup> we should have (1) σ(z) = G<sup>a</sup>*<sup>u</sup>* (l, u) · z + G<sup>b</sup>*<sup>u</sup>* (l, u) for at least one z ∈ [l, u] (i.e., the bound touches σ(x) at some point z), and (2) the volume below G<sup>a</sup>*<sup>u</sup>* (l, u) · x + G<sup>b</sup>*<sup>u</sup>* (l, u) should be minimized (which we note is equivalent to minimizing the volume between the upper bound and σ(x) since σ(x) is fixed). We will illustrate the volume by the shaded green region below the dashed bounding line in Fig. 6.

The first goal is intuitive: if the bound does not touch σ(x), then it can be shifted downward by some constant. The second goal is a heuristic taken from prior work that has been shown to yield a precise approximation of the neural network's output set.

#### **3.2 Challenges and Our Solution**

We face three challenges in searching for the generator functions G<sup>a</sup>*<sup>u</sup>* and G<sup>b</sup>*<sup>u</sup>* . First, we must restrict the search space so that a candidate can be found in a reasonable amount of time (i.e., the search is tractable). The second challenge, which is at odds with the first, is that we must have a large enough search space

**Fig. 4.** Illustration of the two-point form bound (upper dashed line) and tangent-line form bound (lower dashed line).

such that it permits candidates that represent tight bounds. Finally, the third challenge, which is at odds with the second, is that we must be able to formally verify G<sup>a</sup>*<sup>u</sup>* , G<sup>b</sup>*<sup>u</sup>* to be sound. While more complex geneator functions (G<sup>a</sup>*<sup>u</sup>* , G<sup>b</sup>*<sup>u</sup>* ) will likely produce tighter bounds, they will be more difficult (if not impractical) to verify.

We tackle these challenges by proposing two templates for G<sup>a</sup>*<sup>u</sup>* , G<sup>b</sup>*<sup>u</sup>* and then developing an approach for selecting the appropriate template. We observe that prior work has always expressed the linear bound for σ(x) over an interval x ∈ [l, u] as either the line connecting the points (l, σ(l)),(u, σ(u)), referred to as the *two-point form*, or as the line tangent to σ(x) at a point t, referred to as *tangent-line form*. We illustrate both forms in Fig. 4. Assuming that σ (x) is the derivative of σ(x), the two templates for G<sup>a</sup>*<sup>u</sup>* and G<sup>b</sup>*<sup>u</sup>* as follows:

$$\begin{aligned} \mathcal{G}\_{a\_u}(l,u) &= \frac{\sigma(u) - \sigma(l)}{u - l} & \text{two-point} \\ \mathcal{G}\_{a\_u}(l,u) &= -\mathcal{G}\_{a\_u}(l,u) & l + \sigma(l) + \epsilon \end{aligned} \tag{4}$$

$$\begin{aligned} \mathcal{G}\_{b\_u}(l, u) &= -\mathcal{G}\_{a\_u}(l, u) \cdot l + \sigma(l) + \epsilon & \text{norm\\_campance} \\ \mathcal{G}\_{a\_u}(l, u) &= \sigma'(g(l, u)) & \text{tangent-line} \end{aligned} \tag{5}$$

$$\stackrel{\smile\_{u}}{\mathcal{G}\_{b\_{u}}}(l,u) = -\stackrel{\smile\_{u}}{\mathcal{G}\_{a\_{u}}}(l,u) \cdot g(l,u) + \sigma(g(l,u)) + \epsilon \qquad \text{form template} \tag{5}$$

In these templates, there are two *holes* to fill during synthesis: and g(l, u). Here, is a real-valued constant upward (positive) shift that ensures soundness of the linear bounds computed by both templates. We compute when we verify the soundness of the template (discussed in Sect. 4.3). In addition to , for the tangent-line template, we must synthesize a function g(l, u) = t, which takes the interval [l, u] as input and returns the tangent point t as output.

These two templates, together, address the previously mentioned three challenges. For the first challenge, the two-point form actually does not have a search space, and thus can be computed efficiently, and for the tangent-line form, we only need to synthesize the function g(l, u). In Sect. 4.2, we will show empirically that g(l, u) tends to be much easier to learn than a function that directly predicts the coefficients au, bu. For the second challenge, if the two-point form is sound, then it is also tight since the bound touches σ(x) by construction. Similarly, the tangent-line form touches σ(x) at t. For the third challenge, we will show empirically that these templates can be verified to be sound in a reasonable amount of time (on the order of an hour). prove the soundness of Ga*<sup>u</sup>* , Gb*<sup>u</sup>* for large

At a high level, our approach contains three steps. The first step is to partition I<sup>x</sup> into subsets, and then for each subset we assign a fixed template – either the two-point form template or tangent-line form template. The advantage of partitioning is two-fold. First, no single template is a good fit for the entire Ix, and thus partitioning results in overall tighter bounds. And second, if the final verified template for a particular subset has a large violation (which results in a large upward shift and thus less tight bounds) the effect is localized to that subset only. Once we have assigned a template to each subset of Ix, the second step is to learn a g(l, u) for each subset that was assigned a tangent-line template. We use an example-generation procedure to generate training examples, which are then used to train a machine learning model. After learning each g(l, u), the third step is to compute for all of the templates. We phrase the search for a sound as a nonlinear global optimization problem, and then use the interval-based solver IbexOpt [5] to bound .

#### **4 Our Approach**

In this section, we first present our method for partitioning Ix, the input interval space, into disjoint subsets and then assigning a template to each subset. Then, we present the method for synthesizing the bounds-generating function for a subset in the partition of I<sup>x</sup> (see Sect. 3.1). Next, we present the method for making the bounds-generating functions sound. Finally, we present the method for efficiently looking up the appropriate template at runtime.

#### **4.1 Partitioning the Input Interval Space (***Ix***)**

A key consideration when partitioning I<sup>x</sup> is how to represent each disjoint subset of input intervals. While we could use a highly expressive representation such as polytope or even use non-linear constraints, for efficiency reasons, we represent each subset (of input intervals) as a box. Since a subset uses either the two-point form template or the tangent-line form template, the input interval space can be divided into I<sup>x</sup> = I<sup>2</sup>pt ∪ Itan. Each of I<sup>2</sup>pt and Itan is a set of boxes.

At a high-level, our approach first partitions I<sup>x</sup> into uniformly sized disjoint boxes, and then assigns each box to either I<sup>2</sup>pt or Itan. In Fig. 5, we illustrate the partition computed for swish(x) = x · sigmoid(x). The x-axis and y-axis represent the lower bound l and the upper bound u, respectively, and thus a point (l, u) on this graph represents the interval [l, u], and a box on this graph denotes the set of intervals represented by the points contained within it. We give details on computing the partition below.

**Fig. 5.** Partition of I*<sup>x</sup>* for the Swish activation function, where the blue boxes belong to I*tan*, and the green boxes belong to I<sup>2</sup>*pt*. (Color figure online)

*Defining the Boxes.* We first define a constant parameter cs, which is the width and height of each box in the partition of Ix. In Fig. 5, c<sup>s</sup> = 1. The benefits of using a smaller c<sup>s</sup> value is two-fold. First, it allows us to more accurately choose the proper template (two-point or tangent) for a given interval [l, u]. Second, as mentioned previously, the negative impact of a template with a large violation (i.e., large ) is localized to a smaller set of input intervals.

Assuming that (u<sup>x</sup> <sup>−</sup>lx) can be divided by <sup>c</sup>s, then we have ( <sup>u</sup>*x*−l*<sup>x</sup>* <sup>c</sup>*<sup>s</sup>* )<sup>2</sup> disjoint boxes in the partition of <sup>I</sup>x, which we represent by <sup>I</sup>i,j where i, j ∈ {1.. <sup>u</sup>*x*−l*<sup>x</sup>* <sup>c</sup>*<sup>s</sup>* }. Ii,j represents the box whose lower-left corner is located at (l<sup>x</sup> +i· cs, l<sup>x</sup> +j · cs), or alternatively we have Ii,j = {[l, u] | l ∈ [l<sup>x</sup> + i · cs, l<sup>x</sup> + i · c<sup>s</sup> + cs], u ∈ [l<sup>x</sup> + j · cs, l<sup>x</sup> + j · c<sup>s</sup> + cs]}.

To determine which boxes Ii,j belong to the subset I2pt, we uniformly sample intervals [l, u] ∈ Ii,j . Then, for each sampled interval [l, u], we compute the twopoint form for [l, u], and attempt to search for a counter-example to the equation σ(x) ≤ G<sup>a</sup>*<sup>u</sup>* (l, u)x + G<sup>b</sup>*<sup>u</sup>* (l, u) by sampling x ∈ [l, u]. If a counter-example is not found for more than half of the sampled [l, u] ∈ Ii,j , we add the box Ii,j to I<sup>2</sup>pt, otherwise we add the box to Itan.

We note that more sophisticated (probably more expensive) strategies for assigning templates exist. We use this strategy simply because it is efficient. We also note that some boxes in the partition may contain invalid intervals (i.e., we have [l, u] ∈ Ii,j where u<l). These invalid intervals are filtered out during the final verification step described in Sect. 4.3, and thus do not affect the soundness of our algorithm.

#### **4.2 Learning the Function** *g***(***l, u***)**

In this step, for each box Ii,j ∈ Itan, we want to learn a function g(l, u) = t that returns the tangent point for any given interval [l, u] ∈ Ii,j , where t will be used to compute the tangent-line form upper bound as defined in Eq. 5. This process is done for all boxes in Itan, resulting in a separate g(l, u) for each box Ii,j . A sub-goal when learning g(l, u) is to maximize the tightness of the resulting upper bound, which in our case means minimizing the volume below the tangent line.

We leverage machine learning techniques (specifically linear regression or a small neural network with ReLU activation) to learn g(l, u), which means we need a procedure to generate training examples. The examples must have the form ((l, u), t). To generate the training examples, we (uniformly) sample [l, u] ∈ Ii,j , and for each sampled [l, u], we attempt to find a tangent point t whose tangent line represents a tight upper bound of σ(x). Then, given the training examples, we use standard machine learning techniques to learn g(l, u).

The crux of our approach is generating the training examples. To generate a single example for a fixed [l, u], we follows two steps: (1) generate upper bound coefficients au, bu, and then (2) find a tangent point t whose tangent line is close to au, bu. In the following paragraphs, we describe the process for a fixed [l, u], and then discuss the machine learning procedure.

#### **Generating Example Coefficients**

*au , bu* . Given a fixed [l, u], we aim to generate upper bound coefficients au, bu. A good generation procedure has three criteria: (1) the coefficients should be tight for the input interval [l, u], (2) the coefficients should be sound, and (3) the generation should be fast. The first two criteria are intuitive: good training examples will result in a good learned model. The third is to ensure that we can generate a large number of examples in a reasonable amount of time. Unfortunately, the second and third criteria are at odds, because proving soundness is inherently expensive. To ensure a reasonable runtime, we relax the

**Fig. 6.** Illustration of the sampling and linear programming procedure for computing an upper bound. Shaded green region illustrates the volume below the upper bound. (Color figure online)

second criteria to *probably* sound. Thus our final goal is to minimize volume below au, b<sup>u</sup> such that σ(x) ≤ a<sup>u</sup> · x + b<sup>u</sup> *probably* holds for x ∈ [l, u].

Our approach is inspired by a prior work [2,33], which formulates the goal of a non-linear optimization problem as a linear program that can be solved efficiently. Our approach samples points (si, σ(si)) on the activation function for s<sup>i</sup> ∈ [l, u], which are used to to convert the nonlinear constraint σ(x) ≤ a<sup>u</sup> ·x+b<sup>u</sup> into a linear one, and then uses volume as the objective (which is linear). For a set S of sample points s<sup>i</sup> ∈ [l, u], the linear program we solve is:

$$\text{minimize}: \text{volume below } a\_u \cdot x + b\_u$$

$$\text{subj. to}: \bigwedge\_{s\_i \in S} \sigma(s\_i) \le a\_u \cdot s\_i + b\_u$$

**Fig. 7.** Plots of the training examples, smoothed with linear interpolation. On the left: a plot of ((l, u), (t)), and on the right: a plot of ((l, u), (a*u*)).

We illustrate this in Fig. 6. Solving the above problem results in au, bu, and the prior work [2,33] proved that the solution (theoretically) approaches the optimal and sound au, b<sup>u</sup> as the number of samples goes to infinity. We use Gurobi [13] to solve the linear program.

**Converting** *au , bu* **to a Tangent Line.** To use the generated au, b<sup>u</sup> in the tangent-line form template, we must find a point t whose tangent line is close to au, bu. That is, we require that the following condition (almost) holds:

$$(\sigma'(t) = a\_u) \land (-\sigma'(t) \cdot t + \sigma(t) = b\_u)$$

To solve the above problem, we use local optimization techniques (specifically a modified Powell's method [29] implemented in SciPy [41], but most common techniques would work) to find a solution to σ (t) = au.

We then check that the right side of the above formula almost holds (specifically, we check (|(σ (t) · t + σ(t)) − bu| ≤ 0.01). If the local optimization does not converge (i.e., it does not find a t such that σ (t) = au), or the check on b<sup>u</sup> fails, we throw away the example and do not use it in training.

One may ask the question: could we simply train a model to directly predict the coefficients a<sup>u</sup> and bu, instead of predicting a tangent point and then converting it to the tangent line? The answer is yes, however this approach has two caveats. The first caveat is that we will lose the inherent tightness that we gain with the tangent-line form – we no longer have a guarantee that the computed linear bound will touch σ(x) at any point. The second caveat is that the relationship between l, u and t tends to be close to linear, and thus easier to learn, whereas the relationship between l, u and au, or between l, u and bu, is highly nonlinear. We illustrate these relationships as plots in Fig. 7. The left graph plots the generated training examples ((l, u), t), converted to a smooth function using linear interpolation. We can see most regions are linear, as shown by the flat sections. The right plot shows ((l, u), au), where we can see the center region is highly nonlinear.

**Training on the Examples.** Using the procedure presented so far, we sample [l, u] uniformly from Ii,j and generate the corresponding t for each of them. This results in a training dataset of r examples Dtrain = {((li, ui), ti) | i ∈ {1..r}}. We then choose between one of two models – a linear regression model or a 2-layer, 50-hidden-neuron, ReLU network – to become the final function g(l, u). To decide, we train both model types, and choose the one with the lowest error, where error is measured as the mean absolute error. We give details below.

A linear regression model is a function g(l, u) = c<sup>1</sup> · l + c<sup>2</sup> · u + c3, where <sup>c</sup><sup>i</sup> <sup>∈</sup> <sup>R</sup> are coefficients learned by minimizing the *squared error*, which formally is:

$$\sum\_{((l\_i, u\_i), t\_i) \in D\_{train}} (g(l\_i, u\_i) - t\_i)^2 \tag{6}$$

Finding the coefficients c<sup>i</sup> that minimize the above constraint has a closed-form solution, thus convergence is guaranteed and optimal, which is desirable.

However, sometimes the relationship between (l, u) and t is nonlinear, and thus using a linear regression model may result in a poor-performing g(l, u), even though the solution is optimal. To capture more complex relationships, we also consider a 2-layer ReLU network where **<sup>W</sup>**<sup>0</sup> <sup>∈</sup> <sup>R</sup><sup>2</sup>×<sup>50</sup>, **<sup>W</sup>**<sup>1</sup> <sup>∈</sup> <sup>R</sup><sup>50</sup>×<sup>1</sup>, **<sup>b</sup>**<sup>0</sup> <sup>∈</sup> <sup>R</sup><sup>50</sup>, **<sup>b</sup>**<sup>1</sup> <sup>∈</sup> <sup>R</sup>, and we have <sup>g</sup>(l, u) = ReLU(l, u<sup>T</sup> · **<sup>W</sup>**<sup>0</sup> <sup>+</sup> **<sup>b</sup>**0) · **<sup>W</sup>**<sup>1</sup> <sup>+</sup> **<sup>b</sup>**1. The weights and biases are initialized randomly, and then we minimize the squared error (Eq. 6) using gradient descent. While convergence to the optimal weights is not guaranteed in theory, we find in practice it usually converges.

We choose these two models because they can capture a diverse set of g(l, u) functions. While we could use other prediction models, such as polynomial regression, generally, a neural network will be equally (if not more) expressive. However, we believe exploring other model types or architectures of neural networks would be an interesting direction to explore.

#### **4.3 Ensuring Soundness of the Linear Approximations**

For a given Ii,j , we must ensure that its corresponding coefficient generator functions G<sup>a</sup>*<sup>u</sup>* (l, u) and G<sup>b</sup>*<sup>u</sup>* (l, u) are sound, or in other words, that the following condition does **not** hold:

$$\exists \, \exists [l, u] \in I\_{i, j}, \, x \in [l, u] \,\, . \, \sigma(x) > \mathcal{G}\_{a\_u}(l, u) \cdot x + \mathcal{G}\_{b\_u}(l, u)$$

We ensure the above condition does not hold (the formula is unsatisfiable) by bounding the *maximum violation* on the clause σ(x) > G<sup>a</sup>*<sup>u</sup>* (l, u) · x + G<sup>b</sup>*<sup>u</sup>* (l, u), which we formally define as Δ(l, u, x) = σ(x) − (G<sup>a</sup>*<sup>u</sup>* (l, u) · x + G<sup>b</sup>*<sup>u</sup>* (l, u)). Δ is positive when the previous clause holds. Thus, if we can compute an upper bound Δu, we can set the term in G<sup>b</sup>*<sup>u</sup>* (l, u) to Δ<sup>u</sup> to ensure the clause does not hold, thus making the coefficient generator functions sound.

To compute Δu, we solve (i.e., bound) the following optimization problem:

$$\begin{aligned} \text{for} & \text{ } l, u, x \in [l\_{i,j}, u\_{i,j}] \\ \text{maximize} & \text{ } \Delta(l, u, x) \\ \text{subj. to} & \text{ } l < u \land l \le x \land x \le u \end{aligned}$$

where li,j , ui,j are the minimum lower bound and maximum upper bound, respectively, for any interval in Ii,j . The above problem can be solved using the general framework of interval analysis [26] and branch-and-prune algorithms [4].

Letting Δsearch = {(l, u, x)|l, u, x ∈ [li,j , ui,j ]} be the domain over which we want to bound Δ, we can bound Δ over Δsearch using interval analysis. In addition, we can improve the bound in two ways: *branching* (i.e., partitioning Δsearch and bounding Δ on each subset separately) and *pruning* (i.e., removing from Δsearch values that violate the constraints l<u ∧ l ≤ x ∧ x ≤ u). The tool IbexOpt [5] implements such an algorithm, and we use it solve the above optimization problem.

One practical consideration when solving the above optimization problem is the presence of division by zero error. In the two-point template, we have <sup>G</sup><sup>a</sup>*<sup>u</sup>* (l, u) = <sup>σ</sup>(u)−σ(l) <sup>u</sup>−<sup>l</sup> . While we have the constraint l<u, from an interval analysis perspective, G<sup>a</sup>*<sup>u</sup>* (l, u) goes to infinity as u − l goes to 0, and indeed, if we gave the above problem to IbexOpt, it would report that Δ is unbounded. To account for this, we enforce a minimum interval width of 0.01 by changing l<u to 0.01 < u − l.

#### **4.4 Efficient Lookup of the Linear Bounds**

Due to partitioning Ix, we must have a procedure for looking up the appropriate template instance for a given [l, u] at the application time. Formally, we need to find the box Ii,j , which we denote [ll, ul] × [lu, uu], such that l ∈ [ll, ul] and u ∈ [lu, uu], and retrieve the corresponding template. Lookup can actually present a significant runtime overhead if not done with care. One approach is to use a data structure similar to an interval tree or a quadtree [10], the latter of which has O(log(n)) complexity. While the quadtree would be the most efficient for an arbitrary partition of I<sup>x</sup> into boxes, we can in fact obtain O(1) lookup for our partition strategy.

We first note that each box, Ii,j , can be uniquely identified by l<sup>l</sup> and uu. The point (ll, uu) corresponds to the top-left corner of a box in Fig. 5. Thus we build a lookup dictionary keyed by (ll, uu) for each box that maps to the corresponding linear bound template. To perform lookup, we exploit the structure of the partition: specifically, each box in the partition is aligned to a multiple of cs. Thus, to lookup Ii,j for a given [l, u], we view (l, u) as a point on the graph of Fig. 5, and the lookup corresponds to moving left-ward and upward from the point (l, u) to the nearest upper-left corner of a box. More formally, we perform lookup by rounding l down to the nearest multiple of cs, and u upward to the nearest multiple of cs. The top-left corner can then be used to lookup the appropriate template.

#### **5 Evaluation**

We have implemented our approach as a software tool that synthesizes a linear bound generator function G(l, u) for any given activation function σ(x) in the input universe x ∈ [lx, ux]. The output is a function that takes as input [l, u] and returns coefficients al, bl, au, b<sup>u</sup> as output. For all experiments, we use l<sup>x</sup> = −10, u<sup>x</sup> = 10, c<sup>s</sup> = 0.25, and a minimum interval width of 0.01. If we encounter an [l, u] ⊆ [lx, ux], we fall back to the interval bound propagation of dReal [11]. After the generator function is synthesized, we integrate it into AutoLiRPA, a state-of-the-art neural network verification tool, which allows us to analyze neural networks with σ(x) as activation functions.

#### **5.1 Benchmarks**

**Neural Networks and Datasets.** Our benchmarks are eight deep neural networks trained on the following two datasets.

*MNIST*. MNIST [22] is a set of images of hand-written digits each of which are labeled with the corresponding written digit. The images are 28 × 28 grayscale images with one of ten written digits. We use a convolutional network architecture with 1568, 784, and 256 neurons in its first, second, and third layer, respectively. We train a model for each of the activation functions described below.

*CIFAR*. CIFAR [20] is a set of images depicting one of 10 objects (a dog, a truck, etc.), which are hand labeled with the corresponding object. The images are 32 × 32 pixel RGB images. We use a convolutional architecture with 2048, 2048, 1024, and 256 neurons in the first, second, third, and fourth layers, respectively. We train a model for each of the activation functions described below.

**Activation Functions.** Our neural networks use one of the activation functions shown Fig. 8 and defined in Table 1. They are Swish [14,31], GELU [14], Mish [24], LiSHT [32], and AtanSq [31]. The first two are used in language models such as GPT [30], and have been shown to achieve the best performance for some image classification tasks [31]. The third and fourth two are variants of the first two, which are shown to have desirable theoretical properties. The last was discovered using automatic search techniques [31], and found to perform on par with the state-of-the-art. We chose these activations because they are representative of recent developments in deep learning research.

**Robustness Verification.** We evaluate our approach on *robustness* verification problems. Given a neural network <sup>f</sup> : <sup>X</sup> <sup>⊆</sup> <sup>R</sup><sup>n</sup> <sup>→</sup> <sup>Y</sup> <sup>⊆</sup> <sup>R</sup><sup>m</sup> and an input **<sup>x</sup>** <sup>∈</sup> <sup>X</sup>, we verify robustness by proving that making a small p-bounded perturbation (<sup>p</sup> <sup>∈</sup> <sup>R</sup>) to **<sup>x</sup>** does not change the classification. Letting **<sup>x</sup>**[i] <sup>∈</sup> <sup>R</sup> be the <sup>i</sup> th element in **<sup>x</sup>**, we represent the set of all perturbations as <sup>X</sup> <sup>∈</sup> IR<sup>n</sup>, where <sup>X</sup> <sup>=</sup> ×n <sup>i</sup>=1[**x**[i] <sup>−</sup> p, **<sup>x</sup>**[i] + <sup>p</sup>]. We then compute <sup>Y</sup> <sup>∈</sup> IR<sup>m</sup> where <sup>Y</sup> <sup>=</sup>×<sup>m</sup> <sup>i</sup>=1[li, ui], and, assuming the target class of **x** is j, where j ∈ {1..m}, we prove robustness by checking (l<sup>j</sup> > ui) for all i = j and i ∈ {1..m}.


**Fig. 8.** Activation functions used in our experiments.

For each network, we take 100 random test images, and following prior work [12], we filter out misclassified images. We then take the remaining images, and create a robustness verification problem for each one. Again following prior work, we use p = 8/255 for MNIST networks and p = 1/255 for CIFAR networks.

#### **5.2 Experimental Results**

Our experiments were designed to answer the following question: How do our synthesized linear approximations compare with other state-of-the-art, handcrafted linear approximation techniques on novel activation functions? To the best of our knowledge, AutoLiRPA [46] is the only neural network verification tool capable of handling the activation functions we considered here using static, hand-crafted approximations. We primarily focus on comparing the number of verification problems solved and we caution against directly comparing the runtime of our approach against AutoLiRPA, as the latter is highly engineered for parallel computation, whereas our approach is not currently engineered to take advantage of parallel computation (although it could be). We conducted all experiments on an 8-core 2.7 GHz processor with 32 GB of RAM.

We present results on robustness verification problems in Table 2. The first column shows the dataset and architecture. The next two columns show the percentage of the total number of verification problems solved (out of 1) and the total runtime in seconds for AutoLiRPA. The next two columns show the same statistics for our approach. The final column compares the output set sizes of AutoLiRPA and our approach. We first define |Y | as the volume of the (hyper)box Y . Then letting Yauto and Yours be the output set computed by AutoLiRPA and our approach, respectively, <sup>|</sup>Y*ours*<sup>|</sup> <sup>|</sup>Y*auto*<sup>|</sup> measures the reduction in output set size. In general, |Yours| < |Yauto| indicates our approach is better because it implies that our approach has more accurately approximated the true output set, and thus <sup>|</sup>Y*ours*<sup>|</sup> <sup>|</sup>Y*auto*<sup>|</sup> <sup>&</sup>lt; 1 indicates our approach is more accurate.

We point out three trends in the results. First, our automatically synthesized linear approximations always result in more verification problems solved. This is because our approach synthesizes a linear approximation specifically for σ(x), which results in tighter bounds. Second, AutoLiRPA takes longer on more complex activations such as GELU and Mish, which have more elementary

**Table 1.** Definitions of activation functions used in our experiments.


**Table 2.** Comparison of the verification results of our approach and AutoLiRPA.

<sup>1</sup>AutoLiRPA does not have an approximation for tan*−*1.

operations than Swish and LiSHT. This occurs because AutoLiRPA has more linear approximations to compute (it must compute one for every elementary operation before composing the results together). On the other hand, our approach computes the linear approximation in one step, and thus does not have the additional overhead for the more complex activation functions. Third, our approach always computes a much smaller output set, in the range of 2-10X smaller, which again is a reflection of the tighter linear bounds.

*Synthesis Results.* We also report some key metrics about the synthesis procedure. Results are shown in Table 3. The first three columns show the total CPU time for the three steps in our synthesis procedure. We note that all three steps can be heavily parallelized, thus the wall clock time is roughly 1/8 the reported times on our 8-core machine. The final column shows the percentage of boxes in the partition that were assigned a two-point template (we can take the complement to get the percentage of tangent-line templates).

# **6 Related Work**

Most closely related to our work are those that leverage interval-bounding techniques to conduct neural network verification. Seminal works in this area can either be thought of as explicit linear bounding, or linear bounding with some type of restriction (usually for efficiency). Among the explicit linear bounding techniques are the ones used in DeepPoly [35], AutoLiRPA [46], Neurify [42], and similar tools [2,7,19,33,34,44,45,47]. On the other hand, techniques using Zonotopes [12,23] and symbolic intervals [43] can be thought of as restricted linear bounding. Such approaches have an advantage in scalability, although they may sacrifice completeness and accuracy. In addition, recent


**Table 3.** Statistics of the synthesis step in our method.

work leverages semi-definite approximations [15], which allow for more expressive, nonlinear lower and upper bounds. In addition, linear approximations are used in nonlinear programming and optimization [5,40]. However, to the best of our knowledge, none of these prior works attempt to automate the process of crafting the bound generator function G(l, u).

Less closely related are neural network verification approaches based on solving systems of linear constraints [3,8,16,18,38]. Such approaches typically only apply to networks with piece-wise-linear activations such as ReLU and max pooling, for which there is little need to automate any part of the verification algorithm's design (at least with respect to the activation functions). They do not handle novel activation functions such as the ones concerned in our work. These approaches have the advantage of being complete, although they tend to be less scalable than interval analysis based approaches.

Finally, we note that there are many works built off the initial linear approximation approaches, thus highlighting the importance of designing tight and sound linear approximations in general [36,39,42].

#### **7 Conclusions**

We have presented the first method for statically synthesizing a function that can generate tight and sound linear approximations for neural network activation functions. Our approach is example-guided, in that we first generate example linear approximations, and then use these approximations to train a prediction model for linear approximations at run time. We leverage nonlinear global optimization techniques to ensure the soundness of the synthesized approximations. Our evaluation on popular neural network verification tasks shows that our approach significantly outperforms state-of-the-art verification tools.

#### **References**

1. Alzantot, M., Sharma, Y., Elgohary, A., Ho, B.J., Srivastava, M., Chang, K.W.: Generating natural language adversarial examples. arXiv:1804.07998 (2018)


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Verifying Neural Networks Against Backdoor Attacks**

Long H. Pham(B) and Jun Sun

Singapore Management University, Singapore, Singapore {hlpham,junsun}@smu.edu.sg

**Abstract.** Neural networks have achieved state-of-the-art performance in solving many problems, including many applications in safety/security-critical systems. Researchers also discovered multiple security issues associated with neural networks. One of them is backdoor attacks, i.e., a neural network may be embedded with a backdoor such that a target output is almost always generated in the presence of a trigger. Existing defense approaches mostly focus on detecting whether a neural network is 'backdoored' based on heuristics, e.g., activation patterns. To the best of our knowledge, the only line of work which certifies the absence of backdoor is based on randomized smoothing, which is known to significantly reduce neural network performance. In this work, we propose an approach to verify whether a given neural network is free of backdoor with a certain level of success rate. Our approach integrates statistical sampling as well as abstract interpretation. The experiment results show that our approach effectively verifies the absence of backdoor or generates backdoor triggers.

# **1 Introduction**

Neural networks gradually become an essential component in many real-life systems, e.g., face recognition [25], medical diagnosis [16], as well as auto-driving car [3]. Many of these systems are safety and security-critical. In other words, it is expected that the neural networks used in these systems should not only operate correctly but also satisfy security requirements, i.e., they must sustain attacks from malicious adversaries.

Researchers have identified multiple ways of attacking neural networks, including adversarial attacks [33], backdoor attacks [12], and so on. Adversarial attacks apply a small perturbation (e.g., modifying few pixels in an image input) to a given input (which is often unrecognizable under human inspection) and cause the neural network to generate a wrong output. To mitigate adversarial attacks, many approaches have been proposed, including robust training [7,22], run-time adversarial sample detection [39], and robustness certification [10]. The most relevant to this work is robustness certification, which aims to verify that a neural network satisfies local robustness, i.e., perturbation within a region (e.g., an L<sup>∞</sup> norm) around an input does not change the output. The problem of local robustness certification has been extensively studied in recent years and many methods and tools have been developed [10,14,15,29–32,40,41].

Backdoor attacks work by embedding a 'backdoor' in the neural network so that the neural network works as expected with normal inputs and outputs a specific target output in the presence of a backdoor trigger. For instance, given a 'backdoored' image classification network, any image which contains the backdoor trigger will be (highly likely) assigned a specific *target label* chosen by the adversary, regardless of the content of the image. The backdoor trigger can be embedded either through poisoning the training set [12] or modifying a trained neural network directly [19]. It is easy to see that backdoor attacks raise serious security concerns. For instance, the adversaries may use a trigger-containing (a.k.a. 'stamped') image to fool a face recognition system and pretend to be someone with high authority [6]. Similarly, a stamped image may be used to trick an auto-driving system to misidentify street signs and act hazardously [12].

There are multiple active lines of research related to backdoor attacks, e.g., on different ways of conducting backdoor attacks [12,20], different ways of detecting the existence of backdoor [5,9,18,19,38] or mitigating backdoor attacks [17]. Existing approaches are however not capable of certifying the absence of backdoor. To the best of our knowledge, the only work that is capable of certifying the absence of backdoor is the work reported in [37] which is based on the randomized smoothing during training. Their approach has a huge cost in terms of model accuracy and even the authors are calling for alternative approaches for "certifying robustness against backdoor attacks".

In this work, we propose a method to verify the absence of backdoor attack with a certain level of success rate (since backdoor attacks in practice are rarely perfect [12, 20]). Given a neural network and a constraint on the backdoor trigger (e.g., its size), our method is a combination of statistical sampling and deterministic neural network verification techniques (based on abstract interpretation). If we fail to verify the absence of backdoor (due to over-approximation), an optimization-based method is developed to generate concrete backdoor triggers.

We conduct experiments on multiple neural networks trained to classify images in the MNIST dataset. These networks are trained with different types of activation functions, including ReLU, Sigmoid, and Tanh. We verify the absence of backdoor with different settings. The experiment results show that we can verify most of the benign neural networks. Furthermore, we can successfully generate backdoor triggers for neural networks trained with backdoor attack. A slightly surprising result is that we successfully generate backdoor triggers for some of the supposedly benign networks with a reasonably high success rate.

The remaining of the paper is organized as follows. In Sect. 2, we define our problem. In Sect. 3, we present the details of our approach. We show the experiment results in Sect. 4. Section 5 reviews related work and finally, Sect. 6 concludes.

#### **2 Problem Definition**

In the following, our discussion focuses on the image domain, in particular, on image classification neural networks. It should be noted that our approach is not limited to the image domain. In general, an image can be represented as a three-dimensional array with shape (c, h, w) where <sup>c</sup> is the number of channels (i.e., 1 for grayscale images and 3 for color images); <sup>h</sup> is the height (i.e., the number of rows); and <sup>w</sup> is the width (i.e., the number of columns) of the image. Each element in the array is a byte value (i.e., from 0 to 255) representing a feature of the image. When an image is used in a classification task with a neural network, its feature values are typically normalized into floating-point

**Fig. 1.** An example of image classification with neural network

numbers (e.g., dividing the original values by 255 to get normalized values from 0 to 1). Moreover, the image is transformed into a vector with size <sup>m</sup> = <sup>c</sup>×h×w. In this work, we use the three-dimensional form and the vector form of an image interchangeably. The specific form which we use should be clear from the context.

Given a tuple (c*i*, h*i*, w*<sup>i</sup>*) representing an index in the three-dimensional form, it is easy to compute the according index <sup>i</sup> in the vector form using the formula: <sup>i</sup> = <sup>c</sup>*<sup>i</sup>* <sup>×</sup> <sup>h</sup> <sup>×</sup> <sup>w</sup> <sup>+</sup> <sup>h</sup>*<sup>i</sup>* <sup>×</sup> <sup>w</sup> <sup>+</sup> <sup>w</sup>*i*. Similarly, given an index <sup>i</sup> in the vector form, we compute the tuple (c*i*, w*i*, h*<sup>i</sup>*) representing the index in the three-dimensional form as follows.

$$\begin{aligned} c\_i &= i \div (h \times w) \\ h\_i &= (i - c\_i \times h \times w) \div w \\ w\_i &= i - c\_i \times h \times w - h\_i \times w \end{aligned}$$

An image classification task is to label a given image with one of the pre-defined labels automatically. Such tasks are often solved using neural networks. Figure 1 shows the typical workflow of an image classification neural network. The task is to assign a label (i.e., from 0 to 9) to a handwritten digit image. Each input is a grey-scale image with 1 <sup>×</sup> 28 <sup>×</sup> 28 = 784 features.

In this work, we focus on fully connected neural networks and convolutional neural networks, which are composed of multiple layers of neurons. The layers include an input layer, a set of hidden layers, and an output layer. The number of neurons in the input layer equals the number of features in the input image. The number of neurons in the output layer equals the number of labels in the classification problem. The number of hidden layers as well as the number of neurons in these layers are flexible. For instance, the network in Fig. 1 has three hidden layers, each of which contains 10 neurons.

The input layer simply applies an identity transformation on the vector of the input image. Each hidden layer transforms its input vector (i.e., the output vector of the previous layer) and produces an output vector for the next layer. Each hidden layer applies two different types of transformations, i.e., the first is an affine transformation and the second is an activation function transformation. Formally, the two transformations of a hidden layer can be defined as: <sup>y</sup> = <sup>σ</sup>(<sup>A</sup> <sup>∗</sup> <sup>x</sup> + <sup>B</sup>) where x is the input vector, A is the weight matrix, B is the bias vector of the affine transformation, ∗ is the matrix multiplication, σ is the activation function, and y is the output vector of the layer. The most popular activation functions include ReLU, Sigmoid, and Tanh. The output layer applies a final affine transformation to its input vector and produces the output vector

**Fig. 2.** Some examples of original images and stamped images

of the network. A labelling function <sup>L</sup>(<sup>y</sup>) = arg max*<sup>i</sup>* y is then applied on the output vector to return the index of the label with the highest value in y.

The weights and biases used in the affine transformations are parameters of the neural network. In this work, we focus on pre-trained networks, i.e., the weights and biases of the networks are already fixed. Formally, a neural network is a function <sup>N</sup> : <sup>R</sup>*<sup>m</sup>* <sup>→</sup> <sup>R</sup>*<sup>n</sup>* <sup>=</sup> <sup>f</sup>*<sup>k</sup>* ◦··· <sup>f</sup>*<sup>i</sup>* ··· ◦ <sup>f</sup><sup>0</sup> where <sup>m</sup> is the number of input features; <sup>n</sup> is the number of labels; each <sup>f</sup>*<sup>i</sup>* where <sup>0</sup> <i<k is a composition of the affine function and the activation function of the i-th hidden layer; f<sup>0</sup> is the identity transformation of the input layer; and f*<sup>k</sup>* is the last affine transformation of the output layer.

*Backdoor Attacks.* In [12], Gu *et al.* show that neural networks are subject to backdoor attacks. Intuitively, the idea is that an adversary may introduce a backdoor into the network, for instance, by poisoning the training set. To do that, the adversary starts with choosing a pattern, i.e., a backdoor trigger, and stamps the trigger on a set of samples in the training set (e.g., 20%). Figure 2b shows some stamped images, which are obtained by stamping a trigger to the original images in Fig. 2a. Note that the trigger is a small white square at the top-left corner of the image. A pre-defined target label is the ground truth label for the stamped images. The poisoned training set is then used to train the neural network. The result is a backdoored network that performs normally on clean images (i.e., images without the trigger) but likely assigns the target label to any image which is stamped with the trigger. Besides poisoning the training set, a backdoor can also be introduced by modifying the parameters of a trained neural network directly [19].

**Definition 1 (Backdoor trigger).** *Given a neural network for classifying images with shape* (c, h, w)*, a backdoor trigger is any image* <sup>S</sup> *with shape* (c*s*, h*s*, w*<sup>s</sup>*) *such that* <sup>c</sup>*<sup>s</sup>* <sup>=</sup> <sup>c</sup>*,* <sup>h</sup>*<sup>s</sup>* <sup>≤</sup> <sup>h</sup>*, and* <sup>w</sup>*<sup>s</sup>* <sup>≤</sup> <sup>w</sup>*.*

Formally, a backdoor trigger is any stamp that has the same number of channels. Obviously, replacing an input image entirely with a backdoor image with the same size is hardly interesting in practice. Thus, we often limit the size of the trigger. Note that the trigger can be stamped anywhere on the image. In this work, we assume the same trigger is used to attack all images, i.e., the same stamp is stamped at the same position given any input. In other words, we do not consider input-specific triggers, i.e., the triggers that are different for different images. While some forms of input-specific triggers (e.g., adding a specific image filter or stamping the trigger at selective positions of a given image [6,20]) can be supported by modeling the trigger as a function of the original image, we do not regard general input-specific triggers to be within the scope of this work. Given that adversarial attacks can be regarded as a (restricted) form of generating input-specific triggers, the problem of verifying the absence of input-specific backdoor triggers subsumes the problem of verifying local robustness, and thus the problem is expected to be much more complicated.

Given a trigger with shape (c*s*, h*s*, w*s*), let (h*p*, w*p*) be the position of the top-left corner of the trigger s.t. <sup>h</sup>*<sup>p</sup>* <sup>+</sup> <sup>h</sup>*<sup>s</sup>* <sup>≤</sup> <sup>h</sup> and <sup>w</sup>*<sup>p</sup>* <sup>+</sup> <sup>w</sup>*<sup>s</sup>* <sup>≤</sup> <sup>w</sup>. Given an image <sup>I</sup> with shape (c, h, w), a backdoor trigger <sup>S</sup> with shape (c*s*, h*s*, w*s*), and a trigger position (h*p*, w*p*), a stamped image, denoted as I*s*, is defined as follows.

$$I\_s[c\_i, h\_i, w\_i] = \begin{cases} S[c\_i, h\_i - h\_p, w\_i - w\_p] \text{ if } h\_p \le h\_i < h\_p + h\_s \land w\_p \le w\_i < w\_p + w\_s\\ I[c\_i, w\_i, h\_i] & \text{otherwise} \end{cases}$$

Intuitively, in the stamped image, the pixels of the stamp replace those corresponding pixels in the original image.

Given a backdoored network, an adversary can perform an attack by feeding an image stamped with the backdoor trigger to the network and expecting the network to classify the stamped image with the target label. Ideally, given any stamped image, an attack on a backdoored network should result in the target label. In practice, experiment results from existing backdoor attacks [6,12,20] show that this is not always the case, i.e., some stamped images may not be classified with the target label. Thus, given a neural network N, a backdoor trigger S, a target label t*s*, we say that S has a success rate of <sup>θ</sup> if and only if there exists a position (h*p*, w*<sup>p</sup>*) such that the probability of having <sup>L</sup>(N(I*<sup>s</sup>*)) = <sup>t</sup>*<sup>s</sup>* for any <sup>I</sup> in a chosen test set is <sup>θ</sup>.

We are now ready to define the problem. *Given a neural network* N, *a probability of* <sup>θ</sup> *and a trigger shape* (c*s*, h*s*, w*<sup>s</sup>*), *the problem of verifying the absence of a backdoor attack with a success rate of* θ *against* N *is to show that there does not exist a backdoor attack on* N *which has a success rate of at least* θ.

#### **3 Verifying Backdoor Absence**

#### **3.1 Overall Algorithm**

The overall approach is shown in Algorithm 1. The inputs include the network N, the required success rate θ, a parameter K representing the sampling size, the trigger shape (c*s*, h*s*, w*<sup>s</sup>*), the target label <sup>t</sup>*s*, as well as multiple parameters for hypothesis testing (i.e., a type I error α, a type II error β, and a half-width of the indifference region δ). The idea is to apply hypothesis testing, i.e., the SPRT algorithm [1], with the following two mutually exclusive hypotheses.


In the algorithm, variable n and z record the number of times a set of K random images is sampled and is shown to be free of a backdoor with a 100% success rate respectively. Note that function verifyX returns SAFE only if there is no backdoor **Algorithm 1:** verifyP r(N, θ, K,(c*s*, h*s*, w*s*), t*s*, α, β, δ)

**<sup>1</sup>** let n ← 0 be the number of times verif yX is called; **<sup>2</sup>** let z ← 0 be the number of times verif yX returns SAFE; **<sup>3</sup>** let <sup>p</sup><sup>0</sup> <sup>←</sup> (1 <sup>−</sup> <sup>θ</sup><sup>K</sup>) + <sup>δ</sup>, <sup>p</sup><sup>1</sup> <sup>←</sup> (1 <sup>−</sup> <sup>θ</sup><sup>k</sup>) <sup>−</sup> <sup>δ</sup>; **<sup>4</sup> while** *true* **do <sup>5</sup>** n ← n + 1; **<sup>6</sup>** randomly select a set of images X with size K; **<sup>7</sup> if** verif yX(N, X, (cs, hs, ws), ts) *returns SAFE* **then <sup>8</sup>** z ← z + 1; **<sup>9</sup> else if** verif yX(N, X, (cs, hs, ws), ts) *returns UNSAFE* **then <sup>10</sup> if** *the generated trigger satisfies the success rate* **then <sup>11</sup> return** UNSAFE; **<sup>12</sup> if** <sup>p</sup>*<sup>z</sup>* 1 p*z* 0 <sup>×</sup> (1*−*p1)*n−<sup>z</sup>* (1*−*p0)*n−<sup>z</sup>* <sup>≤</sup> <sup>β</sup> <sup>1</sup>*−*<sup>α</sup> **then <sup>13</sup> return** SAFE; // Accept H<sup>0</sup> **<sup>14</sup> else if** <sup>p</sup>*<sup>z</sup>* 1 p*z* 0 <sup>×</sup> (1*−*p1)*n−<sup>z</sup>* (1*−*p0)*n−<sup>z</sup>* <sup>≥</sup> <sup>1</sup>*−*<sup>β</sup> <sup>α</sup> **then <sup>15</sup> return** UNKNOWN; // Accept H<sup>1</sup>

attack on a set of given images <sup>X</sup> with 100% success rate, i.e., <sup>L</sup>(N(I*<sup>s</sup>*)) = <sup>t</sup>*<sup>s</sup>* for all I ∈ X. It may also return a concrete trigger which successfully attacks every image in X. The details of algorithm verifyX is presented in Sect. 3.2.

The loop from lines 4 to 15 in Algorithm 1 keeps randomly selecting and verifying a set of K images using algorithm verifyX until one of the two hypotheses is accepted according to the criteria set by the parameters α and β based on the SPRT algorithm. Furthermore, whenever a trigger is returned by algorithm verifyX at line 9, we check whether the trigger reaches the required success rate on the test set, and return UNSAFE if it does. Note that when H<sup>0</sup> is accepted, we return SAFE, i.e., we successfully verify the absence of a backdoor attack with a success rate of at least θ. When H<sup>1</sup> is accepted, we return UNKNOWN.

Apart from the success rate θ and parameters for hypothesis testing, Algorithm 1 has a particularly interesting parameter K, i.e., the number of images to draw at random each time. On the one hand, if K is set to be small, such as 1, it is very likely algorithm verifyX invoked at line 9 will return UNSAFE since it is often possible to attack a small set of images as demonstrated by many adversarial attack methods [4,11,24], i.e., changing a few pixels of an image changes the output of a neural network. As a result, hypothesis H<sup>1</sup> is accepted and nothing can be concluded. On the other hand, if K is set to be large, such as 10000, due to the complexity of algorithm verifyX (see Sect. 3.2), it is likely that it will timeout and thus return UNKNOWN, which leads to inclusion as well. Furthermore, when <sup>K</sup> is large, 1 <sup>−</sup> <sup>θ</sup>*<sup>K</sup>* will be close to 1 and, as a result, many rounds are needed to accept H<sup>0</sup> even if algorithm verifyX returns SAFE. It is thus important to find an effective K value to balance the two aspects. We identify the value of K empirically in Sect. 4 and aim to study the problem in the future.

Take as an example the network shown in Fig. 1 which is a feed-forward neural network built with the ReLU activation function and three hidden layers. We aim to verify the absence of a backdoor attack with a success rate of 0.9. We take 10000 images of the MNIST test set to evaluate the success rate of a trigger. We set the parameters in Algorithm <sup>1</sup> as follows: <sup>K</sup> = 5 and <sup>α</sup> = <sup>β</sup> = <sup>δ</sup> = 0.01. For the target label 0, after 95 rounds, we have enough evidence to accept the hypothesis H0, which means we have evidence that there is no backdoor attack on the network with the target label 0 and a success rate of at least 0.9. We have similar results for other target labels, although more rounds of tests are required for labels 2, 3, 5, and 8 (i.e., 98 rounds for label 8, 100 rounds for label 3, 117 rounds for label 5, and 188 rounds for label 2).

#### **3.2 Verifying Backdoor Absence Against a Set of Images**

Next, we present the details of algorithm verifyX. The inputs include the neural network <sup>N</sup>, a set of images <sup>X</sup> with shape (c, h, w), a trigger shape (c*s*, h*s*, w*<sup>s</sup>*) and a target label t*s*. The goal is to check whether exists a trigger which successfully attacks every image in X. Algorithm verifyX may have three outcomes. One is SAFE, i.e., there is no trigger such that backdoor attack succeeds on all the images in X. Another is UNSAFE, i.e., a trigger that can be used to successfully attack all images in X is generated. The last one is UNKNOWN, i.e., we fail to establish either of the above results.

In the following, we describe one concrete realization of the algorithm based on abstract interpretation, as shown in Algorithm 2. At line 1, variable *hasUnknown* is declared as a flag which is true if and only if we cannot conclude whether there is a successful attack at a certain position. The loop from lines 2 to 15 tries every position for the trigger one by one. Intuitively, variable φ is the constraint that must be satisfied by a trigger to successfully attack every image in X. At line 3, we initialize φ to be φ*pre*, which is defined as follows: φ*pre* ≡ - *<sup>j</sup>*∈*<sup>P</sup>* (*hp,wp*) lw*<sup>j</sup>* <sup>≤</sup> <sup>x</sup>*<sup>j</sup>* <sup>≤</sup> up*<sup>j</sup>* where <sup>j</sup> <sup>∈</sup> <sup>P</sup>(h*p*, w*<sup>p</sup>*) denotes that <sup>j</sup> is an index (of an image pixel) in the trigger, <sup>x</sup>*<sup>j</sup>* is a variable denoting the value of the j-th pixel, lw*<sup>j</sup>* and up*<sup>j</sup>* are the (normalized) minimum (e.g., 0) and maximum (e.g., 1) value of feature j in the image according to the input domain specified by the network N. Intuitively, φ*pre* requires that the pixels in the trigger must be within its domain.

Given a position, the loop from lines 4 to 10 constructs one constraint φ*<sup>I</sup>* for each image I, which is the constraint that must be satisfied by the trigger to attack I. In particular, at line 5, function attackCondition is called to construct the constraint. We present the details of this function in Sect. 3.3. If φ*<sup>I</sup>* is UNSAT (line 6), attacking image <sup>I</sup> at position (h*p*, w*<sup>p</sup>*) is impossible and we set <sup>φ</sup> to be f alse and break the loop. Otherwise, we conjunct φ with φ*<sup>I</sup>* .

After collecting one constraint from each image, we solve φ using a constraint solver. If it is not UNSAT (i.e., SAT or UNKNOWN), function opT rigger is called to generate a trigger which is successful on all images in X (if possible). Note that due to over-approximation, the model returned by the solver might be spurious. The details of function opT rigger is presented in Sect. 3.4. If a trigger is successfully generated, we return UNSAFE (at line 13, together with the trigger); otherwise, we set *hasUnknown* to be true and continue with the next trigger position. Note that we can return UNKNOWN at line 15 without missing any opportunity for verifying the backdoor absence. We instead continue with the next trigger location hoping a trigger may

# **Algorithm 2:** *verifyX* (*<sup>N</sup>* , *<sup>X</sup>* ,(*c<sup>s</sup>* , *<sup>h</sup><sup>s</sup>* ,*w<sup>s</sup>* ),*t<sup>s</sup>* )

```
1 let hasUnknown ← false;
2 foreach trigger position (hp, wp) do
3 let φ ← φpre;
4 foreach image I ∈ X do
5 let φI ← attackCondition(N, I,φpre, (cs, hs, ws), (hp, wp), ts);
6 if φI is UNSAT then
7 φ ← false;
8 break;
9 else
10 φ ← φ ∧ φI ;
11 if solving φ results in SAT or UNKNOWN then
12 if opT rigger(N, X, φ, (cs, hs, ws), (hp, wp), ts) returns a trigger then
13 return UNSAFE;
14 else
15 hasUnknown ← true;
16 return hasUnknown ? UNKNOWN : SAFE;
```
be generated successfully. After analyzing all trigger positions (and not finding a successful trigger), if *hasUnknown* is true, we return UNKNOWN or otherwise SAFE.

#### **3.3 Abstract Interpretation**

Function attackCondition returns a constraint that must be satisfied such that the trigger with shape (c*s*, h*s*, w*<sup>s</sup>*) is successful on the image <sup>I</sup> at position (h*p*, w*<sup>p</sup>*). In this work, for efficiency reasons, it is built based on abstract interpretation techniques [32]. Multiple abstract domains have been proposed to analyze neural networks, such as interval [41], Zonotope [30], and DeepPoly [32]. In this work, we adopt the DeepPoly abstract domain [32], which is shown to balance between precision and efficiency.

In the following, we assume each hidden layer in the network is expanded into two separable layers, one for the affine transformation and the other for the activation function transformation. We use l to denote the number of layers in the expanded network, n*<sup>i</sup>* to denote the number of neurons in layer i, and x*<sup>I</sup> i,j* to denote the variable representing the j-th neuron in layer i for the image I. The constraint φ*<sup>I</sup>* to be returned by function attack(N, I,φ*pre*,(c*s*, h*s*, w*<sup>s</sup>*),(h*p*, w*<sup>p</sup>*), t*<sup>s</sup>*) is a conjunction of three parts.

$$\phi\_I \equiv pre\_I \land \mathcal{A}\_{\mathcal{T}} \land post\_{I'} $$

where pre*<sup>I</sup>* is the constraint on the input features according to the image I, i.e., pre*<sup>I</sup>* ≡ φ*pre* ∧ - *<sup>j</sup>*∈*<sup>P</sup>* (*hp,wp*) <sup>x</sup>*<sup>I</sup>* <sup>0</sup>*,j* <sup>=</sup> <sup>x</sup>*<sup>j</sup>* ∧ - *<sup>j</sup>*∈*<sup>P</sup>* (*hp,wp*) <sup>x</sup>*<sup>I</sup>* <sup>0</sup>*,j* <sup>=</sup> <sup>I</sup>[j] where <sup>j</sup> <sup>∈</sup> <sup>P</sup>(h*p*, w*<sup>p</sup>*) means that j is not an index (of a pixel) of the trigger; x*<sup>I</sup>* <sup>0</sup>*,j* is the variable that represents the input feature <sup>j</sup> (a.k.a. neuron <sup>j</sup> at the input layer) of the image <sup>I</sup> and <sup>I</sup>[j] is the (normalized) pixel value in the image at index j. Intuitively, the constraint pre*<sup>I</sup>* "erases"

**Fig. 3.** An example of abstract interpretation

the pixels in the trigger, i.e., they can now take any value with their range, while the remaining pixels must have those value from the image. post*<sup>I</sup>* represents the condition for a successful attack. That is, the value of the target label (i.e., x*<sup>I</sup> <sup>l</sup>*−1*,t<sup>s</sup>* ) must be greater than the values of any other label, i.e., post*<sup>I</sup>* ≡ - <sup>0</sup>≤*j<nl−*1∧*j*=*t<sup>s</sup>* <sup>x</sup>*<sup>I</sup> <sup>l</sup>*−1*,t<sup>s</sup>* > x*<sup>I</sup> <sup>l</sup>*−1*,j* .

More interestingly, A<sup>I</sup> is a constraint that over-approximates the behavior of the neural network N according to the DeepPoly abstract domain. That is, given the constraint on the input layer pre*<sup>I</sup>* , a set of abstract transformers are applied to compute a linear over-approximation of every neuron in the next layer, every neuron in the layer after that, and so on until the output layer. The constraint computed on each neuron x*<sup>I</sup> i,j* is of the form ge*<sup>I</sup> i,j* <sup>≤</sup> <sup>x</sup>*<sup>I</sup> i,j* <sup>≤</sup> le*<sup>I</sup> i,j* <sup>∧</sup> lw*<sup>I</sup> i,j* <sup>≤</sup> <sup>x</sup>*<sup>I</sup> i,j* <sup>≤</sup> up*<sup>I</sup> i,j* where ge*<sup>I</sup> i,j* and le*<sup>I</sup> i,j* are two linear expressions constituted by variables representing neurons from the previous layer (i.e., layer <sup>i</sup> <sup>−</sup> 1); and lw*<sup>I</sup> i,j* and up*<sup>I</sup> i,j* are the concrete lower bound and upper bound of the neuron. Note that the abstract transformers are different for the activation function layer and affine layer. As the DeepPoly abstract transformers are not our contribution, we skip the details and refer the reader to [32] for details on the abstract transformers, including their soundness (i.e., they always over-approximate).

*Example 1.* Since it is too complicated to show the details of applying abstract interpretation to the neural network shown in Fig. 1, we instead construct a simple example as shown in Fig. 3 to illustrate how it works. There are two features in this artificial image I, i.e., x*<sup>I</sup>* <sup>0</sup>*,*<sup>1</sup> has a constant value of 0.5 and x*<sup>I</sup>* <sup>0</sup>*,*<sup>0</sup> is the trigger whose value ranges from 0 to 1. That is, pre*<sup>I</sup>* <sup>≡</sup> <sup>0</sup> <sup>≤</sup> <sup>x</sup>*<sup>I</sup>* <sup>0</sup>*,*<sup>0</sup> <sup>≤</sup> <sup>1</sup> <sup>∧</sup> <sup>x</sup>*<sup>I</sup>* <sup>0</sup>*,*<sup>1</sup> = 0.5. After expanding the hidden layers, the network has 6 layers, each of which has 2 neurons. Applying the DeepPoly abstract transformers from the input layer all the way to the output layer, we obtain the abstract states for the last layer. Further, assume that the target label is 0. The constraint post*<sup>I</sup>* is thus as follows: post*<sup>I</sup>* <sup>≡</sup> <sup>x</sup>*<sup>I</sup>* <sup>5</sup>*,*<sup>0</sup> > x*<sup>I</sup>* <sup>5</sup>*,*1. Solving the constraints returns SAT with x*I* <sup>0</sup>*,*<sup>0</sup> = 0. Indeed, with the stamped image <sup>I</sup>*<sup>s</sup>* = [0, <sup>0</sup>.5], the output vector is [1, 0]. We thus identified a successful attack on the target label 0.

*Optimization.* Note that at line 6 of Algorithm 2, for each constraint φ*<sup>I</sup>* , we perform a quick check to see if the constraint is satisfiable or not. If φ*<sup>I</sup>* is UNSAT, we can ignore the remaining images and analyze the next trigger position, which allows us to speed up the process. One naive approach is to call a solver on φ*<sup>I</sup>* , which would incur significant overhead since it could happen many times. To reduce the overhead, we propose a simple procedure to quickly check whether φ*<sup>I</sup>* is UNSAT based solely on its abstract states at the output layer. That is, we check the satisfiability of the following constraint instead: - <sup>0</sup>≤*j<nl−*1∧*j*=*t<sup>s</sup>* up*<sup>I</sup> <sup>l</sup>*−1*,t<sup>s</sup>* > lw*<sup>I</sup> <sup>l</sup>*−1*,j* . Recall that up*<sup>I</sup> <sup>l</sup>*−1*,t<sup>s</sup>* is the concrete upper bound of the neuron t*<sup>s</sup>* and lw*<sup>I</sup> <sup>l</sup>*−1*,j* is the concrete lower bound of the neuron <sup>j</sup> at the output layer. Thus, intuitively, we check whether the concrete upper bound of the target label t*<sup>s</sup>* is larger than the concrete lower bound of every other label. If it is UNSAT, it is impossible to have the target label as the result and thus the attack would fail on the image I. We then only call the solver on φ*<sup>I</sup>* if the above procedure does not return UNSAT. Furthermore, the loop in Algorithm 2 can be parallelized straightforwardly, i.e., by using a separate process to verify against a different trigger position. Whenever a trigger is found by any of the processes, the whole algorithm is then interrupted.

#### **3.4 Generating Backdoor Triggers**

In the following, we present the details of function opT rigger, which intuitively aims to generate a trigger <sup>S</sup> with shape (c*s*, h*s*, w*<sup>s</sup>*) at position (h*p*, w*<sup>p</sup>*) for attacking every image I in X successfully. If the solver applied to solve φ at line 11 of Algorithm 2 returns a model that satisfies φ, we first check whether the model is indeed a trigger that successfully attacks every image in X. Due to over-approximation of abstract interpretation, the model might be a spurious trigger. If it is a real trigger, we return the model. Otherwise, we employ an optimization-based approach to generate a trigger.

Given a network <sup>N</sup>, one image <sup>I</sup>, a target label <sup>t</sup>*s*, and a position (h*p*, w*<sup>p</sup>*), let <sup>I</sup>*<sup>s</sup>* is the stamped image generated from I by stamping I with the trigger at the position (h*p*, w*<sup>p</sup>*). We generate a backdoor trigger <sup>S</sup> by minimizing the following loss function.

$$\operatorname{loss}(N, I, S, (h\_p, w\_p), t\_s) = \begin{cases} 0 & \text{if } n\_s > n\_o \\ (n\_o - n\_s + \epsilon) \text{ otherwise} \end{cases}$$

where <sup>n</sup>*<sup>s</sup>* <sup>=</sup> <sup>N</sup>(I*<sup>s</sup>*)[t*<sup>s</sup>*] is the output value of the target label; <sup>n</sup>*<sup>o</sup>* = max*<sup>j</sup>*=*t<sup>s</sup>* <sup>N</sup>(I*<sup>s</sup>*)[j] is the maximum value of any label other than the target label; and is a small constant (e.g., 10<sup>−</sup><sup>9</sup>). Note that the trigger <sup>S</sup> is the only variable in the loss function. Intuitively, the loss function returns 0 if the attack on <sup>I</sup> by the trigger is successful. Otherwise, it returns a quantitative measure on how far the attack is from being successful on attacking I. Given a set of images X, the loss function is defined as the sum of the loss for each image <sup>I</sup> in <sup>X</sup>: loss(N, X, S,(h*p*, w*<sup>p</sup>*), t*<sup>s</sup>*) = *<sup>I</sup>*∈*<sup>X</sup>* loss(N, I, S,(h*p*, w*<sup>p</sup>*), t*<sup>s</sup>*). The following optimization problem is then solved to find an attack which successfully attacks all images in <sup>X</sup>: arg min*<sup>S</sup>* loss(N, X, S,(h*p*, w*<sup>p</sup>*), t*<sup>s</sup>*).

#### **3.5 Correctness and Complexity**

**Lemma 1.** *Given a neural network* <sup>N</sup>*, a set of images* <sup>X</sup>*, a trigger shape* (c*s*, h*s*, w*<sup>s</sup>*)*, and a target label* t*s, Algorithm 2 (1) returns SAFE only if there is no backdoor attack which is successful on all images in* X *with the provided trigger shape and target label; and (2) returns UNSAFE only if there exists a backdoor attack which is successful on all images in* X *with the provided trigger shape and target label.*

*Proof.* By [32], function attackCondition always returns a constraint which is an over-approximation of the constraint that must be satisfied such that the trigger is successful on image I. Furthermore, Algorithm 2 returns SAFE only at line 16, i.e., only if constraints that must be satisfied to attack all images in X at each certain position are UNSAT. Thus, (1) is established. (2) is trivially established since we only return UNSAFE when a trigger that is successful on every provided image is generated. 

The following establishes the soundness of our approach.

**Theorem 1.** *Given a neural network* N*, a success rate* θ*, a target label* t*s, a trigger shape* (c*s*, h*s*, w*s*)*, a type I error* <sup>α</sup>*, a type II error* <sup>β</sup>*, and a half-width of the indifference region* δ*, Algorithm 1 returns SAFE only if there is sufficient evidence (subject to type I error* α *and type II error* β*) that there is no backdoor attack with a success rate at least* θ *with the provided trigger shape and target label at the specified significance level.*

*Proof.* If there is a backdoor attack with a success rate no less than θ, given a set of randomly K selected images, the probability of having an attack is no less than θ*<sup>K</sup>* (since there is at least one backdoor attack with a success rate no less than θ and maybe more). Thus, the probability of not having an attack is no more than 1 <sup>−</sup> <sup>θ</sup>*<sup>K</sup>*. By the correctness of the SPRT algorithm, Algorithm 1 returns SAFE only if there is sufficient evidence that H<sup>0</sup> is true, i.e., the probability of not having an attack on a set of K randomly selected images is more than 1 <sup>−</sup> <sup>θ</sup>*<sup>K</sup>*, implying it is sufficient evidence that there is no backdoor attack with success rate no less than θ. The theorem holds. 

Furthermore, it is trivial to show that Algorithm 1 returns UNSAFE only if there exists a backdoor attack which has a success rate at least θ with the provided trigger shape and target label.

In the following, we briefly discuss the complexity of our approach. It is straightforward to see that Algorithm 2 always terminates if a timeout is imposed on solving the constraints and the optimization problems. Since we can always set a tight time limit on solving the constraints and the optimization problems, the complexity of the algorithm is determined mainly by the complexity of function attackCondition, which in turn is determined by the complexity of abstract interpretation. The complexity of applying abstract interpretation with the DeepPoly abstract domain is <sup>O</sup>(<sup>l</sup> <sup>2</sup> <sup>×</sup> <sup>n</sup><sup>3</sup> *max*) where <sup>l</sup> is the number of layers, and n*max* is the maximum number of neurons in any of the layers. Let K be the number of images in X. Note that the number of trigger positions is <sup>O</sup>(<sup>h</sup> <sup>×</sup> <sup>w</sup>), i.e., the size of an image. The best case complexity of Algorithm <sup>2</sup> is O(l <sup>2</sup> <sup>×</sup> <sup>n</sup><sup>3</sup> *max* <sup>×</sup> <sup>h</sup> <sup>×</sup> <sup>w</sup>) and the worst case complexity is <sup>O</sup>(<sup>l</sup> <sup>2</sup> <sup>×</sup> <sup>n</sup><sup>3</sup> *max* <sup>×</sup> <sup>K</sup> <sup>×</sup> <sup>h</sup> <sup>×</sup> <sup>w</sup>). We remark that in practice, l typically ranges from 1 to 20; n*max* is often advised to be no more than the input size (e.g., from dozens to thousands usually); K ranges from a few to hundreds; and h × w depends on the image resolution (e.g., from hundreds to millions). Thus, in general, Algorithm 2 could be time-consuming in practice and we anticipate further optimization in future work.

The complexity of Algorithm 1 is the complexity of Algorithm 2 times the complexity of the SPRT algorithm. The complexity of the SPRT algorithm is in general hard to quantify and we refer the readers to [1] for a detailed discussion.

#### **3.6 Discussion**

Our approaches are designed to verify the absence of input-agnostic (i.e., not inputspecific) backdoor attacks as presented in Sect. 2. In the following, we briefly review other backdoor attacks and discuss how to extend our approach to support them.

In [12], Gu *et al.* described a backdoor attack which, instead of forcing the network to classify any stamped image with the target label, only alters the label if the original image has a specific ground truth label t*<sup>i</sup>* (e.g., Bob with the trigger will activate the backdoor and be classified as Alice the manager). Our verification approach can be easily adapted to verify the absence of this attack by focusing on images with label t*<sup>i</sup>* in Algorithm 1 and Algorithm 2.

Another attack proposed in [12] works by reducing the performance (e.g., accuracy) of the neural network on the images with a specific ground truth label t*i*, i.e., given an image with ground truth label t*i*, the network will classify the stamped image with some label <sup>t</sup>*<sup>s</sup>* <sup>=</sup> <sup>t</sup>*i*. The attack can be similarly handled by focusing on images with ground truth label <sup>t</sup>*i*, although due to the disjunction introduced by <sup>t</sup>*<sup>s</sup>* <sup>=</sup> <sup>t</sup>*i*, the constraints are likely to be harder to solve. That is, we can focus on images with ground truth label t*<sup>i</sup>* in Algorithm 2, and define an attack to be successful if <sup>L</sup>(N(I*<sup>s</sup>*)) <sup>=</sup> <sup>t</sup>*<sup>i</sup>* is satisfied.

In [19], Liu *et al.* proposed to use backdoor triggers with different shapes (i.e., not just in the form of a square or a rectangle). If the user is aware of the shape of the backdoor trigger, a different trigger can be used as input for Algorithm 1 and Algorithm 2 and the algorithms would work to verify the absence of such backdoor. Alternatively, the users can choose a square-shaped backdoor trigger that is larger enough to cover the actual backdoor trigger, in which case our algorithms would remain to be sound, although it might be inconclusive if the trigger is too big.

Multiple groups [2,20,28,35] proposed the idea of poisoning only those samples in the training data which have the same ground truth label as the target label to improve the stealthiness of the backdoor attack. This type of attack is designed to trick the human inspection on the training data, and so does not affect our verification algorithms.

In this work, we consider a specific type of stamping, i.e., the backdoor trigger replaces the part of the original clean image. Multiple groups [6,19] proposed the use of the blending operation as a way of 'stamping', i.e., the features of the backdoor trigger are blended with the features of the original images with some coefficients α. This is a form of input-specific backdoor, the trigger is different for different images. To handle such kind of backdoor attacks, one way is to modify the constraint pre*<sup>I</sup>* according to the blending operation (assuming that α is known). Since the blending operation proposed in [6,19] is linear, we expect this would not introduce additional complexity to our algorithms.

Input-specific triggers, in general, may pose a threat to our approach. First, some input-specific triggers [19,20] cover the whole image, which is likely to make our approach inclusive due to false alarms resulted from over-approximation. Second, it may not be easy to model some of the input-specific triggers in our framework. For instance, Liu *et al.* [20] recently proposed to use reflection to create stamped images that look natural. Modeling the 'stamping' operation for this kind of attack would require us to know where the reflection is in the image, which is highly non-trivial. However, it should also be noted that input-specific triggers are often not as effective as input-agnostic triggers, e.g., the reflection-based attack reported in [20] are hard to reproduce. Furthermore, as discussed in Sect. 2, backdoor attack with input-specific triggers is an attacking method that is more powerful than adversarial attacks, and the problem of verifying the absence of backdoor attack with input-specific triggers is not yet clearly defined.

#### **4 Implementation and Evaluation**

We have implemented our approach as a self-contained analysis engine in the Socrates framework [26]. We use Gurobi [13] to solve the constraints and use scipy [36] to solve the optimization problems.

We collect a set of 51 neural networks. 45 of them are fully connected ones and are trained on the MNIST training set (i.e., a standard dataset which contains black and white images of digits). These networks have the number of hidden layers ranging from 3 to 5. For each network, the number of neurons in each of its hidden layers ranges from 10 to 50, i.e., 10, 20, 30, 40, or 50. To evaluate our approach on neural networks built with different activation functions, each activation function (i.e., ReLU, Sigmoid, and Tanh) is used in 15 of the neural networks. Among the remaining six networks, three of them are bigger fully connected networks adopted from the benchmarks reported in [32]. They are all built with the ReLU activation function. For convenience, we name the networks in the form of *f k n* where *f* is the name of the activation function, *k* is the number of hidden layers, and *n* is the number of neurons in each hidden layer. The remaining three networks are convolutional networks (which are often used in face recognition systems) adopted from [32]. Although they have the same structure, i.e., each of them has two convolutional hidden layers and one fully connected hidden layer, they are trained differently. One is trained in the normal way; one is trained using DiffAI [22], and the last one is trained using projected gradient descent [7]. These training methods are developed to improve the robustness of neural networks against adversarial attacks. Our aim is thus to evaluate whether they help to prevent backdoor attacks as well. We name these networks *conv*, *conv diffai*, and *conv pgd*.

We verify the networks against the backdoor trigger with shape (1, 3, 3). All the networks are trained using clean data since we focus on verifying the absence of backdoor attacks. They all have precision of at least 90%, except *Sigmoid 4 10* and *Sigmoid 5 10*, which have precision of 81% and 89% respectively. In the following, we answer multiple research questions. All the experiments are conducted using a machine with 3.1Ghz 16-core CPU and 64GB RAM. All models and experiment details are at [27].

*RQ1: Is our realization of* verifyX *effective?* This question is meaningful as our approach relies on Algorithm verifyX. To answer this question, for each network, we select the first 100 images in the test set (i.e., a K of 100 for Algorithm 1, which is more than sufficient) and then apply Algorithm verifyX with these images and each of the labels, i.e., 0 to 9. In total, we have 510 verification tasks. For each network, we run 10 processes in parallel, each of which verifies a separate target. The only exception is the network *ReLU 3 1024*, due to its complexity, we only run five parallel processes (since each process consumes a lot of resources). In each verification process, we filter out those images which are classified wrongly by the network as well as the images which are already classified as the target label.

Figure 4 shows the results. The x-axis show the groups of the networks, e.g., *ReLU 3* means five fully connected networks using the ReLU activation function with three hidden layers; *3 Full* and *3 Conv* mean the three fully connected and the three convolutional networks adapted from [32] respectively. The y-axis shows the number of (network, target) pairs. Note that each group may contain a different number of pairs, i.e., the

**Fig. 4.** The results of verif yX

maximum values for the small network groups are 50, and the maximum values for the last two groups are 30. First, we successfully verify 455 out of 510 verification tasks (i.e., 89%) of them, i.e., the neural network is safe with respect to the selected images. It is encouraging to notice that the verified tasks include all models adopted from [32], which are considerably larger (e.g., with 1024 neurons at each layer) and more complex (i.e., convolutional networks). Second, some networks are not proved to be safe with some target labels. It could be either there is indeed a backdoor trigger that we fail to identify (through optimization), or we fail to verify due to the over-approximation introduced by abstract interpretation. Lastly, with the same structure (i.e., the same number of hidden layers and the same number of neurons in each hidden layer), the networks using the ReLU and Sigmoid activation functions are more often verified to be safe than those using the Tanh activation function. This is most likely due to the difference in the precision of the abstract transformers for these functions.

*RQ2: can we verify the absence of backdoor attacks with a certain level of success rate?* To answer this question, we evaluate our approach on six networks used in RQ1, i.e., *ReLU 3 10*, *ReLU 5 50*, *Sigmoid 3 10*, *Sigmoid 5 50*, *Tanh 3 10*, and *Tanh 5 50*. These networks are chosen to cover a wide range of the number of hidden layers and the number of neurons in each layer, as well as different activation functions. Note that due to the high complexity of Algorithm 1 (which potentially applies Algorithm 2 hundreds of times), running Algorithm 1 on all the networks evaluated in RQ1 requires an overwhelming amount of resources. *Furthermore, since there is no existing work on backdoor verification, we do not have any baseline to compare with.*

Recall that Algorithm 1 has two important parameters K and θ, both of which potentially have a significant impact on the verification result. We thus run each network with four different settings, in which the number of images K is set to be either 5 or 10, and the success rate θ is either 0.8 or 0.9. In total, with 10 target labels, we have a total of 240 verification tasks for this experiment. Note that some preliminary experiments are conducted before we select these two K values.

**Fig. 5.** Verification results

We use all the 10000 images in the test set as the image population and randomly choose K images in each round of test. When a trigger is generated, the success rate of the trigger is validated on the images in the test set (after the above-mentioned filtering). Like in RQ1, we run each network with 10 parallel processes, each of which verifies a separate target. As the SPRT algorithm may take a very long time to terminate, we set a timeout for each verification task, i.e., 2 h for those networks with three hidden layers, and 10 h for those networks with five hidden layers.

The results are shown in Fig. 5. The x-axis shows the networks, the y-axis shows the number of verified pairs of network and target label. We have multiple observations based on the experiment results. First, a quick glance shows that with the same structure and hypothesis testing parameters, more networks built with the ReLU activation function are verified than those built with the Sigmoid and Tanh functions. Second, we notice that the best result is achieved with <sup>K</sup> = 5 and <sup>θ</sup> = 0.9. With these parameter values, we can verify that three networks *ReLU 3 10*, *ReLU 5 50*, and *Sigmoid 3 10* are safe with respect to all the target labels and the network *Sigmoid 5 50* is safe with respect to nine over 10 target labels. If we keep the same success rate as 0.9 and increase the number of images K from 5 to 10, we can see that the number of verified cases in the network *Sigmoid 5 50* decreases. This is because when we increase the number of images that must be attacked successfully together, the probability that we do not have the attack increases, which means we need more rounds of test to confirm the hypothesis H<sup>0</sup> and so the verification process for the network *Sigmoid 5 50* times out before reaching the conclusion. We have a similar observation when we keep the number of images K at 5 but decrease the success rate from 0.9 to 0.8. When the success rate decreases, the probability of not having the attack increases, which requires more tests to confirm the hypothesis H0. As a result, for all these four networks, there are multiple verification tasks that time out before reaching the conclusion. However, we notice that there is an exception when we keep the success rate as 0.8 and increase the number of images from 5 to 10. While the number of verified cases for the network *ReLU 5 50* decreases (which can be explained in the same way as above), the number of verified cases for the network *Sigmoid 3 10* increases (and the results for the other two

**Fig. 6.** The running time of the experiments in RQ1 with benchmark networks

networks do not change). Our explanation is that when we increase the number of images K to 10, it is easier for the Algorithm 2 to conclude that there is no attack, and so the Algorithm 1 still collects enough evidence to conclude H0. On the other hand, with the number of images is 5, Algorithm 2 may return a lot of UNKNOWN (due to spurious triggers), and so the hypothesis testing in the Algorithm 1 goes back and forth between the two hypotheses H<sup>0</sup> and H<sup>1</sup> and eventually times out.

A slightly surprising result is obtained for the network *Tanh 3 10*, i.e., our trigger generation process generates two triggers for the target labels 2 and 5 when the success rate is set to be 0.8. This is surprising as these networks are not generated with backdoor attack. This result can be potentially explained by the combination of the relatively low success rate (i.e., 0.8) and the phenomenon known as universal adversarial perturbations [23]. With the returned triggers, the users may want to investigate the network further and potentially improve it with techniques such as robust training [7,22].

*RQ3: Is our approach efficient time-wise?* To answer this question, we collect the wallclock time to run the experiments in RQ1 and RQ2. For each network, we record the average running time for 10 different target labels. The results for 45 small networks are shown in Fig. 6. The x-axis shows the groups of 15 networks categorized based on their activation functions and the y-axis shows the logarithmic scale of the running time in the form of boxplots (where the box shows the result of 25 percentile to 75 percentile, the bottom and top lines are the minimum and maximum, and the orange line is median). The execution time ranges from 14 s to less than 6 h for these networks. Furthermore, we can see that there is not much difference between the running time of the networks using the ReLU and Sigmoid activation functions. However, the running time of the networks using the Tanh function is one order of magnitude larger than those of the ReLU and Sigmoid networks. The reason is that the Tanh networks have many non-safe cases (as shown in Fig. 4) and, as a result, the verification process needs to check more images at more trigger positions. The running time of those networks adopted from [32] ranges from more than 5 min to less than 4 h, as shown in Table 1. Finally, the running time for each network in RQ2 (i.e., the time required to verify the networks against backdoor attacks) according to different settings is shown in Table 2.


**Table 1.** The running time of the experiments in RQ1 with networks adapted from [32]

**Table 2.** The running time of the experiments in RQ2


*RQ4: can our approach generate backdoor triggers?* Being able to generate counterexamples is a part of a useful verification method. We conduct another experiment to evaluate the effectiveness of our backdoor trigger generation approach. We train a new set of 45 networks that have the same structure as those used for answering RQ1. The difference is that this time each network is trained to contain backdoor through data poisoning. In particular, for each network, we randomly extract 20% of the training data, stamp a white square with shape (1, 3, 3) in one corner of the images, assign a random target label, and then train the neural network from scratch with the poisoned training data. While such an attack is shown to be effective [12], it is not guaranteed to be always successful on a randomly selected set of images. Thus, we do the following to make sure that there exists a trigger for a set of selected images. From 10000 images in the test set, we first filter out those images which are classified wrongly or already classified with the target label. The remaining images are collected into a set X0. Next, to make sure that the selected images have a high chance of being attacked successfully, we apply another filter on X0. This time, we stamp each image in X<sup>0</sup> with a white square at the same trigger position as we poison the training data. We then keep the image if its stamped version is classified by the network with the target label. The remaining images after the second filter are collected into another set X. We apply our approach, in particular, the backdoor trigger generation on <sup>X</sup>, if <sup>|</sup>X|÷|X0| ≥ <sup>0</sup>.8, i.e., the backdoor attack has a success rate of 80%.

The results are shown in Fig. 7 in which the y-axis shows the number of networks. The timeout is set to be 120 s. Among the 45 networks, we can see that a trigger is successfully generated for 33 (i.e., 73%) of the networks. A close investigation of these networks shows that the generated trigger is the exact white square that is used to stamp the training data. There are 12 networks for which the trigger is not generated. We

**Fig. 7.** The results of backdoor trigger generation

investigate these networks and see that they are either too biased (i.e., classifying every image with the target label and thus <sup>|</sup>X0<sup>|</sup> = 0) or the attack on these networks does not perform well (i.e., <sup>|</sup>X|÷|X0<sup>|</sup> <sup>&</sup>lt; <sup>0</sup>.8). In other words, the backdoor attack on these networks failed and, as a result, the generation process does not even begin with these networks. In a nutshell, we successfully generate the trigger for every successful backdoor attack. Finally, note that the running time of the backdoor generation process is reasonable (i.e., on average, 50 s to generate a backdoor trigger for one network) and thus it does not affect the overall performance of our verification algorithm.

# **5 Related Work**

The work which is closest to ours is [37] in which Wang *et al.* aim to certify neural networks' robustness against backdoor attack using randomized smoothing. However, there are many noticeable differences between their approach and ours. First, while our work focuses on verifying the absence of backdoor, their work aims to certify the robustness of individual images based on the provided training data and learning algorithm (which can be used to implicitly derive the network). Second, by using random noises to estimate the networks' behaviors, their approach can only obtain very loose results. As shown in their experiments, they can only certify the robustness against backdoor attack with triggers contains two pixels and on a "toy" network with only two layers and two labels, after simplifying the input features by rounding them into 0 or 1. Compare to their approach, our approach can apply to networks used to solve real image classification problems as shown in our experiments.

Our work is closely related to a line of work on verifying neural networks. Existing approaches mostly focus on local robustness property and can be roughly classified into two categories: exact methods and approximation methods. The exact methods aim to model the networks precisely and solve the verification problem using techniques such as mixed-integer linear programming [34] or SMT solving [8,15]. On the one hand, these approaches can guarantee sound and complete results in verifying neural networks. On the other hand, they often have limited scalability and thus are limited to small neural networks. Moreover, these approaches have difficulty in handling activation functions except the ReLU function.

In comparison, the approximation approaches over-approximate neural network behavior to gain better scalability. AI<sup>2</sup> [10] is the first work pursuing this direction using the classic abstract interpretation technique. After that, more researchers try to explore different abstract domains for better precision without sacrificing too much scalability [29,30,32]. In general, the approximation approaches are more scalable than the exact methods, and they are capable of handling activation functions such as Sigmoid and Tanh. However, due to the over-approximation, these methods may fail to verify a valid property.

We also notice that it is possible to incorporate abstraction refinement to the approximation methods and gain better precision, for instance, by splitting an abstraction into multiple parts to reduce the imprecision due to over-approximation. There are many works [21,40,41] which fall into this category. We remark that our approach is orthogonal to the development of sophisticated verification techniques for neural networks.

Finally, our approach, especially the part on backdoor trigger generation, is related to many approaches on generating adversarial samples for neural networks. Some representative approaches in this category are FGSM [11], JSMA [24], and C&W [4] which aim to generate adversarial samples to violate the local robustness property, and [42] which aims to violate fairness property.

#### **6 Conclusion**

In this work, we propose the first approach to formally verify that a neural network is safe from backdoor attacks. We address the problem on how to verify the absence of a backdoor that reaches a certain level of success rate. Our approach is based on abstract interpretation and we provide an implementation based on DeepPoly abstract domain. The experiment results show the potential of our approach. In the future, we intend to extend our approach with more abstract domains as well as improve the performance to verify more real-life networks. Besides that, we also intend to apply our approach to verify the networks designed for other tasks, such as sound or text classification.

**Acknowledgements.** This research is supported by the Ministry of Education, Singapore under its Academic Research Fund Tier 3 (Award ID: MOET32020-0004). Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of the Ministry of Education, Singapore. This research is also partly supported by the Starry Night Science Fund of Zhejiang University Shanghai Institute for Advanced Study, Grant No. SN-ZJU-SIAS-001.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Trainify**: A CEGAR-Driven Training and Verification Framework for Safe Deep Reinforcement Learning**

Peng Jin<sup>1</sup>, Jiaxu Tian<sup>1</sup>, Dapeng Zhi<sup>1</sup>, Xuejun Wen<sup>2</sup>, and Min Zhang1,3(B)

<sup>1</sup> Shanghai Key Laboratory of Trustworthy Computing, ECNU, Shanghai, China

zhangmin@sei.ecnu.edu.cn <sup>2</sup> Huawei International, Singapore, Singapore <sup>3</sup> Shanghai Institute of Intelligent Science and Technology, Tongji University, Shanghai, China

**Abstract.** Deep Reinforcement Learning (DRL) has demonstrated its strength in developing intelligent systems. These systems shall be formally guaranteed to be trustworthy when applied to safety-critical domains, which is typically achieved by formal verification performed after training. This *train-then-verify* process has two limits: (i) trained systems are difficult to formally verify due to their continuous and infinite state space and inexplicable AI components (*i.e.*, deep neural networks), and (ii) the *ex post facto* detection of bugs increases both the time- and money-wise cost of training and deployment. In this paper, we propose a novel verification-in-the-loop training framework called Trainify for developing safe DRL systems driven by counterexample-guided abstraction and refinement. Specifically, Trainify trains a DRL system on a finite set of coarsely abstracted but efficiently verifiable state spaces. When verification fails, we refine the abstraction based on returned counterexamples and train again on the finer abstract states. The process is iterated until all predefined properties are verified against the trained system. We demonstrate the effectiveness of our framework on six classic control systems. The experimental results show that our framework yields more reliable DRL systems with provable guarantees without sacrificing system performance such as cumulative reward and robustness than conventional DRL approaches.

**Keywords:** Deep reinforcement learning · Model checking · CEGAR · ACTL

# **1 Introduction**

Deep Reinforcement Learning (DRL) has shown its strength in developing intelligent systems for complex control tasks such as autonomous driving [37,40]. Verifiable safety and robustness guarantees are crucial to these safety-critical DRL systems before deploying [23,44]. A typical example is autonomous driving, which is arguably still a long way off due to safety concerns [21,39]. Recently, tremendous efforts have been made toward adapting existing and devising new formal methods for DRL systems in order to provide provable safety guarantees [18,25,45,46,51].

Formally verifying DRL systems is still a challenging problem. The challenge arises from DRL systems' three features. First, the state space of a DRL system is usually continuous and infinite [28]. Second, the behavior of a DRL system is non-linear and determined by high-order system dynamics [17]. Last but not least, the controllers, typically deep neural networks (DNN), are almost inexplicable because of their black-box development [20,52]. The three features make it unattainable to verify DRL systems using conventional formal methods, *i.e.*, modeling them as state transition systems and verifying temporal properties using dedicated decision procedures [4]. Most existing approaches have to simplify the problem by abstraction or over-approximation techniques and restrict to specific properties such as safety or reachability [46].

Another common problem with most existing formal verification approaches to DRL systems is that they are applied after the training is concluded. These *train-then-verify* approaches have two limitations. First, verification results may be inconclusive due to abstraction or overestimation. The non-linearity of both system dynamics and deep neural networks makes it difficult to control the overestimation in a reasonable range, resulting in false positives in verification results [50]. Second, the *ex post facto* detection of bugs increases both the timeand money-wise cost of training and deployment. No evidence shows that the iterative training and verification help improve system reliability, as tuning the parameters in neural networks may cause an unpredictable impact on the properties because of the inexplicability [24].

To address the challenges in training and verifying DRL systems, in this paper we propose a novel *verification-in-the-loop* framework for training safe and reliable DRL systems with verifiable guarantees. Provided that a set of properties are predefined for a target DRL system to develop, our framework trains the system and verifies it against the properties in every iteration. To overcome the verification challenges in DRL systems, for the first time, we propose a novel approach in our framework to train the systems on a finite set of *abstract states*, based on the observation that *approximate abstractions can still preserve near-optimal behavior* [1]. These states are the abstractions of the actual states. Training on the finite abstract states allows us to model the AI-embedded systems as finite-state transition systems. We can leverage classic model checking techniques to verify their more complicated temporal properties than safety and reachability.

As system performance may be affected by the abstraction granularity, we employ the idea of the counterexample-guided abstraction and refinement (CEGAR) [8] in model checking along the training process. We start with a coarsely abstracted but efficiently verifiable state space and train and verify DRL systems on the abstract state space. Once verification fails, we refine the abstract state space based on the returned counterexamples and retrain the system on the finer-grained refined state space. The process is repeated until all the properties are verified successfully. We, therefore, call the training and verification framework *CEGAR-driven*, by which we can reach an appropriate abstraction granularity that guarantees both system performance and verification scalability.

Our verification-in-the-loop training framework has four advantages compared with conventional DRL training and verification approaches. Firstly, our approach produces correct-by-construction DRL systems that are verifiably safe with respect to user-defined safety requirements. Secondly, more complicated properties such as safety and liveness can be verified thanks to the dedicated training approach on abstracted state space. Another advantage of the training approach is that it is orthogonal to state-of-the-art DRL algorithms such as Deep Q-Network (DQN) [34] and Deep Deterministic Policy Gradient (DDPG) [32]. Thirdly, our approach provides a flexible mechanism for fine-tuning an appropriate abstraction granularity to balance system performance and verification scalability. Lastly, training on abstract states renders DRL systems to be more robust against adversarial and environmental perturbations because small perturbation to an actual state may not alter the decision of the neural network on the same abstract state.

We implement a prototype tool called Trainify (abbreviated for Train and Verify, available at https://github.com/aptx4869tjx/RL verification). We perform extensive experiments on six classic control tasks in public benchmarks to evaluate the effectiveness of our framework. For each task, we train two DRL systems under the same settings in our approach and corresponding conventional DRL algorithm, respectively. We compare the two systems in terms of the properties that they shall satisfy and the performance in terms of cumulative reward and robustness. Experimental results show that the systems trained in our approach are more efficient to verify and more reliable than those trained in conventional methods; moreover, their performance is competitive and higher.

In summary, this paper makes the following three major contributions:


*Paper Organization.* Section 2 briefly introduces deep reinforcement learning. Section 3 presents the model-checking problem of DRL systems. Section 4 presents our training and verification framework. Section 5 shows six case studies and experimental results. Section 6 mentions some related work, and Sect. 7 concludes the paper.

# **2 Deep Reinforcement Learning (DRL)**

DRL is a technique for learning optimal control policies using deep neural networks according to evaluative feedback [31]. An agent in a DRL system interacts with the environment and records its state s<sup>t</sup> at each time step t. It feeds s<sup>t</sup> into a deep neural network to compute an action a<sup>t</sup> and transitions to the next state st+1 according to a<sup>t</sup> and the system dynamics. The system dynamics describe the non-linear behavior of the agent over time. The agent receives a scalar reward according to reward functions. Some algorithms estimate the distance between the action determined by the network and the expected action in the same state. Then, it updates the parameters in the network according to the estimated distance to maximize the cumulative reward.

#### *A Running Example.*

Figure 1 shows a classic DRL task of learning a control policy to drive a car to the right hilltop. The car is initially positioned on a track between two mountains. The track is one-dimensional, and thus the car's position is represented as a real num-

**Fig. 1.** A DRL example of mountain car system.

ber. Velocity is another dimension in the car's state and is represented as a real number too. Thus, the car's state is a pair (p, v) of position p and velocity v. An action a is a real number representing the force imposed on the car. The action is computed by a neural network on both p and v.

The sign of a means the direction of the force, *i.e.*, positive for the right and negative for the left, respectively. Given a state s<sup>t</sup> = (pt, vt) and an action a<sup>t</sup> at time step t, the system transitions to the next step st+1 = (pt+1, vt+1) following the given dynamics:

$$p\_{t+1} = p\_t + v\_t \Delta\_t,\tag{1}$$

$$v\_{t+1} = v\_t + (a\_t - m\_c \times g \times \cos(3p\_t))\Delta\_t,\tag{2}$$

where m<sup>c</sup> denotes the car's mass, g denotes the gravity, and Δ<sup>t</sup> is the unit interval between two consecutive steps. In DRL, time is usually discretized to facilitate implementation. The car is assumed to move in uniform motion during a unit interval.

*Reward Setting.* The reward function R maps state st, action a<sup>t</sup> and state st+1 to a real number, which represents the rewarded value by applying a<sup>t</sup> to s<sup>t</sup> to transition to st+1. The purpose of R is to guide the agent to achieve the preset goals by making cumulative reward as great as possible. The definition of R is based on prior knowledge or expert experience before training.

In the Mountain Car example, the controller receives the reward which is defined as R(pt, vt, at,pt+1, vt+1) = −1.0 at each time step when pt+1 < 0.45. The reward is a negative constant because the goal in this example is to force the car to reach the right hilltop (p = 0.45) as quickly as possible. If the corresponding cumulative reward value is larger than another when the car reaches the destination, it means that the car takes fewer steps. A reward function can be a more complex formula than a constant when the reward strategy is related to states and actions.

*Training.* The essence of DRL training is to update parameters in neural networks so that the networks can compute optimal actions for input states. A deep neural network is a directed graph comprised of an input layer, multiple hidden layers, and an output layer, as shown in Fig. 2. Each layer contains several nodes called *neurons*. They are connected to the neurons on the following layer. Each edge has a weight. The values passed on the edge are multiplied by the weight. A neuron on

**Fig. 2.** A simple neural network.

hidden layers takes the sum of all the incoming values, adds a bias, and feeds the result to its activation function σ. The output of σ is passed to the neurons on the following layer. There are several commonly used activation functions, e.g., ReLU (σ(x) = max(x, 0)), Sigmoid (σ(x) = <sup>1</sup> 1+e−<sup>x</sup> ) and Tanh (σ(x) = <sup>e</sup>x−e−<sup>x</sup> <sup>e</sup>x+e−<sup>x</sup> ), etc. In DRL, the inputs to a neural network are system states. The outputs are (probably continuous) actions that shall be performed to the present state.

During training, agents continuously interact with the environment to obtain trajectories. A trajectory is a 4-tuple, consisting of a state s, the action a on s, the reward of executing a on s, and the successor state after the execution. A predefined loss function uses the collected trajectories to estimate an action value and compute the distance between the estimated value and the one computed by the neural network for the same state. Guided by the distance, the parameters in the network are updated using gradient descent algorithms [12]. The process is repeated until the system reaches a predefined maximal iteration limit or a preset cumulative reward threshold.

#### **Algorithm 1:** Training for the Mountain Car Task using DQN


There are several well-established training algorithms, such as Deep Q-Network (DQN) [35] and Deep Deterministic Policy Gradient (DDPG) [32]. Algorithm 1 depicts a high-level process of training the mountain car using DQN. We call the process of training the car to move from the initial position to the destination an *episode*. For each episode, the initial state is firstly determined (Line 2). Then, the controller determines the action to be adopted based on the current state s<sup>t</sup> and the neural network N (Line 4). After performing the action, the controller receives a reward value (−1.0 in this case) and transitions to the next state based on the system dynamics (Line 5). A loss is estimated by calling the loss function L with partially sampled trajectories. The loss is represented by P (Line 6) used to update the parameters of the network N (Line 7). We omit the details of L, as it is not the emphasis of our paper.

*The Target DRL Systems in this Work.* The types of DRL systems are diverse from different perspectives, such as the availability of system dynamics [17] and the determinism of actions. In this work, we assume system dynamics is prior knowledge for training, and the actions are deterministic. That is, a unique action is determined to take on the present state, and its successor state is also uniquely determined by system dynamics.

# **3 Model Checking of DRL Systems**

#### **3.1 The Model Checking Problem**

A trained deterministic DRL system can be represented as a tuple M = (S, A, f, π, S0, L), where <sup>S</sup> is the state space which is usually infinite, <sup>S</sup><sup>0</sup> <sup>⊆</sup> <sup>S</sup> is the initial state space, A is a set of actions, f : S × A → S is the system dynamics, <sup>π</sup> : <sup>S</sup> <sup>→</sup> <sup>A</sup> is a policy function, and <sup>L</sup> : <sup>S</sup> <sup>→</sup> <sup>2</sup>AP is a state labeling function. In this work, we use π to denote the policy that is implemented by the trained deep neural network in the system.

The model M of a DRL system is essentially a Kripke structure [10], which is a 4-tuple (S, R, S<sup>0</sup>, L). Given two arbitrary states s, s in S, there is a transition from s to s , denoted by (s, s ) ∈ R, if and only if there is an action a in A such that a = π(s) and s = f(s, a). Given that a property is formalized by a formula Φ in some logic, the model checking problem of the system is to decide whether M satisfies Φ, denoted by M |= Φ.

In this work, we formulate properties in ACTL [4], a segment of CTL where only universal path quantifiers are allowed and negation is restricted to atomic propositions [14,15]. ACTL consists of state formula Φ and path formula ϕ in the following syntax:

$$\begin{array}{lclclcl}\Phi & ::= & true & false & a & \mid \neg a & \mid \Phi\_1 \land \Phi\_2 & \mid \Phi\_1 \lor \Phi\_2 & \mid \: A \,\varphi, \\\varphi & ::= & X\,\Phi & \mid \: \Phi\_1 \; U\,\Phi\_2 & \mid \: \Phi\_1 \; R\,\Phi\_2. \end{array}$$

The temporal operators fall into two main categories, *i.e.*, quantifiers over paths and path-specific quantifiers. In ACTL, only the universal path quantifier A is considered. Path-specific quantifiers refer to X, U and R.


Using the above basic temporal operators, we can define another two important path-specific quantifiers G (*globally*) and F (*finally*) with G Φ = *false* R Φ and F Φ = *true* U Φ. Intuitively, G Φ means that Φ has to hold on the entire subsequent path, and F Φ means that Φ eventually has to hold (somewhere on the subsequent path).

We choose ACTL to formulate system properties or requirements in our framework for two main reasons. Firstly, in our framework, we rely on refinement to the abstract states where system properties are violated. Such states can be obtained as counterexamples returned by model checkers when the system properties defined in ACTL are verified not valid by model checking. Secondly, the verification results of ACTL formulas can be preserved by property-based abstraction [9,11]. Such preservation is vital to ensure the correctness of our verification results because the abstraction is necessary for our framework to guarantee the scalability of the verification algorithm.

#### **3.2 Challenges in Model Checking DRL Systems**

Unlike the model checking problems for finite-state systems, model checking M |= Φ for DRL systems is particularly challenging. The challenge arises from the three features of DRL systems, *i.e.*, (i) the infinity and continuity of state space S, (ii) the non-linearity of system dynamics f, and (iii) the inexplicability of the policy π that is encoded as deep neural networks. Usually, the state space of DRL systems is continuous and infinite, and behaviors are non-linear due to high-order system dynamics. Even worse, the actions of states are determined by inexplicable deep neural networks, which means that the transitions between states cannot be defined as straightforwardly as those of traditional software systems.

To build a model M for a DRL system, we have to compute the successor of each state s by applying the neural network π on s to compute the action a and then performing a to s according to the system's dynamics f. Specifically, the successor of s can be represented as f(s, π(s)). The non-linearity of both f and π and the infinity of S makes the verification problem difficult. Most existing approaches rely on the over-approximation of f and π to simplify the problem [16,25,29,46]. However, over-approximation inevitably introduces overestimation and restricts to only safety properties and reachability analysis in bounded steps.

# **4 The CEGAR-Driven DRL Approach**

#### **4.1 The Framework**

Figure 3 shows the overview of our framework. It consists of three parts, *i.e.*, training, verification and refinement. In the training part, a DRL system is trained on a finite set of abstract states. An actual state is first mapped to its corresponding abstract state, then fed into the neural network to compute a corresponding action. The action is applied to the actual state to drive the system to transition to the next state. The reward is accumulated according to a predefined reward function, and the neural network is updated in the same way as conventional DRL algorithms. In the verification part, we build a Kripke structure on the finite abstract state space based on the trained neural network. Then, we verify the desired properties that are predefined in ACTL formulas Φ. If all the properties are verified valid, we stop training, and a DRL system is developed. If some property is verified not valid, we move to the refinement part. When verification fails, counterexamples are returned. They are the abstract states where the property is violated. We refine these states by subdividing them into fine-grained sub-states and substitute those *bad* states. We resume to train the system on the refined abstract state space and repeat the whole process.

**Fig. 3.** The training, verification and refinement framework for developing DRL systems.

The integration of training, verification and refinement seamlessly constitutes a *verification-in-the-loop* DRL approach, driven by the counterexample-guided abstraction and refinement. We start with a coarse abstraction. After every training episode, we model check the system against all the predefined properties. If all the properties are verified, we stop training and obtain a verified system. Otherwise, counterexamples are returned. The abstract state space is refined for further training. After several iterations, a DRL system is trained with all the predefined properties rigorously verified.

#### **4.2 Training on Abstract States**

DRL is a process of learning optimal actions on all system states for specific objectives. A trained model partitions the state space into a family of sets such that the same action is taken in the states from a set [38]. Continuous state spaces can be adaptively discretized into finite ones for learning without affecting learning performance [41,42]. Motivated by this observation, we discretize a continuous state space into a finite set of fragments. We call each fragment an abstract state and train the DRL system by feeding abstract states into the deep neural network for decision making.

**Fig. 4.** An example of encoding an abstract state space into an R-tree.

*System State Abstraction.* Given an n-dimension DRL system, a concrete system state s is represented as a vector of n real numbers. Each number has a physical meaning about the system, such as speed and position in the running example. Let L<sup>i</sup> and U<sup>i</sup> be the lower and upper bounds for the i-th dimension value of S. Then, the state space S of the control system is Π<sup>n</sup> <sup>i</sup>=1[Li, Ui].

Initially, we use interval boxes to discretize S. An interval box I is a vector of n intervals, denoted by (I1, I2,...,In). Each interval Ii(1 ≤ i ≤ n) represents all the system states, denoted by S<sup>I</sup><sup>i</sup> , where a state s belongs to S<sup>I</sup><sup>i</sup> if and only if the i-th value in s is in Ii. An interval box I represents the intersection of all the sets S<sup>I</sup><sup>i</sup> (i = 1,...,n).

Let <sup>d</sup><sup>i</sup> <sup>∈</sup> <sup>R</sup> (0 < d<sup>i</sup> <sup>≤</sup> <sup>U</sup><sup>i</sup> <sup>−</sup> <sup>L</sup>i) be the diameter by which we subdivide evenly the interval [Li, Ui] in each dimension i into (U<sup>i</sup> − Li)/d<sup>i</sup> unit intervals, and I<sup>i</sup> = [Li, Ui]/d<sup>i</sup> denote the set of all the unit intervals. Then, we obtain the abstract state space **S** = I<sup>1</sup> × ... × In, which is an abstraction of the infinite continuous state space S. We call the vector (d1, d2,...,dn) of the n diameters *abstraction granularity* and denote it by δ.

Given a continuous state space S and its corresponding abstract state space **S**, we call the mapping function from the states in S to the corresponding abstract states in **S** a *transformer* A : S → **S**. The transformer can be encoded as an R-tree, a tree-like data structure devised for efficiently indexing multidimensional objects [22]. Figure 4 depicts an example of building an R-tree to index an abstract state space of the continuous space [v0, v4] × [p0, p5]. A rectangle on a leaf node represents an abstract state, and the one on a non-leaf node represents the minimum bounding rectangle enclosing all the rectangles on its child nodes. There can be multiple rectangles on a single node. R-tree supports intersection search, *i.e.*, searching for the abstract states that intersect with the interval we are querying. Given a concrete state, an R-tree can quickly return its corresponding abstract state. Note that in Fig. 4, we assume state space is discretized evenly for clarity. During training, the size of abstract states becomes diverse after iterative refinement, and the R-tree should be updated correspondingly.

*The Training Algorithms.* The algorithms for training on abstract states can be achieved by extending existing DRL algorithms such as DQN and DDPG. The extension can be easily achieved by adapting the neural networks and loss functions in DRL systems so that they can admit abstract states as inputs.


For neural networks, we only need to modify the input layer by doubling the number of neurons on the input layer, denoted by N . Given an n-dimension system, we declare 2n neurons. Each pair of neurons read the lower and upper bounds of an interval in an abstract state, respectively. This dedicated structure guarantees that a trained network can produce the same action for all the states that correspond to the same abstract state.

Figure 5 shows an example of adapting the network in the Mountain Car for training it on abstract states. For traditional DRL algorithms, two input neurons are needed in the neural network to take p and v as inputs, respectively. To train on abstract states, four input neurons are needed to take the lower and upper bounds of the position and velocity intervals in abstract states. For instance, let the interval box (Ip, Iv) be the abstract state of (p, v). Then, the lower bounds Ip, I<sup>v</sup> and the upper bounds Ip, I<sup>v</sup> of p, v

**Fig. 5.** Adapting neural networks for abstract states.

are input to the four neurons, respectively. Apparently, this adaption guarantees that the neural network always produces the same action on the states that are transformed into the same abstract state.

We consider incorporating these two steps to extend Algorithm 1 as an illustrative example. Algorithm 2 depicts the main workflow where the differences are highlighted. The main difference from the traditional training process lies in line 6. Given a concrete state s = (s1,...,sn), A will return the abstract state **s** = ([l1, u1],..., [ln, un]) such that l<sup>i</sup> ≤ s<sup>i</sup> ≤ u<sup>i</sup> with i = 1,...,n, which is also the result fed into neural network. Although the dimension of input states increases, the form of corresponding output actions does not change. Therefore, the loss function can naturally adapt to changes in input states.

#### **4.3 Model Checking Trained DRL Systems**

A DRL system can be naturally verified using abstract model checking [26]. The actual states of the system are first abstracted in the same way used in training, and then the transitions between abstract states are determined by the corresponding action and dynamics. ACTL formulas are then model checked on the abstract state transition system.

*Building Kripke Structure.* During the training phase, the actual state space has already been abstracted into a finite set **S** of abstract states. Therefore, the main task for abstract model checking is to build a Kripke structure by defining the transition relation on **S**.

Algorithm 3 depicts the process of building a Kripke structure K for a trained DRL system. Firstly, K is initialized on set **S** with R being empty. Starting from an initial abstract state **s**<sup>0</sup>, we compute its successors and define the transitions from **s**<sup>0</sup> to them. We repeat the process until all reachable states are traversed.

Given an abstract state **s**, we compute its abstract successor states by applying the corresponding action a and the dynamics to **s**. Because the system is trained on abstract states, all the actual states in **s** have the same action, *i.e.*, a = N (**s**). Let f <sup>∗</sup>(**s**, a) = {f(s, a)|s ∈ **s**} be the set of all the successors of the actual states in **s**. Due to the nonlinearity of f and the infinity of **s**,

**Algorithm 3:** Building Kripke Structure **Input:** Initial state **s**<sup>0</sup>, state space **S**, system dynamics f, neural network N **Output:** A Kripke Structure K **<sup>1</sup>** K = Initialize Kripke Structure() **<sup>2</sup>** Queue ← {**s**<sup>0</sup>} **<sup>3</sup> while** *Queue is not empty* **do <sup>4</sup>** Fetch **s** from Queue **<sup>5</sup> for** i = 1,...,n **do <sup>6</sup>** [li, ui] ← g(f(**s**, N (**s**)), i) **<sup>7</sup>** {**s**<sup>1</sup>,..., **<sup>s</sup>**<sup>m</sup>} := h([l1, u1],..., [ln, un], **S**) **<sup>8</sup> for** j = 1,...,m **do <sup>9</sup>** <sup>K</sup>.add edge(**<sup>s</sup>** <sup>→</sup> **<sup>s</sup>**<sup>j</sup> ) **<sup>10</sup> if s**<sup>j</sup> *is not traversed* **then <sup>11</sup>** Push **s**<sup>j</sup> into Queue **<sup>12</sup> return** K

we over-approximate the set f <sup>∗</sup>(**s**, a) = {f(s, a)|s ∈ **s**} as an interval box. As shown in Fig. 6, the dashed box is an over-approximation of f <sup>∗</sup>(**s**, a). The overapproximation may overlap one or more abstract states, e.g., **s**<sup>1</sup>,..., **s**<sup>4</sup> in the example. All the overlapped abstract states are successors of **s**. In Algorithm 3, function g calculates the interval box and function h determines the overlapped abstract states. Note that the shapes of abstract states may be different because they are refined during training, which is to be detailed in Sect. 4.4.

We use an interval to approximate the i-th dimension's values in all the successor states. Then, all the successor states are approximated as a vector of n intervals. We can compute the upper and lower bounds for each i by solving the following two optimization problems, respectively:

**Fig. 6.** Transitions between abstract states

$$\begin{aligned} \underset{s \in \mathbf{s}}{\arg\max} & \quad v\_i \cdot f(s, \mathcal{N}'(\mathbf{s})) \\ \underset{s \in \mathbf{s}}{\arg\min} & \quad v\_i \cdot f(s, \mathcal{N}'(\mathbf{s})) \end{aligned}$$

where, v<sup>i</sup> is a one-hot vector with the i-th element being 1. Because all the states in **s** have the same action according to the network, N (**s**) in the above optimization problems can be substituted for a constant, *i.e.*, the action taken by the system on all the states in **s**. The substitution significantly simplifies the optimization problems; no information of the networks is needed in the simplified problems. The simplified problems can be efficiently solved using off-the-shelf scientific computing tools such as SciPy [48].

We consider an example in the mountain car system. We assume that the current abstract state **s** is ([0, 0.2], [0, 0.02]) and the adopted action is 0.001, which says that the controller accelerates to the right for all states in **s**. Based on the dynamics defined by Eq. 1, we can compute the upper bounds of both position and velocity in the next step by solving the following two optimization problems:

$$\underset{p\_t \in [0, 0.2], v\_t \in [0, 0.02]}{\text{arg}\max} \quad p\_t + v\_t \tag{p\_{t+1}}$$

$$\underset{p\_t \in [0, 0.2], v\_t \in [0, 0.02]}{\arg\max} \quad v\_t + 0.001 - 0.0025 \cos(3p\_t) \tag{5.41}$$

The lower bounds of pt+1 and vt+1 are calculated similarly. Then, we obtain an abstract state **s** = ([0, 0.22], [−0.0035, 0.0165]), which is an overestimated set of all the actual successors of the states in **s**. There is a transition from **s** to any abstract state **s** = ([p, p], [v, v]) in **S**, if **s** and **s** overlap, *i.e.*, (0< p < 0.22∨0< p <0.22) ∧ (−0.0035< v <0.0165 ∨ −0.0035<v <0.0165) is true. Note that the transition from **s** to **s** includes all the transitions between the actual states in **s** and **s** , respectively. It may also include those that do not actually exist due to the overestimation.

There are other approaches for over-approximating the set f <sup>∗</sup>(**s**, a), such as template polyhedrons like rectangle and octagon [2]. Note that there is always a trade-off between the tightness of the polyhedral and the efficiency of computing it. For instance, an octagon can approximate the set more tightly than a rectangle. However, it costs double effort to compute the borders. The tighter an over-approximation is, the more accurate the set of computed successors is, but the more time it costs to compute the approximation.

*Property-Based Abstraction.* For those high-dimensional DRL systems, the abstract state space may be still too huge to model check directly when the abstraction granularity becomes small after refinement. To improve the model checking scalability, we further abstract the constructed Kripke structure based on the ACTL formula Φ to be model checked using the abstraction approach in the work [9].

**Definition 1 (State Abstraction).** *Given an abstract state space* **S** = I<sup>1</sup> × ..., ×I<sup>n</sup> *and an ACTL formula* Φ*, let* D<sup>Φ</sup> *be the set of dimensions that occur in* Φ *and* **S**- <sup>=</sup> <sup>Π</sup><sup>d</sup>∈D<sup>Φ</sup> <sup>I</sup>d*. Function* <sup>α</sup><sup>Φ</sup> : **<sup>S</sup>** <sup>→</sup> **<sup>S</sup>** *is an abstract transformer such that for every* **<sup>s</sup>** <sup>∈</sup> **<sup>S</sup>** *and* **s** ∈ **S**-*,* **<sup>s</sup>** <sup>=</sup> <sup>α</sup>Φ(**s**) *if and only if* **<sup>s</sup>**[d] = **s**[d] *for all* d ∈ DΦ*.*

Given a Kripke structure <sup>K</sup> = (**S**, R, **<sup>S</sup>**<sup>0</sup>, L) and an ACTL formula <sup>Φ</sup>, let <sup>α</sup><sup>Φ</sup> : **<sup>S</sup>** <sup>→</sup> **<sup>S</sup>** be the abstract transformer, and AP <sup>⊆</sup> AP be all the atomic propositions in Φ. We can construct the following abstract Kripke structure K- = (**S**-, R, - **S**-0,L-) based on αΦ, where:

– **S**- = Πd∈D<sup>Φ</sup> Id; – R- = {(αΦ(**s**), αΦ(**s** ))|**s**, **s** ∈ **S**.(**s**, **s** ) ∈ R}; – **S**-<sup>0</sup> <sup>=</sup> {αΦ(**s**) <sup>|</sup> **<sup>s</sup>** <sup>∈</sup> **<sup>S</sup>**<sup>0</sup>}; – L- : **S**- <sup>→</sup> <sup>2</sup>AP such that <sup>L</sup>-(**<sup>s</sup>**) = <sup>L</sup>(**s**) <sup>∩</sup> AP where **<sup>s</sup>** <sup>∈</sup> **<sup>S</sup>** and **s** = αΦ(**s**). We call K-

 a simulation of K with respect to Φ. An important property of the simulation is that the property represented by Φ is preserved by the abstract model K-. **Theorem 1 (Soundness).** *Let* <sup>K</sup>-

 *be a simulation of* K *with respect to an ACTL formula* <sup>Φ</sup>*,* <sup>K</sup>-|= Φ *implies* K |= Φ*.*

The proof of Theorem 1 is straightforward. We omit the proof due to space limit. According to the theorem, we can conclude that K |= Φ holds whenever we find a simulation <sup>K</sup> of <sup>K</sup> and model check that <sup>K</sup>-|= Φ holds.

**Fig. 7.** An example of refinements on abstract states where properties are violated.

#### **4.4 Counterexample-Guided Refinement**

If a formula Φ is verified not true, our algorithm returns corresponding counterexamples. A counterexample is an abstract state where Φ is violated. We refine the abstract state into finer ones and substitute them in the abstract state space for further training.

A na¨ıve refinement approach subdivides each dimension of states into two intervals. Assuming that a property is violated on an abstract state **s** = ([l0, u0],..., [ln, un]), we can simply divide each dimension evenly into two intervals ([li,(l<sup>i</sup> + ui)/2], [(l<sup>i</sup> + ui)/2, ui]), and obtain 2<sup>n</sup> finer abstract states. Apparently, the refinement may lead to state space explosion, particularly for highdimensional systems.

In our approach, we only refine the states on the dimensions that are used to define the properties being verified to avoid state explosion. Considering the mountain car example, we assume that the formula is AF[p ≥ 0.45], saying that the car will eventually reach the hilltop where p = 0.45. Suppose that the property fails and counterexamples are returned. We assume **s** = ([0, 0.2], [0, 0.02]) is the state where the property is violated, as shown in Fig. 7 (a). We bisect the state into two fine-grained sub-states, **s**<sup>1</sup> = ([0, 0.1], [0, 0.02]) and **s**<sup>2</sup> = ([0.1, 0.2], [0, 0.02]). Then, we substitute the two fine-grained states for **s** on the R-tree for further training. Figure 7 (b) shows the new R-tree after the substitution.

It is worth mentioning that counterexamples may be false positives. Abstract states may include the actual states that are unreachable in the trained system because of the approximation of system dynamics. Unfortunately, it is difficult to check which states are actually unreachable because we need to know their corresponding initial state to check the reachability of these bad states. However, the corresponding initial state is *enclosed* in an abstract state and cannot be identified due to the abstraction. In our approach, we perform refinement without checking whether the counterexamples are real or not. After refinement, the abstract states become finer-grained. Counterexamples can be discarded by training and verifying on these finer-grained abstract states. The price of such extra refinements is that more iterations of training and verification are conducted, but the benefit is that the performance of the trained systems is better.

# **5 Implementation and Evaluation**

#### **5.1 Implementation**

We implement our framework into a prototype toolkit called Trainify in Python. In the toolkit, we leverage the open-source library *pyModelChecking* [6] as the back-end model checker and the scientific computing tool SciPy [48] as an optimization solver.

#### **5.2 Benchmarks and Experimental Settings**

We evaluate the effectiveness of our approach on a wide range of classic control tasks from public benchmarks. For each control task, we train two DRL systems using our approach and the corresponding conventional DRL approach, respectively. We compare the two trained systems in terms of their reliability, verifiability and system performance.

*Benchmarks*. We choose six classic control problems. Three of them are from the DRL training platform Gym [5], including Mountain Car, Pendulum and Cartpole. The other three, *i.e.*, B1, B2 and Tora, are the problems that are widely used for evaluation by state-of-the-art tools [19,25,27,28].


*Training Configurations and Evaluation Metrics.* We adopt the same system configurations and training parameters for each task, including neural network architecture, system dynamics, time interval, DRL algorithms and the number of training episodes.

We choose three metrics, including the satisfaction of predefined properties, cumulative reward and robustness, to evaluate and compare the reliability, verifiability and performance of the DRL systems trained in our approach and those trained in the conventional DRL approach for the same task. The first metric is about reliability and verifiability. The other two are about performance. The cumulative reward is an important figure to evaluate a trained system's performance because maximizing the cumulative reward is the objective of learning. Robustness is another essential criterion for DRL systems because the systems are expected to be robust against perturbations from both the environment and adversarial attacks. Note that we classify robustness into performance category instead of reliability because we restrict the reliability of DRL systems to the safety and functional requirements.

*Experimental Settings.* All experiments are conducted on a workstation running Ubuntu 18.04 with a 32-core AMD Ryzen Threadripper CPU @ 3.7 GHz and 128 GB RAM.

#### **5.3 Reliability and Verifiability Comparison**

We first evaluate the reliability and verifiability of the DRL systems trained in our approach and conventional approach, respectively. For each task, we predefined system properties according to their safety and functional requirements. The functional requirement is usually the objective of control tasks. For instance, the controller's objective to train in the mountain car example is to drive the car to the hilltop. We define an atomic proposition p > 0.45 to indicate that the car reaches the hilltop. Then, we can define an ACTL formula Φ<sup>1</sup> = AF(p > 0.45) to represent the liveness property. Safety requirements in DRL systems usually specify important parameters of the systems that must always be kept in safe ranges. For instance, a safety requirement in the mountain car example is that the car's velocity must be greater than 0.02 when the car moves to a position around 0.2 within a 0.05 deviation. The property can be represented by the ACTL formula Φ<sup>2</sup> as defined in Table 1. The properties of other tasks are formalized similarly. The formulas and the types of properties are shown in the table.


**Table 1.** Expected properties and their definitions in ACTL of the selected control tasks.

**Remarks.** *target* is an atomic proposition *i.e.*, <sup>x</sup><sup>1</sup> <sup>∈</sup> [−0.3, <sup>0</sup>.1] <sup>∧</sup> <sup>x</sup><sup>2</sup> <sup>∈</sup> [−0.35, <sup>0</sup>.5] in B2.

We compare the reliability and verifiability of all the trained DRL systems with respect to their predefined properties using both verification and simulation. The DRL systems trained in our approach can be naturally verified in our framework. For those trained in the conventional DRL approaches, our verification approach is not applicable because we cannot construct abstract Kripke structures for them. The main reason is that we cannot abstract the system states such that there is a unique action on all the actual states represented by the same abstract state. We therefore resort to the state-of-the-art reachability analysis tool Verisig 2.0 [25] to verify them. We also simulate all the trained systems in a fixed number of rounds and detect the occurrences of property violations. The purposes of the simulation are twofold: (i) to partially reflect the reliability of systems; and (ii) to validate the verification results in a bounded number of steps.

Table 2 shows the comparison results. We can observe that all the systems trained in our approach are successfully verified, and the corresponding properties hold on them. No violations are detected by simulation. For those systems trained in conventional DRL algorithms, only 8 out of 16 are successfully verified by Verisig. There are two cases, where Verisig returns **Unknown** when verifying φ<sup>7</sup> for task B2. It means that the verification fails because Verisig 2.0 cannot determine whether the destination region (defined by x<sup>1</sup> ∈ [−0.3, 0.1] ∧ x<sup>2</sup> ∈ [−0.35, 0.5]) must always be reached when it computes a larger region that overlaps the *target*. The extra part in the larger region may be an overestimation caused by the over-approximation. By simulation, we detect violations to φ7. The violations can be considered as counterexamples to the property. The other properties such as φ2, φ3, φ4, and φ<sup>8</sup> are not supported by Verisig 2.0. Among these unverified properties, we detect there exist violations by simulation for three of them. The violations indicate that the systems trained in conventional DRL approaches may not satisfy expected properties, and existing


**Table 2.** Comparison of the verification and simulation results between the DRL systems trained in our approach and conventional DRL algorithms, respectively.

**Remarks. A.F.**: activation function; **T.T.**: average training time per iteration; **V.R.**: verification result; **V.T.**: average verification time per iteration; **Vio.**: the number of violations in simulation; **N/A**: not applicable; **Unknown**: verification fails. Time is recorded in seconds.

state-of-the-art verification tools cannot always verify them or find violations. Our approach can guarantee that the trained systems satisfy the properties. The simulation results show there are indeed no violations.

As for efficiency, on average, our approach costs slightly more time on the training because it takes extra time to look up the corresponding abstract state for an actual state at every training step. But the small-time overhead is worthwhile for the sake of being verifiable. Besides verifiability, another benefit from this extra time cost is that the efficiency of verification in our approach is not affected by the size and type of neural networks because we treat them as blackbox in the verification. On the contrary, the efficiency of verifying the systems that are trained in conventional approaches is restricted by neural networks, as the verification time cost by Verisig 2.0 shows.

Based on the above analysis, we conclude that the reliability of the DRL systems developed in our approach are more trustworthy as their predefined properties are provably satisfied by the systems. Besides, their verification is more amenable and scalable than the systems trained in conventional DRL approaches.

#### **5.4 Performance Comparison**

We compare the performance of the DRL systems trained in our approach and the conventional approaches in terms of cumulative reward and robustness, respectively.

**Fig. 8.** Robustness comparison of the systems trained in our approach (blue) and in conventional approaches (orange). The number in the parentheses is the base of σ. For example, in Mountain Car, when the abscissa is equal to 50, <sup>σ</sup> = 50 <sup>×</sup> <sup>0</sup>.0005 = 0.025. (Color figure online)

*Cumulative Reward.* We record the cumulative reward by running each system for 100 episodes in the simulation environment and calculating the averages. A larger reward implies that a system has a better performance. Table 3 shows the cumulative reward of the six DRL systems trained in our approach and conventional approaches, respectively. All the trained

**Table 3.** Comparison of accumulated reward.


systems can achieve almost optimal cumulative reward. Among the ten cases, the systems trained in our approach have better performances in four cases, equivalent in four cases, and lower in the rest two cases. Note that there is a difference, which is due to floating point errors, but it is almost negligible. In this sense, we say that the performance of the systems trained in the two different approaches is comparable.

Another observation from the results is that a system with a bigger neural network produces a larger reward. This characteristic is shared by both our approach and the conventional approaches. Thus, we can increase the size of networks and even modify network architectures for better performance in our approach. Such change will not cause the extra cost to the verification of the systems because our approach is entirely black-box, using the network only to output actions for the given abstract state.

*Robustness*. We demonstrate that the systems trained in our approach can be more robust than those trained in conventional DRL algorithms when the perturbation is set in a reasonable range. To examine the robustness, we add Gaussian noise to the actual states of systems and check the cumulative reward of the systems under different levels of perturbations. Given an actual state s = (s1,...,sn), we add a noise X1, ...X<sup>n</sup> to s and obtain a perturbed state <sup>s</sup> = (s<sup>1</sup> <sup>+</sup> <sup>X</sup>1,...,s<sup>n</sup> <sup>+</sup> <sup>X</sup>n), where <sup>X</sup><sup>i</sup> <sup>∼</sup> **<sup>N</sup>**(μ, σ<sup>2</sup>) for 1 <sup>≤</sup> <sup>i</sup> <sup>≤</sup> <sup>n</sup> with <sup>μ</sup> = 0. We start with σ = 0 and increase it gradually.

Figure 8 shows the trend of cumulative reward of the systems with the increase of perturbations. For each system, we evaluate 200 different levels of perturbations, and for each level of perturbation, we conduct 20 repetitions to obtain the average and standard deviation of the reward, represented by the solid lines and shadows in Fig. 8. The general trend is that the cumulative reward deteriorate for all the systems that are trained in either of the approaches. The result is reasonable because the actions computed by neural networks are optimal to non-perturbed states but may not be optimal to the perturbed ones, leading to lower reward at some steps. However, we can observe that the decline ratio of the systems trained in our approach (blue) is smaller than the one trained in conventional approaches (orange). When σ = 0, the accumulated reward of the two systems for the same task is almost the same. With the increase of σ, the performance declines more slowly for the systems trained in our approach than for those trained in the conventional approaches when σ is in a reasonably small range. That is because a perturbed state may belong to the same abstract state as its original state, and thus has the optimal action. In this sense, we say the perturbation is *absorbed* by the abstract state and the neural networks become less sensitive to perturbations. Our additional experiments on these examples show that a larger abstraction granularity produces a more robust system.

#### **6 Related Work**

Our work has been inspired by several related works, which attempted to integrate formal methods and DRL approaches. We classify them into the following three categories.

*Verification-in-the-Loop Training.* Verification-in-the-loop training has been proposed for developing reliable AI-powered systems. A pioneering work is that Nilsson *et al.* proposed a correct-by-construction approach for developing Adaptive Cruise Control (ACC) by first formally defining safety properties in Linear Temporal Logic (LTL) and then computing the safe domain where the LTL specification can be enforced [36]. Wang *et al.* proposed a correctby-construction control learning framework by leveraging verification during training to formally guarantee that the learned controller satisfies the required reach-avoid property [49]. Lin *et al.* proposed an approach for training robust neural networks for general classification problems by fine-tuning the parameters in the networks based on the verification result [33]. Our work is a sequel of these previous works with new features of training on abstract states, counterexampleguided abstraction and refinement, and supporting more complex properties.

*Safe DRL via Formal Methods.* Most of the existing approaches for formal verification of DRL systems follow the *train-then-verify* style. Bacci and Parker [3] proposed an approach to split an abstract domain into fine-grained ones and compute their successor abstract states separately for probabilistic model checking of DRL systems. The approach can reduce the overestimation and meanwhile construct a transition system upon abstract states, which allows us to verify more complex liveness and probabilistic properties than safety using bounded model checking [29] and probabilistic model checking. A criteria of subdividing an abstract domain is to ensure that all the states in the same subdomain have the same action. Identifying these sub-domains is computationally expensive because it relies on iterative branching and bounding [3]. Furthermore, these approaches need to compute the output range of the neural networks on the abstract domains, and therefore are restricted to specific types and scales of networks. Besides model checking, reachability analysis [13,16,25,46] has been well studied to ensure the safety of DRL systems. The basic idea is to over-approximate system dynamics and neural networks to compute overestimated safe regions and check whether they have interactions with unsafe regions. However, large overestimation, limited scalability, and requirements on specific network architectures are the common restrictions of these approaches. Online verification [47] and runtime monitoring [18] in formal methods is another lightweight but effective means to detect potential flaws timely during system execution. Another direction is to synthesize *safe shields* [7,54] and barrier functions [53] to prevent agents from adopting dangerous actions. A strong assumption of these methods is that the valid safe states set is given in advance. However, computing valid safe states set may be computationally intensive, and it is restricted to safety properties.

*Abstraction and State Discretization in DRL.* Abstraction in DRL has gained more attention in recent years. Abel presented a theory of abstraction for DRL in his dissertation and concluded that learning on abstraction can be more efficient while preserving near-optimal behaviors [1]. Abel's abstraction theory is focused on the systems with finite state space for learning efficiency. Our work demonstrates another advantage of learning on abstraction, *i.e.*, *formal reliability guarantee* to trained systems even with infinite state space.

The state-space abstraction approach in our framework is also inspired by *state space discretization*, a technique for discretizing continuous state space, by which a finer partition of the state-action space is maintained during training for higher payoff estimates [41,42]. Our work shows that, after being integrated with formal verification, state-space discretization is also useful in developing highly reliable DRL systems without loss of performance. In addition, our CEGAR- driven approach provides a flexible mechanism for fine-tuning the granularity of discretization to reach an appropriate balance between system performance and the scale of state space for formal verification.

### **7 Discussion and Conclusion**

We have presented a novel verification-in-the-loop framework for training and verifying DRL systems, driven by counterexample-guided abstract and refinement. The framework can be used to train reliable DRL systems with their desired properties on safeties and functionalities formally verified, without compromising system performances. We have implemented a prototype Trainify and evaluated it by training six classic control problems from public benchmarks. The experimental results showed that the systems trained in our approach were more reliable and verifiable than those trained in conventional DRL approaches, while their performances are comparable or even better than the latter.

Our verification-in-the-loop training approach sheds light on a new search direction for developing reliable and verifiable AI-empowered systems. It follows the idea of correctness-by-construction in traditional trustworthy software system development and makes it possible to take system properties (or requirements) into account during the training process. It also reveals that (i) it is not necessary to learn on actual data to build high-performance (e.g., high reward and robust) DRL systems, and (ii) abstraction is an effective means to deal with the challenges in verifying DRL systems and shall be introduced earlier during training, rather than an *ex post facto* method in verification.

Our work would inspire more research in this direction. One important research objective is to investigate appropriate abstractions for the DRL systems with high dimensions. In our current framework, we adopt the simplest interval abstraction that suffices to the systems with low dimensions. It would be interesting to investigate more sophisticated abstractions such as floatingpoint polyhedra combined with intervals, designed mainly for neural networks [43], to those high-dimensional DRL systems. Another direction is to extend our framework to non-deterministic DRL systems. In the non-deterministic case, a neural network returns both actions and their corresponding probabilities. We can associate probabilities to state transitions and obtain a probabilistic model. The model can be naturally verified using existing probabilistic model checkers such as Prism [30]. Thus, we believe that our approach is also applicable to those systems after a slight extension. It would be another piece of our future work.

**Acknowledgments.** The authors thank all the anonymous reviewers and Katz Guy from the Hebrew University of Jerusalem for their valuable comments on this work. The work has been supported by National Key Research Program (2020AAA0107800), Shanghai Science and Technology Commission (20DZ1100300), Shanghai Trusted Industry Internet Software Collaborative Innovation Center, Shanghai AI Innovation and Development Fund (2020-RGZN-02026), Shenzhen Institute of AI and Robotics for Society (AC01202005020), NSFC-ISF Joint Program (62161146001,3420/21) and NSFC project (61872146).

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Neural Network Robustness as a Verification Property: A Principled Case Study**

Marco Casadio1(B) , Ekaterina Komendantskaya1, Matthew L. Daggitt1, Wen Kokke2, Guy Katz3, Guy Amir3, and Idan Refaeli3

> <sup>1</sup> Heriot-Watt University, Edinburgh, UK *{*mc248,ek19,md2006*}*@hw.ac.uk <sup>2</sup> University of Strathclyde, Glasgow, UK wen.kokke@strath.ac.uk <sup>3</sup> The Hebrew University of Jerusalem, Jerusalem, Israel *{*guykatz,guyam,idan0610*}*@cs.huji.ac.il

**Abstract.** Neural networks are very successful at detecting patterns in noisy data, and have become the technology of choice in many fields. However, their usefulness is hampered by their susceptibility to *adversarial attacks*. Recently, many methods for measuring and improving a network's robustness to adversarial perturbations have been proposed, and this growing body of research has given rise to numerous explicit or implicit notions of robustness. Connections between these notions are often subtle, and a systematic comparison between them is missing in the literature. In this paper we begin addressing this gap, by setting up general principles for the empirical analysis and evaluation of a network's robustness as a mathematical property—during the network's training phase, its verification, and after its deployment. We then apply these principles and conduct a case study that showcases the practical benefits of our general approach.

**Keywords:** Neural Networks · Adversarial Training · Robustness · Verification

### **1 Introduction**

Safety and security are critical for many complex systems that use deep neural networks (DNNs). Unfortunately, due to the opacity of DNNs, these properties are difficult to ensure. Perhaps the most famous instance of this problem is guaranteeing the robustness of DNN-based systems against *adversarial attacks* [5,17]. Intuitively, a neural network is -*-ball robust* around a particular input if, when you move no more than away from that input in the input space, the output does not change much; or, alternatively, the classification decision that the network gives does not change. Even highly accurate DNNs will often display only low robustness, and so measuring and improving the adversarial robustness of DNNs has received significant attention by both the machine learning and verification communities [7,8,15].

As a result, neural network verification often follows a *continuous verification cycle* [9], which involves retraining neural networks with a given *verification property* in mind, as Fig. 1 shows. More generally, such training can be regarded as a way to impose a formal specification on a DNN; and so, apart from improving its robustness, it may also contribute to the network's explainability, and facilitate its verification. Due to the high level of interest in adversarial robustness, numerous approaches have been proposed for performing such retraining in recent years, each with its own specific details. However it is quite unclear what are the benefits that each approach offers, from a verification point of view.

The primary goal of this casestudy paper is to introduce a more holistic methodology, which puts the verification property in the centre of the development cycle, and in turn permits a principled analysis of how this property influences both training and verification practices. In particu-

**Fig. 1.** Continuous Verification Cycle

lar, we analyse the verification properties that implicitly or explicitly arise from the most prominent families of training techniques: *data augmentation* [14], *adversarial training* [5,10], *Lipschitz robustness training* [1,12], and *training with logical constraints* [4,20]. We study the effect of each of these properties on verifying the DNN in question.

In Sect. 2, we start with the forward direction of the continuous verification cycle, and show how the above training methods give rise to logical properties of *classification robustness* (CR), *strong classification robustness* (SCR), *standard robustness* (SR) and *Lipschitz robustness* (LR). In Sect. 4, we trace the opposite direction of the cycle, i.e. show how and when the verifier failure in proving these properties can be mitigated. However Sect. 3 first gives an auxiliary logical link for making this step. Given a robustness property as a logical formula, we can use it not just in verification, but also in attack or property accuracy measurements. We take property-driven attacks as a valuable tool in our study, both in training and in evaluation. Section 4 makes the underlying assumption that verification requires retraining: it shows that the verifier's success ranges only 0–1.5% for an accurate baseline network. We show how our logical understanding of robustness properties empowers us in property-driven training and in verification. We first give abstract arguments why certain properties are stronger than others or incomparable; and then we use training, attacks and the verifier Marabou to confirm them empirically. Sections 5 and 6 add other general considerations for setting up the continuous verification loop and conclude the paper.

### **2 Existing Training Techniques and Definitions of Robustness**

**Data Augmentation** is a straightforward method for improving robustness via training [14]. It is applicable to any transformation of the input (e.g. addition of noise, translation, rotation, scaling) that leaves the output label unchanged. To make the network robust against such a transformation, one augments the dataset with instances sampled via the transformation.

More formally, given a neural network N : <sup>R</sup><sup>n</sup> <sup>→</sup> <sup>R</sup><sup>m</sup>, the goal of data augmentation is to ensure *classification robustness*, which is defined as follows. Given a training dataset input-output pair (**x**ˆ, **<sup>y</sup>**) and a distance metric |·−·|, for all inputs **<sup>x</sup>** within the --ball distance of **<sup>x</sup>**ˆ, we say that N is *classification-robust* if class **<sup>y</sup>** has the largest score in output N(**x**).

#### **Definition 1 (Classification robustness).**

$$CR(\epsilon, \hat{\mathbf{x}}) \stackrel{\Delta}{=} \forall \mathbf{x} : |\mathbf{x} - \hat{\mathbf{x}}| \le \epsilon \Rightarrow \arg\max N(\mathbf{x}) = \mathbf{y}$$

In order to apply data augmentation, an engineer needs to specify: **c1.** the value of -, i.e. the admissible range of perturbations; **c2.** the distance metric, which is determined according to the admissible geometric perturbations; and **c3.** the sampling method used to produce the perturbed inputs (e.g., random sampling, adversarial attacks, generative algorithm, prior knowledge of images).

Classification robustness is straightforward, but does not account for the possibility of having "uncertain" images in the dataset, for which a small perturbation ideally should change the class. For datasets that contain a significant number of such images, attempting this kind of training could lead to a significant reduction in accuracy.

**Adversarial training** is a current state-of the-art method to robustify a neural network. Whereas standard training tries to minimise loss between the predicted value, f(**x**ˆ), and the true value, **<sup>y</sup>**, for each entry (**x**ˆ, **<sup>y</sup>**) in the training dataset, adversarial training minimises the loss with respect to the worst-case perturbation of each sample in the training dataset. It therefore replaces the standard training objective <sup>L</sup>(**x**ˆ, **<sup>y</sup>**) with: max<sup>∀</sup>**x**:|**x**−**x**ˆ|≤- <sup>L</sup>(**x**, **<sup>y</sup>**). Algorithmic solutions to the maximisation problem that find the worst-case perturbation has been the subject of several papers. The earliest suggestion was the Fast Gradient Sign Method (FGSM) algorithm introduced by [5]:

$$\text{FGSM}(\hat{\mathbf{x}}) = \hat{\mathbf{x}} + \epsilon \cdot \text{sign}(\nabla\_{\mathbf{x}} \mathcal{L}(\mathbf{x}, \mathbf{y})),$$

However, modern adversarial training methods usual rely on some variant of the Projected Gradient Descent (PGD) algorithm [11] which iterates FGSM:

$$\text{PGD}\_0(\hat{\mathbf{x}}) = \hat{\mathbf{x}}; \quad \text{PGD}\_{t+1}(\hat{\mathbf{x}}) = \text{PGD}\_t(\text{FGSM}(\hat{\mathbf{x}}));$$

It has been empirically observed that neural networks trained using this family of methods exhibit greater robustness at the expense of an increased generalisation error [10,18,21], which is frequently referred to as the *accuracy-robustness trade-off* for neural networks (although this effect has been observed to disappear as the size of the training dataset grows [13]).

In logical terms what is this procedure trying to train for? Let us assume that there's some maximum distance, δ, that it is acceptable for the output to be perturbed given the size of perturbations in the input. This leads us to the following definition, where || · − · || is a suitable distance function over the output space:

**Definition 2 (Standard robustness).**

$$SR(\epsilon, \delta, \hat{\mathbf{x}}) \stackrel{\Delta}{=} \forall \mathbf{x} : |\mathbf{x} - \hat{\mathbf{x}}| \le \epsilon \Rightarrow ||f(\mathbf{x}) - f(\hat{\mathbf{x}})|| \le \delta$$

We note that, just as with data augmentation, choices **c1**–**c3** are still there to be made, although the sampling methods are usually given by special-purpose FGSM/PGD heuristics based on computing the loss function gradients.

**Training for Lipschitz Robustness.** More recently, a third competing definition of robustness has been proposed: Lipschitz robustness [2]. Inspired by the well-established concept of Lipschitz continuity, Lipschitz robustness asserts that the distance between the original output and the perturbed output is at most a constant L times the change in the distance between the inputs.

#### **Definition 3 (Lipschitz robustness).**

$$LR(\epsilon, L, \hat{\mathbf{x}}) \stackrel{\Delta}{=} \forall \mathbf{x} : |\mathbf{x} - \hat{\mathbf{x}}| \le \epsilon \Rightarrow ||f(\mathbf{x}) - f(\hat{\mathbf{x}})|| \le L|\mathbf{x} - \hat{\mathbf{x}}|$$

As will be discussed in Sect. 4, this is a stronger requirement than standard robustness. Techniques for training for Lipschitz robustness include formulating it as a semidefinite programming optimisation problem [12] or including a projection step that restricts the weight matrices to those with suitable Lipschitz constants [6].

**Training with Logical Constraints.** Logically, this discussion leads one to ask whether a more general approach to constraint formulation may exist, and several attempts in the literature addressed this research question [4,20], by proposing methods that can translate a first-order logical formula <sup>C</sup> into a *constraint loss function* <sup>L</sup><sup>C</sup> . The loss function penalises the network when outputs do not satisfy a given Boolean constraint, and universal quantification is handled by a choice of sampling method. Our standard loss function L is substituted with:

$$\mathcal{L}^\*(\hat{\mathbf{x}}, \mathbf{y}) = \alpha \mathcal{L}(\hat{\mathbf{x}}, \mathbf{y}) + \beta \mathcal{L}\_C(\hat{\mathbf{x}}, \mathbf{y}) \tag{l}$$

where weights α and β control the balance between the standard and constraint loss.

This method looks deceivingly as a generalisation of previous approaches. However, even given suitable choices for **c1**–**c3**, classification robustness cannot be modelled via a constraint loss in the DL2 [4] framework, as argmax is not differentiable. Instead, [4] defines an alternative constraint, which we call *strong classification robustness*:

#### **Definition 4 (Strong classification robustness).**

$$SCR(\epsilon, \eta, \hat{\mathbf{x}}) \stackrel{\Delta}{=} \forall \mathbf{x} : |\mathbf{x} - \hat{\mathbf{x}}| \le \epsilon \Rightarrow f(\mathbf{x}) \ge \eta$$

which looks only at the prediction of the true class and checks whether it is greater than some value η (chosen to be 0.52 in their work).

We note that sometimes, the constraints (and therefore the derived loss functions) refer to the true label **<sup>y</sup>** rather than the current output of the network f(**x**ˆ), e.g. <sup>∀</sup>**<sup>x</sup>** : <sup>|</sup>**<sup>x</sup>** <sup>−</sup> **<sup>x</sup>**ˆ| ≤ - ⇒ |f(**x**) <sup>−</sup> **<sup>y</sup>**| ≤ δ. This leads to scenarios where a network that *is* robust around **<sup>x</sup>**<sup>ˆ</sup> but gives the wrong prediction, being penalised by <sup>L</sup><sup>C</sup> which on paper is designed to maximise robustness. Essentially L<sup>C</sup> is trying to maximise both accuracy and constraint adherence concurrently. Instead, we argue that to preserve the intended semantics of α and β it is important to instead compare against the current output of the network. Of course, this does not work for SCR because deriving the most popular class from the output f(**x**ˆ) requires the arg max operator—the very function that SCR seeks to avoid using. This is another argument why (S)CR should be avoided if possible.

#### **3 Robustness in Evaluation, Attack and Verification**

Given a particular definition of robustness, a natural question is how to quantify how close a given network is to satisfying it. We argue that there are three different measures that one should be interested in: 1. Does the constraint hold? This is a binary measure, and the answer is either true or false. 2. If the constraint does not hold, how easy is it for an attacker to find a violation? 3. If the constraint does not hold, how often does the average user encounter a violation? Based on these measures, we define three concrete metrics: *constraint satisfaction*, *constraint security*, and *constraint accuracy*. 1

Let <sup>X</sup> be the training dataset, <sup>B</sup>(**x**ˆ, -) - {**<sup>x</sup>** <sup>∈</sup> <sup>R</sup><sup>n</sup> | |**<sup>x</sup>** <sup>−</sup> **<sup>x</sup>**ˆ| ≤ -} be the --ball around **<sup>x</sup>**<sup>ˆ</sup> and P be the right-hand side of the implication in each of the definitions of robustness. Let <sup>I</sup><sup>φ</sup> be the standard indicator function which is 1 if constraint <sup>φ</sup>(**x**) holds and 0 otherwise. The *constraint satisfaction* metric measures the proportion of the (finite) training dataset for which the constraint holds.

#### **Definition 5 (Constraint satisfaction).**

$$\text{CSat}(\mathcal{X}) = \frac{1}{|\mathcal{X}|} \sum\_{\hat{\mathbf{x}} \in \mathcal{X}} \mathbb{I}\_{\forall \mathbf{x} \in \mathbb{B}(\hat{\mathbf{x}}, \epsilon) : P(\mathbf{x})}$$

In contrast, *constraint security* measures the proportion of inputs in the dataset such that an attack A is unable to find an adversarial example for constraint P. In our experiments we use the PGD attack for A, although in general any strong attack can be used.

#### **Definition 6 (Constraint security).**

$$\operatorname{CSec}(\mathcal{X}) = \frac{1}{|\mathcal{X}|} \sum\_{\hat{\mathbf{x}} \in \mathcal{X}} \mathbb{I}\_P(A(\hat{\mathbf{x}})) $$

Finally, *constraint accuracy* estimates the probability of a random user coming across a counter-example to the constraint, usually referred as *1 - success rate* in the robustness literature. Let S(**x**ˆ, n) be a set of n elements randomly uniformly sampled from <sup>B</sup>(**x**ˆ, -). Then constraint accuracy is defined as:

#### **Definition 7 (Constraint accuracy).**

$$\text{CACC}(\mathcal{X}) = \frac{1}{|\mathcal{X}|} \sum\_{\hat{\mathbf{x}} \in \mathcal{X}} \left( \frac{1}{n} \sum\_{\mathbf{x} \in S(\hat{\mathbf{x}}, n)} \mathbb{I}\_P(\mathbf{x}) \right),$$

Note that there is no relationship between constraint accuracy and constraint security: an attacker may succeed in finding an adversarial example where random sampling fails and vice-versa. Also note the role of sampling in this discussion and compare it to the discussion of the choice **c3** in Sect. 2. Firstly, sampling procedures affect both training and evaluation of networks. But at the same time, their choice is orthogonal

<sup>1</sup> Our naming scheme differs from [4] who use the term *constraint accuracy* to refer to what we term *constraint security*. In our opinion, the term *constraint accuracy* is less appropriate here than the name *constraint security* given the use of an adversarial attack.

to choosing the verification constraint for which we optimise or evaluate. For example, we measure constraint security with respect to the PGD attack, and this determines the way we sample; but having made that choice still leaves us to decide which constraint, SCR, SR, LR, or other we will be measuring as we sample. Constraint satisfaction is different from constraint security and accuracy, in that it must evaluate constraints over infinite domains rather than merely sampling from them.

**Choosing an Evaluation Metric.** It is important to note that for all three evaluation metrics, one still has to make a choice for constraint P, namely SR, SCR or LR, as defined in Sect. 2. As constraint security always uses PGD to find input perturbations, the choice of SR, SCR and LR effectively amounts to us making a judgement of what an adversarial perturbation consists of: is it a class change as defined by SCR, or is it a violation of the more nuanced metrics defined by SR and LR? Therefore we will evaluate constraint security on the *SR/SCR/LR constraints* using a *PGD attack*.

For large search spaces in n dimensions, random sampling deployed in constraint accuracy fails to find the trickier adversarial examples, and usually has deceivingly high performance: we found 100% and >98% constraint accuracy for SR and SCR, respectively. We will therefore not discuss these experiments in detail.

# **4 Relative Comparison of Definitions of Robustness**

We now compare the strength of the given definitions of robustness using the introduced metrics. For empirical evaluation, we train networks on *FASHION MNIST* (or just *FASHION*) [19] and a modified version of the *GTSRB* [16] datasets consisting, respectively, by 28 × 28 and 48 × 48 images belonging to 10 classes. The networks consist of two fully connected layers: the first one having 100 neurons and ReLU as activation function, and the last one having 10 neurons on which we apply a clamp function [−100, 100], because the traditional softmax function is not compatible with constraint verification tools such as Marabou. Taking four different robustness properties for which we optimise while training (Baseline, LR, SR, SCR), gives us 8 different networks to train, evaluate and attack. Generally, all trends we observed for the two data sets were the same, and we put matching graphs in [3] whenever we report a result for one of the data sets. Marabou [8] was used for evaluating constraint satisfaction.

#### **4.1 Standard and Lipschitz Robustness**

Lipschitz robustness is a strictly stronger constraint than standard robustness, in the sense that when a network satisfies LR(-, L) then it also satisfies SR(-, -L). However, the converse does not hold, as standard robustness does not relate the distances between the inputs and the outputs. Consequently, there are SR(-, δ) robust models that are not LR(-, L) robust for any L, as for any fixed L one can always make the distance <sup>|</sup>**<sup>x</sup>** <sup>−</sup> **<sup>x</sup>**ˆ<sup>|</sup> arbitrarily small in order to violate the Lipschitz inequality.


**Table 1.** Constraint satisfaction results for the Classification, Standard and Lipschitz constraints. These values are calculated over the test set and represented as %.

**Empirical Significance of the Conclusions for Constraint Security.** Figure 2 shows an empirical evaluation of this general result. If we train two neural networks, one with the SR, and the other with the LR constraint, then the latter always has higher constraint security against both SR and LR attacks than the former. It also confirms that generally, stronger constraints are harder to obtain: whether a network is trained with SR or LR constraints, it is less robust against an LR attack than against any other attack.

**Empirical Significance of the Conclusions for Constraint Satisfaction.** Table 1 shows that LR is very difficult to guarantee as a verification property, indeed none of our networks satisfied this constraint for any image in the data set. At the same time, networks trained with LR satisfy

**Fig. 2.** Experiments that show how the two networks trained with LR and SR constraints perform when evaluated against different definitions of robustness underlying the attack; measures the attack strength.

the weaker property SR, for 100% and 97% of images – a huge improvement on the negligible percentage of robust images for the baseline network! Therefore, knowing a verification property or mode of attack, one can tailor the training accordingly, and training with stronger constraint gives better results.

#### **4.2 (Strong) Classification Robustness**

Strong classification robustness is designed to over-approximate classification robustness whilst providing a logical loss function with a meaningful gradient. We work under the assumption that the last layer of the classification network is a softmax layer, and therefore the output forms a probability distribution. When η > 0.5 then any network that satisfies SCR(-, η) also satisfies CR(-). For η <sup>≤</sup> 0.5 this relationship breaks down as the true class may be assigned a probability greater than η but may still not be the class with the highest probability. We therefore recommended that one only uses value of η > 0.5 when using strong classification robustness (for example η = 0.52 in [4]).

**Empirical Significance of the Conclusions for Constraint Security.** Because the CR constraint cannot be used within a loss function, we use data augmentation when training to emulate its effect. First, we confirm our assumptions about the relative inefficiency of using data augmentation compared to adversarial training or training with constraints, see Fig. 3. Surprisingly, neural networks trained with data augmentation give worse results than even the baseline network.

As previously discussed, random uniform sampling struggles to find adversarial inputs in large searching spaces. It is logical to expect that using random uniform sampling when training will be less successful than training with sampling that uses FGSM or PGD as heuristics. Indeed, Fig. 3 shows this effect for data augmentation.

One may ask whether the trends just described would be replicated

**Fig. 3.** Experiments that show how adversarial training, training with data augmentation, and training with constraint loss affect standard and classification robustness of networks; measures the attack strength.

for more complex architectures of neural networks. In particular, data augmentation is known to require larger networks. By replicating the results with a large, 18-layer convolutional network from [4] (second graph of Fig. 3), we confirm that larger networks handle data augmentation better, and that data augmentation affords improved robustness compared to the baseline. Nevertheless, data augmentation still lags behind all other modes of constraint-driven training, and thus this major trend remains stable across network architectures. The same figure also illustrates our point about the relative strength of SCR compared to CR: a network trained with data augmentation (equivalent to CR) is more prone to SCR attacks than a network trained with the SCR constraint.

**Empirical Significance of the Conclusions for Constraint Satisfaction.** Although Table 1 confirms that training with a stronger property (SCR) does improve the constraint satisfaction of a weaker property (CR), the effect is an order of magnitude smaller than what we observed for LR and SR. Indeed, the table suggests that training with the LR constraint gives better results for CR constraint satisfaction. This does not contradict, but does not follow from our theoretical analysis.

#### **4.3 Standard vs Classification Robustness**

Given that LR is stronger than SR and SCR is stronger than CR, the obvious question is whether there is a relationship between these two groups. In short, the answer to this question is no. In particular, although the two sets of definitions agree on whether a network is robust around images with high-confidence, they disagree over whether a network is robust around images with low confidence. We illustrate this with an example, comparing SR against CR. We note that a similar analysis holds for any pairing from the two groups.

The key insight is that standard robustness bounds the drop in confidence that a neural network can exhibit after a perturbation, whereas classification robustness does not. Figure 4a shows two hypothetical images from the MNIST dataset. Our network predicts that Fig. 4a has an 85% chance of being a 7. Now consider adding a small perturbation to the image and consider two different scenarios. In the first scenario the output of the network for class 7 decreases from 85% to 83% and therefore the classification stays the same. In the second

**Fig. 4.** Images from the MNIST set

scenario the output of the network for class 7 decreases from 85% to 45%, and results in the classification changing from 7 to 9. When considering the two definitions, a small change in the output leads to no change in the classification and a large change in the output leads to a change in classification and so robustness and classification robustness both agree with each other.

However, now consider Fig. 4b with relatively high uncertainty. In this case the network is (correctly) less sure about the image, only narrowly deciding that it's a 7. Again consider adding a small perturbation. In the first scenario the prediction of the network changes dramatically with the probability of it being a 7 increasing from 51% to 91% but leaves the classification unchanged as 7. In the second scenario the output of the network only changes very slightly, decreasing from 51% to 49% flipping the classification from 7 to 9. Now, the definitions of SR and CR disagree. In the first case, adding a small amount of noise has erroneously massively increased the network's confidence and therefore the SR definition correctly identifies that this is a problem. In contrast CR has no problem with this massive increase in confidence as the chosen output class remains unchanged. Thus, SR and CR agree on low-uncertainty examples, but CR breaks down and gives what we argue are both false positives and false negatives when considering examples with high-uncertainty.

**Empirical Significance of the Conclusions for Constraint Security.** Our empirical study confirms these general conclusions. Figure 2 shows that depending on the properties of the dataset, SR may not guarantee SCR. The results in Fig. 5 tell us that using the SCR constraint for training does not help to increase defences against SR attacks. A similar picture, but in reverse, can be seen when we optimize for SR but attack with SCR. Table 1 confirms these trends for constraint satisfaction.

# **5 Other Properties of Robustness Definitions**

**Table 2.** A comparison of the different types of robustness studied in this paper. Top half: general properties. Bottom half: relation to existing machine-learning literature


We finish with a summary of further interesting properties of the four robustness definitions. Table 2 shows a summary of all comparison measures considered in the paper.

**Dataset assumptions** concern the distribution of the training data with respect to the data manifold of the true distribution of inputs, and influ-

**Fig. 5.** Experiments that show how different choices of a constraint loss affect standard robustness of neural networks.

ence evaluation of robustness. For SR and LR it is, at minimum, desirable for the network to be robust over the entire data manifold. In the most domains the shape of the manifold is unknown and therefore it is necessary to approximate it by taking the union of the balls around the inputs in the training dataset. We are not particularly interested about whether the network is robust in regions of the input space that lie off the data manifold, but there is no problem if the network is robust in these regions. Therefore these definitions make no assumptions about the distribution of the training dataset.

This is in contrast to CR and SCR. Rather than requiring that there is only a small change in the output, they require that there is no change to the classification. This is only a desirable constraint when the region being considered does not contain a decision boundary. Consequently when one is training for some form of classification robustness, one is implicitly making the assumption that the training data points lie away from any decision boundaries within the manifold. In practice, most datasets for classification problems assign a single label instead of an entire probability distribution to each input point, and so this assumption is usually valid. However, for datasets that contain input points that may lie close to the decision boundaries, CR and SCR may result in a logically inconsistent specification.

**Interpretability.** One of the key selling points of training with logical constraints is that, by ensuring that the network obeys understandable constraints, it improves the explainability of the neural network. Each of the robustness constraints encode that "small changes to the input only result in small changes to the output", but the interpretability of each definition is also important.

All of the definitions share the relatively interpretable parameter, which measures how large a perturbation from the input is acceptable. Despite the other drawbacks discussed so far, CR is inherently the most interpretable as it has no second parameter. In contrast, SR and SCR require extra parameters, δ and η respectively, which measure the allowable deviation in the output. Their addition makes these models less interpretable.

Finally we argue that, although LR is the most desirable constraint, it is also the least interpretable. Its second parameter L measures the allowable change in the output as a proportion of the allowable change in the input. It therefore requires one to not only have an interpretation of distance for both the input and output spaces, but to be able to relate them. In most domains, this relationship simply does not exist. Consider the MNIST dataset, both the commonly used notion of pixel-wise distance used in the input set, although crude, and the distance between the output distributions are both interpretable. However, the relationship between them is not. For example, what does allowing the distance between the output probability distributions being no more than twice the distance between the images actually mean? This therefore highlights a common trade-off between complexity of the constraint and its interpretability.

### **6 Conclusions**

These case studies have demonstrated the importance of emancipating the study of desirable properties of neural networks from a concrete training method, and studying these properties in an abstract mathematical way. For example, we have discovered that some robustness properties can be ordered by logical strength and some are incomparable. Where ordering is possible, training for a stronger property helps in verifying a weaker property. Some of the stronger properties, such as Lipschitz robustness, are not yet feasible for the modern DNN solvers, such as Marabou [8]. Moreover, we show that the logical strength of the property may not guarantee other desirable properties, such as interpretability. Some of these findings lead to very concrete recommendations, e.g.: it is best to avoid CR and SCR as they may lead to inconsistencies; when using LR and SR, one should use stronger property (LR) for training in order to be successful in verifying a weaker one (SR). In other cases, the distinctions that we make do not give direct prescriptions, but merely discuss the design choices and trade-offs.

This paper also shows that constraint security, a measure intermediate between constraint accuracy and constraint satisfaction, is a useful tool in the context of tuning the continuous verification loop. It is more efficient to measure and can show more nuanced trends than constraint satisfaction. It can be used to tune training parameters and build hypotheses which we ultimately confirm with constraint satisfaction.

We hope that this study will contribute towards establishing a solid methodology for continuous verification, by setting up some common principles to unite verification and machine learning approaches to DNN robustness.

**Acknowledgement.** Authors acknowledge support of EPSRC grant AISEC EP/T026952/1 and NCSC grant Neural Network Verification: in search of the missing spec.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Software Verification and Model Checking**

# **The Lattice-Theoretic Essence of Property Directed Reachability Analysis**

Mayuko Kori1,2(B) , Natsuki Urabe<sup>2</sup> , Shin-ya Katsumata<sup>2</sup> , Kohei Suenaga<sup>3</sup> , and Ichiro Hasuo1,2

<sup>1</sup> The Graduate University for Advanced Studies (SOKENDAI), Hayama, Japan

<sup>2</sup> National Institute of Informatics, Tokyo, Japan {mkori,urabenatsuki,s-katsumata,hasuo}@nii.ac.jp <sup>3</sup> Kyoto University, Kyoto, Japan ksuenaga@fos.kuis.kyoto-u.ac.jp

**Abstract.** We present *LT-PDR*, a lattice-theoretic generalization of Bradley's property directed reachability analysis (PDR) algorithm. LT-PDR identifies the essence of PDR to be an ingenious combination of verification and refutation attempts based on the Knaster–Tarski and Kleene theorems. We introduce four concrete instances of LT-PDR, derive their implementation from a generic Haskell implementation of LT-PDR, and experimentally evaluate them. We also present a categorical structural theory that derives these instances.

**Keywords:** Property directed reachability analysis · Model checking · Lattice theory · Fixed point theory · Category theory

# **1 Introduction**

*Property directed reachability (PDR)* (also called *IC3* ) introduced in [9,13] is a model checking algorithm for proving/disproving safety problems. It has been successfully applied to software and hardware model checking, and later it has been extended in several directions, including *fbPDR* [25,26] that uses both forward and backward predicate transformers and *PrIC3* [6] for the quantitative safety problem for probabilistic systems. See [14] for a concise overview.

The original PDR assumes that systems are given by binary predicates representing transition relations. The PDR algorithm maintains data structures called *frames* and *proof obligations*—these are collections of predicates over states—and updates them. While this logic-based description immediately yields automated tools using SAT/SMT solvers, it limits target systems to qualitative and nondeterministic ones. This limitation was first overcome by PrIC3 [6] whose target is probabilistic systems. This suggests room for further generalization of PDR.

The authors are supported by ERATO HASUO Metamathematics for Systems Design Project (No. JPMJER1603). MK is a JSPS DC fellow and supported by JSPS KAKENHI Grant (No. 22J21742). KS is supported by JST CREST Grant (No. JPMJCR2012) and JSPS KAKENHI Grant (No. 19H04084).

In this paper, we propose the first lattice theory-based generalization of the PDR algorithm; we call it *LT-PDR*. This makes the PDR algorithm apply to a wider class of safety problems, including qualitative and quantitative. We also derive a new concrete extension of PDR, namely one for Markov reward models.

We implemented the general algorithm LT-PDR in Haskell, in a way that maintains the theoretical abstraction and clarity. Deriving concrete instances for various types of systems is easy (for Kripke structures, probabilistic systems, etc.). We conducted an experimental evaluation, which shows that these easilyobtained instances have at least reasonable performance.

**Preview of the Theoretical Contribution.** We generalize the PDR algorithm so that it operates over an arbitrary complete lattice L. This generalization recasts the PDR algorithm to solve a general problem μF <sup>≤</sup>? <sup>α</sup> of overapproximating the least fixed point of an <sup>ω</sup>-continuous function <sup>F</sup> : <sup>L</sup> <sup>→</sup> <sup>L</sup> by a safety property α. This lattice-theoretic generalization signifies the relationship between the PDR algorithm and the theory of fixed points. This also allows us to incorporate quantitative predicates suited for probabilistic verification.

More specifically, we reconstruct the original PDR algorithm as a combination of two constituent parts. They are called *positive LT-PDR* and *negative LT-PDR*. Positive LT-PDR comes from a witness-based proof method by the *Knaster–Tarski fixed point theorem*, and aims to *verify* μF <sup>≤</sup>? <sup>α</sup>. In contrast, negative LT-PDR comes from the *Kleene fixed point theorem* and aims to *refute* μF <sup>≤</sup>? <sup>α</sup>. The two algorithms build up witnesses in an iterative and nondeterministic manner, where nondeterminism accommodates guesses and heuristics. We identify the essence of PDR to be an ingenious combination of these two algorithms, in which intermediate results on one side (positive or negative) give informed guesses on the other side. This is how we formulate LT-PDR in Sect. 3.3.

We discuss several instances of our general theory of PDR. We discuss three concrete settings: Kripke structures (where we obtain two instances of LT-PDR), Markov decision processes (MDPs), and Markov reward models. The two in the first setting essentially subsume many existing PDR algorithms, such as the original PDR [9,13] and Reverse PDR [25,26], and the one for MDPs resembles PrIC3 [6]. The last one (Markov reward models) is a new algorithm that fully exploits the generality of our framework.

In fact, there is another dimension of theoretical generalization: the derivation of the above concrete instances follows a *structural theory of state-based dynamics and predicate transformers*. We formulate the structural theory in the language of *category theory* [3,23]—using especially *coalgebras* [17] and *fibrations* [18]—following works such as [8,15,21,28]. The structural theory tells us which safety problems arise under what conditions; it can therefore suggest that certain safety problems are unlikely to be formulatable, too. The structural theory is important because it builds a mathematical order in the PDR literature, in which theoretical developments tend to be closely tied to implementation and thus theoretical essences are often not very explicit. For example, the theory is useful in classifying a plethora of PDR-like algorithms for Kripke structures (the original, Reverse PDR, fbPDR, etc.). See Sect. 5.1.

We present the above structural theory in Sect. 4 and briefly discuss its use in the derivation of concrete instances in Sect. 5. We note, however, that this categorical theory is not needed for reading and using the other parts of the paper.

There are other works on generalization of PDR [16,24], but our identification of the interplay of Knaster–Tarski and Kleene is new. They do not accommodate probabilistic verification, either. See [22, Appendix A] for further discussions.

**Preliminaries.** Let (L, <sup>≤</sup>) be a poset. (L, <sup>≤</sup>)op denotes the opposite poset (L, <sup>≥</sup> ). Note that if (L, <sup>≤</sup>) is a complete lattice then so is (L, <sup>≤</sup>)op. An <sup>ω</sup>-chain (resp. ωop-chain) in L is an N-indexed family of increasing (resp. decreasing) elements in <sup>L</sup>. A monotone function <sup>F</sup> : <sup>L</sup> <sup>→</sup> <sup>L</sup> is <sup>ω</sup>*-continuous* (resp. <sup>ω</sup>op-continuous) if F preserves existing suprema of ω-chains (resp. infima of ωop-chains).

### **2 Fixed-points in Complete Lattices**

Let (L, <sup>≤</sup>) be a complete lattice and <sup>F</sup> : <sup>L</sup> <sup>→</sup> <sup>L</sup> be a monotone function. When we analyze fixed points of F, pre/postfixed points play important roles.

**Definition 2.1.** *<sup>A</sup>* prefixed point *of* <sup>F</sup> *is an element* <sup>x</sup> <sup>∈</sup> <sup>L</sup> *satisfying* F x <sup>≤</sup> <sup>x</sup>*. <sup>A</sup>* postfixed point *of* <sup>F</sup> *is an element* <sup>x</sup> <sup>∈</sup> <sup>L</sup> *satisfying* <sup>x</sup> <sup>≤</sup> F x*. We write* **Pre**(F) *and* **Post**(F) *for the set of prefixed points and postfixed points of* F*, respectively.*

The following results are central in fixed point theory. They allow us to under/over-approximate the least/greatest fixed points.

**Theorem 2.2.** *A monotone endofunction* <sup>F</sup> *on a complete lattice* (L, <sup>≤</sup>) *has the least fixed point* μF *and the greatest fixed point* νF*. Moreover,*


Theorem 2.2.2 is known to hold for arbitrary ω-cpos (complete lattices are their special case). A generalization of Theorem 2.2.2 is the Cousot–Cousot characterization [11], where F is assumed to be monotone (but not necessarily ωcontinuous) and we have μF <sup>=</sup> <sup>F</sup> <sup>κ</sup><sup>⊥</sup> for a sufficiently large, possibly transfinite, ordinal κ. In this paper, for the algorithmic study of PDR, we assume the ωcontinuity of F. Note that ω-continuous F on a complete lattice is necessarily monotone.

We call the <sup>ω</sup>-chain ⊥ ≤ <sup>F</sup>⊥≤··· *the initial chain of* <sup>F</sup> and the <sup>ω</sup>op-chain ≥ <sup>F</sup>≥··· *the final chain of* <sup>F</sup>. These appear in Theorem 2.2.2.

Theorem 2.2.1 and 2.2.2 yield the following witness notions for *proving* and *disproving* μF <sup>≤</sup> <sup>α</sup>, respectively.

**Corollary 2.3.** *Let* (L, <sup>≤</sup>) *be a complete lattice and* <sup>F</sup> : <sup>L</sup> <sup>→</sup> <sup>L</sup> *be* <sup>ω</sup>*-continuous.*


By Corollary 2.3.1, proving μF <sup>≤</sup> <sup>α</sup> is reduced to searching for <sup>x</sup> <sup>∈</sup> <sup>L</sup> such that F x <sup>≤</sup> <sup>x</sup> <sup>≤</sup> <sup>α</sup>. We call such <sup>x</sup> <sup>a</sup> *KT (positive) witness*. In contrast, by Corollary 2.3.2, disproving μF <sup>≤</sup> <sup>α</sup> is reduced to searching for <sup>n</sup> <sup>∈</sup> <sup>N</sup> and <sup>x</sup> <sup>∈</sup> <sup>L</sup> such that <sup>x</sup> <sup>≤</sup> <sup>F</sup> <sup>n</sup><sup>⊥</sup> and <sup>x</sup> <sup>≤</sup> <sup>α</sup>. We call such <sup>x</sup> <sup>a</sup> *Kleene (negative) witness*.

**Notation 2.4.** We shall use lowercase (Roman and Greek) letters for elements of <sup>L</sup> (such as α, x <sup>∈</sup> <sup>L</sup>), and uppercase letters for (finite or infinite) sequences of <sup>L</sup> (such as <sup>X</sup> <sup>∈</sup> <sup>L</sup><sup>∗</sup> or <sup>L</sup>ω). The <sup>i</sup>-th (or (<sup>i</sup> <sup>−</sup> <sup>j</sup>)-th when subscripts are started from <sup>j</sup>) element of a sequence <sup>X</sup> is designated by a subscript: <sup>X</sup><sup>i</sup> <sup>∈</sup> <sup>L</sup>.

# **3 Lattice-Theoretic Reconstruction of PDR**

Towards the LT-PDR algorithm, we first introduce two simpler algorithms, called positive LT-PDR (Sect. 3.1) and negative LT-PDR (Sect. 3.2). The target problem of the LT-PDR algorithm is the following:

**Definition 3.1 (the LFP-OA problem** μF <sup>≤</sup>? <sup>α</sup>**).** *Let* <sup>L</sup> *be a complete lattice,* <sup>F</sup> : <sup>L</sup> <sup>→</sup> <sup>L</sup> *be* <sup>ω</sup>*-continuous, and* <sup>α</sup> <sup>∈</sup> <sup>L</sup>*. The* lfp over-approximation (LFP-OA) problem *asks if* μF <sup>≤</sup> <sup>α</sup> *holds; the problem is denoted by* μF <sup>≤</sup>? <sup>α</sup>*.*

*Example 3.2.* Consider a transition system, where <sup>S</sup> be the set of states, <sup>ι</sup> <sup>⊆</sup> <sup>S</sup> be the set of initial states, <sup>δ</sup> : <sup>S</sup> → P<sup>S</sup> be the transition relation, and <sup>α</sup> <sup>⊆</sup> <sup>S</sup> be the set of safe states. Then letting <sup>L</sup> := <sup>P</sup><sup>S</sup> and <sup>F</sup> := <sup>ι</sup><sup>∪</sup> <sup>s</sup>∈(−) <sup>δ</sup>(s), the lfp overapproximation problem μF <sup>≤</sup>? <sup>α</sup> is the problem whether all reachable states are safe. It is equal to the problem studied by the conventional IC3/PDR [9,13].

Positive LT-PDR iteratively builds a KT witness in a bottom-up manner that positively answers the LFP-OA problem, while negative LT-PDR iteratively builds a Kleene witness for the same LFP-OA problem. We shall present these two algorithms as clear reflections of two proof principles (Corollary 2.3), each of which comes from the fundamental Knaster–Tarski and Kleene theorems.

The two algorithms build up witnesses in an iterative and nondeterministic manner. The nondeterminism is there for accommodating guesses and heuristics. We identify the essence of PDR to be an ingenious combination of these two algorithms, in which intermediate results on one side (positive or negative) give informed guesses on the other side. This way, each of the positive and negative algorithms provides heuristics in resolving the nondeterminism in the execution of the other. This is how we formulate the LT-PDR algorithm in Sect. 3.3.

The dual of LFP-OA problem is called the *gfp-under-approximation problem* (GFP-UA): the GFP-UA problem for a complete lattice L, an ωop-continuous function <sup>F</sup> : <sup>L</sup> <sup>→</sup> <sup>L</sup> and <sup>α</sup> <sup>∈</sup> <sup>L</sup> is whether the inequality <sup>α</sup> <sup>≤</sup> νF holds or not, and is denoted by <sup>α</sup> <sup>≤</sup>? νF. It is evident that the GFP-UA problem for (L, F, α) is equivalent to the LFP-OA problem for (Lop, F, α). This suggests the dual algorithm called LT-OpPDR for GFP-UA problem. See Remark 3.24 later.

#### **3.1 Positive LT-PDR: Sequential Positive Witnesses**

We introduce the notion of KT<sup>ω</sup> witness—a KT witness (Corollary 2.3) constructed in a sequential manner. Positive LT-PDR searches for a KT<sup>ω</sup> witness by growing its finitary approximations (called KT sequences).

Let <sup>L</sup> be a complete lattice. We regard each element <sup>x</sup> <sup>∈</sup> <sup>L</sup> as an abstract presentation of a predicate on states. The inequality <sup>x</sup> <sup>≤</sup> <sup>y</sup> means that the predicate x is stronger than the predicate y. We introduce the complete lattice [n, L] of increasing chains of length <sup>n</sup> <sup>∈</sup> <sup>N</sup>, whose elements are (X<sup>0</sup> ≤···≤ <sup>X</sup><sup>n</sup>−1) in L equipped with the element-wise order. We similarly introduce the complete lattice [ω,L] of <sup>ω</sup>-chains in <sup>L</sup>. We lift <sup>F</sup> : <sup>L</sup> <sup>→</sup> <sup>L</sup> to <sup>F</sup> # : [ω,L] <sup>→</sup> [ω,L] and F # <sup>n</sup> : [n, L] <sup>→</sup> [n, L] (for <sup>n</sup> <sup>≥</sup> 2) as follows. Note that the entries are shifted.

$$\begin{aligned} F^\#(X\_0 \le X\_1 \le \cdots) &:= (\bot \le FX\_0 \le FX\_1 \le \cdots) \\ F\_n^\#(X\_0 \le \cdots \le X\_{n-1}) &:= (\bot \le FX\_0 \le \cdots \le FX\_{n-2}) \end{aligned} \tag{1}$$

**Definition 3.3 (KT**<sup>ω</sup> **witness).** *Let* L, F, α *be as in Definition 3.1. Define* Δα := (<sup>α</sup> <sup>≤</sup> <sup>α</sup> ≤···)*. A KT*<sup>ω</sup> witness *is* <sup>X</sup> <sup>∈</sup> [ω,L] *such that* <sup>F</sup> #<sup>X</sup> <sup>≤</sup> <sup>X</sup> <sup>≤</sup> Δα*.*

**Theorem 3.4.** *Let* L, F, α *be as in Definition 3.1. There exists a KT witness (Corollary 2.3) if and only if there exists a KT*<sup>ω</sup>*witness.* 

Concretely, a KT witness <sup>x</sup> yields a KT<sup>ω</sup> witness <sup>x</sup> <sup>≤</sup> <sup>x</sup> ≤··· ; a KT<sup>ω</sup> witness X yields a KT witness <sup>n</sup>∈<sup>ω</sup> <sup>X</sup>n. A full proof (via Galois connections) is in [22].

The initial chain ⊥ ≤ <sup>F</sup>⊥≤··· is always a KT<sup>ω</sup> witness for μF <sup>≤</sup> <sup>α</sup>. There are other KT<sup>ω</sup> witnesses whose growth is accelerated by some heuristic guesses an extreme example is <sup>x</sup> <sup>≤</sup> <sup>x</sup> ≤··· with a KT witness <sup>x</sup>. KT<sup>ω</sup> witnesses embrace the spectrum of such different sequential witnesses for μF <sup>≤</sup> <sup>α</sup>, those which mix routine constructions (i.e. application of F) and heuristic guesses.

**Definition 3.5 (KT sequence).** *Let* L, F, α *be as in Definition 3.1. A* KT sequence *for* μF <sup>≤</sup>? <sup>α</sup> *is a finite chain* (X<sup>0</sup> ≤···≤ <sup>X</sup>n−<sup>1</sup>)*, for* <sup>n</sup> <sup>≥</sup> <sup>2</sup>*, satisfying*


*A KT sequence* (X<sup>0</sup> ≤···≤ <sup>X</sup><sup>n</sup>−<sup>1</sup>) *is* conclusive *if* <sup>X</sup>j+1 <sup>≤</sup> <sup>X</sup><sup>j</sup> *for some* <sup>j</sup>*.*

KT sequences are finite by definition. Note that the upper bound α is imposed on all <sup>X</sup><sup>i</sup> but <sup>X</sup><sup>n</sup>−<sup>1</sup>. This freedom in the choice of <sup>X</sup><sup>n</sup>−<sup>1</sup> offers room for heuristics, one that is exploited in the combination with negative LT-PDR (Sect. 3.3).

We take KT sequences as finite approximations of KT<sup>ω</sup> witnesses. This view shall be justified by the partial order ( ) between KT sequences defined below.

**Definition 3.6 (order between KT sequences).** *We define a partial order relation on KT sequences as follows:* (X0,...,X<sup>n</sup>−<sup>1</sup>) (X 0,...,X <sup>m</sup>−<sup>1</sup>) *if* <sup>n</sup> <sup>≤</sup> <sup>m</sup> *and* <sup>X</sup><sup>j</sup> <sup>≥</sup> <sup>X</sup> <sup>j</sup> *for each* <sup>0</sup> <sup>≤</sup> <sup>j</sup> <sup>≤</sup> <sup>n</sup> <sup>−</sup> <sup>1</sup>*.*

The order <sup>X</sup><sup>j</sup> <sup>≥</sup> <sup>X</sup> <sup>j</sup> represents that <sup>X</sup> <sup>j</sup> is a stronger predicate (on states) than <sup>X</sup><sup>j</sup> . Therefore <sup>X</sup> <sup>X</sup> expresses that <sup>X</sup> is a longer and stronger/more determined chain than X. We obtain KT<sup>ω</sup> witnesses as their ω-superma.

**Theorem 3.7.** *Let* L, F, α *be as in Definition 3.1. The set of KT sequences, augmented with the set of KT*<sup>ω</sup> *witnesses* {<sup>X</sup> <sup>∈</sup> [ω,L] <sup>|</sup> <sup>F</sup> #<sup>X</sup> <sup>≤</sup> <sup>X</sup> <sup>≤</sup> Δα} *and ordered by the natural extension of , is an* <sup>ω</sup>*-cpo. In this* <sup>ω</sup>*-cpo, each KT*ω*witness* X *is represented as the suprema of an* ω*-chain of KT sequences, namely* X = <sup>n</sup>≥<sup>2</sup> <sup>X</sup>|<sup>n</sup> *where* <sup>X</sup>|<sup>n</sup> <sup>∈</sup> [n, L] *is the length* <sup>n</sup> *prefix of* <sup>X</sup>*.* 

**Proposition 3.8.** *Let* L, F, α *be as in Definition 3.1. There exists a KT*<sup>ω</sup> *witness if and only if there exists a conclusive KT sequence.*

*Proof.* (⇒): If there exists a KT<sup>ω</sup> witness, μF <sup>≤</sup> <sup>α</sup> holds by Corollary 2.3 and Theorem 3.4. Therefore, the "informed guess" (μF <sup>≤</sup> μF) gives a conclusive KT sequence. (⇐): When <sup>X</sup> is a conclusive KT sequence with <sup>X</sup><sup>j</sup> <sup>=</sup> <sup>X</sup>j+1, <sup>X</sup><sup>0</sup> ≤···≤ <sup>X</sup><sup>j</sup> <sup>=</sup> <sup>X</sup>j+1 <sup>=</sup> ··· is a KT<sup>ω</sup> witness. 

The proposition above yields the following partial algorithm that aims to answer positively to the LFP-OA problem. It searches for a conclusive KT sequence.

**Definition 3.9 (positive LT-PDR).** *Let* L, F, α *be as in Definition 3.1.* Positive LT-PDR *is the algorithm shown in Algorithm 1, which says 'True' to the LFP-OA problem* μF <sup>≤</sup>? <sup>α</sup> *if successful.*

The rules are designed by the following principles.

**Valid** is applied when the current X is conclusive.

**Unfold** extends <sup>X</sup> with . In fact, we can use any element <sup>x</sup> satisfying <sup>X</sup><sup>n</sup>−<sup>1</sup> <sup>≤</sup> <sup>x</sup> and F X<sup>n</sup>−<sup>1</sup> <sup>≤</sup> <sup>x</sup> in place of (by the application of **Induction** with <sup>x</sup>). The condition <sup>X</sup><sup>n</sup>−<sup>1</sup> <sup>≤</sup> <sup>α</sup> is checked to ensure that the extended <sup>X</sup> satisfies the condition in Definition 3.5.1.

**Induction** strengthens X, replacing the j-th element with its meet with x. The first condition <sup>X</sup><sup>k</sup> <sup>≤</sup> <sup>x</sup> ensures that this rule indeed strengthens <sup>X</sup>, and the second condition <sup>F</sup>(X<sup>k</sup>−<sup>1</sup> <sup>∧</sup>x) <sup>≤</sup> <sup>x</sup> ensures that the strengthened <sup>X</sup> satisfies the condition in Definition 3.5.2, that is, F # <sup>n</sup> <sup>X</sup> <sup>≤</sup> <sup>X</sup> (see the proof in [22]).

**Theorem 3.10.** *Let* L, F, α *be as in Definition 3.1. Then positive LT-PDR is sound, i.e. if it outputs 'True' then* μF <sup>≤</sup> <sup>α</sup> *holds.*

*Moreover, assume* μF <sup>≤</sup> <sup>α</sup> *is true. Then positive LT-PDR is weakly terminating (meaning that suitable choices of* x *when applying Induction make the algorithm terminate).* 

The last "optimistic termination" is realized by the informed guess μF as x in **Induction**. To guarantee the termination of LT-PDR, it suffices to assume that the complete lattice L is well-founded (no infinite decreasing chain exists in L) and there is no strictly increasing ω-chain under α in L, although we cannot hope for this assumption in every instance (Sect. 5.2, 5.3).

**Lemma 3.11.** *Let* L, F, α *be as in Definition 3.1. If* μF <sup>≤</sup> <sup>α</sup>*, then for any KT sequence* X*, at least one of the three rules in Algorithm 1 is enabled.*

*Moreover, for any KT sequence* X*, let* X *be obtained by applying either Unfold or Induction. Then* <sup>X</sup> <sup>X</sup> *and* <sup>X</sup> = X *.*  **Input :** An instance (μF <sup>≤</sup>? <sup>α</sup>) of the LFP-OA problem in <sup>L</sup> **Output :** 'True' with a conclusive KT sequence **Data:** a KT sequence X = (X<sup>0</sup> ≤···≤ X<sup>n</sup>−<sup>1</sup>) **Initially:** X := (⊥ ≤ F⊥) **repeat (do one of the following) Valid** If X<sup>j</sup>+1 ≤ X<sup>j</sup> for some j<n − 1, return 'True' with the conclusive KT sequence X. **Unfold** If X<sup>n</sup>−<sup>1</sup> ≤ α, let X := (X<sup>0</sup> ≤···≤ X<sup>n</sup>−<sup>1</sup> ≤ ). **Induction** If some k ≥ 2 and x ∈ L satisfy X<sup>k</sup> ≤ x and F(X<sup>k</sup>−<sup>1</sup> ∧ x) ≤ x, let X := X[X<sup>j</sup> := X<sup>j</sup> ∧ x]<sup>2</sup>≤j≤<sup>k</sup>. **until** *any return value is obtained*;

**Algorithm 1:** positive LT-PDR

**Input :** An instance (μF <sup>≤</sup>? <sup>α</sup>) of the LFP-OA problem in <sup>L</sup> **Output :** 'False' with a conclusive Kleene sequence **Data:** a Kleene sequence C = (C0,...,C<sup>n</sup>−<sup>1</sup>) **Initially:** C := () **repeat (do one of the following) Candidate** Choose x ∈ L such that x ≤ α, and let C := (x). **Model** If C<sup>0</sup> = ⊥, return 'False' with the conclusive Kleene sequence C.

**Decide** If there exists x such that C<sup>0</sup> ≤ F x, then let C := (x, C0,...,C<sup>n</sup>−<sup>1</sup>). **until** *any return value is obtained*;

**Algorithm 2:** negative LT-PDR

**Input :** An instance (μF <sup>≤</sup>? <sup>α</sup>) of the LFP-OA problem in <sup>L</sup>

**Output :** 'True' with a conclusive KT sequence, or 'False' with a conclusive Kleene sequence

**Data:** (X; C) where X is a KT sequence (X<sup>0</sup> ≤···≤ X<sup>n</sup>−<sup>1</sup>), and C is a Kleene sequence (Ci, C<sup>i</sup>+1,...,C<sup>n</sup>−<sup>1</sup>) (C is empty if n = i).

**Initially:** (X; C) := (⊥ ≤ F⊥; () )

**repeat (do one of the following)**

**Valid** If X<sup>j</sup>+1 ≤ X<sup>j</sup> for some j<n − 1, return 'True' with the conclusive KT sequence X.

**Unfold** If X<sup>n</sup>−<sup>1</sup> ≤ α, let (X; C) := (X<sup>0</sup> ≤···≤ X<sup>n</sup>−<sup>1</sup> ≤ ; ()).

**Induction** If some k ≥ 2 and x ∈ L satisfy X<sup>k</sup> ≤ x and F(X<sup>k</sup>−<sup>1</sup> ∧ x) ≤ x, let (X; C) := (X[X<sup>j</sup> := X<sup>j</sup> ∧ x]<sup>2</sup>≤j≤<sup>k</sup>; C).

**Candidate** If C = () and X<sup>n</sup>−<sup>1</sup> ≤ α, choose x ∈ L such that x ≤ X<sup>n</sup>−<sup>1</sup> and x ≤ α, and let (X; C) := (X; (x)).

**Model** If C<sup>1</sup> is defined, return 'False' with the conclusive Kleene sequence (⊥, C1,...,C<sup>n</sup>−<sup>1</sup>).

**Decide** If C<sup>i</sup> ≤ F X<sup>i</sup>−<sup>1</sup>, choose x ∈ L satisfying x ≤ X<sup>i</sup>−<sup>1</sup> and C<sup>i</sup> ≤ F x, and let (X; C) := (X; (x, Ci,...,C<sup>n</sup>−<sup>1</sup>)).

**Conflict** If C<sup>i</sup> ≤ F X<sup>i</sup>−<sup>1</sup>, choose x ∈ L satisfying C<sup>i</sup> ≤ x and F(X<sup>i</sup>−<sup>1</sup> ∧ x) ≤ x, and let

(X; C) := (X[X<sup>j</sup> := X<sup>j</sup> ∧ x]<sup>2</sup>≤j≤<sup>i</sup>; (C<sup>i</sup>+1,...,C<sup>n</sup>−<sup>1</sup>)).

**until** *any return value is obtained*;

**Algorithm 3:** LT-PDR

**Theorem 3.12.** *Let* L, F, α *be as in Definition 3.1. Assume that* <sup>≤</sup> *in* <sup>L</sup> *is well-founded and* μF <sup>≤</sup> <sup>α</sup>*. Then, any non-terminating run of positive LT-PDR converges to a KT*<sup>ω</sup> *witness (meaning that it gives a KT*<sup>ω</sup> *witness in* ω*-steps). Moreover, if there is no strictly increasing* ω*-chain bounded by* α *in* L*, then positive LT-PDR is strongly terminating.* 

#### **3.2 Negative PDR: Sequential Negative Witnesses**

We next introduce *Kleene sequences* as a lattice-theoretic counterpart of *proof obligations* in the standard PDR. Kleene sequences represent a chain of sufficient conditions to conclude that certain unsafe states are reachable.

**Definition 3.13 (Kleene sequence).** *Let* L, F, α *be as in Definition 3.1. <sup>A</sup>* Kleene sequence *for the LFP-OA problem* μF <sup>≤</sup>? <sup>α</sup> *is a finite sequence* (C0,...,C<sup>n</sup>−<sup>1</sup>)*, for* <sup>n</sup> <sup>≥</sup> <sup>0</sup> *(*<sup>C</sup> *is empty if* <sup>n</sup> = 0*), satisfying*

*1.* <sup>C</sup><sup>j</sup> <sup>≤</sup> F C<sup>j</sup>−<sup>1</sup> *for each* <sup>1</sup> <sup>≤</sup> <sup>j</sup> <sup>≤</sup> <sup>n</sup> <sup>−</sup> <sup>1</sup>*; 2.* <sup>C</sup><sup>n</sup>−<sup>1</sup> <sup>≤</sup> <sup>α</sup>*.*

*A Kleene sequence* (C0,...,C<sup>n</sup>−<sup>1</sup>) *is* conclusive *if* <sup>C</sup><sup>0</sup> <sup>=</sup> <sup>⊥</sup>*. We may use* <sup>i</sup> (0 <sup>≤</sup> <sup>i</sup> <sup>≤</sup> <sup>n</sup>) *instead of* <sup>0</sup> *as the starting index of the Kleene sequence* <sup>C</sup>*.*

When we have a Kleene sequence <sup>C</sup> = (C0,...,C<sup>n</sup>−<sup>1</sup>), the chain of implications (C<sup>j</sup> <sup>≤</sup> <sup>F</sup><sup>j</sup>⊥) =<sup>⇒</sup> (Cj+1 <sup>≤</sup> <sup>F</sup>j+1⊥) hold for 0 <sup>≤</sup> j<n <sup>−</sup> 1. Therefore when <sup>C</sup> is conclusive, <sup>C</sup><sup>n</sup>−<sup>1</sup> is a Kleene witness (Corollary 2.3.2).

**Proposition 3.14.** *Let* L, F, α *be as in Definition 3.1. There exists a Kleene (negative) witness if and only if there exists a conclusive Kleene sequence.*

*Proof.* (⇒): If there exists a Kleene witness <sup>x</sup> such that <sup>x</sup> <sup>≤</sup> <sup>F</sup> <sup>n</sup><sup>⊥</sup> and <sup>x</sup> <sup>≤</sup> <sup>α</sup>, (⊥, F⊥,...,F <sup>n</sup>⊥) is a conclusive Kleene sequence. (⇐): Assume there exists a conclusive Kleene sequence <sup>C</sup>. Then <sup>C</sup><sup>n</sup>−<sup>1</sup> satisfies <sup>C</sup><sup>n</sup>−<sup>1</sup> <sup>≤</sup> <sup>F</sup> <sup>n</sup>−<sup>1</sup><sup>⊥</sup> and <sup>C</sup><sup>n</sup>−<sup>1</sup> ≤ <sup>α</sup> because of <sup>C</sup><sup>n</sup>−<sup>1</sup> <sup>≤</sup> F C<sup>n</sup>−<sup>2</sup> ≤···≤ <sup>F</sup> <sup>n</sup>−1C<sup>0</sup> <sup>=</sup> <sup>F</sup> <sup>n</sup>−<sup>1</sup><sup>⊥</sup> and Definition 3.13.2. 

This proposition suggests the following algorithm to negatively answer to the LFP-OA problem. It searches for a conclusive Kleene sequence. The algorithm updates a Kleene sequence until its first component becomes ⊥.

**Definition 3.15 (negative LT-PDR).** *Let* L, F, α *be as in Definition 3.1.* Negative LT-PDR *is the algorithm shown in Algorithm 2, which says 'False' to the LFP-OA problem* μF <sup>≤</sup>? <sup>α</sup> *if successful.*

The rules are designed by the following principles.

**Candidate** initializes C with only one element x. The element x has to be chosen such that <sup>x</sup> <sup>≤</sup> <sup>α</sup> to ensure Definition 3.13.2.

**Model** is applied when the current Kleene sequence C is conclusive.

**Decide** prepends <sup>x</sup> to <sup>C</sup>. The condition <sup>C</sup><sup>0</sup> <sup>≤</sup> F x ensures Definition 3.13.1.

#### **Theorem 3.16.** *Let* L, F, α *be as in Definition 3.1.*


#### **3.3 LT-PDR: Integrating Positive and Negative**

We have introduced two simple PDR algorithms, called positive LT-PDR (Sect. 3.1) and negative LT-PDR (Sect. 3.2). They are so simple that they have potential inefficiencies. Specifically, in positive LT-PDR, it is unclear that how we choose <sup>x</sup> <sup>∈</sup> <sup>L</sup> in **Induction**, while in negative LT-PDR, it may easily diverge because the rules **Candidate** and **Decide** may choose <sup>x</sup> <sup>∈</sup> <sup>L</sup> that would not lead to a conclusive Kleene sequence. We resolve these inefficiencies by combining positive LT-PDR and negative LT-PDR. The combined PDR algorithm is called LT-PDR, and it is a lattice-theoretic generalization of conventional PDR.

Note that negative LT-PDR is only weakly terminating. Even worse, it is easy to make it diverge—after a choice of x in **Candidate** or **Decide** such that x <sup>≤</sup> μF, no continued execution of the algorithm can lead to a conclusive Kleene sequence. For deciding μF <sup>≤</sup>? <sup>α</sup> efficiently, therefore, it is crucial to detect such useless Kleene sequences.

The core fact that underlies the efficiency of PDR is the following proposition, which says that a KT sequence (in positive LT-PDR) can quickly tell that a Kleene sequence (in negative LT-PDR) is useless. This fact is crucially used for many rules in LT-PDR (Definition 3.20).

**Proposition 3.17.** *Let* <sup>C</sup> = (Ci,...,C<sup>n</sup>−<sup>1</sup>) *be a Kleene sequence* (2 <sup>≤</sup> n, <sup>0</sup> <sup>&</sup>lt; <sup>i</sup> <sup>≤</sup> <sup>n</sup> <sup>−</sup> 1) *and* <sup>X</sup> = (X<sup>0</sup> ≤···≤ <sup>X</sup><sup>n</sup>−<sup>1</sup>) *be a KT sequence. Then*


The proof relies on the following lemmas.

**Lemma 3.18.** *Any KT sequence* (X<sup>0</sup> ≤ ··· ≤ <sup>X</sup><sup>n</sup>−<sup>1</sup>) *over-approximates the initial sequence:* F<sup>i</sup> ⊥ ≤ <sup>X</sup><sup>i</sup> *holds for any* <sup>i</sup> *such that* <sup>0</sup> <sup>≤</sup> <sup>i</sup> <sup>≤</sup> <sup>n</sup> <sup>−</sup> <sup>1</sup>*.* 

**Lemma 3.19.** *Let* <sup>C</sup> = (Ci,...,C<sup>n</sup>−<sup>1</sup>) *be a Kleene sequence* (0 < i <sup>≤</sup> <sup>n</sup> <sup>−</sup> 1) *and* (X<sup>0</sup> ≤···≤ <sup>X</sup><sup>n</sup>−<sup>1</sup>) *be a KT sequence. The following satisfy* <sup>1</sup> <sup>⇔</sup> <sup>2</sup> <sup>⇒</sup> <sup>3</sup>*.*

*1. The Kleene sequence* C *can be extended to a conclusive one.*

*2.* <sup>C</sup><sup>i</sup> <sup>≤</sup> <sup>F</sup><sup>i</sup> ⊥*.*

*3.* <sup>C</sup><sup>i</sup> <sup>≤</sup> <sup>F</sup>jX<sup>i</sup>−<sup>j</sup> *for each* <sup>j</sup> *with* <sup>0</sup> <sup>≤</sup> <sup>j</sup> <sup>≤</sup> <sup>i</sup>*.* 

Using the above lattice-theoretic properties, we combine positive and negative LT-PDRs into the following *LT-PDR* algorithm. It is also a lattice-theoretic generalization of the original PDR algorithm. The combination exploits the mutual relationship between KT sequences and Kleene sequences, exhibited as Proposition 3.17, for narrowing down choices in positive and negative LT-PDRs.

**Definition 3.20 (LT-PDR).** *Given a complete lattice* L*, an* ω*-continuous function* <sup>F</sup> : <sup>L</sup> <sup>→</sup> <sup>L</sup>*, and an element* <sup>α</sup> <sup>∈</sup> <sup>L</sup>*,* LT-PDR *is the algorithm shown in Algorithm <sup>3</sup> for the LFP-OA problem* μF <sup>≤</sup>? <sup>α</sup>*.*

The rules are designed by the following principles.

(**Valid**, **Unfold**, and **Induction**): These rules are almost the same as in positive LT-PDR. In **Unfold**, we reset the Kleene sequence because of Proposition 3.17.3. Occurrences of **Unfold** punctuate an execution of the algorithm: between two occurrences of **Unfold**, a main goal (towards a negative conclusion) is to construct a conclusive Kleene sequence with the same length as the X.

(**Candidate**, **Model**, and **Decide**): These rules have many similarities to those in negative LT-PDR. Differences are as follows: the **Candidate** and **Decide** rules impose <sup>x</sup> <sup>≤</sup> <sup>X</sup><sup>i</sup> on the new element <sup>x</sup> in (x, Ci+1,...,C<sup>n</sup>−1) because Proposition 3.17.1 tells us that other choices are useless. In **Model**, we only need to check whether <sup>C</sup><sup>1</sup> is defined instead of <sup>C</sup><sup>0</sup> <sup>=</sup> <sup>⊥</sup>. Indeed, since <sup>C</sup><sup>1</sup> is added in **Candidate** or **Decide**, <sup>C</sup><sup>1</sup> <sup>≤</sup> <sup>X</sup><sup>1</sup> <sup>=</sup> <sup>F</sup><sup>⊥</sup> always holds. Therefore, 2 <sup>⇒</sup> 1 in Lemma 3.19 shows that (⊥, C1,...,C<sup>n</sup>−<sup>1</sup>) is conclusive.

(**Conflict**): This new rule emerges from the combination of positive and negative LT-PDRs. This rule is applied when <sup>C</sup><sup>i</sup> <sup>≤</sup> F X<sup>i</sup>−<sup>1</sup>, which confirms that the current C cannot be extended to a conclusive one (Proposition 3.17.2). Therefore, we eliminate C<sup>i</sup> from C and strengthen X so that we cannot choose C<sup>i</sup> again, that is, so that <sup>C</sup><sup>i</sup> <sup>≤</sup> (X<sup>i</sup> <sup>∧</sup>x). Let us explain how <sup>X</sup> is strengthened. The element <sup>x</sup> has to be chosen so that <sup>C</sup><sup>i</sup> <sup>≤</sup> <sup>x</sup> and <sup>F</sup>(X<sup>i</sup>−<sup>1</sup> <sup>∧</sup> <sup>x</sup>) <sup>≤</sup> <sup>x</sup>. The former dis-inequality ensures the strengthened <sup>X</sup> satisfies <sup>C</sup><sup>i</sup> <sup>≤</sup> (X<sup>i</sup> <sup>∧</sup> <sup>x</sup>), and the latter inequality implies <sup>F</sup>(X<sup>i</sup>−<sup>1</sup> <sup>∧</sup> <sup>x</sup>) <sup>≤</sup> <sup>x</sup>. One can see that **Conflict** is **Induction** with additional condition <sup>C</sup><sup>i</sup> <sup>≤</sup> <sup>x</sup>, which enhances so that the search space for <sup>x</sup> is narrowed down using the Kleene sequence C.

Canonical choices of <sup>x</sup> <sup>∈</sup> <sup>L</sup> in **Candidate**, **Decide**, and **Conflict** are <sup>x</sup> := <sup>X</sup><sup>n</sup>−<sup>1</sup>, <sup>x</sup> := <sup>X</sup><sup>i</sup>−<sup>1</sup>, and <sup>x</sup> := F X<sup>i</sup>−<sup>1</sup>, respectively. However, there can be cleverer choices; e.g. <sup>x</sup> := <sup>S</sup> \ (C<sup>i</sup> \ F X<sup>i</sup>−<sup>1</sup>) in **Conflict** when <sup>L</sup> <sup>=</sup> <sup>P</sup>S.

**Lemma 3.21.** *Each rule of LT-PDR, when applied to a pair of a KT and a Kleene sequence, yields a pair of a KT and a Kleene sequence.* 

**Theorem 3.22 (correctness).** *LT-PDR is sound, i.e. if it outputs 'True' then* μF <sup>≤</sup> <sup>α</sup> *holds, and if it outputs 'False' then* μF <sup>≤</sup> <sup>α</sup> *holds.* 

Many existing PDR algorithms ensure termination if the state space is finite. A general principle behind is stated below. Note that it rarely applies to infinitary or quantitative settings, where we would need some abstraction for termination.

**Proposition 3.23 (termination).** *LT-PDR terminates regardless of the order of the rule-applications if the following conditions are satisfied.*


Cond 1 is natural: it just requires LT-PDR to immediately conclude 'True' or 'False' if it can. Cond. 2–3 are always satisfied when L is finite.


**Table 1.** Categorical modeling of state-based dynamics and predicate transformers

Theorem 3.22 and Proposition 3.23 still hold if **Induction** rule is dropped. However, the rule can accelerate the convergence of KT sequences and improve efficiency.

*Remark 3.24 (LT-OpPDR).* The GFP-UA problem <sup>α</sup> <sup>≤</sup>? νF is the dual of LFP-OA, obtained by opposing the order <sup>≤</sup> in <sup>L</sup>. We can also dualize the LT-PDR algorithm (Algorithm 3), obtaining what we call the *LT-OpPDR* algorithm for GFP-UA. Moreover, we can express LT-OpPDR as LT-PDR if a suitable *involution* <sup>¬</sup>: <sup>L</sup> <sup>→</sup> <sup>L</sup>op is present. See [22, Appendix B] for further details; see also Proposition 4.3.

#### **4 Structural Theory of PDR by Category Theory**

Before we discuss concrete instances of LT-PDR in Sect. 5, we develop a structural theory of transition systems and predicate transformers as a basis of LT-PDR. The theory is formulated in the language of *category theory* [3,17,18,23]. We use category theory because 1) categorical modeling of relevant notions is well established in the community (see e.g. [2,8,17,18,27]), and 2) it gives us the right level of abstraction that accommodates a variety of instances. In particular, qualitative and quantitative settings are described in a uniform manner.

Our structural theory (Sect. 4) serves as a backend, not a frontend. That is,


#### **4.1 Categorical Modeling of Dynamics and Predicate Transformers**

Our interests are in instances of the LFP-OA problem μF <sup>≤</sup>? <sup>α</sup> (Definition 3.1) that appear in *model checking*. In this context, 1) the underlying lattice L is that of *predicates* over a state space, and 2) the function <sup>F</sup> : <sup>L</sup> <sup>→</sup> <sup>L</sup> arises from the dynamic/transition structure, specifically as a *predicate transformer*. The categorical notions in Table 1 model these ideas (state-based dynamics, predicate transformers). This modeling is well-established in the community.

Our introduction of Table 1 here is minimal, due to the limited space. See [22, Appendix C] and the references therein for more details.

A *category* consists of *objects* and *arrows* between them. In Table 1, categories occur twice: 1) a *base category* B where objects are typically sets and arrows are typically functions; and 2) *fiber categories* ES, defined for each object S of B, that are identified with the lattices of *predicates*. Specifically, objects P, Q, . . . of <sup>E</sup><sup>S</sup> are predicates over <sup>S</sup>, and an arrow <sup>P</sup> <sup>→</sup> <sup>Q</sup> represents logical implication. A general fact behind the last is that every preorder is a category—see e.g. [3].

**Transition Systems as Coalgebras.** State-based transition systems are modeled as *coalgebras* in the base category <sup>B</sup> [17]. We use a *functor* <sup>G</sup>: <sup>B</sup> <sup>→</sup> <sup>B</sup> to represent a transition type. A <sup>G</sup>*-coalgebra* is an arrow <sup>δ</sup> : <sup>S</sup> <sup>→</sup> GS, where <sup>S</sup> is a state space and δ describes the dynamics. For example, a Kripke structure can be identified with a pair (S, δ) of a set <sup>S</sup> and a function <sup>δ</sup> : <sup>S</sup> → PS, where <sup>P</sup><sup>S</sup> denotes the powerset. The powerset construction P is known to be a functor P : **Set** → **Set**; therefore Kripke structures are P-coalgebras. For other choices of G, G-coalgebras become different types of transition systems, such as MDPs (Sect. 5.2) and Markov Reward Models (Sect. 5.3).

**Predicates Form a Fibration.** Fibrations are powerful categorical constructs that can model various indexed entities; see e.g. [18] for its general theory. Our use of them is for organizing the lattices E<sup>S</sup> of *predicates* over a set S, indexed by the choice of S. For example, E<sup>S</sup> = 2<sup>S</sup>—the lattice of subsets of S—for modeling qualitative predicates. For quantitative reasoning (e.g. for MDPs), we use E<sup>S</sup> = [0, 1]<sup>S</sup>, where [0, 1] is the unit interval. This way, qualitative and quantitative reasonings are mathematically unified in the language of fibrations.

<sup>A</sup> *fibration* is a functor <sup>p</sup>: <sup>E</sup> <sup>→</sup> <sup>B</sup> with suitable properties; it can be thought of as a collection (ES)<sup>S</sup>∈<sup>B</sup> of *fiber categories* <sup>E</sup>S—indexed by objects <sup>S</sup> of <sup>B</sup> suitably organized as a single category E. Notable in this organization is that we obtain the *pullback* functor l <sup>∗</sup> : <sup>E</sup><sup>Y</sup> <sup>→</sup> <sup>E</sup><sup>X</sup> for each arrow <sup>l</sup>: <sup>X</sup> <sup>→</sup> <sup>Y</sup> in <sup>B</sup>. In our examples, l <sup>∗</sup> is a *substitution* along l in predicates—l <sup>∗</sup> is the monotone map that carries a predicate P(y) over Y to the predicate P(l(x)) over X.

In this paper, we restrict to a subclass of fibrations (called *CLat*∧*-fibrations*) in which every fiber category E<sup>S</sup> is a complete lattice, and each pullback functor preserves all meets. We therefore write <sup>P</sup> <sup>≤</sup> <sup>Q</sup> for arrows in <sup>E</sup>S; this represents logical implication, as announced above. Notice that each f <sup>∗</sup> has a left adjoint (lower adjoint in terms of Galois connection), which exists by Freyd's adjoint functor theorem. The left adjoint is denoted by <sup>f</sup>∗.

We also consider a *lifting* <sup>G</sup>˙ : <sup>E</sup> <sup>→</sup> <sup>E</sup> of <sup>G</sup> along <sup>p</sup>; it is a functor G˙ such that pG˙ = Gp. See the diagram on the right. It specifies the *logical interpretation* of the transition type G. For example, for <sup>G</sup> <sup>=</sup> <sup>P</sup> (the powerset functor) from the above, two choices of G˙ are for the *may* and *must* modalities. See e.g. [2, 15,20,21].

**Categorical Predicate Transformer.** The above constructs allow us to model predicate transformers—<sup>F</sup> in our examples of the LFP-OA problem μF <sup>≤</sup>? <sup>α</sup> in categorical terms. A *predicate transformer* along a coalgebra <sup>δ</sup> : <sup>S</sup> <sup>→</sup> GS with respect to the lifting G˙ is simply the composite E<sup>S</sup> G˙ −→ <sup>E</sup>GS δ∗ −→ <sup>E</sup>S, where the first <sup>G</sup>˙ is the restriction of <sup>G</sup>˙ : <sup>E</sup> <sup>→</sup> <sup>E</sup> to <sup>E</sup>S. Intuitively, 1) given a *postcondition* P in ES, 2) it is first interpreted as the predicate GP˙ over GS, and then 3) it is pulled back along the dynamics δ to yield a *precondition* δ∗GP˙ . Such (backward) predicate transformers are fundamental in a variety of model checking problems.

#### **4.2 Structural Theory of PDR from Transition Systems**

We formulate a few general *safety* problems. We show how they are amenable to the LT-PDR (Definition 3.20) and LT-OpPDR (Remark 3.24) algorithms.

**Definition 4.1 (backward safety problem, BSP).** *Let* <sup>p</sup> *be a CLat*∧ *fibration,* <sup>δ</sup> : <sup>S</sup> <sup>→</sup> GS *be a coalgebra in* <sup>B</sup>*, and* <sup>G</sup>˙ : <sup>E</sup> <sup>→</sup> <sup>E</sup> *be a lifting of* <sup>G</sup> *along* <sup>p</sup> *such that* <sup>G</sup>˙ <sup>X</sup> : <sup>E</sup><sup>X</sup> <sup>→</sup> <sup>E</sup>GX *is* <sup>ω</sup>op*-continuous for each* <sup>X</sup> <sup>∈</sup> <sup>B</sup>*. The* backward safety problem for (<sup>ι</sup> <sup>∈</sup> <sup>E</sup>S, δ, α <sup>∈</sup> <sup>E</sup>S) *in* (p, G, <sup>G</sup>˙) *is the GFP-UA problem for* (ES, α <sup>∧</sup> <sup>δ</sup>∗G, ι ˙ )*, that is,*

$$
\mu \le^? \nu x. \alpha \land \delta^\* \dot{G} x. \tag{2}
$$

<sup>E</sup> <sup>G</sup>˙ -

B <sup>G</sup> -

E p

B

p

Here, ι represents the initial states and α represents the safe states. The predicate transformer <sup>x</sup> → <sup>α</sup> <sup>∧</sup> <sup>δ</sup>∗Gx˙ in (2) is the standard one for modeling safety currently safe (α), and the next time x (δ∗Gx˙ ). Its gfp is the safety property; (2) asks if all initial states (ι) satisfy the safety property. Since the backward safety problem is a GFP-UA problem, we can solve it by LT-OpPDR (Remark 3.24).

Additional assumptions allow us to reduce the backward safety problem to LFP-OA problems, which are solvable by LT-PDR, as shown on the right.

BSP as-is - involution  suitable adjoints GFP-UALT-OpPDR-True/False LFP-OA LT-PDR -True/False

The first case requires the existence of the *left adjoint* to the predicate transformer <sup>δ</sup>∗G˙ <sup>S</sup> : <sup>E</sup><sup>S</sup> <sup>→</sup> <sup>E</sup>S. Then we can translate BSP to the following LFP-OA problem. It directly asks whether all reachable states are safe.

**Proposition 4.2 (forward safety problem, FSP).** *In the setting of Definition 4.1, assume that each* <sup>G</sup>˙ <sup>X</sup> : <sup>E</sup><sup>X</sup> <sup>→</sup> <sup>E</sup>GX *preserves all meets. Then by* *letting* <sup>H</sup>˙ <sup>S</sup> : <sup>E</sup>GS <sup>→</sup> <sup>E</sup><sup>S</sup> *be the left adjoint of* <sup>G</sup>˙ <sup>S</sup>*, the BSP* (2) *is equivalent to the LFP-OA problem for* (ES, ι <sup>∨</sup> <sup>H</sup>˙ <sup>S</sup>δ∗, α)*:*

$$
\mu x.\iota \vee \dot{H}\_S \delta\_\* x \le^? \alpha.\tag{3}
$$

*This problem is called the* forward safety problem *for* (ι, δ, α) *in* (p, G, <sup>G</sup>˙)*.* 

The second case assumes that the complete lattice E<sup>S</sup> of predicates admits an involution operator <sup>¬</sup> : <sup>E</sup><sup>S</sup> <sup>→</sup> <sup>E</sup>op <sup>S</sup> (cf. [22, Appendix B]).

**Proposition 4.3 (inverse backward safety problem, IBSP).** *In the setting of Definition 4.1, assume further that there is a monotone function* <sup>¬</sup> : <sup>E</sup><sup>S</sup> <sup>→</sup> Eop <sup>S</sup> *satisfying* ¬◦¬ = id*. Then the backward safety problem* (2) *is equivalent to the LFP-OA problem for* (ES,(¬α) <sup>∨</sup> (¬ ◦ <sup>δ</sup>∗G˙ ◦ ¬),¬ι)*, that is,*

$$
\mu x. \left( \neg \alpha \right) \lor \left( \neg \circ \delta^\* \dot{G} \circ \neg x \right) \leq^? \neg \iota. \tag{4}
$$

*We call* (4) *the* inverse backward safety problem *for* (ι, δ, α) *in* (p, G, G˙)*. Here* (¬α) <sup>∨</sup> (¬ ◦ <sup>δ</sup>∗G˙ ◦ ¬(−)) *is the* inverse backward predicate transformer*.* 

When both additional assumptions are fulfilled (in Proposition 4.2 and 4.3), we obtain two LT-PDR algorithms to solve BSP. One can even simultaneously run these two algorithms—this is done in fbPDR [25,26]. See also Sect. 5.1.

# **5 Known and New PDR Algorithms as Instances**

We present several concrete instances of our LT-PDR algorithms. The one for Markov reward models is new (Sect. 5.3). We also sketch how those instances can be systematically derived by the theory in Sect. 4; details are in [22, Appendix D].

### **5.1 LT-PDRs for Kripke Structures: PDRF-Krand PDRIB-Kr**

In most of the PDR literature, the target system is a Kripke structure that arises from a program's operational semantics. A *Kripke structure* consists of a set S of states and a transition relation <sup>δ</sup> <sup>⊆</sup> <sup>S</sup> <sup>×</sup> <sup>S</sup> (here we ignore initial states and atomic propositions). The basic problem formulation is as follows.

**Definition 5.1 (backward safety problem (BSP) for Kripke structures).** *The* BSP *for a Kripke structure* (S, δ)*, a set* <sup>ι</sup> <sup>∈</sup> <sup>2</sup><sup>S</sup> *of initial states, and a set* <sup>α</sup> <sup>∈</sup> <sup>2</sup><sup>S</sup> *of safe states, is the GFP-UA problem* <sup>ι</sup> <sup>≤</sup>? νx. α∧F x, *where* <sup>F</sup> : 2<sup>S</sup> <sup>→</sup> <sup>2</sup><sup>S</sup> *is defined by* <sup>F</sup> (A) := {<sup>s</sup> | ∀s .((s, s ) <sup>∈</sup> <sup>δ</sup> <sup>⇒</sup> <sup>s</sup> <sup>∈</sup> <sup>A</sup>)}*.*

It is clear that the GFP in Definition 5.1 represents the set of states from which all reachable states are in α. Therefore the BSP is the usual safety problem.

The above BSP is easily seen to be equivalent to the following problems.

**Proposition 5.2 (forward safety problem (FSP) for Kripke structures).** *The BSP in Definition 5.1 is equivalent to the LFP-OA problem* μx. ι <sup>∨</sup> <sup>F</sup><sup>x</sup> <sup>≤</sup>? <sup>α</sup>*, where* <sup>F</sup> : 2<sup>S</sup> <sup>→</sup> <sup>2</sup><sup>S</sup> *is defined by* <sup>F</sup>(A) := <sup>s</sup>∈<sup>A</sup>{s <sup>|</sup> (s, s ) <sup>∈</sup> <sup>δ</sup>}*.*  **Proposition 5.3 (inverse backward safety problem (IBSP) for Kripke structures).** *The BSP in Definition 5.1 is equivalent to the LFP-OA problem* μx.¬<sup>α</sup> ∨ ¬F (¬x) <sup>≤</sup>? <sup>¬</sup>ι*, where* <sup>¬</sup>: 2<sup>S</sup> <sup>→</sup> <sup>2</sup><sup>S</sup> *is the complement function* <sup>A</sup> → <sup>S</sup> \ <sup>A</sup>*.* 

**Instances of LT-PDR.** The FSP and IBSP (Propositions 5.2–5.3), being LFP-OA, are amenable to the LT-PDR algorithm (Definition 3.20). Thus we obtain two instances of LT-PDR; we call them *PDR<sup>F</sup>* -*K r* and *PDRIB* -*K r* . **PDRIB-Kr** is a step-by-step dual to the application of LT-OpPDR to the BSP (Definition 5.1)—see Remark 3.24.

We compare these two instances of LT-PDR with algorithms in the literature. If we impose <sup>|</sup>Ci<sup>|</sup> = 1 on each element <sup>C</sup><sup>i</sup> of Kleene sequences, the **PDRF-Kr** instance of LT-PDR coincides with the conventional IC3/PDR [9,13]. In contrast, **PDRIB-Kr** coincides with *Reverse PDR* in [25,26]. The parallel execution of **PDRF-Kr** and **PDRIB-Kr** roughly corresponds to fbPDR [25,26].

**Structural Derivation.** The equivalent problems (Propositions 5.2–5.3) are derived systematically from the categorical theory in Sect. 4.2. Indeed, using a lifting <sup>P</sup>˙ : 2<sup>S</sup> <sup>→</sup> <sup>2</sup>P<sup>S</sup> such that <sup>A</sup> → {A <sup>|</sup> <sup>A</sup> <sup>⊆</sup> <sup>A</sup>} (the *must modality* -), F in Definition 5.1 coincides with <sup>δ</sup><sup>∗</sup>P˙ in (2). The above <sup>P</sup>˙ preserves meets (cf. the modal axiom -(<sup>ϕ</sup> <sup>∧</sup> <sup>ψ</sup>) <sup>∼</sup><sup>=</sup> <sup>ϕ</sup> <sup>∧</sup> ψ, see e.g. [7]); thus Proposition 4.2 derives the FSP. Finally, ¬ in Proposition 5.3 allows the use of Proposition 4.3. More details are in [22, Appendix D].

### **5.2 LT-PDR for MDPs: PDRIB-MDP**

The only known PDR-like algorithm for *quantitative* verification is *PrIC3* [6] for Markov decision processes s(MDPs). Here we instantiate LT-PDR for MDPs and compare it with PrIC3.

An *MDP* consists of a set S of states, a set Act of actions and a transition function <sup>δ</sup> mapping <sup>s</sup> <sup>∈</sup> <sup>S</sup> and <sup>a</sup> <sup>∈</sup> Act to either <sup>∗</sup> ("the action <sup>a</sup> is unavailable at s") or a probability distribution δ(s)(a) over S.

**Definition 5.4 (IBSP for MDPs).** *The* inverse backward safety problem (IBSP) *for an MDP* (S, δ)*, an initial state* <sup>s</sup><sup>ι</sup> <sup>∈</sup> <sup>S</sup>*, a real number* <sup>λ</sup> <sup>∈</sup> [0, 1]*, and a set* <sup>α</sup> <sup>⊆</sup> <sup>S</sup> *of safe states, is the LFP-OA problem* μx. F (x) <sup>≤</sup>? <sup>d</sup>ι,λ*. Here* <sup>d</sup>ι,λ : <sup>S</sup> <sup>→</sup> [0, 1] *is the predicate such that* <sup>d</sup>ι,λ(sι) = <sup>λ</sup> *and* <sup>d</sup>ι,λ(s)=1 *otherwise.* <sup>F</sup> : [0, 1]<sup>S</sup> <sup>→</sup> [0, 1]<sup>S</sup> *is defined by* <sup>F</sup> (d)(s)=1 *if* <sup>s</sup> <sup>∈</sup> <sup>α</sup>*, and* F (d)(s) = max{ <sup>s</sup>∈<sup>S</sup> <sup>d</sup>(s ) · <sup>δ</sup>(s)(a)(s ) <sup>|</sup> <sup>a</sup> <sup>∈</sup> Act, δ(s)(a) <sup>=</sup> ∗} *if* <sup>s</sup> <sup>∈</sup> <sup>α</sup>*.*

The function F in Definition 5.4 is a *Bellman operator* for MDPs—it takes the average of d over δ(s)(a) and takes the maximum over a. Therefore the lfp in Definition 5.4 is the maximum reachability probability to <sup>S</sup>\α; the problem asks if it is <sup>≤</sup> <sup>λ</sup>. In other words, it asks whether the *safety* probability—of staying in <sup>α</sup> henceforth, under any choices of actions—is <sup>≥</sup> <sup>1</sup> <sup>−</sup> <sup>λ</sup>. This problem is the same as in [6].

**Instance of PDR.** The IBSP (Definition 5.4) is LFP-OA and thus amenable to LT-PDR. We call this instance *PDRIB* -*MDP* ; See [22, Appendix E] for details.

**PDRIB-MDP** shares many essences with PrIC3 [6]. It uses the operator F in Definition 5.4, which coincides with the one in [6, Def. 2]. PrIC3 maintains *frames*; they coincide with KT sequences in **PDRIB-MDP**.

Our Kleene sequences correspond to *obligations* in PrIC3, modulo the following difference. Kleene sequences aim at a negative witness (Sect. 3.2), but they happen to help the positive proof efforts too (Sect. 3.3); obligations in PrIC3 are solely for accelerating the positive proof efforts. Thus, if PrIC3 cannot solve these efforts, we need to check whether obligations yield a negative witness.

**Structural Derivation.** One can derive the IBSP (Definition 5.4) from the categorical theory in Sect. 4.2. Specifically, we first formulate the *BSP* <sup>¬</sup>d<sup>λ</sup> <sup>≤</sup>? νx. d<sup>α</sup> <sup>∧</sup> <sup>δ</sup>∗Gx˙ , where <sup>G</sup>˙ is a suitable lifting (of <sup>G</sup> for MDPs, Table 1) that combines average and minimum, <sup>¬</sup>: [0, 1]<sup>S</sup> <sup>→</sup> [0, 1]<sup>S</sup> is defined by (¬d)(s):= 1−d(s), and <sup>d</sup><sup>α</sup> is such that <sup>d</sup>α(s) = 1 if <sup>s</sup> <sup>∈</sup> <sup>α</sup> and <sup>d</sup>α(s) = 0 otherwise. Using <sup>¬</sup>: [0, 1]<sup>S</sup> <sup>→</sup> [0, 1]<sup>S</sup> in the above as an involution, we apply Proposition 4.3 and obtain the IBSP (Definition 5.4).

Another benefit of the categorical theory is that it can tell us a forward instance of LT-PDR (much like **PDRF-Kr** in Sect. 5.1) is unlikely for MDPs. Indeed, we showed in Proposition 4.2 that G˙ s preservation of meets is essential (existence of a left adjoint is equivalent to meet preservation). We can easily show that our G˙ for MDPs does not preserve meets. See [22, Appendix G].

#### **5.3 LT-PDR for Markov Reward Models: PDRMRM**

We present a PDR-like algorithm for *Markov reward models (MRMs)*, which seems to be new, as an instance of LT-PDR. An MRM consists of a set S of states and a transition function <sup>δ</sup> that maps <sup>s</sup> <sup>∈</sup> <sup>S</sup> (the current state) and <sup>c</sup> <sup>∈</sup> <sup>N</sup> (the reward) to a function <sup>δ</sup>(s)(c) : <sup>S</sup> <sup>→</sup> [0, 1]; the last represents the probability distribution of next states.

We solve the following problem. We use [0,∞]-valued predicates representing accumulated rewards—where [0,∞] is the set of extended nonnegative reals.

**Definition 5.5 (SP for MRMs).** *The* safety problem (SP) *for an MRM* (S, δ)*, an initial state* <sup>s</sup><sup>ι</sup> <sup>∈</sup> <sup>S</sup>*,* <sup>λ</sup> <sup>∈</sup> [0,∞]*, and a set* <sup>α</sup> <sup>⊆</sup> <sup>S</sup> *of safe states is* μx. F (x) <sup>≤</sup>? <sup>d</sup>ι,λ*. Here* <sup>d</sup>ι,λ : <sup>S</sup> <sup>→</sup> [0,∞] *maps* <sup>s</sup><sup>ι</sup> *to* <sup>λ</sup> *and others to* <sup>∞</sup>*, and* <sup>F</sup> : [0,∞] <sup>S</sup> <sup>→</sup> [0,∞] <sup>S</sup> *is defined by* F (d)(s)=0 *if* <sup>s</sup> <sup>∈</sup> <sup>α</sup>*, and* <sup>F</sup> (d)(s) = <sup>s</sup>∈S,c∈<sup>N</sup>(c+d(s ))· δ(s)(c)(s ) *if* <sup>s</sup> <sup>∈</sup> <sup>α</sup>*.*

The function F accumulates expected reward in α. Thus the problem asks if the expected accumulated reward, starting from <sup>s</sup><sup>ι</sup> and until leaving <sup>α</sup>, is <sup>≤</sup> <sup>λ</sup>.

**Instance of PDR.** The SP (Definition 5.5) is LFP-OA thus amenable to LT-PDR. We call this instance *PDRMRM* . It seems new. See [22, Appendix F] for details.

**Structural Derivation.** The function F in Definition 5.5 can be expressed categorically as F (x) = <sup>d</sup><sup>α</sup> <sup>∧</sup> <sup>δ</sup>∗G˙(x), where <sup>d</sup><sup>α</sup> : <sup>S</sup> <sup>→</sup> [0,∞] carries <sup>s</sup> <sup>∈</sup> <sup>α</sup> to <sup>∞</sup> and <sup>s</sup> <sup>∈</sup> <sup>α</sup> to 0, and <sup>G</sup>˙ is a suitable lifting that accumulates expected reward. However, the SP (Definition 5.5) is *not* an instance of the three general safety problems in Sect. 4.2. Consequently, we expect that other instances of LT-PDR than **PDRMRM** (such as **PDRF-Kr** and **PDRIB-Kr** in Sect. 5.1) are hard for MRMs.

### **6 Implementation and Evaluation**

**Implementation.** LTPDR We implemented LT-PDR in Haskell. Exploiting Haskell's language features, it is succinct (∼50 lines) and almost a literal translation of Algorithm 3 to Haskell. Its main part is presented in [22, Appendix K]. In particular, using suitable type classes, the code is as abstract and generic as Algorithm 3.

Specifically, our implementation is a Haskell module named LTPDR. It has two interfaces, namely the type class CLat <sup>τ</sup> (the lattice of predicates) and the type Heuristics <sup>τ</sup> (the definitions of **Candidate**, **Decide**, and **Conflict**). The main function for LT-PDR is ltPDR :: CLat <sup>τ</sup> <sup>⇒</sup> Heuristics <sup>τ</sup> <sup>→</sup> (<sup>τ</sup> <sup>→</sup> <sup>τ</sup> ) <sup>→</sup> <sup>τ</sup> <sup>→</sup> IO (PDRAnswer <sup>τ</sup> ) , where the second argument is for a monotone function <sup>F</sup> of type <sup>τ</sup> <sup>→</sup> <sup>τ</sup> and the last is for the safety predicate <sup>α</sup>.

Obtaining concrete instances is easy by fixing <sup>τ</sup> and Heuristics <sup>τ</sup> . A simple implementation of **PDRF-Kr** takes 15 lines; a more serious SAT-based one for **PDRF-Kr** takes <sup>∼</sup>130 lines; **PDRIB-MDP** and **PDRMRM** take <sup>∼</sup>80 lines each.

**Heuristics.** We briefly discuss the heuristics, i.e. how to choose <sup>x</sup> <sup>∈</sup> <sup>L</sup> in **Candidate**, **Decide**, and **Conflict**, used in our experiments. The heuristics of **PDRF-Kr** is based on the conventional PDR [9]. The heuristics of **PDRIB-MDP** is based on the idea of representing the smallest possible x greater than some real number <sup>v</sup> <sup>∈</sup> [0, 1] (e.g. <sup>x</sup> taken in **Candidate**) as <sup>x</sup> <sup>=</sup> <sup>v</sup><sup>+</sup> , where is a symbolic variable. This implies that **Unfold** (or **Valid**, **Model**) is always applied in finite steps, which further guarantees finite-step termination for invalid cases and ω-step termination for valid cases (see [22, Appendix H] for more detail). The heuristics of **PDRMRM** is similar to that of **PDRIB-MDP**.

**Experiment Setting.** We experimentally assessed the performance of instances of LTPDR. The settings are as follows: 1.2 GHz Quad-Core Intel Core i7 with 10 GB memory using Docker, for **PDRIB-MDP**; Apple M1 Chip with 16 GB memory for the other. The different setting is because we needed Docker to run PrIC3 [6].

**Experiments with PDRMRM.** Table 2a shows the results. We observe that **PDRMRM** answered correctly, and that the execution time is reasonable. Further performance analysis (e.g. comparison with [19]) and improvement is future work; the point here, nevertheless, is the fact that we obtained a reasonable MRM model checker by adding <sup>∼</sup>80 lines to the generic solver LTPDR.

**Experiments with PDRIB**-**MDP.** Table 2c shows the results. Both PrIC3 and our **PDRIB-MDP** solve a linear programming (LP) problem in **Decide**. PrIC3 uses Z3 for this; **PDRIB-MDP** uses GLPK. PrIC3 represents an MDP symbolically, while **PDRIB-MDP** do so concretely. Symbolic representation in **PDRIB-MDP** is possible—it is future work. PrIC3 can use four different *interpolation generalization* methods, leading to different performance (Table 2c).

We observe that **PDRIB-MDP** outperforms PrIC3 for some benchmarks with smaller state spaces. We believe that the failure of **PDRIB-MDP** in many instances can be attributed to our current choice of a generalization method (it is the closest to the linear one for PrIC3). Table 2c suggests that use of *polynomial* or *hybrid* can enhance the performance.

**Experiments with PDR<sup>F</sup>**-**Kr.** Table 2b shows the results. The benchmarks are mostly from the HWMCC'15 competition [1], except for latch0.smv<sup>1</sup> and counter.smv (our own).

IC3ref vastly outperforms **PDRF-Kr** in many instances. This is hardly a surprise—IC3ref was developed towards superior performance, while **PDRF-Kr**'s emphasis is on its theoretical simplicity and genericity. We nevertheless see that **PDRF-Kr** solves some benchmarks of substantial size, such as power2bit8.smv. This demonstrates the practical potential of LT-PDR, especially in view of the following improvement opportunities (we will pursue them as future work): 1) use of well-developed SAT solvers (we currently use toysolver<sup>2</sup> for its good interface but we could use Z3); 2) allowing <sup>|</sup>Ci<sup>|</sup> <sup>&</sup>gt; 1, a technique discussed in Sect. 5.1 and implemented in IC3ref but not in **PDRF-Kr**; and 3) other small improvements, e.g. in our CNF-based handling of propositional formulas.

**Ablation Study.** To assess the value of the key concept of PDR (namely the *positive-negative interplay* between the Knaster–Tarski and Kleene theorems (Sect. 3.3)), we compared **PDRF-Kr** with the instances of positive and negative LT-PDR (Sects. 3.1–3.2) for Kripke structures.

Table 2d shows the results. Note that the value of the positive-negative interplay is already theoretically established; see e.g. Proposition 3.17 (the interplay detects executions that lead to nowhere). This value was also experimentally witnessed: see power2bit8.smv and simpleTrans.smv, where the one-sided methods made wrong choices and timed out. One-sided methods can be efficient when they get lucky (e.g. in counter.smv). LT-PDR may be slower because of the overhead of running two sides, but that is a trade-off for the increased chance of termination.

**Discussion.** We observe that all of the studied instances exhibited at least reasonable performance. We note again that detailed performance analysis and improvement is out of our current scope. Being able to derive these model checkers, with such a small effort as ∼100 lines of Haskell code each, demonstrates the value of our abstract theory and its generic Haskell implementation LTPDR.

<sup>1</sup> https://github.com/arminbiere/aiger.

<sup>2</sup> https://github.com/msakai/toysolver.

#### **Table 2.** experimental results for our **PDRF-Kr**, **PDRIB-MDP**, and **PDRMRM**

(a) Results with **PDRMRM**. The MRM is from [4, Example 10.72], whose ground truth expected reward is <sup>4</sup> <sup>3</sup> . The benchmarks ask if the expected reward (not known to the solver) is ≤ 1.5 or ≤ 1.3.


(b) Results with **PDRF-Kr** in comparison with IC3ref, a reference implementation of [9] (https://github.com/arbrad/ IC3ref). Both solvers answered correctly. Timeout (TO) is 600 sec.


(c) Results with **PDRIB-MDP**(an excerpt of [22, Table 3]). Comparison is against PrIC3 [6] with four different interpolation generalization methods (none, linear, polynomial, hybrid). The benchmarks are from [6]. |S| is the number of states of the benchmark MDP. "GT pr." is for the *ground truth probability*, that is the reachability probability *Pr max* (s<sup>ι</sup> <sup>|</sup><sup>=</sup> (<sup>S</sup> \ <sup>α</sup>)) computed outside the solvers under experiments. The solvers were asked whether the GT pr. (which they do not know) is ≤ λ or not; they all answered correctly. The last five columns show the average execution time in seconds. – is for "did not finish," for out of memory or timeout (600 sec.)


(d) Ablation experiments: LT-PDR (**PDRF-Kr**) vs. positive and negative LT-PDRs, implemented for the FSP for Kripke structures. The benchmarks are as in Table 2b, except for a new micro benchmark simpleTrans.smv. Timeout (TO) is 600 sec.


#### **7 Conclusions and Future Work**

We have presented a lattice-theoretic generalization of the PDR algorithm called LT-PDR. This involves the decomposition of the PDR algorithm into positive and negative ones, which are tightly connected to the Knaster–Tarski and Kleene fixed point theorems, respectively. We then combined it with the coalgebraic and fibrational theory for modeling transition systems with predicates. We instantiated it with several transition systems, deriving existing PDR algorithms as well as a new one over Markov reward models. We leave instantiating our LT-PDR and categorical safety problems to derive other PDR-like algorithms, such as PDR for hybrid systems [29], for future work.

We will also work on the combination of our work and the theory of *abstract interpretation* [10,12]. Our current framework axiomatizes what is needed of heuristics, but it does not tell how to realize such heuristics (that differ a lot in different concrete settings). We expect abstract interpretation to provide some general recipes for realizing such heuristics.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Affine Loop Invariant Generation via Matrix Algebra**

Yucheng Ji1,2 , Hongfei Fu2(B) , Bin Fang<sup>1</sup> , and Haibo Chen1,2

<sup>1</sup> OS Kernel Lab, Huawei Technologies, Shanghai, China

*{*jiyucheng,fangbin11,hb.chen*}*@huawei.com <sup>2</sup> Shanghai Jiao Tong University, Shanghai, China fuhf@cs.sjtu.edu.cn

**Abstract.** Loop invariant generation, which automates the generation of assertions that always hold at the entry of a while loop, has many important applications in program analysis and formal verification. In this work, we target an important category of while loops, namely affine while loops, that are unnested while loops with affine loop guards and variable updates. Such a class of loops widely exists in many programs yet still lacks a general but efficient approach to invariant generation. We propose a novel matrix-algebra approach to automatically synthesizing affine inductive invariants in the form of an affine inequality. The main novelty of our approach is that (i) the approach is general in the sense that it theoretically addresses all the cases of affine invariant generation over an affine while loop, and (ii) it can be efficiently automated through matrix-algebra (such as eigenvalue, matrix inverse) methods.

The details of our approach are as follows. First, for the case where the loop guard is a tautology (i.e., '**true**'), we show that the eigenvalues and their eigenvectors of the matrices derived from the variable updates of the loop body encompass all meaningful affine inductive invariants. Second, for the more general case where the loop guard is a conjunction of affine inequalities, our approach completely addresses the invariantgeneration problem by first establishing through matrix inverse the relationship between the invariants and a key parameter in the application of Farkas' lemma, then solving the feasible domain of the key parameter from the inductive conditions, and finally illustrating that a finite number of values suffices for the key parameter w.r.t a tightness condition for the invariants to be generated.

Experimental results show that compared with previous approaches, our approach generates much more accurate affine inductive invariants over affine while loops from existing and new benchmarks within a few seconds, demonstrating the generality and efficiency of our approach.

#### **1 Introduction**

An *invariant* is a logical assertion at a certain program location that always holds whenever the program executes across that location. Invariants are indispensable parts of program analysis and formal verification, and thus the generation of invariants has been key to the proof and analysis of crucial properties like reachability [3,6,15], time complexity [9] and safety [2,32]. To ease program analysis and formal verification, there has been a long thread of research on approaches to automatic generation of invariants, including constraint solving [10,12,27], recurrence analysis [17,24,29,31], abstract interpretation [13,14], logical inference [18,19,38], dynamic analysis [33,39], and machine learning [20,23,44]. To guarantee that an assertion is indeed an invariant, the widely-adopted paradigm is to generate an *inductive invariant* that holds for the first execution and for every periodic execution to the particular program location [12,32]. In this work, we consider an important subclass of invariants called *numerical invariants* which are assertions over the numerical values taken by the program variables, and are closely related to many common vulnerabilities like integer overflow, buffer overflow, division by zero and array out-of-bound. More specifically, we consider *affine* inductive invariants in the form of an affine inequality over program variables, and focus on affine while loops that have affine loop guards (as a conjunction of affine inequalities) and affine updates for the program variables but do not have nested loops.

To automate the generation of affine inductive invariants, we adopt the *constraint-solving* based approach with three steps. First, it establishes a template with unknown parameters for the target invariants. Second, it collects constraints derived from the inductive conditions. Finally, it solves the unknown parameters to get the desired invariants. Prior work in this space [12,37] leverages Farkas' lemma to provide a sound and complete characterization for the inductive conditions and then generates the affine inductive invariants either by the complete approach of quantifier elimination [12] or through several heuristics [37]. Specifically, the StInG invariant generator [40] implements the approach in [37], and the InvGen invariant generator [22] integrates abstract interpretation as well as the approach in [37]. Furthermore, a recent effort [34] leverages eigenvalues and eigenvectors for inferring a restricted class of invariants. Finally, some recent work considers decidable logic fragments that directly verify properties of loops [4,11,28,30]. Compared with other approaches such as machine learning and dynamic analysis, constraint solving has a theoretical guarantee on the correctness and accuracy of the generated invariants, yet typically at the cost of higher runtime complexity.

The novelty of our approach lies in that it completely addresses the constraints derived from Farkas' lemma by matrix methods, thus ensuring both generality and efficiency. In detail, this paper makes the following contributions (due to the page limit, the current paper is abridged. The full version is available at [25]):


then solving the feasible domain of the key parameter from the inductive conditions, and finally showing that it suffices to choose a finite number of values for the key parameter if one imposes a tightness condition on the invariants.

– We generalize our results to affine while loops with non-deterministic updates and to bidirectional affine invariants. A continuity property on the invariants w.r.t. the key parameter is also proved for tackling the numerical issue arising from the computation of eigenvectors. Experimental results on existing benchmarks and new benchmarks arising from linear dynamical systems demonstrate the generality and efficiency of our approach.

#### **1.1 Related Work**

*Constraint Solving.* There have been several prior approaches [12,37] using constraint solving for invariant generation based on Farkas' lemma. Compared to the approach in [12] that uses quantifier elimination to solve the constraints from Farkas' lemma, our approach is more efficient since it only involves the matrix computation. Compared with [37] that uses several heuristics, our approach is more general and complete in addressing all the cases in affine invariant generation. While the approach in [34] also uses eigenvectors, it is restricted to the subclass of equality and convergent invariants. In contrast, our approach targets at general affine inductive invariants over affine while loops. Other prior work [4,11,28,30] considers to have a decidable logic for unnested affine while loops with tautological guard but no conditional branches. Compared with them, our approach handles general affine while loops and targets at invariant generation.

*Abstract Interpretation.* A long thread of research to infer inductive invariants is using *abstract interpretation* [1,7,22,35] framework which constructs sound approximations for program semantics. In a nutshell, it first establishes an abstract domain for the specific form of properties to be generated, and then performs fixed-point computation in the abstract domain. Abstract interpretation generates invariants whose precision depends on the abstract domain and abstract operators, except for rare special cases [21,37].

*Recurrence Analysis.* Another closely-related technique is *recurrence analysis* [8,17,24,29,31]. The main idea is transforming the problem of invariant generation into a recurrence relation problem and then solve the latter one. The main limitation of recurrence analysis is that it requires the underlying recurrence relation to have a closed-form solution. This requirement, unfortunately, does not hold for the general case of affine inductive invariants over affine while loops.

*Logical Inference.* Invariants could also be obtained through *logical inference*, such as abductive inference [16], Craig interpolation [18], ICE learning [19,43], random search [38], etc. These approaches, however, cannot provide any theoretical guarantee on the accuracy of the generated numerical invariants. In contrast, our approach essentially addresses this issue.

*Dynamic Analysis. Dynamic analysis* [33,39] has also been exploited to invariant generation. The major process is first to collect the execution traces of a particular program by running it multiple times, and then guess the invariants based on these traces. As indicated in its process, dynamic analysis provides no guarantee on the correctness or accuracy of the inferred invariants, yet still pays the price of running the program at a large amount of time.

*Machine Learning.* There is a recent trend of applying *machine learning* [20,23,44] to solve the invariant-generation problem. Such approaches first establish a (typically large) training set of data, then use training approaches such as neural networks to generate invariants. Compared to our approach, those approaches require a large training set, while still having no theoretical guarantee on the correctness or accuracy. Specifically, such approaches cannot produce specific numerical values (e.g., eigenvalues) that are required to handle some examples in this work.

# **2 Preliminaries**

In this section, we specify the class of affine while loops and define the affineinvariant-generation problem over such loops. Throughout the paper, we use <sup>V</sup> <sup>=</sup> {x1, ..., x*n*} to denote the set of program variables in an affine while loop; we abuse the notation V so that it also represents the current values (before the execution of the loop body) of the original variables in V , and use the primed variables <sup>V</sup> := {x <sup>|</sup> <sup>x</sup> <sup>∈</sup> <sup>V</sup> } for the next values (after the execution of the loop body). Furthermore, we denote by **x** = [x1, ..., x*n*] <sup>T</sup> the vector variable that represents the current values of the program variables, and by **x** = [x 1, ..., x *n*] T the vector variable for the next values.

An *affine while loop* is a while loop without nested loops that has affine updates in each assignment statement and possibly multiple conditional branches in the loop body. To formally specify the syntax of it, we first define affine inequalities and assertions, program states and satisfaction relation between them as follows.

*Affine Inequalities and Assertions.* An *affine inequality* φ is an inequality of the form **<sup>c</sup>**<sup>T</sup> · **<sup>y</sup>** <sup>+</sup> <sup>d</sup> <sup>≤</sup> 0 where **<sup>c</sup>** is a real vector, **<sup>y</sup>** is a vector of real-valued variables and d is a real scalar. An *affine assertion* is a finite conjunction of affine inequalities. An affine assertion is *satisfiable* if it is true under some assignment of real values to its variables. Given an affine assertion ψ over vector variable **x**, we denote by ψ the affine assertion obtained by substituting **x** in ψ with its next-value variable **x** .

*Program States.* A *program state* **v** is a real vector **v** = [v1, ..., v*n*] <sup>T</sup> such that each v*<sup>i</sup>* is a concrete value for the variable x*<sup>i</sup>* (in the vector variable **x**). We say that a program state **<sup>v</sup>** satisfies an affine inequality <sup>φ</sup> <sup>=</sup> **<sup>c</sup>**<sup>T</sup> ·**x**+<sup>d</sup> <sup>≤</sup> 0, written as **<sup>v</sup>** <sup>|</sup><sup>=</sup> <sup>φ</sup>, if it holds that **<sup>c</sup>**<sup>T</sup> · **<sup>v</sup>** <sup>+</sup><sup>d</sup> <sup>≤</sup> 0. Likewise, **<sup>v</sup>** satisfies an affine assertion <sup>ψ</sup> if it satisfies every conjunctive affine inequality in ψ. Furthermore, given an affine assertion ψ with both **x** and **x** , we say that two program states **v**, **v** satisfy ψ, written as **<sup>v</sup>**, **<sup>v</sup>** <sup>|</sup><sup>=</sup> <sup>ψ</sup>, if <sup>ψ</sup> is true when one substitutes **<sup>x</sup>** by **<sup>v</sup>** and **<sup>x</sup>** by **<sup>v</sup>** . We then illustrate the syntax of (unnested) affine while loops as follows.

*Affine While Loops.* We consider affine while loops that take the form:

$$\begin{array}{l}\text{initial condition } \theta: \mathbf{R} \cdot \mathbf{x} + \mathbf{f} \le \mathbf{0} \\ \text{while } G: \mathbf{P} \cdot \mathbf{x} + \mathbf{q} \le \mathbf{0} \text{ do} \\ \mathbf{case } \psi\_1: \mathbf{T}\_1 \cdot \mathbf{x} - \mathbf{T}\_1' \cdot \mathbf{x}' + \mathbf{b}\_1 \le \mathbf{0} \text{ (}\tau\_1\text{)}; \\ \vdots \\ \mathbf{case } \psi\_k: \mathbf{T}\_k \cdot \mathbf{x} - \mathbf{T}\_k' \cdot \mathbf{x}' + \mathbf{b}\_k \le \mathbf{0} \text{ (}\tau\_k\text{)}; \\ \text{end} \end{array} \tag{\dagger}$$

where (i) θ is an affine assertion that specifies the initial condition for inputs and is given by the real matrix **R** and vector **f**, (ii) G is an affine assertion serving as the loop guard given by the real matrix **P** and vector **q**, and (iii) each ψ*<sup>j</sup>* is an affine assertion that represents a conditional branch, with the relationship between the current-state vector **x** and the next-state vector **x** given by the affine assertion <sup>τ</sup>*<sup>j</sup>* := **<sup>T</sup>***<sup>j</sup>* · **<sup>x</sup>** <sup>−</sup> **<sup>T</sup>** *<sup>j</sup>* · **<sup>x</sup>** <sup>+</sup> **<sup>b</sup>***<sup>j</sup>* <sup>≤</sup> **<sup>0</sup>** with transition matrices **<sup>T</sup>***<sup>j</sup>* , **<sup>T</sup>** *j* and vector **b***<sup>j</sup>* . In this work, we always assume that the rows of **R** are linearly independent (this condition means that every variable x*<sup>i</sup>* has one independent initial condition attached to it, which holds in most situations such as a fixed initial program state), such that **R**<sup>T</sup> is left invertible; we denote its left inverse as (**R**<sup>T</sup>) −1 <sup>L</sup> .

The execution of an affine while loop is as follows. First, the loop starts with an arbitrary initial program state **v**<sup>∗</sup> that satisfies the initial condition θ. Then in each loop iteration, the current program state **v** is checked against the loop guard <sup>G</sup>. In the case that **<sup>v</sup>** <sup>|</sup><sup>=</sup> <sup>G</sup>, the loop arbitrarily chooses a conditional branch <sup>ψ</sup>*<sup>j</sup>* satisfying **<sup>v</sup>** <sup>|</sup><sup>=</sup> <sup>ψ</sup>*<sup>j</sup>* , and sets the next program state **<sup>v</sup>** non-deterministically such that **<sup>v</sup>**, **<sup>v</sup>** <sup>|</sup><sup>=</sup> <sup>τ</sup>*<sup>j</sup>* ; the next program state **<sup>v</sup>** is then set as the current program state. Otherwise (i.e., **<sup>v</sup>** |<sup>=</sup> <sup>G</sup>), the loop halts immediately.

Now we define affine inductive invariants over affine while loops. Informally, an affine inductive invariant is an affine inequality satisfying the initiation and consecution conditions which mean that the inequality holds at the start of the loop (initiation) and is preserved under every iteration of the loop body (consecution).

*Affine Inductive Invariants.* An *affine inductive invariant* for an affine while loop (†) is an affine inequality <sup>Φ</sup> that satisfies the initiation and consecution conditions as follows:


From the definition above, it can be observed that an affine inductive invariant is an invariant, in the sense that every program state traversed (as a current state at the start or after every loop iteration) in some execution of the underlying affine while loop will satisfy the affine inductive invariant.

From now on, we abbreviate affine while loops as affine loops and affine inductive invariants as affine invariants.

*Problem Statement.* In this work, we study the problem of automatically generating affine invariants over affine loops. Our aim is to have a complete mathematical characterization on all such invariants and develop efficient algorithms for generating these invariants.

### **3 Affine Invariants via Farkas' Lemma**

Affine invariant generation through Farkas' lemma is originally proposed in [12, 37]. Farkas' lemma is a fundamental result in the theory of linear inequalities that leads to a complete characterization for the affine invariants. Since our approach is based on Farkas' lemma, we present a detailed account on the approaches in [12,37], and point out the weakness of each of the approaches.

**Theorem 1 (Farkas' Lemma).** *Consider the following affine assertion* S *over real-valued variables* y1*, ...,* y*n:*

$$S: \begin{bmatrix} a\_{11}y\_1 + \ldots + a\_{1n}y\_n + b\_1 \le 0 \\ \vdots \\ a\_{k1}y\_1 + \ldots + a\_{kn}y\_n + b\_k \le 0 \end{bmatrix}$$

*when* S *is satisfiable, it entails a given affine inequality*

$$
\phi: c\_1 y\_1 + \ldots + c\_n y\_n + d \le 0
$$

*if and only if there exist non-negative real numbers* λ0*, ...,* λ*<sup>k</sup> such that (i)* c*<sup>j</sup>* = *<sup>k</sup> <sup>i</sup>*=1 <sup>λ</sup>*i*a*ij for* <sup>1</sup> <sup>≤</sup> <sup>j</sup> <sup>≤</sup> <sup>n</sup> *and (ii)* <sup>d</sup> = (*<sup>k</sup> <sup>i</sup>*=1 <sup>λ</sup>*i*b*i*) <sup>−</sup> <sup>λ</sup>0*.*

The application of Farkas' lemma can be visualized by a table form as follows:

$$\begin{array}{c|c} \lambda\_0 & -1 \le 0 \\ \lambda\_1 & a\_{11}y\_1 + \ldots + a\_{1n}y\_n + b\_1 \le 0 \\ \vdots & \vdots & \vdots \\ \lambda\_k & a\_{k1}y\_1 + \ldots + a\_{kn}y\_n + b\_k \le 0 \\ \hline & c\_1y\_1 + \ldots + c\_ny\_n + d \le 0 \end{array} (S) \tag{4}$$

The intuition of the table form above is that one first multiplies the λ*i*'s on the left to their corresponding affine inequalities (in the same row) on the right, and then sums these affine inequalities together to obtain the affine inequality at the bottom. In this paper, we will call the table form as *Farkas table*.

Given an affine loop as (†), the approaches in [12,37] first establish a template <sup>Φ</sup> : <sup>c</sup>1x<sup>1</sup> <sup>+</sup> ... <sup>+</sup> <sup>c</sup>*n*x*<sup>n</sup>* <sup>+</sup> <sup>d</sup> <sup>≤</sup> 0 for an affine invariant where <sup>c</sup>1,...,c*n*, d are the unknown coefficients. Second, they establish constraints for the unknown coefficients from the initiation and consecution conditions for an affine invariant, as follows.

*Initiation.* By Farkas' lemma, the initiation condition can be solved from the Farkas table (‡) with <sup>S</sup> := <sup>θ</sup> and <sup>φ</sup> := <sup>Φ</sup>:

$$\frac{\begin{vmatrix} \lambda\_0^1 & -1 \leq 0\\ \lambda \end{vmatrix} \mathbf{R} \cdot \mathbf{x} + \mathbf{f} \leq \mathbf{0} \; (\theta) \\\hline \mathbf{c}^T \cdot \mathbf{x} + d \leq 0 \; (\theta) \end{vmatrix} \tag{4}$$

Here we rephrase the affine inequalities in θ and Φ with the condensed matrix forms **<sup>R</sup>** · **<sup>x</sup>** <sup>+</sup> **<sup>f</sup>** <sup>≤</sup> **<sup>0</sup>** and **<sup>c</sup>**<sup>T</sup> · **<sup>x</sup>** <sup>+</sup> <sup>d</sup> <sup>≤</sup> 0; we also use *<sup>λ</sup>* = [λ1,...,λ*k*] <sup>T</sup> to denote the non-negative parameters in the leftmost column of (‡).

*Consecution.* The consecution condition can be solved by handling each conditional branch (specified by <sup>τ</sup>*<sup>j</sup>* , ψ*<sup>j</sup>* in (†)) separately. By Farkas' lemma, we treat each conditional branch by the Farkas table (‡) with <sup>S</sup> := <sup>Φ</sup>∧G∧τ*<sup>j</sup>* and <sup>φ</sup> := <sup>Φ</sup> :

$$\begin{array}{c|ccc} \mu & \mathbf{c}^T \cdot \mathbf{x} & + & d \le 0 & (\Phi) \\ \lambda\_0^C & & - & 1 \le 0 \\ \xi & \mathbf{P} \cdot \mathbf{x} & + & \mathbf{q} \le \mathbf{0} & (G) \\ \eta \left| \mathbf{T}\_j \cdot \mathbf{x} - \mathbf{T}\_j' \cdot \mathbf{x}' + \mathbf{b}\_j \le \mathbf{0} & (\tau\_j) \\ \hline & \mathbf{c}^T \cdot \mathbf{x}' + \; d \le 0 & (\Phi') \end{array} \tag{\*}$$

Note that the Farkas table above contains quadratic constraints as we multiply an unknown non-negative parameter <sup>μ</sup> to the unknown invariant **<sup>c</sup>**<sup>T</sup> · **<sup>x</sup>** <sup>+</sup> <sup>d</sup> <sup>≤</sup> <sup>0</sup> in the table. The Farkas tables for all conditional branches are grouped conjunctively together to represent the whole consecution condition.

The weakness of the approaches presented in [12,37] lies at the treatment of the quadratic constraints from the consecution condition. The approach in [12] addresses the quadratic constraints by quantifier elimination that guarantees the theoretical completeness but typically has high runtime complexity. The approach in [37] solves the quadratic constraints by several heuristics that guess possible values for the key parameter <sup>μ</sup> in (∗) which causes non-linearity, hence losing completeness. Our approach considers to address parameter μ through matrix-based methods (eigenvalues and eigenvectors, matrix inverse, etc.), which is capable of efficiently generating affine invariants (as compared with quantifier elimination in [12]) while still ensuring theoretical completeness (as compared with the heuristics in [37]).

#### **4 Single-Branch Affine Loops with Deterministic Updates**

For the sake of simplicity, we first consider the affine invariant generation for a simple class of affine loops where there are no conditional branches in the loop body and the updates of the next-value vector **x** are deterministic.

Formally, an affine loop with deterministic updates and a single branch takes the following form:

> **initial condition** <sup>θ</sup> : **<sup>R</sup>** · **<sup>x</sup>** <sup>+</sup> **<sup>f</sup>** <sup>≤</sup> **<sup>0</sup> while** <sup>G</sup> **do x** <sup>=</sup> **<sup>T</sup>** · **<sup>x</sup>** <sup>+</sup> **<sup>b</sup>**; **end**

For the loop above, we aim at *non-trivial* affine invariants, i.e., affine invariants **<sup>c</sup>**<sup>T</sup> · **<sup>x</sup>** <sup>+</sup> <sup>d</sup> <sup>≤</sup> 0 with **<sup>c</sup>** <sup>=</sup> **<sup>0</sup>**. We summarize our results below.


In Sect. 4.1, we first derive the constraints from the initiation (#) and consecution (∗) conditions satisfied by the invariants. Then we solve these constraints for the tautological loop guard case in Sect. 4.2 and the single-constraint loop guard case in Sect. 4.3. Finally we generalize the results to the multi-constraint loop guard case in Sect. 4.4.

#### **4.1 Derived Constraints from the Farkas Tables**

We first derive the constraints from the Farkas tables as follows:

*Initiation.* Recall the Farkas table (#) for initiation. We first compare the coefficients of **x** above and below the horizontal line in (#), and obtain

$$
\lambda^T \cdot \mathbf{R} = \mathbf{c}^T \implies \mathbf{R}^T \cdot \lambda = \mathbf{c}.\tag{1}
$$

Then by comparing the constant terms in (#), we have:

$$-\lambda\_0^\mathrm{I} + \lambda^\mathrm{T} \cdot \mathbf{f} = d \implies \mathbf{f}^\mathrm{T} \cdot \lambda - d = \lambda\_0^\mathrm{I} \ge 0. \tag{2}$$

Note that **R**<sup>T</sup> has left inverse (**R**<sup>T</sup>) −1 <sup>L</sup> , thus constraint (1) is equivalent to *λ* = (**R**<sup>T</sup>) −1 <sup>L</sup> · **c**. Plugging it into (2) yields

$$\mathbf{f}^{\mathrm{T}} \cdot (\mathbf{R}^{\mathrm{T}})^{-1}\_{\mathrm{L}} \cdot \mathbf{c} - d = \lambda^{\mathrm{I}}\_{0} \geq 0. \tag{3}$$

*Consecution.* The Farkas table (∗) for consecution in the case of single-branch affine loops with deterministic updates is as follows:

$$\begin{array}{c|ccc} \mu & \mathbf{c}^{\mathrm{T}} \cdot \mathbf{x} & + & d \leq 0 \ (\Phi) \\ \lambda\_{0}^{\mathrm{C}} & & - & 1 \leq 0 \\ \xi & \mathbf{P} \cdot \mathbf{x} & + \mathbf{q} \leq \mathbf{0} \ (G) \\ \eta \left( \mathbf{T} \cdot \mathbf{x} - \mathbf{x'} & + \mathbf{b} = \mathbf{0} \ (\tau) \right) \\ \hline & \mathbf{c}^{\mathrm{T}} \cdot \mathbf{x'} + d \leq 0 \ (\Phi') \end{array}$$

Here the transition matrix **<sup>T</sup>** is a <sup>n</sup> <sup>×</sup> <sup>n</sup> square matrix, and **<sup>b</sup>** is a <sup>n</sup>-dimensional vector. Since τ contains only equalities, the components η1, ..., η*<sup>n</sup>* of the vector parameter *η* do not have to be non-negative (while the components ξ1, ..., ξ*<sup>n</sup>* of *ξ* and μ must be non-negative). In this table, by comparing the coefficients of **x** above and below the horizontal line, we easily get −*η* = **c**. Then we substitute *η* by −**c** and compare the coefficients of **x** above and below the horizontal line. We get

$$
\mu \cdot \mathbf{c}^{\mathrm{T}} + \boldsymbol{\xi}^{\mathrm{T}} \cdot \mathbf{P} - \mathbf{c}^{\mathrm{T}} \cdot \mathbf{T} = \mathbf{0}^{\mathrm{T}} \Rightarrow \mu \cdot \mathbf{c} - \mathbf{T}^{\mathrm{T}} \cdot \mathbf{c} + \mathbf{P}^{\mathrm{T}} \cdot \boldsymbol{\xi} = \mathbf{0}.\tag{4}
$$

We also compare the constant terms and get

$$\mu \cdot d - \lambda\_0^{\rm C} + \boldsymbol{\xi}^{\rm T} \cdot \mathbf{q} - \mathbf{c}^{\rm T} \cdot \mathbf{b} = d \Rightarrow (\mu - 1)d - \mathbf{b}^{\rm T} \cdot \mathbf{c} + \mathbf{q}^{\rm T} \cdot \boldsymbol{\xi} = \lambda\_0^{\rm C} \ge 0. \tag{5}$$

The rest of this section is devoted to solving the invariants <sup>Φ</sup> : **<sup>c</sup>**<sup>T</sup> · **<sup>x</sup>** <sup>+</sup> <sup>d</sup> <sup>≤</sup> <sup>0</sup> which satisfy all constraints (1)–(5).

#### **4.2 Loops with Tautological Guard**

We first consider the simplest case where the loop guard is '**true**':

$$\begin{array}{ll}\text{initial condition } \theta: \mathbf{R} \cdot \mathbf{x} + \mathbf{f} \le \mathbf{0} \\\text{while true do } \mathbf{x'} = \mathbf{T} \cdot \mathbf{x} + \mathbf{b}; \text{ end} \end{array} \tag{\diamond}$$

In order for completely solving the non-linear constraints, we take three steps:


*Step 1 and Step 2.* We address the values of μ, **c** by eigenvalues and eigenvectors in the following proposition:

**Proposition 1.** *For any non-trivial invariant* **<sup>c</sup>**<sup>T</sup> · **<sup>x</sup>** <sup>+</sup> <sup>d</sup> <sup>≤</sup> <sup>0</sup> *of the loop ( ), we have that* **c** *must be an eigenvector of* **T**<sup>T</sup> *with a non-negative eigenvalue* μ*.*

*Proof.* Since the loop guard is a tautology, we take the parameter *ξ* to be **0** in (4):

$$
\mu \cdot \mathbf{c} - \mathbf{T}^{\mathrm{T}} \cdot \mathbf{c} = \mathbf{0}.
$$

It's obvious that μ must be a non-negative eigenvalue of **T**<sup>T</sup> and **c** is the corresponding eigenvector.  *Example 1. (Fibonacci numbers).* Consider the sequence {s*n*} defined by initial condition <sup>s</sup><sup>1</sup> <sup>=</sup> <sup>s</sup><sup>2</sup> = 1 and recursive formula <sup>s</sup>*n*+2 <sup>=</sup> <sup>s</sup>*n*+1 <sup>+</sup> <sup>s</sup>*<sup>n</sup>* for <sup>n</sup> <sup>≥</sup> 1. If we use variables (x1, x2) to represent (s*n*, s*n*+1), then the sequence can be written as a loop:

$$\begin{aligned} \text{initial condition } \boldsymbol{\theta}: \mathbf{R} \cdot \mathbf{x} + \mathbf{f} &= \begin{bmatrix} 1 \ 0 \\ 0 \ 1 \end{bmatrix} \cdot \begin{bmatrix} x\_1 \\ x\_2 \end{bmatrix} + \begin{bmatrix} -1 \\ -1 \end{bmatrix} = \mathbf{0} \\ \text{while } \mathbf{true} \text{ do } \begin{bmatrix} x'\_1 \\ x'\_2 \end{bmatrix} &= \mathbf{T} \cdot \begin{bmatrix} x\_1 \\ x\_2 \end{bmatrix} + \mathbf{b} = \begin{bmatrix} 0 \ 1 \\ 1 \ 1 \end{bmatrix} \cdot \begin{bmatrix} x\_1 \\ x\_2 \end{bmatrix} + \mathbf{0}; \text{ end} \end{aligned}$$

The eigenvalues of matrix **<sup>T</sup>**<sup>T</sup> are <sup>1</sup>−√<sup>5</sup> <sup>2</sup> , 1+√<sup>5</sup> <sup>2</sup> ; only the second one is nonnegative. This eigenvalue <sup>μ</sup> <sup>=</sup> 1+√<sup>5</sup> <sup>2</sup> yields eigenvector **<sup>c</sup>** = [c1, 1+√<sup>5</sup> <sup>2</sup> <sup>c</sup>1] <sup>T</sup>, here <sup>c</sup><sup>1</sup> is a free variable, which could be fixed in the final form of the invariant. 

*Step 3.* After solving μ and **c**, we illustrate the feasible domain of d and its optimal value by the following proposition:

**Proposition 2.** *For any* μ *and* **c** *given by Proposition 1, the feasible domain of* d *is an interval determined by the two conditions below:*

$$d \le \mathbf{f}^{\mathrm{T}} \cdot (\mathbf{R}^{\mathrm{T}})^{-1}\_{\mathrm{L}} \cdot \mathbf{c} \quad \text{and} \quad (\mu - 1)d \ge \mathbf{b}^{\mathrm{T}} \cdot \mathbf{c}.$$

*If the above conditions have empty solution set, then no affine invariant is available from such* μ *and* **c***; otherwise, the optimal value of* d *falls in one of the two choices:*

$$d = \mathbf{f}^{\mathrm{T}} \cdot (\mathbf{R}^{\mathrm{T}})^{-1}\_{\mathrm{L}} \cdot \mathbf{c} \quad or \quad (\mu - 1)d = \mathbf{b}^{\mathrm{T}} \cdot \mathbf{c}.$$

*Proof.* Constraint (3) provides one condition for d:

$$\mathbf{f}^{\mathrm{T}} \cdot (\mathbf{R}^{\mathrm{T}})^{-1}\_{\mathrm{L}} \cdot \mathbf{c} - d = \lambda^{\mathrm{I}}\_{0} \ge 0 \implies \mathbf{f}^{\mathrm{T}} \cdot (\mathbf{R}^{\mathrm{T}})^{-1}\_{\mathrm{L}} \cdot \mathbf{c} \ge d;$$

while constraint (5) with *ξ* = **0** provides the other condition:

$$d(\mu - 1)d - \mathbf{b}^T \cdot \mathbf{c} = \lambda\_0^C \ge 0 \implies (\mu - 1)d \ge \mathbf{b}^T \cdot \mathbf{c}.$$

To obtain the strongest inequality **<sup>c</sup>**<sup>T</sup> · **<sup>x</sup>** <sup>+</sup> <sup>d</sup> <sup>≤</sup> 0, we need to take <sup>d</sup> to be either minimal or maximal value, i.e., some boundary point of its interval; thus the invariant with this d would imply all invariants with the same **c** and other d's in this interval. The boundary is achieved when one of the two conditions achieves the equality. 

*Example 2 (Fibonacci, Part 2).* We continue with Example 1. Recall that μ = 1+√<sup>5</sup> <sup>2</sup> , **<sup>c</sup>** = [c1, 1+√<sup>5</sup> <sup>2</sup> <sup>c</sup>1] <sup>T</sup>; in this case, constraints (3) (5) (with *ξ* = **0**) read <sup>−</sup>3+√<sup>5</sup> <sup>2</sup> <sup>c</sup><sup>1</sup> <sup>≥</sup> <sup>d</sup> and <sup>−</sup>1+√<sup>5</sup> <sup>2</sup> <sup>d</sup> <sup>≥</sup> 0, hence yield 0 <sup>≤</sup> <sup>d</sup> ≤ −3+√<sup>5</sup> <sup>2</sup> <sup>c</sup>1. The free variable <sup>c</sup><sup>1</sup> must be negative here, so we choose <sup>c</sup><sup>1</sup> <sup>=</sup> <sup>−</sup>2 and thus **<sup>c</sup>** = [−2, <sup>−</sup><sup>1</sup> <sup>−</sup> <sup>√</sup>5]<sup>T</sup> and 0 <sup>≤</sup> <sup>d</sup> <sup>≤</sup> 3+√5; there are two boundary values <sup>d</sup> = 0 and <sup>d</sup> = 3+√5, where <sup>d</sup> =3+ <sup>√</sup>5 leads to the strongest invariant:

$$\mu = (1+\sqrt{5})/2 : \ -2x\_1 - (1+\sqrt{5})x\_2 + 3 + \sqrt{5} \le 0. \tag{7}$$

#### **4.3 Loops with Guard: Single-Constraint Case**

Here we study the loops with non-tautological guard. First of all, the eigenvalue method of Sect. 4.2 applies to this case as well; thus for the rest of Sect. 4, we always assume that μ is not any eigenvalue of **T** (and **c** is not any eigenvector of **T**<sup>T</sup> either) and aim for other invariants than the ones from the eigenvectors.

Let us start with the case that the loop guard consists of only one affine inequality:

$$\begin{array}{ll}\text{initial condition } \boldsymbol{\theta}: \mathbf{R} \cdot \mathbf{x} + \mathbf{f} \le \mathbf{0} \\\text{while } \mathbf{p}^T \cdot \mathbf{x} + q \le 0 \text{ do } \mathbf{x}' = \mathbf{T} \cdot \mathbf{x} + \mathbf{b}; \text{ end} \end{array} \tag{\diamond} \text{ }$$

where **p** is a n-dimensional real vector and q is a real number.

We again take three steps to compute the invariants; these steps are different from the previous case:


*Step 1.* We first establish the relationship between μ and **c** through the constraints. The initiation is still (1) (2) (3), while the consecution (4) (5) becomes:

$$
\mu \cdot \mathbf{c} - \mathbf{T}^{\mathrm{T}} \cdot \mathbf{c} + \xi \cdot \mathbf{p} = \mathbf{0} \tag{4'}
$$

$$d(\mu - 1)d - \mathbf{b}^T \cdot \mathbf{c} + \xi \cdot q = \lambda\_0^C \ge 0 \tag{5'}$$

where the matrix **P** in (4) degenerates to vector **p**<sup>T</sup> and the vectors **q**, *ξ* in (5) both have just one component q, ξ here. Note that ξ is a non-negative parameter.

In contrast to Sect. 4.2, we assume that μ is not any eigenvalue of **T**, and <sup>ξ</sup> = 0. For such <sup>μ</sup>, we have a new formula to compute **<sup>c</sup>**:

**Proposition 3.** *For any non-trivial invariant* **<sup>c</sup>**<sup>T</sup> ·**x**+<sup>d</sup> <sup>≤</sup> <sup>0</sup> *of the loop ( ), we have that* **c** *is given by*

$$\mathbf{c} = \boldsymbol{\xi} \cdot (\mathbf{T}^{\mathrm{T}} - \boldsymbol{\mu} \cdot \mathbf{I})^{-1} \cdot \mathbf{p} \quad with \quad \boldsymbol{\xi} \ge 0 \tag{6}$$

*when* μ *is fixed,* **c***'s with different* ξ*'s are proportional to each other and yield equivalent invariants.*

*Proof.* Since <sup>μ</sup> is not any eigenvalue of **<sup>T</sup>**, the matrix <sup>μ</sup> · **<sup>I</sup>** <sup>−</sup> **<sup>T</sup>**<sup>T</sup> is invertible; thus (4 ) is equivalent to

$$(\boldsymbol{\mu} \cdot \mathbf{I} - \mathbf{T}^{\mathrm{T}}) \cdot \mathbf{c} = -\boldsymbol{\xi} \cdot \mathbf{p} \Rightarrow \; \mathbf{c} = \boldsymbol{\xi} \cdot (\mathbf{T}^{\mathrm{T}} - \boldsymbol{\mu} \cdot \mathbf{I})^{-1} \cdot \mathbf{p}. \tag{7}$$

*Example 3 (Fibonacci, Part 3).* We add a loop guard <sup>x</sup><sup>1</sup> <sup>≤</sup> 10 to Example 1:

$$\begin{aligned} \text{initial condition } \boldsymbol{\theta}: \mathbf{R} \cdot \mathbf{x} + \mathbf{f} &= \begin{bmatrix} 1 \ 0 \\ 0 \ 1 \end{bmatrix} \cdot \begin{bmatrix} x\_1 \\ x\_2 \end{bmatrix} + \begin{bmatrix} -1 \\ -1 \end{bmatrix} = \mathbf{0} \\ \text{while } \mathbf{p}^T \cdot \mathbf{x} + q &= [1, 0] \cdot \begin{bmatrix} x\_1 \\ x\_2 \end{bmatrix} - 10 \le 0 \text{ do} \\ \begin{bmatrix} x\_1' \\ x\_2' \end{bmatrix} &= \mathbf{T} \cdot \begin{bmatrix} x\_1 \\ x\_2 \end{bmatrix} + \mathbf{b} = \begin{bmatrix} 0 \ 1 \\ 1 \ 1 \end{bmatrix} \cdot \begin{bmatrix} x\_1 \\ x\_2 \end{bmatrix} + \mathbf{0}; \text{ end} \end{aligned}$$

and search for more invariants. The formula (6) here reads

$$
\begin{bmatrix} c\_1 \\ c\_2 \end{bmatrix} = \frac{\xi}{\mu^2 - \mu - 1} \begin{bmatrix} 1 - \mu - 1 \\ -1 & -\mu \end{bmatrix} \cdot \begin{bmatrix} 1 \\ 0 \end{bmatrix} = \frac{\xi}{\mu^2 - \mu - 1} \begin{bmatrix} 1 - \mu \\ -1 \end{bmatrix} . \tag{7}
$$

*Step 2.* With formula (6) in hand, every non-negative value μ would give us a vector **c**; the next step is to find such μ's that (1) (2) (3) (5 ) are all satisfied. We call this set the *feasible domain* of μ.

Notice that (3) and (5 ) are two inequalities both containing d. When the value of μ changes, there is a possibility that (3) and (5 ) conflict each other, hence make no invariant available. So the feasible domain consists of such μ's that make the two inequalities compatible with each other:

**Proposition 4.** *For the loop ( ), any feasible* <sup>μ</sup> *falls in* [0, 1) <sup>∪</sup> <sup>K</sup> <sup>∩</sup> [1, <sup>+</sup>∞) *, where* K *is the solution set to the following rational inequality of* μ *(which we call 'compatibility condition'):*

$$\mathbf{b}^{\mathrm{T}} \cdot (\mathbf{T}^{\mathrm{T}} - \boldsymbol{\mu} \cdot \mathbf{I})^{-1} \cdot \mathbf{p} - q \le (\boldsymbol{\mu} - \mathbf{1}) \mathbf{f}^{\mathrm{T}} \cdot (\mathbf{R}^{\mathrm{T}})^{-1}\_{\mathrm{L}} (\mathbf{T}^{\mathrm{T}} - \boldsymbol{\mu} \cdot \mathbf{I})^{-1} \cdot \mathbf{p}. \tag{7}$$

*Proof.* We multiply (<sup>μ</sup> <sup>−</sup> 1) on both sides of (3) and get

$$\mathbf{f}\left(\mu - 1\right)\mathbf{f}^{\mathrm{T}} \cdot \left(\mathbf{R}^{\mathrm{T}}\right)^{-1}\_{\mathrm{L}} \cdot \mathbf{c} \le (\mu - 1)d \quad \text{when} \quad 0 \le \mu < 1\tag{3'}$$

$$\mathbf{f}(\boldsymbol{\mu}-1)\mathbf{f}^{\mathrm{T}}\cdot(\mathbf{R}^{\mathrm{T}})\_{\mathrm{L}}^{-1}\cdot\mathbf{c}\geq(\boldsymbol{\mu}-1)d\quad\text{when}\quad\boldsymbol{\mu}\geq1\tag{3''}$$

compare them with (5 ), we see: (3 ) (5 ) would not conflict each other because they are both about (<sup>μ</sup> <sup>−</sup> 1)<sup>d</sup> being 'larger' than something. However, (3) (5 ) are two inequalities of opposite directions, they together must satisfy

$$\mathbf{b}^{\mathrm{T}} \cdot \mathbf{c} - \boldsymbol{\xi} \cdot \boldsymbol{q} \le (\mu - 1)d \le (\mu - 1)\mathbf{f}^{\mathrm{T}} \cdot (\mathbf{R}^{\mathrm{T}})^{-1}\_{\mathrm{L}} \cdot \mathbf{c}$$

to be compatible. Substitute **c** by (6) in the above inequality and cancel out ξ > 0, we obtain the desired inequality:

$$\mathbf{b}^{\mathrm{T}} \cdot (\mathbf{T}^{\mathrm{T}} - \boldsymbol{\mu} \cdot \mathbf{I})^{-1} \cdot \mathbf{p} - q \le (\boldsymbol{\mu} - 1) \mathbf{f}^{\mathrm{T}} \cdot (\mathbf{R}^{\mathrm{T}})^{-1}\_{\mathrm{L}} (\mathbf{T}^{\mathrm{T}} - \boldsymbol{\mu} \cdot \mathbf{I})^{-1} \cdot \mathbf{p}.$$

Every <sup>μ</sup> from [0, 1) and <sup>K</sup> <sup>∩</sup>[1, <sup>+</sup>∞) would lead to non-trivial invariant satisfying all constraints (1) (2) (3) (4 ) (5 ).  *Example 4 (Fibonacci, Part 4).* Let us compute the feasible domain of μ for Example 3. Inequality (5 ) is (<sup>μ</sup> <sup>−</sup> 1)<sup>d</sup> <sup>≥</sup> <sup>10</sup>ξ; inequality (3) is

$$\left( (\mu - 1)[-1, -1] \cdot \begin{bmatrix} 1 \ 0 \\ 0 \ 1 \end{bmatrix} \cdot \mathbf{c} = \frac{\xi(\mu - 1)\mu}{\mu^2 - \mu - 1} \ge (\mu - 1)d \quad \text{(when } \mu \ge 1).$$

We combine them to form the compatibility condition (7) as

$$10 \le \frac{(\mu - 1)\mu}{\mu^2 - \mu - 1} \implies 0 \le -\frac{9(\mu - \frac{5}{3})(\mu + \frac{2}{3})}{(\mu - \frac{1 - \sqrt{5}}{2})(\mu - \frac{1 + \sqrt{5}}{2})} \text{ (when } \mu \ge 1).$$

The solution domain of it is ( 1+√<sup>5</sup> <sup>2</sup> , <sup>5</sup> <sup>3</sup> ]. Thus by Proposition 4, the feasible domain of <sup>μ</sup> is [0, 1) <sup>∪</sup> ( 1+√<sup>5</sup> <sup>2</sup> , <sup>5</sup> <sup>3</sup> ]. 

*Step 3.* Proposition 4 provides us with a continuum of candidates for μ, thus produces infinitely many legitimate invariants. We want to find a basis consisting of finitely many invariants, such that all invariants are non-negative linear combinations of the basis; however, this idea does not work out, where the reason is explained thoroughly in the full version of this paper [25, Appendix A.1 and A.2]. Instead, we impose a weaker form of optimality called *tightness* coming from the equality cases of constraints (3) (5 ):

$$\begin{aligned} \mathbf{f}^T \cdot (\mathbf{R}^T)\_{\mathbf{L}}^{-1} \cdot \mathbf{c} - d &= \lambda\_0^\mathrm{I} = 0 \\ (\mu - 1)d - \mathbf{b}^T \cdot \mathbf{c} + \xi \cdot q &= \lambda\_0^\mathrm{C} = 0 \end{aligned}$$

we call an invariant *tight* and the corresponding μ as *tight choice* when both equalities are achieved:


The non-tight choices could be kept as back-up for invariant generation. The tight choices are characterized by the following proposition:

**Proposition 5.** *For the loop ( ), the tight choices of* μ *consist of* 0 *and the positive real roots of the following rational equation:*

$$\mathbf{b}^{\mathrm{T}} \cdot (\mathbf{T}^{\mathrm{T}} - \boldsymbol{\mu} \cdot \mathbf{I})^{-1} \cdot \mathbf{p} - q = (\boldsymbol{\mu} - \mathbf{1}) \mathbf{f}^{\mathrm{T}} \cdot (\mathbf{R}^{\mathrm{T}})^{-1}\_{\mathrm{L}} (\mathbf{T}^{\mathrm{T}} - \boldsymbol{\mu} \cdot \mathbf{I})^{-1} \cdot \mathbf{p}. \tag{8}$$

*Note that these roots are also the boundary points of the intervals in* K *defined in Proposition 4.*

*Proof.* Recall Proposition 2, constraints (3) (5) form the two boundaries of the domain of d, which can not be achieved simultaneously in the case of loops with tautological guard. Nevertheless, in the case of loops with guard, we have an extra freedom on μ which allows us to set λ<sup>I</sup> <sup>0</sup> = λ<sup>C</sup> <sup>0</sup> = 0:

$$\begin{aligned} \mathbf{f}^{\mathrm{T}} \cdot (\mathbf{R}^{\mathrm{T}})^{-1}\_{\mathrm{L}} \cdot \mathbf{c} &= d \wedge (\mu - 1)d = \mathbf{b}^{\mathrm{T}} \cdot \mathbf{c} - \xi \cdot q \\ \Rightarrow \quad \mathbf{b}^{\mathrm{T}} \cdot (\mathbf{T}^{\mathrm{T}} - \mu \cdot \mathbf{I})^{-1} \cdot \mathbf{p} - q &= (\mu - 1)\mathbf{f}^{\mathrm{T}} \cdot (\mathbf{R}^{\mathrm{T}})^{-1}\_{\mathrm{L}} (\mathbf{T}^{\mathrm{T}} - \mu \cdot \mathbf{I})^{-1} \cdot \mathbf{p}. \end{aligned}$$

Equation (8) is just the case that (7) achieves the equality, hence is a rational equation of μ with finite number of roots. These roots are also the boundary points of K since K is the solution domain to (7). Besides the roots of (8), μ = 0 is also a boundary point of the feasible domain; its corresponding invariant reflects the feature of the loop guard itself. Thus we add it into the list of tight choices. 

With μ determined and **c** fixed up to a scaling factor, the last thing remains is to determine the optimal d. The strategy here is similar to Proposition 2:

**Proposition 6.** *Suppose* μ *is from the feasible domain and* **c** *is given by Proposition 3. Then the optimal value of* d *is determined by one of the two choices below:*

$$\mathbf{b}^{\mathrm{T}} \cdot \mathbf{c} - \xi \cdot q = (\mu - 1)d \quad or \quad \mathbf{f}^{\mathrm{T}} \cdot (\mathbf{R}^{\mathrm{T}})^{-1}\_{\mathrm{L}} \cdot \mathbf{c} = d.$$

The proof is omitted here and can be found in our full version [25].

*Example 5 (Fibonacci, Part 5).* Remember that

$$
\begin{bmatrix} c\_1 \\ c\_2 \end{bmatrix} = \frac{\xi}{\mu^2 - \mu - 1} \begin{bmatrix} 1 - \mu \\ -1 \end{bmatrix} \text{ and the feasible domain of } \mu \text{ is } [0, 1) \cup (\frac{1 + \sqrt{5}}{2}, \frac{5}{3}].
$$

We compute the tight choices of μ and tight invariants. The equation (8) here is

$$0 = \frac{-9\mu^2 + 9\mu + 10}{\mu^2 - \mu - 1} = -\frac{9(\mu - \frac{5}{3})(\mu + \frac{2}{3})}{(\mu - \frac{1 - \sqrt{5}}{2})(\mu - \frac{1 + \sqrt{5}}{2})}$$

which has only one positive root μ = <sup>5</sup> <sup>3</sup> . By Proposition 5 and Proposition 6, We get two invariants:

$$\begin{aligned} \mu &= 0: & -x\_1 + x\_2 - 10 \le 0; \\ \mu &= 5/3: & -2x\_1 - 3x\_2 + 5 \le 0. \end{aligned}$$

#### **4.4 Loops with Guard: Multi-constraint Case**

After settling the single-constraint loop guard case, we consider the more general loop guard which contains the conjunction of multiple affine constraints:

$$\begin{array}{ll}\text{initial condition } \theta: \mathbf{R} \cdot \mathbf{x} + \mathbf{f} \le \mathbf{0} \\\text{while } \mathbf{P} \cdot \mathbf{x} + \mathbf{q} \le \mathbf{0} \text{ do } \mathbf{x'} = \mathbf{T} \cdot \mathbf{x} + \mathbf{b}; \text{ end} \end{array} \tag{\phi''}$$

where the loop guard **<sup>P</sup>** · **<sup>x</sup>** <sup>+</sup> **<sup>q</sup>** <sup>≤</sup> **<sup>0</sup>** contains <sup>m</sup> affine inequalities.

We can easily generalize the results of Sect. 4.3 to this case. First of all, we generalize Proposition 3: one simply needs to modify the formula (6) into

$$\mathbf{c} = (\mathbf{T}^{\mathrm{T}} - \boldsymbol{\mu} \cdot \mathbf{I})^{-1} \mathbf{P}^{\mathrm{T}} \cdot \boldsymbol{\xi} \quad \text{with} \quad \boldsymbol{\xi} \ge \mathbf{0} \tag{6'}$$

here *ξ* is a free non-negative m-dimensional vector parameter. With a fixed μ, we take *<sup>ξ</sup>* to traverse all vectors in the standard basis {**e**1, ..., **<sup>e</sup>***m*} to get <sup>m</sup> conjunctive invariants.

Next, we generalize Proposition 4 which describes the feasible domain of μ:

**Proposition 7.** *For the loop ( ), the feasible domain of* <sup>μ</sup> *is* [0, 1) <sup>∪</sup> <sup>K</sup> <sup>∩</sup> [1, <sup>+</sup>∞) *, where* K *is the solution set to the following generalized compatibility condition:*

$$\mathbf{b}^{\mathrm{T}} \cdot \mathbf{c} - \mathbf{q}^{\mathrm{T}} \cdot \boldsymbol{\xi} \le (\mu - 1)d \le (\mu - 1)\mathbf{f}^{\mathrm{T}} \cdot (\mathbf{R}^{\mathrm{T}})^{-1}\_{\mathrm{L}} \cdot \mathbf{c}$$

*substitute* **c** *by (*6 *) and take ξ to traverse all vectors in the standard basis (in order for all constraints in the loop guard to be satisfied by the invariant), we have the above condition completely decoded as* m *conjunctive inequalities:*

$$\begin{aligned} \mathbf{u}(\boldsymbol{\mu}) &:= \mathbf{b}^{\mathrm{T}} \cdot (\mathbf{T}^{\mathrm{T}} - \boldsymbol{\mu} \cdot \mathbf{I})^{-1} \mathbf{P}^{\mathrm{T}} - \mathbf{q}^{\mathrm{T}} \\ \leq \mathbf{w}(\boldsymbol{\mu}) &:= (\boldsymbol{\mu} - \mathbf{1}) \mathbf{f}^{\mathrm{T}} \cdot (\mathbf{R}^{\mathrm{T}})\_{\mathrm{L}}^{-1} (\mathbf{T}^{\mathrm{T}} - \boldsymbol{\mu} \cdot \mathbf{I})^{-1} \mathbf{P}^{\mathrm{T}} \end{aligned} \tag{7'}$$

*where* **u**(μ), **w**(μ) *are two* m*-dimensional vector functions in* μ*. The meaning of (*7 *) is that the* i*-th component of* **u**(μ) *is no larger than the* i*-th component of* **<sup>w</sup>**(μ) *for all* <sup>1</sup> <sup>≤</sup> <sup>i</sup> <sup>≤</sup> <sup>m</sup>*; when* <sup>m</sup> = 1*, it goes back to (7).*

At last, we consider the tight choices of μ. The first idea comes up to mind is to repeat Proposition 5: setting λ<sup>I</sup> <sup>0</sup> = λ<sup>C</sup> <sup>0</sup> = 0 for arbitary *ξ* such that the generalized compatibility condition achieves equality, i.e., **u**(μ) = **w**(μ); however, this is the conjunction of m rational equations and probably contains no solution.

Thus we use a different idea: recall that in the single-constraint case, the tight choices are also the (positive) boundary points of K along with 0; so we adopt this property as the definition in the multi-constraint case:

**Definition 1.** *For the loop ( ), the tight choices of* <sup>μ</sup> *consist of* <sup>0</sup> *and the (positive) boundary points of the domain* K *defined in Proposition 7.*

The generalized compatibility condition (7 ) contains m inequalities; at each (positive) boundary point of K, at least one inequality achieves equality and all other inequalities are satisfied (equivalently, λ<sup>I</sup> <sup>0</sup> = λ<sup>C</sup> <sup>0</sup> = 0 is achieved for at least one non-trivial evaluation of the free vector parameter *ξ*). This is indeed a natural generalization of Proposition 5.

*Example 6.* We consider the loop:

$$\begin{aligned} \text{initial condition } \boldsymbol{\theta}: \mathbf{R} \cdot \mathbf{x} + \mathbf{f} &= \begin{bmatrix} 1 \ 0 \\ 0 \ 1 \end{bmatrix} \cdot \begin{bmatrix} x\_1 \\ x\_2 \end{bmatrix} + \begin{bmatrix} -1 \\ -1 \end{bmatrix} = \mathbf{0} \\ \text{while } \mathbf{P} \cdot \mathbf{x} + \mathbf{q} &= \begin{bmatrix} 1 & 0 \\ 0 & -1 \end{bmatrix} \cdot \begin{bmatrix} x\_1 \\ x\_2 \end{bmatrix} + \begin{bmatrix} -10 \\ -5 \end{bmatrix} \le \mathbf{0} \text{ do} \\ \begin{bmatrix} x\_1' \\ x\_2' \end{bmatrix} &= \mathbf{T} \cdot \begin{bmatrix} x\_1 \\ x\_2 \end{bmatrix} + \mathbf{b} = \begin{bmatrix} 1 \ 0 \\ 0 \ 1 \end{bmatrix} \cdot \begin{bmatrix} x\_1 \\ x\_2 \end{bmatrix} + \begin{bmatrix} 1 \\ -1 \end{bmatrix}; \text{ end} \end{aligned}$$

There is one eigenvalue μ = 1 with geometric multiplicity 2; we solve three independent invariants from it:

$$x\_1 + x\_2 - 2 \le 0, \ x\_1 + x\_2 - 2 \ge 0; \ -x\_1 + x\_2 \le 0.$$

Next we find out the other invariants from tight μ's. In this case (7 ) read 11−10*µ* <sup>1</sup>−*<sup>µ</sup>* <sup>≤</sup> <sup>1</sup> <sup>∧</sup> <sup>6</sup>−5*<sup>µ</sup>* <sup>1</sup>−*<sup>µ</sup>* ≤ −1 (when μ > 1). Then <sup>K</sup> = (1, <sup>10</sup> <sup>9</sup> ] <sup>∩</sup> (1, <sup>7</sup> <sup>6</sup> ] = (1, <sup>10</sup> 9 ] and the feasible domain of <sup>μ</sup> is [0, 1)∪(1, <sup>10</sup> <sup>9</sup> ]. The tight choices are 0, <sup>10</sup> <sup>9</sup> (taking *ξ* to be [1, 0]<sup>T</sup>, [0, 1]<sup>T</sup> respectively yields the two conjunctive invariants for each μ):

$$\begin{aligned} \mu = 0 &: x\_1 - 10 \le 0 \land -x\_2 - 5 \le 0; \\ \mu = 10/9 &: -x\_1 + 1 \le 0 \land x\_2 - 1 \le 0. \end{aligned}$$

#### **5 Generalizations**

In this section, we extend our theory developed in Sect. 4 in two directions. For one direction, we consider the invariants **<sup>c</sup>**<sup>T</sup> · **<sup>x</sup>** <sup>+</sup> <sup>d</sup> <sup>≤</sup> 0 for the affine loops in the general form (†): we will derive the relationship of <sup>μ</sup> and **<sup>c</sup>**, as well as the feasible domain and tight choices of μ. For the other direction, we stick to the single-branch affine loops with deterministic updates and tautological guard ( ), yet generalize the invariants to bidirectional-inequality form <sup>d</sup><sup>1</sup> <sup>≤</sup> **<sup>c</sup>**<sup>T</sup> ·**<sup>x</sup>** <sup>≤</sup> <sup>d</sup>2; we will apply eigenvalue method to this case for solving the invariants. At the end of the section, we also give a brief discussion on some other possible generalizations.

#### **5.1 Affine Loops with Non-deterministic Updates**

In Sect. 4, we handled the loops with deterministic updates; here we generalize the results to the non-deterministic case in the form of (†). We focus on the singlebranch loops here, because the multi-branch ones can be handled similarly by taking the conjunction of all branches, as illustrated in the full version of this paper [25, Appendix A.3].

$$\begin{array}{ll}\text{initial condition } \theta: \mathbf{R} \cdot \mathbf{x} + \mathbf{f} \le \mathbf{0} \\ \text{while } \mathbf{P} \cdot \mathbf{x} + \mathbf{q} \le \mathbf{0} \text{ do } \mathbf{T} \cdot \mathbf{x} - \mathbf{T}' \cdot \mathbf{x}' + \mathbf{b} \le \mathbf{0}; \text{ end} \end{array} \tag{\dagger}'$$

For this general form, the initiation constraints are still (1) (2) (3), while the consecution constraints from Farkas table (∗) are

$$
\mu \cdot \mathbf{c} + \mathbf{P}^{\mathrm{T}} \cdot \xi + \mathbf{T}^{\mathrm{T}} \cdot \eta = \mathbf{0} \tag{9}
$$

−(**T** ) <sup>T</sup> · *<sup>η</sup>* <sup>=</sup> **<sup>c</sup>** (10)

$$d\left(\mu - 1\right)d + \mathbf{q}^T \cdot \boldsymbol{\xi} + \mathbf{b}^T \cdot \boldsymbol{\eta} = \lambda\_0^C \ge 0 \tag{11}$$

with *<sup>ξ</sup>*, *<sup>η</sup>* <sup>≥</sup> **<sup>0</sup>**. The relationship of **<sup>c</sup>** and *<sup>η</sup>* is given by (10); plugging it into (9) yield

$$\left(\mathbf{T}^{\mathrm{T}} - \boldsymbol{\mu} \cdot (\mathbf{T}')^{\mathrm{T}}\right) \cdot \boldsymbol{\eta} + \mathbf{P}^{\mathrm{T}} \cdot \boldsymbol{\xi} = \mathbf{0}.\tag{9'}$$

Hence for any non-trivial invariant **<sup>c</sup>**<sup>T</sup> · **<sup>x</sup>** <sup>+</sup> <sup>d</sup> <sup>≤</sup> 0 of this loop († ), we have **c** = −(**T** )<sup>T</sup> · *<sup>η</sup>*, where *<sup>η</sup>* is characterized differently in the following three cases:


For Case 2 and Case 3, we have a continuum of candidates for μ. The feasible domain of μ is given by [0, 1) <sup>∪</sup> <sup>K</sup> <sup>∩</sup> [1, <sup>+</sup>∞) <sup>∩</sup> <sup>J</sup>, where <sup>K</sup> is the solution set to the following compatibility condition (obtained by combining constraints (3) (11)):

$$\mathbf{b}^{\mathrm{T}} \cdot \boldsymbol{\eta}(\boldsymbol{\mu}) + \mathbf{q}^{\mathrm{T}} \cdot \boldsymbol{\xi} \ge (\boldsymbol{\mu} - 1) \mathbf{f}^{\mathrm{T}} \cdot (\mathbf{R}^{\mathrm{T}})^{-1}\_{\mathrm{L}} (\mathbf{T}')^{\mathrm{T}} \cdot \boldsymbol{\eta}(\boldsymbol{\mu})$$

and <sup>J</sup> is the solution set to constraints *<sup>η</sup>*(μ) <sup>≥</sup> **<sup>0</sup>**. Here both *<sup>η</sup>* and *<sup>ξ</sup>* as free non-negative vector parameters are taken to traverse all standard basis vectors, just in the same way as Proposition 7. The tight choices of μ consists of 0 and the positive boundary points of <sup>K</sup> <sup>∩</sup> <sup>J</sup>, in the same sense as Definition 1.

#### **5.2 An Extension to Bidirectional Affine Invariants**

Here we restrict ourselves to single-branch affine loops with deterministic updates and tautological loop guard ( ), but aim for the invariants of bidirectionalinequality form <sup>d</sup><sup>1</sup> <sup>≤</sup> **<sup>c</sup>**<sup>T</sup> · **<sup>x</sup>** <sup>≤</sup> <sup>d</sup>2. This is actually the conjunction of two affine inequalities: <sup>Φ</sup><sup>1</sup> : <sup>−</sup>**c**<sup>T</sup> · **<sup>x</sup>** <sup>+</sup> <sup>d</sup><sup>1</sup> <sup>≤</sup> <sup>0</sup> <sup>∧</sup> <sup>Φ</sup><sup>2</sup> : **<sup>c</sup>**<sup>T</sup> · **<sup>x</sup>** <sup>−</sup> <sup>d</sup><sup>2</sup> <sup>≤</sup> 0. We have the following proposition:

**Proposition 8.** *For any bidirectional invariant* <sup>d</sup><sup>1</sup> <sup>≤</sup> **<sup>c</sup>**<sup>T</sup> ·**<sup>x</sup>** <sup>≤</sup> <sup>d</sup><sup>2</sup> *of the loop ( ), we have that* **c** *must be an eigenvector of* **T**<sup>T</sup> *with a negative eigenvalue.*

*Proof.* We can easily write down the initiation condition: <sup>θ</sup> <sup>|</sup>= (Φ<sup>1</sup> <sup>∧</sup> <sup>Φ</sup>2) and the corresponding constraints (with *<sup>λ</sup>*, *<sup>λ</sup>* being two different vector parameters):

$$\mathbf{R}^{\mathrm{T}} \cdot \boldsymbol{\lambda} = \mathbf{c}, \quad \mathbf{f}^{\mathrm{T}} \cdot \boldsymbol{\lambda} + d\_2 = \lambda\_0^{\mathrm{I}} \ge 0; \quad \mathbf{R}^{\mathrm{T}} \cdot \tilde{\boldsymbol{\lambda}} = -\mathbf{c}, \quad \mathbf{f}^{\mathrm{T}} \cdot \tilde{\boldsymbol{\lambda}} - d\_1 = \tilde{\lambda}\_0^{\mathrm{I}} \ge 0.$$

However, there are two possible ways to propose the consecution condition:

$$(\Phi\_1 \wedge \tau \mid = \Phi\_1' \quad \text{and} \quad \Phi\_2 \wedge \tau \mid = \Phi\_2') \quad \text{or} \quad (\Phi\_1 \wedge \tau \mid = \Phi\_2' \quad \text{and} \quad \Phi\_2 \wedge \tau \mid = \Phi\_1')$$

If we choose the first one, there will be nothing different from the things we did in Sect. 4.2. Thus we choose the second one: making the two inequalities induct each other. Hence the Farkas tables are

$$\begin{array}{c|cccc} \mu & -\mathbf{c}^{\mathrm{T}} \cdot \mathbf{x} & +d\_{1} \leq 0 \,(\Phi\_{1}) & \widetilde{\mu} & \mathbf{c}^{\mathrm{T}} \cdot \mathbf{x} & -d\_{2} \leq 0 \,(\Phi\_{2}) \\ \lambda\_{0}^{\mathrm{C}} & -1 \leq 0 & \widetilde{\lambda}\_{0}^{\mathrm{C}} & -1 \leq 0 \\ \hline -\mathbf{c} \,\mathbf{T} \cdot \mathbf{x} & -\mathbf{x}' & +\mathbf{b} = \mathbf{0} \,(\tau) & \mathbf{c} & \mathbf{T} \cdot \mathbf{x} - & \mathbf{x}' & +\mathbf{b} = \mathbf{0} \,(\tau) \\ \hline & \mathbf{c}^{\mathrm{T}} \cdot \mathbf{x}' - d\_{2} \leq 0 \,(\Phi\_{2}^{\mathrm{f}}) & \widetilde{\mathbf{c}} & \widetilde{\mathbf{c}} & -\mathbf{c}^{\mathrm{T}} \cdot \mathbf{x}' + d\_{1} \leq 0 \,(\Phi\_{1}^{\mathrm{f}}) \end{array}$$

We write out the constraints of consecution:

$$-\boldsymbol{\mu} \cdot \mathbf{c} = \mathbf{T}^{\mathrm{T}} \cdot \mathbf{c} = -\widetilde{\mu} \cdot \mathbf{c} \tag{12}$$

$$-\boldsymbol{\mu} \cdot d\_1 + d\_2 - \mathbf{b}^{\mathrm{T}} \cdot \mathbf{c} = \lambda\_0^{\mathrm{C}} \ge 0, \quad -\widetilde{\mu} \cdot d\_2 - d\_1 + \mathbf{b}^{\mathrm{T}} \cdot \mathbf{c} = \widetilde{\lambda}\_0^{\mathrm{C}} \ge 0$$

the proposition is verified by (12) since μ, <sup>μ</sup> <sup>≥</sup> 0.

*Example 7 (Fibonacci, Part 6).* Recall that in this example we have a negative eigenvalue <sup>1</sup>−√<sup>5</sup> <sup>2</sup> . It yields the eigenvector **<sup>c</sup>** = [c1, <sup>1</sup>−√<sup>5</sup> <sup>2</sup> <sup>c</sup>1] <sup>T</sup>. The other constraints are computed as:

$$\begin{aligned} -(3-\sqrt{5})c\_1/2 + d\_2 &= \lambda\_0^1 \ge 0, \quad (3-\sqrt{5})c\_1/2 - d\_1 = \widetilde{\lambda}\_0^1 \ge 0. \\ -(1-\sqrt{5})d\_1/2 + d\_2 &= \lambda\_0^C \ge 0, \quad (1-\sqrt{5})d\_2/2 - d\_1 = \widetilde{\lambda}\_0^C \ge 0. \end{aligned}$$

If we choose c<sup>1</sup> = 2, λ<sup>I</sup> <sup>0</sup> =0= <sup>λ</sup><sup>C</sup> <sup>0</sup> (or <sup>c</sup><sup>1</sup> <sup>=</sup> <sup>−</sup>2, <sup>λ</sup><sup>I</sup> <sup>0</sup> =0= λ<sup>C</sup> <sup>0</sup> ), we get an invariant

$$\mu = |(1 - \sqrt{5})/2| \text{ : } 2(2 - \sqrt{5}) \le 2x\_1 + (1 - \sqrt{5})x\_2 \le 3 - \sqrt{5}$$

which reflects the 'golden ratio' property of the Fibonacci numbers.

*Remark 1.* The generalizations for bidirectional affine invariants to the loops with non-tautological guard or multiple branches are practicable but with some restrictions. The main restriction lies at the point that we need to assume the affine loop guard to also be bidirectional to make our approach for bidirectional affine invariants work. The issue of multiple branches is not critical as the bidirectional invariants can be derived in almost the same way as single-inequality invariants (illustrated in full version [25, Appendix A.3]), with the only difference at the adaption to bidirectional inequalities.

#### **5.3 Other Possible Generalizations**

*Integer-valued Variables.* One direction is to transfer some of the results for affine loops over real-valued variables to those over integer-valued variables. Our approach is based on Farkas' lemma which is dedicated to real-valued variables, thus can only provide a sound but not exact treatment for integer-valued variables. An exact treatment for integer-valued variables would require Presburger arithmetics [16], rather than Farkas' lemma.

*Strict-inequality Invariants.* We handle the non-strict-inequality affine invariants in this work. It's natural to consider the affine invariants of the strictinequality form. For strict inequalities, we could utilize an extended version of Farkas' lemma in [6, Corollary 1], so that strict inequalities can be generated by either relaxing the non-strict ones obtained from our method or restricting the μ value to be positive. Since Motzkin transposition theorem is a standard theorem for handling strict inequalities, we believe that Motzkin transposition theorem can also achieve similar results, but may require more tedious manipulations.

### **6 Approximation of Eigenvectors through Continuity**

In Sect. 4.2 and Sect. 5.2, we need to solve the characteristic polynomial of the transition matrix to get eigenvalues; while general polynomials with degree ≥ 5 do not have algebraic solution formula due to Abel-Ruffini theorem. We can develop a number sequence {λ*i*} to approximate the eigenvalue <sup>λ</sup> through rootfinding algorithms; however, we cannot approximate the eigenvector of λ by solving the kernel of **<sup>T</sup>**<sup>T</sup>−λ*<sup>i</sup>* ·**<sup>I</sup>** since it has trivial kernel. In the case of dimensions ≥ 5, i.e., when an explicit formula for eigenvalues is unavailable, we introduce an approximation method of the eigenvectors through a continuity property of the invariants:

*Continuity of Invariants w.r.t.* μ*.* In Sect. 4, we have shown that for any invariant **<sup>c</sup>**<sup>T</sup> · **<sup>x</sup>** <sup>+</sup> <sup>d</sup> <sup>≤</sup> 0 of single-branch affine loops with deterministic updates, the relationship of **c** and μ is given in two ways:

$$\mathbf{c} = \begin{cases} \text{kernel vector of } \mathbf{T}^T - \boldsymbol{\mu} \cdot \mathbf{I} & \text{when } \det(\mathbf{T}^T - \boldsymbol{\mu} \cdot \mathbf{I}) = 0 \\ (\mathbf{T}^T - \boldsymbol{\mu} \cdot \mathbf{I})^{-1} \cdot \mathbf{z} & \text{when } \det(\mathbf{T}^T - \boldsymbol{\mu} \cdot \mathbf{I}) \neq 0 \end{cases}$$

with **<sup>z</sup>** <sup>=</sup> **<sup>P</sup>**<sup>T</sup> · *<sup>ξ</sup>*. Thus **<sup>c</sup>** <sup>=</sup> **<sup>c</sup>**(μ) could be seemed as a vector function in <sup>μ</sup> expressed differently at eigenvalues from other points. **c**(μ) is undoubtedly continuous at the points other than eigenvalues, while the following proposition illustrates the continuity property of **c**(μ) at the eigenvalues:

**Proposition 9.** *Suppose* λ *is a real eigenvalue of* **T**<sup>T</sup> *with eigenvector* **c**(λ)*; and* {λ*i*} *is a sequence lying in the feasible domain of* <sup>μ</sup> *which converges to* <sup>λ</sup>*. If* <sup>λ</sup> *has geometric multiplicity* <sup>1</sup>*, then the sequence* {**c**(λ*i*)} *converges to* **<sup>c</sup>**(λ) *as well; otherwise,* {**c**(λ*i*)} *converges to* **<sup>0</sup>***.*

Due to the lack of space, the proof of Proposition 9 is omitted here and available in our full version [25].

*An Algorithmic Approach to Eigenvalue Method in Dimensions* ≥ 5*.* By Proposition 9, if λ has geometric multiplicity 1, we can compute **c**(λ*i*) = (**T**<sup>T</sup> <sup>−</sup> <sup>λ</sup>*<sup>i</sup>* · **<sup>I</sup>**)−<sup>1</sup> · **<sup>z</sup>** (in the case of tautological loop guard, we just replace **<sup>z</sup>** by any non-zero n-dimensional real vector) to approximate the eigenvector **c**(λ). On the other hand, in the case that λ has geometric multiplicity > 1, one can adopt Least-squares approximation as presented in [5, Section 8.9]. Though the Least-squares approximation applies to the cases of eigenvalues with arbitrary geometric multiplicity, our method is much easier to implement and has higher efficiency.

# **7 Experimental Results**

*Experiment.* We implement our automatic invariant-generation algorithm of eigenvalues and tight choices in Python 3.8 and use Sage [42] for matrix manipulation. All results are obtained on an Intel Core i7 (2.00 GHz) machine with 64 GB memory, running Ubuntu 18.04. Our benchmarks are affine loops chosen from some benchmark in the StInG invariant generator [40], some linear dynamical system in [30], some loop programs in [41] and some other linear dynamical systems resulting from well-known linear recurrences such as Fibonacci numbers, Tribonacci numbers, etc.

*Complexity.* The main bottleneck of our algorithm lies at exactly solving or approximating real roots of univariate polynomials (for computing eigenvalues and boundary points in our algorithmic approach). The rest includes Gaussian elimination with a single parameter (the polynomial-time solvability of which is guaranteed by [26]), matrix inverse and solving eigenvectors with fixed eigenvalues, which can easily be done in polynomial time. The exact solution for degrees less than 5 can be done by directly applying the solution formulas. The approximation of real roots can be carried out through real root isolation and a further divide-and-conquer (or Newton's method) in each obtained interval, which can be completed in polynomial time (see e.g. [36] for the polynomial-time solvability of real root isolation). Thus, our approach runs in polynomial time and is much more efficient than quantifier elimination in [12].

*Results.* The experimental results are presented in Table 1. In the table, the column 'Loop' specifies the name of the benchmark, 'Dim(ension)' specifies the number of program variables, 'μ' specifies the values through eigenvalues of the transition matrices (which we marked with e) or boundary points of the intervals in the feasible domain, 'Invariants' lists the generated affine invariants from our approach. We compare our approach with the existing generators StInG [40] and InvGen [22], where '=', '>', '' and '=' means the generated invariants are identical, more accurate, can only be generated in this work, and incomparable, respectively. Table 2 compares the amounts of runtime for our approach and StInG and InvGen respectively, measured in seconds. Note that the runtime of StInG and InvGen are obtained by executing their binary codes on our platform.

*Analysis.* StInG [40] implements constraint-solving method proposed in [12,37], InvGen [22] integrates both constraint-solving method and abstract interpretation, while our approach uses matrix algebra to refine and upgrade the constraint-solving method. Based on the results in Table 1 and Table 2, we conclude that:


**Table 1.** Experimental Results of Invariants

<sup>1</sup> L stands for the variable LARGE INT in the original program [41]. Note that we modified the loop programs in [41] as affine loops before execution.


**Table 2.** Experimental Results of Execution Time (s)


Summarizing all above, the experimental results demonstrate the wider coverage for the μ value endowed from our approach, and show the generality and efficiency of our approach.

**Acknowledgements.** This research is partially funded by the National Natural Science Foundation of China (NSFC) under Grant No. 62172271. We sincerely thank the anonymous reviewers for their insightful comments, which helped improve this paper. We also thank Mr. Zhenxiang Huang and Dr. Xin Gao for their pioneering contributions in the experimental part of this work.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Data-driven Numerical Invariant Synthesis with Automatic Generation of Attributes**

Ahmed Bouajjani<sup>1</sup> , Wael-Amine Boutglay1,2(B) , and Peter Habermehl<sup>1</sup>

<sup>1</sup> Universit´e Paris Cit´e, IRIF, Paris, France {abou,boutglay,haberm}@irif.fr <sup>2</sup> Mohammed VI Polytechnic University, Ben Guerir, Morocco

**Abstract.** We propose a data-driven algorithm for numerical invariant synthesis and verification. The algorithm is based on the ICE-DT schema for learning decision trees from samples of positive and negative states and implications corresponding to program transitions. The main issue we address is the discovery of relevant attributes to be used in the learning process of numerical invariants. We define a method for solving this problem guided by the data sample. It is based on the construction of a separator that covers positive states and excludes negative ones, consistent with the implications. The separator is constructed using an abstract domain representation of convex sets. The generalization mechanism of the decision tree learning from the constraints of the separator allows the inference of general invariants, accurate enough for proving the targeted property. We implemented our algorithm and showed its efficiency.

**Keywords:** Invariant synthesis · Data-driven program verification

### **1 Introduction**

Invariant synthesis for program safety verification is a highly challenging problem. Many approaches exist for tackling this problem, including abstract interpretation, CEGAR-based symbolic reachability, property-directed reachability (PDR), etc. [3,5,6,8,10,14,17,19]. While those approaches are applicable to large classes of programs, they may have scalability limitations and fail to infer certain types of invariants, such as disjunctive invariants. Emerging data-driven approaches, following the active learning paradigm with various machine learning techniques, have shown their ability to solve efficiently complex instances of the invariant synthesis problem [12,15,16,20,26,30,31]. These approaches are based on the iterative interaction between a *learner* inferring candidate invariants from a *data sample*, i.e., a set of data classified either as positive examples, known to be reachable from the initial states and that therefore must be included in any solution, or negative examples, known to be predecessors of states violating the safety property and that therefore cannot be included in any solution,

This work was supported in part by the french ANR project AdeCoDS.

S. Shoham and Y. Vizel (Eds.): CAV 2022, LNCS 13371, pp. 282–303, 2022. https://doi.org/10.1007/978-3-031-13185-1\_14

and a *teacher* checking the validity of the proposed solutions and providing counterexamples as feedback in case of non-validity. One such data-driven approach is ICE [15] which has shown promising results with its instantiation ICE-DT [16] that uses decision trees for the learning component. ICE is a learning approach tailored for invariant synthesis, where the feedback provided by the teacher can be, in addition to positive and negative examples, implications of the form p → q expressing the fact that if p is in a solution, then necessarily q should also be included in the solution since there is a transition in the program from p to q.

The strength of data-driven approaches is the generalization mechanisms of their learning components, allowing them to find relevant abstractions from a number of examples without exploring the whole state space of the program. In the case of ICE-DT, this is done by a sophisticated construction of decision trees classifying correctly the known positive and negative examples at some point, and taking into account the information provided by the implications. These decision trees, where the tested attributes are predicates on the variables of the program, are interpreted as formulas corresponding to candidate invariants.

However, to apply data-driven methods such as ICE-DT, one needs to have a pool of attributes that are potentially relevant for the construction of the invariant. This is actually a crucial issue. In ICE-DT, as well as in most datadriven methods, finding the predicates involved in the invariant construction is based on systematic enumeration of formulas according to some pre-defined templates or grammars. For instance, in the case of numerical programs, the considered patterns are some special types of linear constraints, and candidate attributes are generated by enumerating all possible values for the coefficients under some fixed bound. While such a brute-force enumeration can be effective in many cases, it represents, in general, an obstacle for both scalability and finding sufficiently accurate inductive invariants in complex cases.

In this paper, we provide an algorithmic method for efficient generation of attributes for data-driven invariant synthesis for numerical programs manipulating integer variables. While enumerative approaches are purely syntactic and do not take into account the data sample, our method is guided by it. We show that this method, when integrated in the ICE-DT schema, leads to a new invariant synthesis algorithm outperforming state-of-the-art methods and tools.

Our method for attributes discovery is based on, given an ICE data sample, computing a *separator* of it as a union of convex sets i.e., (1) it covers all the positive examples, (2) it does not contain any negative example, and (3) it is consistent with the implications (for every p → q in the sample, if the separator contains p, then it should also contain q). Then, the set of attributes generated is the set of all constraints defining the separator. However, as for a given sample there might be several possible separators, a question is which separators to consider. Our approach is guided by two requirements: (1) we need to avoid big pools of attributes in order to reduce the complexity of the invariant construction process, and (2) we need to avoid having in the pool constraints that are (visibly) unnecessary, e.g. separating positive examples in a region without any negative ones. Therefore, we consider separators that satisfy the property that, whenever they contain two convex sets, it is impossible to take their convex union (smallest convex set containing the union) without including a negative example.

To represent and manipulate algorithmically convex sets, we consider abstract domains, e.g., intervals, octagons, and polyhedra, as they are defined in the abstract interpretation framework and implemented in tools such as APRON [18]. These domains correspond to particular classes of convex sets, defined by specific types of linear constraints. In these domains, the union operation is naturally over-approximated by the *join* operation that computes the best overapproximation of the union in the considered class of convex sets. Then, constructing separators as explained above can be done by iterative application of the join operation while it does not include negative examples.

Then, this method for generating candidate attributes can be integrated into the ICE-DT schema: in each iteration of ICE loop, given a sample, the learner (1) generates a set of candidate attributes from a separator of the sample, (2) builds a decision tree from these attributes and proposes it as a candidate invariant to the teacher. Then, the teacher (1) checks that the proposed solution is an inductive invariant, and if it is not (2) provides a counterexample to the learner, extending the sample that will be used in the next iteration.

Here a question might be asked: why do we need to construct a decision tree from the constraints of the separator and do not propose directly the formula defining the separator as a candidate invariant to the teacher. The answer is that the decision tree construction is crucial for generalization. Indeed, given a sample, the constructed separator might be too specialized to that sample and does not provide a useful inductive invariant (except for some simple cases). For instance, the constructed separator is a union of *bounded* convex sets (polytopes), while invariants are very often unbounded convex sets (polyhedra). The effect of using decision trees, in this case, is to select the relevant constraints and discard the unnecessary bounds, leading very quickly to an unbounded solution that is general enough to be an inductive invariant. Without this generalization mechanisms, the ICE loop will not terminate in such (quite common) cases.

The integration of our method can be made tighter and more efficient by making the process of building separators incremental along the ICE iterations: at each step, after the extension of the sample by the teacher, instead of constructing a separator of the new sample from scratch, the parts of previously computed separators not affected by the last extension of the sample are reused.

We have implemented our algorithm and carried out experiments on the SyGuS-Comp'19 benchmarks. Our method solves significantly more cases than the tools LoopInvGen [25,26], CVC4 [1,27], and Spacer [19], as well as our implementation of the original ICE-DT [16] algorithm (with template-based enumeration of attributes), with very competitive time performances.

**Related Work.** Many learning-based approaches for the verification of numerical programs have been developed recently. One of the earliest approaches is Daikon [11]. Given a pool of formulas, it computes likely invariants from program executions. Later approaches were developed for the synthesis of sound invariants, for example [30] iteratively generates a set of reachable and bad states and classifies them with a combination of half-spaces computed using SVM. In [29], the problem is reformulated as learning geometric concepts in machine learning. The first instantiation of the ICE framework was based on a constraint solver [15]. Later on, it was instantiated using the decision trees learning algorithm [16]. Both those instantiations require a fixed template for the invariants or the formulas appearing in them. LoopInvGen enumerates predicates on-demand using the approach introduced in [26]. This is extended to a mechanism with hybrid enumeration of several domains or grammars [25]. Continuous logic networks were also used to tackle the problem in CLN2INV [28]. Code2Inv [31], the first approach to introduce general deep learning methods to program verification, uses a graph neural network to capture the program structure and reinforcement learning to guide the search heuristic of a particular domain.

The learning approach of ICE and ICE-DT has been generalized to solve problems given as constrained horn clauses (CHC) in Horn-ICE [12] and HoICE [4]. Outside the ICE framework, [33] proposed a learning approach for solving CHC using decision trees and SVM for the synthesis of candidate predicates from a set of reachable and bad states of the program. The limitation of the non-ICEbased approach is that when the invariant is not inductive, the program has to be rerun, forward and backward, to generate more reachable and bad states.

In more theoretical work, an abstract learning framework for synthesis, introduced in [21], incorporates the principle of CEGIS (counterexample-guided inductive synthesis). A study of overfitting in invariant synthesis was conducted in [25]. ICE was compared with IC3/PDR in terms of complexity in [13]. A generalization of ICE with relative inductiveness [32] can implement IC3/PDR following the paradigm of active learning with a learner and a teacher.

Automatic invariant synthesis and verification has been addressed by many other techniques based on exploring and computing various types of abstract representations of reachable states (e.g., [3,5,6,8,10,14,17,19]). Notice that, although we use abstract domains for representation and manipulation of convex sets, our strategy for exploring the set of potential invariants is different from the ones used typically in abstract interpretation analysis algorithms [8].

#### **2 Safety Verification Using Learning of Invariants**

This section presents the approach we use for solving the safety verification problem. It is built upon the ICE framework [15] and in particular its instantiation with the learning of decision trees [16]. We first define the verification problem.

#### **2.1 Linear Constraints and Safety Verification**

Let X be a set of variables. Linear formulas over X are boolean combinations of linear constraints of the form *n <sup>i</sup>*=1 a*i*x*<sup>i</sup>* ≤ b where the x*i*'s are variables in <sup>X</sup>, the <sup>a</sup>*i*'s are integer constants, and <sup>b</sup> <sup>∈</sup> <sup>Z</sup> ∪ {+∞}. We use linear formulas to reason symbolically about programs with integer variables. Assume we have a program with a set of variables V and let n = |V |. A state of the program is a vector of integers in Z*n*. Primed versions of these variables are used to encode the transition relation T of the program: for each v ∈ V , we consider a variable v to represent the value of v after the transition. Let V be the set of primed variables, and consider linear formulas over V ∪ V to define the relation T.

The *safety verification problem* consists in, given a set of safe states *Good*, deciding whether, starting from a set of initial states *Init*, all the reachable states by iterative application of T are in *Good*. Dually, this is equivalent to decide if starting from *Init*, it is possible to reach a state in *Bad* which is the set of unsafe states (the complement of *Good*). Assuming that the sets *Init* and *Good* can be defined using linear formulas, the safety verification problem amounts to find an adequate *inductive invariant* I, such that the three following formulas are valid:

$$\operatorname{Init}(V) \quad \Rightarrow \ I(V) \tag{1}$$

$$I(V) \quad \Rightarrow \; Good(V) \tag{2}$$

$$I(V) \land T(V, V') \quad \Rightarrow \quad I(V') \tag{3}$$

We are looking for inductive invariants which can be expressed as a linear formula. In that case, the validity of the three formulas is decidable and can be checked with a standard SMT solver.

#### **2.2 The ICE Learning Framework**

ICE [15] follows the active learning paradigm to learn adequate inductive invariants of a given program and a given safety property. It consists of an iteratively communicating *learner* and a *teacher* (see Algorithm 1).

```
Input : A transition system and a property: (Init, T, Good)
  Output: An adequate invariant or error
1 initialize ICE-sample S = (S+, S−, S→);
2 while true do
3 J ← Learn(S);
4 (success, counterexample) ← is inductive(J);
5 if success then return J ;
6 else
7 S ← update(S, counterexample);
8 if contradictory(S) then return error;
```
**Algorithm 1:** The main loop of ICE.

In each iteration, in line 3, the *learner*, which does not know anything about the program, synthesizes a candidate invariant (as a formula over the program variables) from a *sample* S (containing information about program states) which is enriched during the learning process. Contrary to other learning methods, the sample S not only contains a set of *positive* states S<sup>+</sup> which should satisfy the invariant, and a set of *negative* states S<sup>−</sup> which should not satisfy the invariant, but it contains also a set of *implications* S<sup>→</sup> of the form s → s meaning that if s satisfies the invariant, then s should satisfy it as well (because there is a transition from s to s in the transition relation of the program). Therefore, an ICE-sample S is a triple (S<sup>+</sup>, S−, S→), where to account for the information contained in implications, it is imposed additionally that

$$\forall s \to s' \in S^{\rightarrow}: \text{ if } s \in S^{+}, \text{ then } s' \in S^{+}, \text{ and if } s' \in S^{-}, \text{ then } s \in S^{-} \tag{4}$$

The sample is initially empty (or containing some states whose status, positive or negative, is known). It is assumed that a candidate invariant J proposed by the learner is *consistent* with the sample, i.e. states in S<sup>+</sup> satisfy the invariant J, the states in S<sup>−</sup> falsify it, and for implications s → s ∈ S<sup>→</sup> it is not the case that s satisfies J but not s . Given a candidate invariant J provided by the *learner* in line 3, the *teacher* who knows the transition relation T, checks if J is an inductive invariant in line 4; if yes, the process stops, an invariant has been found; otherwise a counterexample is provided and used in line 7 to update the sample for the next iteration. The teacher checks the three conditions an inductive invariant must satisfy (see Sect. 2.1). If (1) is violated the counterexample is a state s which should be in the invariant because it is in *Init*. Therefore s is added to S<sup>+</sup>. If (2) is violated the counterexample is a state s which should not be in the invariant because it is not in *Good* and s is added to S−. If (3) is violated the counterexample is an implication s → s where if s is in the invariant, s should also be in it. Therefore s → s is added to S→. In all three cases, the sample is updated to satisfy property 4. If this leads to a contradictory sample, i.e. <sup>S</sup><sup>+</sup> <sup>∩</sup> <sup>S</sup><sup>−</sup> <sup>=</sup> <sup>∅</sup>, the program is incorrect and an error is returned. Notice that obviously, in general, the loop is not guaranteed to terminate.

#### **2.3 ICE-DT: Invariant Learning Using Decision Trees**

In [16], the ICE learning framework is instantiated with a learn method, which extends classical decision tree learning algorithms with the handling of implications. In the context of invariant synthesis, *decision trees* are used to classify points from a universe, which is the set of program states. They are binary trees whose inner nodes are labeled by predicates from a set of attributes and whose leaves are either + or −. Attributes are (atomic) formulas over the variables of the program. They can be seen as boolean functions that the decision tree learning algorithm will compose to construct a classifier of the given ICE sample. In our case of numerical programs manipulating integer variables, attributes are linear inequalities. Then, a decision tree can be seen naturally as a quantifier-free formula over program variables.

The main idea of the ICE-DT learner (see Algorithm 2) is as follows. Initially, the learner fixes a set of attributes (possibly empty) which is kept in a global variable and updated in successive executions of Learn(S). In line 2, given a sample, the learner checks whether the current set of attributes is sufficient to produce a decision tree corresponding to a formula consistent with the sample. If the check is successful the sample S is changed to S*Attr* taking

```
Input : An ICE sample S = (S+, S−, S→)
  Output: A formula
  Global : Attributes initialized with InitialAttributes
1 Proc Learn(S)
2 (success, SAttr) ← sufficient(Attributes, S);
3 while ¬success do
4 Attributes ← generateAttributes(Attributes, S);
5 (success, SAttr) ← sufficient(Attributes, S);
6 return tree to formula(Construct-Tree(SAttr, Attributes))
```
**Algorithm 2:** The ICE-DT learner Learn(S) procedure.

into account information gathered during the check (see below for the details of sufficient(Attributes, S)). If the check fails new attributes are generated with generateAttributes(Attributes, S) until success. Then, a decision tree is constructed in line 6 from the sample S*Attr* by Construct-Tree(S*Attr*, Attributes) which we present below (Algorithm 3). It is transformed into a formula and returned as a potential invariant. Notice that in the main ICE loop of Algorithm 1 the teacher then checks if this invariant is inductive or not. If not, the original sample S is updated and in the next iteration the learner checks if the attributes are still sufficient for the updated sample. If not, the learner generates new attributes and proceeds with constructing another decision tree and so on.

An important question is how to choose InitialAttributes and how to generate new attributes when needed. In [16], the set InitialAttributes is for example the set of octagons over program variables with absolute values of constants bounded by <sup>c</sup> <sup>∈</sup> <sup>N</sup>. If these attributes are not sufficient to classify the sample, then new attributes are generated simply by increasing the bound c by 1. We use a different method described in detail in Sect. 4. We now describe how a decision tree can be constructed from an ICE sample and a set of attributes.

**Decision Tree Learning Algorithms.** The well-known standard decision tree learning algorithms like ID3 [23] take as an input a sample containing points marked as positive or negative of some universe and a fixed set Attributes. They construct a decision tree by choosing as the root an attribute, splitting the sample in two (one with all points satisfying the attribute and one with the other points) and recursively constructing trees for the two subsamples. At each step the attribute maximizing the information gain computed using the entropy of subsamples is chosen. Intuitively this means that at each step, the attribute which separates the "best" positive and negative points is chosen. In the context of verification, exact classification is needed, and therefore, all points in a leaf must be classified in a way consistent with the sample.

In [16] this idea is extended to handle also implications which is essential for an ICE learner. The basic algorithm to construct a tree (given as Algorithm 3 below) gets as input an ICE sample S = (S<sup>+</sup>, S−, S→) and a set of Attributes and produces a decision tree *consistent* with the sample, which means that each point in S<sup>+</sup> (resp. S−) is classified as positive (resp. negative) and for each implication (s, s ) ∈ S<sup>→</sup> it is not the case that s is classified as positive and s as negative. The initial sample S is supposed to be consistent.

```
Input : An ICE sample S = (S+, S−, S→) and a set of Attributes.
   Output: A tree
1 Proc Construct-Tree(S, Attributes)
2 Set G (partial mapping of end-points of impl. to {Positive, Negative}) to empty ;
3 Let Unclass be the set of all end-points of implications in S→;
4 Compute the implication closure of G w.r.t. S;
5 return DecisionTreeICE(S+, S−, Unclass, Attributes);
6 Proc DecisionTreeICE(Examples = P os, Neg, Unclass, Attributes)
7 Move all points of Unclass classified as Positive (resp. Negative) to P os (resp. Neg);
8 if Neg = ∅ then
9 Mark all points of Unclass in G as Positive;
10 Compute the implication closure of G w.r.t. S;
11 return Leaf(+);
12 else if P os = ∅ then
13 Mark all points of Unclass in G as Negative;
14 Compute the implication closure of G w.r.t. S;
15 return Leaf(−);
16 else
17 a ← choose(Attributes, Examples);
18 Divide Examples into two: Examplesa with all points satisfying a and
           Examples¬a the others;
19 Tlef t ← DecisionTreeICE(Examplesa, Attributes \ {a});
20 Tright ← DecisionTreeICE(Examples¬a, Attributes \ {a});
21 return T ree(a, Tlef t, Tright);
```
**Algorithm 3:** The ICE-DT decision-tree learning procedures.

The learner is similar to the classical decision tree learning algorithms. However, it has to take care of implications. To this end, the learner also considers the set of points appearing as end-points in the implications but not in S<sup>+</sup> and S−. These points are considered in the beginning as unclassified, and the learner will either mark them Positive or Negative during the construction as follows: if in the construction of the tree a subsample is reached containing only positive (resp. negative) points and unclassified points (lines 8 and 12 resp.), *all* these points are classified as positive (resp. negative). To make sure that implications are still consistent, the *implication closure* with the newly classified points is computed and stored in the global variable G, a (partial mapping) of end-points in <sup>S</sup><sup>→</sup> to {Positive, Negative}. The implication closure of <sup>G</sup> w.r.t. <sup>S</sup> is defined as: If <sup>G</sup>(s) = Positive or <sup>s</sup> <sup>∈</sup> <sup>S</sup><sup>+</sup> and (s, s ) ∈ S<sup>→</sup> then also G(s ) = Positive. If G(s ) = Negative or <sup>s</sup> <sup>∈</sup> <sup>S</sup><sup>−</sup> and (s, s ) <sup>∈</sup> <sup>S</sup><sup>→</sup> then also <sup>G</sup>(s) = Negative.

The set Attributes is such that a consistent decision tree will always be found, i.e. the set Attributes in line 17 is never empty (see below). An attribute in a node is chosen with choose(Attributes, Examples) returning an attribute a ∈ Attributes with the highest gain according to Examples. We do not give the details of this function. In [16] several gain functions are defined extending the classical gain function based on entropy with the treatment of implications. We use the one which penalizes cutting implications (like ICE-DT-penalty).

**Checking if the Set of Attributes is Sufficient.** Here we show how the function sufficient(Attributes, S) of Algorithm 2 is implemented in [16]. Two states s and s are considered equivalent (denoted by ≡*Attributes*), if they satisfy the same attributes of Attributes. One has to make sure that two equivalent states are never classified in different ways by the tree construction algorithm. This is done by the following procedure: For any two states s, s with s ≡*Attributes* s which appear in the sample (as positive or negative or end-points of the implications) two implications s → s and s → s are added to S<sup>→</sup> of S.

Then, the implication closure of the sample is computed starting from an empty mapping G (all end-points are initially unclassified). If during the computation of the implication closure one end-point is classified as both Positive and Negative, then sufficient(Attributes, S) returns (f alse, S) else it returns (true, S*Attr*) where S*Attr* is obtained from S = (S+, S−, S→) by adding to S<sup>+</sup> the end-points of implications classified as Positive and to S<sup>−</sup> the end-points classified as Negative.

In [16] it is shown that this guarantees in general that a tree consistent with the sample will always be constructed regardless of the order in which attributes are chosen. We illustrate now the ICE-DT learner on a simple example.

*Example 1.* Let S = (S+, S−, S→) be a sample (illustrated in Fig. 1) with twodimensional states (variables <sup>x</sup> and <sup>y</sup>): <sup>S</sup><sup>+</sup> <sup>=</sup> {(1, 1), (1, 4), (3, 1), (5, 1), (5, 4), (6, 1), (6, 4)}, S<sup>−</sup> = {(4, 1), (4, 2), (4, 3), (4, 4)}, S<sup>→</sup> = {(2, 2) → (2, 3),(0, 2) → (4, 0)}. We suppose that Attributes = {x ≥ 1, x ≤ 3, y ≥ 1, y ≤ 4, x ≥ 5, x ≤ 6} is given. In Sect. 4 we show how to obtain this set from the sample. The learner first checks that the set Attributes is sufficient to construct a formula consistent with S. The check succeeds and we have among others that (2, 2) and (2, 3) and the surrounding positive states on the left are all equivalent w.r.t. ≡*Attributes*. Therefore after adding implications (which we omit for clarity in the following) and the computation of the implication closure both (2, 2) and (2, 3) are added to S<sup>+</sup>. Then, the construction of the tree is started with Examples containing 9 positive, 4 negative and 2 unclassified states. Depending on the gain function an attribute is chosen. Here, it is x ≥ 5, since it separates all the positive states on the right from the rest and does not cut the implication. The set Examples is split into the states satisfying x ≥ 5 and those which don't: Examples*<sup>x</sup>*≥<sup>5</sup> and Examples*x<*5. Examples*<sup>x</sup>*≥<sup>5</sup> contains only positive states {(5, 1),(5, 4),(6, 1),(6, 4)} and the branch is finished whereas Examples*x<*<sup>5</sup> contains the remaining positive, negative and unclassified states and the construction continues. The attribute x ≤ 3 is chosen and Examples*x<*<sup>5</sup> split in two. Examples*x<*5∧*x*≤<sup>3</sup> contains the positive states {(1, 1),(1, 4),(3, 1),(2, 2),(2, 3)} and one unclassified state (0, 2). Therefore, the algorithm marks (0, 2) as positive and as there is an implication (0, 2) → (4, 0), the state (4, 0) is marked positive as well and a leaf node is returned. The other branch Examples*x<*5∧*x>*<sup>3</sup> now contains negative states {(4, 1),(4, 2),(4, 3),(4, 4)} and a positive state (4, 0). Therefore another attribute is needed. Finally, the algorithm returns a tree corresponding to the formula x ≥ 5 ∨ (x < 5 ∧ x ≤ 3) ∨ (x < 5 ∧ x > 3 ∧ y < 1).

#### **3 Linear Formulas as Abstract Objects**

Algorithm 2 requires a set of attributes as input. In Sect. 4, we show how to generate these attributes from the sample. For that purpose, we use numerical abstract domains to represent and manipulate algorithmically sets of integer vectors representing program states. We consider standard numerical domains defined in [7,9,22] and implemented in tools such as APRON [18]: Intervals, Octagons, and Polyhedra.

Given a set of <sup>n</sup> variables <sup>X</sup> and a linear formula <sup>ϕ</sup> over <sup>X</sup>, let [[ϕ]] <sup>⊆</sup> <sup>Z</sup>*<sup>n</sup>* be the set of all integer points satisfying the formula. Now, a subset of Z*<sup>n</sup>* is called


Now, we can define several abstract domains as complete lattices A*type <sup>X</sup>* = D*type <sup>X</sup>* , ,,, <sup>⊥</sup>, , where type is either int, oct or poly and <sup>D</sup>*int <sup>X</sup>* is the set of intervals, D*oct <sup>X</sup>* is the set of octagons and <sup>D</sup>*poly <sup>X</sup>* the set of polyhedra.

The relation is set inclusion. The binary operation (resp. ) is the *join* (resp. *meet*) operation that defines the smallest (resp. greatest) element in D*<sup>X</sup>* that contains (resp. contained in) the union (resp. the intersection) of the two composed elements. Finally <sup>⊥</sup> (resp. ) corresponds to the empty set (resp. <sup>Z</sup>*<sup>n</sup>*).

We suppose that we have a function *Formtype* (d) which given an element <sup>d</sup> <sup>⊆</sup> <sup>Z</sup>*<sup>n</sup>* of the lattice provides us a formula <sup>ϕ</sup> of the corresponding type such that [[ϕ]] = d. There are many ways to describe the set d with a formula ϕ. Therefore the function *Formtype* (d) depends on the particular implementation of the abstract domains. We furthermore define *Constr type* (d) to be the set of linear constraints of *Formtype* (d).

We drop the superscript type from all preceding definitions, when it is clear from the context or when we define notions for all types.

All singleton subsets of Z*<sup>n</sup>* are elements of the lattices and for example, if p = (x = 1, y = 2), then, for the domains of Intervals, Octagons, and Polyhedra as implemented in APRON we have: *Constr int*({p}) = {<sup>x</sup> <sup>≤</sup> <sup>1</sup>, x <sup>≥</sup> <sup>1</sup>, y <sup>≤</sup> <sup>2</sup>, y <sup>≥</sup> <sup>2</sup>}, *Constr oct*({p}) = {<sup>x</sup> <sup>≥</sup> <sup>1</sup>, x <sup>≤</sup> <sup>1</sup>, y <sup>−</sup> <sup>x</sup> <sup>≥</sup> <sup>1</sup>, x <sup>+</sup> <sup>y</sup> <sup>≥</sup> <sup>3</sup>, y <sup>≥</sup> <sup>2</sup>, y <sup>≤</sup> <sup>2</sup>, x <sup>+</sup> <sup>y</sup> <sup>≤</sup> <sup>3</sup>, x <sup>−</sup> <sup>y</sup> ≥ −1} and *Constr poly* ({p}) = {<sup>x</sup> = 1, y = 2}.

Notice, that in APRON while equality constraints are used in the Polyhedra domain, these constraints are not explicit in the Interval and Octagon domains.

An important fact about the three domains mentioned above is that, each element of the lattice is the intersection of a convex subset of Q*<sup>n</sup>* with Z*<sup>n</sup>*. To be able to reason about integer points from *nonconvex* sets, we will use in the next section sets of sets.

**Fig. 1.** An ICE sample and its separators using different abstract domains.

# **4 Generating Attributes from Sample Separators**

We define in this section algorithms for generating a set of attributes that can be used for constructing decision trees representing candidate invariants. Given an ICE sample, these algorithms are based on constructing separators of the two sets of positive and negative states that are consistent with the implications in the sample. These separators are sets of intervals, octagons or polyhedra. The set of all constraints that define these sets are collected as a set of attributes.

#### **4.1 Abstract Sample Separators**

Let <sup>S</sup> = (S+, S−, S→) be an ICE sample, and let <sup>A</sup>*<sup>X</sup>* <sup>=</sup> D*X*, ,,, <sup>⊥</sup>, be an abstract domain. Intuitively, a separator has sets containing all positive states, not containing any negative state and is consistent with implications. Formally, an <sup>A</sup>*X*-separator of <sup>S</sup> is a set <sup>S</sup> <sup>∈</sup> <sup>2</sup>*<sup>D</sup><sup>X</sup>* such that <sup>∀</sup><sup>p</sup> <sup>∈</sup> <sup>S</sup>+. <sup>∃</sup><sup>d</sup> <sup>∈</sup> <sup>S</sup>. p <sup>∈</sup> <sup>d</sup> and <sup>∀</sup><sup>p</sup> <sup>∈</sup> <sup>S</sup>−. <sup>∀</sup><sup>d</sup> <sup>∈</sup> <sup>S</sup>. p <sup>∈</sup> <sup>d</sup> and <sup>∀</sup><sup>p</sup> <sup>→</sup> <sup>q</sup> <sup>∈</sup> <sup>S</sup>→. <sup>∀</sup><sup>d</sup> <sup>∈</sup> <sup>S</sup>. (<sup>p</sup> <sup>∈</sup> <sup>d</sup> <sup>=</sup><sup>⇒</sup> (∃d <sup>∈</sup> <sup>S</sup>. q <sup>∈</sup> <sup>d</sup> )). Given a set of positive states S<sup>+</sup>, we define the basic separator S*basic* as {{p} | <sup>p</sup> <sup>∈</sup> <sup>S</sup><sup>+</sup>} where each state is alone in its set. Our method for generating attributes for the learning process is based on computing a special type of separators called *join-maximal*. An A*X*-separator S is join-maximal if is not possible to take the join of two of its elements without including a negative state: <sup>∀</sup>d1, d<sup>2</sup> <sup>∈</sup> <sup>S</sup>. d<sup>1</sup> <sup>=</sup> <sup>d</sup><sup>2</sup> <sup>=</sup><sup>⇒</sup> (∃<sup>n</sup> <sup>∈</sup> <sup>S</sup>−. n <sup>∈</sup> <sup>d</sup><sup>1</sup> <sup>d</sup>2).

*Example 2.* Let us consider again the ICE sample S given in Example 1. Figure 1 shows the borders of join-maximal A*X*-separators for S for different abstract domains (Intervals int, Octagons oct, and Polyhedra poly).

*Remark 1.* An ICE sample may have multiple join-maximal separators as Fig. 2 shows for the polyhedra domain. The method presented in the next section computes one of them non-deterministically.

#### **4.2 Computing a Join-Maximal Abstract Separator**

We present in this section a basic algorithm for computing a join-maximal A*X*separator for a given sample S. Computing such a separator can be done iteratively starting from S*basic*, and at each step, choosing two elements d<sup>1</sup> and d<sup>2</sup>

**Fig. 2.** Different join-maximal separators for a same sample.

in the current separator such that d<sup>1</sup> d<sup>2</sup> does not contain a negative state in S<sup>−</sup> (This can be checked using the meet operation ), and replacing d<sup>1</sup> and d<sup>2</sup> by d<sup>1</sup> d2. Then, if any element of the separator contains the source p of an implication p → q, which means that p is considered now as a positive state, then since q must also be considered as positive, the element {q} must be added to the separator if q is not already in some element of the current separator. When no new join operations (without including negative states) can be done, the obtained set is necessarily a join-maximal A*X*-separator of S. This procedure corresponds to Algorithm 4.

```
Input : An ICE sample S = (S+, S−, S→) and an abstract domain
         AX = DX, , 
, , ⊥, .
  Output: S a join-maximal AX-separator of S.
1 Proc constructSeparator(S, AX)
2 S ← Sbasic (* = {{s} | s ∈ S+} *) ;
3 while true do
4 if ∃a, b ∈ S. a = b ∧ ∀n ∈ S−.n /∈ a 
 b then
5 S ← (S \ {a, b}) ∪ {a 
 b} ;
6 while ∃p → q ∈ S→. ∃d ∈ S. p ∈ d ∧ ∀d ∈ S.q /∈ d do
7 S ← S ∪ {{q}} ;
8 else break;
```
**Algorithm 4:** Computing a join-maximal A*X*-separator.

Notice that instead of starting with the basic separator S*basic* defined as above one can start with any separator <sup>S</sup>*init* <sup>⊇</sup> <sup>S</sup>*basic* whose additional sets contain only states which are known to be positive (for example the initial states).

*Example 3.* Consider again the sample S of Example 2. We show how the separators of S in Fig. 1 are constructed using Algorithm 4. The algorithm starts from the basic separator S*basic* where every positive state in S is alone (Fig. 3(a)). It picks two elements in that separator, e.g. {d1} and {d2}. As their join does not include negative states, {d1} and {d2} are replaced by j<sup>1</sup> = {d1}{d2} to get a new separator (Fig. 3(b)). Then, depending on the considered domain, different separators are obtained. For Intervals, the join of j<sup>1</sup> and {d3} leads to the separator in Fig. 1(a). Notice that both ends of the implication (2, 2) → (2, 3) are included in j<sup>1</sup> {d3}. In the case of Octagons, the join of j<sup>1</sup> and {d3} is the set

**Fig. 3.** The first iterations of Algorithm 4 on the sample *S* of Fig. 1

on the left of Fig. 1(b). Again, both ends of the implication (2, 2) → (2, 3) are included in j1{d3}. In the case of Polyhedra, j<sup>2</sup> = j1{d3} is the triangle shown in Fig. 3(c). Since (2, 2) is included in j<sup>2</sup> but not (2, 3), the element {(2, 3)} is added to the separator, leading to the separator represented in Fig. 3(c). In the next iteration, j<sup>2</sup> is joined with {d8} leading to the separator shown in Fig. 3(d). Finally, a similar iteration of join operations leads to the rectangle including the four points, and this leads to the join-maximal separator of Fig. 1.

*Remark 2.* In the best case Algorithm <sup>4</sup> performs <sup>|</sup>S<sup>+</sup><sup>|</sup> join and <sup>|</sup>S<sup>+</sup>|(|S<sup>−</sup>|+|S<sup>→</sup>|) meet operations (all pairs of points can be joined and all left end-points of implications are not in the new joined convex sets). In the worst case, it performs O (|S<sup>+</sup><sup>|</sup> <sup>+</sup> <sup>|</sup>S<sup>→</sup>|)<sup>2</sup> join and <sup>O</sup> (|S<sup>+</sup><sup>|</sup> <sup>+</sup> <sup>|</sup>S<sup>→</sup>|)<sup>2</sup>(|S<sup>−</sup><sup>|</sup> <sup>+</sup> <sup>|</sup>S<sup>→</sup>|) meet operations (at most |S<sup>−</sup>| + |S<sup>→</sup>| meets are needed to check if two sets can be joined and implications might add new points to S<sup>+</sup>). The cost of meet and join depends on the used abstract domain; it is polynomial for intervals and octagons, and exponential for polyhedra, in the number of variables. Algorithm 4 is not designed to compute a join-max separator with a minimal number of convex sets as this would require a potentially exponential number of meet and join operations.

#### **4.3 Integrating Separator Computation in ICE-DT**

We use the computation of a join-maximal separator to provide an instance of the function generateAttributes of ICE-DT in Algorithm 2. Given a sample S, let S be the A*X*-separator of S computed by constructSeparator(S, A*X*) defined by Algorithm 4. We consider the set *InitialAttributes* containing all the predicates that constitute the specification (*Init* and *Good*) and those that appear in the programs (as tests in the conditional statements and while loops). Then, we define: generateAttributes(S) = *InitialAttributes* <sup>∪</sup> *<sup>d</sup>*∈<sup>S</sup> *Constr* (d)

*Remark 3.* Several convex sets of the separator S might generate the same constraint and the set of attributes generated in this way might contain attributes which partition the state space in the same way (e.g. x ≤ 0 and x ≥ 1, equivalent to x > 0 over the integers). We keep only one of them. The number of attributes generated is at most linear in the number of positive states in the sample S.

**Fig. 4.** Example program

Notice that our function generateAttributes(S), contrary to the one used in the original ICE-DT (Algorithm 2), does not expand a set of existing attributes, and therefore it only need the sample S as argument. In fact, with our method for computing attributes, the ICE-DT schema can be simplified: the while loop in Algorithm 2 can be replaced by one single initial test on the condition of success. Indeed, each time the learner is called, it checks whether the set of attributes computed for the previous sample is sufficient to build a separator for the new sample. Only when it is not sufficient that the generation of a separator is performed. Then, the call of the sufficient function afterward is needed to extend the sample so that the construction of a decision tree can be done (see explanation in Sect. 2.3), but it will necessarily succeed since in our case the set of attributes defines by construction a separator of the sample.

*Example 4.* Consider the program in Fig. 4 whose set of variables is X = {j, k, t}. We use Polyhedra. First, starting from an empty ICE-sample, regardless of the attributes, the learner proposes *true* as an invariant and (5, 1, 0) is returned as a negative counterexample. Then, it proposes f alse and (2, 0, 0) is returned as a positive counterexample.

Now, Algorithm <sup>4</sup> is called to compute a separator for <sup>S</sup> = (S<sup>+</sup> <sup>=</sup> {(2, <sup>0</sup>, 0)}, <sup>S</sup><sup>−</sup> <sup>=</sup> {(5, <sup>1</sup>, 0)}, S<sup>→</sup> <sup>=</sup> <sup>∅</sup>). Here, we use initially a separator <sup>S</sup>*init* containing the set of states satisfying the initial condition j = 2 ∧ k = 0 denoted by d<sup>1</sup> in addition to d<sup>0</sup> where d<sup>0</sup> = {(2, 0, 0)}. Since d<sup>0</sup> ⊆ d1, the algorithm returns the join-maximal separator <sup>S</sup> <sup>=</sup> {d1} with Constr*poly*(d1) = {<sup>j</sup> = 2, k = 0}.

Using constraints from S as attributes, the learner constructs the candidate invariant k = 0. Then, the teacher provides an implication counterexample (0, 0, 1) → (2, 1, 1). Now, without computing another separator (as the one it has is sufficient for the new sample), the learner proposes j = 2 ∧ k = 0 as an invariant, and the implication counterexample (2, 0, 1) → (4, 1, 1) is returned (and since (2, 0, 1) is an initial state, (4, 1, 1) is also considered positive).

Then, Algorithm 4 is called again to construct a separator for the sample S = (S<sup>+</sup> <sup>=</sup> {(2, <sup>0</sup>, 0),(4, <sup>1</sup>, 1)}, S<sup>−</sup> <sup>=</sup> {(5, <sup>1</sup>, 0)}, S<sup>→</sup> <sup>=</sup> {(0, <sup>0</sup>, 1) <sup>→</sup> (2, <sup>1</sup>, 1),(2, <sup>0</sup>, 1) <sup>→</sup> (4, <sup>1</sup>, 1)}). Starting from a separator <sup>S</sup>*init* <sup>=</sup> {d0, d1, d2} with <sup>d</sup><sup>2</sup> <sup>=</sup> {(4, <sup>1</sup>, 1)} it returns the join-maximal separator

$$\mathbb{S} = \{d\_3\} \qquad constr^{poly}(d\_3) = \{2k + 2 = j, j \le 4, j \ge 2\}$$

Based on this separator, the learner proposes 2k+2 = j, (2, 0, 0) → (6, 0, 0) is given as a counterexample (and then, since (2, 0, 0) is in S<sup>+</sup>, (6, 0, 0) is considered positive). Then, from <sup>S</sup>*init* <sup>=</sup> {d0, d1, d2, d4} with <sup>d</sup><sup>4</sup> <sup>=</sup> {(6, <sup>0</sup>, 0)} a new separator S is constructed

$$\mathbb{S} = \{d\_5\} \qquad constr^{poly}(d\_5) = \{j + 2k \le 6, k \ge 0, j \ge 2k + 2\}$$

leading to a new candidate invariant: j + 2k ≤ 6 ∧ j ≥ 2k + 2. The teacher returns at this point the negative state (0, <sup>−</sup>2, 0). The attributes of <sup>S</sup> are still sufficient to construct a decision tree for the sample. Then, the learner proposes j + 2k ≤ 6 ∧ k ≥ 0 ∧ j ≥ 2k + 2, and the teacher returns the counterexample (3, 0, 1) → (5, 1, 1) (and since (5, 1, 1) is a negative state, (3, 0, 1) is considered negative). The current sample <sup>S</sup> is now (S<sup>+</sup> <sup>=</sup> {(2, <sup>0</sup>, 0),(4, <sup>1</sup>, 1),(6, <sup>0</sup>, 0)}, S<sup>−</sup> <sup>=</sup> {(5, 1, 0),(5, 1, 1),(3, 0, 1),(0, −2, 0)}, S<sup>→</sup> = {(0, 0, 1) → (2, 1, 1), (2, 0, 1) → (4, 1, 1), (2, 0, 0) → (6, 0, 0),(3, 0, 1) → (5, 1, 1)}).

Then, from <sup>S</sup>*init* <sup>=</sup> {d0, d1, d2, d4}, a join-maximal separator is constructed

$$\mathbb{S} = \{d\_3, d\_4\} \qquad constr^{poly}(d\_4) = \{j = 6, t = 0, k = 0\}$$

Some iterations later, using only the attributes of the last S, the learner generates the inductive invariant (t = 0 ∧ 2 ≤ j ∧ k = 0) ∨ (t = 0 ∧ 2 ≤ j ∧ 2k +2= j)

#### **4.4 Computing Separators Incrementally**

Algorithm 4 of Sect. 4.2 always starts from the initial separator, regardless of what has been done in the previous iterations of the ICE learning process. Here, we present an incremental approach to exploit the fact that adding a counterexample to the sample may modify the separator only locally allowing parts of separators computed in previous iterations to be reused. The basic idea is to store the history of the separator computation along the ICE iterations, and update it according to the new counterexamples discovered at each step.

**The Algorithm**. We use an abstract stack data structure to represent the history of separators. Along the iterations of the ICE learning algorithm, an increasing sequence of samples S*i*'s is considered (at each iteration it is enriched by the new counterexample provided by the teacher). Then, at each step i, a join-maximal separator S*<sup>i</sup>* of the sample S*<sup>i</sup>* is computed and stored in the stack. Notice that at a given step i, separators of index j<i are not necessarily separators of S*<sup>i</sup>* since they may not cover all positive points of S*i*. Therefore, we introduce the following notion: a *partial* A*X*-*separator* of a sample S is a set <sup>S</sup> <sup>∈</sup> <sup>2</sup>*<sup>D</sup><sup>X</sup>* such that <sup>∀</sup><sup>p</sup> <sup>∈</sup> <sup>S</sup>−. <sup>∀</sup><sup>d</sup> <sup>∈</sup> <sup>S</sup>.p /<sup>∈</sup> <sup>d</sup>.

Now, to compute the separator S*i*, we start from one of the partial separators in the stack, the most recent one that is not affected by the last update of the sample. When the sample at step i is extended with positive states, S*<sup>i</sup>* can be computed directly from <sup>S</sup>*<sup>i</sup>*−<sup>1</sup>. However, when the sample is extended with negative states, this might require reconsidering several previous steps since some of the elements (convex sets) of their separators might contain states that are (discovered now to be) negative. In that case, we must return to the step of the greatest index j<i (i.e., the last step before i) such that S*<sup>j</sup>* is a partial separator of S*<sup>i</sup>* (i.e., the new knowledge about the negative states does not affect the computed separation at step j). By the fact that the sequence of samples is increasing, it is indeed correct to consider the biggest j<i satisfying the property above. Therefore, the separator S*<sup>i</sup>* is computed starting from S*<sup>j</sup>* augmented with all the positive states in S<sup>+</sup> *<sup>i</sup>* \ <sup>S</sup><sup>+</sup> *j* .

This leads to Algorithm 5. We use in its description a stack P supplied with the usual operations: P.head() returns the top element of the stack, P.pop() removes and returns the top element of the stack, and P.push(e) inserts an element e at the top position of the stack. A refined version of Algorithm 5 is presented in the full paper [2] where the backtracking phase is made more effective: We attach information to each join-created object in order to track its join-predecessors (objects involved in its creation) in the stack.

```
Global : P = {∅} a stack of partial separators.
1 Proc constructSeparatorInc(Si = (S+
                                i , S−
                                   i , S→i ), AX)
      // backtracking
2 while true do
3 if ∃n ∈ S−
                i . ∃d ∈ P.head(). n ∈ d then
4 P.pop();
5 else break;
      // expansion
6 S ← P.head();
7 add ← {p ∈ S+
                i | ∀d ∈ S.p /∈ d}∪{q | ∃p → q ∈ S→i . ∃d ∈ S. p ∈ d ∧ ∀d ∈ S.q /∈ d};
8 while ∃s ∈ add do
9 add ← add \ {s};
10 if ∃d ∈ S. ∀n ∈ S−
                      i .n /∈ d 
 {s} then
11 let o = d 
 {s};
12 S ← (S \ {d}) ∪ {o};
13 for p → q ∈ S→i s.t. p ∈ o ∧ ∀d ∈ S.q /∈ d do
14 add ← add ∪ {q}
15 else
16 S ← S ∪ {{s}};
17 for p → q ∈ S→i s.t. p = s ∧ ∀d ∈ S.q /∈ d do
18 add ← add ∪ {q}
19 P.push(S);
20 return S;
```
**Algorithm 5:** Incremental computation of an A*X*-separator of a sample S.

**Integration to ICE-DT.** The function constructSeparatorInc can be integrated to the ICE-DT algorithm just as the function constructSeparator in Sect. 4.3, by using it to implement the function generateAttributes of the learner. But this time, the learner is more efficient in computing the separator from which the attributes are extracted.

*Example 5.* Consider again the program in Fig. 4 of Example 4. The two first iterations are similar to the ones described in Example 4. Then, the obtained sample is <sup>S</sup> = (S<sup>+</sup> <sup>=</sup> {(2, <sup>0</sup>, 0)}, S<sup>−</sup> <sup>=</sup> {(5, <sup>1</sup>, 0)}, S<sup>→</sup> <sup>=</sup> <sup>∅</sup>). Starting from the empty separator, Algorithm <sup>5</sup> computes the separator <sup>S</sup><sup>1</sup> <sup>=</sup> {d1} where Constr*poly*(d1) = {<sup>j</sup> = 2, k = 0}. Then, the learner proceeds as in the previous example to get the sample <sup>S</sup> = (S<sup>+</sup> <sup>=</sup> {(2, <sup>0</sup>, 0),(4, <sup>1</sup>, 1)}, S<sup>+</sup> <sup>=</sup> {(5, <sup>1</sup>, 0)}, S<sup>→</sup> <sup>=</sup> {(0, 0, 1) → (2, 1, 1),(2, 0, 1) → (4, 1, 1)}). To build a separator of S, Algorithm 5 starts from <sup>S</sup><sup>1</sup> and produces <sup>S</sup><sup>2</sup> <sup>=</sup> {d3} where <sup>d</sup><sup>3</sup> <sup>=</sup> <sup>d</sup><sup>1</sup> {(4, <sup>1</sup>, 1)}.



**Fig. 5.** Benchmark results and comparison of NIS wrt. different abstract domains.

Similarly, when the counterexample (2, 0, 0) → (6, 0, 0) is obtained, the algorithm starts directly from <sup>S</sup><sup>2</sup> to produce <sup>S</sup><sup>3</sup> <sup>=</sup> {d5} where <sup>d</sup><sup>5</sup> <sup>=</sup> <sup>d</sup><sup>3</sup> {(6, <sup>0</sup>, 0)}.

After two more iterations, the sample is the same as S in Example 4. At this point, S<sup>3</sup> cannot be used to construct a separator for S since d<sup>5</sup> includes the negative state (3, 0, 1). Then, the algorithm removes S<sup>3</sup> from the stack. It checks that S<sup>2</sup> is a partial separator of S, which is indeed the case. Then, it constructs a new separator S<sup>4</sup> based on S<sup>2</sup> by expanding it with the counterexamples received after the construction of <sup>S</sup><sup>2</sup> (the negative state (0, <sup>−</sup>2, 0) and the implications (2, <sup>0</sup>, 0) <sup>→</sup> (6, <sup>0</sup>, 0) and (3, <sup>0</sup>, 1) <sup>→</sup> (5, <sup>1</sup>, 1)): <sup>S</sup><sup>4</sup> <sup>=</sup> {d3, d6} where Constr*poly*(d6) = {<sup>t</sup> = 0, k = 0, j = 6}. The rest of the execution proceeds as with Algorithm 4. Here, the advantages of the incremental method are: (1) while positive examples are added the separators are simply expanded, and (2) when a negative example at step 4 is added, only one join operation has to be undone.

#### **5 Experiments**

We have implemented the prototype tool NIS (Numerical Invariant Synthesizer) using our method for attribute synthesis with the ICE-DT schema. NIS written in C++ is configurable with an abstract domain for the manipulation of abstract objects. It uses Z3 [24] for SMT queries and APRON's [18] abstract domains.

We compare our implementation with ICE-DT<sup>1</sup>, LoopInvGen, CVC4, and Spacer<sup>2</sup>. LoopInvGen is a data-driven invariant inference tool based on a syntactic enumeration of candidate predicates [25,26]. It is written in OCaml and uses Z3 as an SMT solver. CVC4 uses an enumerative refutation-based approach [1,27]. It is written in C++ and it includes an SMT solver. Spacer is a PDR-based CHC solver [19], written in C++ and integrated in Z3.

<sup>1</sup> The original ICE-DT tool [16] does not support programs in the SyGuS format. Here we use our own implementation of ICE-DT. It shares with NIS all the components (teacher, decision tree learning algorithm with implications) except that attribute discovery is enumerative.

<sup>2</sup> Spacer does not support programs in the SyGuS format; a wrapper is written in C++ that converts a SyGuS program to a CHC problem and supplies it to Spacer via the Z3 FixedPoint API.

The evaluation was done on 164 linear integer arithmetic (LIA) programs<sup>3</sup> from SyGuS-Comp'19. They have a number of variables ranging from 2 to 10. The experiments were carried out using a timeout of 1800 s (30 min) for each example. They were conducted on a machine with 4 CPUs Intel(R) Xeon(R) 2,13 GHz, 16 cores, and 128 Go RAM running Linux CentOS 7.9.

Figure 5 shows the number of safe and unsafe solved programs by each tool. The instance of our approach using the Polyhedral abstract domain solves 154 programs out of 164, and the virtual best of our approach with the three abstract domains Intervals, Octagons, and Polyhedra, solves 160 programs out of 164. Two of the remaining examples require handling quantifiers, which cannot be done with the current implementation. The two others have not been solved with any of the four tools we considered.

These results show that globally our approach is powerful and is able to solve a significant number of cases that are not solvable by other tools. Interestingly, using different abstract domains leads to incomparable performances: although with polyhedra more cases are solvable, there are some cases that are uniquely solvable with intervals or octagons. Also, while operations on intervals and octagons have a lower complexity than on polyhedra, this is compensated with the fact that polyhedra are more expressive. Indeed, their expressiveness allows in many cases to find quickly invariants for which a less expressive domain requires much more iterations to be learned. Figure 5 shows the number of programs that can be solved using a particular abstract domain but not with another. Polyhedra are globally superior, but the three domains are complementary.

Compared to the other tools, the bottleneck of ICE-DT and also of Loop-InvGen is the number of predicates that are generated using enumeration. Our approach avoids the explosion of the size of the attribute pool by guiding their discovery with the data sample, and reducing the size (by replacing objects by their join) of the computed separators from which constraints are extracted. Concerning CVC4, it uses enumerative refutation techniques, which are also subject to an explosion problem. Moreover, CVC4 does not allow to solve the cases of unsafe programs. The performances of Spacer depend on the ability to generalize the set of predecessors computed using the model-based projection and the interpolants used for separation from bad states in the context of IC3/PDR. While this is done efficiently in general, there are cases where this process can lead to fastidious computations while our technique can be much faster using a small number of join operations of positive states.

The scatter plots shown in Fig. 6 compare the execution times of our approach using Polyhedra abstract domain NIS(poly) with LoopInvGen, CVC4 and Spacer. (A timeout of 1800 s s is used for each example.) They show that NIS(poly) is in general faster than both LoopInvGen and CVC4, and that it has comparable performances in terms of execution time with Spacer. We have

<sup>3</sup> Other programs from SyGuS-Comp'19 have not been taken into account in our evaluations as they are boolean programs with integer variables for encoding nondeterminism or artificial programs augmented with useless variables and statements.

**Fig. 6.** Runtime of NIS(poly) vs. LoopInvGen, CVC4, and Spacer, and NIS(oct) vs. ICE-DT.

also compared the original ICE-DT, based on enumerative attribute generation using octagonal templates (as in [16]) with NIS(oct). The comparison shows that our tool is significantly faster (see the bottom right subfigure of Fig. 6).

### **6 Conclusion**

We have defined an efficient method for generating relevant predicates for the learning process of numerical invariants. The approach is guided by the data sample built during the process and is based on constructing a separator of the sample. The construction consists of an iterative application of join operations in numerical abstract domains in order to cover positive states without including negative ones. Our method is tightly integrated to the ICE-DT schema, leading to an efficient data-driven invariant synthesis and verification algorithm.

Future work includes several directions. First, alternative methods for constructing separators should be investigated in order to reduce the size of the pool of attributes along the learning process while increasing their potential relevance. Another issue to investigate is the control of the counterexamples provided by the teacher since they play an important role in the learning process. In our current implementation, their choice is totally dependent on the SMT solver used for implementing the teacher. Finally, we intend to extend this approach to other types of programs, in particular to programs with other data types, and programs with more general control structures such as procedural programs.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Proof-Guided Underapproximation Widening for Bounded Model Checking**

Prantik Chatterjee1(B) , Jaydeepsinh Meda<sup>2</sup>, Akash Lal<sup>3</sup>, and Subhajit Roy<sup>1</sup>

<sup>1</sup> Indian Institute of Technology Kanpur, Kanpur, India prantik@cse.iitk.ac.in, subhajit@iitk.ac.in <sup>2</sup> Oracle, Bengaluru, India <sup>3</sup> Microsoft Research, Bengaluru, India akashl@microsoft.com

**Abstract.** Bounded Model Checking (BMC) is a popularly used strategy for program verification and it has been explored extensively over the past decade. Despite such a long history, BMC still faces scalability challenges as programs continue to grow larger and more complex. One approach that has proven to be effective in verifying large programs is called Counterexample Guided Abstraction Refinement (CEGAR). In this work, we propose a complementary approach to CEGAR for bounded model checking of sequential programs: in contrast to CEGAR, our algorithm gradually widens underapproximations of a program, guided by the proofs of unsatisfiability. We implemented our ideas in a tool called Legion. We compare the performance of Legion against that of Corral, a state-of-the-art verifier from Microsoft, that utilizes the CEGAR strategy. We conduct our experiments on 727 Windows and Linux device driver benchmarks. We find that Legion is able to solve 12% more instances than Corral and that Legion exhibits a complementary behavior to that of Corral. Motivated by this, we also build a portfolio verifier, Legion<sup>+</sup>, that attempts to draw the best of Legion and Corral. Our portfolio, Legion<sup>+</sup>, solves 15% more benchmarks than Corral with similar computational resource constraints (i.e. each verifier in the portfolio is run with a time budget that is half of the time budget of Corral). Moreover, it is found to be 2.9<sup>×</sup> faster than Corral on benchmarks that are solved by both Corral and Legion<sup>+</sup>.

**Keywords:** Verification · Bounded model checking · Underapproximation widening

### **1 Introduction**

Bounded Model Checking (BMC) [11,20,26,33] is a popular option for program verification, primarily due to its ability of side-stepping the necessity of synthesizing complex invariants. BMC harnesses the power of modern SMT solvers to verify a bounded set of behaviors of a program. The user, if interested, may re-attempt verification with larger bounds once the program is proven correct with small bounds.

BMC operates by constructing a logical formula that symbolically captures all states reachable by a program under a user-provided bound. A *query*, referred to as the *verification condition (VC)*, is constructed as a conjunction of the program semantics and the negation of the property, which is also expressed as a logical formula. If the verification condition is satisfiable, it implies that some program execution violated the property of interest, thus the program is faulty. If unsatisfiable, the program satisfies the property, i.e. the program is safe under the chosen bound.

However, for large programs, BMC faces scalability challenges as the verification condition for the program tends to grow large, posing difficulties for the SMT solver. Prior work has answered this challenge by using the popular *counterexample-guided abstraction refinement (CEGAR)* strategy: start off with the VC for an *abstraction* of the program, and incrementally refine the abstraction until the program is decided as safe or faulty. The *Stratified Inlining (SI)* [26] algorithm is an instance of this strategy. SI starts off with an abstraction of only the entry procedure of the program, and then incrementally inlines callees, guided by counterexamples. Not surprisingly, the dynamic inlining strategy of SI has been found to be significantly more scalable than algorithms that statically inline all procedures [25]. The SI algorithm is used in practice by the Corral [24] verifier that powers Microsoft's Static Driver Verifier (SDV) [4].

In this work, we propose a new algorithm that uses proofs of unsatisfiability to widen underapproximate models of the program en route to verification of sequential programs. Our algorithm starts off by constructing a partial verification condition for only the program entry procedure and blocks all paths that invoke calls to procedures that have not yet been inlined. This constructs an underapproximation of the original program (because paths are blocked). A satisfiable result on an underapproximation will indicate the presence of a bug. If the VC is unsatisfiable, we examine its *proof of unsatisfiability* in order to guide the inlining of called procedures. The program can be declared safe when the proof of unsatisfiability does not depend on any procedure call that has not been inlined yet. We implemented our ideas in a tool called Legion.

Further, we found that our underapproximation widening algorithm and the abstraction refinement strategy (used by Corral) demonstrate complementary behaviors—many programs that Corral struggles on, yield to the underapproximation based technique, and vice-versa. This observation motivated us to build a portfolio verifier, Legion<sup>+</sup>, that runs both these techniques in parallel. We found that the portfolio is more effective than any of the tools alone (with similar computational resources, i.e. each verifier in the portfolio is run with a time budget that is half of the time budget of Corral). Both Legion and Legion<sup>+</sup> are available open-source at the *legion* branch of the *corral* repository<sup>1</sup>.

Our experiments are conducted on 727 Windows and Linux device driver benchmarks on which Corral struggles, i.e., Corral is unable to solve any of

<sup>1</sup> https://github.com/boogie-org/corral.git (branch: legion).

these benchmarks in less than 200 s. We find that Legion is able to solve 12% more instances than Corral with a time budget of 2 h per instance. Further, the portfolio verifier, Legion<sup>+</sup>, given half the time budget of Corral, solves 15% more benchmarks than Corral, and it is found to be 2.9<sup>×</sup> faster than Corral on benchmarks that are solved by both Corral and Legion<sup>+</sup>.

The primary contributions of this paper are as follows:


# **2 Background**

This section presents background material that we use in the rest of the paper.

A logical formula consists of literals. A literal is either a variable or the negation of a variable. A logical formula expressed in a *Conjunctive Normal Formal* (CNF) is a *conjunction* of clauses where each clause is a *disjunction* of literals. Given a logical formula, a *satisfiability* solver returns whether the formula is *satisfiable* (SAT) or *unsatisfiable* (UNSAT). If a formula is SAT, the solver provides a model in the form of a satisfying assignment of the variables. If a formula is UNSAT, the solver returns an *unsatisfiable core* (unsat core), which is a subset of clauses of the input formula whose conjunction is still UNSAT.

### **2.1 Language Model**

We consider a programming language that represents a *passified* form of Boogie programs [8]. A program consists of multiple procedures (*Proc*). We assume an *entry-point* procedure called main where program execution starts. Each procedure can have any number of local variable declarations followed by a series of basic blocks (*BasicBlock*). We assume that local variables are initially unconstrained. A basic block is labeled by a unique identifier and consists of multiple statements (*Stmt*) followed by a single control statement (*ControlStmt*). A control statement is either a *goto*, which takes a sequence of basic block labels and non-deterministically picks one to jump to, or a *return* that returns control back to the caller. Returning from main terminates the program execution. A statement is either an *assume* command or a procedure *call*. The statement (*assume* ϕ) allows a feasible execution only if ϕ is satisfiable.

**Fig. 1.** A passified program

We leave the set of variable types (*Type*) and expressions (*Expr* ) unspecified. In practice, we can use any expression language that can be directly encoded in SMT. Our implementation uses linear arithmetic, fixed-size bit-vectors, uninterpreted functions, and extensional arrays. This combination is sufficient to realistically translate C programs into our language representation [21,24].

Note that the programs that we consider do not have global variables, return parameters of procedures, or assignments. These restrictions are without loss of generality [23]. Conversion of these additional feature into our language representation is readily available in tools like Boogie. A passified program makes it easy to describe the verification-condition generation process.

Given a program P, we consider the verification question of whether there exists a terminating execution of P. To be precise, we are interested in finding out whether there is any execution of main that reaches its *return* statement. If no such execution exists, then P is considered verified, or Safe. Otherwise, we say that P is Unsafe and return the execution trace with concrete variable values along the trace. Note that we consider a bounded version of the verification problem, i.e., we require that P does not contain any loops or recursive procedure calls. All such loops and recursive calls must be unrolled to a pre-determined

**Fig. 2.** Call graph of the program in Fig. 1.

$$\begin{aligned} &\text{pVC(main)}: \\ &blk\_{L0} \\ \wedge & (blk\_{L0} \implies x == 0 \wedge y == 0 \\ & \wedge (blk\_{L1} \wedge flow(0) == 1) \vee (blk\_{L2} \wedge flow(0) == 2)) \\ \wedge & (blk\_{L1} \implies c \wedge blk\_{L3} \wedge flow(1) == 3) \\ \wedge & (blk\_{L2} \implies \neg c \wedge blk\_{L3} \wedge flow(2) == 3) \\ \wedge & (blk\_{L3} \implies y \neq 0) \end{aligned}$$

**Fig. 3.** Partial VC of main()

depth before proceeding with verification, and thus, the verification problem now becomes decidable (if the expression language of the program is decidable) [23].

#### **2.2 VC Generation for a Procedure**

Consider a procedure baz that does not contain any procedure calls. This section outlines one way of verifying baz, i.e., finding out if it has a terminating execution. We use a process called Verification Condition (VC) generation on baz to construct a logical formula Φ and feed it to an SMT solver. If Φ is UNSAT, then the *return* statement in baz is unreachable and baz is Safe. Otherwise, we extract the satisfiable model from the SMT solver, construct the execution trace and return Unsafe along with the trace. We now outline the VC-generation process.

Suppose that baz takes input arguments x. For each basic block j in baz, we define a boolean variable blk*<sup>j</sup>* that is termed as the *control-flow* variable. Let st*<sup>j</sup>* denote the conjunction of all assume statements in basic block j. Let successor(j) denote the targets of the *goto* statement in j, i.e., all the successor basic blocks in baz, to which control may jump non-deterministically from j. Let i*<sup>j</sup>* be a unique integer constant representing basic block j. We also define an uninterpreted function flow : <sup>Z</sup> <sup>→</sup> <sup>Z</sup> that records the non-deterministic choice of the successor basic block of j. Given the above, we construct a logical formula ψ*<sup>j</sup>* for each basic block j as follows:

$$blk\_j \Rightarrow (st\_j \land \bigvee\_{s \in successor(j)} (blk\_s \land (i\_s == flow(i\_j))))$$

If basic block j ends with a *return* statement instead of a *goto*, then ψ*<sup>j</sup>* is:

$$blk\_j \Rightarrow st\_j$$

Assuming the first basic block of baz, where procedure execution begin, is labeled s, the VC of baz is constructed as follows:

$$blk\_s \land \bigwedge\_{l \in basic blocks(p)} \psi\_l$$

In Fig. 3, we show the VC of main of the program in Fig. 1 as an example, where we ignore the procedure calls in main (i.e., treat them as (*assume true*)). We term such a VC (of a procedure where its calls are skipped) as the *partial VC* (pVC) of the procedure.

#### **2.3 Static Versus Dynamic Inlining**

Given a program P with a starting procedure main, one simple way to verify P would be to construct the VC of main by inlining all the procedure calls and check the satisfiability of VC(main) with an SMT solver. However, employing such a *static inlining* strategy can cause an exponential blowup in the size of the VC. Hence, we instead make use of *dynamic inlining* algorithm, called Stratified Inlining (SI) [26], that employs a Counterexample Guided Abstraction Refinement (CEGAR) technique [14] to *dynamically* inline procedure VCs. It has been shown that dynamic inlining scales better than static inlining [25]. Dynamic inlining produces more compact VCs during abstraction refinement which leads to significantly faster program verification.

#### **2.4 Verification with Stratified Inlining**

The working of SI is shown in Algorithm 1. For the sake of simplicity, let us assume that each basic block in P may contain only a single procedure call. Every program point, from which a procedure is called, is termed as a callsite. For example, main in Fig. 1, has two callsites; foo and bar which are called from basic blocks L1, L2 and L3 respectively. A static instance of a callsite is denoted with a pair (l, c) where l denotes the basic block identifier from which a call to the procedure c is made. A dynamic callsite is defined as a stack of static callsites which represents the runtime stack during a program's execution with main being present at the bottom of the stack. For example, the dynamic callsite corresponding to the call foo from L1 in main is given by [main,(L1, foo)]. The call graph of the program in Fig. 1 is shown in Fig. 2.

#### **Algorithm 1:** Stratified Inlining (SI) algorithm.

```
Input: program P with starting procedure main
  Input: An SMT solver S
  Output: Safe, or UnSafe(τ )
1 C ← {[main, s] | s ∈ callsites(main)}
2 S.Assert(pVC(main, [main]))
3 while true do
4 outcome ← OverRefStep(P, C, S)
5 if outcome == Safe ∨ outcome == UnSafe(τ ) then
6 return outcome
7 else
8 let NoDecision( , C-

                           ) = outcome
9 C ← C-
```
The SI algorithm takes as input a program P with a starting procedure main and an SMT solver S. Initially, we add the dynamic callsites in main to a list C (Line 1) and then inline main, i.e., assert the pVC of main (Line 2). The callsites in C are termed as *open* callsites because they have not yet been inlined. The above steps construct an abstraction of P. The SI algorithm then iteratively calls the OverRefStep routine on this abstraction (Line 4) to perform gradual refinement until we can reach a decision about whether P is Safe or not. Each invocation of OverRefStep can potentially inline more procedures by asserting their partial VC to the solver S. Thus, the state of the solver, as well as the set of open callsites C change across invocations of OverRefStep. We discuss the *Overapproximation Refinement Guided Stratified Inlining* (*OverRefSI*) strategy used by the OverRefStep routine in Sect. 2.5.

#### **2.5 Overapproximation Refinement Guided Stratified Inlining**

The OverRefStep routine given in Algorithm 2 demonstrates the inner workings of the *OverRefSI* strategy at each verification step. The *OverRefSI* strategy [26] for verifying a program works by iteratively firing overapproximation queries and gradually refining the abstraction of P. If the query returns UNSAT, then we can conclude that P is safe with respect to the given property. Otherwise, we extract all the open callsites that appear on the counterexample trace and refine the abstraction of P by inlining these callsites. If the counterexample trace contains no open callsites, then P is unsafe and we return the verdict along with the counterexample trace.

The OverRefStep routine takes as input a program P, a set of open callsites <sup>C</sup> and an SMT solver <sup>S</sup>. The OverRefStep routine is called iteratively in order to verify the safety of P. We demonstrate the working of *OverRefSI* to verify the pVC of main of Fig. 1 in Table 1. At the beginning, the SI algorithm asserts the pVC of main to S and adds [main,(L1, foo)] and [main,(L2, bar)] to the list of open callsites C in step 0.

**Algorithm 2:** OverRefStep(P, <sup>C</sup>, <sup>S</sup>)

```
Input: procedure P, set of callsites C, SMT solver S
   Output: Safe, UnSafe(trace), NoDecision(τ , C)
1 // Overapproximate check
2 if S.Check() == UNSAT then
3 return Safe
4 else
5 τ ← opencallsites(S.Model())
6 if τ == ∅ then
7 return UnSafe(S.Model())
8 else
9 C-
        ← ∅
10 forall c ∈ τ do
11 C-
            ← Inline(P, c)
12 C ← (C − τ ) ∪ C-

13 return NoDecision(τ , C)
```
Next, the SI algorithm calls OverRefStep with <sup>P</sup>, <sup>C</sup> and <sup>S</sup> as arguments. OverRefStep fires an overapproximation query in Line 2. If the query is unsatisfiable, we return the safe verdict. If the query is satisfiable, we get the counterexample trace and extract all the open callsites on the trace in τ (Line 5). If τ is empty, i.e., the counterexample trace contains no open callsites, then the trace is not spurious and we can return an unsafe verdict with the trace (Line 7). Otherwise, we inline all the callsites in τ and add all the new callsites that opened up due to the inlinings in C (Line 11). Inlining a callsite c consists of asserting the partial VC of the procedure that was invoked from c.

Subsequently, the inlined callsites are removed from the list of open callsites C and new callsites that opened up due to the inlinings are added to C (Line 12). For example, in step 1 of Table 1, OverRefStep fires an overapproximation query that returns SAT with a counterexample trace that contains the callsite of foo, i.e., [main, (L1,foo)]. This callsite is then inlined by asserting the pVC of foo to the solver. This opens up the callsites of foo1 and foo2. Since we have not been able to arrive at a decision regarding the safety of P at this step, a verdict of NoDecision is returned along with the list of inlined callsites τ and the new list of open callsites C (Line 13).

Next, the SI algorithm calls OverRefStep again and in step 2, it fires an overapproximation query again, which returns SAT with the counterexample trace containing the open callsite of foo1 that we inline by asserting the pVC of foo1. The verification process continues in this way by inlining the open callsites on the counterexample trace in every step, which gradually refines the pVC of main. Finally, in step 7, the overapproximation query returns UNSAT from which we can conclude that main is safe.


**Table 1.** Execution of *OverRefSI* on the program of Fig. 1

# **3 Overview**

#### **3.1 Underapproximation Widening**

We propose a novel algorithm, *Underapproximation Widening Guided Stratified Inlining* (*UnderWidenSI*), that uses proofs of unsatisfiability to guide stratified inlining. *UnderWidenSI* maintains an underapproximated model of the target program and *widens* it until either the program is verified as *safe* or a bug is found.

We illustrate the *UnderWidenSI* strategy in Figs. 4a to 4d. Assume that we are trying to verify whether some required property holds on a program. The space contained by the *yellow* ovals show the reachable program states while the *red* ovals depict error states on which the required property does not hold. The objective of a verification algorithm is to construct a *model* of the program that is precise enough to show that the program can reach an error state or prove that the error states are unreachable. Figures 4a to 4c show a safe program while Fig. 4d depicts an unsafe program.

Consider Fig. 4a: the *UnderWidenSI* algorithm starts off with the partial verification condition of the entry procedure and "blocks" executions though all its open callsites.

**Fig. 4.** How *UnderWidenSI* works

**Definition (Blocked callsites).** We use the term, *blocking* a callsite C, to imply that all paths that reach C are deemed infeasible. That is, blocking a callsite has the effect of replacing the callsite by (*assume false*).

Essentially, blocking callsites creates underapproximations of the set of feasible program paths. Such underapproximated VCs can be constructed by asserting additional *blocking* clauses corresponding to the control-flow variables of the open callsites. These blocks disallow reachability to certain program states. For example, in Fig. 4a, we construct an underapproximated model of the program by blocking the open callsites C<sup>1</sup> and C2. The inner *green* oval depicts the program states that are reachable in the underapproximated model, whereas the outer *gray* regions demonstrate the states that are unreachable due to the blocks on C<sup>1</sup> and C2.

If the verification query on this model (conjunction of the underapproximated model and the negation of the property) returns SAT, it implies that an error state in indeed reachable. On the other hand, if the query returns UNSAT (as shown in Fig. 4a), we need to *widen* the model to procure additional reachable executions. We guide this widening operation by extracting the reason for this unsatisfiability from a *minimal unsat core*<sup>2</sup> of the query, that returns the set of block clauses; the callsites corresponding to these blocking clauses constitutes

<sup>2</sup> Although there may exist multiple minimal unsat cores, we found via some preliminary experiments that the choice of the unsat core does not have a significant impact on the overall runtime of our algorithm (on an average).

**Fig. 5.** How *OverRefSI* works

a *reason* of why the current underapproximate model is not able to reach any of the error states. Hence, we widen the model by unblocking exactly these callsites leading to a wider model (see Fig. 4b). The widening by inlining C<sup>2</sup> causes a stratified inlining step, and hence may open up new callsites, say C<sup>3</sup> and C4.

We proceed in the same manner by blocking these open callsites and repeat the query. Finally, (in Fig. 4c) we construct an underapproximated model that still does not intersect with the error states. However, in this case, the unsat core does not contain any blocked clause, as none of the currently blocked callsites would have allowed widening in the direction of the error states.

The unsat core provides a *direction* for widening towards the error states. This also allows us to declare that the program is *safe* without requiring to widen the model to encompass the set of all reachable program states—if the verification query is UNSAT and the unsat core does not contain any blocked clause, then this forms a sufficient condition to declare the program *safe*.

Figure 4d shows how our algorithm proceeds for a faulty program: it incrementally widens the model in the direction of the error states till an error state R is reached. At this point, the *UnderWidenSI* algorithm declares the program as *unsafe*.

Let us now contrast the *UnderWidenSI* strategy with the *OverRefSI* strategy, popularly known as *counterexample-guided abstraction refinement (CEGAR)*, which currently drives the SI algorithm in Corral. *OverRefSI* starts off with an overapproximated model of the program: the pVC of the entry procedure with all callsites replaced by non-deterministic updates to its set of modified variables. For example, in Fig. 5a, *OverRefSI* constructs an abstract program/overapproximated model M<sup>1</sup> of the program by overapproximating the open callsites. If the resulting verification condition is SAT, it examines the generated counterexample to check if it spurious. If the counterexample is found to be a true bug, it declares the program *unsafe*. If the counterexample is spurious, the model is refined to eliminate this spurious counterexample. For example, in Fig. 5a, we find that there exists an error state/counterexample P within M1, where the property can be violated. Hence, *OverRefSI* refines M<sup>1</sup> in Fig. 5a by inlining the overapproximated callsites through which P is reachable. The refinement is done to rule out P as a counterexample, i.e., P becomes unreachable after refinement. We observe in Fig. 5a, that after the first round of refinement, P is no longer reachable in the overapproximated M2, however, we can still find another counterexample Q. Hence, the abstraction M<sup>2</sup> is refined again. The program is declared *safe* when the model cannot reach any error state. Note that the algorithm can prove the safety of the program without requiring to precisely capture the exact set of reachable program state.

*OverRefSI* and *UnderWidenSI* are complementary: while *OverRefSI* maintains an overapproximated model and refines the model (shrinking the set of reachable states), *UnderWidenSI* maintains an underapproximated model and widens the model (expanding the set of reachable states) incrementally. In terms of the algorithmic details, the *OverRefSI* algorithm in Corral uses the models (the counterexamples) to drive refinements, whereas our *UnderWidenSI* algorithm uses the proof (the unsat core) to guide the widenings.

#### **4 Algorithms**

#### **4.1 Underapproximation Widening Guided Stratified Inlining (***UnderWidenSI* **)**

The UnderWidenStep routine in Algorithm 3 demonstrates how the *Under-WidenSI* strategy works in each verification step. It takes as input a procedure P, a set of open callsites <sup>C</sup> and an SMT solver <sup>S</sup>. The UnderWidenStep routine is called by the SI algorithm (instead of OverRefStep in Line 4) iteratively in order to verify the safety of P.

In the beginning, we construct an underapproximated pVC of the input procedure P by blocking all calls through the open callsites in C (Line 4). Next, we fire an underapproximation query (Line 5). If the query returns SAT, then we return the verdict unsafe with the counterexample trace (Line 6). Otherwise, we get the minimal unsatisfiable core *uc* and extract all the blocked callsites which appear on *uc* in μ (Line 8).

If μ does not contain any blocked callsites, we deduce that P is safe. The proof of the safety of P is captured by *uc*. Hence, we return the verdict that P is safe. Otherwise, each of the callsites in μ are then inlined (Line 15) which constructs a refinement of P. The inlined callsites are then removed from the list of open callsites C and new callsites that opened up due to the inlinings are added to C (Line 16).

When the algorithm is unable to arrive at a decision regarding the safety of P, it returns a verdict of NoDecision along with the list of inlined callsites μ and the new list of open callsites C (Line 13).

**Example.** We demonstrate the working of *UnderWidenSI* to verify the pVC of main of Fig. 1 in Table 2. Initially, we assert the pVC of main and add [main,(L1, foo)] and [main,(L2, bar)] to the list of open callsites in step 0.

# **Algorithm 3:** UnderWidenStep(P, <sup>C</sup>, <sup>S</sup>)

```
Input: procedure P, set of callsites C, SMT solver S
   Output: Safe, UnSafe(trace), NoDecision(μ, C)
1 // Underapproximate check
2 S.Push()
3 forall c ∈ C do
4 S.Assert(¬ControlVariable(c))
5 if S.Check() == SAT then
6 return UnSafe(S.Model())
7 else
8 μ ← BlockedCallsites(S.UnsatCore())
9 S.Pop()
10 if μ == ∅ then
11 return Safe
12 else
13 C-
         ← ∅
14 forall c ∈ μ do
15 C-
            ← Inline(P, c)
16 C ← (C − μ) ∪ C-

17 return NoDecision(μ, C)
```
Replacing each of the open callsites with (*assume false*) statement, i.e., blocking them, constructs an underapproximation of the program. If an SMT solver query on this underapproximation returns SAT, then the program is surely unsafe as the satisfiable model can only represent an execution trace that goes through inlined callsites. In that case, we can return the verdict unsafe along with an error trace constructed from the model. On the other hand, if the underapproximation check returns UNSAT, then we cannot return a verdict on the safety of the program immediately.

Following this, in step 1 (see Table 2), we push a new frame on the solver and assert (¬blk*L*<sup>1</sup>∧¬blk*L*<sup>2</sup>) to block executions through the callsites of foo and bar respectively to construct the underapproximated pVC of main. We query the solver with these constraints. Figure 1 shows that if we block executions through basic blocks L1 and L2, the program cannot terminate, i.e., the *return* statement in L3 is not reachable. Hence, the solver returns UNSAT. The reason for the unsatisfiability is blocking executions through both L1 and L2.

To widen the underapproximated model of the program so that we may reach L3, we need to remove the block on at least one of them and inline the respective callsite. The unsat core, in this case, contains the callsite of varbar in basic block L2. Therefore, we pop the earlier solver frame containing blocked clauses and assert (blk*L*<sup>2</sup> =⇒ pVC(bar)) in the solver. Inlining bar, opens up the callsites [main,(L2, bar),(L12, bar1)] and [main,(L2, bar),(L13, bar2)].

Next, in step 2, we again construct the underapproximated pVC of main by blocking executions through the callsites of foo, bar1 and bar2. The solver


**Table 2.** Execution of *UnderWidenSI* on the program of Fig. 1

query returns UNSAT with *uc* containing the callsites of foo, bar1 and bar2 which are inlined.

In step 3, the callsites of foo1 and foo2 are now open. Blocking both of these callsites and making an underapproximation check returns UNSAT with *uc* containing the callsites of foo1 and foo2. These callsites are now inlined.

In step 4, the underapproximation query returns UNSAT and *uc* contains no blocked callsites. This points to the fact that *uc* contains only inlined callsites, i.e., starting from step 0 if we only inline the callsites in *uc* and leave the remaining callsites overapproximated, we will still get an UNSAT. Therefore, *uc* is the proof of the safety of the program and we return the verdict that the pVC of main is safe.

Note that when the underapproximation query returns SAT, then the counterexample trace is constructed on the underapproximated program, i.e., the trace may contain only blocked and inlined callsites. The underapproximated program represents a subset of the paths in the original program, therefore, any counterexample trace present in the underapproximated program is sure to be present in the original program as well. Therefore, if the underapproximated program is unsafe, the original program is unsafe as well.

We have implemented the *UnderWidenSI* algorithm in Legion. We compare the performance of the *UnderWidenSI* algorithm in Legion against that of Corral which uses *OverRefSI*.

#### **4.2 Portfolio Technique**

The complementary behavior of the *OverRefSI* and the *UnderWidenSI* algorithms motivate us to design a portfolio approach for verifying a program. The portfolio strategy incorporates both the *OverRefSI* algorithm used by Corral and the *UnderWidenSI* algorithm implemented in Legion. We refer to the portfolio verifier as Legion<sup>+</sup>. For each program, Legion<sup>+</sup> runs both Corral and Legion in parallel. Legion<sup>+</sup> terminates verification as soon as one of the algorithms finishes verification and reports the outcome. We discuss the performance of Legion<sup>+</sup> against that of Corral and Legion in Sect. 5.

### **5 Experimental Results**

We have built a tool, Legion, that implements our *UnderWidenSI* algorithm. To compare against *OverRefSI*, we use Corral [26], a state-of-the-art verifier used at Microsoft [24]. We also build a portfolio solver, Legion<sup>+</sup>, that runs both Corral and Legion in parallel. Whenever one of the tools finish verification, Legion<sup>+</sup> terminates the algorithms and reports the outcome.

We compare the performance of Corral against Legion and Legion<sup>+</sup> on a suite of Windows and Linux device driver benchmarks. The Windows device driver benchmarks are obtained by running Static Driver Verifier (SDV) [4] on real windows device drivers that exercise all features of the C language such as arrays, heaps, pointers, loops, recursion etc. SDV compiles these drivers into a suite of BOOGIE [8] programs, each of which is a device driver paired with property (compilation is detailed in [24]). Note that, although the suite of Windows device drivers compiled into BOOGIE programs are available as SDV benchmarks [31], the actual C programs are internal to Microsoft.

Along with this, we also use a set of Linux device drivers that are available as C programs as part of the SVCOMP benchmarks suite [7]. We used SMACK [36] to compile the Linux device drivers into BOOGIE programs. Overall, we elect to use a total of 727 hard programs, on which Corral took more than 200 s to verify or times out, from the SDV and SVCOMP benchmarks to run our experiments. We set the timeout for each verification task to 2 h for both Corral and Legion. For all verification tasks, We use an unrolling length of 3 as advised in the benchmarks [31] and used in other works [11].

As Legion<sup>+</sup> uses twice the computational resources compared to Corral and Legion, we halve its time budget to 1 h to make a fair comparison. We also report the performance of Legion<sup>+</sup> with a 2 h time budget (it can be seen as the *virtual best* of Corral and Legion).

The experiments were performed on a machine with AMD EPYC 7452 processor (48 cores) and 384 GB of RAM. Both Corral and Legion uses Z3 [15] as the underlying SMT solver. We have used the default setting of a fixed random seed for Z3 for all our experiments after verifying the fact that the choice of random seed does not have any significant impact on our results.

#### **5.1 Corral Versus Legion**

Figure 6 depicts the number of solved instances within the time budget by Corral and Legion. In Fig. 6, a point (x, y) denotes the number of instances x,

**Fig. 6.** Number of instances solved within time (in hours) for Corral vs Legion vs Legion<sup>+</sup>.

**Table 3.** Total time taken by each verifier to solve instances


each of which was solved within time y. As we can observe, Corral is able to solve 262 out of 727 instances (36%) with a time budget of 2 h per instance, whereas Legion solves 351 instances (48%) with the same time budget. Both of them fail to solve 330 instances (45%). Out of the 397 instances (55%) that are solved by either Corral or Legion, 46 instances (12%) are solved exclusively by Corral, whereas 135 instances (34%) are solved exclusively by Legion.

The scatter plot of verification times across Legion and Corral is shown in Fig. 7. The spread in the scatter plots demonstrate that these two tools complement each other—the benchmarks on which Corral struggles are sometimes handled well by Legion, and vice-versa. Picking the best of two verifiers solves a total of 397 out of 727 instances (55%). This motivated the design of Legion<sup>+</sup>.

#### **5.2 Performance of Legion<sup>+</sup>**

As Legion<sup>+</sup> utilizes parallelism, in order to make a fair comparison we halve the time budget for Legion<sup>+</sup> on each verification instance to 1 h. This means that Legion<sup>+</sup> runs both the tools Corral and Legion in parallel but with a time budget of 1 h each.

Figure 6 shows that the portfolio verifier Legion<sup>+</sup> solves 369 out of 727 instances (51%) with a 1 h time budget, whereas Corral solves only 262

**Fig. 7.** Scatter plot of verification time of Corral vs Legion.

instances (36%) with a total time budget of 2 h. There are only 14 instances that Corral solves but Legion<sup>+</sup> is unable to solve. Similarly, there are only 17 instances that Legion solves but Legion<sup>+</sup> is unable to solve.

With a 2 h timeout, Legion<sup>+</sup> solves 397 instances in total (55%). This is essentially the *virtual best* of Corral and Legion with a 2 h timeout.

Figure 8 shows the total time taken (in hours) by Corral, Legion and Legion<sup>+</sup> to verify the instances that were solved by all three of them (total 213 instances). Legion<sup>+</sup> is 1.9<sup>×</sup> faster than Legion and 2.9<sup>×</sup> faster than Corral.

Across the benchmarks that each of the tools solve individually, Corral takes 109 h to solve 262 benchmarks, Legion takes 112 h to solve 351 benchmarks, whereas Legion<sup>+</sup> solves 369 benchmarks within only 71 h (see Table 3).

Note that the benchmarks used in our study are those on which Corral took greater than 200 s. On the rest of the benchmarks, clearly Legion<sup>+</sup> will perform at least as well as Corral. We chose to leave them out to ensure that the experiments run in a reasonable time: there were roughly 14000 of these easy cases. It allowed us to focus on benchmarks where speedup was important.

#### **6 Related Work**

The high-level idea of using proof-guided abstractions has been long known [3, 30]. Proofs of unsatisfiability have been used to derive abstractions for unbounded model checking in the context of microprocessor verification [30]. Amla et al. have also demonstrated that counterexample based abstraction is complementary to proof based abstraction and they can be combined in a judicious manner to reap the benefits of both the techniques for hardware verification tasks [3]. However, program verification has mostly been dominated by counterexample-guided abstraction refinement (CEGAR) based strategies. Of

**Fig. 8.** Cumulative time taken (in hours) to verify 213 instances that were solved by all three verifiers.

the few proposals that use proof-guided underapproximation widening strategies, most of them focus on verification of multi-threaded programs [18,35]. These techniques perform underapproximation on the number of thread interleavings allowed, while eagerly inlining all procedures. One technique [18] constrains the number of interleavings to certain bounds, while the other [35] uses dynamically inferred invariants for constructing (potential) underapproximations on interleavings. Note that, these techniques are orthogonal to our approach. Eager inlining is not feasible for our benchmarks, which is precisely the problem that we address. Our proposal shows that proof-guided widening strategies can be effectively employed for verifying large sequential programs. Proof of unsatisfiability from underapproximated models have also been utilized to narrow down the search space for overapproximation refinement in order to decide finite precision bit vector arithmetic with arbitrary bit vector operations [9]. The underapproximation is done on the bit vector variables of a propositional logic formula where some of the bit vector variables are encoded with fewer boolean variables than their width.

Other than using proofs to guide widening heuristics, proof artifacts, like interpolants, have been used to construct annotations [1,2,27–29] that can be useful in constraining future search. Such techniques are orthogonal to underapproximation widening based techniques. However, they can be useful for Legion and we plan to investigate them in the future.

Underapproximation widening has also been used in program synthesis [37,39,40]. Instead of unleashing the search for the program on the whole search space, such techniques search for the desired program in an underapproximated search space. While prior approaches [37] used a pre-defined widening sequence, later approaches [39,40] use proofs of unsatisfiability to guide the widening sequence. Similar techniques have also been used in the synthesis of boolean functions [16,17]. Manthan [16,17] constructs an initial guess of the boolean function by sampling the specification and constructing a decisiontree classifier from the resulting data. It, then, uses a proof-guided technique to "repair" the learnt model into a desired function.

There have also been applications of the maximal satisfiable set (MAXSAT) on an unsatisfiable formula for program debugging. BugAssist [19] attempts to infer the set of suspicious locations using a MAXSAT formulation over an failing program trace and the specifications. Bavishi et al. [6] extend the formulation to provide a ranking over the suspicious locations such that the locations higher up in the rankings are less likely to cause regressions.

Another line of work is to use fuzzers to sample concrete instances and gradually build approximations of program behavior for the purpose of deductive verification [22] and symbolic execution [34]. However, such approaches use test instances and do not apply a proof-guided strategy.

Legion is inspired by many of the above algorithms and, there is potential of incorporating more of these ideas in Legion in the future.

### **7 Conclusion**

Bounded model checking approaches for program verification predominantly focuses on CEGAR based strategies. In this work, we propose a proof-guided underapproximation widening strategy which behaves in a complementary manner to the CEGAR technique. The complementary nature allows us to build a portfolio strategy that takes advantage of both proof-guided underapproximation widening and CEGAR to deliver a significant speed up in verification time over both.

Our current approach only looks at the predicates corresponding to the callsites to figure out which are most relevant to the proof of unsatisfiability of the underapproximated model. In the future, we aim to extract additional information from the unsat core which would allow us to explore more involved widening strategies. Furthermore, combining the underapproximation techniques that work on the domain of thread interleavings to deal with a large space of sequential behaviors (via lots of procedures) and concurrent behaviors (via lots of interleavings) would be another interesting direction to explore. We also believe that underapproximation widening may yield improvement performance on our distributed bounded model checker, Hydra [11,12]. Another interesting direction that we want to pursue is on combining bounded model checking algorithms (both overapproximation refinement and underapproximation widening) with dynamic analysis [5,13,38] and statistical testing [10,32] based approaches.

**Acknowledgements.** We wish to express our gratitude towards *Microsoft Azure* and *Google Cloud Platform* for providing us with computational resources for the experiments. We are also indebted to the PRAISE group of CSE department, IIT Kanpur and the anonymous reviewers for their helpful suggestions.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **SolCMC: Solidity Compiler's Model Checker**

Leonardo Alt1(B) , Martin Blicha2,3, Antti E. J. Hyv¨arinen<sup>2</sup>, and Natasha Sharygina<sup>2</sup>

<sup>1</sup> Ethereum Foundation, Berlin, Germany

leo@ethereum.org <sup>2</sup> Universit`a della Svizzera italiana, Lugano, Switzerland *{*martin.blicha,antti.hyvaerinen,natasha.sharygina*}*@usi.ch <sup>3</sup> Charles University, Prague, Czech Republic

**Abstract.** Formally verifying smart contracts is important due to their immutable nature, usual open source licenses, and high financial incentives for exploits. Since 2019 the Ethereum Foundation's Solidity compiler ships with a model checker. The checker, called SolCMC, has two different reasoning engines and tracks closely the development of the Solidity language. We describe SolCMC's architecture and use from the perspective of developers of both smart contracts and tools for software verification, and show how to analyze nontrivial properties of real life contracts in a fully automated manner.

**Keywords:** Ethereum · Solidity · Symbolic model checking · Constrained Horn clauses · Satisfiability modulo theories

# **1 Introduction**

The Ethereum Foundation's compiler for Ethereum platform's most used language Solidity had almost 4 million downloads (3,957,195) over the last 60 days (at the time of submission). Since 2019, this compiler ships with a robust, builtin, easy-to-use, symbolic model checker SolCMC [16], formerly called SMTChecker. SolCMC models a *smart contract*, that is, a program for the Ethereum platform, and its properties as a system of constrained Horn clauses (CHCs) amenable to IC3-style model checking [34]. Since its deployment, SolCMC has increasingly served a dual purpose. On the one hand, smart contract programmers have through it a very visible and easy access to formal verification techniques. On the other hand, perhaps more subtly but no less importantly, the tool serves as a sounding board for developers of Horn solvers. Currently the system interfaces with Spacer [31] and Eldarica [30], making the related techniques available

This work was partially supported by Swiss National Science Foundation grant 200021 185031 and by Czech Science Foundation grant 20-07487S.

S. Shoham and Y. Vizel (Eds.): CAV 2022, LNCS 13371, pp. 325–338, 2022. https://doi.org/10.1007/978-3-031-13185-1\_16

to a large user base. We expect to integrate in SolCMC many other techniques through a similar mechanism. For instance, the tool has a bounded model checking engine for finding bugs by issuing SMT queries to solvers such as z3 [35] and cvc5 [23].

Smart contracts running on the Ethereum platform hold and control billions of dollars through their immutable logic, and therefore bugs can lead to massive losses. There are many recent sophisticated tools that increase the security of the Ethereum contract ecosystem by detecting smart contract bugs before they are deployed. However, new and emerging applications from the diverse user base are driving Solidity development at a fast pace and it is difficult to keep tools synchronized with the language. We believe that in the long run, the best way to ensure that a model checker for Solidity is sustainable is by integrating it directly into the compiler distribution, or the *main repository* of the related language tools, as we have done for SolCMC.

The direct integration of the model checker into the compiler has two main advantages. Firstly, we can model precisely and robustly features that are somewhat specific to Solidity and its applications, such as modeling reentrancy callbacks, and the handling of global storage. This makes the model checker capable of synthesizing new contracts that serve as counterexamples for correctness, and computing inductive invariants for the cases where properties hold. Secondly, the short pipeline between the source code and the model allows the presentation of both counterexamples and invariants as compiler warnings and annotations using a vocabulary that is meaningful for the developer.

The goal of SolCMC is to verify properties of programs with minimal user input. Our system supports writing properties as assert statements and can in addition automatically check other structural properties such as popping from an empty array and array accesses that are out of bounds, and the lack of underflows, overflows, divisions by zero, and transfers with insufficient balance. Moreover, common Solidity vulnerabilities such as reentrancy mutability and selfdestruct reachability can be verified using test harnesses that make the assertion-based approach more expressive. Thus, the expressiveness of SolCMC allows efficiently obtaining meaningful results for real life contracts in a way that is in practice fully automated. To demonstrate this we analyze the Beacon Chain Deposit Contract that is the base for Ethereum's proof of stake consensus layer, and the OpenZeppelin implementation of the ERC777 token standard.

An extended version of this tool paper including appendices showing detailed experimental results and other analysis is available online in the accompanying artifact, at https://doi.org/10.5281/zenodo.6512173.

*Related Work.* Proving correctness and finding bugs in smart contracts is useful in different abstraction targets. The technical details of how smart contracts are encoded by SolCMC are presented in [34]. In this tool paper the emphasis is on orthogonal topics: the usage of options, generation of counterexamples in Solidity-like syntax, interfacing with different Horn solvers, and how contract invariants can be obtained. We also demonstrate the tool's capabilities

**Fig. 1.** The Solidity compiler stack with the integrated model checker (in green) (Color figure online)

by analysing two important and complex contracts: the Deposit contract and ERC777.

Most current tools either analyse the Solidity high level language, similar to SolCMC, or work directly on Ethereum Virtual Machine (EVM) bytecode.

The tools Solc-verify [28] and Verisol [38] verify Solidity properties in an automated way allowing models with unbounded number of transactions by translating Solidity to Boogie [33]. This gives the tools an advantage in engineering resources, but, compared to SolCMC's direct encoding as CHCs, makes producing counterexamples to the user more difficult. Neither of the two tools produce counterexamples or inductive invariants, and the most recent language versions are not supported. SmartACE [39] relies on translation from Solidity to LLVM-IR. This allows for employing multiple analysis tools, but unlike in SolCMC where we use a direct encoding and tight solver integration, the tools are mostly used as black boxes. EThor [37] also uses Horn clauses but it encodes EVM bytecode, and focuses on specific properties such as reentrancy. The Certora [24] tool relies on invariants to verify EVM bytecode. It is a commercial tool used for smart contract audits and is not publicly available. The K framework [10] is an assisted theorem prover that provides EVM semantics [29] to analyze EVM bytecode. It is generally able to prove more statements than automated tools, but requires considerable user interaction. HEVM [22] is an implementation of EVM in Haskell that also has a symbolic executor for EVM bytecode. It can prove functional properties but, unlike SolCMC, does not support inductive properties over multiple transactions and loops. HEVM and Echidna [4] also provide fuzzing techniques that help determining whether a candidate assertion is a contract invariant. Slither [14] is a powerful static analyzer that does not provide formal guarantees but can detect many vulnerabilities and dangerous patterns. Act [1] is a declarative specification language for smart contracts that supports three backends: bytecode verification via HEVM, SMT theorems for contract invariants, and a Coq backend that exports Coq definitions of contract state transitions. Finally, the Scribble specification language [13] allows annotating Solidity code and can generate runtime checks for given properties.


#### **Table 1.** SolCMC verification targets

#### **2 Solidity Model Checking**

The high level overview of the compilation process is depicted in Fig. 1, with the model checker module emphasized. When enabled, Solidity model checking becomes another pass over the source code in the normal compilation process that starts after parsing and Abstract Syntax Tree (AST) generation. If there were no errors, the compiler produces the optimized bytecode together with any warnings, such as counterexamples found by the model checker.

This paper concentrates on SolCMC's unbounded model checker based on CHCs. The tool also has a BMC engine that generates SMT queries and links against cvc5 [23] and z3 [35].

#### **2.1 The CHC Verification Engine**

SolCMC encodes a smart contract as a system of constrained Horn clauses, based on [34]. The checker supports loops, multi-transaction computation paths, contract invariants, tracking contract balances throughout their lifetimes, and precise multi-contract calls. If the analyzed contract calls external functions unsafely, the model checker also synthesizes malicious external actors and represents them as reentrant calls.

The Horn queries are dispatched to a Horn solver. The encoding requires the solver to support nonlinear Horn clauses and at least the SMT theories for Linear Integer Arithmetic (LIA), Arrays, and the tuples subset of Algebraic Datatypes (ADT). Furthermore, nonlinear integer arithmetic and bitwise operations, if present, are encoded in the respective theories NIA and BV. To the best of our knowledge only Spacer [31] and Eldarica [30] satisfy those requirements. SolCMC has a tight integration with Spacer via its C++ API, whereas Eldarica is integrated using the compiler's SMT callback [21], and is currently accessible via solc-js [15], the JavaScript wrapper of the compiler's WebAssembly binary.

The model checker generates verification targets automatically for the conditions listed in Table 1. In particular a smart contract developer can combine *assertions* with *test harnesses* (see, e.g., Sect. 4) to specify complex behavior. The Solidity language has the statements require and assert, which SolCMC uses to capture developer intent: Conditions inside require statements are considered assumptions, and assert statements should be true for every execution. The model checker then treats every assert as a verification target and attempts to either prove it by finding an invariant, or give a counterexample for its correctness.

#### **2.2 Horn Encoding**

SolCMC's CHC encoding is based on the imperative encoding of [25], and is presented in detail in [34]. Horn logic is a popular formalism for expressing reachability problems. It is equivalent to the *existential positive fix-point logic* [26], and provides a convenient syntax for the use of existentially quantified predicates that in our encoding represent reachable states and effects of transactions. The Solidity AST first gets transformed into a Control Flow Graph (CFG). CFG nodes have corresponding CHC predicates, and edges are encoded as Horn rules with constraints created from the Single Static Assignment (SSA) form of the statements and expressions of the CFG block. Below we give an overview of the encoding that highlights the critical parts.

The encoding consists of three types of predicates that represent reachable states or possible transitions: *function bodies* (*B<sup>f</sup>* ) and *summaries* (*S<sup>f</sup>* ) represent the effect of function calls to *f*; *interfaces* (*I<sup>C</sup>* ) represent the states a contract *C* can reach after initialization and each transaction; and *nondeterministic interfaces* (*N<sup>C</sup>* ) encode the effects the environment may have to a contract *C*. We use the following variables in the encoding: *e*, an integer error flag. Each verification target has a positive unique error id; 0 is reserved for no errors. *a*, the contract address. **abi**, a tuple of Solidity's ABI functions. **cr**, a tuple of Solidity's cryptographic functions: keccak256, sha256, ripemd160, and ecrecover. Both **abi** and **cr** are constant in the encoding. They are passed through the rules to ensure consistency everywhere. **tx**, a tuple of the transaction data, e.g., message sender, data, block number, etc. **st**, the blockchain state, a tuple containing the balances and storage for every contract. Balances are represented by an array mapping addresses to their balances. Each contract has a storage tuple that contains the state variables of that contract. **x**, the program state, input, output and local variables in the scope of that node. When necessary, we refer to the state variables as **s**. For **x** and **st** we use primes to denote the effect of rules on these variables.

*Function bodies* encode constructors, deployment procedures, and function summaries. For example, the contract **contract** Acc { **uint8** x = 0; **function** acc(**uint8** y) **external** { x += y; } } gets encoded into the rules

$$\begin{aligned} e = 0 \land \mathbf{st} = \mathbf{st}' \land x &= x' \land y = y' \land 0 \le y' \le 255 \land 0 \le x' \le 255\\ \implies \mathbf{B\_{acc}}(e, a, \mathbf{abi}, \mathbf{cr}, \mathbf{tx}, \mathbf{st}, x, y, \mathbf{st}', x', y') \end{aligned}$$

stating that the function can always be called, its execution starts with no error, the initial variables have the current values, and the program variables' types are constrained;

$$\begin{aligned} \mathbf{B\_{acc}}(e, a, \mathbf{a} \mathbf{b} \mathbf{i}, \mathbf{c} \mathbf{r}, \mathbf{t} \mathbf{x}, \mathbf{s} \mathbf{t}, x, y, \mathbf{s} \mathbf{t}', x', y') &\land (x' + y' > 255) \\ \implies \mathbf{S\_{acc}}(1, a, \mathbf{a} \mathbf{b} \mathbf{i}, \mathbf{c} \mathbf{r}, \mathbf{t} \mathbf{x}, \mathbf{s} \mathbf{t}, x, y, \mathbf{s} \mathbf{t}', x', y') \end{aligned}$$

stating that an overflow in summation is an error, with label 1; and

$$\begin{split} \mathbf{B\_{acc}}(e, a, \mathbf{a}\mathbf{b}\mathbf{i}, \mathbf{c}\mathbf{r}, \mathbf{t}\mathbf{x}, \mathbf{st}, x, y, \mathbf{st}', x', y') &\land (x'' = x' + y) \land (x'' \le 255) \\ \implies \mathbf{S\_{acc}}(e, a, \mathbf{a}\mathbf{b}\mathbf{i}, \mathbf{c}\mathbf{r}, \mathbf{t}\mathbf{x}, \mathbf{st}, x, y, \mathbf{st}', x'', y), \end{split}$$

which exits the function with no error and updates the contract state variable *x*.

*Interface Rules.* The *interface CFG node* is an artificial node that represents the idle state of a contract. This node is crucial to the encoding when modelling transactions, querying error flags, committing state changes, generating counterexamples, and translating inductive contract invariants. It is reachable at the beginning and end of every transaction. Transactions may revert due to invalid inputs or program logic, in which case all state changes are rolled back. The interface node may contain state changes if the transaction did not revert. Each contract C has a predicate **I**C, whose parameters are *a*, **abi***,* **cr**, **st** and the state variables **s** of the contract. The rules only change *e,* **st** and **s**, and for better readability we use ellipsis (*...*) to denote the unchanged parameters. One rule is added per contract linking the deployment procedure to the interface: **D**C(*...*) =⇒ **I**C(*...*). For each external function f in the contract C, we add the *query rule* and the *update rule*

**I**C(*...,* **st***,* **s***,...*) ∧ **S**f(*e, . . . ,* **st***,* **s***,...,* **st** *,* **s** *,...*) ∧ *e >* 0 =⇒ **Err**f(*e*) **I**C(*...,* **st***,* **s***,...*) ∧ **S**f(*e, . . . ,* **st***,* **s***,...,* **st** *,* **s** *,...*) ∧ *e* =0 =⇒ **I**C(*...,* **st** *,* **s** *,...*)*.*

The Horn query given to the solver then asks whether **Err**f(*e*) is reachable, for each error label *e*. In this modelling, if the property is safe, inductive invariants chosen by the solver as an interpretation for the predicates **I**<sup>C</sup> represent the invariants for contracts C.

*Nondeterministic Interface Rules.* The *nondeterministic interface CFG node* is an artificial node that represents every possible behavior of the contract from an external point of view, in an unbounded number of transactions. This node is essential to model calls that the contract makes to external unknown contracts, as well as reentrancy if present. The predicate that represents this node has the same parameters as the interface predicate, but with the error flag and an extra set of program variables and blockchain state, in order to model possible errors and state changes. For every contract C the encoding adds the base case rule **N**C(0*,...,* **st***,* **s***,* **st***,* **s**) which performs no state changes. Then for every external function f in the contract the encoding adds the inductive rule **N**(0*,...,* **st***,* **s***,* **st** *,* **s** ) ∧ **S**f(*e, . . . ,* **st** *,* **s** *,* **st***,* **s**) =⇒ **N**(*e, . . . ,* **st***,* **s***,* **st***,* **s**). These rules allow us to encode an external call to unknown code using a single constraint **N**(*e, . . . ,* **st***,* **s***,* **st** *,* **s** ) which models every reachable state change in the contract, in any unbounded number of transactions. If a property is unsafe, these rules force the solver to synthesize the behavior of the adversarial contract. Otherwise, the interpretation of such predicate gives us inductive reentrancy properties that are true for every external call to unknown code in the contract.

#### **3 User Features**

As SolCMC is shipped inside the Solidity compiler, it is available for the users whenever and wherever they interact with the compiler. There are currently three major ways the compiler is used: 1. Interfacing with the WebAssembly release through official JavaScript bindings; 2. Interfacing with a binary release on command line; 3. Using web based IDEs, such as Remix [12]. Option 3 is the most accessible, but currently allows only limited configuration of the model checker through pragma statements in source code. Options 1 and 2 both allow extensive configuration, but in addition 1 enables the *SMT callback* feature needed, e.g., for Eldarica. In 2 the options can be provided either on the command line or in JSON [19], whereas 1 accepts only JSON using the JavaScript wrapper [15].

In 1 and 2 several parameters are available to the user for better control when trying to prove complex properties. We list here some examples, using the command line options (without the leading --). The JSON descriptions are named similarly. The model checking engine—BMC, CHC or both—is selected with the option model-checker-engine. Individual verification targets can be chosen with model-checker-targets, and a per-target verification timeout (in ms) can be set with model-checker-timeout. By default, all unproved verification targets are given in a single message after execution. More details are available by specifying model-checker-show-unproved. Option model-checker-contracts provides a way to choose the contracts to verify. Typically the user specifies only the contract they wish to deploy. Inherited and library contracts are included automatically, avoiding verifying every contract as the main one. Some options affect the encoding. For example, integer division and modulo operations can be encoded with the SMT function symbols div and mod or by SolCMC's own encoding using linear arithmetic and slack variables. Depending on the backend one is often preferred to the other. The default is the latter, the former is set by model-checker-div-mod-no-slacks.

Solidity provides the NatSpec [20] format for rich documentation. An annotation /// @custom:smtchecker abstract-function-nondet instructs SolCMC to abstract a function nondeterministically. Abstracting functions as an Uninterpreted Function [32] is under development.

*Counterexamples and Inductive Invariants.* When a verification target is disproved, SolCMC provides a readable counterexample describing how to reach the bug. In addition to the line of code where the verification target is breached, the counterexample states the trace of transactions and function calls leading to the failure along with concrete values substituted for the arguments, and the values of the state variables at the point of failure. When necessary, the trace includes also synthesized reentrant calls that trigger the failure.

Similarly, when SolCMC proves a verification target, the user may instrument the checker to provide safe inductive invariants. The invariants can, for instance, be used as an additional proof that the verification target holds. Technically the invariants are interpretations for the predicates in the CHC system and are presented in a human readable Solidity-like syntax. Similarly to counterexamples, the invariants are given also for predicates guaranteeing correctness under reentrancy. The extended version of this paper contains a short example illustrating the counterexamples and inductive invariants. It also presents more complex examples of both features, which were obtained from our experiments with the ERC777 token standard.

# **4 Real World Experiments**

In this section we analyse two real world smart contract systems using SolCMC. Both contracts are massively important and highly nontrivial for automated tools due to their use of complex features, loops, and the need to produce nontrivial inductive invariants. While only the main results are stated in this section, we want to emphasize that the results were achieved after an extensive, albeit mechanical, experimentation on the two backend solvers (Spacer and Eldarica) and a range of parameters. To us the fact that they were successfully analysed using an automatic method is a strong proof of the combined power of our encoding approach and the backend solvers.

### **4.1 CHC Solver Options**

The options we pass to the underlying CHC solvers Spacer and Eldarica may make the difference between a quick solving and divergences. For Spacer, we use the options rewriter.pull cheap ite=true which pulls if-then-else terms to the top level when it can be done cheaply, fp.spacer.q3.use qgen=true which enables the quantified lemma generalizer, fp.spacer.mbqi=false which disables the model-based quantifier instantiation, and fp.spacer.ground pobs=false which grounds proof obligations using values from a model. For Eldarica, we have found the adjustment of the predicate abstraction to be useful: -abstract:off disables abstraction, -abstract:term uses term abstraction, and -abstract:oct uses the octal abstraction.

#### **4.2 Deposit Contract**

The Ethereum 2.0 (Eth2) [9] Deposit Contract [2,3] is a smart contract that runs on Ethereum 1.0 collecting deposits from accounts that wish to be validators on Eth2. By the time of submission of this paper more than 9,100,194 ETH were held by the Deposit Contract, the equivalent of tens of billions USD in recent rates. Besides the financial incentive, this contract's functionality is essential to the progress of the protocol. The contract was formally verified before deployment [36] and further proved safe [27] with considerable amount of manual work. Despite having relatively few lines of code (less than 200), the contract remains a challenge for automated tools, because of its use of many complex constructs at the same time, such as ABI encoding functions, loops, dynamic types, and hash functions.

As part of the logic of the deposit function, a new entry is created in a Merkle tree for the caller. The contract asserts that such an entry can always be found, expressed as an assert(false) in a program location reachable only if such an entry is not found (line 162 in [2]). Using SolCMC this problem can be encoded into a 1.4MB Horn logic file containing 127 rules, which uses the SMT theories for Arrays, ADTs, NIA, and BV. After a syntactical change, Eldarica can show the property safe automatically in 22.4 s, while Spacer times out after 1 h (see the extended version for details). The change is necessary to avoid bitvector reasoning and consists of replacing the test if ((size & 1) == 1) with a semantically equivalent form if ((size % 2) == 1) on lines 88 and 153 in [2].

#### **4.3 ERC777**

ERC777 [6] is a token standard that offers extra features compared to the ERC20 [5] standard. Besides the usual transfer and allowance features, ERC777 mainly adds account operators and transfer hooks which allow smart contracts to react to sending and receiving tokens. This is similar to the native feature of reacting to receiving Ether. In this experiment we analyze the OpenZeppelin implementation [11] of ERC777. This contract is an interesting benchmark for automated tools not only because of its importance, but also because it is a rather large smart contract system with 1200 lines of Solidity code, in 8 files, and it uses complex high level constructs such as assembly blocks, heavy inheritance, strings, arrays, nested mappings, loops, hash functions, and makes external calls to unknown code. The implementation follows the specification precisely, and does not guarantee a basic safety property related to tokens: *The total supply of tokens should not change during a transfer*.

Compared to the usual ERC20 token transfer that simply decreases and increases the balances of the two accounts involved in the transfer, the ERC777 transfer function may call unknown contracts to notify them that they are sending/receiving tokens. The logic in these external contracts is completely arbitrary and unknown to the token contract. For example, they could make a reentrant call to one of the nine ERC777 token mutable functions from its external interface.

Since the analyzed ERC777 implementation is agnostic on how tokens are initially allocated, no tokens are distributed in the base implementation at deployment. Therefore, to study the property, we write the following test harness [7] that uses the ERC777 token implemented by OpenZeppelin.

}

```
import "<path >/ ERC777.sol";
contract Harness is ERC777 {
 constructor (
   address [] memory defOps_ ,
   uint amt_
 ) ERC777 ("ERC777", "E7", defOps_ ){
   _mint ( msg. sender , amt_ , " ", " ");
 }
```

```
function transfer ( address r, uint a)
  public override returns ( bool ) {
  uint prev = totalSupply ();
  bool res = ERC777 . transfer (r, a);
  uint post = totalSupply ();
  assert ( prev == post );
  return res;
}
```
**Fig. 2.** Transaction trace that violates the safety property in transfer

First, we allocate amt tokens to the creator of the contract, in order to have tokens circulating. Then, we override the transfer function, where our transfer function simply wraps the one from the ERC777 contract, asserting that the property we want to verify is true after the original transfer.

The resulting Horn encoding is 15 MB large and contains 545 rules. The property can be shown unsafe by Eldarica in all its configurations, the quickest taking slightly less than 3 min, including generating the counterexample (see the extended version for details). All Spacer's configurations time out after 1 h. Since the property is unsafe, SolCMC also provides the full transaction trace required to reach the assertion failure. The transaction trace is visualized in Fig. 2 in the form of a sequence diagram, where solid arrows represent function calls and dashed arrows represent the return of the execution control. The full output of the tool can be found in the extended version.

The diagram shows the transaction trace from the call to transfer of ERC777 (after our wrapper contract has been created and its transfer was called). transfer performs 3 internal function calls (in orange): 1) callTokensToSend performs the external call to notify the sender; 2) move moves the tokens from the sender to the recipient; 3) callTokensReceived notifies the recipient. The external calls to unknown code are shown in red. The transaction trace also contains the synthesized behaviour for the recipient (in purple). It is a reentrant call to operatorBurn in the ERC777 token contract itself, where some of the tokens of the recipient contract will be burned. At the end of the execution of transfer, the assertion is no longer true. The total supply of tokens after the call is not the same as the total supply before the call, as some tokens were burned during the transaction.

Given the number of mutable external functions of ERC777 and their complexity, we consider the discovery of the counterexample to be quite an achievement. We ascribe the success to the combined power of the CHC encoding and the Horn solver.

One way to guarantee that our property holds is to disallow reentrancy throughout the contract using a mutex. After changing the ERC777 library [8], we ran the tool again on our test harness. Spacer timed out, but Eldarica was able to prove that the restricted system is safe in all its configurations, the fastest one finishing in 26.2 s, including the generation of the inductive invariants for every predicate. SolCMC now reports back the reentrancy property <errorCode> = 0 given as part of the proof (the property is presented here in a simplified manner, see the extended version for details). The inductive property states that no external call performed by the analyzed contract can lead to an error. This shows that the reentrant path can no longer be taken.

#### **4.4 Discussion**

While producing the above analysis of the real life contracts, we experimented with two backend solvers Spacer and Eldarica, and a range of parameters for them. This phase (documented in the extended version of this paper) was critical in producing the results, because Eldarica and Spacer excel in different domains and parameter selection has a major impact on both verification success and run time. In both cases above Eldarica performed clearly better than Spacer. This seems to be because Eldarica handles abstract data types better than Spacer. This conclusion is backed by experimental evidence. We ran SolCMC using both Spacer and Eldarica on the SolCMC regression test suite consisting of 1098 solidity files [17] and 3688 Horn queries [18]. The experiment shows that while the solvers give overall similar results, in two categories that make heavy use of ADTs, Eldarica is consistently able to solve more benchmarks than Spacer. For lack of space, the detailed analysis is given in the extended version.

Our encoding uses tuples to encode data that makes sense to be bundled together. Moreover, arrays of tuples are used to emulate Uninterpreted Functions (UFs) to abstract injective functions such as cryptographic primitives. This is necessary due to UFs not being syntactically allowed in predicates of Horn instances. While this increases the complexity of the problem, we have chosen this path to reduce encoding complexity, considering that a pre processing step may be available in the future to flatten such tuples and arrays.

#### **5 Conclusions and Future Work**

This paper presents the model checker SolCMC that ships with the Ethereum Foundation's compiler for the Solidity language. We believe that the automated and usable tool has the potential to link a high volume of Solidity developers with the community working on tools for formal verification. The tool is stable, and, having been integrated into the compiler, tracks closely the quickly developing language.

We advocate for a *direct encoding approach* where the same AST gets compiled both into EVM bytecode and into a verification model in SMT-LIB2 or the format used in the CHC competition. In our experience this makes it more natural to model features specific to Solidity and Ethereum smart contracts as well as for generating usable counterexamples and inductive invariants in comparison to producing first a language-agnostic intermediate verification representation that is then processed for reasoning engines.

We argue for the ease of use of the tool by showing nontrivial properties of real life contracts. The experiments also identify interesting future development opportunities in the current CHC formalism. We show how the formalism's limitations can be worked around using abstract data types, and discuss their impact on tool efficiency.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

**Hyperproperties and Security**

# **Software Verification of Hyperproperties Beyond** *k***-Safety**

Raven Beutner(B) and Bernd Finkbeiner

CISPA Helmholtz Center for Information Security, Saarbr¨ucken, Germany {raven.beutner,finkbeiner}@cispa.de

**Abstract.** Temporal hyperproperties are system properties that relate multiple execution traces. For (finite-state) hardware, temporal hyperproperties are supported by model checking algorithms, and tools for general temporal logics like HyperLTL exist. For (infinite-state) software, the analysis of temporal hyperproperties has, so far, been limited to k-safety properties, i.e., properties that stipulate the absence of a bad interaction between any k traces. In this paper, we present an automated method for the verification of <sup>∀</sup><sup>k</sup>∃<sup>l</sup> -safety properties in infinite-state systems. A <sup>∀</sup><sup>k</sup>∃<sup>l</sup> -safety property stipulates that for any k traces, there *exist* l traces such that the resulting k + l traces do not interact badly. This combination of universal and existential quantification enables us to express many properties beyond k-safety, including, for example, generalized non-interference or program refinement. Our method is based on a strategy-based instantiation of existential trace quantification combined with a program reduction, both in the context of a fixed predicate abstraction. Notably, our framework allows for mutual dependence of strategy and reduction.

**Keywords:** Hyperproperties · HyperLTL · Infinite-state systems · Predicate abstraction · Hyperliveness · Verification · Program reduction

# **1 Introduction**

Hyperproperties are system properties that relate multiple execution traces of a system [22] and commonly arise, e.g., in information-flow policies [35], the verification of code optimizations [6], and robustness of software [19]. Consequently, many methods for the automated verification of hyperproperties have been developed [27,39–41]. Almost all previous approaches verify a class of hyperproperties called k-safety, i.e., properties that stipulate the absence of a bad interaction between any k traces in the system. For example, we can express a simple form of non-interference as a 2-safety property by stating that any *two* traces that agree on the low-security inputs should produce the same observable output.

The vast landscape of hyperproperties does, however, stretch far beyond ksafety. The overarching limitation of k-safety (or, more generally, of hypersafety [22]) is an implicit *universal* quantification over all executions. By contrast, many

properties of interest, ranging from applications in information-flow control to robust cleanness, require a combination of universal and existential quantification. For example, consider the reactive program in Fig. 1, where -N denotes a nondeterministic choice of a natural number. We assume that h, l, and o are a high-security input, a low-security input, and a low-security output, respectively. This program violates the simple 2-safety non-interference property given above as the non-determinism influences the output. Nevertheless, the program is "secure" in the sense that an attacker that observes low-security inputs and outputs cannot deduce information about the high-security input. To capture this formally, we use a relaxed notion of non-interference, in the literature often referred to as generalized non-interference (GNI) [35]. We can, informally, express GNI in a temporal logic as follows:

$$\forall \pi. \forall \pi'. \exists \pi''. \Box \left( o\_{\pi} = o\_{\pi''} \land l\_{\pi} = l\_{\pi''} \land h\_{\pi'} = h\_{\pi''} \right)$$

This property requires that for any two traces π, π , there exists some trace π that, globally, agrees with the low-security inputs and outputs on π but the highsecurity inputs on π . Phrased differently, any observation on the low-security input-output behavior is compatible with every possible high-security input. The program in Fig. 1 satisfies GNI. Crucially, GNI is no longer a hypersafety property (and, in particular, no k-safety property for any k) as it requires a combination of universal and *existential* quantification.

#### **1.1 Verification Beyond** *k***-Safety**

Instead, GNI falls in the general class of ∀<sup>∗</sup>∃<sup>∗</sup> safety properties. Concretely, a <sup>∀</sup><sup>k</sup>∃<sup>l</sup> -safety property (using k universal and l existential quantifiers) stipulates that for any k traces, there exist l traces such that the resulting k + l traces do not interact badly. k-safety properties are the *special case* where l = 0. We study the verification of such properties in infinite-state systems arising, e.g., in software. In contrast to k-safety, where a broad range of methods has been developed [10,27,39–41], no method for the automated verification of *temporal* ∀<sup>∗</sup>∃<sup>∗</sup> properties in infinite-state systems exists (we discuss related approaches in Sect. 8).

Our novel verification method is based on a game-based reading of existential quantification *combined* with the search for a program reduc-

**Fig. 1.** An example program is depicted.

tion. The game-based reading of existential quantification instantiates existential trace quantification with an explicit strategy and constitutes the first practicable method for the verification of ∀<sup>∗</sup>∃<sup>∗</sup>-properties in finite-state systems [23]. Program reductions are a well-established technique to align executions of independent program fragments (such as the individual program copies in a selfcomposition) to obtain proofs with easier invariants [27,34,39].

So far, both techniques are limited to their respective domain, i.e., the gamebased approach has only been applied to finite-state systems and synchronous specifications, and reductions have (mostly) been used for the verification of ksafety. We combine both techniques yielding an effective (and first) verification technique for hyperproperties beyond k-safety in infinite-state systems arising in software. Notably, our search for reduction and strategy-based instantiation of existential quantification is *mutually dependent*, i.e., a particular strategy might depend on a particular reduction and vice versa.

#### **1.2 Contributions and Structure**

The starting point of our work is a new temporal logic called *Observation-based HyperLTL* (OHyperLTL for short). Our logic extends the existing hyperlogic HyperLTL [21] with capabilities to reason about asynchronous properties (i.e., properties where the individual traces are traversed at different speeds), and to specify properties using assertions from arbitrary background theories (to reason about the infinite domains encountered in software) (Sect. 4).

To automatically verify <sup>∀</sup><sup>k</sup>∃<sup>l</sup> OHyperLTL properties, we combine program reductions with a strategy-based instantiation of existential quantification, both in the context of a fixed predicate abstraction. To facilitate this combination, we first present a game-based approach that automates the search for a reduction. Concretely, we construct an abstract game where a winning strategy for the verifier directly corresponds to a reduction with accompanying proof. As a side product, our game-based interpretation simplifies the search for a reduction in a given predicate abstraction as, e.g., studied by Shemer et al. [39] (Sect. 5).

Our strategic (game-based) view on reductions allows us to combine them with a game-based instantiation of existential quantification. Here, we view the existentially quantified traces as being constructed by a strategy that, iteratively, reacts to the universally quantified traces. As we phrase both the search for a reduction and the search for existentially quantified traces as a game, we can frame the search for both as a combined abstract game. We prove the soundness of our approach, i.e., a winning strategy for the verifier constitutes both a strategy for the existentially quantified traces and accompanying (mutually dependent) reduction. Despite its finite nature, constructing the abstract game is expensive as it involves many SMT queries. We propose an inner refinement loop that determines the winner of the game (without constructing it explicitly) by computing iterative approximations (Sect. 6).

We have implemented our verification approach in a prototype tool called HyPA (short for **Hy**perproperty Verification with **P**redicate **A**bstraction) and evaluate HyPA on k-safety properties (that can already be handled by existing methods) and on ∀<sup>∗</sup>∃<sup>∗</sup>-safety benchmarks that cannot be handled by any existing tool (Sect. 7).

*Contributions.* In short, our contributions include the following:

– We propose a temporal hyperlogic that can specify asynchronous hyperproperties in infinite-state systems;


### **2 Overview: Reductions and Quantification as a Game**

Our verification approach hinges on the observation that we can express both a reduction and existential trace quantification as a game. In this section, we provide an overview of our game-based interpretations. We begin by outlining our game-based reading of a reduction (illustrating this in the simpler case of k-safety) in Sect. 2.1 and then extend this to include a game-based interpretation of existential quantification in Sect. 2.2.

#### **2.1 Reductions as a Game**

Consider the two programs in Fig. 2 and the specification that both programs produce the same output (on initially identical values for x). We can formalize this in our logic OHyperLTL (formally defined in Sect. 4) as follows:

$$\forall^{\mathsf{P1}}\pi\_1: (pc=2). \,\forall^{\mathsf{P2}}\pi\_2: (pc=2). \,(x\_{\pi\_1}=x\_{\pi\_2}) \to \square(x\_{\pi\_1}=x\_{\pi\_2})$$

The property states that for all traces π<sup>1</sup> in P1 and π<sup>2</sup> in P2 the LTL specification (x<sup>π</sup><sup>1</sup> = x<sup>π</sup><sup>2</sup> ) → (x<sup>π</sup><sup>1</sup> = x<sup>π</sup><sup>2</sup> ) holds (where x<sup>π</sup> refers to the value of x on trace π). Additionally, the observation formula *pc* = 2 marks the positions at which the LTL property is evaluated: We only observe a trace at steps where *pc* = 2 (i.e., where the program counter is at the output position).

The verification of our property involves reasoning about two copies of our system (in this case, one of P1 and one of P2) on *disjoint* state spaces. Consequently, we can interleave the statements of both programs (between two observation points) without affecting the behavior of the individual copies. We refer to each interleaving of both copies as a *reduction*. The choice of a reduction drastically influences the complexity of the needed invariants [27,34,39]. Given an initial abstraction of the system [30,39], we aim to discover a suitable reduction *automatically*. Our first observation is that we can phrase the search for a reduction as a game as follows: In each step, the verifier decides on a *scheduling* (i.e., a non-empty subset M ⊆ {1, 2}) that indicates which of the copies should take a step (i.e., i ∈ M iff copy i should make a program step). Afterward, the refuter can choose an abstract successor state compatible with that scheduling, after which the process repeats. This naturally defines a finite-state two-player safety game that we can solve efficiently.<sup>1</sup> If the verifier wins, a winning strategy

<sup>1</sup> The LTL specification is translated to a symbolic safety automaton that moves alongside the game. For sake of readability, we omitted the automaton from the following discussion.

**Fig. 2.** Two output-equivalent programs P1 and P2 are depicted in Fig. 2a and 2b. In Fig. 2c a possible winning strategy for the verifier is given. Each abstract state contains the value of the program counter of both copies (given as the pair at the top) and the predicates that hold in that state. For sake of readability we omit the trace variables and write, e.g., x<sup>1</sup> for x<sup>π</sup><sup>1</sup> . We mark the initial state with an incoming arrow. The outer label at each state gives the scheduling M ⊆ {1, 2} chosen by the strategy in that state.

directly corresponds to a reduction and accompanying inductive invariant for the safety property within the given abstraction.

For our example, we give (parts of) a possible winning strategy in Fig. 2c. In each abstract state, the strategy chooses a scheduling (written next to the state), and all abstract states compatible with that scheduling are listed as successors. Note that whenever the program counter is (2, 2) (i.e., both programs are at their output position), it holds that x<sup>1</sup> = x<sup>2</sup> (as required). The example strategy schedules in lock-step for the most part (by choosing M = {1, 2}) but lets P1 take the inner loop *twice*, thereby maintaining the linear invariants x<sup>1</sup> = x<sup>2</sup> and y<sup>1</sup> = 2y2. In particular, the resulting reduction is property-based [39] as the scheduling is based on the current (abstract) state. Note that the program cannot be verified with only linear invariants in a sequential or parallel (lockstep) reduction.

#### **2.2 Beyond** *k***-Safety: Quantification as a Game**

We build upon this game-based interpretation of a reduction to move beyond k-safety. As a second example, consider the two programs Q1 and Q2 in Fig. 3, where <sup>τ</sup> denotes a nondeterministic choice of type <sup>τ</sup> ∈ {N,B}. We wish to check that Q1 refines Q2, i.e., all output behavior of Q1 is also possible in Q2. We can express this in our logic as follows:

$$\forall^{\mathbb{Q}1}\pi\_1 : (pc = 2). \; \exists^{\mathbb{Q}2}\pi\_2 : (pc = 2). \; \Box(a\_{\pi\_1} = a\_{\pi\_2})$$

The property states that for every trace π<sup>1</sup> in Q1 there *exists* a trace π<sup>2</sup> in Q2 that outputs the same value. The quantifiers range over infinite traces of variable assignments (with infinite domains), making a direct verification of the

**Fig. 3.** Two programs Q1 and Q2 are given in Fig. 3a and 3b. In Fig. 3c a possible winning strategy for the verifier is depicted. The outer label gives the scheduling M ⊆ {1, 2} and, if applicable, the restriction chosen by the witness strategy.

quantifier alternation challenging. In contrast to alternation-free formulas, we cannot reduce the verification to verification on a self composition [8,28]. Instead, we adopt (yet another) game-based interpretation by viewing the existentially quantified traces as being resolved by a *strategy* (called the witness strategy) [23]. That is, instead of trying to find a witness traces π<sup>2</sup> in Q2 when given the *entire* trace π1, we interpret the ∀∃ property as a game between verifier and refuter. The refuter moves through the state space of Q1 (thereby producing a trace π1), and the verifier reacts to each move by choosing a successor in the state space of Q2 (thereby producing a trace π2). If the verifier can assure that the resulting traces π1, π<sup>2</sup> satisfy (a<sup>π</sup><sup>1</sup> = a<sup>π</sup><sup>2</sup> ), the ∀∃ property holds. However, this game-based interpretation fails in many instances. There might exist a witness trace π2, but the trace cannot be produced by a witness strategy as it requires knowledge of *future* moves of the refuter. Let us discuss this on the example programs in Fig. 3. A simple (informal) solution to construct a witness trace π<sup>2</sup> (when given the entire π1) would be to guarantee that in Q2:4 (meaning location 4 of Q2) and line Q1:6 the value of x in both programs agrees (i.e., x<sup>1</sup> = x<sup>2</sup> holds) and then simply resolve the nondeterminism at Q2:6 with 0. However, to follow this idea, the witness strategy for the verifier, when at Q2:3, would need to know the future value of x<sup>1</sup> when Q1 is at location Q1:6.

Our insight in this paper is that we can turn the strategy-based interpretation of the witness trace π<sup>2</sup> into a useful verification method by *combining* it with a program reduction. As we express both searches strategically, we can phrase the combined search as a combined game. In particular, both the reduction and the witness strategy are controlled by the verifier and can thus *collaborate*. In the resulting game, the verifier chooses a scheduling (as in Sect. 2.1) and, additionally, whenever the existentially quantified copy is scheduled, the verifier also decides on the successor state of that copy. We depict a possible winning strategy in Fig. 3c. This strategy formalizes the interplay of reduction and witness strategy. Initially, the verifier only schedules {1} until Q1 has reached program location Q1:6 (at which point the value of x is fixed). Only then does the verifier schedule {2}, at which point the witness strategy can decide on a successor state for Q2. In our case, the strategy chooses a value for x such that x<sup>1</sup> = x<sup>2</sup> holds. As we work in an abstraction of the actual system, we formalize this by restricting the abstract successor states. In particular, in state α<sup>7</sup> the verifier schedules {2} and simultaneously restricts the successors to {α8} (i.e., the abstract state where x<sup>1</sup> = x<sup>2</sup> holds), even though abstract state [(6, 4), a<sup>1</sup> = a2, x<sup>1</sup> = x2] is also a valid successors under scheduling {2}. We formalize when a restriction is valid in Sect. 6. The resulting strategy is winning and therefore denotes both a reduction *and* witness strategy for the existentially quantified copy. Importantly, both reduction and witness strategy are mutually dependent. Our tool HyPA is able to verify both properties (in Fig. 2 and Fig. 3) in a matter of a few seconds (cf. Sect. 7).

# **3 Preliminaries**

We begin by introducing basic preliminaries, including our basic model of computation and background on (finite-state) safety games.

*Symbolic Transition Systems.* We assume some fixed underlying first-order theory. A *symbolic transition system* (STS) is a tuple T = (X, *init*, *step*) where X is a finite set of variables (possibly sorted), *init* is a formula over X describing all initial states, and *step* is a formula over X X (where X := {x | x ∈ X} is the set of primed variables) describing the transitions of the system. A concrete state μ in T is an assignment to the variables in X. We write μ for the assignment over X given by μ (x ) := μ(x). A trace in T is an infinite sequence of assignment <sup>μ</sup>0μ<sup>1</sup> ··· such that <sup>μ</sup><sup>0</sup> <sup>|</sup><sup>=</sup> *init* and for every <sup>i</sup> <sup>∈</sup> <sup>N</sup>, <sup>μ</sup><sup>i</sup> <sup>μ</sup> <sup>i</sup>+1 |= *step*. We write *Traces*(T ) for the set of all traces in T . We can naturally interpret programs as STS by making the program counter explicit.

*Formula Transformations.* For the remainder of this paper, we fix the set of system variables X. We also fix a finite set of trace variables V = {π1,...,πk}. For a trace variable <sup>π</sup> ∈ V we define <sup>X</sup><sup>π</sup> := {x<sup>π</sup> <sup>|</sup> <sup>x</sup> <sup>∈</sup> <sup>X</sup>} and write X for X<sup>π</sup><sup>1</sup> ∪···∪ X<sup>π</sup>*<sup>k</sup>* . For a formula θ over X, we define θπ as the formula over X<sup>π</sup> obtained by replacing every variable x with xπ. Similarly, we define k fresh disjoint copies X = X <sup>π</sup><sup>1</sup> ∪···∪ X <sup>π</sup>*<sup>k</sup>* (where X <sup>π</sup> := {x <sup>π</sup> | x ∈ X}). For a formula θ over X , we define θ- as the formula over X obtained by replacing every variable x<sup>π</sup> with x π.

*Safety Games.* A *safety game* is a tuple G = (SSAFE, SREACH, S0,T,B) where S = SSAFE sREACH is a set of game states, S<sup>0</sup> ⊆ S a set of initial states, T ⊆ S × S a transition relation, and B ⊆ S a set of bad states. We assume that for every s ∈ S there exists at least one s with (s, s ) ∈ T. States in SSAFE are controlled by player SAFE and those in SREACH by player REACH. A play is an infinite sequence of states <sup>s</sup>0s<sup>1</sup> ··· such that <sup>s</sup><sup>0</sup> <sup>∈</sup> <sup>S</sup>0, and (si, si+1) <sup>∈</sup> <sup>T</sup> for every <sup>i</sup> <sup>∈</sup> <sup>N</sup>. A positional strategy σ for player p ∈ {SAFE, REACH} is a function σ : S<sup>p</sup> → S such that (s, σ(s)) ∈ T for every s ∈ Sp. A play s0s<sup>1</sup> ··· is compatible with strategy σ for player p if si+1 = σ(si) whenever s<sup>i</sup> ∈ Sp. The safety player wins G if there is a strategy σ for SAFE such that all σ-compatible plays never visit a state in B. In particular, SAFE needs to win from *all* initial states.

# **4 Observation-Based HyperLTL**

In this section, we present OHyperLTL (short for observation-based HyperLTL). Our logic builds upon HyperLTL [21], which itself extends linear-time temporal logic (LTL) with explicit trace quantification. In OHyperLTL, we include predicates from the background theory (to reason about infinite variable domains) and explicit observations (to express asynchronous properties). Formulas in OHyper-LTL are given by the following grammar:<sup>2</sup>

$$\begin{aligned} \varphi &:= \forall \pi : \xi . \varphi \mid \exists \pi : \xi . \varphi \mid \phi \\ \phi &:= \theta \mid \neg \phi \mid \phi\_1 \land \phi\_2 \mid \mathsf{O} \phi \mid \phi\_1 \mathcal{U} \phi\_2 \end{aligned}$$

Here <sup>π</sup> ∈ V is a trace variable, <sup>θ</sup> is a formula over X , and <sup>ξ</sup> is a formula over X (called the observation formula). For ease of notation, we assume that all variables in V occur in the quantifier prefix *exactly* once. We use the standard Boolean connectives ∧, →, ↔, and constants , ⊥, as well as the derived LTL operators eventually φ :=  U φ, and globally φ := ¬ ¬φ.

*Semantics.* A trace t is an infinite sequence μ0μ<sup>1</sup> ··· of assignments to X. For <sup>i</sup> <sup>∈</sup> <sup>N</sup>, we write <sup>t</sup>(i) to denote the <sup>i</sup>th value in <sup>t</sup>. A trace assignment <sup>Π</sup> is a partial mapping of trace variables in <sup>V</sup> to traces. Given a trace assignment <sup>Π</sup> and <sup>i</sup> <sup>∈</sup> <sup>N</sup>, we define Π(i) to be the assignment to X given by Π(i)(xπ) := Π(π)(i)(x), i..e, the value of x<sup>π</sup> is the value of x on the trace assigned to π. For the LTL body of an OHyperLTL formula, we define:


The distinctive feature of OHyperLTL over HyperLTL are the explicit observations. Given an observation formula ξ and trace t, we say that ξ is a *valid*

<sup>2</sup> For the examples in Sect. 2, we additionally annotated quantifiers with an STS if we want to reason about different STSs within the same formula. In the following, we assume that all quantifiers range over traces in the same STS to simplify notation.

*observation on* <sup>t</sup> (written *valid*(t, ξ)) if there are infinitely many <sup>i</sup> <sup>∈</sup> <sup>N</sup> such that t(i) |= ξ. If *valid*(t, ξ) holds, we write t<sup>ξ</sup> for the trace obtained by projecting on those positions i where t(i) |= ξ, i.e., tξ(i) := t(j) where j is the ith index that satisfies ξ. Given a set of traces T we resolve trace quantification as follows:


The semantics mostly agrees with that of HyperLTL [21] but projects each trace to the positions where the observation holds. Given an STS T and OHyperLTL formula ϕ, we write T |= ϕ if ∅ |=*Traces*(<sup>T</sup> ) ϕ where ∅ is the empty assignment.

*The Power of Observations.* The explicit observations in OHyperLTL facilitate the specification of asynchronous hyperproperties, i.e., properties where traces are traversed at different speeds. For the example in Sect. 2.1, the explicit observations allow us to compare the output of both programs even though the actual step at which the output occurs (in a synchronous semantics) differs between both programs (as P1 takes the inner loop twice as often as P2). As the observations are part of the specification, we can model a broad spectrum of properties ranging, e.g., from timing-insensitive properties (by placing observations only at output locations) to timing-sensitive specifications [29] (by placing observations at closer intervals). Functional (opposed to temporal) k-safety properties specified by pre-and postcondition [10,39,41] can easily be encoded as <sup>∀</sup><sup>k</sup>-OHyperLTL properties by placing observations at the start and end of each program. By setting ξ = , i.e., observing *every* step, we can express synchronous properties. OHyperLTL thus subsumes HyperLTL.

*Finite-State Model Checking.* Many mechanisms used to express asynchronous hyperproperties render finite-state model checking undecidable [9,17,31]. In contrast, the simple mechanism used in OHyperLTL maintains decidable finite-state model checking. Detailed proofs can be found in the full version [15].

**Theorem 1.** *Assume an STS* T *with finite variable domains and decidable background theory and an OHyperLTL formula* ϕ*. It is decidable if* T |= ϕ*.*

*Proof Sketch.* Under the assumptions, we can view T as an explicit (instead of symbolic) finite-state transition system. Given an observation formula ξ we can effectively compute an explicit finite-state system T such that *Traces*(T ) = {t<sup>ξ</sup> | t ∈ *Traces*(T ) ∧ *valid*(t, ξ)}. This reduces OHyperLTL model checking on T to HyperLTL model checking on T , which is decidable [28].

Note that for infinite-state (symbolic) systems, we cannot effectively compute T as in the proof of Theorem 1. In fact, there may not even exist a system T with the desired property that is expressible in the same background theory.

The finite-state result in Theorem 1 is of little relevance for the present paper. Nevertheless, it indicates that our logic is well suited for verification of infinite-state (software) systems as the (inevitable) undecidability stems from the infinite domains in software programs and not already from the logic itself.

*Safety.* In this paper, we assume that the hyperproperty is temporally safe [12], i.e., the temporal body of any OHyperLTL formula denotes a *safety property*. Note that, as we support quantifier alternation, we can still express hyperliveness properties [22,23]. For example, GNI is both temporally safe and hyperliveness. We model the body of a formula by a symbolic safety automaton [24], which is a tuple A = (Q, q0, δ, B) where Q is a finite set of states, q<sup>0</sup> ∈ Q the initial state, B ⊆ Q a set of bad-states, and δ a finite set of automaton edges of the form (q, θ, q ) where q, q <sup>∈</sup> <sup>Q</sup> are states and <sup>θ</sup> is a formula over X . Given a trace <sup>t</sup> over assignments to X , a run of <sup>A</sup> on <sup>t</sup> is an infinite sequence of states <sup>q</sup>0q<sup>1</sup> ··· (starting in q0) such that for every i, there exists an edge (qi, θi, qi+1) ∈ δ such that t(i) |= θi. A word is accepted by A if it has *no* run that visits a state in B. The automaton is *deterministic* if for every q ∈ Q and every assignments μ to X , there exists exactly one edge (q, θ, q ) ∈ δ with μ |= θ.

### **5 Reductions as a Game**

After having defined our temporal logic, we turn our attention to the automatic verification of OHyperLTL formulas on STSs. In this section, we begin by formalizing our game-based interpretation of a reduction. To illustrate this, we consider <sup>∀</sup><sup>k</sup> OHyperLTL formulas, which, as the body of the formula is a safety property, always denote k-safety properties.

*Predicate Abstraction.* Our search for a reduction is based in the scope of a fixed predicate abstraction [30,33], i.e., we abstract our system by keeping track of the truth value of a few selected predicates that (ideally) identify properties that are relevant to prove the property in question. Let T = (X, *init*, *step*) be an STS and let ϕ = ∀π<sup>1</sup> : ξ<sup>1</sup> ... ∀π<sup>k</sup> : ξk. φ be the (k-safety) OHyperLTL we wish to verify. Let A<sup>φ</sup> = (Qφ, qφ,0, δφ, Bφ) be a deterministic safety automaton for φ. A *relational* predicate p is a formula over X that identifies a property of the combined state space of k system copies. Let P = {p1,...,pn} be a finite set of relational predicates. We say a formula over X is *expressible in* <sup>P</sup> if it is equivalent to a boolean combination of the predicates in P. We assume that all edge formulas in the automaton Aφ, and formulas *init* π*i* and (ξi)π*i* for π<sup>i</sup> ∈ V are expressible in P. Note that we can always add missing predicates to P.

Given the set of predicates P, the state-space of the abstraction w.r.t. P is given by <sup>B</sup><sup>n</sup>, where for each abstract state ˆ<sup>s</sup> <sup>∈</sup> <sup>B</sup><sup>n</sup>, the <sup>i</sup>th position ˆs[i] <sup>∈</sup> <sup>B</sup> tracks whether or not predicate p<sup>i</sup> holds. To simplify notation, we write *ite*(b, θ, θ ) to be formula <sup>θ</sup> if <sup>b</sup> <sup>=</sup> , and <sup>θ</sup> otherwise. For each abstract state ˆ<sup>s</sup> <sup>∈</sup> <sup>B</sup><sup>n</sup>, we define sˆ := <sup>n</sup> <sup>i</sup>=1 *ite*- sˆ[i], pi,¬p<sup>i</sup> , i.e., sˆ is a formula over X that captures all concrete states that are abstracted to ˆs. To incorporate reductions in our abstraction, we parametrize the abstract transition relation by a *scheduling* M ⊆ {π1,...,πk}. We lift the *step* formula from T by defining

$$step\_M := \bigwedge\_{i=1}^k \left( \pi\_i \in M, step\_{\langle \pi\_i \rangle}, \bigwedge\_{x \in X} x'\_{\pi\_i} = x\_{\pi\_i} \right).$$

That is all copies in M take a step while all other copies remain unchanged. Given two abstract states ˆs1, sˆ<sup>2</sup> we say that ˆs<sup>2</sup> is an M*-successor* of ˆs1, written sˆ1 <sup>M</sup>−→ <sup>s</sup>ˆ2, if sˆ1 <sup>∧</sup> sˆ2- ∧ *step*<sup>M</sup> is satisfiable, i.e., we can transition from ˆs<sup>1</sup> to sˆ<sup>2</sup> by only progressing the copies in M.

For an abstract state ˆs, we define *obs*(ˆs) <sup>∈</sup> <sup>B</sup><sup>k</sup> as the boolean vector that indicates which copy (of π1,...,πk) is currently at an observation point, i.e., *obs*(ˆs)[i] =  iff sˆ∧(ξi)π*i* is satisfiable. Note that as (ξi)π*i* is, by assumption, expressible in P, either all or none of the concrete states in sˆ satisfy (ξi)π*i*.

*Game Construction.* Building on the parametrized abstract transition relation, we can construct a (finite-state) safety game where winning strategies for the verifier correspond to valid reductions with accompanying proofs. The nodes in our game have two forms: Either they are of the form (ˆs, q, b) where ˆ<sup>s</sup> <sup>∈</sup> <sup>B</sup><sup>n</sup> is an abstract state, <sup>q</sup> <sup>∈</sup> <sup>Q</sup><sup>φ</sup> a state of the safety automaton, and <sup>b</sup> <sup>∈</sup> <sup>B</sup><sup>k</sup> a boolean vector indicating which copy has moved since the last automaton step; Or of the form (ˆs, q, b, M) where ˆs, q, and b are as before and ∅ = M ⊆ {π1,...,πk} is a scheduling. The initial states are all states (ˆs, qφ,0, <sup>k</sup>) where sˆ <sup>∧</sup> <sup>k</sup> <sup>i</sup>=1 *init* π*i* is satisfiable (recall that *init* π*i* is expressible in P). We mark a state (ˆs, q, b) or (ˆs, q, b, M) as losing iff q ∈ Bφ. For automaton state q ∈ Q<sup>φ</sup> and abstract state ˆs, we define δφ(q, sˆ) as the *unique* state q such that there is an edge (q, θ, q ) ∈ δ<sup>φ</sup> such that sˆ∧θ is satisfiable. Uniqueness follows from the assumption that A<sup>φ</sup> is deterministic and all edge formulas are expressible in P. The transition relation of our game is given by the following rules:

$$\begin{array}{ll} \forall \pi\_{i} \in M. \,\,\neg b[i] \vee \neg obs(\hat{s})[i] \,\, \big[ \\ \frac{\hat{s}}{(\hat{s},q,b)} \rightsquigarrow (\hat{s},q,b,M) \end{array} \begin{array}{ll} \mathbf{(1)} \\ \mathbf{(2)} \end{array} \qquad \begin{array}{ll} obs(\hat{s}) = \top^{k} \quad q' = \delta\_{\phi}(q,\hat{s}) \\ \hline (\hat{s},q,\top^{k}) \leadsto (\hat{s},q',\bot^{k}) \end{array} \{\mathbf{2}\} $$
 
$$ \begin{array}{ll} \hat{s} \xrightarrow{M} \hat{s}' \quad b' = b[i \longmapsto \top]\_{\pi\_{i} \in M} \\ \hline (\hat{s},q,b,M) \leadsto (\hat{s}',q,b') \end{array} \{\mathbf{3}\} $$

In rule **(1)**, we select any scheduling that schedules only copies that have not reached an observation point or have not moved since the last automaton step. In particular, we cannot schedule any copy that has moved and already reached an observation point. In rule **(2)**, all copies reached an observation point and have moved since the last update (i.e., <sup>b</sup> <sup>=</sup> <sup>k</sup>) so we progress the automaton and reset b. Lastly, in rule **(3)**, we select an M-successor of ˆs and update b for all copies that take part in the step. In our game, player SAFE takes the role of the verifier, and player REACH that of the refuter. It is the safety player's responsibility to select a scheduling in each step, so we assign nodes of the form (ˆs, q, b) to SAFE. Nodes of the form (ˆs, q, b, M) are controlled by REACH who can choose an abstract M-successor. Let G<sup>∀</sup> (<sup>T</sup> ,ϕ,P) be the resulting (finite-state) safety game. A winning strategy for SAFE in G<sup>∀</sup> (<sup>T</sup> ,ϕ,P) picks, in each abstract state, a valid scheduling that prevents a visit to a losing state. We can thus show:

**Theorem 2.** *If player SAFE wins* G<sup>∀</sup> (<sup>T</sup> ,ϕ,P)*, then* T |<sup>=</sup> <sup>ϕ</sup>*.* *Proof Sketch.* Assume σ is a winning strategy for SAFE in G<sup>∀</sup> (<sup>T</sup> ,ϕ,P). Let t1,...,t<sup>k</sup> ∈ *Traces*(T ) be arbitrary. We, iteratively, construct stuttered versions t 1,...,t <sup>k</sup> of t1,...,t<sup>k</sup> by querying σ on abstracted prefixes of t1,...,tk: Whenever σ schedules copy i we take a proper step on ti; otherwise we stutter. By construction of G<sup>∀</sup> (<sup>T</sup> ,ϕ,P) the stuttered traces <sup>t</sup> 1,...,t <sup>k</sup> align at observation points. In particular, we have [π<sup>1</sup> → t1ξ<sup>1</sup> ,...,π<sup>k</sup> → tkξ*<sup>k</sup>* ] |= φ iff [π<sup>1</sup> → t <sup>1</sup>ξ<sup>1</sup> ,...,π<sup>k</sup> → t <sup>k</sup>ξ*<sup>k</sup>* ] |= φ. Moreover, the sequence of abstract states in G∀ (<sup>T</sup> ,ϕ,P) forms an abstraction of <sup>t</sup> 1,...,t <sup>k</sup> and shows that A<sup>φ</sup> cannot reach a bad state when reading t <sup>1</sup><sup>ξ</sup><sup>1</sup> ,...,t <sup>k</sup><sup>ξ</sup>*<sup>k</sup>* (as σ is winning). This already shows that [π<sup>1</sup> → t <sup>1</sup><sup>ξ</sup><sup>1</sup> ,...,π<sup>k</sup> → t <sup>k</sup><sup>ξ</sup>*<sup>k</sup>* ] |= φ and thus [π<sup>1</sup> → t1<sup>ξ</sup><sup>1</sup> ,...,π<sup>k</sup> → tk<sup>ξ</sup>*<sup>k</sup>* ] |= φ. As this holds for all traces t1,...,t<sup>k</sup> ∈ *Traces*(T ), we get T |= ϕ as required.

*Game Construction and Complexity.* If the background theory is decidable, G∀ (<sup>T</sup> ,ϕ,P) can be constructed effectively using at most 2|P|+1 · <sup>2</sup><sup>k</sup> queries to an SMT solver. Checking if SAFE wins G<sup>∀</sup> (<sup>T</sup> ,ϕ,P) can be done with a simple fixpoint computation of the attractor in linear time.

Our game-based method of finding a reduction in a given abstraction is closely related to the notation of a *property-directed self-composition* [39]. The previously only known algorithm for finding such a reduction is based on an optimized enumeration [39], which, in the worst case, requires <sup>O</sup>(2|P|+1 · <sup>2</sup><sup>k</sup>) many enumerations. Our worst-case complexity thus matches the bounds inferred by [39], but avoids the explicit enumeration of reductions (and the concomitant repeated construction of the abstract state-space) and is, as we believe, conceptually simpler to comprehend. Moreover, our game-based technique is the key stepping stone for extending our method beyond k-safety in Sect. 6.

### **6 Verification Beyond** *k***-Safety**

Building on the game-based interpretation of a reduction, we extend our verification beyond ∀<sup>∗</sup> properties to support ∀<sup>∗</sup>∃<sup>∗</sup> properties. We accomplish this by *combining* the game-based reading of a reduction (as discussed in the previous section) with a game-based reading of existential quantification. For the remainder of this section, fix an STS T = (X, *init*, *step*) and let

$$\varphi = \forall \pi\_1 : \xi\_1 \dots \forall \pi\_l : \xi\_l . \exists \pi\_{l+1} : \xi\_{l+1} \dots . \exists \pi\_k : \xi\_k . \phi$$

be the OHyperLTL formula we wish to check, i.e., we universally quantify over l traces followed by an existential quantification over k − l traces. We assume that for every existential quantification ∃π<sup>i</sup> : ξ<sup>i</sup> occurring in ϕ, *valid*(t, ξi) holds for every t ∈ *Traces*(T ) (we discuss this later in Remark 1).

#### **6.1 Existential Trace Quantification as a Game**

The idea of a game-based verification of ∀<sup>∗</sup>∃<sup>∗</sup> properties is to consider a ∀<sup>∗</sup>∃<sup>∗</sup> property as a game between verifier and refuter [23]. The refuter controls the l universally quantified traces by moving through l copies of the system (thereby producing traces π1,...,πl) and the verifier reacts by, incrementally, moving through k−l copies of the system (thereby producing traces πl+1,...,πk). If the verifier has a strategy that ensures that the resulting traces satisfy φ, T |= ϕ holds. We call such a strategy for the verifier a *witness strategy*.

We combine this game-based reading of existential quantification with our game-based interpretation of a reduction by, additionally, letting the verifier control the scheduling of the system. When played on the *concrete* state-space of T the game proceeds in three stages as follows: 1) The verifier selects a valid scheduling M ⊆ {π1,...,πk}; 2) The refuter selects successor states for all universally quantified copies by fixing an assignment to X <sup>π</sup><sup>1</sup> ,...,X <sup>π</sup>*<sup>l</sup>* (only moving copies scheduled by M); 3) The verifier reacts by choosing successor states for the existentially quantified copies by fixing an assignment to X <sup>π</sup>*l*+1 ,...,X <sup>π</sup>*<sup>k</sup>* (again, only moving copies scheduled by M). Afterward, the process repeats.

As we work within a fixed abstraction of T , the verifier can, however, not choose concrete successor states directly but only work in the precision captured by the abstraction. Following the general scheme of abstract games, we, therefore, underapproximate the moves available to the verifier [2]. Formally, we abstract the three-stage game outlined before (which was played at the level of concrete states) to a simpler abstract game (consisting of only two stages). In the first stage, the verifier selects both a scheduling M and a *restriction* on the set of abstract successor states, i.e., a set of abstract states A. In the second stage, the refuter cannot choose any abstract successor state (any M-successor in the terminology from Sect. 5), but only successors contained in the restriction A. To guarantee the soundness of this approach, we ensure that the verifier can only pick restrictions that are *valid*, i.e., restrictions that underapproximate the possibilities of the verifier on the level of concrete states.

*Game Construction.* We modify our game from Sect. 5 as follows. States are either of the form (ˆs, q, b) (as in Sect. 5) or of the form (ˆs, q, b, M, A) where ˆs, <sup>q</sup>, <sup>b</sup>, and <sup>M</sup> are as in Sect. 5, and <sup>A</sup> <sup>⊆</sup> <sup>B</sup><sup>n</sup> is a subset of abstract states (the restriction). To reflect the restriction, we modify transition rules **(1)** and **(3)**. Rule **(2)** remains unchanged.

$$\frac{\forall \pi\_i \in M. \neg b[i] \lor \neg obs(\hat{s})[i]}{(\hat{s}, q, b) \rightsquigarrow (\hat{s}, q, b, M, A)} \text{ (1)} \quad \frac{\hat{s}' \in A \quad b' = b[i \mapsto \top]\_{i \in M}}{(\hat{s}, q, b, M, A) \rightsquigarrow (\hat{s}', q, b')} \text{ (3)}$$

In rule **(1)**, the safety player (who, again, takes the role of the verifier) selects both a scheduling M and a restriction A such that *validRes*s,M<sup>ˆ</sup> <sup>A</sup> holds (which we define later). The reachability player (who takes the role of the refuter) can, in rule **(3)**, select any successor contained in A.

*Valid Restriction.* The above game construction depends on the definition of *validRes*s,M<sup>ˆ</sup> <sup>A</sup> . Intuitively, A is a valid restriction if it underapproximates the possibilities of a witness strategy that can pick concrete successor states for all existentially quantified traces. That is, for every concrete state in ˆs, a witness strategy (on the level of concrete states) can guarantee a move to a concrete state that is abstracted to an abstract state within A. Formally we define *validRes*s,M<sup>ˆ</sup> A as follows:

$$\begin{split} & \mathbb{V} \{ X\_{\pi\_i} \}\_{i=1}^k \cdot \mathbb{V} \{ X'\_{\pi\_i} \}\_{i=1}^l \cdot \mathbb{I} \hat{\mathbb{s}} \Big\| \bigwedge\_{i=1}^l \operatorname{ite} \Big( \pi\_i \in M, \operatorname{step}\_{\{\pi\_i\}}, \bigwedge\_{x \in X} x'\_{\pi\_i} = x\_{\pi\_i} \Big) \\ & \Rightarrow \exists \{ X'\_{\pi\_i} \}\_{i=l+1}^k \cdot \bigwedge\_{i=l+1}^k \operatorname{ite} \Big( \pi\_i \in M, \operatorname{step}\_{\{\pi\_i\}}, \bigwedge\_{x \in X} x'\_{\pi\_i} = x\_{\pi\_i} \Big) \wedge \bigvee\_{i' \in A} \{ \hat{\mathbb{s}'} \}^{\{\hat{\mathbb{s}'\}}} . \end{split}$$

It expresses that for all concrete states in sˆ (assignments to {Xπ*<sup>i</sup>* }<sup>k</sup> <sup>i</sup>=1) and for all concrete successor states for the universally quantified copies (assignments to {X <sup>π</sup>*<sup>i</sup>* }<sup>l</sup> <sup>i</sup>=1), there exist successor states for the existentially quantified copies ({X <sup>π</sup>*<sup>i</sup>* }<sup>k</sup> <sup>i</sup>=l+1) such that one of the abstract states in A is reached.

*Example 1.* With this definition at hand, we can validate the restrictions chosen by the strategy in Fig. 3c. For example, in state α<sup>7</sup> the strategy schedules M = {2} and restricts the successor states to {α8} even though abstract state (6, 4), a<sup>1</sup> = a2, x<sup>1</sup> = x<sup>2</sup> is also a {2}-successor of α7. If we spell out *validRes*<sup>α</sup>7,{2} {α8} we get

$$\forall X\_1 \cup X\_2 \cup X\_1'. \underbrace{a\_1 = a\_2}\_{\llbracket \alpha \tau \rrbracket} \land \left(\bigwedge\_{z \in X} z\_1' = z\_1\right) \Rightarrow \exists X\_2'. \underbrace{a\_2' = a\_2 \land y\_2' = y\_2}\_{step\_{\langle \beta \rangle}} \land \underbrace{\left(a\_1' = a\_2' \land x\_1' = x\_2'\right)}\_{\llbracket \alpha s\_1 \rrbracket}$$

where X = {a, x, y}. Here we assume that *step* := - a = a∧y = y is the update performed on instruction x ← -<sup>N</sup> from Q2:3 to Q2:4. The above formula is valid.

*Correctness.* Call the resulting game G∀∃ (<sup>T</sup> ,ϕ,P). The game combines the search for a reduction with that of a witness strategy (both within the precision captured by <sup>P</sup>).<sup>3</sup> We can show:

**Theorem 3.** *If player SAFE wins* G∀∃ (<sup>T</sup> ,ϕ,P)*, then* T |<sup>=</sup> <sup>ϕ</sup>*.*

*Proof Sketch.* Let σ be a winning strategy for SAFE in G∀∃ (<sup>T</sup> ,ϕ,P). Let <sup>t</sup>1,...,t<sup>l</sup> <sup>∈</sup> *Traces*(T ) be arbitrary. We use σ to incrementally construct witness traces tl+1,...,t<sup>k</sup> by querying σ. In every abstract state ˆs, σ selects a scheduling M and a restriction A such that *validRes*s,M<sup>ˆ</sup> <sup>A</sup> holds. We plug the current *concrete* state (reached in our construction of tl+1,...,tk) into the universal quantification of *validRes*s,M<sup>ˆ</sup> <sup>A</sup> and get (concrete) witnesses for the existential quantification that, by definition of *validRes*s,M<sup>ˆ</sup> <sup>A</sup> , are valid successors for the existentially quantified copies in T .

*Remark 1.* Recall that we assume that for every existential quantification ∃π<sup>i</sup> : ξ<sup>i</sup> occurring in ϕ and all t ∈ *Traces*(T ), *valid*(t, ξi) holds. This is important to ensure that the safety player (the verifier) cannot avoid observation points forever. We could drop this assumption by strengthening the winning condition in G∀∃ (<sup>T</sup> ,ϕ,P) and explicitly state that, in order to win, SAFE needs to visit observations points on existentially quantified traces infinitely many times.

<sup>3</sup> In particular, <sup>G</sup>∀∃ (<sup>T</sup> ,ϕ,P) (strictly) generalizes the construction of G<sup>∀</sup> (<sup>T</sup> ,ϕ,P) from Sect. 5: If k = l (i.e., the property is a ∀<sup>∗</sup>-property) the unique minimal valid restriction from s, Mˆ is {sˆ- <sup>|</sup> <sup>s</sup><sup>ˆ</sup> <sup>M</sup>−→ <sup>s</sup>ˆ- }, i.e., the set of all M-successors of ˆs. The safety player can thus not be more restrictive than allowing *all* M-successors (as in G<sup>∀</sup> (<sup>T</sup> ,ϕ,P)).

*Clairvoyance vs. Abstraction.* The cooperation between reduction (the ability of the verifier to select schedulings) and witness strategy (the ability to select restrictions on the successor) can be seen as a limited form of prophecy [1,14]. By first scheduling the universal copies, the witness strategy can peek at future moves before committing to a successor state, as we e.g., saw in Fig. 3. The "theoretically optimal" reduction is thus a sequential one that first schedules only the universally quantified traces (until an observation point is reached) and thereby provides maximal information for the witness strategy. However, in the context of a fixed abstraction, this reduction is not always optimal. For example, in Fig. 3 the strategy schedules the loop in lock-step which is crucial for generating a proof with simple (linear) invariants. In particular, Fig. 3 does not admit a witness strategy in the lock-step reduction and does not admit a proof with linear invariants in a sequential reduction. Our verification framework, therefore, strikes a delicate balance between clairvoyance needed by the witness strategy and precision captured in the abstraction, further emphasizing why the searches for reduction and witness strategy need to be mutually dependent.

#### **6.2 Constructing and Solving** *G∀∃* **(***T ,ϕ,P***)**

Constructing the game graph of G∀∃ (<sup>T</sup> ,ϕ,P) requires the identification of all valid restrictions (of which there are exponentially many in the number of abstract states and thus double exponentially many in the number of predicates) each of which requires to solve a quantified SMT query. We propose a more effective algorithm that solves G∀∃ (<sup>T</sup> ,ϕ,P) without constructing it explicitly. Instead, we iteratively refine an abstraction <sup>G</sup>˜ of


G∀∃ (<sup>T</sup> ,ϕ,P). Our method hinges on the following easy observation:

**Lemma 1.** *For any* <sup>s</sup><sup>ˆ</sup> *and* <sup>M</sup>*,* {<sup>A</sup> <sup>|</sup> *validRes*s,M<sup>ˆ</sup> <sup>A</sup> } *is upwards closed (w.r.t.* ⊆*).*

Our initial abstraction consists of all possible restrictions (even those that might be invalid), i.e., we allow all restrictions of the form (ˆs, M, A) where A ⊆ {sˆ <sup>|</sup> <sup>s</sup><sup>ˆ</sup> <sup>M</sup>−→ <sup>s</sup>ˆ }. <sup>4</sup> This overapproximates the power of the safety player, i.e., a winning strategy for SAFE in <sup>G</sup>˜ may not be valid in <sup>G</sup>∀∃ (<sup>T</sup> ,ϕ,P). To remedy this, we propose the following inner refinement loop: If we find a winning strategy σ for

<sup>4</sup> Note that {sˆ- <sup>|</sup> <sup>s</sup><sup>ˆ</sup> <sup>M</sup>−→ <sup>s</sup>ˆ- } is always a valid restriction. Importantly, we can compute {sˆ- <sup>|</sup> <sup>s</sup><sup>ˆ</sup> <sup>M</sup>−→ <sup>s</sup>ˆ- } locally, i.e., by iterating over abstract states opposed to *sets* of abstract states.

SAFE in <sup>G</sup>˜ we check if all restrictions chosen by <sup>σ</sup> are valid. If this is the case, <sup>σ</sup> is also winning for G∀∃ (<sup>T</sup> ,ϕ,P) and we can apply Theorem 3. If we find an invalid restriction (ˆs, M, A) used by <sup>σ</sup>, we refine <sup>G</sup>˜ by removing not only the restriction (ˆs, M, A) but *all* (ˆs, M, A ) with A ⊆ A (which is justified by Lemma 1). The algorithm is sketched in Algorithm 1. The subroutine *Restrictions*(σ) returns all restrictions used by σ, i.e., all tuples (ˆs, M, A) such that σ uses an edge (ˆs, q, b) - (ˆs, q, b, M, A) for some q, b. *Remove*(G˜,(ˆs, M, A )) removes from <sup>G</sup>˜ all edges of the form (ˆs, q, b) - (ˆs, q, b, M, A ) for some q, b, and *Solve* solves a finite-state safety game. To improve the algorithm further, in line 4 we always compute a maximal safety strategy, i.e., a strategy that selects maximal restrictions (w.r.t. ⊆) and therefore allows us to eliminate many invalid restrictions from <sup>G</sup>˜ simultaneously. For safety games, there always exists such a maximal winning strategy (see e.g. [11]). Note that while <sup>G</sup>˜ is large, solving this finite-state game can be done very efficiently. The running time of solving G∀∃ (<sup>T</sup> ,ϕ,P) is dominated by the SMT queries of which our refinement loop, in practice, requires very few.

### **7 Implementation and Evaluation**

When combining Theorem 3 and our iterative solver from Sect. 6.2 we obtain an algorithm to verify ∀<sup>∗</sup>∃<sup>∗</sup>-safety properties within a given abstraction. We have implemented a prototype of our method in a tool we call HyPA. We use Z3 [36] to discharge SMT queries. The input of our tool is provided as an arbitrary STS in the SMTLIB format [5], making it *language independent*. In our programs, we make the program counter explicit, allowing us to track predicates locally [32].

*Evaluation for* k*-Safety.* As a special case of ∀<sup>∗</sup>∃<sup>∗</sup> properties, HyPA is also applicable to k-safety verification. We collected an exemplifying suite of programs and k-safety properties from the literature [27,39–41] and manually translated them into STS (this can be automated easily). The results are given in Table 1. As done by Shemer et al. [39], we already provide a

**Table 1.** Evaluation of HyPA on <sup>k</sup>safety instances. We give the size of the abstract game-space (Size), the time taken to compute the abstraction (t*abs* ), and the overall time taken by HyPA (t). Times are given in seconds.


set of predicates that is sufficient for *some* reduction (but not necessarily the lockstep or sequential one), the search for which is then automated by HyPA. Our results show the game-based search for a reduction can verify interesting

**Table 2.** Evaluation of HyPA on <sup>∀</sup><sup>∗</sup>∃<sup>∗</sup>-safety verification instances. We give the size and construction time of the initial abstraction (Size and t*abs* ). For both the direct (explicit) and lazy (Algorithm 1) solver we give the time to construct (and solve) the game (t*solve* ) and the overall time (t = t*abs* + t*solve* ). For the lazy solver we, additionally, give the number of refinement iterations (#Ref). Times are given in seconds. TO indicates a timeout after 5 min.


k-safety properties from the literature. We also note that, currently, the vast majority of time is spent on the construction of the abstract system. If we would move to a fixed language, the computation time of the initial abstraction could be reduced by using existing (heavily optimized) abstraction tools [18,32].

*Evaluation Beyond* k*-Safety.* The main novelty of HyPA lies in its ability to, for the first time, verify temporal properties beyond k-safety. As none of the existing tools can verify such properties, we compiled a collection of very small example programs and ∀<sup>∗</sup>∃<sup>∗</sup>-safety properties. Additionally, we modified the boolean programs from [13] (where they checked GNI on boolean programs) by including data from infinite domains. The properties we checked range from refinement properties for compiler optimizations, over general refinement of nondeterministic programs, to generalized non-interference. Verification often requires a non-trivial combination of reduction and witness strategy (as the reduction must, e.g., compensate for branches of different lengths). As before, we provide a set of predicates and let HyPA automatically search for a witness strategy with accompanying reduction. We list the results in Table 2. To highlight the effectiveness of our inner refinement loop, we apply both a direct (explicit) construction of G∀∃ (<sup>T</sup> ,ϕ,P) and the lazy (iterative) solver in Algorithm 1. Our lazy solver (Algorithm 1) clearly outperforms an explicit construction and is often the only method to solve the game in reasonable time. In particular, we require very few refinement iterations and therefore also few expensive SMT queries. Unsurprisingly, the problem of verifying properties beyond k-safety becomes much more challenging (compared to k-safety verification) as it involves the *synthesis* of a witness function which is already 2EXPTIME-hard for finite-state systems [23,37]. We emphasize that no other existing tool can verify any of the benchmarks.

# **8 Related Work**

*Asynchronous Hyperproperties.* Recently, many logics for the formal specification of asynchronous hyperproperties have been developed [9,13,17,31]. Our logic OHyperLTL is closely related to stuttering HyperLTL (HyperLTLS) [17]. In HyperLTL<sup>S</sup> each temporal operator is endowed with a set of temporal formulas Γ and steps where the truth values of all formulas in Γ remain unchanged are ignored during the operator's evaluation. As for most mechanisms used to design asynchronous hyperlogics [9,17,31], finite-state model checking of HyperLTL<sup>S</sup> is undecidable. By contrast, in OHyperLTL, we always observe the trace at a fixed location, which is key for ensuring decidable finite-state model checking.

k*-Safety Verification.* The literature on k-safety verification is rich. Many approaches verify k-safety by using a form of self-composition [8,20,25,28] and often employ reductions to obtain compositions that are easier to verify. Our game-based interpretation of a reduction (Sect. 5) is related to Shemer et al. [39], who study k-safety verification within a given predicate abstraction using an enumeration-based solver (see Sect. 5 for a discussion). Farzan and Vandikas [27] present a counterexample-guided refinement loop that simultaneously searches for a reduction and a proof. Sousa and Dillig [40] facilitate reductions at the source-code level in program logic.

∀<sup>∗</sup>∃<sup>∗</sup>*-Verification.* Barthe et al. [7] describe an asymmetric product of the system such that only a subset of the behavior of the second system is preserved, thereby allowing the verification of ∀<sup>∗</sup>∃<sup>∗</sup> properties. Constructing an asymmetric product and verifying its correctness (i.e., showing that the product preserves all behavior of the first, universally quantified, system) is challenging. Unno et al. [41] present a constraint-based approach to verify functional (opposed to temporal) ∀∃ properties in infinite-state systems using an extension of constraint Horn clauses called pfwCHC. The underlying verification approach is orthogonal to ours: pfwCHC allows for a clean separation of the actual verification and verification conditions, whereas our approach combines both. For example, our method can prove the existence of a witness strategy without ever formulating precise constraints on the strategy (which seems challenging). Coenen et al. [23] introduce the game-based reading of existential quantification to verify temporal ∀∗∃<sup>∗</sup> properties in a synchronous and finite-state setting. By contrast, our work constitutes the first verification method for temporal ∀∗∃∗-safety properties in *infinite-state* systems. The key to our method is a careful integration of reductions which is not possible in a synchronous setting. For finitestate systems (where the abstraction is precise) and synchronous specifications (where we observe every step), our method subsumes the one in [23]. Beutner and Finkbeiner [14] use prophecy variables to ensure that the game-based reading of existential quantification is complete in a finite-state setting. Automatically constructing prophecies for infinite-state systems is interesting future work. Pommellet and Touili [38] study the verification of HyperLTL in infinitestate systems arising from pushdown systems. By contrast, we study verification in infinite-state systems that arise from the infinite variables domains used in software.

*Game Solving.* Our game-based interpretations are naturally related to infinitestate game solving [4,16,26,42]. State-of-the-art solvers for infinite-state games unroll the game [26], use necessary subgoals to inductively split a game into subgames [4], encode the game as a constraint system [16], and iteratively refine the controllable predecessor operator [42]. We tried to encode our verification approach directly as an infinite-state linear-arithmetic game. However, existing solvers (which, notably, work *without* a user-provided set of predicates) could not solve the resulting game [4,26]. Our method for encoding the witness strategy using *restrictions* corresponds to hyper-must edges in general abstract games [2, 3]. Our inner refinement loop for solving a game with hyper-must edges without explicitly identifying all edges (Algorithm 1) is thus also applicable in general abstract games.

### **9 Conclusion**

In this work, we have presented the first verification method for temporal hyperproperties beyond k-safety in infinite-state systems arising in software. Our method is based on a game-based interpretation of reductions and existential quantification and allows for mutual dependence of both. Interesting future directions include the integration of our method in a counter-example guided refinement loop that automatically refines the abstraction and ways to lift the current restriction to temporally safe specifications. Moreover, it is interesting to study if, and to what extent, the numerous other methods developed for k-safety verification of infinite-state systems (apart from reductions) are applicable to the vast landscape of hyperproperties that lies beyond k-safety.

**Acknowledgments.** This work was partially supported by the DFG in project 389792660 (Center for Perspicuous Systems, TRR 248). R. Beutner carried out this work as a member of the Saarbr¨ucken Graduate School of Computer Science.

### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **A Scalable Shannon Entropy Estimator**

Priyanka Golia1,2(B), Brendan Juba<sup>3</sup>, and Kuldeep S. Meel<sup>2</sup>

<sup>1</sup> Indian Institute of Technology Kanpur, Kanpur, India

pgolia@cse.iitk.ac.in <sup>2</sup> National University of Singapore, Singapore, Singapore

<sup>3</sup> Washington University in St. Louis, St. Louis, USA

**Abstract.** Quantified information flow (QIF) has emerged as a rigorous approach to quantitatively measure confidentiality; the informationtheoretic underpinning of QIF allows the end-users to link the computed quantities with the computational effort required on the part of the adversary to gain access to desired confidential information. In this work, we focus on the estimation of Shannon entropy for a given program Π. As a first step, we focus on the case wherein a Boolean formula ϕ(X, Y ) captures the relationship between inputs X and output Y of Π. Such formulas ϕ(X, Y ) have the property that for every valuation to X, there exists exactly one valuation to Y such that ϕ is satisfied. The existing techniques require <sup>O</sup>(2<sup>m</sup>) model counting queries, where <sup>m</sup> <sup>=</sup> <sup>|</sup><sup>Y</sup> <sup>|</sup>.

We propose the first efficient algorithmic technique, called Entropy Estimation to estimate the Shannon entropy of ϕ with PAC-style guarantees, i.e., the computed estimate is guaranteed to lie within a (1 ± ε) factor of the ground truth with confidence at least 1 − δ. Furthermore, EntropyEstimation makes only <sup>O</sup>( min(m,n) <sup>ε</sup><sup>2</sup> ) counting and sampling queries, where m = |Y |, and n = |X|, thereby achieving a significant reduction in the number of model counting queries. We demonstrate the practical efficiency of our algorithmic framework via a detailed experimental evaluation. Our evaluation demonstrates that the proposed framework scales to the formulas beyond the reach of the previously known approaches.

#### **1 Introduction**

Over the past half-century, the cost effectiveness of digital services has led to an unprecedented adoption of technology in virtually all aspects of our modern lives. Such adoption has invariably led to sensitive information being stored in data centers around the world and increasingly complex software accessing the information in order to provide the services that form the backbone of our modern economy and social interactions. At the same time, it is vital that protected information does not leak, as such leakages may have grave financial and societal

EntropyEstimation is available open-sourced at https://github.com/meelgroup/ entropyestimation. The names of authors are sorted alphabetically and the order does not reflect contribution.

consequences. Consequently, the detection and prevention of information leakage in software have attracted sustained interest in the security community.

The earliest efforts on information leakage focused on *qualitative* approaches that sought to return a Boolean output of the form "yes" or "no" [11,26,30]. While these qualitative approaches successfully capture situations where part of the code accesses prohibited information, such approaches are not well-suited to situations wherein some information leakage is inevitable. An oft-repeated example of such a situation is a *password checker* wherein every response "incorrect password" does leak information about the *secret password*. As a result, the past decade has seen the rise of quantified information flow analysis (QIF) as a rigorous approach to quantitatively measure confidentiality [7,53,57]. The information-theoretic underpinnings of QIF analyses allow an end-user to link the computed quantities with the probability of an adversary successfully guessing a secret, or the worst-case computational effort required for the adversary to infer the underlying confidential information. Consequently, QIF has been applied in diverse use-cases such as software side-channel detection [40], inferring search-engine queries through auto-complete responses sizes [21], and measuring the tendency of Linux to leak TCP-session sequence numbers [59].

The standard recipe for using the QIF framework is to measure the information leakage from an underlying program Π as follows. In a simplified model, a program Π maps a set of controllable inputs (C) and secret inputs (I) to outputs (O) observable to an attacker. The attacker is interested in inferring I based on the output O. A diverse array of approaches have been proposed to efficiently model Π, with techniques relying on a combination of symbolic analysis [48], static analysis [24], automata-based techniques [4,5,14], SMTbased techniques [47], and the like. For each, the core underlying technical problem is to determine the leakage of information for a given observation. We often capture this leakage using entropy-theoretic notions, such as Shannon entropy [7,16,48,53] or min-entropy [7,44,48,53]. In this work, we focus on computing Shannon entropy.

In this work, we focus on entropy estimation for programs modeled by Boolean formulas; nevertheless, our techniques are general and can be extended to other models such as automata-based frameworks. Let a formula ϕ(X, Y ) capture the relationship between X and Y such that for every valuation to X there is atmost one valuation to Y such that ϕ is satisfied; one can view X as the set of inputs and Y as the set of outputs. Let m = |Y | and n = |X|. Let p be a probability distribution over {0, 1}<sup>Y</sup> such that for every assignment to Y , σ : Y -→ {0, 1}, we have <sup>p</sup>σ <sup>=</sup> <sup>|</sup>sol(ϕ(<sup>Y</sup> →σ))<sup>|</sup> <sup>2</sup>n , where sol(ϕ(Y -→ σ)) denotes the set of solutions of ϕ(Y -<sup>→</sup> <sup>σ</sup>). Then, the entropy of <sup>ϕ</sup> is defined as <sup>H</sup>ϕ(<sup>Y</sup> ) = - <sup>p</sup>σ log <sup>1</sup> pσ .

σ The past decade has witnessed a multitude of entropy estimation techniques with varying guarantees on the quality of their estimates [9,17,35,58]. The problem of computing the entropy of a distribution represented by a given circuit is closely related to the EntropyDifference problem considered by Goldreich and Vadhan [34], and shown to be SZK-complete. We therefore do not expect to obtain polynomial-time algorithms for this problem. The techniques that have been proposed to compute <sup>H</sup>(ϕ) exactly compute <sup>p</sup>σ for each <sup>σ</sup>. Observe that computing <sup>p</sup>σ is equivalent to the problem of model counting, which seeks to compute the number of solutions of a given formula. Therefore, the exact techniques require O(2m) model-counting queries [13,27,39]; therefore, such techniques often do not scale for large values of m. Accordingly, the state of the art often relies on sampling-based techniques that perform well in practice but can only provide lower or upper bounds on the entropy [37,49]. As is often the case, techniques that only guarantee lower or upper bounds can output estimates that can be arbitrarily far from the ground truth. This raises the question: *can we design efficient techniques for approximate estimation, whose estimates have PAC-style* (ε, δ) *guarantees? I.e., can we compute an estimate that is guaranteed to lie within a* (1 + ε)*-factor of the ground truth for all possible values, with confidence at least* 1 − δ*?*

The primary contribution of our work is the first efficient algorithmic technique (given in our algorithm EntropyEstimation), to estimate <sup>H</sup>ϕ(<sup>Y</sup> ) with PACstyle guarantees for all possible values of <sup>H</sup>ϕ(<sup>Y</sup> ). In particular, given a formula ϕ, EntropyEstimation returns an estimate that is guaranteed to lie within a (1 <sup>±</sup> <sup>ε</sup>)-factor of <sup>H</sup>ϕ(<sup>Y</sup> ) with confidence at least 1 <sup>−</sup> <sup>δ</sup>. We stress that we obtain such a multiplicative estimate even when <sup>H</sup>ϕ(<sup>Y</sup> ) is very small, as in the case of a password-checker as described above. Furthermore, EntropyEstimation makes only <sup>O</sup>( min(m,n) ε<sup>2</sup> ) counting and sampling queries even though the support of the distribution specified by ϕ can be of the size O(2m).

While the primary focus of the work is theoretical, we seek to demonstrate that our techniques can be translated into practically efficient algorithms. As such, we focused on developing a prototype using off-the-shelf samplers and counters. As a first step, we use GANAK [52] for model counting queries and SPUR [3] for sampling queries. Our empirical analysis demonstrates that EntropyEstimation can be translated into practice and achieves significant speedup over baseline.

It is worth mentioning that recent approaches in quantified information leakage focus on programs that can be naturally translated to string and SMT constraints, and therefore, employ model counters for string and SMT constraints. Since counting and sampling are closely related, we hope the algorithmic improvements attained by EntropyEstimation will lead to the development of samplers in the context of SMT and string constraints, and would lead to practical implementation of EntropyEstimation for other domains. We stress again that while we present EntropyEstimation for programs modeled as a Boolean formula, our analysis applies other approaches, such as automata-based approaches, modulo access to the appropriate sampling and counting oracles.

The rest of the paper is organized as follows: we present the notations and preliminaries in Sect. 2. We then discuss related work in Sect. 3. Next, we present an overview of EntropyEstimation including a detailed description of the algorithm and an analysis of its correctness in Sect. 4. We then describe our experimental methodology and discuss our results with respect to the accuracy and scalability of EntropyEstimation in Sect. 5. Finally, we conclude in Sect. 6.

### **2 Preliminaries**

We use lower case letters (with subscripts) to denote propositional variables and upper case letters to denote a subset of variables. The formula ∃Y ϕ(X, Y ) is existentially quantified in <sup>Y</sup> , where <sup>X</sup> <sup>=</sup> {x1, ··· , xn} and <sup>Y</sup> <sup>=</sup> {y1, ··· , ym}. For notational clarity, we use ϕ to refer to ϕ(X, Y ) when clear from the context. We denote V ars(ϕ) as the set of variables appearing in ϕ(X, Y ). A literal is a boolean variable or its negation.

A *satisfying assignment* or solution of a formula ϕ is a mapping τ : V ars(ϕ) → {0, 1}, on which the formula evaluates to True. For V ⊆ V ars(ϕ), <sup>τ</sup><sup>↓</sup>V represents the truth values of variables in <sup>V</sup> in a satisfying assignment <sup>τ</sup> of ϕ. We denote the set of all the solutions of ϕ as sol(ϕ). For S ⊆ V ars(ϕ), we define sol(ϕ)<sup>↓</sup>S as the set of solutions of <sup>ϕ</sup> projected on <sup>S</sup>.

The problem of *model counting* is to compute |sol(ϕ)| for a given formula <sup>ϕ</sup>. Projected model counting is defined analogously using sol(ϕ)<sup>↓</sup>S instead of sol(ϕ), for a given projection set<sup>1</sup> <sup>S</sup> <sup>⊆</sup> V ars(ϕ). A *uniform sampler* outputs a solution <sup>y</sup> <sup>∈</sup> sol(ϕ) such that Pr[<sup>y</sup> is output] = <sup>1</sup> <sup>|</sup>sol(ϕ)<sup>|</sup> .

We say that ϕ is a circuit formula if for all assignments τ1, τ<sup>2</sup> ∈ sol(ϕ), we have <sup>τ</sup>1↓X <sup>=</sup> <sup>τ</sup>2↓X <sup>=</sup><sup>⇒</sup> <sup>τ</sup><sup>1</sup> <sup>=</sup> <sup>τ</sup>2. It is worth remarking that if <sup>ϕ</sup> is a circuit formula, then X is an independent support.

For a circuit formula ϕ(X, Y ) and for σ : Y -→ {0, <sup>1</sup>}, we define <sup>p</sup><sup>σ</sup> <sup>=</sup> <sup>|</sup>sol(ϕ(Y →σ))<sup>|</sup> <sup>|</sup>sol(ϕ)↓X<sup>|</sup> . Given a circuit formula <sup>ϕ</sup>(X, Y ), we define the entropy of <sup>ϕ</sup>, denoted by <sup>H</sup>ϕ(<sup>Y</sup> ) as follows: <sup>H</sup>ϕ(<sup>Y</sup> ) = <sup>−</sup>- σ∈2<sup>Y</sup> <sup>p</sup><sup>σ</sup> log(pσ).

### **3 Related Work**

The Shannon entropy is a fundamental concept in information theory, and as such have been studied by theoreticians and practitioners alike. While this is the first work, to the best of our knowledge, that provides Probabilistic Approximately Correct (PAC) (ε, δ)-approximation guarantees for all values of the entropy, while requiring only logarithmically (in the size of the support of distribution) many queries, we survey below prior work relevant to ours.

Goldreich and Vadhan [34] showed that the problem of estimating the entropy for circuit formulas is complete for statistical zero-knowledge. Estimation of the entropy via collision probabilities has been considered in the statistical physics community, but these techniques only provide lower bounds [43,55]. Batu et al. [9] considered entropy estimation in a *black-box* model wherein one is allowed to sample <sup>σ</sup> <sup>∈</sup> <sup>2</sup><sup>Y</sup> with probability proportional to <sup>p</sup>σ and <sup>p</sup>σ is revealed along with the sample σ. Batu et al. showed that any algorithm that can estimate the entropy within a factor of 2 in this model must use Ω(2m/<sup>8</sup>) samples. Furthermore, Batu et al. proposed a multiplicative approximation scheme assuming a lower bound on H—precisely, it required a number of samples that grow linearly with 1/H; their scheme also gives rise to an additive approximate scheme.

<sup>1</sup> Projection set has been referred to as sampling set in prior work [19,54].

Guha et al. [35] improved Batu et al.'s scheme to obtain ( , δ) multiplicative estimates using O( m log <sup>1</sup> δ 2H ) samples, matching Batu et al.'s lower bound. Note that this grows with 1/H.

A more restrictive model has been considered wherein we only get access to samples (with the assurance that every σ is sampled with probability proportional to <sup>p</sup>σ). Valiant and Valiant [58] obtained an asymptotically optimal algorithm in this setting, which requires Θ( <sup>2</sup><sup>m</sup> 2m ) samples to obtain an additive approximation. Chakraborty et al. [17] considered the problem in a different setting, in which the algorithm is given the ability to sample σ from a *conditional distribution*: the algorithm is permitted to specify a set S, and obtains σ from the distribution conditioned on σ ∈ S. We remark that as discussed below, our approach makes use of such conditional samples, by sampling from a modified formula that conjoins the circuit formula to a formula for membership in S. In any case, Chakraborty et al. use <sup>O</sup>( <sup>1</sup> <sup>8</sup> <sup>m</sup><sup>7</sup> log <sup>1</sup> δ ) conditional samples to approximately learn the distribution, and can only provide an additive approximation of entropy. A helpful survey of all of these different models and algorithms was recently given by Canonne [15].

In this paper, we rely on the advances in model counting. Theoretical investigations into model counting were initiated by Valiant in his seminal work that defined the complexity class #P and showed that the problem of model counting is #P-complete. From a practical perspective, the earliest work on model counting [12] focused on improving enumeration-based strategies via partial solutions. Subsequently, Bayardo and Pehoushek [10] observed that if a formula can be partitioned into subsets of clauses, also called components, such that each of the subsets is over disjoint sets of variables, then the model count of the formula is the product of the model counts of each of the components. Building on Bayardo and Pehoushek's scheme, Sang et al. [50] showed how conflict-driven clause learning can be combined with component caching, which has been further improved by Thurley [56] and Sharma et al. [52]. Another line of work focuses on compilation-based techniques, wherein the core approach is to compile the input formula into a subset L in negation normal form, so that counting is tractable for L. The past five years have witnessed a surge of interest in the design of projected model counters [6,18,20,42,45,52]. In this paper, we employ GANAK [52], the state of the art projected model counter; an entry based on GANAK won the projected model counting track at the 2020 model counting competition [31].

Another crucial ingredient for our technique is access to an efficient sampler. Counting and sampling are closely related problems, and therefore, the development of efficient counters spurred the research on the development of samplers. In a remarkable result, Huang and Darwiche [36] showed that the traces of model counters are in d-DNNF (deterministic Decomposable Negation Normal Form [25]), which was observed to support sampling in polynomial time [51]. Achlioptas, Hammoudeh, and Theodoropoulos [3] observed that one can improve the space efficiency by performing an on-the-fly traversal of the underlying trace of a model counter such as SharpSAT [56].

Our work builds on a long line of work in the QIF community that identified a close relationship between quantified information flow and model counting [4, 5,27,33,38,59]. There are also many symbolic execution based approaches for QIF based on model counting that would require model counting calls that are linear in the size of observable domain, that is, exponential in the number of bits represents the domain [8,46]. Another closely related line of the work concerns the use of model counting in side-channel analysis [28,29,33]. Similarly, there exists sampling based approaches for black-box leakage estimation that either require too many samples, much larger than the product of size of input and output domain [23] to converge or uses ML based approaches that predict the error of the idea classifier for predicting secrets given observable [22]. However, these approaches can not provide PAC guarantees on the estimation. While we focus on the case where the behavior of a program can be modeled with a Boolean formula ϕ, the underlying technique is general and can extended to cases where programs (and their abstractions) are modeled using automata [4,5,14].

Before concluding our discussion of prior work, we remark that K¨opf and Rybalchenko [41] used Batu et al.'s [9] lower bounds to conclude that their scheme could not be improved without usage of structural properties of the program. In this context, our paper continues the direction alluded by K¨opf and Rybalchenko and designs the first efficient multiplicative approximation scheme by utilizing white-box access to the program.

# **<sup>4</sup> EntropyEstimation: Efficient Estimation of** *<sup>H</sup>***(***ϕ***)**

In this section, we focus on the primary technical contribution of our work: an algorithm, called EntropyEstimation, that takes a circuit formula ϕ(X, Y ) and returns an (ε, δ) estimate of H(ϕ). We first provide a detailed technical overview of the design of EntropyEstimation in Sect. 4.1, then provide a detailed description of the algorithm, and finally, provide the accompanying technical analysis of the correctness and complexity of EntropyEstimation.

#### **4.1 Technical Overview**

At a high level, EntropyEstimation uses a median of means estimator, i.e., we first estimate <sup>H</sup>(ϕ) to within a (1±ε)-factor with probability at least <sup>5</sup> <sup>6</sup> by computing the mean of the underlying estimator and then take the median of many such estimates to boost the probability of correctness to 1 − δ.

Let us consider a random variable <sup>S</sup> over the domain sol(ϕ)<sup>↓</sup>Y such that Pr[<sup>S</sup> <sup>=</sup> <sup>σ</sup>] = <sup>p</sup>σ wherein <sup>σ</sup> <sup>∈</sup> sol(ϕ)<sup>↓</sup>Y and consider the self-information function <sup>g</sup> : sol(ϕ)<sup>↓</sup>Y <sup>→</sup> [0,∞), given by <sup>g</sup>(σ) = log( <sup>1</sup> <sup>p</sup>σ ). Observe that the entropy H(ϕ) = E[g(S)]. Therefore, a simple estimator would be to sample S using our oracle and then estimate the expectation of g(S) by a sample mean. At this point, we observe that given access to a uniform sampler, UnifSample, we can simply first sample τ ∈ sol(ϕ) uniformly at random, and then set S = <sup>τ</sup><sup>↓</sup><sup>Y</sup> , which gives Pr[<sup>S</sup> <sup>=</sup> <sup>τ</sup><sup>↓</sup><sup>Y</sup> ] = <sup>p</sup>τ↓Y . Furthermore, observe that <sup>g</sup>(σ) can be computed via a query to a model counter. In their seminal work, Batu et al. [9] observed that the variance of g(S), denoted by variance[g(S)], can be at most m<sup>2</sup>. The required number of sample queries, based on a straightforward analysis, would be Θ variance[g(S)] ε2·(E[g(S)])<sup>2</sup> = Θ <sup>p</sup>σ log<sup>2</sup> <sup>1</sup> pσ ( <sup>p</sup>σ log <sup>1</sup> pσ )<sup>2</sup> . However, E[g(S)] = H(ϕ) can be arbitrarily close to 0, and therefore, this does not provide a reasonable upper bound on the required number of samples.

To address the lack of lower bound on H(ϕ), we observe that for ϕ to have <sup>H</sup>(ϕ) <sup>&</sup>lt; 1, there must exist <sup>σ</sup>high <sup>∈</sup> sol(ϕ)<sup>↓</sup><sup>Y</sup> such that <sup>p</sup>(σhigh) <sup>&</sup>gt; <sup>1</sup> <sup>2</sup> . We then observe that given access to a sampler and counter, we can identify such a <sup>σ</sup>high with high probability, thereby allowing us to consider the two cases separately: (A) H(ϕ) > 1 and (B) H(ϕ) < 1. Now, for case (A), we could use Batu et al.'s bound for variance[g(S)] [9] and obtain an estimator that would require Θ variance[g(S)] ε2·(E[g(S)])<sup>2</sup> sampling and counting queries. It is worth remarking that the bound variance[g(S)] <sup>≤</sup> <sup>m</sup><sup>2</sup> is indeed tight as a uniform distribution over sol(ϕ)<sup>↓</sup>X would achieve the bound. Therefore, we instead focus on the expression variance[g(S)] (E[g(S)])<sup>2</sup> and prove that for the case when <sup>E</sup>[g(S)] = <sup>H</sup>(ϕ) > h, we can upper bound variance[g(S)] (E[g(S)])<sup>2</sup> by (1+o(1))·<sup>m</sup> h·ε<sup>2</sup> , thereby reducing the complexity from m<sup>2</sup> to m (Observe that we have H(ϕ) > 1, that is, we can take h = 1).

Now, we return to the case (B) wherein we have identified <sup>σ</sup>high <sup>∈</sup> sol(ϕ)<sup>↓</sup>Y with <sup>p</sup><sup>σ</sup>high <sup>&</sup>gt; <sup>1</sup> <sup>2</sup> . Let <sup>r</sup> <sup>=</sup> <sup>p</sup><sup>σ</sup>high and <sup>H</sup>rem <sup>=</sup> - <sup>σ</sup>∈sol(ϕ)↓Y \σhigh <sup>p</sup>σ log <sup>1</sup> <sup>p</sup>σ . Note

that H(ϕ) = r log <sup>1</sup> r <sup>+</sup> <sup>H</sup>rem. Therefore, we focus on estimating <sup>H</sup>rem. To this end, we define a random variable <sup>T</sup> that takes values in sol(ϕ)<sup>↓</sup>Y \ <sup>σ</sup>high such that Pr[T = σ] = <sup>p</sup><sup>σ</sup> <sup>1</sup>−r . Using the function <sup>g</sup> defined above, we have <sup>H</sup>rem <sup>=</sup> (1 <sup>−</sup> <sup>r</sup>) · <sup>E</sup>[g(T)]. Again, we have two cases, depending on whether <sup>H</sup>rem <sup>≥</sup> 1 or not; if it is, then we can bound the ratio variance[g(T)] <sup>E</sup>[g(T)]<sup>2</sup> similarly to case (A). If not, we observe that the denominator is at least 1 for <sup>r</sup> <sup>≥</sup> <sup>1</sup>/2. And, when <sup>H</sup>rem is so small, we can upper bound the numerator by (1 + o(1))m, giving overall variance[g(T)] (E[g(T)])<sup>2</sup> <sup>≤</sup> (1 + <sup>o</sup>(1)) · <sup>1</sup> ε<sup>2</sup> · <sup>m</sup>. We can thus estimate <sup>H</sup>rem using the median of means estimator.

#### **4.2 Algorithm Description**

Algorithm 1 presents the proposed algorithmic framework EntropyEstimation. EntropyEstimation takes a formula ϕ(X, Y ), a tolerance parameter ε, a confidence parameter <sup>δ</sup> as input, and returns an estimate <sup>h</sup><sup>ˆ</sup> of the entropy <sup>H</sup>ϕ(<sup>Y</sup> ), that is guaranteed to lie within a (1±ε)-factor of <sup>H</sup>ϕ(<sup>Y</sup> ) with confidence at least 1 − δ. Algorithm 1 assumes access to following subroutines:


#### **Algorithm 1.** EntropyEstimation(ϕ(X, Y ), ε, δ)

1: m ← |Y |; n ← |X| 2: z ← ComputeCount(ϕ(X, Y ), X) 3: **for** i ∈ [1, log(10/δ)] **do** 4: τ ← UnifSample(ϕ) 5: <sup>r</sup> <sup>=</sup> <sup>z</sup>−<sup>1</sup> · ComputeCount(ϕ(X, Y ) <sup>∧</sup> (<sup>Y</sup> <sup>↔</sup> <sup>τ</sup>↓<sup>Y</sup> ), X) 6: **if** r > <sup>1</sup> <sup>2</sup> **then** 7: ˆϕ ← ϕ ∧ (Y ↔ τ↓<sup>Y</sup> ) 8: <sup>t</sup> <sup>←</sup> <sup>6</sup> <sup>2</sup> · min - n 2 log <sup>1</sup> <sup>1</sup>−r , m + log(<sup>m</sup> + log <sup>m</sup> + 2.5) 9: <sup>h</sup>ˆrem <sup>←</sup> SampleEst( ˆϕ, z, t, <sup>0</sup>.<sup>9</sup> · <sup>δ</sup>) 10: <sup>h</sup><sup>ˆ</sup> <sup>←</sup> (1 <sup>−</sup> <sup>r</sup>)hˆrem <sup>+</sup> <sup>r</sup> log( <sup>1</sup> r ) 11: **return** hˆ 12: <sup>t</sup> <sup>←</sup> <sup>6</sup> <sup>2</sup> · (min {n, m + log(m + log m + 1.1)} − 1) 13: <sup>h</sup><sup>ˆ</sup> <sup>←</sup> SampleEst(ϕ, z, t, <sup>0</sup>.<sup>9</sup> · <sup>δ</sup>) 14: **return** hˆ

**Algorithm 2.** SampleEst(ϕ, z, t, δ)

1: C ← [ ] 2: <sup>T</sup> <sup>←</sup> <sup>9</sup> <sup>2</sup> log <sup>2</sup> δ 3: **for** i ∈ [1, T] **do** 4: est ← 0 5: **for** j ∈ [1, t] **do** 6: τ ← UnifSample(ϕ) 7: <sup>r</sup> <sup>=</sup> <sup>z</sup>−<sup>1</sup> · ComputeCount(ϕ(X, Y ) <sup>∧</sup> (<sup>Y</sup> <sup>↔</sup> <sup>τ</sup>↓<sup>Y</sup> ), X) 8: est ← est + log(1/r) 9: C.Append( est t ) 10: **return** Median(C)

SampleEst: Algorithm 2 presents the subroutine SampleEst, which also assumes access to the ComputeCount and UnifSample subroutines. SampleEst takes as input a formula ϕ(X, Y ); the projected model count of ϕ(X, Y ) over X, z; the number of required samples, t; and a confidence parameter δ, and returns a median-of-means estimate of the entropy. Algorithm 2 starts off by computing the value of T, the required number of repetitions to ensure at least 1 − δ confidence for the estimate. The algorithm has two loops—one outer loop (Lines 3–9), and one inner loop (Lines 5–8). The outer loop runs for [ <sup>9</sup> <sup>2</sup> log( <sup>2</sup> δ )] rounds, where in each round, Algorithm 2 updates a list C with the mean estimate, *est*. In the inner loop, in each round, Algorithm 2 updates the value of *est*: Line 6 draws a sample τ using the UnifSample(ϕ(X, Y )) subroutine. At Line 7, value of r is computed as the ratio of the projected model count of <sup>X</sup> in <sup>ϕ</sup>(X, Y ) <sup>∧</sup> (<sup>Y</sup> <sup>↔</sup> <sup>τ</sup><sup>↓</sup>Y ) to <sup>z</sup>. To compute the projected model count, Algorithm 2 calls the subroutine ComputeCount on input (ϕ(X, Y ) ∧ (Y ↔ <sup>τ</sup><sup>↓</sup>Y ), X). At line 8, *est* is updated with log( <sup>1</sup> r ), and at line 9, the final *est* is added to C. Finally, at line 10, Algorithm 2 returns the median of C.

Returning back to Algorithm 1, it starts by computing the value of z as the projected model count of ϕ(X, Y ) over X at line 2. The projected model count is computed by calling the ComputeCount subroutine. Next, Algorithm 1 attempts to determine whether there exists an output <sup>τ</sup>high with probability greater than 1/2 or not by iterating over lines 3–11 for [log(10/δ)] rounds. Line 4, draws a sample τ by calling the UnifSample(ϕ(X, Y )) subroutine. Line 5 computes the value of <sup>r</sup> by taking the ratio of the projected model count of <sup>ϕ</sup>(X, Y )∧(<sup>Y</sup> <sup>↔</sup> <sup>τ</sup>↓Y ) to z. Line 6 checks whether the value of r is greater than 1/2 or not, and chooses one of the two paths based on the value of r:


#### **4.3 Theoretical Analysis**

**Theorem 1.** *Given a circuit formula* ϕ *with* |Y | ≥ 2*, a tolerance parameter* ε > 0*, and confidence parameter* δ > 0*, the algorithm* EntropyEstimation *returns* hˆ *such that*

$$\Pr\left[ (1 - \varepsilon) H\_{\varphi}(Y) \le \hat{h} \le (1 + \varepsilon) |H\_{\varphi}(Y)| \right] \ge 1 - \delta$$

We first analyze the median-of-means estimator computed by SampleEst.

**Lemma 1.** *Given a circuit formula* <sup>ϕ</sup> *and* <sup>z</sup> <sup>∈</sup> <sup>N</sup>*, an accuracy parameter* ε > <sup>0</sup>*, a confidence parameter* δ > <sup>0</sup>*, and a batch size* <sup>t</sup> <sup>∈</sup> <sup>N</sup> *for which*

$$\frac{1}{t\epsilon^{2}} \cdot \left(\frac{\sum\_{\sigma \in 2^{Y}} \frac{|\operatorname{sol}(\varphi(Y \mapsto \sigma))|}{|\operatorname{sol}(\varphi)\_{\bot X}|} (\log \frac{z}{|\operatorname{sol}(\varphi(Y \mapsto \sigma))|})^{2}}{\left(\sum\_{\sigma \in 2^{Y}} \frac{|\operatorname{sol}(\varphi(Y \mapsto \sigma))|}{|\operatorname{sol}(\varphi)\_{\bot X}|} \log \frac{z}{|\operatorname{sol}(\varphi(Y \mapsto \sigma))|}\right)^{2}} - 1\right) \le 1/6$$

*the algorithm* SampleEst *returns an estimate* <sup>h</sup><sup>ˆ</sup> *such that with probability* <sup>1</sup> <sup>−</sup> <sup>δ</sup>*,*

$$\begin{aligned} \hat{h} &\leq (1+\epsilon) \sum\_{\sigma \in 2^Y} \frac{|sol(\varphi(Y \mapsto \sigma))|}{|sol(\varphi)\_\perp|} \log \frac{z}{|sol(\varphi(Y \mapsto \sigma))|} \; \text{and} \\\hat{h} &\geq (1-\epsilon) \sum\_{\sigma \in 2^Y} \frac{|sol(\varphi(Y \mapsto \sigma))|}{|sol(\varphi)\_\perp|} \log \frac{z}{|sol(\varphi(Y \mapsto \sigma))|} . \end{aligned}$$

*Proof.* Let <sup>R</sup>ij be the random value taken by <sup>r</sup> in the <sup>i</sup>th iteration of the outer loop and <sup>j</sup>th iteration of the inner loop. We observe that {Rij}(i,j) are a family of i.i.d. random variables. Let <sup>C</sup>i <sup>=</sup> t j=1 1 t log <sup>1</sup> <sup>R</sup>ij be the value appended to <sup>C</sup> at the end of the <sup>i</sup>th iteration of the loop. Clearly <sup>E</sup>[Ci] = <sup>E</sup>[log <sup>1</sup> <sup>R</sup>ij ]. Furthermore, we observe that by independence of the <sup>R</sup>ij ,

$$\mathsf{var}\mathsf{variance}[C\_i] = \frac{1}{t}\mathsf{variance}[\log\frac{1}{R\_{ij}}] = \frac{1}{t}(\mathsf{E}[(\log R\_{ij})^2] - \mathsf{E}[\log\frac{1}{R\_{ij}}]^2).$$

By Chebyshev's inequality, now,

$$\begin{split} \Pr\left[|C\_i - \mathsf{E}[\log \frac{1}{R\_{ij}}]| > \epsilon \mathsf{E}[\log \frac{1}{R\_{ij}}] \right] &< \frac{\mathsf{variance}[C\_i]}{\epsilon^2 \mathsf{E}[\log \frac{1}{R\_{ij}}]^2} \\ &= \frac{\mathsf{E}[(\log R\_{ij})^2] - \mathsf{E}[\log \frac{1}{R\_{ij}}]^2}{t \cdot \epsilon^2 \mathsf{E}[\log \frac{1}{R\_{ij}}]^2} \\ &\leq 1/6 \end{split}$$

by our assumption on t.

Let <sup>L</sup>i ∈ {0, <sup>1</sup>} be the indicator random variable for the event that <sup>C</sup>i <sup>&</sup>lt; E[log <sup>1</sup> <sup>R</sup>ij ] <sup>−</sup> <sup>E</sup>[log <sup>1</sup> <sup>R</sup>ij ], and let <sup>H</sup><sup>i</sup> ∈ {0, <sup>1</sup>} be the indicator random variable for the event that <sup>C</sup>i <sup>&</sup>gt; <sup>E</sup>[log <sup>1</sup> <sup>R</sup>ij ] + <sup>E</sup>[log <sup>1</sup> <sup>R</sup>ij ]. Similarly, since these are disjoint events, <sup>B</sup>i <sup>=</sup> <sup>L</sup>i <sup>+</sup> <sup>H</sup>i is also an indicator random variable for the union. So long as -T i=1 <sup>L</sup><sup>i</sup> < T/2 and -T i=1 <sup>H</sup><sup>i</sup> < T/2, we note that the value returned by SampleEst is as desired. By the above calculation, Pr[Li = 1] + Pr[Hi = 1] = Pr[Bi = 1] <sup>&</sup>lt; <sup>1</sup>/6, and we note that {(Bi, Li, Hi)}i are a family of i.i.d. random variables. Observe that by Hoeffding's inequality,

$$\Pr\left[\sum\_{i=1}^{T} L\_i \ge \frac{T}{6} + \frac{T}{3}\right] \le \exp(-2T\frac{1}{9}) = \frac{\delta}{2}$$

and similarly Pr -T i=1 <sup>H</sup><sup>i</sup> <sup>≥</sup> <sup>T</sup> 2 ≤ <sup>δ</sup> <sup>2</sup> . Therefore, by a union bound, the returned value is adequate with probability at least 1 − δ overall.

The analysis of SampleEst relied on a bound on the ratio of the first and second "moments" of the self-information in our truncated distribution. Suppose for all assignments <sup>σ</sup> to <sup>Y</sup> , <sup>p</sup>σ <sup>≤</sup> <sup>1</sup>/2. We observe that then <sup>H</sup>ϕ(<sup>Y</sup> ) <sup>≥</sup> - σ∈2<sup>Y</sup> <sup>p</sup><sup>σ</sup> · 1 = 1. We also observe that on account of the uniform distribution on X, any <sup>σ</sup> in the support of the distribution has <sup>p</sup>σ <sup>≥</sup> <sup>1</sup>/2<sup>|</sup>X<sup>|</sup> . Such bounds allow us to bound the relative variance of the self information:

**Lemma 2.** *Let* {pσ <sup>∈</sup> [1/2<sup>|</sup>X<sup>|</sup> , 1]}σ∈2<sup>Y</sup> *be given. Then,*

$$\sum\_{\sigma \in 2^Y} p\_\sigma (\log p\_\sigma)^2 \le |X| \sum\_{\sigma \in 2^Y} p\_\sigma \log \frac{1}{p\_\sigma}$$

*Proof.* We observe simply that

$$\sum\_{\sigma \in 2^Y} p\_\sigma (\log p\_\sigma)^2 \le \log 2^{|X|} \sum\_{\sigma \in 2^Y} p\_\sigma \log \frac{1}{p\_\sigma} = |X| \sum\_{\sigma \in 2^Y} p\_\sigma \log \frac{1}{p\_\sigma}.$$

**Lemma 3.** *Let* {pσ <sup>∈</sup> [0, 1]}σ∈2<sup>Y</sup> *be given with* - σ∈2<sup>Y</sup> <sup>p</sup><sup>σ</sup> <sup>≤</sup> <sup>1</sup> *and*

$$H = \sum\_{\sigma \in 2^Y} p\_{\sigma} \log \frac{1}{p\_{\sigma}} \ge 1.$$

*Then*

$$\frac{\sum\_{\sigma \in 2^Y} p\_\sigma (\log p\_\sigma)^2}{\left(\sum\_{\sigma \in 2^Y} p\_\sigma \log \frac{1}{p\_\sigma}\right)^2} \le \left(1 + \frac{\log(|Y| + \log|Y| + 1.1)}{|Y|}\right)|Y|.$$

*Similarly, if* H ≤ 1 *and* |Y | ≥ 2*,*

$$\sum\_{\sigma \in 2^Y} p\_\sigma(\log p\_\sigma)^2 \le |Y| + \log(|Y| + \log|Y| + 2.5).$$

Concretely, both cases give a bound that is at most 2|Y | for |Y | ≥ 3; |Y | = 8 gives a bound that is less than 1.5 × |Y | in both cases, |Y | = 64 gives a bound that is less than 1.1 × |Y |, etc.

*Proof.* By induction on the size of the support, denoted as supp and defined as |{<sup>σ</sup> <sup>∈</sup> <sup>2</sup><sup>Y</sup> <sup>|</sup>pσ <sup>&</sup>gt; <sup>0</sup>}|, we'll show that when <sup>H</sup> <sup>≥</sup> 1, the ratio is at most log <sup>|</sup>supp<sup>|</sup> <sup>+</sup> log(log |supp| + log log |supp| + 1.1). The base case is when there are only two elements (|Y | = 1), in which case we must have p<sup>0</sup> = p<sup>1</sup> = 1/2, and the ratio is uniquely determined to be 1. For the induction step, observe that whenever any subset of the <sup>p</sup>σ take value 0, this is equivalent to a distribution with smaller support, for which by induction hypothesis, we find the ratio is at most

$$\begin{split} \log(|\text{supp}| - 1) + \log(\log(|\text{supp}| - 1) + \log\log(|\text{supp}| - 1) + 1.1) \\ < \log|\text{supp}| + \log(\log|\text{supp}| + \log\log|\text{supp}| + 1.1). \end{split}$$

Consider any value of <sup>H</sup>ϕ(<sup>Y</sup> ) = <sup>H</sup>. With the entropy fixed, we need only maximize the numerator of the ratio with <sup>H</sup>ϕ(<sup>Y</sup> ) = <sup>H</sup>. Indeed, we've already ruled out a ratio of <sup>|</sup>supp(<sup>Y</sup> )<sup>|</sup> for solutions in which any of the <sup>p</sup>σ take value 0, and clearly we cannot have any <sup>p</sup>σ = 1, so we only need to consider interior points that are local optima. We use the method of Lagrange multipliers: for some λ, all <sup>p</sup>σ must satisfy log<sup>2</sup> <sup>p</sup>σ + 2 log <sup>p</sup>σ <sup>−</sup> <sup>λ</sup>(log <sup>p</sup>σ <sup>−</sup> 1) = 0, which has solutions

$$\log p\_{\sigma} = \frac{\lambda}{2} - 1 \pm \sqrt{(1 - \frac{\lambda}{2})^2 - \lambda} = \frac{\lambda}{2} - 1 \pm \sqrt{1 + \lambda^2/4}.$$

We note that the second derivatives with respect to <sup>p</sup>σ are equal to 2 log <sup>p</sup><sup>σ</sup> <sup>p</sup>σ <sup>+</sup> <sup>2</sup>−<sup>λ</sup> pσ which are negative iff log <sup>p</sup>σ <sup>&</sup>lt; <sup>λ</sup> <sup>2</sup> − 1, hence we attain local maxima only for the solution log <sup>p</sup>σ <sup>=</sup> <sup>λ</sup> <sup>2</sup> <sup>−</sup> <sup>1</sup> <sup>−</sup> 1 + <sup>λ</sup><sup>2</sup>/4. Thus, there is a single <sup>p</sup>σ, which by the entropy constraint, must satisfy <sup>|</sup>supp|pσ log <sup>1</sup> <sup>p</sup>σ <sup>=</sup> <sup>H</sup> which we'll show gives

$$p\_{\sigma} = \frac{H}{|\text{supp}|(\log \frac{|\text{supp}|}{H} + \log \log \frac{|\text{supp}|}{H} + \rho)}$$

for some ρ ≤ 1.1. For |supp| = 3, we know 1 ≤ H ≤ log 3, and we can verify numerically that log log <sup>3</sup> H +log log <sup>3</sup> H <sup>+</sup><sup>ρ</sup> log <sup>3</sup> H ∈ (0.42, 0.72) for ρ ∈ [0, 1]. Hence, by Brouwer's fixed point theorem, such a choice of ρ ∈ [0, 1] exists. For |supp| ≥ 4, observe that <sup>|</sup>supp<sup>|</sup> H <sup>≥</sup> 2, so log log <sup>|</sup>supp<sup>|</sup> H +log log <sup>|</sup>supp<sup>|</sup> H log <sup>|</sup>supp<sup>|</sup> H > 0. For |supp| = 4, log log <sup>4</sup> H +log log <sup>4</sup> H <sup>+</sup><sup>ρ</sup> log <sup>4</sup> H ∈ [0, 1], and similarly for all integer values of |supp| up to 15, log log <sup>|</sup>supp<sup>|</sup> H +log log <sup>|</sup>supp<sup>|</sup> H +1.<sup>1</sup> log <sup>|</sup>supp<sup>|</sup> H < 1.1, so we can obtain ρ ∈ (0, 1.1). Finally, for <sup>|</sup>supp| ≥ 16, we have <sup>|</sup>supp<sup>|</sup> H <sup>≤</sup> <sup>2</sup>|supp|/2H, and hence log log <sup>|</sup>supp<sup>|</sup> H <sup>+</sup><sup>ρ</sup> log <sup>|</sup>supp<sup>|</sup> H ≤ 1, so

$$\begin{aligned} |\text{supp}| \frac{H(\log \frac{|\text{supp}|}{H} + \log(\log \frac{|\text{supp}|}{H} + \log \log \frac{|\text{supp}|}{H} + \rho))}{|\text{supp}|(\log \frac{|\text{supp}|}{H} + \log \log \frac{|\text{supp}|}{H} + \rho)} \\ \leq H \frac{\log \frac{|\text{supp}|}{H} + \log \log \frac{|\text{supp}|}{H} + 1}{\log \frac{|\text{supp}|}{H} + \log \log \frac{|\text{supp}|}{H} + \rho} \end{aligned}$$

Hence it is clear that this gives H for some ρ ≤ 1. Observe that for such a choice of ρ, using the substitution above, the ratio we attain is

$$\begin{split} \frac{|\text{supp}| \cdot H}{H^2 \cdot |\text{supp}|(\log \frac{|\text{supp}|}{H} + \log \log \frac{|\text{supp}|}{H} + \rho)} & \left( \log \frac{|\text{supp}|(\log \frac{|\text{supp}|}{H} + \log \log \frac{|\text{supp}|}{H} + \rho)}{H} \right)^2 \\ &= \frac{1}{H} (\log \frac{|\text{supp}|}{H} + \log(\log \frac{|\text{supp}|}{H} + \log \log \frac{|\text{supp}|}{H} + \rho)) \end{split}$$

which is monotone in 1/H, so using the fact that H ≥ 1, we find it is at most

$$
\log|\text{supp}| + \log(\log|\text{supp}| + \log\log|\text{supp}| + \rho)
$$

which, recalling ρ < 1.1, gives the claimed bound.

For the second part, observe that by the same considerations, for fixed H,

$$\sum\_{\sigma \in 2^Y} p\_\sigma (\log p\_\sigma)^2 = H \log \frac{1}{p\_\sigma}$$

for the unique choice of <sup>p</sup>σ for <sup>|</sup><sup>Y</sup> <sup>|</sup> and <sup>H</sup> as above, i.e., we will show that for |Y | ≥ 2, it is

$$H\left(\log\frac{2^{|Y|}}{H} + \log(\log\frac{2^{|Y|}}{H} + \log\log\frac{2^{|Y|}}{H} + \rho)\right)$$

for some ρ ∈ (0, 2.5). Indeed, we again consider the function

$$f(\rho) = \frac{\log(\log \frac{2^{|Y|}}{H} + \log \log \frac{2^{|Y|}}{H} + \rho)}{\log \log \frac{2^{|Y|}}{H}},$$

and observe that for 2|Y <sup>|</sup> /H > 2, f(0) > 0. Now, when |Y | ≥ 2 and H ≤ 1, 2|Y <sup>|</sup> /H ≥ 4. We will see that the function d(ρ) = f(ρ) − ρ has no critical points for 2|Y <sup>|</sup> /H ≥ 4 and ρ > 0, and hence its maximum is attained at the boundary, i.e., at <sup>2</sup>|<sup>Y</sup> <sup>|</sup> H = 4, at which point we see that <sup>f</sup>(2.5) <sup>&</sup>lt; <sup>2</sup>.5. So, for such values of <sup>2</sup>|<sup>Y</sup> <sup>|</sup> H , <sup>f</sup>(ρ) maps [0, <sup>2</sup>.5] into [0, <sup>2</sup>.5] and hence by Brouwer's fixed point theorem again, for all |Y | ≥ 4 and H ≥ 1 some ρ ∈ (0, 2.5) exists for which <sup>p</sup>σ = log <sup>2</sup>|<sup>Y</sup> <sup>|</sup> H + log(log <sup>2</sup>|<sup>Y</sup> <sup>|</sup> H + log log <sup>2</sup>|<sup>Y</sup> <sup>|</sup> H <sup>+</sup> <sup>ρ</sup>) gives - <sup>p</sup>σ <sup>∈</sup>2<sup>Y</sup> <sup>p</sup><sup>σ</sup> log <sup>1</sup> <sup>p</sup>σ <sup>=</sup> <sup>H</sup>. Indeed, d (ρ) = <sup>1</sup> ln 2(log <sup>2</sup>|<sup>Y</sup> <sup>|</sup> H +log log <sup>2</sup>|<sup>Y</sup> <sup>|</sup> H <sup>+</sup>ρ) log log <sup>2</sup>|<sup>Y</sup> <sup>|</sup> −1, which has a singularity

H at <sup>ρ</sup> <sup>=</sup> <sup>−</sup> log log <sup>2</sup>|<sup>Y</sup> <sup>|</sup> H <sup>−</sup> log log <sup>2</sup>|<sup>Y</sup> <sup>|</sup> H , and otherwise has a critical point at <sup>ρ</sup> <sup>=</sup> ln 2 log log <sup>2</sup>|<sup>Y</sup> <sup>|</sup> H <sup>−</sup> log <sup>2</sup>|<sup>Y</sup> <sup>|</sup> H <sup>−</sup> log log <sup>2</sup>|<sup>Y</sup> <sup>|</sup> H . Since log <sup>2</sup>|<sup>Y</sup> <sup>|</sup> H <sup>≥</sup> 2 and log log <sup>2</sup>|<sup>Y</sup> <sup>|</sup> H <sup>≥</sup> 1 here, these are both clearly negative.

Now, we'll show that this expression (for |Y | ≥ 2) is maximized when H = 1. Observe first that the expression <sup>H</sup>(|<sup>Y</sup> <sup>|</sup>+log <sup>1</sup> H ) as a function of <sup>H</sup> does not have critical points for <sup>H</sup> <sup>≤</sup> 1: the derivative is <sup>|</sup><sup>Y</sup> <sup>|</sup> + log <sup>1</sup> H <sup>−</sup> <sup>1</sup> ln 2 , so critical points require H = 2<sup>|</sup><sup>Y</sup> |−(1/ ln 2) > 1. Hence we see that this expression is maximized at the boundary, when H = 1. Similarly, the rest of the expression,

$$H\log(|Y| + \log\frac{1}{H} + \log(|Y| + \log\frac{1}{H}) + 2.5))$$

viewed as a function of H, only has critical points for

$$\log(|Y| + \log\frac{1}{H} + \log(|Y| + \log\frac{1}{H}) + 2.5) = \frac{\frac{1}{\ln 2}(1 + \frac{1}{|Y| + \log\frac{1}{H}})}{|Y| + \log\frac{1}{H} + \log(|Y| + \log\frac{1}{H}) + 2.5}$$

i.e., it requires

$$\begin{aligned} \left( |Y| + \log \frac{1}{H} + \log(|Y| + \log \frac{1}{H}) + 2.5 \right) \log(|Y| + \log \frac{1}{H} + \log(|Y| + \log \frac{1}{H}) + 2.5 \right) \\ &= \frac{1}{\ln 2} (1 + \frac{1}{|Y| + \log \frac{1}{H}}). \end{aligned}$$

But, the right-hand side is at most <sup>3</sup> 2 ln 2 < 3, while the left-hand side is at least 13. Thus, it also has no critical points, and its maximum is similarly taken at the boundary, H = 1. Thus, overall, when H ≤ 1 and |Y | ≥ 2 we find

$$\sum\_{\sigma \in 2^Y} p\_\sigma(\log p\_\sigma)^2 \le |Y| + \log(|Y| + \log|Y| + 2.5).$$

Although the assignment of probability mass used in the bound did not sum to 1, nevertheless this bound is nearly tight. For any γ > 0, and letting H = 1+Δ where Δ = <sup>1</sup> log<sup>γ</sup> (2|<sup>Y</sup> <sup>|</sup>−2) , the following solution attains a ratio of (1−o(1))|<sup>Y</sup> <sup>|</sup> <sup>1</sup>−γ: for any two σ<sup>∗</sup> 1, σ<sup>∗</sup> <sup>2</sup> <sup>∈</sup> <sup>2</sup><sup>Y</sup> , set <sup>p</sup>σ<sup>∗</sup> i <sup>=</sup> <sup>1</sup> <sup>2</sup> − <sup>2</sup> and set the rest to <sup>2</sup>|<sup>Y</sup> <sup>|</sup>−<sup>2</sup> , for chosen below. To obtain

$$\begin{aligned} H &= 2 \cdot (\frac{1}{2} - \frac{\epsilon}{2}) \log \frac{2}{1 - \epsilon} + (2^{|Y|} - 2) \cdot \frac{\epsilon}{2^{|Y|} - 2} \log \frac{2^{|Y|} - 2}{\epsilon} \\ &= (1 - \epsilon)(1 + \log(1 + \frac{\epsilon}{1 - \epsilon})) + \epsilon \log \frac{2^{|Y|} - 2}{\epsilon} \end{aligned}$$

observe that since log(1 + x) = <sup>x</sup> ln 2 + Θ(x<sup>2</sup>), we will need to take

$$\begin{split} \epsilon &= \frac{\Delta}{\log(2^{|Y|}-2) + \log\frac{1-\epsilon}{\epsilon} - (1+\frac{1}{\ln 2}) + \Theta(\epsilon^2)} \\ &= \frac{\Delta}{\log(2^{|Y|}-2) + \log\log(2^{|Y|}-2) + \log\frac{1}{\Delta} - (1+\frac{1}{\ln 2}) - \frac{\epsilon}{\ln 2} + \Theta(\epsilon^2)} . \end{split}$$

For such a choice, we indeed obtain the ratio

$$\frac{(1-\epsilon)\log^2\frac{2}{1-\epsilon}+\epsilon\log^2\frac{(2^{|Y|}-2)}{\epsilon}}{H^2} \ge (1-o(1))|Y|^{1-\gamma}.$$

Using these bounds, we are finally ready to prove Theorem 1:

*Proof.* We first consider the case where no <sup>σ</sup> <sup>∈</sup> sol(ϕ) has <sup>p</sup>σ <sup>&</sup>gt; <sup>1</sup>/2; here, the condition in line 6 of EntropyEstimation never passes, so we return the value obtained by SampleEst on line 12. Note that we must have <sup>H</sup>ϕ(<sup>Y</sup> ) <sup>≥</sup> 1 in this case. So, by Lemma 3,

$$\frac{\sum\_{\sigma \in 2^Y} p\_\sigma (\log p\_\sigma)^2}{\left(\sum\_{\sigma \in 2^Y} p\_\sigma \log \frac{1}{p\_\sigma}\right)^2} \le \min\left\{ |X|, \left(1 + \frac{\log(|Y| + \log|Y| + 1.1)}{|Y|}\right) |Y| \right\}$$

and hence, by Lemma 1, using <sup>t</sup> <sup>≥</sup> <sup>6</sup>·min{|X|,|<sup>Y</sup> <sup>|</sup>+log(|<sup>Y</sup> <sup>|</sup>+log <sup>|</sup><sup>Y</sup> <sup>|</sup>+1.1)}−1) ε<sup>2</sup> suffices to ensure that the returned <sup>h</sup><sup>ˆ</sup> is satisfactory with probability 1 <sup>−</sup> <sup>δ</sup>.

Next, we consider the case where some <sup>σ</sup><sup>∗</sup> <sup>∈</sup> sol(ϕ) has <sup>p</sup>σ<sup>∗</sup> <sup>&</sup>gt; <sup>1</sup>/2. Since the total probability is 1, there can be at most one such σ∗. So, in the distribution conditioned on σ = σ∗, i.e., {p σ}σ∈2<sup>Y</sup> that sets <sup>p</sup> σ<sup>∗</sup> = 0, and <sup>p</sup> σ <sup>=</sup> <sup>p</sup><sup>σ</sup> <sup>1</sup>−pσ<sup>∗</sup> otherwise, we now need to show that t satisfies

$$\frac{1}{t\varepsilon^2} \left( \frac{\sum\_{\sigma \neq \sigma \ast} p\_\sigma'(\log \frac{1}{(1 - p\_\sigma \ast) p\_\sigma'})^2}{(\sum\_{\sigma \neq \sigma \ast} p\_\sigma' \log \frac{1}{(1 - p\_\sigma \ast) p\_\sigma'})^2} - 1 \right) < \frac{1}{6}$$

to apply Lemma 1. We first rewrite this expression. Letting H = - σ <sup>=</sup>σ<sup>∗</sup> <sup>p</sup> σ log <sup>1</sup> p σ be the entropy of this conditional distribution,

$$\begin{split} \frac{\sum\_{\sigma \neq \sigma^\*} p\_{\sigma}^{\prime} (\log \frac{1}{(1 - p\_{\sigma^\*} \ast) p\_{\sigma}^{\prime}})^2}{(\sum\_{\sigma \neq \sigma^\*} p\_{\sigma}^{\prime} \log \frac{1}{(1 - p\_{\sigma^\*} \ast) p\_{\sigma}^{\prime}})^2} &= \frac{\sum\_{\sigma \neq \sigma^\*} p\_{\sigma}^{\prime} (\log \frac{1}{p\_{\sigma}^{\prime}})^2 + 2H \log \frac{1}{1 - p\_{\sigma^\*}} + (\log \frac{1}{1 - p\_{\sigma^\*}})^2}{(H + \log \frac{1}{1 - p\_{\sigma^\*}})^2} \\ &= \frac{\sum\_{\sigma \neq \sigma^\*} p\_{\sigma}^{\prime} (\log \frac{1}{p\_{\sigma}^{\prime}})^2 - H^2}{(H + \log \frac{1}{1 - p\_{\sigma^\*}})^2} + 1. \end{split}$$

Lemma 2 now gives rather directly that this quantity is at most

$$\frac{H|X| - H^2}{(H + \log\frac{1}{1 - p\_{\sigma^\*}})^2} + 1 < \frac{|X|}{2\log\frac{1}{1 - p\_{\sigma^\*}}} + 1.$$

For the bound in terms of |Y |, there are now two cases depending on whether H is greater than 1 or less than 1. When it is greater than 1, the first part of Lemma 3 again gives

$$\frac{\sum\_{\sigma \in 2^Y} p'\_{\sigma} (\log p'\_{\sigma})^2}{H^2} \le |Y| + \log(|Y| + \log|Y| + 1.1).$$

When H < 1, on the other hand, recalling <sup>p</sup>σ<sup>∗</sup> <sup>&</sup>gt; <sup>1</sup>/2 (so log <sup>1</sup> <sup>1</sup>−pσ<sup>∗</sup> <sup>≥</sup> 1), the second part of Lemma 3 gives that our expression is less than

$$\frac{|Y| + \log(|Y| + \log|Y| + 2.5) - H^2}{(H + \log\frac{1}{1 - p\_{\sigma^\*}})^2} < |Y| + \log(|Y| + \log|Y| + 2.5).$$

Thus, by Lemma 1,

$$t \ge \frac{6 \cdot \min\{\frac{|X|}{2\log\frac{1}{1-p\_{\sigma^\*}}}, |Y| + \log(|Y| + \log|Y| + 2.5)\}}{\varepsilon^2}$$

suffices to obtain <sup>h</sup><sup>ˆ</sup> such that <sup>h</sup><sup>ˆ</sup> <sup>≤</sup> (1 + <sup>ε</sup>) - σ <sup>=</sup>σ<sup>∗</sup> pσ <sup>1</sup>−pσ<sup>∗</sup> log <sup>1</sup> <sup>p</sup>σ and <sup>h</sup><sup>ˆ</sup> <sup>≥</sup> (1 <sup>−</sup> ε) - σ <sup>=</sup>σ<sup>∗</sup> pσ <sup>1</sup>−pσ<sup>∗</sup> log <sup>1</sup> <sup>p</sup>σ ; hence we obtain such a <sup>h</sup><sup>ˆ</sup> with probability at least 1−0.9·<sup>δ</sup> in line 10, if we pass the test on line 6 of Algorithm 1, thus identifying σ∗. Note that this value is adequate, so we need only guarantee that the test on line 6 passes on one of the iterations with probability at least 1 − 0.1 · δ.

To this end, note that each sample(τ<sup>↓</sup>Y ) on line <sup>4</sup> is equal to <sup>σ</sup><sup>∗</sup> with probability <sup>|</sup>sol(ϕ(<sup>Y</sup> →σ∗))<sup>|</sup> <sup>|</sup>sol(ϕ)↓X<sup>|</sup> <sup>&</sup>gt; <sup>1</sup> <sup>2</sup> by hypothesis. Since each iteration of the loop is an independent draw, the probability that the condition on line 6 is not met after log <sup>10</sup> δ draws is less than (1 <sup>−</sup> <sup>1</sup> <sup>2</sup> )log <sup>10</sup> <sup>δ</sup> = <sup>δ</sup> <sup>10</sup> , as needed.

#### **4.4 Beyond Boolean Formulas**

We now focus on the case where the relationship between X and Y is modeled by an arbitrary relation R instead of a Boolean formula ϕ. As noted in Sect. 1, program behaviors are often modeled with other representations such as automata [4,5,14]. The automata-based modeling often has X represented as the input to the given automaton A while every realization of Y corresponds to a state of A. Instead of an explicit description of A, one can rely on a symbolic description of A. Two families of techniques are currently used to estimate the entropy. The first technique is to enumerate the possible *output* states and, for each such state s, estimate the number of strings accepted by A if s was the only accepting state of A. The other technique relies on uniformly sampling a string σ, noting the final state of A when run on σ, and then applying a histogram-based technique to estimate the entropy.

In order to use the algorithm EntropyEstimation one requires access to a sampler and model counter for automata; the past few years have witnessed the design of efficient counters for automata to handle string constraints. In addition, EntropyEstimation requires access to a conditioning routine to implement the substitution step, i.e., Y -<sup>→</sup> <sup>τ</sup>↓Y , which is easy to accomplish for automata via marking the corresponding state as a non-accepting state.

# **5 Empirical Evaluation**

To evaluate the runtime performance of EntropyEstimation, we implemented a prototype in Python that employs SPUR [3] as a uniform sampler and GANAK [52] as a projected model counter. We experimented with 96 Boolean formulas arising from diverse applications ranging from QIF benchmarks [32], plan recognition [54], bit-blasted versions of SMTLIB benchmarks [52,54], and QBFEval competitions [1,2]. The value of n = |X| varies from 5 to 752 while the value of m = |Y | varies from 9 to 1447.

In all of our experiments, the parameters δ and ε were set to 0.09, 0.8 respectively. All of our experiments were conducted on a high-performance computer cluster with each node consisting of a E5-2690 v3 CPU with 24 cores, and 96 GB of RAM with a memory limit set to 4 GB per core. Experiments were run in single-threaded mode on a single core with a timeout of 3000 s.

*Baseline:* As our baseline, we implemented the following approach to compute the entropy exactly, which is representative of the current state of the art approaches [13,27,39] <sup>2</sup>. For each valuation <sup>σ</sup> <sup>∈</sup> sol(ϕ)<sup>↓</sup>Y , we compute <sup>p</sup>σ <sup>=</sup> <sup>|</sup>sol(ϕ(<sup>Y</sup> →σ))<sup>|</sup> <sup>|</sup>sol(ϕ)↓X<sup>|</sup> , where <sup>|</sup>sol(ϕ(<sup>Y</sup> <sup>→</sup> <sup>σ</sup>))<sup>|</sup> is the count of satisfying assignments of ϕ(Y -<sup>→</sup> <sup>σ</sup>), and <sup>|</sup>sol(ϕ)<sup>↓</sup>X<sup>|</sup> represents the projected model count of <sup>ϕ</sup> over X. Then, finally the entropy is computed as - <sup>p</sup>σ log( <sup>1</sup> pσ ).

σ∈2<sup>Y</sup> Our evaluation demonstrates that EntropyEstimation can scale to the formulas beyond the reach of the enumeration-based baseline approach. Within a given timeout of 3000 s, EntropyEstimation is able to estimate the entropy for all the benchmarks, whereas the baseline approach could terminate only for 14 benchmarks. Furthermore, EntropyEstimation estimated the entropy within the allowed tolerance for *all* the benchmarks.

# **5.1 Scalability of EntropyEstimation**

Table 1 presents the performance of EntropyEstimation vis-a-vis the baseline approach for 20 benchmarks.<sup>3</sup> Column 1 of Table 1 gives the names of the

<sup>2</sup> We wish to emphasize that none of the previous approaches could provide theoretical guarantees of (ε, δ) without enumerating over all possible assignments to Y .

<sup>3</sup> The complete analysis for all of the benchmarks is deferred to the technical report https://arxiv.org/pdf/2206.00921.pdf.


**Table 1.** "-" represents that entropy could not be estimated due to timeout. Note that m = |Y | and n = |X|.

benchmarks, while columns 2 and 3 list the numbers of X and Y variables. Columns 4 and 5 respectively present the time taken, number of samples used by baseline approach, and columns 6 and 7 present the same for EntropyEstimation. The required number of samples for the baseline approach is <sup>|</sup>sol(ϕ)<sup>↓</sup>Y <sup>|</sup>.

Table 1 clearly demonstrates that EntropyEstimation outperforms the baseline approach. As shown in Table 1, there are some benchmarks for which the projected model count on V is greater than 10<sup>30</sup>, i.e., the baseline approach would need 10<sup>30</sup> valuations to compute the entropy exactly. By contrast, the proposed algorithm EntropyEstimation needed at most <sup>∼</sup> <sup>10</sup><sup>4</sup> samples to estimate the entropy within the given tolerance and confidence. The number of samples required to estimate the entropy is reduced significantly with our proposed approach, making it scalable.

#### **5.2 Quality of Estimates**

There were only 14 benchmarks out of 96 for which the enumeration-based baseline approach finished within a given timeout of 3000s. Therefore, we compared the entropy estimated by EntropyEstimation with the baseline for those 14 benchmarks only. Figure 1 shows how accurate were the estimates of the entropy by EntropyEstimation. The y-axis represents the observed error, which was calculated as max( Estimated Exact <sup>−</sup> <sup>1</sup>, Exact Estimated − 1), and the x-axis represents the benchmarks ordered in ascending order of observed error; that is, a bar at x represents the observed error for a benchmark—the lower, the better.

**Fig. 1.** The accuracy of estimated entropy using EntropyEstimation for 14 benchmarks. ε = 0.8, δ = 0.09. (Color figure online)

The red horizontal line in Fig. 1 indicates the maximum allowed tolerance (ε), which was set to 0.80 in our experiments. We observe that for *all* 14 benchmarks, EntropyEstimation estimated the entropy within the allowed tolerance; in fact, the observed error was greater than 0.1 for just 2 out of the 14 benchmarks, and the maximum error observed was 0.29.

*Alternative Baselines:* As we discussed earlier, several other algorithms have been proposed for estimating the entropy. For example, Valiant and Valiant's algorithm [58] obtains an <sup>ε</sup>-additive approximation using <sup>O</sup>( <sup>2</sup><sup>m</sup> ε2m ) samples, and Chakraborty et al. [17] compute such approximations using <sup>O</sup>( <sup>m</sup><sup>7</sup> ε<sup>8</sup> ) samples. We stress that neither of these is exact, and thus could not be used to assess the accuracy of our method as presented in Fig. 1. Moreover, based on Table 1, we observe that the number of sampling or counting calls that could be computed within the timeout was roughly 2×10<sup>4</sup>, where <sup>m</sup> ranges between 10<sup>1</sup>–10<sup>3</sup>. Thus, the method of Chakraborty et al. [17], which would take 10<sup>7</sup> or more samples on all benchmarks, would not be competitive with our method, which never used <sup>2</sup> <sup>×</sup> <sup>10</sup><sup>4</sup> calls. The method of Valiant and Valiant, on the other hand, would likely allow a few more benchmarks to be estimated (perhaps up to a fifth of the benchmarks). Still, it would not be competitive with our technique except in the smallest benchmarks (for which the baseline required < 10<sup>6</sup> samples, about a third of our benchmarks), since we were otherwise more than a factor of m faster than the baseline.

# **6 Conclusion**

In this work, we considered estimating the Shannon entropy of a distribution specified by a circuit formula ϕ(X, Y ). Prior work relied on O(2m) model counting queries and, therefore, could not scale to instances beyond small values of m. In contrast, we propose a novel technique, called EntropyEstimation, for estimation of entropy that takes advantage of the access to the formula ϕ via conditioning. EntropyEstimation makes only O(min(m, n)) model counting and sampling queries, and therefore scales significantly better than the prior approaches.

**Acknowledgments.** This work was supported in part by National Research Foundation Singapore under its NRF Fellowship Programme[NRF-NRFFAI1-2019-0004], Ministry of Education Singapore Tier 2 grant [MOE-T2EP20121-0011], NUS ODPRT grant [R-252-000-685-13], an Amazon Research Award, and NSF awards IIS-1908287, IIS-1939677, and IIS-1942336. We are grateful to the anonymous reviewers for constructive comments to improve the paper. The computational work was performed on resources of the National Supercomputing Centre, Singapore: https://www.nscc.sg.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **PoS4MPC: Automated Security Policy Synthesis for Secure Multi-party Computation**

Yuxin Fan<sup>1</sup>, Fu Song1,2(B) , Taolue Chen<sup>3</sup>, Liangfeng Zhang<sup>1</sup>, and Wanwei Liu4,5

<sup>1</sup> School of Information Science and Technology, ShanghaiTech University, Shanghai 201210, China

songfu@shanghaitech.edu.cn <sup>2</sup> Shanghai Engineering Research Center of Intelligent Vision and Imaging, Shanghai 201210, China

<sup>3</sup> Department of Computer Science, Birkbeck, University of London, London WC1E 7HX, UK

<sup>4</sup> College of Computer Science, National University of Defense Technology, Changsha 410073, China

> <sup>5</sup> State Key Laboratory for High Performance Computing, Changsha 410073, China

**Abstract.** Secure multi-party computation (MPC) is a promising technique for privacy-persevering applications. A number of MPC frameworks have been proposed to reduce the burden of designing customized protocols, allowing non-experts to quickly develop and deploy MPC applications. To improve performance, recent MPC frameworks allow users to declare variables secret only for these which are to be protected. However, in practice, it is usually highly non-trivial for non-experts to specify secret variables: declaring too many degrades the performance while declaring too less compromises privacy. To address this problem, in this work we propose an automated security policy synthesis approach to declare as few secret variables as possible but without compromising security. Our approach is a synergistic integration of type inference and symbolic reasoning. The former is able to quickly infer a sound but sometimes conservative—security policy, whereas the latter allows to identify secret variables in a security policy that can be declassified in a precise manner. Moreover, the results from symbolic reasoning are fed back to type inference to refine the security types even further. We implement our approach in a new tool **PoS4MPC**. Experimental results on five typical MPC applications confirm the efficacy of our approach.

This work is supported by the National Natural Science Foundation of China (NSFC) under Grants No. 62072309, No. 61872340 and No. 61872371, the Open Fund from the State Key Laboratory of High Performance Computing of China (HPCL) (202001- 07), an overseas grant from the State Key Laboratory of Novel Software Technology, Nanjing University, and Birkbeck BEI School Project (EFFECT).

S. Shoham and Y. Vizel (Eds.): CAV 2022, LNCS 13371, pp. 385–406, 2022. https://doi.org/10.1007/978-3-031-13185-1\_19

#### **1 Introduction**

Secure multi-party computation (MPC) is a powerful cryptographic paradigm, allowing mutually distrusting parties to collaboratively compute a public function over their private data without a trusted third party and revealing nothing beyond the result of the computation and their own private data [14,43]. MPC has potential for broader uses in practical applications, e.g., truthful auctions, avoiding satellite collisions [22], private machine learning [41], and data analysis [35]. However, practical deployment of MPC has been limited due to its computational and communication complexity.

To foster applications of MPC, a number of general-purpose MPC frameworks have been proposed, e.g., [9,24,29,34,37,44]. These frameworks provide highlevel languages for specifying MPC applications as well as compilers for translating them into executable implementations, thus drastically reduce the burden of designing customized protocols and allow non-experts to quickly develop and deploy MPC applications. To improve performance, many MPC frameworks provide features to declare secret variables so that only these variables are to be protected. However, such frameworks usually do not verify rigorously whether there is information leakage, or, on some occasions, provide only light-weighted checking (via, e.g., information-flow analysis). Even though some frameworks are equipped with formal security guarantees, it is challenging for non-experts to develop an MPC program that simultaneously achieves good performance and formal security guarantees [3,28]. A typical case for an user is to declare all variables secret while ideally one would declare as few secret variables as possible to achieve a good performance without compromising security.

In this work, we propose an automated security policy synthesis approach for MPC. We first formalize the leakage of an MPC application in the ideal-world as a set of private inputs and define the notion of security policy, which assigns each variable a security level. This can bridge the language-level and protocollevel leakages, hence our approach is independent of the specific MPC protocols being used. Based on the leakage characterization, we provide a type system to infer security policies by tracking both control- and data-flow of information from private inputs. While a security policy inferred from the type system formally guarantees that the MPC application will not leak more information than the result of the computation and participants' own private data, it may be too conservative. For instance, some variables could be declassified without compromising security but with improved performance. Therefore, we propose a symbolic reasoning approach to identify secret variables in security policies that can be declassified without compromising security. We also feed back the results from the symbolic reasoning to type inference to refine the security type further.

We implement our approach in a new tool **PoS4MPC** (**Po**licy **S**ynthesis for **MPC**) based on the LLVM Compiler [1] and the KLEE symbolic execution engine [10]. Experimental results on five typical MPC applications show that our approach can generate less restrictive security policies than using the type system solely. We also deploy the generated security policies in two MPC frameworks Obliv-C [44] and MPyC [37]. The results show that, for instance, the security policies generated by our approach can reduce the execution time by 31%–1.56×

<sup>10</sup><sup>5</sup>%, the circuit size by 38%–3.<sup>61</sup> <sup>×</sup> <sup>10</sup><sup>5</sup>%, and the communication traffic by 39%–4.<sup>17</sup> <sup>×</sup> <sup>10</sup><sup>5</sup>% in Obliv-C.

To summarize, our main technical contributions are as follows.


**Outline.** Section 2 presents the motivation of this work and overview of our approach. Section 3 gives the background of MPC. Section 4 introduces a simple language on which we formalize the leakage of MPC applications. We propose a type system for inferring security policies in Sect. 5 and a symbolic reasoning approach for declassification in Sect. 6. Implementation details and experimental results are given in Sect. 7. Finally, we discuss related work in Sect. 8 and conclude this paper in Sect. 9.

Missing proofs can be found in the full version of this paper [15].

### **2 Motivation**

Figure 1 shows a motivating example that computes the richest among three millionaires. To preserve the privacy, the millionaires can privately send their inputs to a trusted third party (TTP) as shown in Fig. 2 (ideal-world). This reveals the richest millionaire with the least leakage of information. Table 1 shows the leakage for each result r = 1, 2, 3, as well as the leakage if the secret branching variables c1 and c2 are declassified (i.e., from secret to public).


**Table 1.** Leakage from each result and declassified secret branching variables

To achieve the same functionality without TTP, secure multi-party computation (MPC) was proposed [14,43]. One can implement the computation using an MPC protocol π where all the parties collaboratively compute the result over their private inputs via network communications (shown in Fig. 2 (real-world)).

To facilitate applications of MPC, various MPC frameworks, e.g., Obliv-C [44], MP-SPDZ [24] and MPyC [37], have been proposed, which provide highlevel languages for specifying MPC applications, as well as compilers for translating them into executable implementations. To improve performance, these frameworks often allow users to declare secret variables so that only the values of secret variables are to be protected. However, in practice, it is usually quite challenging for non-experts to specify secret variables properly: declaring too many secret variables would degrade the performance, whereas declaring too less secret variables risks compromising security and privacy.

In this work, we propose an automated synthesis approach, aiming to declare as few secret variables as possible but without compromising security. To capture privacy, we formalize the leakage of MPC applications in the ideal-world as a set of private inputs. For instance, the leakage of the result r = 1 in the motivating example is the set of inputs such that a ≥ b ∧ a ≥ c. We introduce the notion of security policy, which assigns each variable a security level, to bridge the language-level and protocol-level leakages, so that our approach is independent of specific MPC protocols being used. The language-level leakage of a security policy is characterized by a set of private inputs with respect to not only the result but also the values of public variables in the intermediate computations.

Based on the leakage characterization, we propose a type system to automatically infer security policies, inspired by the work of proving noninterference of programs [40]. Our type system tracks both control-flow and data-flow of information from the private inputs, and infers a security policy. For instance, all the variables in the motivating example are inferred as secret.

Although a security policy inferred by the type system formally guarantees that the MPC application will not leak more information than that in the idealworld, it may be too conservative. For instance, declassifying the variable c2 in the example would not compromise security. As shown in Table 1, the leakage caused by declassifying c2 can be deduced from the leakage of the result. In contrast, we cannot declassify c1, as neither a ≥ b nor a < b can be deduced from the leakage c > max(a, b). Once c1 is declassified, the adversary would learn if a ≥ b or a < b. This problem is akin to downgrading and declassification of high security levels in information-flow analysis [27], and could be solved via self-composition [39,42] that often require users to write annotations for procedure contracts and loop invariants. In this work, for the sake of efficiency and usability for non-experts, we propose an alternative approach based on symbolic execution. We leverage symbolic execution to finitely represent a potentially infinite set of concrete executions, and propose an automated approach to infer if a secret variable can be declassified by reasoning about pairs of symbolic executions. For instance, in Example 1, our approach is able to identify that c2 can be declassified without compromising security. In general, the experimental results show that our approach is effective and the generated security policies can significantly improve the performance of MPC applications.

#### **3 Secure MPC**

Fix a set of variables <sup>X</sup> over a domain <sup>D</sup>. We write **<sup>x</sup>**<sup>n</sup> ∈ X <sup>n</sup> and **<sup>v</sup>**<sup>n</sup> ∈ D<sup>n</sup> for tuples (x1, ··· , xn) and (v1, ··· , vn) respectively. (The subscript n may be dropped when it is clear from the context.)

**MPC in the Ideal-World**. An <sup>n</sup>-party MPC application <sup>f</sup> : <sup>D</sup><sup>n</sup> → D is to confidentially compute a given function f(**x**), where each party P<sup>i</sup> for 1 ≤ i ≤ n sends her private input v<sup>i</sup> ∈ D to a TTP T which computes and returns the result f(**v**) to all the parties. In the ideal world, an adversary that controls any of the n parties learns no more than the output f(**v**) and the private inputs of the corrupted (dishonest) parties.

We characterize the leakage of an MPC application f(**x**) by a set of private inputs. Hereafter, we assume, w.l.o.g., the first k parties (i.e., P1, ··· , Pk) are corrupted by the adversary for some k ≥ 1. For a given output v ∈ D, let f <sup>v</sup> ⊆ D<sup>n</sup> be the set {**<sup>v</sup>** ∈ D<sup>n</sup> <sup>|</sup> <sup>f</sup>(**v**) = <sup>v</sup>}. Intuitively, <sup>f</sup> <sup>v</sup> is the set of the private inputs **<sup>v</sup>** ∈ D<sup>n</sup> under which <sup>f</sup> is evaluated to <sup>v</sup>. From the result <sup>v</sup>, the adversary is able to learn the set <sup>f</sup> <sup>v</sup> , but cannot tell which one from <sup>f</sup> <sup>v</sup> given v. We refer to <sup>f</sup> <sup>v</sup> as the indistinguishable space of the private inputs w.r.t. the result v. The input domain <sup>D</sup><sup>n</sup> is then partitioned into indistinguishable spaces {<sup>f</sup> <sup>v</sup> }<sup>v</sup>∈D.

When the adversary controls the parties P1, ··· , Pk, she will learn the set Leak<sup>f</sup> iw(v, **<sup>v</sup>**k) := {(v1, ··· , vn) ∈ D<sup>n</sup> <sup>|</sup> **<sup>v</sup>**<sup>k</sup> <sup>=</sup> <sup>v</sup>1, ··· , vk}∩ <sup>f</sup> <sup>v</sup> , from the result v and the adversary-chosen private inputs **<sup>v</sup>**<sup>k</sup> ∈ D<sup>k</sup>.

**Definition 1 (Leakage in the ideal-world).** *For an MPC application* f(**x**n)*, the leakage of computing* v = f(**v**n) *in the ideal-world is* Leak<sup>f</sup> iw(v, **v**k)*, for the adversary-chosen private inputs* **<sup>v</sup>**<sup>k</sup> ∈ D<sup>k</sup> *and the result* <sup>v</sup> ∈ D*.*

**MPC in the Real-World**. An MPC application in the real-world is implemented using some MPC protocol π (denoted by π<sup>f</sup> ) by which all the parties collaboratively compute π<sup>f</sup> (**x**) over their private inputs **v** without any TTP T. Introduction of MPC protocols can be found in [14].

There are generally two types of adversaries in the real world, i.e., semihonest and malicious. An adversary is semi-honest (a.k.a. passive) if the corrupted parties run the protocol honestly as specified, but may try to learn private information of other parties by observing the protocol execution (i.e., network messages and program states). An adversary is malicious (a.k.a. active) if the corrupted parties can deviate arbitrarily from the prescribed protocol (e.g., control, manipulate, and inject messages) in an attempt to learn private information of the other parties. In this work, we consider semi-honest adversaries, which are supported by most MPC frameworks and often serve as a basis for MPC in more robust settings with powerful adversaries.

A protocol π is (semi-honest) secure if what a (semi-honest) adversary can achieve in the real-world can also be achieved by a corresponding adversary in the ideal-world. Semi-honest security ensures that the corrupted parties learn no more information from executing the protocol than what they can learn from the result and the private inputs of the corrupted parties. Therefore, the leakage of an MPC application f(**x**) in the real-world against the semi-honest adversary can also be characterized using the indistinguishability of private inputs.

**Definition 2.** *An MPC protocol* π *is (semi-honest) secure if for any MPC application* <sup>f</sup>(**x**n)*, adversary-chosen private inputs* **<sup>v</sup>**<sup>k</sup> ∈ D<sup>k</sup> *and result* <sup>v</sup> ∈ D*, the leakage of computing* v = π<sup>f</sup> (**v**n) *is* Leak<sup>f</sup> iw(v, **v**k)*.*

# **4 Language-Level Leakage Characterization**

In this section, we characterize the leakage of MPC applications from the language perspective.

#### **4.1 A Language for MPC**

We consider a simple language While for implementing MPC applications. The syntax of While programs is defined as follows.

$$\begin{array}{l} p ::= \text{skip} \mid x = e \mid p\_1; p\_2 \mid \text{if } x \text{ then } p\_1 \text{ else } p\_2 \mid \text{return } x \\\ \mid \text{while } x \text{ do } p \mid \text{repeat } n \text{ do } p \end{array}$$

where e is an expression defined as usual and n is a positive integer.

Despite its simplicity, While suffices to illustrate our approach and our tool supports a real-world language. Note that we introduce two loop constructs. The while loop can only be used with the secret-independent conditions while the repeat loop (with a fixed number n of iterations) can have secret-dependent conditions. The restriction of the while loop is necessary, as the adversary knows when to terminate the loop, so secret information may be leaked if a secretdependent condition is used [44].

The operational semantics of the While program is defined in a standard way (cf. [15]). In particular, repeat n do p means repeating the loop body p for a fixed number n times. A configuration is a tuple p, σ, where p denotes a statement and σ : X→D denotes a state that maps variables to values. The evaluation of an expression e under a state σ is denoted by σ(e). A transition from p, σ to p , σ is denoted by p, σ→ p , σ and →<sup>∗</sup> denotes the transitive closure of →. An execution starting from the configuration p, σ is a sequence of configurations. We write p, σ ⇓ σ if p, σ →<sup>∗</sup> skip, σ . We assume that each execution ends in a return statement, i.e., all the while loops always terminate. We denote by p, σ ⇓ σ : v the execution returning value v.

#### **4.2 Leakage Characterization in Ideal/Real-World**

An MPC application f(**x**) is implemented as a While program p. An execution of the program p evaluates the computation f(**x**) as if a TTP directly executed the program p on the private inputs. In this setting, the adversary cannot observe any intermediate states of the execution other than the final result.

Let X in = {x1, ··· , xn}⊆X be the set of private input variables. We denote by State<sup>0</sup> the set of the initial states. Given a tuple of values **<sup>v</sup>**<sup>k</sup> ∈ D<sup>k</sup> and a result <sup>v</sup> ∈ D, let Leak<sup>p</sup> iw(v, **v**k) denote the set of states σ ∈ State<sup>0</sup> such that p, σ ⇓ σ : v for some state σ and σ(xi) = v<sup>i</sup> for 1 ≤ i ≤ k. Intuitively, when the adversary controls the parties P1, ··· , Pk, she learns the set of states Leak<sup>p</sup> iw(v, **<sup>v</sup>**k) from the result <sup>v</sup> and the adversary-chosen private inputs **<sup>v</sup>**<sup>k</sup> ∈ Dk. We can reformulate the leakage of an MPC application f(**x**) in the ideal-world (cf. Definition 1) as follows.

**Proposition 1.** *Given an MPC application* f(**x**n) *implemented by a program* p*,* **v** <sup>n</sup> <sup>∈</sup> Leak<sup>f</sup> iw(v, **<sup>v</sup>**k) *iff there exists a state* <sup>σ</sup> <sup>∈</sup> Leak<sup>p</sup> iw(v, **v**k) *such that* σ(xi) = v i *for* 1 ≤ i ≤ n*.*

We use security policies to characterize the leakage of MPC applications in the real-world.

**Security Level.** We consider a lattice of security levels <sup>L</sup> <sup>=</sup> {Sec, Pub} with Pub Pub, Pub Sec, Sec Sec and Sec Pub. We denote by <sup>1</sup> <sup>2</sup> the least upper bound of two security levels 1, <sup>2</sup> <sup>∈</sup> <sup>L</sup>, namely, Sec <sup>=</sup> Sec <sup>=</sup> Sec for <sup>∈</sup> <sup>L</sup> and Pub Pub <sup>=</sup> Pub.

**Definition 3.** *A security policy* : X → <sup>L</sup> *for the MPC application* <sup>f</sup>(**x**) *is a function that associates each variable* <sup>x</sup> ∈ X *with a security level* <sup>∈</sup> <sup>L</sup>*.*

Given a security policy and a security level <sup>∈</sup> <sup>L</sup>, let <sup>X</sup> - := {x | (x) = }⊆X , i.e., the set of variables with the security level under . We lift the order to security policies, namely, if (x) (x) for each x ∈ X . When executing the program p with a security policy using an MPC protocol π, we assume that the adversary can observe the values of the public variables x ∈ X Pub, but not that of the secret variables x ∈ X Sec.

This is a practical assumption and can be well-supported by the existing approach. For instance, Obliv-C [44] allows developers to define an MPC application in an extension of C language, when compiled and linked, the result will be a concrete garbled circuit protocol π<sup>p</sup> whose computation does not reveal the values of any oblivious-qualified variables. Thus, all the secret variables specified by the security policy can be declared as oblivious-qualified variables in Obliv-C, while all the public variables specified by the security policy are declared without oblivious-qualification. Similarly, MPyC [37] is a Python package for implementing MPC applications that allows programmers to define instances of secret-typed variable classes using Python's class mechanism. When executing MPC applications, instances of secret-typed class variables are protected via Shamir's secret sharing protocol [38]. Thus, all the secret variables specified by the security policy can be declared as instances of secret-typed variable classes in MPyC, while all the public variables specified by the security policy are declared as instances of Python's standard classes.

**Leakage Under a Security Policy.** Fix a security policy for the program p. Remark that the values of the secret variables will not be known even at runtime for each party, as they are encrypted. This means that, unlike the secretindependent conditions, the secret-dependent conditions cannot be executed normally, and thus should be removed using, e.g., multiplexers, before transforming into circuits. We define the transformation T(·, ·), where c is the selector of a multiplexer.

T-(c, p1; p2) - T-(c, p1); T-(c, p2) T-(c,return x) return x T-(c, x = e) x = x + c × (e − x) T-(c,skip) skip T-(c, if x then p<sup>1</sup> else p2) - - if x then T-(1, p1) else T-(1, p2), if c = 1 ∧ -(x) = Pub; T-(c&x, p1); T-(c&¬x, p2), otherwise. T-(c, while x do p) - - while x do T-(1, p), if c = 1 ∧ -(x) = Pub; Error, otherwise. T-(c,repeat n do p) repeat n do T-(c, p)

Intuitively, c in T(c, ·) indicates whether the statement is under some secretdependent branching statements. Initially, c = 1. During the transformation, c will be conjuncted with the branching condition x or ¬x when transforming if x then p<sup>1</sup> else p<sup>2</sup> if x is secret or c = 1. The control flow inside should be protected if c = 1. If c = 1 and the condition variable x is public, the statement needs not be protected. T (c, x = e) simulates a multiplexer with two different values depending on whether the assignment x = e is in the scope of some secret-dependent conditions. At runtime, the value e is assigned to x if c is 1, otherwise x does not change. T(c,while x do p) enforces that the while loop is used in secret-independent conditions and x is public in the security policy otherwise throws an error. The other cases are trivial. We denote by <sup>p</sup> the program T(1, p) on which we will define the leakage of p in the real-world.

For every state σ : X→D, let σPub : X Pub → D denote the state that is the projection of the state σ onto the public variables X Pub. For each execution p-, σ1 ⇓ σ2, we denote by p-, σ1 ⇓Pub σ<sup>2</sup> the sequence of configurations where each state σ is replaced by the state σPub.

Recall that the adversary can observe the values of public variables x ∈ X Pub when executing the program <sup>p</sup>-. Thus, from an execution p-, σ1 ⇓ σ<sup>2</sup> : v, she can observe the sequence p-, σ1 ⇓Pub σ<sup>2</sup> and the result v, written as p-, σ1 ⇓Pub <sup>σ</sup><sup>2</sup> : <sup>v</sup>. For every state <sup>σ</sup> <sup>∈</sup> Leak<sup>p</sup> iw(v, **v**k), we denote by Leakp, rw (v, σ) the set of states <sup>σ</sup> <sup>∈</sup> Leak<sup>p</sup> iw(v, **v**k) such that p-, σ ⇓Pub σ <sup>1</sup> : v and p-, σ ⇓Pub σ<sup>1</sup> : v are identical.

**Definition 4.** *A security policy is perfect for a given MPC application* f(**x**n) *implemented by the program* p*, denoted by* |=<sup>p</sup> f(**x**n)*, if* T(1, p) *does not throw any errors, and for adversary-chosen private inputs* **<sup>v</sup>**<sup>k</sup> ∈ D<sup>k</sup>*, the result* <sup>v</sup> ∈ D*, and the state* <sup>σ</sup> <sup>∈</sup> Leak<sup>p</sup> iw(v, **v**k)*, we have that*

$$
\mathsf{Lea}k\_{\mathsf{i}\mathsf{v}}^{p}(v,\mathsf{v}\_{k}) = \mathsf{Lea}k\_{\mathsf{r}\mathsf{v}}^{p,\mathcal{o}}(v,\sigma).
$$

Intuitively, a perfect security policy ensures that for every state σ ∈ Leak<sup>p</sup> iw(v, **v**k), from the observation p-, σ ⇓Pub σ : v, the adversary only learns the same set Leak<sup>p</sup> iw(v, **v**k) of initial states as that in the ideal-world.

Our goal is to compute a perfect security policy for every program p that implements the MPC f(**x**). A naive way is to assign the high security level Sec to all the variables X , which may however suffer from a lower performance, as all the intermediate computations have to be performed on encrypted data and conditional statements have to removed. Ideally, a security policy should not only be perfect but also annotate as few secret variables as possible.

#### **5 Type System**

In this section, we present a sound type system to automatically infer perfect security policies. We first define noninterference of a program p w.r.t. a security policy , which is shown to entail the perfectness of .

**Definition 5.** *A program* p *is noninterfering w.r.t. a security policy , written as -noninterfering, if* T(1, p) *does not throw any errors and* p-, σ1 ⇓Pub σ<sup>2</sup> : v *and* p-, σ 1 ⇓Pub σ <sup>2</sup> : v *are the same for each pair of states* σ1, σ <sup>1</sup> ∈ State0*.*

Intuitively, the -noninterference ensures that for all private inputs of the n parties (without the adversary-chosen private inputs), the adversary observes the same sequence of the configurations from all the executions that return the same value.

The -noninterference of p entails the perfectness of where the adversary can choose arbitrary private inputs **<sup>v</sup>**<sup>k</sup> ∈ D<sup>k</sup> of the corrupted participants (P1, ··· , Pk) for any k ≥ 1.

# **Proposition 2.** *If* p *is -noninterfering for a security policy , then* |=<sup>p</sup> f(**x**)*.*

Note that the converse of Proposition 2 does not necessarily hold due to the adversary-chosen private inputs. For instance, suppose p-, σ1 ⇓Pub σ<sup>2</sup> : v and p-, σ 1 ⇓Pub σ <sup>2</sup> : v are identical for every pair of states σ1, σ <sup>1</sup> <sup>∈</sup> Leak<sup>p</sup> iw(v, v1), and p-, σ3 ⇓Pub σ<sup>4</sup> : v and p-, σ 3 ⇓Pub σ <sup>4</sup> : v are identical for every pair of states σ3, σ <sup>3</sup> <sup>∈</sup> Leak<sup>p</sup> iw(v, v <sup>1</sup>). If v<sup>1</sup> = v <sup>1</sup>, then p-, σ1 ⇓Pub σ<sup>2</sup> : v and p-, σ3 ⇓Pub σ<sup>4</sup> : v are different, implying that p is not -noninterfering.

Based on Proposition 2, we present a type system for inferring a perfect security policy of a given program p such that p is -noninterfering. The typing judgement is in the form of c p : ⇒ , where the type contexts , are security policies, p is the program under typing, and c is the security level of the current control flow. The typing judgement c p : ⇒ states that given the security level of the current control flow c and the type context , the statement p is typable and yields a new updated type context .

The type inference rules are shown in Fig. 3 which track the security levels of both data- and control-flow of information from private inputs, where (e) denotes the least upper bound of the security levels (x) of variables x used in the expression e and <sup>1</sup> <sup>2</sup> is the security policy such that for every variable x ∈ X , (<sup>1</sup> 2)(x) = 1(x) 2(x). lfp(c, n, , p) is if n = 0 or = , otherwise lfp(c, n − 1, , p), where c p : ⇒ . Note that constants have the security level Pub. Most of those rules are standard.

$$\begin{array}{llll} \mathbf{c} \vdash \texttt{skip} : \varrho \Rightarrow \varrho \end{array} \begin{bmatrix} \texttt{T-Skip} \end{bmatrix} \qquad \qquad \begin{array}{llll} \underline{\varrho}' = \varrho[x \Rightarrow \mathbf{c} \sqcup \underline{\varrho}(e)] \\ \texttt{c-} \vdash x = e : \varrho \Rightarrow \varrho' \end{array} \Big[ \begin{array}{ll} \mathbf{T-ASIGN} \end{array} \right]$$

$$\begin{array}{llll} \mathbf{c} \vdash p\_{1} : \varrho \Rightarrow \varrho\_{1} & \mathbf{c} \sqcup \underline{\varrho}(x) \vdash p\_{1} : \varrho \Rightarrow \varrho\_{1} \\ \texttt{c-} \vdash p\_{1} : p\_{2} : \varrho \Rightarrow \varrho\_{2} & \mathbf{c} \mathbin{!} \end{array} \begin{array}{llll} \mathbf{c-ASI} \mathbf{c} & \mathbf{c-ASI} \mathbf{c} \mathbf{N} \end{array} \Big]$$

$$\begin{array}{llll} \mathbf{c-P} \mathbf{c-P} \mathbf{c} \mathbf{N} & \mathbf{c-P} \mathbf{c} \mathbf{N} \\ \texttt{c-P} \mathbf{c-P} \mathbf{c} \mathbf{N} & \mathbf{c-P} \texttt{!} \mathbf{C-S} \mathbf{N} \\ \texttt{c-P} \mathbf{c-P} \mathbf{c-P} \mathbf{c} & \mathbf{c-P} \texttt{!} \mathbf{C-P} \mathbf{c-P} \\ \texttt{c-P} \texttt{!} \texttt{c-P} \mathbf{c} & \mathbf{c-P} \texttt{!} \mathbf{N} \end{array} \Big] \begin{array}{llll} \mathbf{c-P} & \mathbf{c-P} \mathbf{c-P} & \mathbf{c-P} \mathbf{c-P} \\ \texttt{c-P} & \mathbf{c-P} \texttt{!} \mathbf{c-P} & \mathbf{c-P} \texttt{!} \mathbf{N} \\ \texttt{c-P} & \mathbf{c-P} \texttt$$

**Fig. 3.** Type inference rules

Rule T-Assign disables the data-flow and control-flow of information from the security level Sec to the security level Pub. To meet this constraint, the security level of the variable x is updated to the least upper bound c (e) of the security levels of the current control flow c and variables used in the expression e. Rule T-If passes the security level c of the current control flow into both branches, preventing from assigning values to public variables in those two branches when c = Sec. Rule T-While requires that the loop condition is public and the loop is used with secret-independent conditions, ensuring that T(1, p) does not throw any errors. Rule T-Return does not impose any constraints on x, as the return value is observable to the adversary.

Let <sup>0</sup> : X → <sup>L</sup> be the mapping such that 0(x) = Sec for all <sup>x</sup> ∈ X Sec, 0(x) = Pub otherwise. If the typing judgement Pub p : <sup>0</sup> ⇒ is valid, then the values of all the public variables specified by do not depend on any values of private inputs. Thus, it is straightforward to get that:

**Proposition 3.** *If the typing judgement* Pub p : <sup>0</sup> ⇒ *is valid, then the program* p *is -noninterfering.*

From Proposition 2 and Theorem 3, we have

**Corollary 1.** *If* Pub p : <sup>0</sup> ⇒ *is valid, then is perfect, i.e.,* |=<sup>p</sup> f(**x**)*.*

#### **6 Degrading Security Levels**

The type system allows to infer a security policy such that the type judgement Pub p : <sup>0</sup> ⇒ is valid, from which we can deduce that |=<sup>p</sup> f(**x**), i.e., is perfect for the MPC application f(**x**) implemented by the program p. However, the security policy may be too conservative, i.e., some secret variables specified by can be declassified without compromising the security. In this section, we propose an automated approach to identify these variables. We mainly consider minimizing the number of secret branching variables, viz., the secret variables used in branching conditions, as they usually incur a high computation and communication overhead. W.l.o.g., we assume that for each secret branching variable x there is only one assignment to x and it is used only in one conditional

**Fig. 4.** The symbolic semantics of While programs

statement. (We can rename variables in p if this assumption does not hold, where the named variables have the same security levels as their original names.) With this assumption, whether x can be declassified depends only on the unique conditional statement where it occurs.

Fix a security policy such that |=<sup>p</sup> f(**x**). Suppose that if x then p<sup>1</sup> else p<sup>2</sup> is not used with secret-dependent conditions. Let be the security policy [x → Pub]. It is easy to see that T- (1, p) does not raise any errors. Therefore, to declassify x, we need to ensure that p-- , σ ⇓Pub σ <sup>1</sup> : v and p-- , σ ⇓Pub σ<sup>1</sup> : v are identical for every adversary-chosen private inputs **<sup>v</sup>**<sup>k</sup> ∈ D<sup>k</sup>, result <sup>v</sup> ∈ D, and states σ, σ <sup>∈</sup> Leak<sup>p</sup> iw(v, **v**k). However, as the number of the initial states may be large and even infinite, it is infeasible to check all pairs of executions.

We propose to use symbolic executions to represent the potentially infinite sets of (concrete) executions. Each symbolic execution t is associated with a path condition φ which denotes the set of initial states satisfying φ, from each of which the execution has the same sequence of statements. Thus, the conjunction φ∧e = v, where e is the symbolic return value and v is concrete value, represents the set of initial states from which the executions have the same sequence of statements and returns the same result v. It is not difficult to observe that checking whether x in if x then p<sup>1</sup> else p<sup>2</sup> can be declassified amounts to checking whether for every pair of symbolic executions t<sup>1</sup> and t<sup>2</sup> that both include if x then p<sup>1</sup> else p2, x has the same truth value in t<sup>1</sup> and t<sup>2</sup> whenever t<sup>1</sup> and t<sup>2</sup> return the same value. This can be solved by invoking off-the-shelf SMT solvers.

#### **6.1 Symbolic Semantics**

Let E denote the set of expressions over the private input variables **x** and constants. A path condition φ ∈ E is a conjunction of Boolean expressions. A state σ ∈ State<sup>0</sup> satisfies φ, denoted by σ |= φ, if φ evaluates to True under σ. A symbolic state α is a function X→E that maps variables to symbolic expressions. α(e) denotes the symbolic value of the expression e under α, obtained from e by replacing each occurrence of variable x by α(x). The initial symbolic state, denoted by α0, is the identity function over the private input variables **x**.

The symbolic semantics of While programs is defined by transitions between symbolic configurations, as shown in Fig. 4, where SAT(φ) is True iff the constraint φ is satisfiable. A symbolic configuration is a tuple p, α, φ, where p is a statement, α is a symbolic state, and φ is the path condition that should be satisfied to reach p, α, φ. p, α, φ → p , α , φ denotes a transition from p, α, φ to p , α , φ . The symbolic semantics is almost the same as the operational semantics except that (1) the path conditions are collected and checked for conditional statements and while loops, and (2) the transition may be nondeterministic if both φ ∧ α(x) and φ ∧ ¬α(x) are satisfiable.

We denote by →<sup>∗</sup> the transitive closure of →, where its path condition is the conjunction of that of each transition. An symbolic execution starting from a symbolic configuration p, α, φ is a sequence of symbolic configurations, written as p, α, φ ⇓ (α , φ ), if p, α, φ →<sup>∗</sup> skip, α , φ . Moreover, we denote by p, α, φ ⇓ (α , φ ) : e the symbolic execution p, α, φ ⇓ (α , φ ) with the symbolic return value e. We denote by SymExe the set of all the symbolic executions p, α0, True ⇓ (α, φ) : e of the program p. Note that α<sup>0</sup> is the initial symbolic state. Recall that we assumed all the (concrete) executions always terminate, thus SymExe is a finite set of finite sequence of symbolic configurations.

#### **6.2 Relating Symbolic Executions to Concrete Executions**

A symbolic execution t = p, α0, True ⇓ (α, φ) : e represents the set of (concrete) executions starting from the states σ ∈ State<sup>0</sup> such that σ |= φ. Formally, consider σ ∈ State<sup>0</sup> such that σ |= φ, by concretizing all the symbolic values of variables x in each symbolic state α with concrete values σ(α (x)) and projecting out all the path conditions, the symbolic execution t is the execution p, σ ⇓ σ : σ(e), written as σ(t). For the execution p, σ ⇓ σ : v, there are a unique symbolic execution t such that σ(t) = p, σ ⇓ σ : v and a unique execution p-, σ ⇓ <sup>σ</sup> : <sup>v</sup> in the program <sup>p</sup>-. We denote by RW,σ(t) the execution p-, σ ⇓ <sup>σ</sup> : <sup>v</sup> and denote by RWPub ,σ(t) the sequence p-, σ ⇓Pub σ : v.

For every adversary-chosen private inputs **<sup>v</sup>**<sup>k</sup> ∈ D<sup>k</sup>, result <sup>v</sup> ∈ D, and initial state <sup>σ</sup> <sup>∈</sup> Leak<sup>p</sup> iw(v, **v**k), we can reformulate the set Leakp, rw (v, σ) as follows. (Recall that Leakp, rw (v, σ) is the set of states <sup>σ</sup> <sup>∈</sup> Leak<sup>p</sup> iw(v, **v**k) such that p-, σ ⇓Pub σ <sup>1</sup> : v and p-, σ ⇓Pub σ<sup>1</sup> : v are identical.)

**Proposition 4.** *For each state* <sup>σ</sup> <sup>∈</sup> Leak<sup>p</sup> iw(v, **v**k)*,* σ ∈ Leak*rwp*,(v, σ) *iff for every symbolic execution* t = p, α0, True ⇓ (α , φ ) : e ∈ SymExe *such that* <sup>σ</sup> <sup>|</sup><sup>=</sup> <sup>φ</sup> <sup>∧</sup> <sup>e</sup> <sup>=</sup> <sup>v</sup>*,* RWPub ,σ(t) *and* RWPub ,σ- (t ) *are identical, where* t *is a symbolic execution* p, α0, True ⇓ (α, φ) : e *such that* σ |= φ ∧ e = v*.*

Proposition 4 allows to consider only the symbolic executions p, α0, True ⇓ (α, φ) : e ∈ SymExe such that σ |= φ ∧ e = v when checking if is perfect or not.

#### **6.3 Reasoning About Symbolic Executions**

We leverage Proposition 4 to identify secret variables that can be declassified without compromising the security by reasoning about symbolic executions. For each expression φ ∈ E, Primed(φ) denotes the "primed" expression φ where each private input variable x<sup>i</sup> is replaced by x <sup>i</sup> (i.e., its primed version).

Consider two symbolic executions t = p, α0, True ⇓ (α, φ) : e and t = p, α0, True ⇓ (α , φ ) : e . Assume if x then p else p is not used with any secretdependent conditions. Recall that we assumed x is used only in if x then p else p. Then, t and t execute the same subsequence (say p1, ··· , pm) of the statements that are if x then p else p. Let e1, ··· , e<sup>m</sup> (resp. e <sup>1</sup>, ··· , e <sup>m</sup>) be symbolic values of x when executing p1, ··· , p<sup>m</sup> in the symbolic execution t (resp. t ). Define the constraint Ψx(t, t ) as 

$$\begin{aligned} \text{Train}\,\Psi\_x(t, t') & \text{ as} \\\\ \Psi\_x(t, t') & \stackrel{\Delta}{=} \left(\phi \wedge \text{Primed}(\phi') \wedge e = \text{Primed}(e')\right) \Rightarrow \left(\bigwedge\_{i=1}^m e\_i = \text{Primed}(e'\_i)\right), \end{aligned}$$

Intuitively, Ψx(t, t ) asserts that for every pair of states σ, σ ∈ State<sup>0</sup> if σ (resp. σ ) satisfies the path condition φ (resp. φ ), σ(e) and σ (e ) are identical, then for each 1 ≤ i ≤ m, the values of x are the same when executing the conditional statement p<sup>i</sup> in both RW,σ(t) and RW,σ- (t ).

**Proposition 5.** *For each pair of states* σ, σ <sup>∈</sup> Leak<sup>p</sup> iw(v, **v**k) *such that* σ |= φ∧e = v *and* σ |= φ ∧e = v*, if* Ψx(t, t ) *is valid and* RWPub ,σ(t) *and* RWPub ,σ- (t ) *are identical, then* RWPub -,σ(t) *and* RWPub -,σ- (t ) *are identical, where* = [x → Pub]*.*

Recall that x can be declassified in a perfect security policy if = [x → Pub] is still perfect, namely, p-- , σ ⇓Pub σ <sup>1</sup> : v and p-- , σ ⇓Pub σ<sup>1</sup> : v are identical for every adversary-chosen private inputs **<sup>v</sup>**<sup>k</sup> ∈ D<sup>k</sup>, result <sup>v</sup> ∈ D, and states σ, σ <sup>∈</sup> Leak<sup>p</sup> iw(v, **v**k). By Proposition 5, if Ψx(t, t ) is valid for each pair of symbolic executions t, t ∈ SymExe, we can deduce that is still perfect.

**Theorem 1.** *If* |=<sup>p</sup> f(**x**) *and* Ψx(t, t ) *is valid for each pair of symbolic executions* t, t ∈ SymExe*, then* [x → Pub] |=<sup>p</sup> f(**x**)*.*

*Example 1.* Consider two symbolic executions t and t in the motivating example such that the path condition φ (resp. φ ) of t (resp. t ) is a ≥ b ∧ c > a (resp. a < b ∧ c > b), and both return the result 3. The secret branching variable c2 has the symbolic values c > a (resp. c > b) in t and t , respectively. Then

$$\Psi\_{\mathsf{c2}}(t, t') \triangleq (\mathbf{a} \ge \mathbf{b} \land \mathbf{c} > \mathbf{a} \land \mathbf{a}' < \mathbf{b}' \land \mathbf{c}' > \mathbf{b}' \land 3 = 3) \Rightarrow ((\mathbf{c} > \mathbf{a}) = (\mathbf{c}' > \mathbf{b}')).$$

Obviously, Ψc2(t, t ) is valid. We can show that for any other pair (t, t ) of symbolic executions, Ψc2(t, t ) is always valid. Therefore, the secret branching variable c2 can be declassified in any perfect security policy .

In contrast, the secret branching variable c1 has the symbolic value a < b in both t and t . Then,

$$\Psi\_{\mathbf{c1}}(t, t') \triangleq (\mathbf{a} \ge \mathbf{b} \wedge \mathbf{c} > \mathbf{a} \wedge \mathbf{a'} < \mathbf{b'} \wedge \mathbf{c'} > \mathbf{b'} \wedge 3 = 3) \Rightarrow ((\mathbf{a} < \mathbf{b}) = (\mathbf{a'} < \mathbf{b'})).$$

Ψc1(t, t ) is not valid, thus the secret branching variable c1 cannot be declassified.

**Fig. 5.** The workflow of our tool **PoS4MPC**

**Refinement**. Theorem 1 allows us to check if the secret branching variable x of a conditional statement if x then p else p that does not used with any secret-dependent conditions can be declassified. After that, if x can be declassified without compromising the security, we feed back the result to the type system before checking the next secret branching variable. This allows us to refine the security level of variables that are updated in branches, namely, the type inference rule T-If is refined to the following one.

$$\begin{array}{llll} \mathsf{c}' = \left( \text{can } x \text{ be desiffified} \ \text{? Pub} : \varrho(x) \right) \\ \mathsf{c} \bot \mathsf{c}' \vdash p\_1 : \varrho \Rightarrow \varrho\_1 \quad \mathsf{c} \bot \mathsf{c}' \vdash p\_2 : \varrho \Rightarrow \varrho\_2 \quad \varrho' = \varrho\_1 \sqcup \varrho\_2 \quad \left[ \begin{array}{c} \mathsf{T} \text{-} \text{IF} \end{array} \right] \end{array}$$

#### **7 Implementation and Evaluation**

We have implemented our approach in a tool, named **PoS4MPC**. The workflow of **PoS4MPC** is shown in Fig. 5, The input is an MPC program in C, which is parsed to an intermediate representation (IR) inside the LLVM Compiler [1] where call graph and control flow graphs are constructed at the LLVM IR level. We then perform the type inference which computes the a perfect security policy for the given program. To be accurate, we perform a field-sensitive pointer analysis [6] and our type inference is also field-sensitive. As the next step, we leverage the KLEE symbolic execution engine [10] to explore all the feasible symbolic executions, as well as the symbolic values of the return variable and secret branching variables of each symbolic execution. We fully explore loops since the bounds of loops in MPC are public and decided by user-specified inputs. Based on them, we iteratively check if a secret branching variable is degraded and the result is fed back to the type inference to refine security levels before checking the next secret branching variable. After that, we transform the program into the input of Obliv-C [44] by which the program can be compiled into executable implementations, one for each party. Obliv-C is an extension of C for implementing 2-party MPC applications using Yao's garbled circuit protocol [43]. For experimental purposes, **PoS4MPC** also features the high-level MPC framework MPyC [37], which is a Python package for implementing n-party MPC applications (n ≥ 1) using Shamir's secret sharing protocol [38]. The C program is transformed into Python by a translator.

We also implement an optimization in our tool to alleviate the path explosion problem. Instead of directly checking the validity of Ψx(t, t ) for each secret


**Table 2.** Number of (secret) branching variables

branching variable x and pair of symbolic executions t and t , we first check if the premise φ ∧ Primed(φ ) ∧ e = Primed(e ) of Ψx(t, t ) is satisfiable. We can conclude that Ψx(t, t ) is valid for any secret branching variable x if the premise φ∧Primed(φ )∧e = Primed(e ) is unsatisfiable. Furthermore, this yields a sound compositional reasoning approach which allows to split a program into a sequence of function calls. When each pair of the symbolic executions for each function cannot result in the same return value, we can conclude that Ψx(t, t ) is valid for any secret branching variable x and any pair of symbolic executions t and t of the entire program. This optimization reduces the evaluation time of symbolic execution of PSI (resp. QS) from 95.9 s–8.1 h (resp. 504.6 s) to 1.7 s–79.6 s (resp. 11.6 s) in input array size varies from 10 to 100 (resp. 10).

#### **7.1 Evaluation Setup**

For an evaluation of our approach, we conduct experiments on five typical 2 party MPC applications [2], i.e., quicksort (QS) [21], linear search (LinS) [13], binary search (BinS) [13], almost search (AlmS), and private set intersection (PSI) [5]. QS outputs the list of indices of a given integer array **a** in its ordered version, where the first half of **a** is given by one party and the second half of **a** is given by the another party. LinS (resp. BinS and AlmS) outputs the index of an integer b in an array **a** if it exists, −1 otherwise, where the integer array **a** is the input from one party and the integer b is the input from the another party. LinS always scans the array from the start to the end even though it has found the integer b. BinS is a standard iterative approach on a sorted array, where the array index is protected via oblivious read access machine [20]. AlmS is a variant of BinS, where the input array is almost sorted, namely, each element is at either the correct position or the closest neighbour of the correct position. PSI outputs the intersection of two integer sets, each of which is an input from one party.

All the experiments were conducted on a desktop with 64-bit Linux Mint 20.1, Intel Core i5-6300HQ CPU, 2.30 GHz and 8 GB RAM. When evaluating MPC applications, the client of each party is executed with a single thread.

#### **7.2 Performance of Security Policy Synthesis**

**Security Policy**. The results of our approach is shown in Table 2, where column (LOC) shows the number of lines of code, column (#Branch var) shows the number of branching variables while column (#Other var) shows the number


**Table 3.** Execution time of our security policy synthesis approach

of other variables, columns (After TS) and (After Check) respectively show the number of secret branching variables after applying the type system and checking if the secret branching variables can be declassified, columns (Before refinement) and (After refinement) respectively show the number of other secret variables before and after refining the type inference by feeding back the results of the symbolic reasoning. (Note that the input variables are excluded in counting.)

We can observe that only few variables (2 for QS, 1 for LinS, 2 for BinS, 2 for AlmS and 2 for PSI) can be found to be public by solely using the type system. With our symbolic reasoning approach, more secret branching variables can be declassified without compromising the security (3 for QS, 1 for LinS, 1 for BinS, 2 for AlmS and 1 for PSI). After refining the type inference using results of the symbolic reasoning approach, more secret variables can be declassified (2 for QS, 1 for LinS and 2 for PSI). Overall, our approach annotates 2, 1, 7, 12 and 1 internal variables as secret out of 10, 4, 10, 16 and 6 variables for QS, LinS, BinS, AlmS and PSI, respectively.

**Execution Time**. The execution time of our approach is shown in Table 3, where columns (SE) and (Check) respectively show the execution time (in second unless indicated by h for hour) of collecting symbolic executions and checking if secret branching variables can be declassified, by varying the size of the input array for each program from 10 to 100 with step 10. We did not report the execution time of our type system, as it is less than 0.1 s for each benchmark.

We can observe that our symbolic reasoning approach is able to check all the secret branching variables in few minutes (up to 294.4 s) except for QS. After an in-depth analysis, we found that the number of symbolic executions is exponential in the length of the input array for QS and PSI while it is linear in the length of the input array for the other benchmarks. Our compositional reasoning approach works very well on PSI, otherwise it would take similar execution time as on QS. Indeed, a loop of PSI is implemented as a sequence of function calls each of which has a fixed number of symbolic executions. Furthermore, each pair of symbolic executions in the called function cannot result in the same return value. Therefore, the number of symbolic executions and the execution time of our symbolic reasoning approach is reduced significantly. However, our compositional reasoning approach does not work on QS. Although the number of symbolic executions grows exponentially on QS, the execution time of checking if secret branching variables can be declassified is still reduced by our

**Fig. 6.** Execution time (Time) in second, the number of gates (Gate) in 10<sup>6</sup> gates, Communication (Comm.) in MB using Obliv-C

**Fig. 7.** Execution time (Time) in second using MPyC

optimization, which avoids the checking of the constraint Ψx(t, t ) if its premise φ ∧ Primed(φ ) ∧ e = Primed(e ) is unsatisfiable.

#### **7.3 Performance Improvement of MPC Applications**

To evaluate the performance improvement of the MPC applications, we compare the execution time (in second), the size of the circuits (in 10<sup>6</sup>×gates), and the volume of communication traffic (in MB) of each benchmark with the security policies v1 and v2, where v1 is obtained by solely applying our type system and v2 is obtained from v1 by degrading security levels and refinement without compromising the security. The measurement results are calculated by result of v1 result of v2 −1, taking the average of 10 times repetitions in order to minimize the noise.

**Obliv-C.** The results in Obliv-C are depicted in Fig. 6 (note the logarithmic scale of the vertical coordinate), where the size of the random input array for each benchmark varies from 10 to 100 with step size 10. Overall, we can observe that the performance improvement is significant especially on QS. In detail, compared with the security policy v1 on QS (resp. LinS, BinS, AlmS, and PSI), on average the security policy v2 reduces (1) the execution time by 1.56×10<sup>5</sup>% (resp. 45%, 38%, 31% and 36%), (2) the size of circuits by 3.61×10<sup>5</sup>% (resp. 368%, 52%, 38% and 275%), and (3) the volume of communication traffic by 4.<sup>17</sup> <sup>×</sup> <sup>10</sup><sup>5</sup>% (resp. 367%, 53%, 39% and 274%). This demonstrates the performance improvement of the MPC applications in Obliv-C that uses Yao's garbled circuit protocol.

**MPyC.** The results in MPyC are depicted in Fig. 7. Since MPyC does not provide the size of circuits and the volume of communication traffic, we only report execution time in Fig. 7. The results show that degrading security levels also improves execution time in MPyC that uses Shamir's secret sharing protocol. Compared with the security policy v1 on benchmark QS (resp. LinS, BinS, AlmS, and PSI), on average the security policy v2 reduces the execution time by 2.5 × 10<sup>4</sup>% (resp. 64%, 23%, 17% and 996%).

We note the difference in improvements of Obliv-C and MPyC. It is because: (1) Obliv-C and MPyC use different MPC protocols with varying improvements, where Yao's protocol (Obliv-C) is efficient for Boolean computations while the secret-sharing protocol (MPyC) is efficient for arithmetic computations; and (2) the proportion of downgrading variables is different where a larger proportion of downgrading variables (in particular branching variables with large branches) boosts performance more.

# **8 Related Work**

**MPC Frameworks.** Early efforts to MPC frameworks provide high-level languages for specifying MPC applications and compilers for translating them into executable implementations [8,23,31,32]. For instance, Fairplay complies 2-party MPC programs written in a domain-specific language into Yao's garbled circuits [31]. FairplayMP [8] extends Fairplay to multi-party using a modified version of the BMR protocol [7] with a Java interface. The others are aimed at improving the efficiency of operations in circuits and size of circuits. Mixed MPC protocols were also proposed to improve efficiency [9,26,34], as the efficiency of MPC protocols vary in operations. These frameworks explore the implementation space of operations in specific MPC protocols (e.g., garbled circuits, secret sharing and homomorphic encryption), as well as their conversions. However, all these frameworks either entirely compile an MPC program or compile an MPC program according to user-annotated secret variables to improve performance without formal security guarantees. Our approach improves the performance of MPC applications by declassifying secret variables without compromising security, which is orthogonal to the above optimization work.

**Security of MPC Applications.** Since MPC applications implemented in MPC frameworks are not necessarily secure due to information leakage during execution in the real-world. Therefore, information-flow type systems and data-flow analysis have been adopted in the MPC frameworks, e.g., [24,37,44]. However, they only consider security verification but not automatic generation of security policies as we did in the current paper. Moreover, these approaches cannot identify some variables (e.g., c2 in our motivating example) that can actually be declassified without compromising security. Kerschbaum [25] proposed to infer public intermediate values by reasoning about epistemic modal logic, with a similar goal to ours for declassifying secret variables. However, it is unclear how efficient this approach is, as the performance of their approach was not reported [25].

Alternatively, self-composition which reduces the security problem to the safety problem on two copies of a program has been adopted by [3], where the safety problem can be solved by safety verification tools. However, safety verification remains challenging and these approaches often require user annotations (e.g., procedure contracts and loop invariants) that are non-trivial for MPC practitioners. Our work is different from them in: (1) they only use the self-composition reduction to verify security instead of automatically generating a security policy; (2) they have to check almost all the program variables which is computational expensive, while we first apply an efficient type system to infer a security policy and then only check if the security branching variables in the security policy can be declassified; and (3) we check if security branching variables can be declassified by reasoning about pairs of symbolic executions which can be seen as a divide-and-conquer approach without annotations, and the results can be fed back to the type system to efficiently refine security levels. We remark that the self-composition reduction could also be used to check if a security branching variable could be declassified.

**Information-Flow Analysis.** A rich body of literature has studied verification of information-flow security and noninterference in programs [12], which requires that confidential data does not flow to outputs. This is too restrictive for programs which allow secret data to flow to some non-secret outputs, e.g., MPC applications, therefore the security notion is extended with declassification (a.k.a. delimited release) later [27]. These security problems are verified by type systems (e.g. [27]) or self-composition (e.g., [39]) or relational reasoning (e.g., [4]). Some of these techniques have been adapted to verify timing sidechannel security, e.g., [11,30,42]. However, as the usual notions of security in these settings do not require reasoning about arbitrary leakage, these techniques are not directly applicable to our setting. Different from existing analysis using symbolic execution [33], our approach takes advantage of the public outputs of MPC programs and regards the public outputs as a part of leakage to avoid false positive of the noninterference approach and the quantification of information flow.

Finally, we remark that the leakage model considered in this work is different from the ones used in power side-channel security [16–19,45] and timing side-channel security [11,30,36,42] which leverage side-channel information while ours assumes that the adversary is able to observe all the public information during computation.

#### **9 Conclusion**

We have formalized the leakage of an MPC application which bridge the language-level and protocol-level leakages via security policies. Based on the formalization, we have presented an approach to automatically synthesize a security policy which can improve the performance of MPC applications while not compromising their privacy. Our approach is essentially a synergistic integration of type inference and symbolic reasoning with security type refinement. We implemented our approach in a tool **PoS4MPC**. The experimental results on five typical MPC applications confirm that our approach can significantly improve the performance of MPC applications.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Explaining Hyperproperty Violations**

Norine Coenen1(B) , Raimund Dachselt<sup>2</sup> , Bernd Finkbeiner<sup>1</sup> , Hadar Frenkel<sup>1</sup> , Christopher Hahn<sup>1</sup> , Tom Horak<sup>3</sup> , Niklas Metzger<sup>1</sup> , and Julian Siber<sup>1</sup>

<sup>1</sup> CISPA Helmholtz Center for Information Security, Saarbr¨ucken, Germany {norine.coenen,finkbeiner,hadar.frenkel,christopher.hahn,

niklas.metzger,julian.siber}@cispa.de <sup>2</sup> Interactive Media Lab, Technische Universit¨at Dresden, Dresden, Germany

dachselt@acm.org <sup>3</sup> elevait GmbH & Co. KG, Dresden, Germany tom.horak@elevait.de

**Abstract.** Hyperproperties relate multiple computation traces to each other. Model checkers for hyperproperties thus return, in case a system model violates the specification, a set of traces as a counterexample. Fixing the erroneous relations between traces in the system that led to the counterexample is a difficult manual effort that highly benefits from additional explanations. In this paper, we present an explanation method for counterexamples to hyperproperties described in the specification logic HyperLTL. We extend Halpern and Pearl's definition of actual causality to sets of traces witnessing the violation of a HyperLTL formula, which allows us to identify the events that caused the violation. We report on the implementation of our method and show that it significantly improves on previous approaches for analyzing counterexamples returned by HyperLTL model checkers.

# **1 Introduction**

While model checking algorithms and tools (e.g., [12,17,18,26,47,55]) have, in the past, focused on trace properties, recent failures in security-critical systems, such as Heartbleed [28], Meltdown [59], Spectre [52], or Log4j [1], have triggered the development of model checking algorithms for properties that relate multiple computation traces to each other, i.e., *hyperproperties* [21]. Although the counterexample returned by such a model checker for hyperproperties, which takes the shape of a *set* of traces, may aid in the debugging process, understanding and narrowing down which features are actually responsible for the erroneous

This work was funded by DFG grant 389792660 as part of TRR 248 – CPEC, by the

DFG as part of the Germany's Excellence Strategy EXC 2050/1 - Project ID 390696704 - Cluster of Excellence "*Centre for Tactile Internet*" (CeTI) of TU Dresden, by the European Research Council (ERC) Grant OSARES (No. 683300), and by the German Israeli Foundation (GIF) Grant No. I-1513-407./2019.

S. Shoham and Y. Vizel (Eds.): CAV 2022, LNCS 13371, pp. 407–429, 2022. https://doi.org/10.1007/978-3-031-13185-1\_20

relation between the traces in the counterexample requires significantly more manual effort than for trace properties. In this paper, we develop an explanation technique for these more complex counterexamples that identifies the *actual causes* [44–46] of hyperproperty violations.

Existing hyperproperty model checking approaches (e.g., [33,35,49]), take a HyperLTL formula as an input. HyperLTL is a temporal logic extending LTL with explicit trace quantification [20]. For example, observational determinism, which requires that all traces π, π agree on their observable outputs *lo* whenever they agree on their observable inputs *li*, can be formalized in HyperLTL as ∀π.∀π . (*li*<sup>π</sup> ↔ *li*<sup>π</sup>- ) → (*lo*<sup>π</sup> ↔ *lo*<sup>π</sup>- ). In case a system model violates observational determinism, the model checker consequently returns a set of two execution traces witnessing the violation.

A first attempt in explaining model checking results of HyperLTL specifications has been made with HyperVis [48], which visualizes a counterexample returned by the model checker MCHyper [35] in a browser application. While the visualizations are already useful to analyze the counterexample at hand, it fails to identify causes for the violation in several security-critical scenarios. This is because HyperVis identifies important atomic propositions that appear in the HyperLTL formula and highlights these in the trace and the formula. For detecting causes, however, this is insufficient: a cause for a violation of observational determinism, for example, could be a branch on the valuation of a secret input is, which is not even part of the formula (see Sect. 3 for a running example).

Defining what constitutes an actual cause for an effect (a violation) in a given scenario is a precious contribution by Halpern and Pearl [44–46], who refined and formalized earlier approaches based on counterfactual reasoning [58]: Causes are sets of events such that, in the counterfactual world where they do not appear, the effect does not occur either. One of the main insights of Halpern and Pearl's work, however, is that naive counterfactuals are too imprecise. If, for instance, our actual cause preempted another potential cause, the mere absence of the actual cause will not be enough to prevent the effect, which will be still produced by the other cause in the new scenario. Halpern and Pearl's definition therefore allows to carefully control for other possible causes through the notion of *contingencies*. In the modified definition [44], contingencies allow to fix certain features of the counterfactual world to be exactly as they are in the actual world, regardless of the system at hand. Such a contingency effectively modifies the dynamics of the underlying model, and one insight of our work is that defining actual causality for reactive systems also needs to modify the system under a contingency. Notably, most works regarding trace causality [13,39] do not consider contingencies but only counterfactuals, and thus are not able to find true actual causes.

In this paper, we develop the notion of actual causality for effects described by HyperLTL formulas and use the generated causes as explanations for counterexamples returned by a model checker. We show that an implementation of our algorithm is practically feasible and significantly increases the state-of-theart in explaining and analyzing HyperLTL model checking results.

# **2 Preliminaries**

We model a system as a *Moore machine* [62] T = (S, s0, AP, δ, l) where S is a finite set of states, s<sup>0</sup> ∈ S is the initial state, AP = I ∪· O is the set of atomic propositions consisting of inputs <sup>I</sup> and outputs <sup>O</sup>, <sup>δ</sup> : <sup>S</sup> <sup>×</sup> <sup>2</sup><sup>I</sup> <sup>→</sup> <sup>S</sup> is the transition function determining the successor state for a given state and set of inputs, and <sup>l</sup> : <sup>S</sup> <sup>→</sup> <sup>2</sup><sup>O</sup> is the labeling function mapping each state to a set of outputs. A *trace* <sup>t</sup> <sup>=</sup> <sup>t</sup>0t1t<sup>2</sup> ... <sup>∈</sup> (2AP )<sup>ω</sup> of <sup>T</sup> is an infinite sequence of sets of atomic propositions with t<sup>i</sup> = A ∪ l(si), where A ⊆ I and δ(si, A) = si+1 for all i ≥ 0. We usually write t[n] to refer to the set t<sup>n</sup> at the (n + 1)-th position of t. With *traces*(T), we denote the set of all traces of T. For some sequence of inputs <sup>a</sup> <sup>=</sup> <sup>a</sup>0a1a<sup>2</sup> ... <sup>∈</sup> (2<sup>I</sup> )<sup>ω</sup>, the trace <sup>T</sup>(a) is defined by <sup>T</sup>(a)<sup>i</sup> <sup>=</sup> <sup>a</sup><sup>i</sup> <sup>∪</sup> <sup>l</sup>(si) and δ(si, ai) = si+1 for all i ≥ 0. A trace property P ⊆ T is a set of traces. A hyperproperty H is a lifting of a trace property, i.e., a *set of sets of traces*. A model T satisfies a hyperproperty H if the set of traces of T is an element of the hyperproperty, i.e., *traces*(T) ∈ H.

#### **2.1 HyperLTL**

HyperLTL is a recently introduced logic for expressing temporal hyperproperties, extending linear-time temporal logic (LTL) [64] with trace quantification:

$$\begin{aligned} \varphi &::= \forall \pi.\varphi \mid \exists \pi.\varphi \mid \psi\\ \psi &::= a\_{\pi} \mid \neg \psi \mid \psi \land \psi \mid \mathsf{O}\,\psi \mid \psi \mathcal{U}\psi \end{aligned}$$

We also consider the usual derived Boolean (∨, →, ↔) and temporal operators (ϕRψ ≡ ¬(¬ϕ U¬ψ), ϕ ≡ *true* Uϕ, ϕ ≡ *false*Rϕ). The semantics of Hyper-LTL formulas is defined with respect to a set of traces *Tr* and a trace assignment Π : V → *Tr* that maps trace variables to traces. To update the trace assignment so that it maps trace variable π to trace t, we write Π[π → t].


We explain counterexamples found by MCHyper [24,35], which is a model checker for HyperLTL formulas, building on ABC [12]. MCHyper takes as inputs a hardware circuit, specified in the Aiger format [8], and a Hyper-LTL formula. MCHyper solves the model checking problem by computing the self-composition [6] of the system. If the system violates the HyperLTL formula, MCHyper returns a counterexample. This counterexample is a set of traces through the original system that together violate the HyperLTL formula. Depending on the type of violation, this counterexample can then be used to debug the circuit or refine the specification iteratively.

#### **2.2 Actual Causality**

A formal definition of what actually causes an observed effect in a given context has been proposed by Halpern and Pearl [45]. Here, we outline the version later modified by Halpern [44]. Causality is defined with respect to a *causal model* M = (S, F), given by a *signature* S and set of *structural equations* F, which define the dynamics of the system. A signature S is a tuple (U, V, D), where U and V are disjoint sets of variables, termed *exogenous* and *endogenous* variables, respectively; and D defines the *range* of possible values D(Y ) for all variables Y ∈U∪V. A *context* u is an assignment to the variables in U∪V such that the values of the exogenous variables are determined by factors outside of the model, while the value of some endogenous variable X is defined by the associated structural equation f<sup>X</sup> ∈ F. An *effect* ϕ in a causal model is a Boolean formula over assignments to endogenous variables. We say that a context u of a model M satisfies a partial variable assignment X <sup>=</sup> x for X ⊆U∪V if the assignments in u and in x coincide for every variable <sup>X</sup> <sup>∈</sup> X . The extension for Boolean formulas over variable assignments is as expected. For a context u and a partial variable assignment X <sup>=</sup> x, we denote by (M, u)[X <sup>←</sup> x] the context u in which the values of the variables in X are set according to x, and all other values are computed according to the structural equations.

The actual causality framework of Halpern and Pearl aims at defining what events (given as variable assignments) are the cause for the occurrence of an effect in a specific given context. We now provide the formal definition.

**Definition 1 (**[44,45]**).** *A partial variable assignment* X = x *is an* actual cause *of the effect* ϕ *in* (M, u) *if the following three conditions hold.*

*AC1:* (M, u) - X <sup>=</sup> x *and* (M, u) ϕ*, i.e., both cause and effect are true in the actual world.*

*AC2: There is a set* W <sup>⊆</sup> <sup>V</sup> *of endogenous variables and an assignment* x *to the variables in* X *s.t. if* (M, u) - W <sup>=</sup> w*, then* (M, u)[X <sup>←</sup> x , W <sup>←</sup> w] - ¬ϕ*. AC3:* X *is minimal, i.e. no subset of* X *satisfies AC1 and AC2.*

Intuitively, AC2 states that in the counterfactual world obtained by intervening on the cause X = x in the actual world (that is, setting the variables in X to x ), the effect does not appear either. However, intervening on the possible cause might not be enough, for example when that cause preempted another. After intervention, this other cause may produce the effect again, therefore clouding the effect of the intervention. To address this problem, AC2 allows to reset values through the notion of *contingencies*, i.e., the set of variables W can be reset to w, which is (implicitly) universally quantified. However, since the actual world has to model W = w, it is in fact uniquely determined. AC3, lastly, enforces the cause to be minimal by requiring that all variables in X are strictly necessary to achieve AC1 and AC2. For an illustration of Halpern and Pearl's actual causality, see Example 1 in Sect. 3.

#### **3 Running Example**

Consider a security-critical setting with two security levels: a high-security level h and a low-security level l. Inputs and outputs labeled as high-security, denoted by *hi* and *ho* respectively, are confidential and thus only visible to the user itself, or, e.g., admins. Inputs and outputs labeled as low-security, denoted by *li* and *lo* respectively, are public and are considered to be observable by an attacker.

Our system of interest is modeled by the state graph representation shown in Fig. 1, which is treated as a black box by an attacker. The system is run without any low-security inputs, but branches depending on the given high-security inputs. If in one of the first two steps of an execution, a high-security input *hi* is encountered, the system outputs only the highsecurity variable *ho* directly afterwards and in the subsequent steps both outputs, regardless of inputs. If no high-security input is given in the first step, the low-security output *lo* is enabled and after the second step, again both outputs are enabled, regardless of what input is fed into the system.

**Fig. 1.** State graph representation of our example system.

A prominent example hyperproperty is *observational determinism* from the introduction which states that any sequence of low-inputs always produces the same low-outputs, regardless of what the high-security level inputs are. ϕ = ∀π.∀π . (*li*<sup>π</sup> ↔ *li*<sup>π</sup>- ) → (*lo*<sup>π</sup> ↔ *lo*<sup>π</sup>- ). The formula states that all traces π and π must agree in the low-security outputs if they agree in the low-security inputs. Our system at hand does not satisfy observational determinism, because the low-security outputs in the first two steps depend on the present high-security inputs. Running MCHyper, a model checker for HyperLTL, results in the following counterexample: <sup>t</sup><sup>1</sup> <sup>=</sup> {}{*lo*}{*ho*, *lo*}<sup>ω</sup> and <sup>t</sup><sup>2</sup> <sup>=</sup> {*hi*}{*hi*, *ho*}{*ho*, *lo*}<sup>ω</sup>. With the same low-security input (none) the traces produce different low-security outputs by visiting s<sup>1</sup> or s<sup>2</sup> on the way to s3.

In this paper, our goal is to explain the violation of a HyperLTL formula on such a counterexample. Following Halpern and Pearl's explanation framework [46], an actual cause that is considered to be possibly true or possibly false constitutes an explanation for the user. We only consider causes over input variables, which can be true and false in any model. Hence, finding an explanation amounts to answering which inputs caused the violation on a specific counterexample. Before we answer this question for HyperLTL and the corresponding counterexamples given by sets of traces (see Sect. 4), we first illustrate Halpern and Pearl's actual causality (see Sect. 2.2) with the above running example.

*Example 1.* Finite executions of a system can be modeled in Halpern and Pearl's causal models. Consider inputs as exogenous variables U = {*hi* <sup>0</sup>, *hi* <sup>1</sup>} and outputs as endogenous variables V = {*lo*1, *lo*2, *ho*1, *ho*2}. The indices model at which step of the execution the variable appears. We omit the inputs at the third position and the outputs at the first position because they are not relevant for the following exposition. We have that D(Y ) = {0, 1} for every Y ∈ U∪V. Now, the following manually constructed structural equations encode the transitions: (1) *lo*<sup>1</sup> = ¬*hi* <sup>0</sup>, (2) *ho*<sup>1</sup> = *hi* <sup>0</sup>, (3) *lo*<sup>2</sup> = ¬*hi* <sup>1</sup> ∨ ¬*lo*<sup>1</sup> and (4) *ho*<sup>2</sup> = *lo*<sup>1</sup> ∨ *ho*1. Consider context u = {*hi* <sup>0</sup> = 0, *hi* <sup>1</sup> = 1}, effect ϕ ≡ *lo*<sup>1</sup> = 1∨*lo*<sup>2</sup> = 1, and candidate cause *hi* <sup>0</sup> = 0. Because of (1), we have that (M, u) hi<sup>0</sup> = 0 and (M, u) *lo*<sup>1</sup> = 1, hence AC1 is satisfied. Regarding AC2, this example allows us to illustrate the need for contingencies to accurately determine the actual cause: If we only consider intervening on the candidate cause *hi* <sup>0</sup> = 0, we still have (M, u)[*hi* <sup>0</sup> ← 1] ϕ, because with *lo*<sup>1</sup> = 0 and (3) it follows that (M, u) *lo*<sup>2</sup> = 1. However, in the actual world, the second high input has no influence on the effect. We can control for this by considering the contingency *lo*<sup>2</sup> = 0, which is satisfied in the actual world, but not after the intervention on *hi* <sup>0</sup>. Because of this contingency, we then have that (M, u)[*hi* <sup>0</sup> ← 1, *lo*<sup>2</sup> ← 0] - ¬ϕ, and hence, AC2 holds. Because a singleton set automatically satisfies AC3, we can infer that the first high input *hi* <sup>0</sup> was the actual cause for any low output to be enabled in the actual world. Note that, intuitively, the contingency allows us to ignore some of the structural equations by ignoring the value they assign to *lo*<sup>2</sup> in this context. Our definitions in Sect. 4 will allow similar modifications for counterexamples to hyperproperties.

# **4 Causality for Hyperproperty Violations**

Our goal in this section is to formally define actual causality for the violation of a hyperproperty described by a general HyperLTL formula ϕ, observed in a counterexample to ϕ. Such a counterexample is given by a trace assignment to the trace variables appearing in ϕ. Note that, for universal quantifiers, the assignment of a single trace to the bounded variable suffices to define a counterexample. For existential quantifiers, this is not the case: to prove that an existential quantifier cannot be instantiated we need to show that no system trace satisfies the formula in its body, i.e., provide a proof for the whole system. In this work, we are interested in explaining violations of hyperproperties, and not proofs of their satisfaction [16]. Hence, we limit ourselves to instantiations of the outermost universal quantifiers of a HyperLTL formula, which can be returned by model checkers like MCHyper [24,35]. Since our goal is to explain counterexamples, restricting ourselves to results returned by existing model checkers is reasonable. Note that MCHyper can still handle formulas of the form <sup>∀</sup><sup>n</sup>∃<sup>m</sup><sup>ϕ</sup> where <sup>ϕ</sup> is quantifier free, including interesting information flow policies like generalized noninterference [61]. The returned counterexample then only contains n traces that instantiate the universal quantifiers, the existential quantifiers are not instantiated for the above reason. In the following, we restrict ourselves to formulas and counterexamples of this form.

**Definition 2 (Counterexample).** *Let* T *be a transition system and denote Traces*(T) := *Tr , and let* ϕ *be a HyperLTL formula of the form* ∀π<sup>1</sup> ... ∀πkψ*,* *where* ψ *is a HyperLTL formula that does not start with* ∀*. A counterexample to* ϕ *in* T *is a partial trace assignment* Γ : {π1,...,πk} → *Tr such that* Γ, 0 -Tr ¬ψ*.*

For ease of notation, we sometimes refer to Γ simply as the tuple of its instantiations Γ = Γ(π1),...,Γ(πk). In terms of Halpern and Pearl's actual causality as outlined in Sect. 2.2, a counterexample describes the actual world at hand, which we want to explain. As a next step, we need to define an appropriate language to reason about possible causes and contingencies in our counterexample. We will use sets of *events*, i.e., values of atomic propositions at a specific position of a specific trace in the counterexample.

**Definition 3 (Event).** *An event is a tuple* e = la, n, t *such that* l<sup>a</sup> = a *or* <sup>l</sup><sup>a</sup> <sup>=</sup> <sup>¬</sup><sup>a</sup> *for some atomic proposition* <sup>a</sup> <sup>∈</sup> AP*,* <sup>n</sup> <sup>∈</sup> <sup>N</sup> *is a point in time, and* <sup>t</sup> <sup>∈</sup> (2AP )<sup>ω</sup> *is a trace of a system* <sup>T</sup>*. We say that a counterexample* <sup>Γ</sup> <sup>=</sup> t1,...tk satisfies *a set of events* C*, and denote* Γ - C*, if for every event* la, n, t∈C *the two following conditions hold:*

*1.* t = t<sup>i</sup> *for some* i ∈ {1,...,k}*, i.e., all events in* C *reason about traces in* Γ*, 2.* l<sup>a</sup> = a *iff* a ∈ ti[n]*, i.e.,* a *holds on trace* t<sup>i</sup> *of the counterexample at time* n*.*

We assume that the set *AP* is a disjoint union of input an output propositions, that is, *AP* = I ∪· O. We say that la, n, t is an *input event* if a ∈ I, and we call it an *output event* if a ∈ O. We denote the set of input events by *IE* and the set of output events by *OE*. These events have a direct correspondence with the variables appearing in Halpern and Pearl's causal models: we can identify input events with exogenous variables (because their value is determined by factors outside of the system) and output events with endogenous variables.

We define a cause as a set of input events, while an effect is a possibly infinite Boolean formula over OE. Note that, similar to [37], every HyperLTL formula can be represented as a first order formula over events, e.g. ∀π∀π (a<sup>π</sup> ↔ a<sup>π</sup>- ) = ∀π∀π - <sup>n</sup>∈<sup>N</sup>(a, n, π↔a, n, π ). For some set of events <sup>S</sup>, let <sup>+</sup>S<sup>k</sup> <sup>π</sup> = {a ∈ *AP* | a, k, π ∈ S} denote the set of atomic propositions defined positively by <sup>S</sup> on trace <sup>π</sup> at position <sup>k</sup>. Dualy, we define <sup>−</sup>S<sup>k</sup> <sup>π</sup> = {a ∈ *AP* | ¬a, k, π ∈ S}.

In order to define actual causality for hyperproperties we need to formally define how we obtain the counterfactual executions under some contingency for the case of events on infinite traces. We define a contingency as a set of output events. Mapping Halpern and Pearl's definition to transition systems, contingencies reset outputs in the counterfactual traces back to their value in the original counterexample, which amounts to changing the state of the system, and then following the transition function from the new state. For a given trace of the counterexample, we describe all possible behaviors under *arbitrary* contingencies with the help of a counterfactual automaton. The concrete contingency on a trace is defined by additional input variables. In the following, let *IC* <sup>=</sup> {o<sup>C</sup> <sup>|</sup> <sup>o</sup> <sup>∈</sup> <sup>O</sup>} be a set of auxiliary input variables expressing whether a contingency is invoked at the given step of the execution and <sup>c</sup> : <sup>O</sup> <sup>→</sup> *IC* be a function s.t. <sup>c</sup>(o) = <sup>o</sup><sup>C</sup> .

**Definition 4 (Counterfactual Automaton).** *Let* T = (S, s0, *AP*, δ, l) *be a system with* S = 2<sup>O</sup> *, i.e., every state is uniquely labeled, and there exists a state* *for every combination of outputs. Let* <sup>π</sup> <sup>=</sup> <sup>π</sup><sup>0</sup> ...πi(π<sup>j</sup> ...πn)<sup>ω</sup> <sup>∈</sup> *traces*(T) *be a trace of* T *in a finite, lasso-shaped representation. The counterfactual automaton* T <sup>C</sup> <sup>π</sup> = (S×{<sup>0</sup> ...n},(s0, 0),(*IC* ∪· <sup>I</sup>)∪· (O∪· {<sup>0</sup> ...n}), δ<sup>C</sup> , l<sup>C</sup> ) *is defined as follows:*

$$\begin{array}{l} \text{- } \delta^{C}((s,k),Y) = (s',k') \text{ where } k' = j \text{ if } k = n, \text{ else } k' = k + 1, \text{ and} \\\ l(s') = \{ o \in O \mid (o \in \delta(s, Y \cap I) \land c(o) \notin Y) \lor (o \in \pi\_{k'} \land c(o) \in Y) \}, \\\ l^{C}(s,k) = l(s) \cup \{k\}. \end{array}$$

A counterfactual automaton is effectively a chain of copies of the original system, of the same length as the counterexample. An execution through the counterfactual automaton starts in the first copy corresponding to the first position in the counterexample trace, and then moves through the chain until it eventually loops back from copy n to copy j. A transition in the counterfactual automaton can additionally specify setting as a contingency some output variable o if the auxiliary input variable o<sup>C</sup> is enabled. In this case, the execution will move to a state in the next automaton of the chain where all the outputs are as usual, except o, which will have the same value as in the counterexample π. Note that, under the assumption that all states of the original system are uniquely labeled and there exists a state for every combination of output variables, the function δ<sup>C</sup> is uniquely determined.<sup>1</sup> A counterfactual automaton for our running example is described in the full version of this paper [22].

Next, we need to define how we intervene on a set of traces with a candidate cause given as a set of input events, and a contingency given as a set of output events. We define an intervention function, which transforms a trace of our original automaton to an input sequence of an counterfactual automaton.

**Definition 5 (Intervention).** *For a cause* C ⊆ *IE , a contingency* W ⊆ *OE and a trace* <sup>π</sup>*, the function intervene* : (2AP )<sup>ω</sup> <sup>×</sup> <sup>2</sup>IE <sup>×</sup> <sup>2</sup>OE <sup>→</sup> (2<sup>I</sup> <sup>∪</sup>IC )<sup>ω</sup> *returns a trace such that for all* <sup>k</sup> <sup>∈</sup> <sup>N</sup> *the following holds: intervene*(π, <sup>C</sup>, <sup>W</sup>)[k] = (π[k] \ <sup>+</sup>C<sup>k</sup> <sup>π</sup>) <sup>∪</sup> <sup>−</sup>C<sup>k</sup> <sup>π</sup> ∪ {c(o) <sup>|</sup> <sup>o</sup> <sup>∈</sup> <sup>+</sup>W<sup>k</sup> <sup>π</sup> <sup>∪</sup> <sup>−</sup>W<sup>k</sup> <sup>π</sup>}. *We lift the intervention function to counterexamples given as a tuple* Γ = π1,...,πk *as follows: intervene*(Γ, <sup>C</sup>, <sup>W</sup>) = <sup>T</sup> <sup>C</sup> <sup>π</sup><sup>1</sup> (*intervene*(π1, <sup>C</sup>, <sup>W</sup>)),...,T <sup>C</sup> <sup>π</sup>*<sup>k</sup>* (*intervene*(πk, C, W)).

Intuitively, the intervention function *flips* all the events that appear in the cause Γ: If some a ∈ *I* appears positively in the candidate cause C, it will appear negatively in the resulting input sequence, and vice-versa. For a contingency W, the intervention function enables their auxiliary input for the counterfactual automaton at the appropriate time point irrespective of their value, as the counterfactual automaton will take care of matching the atomic propositions value to the value in the original counterexample Γ.

<sup>1</sup> The same reasoning can be applied to arbitrary systems by considering for contingencies largest sets of outputs for which the assumption holds, with the caveat that the counterfactual automaton may model fewer contingencies. Consequently, computed causes may be less precise in case multiple causes appear in the counterexample.

#### **4.1 Actual Causality for HyperLTL Violations**

We are now ready to formalize what constitutes an actual cause for the violation of a hyperproperty described by a HyperLTL formula.

**Definition 6 (Actual Causality for HyperLTL).** *Let* Γ *be a counterexample to a HyperLTL formula* ϕ *in a system* T*. The set* C *is an actual cause for the violation of* ϕ *on* Γ *if the following conditions hold.*

**SAT** Γ - C*.* **CF** *There exists a contingency* W *and a non-empty subset* C ⊆ C *such that:* Γ - W *and intervene*(Γ, C , W) traces(T) ϕ*.* **MIN** C *is minimal, i.e., no subset of* C *satisfies SAT and CF.*

Unlike in Halpern and Pearl's definition (see Sect. 2.2), the condition SAT requires Γ to satisfy only the cause, as we already know that the effect ¬ϕ, i.e., the violation of the specification, is satisfied by virtue of Γ being a counterexample. CF is the counterfactual condition corresponding to AC2 in Halpern and Pearl's definition, and it states that after intervening on the cause, under a certain contingency, the set of traces satisfies the property. (Note that we use a conjunction of two statements here while Halpern and Pearl use an implication. This is because they implicitly quantify universally over the values of the variables in the set W (which should be as in the actual world) where in our setting the set of contingencies already defines explicit values.) MIN is the minimality criterion directly corresponding to AC3.

*Example 2.* Consider our running example from Sect. 3, i.e., the system from Fig. 1 and the counterexample to observational determinism Γ = t1, t2. Let us consider what it means to intervene on the cause C<sup>1</sup> = {*hi*, 0, t2}. Note that we have Γ - C1, hence the condition SAT is satisfied. For CF, let us first consider an intervention without contingencies. This results in *intervene*(Γ, C1, ∅) = t 1, t <sup>2</sup> <sup>=</sup> t1, {}{*hi*, *lo*}{*ho*}{*ho*, *lo*}<sup>ω</sup>. However, *intervene*(Γ, <sup>C</sup>1, <sup>∅</sup>) traces(T) ¬ϕ, because the low outputs of t <sup>1</sup> and t <sup>2</sup> differ at the third position: *lo* ∈ t <sup>1</sup>[2] and *lo* ∈ t <sup>2</sup>[2]. This is because now the second high input takes effect, which was preempted by the first cause in the actual counterexample. The contingency W<sup>2</sup> = {*lo*, 2, t2} now allows us to control this by *modyfing the state* after taking the second high input as follows: *intervene*(Γ, C2, W2)) = t <sup>1</sup> , t <sup>2</sup> = t1, {}{*hi*, *lo*}{*ho*, *lo*}{*ho*, *lo*}<sup>ω</sup>. Note that <sup>t</sup> <sup>2</sup> is not a trace of the model depicted in Fig. 1, because there is no transition that explains the step from t <sup>2</sup> [1] to t <sup>2</sup> [2]. It is, however, a trace of the counterfactual automaton T <sup>C</sup> <sup>t</sup><sup>2</sup> (see full version [22]), which encodes the set of counterfactual worlds for the trace t2. The fact that we consider executions that are not part of the original system allows us to infer that only the first high input was an actual cause in our running example. Disregarding contingencies, we would need to consider both high inputs as an explanation for the violation of observational determinism, even though the second high input had no influence. Our treatment of contingencies corresponds directly to Halpern and Pearl's causal models, which allow to ignore certain structural equations as outlined in Example 1.

*Remark:* With our definitions, we strictly generalize Halpern and Pearl's actual causality to reactive systems modeled as Moore machines and effects expressed as HyperLTL formulas. Their structural equation models can be encoded in a onestep Moore machine; effect specifying a Boolean combination of primitive events can be encoded in the more expressive logic HyperLTL. Just like for Halpern and Pearl, our actual causes are not unique. While there can exist several different actual causes, the set of all actual causes is always unique. It is also possible that no actual cause exists: If the effect occurs on all system traces, there may be no actual cause on a given individual trace.

#### **4.2 Finding Actual Causes with Model Checking**

In this section, we consider the relationship between finding an actual cause for the violation of a HyperLTL formula starting with a universal quantifier and model checking of HyperLTL. We show that the problem of finding an actual cause can be reduced to a model checking problem where the generated formula for the model checking problem has one additional quantifier alternation. While there might be a reduction resulting in a more efficient encoding, our current result suggests that causality checking is the harder problem. The key idea of our reduction is to use counterfactual automata (that encode the given counterexample and the possible counterfactual traces) together with the HyperLTL formula described in the proof to ensure the conditions SAT, CF, and MIN on the witnesses for the model checking result.

**Proposition 1.** *We can reduce the problem of finding an actual cause for the violation of an HyperLTL formula starting with a universal quantifier to the HyperLTL model checking problem with one additional quantifier alternation.*

*Proof.* Let Γ = t1,...tk be a counterexample for the formula ∀π<sup>1</sup> ... ∀πk.ϕ where ϕ is a HyperLTL formula that does not have a universal first quantifier. We provide the proof for the case of Γ = t1, t2 for readability reasons, but it can be extended to any natural number k. We assume that t1, t<sup>2</sup> have some ω-regular representation, as otherwise the initial problem of computing causality is not well defined. That is, we denote <sup>t</sup><sup>i</sup> <sup>=</sup> <sup>u</sup>i(vi)<sup>ω</sup> such that <sup>|</sup>u<sup>i</sup> · <sup>v</sup>i<sup>|</sup> <sup>=</sup> <sup>n</sup>i.

In order to find an actual cause, we need to find a pair of traces t 1, t <sup>2</sup> that are counterfactuals for t1, t2; satisfy the property ϕ; and the changes from t1, t<sup>2</sup> to t 1, t <sup>2</sup> are minimal with respect to set containment. Changes in inputs between t<sup>i</sup> and t <sup>i</sup> in the loop part v<sup>i</sup> should reoccur in t <sup>i</sup> repeatedly. Note that the differences between the counterexample t1, t2 and the witness of the model checking problem t 1, t <sup>2</sup> encode the actual cause, i.e. in case of a difference, the cause contains the event that is present on the counterexample. To reason about these changes, we use the counterfactual automaton T <sup>C</sup> <sup>i</sup> for each ti, which also allows us to search for the contingency W as part of the input sequence of T <sup>C</sup> <sup>i</sup> . Note that each T <sup>C</sup> <sup>i</sup> consists of n<sup>i</sup> copies, that indicate in which step the automaton is with respect to t<sup>i</sup> and its loop vi. For m > |ui|, we label each state (si, m) in T <sup>C</sup> <sup>i</sup> with the additional label L<sup>s</sup>*m*,i, to indicate that the system is now in the loop part of ti. In addition, we add to the initial state of T <sup>C</sup> <sup>i</sup> the label li, and we add to the initial state of the system T the label lor . The formula ψi loop below states that the trace π begins its run from the initial state of T <sup>C</sup> i (and thus stays in this component through the whole run), and that every time π visits a state on the loop, the same input sequence is observed. This way we enforce the periodic input behavior of the traces t1, t<sup>2</sup> on t 1, t 2.

$$\psi\_{loop}^i(\pi) := l\_{i,\pi} \land \bigwedge\_{L\_{s\_m,i}} \bigvee\_{A \subseteq I} \Box(L\_{s\_m,i,\pi} \to (\bigwedge\_{a \in A} a\_{\pi} \land \bigwedge\_{a \notin A} \neg a\_{\pi}))$$

For a subset of locations N ⊆ [1, ni] and a subset of input propositions A ⊆ I we define ψ<sup>i</sup> diff [N,A](π) that states that π differs from t<sup>i</sup> in at least all events la, m, ti for <sup>a</sup> <sup>∈</sup> A, m <sup>∈</sup> <sup>N</sup>; and the formula <sup>ψ</sup><sup>i</sup> eq [N,A](π) that states that for all events that are not defined by A and N, π is equal to ti.

$$\psi\_{diff}^{i}[N,A](\pi) = \bigwedge\_{j \in N, a \in A} \mathsf{O}^{j}(a\_{\pi} \neq a\_{t\_{i}})$$

$$\psi\_{eq}^{i}[N,A](\pi) = \bigwedge\_{j \notin N, a \in I} \mathsf{O}^{j}(a\_{\pi} \leftrightarrow a\_{t\_{i}}) \wedge \bigwedge\_{j \in [1, n\_{i}], a \notin A} \mathsf{O}^{j}(a\_{\pi} \leftrightarrow a\_{t\_{i}})$$

We now define the formula ψ<sup>i</sup> min that states that the set of inputs (and locations) on which trace π differs from t<sup>i</sup> is not contained in the corresponding set for π . We only check locations up until the length n<sup>i</sup> of ti.

$$\psi\_{min}^i(\pi,\pi') \coloneqq \bigwedge\_{N \subseteq [i,n\_i]} \bigwedge\_{A \subseteq \mathbb{Z}} \left( \left( \psi\_{diff}^i[N,A](\pi) \wedge \psi\_{eq}^i[N,A](\pi) \right) \to \neg \psi\_{eq}^i[N,A](\pi') \right)$$

Denote ϕ := Q1τ<sup>1</sup> ...Qnτn. ϕ (π1, π2) where Q<sup>i</sup> ∈ {∀, ∃} and τ<sup>i</sup> are trace variables for i ∈ [1, n]. The formula ψcause described below states that the two traces π <sup>1</sup> and π <sup>2</sup> are part of the systems T <sup>C</sup> <sup>1</sup> , T <sup>C</sup> <sup>2</sup> , and have the same loop structure as t<sup>1</sup> and t2, and satisfy ϕ. That is, these traces can be obtained by changing the original traces t1, t<sup>2</sup> and avoid the violation.

$$
\psi\_{cause}(\pi\_1', \pi\_2') := \varphi'(\pi\_1', \pi\_2') \land \bigwedge\_{i=1,2} \psi\_{loop}^i(\pi\_i')
$$

Finally, ψactual described below states that the counterfactuals π 1, π <sup>2</sup> correspond to a minimal change in the input events with respect to t1, t2. All other traces that the formula reasons about start at the initial state of the original system and thus are not affected by the counterfactual changes. We verify ψactual against the product automaton <sup>T</sup> <sup>×</sup> <sup>T</sup> <sup>C</sup> <sup>1</sup> <sup>×</sup> <sup>T</sup> <sup>C</sup> <sup>2</sup> to find these traces π <sup>i</sup> <sup>∈</sup> <sup>T</sup> <sup>C</sup> <sup>i</sup> that witness the presence of a cause, counterfactual and contingency.

$$\psi\_{actual} := \exists \pi\_1' \exists \pi\_2'. \; \forall \pi\_1'' \pi\_2''. \; Q\_1 \tau\_1 \dots Q\_n \tau\_n. \; \psi\_{ cause}(\pi\_1', \pi\_2') \land \bigwedge\_{i=1,2} (l\_{i, \pi\_i'} \land l\_{i, \pi\_i'})$$

$$\land \bigwedge\_{i \in [1,n]} l\_{or, \tau\_i} \land \left(\psi\_{cause}(\pi\_1'', \pi\_2'') \rightarrow \left(\bigwedge\_{i=1,2} \psi\_{min}^i(\pi\_i', \pi\_i'')\right)\right)$$

Then, if there exists two such traces π 1, π <sup>2</sup> in the system <sup>T</sup> <sup>×</sup> <sup>T</sup> <sup>C</sup> <sup>1</sup> <sup>×</sup> <sup>T</sup> <sup>C</sup> 2 , they correspond to a minimal cause for the violation. Otherwise, there are no traces of the counterfactual automata that can be obtained from t1, t<sup>2</sup> using counterfactual reasoning and satisfy the formula ϕ.

We have shown that we can use HyperLTL model checking to find an actual cause for the violation of a HyperLTL formula. The resulting model checking problem has an additional quantifier alternation which suggests that identifying actual causes is a harder problem. Therefore, we restrict ourselves to finding actual causes for violations of universal HyperLTL formulas. This keeps the algorithms we present in the next section practical as we start without any quantifier alternation and need to solve a model checking problem with a single quantifier alternation. While this restriction excludes some interesting formulas, many can be strengthened into this fragment such that we are able to handle close approximations (c.f. [25]). Any additional quantifier alternation from the original formula carries over to an additional quantifier alternation in the resulting model checking problem which in turn leads to an exponential blow-up. The scalability of our approach is thus limited by the complexity of the model checking problem.

#### **5 Computing Causes for Counterexamples**

In this section, we describe our algorithm for finding actual causes of hyperproperty violations. Our algorithm is implemented on top of MCHyper [35], a model checker for hardware circuits and the alternation-free fragment of HyperLTL. In case of a violation, our analysis enriches the provided counterexample with the actual cause which can explain the reason for the violaiton to the user.

We first provide an overview of our algorithm and then discuss each step in detail. First, we compute an over-approximation of the cause using a satisfiability analysis over transitions taken in the counterexample. This analysis results in a set of events <sup>C</sup>˜. As we show in Proposition 2, every actual cause <sup>C</sup> for the violation is a subset of <sup>C</sup>˜. In addition, in Proposition <sup>3</sup> we show that the set <sup>C</sup>˜ satisfies conditions SAT and CF. To ensure MIN, we search for the smallest subset C ⊆ <sup>C</sup>˜ that satisfies SAT and CF. This set <sup>C</sup> is then our minimal and therefore actual cause.

To check condition CF, we need to check the counterfactual of each candidate cause C, and potentially also look for contingencies for C. We separate our discussion as follows. We first discuss the calculation of the over-approximation <sup>C</sup>˜ (Sect. 5.1), then we present the ActualCause algorithm that identifies a minimal subset of <sup>C</sup>˜ that is an actual cause (Sect. 5.2), and finally we discuss in detail the calculation of contingencies (Sect. 5.3). In the following sections, we use a reduction of the universal fragment of HyperLTL to LTL, and the advantages of the linear translation of LTL to alternating automata, as we now briefly outline.

*HyperLTL to LTL.* Let <sup>ϕ</sup> be a <sup>∀</sup>n-HyperLTL formula and <sup>Γ</sup> be the counterexample. We construct an LTL formula ϕ from ϕ as follows [31]: atomic propositions indexed with different trace variables are treated as different atomic propositions and trace quantifiers are eliminated. For example ∀π, π .a<sup>π</sup> ∧ aπ results in the LTL formula aπ∧aπ- . As for Γ, we use the same renaming in order to zip all traces into a single trace, for which we assume the finite representation t <sup>=</sup> <sup>u</sup> ·(v)ω, which is also the structure of the model checker's output. The trace t is a violation of the formula ϕ , i.e., t satisfies ¬ϕ . We denote ¯ϕ := ¬ϕ . We can then assume, for implementation concerns, that the specification (and its violation) is an LTL formula, and the counterexample is a single trace. After our causal analysis, the translation back to a cause over hyperproperties is straightforward as we maintain all information about the different traces in the counterexample. Note that this translation works due to the synchronous semantics of HyperLTL.

*Finite Trace Model Checking Using Alternating Automata.* In verifying condition CF (that is, in computing counterfactuals and contingencies), we need to apply finite trace model checking, as we want to check if the modified trace in hand still violates the specification ϕ, that is, satisfies ¯ϕ. To this end, we use the linear algorithm of [36], that exploits the linear translation of ¯ϕ to an alternating automaton Aϕ¯, and using backwards analysis checks the satisfaction of ϕ¯. An alternating automaton [68] generalizes non-deterministic and universal automata, and its transition relation is a Boolean function over the states. The run of alternating automaton is then a *tree run* that captures the conjunctions in the formula. We use the algorithm of [36] as a black box (see App. A.2 in [22] for a formal definition of alternating automata and App. A.3 in [22] for the translation from LTL to alternating automata). For the computation of contingencies we use an additional feature of the algorithm of [36] – the algorithm returns an accepting run tree T of Aϕ¯ on t , with annotations of nodes that represent atomic subformulas of ¯ϕ that take part in the satisfaction of ¯ϕ. We use this feature also in Sect. 5.1 when calculating the set of candidate causes.

#### **5.1 Computing the Set of Candidate Causes**

The events that might have been a part of the cause to the violation are in fact all events that appear on the counterexample, or, equivalently, all events that appear in u and v. Note that due to the finite representation, this is a finite set of events. Yet, not all events in this set can cause the violation. In order to remove events that could not have been a part of the cause, we perform an analysis of the transitions of the system taken during the execution of t . With this analysis we detect which events appearing in the trace locally cause the respective transitions, and thus might be part of the global cause. Events that did not trigger a transition in this specific trace cannot be a part of the cause. Note that causing a transition and being an actual cause are two different notions - actual causality is defined over the behaviour of the system, not on individual traces. We denote the over-approximation of the cause as <sup>C</sup>˜. Formally, we represent each transition as a Boolean function over inputs and states. Let δ<sup>n</sup> denote the formula representing the transition of the system taken when reading t [n], and let ca,n,i be a Boolean variable that corresponds to the event at*<sup>i</sup>* , n, t. <sup>2</sup> Denote ψ<sup>t</sup> <sup>n</sup> = - a*ti*∈t--[n] ca,n,i ∧ - a*ti*∈/t--[n] <sup>¬</sup>ca,n,i, that is, <sup>ψ</sup><sup>t</sup> n expresses the exact set of events in t [n]. In order to find events that might trigger the transition <sup>δ</sup>n, we check for the *unsatisfiable core* of <sup>ψ</sup><sup>n</sup> = (¬δn) <sup>∧</sup> <sup>ψ</sup><sup>t</sup> n. Intuitively, the unsatisfiable core of ψ<sup>n</sup> is the set of events that force the system to take this specific transition. For every ca,n,i (or ¬ca,n,i ) in the unsatisfiable core that is also a part of ψ<sup>t</sup> <sup>n</sup>, we add a, n, ti (or ¬a, n, ti) to <sup>C</sup>˜.

We use unsatisfiable cores in order to find input events that are necessary in order to take a transition. However, this might not be enough. There are cases in which inputs that appear in formula ¯ϕ are not detected using this method, as they are not essential in order to take a transition; however, they might be considered a part of the actual cause, as negating them can avoid the violation. Therefore, as a second step, we apply the algorithm of [36] on the annotated automaton Aϕ¯ in order to find the specific events that affect the satisfaction of <sup>ϕ</sup>¯, and we add these events to <sup>C</sup>˜. Then, the unsatisfiable core approach provides us with inputs that affect the computation and might cause the violation even though they do not appear on the formula itself; while the alternating automaton allows us to find inputs that are not essential for the computation, but might still be a part of the cause as they appear on the formula.

**Proposition 2.** *The set* <sup>C</sup>˜ *is indeed an over-approximation of the cause for the violation. That is, every actual cause* <sup>C</sup> *for the violation is a subset of* <sup>C</sup>˜*.*

*Proof (sketch).* Let e = la, n, t be an event such that e is not in the unsatisfiable core of ψ<sup>n</sup> and does not directly affect the satisfaction of ¯ϕ according to the alternating automata analysis. That is, the transition corresponding to ψ<sup>t</sup> <sup>n</sup> is taken regardless of e, and thus all future events on t remain the same regardless of the valuation of e. In addition, the valuation of the formula ¯ϕ is the same regardless of e, since: (1) e does not directly affect the satisfaction of ¯ϕ; (2) e does not affect future events on t (and obviously it does not affect past events). Therefore, every set C such that e ∈ C is not minimal, and does not form a cause. Since the above is true for all events <sup>e</sup> ∈ C, it holds that C ⊆ <sup>C</sup>˜ for every actual cause C.

# **Proposition 3.** *The set* <sup>C</sup>˜ *satisfies conditions SAT and CF.*

*Proof.* The condition SAT is satisfied as we add to <sup>C</sup>˜ only events that indeed occur on the counterexample trace. For CF, consider that <sup>C</sup>˜ is a super-set of the actual cause C, so the same contingency and counterfactual of C will also apply for <sup>C</sup>˜. This is since in order to compute counterfactual we are allowed to flip any subset of the events in <sup>C</sup>, and any such subset is also a subset of <sup>C</sup>˜.

<sup>2</sup> That is, <sup>¬</sup>ca,n,i corresponds to the event -¬a<sup>t</sup>*<sup>i</sup>* , n, t--. Recall that the atomic propositions on the zipped trace t -are annotated with the original trace t<sup>i</sup> from Γ.

**Algorithm 1:** ActualCause(ϕ, Γ, <sup>C</sup>˜)

**Input**: Hyperproperty ϕ, counterexample Γ violating ϕ, and a set of candidate causes <sup>C</sup>˜ for which conditions SAT and CF hold.

**Output**: A set of input events C which is an actual cause for the violation.

```
1 for i ∈ [1,..., |C| −˜ 1] do
```

```
2 for C ⊂ C˜ with |C| = i do
3 let Γf = intervene(Γ, C, ∅);
4 if Γf -
             ϕ then
5 return C;
6 else
7 W˜ = ComputeContingency(ϕ, Γ, C);
8 if W˜ = ∅ then
9 return C;
10 return C˜;
```
In addition, in computing contingencies, we are allowed to flip any subset of outputs as long as they agree with the counterexample trace, which is independent in <sup>C</sup>˜ and <sup>C</sup>.

#### **5.2 Checking Actual Causality**

Due to Proposition 2 we know that in order to find an actual cause, we only need to consider subsets of <sup>C</sup>˜ as candidate causes. In addition, since <sup>C</sup>˜ satisfies condition SAT, so do all of its subsets. We thus only need to check conditions CF and MIN for subsets of <sup>C</sup>˜. Our actual causality computation, presented in Algorithm <sup>1</sup> is as follows. We start with the set <sup>C</sup>˜, that satisfies SAT and CF. We then check if there exists a more minimal cause that satisfies CF. This is done by iterating over all subsets <sup>C</sup> of <sup>C</sup>˜, ordered by size and starting with the smallest ones, and checking if the counterfactual for the C manages to avoid the violation; and if not, if there exists a contingency for this C . If the answer to one of these questions is yes, then C is a minimal cause that satisfies SAT, CF, and MIN, and thus we return C as our actual cause. We now elaborate on CF and MIN.

*CF.* As we have mentioned above, checking condition CF is done in two stages – checking for counterfactuals and computing contingencies. We first show that we do not need to consider all possible counterfactuals, but only one counterfactual for each candidate cause.

**Proposition 4.** *In order to check if a candidate cause* <sup>C</sup>˜ *is an actual cause it is enough to test the one counterfactual where all the events in* <sup>C</sup>˜ *are flipped.*

*Proof.* Assume that there is a strict subset <sup>C</sup> of <sup>C</sup>˜ such that we only need to flip the valuations of events in C in order to find a counterfactual or contingency, thus <sup>C</sup> satisfies CF. Since <sup>C</sup> is a more minimal cause than <sup>C</sup>˜, we will find it during the minimality check.

# **Algorithm 2:** ComputeContingency(ϕ, Γ, C)

**Input**: Hyperproperty ϕ, a counterexample Γ and a potential cause C.

**Output**: a set of output events W which is a contingency for ϕ, Γ and C, or ∅ if no contingency found.


**<sup>10</sup> return** ∅*;*

We assume that CF holds for the input set <sup>C</sup>˜ and check if it holds for any smaller subset C ⊂ <sup>C</sup>˜. CF holds for <sup>C</sup> if (1) flipping all events in <sup>C</sup> is enough to avoid the violation of ϕ or if (2) there exists a non-empty set of contingencies for C that ensures that ϕ is not violated. The computation of contingencies is described in Algorithm 2. Verifying condition CF involves model checking traces against an LTL formula, as we check in Algorithm 1 (line 3) if the property ϕ is still violated on the counterfactual trace with the empty contingency, and on the counterfactual traces resulting from the different contingency sets we consider in Algorithm 2 (line 7). In both scenarios, we apply finite trace model checking, as described at the beginning of Sect. 5 (as we assume lasso-shaped traces).

*MIN.* To check if <sup>C</sup>˜ is minimal, we need to check if there exists a subset of <sup>C</sup>˜ that satisfies CF. We check CF for all subsets, starting with the smallest one, and report the first subset that satisfies CF as our actual cause. (Note that we already established that <sup>C</sup>˜ and all of its subsets satisfy SAT.)

#### **5.3 Computing Contingencies**

Recall that the role of contingencies is to eliminate the effect of other possible causes from the counterfactual world, in case these causes did not affect the violation in the actual world. More formally, in computing contingencies we look for a set W of output events such that changing these outputs from their value in the counterfactual to their value in the counterexample t results in avoiding the violation. Note that the inputs remain as they are in the counterfactual. We note that the problem of finding contingencies is hard, and in general is equivalent to the problem of model checking. This is since we need to consider all traces that are the result of changing some subset of events (output + time step) from the counterfactual back to the counterexample, and to check if there exists a trace in this set that avoids the violation. Unfortunately, we are unable to avoid an exponential complexity in the size of the original system, in the worst case. However, our experiments show that in practice, most cases do not require the use of contingencies.

Our algorithm for computing contingencies (Algorithm 2) works as follows. Let t <sup>f</sup> be the counterfactual trace. As a first step, we use the annotated run tree T of the alternating automaton Aϕ¯ on t <sup>f</sup> to detect output events that appear in ¯ϕ and take part in satisfying ¯ϕ. Subsets of these output events are our first candidates for contingencies as they are directly related to the violation (Algorithm 2 lines 4–9). If we were not able to find a contingency, we continue to check all possible subsets of output events that differ from the original counterexample trace. We test the different outputs by feeding the counterfactual automaton of Definition 4 with additional inputs from the set I<sup>C</sup> . The resulted trace is then our candidate contingency, which we try to verify against ϕ. The number of different input sequences is bounded by the size of the product of the counterfactual automaton and the automaton for ¯ϕ, and thus the process terminates.

**Theorem 1 (Correctness).** *Our algorithm is sound and complete. That is, let* <sup>Γ</sup> *be a counterexample with a finite representation to a* <sup>∀</sup><sup>n</sup>*-HyperLTL formula* ψ*. Then, our algorithm returns an actual cause for the violation, if such exists.*

*Proof. Soundness*. Since we verify each candidate set of inputs according to the conditions SAT, CF and MIN, it holds that every output of our algorithm is indeed an actual cause. *Completeness*. If there exists a cause, then due to Proposition 2, it is a subset of the finite set <sup>C</sup>˜. Since in the worst case we test every subset of <sup>C</sup>˜, if there exists a cause we will eventually find it.

### **6 Implementation and Experiments**

We implemented Algorithm 1 and evaluated it on publicly available example instances of HyperVis [48], for which their state graphs were available. In the following, we provide implementation details, report on the running time and show the usefulness of the implementation by comparing to the highlighting output of HyperVis. Our implementation is written in Python and uses py-aiger [69] and Spot [27]. We compute the candidate cause according to Sect. 5.1 with pysat [50], using Glucose 4 [3,66], building on Minisat [66]. We ran experiments on a MacBook Pro with a 3, 3 GHz Dual-Core Intel Core i7 processor and 16 GB RAM<sup>3</sup>.

*Experimental Results.* The results of our experimental evaluation can be found in Table 1. We report on the size of the analyzed counterexample |Γ|, the size of the violated formula |ϕ|, how long it took to compute the first, over-approximated cause (see time(C˜)) and state the approximation <sup>C</sup>˜ itself, the number of computed minimal causes #(C) and the time it took to compute all of them (see time(∀C)). The Running Example is described in Sect. 3, the instance Security in & out

<sup>3</sup> Our prototype implementation and the experimental data are both available at: https://github.com/reactive-systems/explaining-hyperproperty-violations.


**Table 1.** Experimental results of our implementation. Times are given in ms.

refers to a system which leaks high security input by not satisfying a noninterference property, the Drone examples consider a leader-follower drone scenario, and the Asymmetric Arbiter instances refer to arbiter implementations that do not satisfy a symmetry constraint. Specifications can be found in the full version of this paper [22].

Our first observation is that the cause candidate <sup>C</sup>˜ can be efficiently computed thanks to the iterative computation of unsatisfiable cores (Sect. 5.1). The cause candidate provides a tight over-approximation of possible minimal causes. As expected, the runtime for finding minimal causes increases for larger counterexamples. However, as our experiments show, the overhead is manageable, because we optimize the search for all minimal causes by only considering every subset in <sup>C</sup>˜ instead of naively going over every combination of input events (see Proposition 2). Compared to the computationally heavy task of model checking to get a counterexample, our approach incurs little additional cost, which matches our theoretical results (see Proposition 1). During our experiments, we have found that computing the candidate <sup>C</sup>˜ first has, additionally to providing a powerful heuristic, another benefit: Even when the computation of minimal causes becomes increasingly expensive, <sup>C</sup>˜ can serve as an intermediate result for the user. By filtering for important inputs, such as high security inputs, <sup>C</sup>˜ already gives great insight to why the property was violated. In the asymmetric arbiter instance, for example, the input events ¬*tb secret*, 3, t0 and *tb secret*, 3, t1 of <sup>C</sup>˜, which cause the violation, immediately catch the eye (c.f App. A.4 in [22]).

*Comparison to HyperVis.* HyperVis [48] is a tool for visualizing counterexamples returned from the HyperLTL model checker MCHyper [35]. It highlights the events in the trace that it considers responsible for the violation based on the formula and the set of traces, without considering the system model. However, violations of many relevant security policies such as observational determinism are not caused by events whose atomic propositions appear in the formula, as can be seen in our running example (see Sect. 3 and Example 2). When running the highlight function of HyperVis for the counterexample traces t1, t<sup>2</sup> on Running example, the output events lo, 1, t1 and ¬lo, 1, t2 will be highlighted, neglecting the decisive high security input hi. Using our method additionally reveals the input events ¬hi, 0, t1 and hi, 0, t2, i.e., an actual cause (see Table 1). This pattern can be observed throughout all considered instances in our experiments. For instance in the Asymmetric arbiter instance mentioned above, the input events causing the violation also do not occur in the formula (see App. A.5 in [22]) and thus HyperVis does not highlight this important cause for the violation.

### **7 Related Work**

With the introduction of HyperLTL and HyperCTL<sup>∗</sup> [20], temporal hyperproperties have been studied extensively: satisfiability [29,38,60], model checking [34,35,49], program repair [11], monitoring [2,10,32,67], synthesis [30], and expressiveness studies [23,37,53]. Causal analysis of hyperproperties has been studied theoretically based on counterfactual builders [40] instead of actual causality, as in our work. Explanation methods [4] exist for trace properties [5,39,41,42,70], integrated in several model checkers [14,15,19]. Minimization [54] has been studied, as well as analyzing several system traces together [9,43,65]. There exists work in explaining counterexamples for function block diagrams [51,63]. MODCHK uses a causality analysis [7] returning an overapproximation, while we provide minimal causes. Lastly, there are approaches which define actual causes for the violation of a trace property using Event Order Logic [13,56,57].

### **8 Conclusion**

We present an explanation method for counterexamples to hyperproperties described by HyperLTL formulas. We lift Halpern and Pearl's definition of actual causality to effects described by hyperproperties and counterexamples given as sets of traces. Like the definition that inspired us, we allow modifications of the system dynamics in the counterfactual world through contingencies, and define these possible counterfactual behaviors in an automata-theoretic approach. The evaluation of our prototype implementation shows that our method is practically applicable and significantly improves the state-of-the-art in explaining counterexamples returned by a HyperLTL model checker.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Distilling Constraints in Zero-Knowledge Protocols**

Elvira Albert<sup>1</sup> , Marta Bell´es-Mu˜noz<sup>2</sup> , Miguel Isabel<sup>1</sup> , Clara Rodr´ıguez-N´u˜nez1(B) , and Albert Rubio<sup>1</sup>

<sup>1</sup> Complutense University of Madrid, Madrid, Spain clarrodr@ucm.es <sup>2</sup> Pompeu Fabra University, Barcelona, Spain

**Abstract.** The most widely used *Zero-Knowledge* (ZK) protocols require provers to prove they know a solution to a computational problem expressed as a *Rank-1 Constraint System* (R1CS). An R1CS is essentially a system of non-linear arithmetic constraints over a set of signals, whose security level depends on its non-linear part only, as the linear (additive) constraints can be easily solved by an attacker. Distilling the essential constraints from an R1CS by removing the part that does not contribute to its security is important, not only to reduce costs (time and space) of producing the ZK proofs, but also to reveal to cryptographic programmers the real hardness of their proofs. In this paper, we formulate the problem of distilling constraints from an R1CS as the (hard) problem of simplifying constraints in the realm of non-linearity. To the best of our knowledge, it is the first time that constraint-based techniques developed in the context of formal methods are applied to the challenging problem of analysing and optimizing ZK protocols.

# **1 Introduction**

Zero-Knowledge (ZK) protocols [8,15,17,27] enable one party, called prover, to convince another one, called verifier, that a statement is true without revealing any information beyond the veracity of the "statement". In this context, we understand a statement as a relation between an instance, a *public* input known to both prover and verifier, and a *witness*, a *private* input known only to the prover, which belongs to a language L in the nondeterministic polynomial time (NP) complexity class [5,15]. The most popular, efficient and general-purpose ZK protocols are ZK-SNARKs: ZK Succinct Non-interactive ARguments of Knowledge. While a proof guarantees the existence of a witness in a language L, and *argument of knowledge* proves that, with very high probability, the prover knows a concrete valid witness in L. A ZK-SNARK does not require interaction between the prover and the verifier, and regardless of the size of the statement being proved, the size of the proof is succinct. These appealing properties of ZK-SNARKs have made them become crucial tools in many real-world applications with strong privacy issues. A prominent such example is Zcash [4]. ZK protocols are also being used in conjunction with smart contracts, in the so-called *ZK-rollups* for enhancing the scalability of distributed ledgers [18].

Like most ZK systems, ZK-SNARKs operate in the model of *arithmetic circuits*, meaning that the NP language L is that of satisfiable arithmetic circuits. The gates of an arithmetic circuit consist of additions and multiplications modulo p, where p is typically a large prime number of approximately 254 bits [3]. The wires of an arithmetic circuit are called signals, and can carry any value from the prime finite field F*p*. In the ZK context, there is usually a set of public inputs known both to the prover and the verifier, and the prover proves that she knows a valid assignment to the rest of signals that satisfies the circuit (i.e., the witness). Most ZK-SNARK protocols draw from a classical algebraic form for encoding circuits and wire assignment called rank-1 constraint system (R1CS). An R1CS encodes a circuit as a set of quadratic constraints over its variables, so that a correct execution of a circuit is equivalent to finding a satisfying variable assignment. This way, a valid witness for an arithmetic circuit translates naturally into a solution of its R1CS representation.

Although ZK protocols guarantee that a malicious verifier cannot extract a witness from a proof, they do not prevent the verifier from attacking the statement directly. Hence, it is important that the prover is aware of the difficulty of the statement being proved. In this regard, it is challenging for cryptographic developers that apply ZK protocols to complex computations to assess the real hardness of the produced computational problem, being hence also difficult to verify and audit the systems. It is partly because a syntactic assessment (e.g. based on counting the number of non-linear constraints) can be inaccurate and misleading. This is the case if the R1CS contains *redundant* constraints, i.e., constraints that can be deduced from others or constraints that follow from linear constraints, since they do not contribute to the hardness of the computational statement. Distilling the relevant constraints is important on one hand for efficiency, to reduce costs (time and space) of producing the ZK proofs, and also because redundancy can mislead developers to believe that the statement is far more complex than it really is. It is clear that when arithmetic circuits are defined over a finite field of small order, the problem can be attacked by brute-force, or if the system consists only of linear constraints, a solution can be found in polynomial time [25]. Moreover, in R1CSbased systems like [17] only multiplication gates add complexity to the statement. Also note that linear constraints induce a way to compute the value of one signal from a linear combination of the others, and hence we can easily extend a witness for the other signals to a witness for all the signals. As a result, the difficulty of finding a solution to a system relies mostly in the number of *non-redundant nonlinear constraints*.

*Contributions.* This case study paper applies techniques developed in the context of formal methods to distill constraints from the R1CS systems used by ZK protocols. The main challenges are related, on the one hand, to reasoning with non-linear information in a finite field and, on the other hand, to dealing with very large constraint systems. Briefly, our main contributions are: (1) we present a formal framework to reason on circuit reduction which generalizes the application of different existing optimizations and the reduction strategy in which they are applied, (2) we introduce a concrete new optimization technique based on Gaussian elimination that allows deducing linear constraints from the non-linear constraints, (3) we implement our approach within circom [21] (a novel domainspecific language and compiler for defining arithmetic circuits) and also develop an interface for using it on the R1CS generated by ZoKrates [12], (4) we experimentally evaluate its performance on multiple real-world circuits (including templates from the circom library [22] and from [12], on implementations of different SHA-2 hash functions, on elliptic curve operations, etc.).

# **2 Preliminaries**

This section introduces some preliminary notions and notation. We consider F*<sup>p</sup>* a finite field of prime order p. As usual, F*<sup>n</sup> <sup>p</sup>* is a sequence of n values in F*p*. We drop p from F when it is irrelevant. An arithmetic circuit (over the field F) consists of wires (represented by means of signals <sup>s</sup>*<sup>i</sup>* <sup>∈</sup> <sup>F</sup>) connected to gates (represented by *quadratic constraints*). Signals can be public or private. We now define the concepts of quadratic constraints and R1CS over a set of signals.

**Definition 1 (R1CS).** *A* quadratic constraint *over a set of signals* {s1,...,s*n*} *is an equation of the form* <sup>Q</sup> : <sup>A</sup> <sup>×</sup> <sup>B</sup> <sup>−</sup> <sup>C</sup> = 0*, where* A, B, C <sup>∈</sup> <sup>F</sup>[s1, ..., s*n*] *are linear polynomials over the variables* s1, ..., s*n, i.e.,* A = a<sup>0</sup> + a1s<sup>1</sup> + ··· + a*n*s*n,* <sup>B</sup> <sup>=</sup> <sup>b</sup><sup>0</sup> <sup>+</sup> <sup>b</sup>1s<sup>1</sup> <sup>+</sup> ··· <sup>+</sup> <sup>b</sup>*n*s*n, and* <sup>C</sup> <sup>=</sup> <sup>c</sup><sup>0</sup> <sup>+</sup> <sup>c</sup>1s<sup>1</sup> <sup>+</sup> ··· <sup>+</sup> <sup>c</sup>*n*s*n, where* <sup>a</sup>*i*, b*i*, c*<sup>i</sup>* <sup>∈</sup> <sup>F</sup> *for all* i ∈ {0,...,n}*. A* rank-1 constraint system (R1CS) *over a set of signals* T *is a collection of quadratic constraints over* T*.*

We say that a quadratic constraint Q is *linear* when A or B only have the constant term, i.e., a*<sup>i</sup>* = 0 ∀i ∈ {1,...,n} or b*<sup>i</sup>* = 0 ∀i ∈ {1,...,n}, and is *nonlinear* otherwise. As R1CS systems only contain quadratic constraints, in what follows, we simply call them *constraints*, and specify if they are linear or not where needed. We use the standard notation S |= c to indicate that a constraint c is deducible from a set of constraints S and |S| for the number of constraints.

**Definition 2 (arithmetic circuit and witness).** *An* (arithmetic) circuit *is a tuple* C = (U, V, S) *where* U *represents the set of public signals,* V *represents the set of private signals, and the R1CS* S={Q1,...,Q*m*} *over the signals* U ∪V *represents the circuit operations. Given an assignment* u *for* U*, a* witness *for* C *is an assignment* v *for* V *s.t.* u *together with* v *are a solution to the R1CS* S*.*

We use the terms *circuit* and, R1CS or just constraint system, indistinctly when the signals used in the circuit are clear. Given a circuit C and a public assignment for U, a ZK protocol is a mechanism that allows a prover to prove to a verifier that she knows a private assignment for V that, together with those for U, satisfy the R1CS system describing C. ZK protocols guarantee that the proof will not reveal any information about V .

*Example 1.* We consider a circuit <sup>C</sup><sup>1</sup> = (U, V, S1) over a finite field <sup>F</sup>, with U = {v, w}, V = {x, y, z}, and S<sup>1</sup> given by the following constraint system:

$$\begin{aligned} Q\_1: & w \times (y+z) - 4x - 10 = 0, \quad Q\_2: & w \times z - w - 3 = 0, \\ Q\_3: & (x - w + 1) \times v - v + 1 = 0, \; Q\_4: & y - z - 2 = 0. \end{aligned}$$

This circuit contains 3 non-linear constraints (Q1, Q2, and Q3) and a linear one (Q4). Because of its small size, we can easily solve the system (i.e., give the value of each signal in terms of only one of them) and find the set of solutions:

<sup>W</sup> <sup>=</sup> {(v, w, x, y, z) → (1, w, w <sup>−</sup> <sup>1</sup>, <sup>3</sup>w−<sup>1</sup> + 3, <sup>3</sup>w−<sup>1</sup> + 1) <sup>|</sup> <sup>w</sup> <sup>∈</sup> <sup>F</sup> \ {0}}.

A cryptographic problem can be modeled by different circuits producing the same solutions. This relation among circuits can be formalized as *circuit equivalence*, which is a natural extension of the constraint system equivalence. We say that two circuits C = (U, V, S) and C = (U, V, S ) are *equivalent*, written CC , if S and S have the same set of solutions. Consequently, if C and C are equivalent, they have the same set of solutions and hence of witnesses.

*Example 2.* The circuit C<sup>2</sup> = (U, V, S2) with the same sets of public and private signals U and V as C1, and the R1CS S<sup>2</sup> given by the constraints:

$$Q\_1': w \times y - 3w - 3 = 0, \quad Q\_2': y - z - 2 = 0, \quad Q\_3': v - 1 = 0, \quad Q\_4': x - w + 1 = 0,$$

has the same set of solutions (and thus witnesses) as C1. Hence, C<sup>1</sup> C2.

#### **3 A Formal Framework for R1CS Reduction**

R1CS optimizations are applied within state-of-the-art compilers like circom [21] or ZoKrates [12]. Common to such existing compiler optimizations is the application of rules to simplify and eliminate linear constraints and/or to deduce information from them. As our first contribution, we present a formal framework for R1CS reduction based on a rule-based transformation system which is general enough to be a formal basis for developing specific simplification techniques and reduction strategies. In particular, the simplifications already applied in the above compilers are instantiations of our framework.

The notion of reduction that our framework formalizes is key to define the security level of circuits. When two circuits model the same problem, they provide the same level of security. However, an assessment of their security level based on syntactically counting the number of non-linear constraints in the circuits can lead to a wrong understanding/estimation of their security. For instance, circuits C<sup>1</sup> and C<sup>2</sup> (see Examples 1-2) model the same problem, although C<sup>2</sup> needs a single nonlinear constraint to define its set of solutions (instead of three as C1). This happens because some of the non-linear constraints of C<sup>1</sup> are not essential and can be substituted by linear constraints. Besides, we can observe in C<sup>2</sup> that signals x and z are only involved in linear constraints instead of being on non-linear constraints like in C1. In other words, *having a circuit with more private signals involved in non-linear constraints (e.g.,* C1*) does not ensure further security if these private* *signals can be deduced from linear combinations of the others*. We build our notion of *circuit reduction* upon this concept.

**Definition 3 (circuit-reduction).** *Let* C = (U, V, S) *be a circuit with* U ∪ V = {s1,...,s*n*}*, and* C = (U, V , S ) *another circuit with* V ⊆ V *.*


Intuitively, we have that for every signal defined in V , the values of the two witnesses match, and for the signals defined in V \V , the value of the witness of C can be obtained from a linear combination of the values from the assignment for U and φ.

*Example 3.* Let C<sup>3</sup> be ({v, w}, {y}, S3) with S<sup>3</sup> = {Q <sup>1</sup> : w × y − 3w − 3=0, Q <sup>2</sup> : v − 1=0}. Let us show that C<sup>1</sup> (from Example 1) strictly reduces to C3. From Example 2, we have that every solution of C<sup>1</sup> restricted to {v, w, y} is also a solution of C<sup>3</sup> (since S<sup>3</sup> ⊆ S<sup>2</sup> and C<sup>2</sup> C1) and that in every witness φ of C<sup>2</sup> we have that φ (x) = φ (w) − 1 and φ (z) = φ (y) − 2. Therefore, taking λ*x* <sup>0</sup> <sup>=</sup> <sup>−</sup>1, <sup>λ</sup>*<sup>x</sup>* pos(*w*) = 1, λ*<sup>z</sup>* <sup>0</sup> <sup>=</sup> <sup>−</sup>2, <sup>λ</sup>*<sup>z</sup>* pos(*y*) = 1 (where function pos(s*i*) abstracts the index i of the variable s*<sup>i</sup>* in the set of signals), we have that C<sup>3</sup> |=*<sup>l</sup>* C1. Finally, since {y}⊂{x, y, z} and, given an assignment for {v, w}, every witness of C<sup>1</sup> restricted to {y} is a witness for C3, and we can conclude.

We now present a set of transformation rules that ensure circuit reducibility. The transformation is based on finding linear consequences of the constraint system to guarantee that the transformed set of constraints linearly follows from the original system. Our transformation rules operate on pairs in <sup>K</sup>×SL, where <sup>K</sup> is the set of arithmetic circuits and S<sup>L</sup> is the set of linear constraint systems. As usual, we use infix notation, writing (C, S*L*) ⇒ (C , S*L* ), and denote respectively by <sup>⇒</sup><sup>+</sup> and <sup>⇒</sup><sup>∗</sup>, its transitive and reflexive-transitive closure. Given a circuit <sup>C</sup>, if (C, ∅) ⇒<sup>∗</sup> (C , S*L*), then C is a reduction for C, and the linear system S*<sup>L</sup>* shows how to prove that C |=*<sup>l</sup>* C. In the following, we assume that there exists a total order < among the private signals in V which is used to select a signal among the private signals of a constraint c, denoted by V (c).

**Fig. 1.** Circuit transformation rules.

The remove rule allows us to remove redundant constraints. The deduce rule is needed to extract from S linear relations among the signals. Finally, the simplify rule allows us to safely remove a signal s from V by replacing it by an equivalent linear combination of public and (strictly) smaller private signals in S. The fact that we replace a private signal by strictly smaller ones prevents this rule from being applied infinitely many times. When no constraint or private signal can be removed from a circuit (e.g., from C3) after applying a sequence of reduction rule steps, the circuit is considered *irreducible* and we call it a *normal form*. Note that the linear constraints in S*<sup>L</sup>* with signals not belonging to U ∪ V are the ones that track how to obtain the missing signals from the remaining ones.

The three rules from Fig. 1 are terminating and they are contained in the circuit reducibility relation (Definition 3) when projected to the first component (the circuit). Regarding confluence, we have that if (C, S*L*) ⇒<sup>∗</sup> (C1, S*<sup>L</sup>*1) and (C, S*L*) ⇒<sup>∗</sup> (C2, S*<sup>L</sup>*2), then we have that (C1, S*<sup>L</sup>*1) ⇒<sup>∗</sup> (C 1, S*L* <sup>1</sup>) and (C2, S*<sup>L</sup>*2) ⇒<sup>∗</sup> (C 2, S*L* <sup>2</sup>) such that C <sup>1</sup> and C <sup>2</sup> are equivalent (see Appendix).

*Example 4.* Let us apply our reduction system to find a normal form of (C1, ∅) which corresponds to its reduction. At each step we label the arrow with the applied rule and show only the component that is modified from the previous step (we use to indicate the value of the component as in the previous step):

((*U, V, S*1)*,* <sup>∅</sup>) deduce <sup>⇒</sup> (( *, ,* )*,* {*L*<sup>1</sup> : *<sup>z</sup>* <sup>=</sup> *<sup>y</sup>* <sup>−</sup> <sup>2</sup>}) simplify <sup>⇒</sup> ( *,* \ {*z*}*,* [*<sup>z</sup>* → *<sup>y</sup>* <sup>−</sup> 2])*,* ) remove <sup>⇒</sup> (( *, ,* \ {0=0})*,* ) deduce <sup>⇒</sup> (( *, ,* )*,* ∪ {*L*<sup>2</sup> : *<sup>x</sup>* <sup>=</sup> *<sup>w</sup>* <sup>−</sup> <sup>1</sup>}) simplify <sup>⇒</sup> (( *,* \ {*x*}*,* [*<sup>x</sup>* → *<sup>w</sup>* <sup>−</sup> 1])*,* ) remove <sup>⇒</sup> (( *, ,* \ {*<sup>Q</sup>* : *<sup>w</sup>* <sup>×</sup> (2*<sup>y</sup>* <sup>−</sup> 2) <sup>−</sup> <sup>4</sup>*<sup>w</sup>* <sup>−</sup> 6=0})*,* )

Here (C3, {L1, L2}) is a normal form of (C1, ∅) and, as we have already seen in Example 3, C<sup>3</sup> is a reduction for C1. Note that {L1, L2} shows how to obtain the values of the removed signals as a linear combination.

#### **4 Circuit Reduction Using Constraint Simplification**

In this section, we introduce different strategies to apply the transformation rules described in Fig. 1, and also to approximate the deduction relation S |= c in rules remove and deduce. Note that the classical representation of our problem is undecidable, but since we work in a finite field, it becomes decidable. However, as the order of F is large, it is still impractical and approximation is required.

As an example, let us show how the simplification techniques applied in ZoKrates and circom fit in our framework. In both languages, besides the removal of tautologies, all simplification steps are made using linear constraints that are part of the set of constraints. In particular, in a first step both languages handle the so-called *redefinitions* (i.e., constraints of the form x = y), and in a second step all the remaining linear constraints are eliminated applying the necessary substitutions. In our framework, these simplification steps can be described as a sequence of deduce to obtain the linear constraints that will be applied as substitutions, followed by a sequence of simplify, and a sequence of remove to delete the tautologies obtained after the substitutions. The whole sequence can be repeated until no linear constraints are left in the circuit. The specific strategy followed to perform the sequence of deduce steps to obtain the substitutions used to simplify the circuit from its linear constraints has a big impact in the efficiency of the process. For instance, circom considers all maximal clusters of linear constraints (sharing signals) in the system and then infers in one go all the substitutions to be applied for every cluster, using a lazy version of Gauss-Jordan elimination. This process can be very expensive when the number of constraints in the R1CS is very large (e.g. hundreds of millions in ZK-Rollups like Hermez [20]).

Similar techniques based on analyzing the linear constraints are applied in other circuit-design languages. However, up to our knowledge, no language uses the non-linear part of the circuit to infer new linear constraints, or to remove redundant constraints, and this constitutes the second main contribution of this work. In the remaining of this section, we present a new approach inspired by techniques used in program analysis and SMT-solving like [9,11], where the non-linear reasoning is reduced to linear-reasoning. We can assume that we have applied first the aforementioned strategies to obtain an R1CS containing only non-linear constraints (or linear constrains with only public signals). Then, in our framework, the problem of inferring new linear constraints from non-linear R1CS can be formalized as a synthesis problem as follows: "*given a circuit* (U, V, S)*, where* U ∪ V = {s1,...,s*n*}*, our goal is to find a linear expression* l = c<sup>0</sup> + <sup>c</sup>1s<sup>1</sup> <sup>+</sup> ... <sup>+</sup> <sup>c</sup>*n*s*<sup>n</sup> with* <sup>c</sup>0, c1,...,c*<sup>n</sup>* <sup>∈</sup> <sup>F</sup> *such that* <sup>S</sup> <sup>|</sup><sup>=</sup> <sup>l</sup> = 0*.*" In order to solve this problem, we follow an efficient approach in which we restrict ourselves to the case where l = 0 can be expressed as a linear combination of constraints in S, i.e., of the form <sup>λ</sup>*<sup>k</sup>* <sup>∗</sup> <sup>Q</sup>*<sup>k</sup>* with <sup>Q</sup>*<sup>k</sup>* <sup>∈</sup> <sup>S</sup> and <sup>λ</sup>*<sup>k</sup>* <sup>∈</sup> <sup>F</sup>. It is clear that any constraint l = 0 obtained using this approach satisfies S |= l = 0, but we are only interested in the ones that are linear. In the following two stages, we describe how to obtain linear expressions l, and hence, infer the constraints.

**Stage 1.** First, for each constraint Q*<sup>k</sup>* : A*k*×B*k*−C*<sup>k</sup>* = 0, k ∈ {1,...,m}, we expand the multiplication A*<sup>k</sup>* ×B*k*, obtaining the expression - <sup>1</sup>≤*i*≤*j*≤*<sup>n</sup>* <sup>Q</sup>*k*[i, j]<sup>∗</sup> s*i*s*j*+L*k*, where Q*k*[i, j] for 1 ≤ i ≤ j ≤ n denotes the coefficient of the monomial s*i*s*<sup>j</sup>* in the constraint Q*k*, and L*<sup>k</sup>* is the linear part of A*<sup>k</sup>* × B*k*.

*Example 5.* Let us consider the circuit from Example 4 after applying the first three transformation rules, i.e. after removing the linear constraints. We denote the resulting circuit C<sup>4</sup> = (U, V4, S4), where U ∪ V<sup>4</sup> = {v, w, x, y} and S<sup>4</sup> is given by:

$$\begin{aligned} Q\_1 &: w \times (2y - 2) - 4x - 10 = 0, \ Q\_2 : w \times (y - 2) - w - 3 = 0, \\ Q\_3 &: (x - w + 1) \times v - v + 1 = 0. \end{aligned}$$

Here, we have for Q<sup>1</sup> that A<sup>1</sup> = w, B<sup>1</sup> = 2y − 2 and C<sup>1</sup> = 4x + 10 (recall that we consider A<sup>1</sup> × B<sup>1</sup> − C<sup>1</sup> = 0). Then, we expand the multiplication A<sup>1</sup> × B<sup>1</sup> = 2wy − 2w, so that L<sup>1</sup> = −2w and Q1[2, 4] = 2 (for wy), where the later is the only non-zero coefficient of a quadratic monomial. Similarly, for Q<sup>2</sup> we have C<sup>2</sup> = w + 3, Q2[2, 4] = 1 (also for wy) and L<sup>2</sup> = −2w. Finally, for Q<sup>3</sup> we have C<sup>3</sup> = v − 1, and Q3[1, 3] = 1 (for vx) and Q3[1, 2] = −1 (for vw) and L<sup>3</sup> = v.

**Stage 2.** Now, we can model a sufficient condition of linearity using the previous ingredients: if there exist <sup>λ</sup>1,...,λ*<sup>m</sup>* <sup>∈</sup> <sup>F</sup> such that, for every i, j with 1 ≤ i ≤ j ≤ n, we have that *m <sup>k</sup>*=1 λ*<sup>k</sup>* ∗Q*k*[i, j] = 0, then l = *m <sup>k</sup>*=1 λ*<sup>k</sup>* ∗(L*<sup>k</sup>* −C*k*) is linear and S |= l = 0. Moreover, assuming that S is consistent, we have that either l = 0 is a tautology 0 = 0 or it is a non-trivial linear constraint. In the first case, any of the constraints Q*<sup>k</sup>* with λ*<sup>k</sup>* = 0 follows from the rest of the constraints and we can apply the remove rule. In the second case, we can apply deduce and later simplify if l has at least one private signal. Note that, after applying simplify one of the constraints Q*<sup>k</sup>* with λ*<sup>k</sup>* = 0 will follow from the rest, and we will be able to finally apply remove.

*Example 6 (continued).* Following the example, we need to find λ1, λ2, λ<sup>3</sup> such that (considering only the non-zero coefficients Q[i, j]) 2λ1+λ<sup>2</sup> = 0 (for Q[2, 4]), 2λ<sup>3</sup> = 0 (for Q[1, 3]), and −λ<sup>3</sup> = 0 (for Q[1, 2]). Since the monomials vx and vw occur only once, the only solution for λ<sup>3</sup> is 0. Now solving 2λ<sup>1</sup> + λ<sup>2</sup> = 0, we get that λ<sup>2</sup> = −2λ1. Hence, we take the solution λ<sup>1</sup> = 1 and λ<sup>2</sup> = −2. With this solution, l = 1 ∗ (−2w − (4x + 10)) + (−2) ∗ (−2w − (w + 3)) + 0 ∗ (v − (v − 1)). Hence, we obtain 4w − 4x − 4 = 0, which is equivalent to x − w + 1 = 0 that is the deduced linear constraint used in Example 4 to reduce the original circuit.

To conclude, finding <sup>λ</sup>1,...,λ*<sup>m</sup>* <sup>∈</sup> <sup>F</sup> such that for every i, j with 1 <sup>≤</sup> <sup>i</sup> <sup>≤</sup> <sup>j</sup> <sup>≤</sup> n, then *m <sup>k</sup>*=1 λ*<sup>k</sup>* ∗ Q*k*[i, j] = 0, is a linear problem that can be solved using Gaussian elimination or similar techniques. Note that we are only interested in solutions with at least one λ*<sup>k</sup>* = 0. Therefore, we can efficiently synthesize new linear constraints or show that some constraint follows from the others using this approach.

Regarding the practical application of our technique, since sometimes we are handling very large sets of non-linear constraints, additional engineering is needed to make it work. For instance, we need to remove those constraints that have a quadratic monomial that appears in no other constraint, and after that, compute maximal clusters sharing the same quadratic monomials. We have observed in our experimental evaluation that, in general, even for large circuits, each cluster remains small. Thanks to this, we obtain rather small independent sets of constraints that can be solved in parallel using Gaussian elimination.

#### **5 Experimental Results**

This section describes our experimental evaluation on two settings: On one hand (Sect. 5.1), we have implemented them within circom [21], a novel domainspecific language and compiler for defining arithmetic circuits, fully written in Rust. The circom compiler generates executable code (WebAssembly or C++) to compute the witness, together with the R1CS, since both are later needed by ZK tools to produce ZK proofs. The implementation is available in a public fork of the compiler [1]; On the other hand (Sect. 5.2), we have decoupled the constraint optimization module from the circom compiler in a new project, which is accessible online [2], in order to be able to use it after other cryptographic-language compilers that produce R1CS, in our case with ZoKrates [12]. ZoKrates is a high-level language that allows the programmer to abstract the technicalities of building arithmetic circuits. The input to our optimizer is the R1CS in the smtlib2 format generated by ZoKrates. The goal of our experiments is two fold: (1) assess the scalability of the approach when applied to real-world circuits used in industry and (2) evaluate its impact on code already highly optimized –such as circom's libraries developed on a low-level language by experienced programmers– and on code automatically compiled from a high-level language such as ZoKrates. In both cases, the optimizations of linear constraints that the compilers include (see Sect. 4) are enabled so that the reduction gains are due only to our optimization. Experimental results have been obtained using an AMD Ryzen Threadripper PRO 3995WX 64-Cores Processor with 512 GB of RAM (Linux Kernel Debian 5.10.70-1).

#### **5.1 Results on** circom **Circomlib**

circom is a modular language that allows the definition of parameterizable small circuits called "templates" and has its own library called circomlib [22]. This library is widely used for cryptographic purposes and contains hundreds of templates such as comparators, hash functions, digital signatures, binary and decimal converters, and many more. Our experiments have been performed on the available test cases from circomlib. Many of them have been carefully programmed by experienced cryptographers to avoid unnecessary non-linear constraints and our optimization cannot deduce new linear constraints. Still, we are able to reduce 26% of the total tests (12 out of 46).

Table 1 shows the results for the five circuits that we optimize the most. For each of them, we show: (**#C**) the number of generated constraints, (**#R**) the number of removed constraints, (**G%**) the gains expressed as **#R**/**#C x 100**, and (**T(s)**) the compilation time. The largest gain is for pointbits loopback, where circom generates 2.333 constraints and we remove 381 of them, our gain

is 16.33% and the compilation time is 13.4s. As explained in Sect. 4, for each linear constraint deduced by our technique, we are always able to remove a non-linear constraint and, in general, also a signal. Note that we sometimes produce new linear con-

**Table 1.** Results on circomlib.


straints in which all the involved signals are public and thus, none of them can be removed. Importantly, in spite of the manual simplifications already made in most of the circuits in circomlib, our techniques detect further redundant constraints in a short time. Such small reductions in templates of circomlib can produce larger gains, since they are repeatedly used as subcomponents in industrial circuits.

#### **5.2 Results on** ZoKrates **Stdlib**

We have used two kind of circuits from the ZoKrates stdlib for our experimental evaluation: (1) The first four circuits shaXbit are implementations of different SHA-2 hash functions [19], where X indicates the size of the output. SHA-2 hashes are constructed from the repeated use of simple computation units that heavily use bit operations. Bit operations are very inefficient inside

**Table 2.** Results on stdlib.


arithmetic circuits [13] and, as a result, the number of constraints describing these circuits is very large, see in Table 2. The number of constraints deduced is quite low for this kind of circuits since specialized optimization for bitwise operation is required (other compilers like xJsnark [23] are specialized on this). This also happens in the circom implementation of SHA-256-2 (row 1 of Table 1). However, Poseidon [16] is a recent hash function that was designed taking into account the nature of arithmetic circuits in a prime field F, and as a result, the function can be described with many less constraints. Our approach is able to optimize the current implementation of Poseidon by more than 20%, which represents a very significant reduction. (2) The second kind are the last four circuits: they correspond to the ground for implementing elliptic curve cryptography inside circuits. Our optimizer detects, in a negligible time, that more than 23% of constraints are redundant and can be removed. Verifying if a pair of public/private keys matches (ProofOfOwnership) is fundamental in almost every security situation, hence the optimization of this circuit becomes particularly relevant for saving blockchain space. For this reason, we have parameterized ProofOfOwnership to the number of pairs public/private keys to be verified and we have measured the performance impact (time and memory consumption) of snarkjs setup step of these circuits without simplification (Table 3) and after simplification (Table 4). The results show the effect of our reduction when the constraints are later used by snarkjs to produce ZK proofs.


**Table 3.** Results on different instantiations of ProofOfOwnership from stdlib without nonlinear simplification. The generated ERROR in last row is an out-of-memory-error.

**Table 4.** Results on different instantiations of ProofOfOwnership from stdlib with nonlinear simplification.


The impact of our simplification on the setup step of snarkjs is relevant and goes beyond the increase in the compilation time. However, this step is applied only once. We have also measured the impact in performance when generating a ZK-proof for a given witness using snarkjs after the setup step. This action that is the one repeated many times when used in a real context. Our experiments show that, e.g., with ProofOfOwnership-400 we improve from 41 s to 35 s and with ProofOfOwnership-1000 we improve from 1 m 53 s to 1 m 12 s.

In conclusion, our experiments show that the higher the level of abstraction is, the more redundant constraints the compiler introduces in the R1CS. Our proposed techniques are an efficient and effective solution to enhance the performance in this setting. On the other hand, circuits written in a low-level language by security experts (usually optimized by hand), or circuits using bitwise operations, leave small room for optimization by applying our techniques.

# **6 Related Work and Conclusions**

We have proposed the application of (non-linear) constraint reasoning techniques to the new application domain of ZK protocols. Our approach has wide applicability as, in the last few years, much effort has been put in developing new programming languages that enable the generation and verification of ZK proofs and that also focus on the design of arithmetic circuits and the constraint encoding. Among the different solutions, we can distinguish: libraries (bellman [7], libsnark [29], snarky [28]), programming-focused languages (ZoKrates [12], xJsnark [23], zinc [24], Leo [10]), and hardware-description languages (circom). As opposed to the initial library approach, both programming and hardwaredescription languages put focus on the design of arithmetic circuits and the constraint encoding. In this regard, ZoKrates, xJsnark, and the circom compiler implement one simple but powerful R1CS-specific optimization called *linearity reduction*: it consists in substituting the linear constraints to generate a new circuit whose system only consists of non-linear constraints. However, they do not deduce new constraints to detect further redundancies in the system. Linear reduction is a particular case of our reduction rules in which the only linear constraints that can be deduced and added to the linear system are those that follow from linear constraints present in the constraint system. On the other side, the constraint system generated by Leo is only optimized at the level of its intermediate representation not at R1CS-level, as our method works.

Finally, there has been a joint effort towards standardizing and allowing the interoperability between different programs, like CirC [26], an infrastructure for building compilers to logical constraint representation. Currently, CirC only applies the linearity reduction explained above. Recently, an interface called zkInterface [6] has been built to improve the interoperability among several frontends, like ZoKrates and snarky. It provides means to express statements in a high-level language and compile them into an R1CS representation; and several backends that implement ZK protocols like Groth16 [17] and Pinocchio [27] that use the R1CS representation to produce ZK proofs. zkInterface could benefit from our optimization to apply our reduction to every circuit generated by any of the accepted frontends. zkInterface is also written in Rust, then our optimizer could be easily integrated as a new gadget for the tool in the future. Finally, we believe that the techniques presented in this paper can lead us to new reduction schemes to be applied over PlonK [14] constraint systems.

**Acknowledgements.** This research was funded by the Spanish MCIN-AEI-10.13039/ 501100011033-FEDER "Una manera de hacer Europa" projects RTI2018-094403- B-C31 and RTI2018-094403-B-C33, by the CM project S2018/TCS-4314 co-funded by EIE Funds of the European Union, and by the project RTI2018-102112-B-100 (AEI/FEDER, UE).

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Formal Methods for Hardware, Cyber-physical, and Hybrid Systems**

# Oblivious Online Monitoring for Safety LTL Specification via Fully Homomorphic Encryption

Ryotaro Banno1(B) , Kotaro Matsuoka<sup>1</sup> , Naoki Matsumoto<sup>1</sup> , Song Bian<sup>2</sup> , Masaki Waga<sup>1</sup> , and Kohei Suenaga<sup>1</sup>

<sup>1</sup> Kyoto University, Kyoto, Japan banno@fos.kuis.kyoto-u.ac.jp <sup>2</sup> Beihang University, Beijing, China

Abstract. In many Internet of Things (IoT) applications, data sensed by an IoT device are continuously sent to the server and monitored against a specification. Since the data often contain sensitive information, and the monitored specification is usually proprietary, both must be kept private from the other end. We propose a protocol to conduct *oblivious online monitoring*—online monitoring conducted without revealing the private information of each party to the other—against a safety LTL specification. In our protocol, we first convert a safety LTL formula into a DFA and conduct online monitoring with the DFA. Based on *fully homomorphic encryption (FHE)*, we propose two online algorithms (Reverse and Block) to run a DFA obliviously. We prove the correctness and security of our entire protocol. We also show the scalability of our algorithms theoretically and empirically. Our case study shows that our algorithms are fast enough to monitor blood glucose levels online, demonstrating our protocol's practical relevance.

### 1 Introduction

Internet of Things (IoT) [3] devices enable various service providers to monitor personal data of their users and to provide useful feedback to the users. For example, a smart home system can save lives by raising an alarm when a gas stove is left on to prevent a fire. Such a system is realized by the continuous monitoring of the data from the IoT devices in the house [8,18]. Another application of IoT devices is medical IoT (MIoT) [16]. In MIoT applications, biological information, such as electrocardiograms or blood glucose levels, is monitored, and the user is notified when an abnormality is detected (such as arrhythmia or hyperglycemia).

In many IoT applications, monitoring must be conducted *online*, i.e., a stream of sensed data is continuously monitored, and the violation of the monitoring specification must be reported even before the entire data are obtained. In the smart home and MIoT applications, online monitoring is usually required, as continuous sensing is crucial for the immediate notifications to emergency responders, such as police officers or doctors, for the ongoing abnormal situations.

Fig. 1. The proposed oblivious online LTL monitoring protocol.

Fig. 2. How our algorithms consume the data d1, d2,...,d<sup>n</sup> with the DFA M.

As specifications generally contain proprietary information or sensitive parameters learned from private data (e.g., with specification mining [27]), *the specifications must be kept secret*. One of the approaches for this privacy is to adopt the client-server model to the monitoring system. In such a model, the sensing device sends the collected data to a server, where the server performs the necessary analyses and returns the results to the device. Since the client does not have access to the specification, the server's privacy is preserved.

However, the client-server model does *not* inherently protect the client's privacy from the servers, as the data collected from and results sent back to the users are revealed to the servers in this model; that is to say, a user has to *trust* the server. This trust is problematic if, for example, the server itself intentionally or unintentionally leaks sensitive data of device users to an unauthorized party. Thus, we argue that a monitoring procedure should achieve the following goals:

Online Monitoring. The monitored data need not be known beforehand. Client's Privacy. The server shall not know the monitored data and results. Server's Privacy. The client shall not know what property is monitored.

We call a monitoring scheme with these properties *oblivious online monitoring*. By an oblivious online monitoring procedure, 1) a user can get a monitoring result hiding her sensitive data and the result itself from a server, and 2) a server can conduct online monitoring hiding the specification from the user.

*Contribution.* In this paper, we propose a novel protocol (Fig. 1) for oblivious online monitoring against a specification in *linear temporal logic (LTL)* [33]. More precisely, we use a *safety LTL formula* [26] as a specification, which can be translated to a deterministic finite automaton (DFA) [36]. In our protocol, we first convert a safety LTL formula into a DFA and conduct online monitoring with the DFA. For online and oblivious execution of a DFA, we propose two algorithms based on *fully homomorphic encryption* (FHE). FHE allows us to evaluate an arbitrary function over ciphertexts, and there is an FHE-based algorithm to evaluate a DFA obliviously [13]. However, this algorithm is limited to *leveled* homomorphic, i.e., the FHE parameters are dependent on the number of the monitored ciphertexts and thus not applicable to online monitoring.

In this work, we first present a *fully* homomorphic *offline* DFA evaluation algorithm (Offline) by extending the leveled homomorphic algorithm in [13]. Although we can remove the parameter dependence using this method, Offline consumes the ciphertexts from back to front (Fig. 2a). As a result, Offline is still limited to offline usage only. To truly enable online monitoring, we propose two new algorithms based on Offline: Reverse and Block. In Reverse, we *reverse* the DFA and apply Offline to the reversed DFA (Fig. 2b). In Block, we split the monitored ciphertexts into fixed-length *blocks* and process each block sequentially with Offline (Fig. 2c). We prove that both of the algorithms have *linear* time complexity and *constant* space complexity to the length of the monitored ciphertexts, which guarantees the scalability of our entire protocol.

On top of our online algorithms, we propose a protocol for oblivious online LTL monitoring. We assume that the client is *malicious*, i.e., the client can deviate arbitrarily from the protocol, while the server is *honest-but-curious*, i.e., the server honestly follows the protocol but tries to learn the client's private data by exploiting the obtained information. We show that the privacy of both parties can be protected under the standard IND-CPA security of FHE schemes with the addition of *shielded randomness leakage* (SRL) security [10,21].

We implemented our algorithms for DFA evaluation in C++20 and evaluated their performance. Our experiment results confirm the scalability of our algorithms. Moreover, through a case study on blood glucose levels monitoring, we also show that our algorithms run fast enough for online monitoring, i.e., our algorithms are faster than the sampling interval of the current commercial devices that samples glucose levels.

Our contributions are summarized as follows:


*Related Work.* There are various works on DFA execution without revealing the monitored data (See Table 1 for a summary). However, to our knowledge, there is no existing work achieving all of our three goals (i.e., *online monitoring*, *privacy of the client*, and *privacy of the server* ) simultaneously. Therefore, none of them is applicable to oblivious online LTL monitoring.

Homomorphic encryption, which we also utilize, has been used to run a DFA obliviously [13,25]. Among different homomorphic encryption schemes, our algorithm is based on the algorithm in [13]. Although these algorithms guarantee the *privacy of the client* and the *privacy of the server*, all of the


Table 1. Related work on DFA execution with *privacy of the client*.

homomorphic-encryption-based algorithms are limited to offline DFA execution and do not achieve *online monitoring*. We note that the extension of [13] for online DFA execution is one of our technical contributions.

In [1], the authors propose an LTL runtime verification algorithm without revealing the monitored data to the server. They propose both offline and online algorithms to run a DFA converted from a safety LTL formula. The main issue with their online algorithm is that the DFA running on the server must be revealed to the client, and the goal of *privacy of the server* is not satisfied.

*Oblivious DFA evaluation (ODFA)* [9,20,22,31,35,37] is a technique to run a DFA on a server while keeping the DFA secret to the server and the monitored data secret to the client. Although the structure of the DFA is not revealed to the client, the client has to know the number of the states. Consequently, the goal *privacy of the server* is satisfied *only partially*. Moreover, to the best of our knowledge, none of the ODFA-based algorithms support online DFA execution. Therefore, the goal *online monitoring* is not satisfied.

*Organization.* The rest of the paper is organized as follows: In (Sect. 2), we overview LTL monitoring (Sect. 2.1), the FHE scheme we use (Sect. 2.2), and the leveled homomorphic offline algorithm (Sect. 2.3). Then, in Sect. 3, we explain our fully homomorphic offline algorithm (Offline) and two online algorithms (Reverse and Block). We describe the proposed protocol for oblivious online LTL monitoring in Sect. 4. After we discuss our experimental results in Sect. 5, we conclude our paper in Sect. 6.

### 2 Preliminaries

*Notations.* We denote the set of all nonnegative integers by N, the set of all positive integers by <sup>N</sup><sup>+</sup>, and the set {0, <sup>1</sup>} by <sup>B</sup>. Let <sup>X</sup> be a set. We write <sup>2</sup><sup>X</sup> for the powerset of X. We write X<sup>∗</sup> for the set of finite sequences of X elements and <sup>X</sup><sup>ω</sup> for the set of infinite sequences of <sup>X</sup> elements. For <sup>u</sup> <sup>∈</sup> <sup>X</sup><sup>ω</sup>, we write <sup>u</sup><sup>i</sup> <sup>∈</sup> <sup>X</sup> for the <sup>i</sup>-th element (0-based) of <sup>u</sup>, <sup>u</sup><sup>i</sup>:<sup>j</sup> <sup>∈</sup> <sup>X</sup><sup>∗</sup> for the subsequence <sup>u</sup>i, u<sup>i</sup>+1,...,u<sup>j</sup> of <sup>u</sup>, and <sup>u</sup><sup>i</sup>: <sup>∈</sup> <sup>X</sup><sup>ω</sup> for the suffix of <sup>u</sup> starting from its <sup>i</sup>-th element. For <sup>u</sup> <sup>∈</sup> <sup>X</sup><sup>∗</sup> and <sup>v</sup> <sup>∈</sup> <sup>X</sup><sup>∗</sup> <sup>∪</sup> <sup>X</sup><sup>ω</sup>, we write <sup>u</sup> · <sup>v</sup> for the concatenation of <sup>u</sup> and <sup>v</sup>.

*DFA.* A deterministic finite automaton (DFA) is a 5-tuple (Q, Σ, δ, q0, F), where <sup>Q</sup> is a finite set of states, <sup>Σ</sup> is a finite set of alphabet, <sup>δ</sup> : <sup>Q</sup>×<sup>Σ</sup> <sup>→</sup> <sup>Q</sup> is a transition function, <sup>q</sup><sup>0</sup> <sup>∈</sup> <sup>Q</sup> is an initial state, and <sup>F</sup> <sup>⊆</sup> <sup>Q</sup> is a set of final states. If the alphabet of a DFA is <sup>B</sup>, we call it a *binary* DFA. For a state <sup>q</sup> <sup>∈</sup> <sup>Q</sup> and a word w = σ1σ<sup>2</sup> ...σ<sup>n</sup> we define δ(q, w) := δ(...δ(δ(q, σ1), σ2),...,σn). For a DFA M and a word w, we write M(w) := 1 if M accepts w; otherwise, M(w) := 0. We also abuse the above notations for nondeterministic finite automata (NFAs).

#### 2.1 LTL

We use *linear temporal logic (LTL)* [33] to specify the monitored properties. The following BNF defines the syntax of LTL formulae: φ, ψ ::= | <sup>p</sup> | ¬<sup>φ</sup> <sup>|</sup> <sup>φ</sup> <sup>∧</sup> <sup>ψ</sup> <sup>|</sup> <sup>X</sup><sup>φ</sup> <sup>|</sup> <sup>φ</sup>Uψ, where <sup>φ</sup> and <sup>ψ</sup> range over LTL formulae and <sup>p</sup> ranges over a set AP of atomic propositions.

An LTL formula asserts a property of <sup>u</sup> <sup>∈</sup> (2AP)<sup>ω</sup>. The sequence <sup>u</sup> expresses an execution trace of a system; u<sup>i</sup> is the set of the atomic propositions satisfied at the <sup>i</sup>-th time step. Intuitively, represents an always-true proposition; <sup>p</sup> asserts that <sup>u</sup><sup>0</sup> contains <sup>p</sup>, and hence <sup>p</sup> holds at the <sup>0</sup>-th step in <sup>u</sup>; <sup>¬</sup><sup>φ</sup> is the negation of <sup>φ</sup>; and <sup>φ</sup>∧<sup>ψ</sup> is the conjunction of <sup>φ</sup> and <sup>ψ</sup>. The temporal proposition <sup>X</sup><sup>φ</sup> expresses that φ holds from the next step (i.e., u1:); φUψ expresses that ψ holds eventually and <sup>φ</sup> continues to hold until then. We write <sup>⊥</sup> for ¬; <sup>φ</sup> <sup>∨</sup> <sup>ψ</sup> for <sup>¬</sup>(¬<sup>φ</sup> ∧ ¬ψ); n occurrences of X

$$\begin{array}{lcl} \phi \implies \psi \text{ for } \neg\phi \lor \psi; \mathsf{F}\phi \text{ for } \mathsf{T}\mathsf{U}\phi; \mathsf{G}\phi \text{ for } \neg(\mathsf{F}\neg\phi); \mathsf{G}\_{[n,m]}\phi \text{ for } \begin{array}{lcl} \mathsf{X}\ldots\mathsf{X} \\ \downarrow \mathsf{B}(\phi\land\mathsf{X}(\cdot\cdots\wedge\mathsf{X}\phi)) \end{array}; \text{ and } \mathsf{F}\_{[n,m]}\phi \text{ for } \begin{array}{lcl} \mathsf{X}\ldots\mathsf{X} \\ \downarrow \ldots\mathsf{X} \\ \downarrow \ldots\cdots\vdots \end{array}; \text{ for } \begin{array}{lcl} \mathsf{X}\ldots\mathsf{X} \\ \downarrow \ldots\mathsf{X} \end{array} \end{array} \begin{array}{lcl} \mathsf{X}\ldots\mathsf{X} \\ \downarrow \phi\land\mathsf{X}(\cdot\cdots\wedge\mathsf{X}\phi) \end{array}; \end{array} \right(\phi\land\ \begin{array}{lcl} \mathsf{X}\ldots\mathsf{X} \\ \downarrow \ldots\cdots\cdots \\ \downarrow \ldots\cdots\cdots \end{array} \right).$$

We formally define the semantics of LTL below. Let <sup>u</sup> <sup>∈</sup> (2AP)<sup>ω</sup>, <sup>i</sup> <sup>∈</sup> <sup>N</sup>, and <sup>φ</sup> be an LTL formula. We define the relation u, i <sup>|</sup><sup>=</sup> <sup>φ</sup> as the least relation that satisfies the following:

$$\begin{aligned} u, i \mid \vdash \top \qquad u, i \mid \vdash p \stackrel{\text{def}}{\iff} p \in u(i) \qquad u, i \mid \neg \phi \stackrel{\text{def}}{\iff} u, i \not\models \phi\\ u, i \mid \vdash \phi \land \psi \stackrel{\text{def}}{\iff} u, i \mid \vdash \phi \text{ and } u, i \mid \vdash \psi \qquad u, i \mid \vdash \mathsf{X}\phi \stackrel{\text{def}}{\iff} u, i + 1 \mid \vdash \phi\\ u, i \mid \vdash \phi \mathsf{U}\psi \stackrel{\text{def}}{\iff} \text{there exists } j \ge i \text{ such that } u, j \mid \vdash \psi \text{ and,}\\ \text{for any } k, i \le k \le j \implies u, k \mid \vdash \phi. \end{aligned}$$

We write <sup>u</sup> <sup>|</sup><sup>=</sup> <sup>φ</sup> for u, <sup>0</sup> <sup>|</sup><sup>=</sup> <sup>φ</sup> and say <sup>u</sup> *satisfies* <sup>φ</sup>.

In this paper, we focus on *safety* [26] (i.e., nothing bad happens) fragment of LTL properties. A finite sequence <sup>w</sup> <sup>∈</sup> (2AP)<sup>∗</sup> is a *bad prefix* for an LTL formula <sup>φ</sup> if <sup>w</sup> · <sup>v</sup> <sup>|</sup><sup>=</sup> <sup>φ</sup> holds for any <sup>v</sup> <sup>∈</sup> (2AP)<sup>ω</sup>. For any bad prefix <sup>w</sup>, we cannot extend w to an infinite word that satisfies φ. An LTL formula φ is a *safety* LTL formula if for any <sup>w</sup> <sup>∈</sup> (2AP)<sup>ω</sup> satisfying <sup>w</sup> <sup>|</sup><sup>=</sup> <sup>φ</sup>, <sup>w</sup> has a bad prefix for <sup>φ</sup>.

<sup>A</sup> *safety monitor* (or simply a *monitor* ) is a procedure that takes <sup>w</sup> <sup>∈</sup> (2AP)<sup>ω</sup> and a safety LTL formula <sup>φ</sup> and generates an alert if <sup>w</sup> <sup>|</sup><sup>=</sup> <sup>φ</sup>. From the definition of safety LTL, it suffices for a monitor to detect a bad prefix of φ. It is known that, for any safety LTL formula φ, we can construct a DFA M<sup>φ</sup> recognizing the set of the bad prefixes of φ [36], which can be used as a monitor.

#### 2.2 Torus Fully Homomorphic Encryption

Homomorphic encryption (HE) is a form of encryption that enables us to apply operations to encrypted values *without decrypting them*. In particular, a type


Table 2. Summary of TFHE ciphertexts, where N is a parameter of TFHE.

of HE, called Fully HE (FHE), allows us to evaluate arbitrary functions over encrypted data [11,19,23,24]. We use an instance of FHE called TFHE [13] in this work. We briefly summarize TFHE below; see [13] for a detailed exposition.

We are concerned with the following two-party secure computation, where the involved parties are a client (called Alice) and a server (called Bob): 1) Alice generates the keys used during computation; 2) Alice encrypts her plaintext messages into ciphertexts with her keys; 3) Alice sends the ciphertexts to Bob; 4) Bob conducts computation over the received ciphertexts and obtains the encrypted result *without decryption*; 5) Bob sends the encrypted results to Alice; 6) Alice decrypts the received results and obtains the results in plaintext.

Keys. There are three types of keys in TFHE: *secret key* SK, *public key* PK, and *bootstrapping key* BK. All of them are generated by Alice. PK is used to encrypt plaintext messages into ciphertexts, and SK is used to decrypt ciphertexts into plaintexts. Alice keeps SK private, i.e., the key is known only to herself but not to Bob. In contrast, PK is public and also known to Bob. BK is generated from SK and can be safely shared with Bob without revealing SK. BK allows Bob to evaluate the homomorphic operations (defined later) over the ciphertext.

Ciphertexts. Using the public key, Alice can generate three kinds of ciphertexts (Table 2): TLWE (Torus Learning With Errors), TRLWE (Torus Ring Learning With Errors), and TRGSW (Torus Ring Gentry-Sahai-Waters). Homomorphic operations provided by TFHE are defined over each of the specific ciphertexts. We note that different ciphertexts have different data structures, and their conversion can be time-consuming. Table 2 shows one such example.

In TFHE, different types of ciphertexts represent different plaintext messages. A TLWE ciphertext represents a Boolean value. In contrast, TRLWE represents a vector of Boolean values of length N, where N is a TFHE parameter. We can regard a TRLWE ciphertext as a vector of TLWE ciphertexts, and the conversion between a TRLWE ciphertext and a TLWE one is relatively easy. A TRGSW ciphertext also represents a Boolean value, but its data structure is quite different from TLWE, and the conversion from TLWE to TRGSW is slow.

TFHE provides different encryption and decryption functions for each type of ciphertext. We write Enc(x) for a ciphertext of a plaintext x; Dec(c) for the plaintext message for the ciphertext c. We abuse these notations for all three types of ciphertexts.

Besides, TFHE supports *trivial samples* of TRLWE. A trivial sample of TRLWE has the same data structure as a TRLWE ciphertext but is *not* encrypted, i.e., anyone can tell the plaintext message represented by the trivial sample. We denote by Trivial(n) a trivial sample of TRLWE whose plaintext message is (b1, b2,...,b<sup>N</sup> ), where each b<sup>i</sup> is the i-th bit in the binary representation of n.

Homomorphic Operations. TFHE provides *homomorphic operations*, i.e., operations over ciphertexts without decryption. Among the operators supported by TFHE [13], we use the following ones.

CMux(d, **<sup>c</sup>true**, **<sup>c</sup>false**) : TRGSW <sup>×</sup> TRLWE <sup>×</sup> TRLWE <sup>→</sup> TRLWE Given a TRGSW ciphertext <sup>d</sup> and TRLWE ciphertexts **<sup>c</sup>**true, **<sup>c</sup>**false, CMux outputs a TRLWE ciphertext **c**result such that Dec(**c**result) = Dec(**c**true) if Dec(d)=1, and otherwise, Dec(**c**result) = Dec(**c**false).


Let Dec(**c**)=(b1, b2,...,b<sup>N</sup> ). Given k<N and a TRLWE ciphertext **c**, SampleExtract outputs a TLWE ciphertext <sup>c</sup> where Dec(c) = <sup>b</sup><sup>k</sup>+1.

Intuitively, CMux can be regarded as a multiplexer over TRLWE ciphertexts with TRGSW selector input. The operation LookUp regards **<sup>c</sup>**1, **<sup>c</sup>**2,..., **<sup>c</sup>**2*<sup>n</sup>* as encrypted entries composing a LookUp Table (LUT) of depth n and d1, d2,...,d<sup>n</sup> as inputs to the LUT. Its output is the entry selected by the LUT. LookUp is constructed by <sup>2</sup><sup>n</sup> <sup>−</sup> <sup>1</sup> CMux arranged in a tree of depth <sup>n</sup>. SampleExtract outputs the k-th element of **c** as TLWE. Notice that all these operations work over ciphertexts without decrypting them.

Noise and Operations for Noise Reduction. In generating a TFHE ciphertext, we ensure its security by adding some random numbers called *noise*. An application of a TFHE operation adds noise to its output ciphertext; if the noise in a ciphertext becomes too large, the TFHE ciphertext cannot be correctly decrypted. There is a special type of operation called *bootstrapping*<sup>1</sup> [23], which reduces the noise of a TFHE ciphertext.

# BootstrappingBK(c) : TLWE <sup>→</sup> TRLWE

Given a bootstrapping key BK and a TLWE ciphertext <sup>c</sup>, Bootstrapping outputs a TRLWE ciphertext **c** where Dec(**c**)=(b1, b2,...,b<sup>N</sup> ) and b<sup>1</sup> = Dec(c). Moreover, the noise of **c** becomes a constant that is determined by the parameters of TFHE and is independent of c.

# CircuitBootstrappingBK(c) : TLWE <sup>→</sup> TRGSW

Given a bootstrapping key BK and a TLWE ciphertext <sup>c</sup>, CircuitBootstrapping outputs a TRGSW ciphertext <sup>d</sup> where Dec(d) = Dec(c). The noise of d becomes a constant that is determined by the parameters of TFHE and is independent of c.

<sup>1</sup> Note that bootstrapping here has nothing to do with bootstrapping in statistics.

#### Algorithm 1: The leveled homomorphic offline algorithm [13].

Input : A binary DFA <sup>M</sup> = (Q, Σ = <sup>B</sup>, δ, q0, F ) and TRGSW monitored ciphertexts d, d2,...,d*<sup>n</sup>* Output : <sup>A</sup> TLWE ciphertext <sup>c</sup> satisfying Dec(c) = <sup>M</sup>(Dec(d1)Dec(d2) ... Dec(d*n*)) for <sup>q</sup> <sup>∈</sup> <sup>Q</sup> do **<sup>c</sup>***n,q* <sup>←</sup> <sup>q</sup> <sup>∈</sup> <sup>F</sup> ? Trivial(1) : Trivial(0) // Initialize each **<sup>c</sup>***n,q* for <sup>i</sup> <sup>=</sup> n, n <sup>−</sup> <sup>1</sup>,..., <sup>1</sup> do for <sup>q</sup> <sup>∈</sup> <sup>Q</sup> *such that* <sup>q</sup> *is reachable from* <sup>q</sup>0 *by* (<sup>i</sup> <sup>−</sup> 1) *transitions* do **<sup>c</sup>***i*−1*,q* <sup>←</sup> CMux(d*i*, **<sup>c</sup>***i,δ*(*q,*1), **<sup>c</sup>***i,δ*(*q,*0)) <sup>c</sup> <sup>←</sup> SampleExtract(0, **<sup>c</sup>**0*,q*0 ) 7 return c

These bootstrapping operations are used to keep the noise of a TFHE ciphertext small enough to be correctly decrypted. Bootstrapping and Circuit-Bootstrapping are almost two and three orders of magnitude slower than CMux, respectively [13].

Parameters for TFHE. There are many parameters for TFHE, such as the length N of the message of a TRLWE ciphertext and the standard deviation of the probability distribution from which a noise is taken. Certain properties of TFHE depend on these parameters. These properties include the security level of TFHE, the number of TFHE operations that can be applied without bootstrapping ensuring correct decryption, and the time and the space complexity of each operation. The complete list of TFHE parameters is presented in the full version [4].

We remark that we need to determine the TFHE parameters *before* performing any TFHE operation. Therefore, we need to know the number of applications of homomorphic operations without bootstrapping *in advance*, i.e., the homomorphic circuit depth must be determined *a priori*.

#### 2.3 Leveled Homomorphic Offline Algorithm

Chillotti et al. [13] proposed an *offline* algorithm to evaluate a DFA over TFHE ciphertexts (Algorithm 1). Given a DFA M and TRGSW ciphertexts d1, d2,...,dn, Algorithm 1 returns a TLWE ciphertext c satisfying Dec(c) = M(Dec(d1)Dec(d2)... Dec(dn)). For simplicity, for a state q of M, we write M<sup>i</sup> (q) for M(q, Dec(di)Dec(d<sup>i</sup>+1)... Dec(dn)).

In Algorithm 1, we use a TRLWE ciphertext **c**i,q whose first element represents M<sup>i</sup>+1(q), i.e., whether we reach a final state by reading Dec(d<sup>i</sup>+1)Dec(d<sup>i</sup>+2)... Dec(dn) from q. We abuse this notation for i = n, i.e., the first element of **<sup>c</sup>**n,q represents if <sup>q</sup> <sup>∈</sup> <sup>F</sup>. In Lines <sup>1</sup> and 2, we initialize **<sup>c</sup>**n,q; For each <sup>q</sup> <sup>∈</sup> <sup>Q</sup>, we let **<sup>c</sup>**n,q be Trivial(1) if <sup>q</sup> <sup>∈</sup> <sup>F</sup>; otherwise, we let **<sup>c</sup>**n,q be Trivial(0). In Lines 3–5, we construct **<sup>c</sup>**<sup>i</sup>−1,q inductively by feeding each monitored ciphertext <sup>d</sup><sup>i</sup> to CMux from tail to head. Here, **<sup>c</sup>**<sup>i</sup>−1,q represents M<sup>i</sup> (q) because of M<sup>i</sup> (q) = M<sup>i</sup>+1(δ(q, Dec(di))). We note that for the efficiency, we only construct **<sup>c</sup>**<sup>i</sup>−1,q for the states reachable from <sup>q</sup><sup>0</sup> by <sup>i</sup> <sup>−</sup> <sup>1</sup> transitions. In Line 6, we extract the first element of **<sup>c</sup>**0,q0 , which represents <sup>M</sup><sup>1</sup>(q0), i.e., M(Dec(d1)Dec(d2)... Dec(dn)).

# Algorithm 2: Our fully homomorphic offline algorithm (Offline).

Input : A binary DFA <sup>M</sup> = (Q, Σ = <sup>B</sup>, δ, q0, F ), TRGSW monitored ciphertexts d, d2,...,d*n*, a bootstrapping key BK, and <sup>I</sup>boot <sup>∈</sup> <sup>N</sup><sup>+</sup> Output : <sup>A</sup> TLWE ciphertext <sup>c</sup> satisfying Dec(c) = <sup>M</sup>(Dec(d1)Dec(d2) ... Dec(d*n*)) for <sup>q</sup> <sup>∈</sup> <sup>Q</sup> do **<sup>c</sup>***n,q* <sup>←</sup> <sup>q</sup> <sup>∈</sup> <sup>F</sup> ? Trivial(1) : Trivial(0) for <sup>i</sup> <sup>=</sup> n, n <sup>−</sup> <sup>1</sup>,..., <sup>1</sup> do for <sup>q</sup> <sup>∈</sup> <sup>Q</sup> *such that* <sup>q</sup> *is reachable from* <sup>q</sup>0 *by* (<sup>i</sup> <sup>−</sup> 1) *transitions* do **<sup>c</sup>***i*−1*,q* <sup>←</sup> CMux(d*i*, **<sup>c</sup>***i,δ*(*q,*1), **<sup>c</sup>***i,δ*(*q,*0)) if (<sup>n</sup> <sup>−</sup> <sup>i</sup> + 1) mod <sup>I</sup>boot = 0 then for <sup>q</sup> <sup>∈</sup> <sup>Q</sup> *such that reachable from* <sup>q</sup>0 *by* (<sup>i</sup> <sup>−</sup> 1) *transitions* do <sup>c</sup>*i*−1*,q* <sup>←</sup> SampleExtract(0, **<sup>c</sup>***i*−1*,q*) **<sup>c</sup>***i*−1*,q* <sup>←</sup> BootstrappingBK(c*i*−1*,q*) <sup>c</sup> <sup>←</sup> SampleExtract(0, **<sup>c</sup>**0*,q*0 ) 11 return c

Theorem 1 (Correctness [13, Thm. 5.4]). *Given a binary DFA* M *and TRGSW ciphertexts* d1, d2,...,dn*, if* c *in Algorithm 1 can be correctly decrypted, Algorithm <sup>1</sup> outputs* <sup>c</sup> *satisfying Dec*(c) = <sup>M</sup>(*Dec*(d1)*Dec*(d2)... *Dec*(dn))*.*

Complexity Analysis. The time complexity of Algorithm 1 is determined by the number of applications of CMux, which is <sup>O</sup>(n|Q|). Its space complexity is <sup>O</sup>(|Q|) because we can use two sets of <sup>|</sup>Q<sup>|</sup> TRLWE ciphertexts alternately for **<sup>c</sup>**2j−1,q and **<sup>c</sup>**2j,q (for <sup>j</sup> <sup>∈</sup> <sup>N</sup><sup>+</sup>).

Shortcomings of Algorithm 1. We cannot use Algorithm 1 under an *online* setting due to two reasons. Firstly, Algorithm 1 is a *leveled* homomorphic algorithm, i.e., the maximum length of the ciphertexts that Algorithm 1 can handle is determined by TFHE parameters. This is because Algorithm 1 does not use Bootstrapping, and if the monitored ciphertexts are too long, the result <sup>c</sup> cannot be correctly decrypted due to the noise. This is critical in an online setting because we do not know the length n of the monitored ciphertexts in advance, and we cannot determine such parameters appropriately.

Secondly, Algorithm 1 consumes the monitored ciphertext from back to front, i.e., the last ciphertext d<sup>n</sup> is used in the beginning, and d<sup>1</sup> is used in the end. Thus, we cannot start Algorithm 1 before the last input is given.

#### 3 Online Algorithms for Running DFA Obliviously

In this section, we propose two online algorithms that run a DFA obliviously. As a preparation for these online algorithms, we also introduce a fully homomorphic offline algorithm based on Algorithm 1.

#### 3.1 Preparation: Fully Homomorphic Offline Algorithm (**Offline**)

As preparation for introducing an algorithm that can run a DFA under an online setting, we enhance Algorithm 1 so that we can monitor a sequence of ciphertexts whose length is unknown *a priori*. Algorithm 2 shows our *fully homomorphic*

# Algorithm 3: Our first online algorithm (Reverse).

```
Input : A binary DFA M, TRGSW monitored ciphertexts d1, d2, d3,...,dn, a
             bootstrapping key BK, and Iboot ∈ N+
   Output : For every i ∈ {1, 2,...,n}, a TLWE ciphertext ci satisfying
             Dec(ci) = M(Dec(d1)Dec(d2) ... Dec(di))
1 let MR = (QR, B, δR, qR
                      0 , F R) be the minimum reversed DFA of M
2 for qR ∈ QR do
3 c0,qR ← qR ∈ F R ? Trivial(1) : Trivial(0)
4 for i = 1, 2,...,n do
5 for qR ∈ QR do
6 ci,qR ← CMux(di, ci−1,δR(qR,1), ci−1,δR(qR,0))
7 if i mod Iboot = 0 then
8 for qR ∈ QR do
9 ci,qR ← SampleExtract(0, ci,qR )
10 ci,qR ← BootstrappingBK(ci,qR )
11 ci ← SampleExtract(0, ci,qR
                               0
                                )
12 output ci
```
offline algorithm (Offline), which does not require TFHE parameters to depend on the length of the monitored ciphertexts. The key difference lies in Lines 6–9 (the red lines) of Algorithm 2. Here, for every Iboot consumption of the monitored ciphertexts, we reduce the noise by applying Bootstrapping to the ciphertext **c**i,j representing a state of the DFA. Since the amount of the noise accumulated in **c**i,j is determined only by the number of the processed ciphertexts, we can keep the noise levels of **c**i,j low and ensure that the monitoring result c is correctly decrypted. Therefore, by using Algorithm 2, we can monitor an arbitrarily long sequence of ciphertexts as long as the interval Iboot is properly chosen according to the TFHE parameters. We note that we still cannot use Algorithm 2 for online monitoring because it consumes the monitored ciphertexts from back to front.

#### 3.2 Online Algorithm 1: **Reverse**

To run a DFA online, we modify Offline so that the monitored ciphertexts are consumed from front to back. Our main idea is illustrated in Fig. 2b: we *reverse* the DFA M beforehand and feed the ciphertexts d1, d2,...,d<sup>n</sup> to the reversed DFA M<sup>R</sup> serially from d<sup>1</sup> to dn.

Algorithm <sup>3</sup> shows the outline of our first online algorithm (Reverse) based on the above idea. Reverse takes the same inputs as Offline: a DFA <sup>M</sup>, TRGSW ciphertexts d1, d2,...,dn, a bootstrapping key BK, and a positive integer Iboot indicating the interval of bootstrapping. In Line 1, we construct the minimum DFA <sup>M</sup><sup>R</sup> that satisfies, for any <sup>w</sup> <sup>=</sup> <sup>σ</sup>1σ<sup>2</sup> ...σ<sup>k</sup> <sup>∈</sup> <sup>B</sup>∗, we have M<sup>R</sup>(w) = M(w<sup>R</sup>), where w<sup>R</sup> = σ<sup>k</sup> ...σ1. We can construct such a DFA by reversing the transitions and by applying the powerset construction and the minimization algorithm.

In the loop from Lines 4-12, the reversed DFA M<sup>R</sup> consumes each monitored ciphertext di, which corresponds to the loop from Lines 3-9 in Algorithm 2. The main difference lies in Line <sup>5</sup> and 8: Algorithm <sup>3</sup> applies CMux and

Algorithm 4: Our second online algorithm (Block). Input : A binary DFA <sup>M</sup> = (Q, Σ = <sup>B</sup>, δ, q0, F ), TRGSW monitored ciphertexts d1, d2, d3,...,d*n*, a bootstrapping key BK, and <sup>B</sup> <sup>∈</sup> <sup>N</sup><sup>+</sup> Output : For every <sup>i</sup> <sup>∈</sup> <sup>N</sup><sup>+</sup> (<sup>i</sup> ≤ n/B), a TLWE ciphertext <sup>c</sup>*<sup>i</sup>* satisfying Dec(c*i*) = <sup>M</sup>(Dec(d1)Dec(d2) ... Dec(d*i*×*B*)) 1 S1 ← {q0} // <sup>S</sup>*i*: the states reachable by (<sup>i</sup> <sup>−</sup> 1) <sup>×</sup> <sup>B</sup> transitions. <sup>2</sup> for <sup>i</sup> = 1, <sup>2</sup>,..., n/B do <sup>3</sup> <sup>S</sup>*i*+1 ← {<sup>q</sup> <sup>∈</sup> <sup>Q</sup> | ∃s*<sup>i</sup>* <sup>∈</sup> <sup>S</sup>*i*. q is reachable from <sup>s</sup>*<sup>i</sup>* by <sup>B</sup> transitions} // We denote <sup>S</sup>*i*+1 <sup>=</sup> {s*i*+1 1 , s*i*+1 2 ,...,s*i*+1 <sup>|</sup>*Si*+1|} <sup>4</sup> for <sup>q</sup> <sup>∈</sup> <sup>Q</sup> do <sup>5</sup> if <sup>q</sup> <sup>∈</sup> <sup>S</sup>*i*+1 then <sup>6</sup> <sup>j</sup> <sup>←</sup> the index of <sup>S</sup>*i*+1 such that <sup>q</sup> <sup>=</sup> <sup>s</sup>*i*+1 *j* <sup>7</sup> **c***Ti B,q* <sup>←</sup> Trivial((<sup>j</sup> <sup>−</sup> 1) <sup>×</sup> 2+(<sup>q</sup> <sup>∈</sup> <sup>F</sup> ? <sup>1</sup> : 0)) <sup>8</sup> for <sup>k</sup> <sup>=</sup> B, B <sup>−</sup> <sup>1</sup>,..., <sup>1</sup> do <sup>9</sup> for <sup>q</sup> <sup>∈</sup> <sup>Q</sup> *such that* <sup>q</sup> *is reachable from a state in* <sup>S</sup>*<sup>i</sup> by* (<sup>k</sup> <sup>−</sup> 1) *transitions* do <sup>10</sup> **c***Ti <sup>k</sup>*−1*,q* <sup>←</sup> CMux(d(*i*−1)*B*+*k*, **<sup>c</sup>***Ti k,δ*(*q,*1), **<sup>c</sup>***Ti k,δ*(*q,*0)) <sup>11</sup> if <sup>|</sup>S*i*<sup>|</sup> = 1 then 12 **c**cur *<sup>i</sup>*+1 <sup>←</sup> **<sup>c</sup>** *Ti* 0*,q* where <sup>S</sup>*<sup>i</sup>* <sup>=</sup> {q} 13 else <sup>14</sup> for <sup>l</sup> = 1, <sup>2</sup>,..., log2(|S*i*|) do <sup>15</sup> <sup>c</sup>*<sup>l</sup>* <sup>←</sup> SampleExtract(l, **<sup>c</sup>**cur *<sup>i</sup>* ) 16 d- *<sup>l</sup>* <sup>←</sup> CircuitBootstrappingBK(c*l*) 17 **c**cur *<sup>i</sup>*+1 <sup>←</sup> LookUp({**c***Ti* 0*,s<sup>i</sup>* 1 , **c***Ti* 0*,s<sup>i</sup>* 2 ,... **c***Ti* 0*,s<sup>i</sup>* |*Si*| }, {d- 1,...,d- log2(|*Si*|)}) <sup>18</sup> <sup>c</sup>*<sup>i</sup>* <sup>←</sup> SampleExtract(0, **<sup>c</sup>**cur *<sup>i</sup>*+1) <sup>19</sup> output c*<sup>i</sup>*

Bootstrapping to all the states of <sup>M</sup><sup>R</sup>, while Algorithm <sup>2</sup> only considers the states reachable from the initial state. This is because in online monitoring, we monitor a stream of ciphertexts without knowing the number of the remaining ciphertexts, and all the states of the reversed DFA M<sup>R</sup> are potentially reachable from the initial state q<sup>R</sup> <sup>0</sup> by the reversed remaining ciphertexts <sup>d</sup>n, d<sup>n</sup>−1,...,d<sup>i</sup>+1 because of the minimality of M<sup>R</sup>.

Theorem 2. *Given a binary DFA* M*, TRGSW ciphertexts* d1, d2,...,dn*, a bootstrapping key* BK*, and a positive integer* <sup>I</sup>boot*, for each* <sup>i</sup> ∈ {1, <sup>2</sup>,...,n}*, if* c<sup>i</sup> *in Algorithm 3 can be correctly decrypted, Algorithm 3 outputs* c<sup>i</sup> *satisfying Dec*(ci) = M(*Dec*(d1)*Dec*(d2)... *Dec*(di))*.*

*Proof (sketch).* SampleExtract and Bootstrapping in Line <sup>9</sup> and <sup>10</sup> do not change the decrypted value of ci. Therefore, Dec(ci) = M<sup>R</sup>(Dec(di)... Dec(d1)) for <sup>i</sup> ∈ {1, <sup>2</sup>,...,n} by Theorem 1. As <sup>M</sup><sup>R</sup> is the reversed DFA of <sup>M</sup>, we have Dec(ci) = <sup>M</sup><sup>R</sup>(Dec(di)... Dec(d1)) = <sup>M</sup>(Dec(d1)... Dec(di)).

#### 3.3 Online Algorithm 2: **Block**

A problem of Reverse is that the number of the states of the reversed DFA can explode exponentially due to powerset construction (see Sect. 3.4 for the details). Another idea of an online algorithm without reversing a DFA is illustrated in Fig. 2c: we split the monitored ciphertexts into *blocks* of fixed size B

and process each block in the same way as Algorithm 2. Intuitively, for each block <sup>d</sup>1+(i−1)×B, d2+(i−1)×B,...,dB+(i−1)×<sup>B</sup> of ciphertexts, we compute the function <sup>T</sup><sup>i</sup> : <sup>Q</sup> <sup>→</sup> <sup>Q</sup> satisfying <sup>T</sup>i(q) = <sup>δ</sup>(q, d1+(i−1)×B, d2+(i−1)×B,...,dB+(i−1)×B) by a variant of Offline, and keep track of the current state of the DFA after reading the current prefix <sup>d</sup>1, d2,...,dB+(i−1)×B.

Algorithm <sup>4</sup> shows the outline of our second online algorithm (Block) based on the above idea. Algorithm 4 takes a DFA M, TRGSW ciphertexts d1, d2,...,dn, a bootstrapping key BK, and an integer B representing the interval of output. To simplify the presentation, we make the following assumptions, which are relaxed later: 1) B is small, and a trivial TRLWE sample can be correctly decrypted after <sup>B</sup> applications of CMux; 2) the size <sup>|</sup>Q<sup>|</sup> of the states of the DFA M is smaller than or equal to 2<sup>N</sup>−<sup>1</sup>, where N is the length of TRLWE.

The main loop of the algorithm is sketched on Lines 2–19. In each iteration, we consume the <sup>i</sup>-th block consisting of <sup>B</sup> ciphertexts, i.e., <sup>d</sup>(i−1)B+1,...,d(i−1)B+<sup>B</sup>. In Line 3, we compute the set <sup>S</sup><sup>i</sup>+1 <sup>=</sup> {s<sup>i</sup>+1 <sup>1</sup> , s<sup>i</sup>+1 <sup>2</sup> ,...,s<sup>i</sup>+1 <sup>|</sup>Snext<sup>|</sup> } of the states reachable from <sup>q</sup><sup>0</sup> by reading a word of length <sup>i</sup> <sup>×</sup> <sup>B</sup>.

In Lines 4–10, for each <sup>q</sup> <sup>∈</sup> <sup>Q</sup>, we construct a ciphertext representing <sup>T</sup>i(q) by feeding the current block to a variant of Offline. More precisely, we construct a ciphertext **c**<sup>T</sup>*<sup>i</sup>* <sup>0</sup>,q representing the pair of the Boolean value showing if <sup>T</sup>i(q) <sup>∈</sup> <sup>F</sup> and the state <sup>T</sup>i(q) <sup>∈</sup> <sup>Q</sup>. The encoding of such a pair in a TRLWE ciphertext is as follows: the first element shows if <sup>T</sup>i(q) <sup>∈</sup> <sup>F</sup> and the other elements are the binary representation of <sup>j</sup> <sup>∈</sup> <sup>N</sup><sup>+</sup>, where <sup>j</sup> is such that <sup>s</sup><sup>i</sup>+1 <sup>j</sup> = Ti(q).

In Lines 11–17, we construct the ciphertext **c**cur <sup>i</sup>+1 representing the state of the DFA <sup>M</sup> after reading the current prefix <sup>d</sup>1, d2,...,d<sup>B</sup>+(i−1)×<sup>B</sup>. If <sup>|</sup>S<sup>i</sup><sup>|</sup> = 1, since the unique element q of S<sup>i</sup> is the only possible state before consuming the current block, the state after reading it is T(q). Therefore, we let **c**cur <sup>i</sup>+1 <sup>=</sup> **<sup>c</sup>**<sup>T</sup>*<sup>i</sup>* <sup>0</sup>,q.

Otherwise, we extract the ciphertext representing the state q before consuming the current block, and let **c**cur <sup>i</sup>+1 <sup>=</sup> **<sup>c</sup>**<sup>T</sup>*<sup>i</sup>* <sup>0</sup>,q. Since the **c**cur <sup>i</sup> (except for the first element) represents <sup>q</sup> (see Line 7), we extract them by applying SampleExtract (Line 15) and convert them to TRGSW by applying CircuitBootstrapping (Line 16). Then, we choose **c**<sup>T</sup>*<sup>i</sup>* <sup>0</sup>,q by applying LookUp and set it to **<sup>c</sup>**cur <sup>i</sup>+1.

The output after consuming the current block, i.e., M(Dec(d1)Dec(d2)... Dec(d(i−1)B+<sup>B</sup>)) is stored in the first element of the TRLWE ciphertext **<sup>c</sup>**cur <sup>i</sup>+1. It is extracted by applying SampleExtract in Line <sup>18</sup> and output in Line19.

Theorem 3. *Given a binary DFA* M*, TRGSW ciphertexts* d1, d2,...,dn*, a bootstrapping key* BK*, and a positive integer* <sup>B</sup>*, for each* <sup>i</sup> ∈ {1, <sup>2</sup>,..., n/B}*, if* <sup>c</sup><sup>i</sup> *in Algorithm 4 can be correctly decrypted, Algorithm 4 outputs a TLWE ciphertext* <sup>c</sup><sup>i</sup> *satisfying Dec*(ci) = <sup>M</sup>(*Dec*(d1)*Dec*(d2)... *Dec*(d<sup>i</sup>×<sup>B</sup>))*.*

*Proof (sketch).* Let <sup>q</sup><sup>i</sup> be <sup>δ</sup>(q0, Dec(d1)Dec(d2)... Dec(d<sup>i</sup>×<sup>B</sup>)). It suffices to show that, for each iteration i in Line 2, Dec(**c**cur <sup>i</sup>+1) represents a pair of the Boolean value showing if <sup>q</sup><sup>i</sup> <sup>∈</sup> <sup>F</sup> and the state <sup>q</sup><sup>i</sup> <sup>∈</sup> <sup>Q</sup> in the above encoding format. This is because c<sup>i</sup> represents the first element of **c**cur <sup>i</sup>+1. Algorithm 4 selects **c**cur i+1 from {**c**<sup>T</sup>*<sup>i</sup>* <sup>0</sup>,q}<sup>q</sup>∈S*<sup>i</sup>* in Line 12 or Line 17. By using a slight variant of Theorem 1 in Lines 11–17, we can show that **c**<sup>T</sup>*<sup>i</sup>* <sup>0</sup>,q represents if <sup>T</sup><sup>i</sup> (q) <sup>∈</sup> <sup>F</sup> and the state <sup>T</sup><sup>i</sup> (q). Therefore, the proof is completed by showing Dec(**c**cur <sup>i</sup>+1) = Dec(**c**<sup>T</sup>*<sup>i</sup>* <sup>0</sup>,q*i*−1 ).


Table 3. Complexity of the proposed algorithms with respect to the number <sup>|</sup>Q<sup>|</sup> of the states of the DFA and the size <sup>|</sup>φ<sup>|</sup> of the LTL formula. For Block, we show the complexity *before* the relaxation.

We prove Dec(**c**cur <sup>i</sup>+1) = Dec(**c**<sup>T</sup>*<sup>i</sup>* <sup>0</sup>,q*i*−1 ) by induction on <sup>i</sup>. If <sup>i</sup> = 1, <sup>|</sup>S<sup>i</sup><sup>|</sup> = 1 holds, and by <sup>q</sup><sup>i</sup>−<sup>1</sup> <sup>∈</sup> <sup>S</sup>i, we have Dec(**c**cur <sup>i</sup>+1) = Dec(**c**<sup>T</sup>*<sup>i</sup>* <sup>0</sup>,q*i*−1 ). If i > <sup>1</sup> and <sup>|</sup>S<sup>i</sup><sup>|</sup> = 1, Dec(**c**cur <sup>i</sup>+1) = Dec(**c**<sup>T</sup>*<sup>i</sup>* <sup>0</sup>,q*i*−1 ) holds similarly. If i > <sup>1</sup> and <sup>|</sup>S<sup>i</sup><sup>|</sup> <sup>&</sup>gt; <sup>1</sup>, by induction hypothesis, Dec(**c**cur <sup>i</sup> ) represents if <sup>T</sup><sup>i</sup>−1(q<sup>i</sup>−2) = <sup>q</sup><sup>i</sup>−<sup>1</sup> <sup>∈</sup> <sup>F</sup> and the state <sup>q</sup><sup>i</sup>−<sup>1</sup>. By construction in Line 16, Dec(d <sup>l</sup>) is equal to the <sup>l</sup>-th bit of (<sup>j</sup> <sup>−</sup> 1), where <sup>j</sup> is such that s<sup>i</sup> <sup>j</sup> <sup>=</sup> <sup>q</sup><sup>i</sup>−<sup>1</sup>. Therefore, the result of the application of LookUp in Line 17 is equivalent to **c**<sup>T</sup>*<sup>i</sup>* 0,s*<sup>i</sup> j* (= **c**<sup>T</sup>*<sup>i</sup>* <sup>0</sup>,q*i*−1 ), and we have Dec(**c**cur <sup>i</sup>+1) = Dec(**c**<sup>T</sup>*<sup>i</sup>* <sup>0</sup>,q*i*−1 ).

We note that Block generates output for every <sup>B</sup> monitored ciphertexts while Reverse generates output for every monitored ciphertext.

We also remark that when <sup>B</sup> = 1, Block consumes every monitored ciphertext from front to back. However, such a setting is slow due to a huge number of CircuitBootstrapping operations, as pointed out in Sect. 3.4.

Relaxations of the Assumptions. When B is too large, **c**<sup>T</sup>*<sup>i</sup>* <sup>0</sup>,q may not be correctly decrypted. We can relax this restriction by inserting Bootstrapping just after Line 10, which is much like Algorithm 2. When the size <sup>|</sup>Q<sup>|</sup> of the states of the DFA M is larger than 2<sup>N</sup>−<sup>1</sup>, we cannot store the index j of the state using one TRLWE ciphertext (Line 7). We can relax this restriction by using multiple TRLWE ciphertexts for **c**<sup>T</sup>*<sup>i</sup>* <sup>0</sup>,q and **c**cur <sup>i</sup>+1.

#### 3.4 Complexity Analysis

Table 3 summarizes the complexity of our algorithms with respect to both the number <sup>|</sup>Q<sup>|</sup> of the states of the DFA and the size <sup>|</sup>φ<sup>|</sup> of the LTL formula. We note that, for Block, we do not relax the above assumptions for simplicity. Notice that the number of applications of the homomorphic operations is linear to the length n of the monitored ciphertext. Moreover, the space complexity is independent of n. This shows that our algorithms satisfy the properties essential to good online monitoring; 1) they only store the minimum of data, and 2) they run quickly enough under a real-time setting [5].

The time and the space complexity of Offline and Block are linear to <sup>|</sup>Q|. Moreover, in these algorithms, when the i-th monitored ciphertext is consumed, only the states reachable by a word of length i are considered, which often makes the scalability even better. In contrast, the time and the space complexity of Reverse is exponential to <sup>|</sup>Q|. This is because of the worst-case size of the reversed DFA due to the powerset construction. Since the size of the reversed DFA is usually reasonably small, the practical scalability of Reverse is also much better, which is observed through the experiments in Sect. 5.

For Offline and Block, <sup>|</sup>Q<sup>|</sup> is *doubly* exponential to <sup>|</sup>φ<sup>|</sup> because we first convert φ to an NFA (one exponential) and then construct a DFA from the NFA (second exponential). In contrast, for Reverse, it is known that we can construct a reversed DFA for φ of the size of at most *singly* exponential to <sup>|</sup>φ<sup>|</sup> [15]. Note that, in a practical scenario exemplified in Sect. 5, the size of the DFA constructed from φ is expected to be much smaller than the worst one.

# 4 Oblivious Online LTL Monitoring

In this section, we formalize the scheme of oblivious online LTL monitoring. We consider a two-party setting with a client and a server and refer to the client and the server as Alice and Bob, respectively. Here, we assume that Alice has private data sequence <sup>w</sup> <sup>=</sup> <sup>σ</sup>1σ<sup>2</sup> ...σ<sup>n</sup> to be monitored where <sup>σ</sup><sup>i</sup> <sup>∈</sup> <sup>2</sup>AP for each <sup>i</sup> <sup>≥</sup> <sup>1</sup>. Meanwhile, Bob has a private LTL formula <sup>φ</sup>. The purpose of oblivious online LTL monitoring is to let Alice know if <sup>σ</sup>1σ<sup>2</sup> ...σ<sup>i</sup> <sup>|</sup><sup>=</sup> <sup>φ</sup> for each <sup>i</sup> <sup>≥</sup> <sup>1</sup>, while keeping the privacy of Alice and Bob.

#### 4.1 Threat Model

We assume that Alice is *malicious*, i.e., Alice can deviate arbitrarily from the protocol to try to learn φ. We also assume that Bob is *honest-but-curious*, i.e., Bob correctly follows the protocol, but he tries to learn w from the information he obtains from the protocol execution. We do not assume that Bob is malicious in the present paper; a protocol that is secure against malicious Bob requires more sophisticated primitives such as zero-knowledge proofs and is left as future work.

*Public and Private Data.* We assume that the TFHE parameters, the parameters of our algorithms (e.g., Iboot and B), Alice's public key PK, and Alice's bootstrapping key BK are public to both parties. The input w and the monitoring result are private for Alice, and the LTL formula φ is private for Bob.

#### 4.2 Protocol Flow

The protocol flow of oblivious online LTL monitoring is shown in Fig. 3. It takes <sup>σ</sup>1, σ2,...,σn, <sup>φ</sup>, and <sup>b</sup> <sup>∈</sup> <sup>B</sup> as its parameters, where <sup>b</sup> is a flag that indicates the algorithm Bob uses: Reverse (<sup>b</sup> = 0) or Block (<sup>b</sup> = 1). After generating her


Fig. 3. Protocol of oblivious online LTL monitoring.

secret key and sending the corresponding public and bootstrapping key to Bob (Lines 1–3), Alice encrypts her inputs into ciphertexts and sends the ciphertexts to Bob one by one (Lines 5–8). In contrast, Bob first converts his LTL formula φ to a binary DFA M (Line 4). Then, Bob serially feeds the received ciphertexts from Alice to Reverse or Block (Line 9) and returns the encrypted output of the algorithm to Alice (Lines 10–13).

Note that, although the alphabet of a DFA constructed from an LTL formula is 2AP [36], our proposed algorithms require a binary DFA. Thus, in Line 4, we convert the DFA constructed from φ to a binary DFA M by inserting auxiliary states. Besides, in Line 6, we encode an observation <sup>σ</sup><sup>i</sup> <sup>∈</sup> <sup>2</sup>AP by a sequence <sup>σ</sup> <sup>i</sup> := (σ<sup>1</sup> <sup>i</sup> , σ<sup>2</sup> <sup>i</sup> ,...,σ|AP <sup>|</sup> <sup>i</sup> ) <sup>∈</sup> <sup>B</sup>|AP <sup>|</sup> such that <sup>p</sup><sup>j</sup> <sup>∈</sup> <sup>σ</sup><sup>i</sup> if and only if <sup>σ</sup> <sup>j</sup> <sup>i</sup> is true, where AP = {p1,...,p|AP|}. We also note that, taking this encoding into account, we need to properly set the parameters for Block to generate an output for each <sup>|</sup>AP|-size block of Alice's inputs, i.e., <sup>B</sup> is taken to be equal to <sup>|</sup>AP|.

Here, we provide brief sketches of the correctness and security analysis of the proposed protocol. See the full version [4] for detailed explanations and proofs.

Correctness. We can show that Alice obtains correct results in our protocol directly by Theorem 2 and Theorem 3.

Security. Intuitively, after the execution of the protocol described in Fig. 3, Alice should learn M(σ <sup>1</sup> · <sup>σ</sup> <sup>2</sup> ··· <sup>σ</sup> <sup>i</sup>) for every <sup>i</sup> ∈ {1, <sup>2</sup>,...,n} but nothing else. Besides, Bob should learn the input size n but nothing else.

*Privacy for Alice.* We observe that Bob only obtains Enc(<sup>σ</sup> <sup>j</sup> <sup>i</sup> ) from Alice for each <sup>i</sup> ∈ {1, <sup>2</sup>,...,n} and <sup>j</sup> ∈ {1, <sup>2</sup>,..., <sup>|</sup>AP|}. Therefore, we need to show that Bob learns nothing from the ciphertexts generated by Alice. Since TFHE provides IND-CPA security [7], we can easily guarantee the client's privacy for Alice.

*Privacy for Bob.* The privacy guarantee for Bob is more complex than that for Alice. Here, Alice obtains σ 1, σ 2,...,σ <sup>n</sup> and the results M(σ <sup>1</sup> · <sup>σ</sup> <sup>2</sup> ··· <sup>σ</sup> <sup>i</sup>) for every <sup>i</sup> ∈ {1, <sup>2</sup>,...,n} in plaintext. In the protocol (Fig. 3), Alice does not obtain φ, M themselves or their sizes, and it is known that a finite number of checking <sup>M</sup>(w) cannot uniquely identify <sup>M</sup> if any additional information (e.g., <sup>|</sup>M|) is not given [2,32]. Thus, it is impossible for Alice to identify M (or φ) from the input/output pairs.

Nonetheless, to fully guarantee the model privacy of Bob, we also need to show that, when Alice inspects the result ciphertext c , it is impossible for Alice to know Bob's specification, i.e., what homomorphic operations were applied by Bob to obtain c . A TLWE ciphertext contains a random nonce and a noise term. By randomizing c properly in Line 11, we ensure that the random nonce of c is not biased [34]. By assuming SRL security [10,21] over TFHE, we can ensure that there is no information leakage regarding Bob's specifications through the noise bias. A more detailed discussion is in the full version [4].

### 5 Experiments

We experimentally evaluated the proposed algorithms (Reverse and Block) and protocol. We pose the following two research questions:


To answer RQ1, we conducted an experiment with our original benchmark where the length of the monitored ciphertexts and the size of the DFA are configurable (Sect. 5.1). To answer RQ2 and RQ3, we conducted a case study on blood glucose monitoring; we monitored blood glucose data obtained by simglucose<sup>2</sup> against specifications taken from [12,38] (Sect. 5.2). To answer RQ3, we measured the time spent on the encryption of plaintexts, which is the heaviest task for a client during the execution of the online protocol.

We implemented our algorithms in C++20. Our implementation is publicly available<sup>3</sup>. We used Spot [17] to convert a safety LTL formula to a DFA. We also used a Spot's utility program ltlfilt to calculate the size of an LTL formula<sup>4</sup>. We used TFHEpp [30] as the TFHE library. We used N = 1024 as the size of the message represented by one TRLWE ciphertext, which is a parameter of TFHE. The complete TFHE parameters we used are shown in the full version [4].

For RQ1 and RQ2, we ran experiments on a workstation with Intel Xeon Silver 4216 (3.2 GHz; 32 cores and 64 threads in total), 128 GiB RAM, and Ubuntu 20.04.2 LTS. We ran each instance of the experiment setting five times

<sup>2</sup> https://github.com/jxx123/simglucose.

<sup>3</sup> Our implementation is uploaded to https://doi.org/10.5281/zenodo.6558657..

<sup>4</sup> We desugared a formula by ltlfilt with option --unabbreviate="eFGiMRˆW" and counted the number of the characters.

Fig. 4. Experimental results of Mm. The left figure shows runtimes when the number of states (i.e., <sup>m</sup>) is fixed to 500, while the right one is when the number of monitored ciphertexts (i.e., <sup>n</sup>) is fixed to 50000.

and reported the average. We measured the time to consume all of the monitored ciphertexts in the main loop of each algorithm, i.e., in Lines 4–12 in Reverse and in Lines 2–19 in Block.

For RQ3, we ran experiments on two single-board computers with and without Advanced Encryption Standard (AES) [14] hardware accelerator. ROCK64 has ARM Cortex A53 CPU cores (1.5 GHz; 4 cores) with AES hardware accelerator and 4 GiB RAM. Raspberry Pi 4 has ARM Cortex A72 CPU cores (1.5 GHz; 4 cores) without AES hardware accelerator and 4 GiB RAM.

#### 5.1 RQ1: Scalability

Experimental Setup. In the experiments to answer RQ1, we used a simple binary DFA Mm, which accepts a word w if and only if the number of the appearance of 1 in w is a multiple of m. The number of the states of M<sup>m</sup> is m.

Our experiments are twofold. In the first experiment, we fixed the DFA size m to 500 and increased the size n of the input word w from 10000 to 50000. In the second experiment, we fixed n = 50000 and changed m from 10 to 500. The parameters we used are Iboot = 30000 and B = 150.

Results and Discussion. Figure 4 shows the results of the experiments. In the left plot of Fig. 4, we observe that the runtimes of both algorithms are linear to the length of the monitored ciphertexts. This coincides with the complexity analysis in Sect. 3.4.

In the right plot of Fig. 4, we observe that the runtimes of both algorithms are at most linear to the number of the states. For Block, this coincides with the complexity analysis in Sect. 3.4. In contrast, this is much more efficient than the exponential complexity of Reverse with respect to <sup>|</sup>Q|. This is because the size of the reversed DFA does not increase.

In both plots of Fig. 4, we observe that Reverse is faster than Block. Moreover, in the left plot of Fig. 4, the curve of Block is steeper than that of Reverse. This is because 1) the reversed DFA <sup>M</sup><sup>R</sup> <sup>m</sup> has the same size as Mm, 2) CircuitBootstrapping is about ten times slower than Bootstrapping, and 3) Iboot is much larger than B.

Overall, our experiment results confirm the complexity analysis in Sect. 3.4. Moreover, the practical scalability of Reverse with respect to the DFA size is much better than the worst case, at least for this benchmark. Therefore, we answer RQ1 affirmatively.

#### 5.2 RQ2 and RQ3: Case Study on Blood Glucose Monitoring

Experimental Setup. To answer RQ2, we applied Reverse and Block to the monitoring of blood glucose levels. The monitored values are generated by simulation of type 1 diabetes patients. We used the LTL formulae in Table 4. These formulae are originally presented as signal temporal logic [28] formulae [12, 38], and we obtained the LTL formulae in Table 4 by discrete sampling.

To simulate blood glucose levels of type 1 diabetes patients, we adopted simglucose, which is a Python implementation of UVA/Padova Type 1 Diabetes Simulator [29]. We recorded the blood glucose levels every one minute<sup>5</sup> and encoded each of them in nine bits. For ψ1, ψ2, ψ4, we used 720 min of the simulated values. For φ1, φ4, φ5, we used seven days of the values. The parameters we used are Iboot = 30000, B = 9.

To answer RQ3, we encrypted plaintexts into TRGSW ciphertexts 1000 times using two single-board computers (ROCK64 and Raspberry Pi 4) and reported the average runtime.

Results and Discussion (RQ2). The results of the experiments are shown in Table 5. The result for <sup>ψ</sup><sup>4</sup> with Reverse is missing because the reversed DFA for ψ<sup>4</sup> is too huge, and its construction was aborted due to the memory limit.

Although the size of the reversed DFA was large for ψ<sup>1</sup> and ψ2, in all the cases, we observe that both Reverse and Block took at most 24 s to process each blood glucose value on average. This is partly because <sup>|</sup>Q<sup>|</sup> and <sup>|</sup>Q<sup>R</sup><sup>|</sup> are not so large in comparison with the upper bound described in Sect. 3.4, i.e., doubly or singly exponential to <sup>|</sup>φ|, respectively. Since each value is recorded every one minute, at least on average, both algorithms finished processing each value before the next measured value arrived, i.e., any congestion did not occur. Therefore, our experiment results confirm that, in a practical scenario of blood glucose monitoring, both of our proposed algorithms are fast enough to be used in the online setting, and we answer RQ2 affirmatively.

We also observe that average runtimes of ψ1, ψ2, ψ<sup>4</sup> and φ1, φ4, φ<sup>5</sup> with Block are comparable, although the monitoring DFA of <sup>ψ</sup>1, ψ2, ψ<sup>4</sup> are significantly larger than those of φ1, φ4, φ5. This is because the numbers of the reachable states during execution are similar among these cases (from 1 up to 27 states). As we mentioned in Sect. 3.4, Block only considers the states reachable by a word of length i when the i-th monitored ciphertext is consumed, and thus, it ran much faster even if the monitoring DFA is large.

Results and Discussion (RQ3). It took 40.41 and 1470.33 ms on average to encrypt a value of blood glucose (i.e., nine bits) on ROCK64 and Raspberry Pi 4, respectively. Since each value is sampled every one minute, our experiment results confirm that both machines are fast enough to be used in an online setting. Therefore, we answer RQ3 affirmatively.

<sup>5</sup> Current continuous glucose monitors (e.g., Dexcom G4 PLATINUM) record blood glucose levels every few minutes, and our sampling interval is realistic.


Table 4. The safety LTL formulae used in our experiments. ψ1, ψ2, ψ<sup>4</sup> are originally from [12], and φ1, φ4, and φ<sup>5</sup> are originally from [38].

Table 5. Experimental results of blood glucose monitoring, where Q is the state space of the monitoring DFA and Q<sup>R</sup> is the state space of the reversed DFA.


We also observe that encryption on ROCK64 is more than 35 times faster than that on Raspberry Pi 4. This is mainly because of the hardware accelerator for AES, which is used in TFHEpp to generate TRGSW ciphertexts.

#### 6 Conclusion

We presented the first oblivious online LTL monitoring protocol up to our knowledge. Our protocol allows online LTL monitoring concealing 1) the client's monitored inputs from the server and 2) the server's LTL specification from the client. We proposed two online algorithms (Reverse and Block) using an FHE scheme called TFHE. In addition to the complexity analysis, we experimentally confirmed the scalability and practicality of our algorithms with an artificial benchmark and a case study on blood glucose level monitoring.

Our immediate future work is to extend our approaches to LTL semantics with multiple values, e.g., LTL<sup>3</sup> [6]. Extension to monitoring continuous-time signals, e.g., against an STL [28] formula, is also future work. Another future direction is to conduct a more realistic case study of our framework with actual IoT devices.

Acknowledgements. This work was partially supported by JST ACT-X Grant No. JPMJAX200U, JSPS KAKENHI Grant No. 22K17873 and 19H04084, and JST CREST Grant No. JPMJCR19K5, JPMJCR2012, and JPMJCR21M3.

# References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Abstraction Modulo Stability for Reverse Engineering**

Anna Becchi1,2(B) and Alessandro Cimatti<sup>1</sup>

<sup>1</sup> Fondazione Bruno Kessler, Trento, Italy {abecchi,cimatti}@fbk.eu <sup>2</sup> University of Trento, Trento, Italy

**Abstract.** The analysis of legacy systems requires the automated extraction of high-level specifications. We propose a framework, called Abstraction Modulo Stability, for the analysis of transition systems operating in stable states, and responding with run-to-completion transactions to external stimuli. The abstraction captures the effects of external stimuli on the system state, and describes it in the form of a finite state machine. This approach is parametric on a set of predicates of interest and the definition of stability. We consider some possible stability definitions which yield different practically relevant abstractions, and propose a parametric algorithm for abstraction computation. The obtained FSM is extended with guards and effects on a given set of variables of interest. The framework is evaluated in terms of expressivity and adequacy within an industrial project with the Italian Railway Network, on reverse engineering tasks of relay-based interlocking circuits to extract specifications for a computer-based reimplementation.

**Keywords:** Timed Transition Systems · Property extraction · Simulations · Relay-based circuits

# **1 Introduction**

The maintenance of legacy systems is known to be a very costly task, and the lack of knowledge hampers the possibility of a reimplementation with more modern technologies. Legacy systems may have been actively operating for decades, but their behavior is known only to a handful of people. It is therefore important to have automated means to reverse-engineer and understand their behavior, for example in the form of state machines or temporal properties.

We focus on understanding systems that exhibit self-stabilizing behaviors, i.e. that are typically in a stable state, and respond to external stimuli by reaching stability in a possibly different state. As an industrially relevant example, consider legacy Railway Interlocking Systems based on Relay technology (RRIS): these are electro-mechanical circuits for the control of railway stations, with thousands of components that respond to the requests of human operators to activate the shunting routes for the movement of the trains. They support a computational model based on "run-to-completion", where a change in a part of the circuit (e.g. a switch closing) may change the power in another part of the circuit, and in turn operate other switches, until a stable condition is (hopefully) reached. This is very different in spirit from typical "cycle-based" control implemented in computer-based systems such as SCADA.

In this paper, we tackle the problem of extracting abstract specifications of the possible behaviors of an infinite-state timed transition system. The idea is to understand how the system evolves from a stable state, in response to a given stimulus, to the next stable state. In addition, we are interested in knowing under which conditions the transitions are possible and which are the effects on selected state variables. All this information is presented in the form of an extended finite state machine, which can be seen as a collection of temporal specifications satisfied by the system.

We make the following contributions. First, we propose the general framework of *Abstraction Modulo Stability*, a white-box analysis of self-stabilizing systems with run-to-completion behavior. The set of abstract states is the grid induced by a set of given predicates of interest. The framework is generic and parameterized with respect to the notion of stability. Different notions of stability are possible, depending on several factors: remaining in a region is possible (for some paths) or necessary (for all paths); whether the horizon of persistence in the stable region is unbounded, or lower-bounded on the number of discrete transitions and/or on the actual time. The framework also takes into account the notion of reachability in the concrete space, in order to limit the amount of spurious behaviors in the abstract description. We illustrate the relations holding between the corresponding abstractions, depending on the strength of the selected notion of stability.

Second, we present a practical algorithm to compute stability abstractions. We face two key difficulties. In the general case, one abstract transition is associated to a sequence of concrete transitions, of possibly unbounded length, so that a fix point must be reached. Furthermore, we need to make sure that the sequence is starting from a reachable state. Contrast this with the standard SMT-based computation of predicate abstractions [15], where one transition in the abstract space corresponds to one concrete transition, and reachability is not considered.

Third, we show how to lift to the abstract space other relevant variables from the concrete space, so that each abstract transition is associated with guards and effects. This results in a richer abstraction where the abstract states (typically representing control modes) are complemented by information on the data flow of the additional variables (typically representing the actual control conditions in a given mode).

We experimentally evaluate the approach on several large RRIS implementing the control logic for shunting routes and switch controls. This research is strongly motivated by an ongoing activity on the migration of the Italian Railway Network from relay-based interlocking to computer-based interlocking [3]. Stability abstraction is the chosen formalism to reverse engineer the RRIS, and to automatically provide the actual specifications for computer-based interlocking. We demonstrate the effectiveness of the proposed algorithms, and the crucial role of reachability in terms of precision of the abstractions.

**Related Works.** This work has substantial differences with most of the literature in abstraction. For example, Predicate Abstraction (PA) [11] can be directly embedded within the framework; furthermore, PA does not take into account concrete reachability; finally, an abstract transition is the direct result of a concrete transition, and not, as in our case, of a sequence of concrete transitions.

In [5] the authors propose to analyze abstract transitions between invariant regions with an approximated approach. In comparison, we propose a general framework, parameterized on the notion of stability. Additionally, we propose effective algorithms to construct automata from concrete behaviors only, and that represent symbolically the guards and the effects of the transitions.

The idea of weak bisimilarity [19], proposed for the comparison of observable behaviors of CCS, is based on collapsing sequences of silent, internal actions. The main difference with our approach is that weak bisimilarity is not used to obtain an abstraction for reverse engineering. Furthermore, in Abstraction Modulo Stability, observability is a property of states, and the silent actions are collapsed only when passing through unobservable (i.e., unstable) states.

Somewhat related are the techniques for specification mining, that have been extensively studied, for example in hardware and software. For example, DAIKON [9] extracts candidate invariant specifications from simulations. In our approach, the abstraction directly results in temporal properties that are guaranteed to hold on the system being abstracted. Yet, simulation-based techniques might be useful to bootstrap the computation of Abstraction Modulo Stability.

The work in [1] proposes techniques for the analysis of RRIS, assuming that a description of the stable states is already given. There are two key differences: first, the analysis of transient states is not considered; second, the extraction of a description in terms of stable states is a manual (and thus inefficient and error prone) task. For completeness, we mention the vast literature on the application of formal methods to railways interlocking systems (see e.g. [6,12,13,17,18]). Aside from the similarity in the application domain, these works are not directly related, given their focus on the verification of the control algorithms.

**Structure of the Paper.** In Sect. 2 we present the background notions. In Sect. 3 we present the framework of Abstraction Modulo Stability. In Sect. 4 we present the algorithms for computing abstraction. In Sect. 5 we present the experimental evaluation. In Sect. 6 we draw some conclusions and present the directions of future work.

#### **2 Background**

We work in the setting of Satisfiability Modulo Theories (SMT) [4], with quantifier-free first order formulae interpreted over the theory of Linear Real Arithmetic (LRA). We use P, Q to denote sets of Boolean variables, p, q to denote truth assignments, and the standard Boolean connectives ∧,∨,¬,→ for conjunction, disjunction, negation and implication. and ⊥ define true and false respectively. For a set of variables V , let Ψ<sup>T</sup> (V ) denote the set of first-order formulae over a theory T with free variables in V . When clear from context we omit the subscript. Let V . = {v | v ∈ V }. For a formula φ ∈ Ψ(V ), let φ denote φ[V /V ], i.e. the substitution of each variable v ∈ V with v .

A finite state automaton is a tuple A = Q, L, Q0, R where: Q is a finite set of states; L is the alphabet; Q<sup>0</sup> ⊆ Q is the set of initial states; R ⊆ (Q × L × Q) is the labeled transition relation. We also consider automata with transitions annotated by guards and effects expressed as SMT formulae over given sets of variables. For (q1, , q2) ∈ R, we write q<sup>1</sup> - −→<sup>A</sup> q2. Let A<sup>1</sup> and A<sup>2</sup> be two automata defined on the same set of states Q and on the same alphabet L including a label τ : we say that A<sup>1</sup> weakly simulates A2, and we write A<sup>1</sup> - A2, if whenever q - −→<sup>A</sup><sup>1</sup> q , then q - −→<sup>A</sup><sup>2</sup> <sup>τ</sup> −→<sup>∗</sup> <sup>A</sup><sup>2</sup> q , where <sup>τ</sup> −→<sup>∗</sup> is a (possibly null) sequence of transitions labeled with τ .

A symbolic timed transition system is a tuple M = V, C, Σ,Init,Invar, Trans , where: V is a finite set of state variables; C ⊆ V is a set of clock variables; Σ is a finite set of boolean variables encoding the alphabet; Init(V ), Invar(V ), Trans(V,Σ,V ) are SMT formulae describing the initial states, the invariant and the transition relation respectively. The clocks in C are real-valued variables. We restrict the formulae over clock variables to atoms of the form c k, for c ∈ C, <sup>k</sup> <sup>∈</sup> <sup>R</sup> and ∈ {≤, <, <sup>≥</sup>, >, <sup>=</sup>}. The clock invariants are convex. We allow the other variables in V to be either boolean or real-valued.

A state is an assignment for the V state variables, and let S denote the set of all the interpretations of V . We assume a distinguished clock variable *time* ∈ C initialized with *time* = 0 in Init, representing the global time.

The system evolves following either a discrete or a timed step. The timed transition entails that there exists <sup>δ</sup> <sup>∈</sup> <sup>R</sup><sup>+</sup> such that <sup>c</sup> <sup>=</sup> <sup>c</sup> <sup>+</sup> <sup>δ</sup> for each clock variable <sup>c</sup> <sup>∈</sup> <sup>C</sup>, and <sup>v</sup> <sup>=</sup> <sup>v</sup> for all the other variables<sup>1</sup>. The discrete transition entails that *time* = *time* and can change the other variables instantaneously.

A valid trace π is a sequence of states (s0, s1,...) that all fulfill the Invar condition, such that s<sup>0</sup> |= Init and for all i, (si, i, si+1) |= Trans(V,Σ,V ) for some <sup>i</sup> assignment to Σ. We denote with Reach(M) the set of states that are reachable by a valid trace in M. We adopt a hyper-dense semantics: in a trace π, time is weakly monotonic, i.e. si.*time* ≤ si+1.*time*. We disregard Zeno behaviors, i.e. every finite run is a prefix of a run in which *time* diverges.

The states in which time cannot elapse, i.e. which are forced to take an instantaneous discrete transition, are called *urgent* states. We assume the existence of a boolean state variable *urg* ∈ V which is true in all and only the urgent states. Namely, for every pair of states (si, si+1) in a path π where si.*urg* is true, then (si.*time* = si+1.*time*).

<sup>1</sup> We abuse the notation and write <sup>P</sup> <sup>=</sup> <sup>Q</sup> for <sup>P</sup> <sup>↔</sup> <sup>Q</sup> when <sup>P</sup> and <sup>Q</sup> are Boolean variables.

We consider CTL+P [16], a branching-time temporal logic with the future and past temporal operators. A history h = (s0, ..., sn) for M is a finite prefix of a trace of M. For a CTL+P formula ψ, write M, h |= ψ meaning that after h, s<sup>n</sup> satisfies ψ in M. Operators AGψ, E(ψ<sup>1</sup> U ψ2), Hψ are used with their standard interpretations (*in every future* ψ *will always hold*, *there exists a future in which* ψ<sup>1</sup> *holds until* ψ2, *in the current history* ψ *always held*, respectively).

#### **3 The Framework of Abstraction Modulo Stability**

#### **3.1 Overview**

We tackle the problem of abstracting a concrete system in order to mine relevant high-level properties about its behavior.

We are interested in how the system reacts to stimuli: when an action is performed, we want to skip the intermediate steps that are necessary to accomplish an induced effect, and evaluate how stable conditions are connected to each other. The definition of stability is the core filter that defines which states we want to observe when following a run-to-completion process, i.e., the run triggered by a stimulus under the assumption that the inputs remain stationary. In practice, several definitions of stability are necessary, each of them corresponding to a different level of abstraction.

An additional element of the desired abstraction is that relevant properties regard particular evaluations of the system. We consider a defined abstract space which intuitively holds the observable evaluations on the system, on which we will project the concrete states.

In this section we describe a general framework for *Abstraction Modulo Stability*, which is parametric with respect to the abstract domain and the definition of stability. The result will be a finite state system which simulates the original model, by preserving only the stable way-points on the abstract domain, and by skipping the transient (i.e., unstable and unobservable) states.

Finally, we define how the obtained abstract automata can be enriched with guards and effects for each transition.

*Example 1.* Consider as running example the timed transition system S shown in the right hand side of Fig. 1 which models a tank receiving a constant incoming flow of water, with an automatic safety valve.

S has a clock variable c which monitors the velocity of filling and emptying processes, and reads an input boolean variable in.flow. The status of this variable is controlled by the environment E, shown in the left hand side of the figure. In the transition relation of E, the variables in Σ encode the labels for the stimuli, which are variations of the input variable in.flow. In particular, if Σ = τ , then in.flow is unchanged, and we say that the system S is not receiving any stimulus. S reacts accordingly to the updated in.flow . The discrete transitions of S are labeled with guards and with resetting assignments on the clock variable (in the form *[guards]/resets*). The system starts in the Empty location. A discrete transition reacts to a true in.flow jumping in Filling and resetting c := 0. The invariant c ≤

**Fig. 1.** A timed transition system representing a tank of water.

10 of Filling forces the system to transit to a Warning location after 10 time units, corresponding to the time needed to reach a critical level. Warning is urgent: as soon as S reaches this state, it is forced to take the next discrete transition. The urgency of location Warning models the causality relation between the evaluation on the level of water and the instantaneous opening of a safety valve. Due to the latter, in location Full the system dumps all the incoming water and keeps the level of water stable. If the input is closed, S transits in Emptying. In this condition, water is discharged faster: after 2 time units the system is again in Empty. Transitions between Filling and Emptying describe the system's reaction to a change of the input while in charging/discharging process.

We consider as predicates of interest exactly the five locations of the system. The stability abstraction of the composed system is meant to represent the stable conditions reached after the triggering events defined by Σ.

#### **3.2 Abstraction Modulo Stability**

Consider a symbolic timed transition system M = X, C, Σ,Init,Invar, Trans whose discrete transitions are labeled by assignments to Σ representing stimuli. A stimulus corresponds to a variation of some variables I ⊆ V which we call *input* variables. Namely, we can picture M as a closed system partitioned into an *environment* E which changes the variables I, and a open system S which reads the conditions of the updated variables I and reacts accordingly: Trans(X, Σ, X ) = Trans<sup>E</sup> (I,Σ,I ) ∧ Trans<sup>S</sup> (V, I , V ), with V = X \ I.

In particular, we assume a distinguished assignment τ to the labels Σ, corresponding to the absence of stimuli: Trans<sup>E</sup> [Σ/τ ]=(I ↔ I ). The transition labeled with τ is the *silent* or *internal* transition. It corresponds to the discrete changes which keep the inputs stationary (i.e., unchanged) and the timed transitions. We write <sup>M</sup><sup>τ</sup> for the restriction of <sup>M</sup> which evolves only with the silent transition τ , i.e., under the assumption that no external interrupting action is performed on S, so that I ↔ I is entailed by the transition relation. We assume that <sup>M</sup> is never blocked waiting for an external action: this makes <sup>M</sup><sup>τ</sup> always responsive to τ transition. Moreover, we assume that Zeno behaviors are not introduced by this restriction.

We define a framework for abstracting M parametric on an abstract domain Φ and a stability definition σ.

*Abstract Domain.* Between the variables of the system M, consider a set of boolean variables P ⊆ X representing important predicates. The abstract domain Φ is the domain of the boolean combinations of P variables.

*Stability Definition.* Let σ(X) be a CTL+P formula providing a stability criterion.

**Definition 1 (**σ**-Stability).** *A concrete state* s *with history* h = (s0,...,s) *is* σ*-stable if and only if*

$$\mathcal{M}^{\tau}, h \models \sigma.$$

Note that the stability is evaluated in <sup>M</sup><sup>τ</sup> , i.e. under the assumption that the inputs are stationary: at the reception of an external stimulus, a σ-stable might move to a new concrete state which does not satisfy σ. We say that a state s is σ-stable in a region p ∈ Φ if it is σ-stable and s |= p.

The states for which <sup>M</sup><sup>τ</sup> ,(s0,...,s) |<sup>=</sup> <sup>σ</sup>, are said <sup>σ</sup>-*unstable*. These states might be transient during a convergence process which leads to the next stable state. In the following we will omit the prefix σ when clear from context.

**Definition 2 (Abstraction Modulo** σ**-Stability).** *Given a concrete system* M = X, C, Σ,Init,Invar, Trans *, with* P ⊆ X *boolean variables, the* abstraction modulo <sup>σ</sup>-stability *of* <sup>M</sup> *is a finite state automaton* <sup>A</sup><sup>σ</sup> <sup>=</sup> Φ, <sup>2</sup><sup>Σ</sup>,Initσ, Trans<sup>σ</sup> *. For each* p<sup>0</sup> ∈ Φ*,* p<sup>0</sup> |= Init<sup>σ</sup> *if and only if there exists a state* s<sup>0</sup> ∈ S *such that* s<sup>0</sup> |= Init*, and with* h<sup>0</sup> = (s0)

$$\mathcal{M}^\tau, h\_0 \vdash \mathrm{E}(\neg \sigma \,\,\mathrm{U} \,(\sigma \wedge p\_0)).$$

*For each* <sup>p</sup>1, p<sup>2</sup> <sup>∈</sup> <sup>Φ</sup>*,* <sup>∈</sup> <sup>2</sup><sup>Σ</sup>*, the triple* (p1, , p2) <sup>|</sup>= Trans<sup>σ</sup> *if and only if there exist states* s0, s1, s<sup>2</sup> ∈ S *and histories* h<sup>1</sup> = (s0,...,s1)*,* h<sup>2</sup> = (s2) *such that* (s1, , s2) |= Trans*, and such that*

$$
\mathcal{M}^\tau, h\_1 \models \sigma \wedge p\_1, \qquad \mathcal{M}^\tau, h\_2 \models \to (\neg \sigma \to (\sigma \wedge p\_2)).
$$

Abstract automaton A<sup>σ</sup> simulates with a single abstract transition a run of the concrete system M that connects two σ-stable states with a single event and possibly multiple steps of internal τ transitions. We call such convergence process a *run-to-completion* triggered by the initial event.

Observe that the abstraction is led by the definition of σ-stability. It preserves only the abstract regions in which there is a σ-stable state. The transient states are not exposed, hence disregarding also the behaviors of M in which a new external stimuli interrupts a convergence still in progress. In other words, it represents the effects of stimuli accepted only in stable conditions.

In this way, A<sup>σ</sup> satisfies invariant properties that would have been violated in σ-unstable states, transient along an internal run-to-completion.

*Reachability-Aware Abstraction.* Abstractions modulo stability can be tightened by considering only concrete reachable states in M. In fact, in the setting of reverse engineering, considering unreachable states may result in an abstraction that includes impossible behaviors that have no counterpart in the concrete space. This is done by enforcing that the first state of h<sup>1</sup> in Definition 2 to be reachable in M. This is an orthogonal option to the choice of the stability definition σ.

#### **3.3 Instantiating the Framework**

The level of abstraction of Aσ, i.e., the disregarded behaviors, is directly induced by the chosen definition of σ. Its adequacy depends on both the application domain and the objective of the analysis. We now explore some possibilities that we consider relevant in practice.

*Predicate Abstraction.* Firstly, we show that the Abstraction Modulo Stability framework is able to cover the known *predicate abstraction* [11,14]. With a trivial stability condition

$$
\sigma\_1 \doteq \top,
$$

every concrete state s is stable and is projected in the abstract region it belongs to (p = ∃(X \ P) . s). In this way, all concrete transitions (including the timed ones) are reflected in the corresponding A<sup>σ</sup><sup>1</sup> .

*Non-urgent Abstraction.* Urgent states are the ones in which time cannot elapse, and are forced to transit with a discrete transition. They are usually exploited to decompose a complex action made of multiple steps and to faithfully model the causality along a cyclical chain of events. Unfortunately, by construction, urgent states introduce *transient* conditions which may be physically irrelevant. In practice, in the analysis of the system's behaviors, one may want to disregard the intermediate steps of a complex instantaneous action.

To this aim, we apply the Abstraction Modulo Stability framework and keep only the states in which time can elapse for an (arbitrarily small) time bound T.

$$
\sigma\_2(X) \doteq \neg urg.
$$

The obtained abstract automaton A<sup>σ</sup><sup>2</sup> has transitions that correspond to *instantaneous* run-to-completion processes, skipping urgent states until time is allowed to elapse.

*Example 2.* On the left hand side of Fig. 2 we show the abstraction of the tank system obtained using σ1. An abstract transition connects two predicates (recall that in this example predicates correspond to concrete locations) if they are connected in S, by either a discrete or a timed transition.

On the right hand side of Fig. 2 we show the abstraction obtained using σ2. With respect to A<sup>σ</sup><sup>1</sup> , here location Warning is missing, since time cannot elapse in it, and an abstract transition connects directly Filling to Full.

**Fig. 2.** Abstractions modulo σ<sup>1</sup> and σ<sup>2</sup> on the tank running example.

*Eq-predicate Abstractions.* Let *Eq*(P) be a formula expressing implicitly that the interpretations of the abstract predicates are not changing during a transition (either a discrete or a timed step).

We now address the intuitive definition: *"a stable state is associated with behaviors that preserve the abstract predicates for enough time, i.e., if the system is untouched, then the predicates do not change value for a sufficient time interval"*. One can choose to measure the permanence of s in p ∈ Φ in terms of number of steps (e.g., at least <sup>K</sup> concrete steps, with <sup>K</sup> <sup>∈</sup> <sup>N</sup>+), or in terms of continuous-time (e.g., for at least <sup>T</sup> time, with <sup>T</sup> <sup>∈</sup> <sup>R</sup>+), or both.

This intuitive definition can be interpreted both backward and forward. In this paragraph we illustrate the backward perspective.

Consider the doubly bounded definition

$$
\sigma\_3^{T,K}(X) \doteq \mathcal{H}^{>T,>K} Eq(P),
$$

where: <sup>M</sup><sup>τ</sup> , h <sup>|</sup><sup>=</sup> <sup>σ</sup>T ,K <sup>3</sup> , if and only if h = (s<sup>0</sup> ...si), with i ≥ K and for some <sup>p</sup> <sup>∈</sup> <sup>2</sup><sup>P</sup>

$$\begin{pmatrix} \forall j \in [(i-K), i]: s\_j \mid = p \ \wedge \\ s\_i.time - s\_{i-K}.time > T \end{pmatrix}.$$

Such characterization of stability captures the states that have been in the same predicate assignment for at least K steps *and* at least T time has elapsed in such frame. Several variants of this definition are possible, e.g. by using only one bound.

This definition is referred to as *backward* since we consider the history of the system: a stable state has a past trajectory that remained in the same abstract region for enough time/steps. It is practically relevant in contexts where it is useful to highlight the dwell time of the system in a given condition. The only visible behaviors are the ones that were exposed for sufficient time/steps.

It can be easily seen that if a history h satisfies σ<sup>T</sup>2,K <sup>3</sup> , then it also satisfies σ<sup>T</sup>1,K <sup>3</sup> , with T<sup>1</sup> ≤ T2.

Notably, for the instantiations of σ<sup>3</sup> with K = 1, a state is stable if it has just finished a timed transition elapsing at least T time. In the following, we omit the superscript K from σT ,K <sup>3</sup> when K = 1. We have that if a history h satisfies σ<sup>T</sup> <sup>3</sup> , then it also satisfies σ2. Namely, while every urgent state (i.e., a

**Fig. 3.** Abstractions modulo σ<sup>T</sup> =7 <sup>3</sup> and σ<sup>4</sup> on the tank running example.

transient state for σ2) is transient also for σ<sup>T</sup> <sup>3</sup> , for σ<sup>T</sup> <sup>3</sup> also become transient the non-urgent states that are accidentally traversed in 0 time, for example because an exiting discrete transition is immediately enabled.

*Future Eq-predicate Abstractions.* In contrast to the backward evaluation of σ3, one can think of assessing stability forward, by looking at the future(s)<sup>2</sup> of the state. A possible definition in this perspective would be

$$
\sigma\_4(X) \doteq \operatorname{AGEq}(P),
$$

asking that, as long as only τ transitions are taken, the system will never change the evaluation of predicates. Namely, once a state is σ4-stable, it can change the predicates only with an external event, and the abstract states in A<sup>σ</sup><sup>4</sup> are closed under τ transitions. This is similar in spirit to the notion of P-stable abstraction of [5], with the difference that in the latter arbitrary regions are considered.

Within this perspective, alternative definitions can be obtained by interchanging the existential/universal path quantifiers (e.g., EG*Eq*(P) characterizes a state for which there exists a future that never changes the predicate evaluations), or by bounding the "globally" operator (e.g., AG>K*Eq*(P) captures a state which is guaranteed to expose the same evaluations of predicates in the next K steps). Observe that all these variants would assess σ-stability of a state *before* it has actually proven to expose the same predicates for enough time/steps.

*Example 3.* On the left hand side of Fig. 2 we show the abstraction obtained with σT ,K <sup>3</sup> definition, using T = 7 and K = 1. State Emptying is unstable, since time cannot elapse in it more than T time: namely, from Full, at the reception of the stimulus which opens in.flow, all the τ -paths lead to Empty in less than T time. On the other hand, Fing is kept, since the system may stay in this location for enough time to be considered relevant.

On the right hand side of Fig. 2 we show the abstraction obtained with σ4. Here, the stable states are only Empty and Full: the others are abstracted since they are not invariant for the τ internal transition. Each external event directly leads to the end of a timed process which converges in the next stable state. Note that in this setting, an abstract transition labeled with τ can only be self loops.

<sup>2</sup> Note that, in contrast to the backward case where the past is unique, in the forward case we adopt a branching time view with multiple futures.

Here, Aσ<sup>4</sup> corresponds to the P-stable abstraction because the chosen abstract domain Φ is able to express the "minimally stable" regions [5] of M.

Observe that Aσ<sup>4</sup> would be also obtained by increasing the time bound of σT <sup>3</sup> , e.g., with T = 15.

As the examples show, different stability definitions induce abstract automata with different numbers of states and transitions. The following proposition states what is the effect on the abstract automata of making stricter the stability definition. Let us write p<sup>1</sup> - −→<sup>σ</sup> p<sup>2</sup> meaning that (p1, , p2) |= Trans<sup>σ</sup> in Aσ.

**Proposition 1.** *Let* σ *and* σ *be two stability definitions such that every history that is* σ*-stable, is also* σ *-stable, and let* A<sup>σ</sup> *and* A<sup>σ</sup> *be the corresponding abstractions modulo stability of the same concrete model* M*. Then,* A<sup>σ</sup> *weakly simulates* A<sup>σ</sup>-*.*

*Proof.* By definition, if p<sup>1</sup> - −→<sup>σ</sup> p2, then there exists (s1, , s2) |= Trans with (1) <sup>M</sup><sup>τ</sup> , h<sup>1</sup> <sup>|</sup><sup>=</sup> <sup>σ</sup> <sup>∧</sup> <sup>p</sup>1, and (2) <sup>M</sup><sup>τ</sup> , h<sup>2</sup> <sup>|</sup>= E(¬<sup>σ</sup> U (<sup>σ</sup> <sup>∧</sup> <sup>p</sup>2), with <sup>h</sup><sup>1</sup> = (s<sup>0</sup> ...,s1) and h<sup>2</sup> = (s2). Since every σ-stable history is also σ -stable, from (1) we obtain that <sup>M</sup><sup>τ</sup> , h<sup>1</sup> <sup>|</sup><sup>=</sup> <sup>σ</sup> <sup>∧</sup> <sup>p</sup>1, and from (2) we derive

$$\begin{aligned} \mathcal{M}^\tau, h\_2 \vdash \text{EF}(\sigma \wedge p\_2) &\implies \mathcal{M}^\tau, h\_2 \vdash \text{EF}(\sigma' \wedge p\_2) \\ &\implies \mathcal{M}^\tau, h\_2 \vdash \text{E}(\neg \sigma' \text{ U } (\sigma' \text{EX}(\neg \sigma' \dots \text{U } (\sigma' \wedge p\_2) \dots))) \end{aligned}$$

Hence, p<sup>1</sup> - −→<sup>σ</sup>- <sup>τ</sup> −→<sup>∗</sup> σ p<sup>2</sup> and A<sup>σ</sup> - A<sup>σ</sup>-.

**Corollary 1.** *For every bounds* <sup>T</sup><sup>1</sup> <sup>≤</sup> <sup>T</sup><sup>2</sup> <sup>∈</sup> <sup>R</sup><sup>+</sup>

$$\mathcal{A}\_{\sigma\_3^{T\_2}} \lesssim \mathcal{A}\_{\sigma\_3^{T\_1}} \lesssim \mathcal{A}\_{\sigma\_2} \lesssim \mathcal{A}\_{\sigma\_1}$$

#### **3.4 Extending with Guards and Effects**

Abstract transitions in A<sup>σ</sup> are labeled with the stimulus that has triggered the abstracted run-to-completion process. Recall that a stimulus <sup>∈</sup> <sup>2</sup><sup>Σ</sup> is connected to a (possibly null) variation of the inputs I by Trans<sup>E</sup> (I,Σ,I ). A *guard* for an abstract transition (p1, , p2) is a formula on I variables entailed by Trans<sup>E</sup> [Σ/] which describes the configurations of inputs that, starting from p<sup>1</sup> with event , lead to p2. In order to enrich the description of the *effects* of an abstract transition, we also consider a subset of state variables O ⊆ V , called *output* variables. Observe that an abstract transition may be witnessed by multiple concrete paths, each with its own configuration of inputs and outputs. Hence, we can keep track of a precise correlation between guards and effects with a unique relational formula on I and O variables. This formula is obtained as a disjunction of all the configurations of inputs and outputs in the concrete states accomplishing stability in p<sup>2</sup> (since the configuration of I set by the stimulus is preserved by τ along the run-to-completion process).

*Example 4.* The stability abstractions shown in Figs. 2 and 3 are equipped with *guard* constraints, as evaluations on the original input variable in.flow, (shown in square brackets near the label of the stimuli).

### **4 Algorithms for Stability Abstractions**

In order to build the abstract automaton structure we have to check whether there exists a (reachable) σ-stable state in p1, with (s1, , s2) |= Trans and <sup>M</sup><sup>τ</sup> , s<sup>2</sup> <sup>|</sup>= E(¬<sup>σ</sup> U (<sup>σ</sup> <sup>∧</sup> <sup>p</sup>2)), for every pair (p1, p2) <sup>∈</sup> <sup>Φ</sup> <sup>×</sup> <sup>Φ</sup>. Reachability analysis and (C/)LTL model checking for infinite state systems are undecidable problems. The work in [5] computes overapproximations of the regions that are invariant for silent transitions (i.e., addresses an unbounded stability criterion AGφ), exploiting the abstract interpretation framework. This approach also overapproximates multiple stable targets that may be given by the non-determinism of the concrete system.

Here, instead, we deal precisely with the non-determinism of the underlying concrete system by collecting information about actual, visible consequences of an action, by focusing on *bounded* stability definitions. In fact, we consider stability criteria that do not require fixpoint computations in the concrete system, *and* we under-approximate the reachability analysis fixing a bound for unstable paths. Namely, our algorithm follows an iterative deepening approach, which considers progressively longer unstable run-to-completion paths, seeking for the next stable condition.

Intuitively we search for concrete witnesses for an abstract transition (p1, , p2) by searching for a concrete path connecting a concrete σ-stable state s<sup>1</sup> in p<sup>1</sup> and a σ-stable state in p2, with a bounded reachability analysis from s1.

Notice that the algorithm builds a *symbolic* characterization for the stability automaton. In fact, instead of enumerating all (p1, p2) ∈ Φ × Φ and check if they are connected by some concrete path, we incrementally build a formula characterizing all the paths of <sup>M</sup><sup>τ</sup> connecting two <sup>σ</sup>-stable states. Then, we project such formula on the P variables, hence obtaining symbolically all the abstract transitions having a witness of that length. This intuition is similar to [15] to efficiently compute predicate abstractions.

Moreover, having a formula representing finite paths of <sup>M</sup><sup>τ</sup> connecting two σ-stable states, we can extract guards and effects with a projection on I and O variables. Namely, while checking the existence of an abstract transition, we also synthesize the formula on I and O annotating it.

A significant characteristic of our approach, also with respect to the classical instantiation of predicate abstraction, is that we refine the abstract transitions by forcing the concrete states to be reachable from the initial condition.

In the following we describe the general algorithm for computing abstractions parametric on the stability definition σ, and then show how the criteria proposed in Sect. 3.3 can be actually passed as parameter.

#### **4.1 Symbolic Algorithm for Bounded Stability**

Consider the symbolic encoding of automaton M = X, C, Σ,Init,Invar, Trans , <sup>3</sup> and a classification of the variables in X distinguishing P boolean predicates variables, I input variables, O output variables.

<sup>3</sup> For exposition purposes, let Trans entails both Invar and Invar- .

We address the computation of the formulae Initσ(P) and Transσ(P, I, O, P ), for a stability definition provided as a formula <sup>σ</sup>(X0,...,Xn) with <sup>n</sup> <sup>∈</sup> <sup>N</sup>. The algorithm performs a reachability analysis based on two bounds:


**Pseudocode 1.** Reachability-aware symbolic computation of the abstract transition relation Trans<sup>σ</sup>


*Computation of* Transσ*.* Pseudocode 1 shows the algorithm for extraction of the transition relation Transσ. It builds a formula

$$\text{Int}(X\_0) \land \bigwedge\_{0 \le h \le j} \text{Trans}(X\_h, X\_{h+1}) \land \left(\bigwedge\_{i=n \le h < i}^{\sigma(X\_{i-n}, \dots, X\_i) \land \bigwedge\_{i-n \le h < i} I\_h = I\_{h+1} \land \\ \bigwedge\_{i < h < j} (I\_h = I\_{h+1} \land \neg \sigma(X\_{h-n}, \dots, X\_h)) \land \left(\bigwedge\_{i=n \le j-1}^{\sigma(X\_{i-n}, \dots, X\_i) \land \bigwedge\_{i-n \le j-1} I\_h = I\_{h+1} \land \neg \sigma(X\_{h-n}, \dots, X\_j)\right) \right)$$

for each i, j with 0 ≤ j − i ≤ U and j<L. The procedure exploits the incrementality of the SMT solvers which organize assertions in a stack: the push/pop interface allows the addition of layers, in which to insert new formulae with the assert primitive. In this way, we can progressively build the path and avoid its recomputation for every pair i, j. Namely, for each j<L, firstly we build the path until j (line 6) and assert σ-stability in j (line 9). Then we progressively try i going backward (in order to better exploit incrementality), constrain the I variables to be unchanged, and σ-unstability (lines 11–12).

Function S.project-on() (line 16) performs an existential quantification of the formula currently present in the solver stack. We preserve variables P<sup>i</sup> and P<sup>j</sup> , which characterize the two stable states connected by the transition. Variables I<sup>j</sup> and O<sup>j</sup> are also preserved: in this way, we extract the guards and the effects formulae directly within the building of the abstract transition. Notice that, due to the input stability hypothesis preserved during the unstable path, the input configuration read in j is the same read immediately following the external event in i + 1.

Every found contribute Trans(i,j) <sup>σ</sup> is then merged in a single Transσ, after substitution of the variables in P, I, O, P . Observe that an important optimization is to block the negation of the already computed formula Trans<sup>σ</sup> (shifted in the current i, j indices) before each projection (line 15), in order to avoid recomputing the same transitions.

*Reachability-Awareness.* A reachability-unaware version would drop the first part of the formula characterizing the path from 0 to i − n.

The described algorithm is reachability-aware, meaning that every considered stable state is, by construction, reachable from the initial condition Init. This is important to extract actually concretizable behaviors, and is a main difference with respect to the classical predicate abstraction technique: it is well known that mere the projection on the boolean predicates of the single transition relation may introduce several spurious behaviors.

Note that the reachability-aware improvement is based on *concrete* reachability. In contrast, the algorithm of [5], exploits abstract reachability until fixpoint in the abstract automaton, possibly incurring in further overapproximations induced by the use of convergence accelerators.

*Computation of* Initσ*.* The algorithm for the extraction of the initial state Init<sup>σ</sup> is similar: it builds a formula

$$\text{Init}(X\_0) \land \bigwedge\_{0 \le h \le i} (\text{Trans}(X\_h, X\_{h+1}) \land I\_h = I\_{h+1}) \land \sigma(X\_{i-n}, ..., X\_i)$$

for every <sup>i</sup> <sup>≤</sup> <sup>U</sup>. Init<sup>σ</sup> is the collection of the contributes Init(i) <sup>σ</sup> , obtained by fixing a stable slot in the last position i and projecting on P<sup>i</sup> variables.

#### **4.2 Instantiating the Algorithm**

The bounded stability definitions presented in Sect. 3 can be unrolled and expressed in the form σ(X0,...,Xn)

*Predicate Abstraction.* σ1(X0) = trivially needs only the current variables. Observe that in this case we can use a U = 1 bound, since the unstability constraint is always unsatisfiable.

*Non-urgent Abstraction.* Having a classification of urgent conditions, also σ2(X0) = ¬*urg*<sup>0</sup> can be established looking only at the current variables (it only needs n = 0).

*Eq-predicate Abstraction.* More generally, given K and T bounds, we encode that the abstract region has not changed for the last K steps and that at least T time has elapsed using n = K and

$$\sigma\_3^{T,K}(X\_0\dots X\_K) = \bigwedge\_{h$$

#### **5 Experimental Evaluation**

We evaluate the applicability and the adequacy of stability abstractions for the reverse engineering of real-world Relay-based Railway Interlocking Systems.

*Relay-Based Railway Interlocking Systems (RRIS).* RRIS are complex electromechanical circuits used for the control stations and train traffic. Such systems receive stimuli from an external environment, including both human operators (e.g., performing actions on buttons) and physical entities (e.g., a train passing on some sensors). In response, they control railway elements, like signaling lights or railway switches. Internally, they use relays to propagate signals: relays are electro-mechanical components which, when activated, change the position of an associated contact after a (possibly null) delay.

The controlling logic implemented by RRIS is hidden by complex legacy internal optimizations performed over the years by numerous electro-mechanical engineers. For this reason, it is hard to understand their high-level behavior and highlight the connections between stimuli and observable railway properties.

The experimental evaluation is based on real-world RRIS schematics that are intended to control level crossing and shunting routes. Using the tool norma [2], the considered RRIS have been modeled and automatically converted in timed transition systems in the syntax of Timed nuXmv [7]. The obtained models involve several real-valued variables (modeling voltages and currents in the circuits), changing accordingly to the configuration of the boolean variables (modeling the switches of the circuit). The discrete state changes when an external event updates the position of a switch, or as a consequence of the activation of an internal relay. Hence, these systems react to an external variation with a chain of internal transitions. The duration of the triggered run-to-completion process is important: urgent states are widely used to model the causality relation between the activation of an instantaneous relay and the action performed on the associated switch; timed relays may impose a low delay, so that the internal response is actually very fast and almost non observable.


**Table 1.** Result of the abstraction of routesN RRIS benchmarks with different stability definitions.

*Abstraction Modulo Stability of RRIS.* The Timed nuXmv model checker was used to convert the models produced by norma in untimed transition systems in SMV. The algorithm presented in Sect. 4 has been implemented using the pySMT library [10] and the MathSAT5 SMT solver [8]. It requires in input a classification of the variables X, selecting the predicates P, the inputs I and the outputs O, which can be directly provided by railway domain experts. We choose as P the status of some relays or (boolean variables associated with) linear predicates on the electrical variables, representing, as an example, the status of a lamp.

Table 1 and 2 report the number of variables X, P, I, O for each benchmark. Column Φ reports the size of the resulting abstract domain, obtained by considering all the *consistent* combinations of P predicates (with respect to the invariant of the model).

We show the results of the Abstraction Modulo Stability considering the stability definitions described in Sect. 3.3, using the algorithm of Sect. 4 with bounds L = 40 and U = 15. All the experiment ran on a 2.4 GHz CPU, with time out (to) set to 15 h, and memory limit set to 20 GB.

Columns "Aσstates" and "Aσtrans" hold the number of abstract states and transitions respectively, computed counting the configurations of the predicate variables in the abstract automaton Aσ. As stated in Corollary 1, the corresponding abstract automata have progressively less states.

Stability abstractions were used by railway experts from the Italian Railway Network company (RFI) to understand two main families of legacy RRIS.

*Routes.* routesN is a RRIS regulating the activation/deactivation of N shunting routes concurring for the same resources. The implemented logic takes care of avoiding the simultaneous activation of conflicting routes. In such RRIS the inputs are the switches controlled by a human operator, attempting to enable/disable a route; the outputs are the status of some internal entities that we want to monitor; the predicates are the status of lamps representing whether the routes have been registered.

In the routes benchmarks the delays used in the run-to-completion processes are very small, so that in the abstract automata obtained (Table 1) there is no difference between σ<sup>T</sup> =1 <sup>3</sup> and σ<sup>T</sup> =7 <sup>3</sup> (i.e., if a state has stayed in the same predicate for 1 time unit, then it can also stay there for 7). These abstract automata clearly highlight what are the consequences of the requests of a human operator with respect to the active/inactive status of the routes involved. As an example, the abstraction routes02 (a circuit handling two routes) has only 4 stable states which show that the routes are incompatible and one of them has priority on the other, and disregards all the intermediate steps that the concrete system needs to progressively check the availability of the resources. These steps are visible with a less strict stability definition, like σ<sup>1</sup> or σ2.

Table 1 also evaluates the effectiveness of the reachability refinement. When dropping the prefix starting from the initial states of the concrete system, the algorithm would consider several spurious behaviors. Especially in these benchmarks, the resulting abstract automaton would also show the unreachable states (e.g., the ones in which two routes are in conflict), therefore reducing the relevance for the reverse engineering purpose. Moreover, the *reach.unaware* computation may be harder to compute as it has to explore more transitions and more models in the guards and effects formulae.

*Railway Switch.* r-switch is a RRIS modeling a railway switch. It has several externally controlled switches and only 4 relevant observations, defining its abstract state. The schema can be instantiated as nominal (N) or faulty (F), by injecting faulty behaviors in some physical components. We consider three versions: r-switch1 interacts with a free environment, showing a wide number of circuit configurations; r-switch2 and r-switch3, instead, exploit some assumptions on the environment and expose less inputs, and, although using different internal implementations, are supposed to guarantee the same controlling logic.

Table 2 reports the features of the abstract automata obtained for these benchmarks. Here, during a run-to-completion process, some states dwell in the same predicate for a time 1 <sup>≤</sup> <sup>t</sup> <sup>≤</sup> 7, so that are visible in <sup>σ</sup><sup>T</sup> =1 <sup>3</sup> but skipped by σ<sup>T</sup> =7 <sup>3</sup> when reporting the corresponding abstract transition.

Again, the *reach.unaware* option reports more transitions. The difference is especially evident in the nominal versions, as the faulty concrete system already covers more behaviors. Even when the number of abstract transitions is the same, the *reach.aware* option reports more precise guards and effects, i.e., each annotating formula on I and O has less models.

By looking at the abstract automata, the user could recover what are the triggering reasons that make the system reach certain states (e.g., the ones that are shown in r-switch1 and not in r-switch2). Namely, A<sup>σ</sup> could highlight the enabling conditions for certain behaviors, which may apply far from the final observable consequence and were hard to inspect by hand. In this way, the user could also collect what assumptions are needed to avoid certain behaviors (e.g.,


**Table 2.** Result of the abstraction of r-switch RRIS benchmarks with different stability definitions.

in understanding what changes were made from r-switch1 to r-switch2 or r-switch3 schemas).

Finally, as expected, r-switch2 and r-switch3 have exactly the same abstract automata for every stability definition and nominal/faulty configuration, since they are two different implementations for the same observable properties.

*P-Stable Abstractions.* We also tried the implementation of [5], for approximated P-stable abstractions (σ4), which uses BDDs and convex polyhedra. On small handcrafted models like the tank system used as running example we could run all the approaches and confirm the output automata described in Sect. 3. Nonetheless, in the analysis of RRIS the approach of [5] turned out to be impractical, and was unable to deal with any of the considered RRIS models, due to the high number of variables.

More importantly, in our case studies, σ<sup>4</sup> would likely result in abstractions that are too aggressive, hiding states that are practically interesting, such as the ones that emerge from the analysis of run-to-completion processes with non negligible duration.

#### **6 Conclusions**

In this paper we presented a framework for the reverse engineering of legacy systems. Starting from a symbolic timed transition system, the framework supports the construction of abstractions in the form of state machines with guards and effects over transitions. The abstractions are parameterized on the notion of stability. We propose an SMT-based algorithm for abstraction computation, and we instantiate it over several notions of stability.

The results have been evaluated within an industrial project with the Italian Railway Network, on reverse-engineering tasks of complex relay-based interlocking circuits. The experimental analysis demonstrated that the approach is practical, and able to construct abstractions for complex real-world circuits. Taking reachability into account allowed us to produce tighter, more informative representations of the system under inspection. Railway signaling engineers involved in the project considered the proposed approach adequate in terms of expressiveness and able to provide substantial support in understanding the legacy RRIS.

In the future, we will define an "anytime" version of algorithms, so that the abstraction can be incrementally visualized as the computation proceeds, and leverage parallelization to increase the efficiency. Given the positive feedback from the RFI experts, we plan to integrate the proposed abstraction techniques abstraction within a RRIS modeling front-end, and to apply them on a larger set of interlockings.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Reachability of Koopman Linearized Systems Using Random Fourier Feature Observables and Polynomial Zonotope Refinement**

Stanley Bak<sup>1</sup> , Sergiy Bogomolov<sup>2</sup> , Brandon Hencey<sup>3</sup> , Niklas Kochdumper1(B) , Ethan Lew<sup>4</sup> , and Kostiantyn Potomkin<sup>2</sup>

<sup>1</sup> Department of Computer Science, Stony Brook University, Stony Brook, NY, USA {stanley.bak,niklas.kochdumper}@stonybrook.edu <sup>2</sup> School of Computing, Newcastle University, Newcastle Upon Tyne, UK {sergiy.bogomolov,k.potomkin2}@newcastle.ac.uk <sup>3</sup> Air Force Research Laboratory, Wright-Patterson Air Force Base, Dayton, OH, USA brandon.hencey@us.af.mil <sup>4</sup> Galois Inc., Portland, OR, USA

elew@galois.com

**Abstract.** Koopman operator linearization approximates nonlinear systems of differential equations with higher-dimensional linear systems. For formal verification using reachability analysis, this is an attractive conversion, as highly scalable methods exist to compute reachable sets for linear systems. However, two main challenges are present with this approach, both of which are addressed in this work. First, the approximation must be sufficiently accurate for the result to be meaningful, which is controlled by the choice of *observable functions* during Koopman operator linearization. By using random Fourier features as observable functions, the process becomes more systematic than earlier work, while providing a higher-accuracy approximation. Second, although the higherdimensional system is linear, simple convex initial sets in the original space can become complex non-convex initial sets in the linear system. We overcome this using a combination of Taylor model arithmetic and polynomial zonotope refinement. Compared with prior work, the result is more efficient, more systematic and more accurate.

**Keywords:** Koopman operator · Reachability analysis · Polynomial zonotopes · Random Fourier features · Formal verification

#### **1 Introduction**

Despite recent advances, systems described by nonlinear ordinary differential equations are still hard to analyze, control, and verify. On the other hand, a powerful body of methods and theories exists for linear systems making analysis, control, and verification much easier, even for high-dimensional systems. The efficiency of techniques related to reachability analysis for linear systems [4,6,15] motivates the use of Koopman operator linearization, where a higher-dimensional linear system approximates the dynamic behavior of a nonlinear system. Koopman operator techniques are also well-suited for data-driven approaches since the Koopman linearized system can be directly created from measurements, bypassing a potentially complex modeling step. The Koopman framework has been successfully applied to many applications, including control [26,28], state estimation [31] and recently, formal verification [5].

The main contribution of this paper is to advance the state-of-the-art in formal verification using reachability analysis on Koopman operator linearized systems. First, we improve the accuracy of the finite Koopman linearization by employing random Fourier features [29]. In contrast with an *ad hoc*, finitedimensional feature space, random Fourier features leverage the powerful *kernel trick* from machine learning [36,38] to generate a computationally tractable mapping over an infinite-dimensional feature space. Second, we improve speed. Instead of using an SMT solver to reason over non-convex initial sets, we propose combining Taylor models with polynomial zonotope refinement. A comparison on the same nonlinear system benchmarks used in the earlier Koopman verification work [5] demonstrates both the improved accuracy and the improved verification speed.

#### **1.1 Related Work**

The concept of Koopman operator linearization was originally introduced in 1931 [22]. Instead of investigating the dynamic evolution of the original system state, the Koopman approach considers the evolution of so-called *observable functions* or *observables* defined by nonlinear transformations of the original system state. Since the set of all possible observables defines a vector space, it then holds that the dynamic behavior of every nonlinear system can be equivalently represented by an infinite dimensional linear system. Because it is obviously infeasible to handle infinite dimensions, a finite set of observables is used in practice. Given such a set, the system matrix resulting in the most accurate linear approximation of the original system behavior can be determined using extended dynamic mode decomposition [41].

Many different methods for determining good observables have been proposed: Carleman linearization [7] equivalently represents the dynamic behavior of polynomial systems with an infinite dimensional linear system. The corresponding observables are multi-variate monomials, which are determined by repeatedly computing the time-derivative of the current observables. Terminating this iteration after a certain number of steps yields a finite set of observables. Carleman linearization can be extended to general nonlinear systems by using a Taylor series expansion. A finite set of observables defines an exact linear representation of the original system if the vector space spanned by the observables is closed under the operation of Lie-derivatives [34]. Consequently, a natural approach is to refine an initial set of observables by removing observables that violate the condition [34]. This concept can be extended to obtain polynomial instead of linear representations for the original nonlinear system [35]. Another class of approaches uses neural networks as observables [16,43], where the weights of the network are trained on traces of the real system. Since these approaches usually train the system matrix together with neural networks, they circumvent the subsequent application of dynamic mode decomposition. If one aims to reason about the original system based on the Koopman linearization, some quantification of the approximation error is required. Several approaches derive error bounds for truncated Carleman linearization [3,12,24] considering quadratic systems [24], polynomial systems [12], as well as general nonlinear systems [3].

The main motivation for using the Koopman framework for reachability analysis is that reachable sets for linear systems can be computed efficiently [11,15,23] even for high-dimensional systems [2,4,6], while reachability analysis for nonlinear systems [1,8,27] is often computationally demanding and potentially results in large over-approximations. Another advantage is that the Koopman approach can also be applied to data-driven systems where no model is available. Due to the nonlinear transformation of the initial state defined by the observables, reachability analysis for Koopman operator linearized system represents a special type of reachability problem. To the best of our knowledge only two approaches exist for far: The first approach [13] utilizes the error bounds for quadratic systems [24] to compute an enclosure of the reachable set for weakly nonlinear systems based on a finite Carleman linearization, where interval arithmetic [17] is applied to enclose the image of the initial set through the observables. The second approach [5], which represents the work closest to our method, presents two different verification strategies: 1) Direct encoding of the nonlinear transformation defined by the observables using a SMT solver, and 2) zonotope domain splitting, where the initial set is recursively split into smaller sets until the specification can be verified or falsified.

#### **1.2 Overview**

In this work we address the two main bottlenecks of formal verification for Koopman operator linearized systems, which are the selection of observables and the computation of the image of the initial set through the nonlinear transformation defined by the observables. In particular, while currently observables often have to be selected manually by the user, we generate observables in a systematic fashion using random Fourier features. As we demonstrate with numerical experiments, these observables yield high-accuracy approximations of the real system behavior. Moreover, while previous approaches either compute very conservative convex enclosures of the image through the observables [13] or have to split the initial set in order to achieve a desired precision [5], we calculate tight non-convex enclosures of the image by combining Taylor model arithmetic with polynomial zonotopes. To conduct collision checks between the resulting non-convex reachable set enclosures and unsafe regions we then use a novel polynomial zonotope refinement strategy, which is significantly faster than the previous SMT solver and zonotope domain splitting approaches [5].

The remainder of the paper is structured as follows: We first recapitulate some preliminary results that are required throughout the paper in Sect. 2. In the main part we then describe the systematic generation of observables using random Fourier features in Sect. 3, before we present our proposed verification algorithm in Sect. 4. Finally, we demonstrate the superior performance of random Fourier feature observables and our verification algorithm in comparison with existing techniques on various benchmark systems in Sect. 5.

#### **1.3 Notation**

In the remainder of this paper, we will use the following notations: Sets are denoted by calligraphic letters, matrices by uppercase letters, vectors by lowercase letters, and lists by bold uppercase letters. Given a vector <sup>b</sup> <sup>P</sup> <sup>R</sup><sup>n</sup>, <sup>b</sup>(i) refers to the <sup>i</sup>-th entry. Given a matrix <sup>A</sup> <sup>P</sup> <sup>R</sup><sup>n</sup>×<sup>m</sup>, <sup>A</sup>(i,·) represents the <sup>i</sup>-th matrix row, A(·,j) the j-th column, and A(i,j) the j-th entry of matrix row i. Given a discrete set of positive integer indices H " {h1,...,hw} with 1 ď h<sup>i</sup> ď m @i P {1,...,w}, A(·,H) is used for [A(·,h1) ... A(·,hw)], where [C D] denotes the concatenation of two matrices C and D. The symbols **0** and **1** represent matrices of zeros and ones of proper dimension, the empty matrix is denoted by [ ], and <sup>I</sup><sup>n</sup> <sup>P</sup> <sup>R</sup><sup>n</sup>×<sup>n</sup> is the identity matrix. Given an ordered list **L** " (l1,...,ln), **L**(i) " l<sup>i</sup> refers to the i-th entry and |**L**| " n denotes the number of elements in the list. Moreover, the concatenation of two lists **L**<sup>1</sup> and **L**<sup>2</sup> is denoted by (**L**1,**L**2). The left multiplication of a matrix <sup>M</sup> <sup>P</sup> <sup>R</sup><sup>m</sup>×<sup>n</sup> with a set <sup>S</sup> <sup>Ă</sup> <sup>R</sup><sup>n</sup> is defined as <sup>M</sup><sup>S</sup> " {Ms <sup>|</sup> <sup>s</sup> <sup>P</sup> S}, and the Cartesian product of two sets is denoted by the × operator. We further introduce an <sup>n</sup>-dimensional interval as <sup>I</sup> " [l, u], @i l(i) <sup>ď</sup> <sup>u</sup>(i), l, u <sup>P</sup> <sup>R</sup><sup>n</sup>.

### **2 Preliminaries**

Our approach utilizes several existing techniques and concepts, which we shortly recapitulate here. We use the nonlinear system

$$\begin{aligned} \dot{x}\_1 &= x\_1 \\ \dot{x}\_2 &= x\_2 - x\_1^4 \end{aligned} \tag{1}$$

in combination with the initial set X<sup>0</sup> " [´2, 2] × [0, 4] as a running example throughout this section.

#### **2.1 Koopman Operator Linearization**

First, we describe the general concept of Koopman operator linearization [22]. Given a nonlinear system

$$\frac{\partial x}{\partial t} = f(x) \quad \text{with} \quad x \in \mathbb{R}^n, \ f: \mathbb{R}^n \to \mathbb{R}^n,\tag{2}$$

our goal is to find observables <sup>g</sup><sup>i</sup> : <sup>R</sup><sup>n</sup> <sup>→</sup> <sup>R</sup> such that the dynamics of the resulting new variables gi(x) is linear:

$$\frac{\partial g(x)}{\partial t} = A \, g(x) \quad \text{with} \quad A \in \mathbb{R}^{m \times m}, \tag{3}$$

where g(x) " [g1(x) ... gm(x)]<sup>T</sup> is the observable function. Since the new variables gi(x) are functions of the original system state x, the linear system (3) defines an equivalent representation of the dynamic behavior of the original system (2). Usually, the number of observables m is significantly larger than the dimension n of the original system.

Let us demonstrate Koopman linearization for our exemplary system in (1). By choosing the observables g1(x) " x1, g2(x) " x2, and g3(x) " x<sup>4</sup> <sup>1</sup> we obtain the linear system

$$
\frac{\partial}{\partial t} \begin{bmatrix} g\_1(x) \\ g\_2(x) \\ g\_3(x) \end{bmatrix} = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & -1 \\ 0 & 0 & 4 \end{bmatrix} \begin{bmatrix} g\_1(x) \\ g\_2(x) \\ g\_3(x) \end{bmatrix}
$$

since Bg1(x)/Bt " x˙ <sup>1</sup> " x1, Bg2(x)/Bt " x˙ <sup>2</sup> " x<sup>2</sup> ´ x<sup>4</sup> <sup>1</sup> " g2(x) ´ g3(x), and Bg3(x)/Bt " 4 x<sup>3</sup> <sup>1</sup> x˙ <sup>1</sup> " 4 x<sup>4</sup> <sup>1</sup> " 4 g3(x).

The exact linearization using a finite number of observables demonstrated by the example above is unfortunately only possible for a small number of special systems. In practice one therefore usually aims to instead determine a linear system (3) that approximates the dynamic behavior of the nonlinear system (2) well enough. Given observables gi(x), the system matrix A resulting in the best approximation can be determined by applying extended dynamic mode decomposition [41] to traces of the original system. Since those traces can also be generated by simulating black-box systems or by measuring the real system behavior, we do not necessarily require a model (2) of the original system. This is one of the biggest advantages of the Koopman framework making it well suited for data-driven approaches. The approach we present in this work verifies Koopman linearized systems using reachability analysis:

**Definition 1.** *(Reachable set) Given an initial set* <sup>X</sup><sup>0</sup> <sup>Ă</sup> <sup>R</sup><sup>n</sup>*, the reachable set for a Koopman linearized system is*

$$\mathcal{R}(t) := \left\{ \xi(t, g(x\_0)) \: \mid \: x\_0 \in \mathcal{X}\_0 \right\},$$

*where* <sup>ξ</sup>(t, g(x0)) *is the solution to* (3) *at time* <sup>t</sup> <sup>P</sup> <sup>R</sup>ě<sup>0</sup> *for the initial state* <sup>g</sup>(x0)*.*

Consequently, to compute the reachable set for a Koopman linearized system one first needs to propagate the initial set through the nonlinear transformation defined by the observables, followed by the calculation of the reachable set for the linear system in (3) using a reachability algorithm. This procedure is visualized in Fig. 1. Definition 1 defines the reachable set for the observables gi(x). However, since safety specifications are typically defined on the original system state x rather than on g(x), we usually require the reachable set for the original state Rx(t) for verification. This issue can easily be resolved by using the original system state x for the first n observables gi(x) " x(i), i " 1,...,n, in which case Rx(t) can be obtained via projection: Rx(t) " [I<sup>n</sup> **0**] R(t).

**Fig. 1.** Schematic visualization of reachability analysis for Koopman linearized systems: We first transform the initial set to the higher-dimensional observable space using g(x), then compute the reachable set of the linear system using the matrix exponential eAΔt with time-step size Δt, and finally obtain the reachable set in the original state space via projection.

#### **2.2 Taylor Model Arithmetic**

Taylor model arithmetic [25] can be utilized to compute tight non-convex enclosures for the image through a nonlinear function. It is based on a set representation called Taylor models:

**Definition 2.** *(Taylor model) Given a polynomial function* <sup>p</sup> : <sup>R</sup><sup>s</sup> <sup>→</sup> <sup>R</sup><sup>n</sup>*, an interval domain* <sup>D</sup> <sup>Ă</sup> <sup>R</sup><sup>s</sup>*, and an interval remainder* <sup>Y</sup> <sup>Ă</sup> <sup>R</sup><sup>n</sup>*, a Taylor model* T (x) *is defined as*

$$\forall x \in \mathcal{D}: \quad \mathcal{T}(x) := \{ p(x) + y \mid y \in \mathcal{V} \}.$$

*The Taylor order* <sup>κ</sup> <sup>P</sup> <sup>N</sup> *defines an upper bound for the polynomial degree of the polynomial* p(x)*. The set defined by a Taylor model is*

$$\left\{ \mathcal{T}(x) \mid x \in \mathcal{D} \right\} = \left\{ p(x) + y \mid x \in \mathcal{D}, \ y \in \mathcal{Y} \right\}.$$

*For a concise notation we use the shorthand* T (x) " xp(x),Y, Dy<sup>T</sup> *.*

The general concept of Taylor model arithmetic is to define rules on how to perform the arithmetic operations +, ´, ·, and / as well as elementary functions such as sin(x) or √x on Taylor models [25, Sec. 2]. Since every nonlinear function represents a composition of arithmetic operations and elementary functions, the image through the function can then be computed by successively evaluating those rules. Given two one-dimensional Taylor models T1(x) " xp1(x),Y1, Dy<sup>T</sup> and T2(x) " xp2(x),Y2, Dy<sup>T</sup> the rules for addition and multiplication are for example given as

$$\begin{aligned} \mathcal{T}\_1(x) + \mathcal{T}\_2(x) &:= \left< p\_1(x) + p\_2(x), \mathcal{Y}\_1 + \mathcal{Y}\_2, \mathcal{D} \right>\_T \\ \mathcal{T}\_1(x) \cdot \mathcal{T}\_2(x) &:= \left< p\_1(x) \cdot p\_2(x), \mathcal{Y}\_1 \cdot \mathcal{Y}\_2 + \mathcal{T}\_1 \cdot \mathcal{Y}\_2 + \mathcal{Y}\_1 \cdot \mathcal{T}\_2, \mathcal{D} \right>\_T, \end{aligned}$$

where I<sup>1</sup> " {p1(x) | x P D} and I<sup>2</sup> " {p2(x) | x P D}. The rules for elementary functions are obtained using a finite Taylor series expansion, where the order of the Taylor series is equal to the Taylor order κ. For sin(x) we for example obtain with κ " 2 the rule

$$\sin\left(T\_1(x)\right) := \left\langle \sin(c) + \cos(c) \left(p\_1(x) - c\right) - 0.5\sin(c) \left(p\_1(x) - c\right)^2, \mathcal{Y}, \mathcal{D} \right\rangle\_T,$$

where the expansion point c is chosen as c " p1(cd) with c<sup>d</sup> being the center of the domain D, and the interval Y computed according to [25, Sec. 2] encloses the remainder of the Taylor series. Due to the finite Taylor series approximation, Taylor model arithmetic yields a tight enclosure rather than the exact image. The accuracy of the enclosure can be improved by choosing a larger Taylor order.

For our verification approach we apply Taylor model arithmetic to compute the image of the initial set through the observable function. The initial set X<sup>0</sup> " [´2, 2] × [0, 4] for the exemplary system in (1) can be represented by the Taylor model T (x) " xx, H, X0y<sup>T</sup> . Applying Taylor model arithmetic to the observable function g(x) defined by the observables g1(x) " x1, g2(x) " x2, and g3(x) " x<sup>4</sup> 1 then yields the Taylor model

$$\left\{ g(x) \mid x \in \mathcal{X}\_0 \right\} \subseteq \left\langle \begin{bmatrix} x\_1 \\ x\_2 \\ x\_1^4 \end{bmatrix}, \mathcal{Q} , [-2, 2] \times [0, 4] \right\rangle\_T,\tag{4}$$

which represents the exact image in this case since the observables contain polynomial functions only.

#### **2.3 Set Representations**

In this work we use polynomial zonotopes to represent reachable sets, polytopes to represent unsafe sets, and zonotopes for efficient collision checking. Let us first introduce polytopes, for which we consider the halfspace representation:

**Definition 3.** *(Polytope) Given a matrix* <sup>H</sup> <sup>P</sup> <sup>R</sup>s×<sup>n</sup> *and vector* <sup>d</sup> <sup>P</sup> <sup>R</sup><sup>s</sup>*, the halfspace representation of a polytope* <sup>P</sup> <sup>Ă</sup> <sup>R</sup><sup>n</sup> *is defined as*

$$\mathcal{P} \coloneqq \{ x \in \mathbb{R}^n \mid H \, x \leqslant d \}.$$

*We use the shorthand* P " xH, dy<sup>P</sup> *.*

A halfspace <sup>H</sup> <sup>Ă</sup> <sup>R</sup><sup>n</sup> is a special case of a polytope consisting of a single inequality constraint <sup>h</sup><sup>T</sup> <sup>x</sup> <sup>ď</sup> <sup>d</sup> with <sup>h</sup> <sup>P</sup> <sup>R</sup><sup>n</sup>, <sup>d</sup> <sup>P</sup> <sup>R</sup>. We use the shorthand <sup>H</sup> " xh, dyH. Another special type of polytopes are zonotopes, which can be stored efficiently using so-called generators:

**Definition 4.** *(Zonotope) Given a center vector* <sup>c</sup> <sup>P</sup> <sup>R</sup><sup>n</sup> *and a generator matrix* <sup>G</sup> <sup>P</sup> <sup>R</sup><sup>n</sup>×<sup>p</sup>*, a zonotope* <sup>Z</sup> <sup>Ă</sup> <sup>R</sup><sup>n</sup> *is defined as*

$$\mathcal{Z} := \left\{ c + \sum\_{i=1}^p \alpha\_i \, G\_{(\cdot, i)} \; \middle| \; \alpha\_i \in [-1, 1] \right\},$$

*where the scalars* α<sup>i</sup> *are called factors. We use the shorthand* Z " xc, GyZ*.*

Polynomial zonotopes are a novel non-convex set representation that has been originally introduced for reachability analysis of nonlinear systems [1]. We use the sparse representation of polynomial zonotopes [20] 1:

**Definition 5.** *(Polynomial zonotope) Given a constant offset* <sup>c</sup> <sup>P</sup> <sup>R</sup>n*, a generator matrix of dependent generators* <sup>G</sup> <sup>P</sup> <sup>R</sup>n×h*, a generator matrix of independent generators* <sup>G</sup><sup>I</sup> <sup>P</sup> <sup>R</sup>n×q*, and an exponent matrix* <sup>E</sup> <sup>P</sup> <sup>N</sup>p×<sup>h</sup> <sup>0</sup> *, a polynomial zonotope* PZ <sup>Ă</sup> <sup>R</sup><sup>n</sup> *is defined as*

$$\mathcal{P}\mathcal{Z} := \left\{ c + \sum\_{i=1}^{h} \left( \prod\_{k=1}^{p} \alpha\_k^{E\_{(k,i)}} \right) G\_{(\cdot,i)} + \sum\_{j=1}^{q} \beta\_j G\_{I(\cdot,j)} \; \middle| \; \alpha\_k, \beta\_j \in [-1, 1] \right\}.$$

*The scalars* α<sup>k</sup> *are called dependent factors since a change in their value affects multiplication with multiple generators. Consequently, the scalars* β<sup>j</sup> *are called independent factors because they only affect multiplication with one generator. We use the shorthand* PZ " xc, G, G<sup>I</sup> , EyP Z*.*

Using polynomial zonotopes for verification has two main advantages:


For verification we therefore convert the Taylor model representing the image of the initial set through the observable function to a polynomial zonotope, for which collision checks with the unsafe sets can be efficiently realized using zonotope enclosures that are iteratively refined by splitting the polynomial zonotope.

The conversion of the Taylor model in (4) corresponding to our running example in (1) yields the following polynomial zonotope

$$\begin{aligned} \left\langle \begin{bmatrix} x\_1\\ x\_2\\ x\_1^4 \end{bmatrix}, \mathcal{Q}, [-2, 2] \times [0, 4] \right\rangle\_T &= \left\langle \begin{bmatrix} 0\\ 2\\ 0 \end{bmatrix}, \begin{bmatrix} 2 & 0 & 0\\ 0 & 2 & 0\\ 0 & 0 & 16 \end{bmatrix}, [\, ], \begin{bmatrix} 1 & 0 & 4\\ 0 & 1 & 0 \end{bmatrix} \right\rangle\_{PZ} \\ &= \left\{ \begin{bmatrix} 0\\ 2\\ 0 \end{bmatrix} + \begin{bmatrix} 2\\ 0\\ 0 \end{bmatrix} \alpha\_1 + \begin{bmatrix} 0\\ 2\\ 0 \end{bmatrix} \alpha\_2 + \begin{bmatrix} 0\\ 0\\ 16 \end{bmatrix} \alpha\_1^4 \, \middle| \, \alpha\_1, \alpha\_2 \in [-1, 1] \right\}, \end{aligned}$$

where the high-level idea of the conversion is to represent the interval domain D with dependent zonotope factors α<sup>i</sup> P [´1, 1].

#### **3 Linearization via Fourier Features**

We now present the automated generation of observables using random Fourier features [10]. Let us first motivate why Fourier features are a good choice for

<sup>1</sup> In contrast to [20, Def. 1], we explicitly do not integrate the constant offset c in G. Moreover, we omit the identifier vector used in the original work [20] for simplicity.

observables. For Koopman linearization, the observables g(x) define a transformation to a high-dimensional space. One commonly used approach to handle such high-dimensional spaces efficiently is the *kernel trick*: In many algorithms the data points x, y <sup>P</sup> <sup>R</sup><sup>n</sup> only appear in the form of inner products <sup>g</sup>(x)<sup>T</sup> <sup>g</sup>(y). In this case it suffices to define a kernel function k(x, y) that represents the similarity measure g(x)<sup>T</sup> g(y) between data points in the high-dimensional feature space, rather than explicitly defining a transformation g(x) to this space. Kernel functions can also represent more general features that are not vectors and even infinite dimensional features, which motivates their application in the Koopman framework. The kernel trick is mainly applied for machine learning techniques [36], such as regression [38], clustering [18], and classification [39]. However, also the extended dynamic mode decomposition algorithm [41] can be formulated in terms of inner-products [42], so that the kernel trick can be applied for Koopman linearization. Rather than explicitly choosing observables g(x) we can therefore select a kernel function instead, which implicitly defines the observable function g(x) through the kernel's relation to an inner product space. Commonly used kernels are radial basis function kernels, polynomial kernels, and spline kernels.

The kernel trick cannot be applied directly to our reachability technique since we require an explicit formulation of the observables g(x). We therefore first select a kernel function k(x, y), and then determine observables g(x) that yield a good approximation of the kernel function k(x, y) « g(x)<sup>T</sup> g(y). Random Fourier features are a common technique to approximate kernel functions [10,29]. They are based on Bochner's theorem [33, Sec. 1.4.3], which links a weakly stationary kernel function to a Fourier transform:

$$k(x,y) = \int\_{\mathbb{R}^n} e^{j\,\omega^T(x-y)} \, d\mu(\omega) = \mathbb{E}\_{\omega} \left( e^{j\,\omega^T x} \, \overline{e^{j\,\omega^T y}} \right),\tag{5}$$

where the function <sup>μ</sup> : <sup>R</sup><sup>n</sup> <sup>→</sup> [0, 1] defines a probability distribution, <sup>E</sup>ω(·) denotes the expected value with respect to ω, j is the imaginary unit, and a denotes the complex conjugate for a complex number <sup>a</sup> <sup>P</sup> <sup>C</sup>. The distribution μ(ω) associated with a specific kernel can be obtained by taking the inverse Fourier transform of k(x, y) [29]. We can collect m samples from the distribution μ(ω) to approximate the expected value in (5), which finally yields

$$k(x, y) = \mathbb{E}\_{\omega} \left( e^{j\,\omega^T x} \, \overline{e^{j\,\omega^T y}} \right) \approx \frac{1}{m} \sum\_{i=1}^{m} \underbrace{e^{j\,\omega\_i^T x}}\_{g\_i(x)} \underbrace{e^{j\,\omega\_i^T y}}\_{g\_i(y)}.$$

The random Fourier features are the resulting observables gi(x) that approximate the kernel function. Note that we can omit the constant factor <sup>1</sup> <sup>m</sup> since extended dynamic mode decomposition will automatically scale the observables accordingly. We consider real-valued kernels only, so we use Euler's formula ej x " cos(x) + j sin(x) to simplify the random Fourier features to

$$g\_i(x) = \sqrt{2}\cos(\omega\_i^T x + b\_i), \quad i = 1, \ldots, m,\tag{6}$$

where the shift b<sup>i</sup> is selected uniformly from the interval [0, 2π] and ω<sup>i</sup> is drawn randomly from the probability distribution μ(ω) corresponding to the kernel that is used. While this random selection might appear to be a disadvantage at first sight, it is guaranteed that the random Fourier feature approximation converges to the exact kernel function when increasing the number of observables [29]. Moreover, we observed from our numerical experiments that changes in the values for b<sup>i</sup> and ω<sup>i</sup> do not significantly influence the accuracy of the resulting linear approximation.

In summary, the random Fourier features presented above represent a systematic method for selecting a finite set of accurate observables, which requires only few hyperparameters. These hyperparameters include the type of kernel that is used, the kernel parameters, and the number of observables. For the numerical experiments in this paper we use a radial basis function kernel

$$k(x,y) = e^{-\frac{\|x-y\|\_2^2}{2\ell^2}},$$

which contains the lengthscale as the only parameter. The probability distribution μ(ω) for this kernel is the multivariate normal distribution with covariance matrix <sup>2</sup> · <sup>I</sup><sup>n</sup> centered at the origin [29, Fig. 1].

### **4 Verification Using Reachability Analysis**

We now present our novel verification algorithm for Koopman linearized systems, which is summarized in Algorithm 1. For simplicity we assume that the specification we aim to verify is described by a single unsafe set U, but the extension to multiple unsafe sets is straightforward. We first apply Taylor model arithmetic (see Sect. 2.2) to compute a tight non-convex enclosure for the image of the initial set X<sup>0</sup> through the observable function g(x) in Line 3. Since it simplifies the computation of the zonotope enclosures required later on, we then convert the resulting Taylor model to a polynomial zonotope in Line 4. This polynomial zonotope is used as the initial set for the computation of the reachable set for the Koopman linearized system as performed in Line 5, for which we can use any reachability algorithm for linear systems. For simplicity we assume here that the obtained reachable sets are exact. In the general case where the exact reachable set cannot be computed one can for example incorporate the error measures from [14] and [40] into the verification algorithm.

The problem we are facing now is that the reachable sets R0,..., R<sup>t</sup><sup>F</sup> /Δt are represented by polynomial zonotopes, a set representation for which exact collision checks with the unsafe set U are computationally demanding. We resolve this issue by applying a novel polynomial zonotope refinement procedure in lines 6–19, where we recursively split the polynomial zonotopes until we can either verify or falsify the specification using zonotope enclosures of the split sets. In particular, we first enclose each polynomial zonotope in the queue **L** with a zonotope in Line 9. For a zonotope Z " xc, Gy<sup>Z</sup> collision checks with an unsafe set as performed in Line 10 are very efficient: If the unsafe set is a halfspace U " xh, dyH, we have according to [15, Sec. 5.1]

$$(\mathcal{Z}\cap \mathcal{U}\neq\mathcal{Z})\Leftrightarrow \left(h^T c - \sum\_{i=1}^p |h^T G\_{(\cdot,i)})|\leqslant d\right) \tag{7}$$

#### **Algorithm 1.** Verification of Koopman linearized systems

**Require:** Koopman linearized system ˙g(x) " A g(x), initial set <sup>X</sup>0, final time <sup>t</sup><sup>F</sup> , specification given as an unsafe set <sup>U</sup>, time step size Δt, initial Taylor order κ<sup>0</sup>. **Ensure:** System is safe (res = ⊥ ) or unsafe (res = ⊥). 1: ¸res ←⊥, <sup>κ</sup> <sup>←</sup> <sup>κ</sup><sup>0</sup> (initialization) 2: **repeat** 3: <sup>T</sup> (x) ← {g(x) <sup>|</sup> x <sup>P</sup> <sup>X</sup>0} (comp. using Taylor model arithmetic with order <sup>κ</sup>) 4: PZ ← T (x) (convert Taylor model to polynomial zonotope, see [20, Prop. 4]) 5: <sup>R</sup><sup>0</sup>,..., <sup>R</sup><sup>t</sup><sup>F</sup> /Δt <sup>←</sup> reachability analysis of ˙g(x) " A g(x) for initial set PZ 6: **<sup>L</sup>** <sup>←</sup> (R<sup>0</sup>,..., <sup>R</sup><sup>t</sup><sup>F</sup> /Δt) (initialize queue of not yet verified sets) 7: **repeat** 8: PZ ← **<sup>L</sup>**(1), **<sup>L</sup>** <sup>←</sup> (**L**(2),..., **<sup>L</sup>**(*|***L***|*)) (pop first element from queue) 9: Z ← zonotope enclosure of PZ (see [20, Prop. 5]) 10: **if** Z X U " H **then** (check if specification is satisfied, see (7) and (8)) 11: x<sup>0</sup>, t <sup>←</sup> most critical initial state and corresponding time 12: **if** [I<sup>n</sup> **<sup>0</sup>**] <sup>e</sup>Atg(x<sup>0</sup>) <sup>P</sup> <sup>U</sup> **then** 13: **return** (specification falsified ⇒ system is unsafe) 14: **else** 15: PZ<sup>1</sup>, PZ<sup>2</sup> <sup>←</sup> split PZ (see Prop. 1 and (11)) 16: **<sup>L</sup>** <sup>←</sup> (**L**, PZ<sup>1</sup>, PZ2) (add new sets to queue) 17: **end if** 18: **end if** 19: **until L** " ( ) or splitting does not yield any further improvement 20: κ <sup>←</sup> κ + 1 (increase Taylor order) 21: **until L** " ( ) (queue empty ⇒ no intersection with U) 22: res ← ⊥(if this line is reached no reach. set intersects U ⇒ system is safe)

For general polytopes U " xH, dy<sup>P</sup> collision checks can be realized using linear programming:

$$(\mathcal{Z}\cap\mathcal{U}\neq\mathcal{Q})\Leftrightarrow(\delta=0),\tag{8}$$

where

$$\delta = \min\_{\alpha, x} \|c + G\alpha - x\|\_1 \quad \text{s.t.} \quad \alpha \in [-\mathbf{1}, \mathbf{1}], \ Hx \lessdot d. \tag{9}$$

If the specification cannot be verified, we next try to falsfy it in lines 11–13 by extracting the initial point x<sup>0</sup> that is expected to violate the specification the most from Z. For a halfspace U " xh, dy<sup>H</sup> the vector of zonotope factors α " [α<sup>1</sup> ... αp] <sup>T</sup> resulting in the largest violation is given as α " ´sign(h<sup>T</sup> G), where the signum function is interpreted elementwise. Since the factors α of the zonotope enclosure are related to the dependent factors of the original polynomial zonotope and since polynomial zonotopes preserve dependencies during reachability analysis [21], we can then directly extract the initial point x<sup>0</sup> corresponding to α from the polynomial zonotope. For general polytopes we can use the optimal α from the linear program in (9) to estimate the most critical initial point. If we can neither verify nor falsify the specification we have a so called spurious counterexample that arises due to the over-approximation introduced by the zonotope enclosure. We therefore split the polynomial zonotope in

**Fig. 2.** Reachable set for the Roessler system (see Sect. 5.1) at time t " <sup>2</sup>.95, where polynomial zonotopes are depicted by solid lines, the corresponding zonotope enclosures are depicted by dashed lines, and the unsafe set is shown in orange. While the zonotope enclosure of the original polynomial zonotope is too conservative to verify the specification (left), splitting the polynomial zonotope once reduces the over-approximation enough for verification to succeed (right).

this case in Line 15 since splitting reduces the over-approximation in the zonotope enclosure (see Fig. 2). The split sets are then added to the queue in Line 16, where we use a first-in, first-out scheme for the queue to detect easy falsifications fast before excessively splitting the sets.

One remaining issue we are facing is that Taylor model arithmetic is not exact. Due to the over-approximation in the initial set it can therefore happen that we can neither verify nor falsify the specification by splitting the polynomial zonotope. To solve this issue we embed our whole algorithm into a repeat-untilloop that iteratively increases the order κ used for Taylor model arithmetic (see Line 20). Since Taylor model arithmetic converges to the exact result if the order goes to infinity, we obtain a complete algorithm that is guaranteed to terminate. In practice we can often prevent computational expensive iterations of the outer loop by choosing the initial order κ<sup>0</sup> large enough. It remains to decide when to stop splitting the polynomial zonotopes and increase the Taylor order instead (see Line 19). The simplest method is to just use an upper bound for the number of recursive splits that are performed. A more sophisticated approach is to abort splitting if the distance between the most critical point [I<sup>n</sup> **0**] eAtg(x0) and the unsafe set U is smaller than the over-approximation in the polynomial zonotope PZ, which is given by the independent generators.

Finally, we provide a closed-form expression for splitting a polynomial zonotope since this operation is not specified in the original work [20]:

**Proposition 1.** *(Split) Given a polynomial zonotope* PZ " xc, G, G<sup>I</sup> , EyP Z Ă <sup>R</sup><sup>n</sup> *and the index* <sup>r</sup> <sup>P</sup> {1,...,p} *of one dependent factor, the operation* split(PZ, r) *returns two polynomial zonotopes* PZ1*,* PZ<sup>2</sup> *satisfying* PZ<sup>1</sup> Y PZ<sup>2</sup> " PZ*:*

502 S. Bak et al.

$$\mathcal{P}\mathcal{Z}\_1 = \left\langle c, \left[\hat{G}\_1^{(1)} \; \ldots \; \hat{G}\_h^{(1)}\right], G\_I, \left[\hat{E}\_1 \; \ldots \; \hat{E}\_h\right] \right\rangle\_{PZ}$$

$$\mathcal{P}\mathcal{Z}\_2 = \left\langle c, \left[\hat{G}\_1^{(2)} \; \ldots \; \hat{G}\_h^{(2)}\right], G\_I, \left[\hat{E}\_1 \; \ldots \; \hat{E}\_h\right] \right\rangle\_{PZ}$$

*with*

$$
\begin{split}
\widehat{E}\_{i} &= \begin{bmatrix}
E\_{\{\{1,\ldots,r-1\},i\}} & E\_{\{\{1,\ldots,r-1\},i\}} \dots E\_{\{\{1,\ldots,r-1\},i\}} & E\_{\{\{1,\ldots,r-1\},i\}} \\
0 & 1 & \ldots & E\_{\{r,i\}} - 1 & E\_{\{r,i\}} \\
E\_{\{\{r+1,\ldots,p\},i\}} & E\_{\{\{r+1,\ldots,p\},i\}} \dots E\_{\{\{r+1,\ldots,p\},i\}} & E\_{\{\{r+1,\ldots,p\},i\}} \end{bmatrix}, \\
& \quad \widehat{G}^{(k)}\_{i} = \begin{bmatrix}
b^{(k)}\_{i,0} \cdot G\_{\{\cdot,i\}} \dots b^{(k)}\_{i,E\_{\{r,i\}}} \cdot G\_{\{\cdot,i\}} \\
b^{(k)}\_{i,0} \cdot \end{bmatrix}, \\
b^{(1)}\_{i,j} &= 0.5^{E\_{\{r,i\}}} \binom{E\_{\{r,i\}}}{j}, \quad b^{(2)}\_{i,j} &= -0.5^{E\_{\{r,i\}}} \left(2(E\_{\{r,i\}} \bmod 2) - 1\right) \binom{E\_{\{r,i\}}}{j},
\end{split}
$$

*where* <sup>x</sup> mod <sup>y</sup>*,* x, y <sup>P</sup> <sup>N</sup><sup>0</sup> *is the modulo operation and* w z *,* w, z <sup>P</sup> <sup>N</sup><sup>0</sup> *denotes the binomial coefficient. To remove redundancies we subsequently apply the compact operation as defined in [20, Prop. 2] to* PZ<sup>1</sup> *and* PZ2*.*

*Proof.* The split operation is based on the substitution of the selected dependent factor α<sup>r</sup> with two new dependent factors αr,<sup>1</sup> and αr,2:

$$\begin{aligned} \left\{ \alpha\_r \mid \alpha\_r \in [-1, 1] \right\} &= \left\{ 0.5(1 + \alpha\_{r, 1}) - 0.5(1 + \alpha\_{r, 2}) \mid \alpha\_{r, 1}, \alpha\_{r, 2} \in [-1, 1] \right\} \\ &\quad \left\{ 0.5(1 + \alpha\_{r, 1}) \mid \alpha\_{r, 1} \in [-1, 1] \right\} \cup \left\{ -0.5(1 + \alpha\_{r, 2}) \mid \alpha\_{r, 2} \in [-1, 1] \right\}. \end{aligned} (10)$$

Inserting this substitution into the definition of polynomial zonotopes in Definition 5 yields

$$\begin{split} \mathcal{P}Z &= \left\{ c + \sum\_{i=1}^{h} \left( \prod\_{k=1}^{p} \alpha\_{k}^{E(k,i)} \right) G\_{(\cdot,i)} + \sum\_{j=1}^{q} \beta\_{j} G\_{I(\cdot,j)} \; \middle| \; \alpha\_{k}, \beta\_{j} \in [-1, 1] \right\} \stackrel{(10)}{=} \\ & \underbrace{\left\{ c + \sum\_{i=1}^{h} \left( \prod\_{k=1}^{p} \alpha\_{k}^{E(k,i)} \right) \left( \frac{1 + \alpha\_{r,1}}{2} \right)^{E(r,i)} G\_{(\cdot,i)} + \sum\_{j=1}^{q} \beta\_{j} G\_{I(\cdot,j)} \; \middle| \; \alpha\_{k}, \beta\_{j}, \alpha\_{r,1} \in [-1, 1] \right\}}\_{\text{-\textquotedblleft}{\text{\$Z\_{1}\$}\$}} \\ & \underbrace{\cup \left\{ c + \sum\_{i=1}^{h} \left( \prod\_{k=1 \atop k \neq r}^{E(k,i)} \right) \left( \frac{1 + \alpha\_{r,2}}{-2} \right)^{E(r,i)} G\_{(\cdot,i)} + \sum\_{j=1}^{q} \beta\_{j} G\_{I(\cdot,j)} \; \middle| \; \alpha\_{k}, \beta\_{j}, \alpha\_{r,2} \in [-1, 1] \right\}}\_{\text{-\textquotedblleft}{\text{\$Z\_{2}\$}}}. \end{split}$$

Finally, with

$$\begin{aligned} \left(\frac{1+\alpha\_{r,1}}{2}\right)^{E\_{(r,i)}} &= b\_{i,0}^{(1)} + b\_{i,1}^{(1)}\alpha\_{r,1} + b\_{i,2}^{(1)}\alpha\_{r,1}^2 + \dots + b\_{i,E\_{(r,i)}}^{(1)}\alpha\_{r,1}^{E\_{(r,i)}}\\ \left(\frac{1+\alpha\_{r,2}}{-2}\right)^{E\_{(r,i)}} &= b\_{i,0}^{(2)} + b\_{i,1}^{(2)}\alpha\_{r,2} + b\_{i,2}^{(2)}\alpha\_{r,2}^2 + \dots + b\_{i,E\_{(r,i)}}^{(2)}\alpha\_{r,2}^{E\_{(r,i)}} \end{aligned}$$

we obtain the equations above.

The split operation for polynomial zonotopes is not exact, meaning that the resulting sets usually overlap (see Fig. 2). To minimize the size of the overlapping region we split the dependent factor with index r that maximizes the following heuristic:

$$\max\_{r \in \{1, \dots, p\}} \sum\_{i=1 \atop E\_{(r,i)} > 1}^{h} \left(1 - 0.5^{E\_{(r,i)}}\right) \|G\_{(\cdot, i)}\|\_{2},\tag{11}$$

where <sup>G</sup> <sup>P</sup> <sup>R</sup><sup>n</sup>×<sup>h</sup> and <sup>E</sup> <sup>P</sup> <sup>N</sup><sup>p</sup>×<sup>h</sup> <sup>0</sup> are the generator and exponent matrix of the polynomial zonotope. Moreover, since the goal of splitting in Algorithm 1 is to verify a certain specification, it is advisable to first project the polynomial zonotope onto the halfspace normal directions of the unsafe set U before evaluating the heuristic (11) in order to direct the splitting process towards directions that are beneficial for verification.

Note that the polynomial zonotope refinement technique presented in this section is not restricted to verification of Koopman linearized systems, but can equally be applied for collision checks of polynomial zonotopes or Taylor models with halfspaces and polytopes in general. Moreover, by inverting the inequality constraints polynomial zonotope refinement can also be applied to check if a Taylor model or polynomial zonotope is contained in a halfspace or polytope.

### **5 Experimental Results**

We now evaluate the performance of random Fourier feature observables and our novel reachability algorithm on various benchmark systems. For this, we compare our approach with the closest method from the literature [5]. Since the algorithms presented there are implemented in Julia, we also implemented our approach in Julia to obtain a fair comparison of the computation time. In our implementation we use the package TaylorModels.jl<sup>2</sup> for Taylor model arithmetic and the package DataDrivenDiffEq.jl<sup>3</sup> for extended dynamic mode decomposition. All computations are carried out on a 3.2 GHz 8-core AMD Ryzen 7 5800H processor with 16 GB memory. We published our implementation together with a repeatability package that reproduces the results shown in this paper as a CodeOcean compute capsule<sup>4</sup>.

#### **5.1 Benchmarks**

Let us first define all benchmarks that we use for the evaluation. Again, we consider the same systems and specifications as in [5] for a fair comparison:

<sup>2</sup> https://github.com/JuliaIntervals/TaylorModels.jl.

<sup>3</sup> https://datadriven.sciml.ai/.

<sup>4</sup> https://codeocean.com/capsule/8730054/tree/v1.

**Roessler Attractor:** The dynamic equations for the Roessler attractor [32] are

$$\begin{aligned} \dot{x}\_1 &= -x\_2 - x\_3\\ \dot{x}\_2 &= x\_1 + 0.2 \, x\_2\\ \dot{x}\_3 &= 0.2 + x\_3 \, (x\_1 - 5.7), \end{aligned}$$

and we consider the initial set X<sup>0</sup> " [´0.05, 0.05]×[´8.45, ´8.35]×[´0.05, 0.05], the final time t<sup>F</sup> " 6, and the unsafe region x<sup>2</sup> ě 6.375 ´ 0.025 ·i parameterized by i P [0, 20].

**Steam Governor:** The dynamic equations for the steam governor [37] are

$$\begin{aligned} \dot{x}\_1 &= x\_2\\ \dot{x}\_2 &= x\_3^2 \sin(x\_1) \cos(x\_1) - \sin(x\_1) - 3x\_2\\ \dot{x}\_3 &= \cos(x\_1) - 1, \end{aligned}$$

and we consider the initial set X<sup>0</sup> " [0.95, 1.05] × [´0.05, 0.05] × [0.95, 1.05], the final time t<sup>F</sup> " 3, and the unsafe set x<sup>2</sup> ď ´0.25 + 0.01 · i parameterized by i P [0, 10].

**Coupled Van-der-Pol Oscillator:** The dynamic equations for the coupled Van-der-Pol oscillator [30] are

$$\begin{aligned} \dot{x}\_1 &= x\_2 \\ \dot{x}\_2 &= \left(1 - x\_1^2\right)x\_2 - x\_1 + \left(x\_3 - x\_1\right) \end{aligned} \qquad \begin{aligned} \dot{x}\_3 &= x\_4 \\ \dot{x}\_4 &= \left(1 - x\_3^2\right)x\_4 - x\_3 + \left(x\_1 - x\_3\right), \end{aligned}$$

and we consider the initial set X<sup>0</sup> " [´0.025, 0.025] × [0.475, 0.525] × [´0.025, 0.025] × [0.475, 0.525], the final time t<sup>F</sup> " 2, and the unsafe set x<sup>1</sup> ě 1.25 ´ 0.05 · i parameterized by i P [1, 16].

**Biological System:** The dynamic equations for the biological system [19] are

x˙<sup>1</sup> " ´0.4 x<sup>1</sup> + 5 x<sup>3</sup> x<sup>4</sup> x˙<sup>5</sup> " ´5 x<sup>5</sup> x<sup>6</sup> + 5 x<sup>3</sup> x<sup>4</sup> x˙<sup>2</sup> " 0.4 x<sup>1</sup> ´ x<sup>2</sup> x˙<sup>6</sup> " 0.5 x<sup>7</sup> ´ 5 x<sup>5</sup> x<sup>6</sup> x˙<sup>3</sup> " x<sup>2</sup> ´ 5 x<sup>3</sup> x<sup>4</sup> x˙<sup>7</sup> " ´0.5 x<sup>7</sup> + 5 x<sup>5</sup> x6, x˙<sup>4</sup> " 5 x<sup>5</sup> x<sup>6</sup> ´ 5 x<sup>3</sup> x<sup>4</sup>

and we consider the initial set X<sup>0</sup> " [0.99, 1.01] ×···× [0.99, 1.01], the final time t<sup>F</sup> " 2, and the unsafe set x<sup>4</sup> ď 0.883 + 0.002 · i parameterized by i P [0, 20].

#### **5.2 Approximation Error**

We first investigate the accuracy of the Koopman linearized system with respect to the original nonlinear dynamics, where we compare our random Fourier feature observables with the ad hoc observables from [5]. These ad hoc observables consist of multi-variate polynomials of the system state x up to a fixed order, trigonometric functions of the time t, and combinations of these

**Fig. 3.** Relative simulation error between Koopman linearized systems and the original nonlinear system in percent.

(*e.g.,* x<sup>1</sup> x<sup>2</sup> sin<sup>2</sup>(t) cos(t)). To obtain the data traces required for extended dynamic mode decomposition we simulate the original nonlinear systems for 500 points sampled from the corresponding initial set, where a Sobol sequence is used for sampling. For the generation of the random Fourier feature observables according to (6) we use the parameter " 0.3 and m " 71 for the Roessler attractor, " 1.62 and m " 72 for the steam governor, " 1.24 and m " 132 for the coupled Van-der-Pol oscillator, and " 1.81 and m " 105 for the biological system, where is the lengthscale parameter of the kernel and the number of observables m is chosen identical to the one used for the ad hoc observables [5]. As a measure for the accuracy we use the Euclidean distance between simulated trajectories for the original nonlinear system and the Koopman linearized system. The initial points for these trajectories are the center and the vertices of the initial set. According to Fig. 3 random Fourier feature observables are for the steam governor and the Roessler attractor more accurate than than the ad hoc observables used in earlier work [5]. Moreover, while for the short time horizons considered in Fig. 3 it seems that the ad hoc observables are more precise for the coupled Van-der-Pol oscillator and the biological system, over longer time horizons the error of the ad hoc observables is exploding. This is visualized in Fig. 4, where the trajectory corresponding to the ad hoc observables progresses into a completely different direction than the original system, while random Fourier features stay accurate. In this way, random Fourier features are not only a more systematic approach for choosing observables, but also improve the precision of the resulting Koopman linearized system.

**Fig. 4.** Comparison of simulations for Koopman linearized systems with the ground truth from the original nonlinear system for a time horizon of <sup>t</sup><sup>F</sup> " 10, where the biological system is shown on the left and the coupled Van-der-Pol oscillator is shown on the right.

#### **5.3 Verification Using Reachability Analysis**

We now compare our novel verification algorithm for Koopman linearized systems with the verification strategies presented in [5]. In particular, we compare to verification of the original nonlinear system using Flow\* [9], direct encoding of nonlinear constraints using a SMT solver [5, Sec. 4.1], and zonotope domain splitting [5, Sec. 4.4]. Both approaches from [5] consider discrete-time safety, where the system is considered to be safe if the specification is satisfied at time points 0, Δt, 2Δt, . . . , t<sup>F</sup> with Δt " 0.05. While our verification algorithm also supports continuous-time safety, we consider discrete-time safety here to obtain a fair comparison. Note that for discrete-time safety the reachable set computation in Line 5 of Algorithm <sup>1</sup> simplifies to <sup>R</sup><sup>i</sup> " [I<sup>n</sup> **<sup>0</sup>**] <sup>e</sup>AiΔt <sup>X</sup>0, <sup>i</sup> " <sup>0</sup>,...,t<sup>F</sup> /Δt. For the comparison we consider both, the ad hoc observables used in [5] as well as the random Fourier feature observables presented here.

The resulting computation times for verification are summarized in Table 1. For all benchmark instances our novel verification algorithm has the lowest computation time, and is often even magnitudes faster than the other verification approaches. The main reason for this is that with our polynomial refinement strategy we can completely avoid the computational expensive calls to SMT solvers used by the other methods. Moreover, while the computation time for the other approaches often depends on how difficult it is to verify or falsify the specification, our algorithm exhibits roughly equal runtimes for all specifications. The explanation for this is that the polynomial zonotope refinement approach that we use for the collision checks with unsafe sets is very efficient, so that the majority of the runtime is spent on the computation of the image through the observable function using Taylor model arithmetic, a task which is independent from the specification. Interestingly, using random Fourier features instead of ad hoc observables can either prolong or accelerate the verification process, depending on the benchmark instance and verification approach used. However, even if

**Table 1.** Computation time in seconds for verification or falsification of the benchmark systems from Sect. 5.1 using different approaches, where the symbol ´ indicates that the computation timed-out after 2 h. The parameter i specified in the second column changes the specification, and the third column shows weather the specification can be verified or falsified.


they prolong the time required for verification in some cases, the usage of random Fourier feature observables can be justified by their superior accuracy demonstrated in Sect. 5.2. Yet another observation is that direct encoding and zonotope domain splitting are not able to verify or falsify the high-dimensional biological model at all if random Fourier feature observables are used. The reason for this is that both of these approaches apply an SMT solver for verification, which do not scale to high-dimensions and are not well-suited for handling the trigonometric functions as well as the high coupling between variables used for random Fourier feature observables. So in summary our proposed verification algorithm outperforms all exiting verification techniques for Koopman linearized systems in terms of runtime. In addition, it handles different types of observables well and scales to high-dimensional systems.

#### **6 Conclusion**

We presented two major improvements for reachability analysis of Koopman operator linearized systems: First, we use random Fourier features as observable functions, which yields a systematic approach requiring much less user insight than previous methods. Second, we handle the nonlinear transformation of the initial state by combining Taylor model arithmetic with polynomial zonotope refinement. As demonstrated on several nonlinear system benchmarks, the combination of these two techniques is both extremely accurate and extremely fast.

The main trade-off with Koopman linearized systems is that the guarantees are on the system approximation, not the original system. Despite this, we believe the method could still be useful for verification in systems engineering, where the goal is to produce evidence that the system meets its requirements. It could also be effective for finding unsafe counterexamples—falsification—or to analyze systems where only simulation code is provided, or even real-world systems where sensor measurements could be used to create a Koopman linearized model for analysis. As such systems do not have models given with symbolic differential equations, most traditional reachability methods cannot be applied.

**Acknowledgements.** This material is based upon work supported by the Air Force Office of Scientific Research, the DARPA Assured Autonomy program under the United States Air Force, and the Office of Naval Research under award numbers FA9550-19- 1-0288, FA9550-21-1-0121, FA9550-22-1-0450, FA2386-17-1-4065, FA8750-19-C-0092, and N00014-22-1-2156. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the United States Air Force, DARPA, or the United States Navy. Distribution Statement A: Approved for Public Release; Distribution is Unlimited. PA: AFRL-2022- 1356.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **RINO: Robust INner and Outer Approximated Reachability of Neural Networks Controlled Systems**

Eric Goubault(B) and Sylvie Putot

LIX, Ecole Polytechnique, CNRS and Institut Polytechnique de Paris, 91128 Palaiseau, France {eric.goubault,sylvie.putot}@polytechnique.edu

**Abstract.** We present a unified approach, implemented in the RINO tool, for the computation of inner and outer-approximations of reachable sets of discrete-time and continuous-time dynamical systems, possibly controlled by neural networks with differentiable activation functions. RINO combines a zonotopic set representation with generalized mean-value AE extensions to compute under and over-approximations of the robust range of differentiable functions, and applies these techniques to the particular case of learning-enabled dynamical systems. The AE extensions require an efficient and accurate evaluation of the function and its Jacobian with respect to the inputs and initial conditions. For continuous-time systems, possibly controlled by neural networks, the function to evaluate is the solution of the dynamical system. It is over-approximated in RINO using Taylor methods in time coupled with a set-based evaluation with zonotopes. We demonstrate the good performances of RINO compared to state-of-the art tools Verisig 2.0 and ReachNN\* on a set of classical benchmark examples of neural network controlled closed loop systems. For generally comparable precision to Verisig 2.0 and higher precision than ReachNN\*, RINO is always at least one order of magnitude faster, while also computing the more involved inner-approximations that the other tools do not compute.

**Keywords:** Neural networks verification · Reachability analysis · Robustness · Inner-approximation

# **1 Introduction**

Over the last few years, neural networks have emerged as an increasingly classical choice for the control of autonomous systems, in particular due to their properties as universal function approximators. However, their adoption in safety-critical systems, the inherent uncertainties from the dynamic environment, and their sensitivity to adversarial examples make it crucial to establish their safety and robustness. This verification is challenging because of the complex non-linear characteristics of neural networks. Recent works come up with some approaches and tools to bound the output uncertainty of neural networks with respect to input perturbations. However, many of them are restricted to the analysis of networks with ReLU activation functions. Moreover, the approaches considering general differentiable activation functions and systems with general non linear dynamics provide over-approximations, which conservatism is difficult to estimate. RINO proposes a scalable and adaptive approach to compute both inner (or under) and outer (or over) approximations for the closed loop reachability problem of neural network controlled systems, with differentiable activation functions. The outer-approximation allows for property verification, while the inner-approximation allows for property refutation. Combined, the inner and outer-approximations allow to assess the conservatism of the approximations.

As the behavior of a neural network controlled closed-loop system relies on the interaction between the continuous dynamics and the neural network controller, a good precision requires to not only compute the output range but also describe the input-output mapping for the controller. In this work, we propose to use a zonotope-based abstraction to compute in a unified way both the reachable sets of neural networks and dynamical systems. This seamless integration of the reachability of neural networks and dynamical systems presents the advantage of a natural propagation of useful correlations through the different components of the closed loop system, resulting in an efficient and precise approach compared to many existing works which rely on external reachability tools.

#### *Contributions*


*Related Work.* The safety verification for DNNs has received considerable attention recently, with several threads of work being developed. We draw below a non exhaustive panorama focusing on available tools for reachability analysis of neural network controlled systems with smooth activation functions.

Different approaches have been proposed to the reachability analysis closedloop systems with neural network controllers, often by a transformation to a continuous or hybrid system reachability. Sherlock [6] targets both the openloop and closed-loop problems with ReLU activation functions, in particular using the regressive polynomial rule inference approach [5] for the closed-loop, and Flow\* [3] for the reachability of the dynamical system. NNV [24] also targets both the open loop and closed loop verification problems, with various activation functions and set representations such as polyhedra or star sets [23], and different reachability algorithms for dynamical systems relying on CORA [1] and the MPT toolbox [18]. ReachNN [13] and its successor ReachNN\* [7] propose a reachability analysis based on Bernstein polynomials for closed-loop systems with general activation functions, also relying on Flow\* [3] for the reachability of the dynamical system. Verisig [14] handles NNCS with nonlinear plants controlled by sigmoid-based networks, exploiting the fact that the sigmoid is the solution to a differential equation to transform the neural network into an equivalent hybrid system, which is then fed to Flow\*. Verisig 2.0 [15] uses preconditioned Taylor Models to propagate reachable sets in neural networks, and also relies on Flow\* for reachability of the hybrid system component.

The very recent works [21] and [12] implemented respectively over JuliaReach and in POLAR are also closely related to our work. In [21], the authors implement a bridge between zonotope abstractions and Taylor model abstractions in order to combine tools analyzing controllers (e.g. using zonotopes like deepZ [22]) with tools analyzing ordinary differential equations (e.g. Flow\* [3]). In [12], the authors use a polynomial arithmetic made up of a combination of Berstein polynomials and Taylor models to iteratively overapproximate networks layers, according to whether the activation function is differentiable or not.

#### **2 Problem Statement and Background**

#### **2.1 Robust Reachability of Closed-Loop Dynamical Systems**

We consider in this work a closed-loop system consisting of a plant with states x, modeled as a discrete-time or continuous-time system with time-varying disturbances w and inputs u, where some components of the control inputs can be the output a neural network h taking x as input. For notation's simplicity, we focus on continuous-time systems and define:

$$\begin{cases} \dot{x}(t) = f(x(t), u(t), w(t)) & \text{if } t \ge 0 \\ x(t) = x\_0 & \text{if } t = 0 \end{cases} \tag{1}$$

where <sup>f</sup> is a sufficiently smooth function and at least <sup>C</sup><sup>1</sup>, and controls u and disturbances w are also supposed to be sufficiently smooth C<sup>k</sup> for some <sup>k</sup> <sup>≥</sup> <sup>0</sup> stepwise. This allows discontinuous controls and disturbances, where the discontinuities can only appear at discrete times <sup>t</sup><sup>j</sup> .

The neural network h is a fully-connected feedforward NN with differentiable activation functions, defined as the composition <sup>h</sup>(x) = <sup>h</sup><sup>L</sup> ◦ <sup>h</sup><sup>L</sup>−<sup>1</sup> ◦ ...h<sup>1</sup>(x) of L layers where each layer hi(x) = σ(Wix <sup>+</sup> bi) performs a linear transform followed by a sigmoid or hyperbolic tangent activation σ. We assume the control is decomposed as u(t)=(u<sup>1</sup>(t), u<sup>2</sup>(t)) where <sup>u</sup><sup>2</sup>(t) is a control input defined in <sup>U</sup><sup>2</sup> and <sup>u</sup><sup>1</sup>(t) is the output of the neural network controller. This controller is executed in a time-triggered fashion with control step T, so that u<sup>1</sup>(t) = h(x(tk)), for <sup>t</sup> <sup>∈</sup> [tk, t<sup>k</sup> <sup>+</sup> <sup>T</sup>), where <sup>t</sup><sup>k</sup> <sup>=</sup> kT for positive integers <sup>k</sup>. System (1) can then be rewritten as

$$\begin{cases} \dot{x}(t) = f(x(t), h(x(t\_k)), u\_2(t), w(t)) & \text{if } t \in [t\_k, t\_k + T), \ t\_k = kT, k \ge 0\\ x(t) = x\_0 & \text{if } t = 0 \end{cases} \tag{2}$$

Let ϕ<sup>f</sup> (t; <sup>x</sup>0, u2, w) for time <sup>t</sup> <sup>∈</sup> <sup>T</sup> denote the *time trajectory* of (2) with initial state <sup>x</sup>(0) = <sup>x</sup><sup>0</sup>, for input signal <sup>u</sup><sup>2</sup> and disturbance <sup>w</sup>.

We consider the problem of computing inner and outer-approximations of robust reachable sets as introduced in [9], defined here as

$$R^f\_{\mathcal{A}\mathcal{E}}(t; \mathbb{X}\_0, \mathbb{U}\_2, \mathbb{W}) = \{x \mid \forall w \in \mathbb{W}, \exists u\_2 \in \mathbb{U}\_2, \exists x\_0 \in \mathbb{X}\_0, \, x = \varphi^f(t; x\_0, u, w)\}$$

Note that this notion of robust reachability extends the classical notions of minimal and maximal reachability [20]. We use the subscript notation AE to indicate that the reachable set is minimal with respect to the disturbances w (universal quantification <sup>A</sup>) and maximal with respect to the input <sup>u</sup><sup>2</sup> (existential quantification indicated by E), and that the universal quantification always precedes the existential quantification.

#### **2.2 Mean-Value Inner and Outer-Approximating Robust Extensions**

A classical but often overly conservative way to overapproximate the image of a set by a real-valued function f : <sup>R</sup><sup>m</sup> <sup>→</sup> <sup>R</sup> is the natural interval extension <sup>F</sup> : IR<sup>m</sup> <sup>→</sup> IR, IR being the set of intervals with real bounds, which consists in replacing real operations by their interval counterparts in the expression of the function.

A generally more accurate extension relies on a linearization by the mean-value theorem. Mean-value extensions can be generalized to compute ranges that are robust to disturbances, identified as a subset of the input components. Let f be a continuously differentiable function from R<sup>m</sup> to R with input decomposed as x = (u, w) <sup>∈</sup> (U, <sup>W</sup>) <sup>⊆</sup> IR<sup>m</sup>. We define the robust range of function <sup>f</sup> on *<sup>x</sup>*, robust with respect to component w ∈ W, as R<sup>f</sup> AE (U, <sup>W</sup>) = {<sup>z</sup> | ∀<sup>w</sup> ∈ W, <sup>∃</sup><sup>u</sup> <sup>∈</sup> <sup>U</sup>, z <sup>=</sup> f(u, w)}.

For a continuously differentiable function f : <sup>R</sup><sup>m</sup> <sup>→</sup> <sup>R</sup><sup>n</sup>, we note <sup>∇</sup><sup>f</sup> <sup>=</sup> (∇<sup>j</sup>f<sup>i</sup>)ij = ( ∂f*<sup>i</sup>* ∂x*<sup>j</sup>* )<sup>1</sup>≤i≤n,1≤j≤<sup>m</sup> its Jacobian matrix. We note x, y the scalar product of vectors x and y, and <sup>|</sup>x<sup>|</sup> the absolute value extended componentwise. For a vector of intervals <sup>X</sup> = [<sup>X</sup> , <sup>X</sup> ], we note <sup>c</sup>(<sup>X</sup> )=(<sup>X</sup> <sup>+</sup> <sup>X</sup> )/2.0 and <sup>r</sup>(<sup>X</sup> ) = (X −X )/2.0 its center and radius defined componentwise.

**Theorem 1. (**[8]**, slightly simplified version of Thm. 2).** *Let* f *be a continuously differentiable function from* <sup>R</sup><sup>m</sup> *to* <sup>R</sup> *and* <sup>X</sup> <sup>=</sup> U×W ⊆ IR<sup>m</sup>*. Let* <sup>F</sup><sup>0</sup>*,* ∇<sup>X</sup> <sup>w</sup> *and* ∇<sup>X</sup> <sup>u</sup> *be vectors of intervals such that* <sup>c</sup>(<sup>X</sup> ) ⊆ F<sup>0</sup>*,* {|∇wf(u, w)<sup>|</sup> , (u, w) <sup>∈</sup> X} ⊆∇<sup>X</sup> <sup>w</sup> *and* {|∇uf(u, w)<sup>|</sup> ,(u, w) ∈ X} ⊆∇<sup>X</sup> <sup>u</sup> *. We have:*

$$\begin{split} & \left[ \overline{\mathcal{F}^0} - \langle \underline{\nabla\_u^{\mathcal{K}}}, r(\mathcal{U}) \rangle + \langle \overline{\nabla\_w^{\mathcal{K}}}, r(\mathcal{W}) \rangle \right], \underline{\mathcal{F}^0} + \langle \underline{\nabla\_u^{\mathcal{K}}}, r(\mathcal{U}) \rangle - \langle \overline{\nabla\_w^{\mathcal{K}}}, r(\mathcal{W}) \rangle \!\rangle \subseteq \boldsymbol{R}\_{\mathcal{A}\mathcal{E}}^f(\mathcal{U}, \mathcal{W}) \\ & \boldsymbol{R}\_{\mathcal{A}\mathcal{E}}^f(\mathcal{U}, \mathcal{W}) \subseteq [\underline{\mathcal{F}^0} - \langle \overline{\nabla\_u^{\mathcal{K}}}, r(\mathcal{U}) \rangle + \langle \underline{\nabla\_w^{\mathcal{K}}}, r(\mathcal{W}) \rangle, \overline{\mathcal{F}^0} + \langle \overline{\nabla\_u^{\mathcal{K}}}, r(\mathcal{U}) \rangle - \langle \underline{\nabla\_w^{\mathcal{K}}}, r(\mathcal{W}) \rangle] \end{split}$$

Theorem 1 provides inner and outer-approximations of the robust range (or of the classical range when there is no disturbance component w) of scalar-valued functions, or of the projections on each component of vector-valued functions, using bounds on the slopes on the input set. The result is useful to compute a projected range that is robustly reachable with respect to the disturbances w, or as a brick in computing an under-approximation of the image of a vector-valued function, as stated in Theorem 3 in [8].

Note that the accuracy of the mean-value AE extension can be improved with an evaluation by a quadrature formula ([10], Sect. 4.2). Alternatively, an order 2 Taylor-based extension ([10], Sect. 3) can be used.

#### **2.3 Reachability of Neural Network Controlled Closed-Loop Systems**

The inner and outer approximations defined in Sect. 2.2 can be computed for f being a simple function, possibly involving a neural network evaluation, or f being the function defined by the iterated values of a discrete systems, or finally f being the solution flow of closed-loop system (2).

In both discrete-time and the continuous-time cases, and whether some neural network controller is present or not, the evaluation of an outer-approximation of the image of the solution and its Jacobian with respect to inputs and disturbances over sets is needed in order to apply Theorem 1.

In our work and implementation, we advocate the use of a unique abstraction by affine forms (or zonotopes for the geometric view of a tuple of variables represented by affine forms) for these sets and these evaluations, including performing reachability of the neural network controller. This abstraction is very convenient and versatile to over-approximate any smooth function, providing a good tradeoff between efficiency and precision in most cases (and for more precision, one can consider extensions with e.g. polynomial zonotopes [2]).

For continuous-time systems, we use Taylor expansions in time of the solution on a time grid. To build these Taylor expansions, we evaluate function f and its (Lie) derivatives over affine forms by a combination of automatic differentiation and numerical evaluation by affine arithmetic, as described in e.g. [9]. The neural network is seen as a nonlinear function h, composed with f to build function g for which we compute the solution flow. Theorem 1 is applied to this solution flow. We build the abstraction of h and thus g by a simple propagation of affine forms by affine arithmetic in the network: linear transformers are exact, and we propagate affine forms through the activation functions seen as standard nonlinear functions relying on the elementary exponential function, tanh(x) = <sup>2</sup>/(1 + e<sup>−</sup>2<sup>x</sup>) <sup>−</sup> 1 and sig(x)=1/(1 + <sup>e</sup><sup>−</sup><sup>x</sup>). For differentiating the activation functions, we use tanh (x)=1.<sup>0</sup> <sup>−</sup> tanh(x)<sup>2</sup> and sig (x) = sig(x)(1 <sup>−</sup> sig(x)).

# **3 Implementation**

As mentioned in the introduction, RINO implements all ideas presented in [8–11] for the joint computation of inner and outer approximations of robustly reachable sets of differentiable nonlinear discrete-time [8,10] or continuous-time systems [8,9], possibly with constant delays [11]. For experiments with systems without neural networks, we refer to the results presented in these works, obtained with a previous version of RINO.

RINO is written in C++. Intervals and zonotopes are used for set representation: the tool relies on the FILIB++ library [19] for interval computations and the aaflib library<sup>1</sup> for affine arithmetic [4]. Ole Stauning's FADBAD++ library<sup>2</sup> is used for automatic differentiation: its implementation with template enables us to easily evaluate the differentiation in the set representation of our choice (affine forms or zonotopes mostly). The tool takes as inputs:


It computes inner and outer-approximations of the projection on each component of ranges, as well as joint 2D and 3D inner-approximations (provided as yaml file and Jupyter/python-produced figures). Additionally to the classical ranges, RINO computes approximations of output ranges that are reachable robustly or adversarially with respect to disturbances, specified as a subset of inputs. In the experiments presented herafter, we consider examples only of classical reachability, for which comparisons with existing work are available, but the extension to robust reachability based on our previous work is straightforward.

# **4 Experiments**

For space reasons, we focus here on the main novelty which is the extension of this previous work to compute under and over-approximations of (robust) reachable sets of neural network controlled systems (2).

*Choice of Tools and Benchmark Examples.* We compare RINO against ReachNN\* and Verisig 2.0 that are the most recent fully-fledged reachability analyzers for neural network based control systems, and for which comparisons with other tools on classical benchmarks are well documented in e.g. [15]. They both improve on previous versions, Verisig and ReachNN, and on state of the art

<sup>1</sup> http://aaflib.sourceforge.net.

<sup>2</sup> http://www.fadbad.com.

tools Sherlock, also based on Flow\*, and NNV. As noted in e.g. [15]: "Firstly, note that Verisig takes significantly more time to compute reachable sets (21 times slower in the case of the B5 benchmark). Furthermore, Verisig is unable to verify some properties due to increasing error. Note that NNV is unable to verify any of the properties considered in this paper due to high approximation error.". Remark though that there has been some amelioration to the internal solvers used in NNV which should qualify the latter statement (see e.g. [16]). We do not compare with the implementation in JuliaReach [21] since, first, timings are difficult to compare with an interpreted framework, and second, because it would require mixing several tools together, with many potential combinations. We try to provide elements of comparison with POLAR [12], but in many ways the latter addresses a different problem, with the emphasis on being able to interpret e.g. ReLU activation functions.


**Table 1.** List of benchmarks (see [15])

We use a large subset (7/10) of the examples from Verisig 2.0 [15], which are benchmarks used by most of the tools in the field, through e.g. the ARCH competition [17]. We also consider the same settings in terms of initial sets and the same time horizon. These are recalled in Table 1.

We indicate some of RINO's reachability results on these benchmarks in Table 2, before comparing the tightness and computing times with other tools.


**Table 2.** RINO's results for time step 0.05 (except Mountain Car, step 1.)

*Settings.* All tools, Verisig 2.0 and ReachNN\* and RINO, were run without GPU support, under Ubuntu 18.04 docker, on a Mac running Mac OS Big Sur 11.2.3 on a 2.3 GHz Intel Core i9 processor with 16Gb of memory. Verisig 2.0 and ReachNN\* were run with the Reproductibility Package of Verisig 2.0 [15]. For fairness of timing results, we also run RINO with docker, and the running ratios given in Table 3 are those using these docker versions. RINO was also run natively on the same Mac. The performance degradation between the two versions of RINO can be estimated from the full data given in Table 2 from none to a 40% increase (with one exception at 80%), and most between 20 and 30%. This is higher than generally observed with docker, but due to the fact that docker on Macintosh is known to perform badly when it comes to IOs, using the underlying file system. Therefore, the performance degrades more when the system is of higher dimension and have more time steps to evaluate, since RINO logs all estimated ranges for all variables in separate files.

*Comparisons Results.* We compare in Table 3 the running times of Verisig 2.0, ReachNN\* and RINO, and volumes of their final over-approximations, more precisely the widths of the projections of each component at final time horizon.

The three tools depend on some parameters, in particular integration time steps and order of approximation. RINO does not require tuning the integration time steps and order of Taylor models so much, so we use one fixed time step of 0.05 for all examples. We use for Verisig 2.0 and ReachNN\* the settings of the CAV Reproductibility package, that we suppose give good results. Verisig 2.0 and ReachNN\* actually perform poorly on the same examples with a fixed time steps of 0.05 s.

We experimented RINO with different time steps. The precision is relatively stable and does not necessarily improve when decreasing the time step. Indeed, as already noted [25], the improvement in approximation by Taylor models on smaller time steps is balanced by the loss of precision due to set-based abstraction being performed more often. Note also that the analysis time does not depend linearly on the time step: the control step, which rules the frequency at which the analysis of the neural net controller has to be performed, is fixed (see Table 1) and does not depend on the integration time step.

Column 2 in Table 3 describes the relative width of the intervals given by Verisig 2.0 for each variable at the final time and for each system, with respect to the one given by RINO. Column 4 is the same, but for ReachNN\*. Columns 3 and 5 give the ratio of the analysis time of Verisig 2.0 (respectively ReachNN\*), with respect to the analysis time of RINO.

In all cases, RINO is much faster than both Verisig 2.0 and ReachNN\*, by factors ranging from 13 to 638.5. Moreover, this includes for RINO the time to compute the inner-approximations that Verisig 2.0 and ReachNN\* do not compute. ReachNN\* could not analyze TORA because of lack of memory on our platform, and timed out on ACC. Finally, interpolating the timings given in Table 1 of [12], e.g. for B1 (sig), Verisig 2.0 is reported to take 47 s whereas POLAR is reported to take 20 s on their platform. As Verisig 2.0 took 81.33 s on our platform, we can infer that RINO is most certainly much faster, with e.g. 3.62 s for B1, than POLAR.

RINO's precision is of the same order as Verisig 2.0, and always better than ReachNN\* by a factor of about 2 to 10. RINO is in fact even substantially more precise than Verisig 2.0 in some cases (B1 and B2 in particular).

*Inner-Approximations.* Let us take example B1 (with sigmoid-based controller), and suppose we have a safety property that the value of <sup>x</sup><sup>1</sup> should never be bigger than 1. Figure 1a represents in filled blue region the inner-approximation, as plain black lines the bounds of the outer-approximation, and as purple dots values actually reached, obtained by trajectories for sample initial conditions The over-approximation alone does raise a potential alarm with respect to the unsafe zone (in red), only the inner-approximation actually proves that the safety


**Table 3.** Precision and running time comparisons RINO [timestep=0.05] vs Verisig 2.0 [time steps of [15]] vs ReachNN\* [time steps of [15]]

property is falsified. We also note on this picture that the over-approximation is very tight, given that samples give almost indistinguishable ranges. Figure 1b represents the inner and outer approximations of joint range (x<sup>1</sup>, x<sup>2</sup>) as well as estimation by sampling. As shown by the samples, (x<sup>1</sup>, x<sup>2</sup>) becomes almost a 1D curve after some time, making inner approximation extremely difficult to estimate. Indeed our inner-approximation in orange is fairly precise for the first time steps, and the corresponding inner skewed boxes are rotated to match the curvy, 1D, shape of the samples. The green boxes printed on the picture are the box enclosure of the actually computed outer-approximation. Note that the inner-approximation of the projections on each component can be non-empty while having an empty joint inner range, as some approximation is committed in the joint inner range computation (as a skewed box) from the projected ranges.

**Fig. 1.** B1: inner-approximation, outer-approximation and sampling (purple dots) (Color figure online)

#### **5 Conclusion and Future Work**

We presented the RINO tool, dedicated to the reachability analysis of dynamical systems, possibly controlled by neural networks. While providing accurate results, RINO is significantly faster than other state-of-the-art tools, which is key in view to address real-life reachability problems, where the systems and neural networks can be of high dimension. Moreover, as far as we are aware, it is the only existing tool to propose inner-approximations of the reachable sets of such systems. We currently handle only differentiable activation functions. We are thinking of some abstractions to handle ReLU activations as well, even though the approach is less natural in that case as it will introduce conservatism. We also plan to improve the accuracy of our current results by further specializing this work to exploit the structure of neural network, such as monotonicity of activation functions. Finally, robustness is a crucial property for neural networks enabled systems, and we plan to explore the possibilities offered by the computation of robust reachable sets.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **STL**mc**: Robust STL Model Checking of Hybrid Systems Using SMT**

Geunyeol Yu , Jia Lee , and Kyungmin Bae(B)

Pohang University of Science and Technology, Pohang, Korea kmbae@postech.ac.kr

**Abstract.** We present the STLmc model checker for signal temporal logic (STL) properties of hybrid systems. The STLmc tool can perform STL model checking up to a robustness threshold for a wide range of hybrid systems. Our tool utilizes the refutation-complete SMT-based bounded model checking algorithm by reducing the robust STL model checking problem into Boolean STL model checking. If STLmc does not find a counterexample, the system is guaranteed to be correct up to the given bounds and robustness threshold. We demonstrate the effectiveness of STLmc on a number of hybrid system benchmarks.

# **1 Introduction**

Signal temporal logic (STL) [31] has emerged as a popular property specification formalism for hybrid systems. STL formulas describe linear-time properties of continuous real-valued signals. Because hybrid systems exhibit both discrete and continuous behaviors, STL provides a convenient and expressive way to specify important requirements of hybrid systems. STL has a vast range of applications on hybrid systems, including automotive systems [26], robotics [24,40], medical systems [36], IoT [7], smart cities [30], etc.

Due to the infinite-state nature of hybrid systems with continuous dynamics, most techniques and tools for analyzing STL properties focus on monitoring and falsification. These techniques analyze concrete samples of signals obtained by simulating hybrid automata to monitor the system's behavior [13,15,32] or find counterexamples [1,37,43], often combined with stochastic optimization. To this end, STL monitoring and falsification use quantitative semantics that defines the *robustness degree* to indicate how well the formula is satisfied. However, these methods cannot be used to guarantee correctness.

Recently, several STL model checking techniques have been proposed for hybrid systems [3,29,35]. In particular, the SMT-based bounded model checking algorithms [3,29] are refutation-complete, i.e., they can guarantee correctness up to given bounds. However, these techniques are based on the Boolean semantics of STL instead of quantitative semantics. This is a limitation for hybrid systems as small perturbations of signals can cause the system to violate the properties *verified* by Boolean STL model checking. Moreover, there exists no tool with a convenient user interface implementing STL model checking techniques.

This paper presents the STLmc tool for robust STL model checking of hybrid systems. Our tool can verify that, up to given bounds, the robustness degree of an STL formula ϕ is greater than a *robustness threshold* > 0 for all possible behaviors of the system. We reduce the robust STL model checking problem to Boolean STL model checking using *-strengthening* (perturbing the problem by to make it harder to be true), first proposed in [21] for first-order logic and extended to STL. We then apply the refutation-complete bounded model checking algorithm [3,29] to build the SMT encoding of the resulting Boolean STL model checking problem, which can be solved using SMT solvers.

Apart from the robust STL model checking method, STLmc also implements several techniques to improve the usability and scalability of the tool:


We demonstrate the effectiveness of the STLmc tool on a number of hybrid system benchmarks— including linear, polynomial, and ODE dynamics— and nontrivial STL properties. The tool is available at https://stlmc.github.io.

#### **2 Background: Robust STL Model Checking**

*Hybrid Automata.* Hybrid systems are often formalized as *hybrid automata* [25], defined as a tuple H = (Q, X, *init*, *inv*, *jump*, *flow*). A set of modes Q specifies discrete states. A set of real-valued variables <sup>X</sup> <sup>=</sup> {x1, ...., xl} gives continuous states. A pair q,v of mode <sup>q</sup> <sup>∈</sup> <sup>Q</sup> and vector v <sup>∈</sup> <sup>R</sup><sup>l</sup> constitutes a state of <sup>H</sup>. An initial condition *init*(q,v) defines a set of initial states. An invariant condition *inv*(q,v) defines a set of valid states. A jump condition *jump*(q,v, q , v ) defines a discrete transition from q,v to q , v . A flow condition *flow*(q,v,vt, t) defines a continuous evolution of X's values from v to v<sup>t</sup> over time t in mode q.

A *signal* σ represents a continuous execution of a hybrid automaton H, given by a function [0, τ ) <sup>→</sup> <sup>Q</sup> <sup>×</sup> <sup>R</sup><sup>l</sup> with a time bound τ > 0. A signal <sup>σ</sup> is called <sup>a</sup> *trajectory* of a hybrid automaton <sup>H</sup>, written <sup>σ</sup> <sup>∈</sup> <sup>H</sup>, if <sup>σ</sup> describes a valid behavior of H: formally, there exists a sequence of times 0 = t<sup>0</sup> < t<sup>1</sup> < ... < τ such that: (i) <sup>σ</sup>(t0) is an initial state by *init*; (ii) for <sup>i</sup> <sup>≥</sup> 1, <sup>H</sup>'s state evolves from <sup>σ</sup>(ti) according to *flow*, while satisfying *inv*, for each time interval [t<sup>i</sup>−<sup>1</sup>, ti); and (iii) for <sup>i</sup> <sup>≥</sup> 1, a discrete transition occurs by *jump* at each time point <sup>t</sup>i.

*Signal Temporal Logic.* Signal temporal logic (STL) is widely used to specify properties of hybrid systems [31]. The syntax of STL is defined by:

$$\varphi ::= p \mid \neg \varphi \mid \varphi \land \varphi \mid \varphi \mathbf{U}\_I \varphi \mid$$

where <sup>p</sup> denotes state propositions, and <sup>I</sup> <sup>⊆</sup> <sup>R</sup>≥<sup>0</sup> is any interval of nonnegative real numbers. Examples of state propositions include relational expressions of the form <sup>f</sup>(x) <sup>≥</sup> 0 over variables <sup>X</sup> with a real-valued function <sup>f</sup> : <sup>R</sup><sup>l</sup> <sup>→</sup> <sup>R</sup>. Other common Boolean and temporal operators can be derived by equivalences: e.g., <sup>ϕ</sup> <sup>∨</sup> <sup>ϕ</sup> ≡ ¬(¬<sup>ϕ</sup> ∧ ¬ϕ ), ♦<sup>I</sup> <sup>ϕ</sup> <sup>≡</sup> **<sup>U</sup>**<sup>I</sup> <sup>ϕ</sup>, <sup>I</sup> <sup>ϕ</sup> ≡ ¬♦<sup>I</sup> <sup>¬</sup>ϕ, etc.

We consider a quantitative semantics of STL based on *robustness degrees* [15]. The semantics of a state proposition <sup>p</sup> is defined as a function <sup>p</sup> : <sup>Q</sup> <sup>×</sup> <sup>R</sup><sup>l</sup> <sup>→</sup> <sup>R</sup> that assigns to a state the degree to which <sup>p</sup> is true, where <sup>R</sup> <sup>=</sup> <sup>R</sup> ∪ {−∞,∞}. Specifically, the robustness degree of a state proposition <sup>f</sup>(x) <sup>≥</sup> 0 is the value of <sup>f</sup>(x). E.g., the robustness degree of <sup>x</sup> <sup>≥</sup> 4 is the value of <sup>x</sup> <sup>−</sup> 4 at a given state. The robustness degree of an STL formula can be defined as follows [15], where a time bound τ of a signal is explicitly taken into account.<sup>1</sup>

**Definition 1.** *Given an STL formula* <sup>ϕ</sup>*, a signal* <sup>σ</sup> : [0, τ ) <sup>→</sup> <sup>R</sup><sup>l</sup> *, and a time* <sup>t</sup> <sup>∈</sup> [0, τ )*, the robustness degree* <sup>ρ</sup><sup>τ</sup> (ϕ, σ, t) <sup>∈</sup> <sup>R</sup> *is defined inductively by:*<sup>2</sup>

$$\begin{aligned} \rho\_{\tau}(p,\sigma,t) &= p(\sigma(t)) \\ \rho\_{\tau}(\neg\varphi,\sigma,t) &= -\rho\_{\tau}(\varphi,\sigma,t) \\ \rho\_{\tau}(\varphi\_{1}\wedge\varphi\_{2},\sigma,t) &= \min(\rho\_{\tau}(\varphi\_{1},\sigma,t),\rho\_{\tau}(\varphi\_{2},\sigma,t)) \\ \rho\_{\tau}(\varphi\_{1}\bullet\mathbf{U}\_{I}\varphi\_{2},\sigma,t) &= \sup\_{\mathbf{t}'\in\{t+I\}\cap[0,\tau)} \min(\rho\_{\tau}(\varphi\_{2},\sigma,t'),\inf\_{\mathbf{t}'\prime\in[t,t']}\rho\_{\tau}(\varphi\_{1},\sigma,t')) \end{aligned}$$

The robust STL model checking problem is to determine if the robustness degree of an STL formula ϕ is always greater than a given robustness threshold > 0 for all possible trajectories of a hybrid automaton H.

**Definition 2 (Robust STL Model Checking).** *For a time bound* τ > 0*, an STL formula* <sup>ϕ</sup> *is satisfied at time* <sup>t</sup> <sup>∈</sup> [0, τ ) *on a hybrid automaton* <sup>H</sup> *with respect to a robustness threshold* > <sup>0</sup> *iff for every trajectory* <sup>σ</sup> <sup>∈</sup> <sup>H</sup>*,* <sup>ρ</sup><sup>τ</sup> (ϕ, σ, t) > *.*

*A Running Example.* Consider two rooms interconnected by an open door. The temperature x<sup>i</sup> of each room, i = 0, 1, changes depending on the heater's mode <sup>q</sup><sup>i</sup> ∈ {On, Off} and the temperature of the other room. The continuous dynamics of x<sup>i</sup> can be specified as the following ODEs, where Ki, hi, ci, d<sup>i</sup> are determined by the size of the room, the heater's power, and the size of the door [2,19,25]:

$$
\dot{x\_i} = \begin{cases}
K\_i(h\_i - (c\_i x\_i - d\_i x\_{1-i})) & \text{(On)} \\
\end{cases}
$$

<sup>1</sup> C.f., in the Boolean semantics of STL [29,31], the satisfaction of an STL formula is defined as a Boolean value (i.e., true or false).

<sup>2</sup> The Minkowski sum of intervals I and J is denoted by I +J. For a singular interval, {t} + I is written as t + I. We write sup<sup>a</sup>∈<sup>A</sup> g(a) and inf<sup>a</sup>∈<sup>A</sup> g(a) to denote the least upper bound and the greatest lower bound of the set {g(a) | a ∈ A}, respectively.

**Fig. 1.** A hybrid automaton for the networked thermostats.

Figure 1 shows a hybrid automaton of our networked thermostat controllers. Initially, both heaters are off and the temperatures are between 18 and 22. The jumps between modes then define a control logic to keep the temperatures within a certain range using only one heater. We are interested in robust model checking of nontrivial STL properties, such as:

<sup>φ</sup>1**:** ♦[0,15](x<sup>0</sup> <sup>≥</sup> <sup>14</sup> **<sup>U</sup>**[0,∞) <sup>x</sup><sup>1</sup> <sup>≤</sup> 19): at some moment in the first 15 s, <sup>x</sup><sup>1</sup> is less than or equal to 19; until then, x<sup>0</sup> is greater than or equal to 14.

<sup>φ</sup>2**:** [2,4](x<sup>0</sup> <sup>−</sup> <sup>x</sup><sup>1</sup> <sup>≥</sup> <sup>4</sup> <sup>→</sup> ♦[3,10] <sup>x</sup><sup>0</sup> <sup>−</sup> <sup>x</sup><sup>1</sup> ≤ −3): between 2 and 4 s, whenever <sup>x</sup><sup>0</sup> <sup>−</sup> <sup>x</sup><sup>1</sup> <sup>≥</sup> 4, <sup>x</sup><sup>0</sup> <sup>−</sup> <sup>x</sup><sup>1</sup> ≤ −3 holds within 10 s after 3 s.

# **3 The** STLmc **Model Checker**

The STLmc tool can model check STL properties of hybrid automata, given three parameters > 0 (robustness threshold), τ > 0 (time bound), and <sup>N</sup> <sup>∈</sup> <sup>N</sup> (discrete bound). STLmc provides an expressive input format to easily specify a wide range of hybrid automata. STLmc also provides a visualization command to give an intuitive description of counterexamples.

#### **3.1 Input Format**

The input format of STLmc, inspired by dReach [28], consists of five sections: variable declarations, mode definitions, initial conditions, state propositions, and STL properties. Mode and continuous variables define discrete and continuous states of hybrid automata. Mode definitions specify flow, jump, and invariant conditions. STL formulas can also include user-defined state propositions.

Figure 2 shows the input model of the hybrid automaton described in the running example above. Constants are introduced with the const keyword. Two mode variables on0 and on1 denote the heaters' modes. Continuous variables x0 and x1 are declared with domain intervals. There are three "mode blocks" that specify the three modes in Fig. 1 and their invariant, flow, and jump conditions.

In mode blocks, a mode component includes a set of logic formulas over mode variables. An inv component contains a set of logic formulas over continuous variables. A flow component can include ODEs over continuous variables. A jump component contains a set of jump conditions of the form *guard* => *reset*, where *guard* and *reset* are logic formulas over mode and continuous variables, and "primed" variables denote states after the jump has occurred.

**Fig. 2.** An input model example

STL properties are declared in the goal section, and "named" propositions are declared in the proposition section. State propositions are arithmetic and relational expressions over mode and continuous variables. For example, in Fig. 2, the STL formula f1 contains two state propositions <sup>x</sup><sup>0</sup> <sup>≥</sup> 14 and <sup>x</sup><sup>1</sup> <sup>≤</sup> 19, and the formula f2 contains the user-defined propositions p1 and p2.

#### **3.2 Command Line Options**

STLmc provides a command-line interface with various options in Table 1. The options -two-step and -parallel enable the two-step solving optimization in Sect. 4.3. STLmc supports three SMT solvers to choose from based on continuous dynamics: Z3 [12] and Yices2 [17] can deal with linear and polynomial dynamics (solutions of ODEs are linear functions or polynomials), and dReal [22] can approximately deal with ODE dynamics with Lipschitz-continuous ODEs.

A discrete bound N limits the number of mode changes and *variable points* at which the truth value of some STL subformula changes. This is a distinctive parameter of STL model checking that cannot typically be derived from a time bound τ or the maximal number of jumps (say, m). E.g., for any positive natural number <sup>n</sup> <sup>∈</sup> <sup>N</sup>, consider the function <sup>y</sup>(t) = sin( <sup>π</sup> <sup>τ</sup> · <sup>n</sup> · <sup>t</sup>); the state proposition y > 0 has <sup>n</sup> <sup>−</sup> 1 variable points even if there is no mode change (<sup>m</sup> = 0).<sup>3</sup>

For the input model in Fig. 2, the following command found a counterexample of the formula f2 at bound 2 with respect to = 2 in 15 s using dReal:

<sup>3</sup> This example also hints that STL model checking can be arbitrary complex even for one mode; τ and m cannot limit such model checking computation, whereas N can limit the computation involving *both* discrete and continuous behaviors.


**Table 1.** Some command line options for STLmc.

**Fig. 3.** Visualization of a counterexample (horizontal dotted lines denote = 2).

\$./stlmc ./therm.model -bound 5 -time-bound 25 -threshold 2 \ -goal f2 -solver dreal -two-step -parallel -visualize result: counterexample found at bound 2 (14.70277s)

Similarly, the following command verified the formula f1 up to bounds N = 5 and τ = 25 with respect to = 0.5 in 819 s using dReal:

```
$./stlmc ./therm.model -bound 5 -time-bound 25 -threshold 0.5 \
         -goal f1 -solver dreal -two-step -parallel
result : True (818.73110s)
```
STLmc provides a command to visualize counterexamples for robust STL model checking. It can generate images representing counterexample trajectories and robustness degrees. Figure 3 shows the visualization graphs, showing the values of variables or robustness degrees over time, generated for the formula f2 <sup>=</sup> [2,4](x<sup>0</sup> <sup>−</sup> <sup>x</sup><sup>1</sup> <sup>≥</sup> <sup>4</sup> <sup>→</sup> ♦[3,10](x<sup>0</sup> <sup>−</sup> <sup>x</sup><sup>1</sup> ≤ −3)) with the subformulas:

$$\begin{aligned} \mathbf{f2\_1} &= x\_0 - x\_1 \ge 4 \to \Diamond\_{[3,10]}(x\_0 - x\_1 \le -3) \qquad &\mathbf{f2\_2} = \neg(x\_0 - x\_1 \ge 4) \\ \mathbf{f2\_3} &= \Diamond\_{[3,10]}(x\_0 - x\_1 \le -3) \qquad &\mathbf{p\_1} = x\_0 - x\_1 \ge 4 \qquad &\mathbf{p\_2} = x\_0 - x\_1 \le -3 \end{aligned}$$

The robustness degree of f2 is less than at time 0, since the robustness degree of f2<sup>1</sup> goes below in the interval [2, 4], which is because both the degrees of f2<sup>2</sup> and f2<sup>3</sup> are less than in [2, 4]. The robustness degree of f2<sup>3</sup> is less than in [2, 4], since the robustness degree of p<sup>2</sup> is less than in [5, 14] = [2, 4] + [3, 10].

**Fig. 4.** The STLmc architecture

# **4 Algorithms and Implementation**

Figure 4 shows the architecture of the STLmc tool. The tool first reduces robust STL model checking into Boolean STL model checking using -strengthening. It then applies an existing SMT-based STL model checking algorithm [3,29]. The satisfiability of the SMT encoding can be checked directly using an SMT solver or using the two-step solving algorithm to improve the performance for ODE dynamics. Our tool is implemented in around 9,500 lines of Python code.

#### **4.1 Reduction to Boolean STL Model Checking**

As usual for model checking, robust STL model checking is equivalent to finding a counterexample. Specifically, an STL formula ϕ is not satisfied on a hybrid automata H with respect to a robustness threshold > 0 iff there exists a counterexample for which the robustness degree of <sup>¬</sup><sup>ϕ</sup> is greater than or equal to <sup>−</sup>. (Formally, <sup>¬</sup>(∀<sup>σ</sup> <sup>∈</sup> H. ρ<sup>τ</sup> (ϕ, σ, t) > ) iff <sup>∃</sup><sup>σ</sup> <sup>∈</sup> H. ρ<sup>τ</sup> (¬ϕ, σ, t) ≥ −.)

Consider a state proposition x < 0. Its robust model checking is equivalent to finding a counterexample <sup>σ</sup> <sup>∈</sup> <sup>H</sup> with <sup>ρ</sup><sup>τ</sup> (<sup>x</sup> <sup>≥</sup> <sup>0</sup>, σ, t) ≥ −, which is equivalent to <sup>ρ</sup><sup>τ</sup> (<sup>x</sup> ≥ −, σ, t) <sup>≥</sup> 0. Observe that <sup>x</sup> ≥ − is *weaker* than <sup>x</sup> <sup>≥</sup> 0 by . The notion of -weakening is first introduced in [21] for first-order logic, and we extend the definitions of -weakening and -strengthening to STL as follows.

**Definition 3.** *The -weakening* ϕ− *and -strengthening* ϕ+ *of* ϕ *are defined as follows:* (p−)(s) = <sup>p</sup>(s) <sup>−</sup> *and* (p<sup>+</sup>)(s) = <sup>p</sup>(s) + *for a state* <sup>s</sup>*, and:*

$$\begin{aligned} (\neg \varphi)^{-\epsilon} &\equiv \neg (\varphi^{+\epsilon}) \quad (\varphi\_1 \wedge \varphi\_2)^{-\epsilon} \equiv \varphi\_1^{-\epsilon} \wedge \varphi\_2^{-\epsilon} \quad (\varphi\_1 \mathbf{U}\_I \varphi\_2)^{-\epsilon} \equiv \varphi\_1^{-\epsilon} \mathbf{U}\_I \varphi\_2^{-\epsilon} \\\ (\neg \varphi)^{+\epsilon} &\equiv \neg (\varphi^{-\epsilon}) \quad (\varphi\_1 \wedge \varphi\_2)^{+\epsilon} \equiv \varphi\_1^{+\epsilon} \wedge \varphi\_2^{+\epsilon} \quad (\varphi\_1 \mathbf{U}\_I \varphi\_2)^{+\epsilon} \equiv \varphi\_1^{+\epsilon} \mathbf{U}\_I \varphi\_2^{+\epsilon} \end{aligned}$$

Finding a counterexample of ϕ for robust STL model checking can be reduced to finding a counterexample of the -strengthening ϕ<sup>+</sup> for Boolean STL model checking. The satisfaction of ϕ by the Boolean STL semantics [29,31] is denoted by σ, t <sup>|</sup>=<sup>τ</sup> <sup>ϕ</sup>. We have the following theorem (see our report [42] for details).

**Theorem 1.** *(1)* <sup>∃</sup><sup>σ</sup> <sup>∈</sup> H. σ, t <sup>|</sup>=<sup>τ</sup> <sup>¬</sup>(ϕ<sup>+</sup>) *implies* <sup>∃</sup><sup>σ</sup> <sup>∈</sup> H. ρ<sup>τ</sup> (¬ϕ, σ, t) ≥ −*, and (2)* <sup>∀</sup><sup>σ</sup> <sup>∈</sup> H. σ, t |=<sup>τ</sup> <sup>¬</sup>(ϕ<sup>+</sup>) *implies* <sup>∀</sup><sup>σ</sup> <sup>∈</sup> H. ρ<sup>τ</sup> (ϕ, σ, t) <sup>≥</sup> *.*

As a consequence, a counterexample of ϕ<sup>+</sup> for Boolean STL model checking is also a counterexample of ϕ for robust STL model checking. If there is no counterexample of ϕ<sup>+</sup> for Boolean STL model checking, then ϕ is satisfied on H with respect to any robustness threshold 0 < < . It is worth noting that ϕ may not be satisfied on H with respect to itself.

#### **4.2 Boolean STL Model Checking Algorithm**

For Boolean STL model checking, there exist *refutationally complete* bounded model checking algorithms [3,29] with two bound parameters: τ for the time domain, and N for the number of mode changes and variable points. A time point t is a variable point if a truth value of ϕ's subformula changes at t. The algorithms build an SMT encoding Ψ N,τ H,¬<sup>ϕ</sup> of Boolean STL model checking:

**Theorem 2.** *[3,29]* Ψ N,τ H,¬<sup>ϕ</sup> *is satisfiable iff there is a counterexample trajectory* <sup>σ</sup> <sup>∈</sup> <sup>H</sup>*, with at most* <sup>N</sup> *variable points and mode changes, such that* σ, t |=<sup>τ</sup> <sup>ϕ</sup>*.*

For hybrid automata with polynomial continuous dynamics, the satisfiability of the encoding Ψ can be precisely determined using standard SMT solvers, including Z3 [12] and Yices2 [17]. For ODE dynamics, the satisfiability of Ψ is undecidable in general, but there exist specialized solvers, such as dReal [22] and iSAT-ODE [18], that can approximately determine the satisfiability.

To support various SMT solvers, the implementation of STLmc utilizes a generic wrapper interface based on the SMT-LIB standard [5]. Therefore, if it follows SMT-LIB, a new SMT solver can be easily integrated with our tool. Moreover, STLmc can also detect the most suitable solver for a given input model; e.g., if the model has ODE dynamics, then the tool chooses dReal.

The encoding Ψ includes universal quantification over time, e.g., because of invariant conditions. Several SMT solvers (including Z3 and Yice2) support these ∃∀-conditions but at high computational costs [27]. For polynomial dynamics, we implement the encoding method [10] to simplify ∃∀-conditions to quantifier-free formulas. For ODE dynamics, dReal natively supports ∃∀-conditions [23].

#### **4.3 Two-Step Solving Algorithm**

To reduce the complexity of ODE dynamics, we propose a two-step solving algorithm in Algorithm 1, inspired by the lazy SMT solving approach [38]:


We also implement a simple method to avoid redundant scenarios by minimizing a scenario. A scenario <sup>π</sup> <sup>=</sup> <sup>l</sup><sup>1</sup> ∧···∧l<sup>m</sup> is minimal if (¬l<sup>i</sup> <sup>∧</sup> <sup>j</sup>=<sup>i</sup> <sup>l</sup><sup>j</sup> ) <sup>→</sup> <sup>Ψ</sup>— one literal in π is false— is not valid. To minimize a scenario π, we use a dual propagation approach [33]. Since <sup>π</sup> implies <sup>Ψ</sup>, <sup>π</sup> ∧ ¬<sup>Ψ</sup> is unsatisfiable. We compute the unsatisfiable core of <sup>π</sup> ∧ ¬<sup>Ψ</sup> using Z3 to extract a minimal scenario from <sup>π</sup>.

We parallelize the two-step solving algorithm by running the satisfiability checking of refinements in parallel. If any of such refinements is satisfied and a counterexample is found, then all other jobs are terminated. If all refinements, checking in parallel, are unsatisfiable, then there is no counterexample. As shown in Sect. 5, it greatly improves the performance for the ODE cases in practice.

#### **Algorithm 1:** Two-Step SMT Solving Algorithm **Input**: Hybrid automaton H, STL formula ϕ, threshold , bounds τ and N **1 for** k = 1 **to** N **do <sup>2</sup>** <sup>Ψ</sup> <sup>←</sup> abstraction of the encoding <sup>Ψ</sup> k,τ H,¬(ϕ+-) without *flow* and *inv*; **<sup>3</sup> while** checkSat(Ψ) *is* Sat **do <sup>4</sup>** π ← a minimal satisfying scenario; **<sup>5</sup>** πˆ ← the refinement of π with *flow* and *inv*; **<sup>6</sup> if** checkSat(πˆ) *is* Sat **then <sup>7</sup> return** counterexample(result*.satAssignment*); **<sup>8</sup>** Ψ ← Ψ ∧ ¬π; **9 return** True;

# **5 Experimental Evaluation**

We evaluate the effectiveness of the STLmc model checker using a number of hybrid system benchmarks and nontrivial STL properties.<sup>4</sup> We use the following models, adapted from existing benchmarks [2,6,19,20,25,34]: load management for two batteries (Bat), two networked water tank systems (Wat), autonomous driving of two cars (Car), a railroad gate (Rail), two networked thermostats (Thm), a spacecraft rendezvous (Space), navigation of a vehicle (Nav), and a filtered oscillator (Oscil). We use a modified model with either linear, polynomial, or ODE dynamics to analyze the effect of different continuous dynamics. For each model, we use three STL formulas with nested temporal operators. More details on the benchmark models can be found in the longer report [42].

We measure the SMT encoding size and execution time for robust STL model checking, up to discrete bound N = 20 for linear models, N = 10 for polynomial models, and N = 5 for ODEs models, with a timeout of 60 min. We use different time bounds τ and robustness thresholds for different models, since τ and depend on each model. As an underlying SMT solver, we use Yices for linear and polynomial models, and dReal for ODE models with a precision δ = 0.001. We run both direct SMT solving (1-step) and two-step SMT solving (2-step). We use 25 cores for parallelizing the two-phase solving algorithm. We have run all experiments on Intel Xeon 2.8 GHz with 256 GB memory.

The experimental results are summarized in Table 2, where <sup>|</sup>Ψ<sup>|</sup> denotes the size of the SMT encoding Ψ (in thousands) as the number of connectives in Ψ. For the model checking results, indicates that the tool found no counterexample up to bound <sup>N</sup>, and <sup>⊥</sup> indicates that the tool found a counterexample at bound <sup>k</sup> <sup>≤</sup> <sup>N</sup>. For the algorithms (Alg.), we write one of the results with a better

<sup>4</sup> For reachability properties, STLmc has a similar performance to other SMT-based tools, because STLmc uses the same SMT encoding. Indeed, our previous work [29] shows that the underlying algorithm used for STLmc has comparable performance to other tools for reachability properties. Nonetheless, our companion report [42] also includes some experimental results comparing STLmc with four reachability analysis tools (HyComp [9], SpaceEx [20], Flow\* [8], and dReach [28]).


**Table 2.** Robust Bounded Model Checking of STL (Time in seconds)

performance. For the 2-step case, we also write the number of minimal scenarios generated (#π). Actually, two-step SMT solving timed out for all linear and polynomial models, and direct SMT solving timed out for all ODE models.

As shown in Table 2, our tool can perform robust model checking of nontrivial STL formulas for hybrid systems with different continuous dynamics. The cases of ODE models generally take longer than the cases of linear and polynomial models, because of the high computational costs for ODE solving. Nevertheless, our parallelized two-step SMT solving method works well and all model checking analyses are finished before the timeout. In contrast, for linear and polynomial models with a larger discrete bound <sup>N</sup> <sup>≥</sup> 10, direct SMT solving is usually effective but the two-step SMT solving method is not. There are too many scenarios, and the scenario generation does not terminate within 60 min. Therefore, the two algorithms implemented in our tool are complementary.

#### **6 Related Work**

There exist many tools for falsifying STL properties of hybrid systems, including Breach [14], S-talrio [1], and TLTk [11]. STL falsification techniques are based on STL monitoring [13,32], and often use stochastic optimization techniques, such as Ant-Colony Optimization [1], Monte-Carlo tree search [43], deep reinforcement learning [41], and so on. These techniques are often quite useful for finding counterexamples in practice, but, as mentioned, cannot be used to verify STL properties of hybrid systems.

There exist many tools for analyzing reachability properties of hybrid systems based on reachable-set computation, including C2E2 [16], Flow\* [8], Hylaa [4], and SpaceEx [20]. They can be used to guarantee the correctness of invariant properties of the form <sup>p</sup> <sup>→</sup> <sup>I</sup> <sup>q</sup>, but cannot verify general STL properties. In contrast, STLmc uses a refutation-complete bounded STL model checking algorithm to verify general STL properties, including complex ones.

Our tool is also related to SMT-based tools for analyzing hybrid systems, including dReach [28], HyComp [9], and HybridSAL [39]. These techniques also focus on analyzing invariant properties of hybrid systems, but some SMT-based tools, such as HyComp, can verify LTL properties of hybrid systems. Unlike STLmc, they cannot deal with general STL properties of hybrid systems.

#### **7 Concluding Remarks**

We have presented the STLmc tool for robust bounded model checking of STL properties for hybrid systems. STLmc can verify that, up to given bounds, the robustness degree of an STL formula ϕ is always greater than a given robustness threshold for all possible behaviors of a hybrid system. STLmc also provides a convenient user interface with an intuitive counterexample visualization.

Our tool leverages the reduction from robust model checking to Boolean model checking, and utilizes the refutation-complete SMT-based Boolean STL model checking algorithm to guarantee correctness up to given bounds and find subtle counterexamples. STLmc can deal with hybrid systems with (nonlinear) ODEs using dReal. We have shown using various hybrid system benchmarks that STLmc can effectively analyze nontrivial STL properties.

Future work includes extending our tool with other hybrid system analysis methods, such as reachable-set computation, besides SMT-based approaches.

**Acknowledgments.** This work was supported in part by the National Research Foundation of Korea (NRF) grants funded by the Korea government (MSIT) (No. 2021R1A5A1021944 and No. 2019R1C1C1002386).

#### **References**

1. Annpureddy, Y., Liu, C., Fainekos, G., Sankaranarayanan, S.: S-TaLiRo: a tool for temporal logic falsification for hybrid systems. In: Abdulla, P.A., Leino, K.R.M. (eds.) TACAS 2011. LNCS, vol. 6605, pp. 254–257. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-19835-9 21


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **UCLID5: Multi-modal Formal Modeling, Verification, and Synthesis**

Elizabeth Polgreen1,2(B) , Kevin Cheang1(B) , Pranav Gaddamadugu<sup>1</sup>, Adwait Godbole<sup>1</sup> , Kevin Laeufer<sup>1</sup> , Shaokai Lin<sup>1</sup> , Yatin A. Manerkar<sup>3</sup>, Federico Mora<sup>1</sup> , and Sanjit A. Seshia<sup>1</sup>

<sup>1</sup> UC Berkeley, Berkeley, USA elizabeth.polgreen@ed.ac.uk <sup>2</sup> University of Edinburgh, Edinburgh, UK <sup>3</sup> University of Michigan, Ann Arbor, USA

**Abstract.** UCLID5 is a tool for the multi-modal formal modeling, verification, and synthesis of systems. It enables one to tackle verification problems for heterogeneous systems such as combinations of hardware and software, or those that have multiple, varied specifications, or systems that require hybrid modes of modeling. A novel aspect of UCLID5 is an emphasis on the use of syntax-guided and inductive synthesis to automate steps in modeling and verification. This tool paper presents new developments in the UCLID5 tool including new language features, integration with new techniques for syntax-guided synthesis and satisfiability solving, support for hyperproperties and combinations of axiomatic and operational modeling, demonstrations on new problem classes, and a robust implementation.

# **1 Overview**

Tools for formal modeling and verification are typically specialized for particular domains and for particular methods. For instance, software verification tools like Boogie [4] focuses on modeling sequential software and Floyd-Hoare style reasoning, while hardware verifiers like ABC [5] are specialized for sequential circuits and SAT-based equivalence and model checking. Specialization makes sense when the problems fit well within a homogeneous problem domain with specific verification needs. However, there is an emerging class of problems, such as in security and cyber-physical systems (CPS), where the systems under verification are heterogeneous, or the types of specifications to be verified are varied, or there is not a single type of model that is effective for verification. An example of such a problem is the verification of trusted computing platforms [37] that involve hardware and software components working in tandem, and where the properties to be checked include invariants, refinement checks, and hyperproperties. There is a need for automated formal methods and tools to handle this class of problems.

UCLID5 is a system for *multi-modal* formal modeling, verification, and synthesis that addresses the above need. UCLID5 is multi-modal in three important ways. First, it permits different modes of modeling, using axiomatic and operational semantics, or as combinations of concurrent transition systems and procedural code. This enables modeling systems with multiple characteristics. Second, it offers a varied suite of specification modes, including first-order formulas in a combination of logical theories, temporal logic, inline assertions, preand post-conditions, system invariants, and hyperproperties. Third, it supports the first two capabilities with a varied suite of verification techniques, including Floyd-Hoare style proofs, k-induction and bounded model checking (BMC), verifying hyperproperties, or using syntax-guided and inductive synthesis to provide more automation in tedious steps of verification, or to automate the modeling process (as proposed in [34]).

The UCLID5 framework was first proposed in 2018 [35], itself a major evolution of the much older UCLID system [6], one of the first satisfiability modulo theories (SMT) based modeling and verification tools. Since that publication [35], which laid out the vision for the tool and described a preliminary implementation, the utility of the tool has been demonstrated on several problem classes (e.g., [7,8,25]), such as for verifying security across the hardware-software interface. The syntax has been extended and state-of-the-art methods for syntaxguided synthesis (SyGuS) have also been integrated into the tool [28], including new capabilities for satisfiability and synthesis modulo oracles [32]. This tool paper presents an overview of the latest version of UCLID5, highlighting novel multi-modal aspects of the tool, as well as the new features supported since 2018 [35]. The paper is structured as follows: in Sect. 2 we give an overview of the UCLID5 tool; in Sect. 3 we detail different multi-modal aspects of the tool, as well as high-lighting new features; and in Sect. 4 we present a case study using UCLID5 to verify a Trusted Abstract Platform. We cover related work in Sect. 5. The new features we highlight are:


#### **2 Overview of UCLID5**

In verification mode, UCLID5 reduces the question of whether a model satisfies a given specification to a set of constraints that can be solved by an off-theshelf SMT solver. In synthesis mode, UCLID5 reduces the problem of finding an interpretation for an uninterpreted function such that the specification is satisfied into a SyGuS problem that can be solved by an off-the-shelf SyGuS solver. In order to do so, UCLID5 performs the following main tasks, as shown in Fig. 1:

*Front End:* UCLID5 takes models written in the UCLID5 language as input. The command-line front-end allows user configuration, including specifying the external SMT-solver/SyGuS-solver to be used, as well as enabling certain utilities such as automatically converting uninterpreted functions to arrays. The parser builds an abstract syntax tree from the model.

*AST Passes:* UCLID5 performs a number of transformations and checks on the abstract syntax tree, including type-checking and inlining of procedures. This intermediate representation supports limited control flow such as if-statements and switch-cases, but loops are not permitted in procedural code and are removed via unrolling (bounded for-loops) or replacement with user-provided invariants (while loops). However, unbounded control flow can be handled by representation as transition systems (where each module consists of a transition system with an initial and a next block, each represented as a separate AST).

*Symbolic Simulator:* The symbolic simulator performs a simulation of the transition system in the model, according to the verification command provided, and produces a set of assertions. For instance, if bounded model checking is used, UCLID5 will symbolically execute the main module a bounded number of times. UCLID5 encodes the violation of each independent verification condition as a separate assertion tree.

*Synth-Lib Interface:* UCLID5 supports both synthesis and verification. The Synth-Lib interface constructs either a verification or a synthesis problem from the assertions generated by the symbolic simulator. The verification problems are passed to the SMT-LIB interface, which converts each assertion in UCLID5's intermediate representation to an assertion in SMT-LIB. Similarly, the synthesis problems are passed to the SyGuS-IF interface, which converts each assertion to an assertion in SyGuS-IF. The verification and synthesis problems are then passed to the appropriate provided external solver and the result is reported back to the user.

**Fig. 1.** Architecture of UCLID5

**Basic UCLID5 Models.** A simple UCLID5 model that computes the Fibonacci sequence is shown in Fig. 2. UCLID5 models are contained within modules which comprise of 3 parts: a system model represented using combinations of sequential, concurrent, operational and axiomatic modeling, as described in Sects. 3.2; a system specification described in Sect. 3.1; and a proof script that specifies the verification tasks UCLID5 should perform to prove that the system satisfies its specification, using a variety of supported verification and synthesis techniques described in Sect. 3.1.

### **3 Multi-modal Language Features**

#### **3.1 Multi-modal Verification and Synthesis**

**Specification.** UCLID5 supports a variety of different types of specifications. The standard properties supported include inline assertions and assumptions in sequential code, pre-conditions and post-conditions for procedures, and global axioms and invariants (both as propositional predicates, and temporal invariants in Linear Temporal Logic (LTL)).

The latest version of UCLID5 further provides direct support for hyperinvariants and hyperaxioms (for *k*-safety). This new support for direct hyperproperties comprises of two new language constructs: hyperaxiom and hyperinvariant. The former places an assumption on the behavior of the module, if *n* instances of the module were instantiated, and the latter is an invariant over *n* instances of the module, which is verified via the usual verification methods. A variable *x* from the *nth* instance of the module is reasoned about in the predicate using *x.n*, and the number of modules instantiated is determined by the maximum *n* in both the invariant and the axiom. For example, hyperinvariant[2] det xy: y.1 == y.2 asserts that a 2-safety hyperproperty holds.

**Verification.** To verify these specifications, we implement multiple classic techniques. As a result, once a model is written in UCLID5, the user can deploy a combination of verification techniques, depending on the properties targeted. UCLID5 supports a range of verification techniques including: Bounded Model Checking (for LTL, hyperinvariants and assertion-based properties); induction and k-induction for assertion-based invariants and hyperinvariants; and verification of pre-and post-conditions on procedures and hyperinvariants.

As an exemplar of the utility of multi-modal verification, consider the hyperproperty based models verified by Sahai et al. [33]. These models use both procedure verification and induction to verify k-trace properties.

**Synthesis.** The latest version of UCLID5 integrates program synthesis fully across all the verification modes previously described. Specifically, users are able to declare and use *synthesis functions* anywhere in their models, and UCLID5 will seek to automatically synthesize function bodies for these functions such that the user-selected verification task will pass. In this section, we give an illustrative example of synthesis in UCLID5, we provide the necessary background on program synthesis, and then we formulate the existing verification techniques inside of UCLID5 for synthesis.

**Fig. 2.** UCLID5 Fibonacci model. Part 3 shows the new synthesis syntax, and how to find an auxiliary invariant.

Consider the UCLID5 model in Fig. 2. The user wants to prove by induction that the invariant a le b at line 13 always holds. Unfortunately, the proof fails because the invariant is not inductive. Without synthesis, the user would need to manually strengthen the invariant until it became inductive. However, the user can ask UCLID5 to automatically do this for them. Figure 2 demonstrates this on lines 16, 17 and 18. Specifically, the user specifies a function to synthesize called h at lines 16 and 17, and then uses h at line 18 to strengthen the existing set of invariants. Given this input, UCLID5, using e.g. cvc5 [3] as a syntaxguided synthesis engine, will automatically generate the function h(x, y) = x >= 0, which completes the inductive proof.

In this example, the function to synthesize represents an inductive invariant. However, functions to synthesize are treated exactly like any interpreted function in UCLID5: the user could have called h anywhere in the code. Furthermore, this example uses induction and a global invariant, however, the user could also have used a linear temporal logic (LTL) specification and bounded model checking (BMC). In this sense, our integration is fully flexible and generic. Furthermore, the integration scheme allows us to enable synthesis for any verification procedure in UCLID5, by simply letting users declare and use functions to synthesize and relying on existing SyGuS-IF solvers to carry out the automated reasoning.

#### **3.2 Multi-modal Modeling**

**Combining Concurrent and Sequential Modeling.** A unique feature of the UCLID5 modeling language is the ability to easily combine sequential and concurrent modeling. This allows a user to easily express models representing sequential programs, including standard control flow, procedure calls, sequential updates, etc., in a sequential model, and to combine these components within a system designed for concurrent modeling based on transition systems. The sequential program modeling is inspired by systems such as Boogie [4] and allows the user to port Boogie models to UCLID5. The concurrent modeling is done by defining transition systems with a set of initial states and a transition relation. Within UCLID5, each module is a transition system. A main module can be defined that triggers when each child module is stepped. For an example of this combination of sequential and concurrent modeling, we refer the reader to the CPU example presented in the original UCLID5 paper [35], which uses concurrent modules to instantiate multiple CPU modules, modeled as transition systems, with sequential code to model the code that executes instructions, and to the case study in Sect. 4.

**Reasoning with External Oracles.** New in the latest version, UCLID5 supports the modeling with *oracle function symbols* [32] in both verification and synthesis. Namely, a user can include "oracle functions" in any UCLID5 model, where an oracle function is a function without a provided implementation, but which is associated to a user-provided external binary that can be queried by the solver. We note that oracle functions (and functions in general) can only be first-order within the UCLID5 modeling language, i.e., functions cannot receive functions as arguments.

This support is useful in cases where some components of the system are difficult or impossible to model, but could be compiled into a binary that the solver can query; or where the model of the system would be challenging for an SMT solver to reason about (for instance, highly non-linear arithmetic), and it may be better to outsource that reasoning to an external binary.

UCLID5 supports oracle function symbols in verification by interfacing with a solver that supports Satisfiability Modulo Theories and Oracles (SMTO) [32], and in synthesis by interfacing with a solver that supports Synthesis Modulo Oracles (SyMO) [32].

Oracle function symbols are declared like functions, with the keyword oracle, and an annotation pointing to the binary implementation. For instance oracle function [isprime] Prime (x: integer): boolean would indicate to the solver that the binary isprime takes an integer as input and returns a boolean. This is translated into the corresponding syntax in SMTO or SyMO, as detailed in [30].

An exemplar of such reasoning in a synthesis file is available in the artifact [31], where we use UCLID5 to synthesize a safe and stabilizing controller for a Linear Time Invariant system, similar to Abate et al. [1].

**Combining Operational and Axiomatic Modeling.** UCLID5 can model a system being verified using an operational (transition system-based) approach, as Fig. 2 shows. However, UCLID5 also supports modeling a system in an *axiomatic* manner, whereby the system is specified as a set of properties over traces. Any execution satisfying the properties is allowed by the system, and any execution violating the properties is disallowed. Axiomatic modeling can provide order-of-magnitude performance improvements over operational models in certain cases [2], and is often well suited to systems with large amounts of non-determinism. We provide an example of fully axiomatic modeling in the artifact [31].

However, uniquely, UCLID5 allows users to specify multi-modal systems using a combination of operational and axiomatic modeling. In such models, some constraints on the execution are enforced by the initial state and transition relation (operational modeling), while others are enforced through axiomatic invariants (axiomatic modeling). This allows the user to choose the mode of modeling most appropriate to each constraint. For example, the ILA-MCM work [39] combined operational ILA (Instruction Level Abstraction) models to describe the functional behavior of processing elements with memory consistency model (MCM) orderings that are more naturally specified axiomatically [2]. (MCM orderings constrain shared-memory communication and synchronization between multiple processing elements.) The combined model, used for System-on-Chip verification, worked by sharing variables (called "facets") between both the models. UCLID5 makes it much easier to perform such a combination.

Figure 3 depicts parts of a UCLID5 model of microarchitectural execution that uses both operational and axiomatic modeling (similar to that from the ILA-MCM work), based on the *µ*spec specifications of COATCheck [24]. In this model, the steps of instruction execution are driven by the init and next blocks, i.e., the operational component of the model. Multiple instructions can step at any time (curTime denotes the current time in the execution), but they can only take one step per timestep. Meanwhile, axioms such as the fifoFetch axiom enforce ordering *between* the execution of multiple instructions. The fifoFetch axiom specifically enforces that instructions in program order on the same core must be fetched in program order. (Enforcing this order is tricky using operational modeling alone). The transition rules and axioms operate over the same data structures, ensuring that executions of the final model abide by both sets of constraints.

*µ*spec models routinely function by grounding quantifiers over a finite set of instructions. Thus, to fully support *µ*spec axiomatic modeling, we introduce two new language features —namely, *groups* and *finite quantifiers*. A group is a set of objects of a single type. A group can have any number of elements, but it must be finite, and the group is immutable once created. For instance, the group testInstrs in Fig. 3 consists of four instructions. Finite quantifiers, meanwhile, are used to quantify over group elements.

This example showcases UCLID5's highly flexible multi-modal modeling capability. Models can be purely operational, purely axiomatic, or a combination of the two. Note that axiomatic modeling relies on the new language features finite forall and groups. For a further example of axiomatic and operational multi-modal modeling, we refer the reader to the case study checking reachability properties in reactive embedded systems described in the artifact [31].

**Fig. 3.** UCLID5 model that incorporates both operational modeling (through the init and next blocks) and axiomatic modeling (through the axiom keyword).

# **4 Case Study: TAP Model**

The final case study we wish to describe verifies a model of a trusted execution environment. Trusted execution environments [10,11,17,20] often provide a software interface for users to execute enclaves, using hardware primitives to enforce memory isolation. In contrast to software which requires reasoning about sequential code, hardware modeling uses a paradigm that permits concurrent updates to a system. Moreover, verifying hyperproperties such as integrity requires reasoning about multiple instances of a system which most existing tools are not well suited for. In this section, we present the UCLID5 port<sup>1</sup> of the Trusted

<sup>1</sup> https://github.com/uclid-org/trusted-abstract-platform/.

Abstract Platform (TAP) which was originally<sup>2</sup> written in Boogie and introduced by Subramanyan et al. [37] to model an abstract idealized trusted enclave platform. We demonstrate how UCLID5's multi-model support alleviates the difficulties in modeling the TAP model in existing tools.

**Fig. 4.** UCLID5 transition system-styled model of TAP and the integrity proof.

**Modeling the TAP and Proving Integrity.** The UCLID5 model of TAP in Fig. 4 demonstrates some of UCLID5's key features: the enclave operations of the TAP model (e.g. launch) are implemented as procedures, and a transition relation of the TAP is defined using a next block that either executes an untrusted adversary operation or the trusted enclave, which in turn executes one of the enclave operations atomically. Proving the integrity hyperproperty on the TAP thus only requires two instantiations of the TAP model, specifying the integrity invariants, and defining a next block which steps each of the TAP instances as shown in the integrity proof module. The integrity proof in UCLID5 uses inductive model checking.

<sup>2</sup> https://github.com/0tcb/TAP.

**Results and Statistics of the TAP Modules.** Table 1 shows the approximate size of the TAP model in both Boogie and UCLID5. #pr, #fn, #an, and #ln refer to the number of procedures, functions, annotations, and lines of code respectively. Annotations are the number of loop invariants, assertions, assumptions, pre- and post-conditions


**Table 1.** Boogie vs UCLID5 Model Results

that were manually specified. The verification time includes compilation and solving.

While the #ln for the TAP model in UCLID5 is higher than that of the model in Boogie due to stylistic differences, the crucial difference is in the integrity proof. The original model in Boogie implements the TAP model and integrity proof as procedures, where the transition of the TAP model is implemented as a while loop. However, this lack of support for modeling transition systems introduces duplicate state variables in a hyperproperty such as integrity, requires context switching and additional procedures for the new variables, which makes the model difficult to maintain and self composition unwieldy. In UCLID5, the proof is no longer implemented as a procedure, but rather, we create instances of the TAP model. We also note that the number of annotations is less in UCLID5 compared to Boogie for the TAP model and proof. Additionally, this model lends itself for more direct verification of hyperproperties.

The verification results are run on a machine with 2.6GHz 6-Core Intel Core i7 and 16GB of RAM running OSX. As shown on the right of Table 1, the verification runtimes between the Boogie and UCLID5 models and proofs are comparable.

#### **5 Related Work**

There are a multitude of verification and synthesis tools related to UCLID5. In this brief review, we highlight prominent examples and contrast them with UCLID5 along the key language features described in Sect. 3.

UCLID5 allows users to combine sequential and concurrent modeling (see Sect. 3.2). Most existing tools primarily support either sequential, e.g. [4,21,38], or concurrent computation modeling, e.g. [5,9,14,26,27]. Although users of these systems can often overcome the tool's modeling focus by manually including support for different computation paradigms, for example, Dafny can be used to model concurrent systems [22], this is not always straightforward, and limited support for different paradigms can manifest as limitations in downstream applications. For example, the Serval [29] framework, based on Rosette, cannot reason about concurrent code. UCLID5, to the best of our knowledge, is the only verification tool natively supporting modeling with external oracles.

UCLID5 supports different kinds of specifications and verification procedures (see Sect. 3.1). Most existing tools [5,9,21] do not support multi-modal verification at all. Tools that do offer multi-modal verification do not offer the same range of options as UCLID5. For example, [26] does not support linear temporal logic, and [13,27] does not support hyperproperty verification.

Finally, UCLID5 supports a generic integration with program synthesis (see Sect. 3.1), and so related work includes a number of synthesis engines. The SKETCH system [36] synthesizes expressions to fill holes in programs, and has subsequently been applied to program repair [16,19]. UCLID5 is more flexible than this work, and allows users to declare unknown functions even in the verification annotations, as well as supporting multiple verification algorithms and types of properties. Rosette [38] provides support for synthesis and verification, but, unlike UCLID5, the synthesis is limited to bounded specifications of sequential programs and external synthesis engines are not supported. Synthesis algorithms have been used to assist in verification tasks, such as safety and termination of loops [12], and generating invariants [15,40], but none of this work to-date integrates program synthesis fully into an existing verification tool. Before the new synthesis integration, UCLID5 supported synthesis of inductive invariants. The key insight of this work is to generalize the synthesis support, and to unify all synthesis tasks by re-using the verification back-end.

#### **6 Software Project**

The source code for UCLID5 is made publicly available under a BSD-license<sup>3</sup>. UCLID5 is maintained by the UCLID5 team<sup>4</sup>, and we welcome patches from the community. Additional front-ends are available for UCLID5, including translators from Firrtl [18] <sup>5</sup>, and RISC-V binaries<sup>6</sup> to UCLID5 models. An artifact incuding the code for the case studies in this paper is available [31].

**Acknowledgments.** The UCLID5 project is grateful for the significant contributions by Pramod Subramanyan, one of the original creators of the tool. This work was supported in part by NSF grant 1837132, the DARPA grant FA8750-20-C-0156 (LOGiCS), by the Qualcomm Innovation Fellowship, and by Amazon and Intel.

### **References**


<sup>3</sup> https://github.com/uclid-org/uclid.

<sup>4</sup> https://github.com/uclid-org/uclid/blob/master/CONTRIBUTORS.md.

<sup>5</sup> https://github.com/uclid-org/chiselucl.

<sup>6</sup> https://github.com/uclid-org/riscverifier.


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Author Index**

Agarwal, Chaitanya II-3 Albert, Elvira I-430 Alt, Leonardo I-325 Alur, Rajeev II-343 Amir, Guy I-219 Appel, Andrew W. II-272 Arvind II-317 Baccelli, Emmanuel II-293 Badings, Thom S. II-26 Bae, Kyungmin I-524 Bak, Stanley I-490 Bang, Seongwon II-386 Banno, Ryotaro I-447 Bansal, Suguman II-343 Bao, Jialu I-33 Barbosa, Haniel II-205 Barrett, Clark II-92, II-205 Bastani, Osbert II-343 Becchi, Anna I-469 Bellés-Muñoz, Marta I-430 Besson, Frédéric II-293 Beutner, Raven I-341 Bian, Song I-447 Blicha, Martin I-325 Blondin, Michael II-468 Bogomolov, Sergiy I-490 Bouajjani, Ahmed I-282 Boutglay, Wael-Amine I-282 Cai, Shaowei II-227 Casadio, Marco I-219 Castro, Pablo F. II-48 Chatterjee, Krishnendu I-55 Chatterjee, Prantik I-304 Cheang, Kevin I-538 Chen, Haibo I-257 Chen, Mingshuai I-79 Chen, Taolue I-385 Chlipala, Adam II-317 Choi, Joonwon II-317 Chun, Inwhan II-386 Cimatti, Alessandro I-469 Coenen, Norine I-407

Cohen, Joshua M. II-272 Colange, Maximilien II-174 D'Argenio, Pedro R. II-48 Dachselt, Raimund I-407 Daggitt, Matthew L. I-219 Dardinier, Thibault II-130 de Oliveira Oliveira, Mateus II-447 Demasi, Ramiro II-48 Dimitrov, Dimitar Iliev I-127 Doveri, Kyveli II-109 Dubois, Jérôme II-174 Duret-Lutz, Alexandre II-174 Faella, Marco II-249 Fan, Yuxin I-385 Fang, Bin I-257 Fang, Wang II-408 Feng, Weizhi II-152 Finkbeiner, Bernd I-341, I-407, II-505 Fischer, Marc I-127 Fremont, Daniel J. II-526 Frenkel, Hadar I-407 Fu, Hongfei I-257 Gaddamadugu, Pranav I-538 Ganty, Pierre II-109 Gbaguidi Aisse, Alexandre II-174 Giannakopoulou, Dimitra II-490 Gillard, Clément II-174 Gittis, Andreas II-526 Godbole, Adwait I-538 Goharshady, Amir Kafshdar I-55 Golia, Priyanka I-363 Goubault, Eric I-511 Gros, Timo P. II-430 Guan, Ji II-408 Guha, Shibashis II-3 Gurfinkel, Arie I-19

Habermehl, Peter I-282 Hahn, Christopher I-407 Hasuo, Ichiro I-235 Havlena, Vojtˇech II-188

Hencey, Brandon I-490 Hermanns, Holger II-430 Hoffmann, Jörg II-430 Horak, Tom I-407 Hsu, Justin I-33 Hym, Samuel II-293 Hyvärinen, Antti E. J. I-325 Isabel, Miguel I-430 Jansen, Nils II-26 Jhoo, Ho Young II-386 Ji, Yucheng I-257 Jin, Peng I-193 Jothimurugan, Kishor II-343 Juba, Brendan I-363 Junges, Sebastian I-102, II-26 Katis, Andreas II-490 Katoen, Joost-Pieter I-79 Katsumata, Shin-ya I-235 Katz, Guy I-219 Klauck, Michaela II-430 Klinkenberg, Lutz I-79 Kochdumper, Niklas I-490 Köhl, Maximilian A. II-430 Kokke, Wen I-219 Komendantskaya, Ekaterina I-219 Kori, Mayuko I-235 Kˇretínský, Jan II-3 Laeufer, Kevin I-538 Lal, Akash I-304 Lauko, Henrich II-174 Lee, Jia I-524 Lee, Juneyoung II-386 Lengál, Ondˇrej II-188 Leutgeb, Lorenz II-70 Lew, Ethan I-490 Li, Bohan II-227 Li, Yannan II-364 Li, Yong II-152 Lin, Shaokai I-538 Liu, Wanwei I-385 Manerkar, Yatin A. I-538 Martin, Antoine II-174 Matsumoto, Naoki I-447 Matsuoka, Kotaro I-447 Mavridou, Anastasia II-490

Mazowiecki, Filip II-468 Mazzocchi, Nicolas II-109 Meda, Jaydeepsinh I-304 Medioni, Thomas II-174 Meel, Kuldeep S. I-363 Meggendorfer, Tobias I-55 Metzger, Niklas I-407, II-505 Mora, Federico I-538 Moser, Georg II-70 Moses, Yoram II-505 Müller, Peter II-130 Muruganandham, Pazhamalai II-3

Nam, Seunghyeon II-386 Niemetz, Aina II-92 Nötzli, Andres II-205

#### Offtermatt, Philip II-468

Parlato, Gennaro II-249 Parthasarathy, Gaurav II-130 Pathak, Drashti I-33 Paulsen, Brandon I-149 Pham, Long H. I-171 Polgreen, Elizabeth I-538 Potomkin, Kostiantyn I-490 Preiner, Mathias II-92 Pressburger, Thomas II-490 Putot, Sylvie I-511 Putruele, Luciano II-48

Refaeli, Idan I-219 Renault, Etienne II-174 Renkin, Florian II-174 Reynolds, Andrew II-205 Rodríguez-Núñez, Clara I-430 Roy, Subhajit I-33, I-304 Rubio, Albert I-430 Rungta, Neha I-3

Schlehuber-Caissier, Philipp II-174 Schumann, Johann II-490 Seshia, Sanjit A. I-538 Sharygina, Natasha I-325 Siber, Julian I-407 Singh, Gagandeep I-127 Šmahlíková, Barbora II-188 Song, Fu I-385 Spaan, Matthijs T. J. I-102 Sprecher, Christian I-127

Stoelinga, Marielle II-26 Suenaga, Kohei I-235, I-447 Summers, Alexander J. II-130 Sun, Jun I-171

Talpin, Jean-Pierre II-293 Tian, Jiaxu I-193 Tinelli, Cesare II-205 Trivedi, Nitesh I-33 Turrini, Andrea II-152

Urabe, Natsuki I-235

Vardi, Moshe Y. II-152 Vechev, Martin I-127 Vin, Eric II-526 Volk, Matthias II-26

Waga, Masaki I-447 Wang, Chao I-149, II-364 Wang, Jingbo II-364 Wang, Qinshi II-272 Weeks, Noé II-130 Wen, Xuejun I-193 Winkler, Tobias I-79 Wolf, Verena II-430

Ying, Mingsheng II-408 Yu, Geunyeol I-524 Yuan, Shenghao II-293

Zandberg, Koen II-293 Zhang, Liangfeng I-385 Zhang, Lijun II-152 Zhang, Min I-193 Zhang, Xindi II-227 Zhi, Dapeng I-193 Žikeli´c, Ðor de I-55 Zuleger, Florian II-70