**Einar Broch Johnsen Manuel Wimmer (Eds.)**

# **Fundamental Approaches to Software Engineering**

**25th International Conference, FASE 2022 Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2022 Munich, Germany, April 2–7, 2022 Proceedings**

## Lecture Notes in Computer Science 13241

Founding Editors

Gerhard Goos, Germany Juris Hartmanis, USA

## Editorial Board Members

Elisa Bertino, USA Wen Gao, China Bernhard Steffen , Germany Gerhard Woeginger , Germany Moti Yung , USA

## Advanced Research in Computing and Software Science Subline of Lecture Notes in Computer Science

Subline Series Editors

Giorgio Ausiello, University of Rome 'La Sapienza', Italy Vladimiro Sassone, University of Southampton, UK

Subline Advisory Board

Susanne Albers, TU Munich, Germany Benjamin C. Pierce, University of Pennsylvania, USA Bernhard Steffen , University of Dortmund, Germany Deng Xiaotie, Peking University, Beijing, China Jeannette M. Wing, Microsoft Research, Redmond, WA, USA More information about this series at https://link.springer.com/bookseries/558

# Fundamental Approaches to Software Engineering

25th International Conference, FASE 2022 Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2022 Munich, Germany, April 2–7, 2022 Proceedings

Editors Einar Broch Johnsen University of Oslo Oslo, Norway

Manuel Wimmer Johannes Kepler University Linz Linz, Austria

ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Computer Science ISBN 978-3-030-99428-0 ISBN 978-3-030-99429-7 (eBook) https://doi.org/10.1007/978-3-030-99429-7

© The Editor(s) (if applicable) and The Author(s) 2022. This book is an open access publication.

Open Access This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

## ETAPS Foreword

Welcome to the 25th ETAPS! ETAPS 2022 took place in Munich, the beautiful capital of Bavaria, in Germany.

ETAPS 2022 is the 25th instance of the European Joint Conferences on Theory and Practice of Software. ETAPS is an annual federated conference established in 1998, and consists of four conferences: ESOP, FASE, FoSSaCS, and TACAS. Each conference has its own Program Committee (PC) and its own Steering Committee (SC). The conferences cover various aspects of software systems, ranging from theoretical computer science to foundations of programming languages, analysis tools, and formal approaches to software engineering. Organizing these conferences in a coherent, highly synchronized conference program enables researchers to participate in an exciting event, having the possibility to meet many colleagues working in different directions in the field, and to easily attend talks of different conferences. On the weekend before the main conference, numerous satellite workshops took place that attract many researchers from all over the globe.

ETAPS 2022 received 362 submissions in total, 111 of which were accepted, yielding an overall acceptance rate of 30.7%. I thank all the authors for their interest in ETAPS, all the reviewers for their reviewing efforts, the PC members for their contributions, and in particular the PC (co-)chairs for their hard work in running this entire intensive process. Last but not least, my congratulations to all authors of the accepted papers!

ETAPS 2022 featured the unifying invited speakers Alexandra Silva (University College London, UK, and Cornell University, USA) and Tomáš Vojnar (Brno University of Technology, Czech Republic) and the conference-specific invited speakers Nathalie Bertrand (Inria Rennes, France) for FoSSaCS and Lenore Zuck (University of Illinois at Chicago, USA) for TACAS. Invited tutorials were provided by Stacey Jeffery (CWI and QuSoft, The Netherlands) on quantum computing and Nicholas Lane (University of Cambridge and Samsung AI Lab, UK) on federated learning.

As this event was the 25th edition of ETAPS, part of the program was a special celebration where we looked back on the achievements of ETAPS and its constituting conferences in the past, but we also looked into the future, and discussed the challenges ahead for research in software science. This edition also reinstated the ETAPS mentoring workshop for PhD students.

ETAPS 2022 took place in Munich, Germany, and was organized jointly by the Technical University of Munich (TUM) and the LMU Munich. The former was founded in 1868, and the latter in 1472 as the 6th oldest German university still running today. Together, they have 100,000 enrolled students, regularly rank among the top 100 universities worldwide (with TUM's computer-science department ranked #1 in the European Union), and their researchers and alumni include 60 Nobel laureates. The local organization team consisted of Jan Křetínský (general chair), Dirk Beyer (general, financial, and workshop chair), Julia Eisentraut (organization chair), and Alexandros Evangelidis (local proceedings chair).

ETAPS 2022 was further supported by the following associations and societies: ETAPS e.V., EATCS (European Association for Theoretical Computer Science), EAPLS (European Association for Programming Languages and Systems), and EASST (European Association of Software Science and Technology).

The ETAPS Steering Committee consists of an Executive Board, and representatives of the individual ETAPS conferences, as well as representatives of EATCS, EAPLS, and EASST. The Executive Board consists of Holger Hermanns (Saarbrücken), Marieke Huisman (Twente, chair), Jan Kofroň (Prague), Barbara König (Duisburg), Thomas Noll (Aachen), Caterina Urban (Paris), Tarmo Uustalu (Reykjavik and Tallinn), and Lenore Zuck (Chicago).

Other members of the Steering Committee are Patricia Bouyer (Paris), Einar Broch Johnsen (Oslo), Dana Fisman (Be'er Sheva), Reiko Heckel (Leicester), Joost-Pieter Katoen (Aachen and Twente), Fabrice Kordon (Paris), Jan Křetínský (Munich), Orna Kupferman (Jerusalem), Leen Lambers (Cottbus), Tiziana Margaria (Limerick), Andrew M. Pitts (Cambridge), Elizabeth Polgreen (Edinburgh), Grigore Roşu (Illinois), Peter Ryan (Luxembourg), Sriram Sankaranarayanan (Boulder), Don Sannella (Edinburgh), Lutz Schröder (Erlangen), Ilya Sergey (Singapore), Natasha Sharygina (Lugano), Pawel Sobocinski (Tallinn), Peter Thiemann (Freiburg), Sebastián Uchitel (London and Buenos Aires), Jan Vitek (Prague), Andrzej Wasowski (Copenhagen), Thomas Wies (New York), Anton Wijs (Eindhoven), and Manuel Wimmer (Linz).

I'd like to take this opportunity to thank all authors, attendees, organizers of the satellite workshops, and Springer-Verlag GmbH for their support. I hope you all enjoyed ETAPS 2022.

Finally, a big thanks to Jan, Julia, Dirk, and their local organization team for all their enormous efforts to make ETAPS a fantastic event.

February 2022 Marieke Huisman ETAPS SC Chair ETAPS e.V. President

## Preface

This volume contains the papers presented at FASE 2022, the 25th International Conference on Fundamental Approaches to Software Engineering. FASE 2022 was organized as part of the annual European Joint Conferences on Theory and Practice of Software (ETAPS 2022).

FASE is concerned with the foundations on which software engineering is built, including topics like software engineering as an engineering discipline, requirements engineering, software architectures, software quality, model-driven development, software processes, software evolution, AI-based software engineering, and the specification, design, and implementation of particular classes of systems, such as (self-) adaptive, collaborative, AI, embedded, distributed, mobile, pervasive, cyber-physical, or service-oriented applications.

FASE 2022 received 61 submissions and used a double-anonymous reviewing process. Each submission was reviewed by three Program Committee members. After an online discussion period, the Program Committee accepted 17 papers as part of the conference program (28% acceptance rate).

FASE 2022 hosted the 4th International Competition on Software Testing (Test-Comp 2022). Test-Comp is an annual comparative evaluation of testing tools. This edition contained 12 participating tools, from academia and industry. These proceedings contain the competition report and two system descriptions of participating tools. The system-description papers were reviewed and selected by a separate Program Committee: the Test-Comp jury. Each paper was assessed by at least three reviewers. Two sessions in the FASE program were reserved for the presentation of the results: the summary by the Test-Comp chair and of the participating tools by the developer teams in the first session, and the community meeting in the second session.

Many people contributed to the success of FASE 2022. We are grateful to the Program Committee members and reviewers for their thorough reviews and constructive discussions. We thank the ETAPS 2022 organizers, in particular, Jan Křetínský and Dirk Beyer (General Chairs), Julia Eisentraut (Organization Chair), Maximilian Weininger (Web Chair), and Alexandros Evangelidis (Proceedings Chair). We thank Marieke Huisman (ETAPS Steering Committee Chair) and Tarmo Uustalu (ETAPS Publicity Chair) for managing the process, and Andrzej Wasowski (FASE Steering Committee Chair) for his feedback and support. We are especially grateful to our Artefact Evaluation Committee Chairs Marie-Christine Jakobs and Eduard Kamburjan. Last but not least, we would like to thank the authors for their excellent work.

February 2022 Einar Broch Johnsen Manuel Wimmer

## Organization

### Program Committee Chairs


### Steering Committee

Wil van der Aalst RWTH Aachen, Germany Jordi Cabot Universitat Oberta de Catalunya, Spain Marsha Chechik University of Toronto, Canada Esther Guerra Universidad Autónoma de Madrid, Spain Reiner Hähnle Technische Universität Darmstadt, Germany Reiko Heckel University of Leicester, UK Tiziana Margaria University of Limerick, Ireland Fernando Orejas Universitat Politècnica de Catalunya, Spain Julia Rubin University of British Columbia, Canada Alessandra Russo Imperial College London, UK Andy Schürr Technische Universität Darmstadt, Germany Perdita Stevens University of Edinburgh, UK Mariëlle Stoelinga Universiteit Twente, The Netherlands Gabriele Taentzer Philipps-Universität Marburg, Germany Andrzej Wasowski IT University of Copenhagen, Denmark Heike Wehrheim Universtät Paderborn, Germany

## Program Committee

Ralf Huuck Logilica, Australia Gerti Kappel TU Wien, Austria Martin Leucker Universität zu Lübeck, Germany

Erika Ábrahám RWTH Aachen, Germany Shaukat Ali Simula Research Lab, Norway Étienne André Université de Lorraine, France Thorsten Berger Ruhr-Universität Bochum, Germany Tomáš Bureš Charles University, Czech Republic Jane Cleland-Huang University of Notre Dame, USA Carlo Furia Università della Svizzera italiana, Switzerland Stijn de Gouw Open Universiteit, The Netherlands Esther Guerra Universidad Autónoma de Madrid, Spain Ichiro Hasuo National Institute of Informatics, Japan Marie-Christine Jacobs Technische Universität Darmstadt, Germany Leen Lambers BTU Cottbus-Senftenberg, Germany


## Artefact Evaluation Committee Chairs


## Artifact Evaluation Committee

Chinmayi Baramashetru University of Oslo, Norway Saverio Giallorenzo Università di Bologna, Italy, and Inria, France Maya Retno Ayu Setyautami

### Pablo Gómez-Abajo Universidad Autónoma de Madrid, Spain Elena Gómez-Martínez Universidad Autónoma de Madrid, Spain Jan Haltermann Carl von Ossietzky Universität Oldenburg, Germany Hannes Kallwies Universität zu Lübeck, Germany Dylan Marinho Loria, Université de Lorraine, France Felix Pauck Universität Paderborn, Germany Sven Peldszus Universität Koblenz-Landau, Germany Mohammad Rezaalipour Università della Svizzera italiana, Switzerland Cedric Richter Carl von Ossietzky Universität Oldenburg, Germany Universitas Indonesia, Indonesia

Asmae Heydari Tabar Technische Universität Darmstadt, Germany Matthias Volk Universiteit Twente, The Netherlands Xiuheng Wu Nanyang Technological University, Singapore Liu Ye Nanyang Technological University, Singapore

## Additional Reviewers

Milad Abdullah Aren Babikian Lara Bargmann

Luca Berardinelli Benedikt Bollig Tabea Bordis

Ricardo Caldas Diego Damasceno Istvan David Francesca Del Bonifro Swaib Dragule Leo Freitas Antonio Garmendia Marcus Gerhold Bineet Ghosh Razan Ghzouli Hans-Dieter Hiep Tobias John Danylo Khalyeyev Karam Kharraz Alexander Knüppel Paul Kobialka Jens Kosiol Wardah Mahmood Mukelabai Mukelabai Argentina Ortega Juliane Päßler Tobias Pett

Ricardo Prudencio Qusai Ramadan Mathias Ramparison Henrique Rebêlo Tobias Runge Lucas Sakizloglou Arnaud Sangnier Sota Sato Alceste Scalas Malte Schmitz Martina Seidl Oszkár Semeráth Arnab Sharma Thiago D. Simão Sabine Sint Martin Steffen Daniel Thoma Palina Tolmach Michal Töpfer Gianluca Turin Freek Verbeek Chenguang Zhu

## Program Committee and Jury — Test-Comp


## Contents

#### FASE Contributions



## FASE Contributions

## Information-flow Interfaces?

Ezio Bartocci<sup>1</sup> , Thomas Ferr`ere<sup>2</sup> , Thomas A. Henzinger<sup>3</sup> , Dejan Nickovic<sup>4</sup> , and Ana Oliveira da Costa<sup>1</sup> ()

<sup>1</sup> Technische Universit¨at Wien, Vienna, Austria {ezio.bartocci, ana.costa}@tuwien.ac.at 2 Imagination Technologies, Kings Langley, UK thomas.ferrere@imgtec.com 3 IST Austria, Klosterneuburg, Austria tah@ist.ac.at <sup>4</sup> AIT Austrian Institute of Technology, Vienna, Austria dejan.nickovic@ait.ac.at

Abstract. Contract-based design is a promising methodology for taming the complexity of developing sophisticated systems. A formal contract distinguishes between assumptions, which are constraints that the designer of a component puts on the environments in which the component can be used safely, and guarantees, which are promises that the designer asks from the team that implements the component. A theory of formal contracts can be formalized as an interface theory, which supports the composition and refinement of both assumptions and guarantees. Although there is a rich landscape of contract-based design methods

that address functional and extra-functional properties, we present the first interface theory that is designed for ensuring system-wide security properties. Our framework provides a refinement relation and a composition operation that support both incremental design and independent implementability. We develop our theory for both stateless and stateful interfaces. We illustrate the applicability of our framework with an example inspired from the automotive domain.

Keywords: Contract-based design, Interface Theory, Hyperproperties, Information-flow.

## 1 Introduction

The rise of pervasive information and communication technologies seen in cyberphysical systems, internet of things, and blockchain services has been accompanied by a tremendous growth in the size and complexity of systems [28]. Subtle dependencies involving multiple architectural layers and unforeseen environmental interactions can expose these systems to cyber-attacks. This problem is further exacerbated by the heterogeneous nature of their constituent components,

<sup>?</sup> This project has received funding from the European Unions Horizon 2020 research and innovation programme under grant agreement No 956123 and was funded in part by the FWF project W1255-N23 and by the ERC-2020-AdG 101020093.

which are often developed independently by different teams or providers. In such a scenario, defining and enforcing security requirements across components at an early stage of the design process becomes a necessity. This engineering approach is called security-by-design. Although in recent years there has been impressive progress in the verification of security properties for individual system components, the science of compositional security design [22,23] is still in its infancy.

Security policies are usually enforced by restricting the flow of information in a system [30]. Information-flow policies define which information a user or a software/hardware component is allowed to observe or to interfere with while interacting with another component.

The goal of information-flow control is to ensure that a system as a whole satisfies the desired policies. It is especially challenging to verify that there are no side-channels or implicit flows that violate a given policy. For example, in a modern car, the tight coupling between the cyber and the physical components allows an attacker to infer computational properties, such as secrets used for encryption, from side-channels, such as power consumption and electromagnetic radiation [17]. Moreover, the increasing connectivity of automotive systems with their environment makes it easier for the attacker to gather data about the system behavior. The attacker can use this data to exploit weaknesses of the system implementation and gain control over the system [32,7]. These attacks often rely on analyzing and comparing multiple observations to deduce protected information. From a formal-language perspective, such security vulnerabilities are not characterized by properties of a single system execution, but rather by properties of sets of execution traces, which are called hyperproperties [12].

The rigorous design of systems that satisfy information flow requirements is essential from the security perspective. This activity can be supported by the verification of information flow properties, a well-studied problem with a rich landscape of both theory and tools, ranging from language-based [29,18,15,11] to simulation-based [25] approaches. Nevertheless, the existing verification solutions do not address two important aspects. First, components in complex systems are often heterogeneous and cannot be analysed with a single verification tool. Moreover, it is not clear how to combine component verification outcomes to infer system-level information-flow properties. Second, existing methods do not provide guidelines on which information flow properties need to be verified against individual components to provide system-level guarantees regarding leakage of information.

In this paper, we present a contract-based design [8] approach for the structural aspect of information-flow policies. Contract-based design provides a formal framework for building complex systems from individual components, mixing both top-down and bottom-up steps. A top-down step decomposes and refines system-wide requirements; a bottom-up step assembles a system by combining available components. A formal contract distinguishes between assumptions, which are constraints that the designer of a component puts on the environments in which the component can be used safely, and guarantees, which are promises that the designer asks from the team that implements the component. A theory of formal contracts can be formalized as an interface theory, which supports the composition and refinement of both assumptions and guarantees [2,3,31]. While there is a rich landscape of interface theories for functional and extra-functional properties [10,4,13,20], we present the first interface theory that is designed for ensuring system-wide security properties, thus paving the way for a science of safety and security co-engineering.

The focus on the structural aspects of information flow and abstraction from concrete semantics enables compositional reasoning in presence of heterogeneous components and is complementary to the existing body of work on information flow verification. A different component implementation verified under different semantics could result in different flows being detected. However, after deriving the component flows from the implementation under some concrete semantics, the theory can be agnostic about the underlying semantic interpretation. Hence it enables the design of secure systems from trusted components by abstracting away how information flows and by focusing on whether it can flow at all. In essence, our approach enables to decompose system-level information flow requirements and derive component properties that need to hold, thus providing a divide-and-conquer procedure for organizing verification tasks.

Our theory is based on information-flow assumptions as well as informationflow guarantees. As an interface theory, our theory supports both incremental design and independent implementability [3]. Incremental design allows the composition of different system parts, each coming with their own assumptions and guarantees, without requiring additional knowledge of the overall design context. Independent implementability enables the separate refinement of different system parts by different teams that, without gaining additional information about each other's design choices, can still be certain that their designs, once combined, preserve the specified system-wide requirements. While in previous interface theories, the environment of a component is held responsible for meeting assumptions, and the implementation of the component for the guarantees, there are cases of information-flow violations for which blame cannot be assigned uniquely to the implementation or the environment. In information-flow interfaces we therefore introduce, besides assumptions and guarantees, a new, third type of constraint—called properties—whose enforcement is the shared responsibility of the implementation and the environment.

We develop our framework for both stateless and stateful interfaces. Stateless information-flow interfaces are built from primitive information-flow constraints assumptions, guarantees, and properties—of the form "the value of a variable x is always independent of the value of another variable y." Stateful informationflow interfaces add a temporal dimension, e.g., "the value of y is independent of x until the value of z is independent of x." The temporal dimension is introduced through a natural notion of state and state transition for interfaces, not through logical operators. We prove that our calculus of information-flow interfaces satisfies the principles of incremental design and independent implementability.

#### 2 Application Example

We showcase the applicability of our theory with an example from the automotive industry: a stepwise design of a shared communication infrastructure (a bus) from distance warners and a wheel sensor to the braking system and the odometer. We adapted this use-case from the industrial case study presented by Marcus Mikulcak et al. [25]. The main goal of this system design is to ensure the integrity of a communication channel used to perform a safety-critical functionality. We consider two integrity levels, high and low, to characterize functionalities in our system. Then, we want to guarantee that data exchanged by high-integrity functionalities is not compromised by low-integrity functions.

Distance warners sense the car's proximity to other objects and send their analysis to other components. In our example, we have two distance warners, at the front and the back of the car, that use the shared bus to communicate with the braking system. The wheel sensor senses the wheel rotations and sends this information through the shared bus to the odometer. The braking system is a high-integrity system since it performs safety-critical functions. Hence the communication channel between the distance warners and the braking system is classified with high-integrity, while the communication between the wheel sensor and the odometer is low-integrity. Thus, data sent by the wheel sensor should not interfere with the high-integrity channel to prevent distance warnings sent to the braking system from being delayed or lost. The main goal of our design process is to guarantee that the closed system requirement that information from the wheel sensor does not flow to the braking system is propagated accordingly to subsystems through successive decomposing and refinement steps.

Fig. 1: Representation of the objects in our theory.

Figure 1 shows the graphical representation we adopt throughout the paper for the objects in our theory. We represent the open system no-flows requirements with dashed arrows. Then, arrows to input ports are assumptions while arrows to output parts are guarantees. The closed system noflows, properties, are represented as dotted arrows. To improve the readability of the drawings, it is implicit that for each drawn property, we have the same guarantee over the open system. When it is clear from the context we may omit port(s) names.

We present, in Fig. 2, the stepwise design of the security requirement that data from the wheel sensor, wheel tick, should not flow to the target of the distance warners, distw f t and distw b t. The first interface in Fig. 2 includes two properties that specify this security requirement. The system is then decomposed into the sending subsystems (warners and wheel sensor), the shared bus, and the receiving subsystems.

Naturally, we keep the two properties from the first interface as properties in the Bus interface. However, this natural decomposition does not define a well-formed interface according to our theory because the properties in the Bus interface cannot be satisfied given the interface's current assumption and

Fig. 2: Top-down design of a shared communication infrastructure used by two distance warners, distw f s and distw b s, and a wheel sensor, wheel tick, to communicate with the braking system, distw f t and distw b t, and the odometer, odometer, respectively.

guarantee. As the environment allows a flow from wheel tick to the source of the front distance warner, wheel tick distw f s then, with the flow allowed by the guarantee from a distance warner source to its target, we have the flow wheel tick distw f s distw f t. This flow is forbidden by the interface's properties. If we specified the no-flow properties in the Bus interface as guarantees, then the interface would be well-formed. However, the composition of the three subsystems would not satisfy the initial specification because guarantees only apply to implementations of their interface, and the flow described above would still be allowed in the composition of the three subsystems. This illustrates two applications of the information-flow interface theory: to detect inconsistent noflow specifications and faulty decompositions. Moreover, when an interface is not well-formed we can provide a witness for the property violation. We can use this witness to guide the refinement of an ill-formed interface into a well-formed one.

In the second step of our refinement, in Figure 2, we add the missing assumptions to the Bus interface. Our notion of composition compatibility between interfaces requires that the Sending interface includes guarantees that implies the Bus assumptions, as the Sending interface will be part of the Bus environment. At this point, with a certified decomposition of the original specification, our theory guarantees that each subsystem can now be further refined independently (possibly by different teams). The last step illustrates an independent refinement of the Sending and the Receiving interfaces.

In Fig. 3, we present the stateful view of the system, which requires that the system satisfies the composition of the Sending, the Bus, and the Receiving interfaces derived in Fig. 2 at all times. We present, as well, a refinement of that specification, which requires that in each time point only one of the sending components can use the bus. The interfaces that define each state are named after the sending component that can use the bus (e.g. in the state Swheel only the wheel tick can use the bus). If the access to the bus is mutually exclusive, then we can simplify the assumptions on the environment in the Bus interface. With more guarantees on the implementations we need fewer assumptions to satisfy the properties.

Fig. 3: Design of mutually exclusive shared communication infrastructure for distance warners and the wheel odometer. Each state is defined by the composition of the interfaces inside.

Finally, the components of our system can be, for instance, the Simulink and Stateflow models provided to the authors [25] by their industrial partners. We can then use the tool introduced in their work to verify whether these components implement the stateful interfaces we derived.

In summary, our framework defines relations on both stateless and stateful interfaces specifying information-flow policies that allow to check if: (i) a given interface refines (or abstracts) the current specification; (ii) two interfaces are compatible for composition; (iii) a specification is consistent; (iv) informationflows in a component define an implementation of a given interface; and (v) a system decomposition refines the system specification.

#### 3 Stateless Information-flow Interfaces

In this section, we introduce a stateless interface and component algebra for secure information flow. Information flows between two variables when the value of one influences the other.

We are interested in the structural properties of information flow within a system and define relations abstracting flows, flow relations, as being both reflexive and transitively closed. An information-flow component abstracts the implementation of a system by a flow relation. An information-flow interface specifies forbidden flows in an open system by defining three kinds of constraints: assumptions, guarantees, and properties. The assumption characterizes flows that we assume are not part of the environment while the guarantee describes all flows the system forbids and that are local to it. The property qualifies the forbidden flows at the interaction between the system and its environment. Hence, it represents a requirement on the closed system that needs to be enforced by guarantees on the open system and assumptions on its environment.

Definition 1. Let X and Y be disjoint sets of input and output variables, respectively, with Z = X ∪ Y the set of all variables. A stateless informationflow component is a tuple (X, Y,M), where M ⊆ Z × Y is a (reflexive and transitive) flow relation, called flows. A stateless information-flow interface is a tuple (X, Y, A, G,P), where: A ⊆ Z × X is a relation, called assumption; G ⊆ Z × Y is a relation, called guarantee; and P ⊆ Z × Y is a relation, called property.

Given an interface F we are interested in components that do not implement flows forbidden by either the interface guarantees (called implementations of F) or the interface assumptions (called environments of F).

Definition 2. A component f<sup>E</sup> = (Y, X, E) is called an environment of F = (X, Y, A, G,P). An environment is admissible for F, denoted by f<sup>E</sup> |= F, iff E ⊆ A. A component f = (X, Y,M) implements the interface F, denoted by f |= F, iff M ⊆ G.

Example 1.

Fig. 4: Interface Bus with an implementation, bus, and an admissible environment, sending.

In Figure 4, we have the first refinement of the interface Bus from our application example. The Bus interface specifies the requirement on the closed system (using properties) that there are no-flows from wheel tick to both distw f s and distw b s. The Bus interface specifies this requirement as a guarantee on the open system, too. Then, the bus component is an implementation of Bus because it has only a flow from distw f s to distw f t, which is not in the guarantees of the Bus interface. Bus does not have any assumptions, then the sending component is an environment for Bus.

When we compose the components sending and bus, there is a flow from wheel tick to distw f t, which is in the properties of the Bus. Hence the assumption and guarantee specified over the open system are not enough to ensure the property over the closed system. The composition of these two components witness that the Bus interface is not well-formed.

An information-flow interface is well-formed when it has at least one implementation and one admissible environment. Therefore, all of its relations must be irreflexive. We refer to irreflexive relations as no-flow relations. A well-formed interface ensures, additionally, that an interface property is consistent with its assumptions and guarantees. An interface property is not consistent when the flow relation defined by the composition of one of the interface's admissible environments with one of its implementations includes a pair specified in the interface property. To check whether the property is consistent, we compute the flow relation of the closed system defined by an interface F, which includes all flows that are in the composition of any of the interface's admissible environments with one of its implementations. The main challenge is that, in general, the complement of an interface's guarantee (assumption) may not define the flow relation of any of its implementations (environments). Hence there may be no maximal implementation or admissible environment for a given interface.

Example 2.

In Figure 5, we have two components, bus<sup>s</sup> and bust, that implement the interface Bus from the previous example. A maximal implementation of Bus must include the flows in both bus<sup>s</sup> and bust. As flows are transitively closed, the maximal implementation would include a flow from wheel tick to distw f t, which violates the Bus guarantees.

Given that we do not have maximal implementations and maximal admissible environments, then we cannot characterize all flows of the closed system defined by an interface F by computing the transitive closure of all pairs in the complement of F's assumption and guarantee – (A ∪ G) ∗ . This approach would yield more flows than the flows of the closed system defined by F. Instead, we consider all pairs (z, z<sup>0</sup> ) such that there exists a path from z to z 0 that alternates between flows in the complement of the assumption, A, and the guarantee, G. We define this notion below as the composition between no-flow relations. In Proposition 1 we prove that this definition captures our intended relation between an interface property and its environments and implementations.

Definition 3. A no-flow relation N ⊆ (A ∪ B) × B is an irreflexive relation. and its complement is N = ((A ∪ B) × B) \ N . Let N ⊆ (A ∪ B) × B and N <sup>0</sup> ⊆ (A<sup>0</sup> ∪ B<sup>0</sup> ) × B<sup>0</sup> be two no-flow relations. The set of flows defined by their composition is N • N <sup>0</sup> = (IdA0∪B<sup>0</sup> ∪ N ) ◦ (N <sup>0</sup> ◦ N ) <sup>∗</sup> ◦ (Id<sup>B</sup> ∪ N <sup>0</sup>), where Id<sup>Z</sup> = {(z, z)| z ∈ Z} and R ◦ R<sup>0</sup> = {(z, z<sup>00</sup>)|(z, z<sup>0</sup> ) ∈ R and (z 0 , z<sup>00</sup>) ∈ R<sup>0</sup>} is the usual composition between relations.

We have now all the ingredients to define well-formed interfaces.

Definition 4. An interface (X, Y, A, G,P) is well-formed iff A, G and P are no-flow relations; and the property is consistent, i.e. (A • G) ∩ P = ∅.

Proposition 1. For all well-formed interfaces F = (X, Y, A, G,P), and for all components f = (X, Y,M) and f<sup>E</sup> = (Y, X, E): if f implements F, f |= F, and f<sup>E</sup> is an admissible environment of F, f<sup>E</sup> |= F, then their combined flows are consistent with the property of F, (M ∪ E) <sup>∗</sup> ∩ P = ∅.

#### 3.1 Composition and Incremental Design

We now present how to compose components and interfaces. We introduce a compatibility predicate that checks whether the composition of two interfaces is a well-formed interface. We prove that these two notions support the incremental design of systems.

The different types of variables between interfaces F and F <sup>0</sup> are defined as YF,F <sup>0</sup> = Y ∪ Y 0 , XF,F <sup>0</sup> = (X ∪ X<sup>0</sup> ) \ YF,F <sup>0</sup> , and ZF,F <sup>0</sup> = YF,F <sup>0</sup> ∪ XF,F <sup>0</sup> . The same definition applies to components f and f 0 . The composition of components f and f 0 is the reflexive and transitive closure of the union of the individual component flows, i.e. f ⊗ f <sup>0</sup> = (Xf,f<sup>0</sup> , Yf,f<sup>0</sup> ,(M ∪ M<sup>0</sup> ) ∗ ). We present interface composition by defining separately A, G and P of the composed interface.

We compose interfaces through their shared variables. Shared variables between two interfaces are all variables that are an input variable in one of the interfaces and an output variable in the other one. The composite flows between two interfaces is the set with all flows that are in the composition of any of their implementations. As for the definition of flows in the closed system defined by an interface, the composite flows are the composition of the guarantees of the interfaces being composed (as defined in Definition 3). The composition of two interfaces should not restrict their sets of implementations, thus the composite guarantees are the complement of the composite flows.

Definition 5. Let F = (X, Y, A, G,P) and F <sup>0</sup> = (X<sup>0</sup> , Y <sup>0</sup> , A<sup>0</sup> , G 0 ,P 0 ) be two interfaces. Their composite flows are GF,F <sup>0</sup> = G • G<sup>0</sup> . The composite guarantees of F and F <sup>0</sup> are defined as GF,F <sup>0</sup> = (ZF,F <sup>0</sup>×YF,F <sup>0</sup> )\GF,F <sup>0</sup> , also denoted by G<sup>F</sup> <sup>⊗</sup><sup>F</sup> <sup>0</sup> .

The assumption of an interface resulting from the composition of multiple interfaces is the weakest condition on the environment that allows the interfaces being composed to work together. Additionally, it must support incremental design, i.e. the admissibility of an environment must be independent of the order in which the interfaces are composed.

Naturally, all assumptions of each interface must be considered during composition. However, not all of them can be kept as assumptions of the composite interface, because shared variables will be output variables of the composition. If the environment can still influence the information flow to a shared variable, then we may need to add assumptions to prevent such a flow. Propagated assumptions between two interfaces are derived by looking in their respective assumptions for no-flow pairs pointing to a shared variable.

Example 3.

Fig. 6: Propagating assumptions.

In Figure 6, we depict an interface specifying information-flow policies for a car immobilizer, Fimm, along with an interface for a Controller Area Network (CAN bus), Fcan. Interface Fimm has only one assumption that key does not flow to can. In this design, the immobilizer uses the CAN to communicate with the car electronic control unit (ECU). Our goal is to compose both interfaces.

These interfaces share the port can. Thus, can will be an output port of their composition. The interface Fcan cannot guarantee that the only assumption in Fimm is satisfied after composition because it does not have a port key. As we are working with open systems and assume that the environment is helpful, we can add further assumptions to ensure the correctness of this composition. For example, we can add assumptions that prevent key from flowing to an input port in Fcan that can flow to can. Such flows could be part of a flow from key to can, which would violate the assumption we want to enforce. In this case, we note that in Fcan information in ecu can flow to can. So, the composite interface needs to include the assumption that key does not flow to ecu. This is a propagated assumption.

Definition 6. The set of assumptions propagated from F = (X, Y, A, G,P) to F <sup>0</sup> = (X<sup>0</sup> ,Y <sup>0</sup> ,A<sup>0</sup> ,G 0 ,P 0 ) is Aˆ <sup>F</sup>→<sup>F</sup> <sup>0</sup> = {(z, z<sup>0</sup> ) | ∃s ∈ X ∩ Y 0 s.t. (z, s) ∈ A and (z 0 , s) ∈ GF,F <sup>0</sup>}. The set with all propagated assumptions of F and F 0 is Aˆ F,F <sup>0</sup> = Aˆ <sup>F</sup>→<sup>F</sup> <sup>0</sup> ∪ Aˆ <sup>F</sup> <sup>0</sup>→<sup>F</sup> . The composite assumptions of F and F <sup>0</sup> are defined as AF,F <sup>0</sup> = (A ∪ A<sup>0</sup> ∪ Aˆ F,F <sup>0</sup> ) ∩ (ZF,F <sup>0</sup> ×XF,F <sup>0</sup> ), also denoted by A<sup>F</sup> <sup>⊗</sup><sup>F</sup> <sup>0</sup> .

Example 4. From the example before, information from the ports ecu, imm and deb can all flow to can. So, they are flows in the composite interface and, by Definition 5, {(ecu, can),(imm, can),(deb, can)} ⊆ GFimm,Fcan . Then, Aˆ <sup>F</sup>imm→Fcan = {(key, ecu),(key, imm),(key, deb)}. From those assumptions only (key, ecu) points to a variable in XF,F <sup>0</sup> , so A<sup>F</sup>imm,Fcan = {(key, ecu)}.

The properties of the composition contains all properties of each interface being composed. They include, additionally, all derived properties from the assumptions and guarantees of the composite. Derived properties are guarantees that hold under any admissible environment. They are defined by all pairs (z, y) in an interface guarantee s.t. there is no combination of flows allowed by its assumptions and guarantees that creates a flow from z to y. Then, the derived properties of an assumption A and guarantee G is defined as P <sup>A</sup>,<sup>G</sup> = G \ (A • G). The composite properties of F and F <sup>0</sup> are PF,F <sup>0</sup> = P ∪ P<sup>0</sup> ∪ P<sup>A</sup>F,F <sup>0</sup> ,GF,F <sup>0</sup> .

Definition 7. The composition of two interfaces F and F is the interface: F ⊗ F = (XF,F <sup>0</sup> , YF,F <sup>0</sup> , AF,F <sup>0</sup> , GF,F <sup>0</sup> ,PF,F <sup>0</sup> ), where AF,F <sup>0</sup> is defined in Definition 6, GF,F <sup>0</sup> defined in Definition 5 and PF,F <sup>0</sup> in the previous paragraph.

We allow composition for any two arbitrary interfaces. However, not all compositions result in a well-formed interface. We define next the notions of two interfaces being composable and compatible. Composability imposes the syntactic restriction that both interface's output variables are disjoint. Compatibility captures the semantic requirement that whenever an interface F provides inputs to another interface F 0 , then F <sup>0</sup> needs to include guarantees that imply the assumptions of F.

Definition 8. Two interfaces F = (X, Y, A, G,P) and F <sup>0</sup> = (X<sup>0</sup> , Y <sup>0</sup> , A<sup>0</sup> , G 0 ,P 0 ) are composable iff Y ∩Y <sup>0</sup> = ∅. The interfaces F and F <sup>0</sup> are compatible, denoted F ∼ F 0 iff they are composable and ((A ∪ A<sup>0</sup> ) ∩ (ZF,F <sup>0</sup> × YF,F <sup>0</sup> )) ⊆ GF,F <sup>0</sup> .

Clearly, both the composition operator and the compatibility relation are commutative. Additionally, we prove that composition preserves well-formedness and that it supports incremental design of systems. The full proofs are in the appendix.

Theorem 1. Let F and F 0 be well-formed interfaces. If the interfaces are compatible, F ∼ F 0 , then their composition, F ⊗ F 0 , defines a well-formed interface.

Theorem 2 (Incremental design). Let F, F <sup>0</sup> and F <sup>00</sup> be interfaces. If F ∼ F 0 and (F ⊗ F 0 ) ∼ F <sup>00</sup>, then F <sup>0</sup> ∼ F <sup>00</sup> and F ∼ (F <sup>0</sup> ⊗ F <sup>00</sup>).

Proof. We proved first that composite assumptions are associative. We assume that F ∼ F <sup>0</sup> and (F ⊗ F 0 ) ∼ F <sup>00</sup>. The most interesting case is when (z, s) is an assumption of F and s is a shared variable between F and F <sup>0</sup> ⊗ F <sup>00</sup>. Then, we need to prove that (z, s) ∈ GF,F <sup>0</sup>⊗<sup>F</sup> <sup>00</sup> . We prove this by assuming towards a contradiction that (z, s) ∈ GF,F <sup>0</sup>⊗<sup>F</sup> <sup>00</sup> . We illustrate it in Figure 7.

By composite flows being associative, (z, s) ∈ G<sup>F</sup> <sup>⊗</sup><sup>F</sup> <sup>0</sup> ,F <sup>00</sup> . By (z, s) being an assumption of F and (s 0 , s) ∈ G<sup>F</sup> <sup>⊗</sup><sup>F</sup> <sup>0</sup> , then we have the derived assumption (z, s<sup>0</sup> ) ∈ Aˆ <sup>F</sup>→<sup>F</sup> <sup>0</sup> and, so (z, s<sup>0</sup> )∈A<sup>F</sup> <sup>⊗</sup><sup>F</sup> <sup>0</sup> . Moreover, (z, s<sup>0</sup> )∈G<sup>F</sup> <sup>⊗</sup><sup>F</sup> <sup>0</sup> ,F <sup>00</sup> , because z can flow to s <sup>0</sup> when F ⊗ F 0 is composed with F <sup>00</sup>. This contradicts our initial assumption that (F ⊗ F 0 ) ∼ F 00 .

Fig. 7: Incremental design.

We prove additionally that composition is associative for compatible interfaces.

Theorem 3. If F ∼ F <sup>0</sup> and F ⊗F <sup>0</sup> ∼ F <sup>00</sup>, then (F ⊗F 0 )⊗F <sup>00</sup> = F ⊗(F <sup>0</sup> ⊗F <sup>00</sup>).

Finally, we show that flows resulting from the composition of any components that implement two given interfaces are allowed by the composition of these interfaces.

Proposition 2. For all two interfaces F and F 0 , and all two components f = (X, Y,M) and f <sup>0</sup> = (X<sup>0</sup> , Y <sup>0</sup> ,M<sup>0</sup> ) that implement them, f |= F and f 0 |= F 0 , then the composition of the components implements the composition of the interfaces, f ⊗ f 0 |= F ⊗ F 0 .

#### 3.2 Refinement and Independent Implementability

We now define a refinement relation between interfaces. Intuitively, an interface F 0 refines F iff F <sup>0</sup> admits more environments than F, while possibly constraining its implementations.

Definition 9. Interface F <sup>0</sup> = (X<sup>0</sup> , Y <sup>0</sup> , A<sup>0</sup> , G 0 ,P 0 ) refines F = (X, Y, A, G,P), written F <sup>0</sup> F, when A<sup>0</sup> ⊆ A, G ⊆ G<sup>0</sup> and P ⊆ P<sup>0</sup> .

Let F and F <sup>0</sup> be interfaces s.t. F <sup>0</sup> F. Let f = (X, Y,M) and f<sup>E</sup> = (Y, X, E) be components. Then, (a) If f |= F 0 , then f |= F; and (b) if f<sup>E</sup> |= F, then f<sup>E</sup> |= F 0 .

Additionally, we show below that refinement and composition supports independent implementability.

Theorem 4 (Independent implementability). For all well-formed interfaces F 0 1 , F<sup>1</sup> and F2, if F 0 <sup>1</sup> F<sup>1</sup> and F<sup>1</sup> ∼ F2, then F 0 <sup>1</sup> ∼ F<sup>2</sup> and F 0 <sup>1</sup>⊗F<sup>2</sup> F1⊗F2.

Proof. The challenging part is to prove that the refined composite contains all properties of the abstracted one, i.e. PF1⊗F<sup>2</sup> ⊆ P<sup>F</sup> <sup>0</sup> <sup>1</sup>⊗F<sup>2</sup> . We prove by induction on n ∈ N that if a pair of variables (z, y) cannot be defined by assume-guarantee paths of size at most n of the abstract composition, then it cannot be defined by assume-guarantee paths of size at most n of the refined composition. We can see easily for the base case. If for all (z, s) ∈ AF1,F<sup>2</sup> s.t. there exists (s, y) ∈ GF1,F<sup>2</sup> , then, by F 0 <sup>1</sup> F1, it follows that for all (z, s) ∈ A<sup>F</sup> <sup>0</sup> 1 ,F<sup>2</sup> there exists (s, y) ∈ G<sup>F</sup> <sup>0</sup> 1 ,F<sup>2</sup> . Hence if (z, y) ∈/ AF1,F<sup>2</sup> ◦ GF1,F<sup>2</sup> , then (z, y) ∈/ A<sup>F</sup> <sup>0</sup> 1 ,F<sup>2</sup> ◦ G<sup>F</sup> <sup>0</sup> 1 ,F<sup>2</sup> as well.

#### 3.3 Discussion

Properties. In this work, we consider transitively closed flows. In this setting, in an open system, information can flow from z to z <sup>0</sup> by flowing from z to s through the environment, and then from s to z 0 through one of its implementations. As our algebra focuses on the design of structural requirements of no-flows in open systems, it needs to support the specification of global no-flow requirements. We made them explicit by introducing properties. If we did not include properties in our interfaces, then either assumptions or guarantees would need to take over the role of specifying global no-flows. Let's assume that, alternatively, guarantees would be interpreted as global no-flows. Then, to support incremental design, the compatibility criteria between interfaces would turn out to be overly restrictive, with intuitive and correct designs being considered incompatible. This led us to the distinction between guarantees and properties, where properties may be supported by assumptions on the environment that can restrict the set of compatible interfaces. In other words, the main advantage of having properties is that the designer can choose how to split the responsibilities between the environment and the implementations to satisfy a global no-flow.

Semantics. The structural approach that abstracts away semantic considerations is an important feature of our theory. The practicability of our approach lies in the support for the design of such requirements by decoupling the design process from (its orthogonal) semantic considerations. Hence, our approach does not deny semantics, but rather separates the design of specifications from component implementation concerns. The presented approach even allows using tailored semantics and tools for different parts of the design. For example, at the bottom (component) level, no-flows and flows relations can be instantiated with different semantic interpretations. After deriving the component no-flows from the implementation under a concrete semantics, the theory can be agnostic about the underlying semantic interpretation and can focus on whether there exists a flow at all.

#### 4 Stateful Information-Flow Interfaces

We extend our theory with stateful components and interfaces. These are transition systems in which each state is a stateless component or interface, respectively.

Definition 10. Let X and Y be disjoint sets of input and output variables, respectively, with Z = X ∪Y the set of all variables. Let Q be a set of states with qˆ ∈ Q being the initial state and δ : Q → 2 <sup>Q</sup> be a transition relation. A stateful information-flow component f is a tuple (X, Y, Q, q, δ, ˆ M), where M : Q → 2 Z×Y is a state labeling such that for all states q ∈ Q, M(q) defines a flow relation. We denote by f(q) = (X, Y, M(q)) the stateless component implied by the labeling of q. A stateful information-flow interface F is a tuple (X, Y, Q, q, δ, ˆ A, G, P), where A : Q → 2 <sup>Z</sup>×<sup>X</sup> is called assumption; G : Q → 2 Z×Y is called guarantee; and P : Q → 2 Z×Y is called property. For each state q ∈ Q we denote by F(q) = (X, Y, A(q), G(q), P(q)) the stateless interface implied by the assumption, guarantee and property of q.

A stateful interface F is well-formed iff F(ˆq) is a well-formed stateless interface, and for all q ∈ Q reachable from ˆq the stateless interface F(q) is well-formed. In what follows, F = (X, Y, Q, q, δ, ˆ A, G, P) and F <sup>0</sup> = (X<sup>0</sup> , Y <sup>0</sup> , Q<sup>0</sup> , qˆ 0 , δ<sup>0</sup> , A 0 , G<sup>0</sup> , P 0 ) are stateful interfaces, and f = (X, Y, Q<sup>f</sup> , qˆ<sup>f</sup> , δ<sup>f</sup> , M) and f<sup>E</sup> = (Y, X, Q<sup>E</sup> , qˆ<sup>E</sup> , δ<sup>E</sup> , E) are stateful components.

A stateful component f implements a stateful interface F if there exists a simulation relation from f to F such that the stateless components in the relation implement the stateless interfaces they are related to. Admissible environments require a simulation relation from them to the interface they are admissible on.

Definition 11. A component f implements the interface F, denoted by f |= F, iff there exists H ⊆ Q<sup>f</sup> ×Q s.t. (ˆq<sup>f</sup> , qˆ) ∈ H and for all (q<sup>f</sup> , q) ∈ H: (i) f(qf) |= F(q); and (ii) if q 0 <sup>f</sup> ∈ δf(qf), then there exists a state q <sup>0</sup> ∈ δ(q) s.t. (q 0 f , q<sup>0</sup> ) ∈ H. A component f<sup>E</sup> is an admissible environment for the interface F, denoted by f<sup>E</sup> |= F, iff there exists a relation H ⊆ Q × Q<sup>E</sup> s.t. (ˆq, qˆ<sup>E</sup> ) ∈ H and for all (q, q<sup>E</sup> ) ∈ H: (i) f(q<sup>E</sup> ) |= F(q); and (ii) if q <sup>0</sup> ∈ δF(q), then there exists a state q 0 <sup>E</sup> ∈ δ<sup>E</sup> (q<sup>E</sup> ) s.t. (q 0 , q<sup>0</sup> E ) ∈ H.

As for stateless interfaces, we have that interface's properties are satisfied after we compose any of its implementations f with any of its admissible environments f<sup>E</sup> .

Proposition 3. For all well-formed interfaces F, and all relations H ⊆ Q<sup>f</sup> × Q and H<sup>E</sup> ⊆ Q × Q<sup>E</sup> that witness f |= F and f<sup>E</sup> |= F, respectively, it holds: (i) (M(ˆqf) ∪ E(ˆq<sup>E</sup> ))<sup>∗</sup> ∩ P(ˆq) = ∅; and (ii) for all q ∈ Q that are reachable from qˆ, if (q<sup>f</sup> , q) ∈ H and (q, q<sup>E</sup> ) ∈ H<sup>E</sup> , then (M(qf) ∪ E(q<sup>E</sup> ))<sup>∗</sup> ∩ P(q) = ∅.

Composition of two components is defined as their synchronous product. The composition of two interfaces is defined as their synchronous product, as well. However, we only keep the states that are defined by the composition of two compatible stateless interfaces.

Definition 12. Let F and F 0 be two interfaces. Their composition is defined as the tuple: F ⊗ F <sup>0</sup> = (X<sup>F</sup>,F<sup>0</sup> , Y<sup>F</sup>,F<sup>0</sup> , Q<sup>F</sup>,F<sup>0</sup> , qˆ<sup>F</sup>,F<sup>0</sup> , δ<sup>F</sup>,F<sup>0</sup> , A<sup>F</sup>,F<sup>0</sup> , G<sup>F</sup>,F<sup>0</sup> , P<sup>F</sup>,F<sup>0</sup> ), where: qˆ<sup>F</sup>,F<sup>0</sup> = (ˆq, qˆ 0 ) and Q<sup>F</sup>,F<sup>0</sup> = {qˆ<sup>F</sup>,F0}∪{(q, q<sup>0</sup> ) | F(q) ∼ F 0 (q 0 )}; (q2, q<sup>0</sup> 2 ) ∈ δ<sup>F</sup>,F<sup>0</sup> (q1, q<sup>0</sup> 1 ) iff q<sup>2</sup> ∈ δ(q1) and q 0 <sup>2</sup> ∈ δ 0 (q 0 1 ); for all (q, q<sup>0</sup> ) ∈ Q<sup>F</sup>,F<sup>0</sup> : F<sup>F</sup>,F<sup>0</sup> (q, q<sup>0</sup> ) = F(q) ⊗ F 0 (q 0 ).

Two stateful interfaces are compatible if the stateless interfaces defined by their initial states are compatible, i.e. F(ˆq) ∼ F 0 (ˆq 0 ). It follows from the results proved for the stateless interfaces that compatibility is commutative, composition preserves well-formedness and stateful interfaces support incremental design.

Proposition 4. If f |= F and g |= G, then f ⊗ g |= F ⊗ G.

Given an interface, we define transitions parameterized by no-flows on its input variables (i.e. with fixed assumptions) or on its output variables (i.e. with fixed guarantees and properties).

Definition 13. Let F be an interface. Input transitions from a given state q ∈ Q are defined as δ <sup>X</sup>(q) = {δ <sup>X</sup>(q, A) | A ⊆ Z × X} with δ <sup>X</sup>(q, A) = {q <sup>0</sup> ∈ δ(q) | A(q 0 ) = A}. Output transitions from a given state q ∈ Q are defined as δ Y (q) = {δ Y (q, G,P) | G ⊆ Z × Y and P ⊆ Z × Y } with δ Y (q, G,P) = {q <sup>0</sup> ∈ δ(q) | G(q 0 ) = G and P(q 0 ) = P}.

Interface F<sup>R</sup> refines FA, if all output steps of F<sup>R</sup> can be simulated by FA, while all input steps of F<sup>A</sup> can be simulated by FR. This corresponds to alternating refinement [5].

Definition 14. Interface F<sup>R</sup> refines FA, written F<sup>R</sup> FA, iff there exists a relation H ⊆ QR×Q<sup>A</sup> s.t. (ˆqR, qˆA) ∈ H and for all (qR, qA) ∈ H: (i) FR(qR) FA(qA); (ii) for all set of states O ∈ δ Y <sup>R</sup> (qR), there exists O<sup>0</sup> ∈ δ Y <sup>A</sup> (qA) s.t. for all set of states I <sup>0</sup> ∈ δ X <sup>A</sup> (qA), there exists I ∈ δ X <sup>R</sup> (qR) s.t. (O ∩ I) × (O<sup>0</sup> ∩ I 0 ) ⊆ H.

Fig. 8: Refined interfaces with witness: (a) relation {(ˆq1, qˆ 0 1 ),(q2, q<sup>0</sup> 2 )}; and (b) relation {(ˆq1, qˆ 0 1 ),(q2, q<sup>0</sup> 2 ),(q3, q<sup>0</sup> 2 )}.

Example 5. In Figure 8 we depict two examples of refined stateful interfaces.

In Figure 8(a) the stateless interface in each state only uses output ports and it only specifies properties. The initial state of both stateful interfaces is the same, so they clearly refine each other. As there are no assumptions and guarantees, then, by Definition 14, we need to check that for all successors of the initial state in the refined interface qs, there exists a successor of the initial state in the abstract interface q 0 s such that PA(q 0 s ) ⊆ PR(qs). This holds for the states (q2, q<sup>0</sup> 2 ). Hence the relation {(ˆq1, qˆ 0 1 ),(q2, q<sup>0</sup> 2 )} witnesses the refinement. Note that the refined interface is obtained by removing a nondeterministic choice on the transition function.

The witness relation for the refinement depicted in Figure 8(b) is {(ˆq1, qˆ 0 1 ), (q2, q<sup>0</sup> 2 ),(q3, q<sup>0</sup> 2 )}. The initial states are the same, so the condition (i) in Definition 14 is trivially satisfied. The refined interface has two distinct output transitions from the initial state ˆq1. It can either go to state q<sup>2</sup> by choosing the set of guarantees and proposition with only one element (x, y) or it can transition to state q<sup>3</sup> by committing to the set of no-flows {(x, y),(x 0 , y)} for the guarantees and {(x, y)} as property. From the initial state of the abstract interface, there exists only one input transition possible, to assume that x does not flow to x 0 and y <sup>0</sup> does not flow to x. The following holds for both states accessible from the initial state in the refined interface: AR(q2) ⊆ AA(q 0 2 ) and AR(q3) ⊆ AA(q 0 2 ). The refined interface specifies an alternative transition from the initial state (represented by state q3) that allows more environments while restricting the implementation and preserving the property.

Theorem 5. Let F <sup>0</sup> F. (a) If f |= F 0 , then f |= F. (b) If f<sup>E</sup> |= F, then f<sup>E</sup> |= F 0 .

Theorem 6 (Independent implementability). For all well-formed interfaces F 0 1 , F<sup>1</sup> and F2, if F 0 <sup>1</sup> F<sup>1</sup> and F<sup>1</sup> ∼ F2, then F 0 <sup>1</sup> ∼ F<sup>2</sup> and F 0 <sup>1</sup>⊗F<sup>2</sup> F1⊗F2.

The composition operation on stateful information-flow interfaces can be generalized to distinguish between compatible and incompatible transitions of interfaces when they are composed. Usually this is done by labeling transitions with letters from an alphabet, so that only transitions with the same letter can be synchronized. While necessary for practical modeling, we omit this technical generalization to allow the reader to focus on the novelty of our formalism, which is the ability to specify information-flow constraints (environment assumptions, implementation guarantees, and global properties) at each state of an interface.

#### 5 Related Work

To the best of our knowledge, we are the first to provide a theory for top-down and bottom-up design of information-flow system requirements that supports both incremental design and independent implementability of systems. The literature closest to our work about information-flow focus on the semantic aspects of it. The novelty of our work lies on explicit separation of the structural concerns from the semantic aspects of information-flow.

Language-based techniques have been proved useful to verify and enforce information flow policies [29]. Examples range from type systems [15] to program analysis using program-dependency graphs (PDGs) [18,16]. In our approach we aim at composition and refinement notions that are independent of the language adopted for the implementations.

Information-flow properties can be specified with respect to the observed behavior of a system, in which each of its execution runs is abstracted as a trace. In this approach, properties often compare multiple executions of a system to certify that no forbidden flow can be deduced by an observer. Such properties over multiple execution traces are called hyperproperties [12]. Temporal logics [26], like LTL or CTL\* are used to specify trace properties of reactive systems. HyperLTL and HyperCTL\* [11] extend temporal logics by introducing quantifiers over path variables. They allow relating multiple executions and expressing informationflow security properties [12,11]. Epistemic temporal logics (ETL) [9] provide the knowledge connective with an implicit quantification over traces. With ETL we can reason about the knowledge gain of agents over time. Then, we can specify which information can be learned by the agents while interacting with the system [6]. All these LTL extensions reason about closed systems while our approach allows compositional reasoning about open systems. Moreover, we focus here on the structural aspect of information-flow, and not yet on its semantic interpretation. Thus, all information-flow trace-based semantics are orthogonal to our approach.

Interface theories belong to the broader area of contract-based design [8], originally popularized by Meyer [24], following earlier ideas introduced by Floyd and Hoare [14,19]. Our theory follows closely the philosophy for formal frameworks for systems design introduced for Interface automata (IA) [1] and Assume/Guarantee (A/G) [2] interfaces. Interface theories were later extended with extrafunctional requirements such as resource [10], timing [4,13] and security [21] requirements. Unlike in previous interface formalisms, we had to introduce the notion of properties which capture the intent of the designer and can be used to steer the refinement of interfaces.

Interface for structure and security (ISS) [21] is a variant of IA that enables specification of two types of actions on (1) low and (2) high confidential information. ISS uses a bisimulation-based notion of non-interference that checks whether the system behaves in the same way when high actions are performed or when they are considered hidden actions. Our approach is orthogonal to IA and their extensions: we do not characterise the type of actions of each component, but only their input/output ports, defining explicitly the information-flow relations between variables.

Our approach took inspiration from relational interfaces (RIs) [31]. RIs specify the legal inputs that the environment is allowed to provide to the component along with the legal outputs that the component can generate when provided with these input. RIs do not have assumptions and guarantees defined separately. Instead, they have a contract that specifies the desired input-output behavior. A contract in RIs is expressed over individual traces. Then, an RI contract can only relate input and output values in a trace, and not across multiple traces. This restricts considerably RIs expressivity concerning information-flow properties. Besides, RIs are trace-based interfaces, while in our approach we focus on the structural aspect of information-flow, which may change from state to state (in the stateful case). Our approach can be seen as a limited way to introduce relational properties into A/G interfaces, namely solely for guiding refinement. This limited way avoids many of the technical complexities of general relational interfaces [31].

#### 6 Conclusion

We propose a novel interface theory to specify information-flow properties. Our framework includes both stateless and stateful interfaces and supports both incremental design and independent implementability. To achieve this, unlike in previous interface formalisms, we introduce the notion of properties which captures the intent of the designer for the interaction between assumptions and guarantees. Moreover, properties can be used to steer the refinement of interfaces. It will be interesting to study the introduction of such design-guiding properties in the context of other interface languages.

As future work, we will explore how to extend our theory with sets of mustflows, i.e. support for modal specifications [27]. This will enable, for example, to specify flows that a state q must implement so that the system can transition to a different state, which is useful to specify declassification of information. Another direction is to explore trace semantics for our interfaces.

#### References


Conference (ISSC). pp. 4–9. Institution of Engineering and Technology (2008). https://doi.org/10.1049/cp:20080630


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium

or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## A Survey-driven Feature Model for Software Traceability Approaches

Edouard Romari Batot<sup>1</sup> , Sebastien Gérard<sup>2</sup> , and Jordi Cabot1,<sup>3</sup>

	- <sup>2</sup> CEA LIST, Paris France sebastien.gerard@cea.fr
	- 3 ICREA, Barcelona, España – jordi.cabot@icrea.cat

Abstract. Traceability is the capability to represent, understand and analyze the relationships between software artefacts. Traceability is at the core of many software engineering activities. This is a blessing in disguise as traceability research is scattered among various research subfields, which impairs a global view and integration of the different innovations around the recording, identification, evaluation and management of traces. This also limits the adoption of traceability solutions in industry.

In this sense, the goal of this paper is to present a characterization of the traceability mechanism as a feature model depicting the shared and variable elements in any traceability proposal. The features in the model are derived from a survey of papers related to traceability published in the literature. We believe this feature model is useful to assess and compare different proposals and provide a common terminology and background. Beyond the feature model, the survey we conducted also help us to identify a number of challenges to be solved in order to move traceability forward, especially in a context where, due to the increasing importance of AI techniques in Software Engineering, traces are more important than ever in order to be able to reproduce and explain AI decisions.

## 1 Introduction

The need for traceability has always been salient in software and systems development. Across the years, there has been a continuous interest in developing techniques to facilitate the representation and analysis of traces and links between related artefacts. It helps explaining their execution and evolution as required in many software engineering activities and disciplines such as code-generation, program understanding, software maintenance, and debugging.

The importance of traceability was first recognized in system engineering, especially related to the development and certification of critical systems where it is a primary concern. As an example, traceability is part of any certification mechanism in all commercial software-based aerospace systems as stated in documents like the RTCA DO-178C (2012) [76,62]. The consideration of various levels of abstraction in software development and the meaning of verification in model-based development paradigm – which figures abstract representations (models) as the core artefact for conceptualization – was later introduced with companion documents (specifically, DO-331). The automotive industry has followed the same path with the construction of an international standard for functional safety, the ISO-26262 [46].

Despite these important evidences on the need for explicit (and automated) tracing abilities in software development, traceability is not widely adopted, even less automated. There is little feedback from its concrete use in industry beyond the critical domains above [75] and when existing, it ends up being mostly a manual process [55]. Moreover, with no standard definition or representation of traces, it is difficult to bridge the gaps between the different partial traceability solutions existing in research subfields [4,102,101]. Even the software engineering body of knowledge does not seem to properly consider the power of traceability as it only *mentions* traceability once [16].

The foundation for an effective modelling of traceability is disseminated among a profuse literature. Approaches vary greatly in their means and goals. Moreover, most focus on specific pairs of artefacts and therefore remain difficult to integrate in different industrial scenarios. Note also that this happens in a context where artificial intelligence techniques are being integrated in development processes, raising the need for more powerful reproducibility and explainability concerns, both requiring the assistance of traceability mechanisms.

This paper aims to provide a comprehensive perspective on the state of the art of traceability techniques in software development and their limitations. With the shortterm goal of facilitating the evaluation and comparison of current solutions. And the mid-term goal of accelerating the development of new traceability solutions that could benefit from the existing ones thanks to our new conceptualization in the form of a feature model describing the potential dimensions and concerns a traceability solution may wish to consider. We do not create the feature model only based on our (partial) knowledge and expertise in the domain. Instead, we ground our classification with a survey of the published literature in this field. According to this survey, we group the traceability features in three main dimensions: trace definition, trace identification and trace management, with the corresponding feature hierarchies for each of them.

The paper is organized as follows. After a brief introduction, we discuss in Section 2 an overview of the scientific work related to traceability. We then remind some basic terminology in Section 3. Section 4 describes how we conducted our literature review and Section 5 presents a detailed feature model derived from the survey of the retrieved works. This analysis also helps us to propose a number of discussion points and open challenges in Section 6 before concluding this work.

#### 2 State of the art of software traceability

Traceability was proposed, from the very beginning of software engineering, to ensure that a system being developed actually reflects its design. Already in the original NATO working conference, quality projects were praised for making "the system that they are designing contain explicit *traces* of the design process" [81]. From that point on, traceability has been studied from a myriad of perspectives, dimensions and applications.

Historically, traceability historically started in requirement engineering. The very idea to follow the impact of changes in the requirements to other artefacts (and backward) was then and remains today the most prominent goal [35]. Precise and rich requirements allow a proper follow up of their later implementations [21]. Through time, the advantages of using traces – *i.e.,* the record of (inter-)dependencies between artefacts, has revealed to be applicable to most if not all sphere of software maintenance. The use of traces spans from software certification and testing, feature location, debugging, code generation, and so on. With the proliferation of traceability purposes, some authors explicitly asked for better sharing of experiences in using traceability [36] and evaluating the solutions existing so far [91]. Surveys and literature reviews trying to group and compare them began to appear as well, though most of them focused on specific subareas such as requirement engineering [35,15], model-driven development [32,101,70,86,63], software product lines [96,3], benchmarking [91], and information retrieval [23,13,39]. To complement these scientific surveys, Konigs *et al.* survey industrial application of traceability approaches, showing its limited penetration [52]. Neumuller *et al.* show that the adoption is worse in small businesses where traceability is even less automated [67]. Finally, Charalampidou *et al.* add to the conclusion of other surveys that "although many studies include some empirical validation", there is still much to be done with respect to validation and reproducibility [20].

This is aggravated by the fact that, as pointed out above, many of the proposals belong to different research subfields, which limits the discovery and awareness of alternative solutions. For instance, authors point out that researchers in requirement engineering and in model-based development do not communicate enough among each others [101,70,85]. This lack of communication and shared understanding is one of the open challenges in the traceability domain [22,4,28]. To solve this issue, several works aim at proposing specific traceability models. Unfortunately, many investigations suffer a lack of generalizability due the specific nature of the problem being solved (*e.g.,* certification conformity [51], model transformation coevolution [38]), or the specific nature of the solution considered (*e.g.,* w.r.t. its language: SysML [65], w.r.t. its engineering field: SPL [3], agile [60]).

As an example, the automatic identification of trace links is one of the most studied features. There are plenty of proposals but as they are evaluated using different datasets and configurations, they cannot be directly compared [89,39,13]. Another example would be model-driven engineering, where the use of traceability specific languages together with automated model transformation appears as an ideal soil to grow end-to-end traceability. This led authors to present classifications and terminologies for a systematic perspective on the tracing of MDE development [70,28,85]. Nevertheless, proposals tend to focus on a specific model-driven engineering problem: the co-evolution of models and transformations [2] instead of aiming for more general solutions. Mustafa *et al.* argue that "the main issues in traceability nowadays are building traceability models that can accommodate the capturing of traceability information and providing common semantics for trace links" [63]. As a result of this confusing situation, authors asked for more standardized practices. Two proposals gather terminology for fundamental and model based terminology [36,45]. We take our general knowledge about traceability from them and add to their definitions an actionable categorization for existing and coming traceability approaches.

We agree with these authors that this lack of *de juro / de facto* standard is hampering the benefits of current solutions and hindering evolution in the field. This paper intends to cover this gap by proposing a traceability characterization that stems from the analysis of existing proposals. We believe this model can be useful to researchers trying to improve traceability techniques in any subfield and to practitioners looking for a way to compare and choose the traceability solution that best suits their needs.

## 3 Towards a common traceablility terminology

A clear conclusion from the previous section is the lack of a common agreed upon conceptualization for traceability that helps evaluating, comparing and reusing traceability solutions over a variety of scenarios and application domains. Thus, the *incoherency problem* still arises in traceability research [100]. Even if an individual article makes a claim that withstood rigorous testing and statistical analysis, it might not use the same words as an adjacent article, or it would use the same words but intend different meanings. For instance, the term *traceability* is used to designate both the ability to trace system elements, and the traceability links (the relations) themselves [15,4].

Therefore, before proposing our global traceability feature model to classify traceability solutions, we first recap the different usages of the key traceability concepts and propose a unified definition that we will use in the rest of the paper.

#### 3.1 Traceability components

Traceability research refers mainly to a definition from Gotel *et al.* that defines traceability as the ability to describe and follow the life-cycle of a requirement, from its initial specification to the design and code elements of the system implementing it [35]. This is still the most popular meaning for traceability [15,7] even if modeling approaches try to generalize this notion by seeing traceability as a valuable tool to link all types of linking artefacts at either the same or different levels of abstraction [56,95].

Regardless of the specific interpretation of traceability, we observe a division of knowledge into four main areas:


The first area is a high-level concern that influences the requirements of the other three to cover the specific needs of a project. These three will therefore be used to structure our feature model later on. Note that the representation component should be part of any traceability solution as it is the base component to be able to, at the very least, express traceability information.

#### 3.2 Traceability glossary

We propose some general definitions for the most frequently encountered traceability terms while searching for and studying solutions for traceability in any of the above categories. These definitions, mostly borrowed from past literature [36,45], aim to encompass the different uses and dimensions of traceability depicted above. Our set of terms is not exhaustive but provide a common core generic enough to be then adapted to specific scenarios. This is also why we try to be precise with the definitions, while also offering room for slightly different (but compatible) interpretations.


Links can be explicit or implicit. An implicit link shows artefacts bondage at a syntactic or semantic level without the need for an explicit link to be part of the model (*e.g.,* a binary class and its respective source code artefact are implicitly "linked" to each other, yet this bondage is not part of any language or grammar definition) [70].

Fig. 1: Survey Process.


On top of these concepts, a recent work, by Holtmann *et al.*, makes a distinction between a *foundational* and a *specifically model-based* terminology [45]. This latter add a specification for *model* and *language* scope definitions, as well as a distinction between *relational* and *referential* trace links.


Some of these concepts will explicitly appear in our feature traceability model while others act as requirements and usages that should be supported/facilitated by the features in the model and taken into account when choosing a specific traceability solution depending on how well that solution covers the specific features of interest for the project at hand.

#### 4 Traceability Survey method

In this section we depict the methodology we followed to collect papers proposing traceability solutions, including at the very least the core *representation* component (see previous section). The analysis of these papers will give rise to the feature model we will present next.

The selection process combined the manual selection of a few approaches based on our own experience working in this field and/or covered by other meta-studies [36,4,22,39] together with a systematic literature search mining bibliographic data sources following the literature review process established by Kitchenham and Charters [49]. Fig. 1 depicts the three main steps of the process.

#### 4.1 Data source and search strategy

We used DBLP [10] as our core electronic database to search for primary studies on traceability. To avoid missing possibly relevant approaches, we decided not to put a specific period constraint for the search, but we limited the scope of the search to papers of five pages or more to avoid opinion and vision papers, posters, tool demos and other types of short papers to reduce the number of results while maximizing their quality.

Based on the topic of this survey, we defined the terms of the search query according to the recommendations of Kitchenham and Charters [49]. We apply the query on the title and abstract of potential relevant publications. As using very generic terms like "trace" or "traceability" returned thousands of results, we decided to combine in the search query trace-related keywords with language-related ones since we target traceability proposals that, at the very least, discuss how traces need to be represented / expressed and not only discuss their application to some specific domain without going deep into the details. As many traceability languages are model-based, we included model, modeling, and other core MDE concepts as part of the language variations. This resulted in a total of 203 papers.

Here is the exact query we applied:

```
.*(([Tt]rac(eability|ing))|([Tt]race[rs])).* AND
.*(([Mm]odel[- ])(([Dd]riven)|([Bb]ased))|
MD[DAE]|Model[l]ing|[Tt]ransformation| DSL|[Ll]anguage).*
```
#### 4.2 Pruning

In what follows, we describe our inclusion and exclusion criteria. We further explain how we applied these criteria on the previous set of papers.


3. traceability is the main concern of the paper

Before we applied these criteria on the potential papers fetched by our query, we removed automatically papers of less than 5 pages long. We also automatically extracted papers whose titles mentioned "biology", "education", "kinetics", "logistics", "physiology", "physics", "neuroscience", "agriculture", and "food" which appeared each in a couple of results. We manually examined the 183 papers left and excluded 40 papers that did not fulfill the criteria or were duplicates.

#### 4.3 Snowballing

At the end of the previous steps, we double-checked that we did not miss any potentially relevant approach due to a number of reasons, *e.g.*, some workshop papers are only indexed by ACM or papers that may be using different synonyms for traceability like "composition" or "extension".

Finally, we added papers we were aware of based on direct knowledge or from other surveys we had read (if not already in the result set) and a few more we found by snowballing on the selected papers references. They amount to a total of 10 more papers. This lead to a final result of 159 papers. Among them, there are 41 journal articles, 82 in conference proceedings, and 36 workshop reports (see Table 1). Fig. 2 shows the chronological distribution of the selected publications.

Fig. 2: Papers selected related to traceability and modeling.


Table 1: Publication types of the selected papers.

#### 4.4 Threats to validity in the selection process

We acknowledge limitations in the execution of our survey method. First, we only used DBLP as a source database. Yet, it is recognized as a representative electronic database for scientific publications on software engineering and already contains more than five million publications from over two million authors. Setting the limit based on the number of pages alone to elude short papers is another threat to validity. Yet, it is a reproducible practice that limits the number of papers to analyse and thus helps concentrate on the topic rather than the engineering of the survey. Then, the vocabulary related to traceability is scattered among various fields of application with their respective nuances. We mitigate the risk of missing papers by manually adding papers that were not using variations of this term but were still referenced by papers that did. Still, focusing on traceability as a key term was also a conscious decision as we wanted to characterize the works in this field, focusing on those papers that define themselves as part of it.

#### 5 A feature model to characterize software traceability

This section presents our feature model describing the traceability features and dimensions found in the analysis of the literature. Our feature model groups them by similarity and provides additional descriptions on the most important aspects of each one, *e.g.*, different existing alternative implementation of the same feature and/or the most/the least studied ones in each group. Next subsections provide some background on feature modeling and then zoom in to each of the three main dimensions of traceability: trace representation, trace identification, and trace management.

#### 5.1 Introduction to feature modelling

A feature model leverages features as the abstraction mechanism to reason about product variability. It is a hierarchically arranged set of features, where relationships between a parent feature and its child features may be categorized as: *and* – all subfeatures must be selected, *alternative* – only one subfeature can be selected, *inclusive or* – one or more can be selected, *mandatory*, and *optional* [48]. Each feature represents an increment in product functionality.

Feature modeling is a technique that has been intensively used for documenting the points of variability in a software product line, how the points of variability constraint one another, and what constitutes a complete configuration of the system. But beyond product lines, feature models are also more and more used to shed light on complex domains by representing the core concerns and variation points in a complex ecosystems (*e.g.*, [17]), as we do in this paper.

#### 5.2 Trace definition and representation

All approaches must discuss their representation of trace artefacts even if they can differ on the type of traces they consider and the application they target. Representations are so diverse that our survey selected more than 80 papers mentioning their own distinct definition for traceability – with 20 metamodels effectively depicted in those papers. Some researchers present generic graph-based representations [87,37] while others focus on representations much more specific to a concrete application like a metamodel for change impact analysis [34] or multi-model consistency [94]. In both cases, what traceability approaches target and how they represent a trace is differently approached.

Fig. 3 shows the hierarchy of features related to the definition and the representation of trace artefacts. A peculiar focus is put on the typing of traces' relationships. Typing relationships is important to add semantics to the trace so that the engineer can know not only what the linked artefacts are but also why they are linked. As such, it facilitates the application of traceability solutions to specific domains. We also detail the genericity of the language, the nature of the artefacts covered by the traceability proposal, and the possibility to annotate traces with quality properties.

Fig. 3: Features related to the representation of a trace.

We would like to remark the contribution of model-based approaches for traceability in this section. The use of MDE tooling such as ATL [84,47], or the Eclipse Modeling Framework (EMF) allows the automated generation of traceability information as a side effect of executing operations [32,101]. The modeling community has proposed metamodels for end-to-end traceability [43,41], as well as metamodels specific to engineering domains such as model transformation [47,3,97,11] or software product line [47,97]. Paige *et al.* call for more flexible modeling where models of different formats are associated to each others' with annotations that allow automated bond or dependency inference between both application and engineering domains [89,72].

Language Languages specific to traceability provide the ability to represent trace artefacts with increased relevance and accuracy. Yet, they often suffer the limitation to be built *ad hoc* and lack a significant power of reusability into other domains. Among these domain-specific languages for traceability, some authors attempt a generic definition of traceability [43,6] while others provide a language specific to a single domain, *e.g.*, traceability for software product lines [3].

We found few studies interested in the use of general-purpose software language for traceability - even though this would be appealing to industrial partners interested in instrumenting their legacy systems code with traceability information to facilitate future evolution or migrations [65]. Representing traces in spreadsheets, text files, or databases, shows better learning curves than using a domain specific language, but at the cost of a cognitive gap between software engineers and domain experts. As an unfortunate consequence, "the maintenance costs turns out to grow accordingly [to the usability of generic representations] and team members fail to keep the trace artefacts up-to-date" [21].

A potential sweet spot lies in the making of orthogonal approaches that "plug" traceability concerns on top of other languages to benefit from an existing language structure while keeping most of the benefits of using a DSL.

Artefacts targeted We distinguish between the nature of the artefacts targeted by traceability purposes and their granularity as both dimensions are important. For the nature aspect, on the one hand, investigations differ on the development phase they target. Linking requirement specifications to design and code level predominate in the literature with more than 50% of the papers in the survey addressing requirement traceability. Other phases such as test and verification are targeted as well but in a lesser proportion (10 approaches). On the other hand, the type of the artefacts is important to deduce the level of potential generalization to other phases of the software lifecycle. Papers focus on four different types: unstructured document, structured as grammar-, and modelbased artefacts, and binaries.

With regard to the granularity of the artefacts targeted, *i.e.*, their level of decomposition, few approaches go for a customizable granularity to adapt to artefact hierarchies [43,60] while most of the others focus on specific types of artefacts (*e.g.,* to concentrate their work on specific optimizations of trace identification).

Relationship types As many authors have demonstrated, offering to the user the ability to define personalized types of relations between the artefacts of a system fosters the comprehensibility of the traces produced [68]. We distinguish between approaches offering predefined types and approaches allowing custom typing. Often the predefined types relate to the field of software engineering (implements, inherits, uses, executes ...), but not only. For example, Maletic *et al.* mention that a separation between *causal*, *non causal*, and *navigation* relationships can be appropriate [57]. Predefined types allow increased monitoring and user-friendliness to developers. They are found in most contributions relating the optimization of trace identification. On the other hand, allowing users to define the types of relationships specific to their area of expertise helps to fill the gap between the design and the use of tracing functionalities [102].

Obviously a fixed typing facilitates the analysis of the traces as the potential set of semantics and interpretations are fixed while offering domain-specific types increases the usability and comprehensibility of the approach. As an example, SysMLv2 is offering a more powerful mechanism to define links between artefacts compared to the previous SysML version (where we had a sole dependency-like mechanism).

The literature shows also a distinction between approaches considering relationships with multiple sources and targets and relationships allowing only a single source. Trace quality In most of the papers, quality aspects are barely mentioned. It seems quality of the generated traces is not a major focus, or at least storing and annotating the traces with such information is not. Yet, a few studies mention coverage and integrity. The coverage of a set of execution traces is used in approaches for software testing [33]. Coverage is also used by Rath *et al.* who address the problem of missing links between commits and issues with a classifier they train on textual commit information to identify missing links between issues and commits (*i.e.,* a lack in the coverage indicates such missing links) [82]. Matrix-based visualizations are particularly fit to assist coverage related tasks (See Section 5.4). Integrity of traces is addressed in work on model transformation where co-evolution figures an automatic verification of their coherence with other (versatile) software artefacts [94,92]. In the same manner, Heisig *et al.* tag links which ends artefacts have been modified or deleted to inform the user of such changes [43]. The co-evolution of traces implies measuring distances between artefacts (syntactic, cognitive, geographic, cultural...) [9]. It also refers to the analysis of the changes of the system that impact traceability artefacts [34,98]. In our survey, nine papers address artefacts co-evolution and 17 tackle model transformation limitations. These latter are a valuable tool to automate co-evolution tasks. In the many studies focusing on the optimization of link identification, the quality of the results is mainly evaluated with precision and recall measurements and never rely on inherent trace artefacts characteristics. Few researchers include a user feedback [13].

#### 5.3 Trace identification

Fig. 4: Features related to the identification of trace links

Fig. 4 shows the hierarchy of features related to the identification of traces with four main possible categories: the manual elicitation of traces, their live record during execution and evolution, rule-based alternatives to assist the user with automation potential, and AI-augmented identification with domain contextualization.

Manual elicitation Manual elicitation makes possible to create traces in an *ad hoc* manner. As an example, one of our industrial partner chose to hire a developer to elicit trace links necessary for a certification commitment. This was chosen rather than a (semi-)automated approach, as they were not convinced the effort of augmenting an existing tool would pay off for that specific project.

Recording instrumentation Teams can instrument the live record of traces during the execution and the evolution of software artefacts. This way traces recording the system changes are a side-effect of those same changes. There are initiatives to instrument existing languages such as ATL with rich log generation [84,31], while others consider trace record an aspect that can be weaved with current existing languages [78,84]. Ziegenhagen *et al.* mix execution traces with metadatas [103], and use developer interaction records [104] to enrich existing traceability artefact.

Model transformations are considered the hearth and soul of software modeling and, consequently, numerous studies attempt to enrich trace generation during transformation execution [97,83,31]. This ubiquitous integration (see Fig. 5, bottom branch) allows a semantically rich tracing of target and source artefacts [71]. Unfortunately, this option can only be applied when the system is being built, not when the system is already in place.

Identification rules Once a system is in place, teams can identify rules that help retrieve and maintain traceability relations [64,93]. Nentwich *et al.* describe a novel semantics for first-order logic that produces links instead of truth values and give an account of their content management strategy that provides rule-based link generation and consistency check [66]. At the model level, Grammel *et al.* use a graph-based model matching technique to exploit metamodel matching techniques for the generation of trace links for arbitrary source and target models [37], and Saada *et al.* recover execution traces of model transformation using genetic algorithms [83].

Domain contextualization Back in 1992, Borillo *et al.* published an article on the use of information retrieval techniques for linguistics applied to spatial software engineering [14]. This precursor work opened the box for AI-augmented traceability where machine learning algorithms help extract knowledge specific to the application domain (later called domain-contextualized traceability [40]). This is specially useful when the source (or target) of the trace link is an unstructured document or when such document is key to infer traces among other artefacts.

Today, domain contextualization by means of machine learning for topic modeling, word embedding, and more generally knowledge extraction from unorganized text documents, is the most popular traceability feature [39,102]. This collective effort made the identification of bonds between requirement specifications and other artefacts possible with a gradually improving precision [5,23]. Studies on domain contextualization are separated into three subgroups according to the type of tools used (algebraic information retrieval models, statistical language models, and neural networks). For example, Florez *et al.* derive fine-grained requirement to source code links [30], Rath *et al.* complete missing links between commits and issues [82], Marcus *et al.* identify links between documentation and source code [59]. An interesting publication from Poshyvanyk *et al.* shows that mixing expertise both in information retrieval techniques and engineering domains gives far better results than when taken separately [79]. McMillan *et al.* add that using structural information together with textual information benefits automated link recovery (between requirements and source code) [61]. In total, we found 22 approaches dedicated to this topic alone in our survey. We do not discuss in this paper the techniques related to data collection and training optimization. These are important features for automated learning which are discussed in depth in specialized literature.

Teams are also using genetic algorithms to cope with the variety of algorithms and parameters these approaches use [58,73], and structural information to foster methodologies interweaving [74]. Unfortunately, a common critique rose against these positive results. Too many teams compete with each others to accomplish a better precision and recall when there is no standard to the effective quantification of tracing artefacts into such variables. Too few attempt at qualifying the overall relation between these measurement and the effective impact on software development [22].

In that regard, Shin *et al.* propose a set of guidelines for benchmarking automated traceability techniques. Their evaluation (of 24 approaches) shows that methods of evaluation (when they are used appropriately) sometimes are not suitable to other application domains and that the variation in results across project is not investigated [91]. This corroborate Borg *et al.* who, in a systematic literature mapping on information retrieval approaches to traceability, notice that there are no empirical evidence that any IR model outperforms another consistently [13]. The ability to continuously improve the learning process is mentioned in the literature but we found no evidence of its application.

Tool assessment Very few of the traceability approaches have been empirically assessed on industrial use cases. The actual trend to report solely for precision and recall values indicates an important issue in the automated identification of traces and may justify the weak investment of industry in this sector [13,69].

Borg *et al.* published a taxonomy for information retrieval techniques applied to traceability [12]. They emphasize the importance of the assessment of the tooling used to derive or identify traces. More specifically, the authors draw a differentiation between two orthogonal dimensions: the evaluation context that precises *where* in the context the tool is assessed (*e.g.,* at a technical, work task, or project level); and, the study environment that shows the kind of data used to fulfil the assessment (*e.g.,* proprietary, open source, or academic). These features will affect the measurable attributes used for the assessment as well as their generalizability.

#### 5.4 Trace management

Fig. 5 shows the hierarchy of features related to the management of trace artefacts: their maintenance, integrity, persistence, and integration in running software systems.

Trace Maintenance Trace links may be affected by changes on the artefacts they link (directly or transitively) and therefore can easily become obsolete. This gradual decay must be seriously taken into account to avoid having to re-elicit traces every time they

Fig. 5: Tool support for traceability management.

need to be analyzed. A manual maintenance is not always impossible but not typically feasible in practice due to the amount of information such inspections would involve. Co-evolution techniques [64,26,80] attempt to tackle the burden to maintain trace links up-to-date [88,19].

Beyond being able to manipulate traces, we also need to offer proper ways to visualize and inspect them [29]. The use of graphical representations stimulates human perception and the integration of such technique in traceability frameworks is a useful feature to augment user awareness [43]. On the other hand, matrix-based views offer a valuable perspective to understand and analyse traces [53]. They are particularly efficient in assisting the visualization of coverage characteristics of traceability [33,82].

In parallel, allowing a rich formulation of queries to assist the exploration of existing traces will help with reducing the amount of information users need to navigate through [19]. More precisely, structured text, in the form of metamodel instances or XML sheets allows query-based mining of trace datasets [24]. Interaction wise, hypertext links is a *de facto* standard to browse trace links. Indeed, following links through successive clicks has become almost natural. Querying relies on the type of representation of traceability artefacts: SQL-like languages benefit from a long history of information mining while dedicated languages offers better legibility. Genetic programming has also permitted the automation of query formulation [77].

Trace Integrity To cope with the decay and volatility mentioned above, ways to determine the integrity of existing traces are greatly needed. Work on these questions, although called out loudly by literature studies, is scarce in practice [101,4]. The first option is given with manual annotation or vetting of trace links to inform about their level of reliability. Annotations allow a qualitative and quantitative evaluation [18]. This is the case for back-propagation of verification and validation results between design and requirements [42]. Some approaches enable the definition of invariant rules while manipulating traces or their targets [19]. If the invariant is violated, an exception for that trace is automatically generated. For example, we could define a rule that is violated when a change occurs in an artefact targeted by a trace if the corresponding link was identified more than two versions prior to the current version. In the same vein, Heisig *et al.* tag trace links when their target (or source) artefacts are modified or deleted [43]. Thanks to the ubiquitous integration of the tool, warning is raised consequently in EMF.

Trace persistence Many different storage alternatives exist for traceability artefacts. An option is to use SQL-like grammar to store and retrieve traces with the power of database tooling, or to use XML documents to represent trace matrix in a transformable format [57,27]. The industry uses a lot of informal format and link representations often remains implemented in spreadsheets, text files, databases or requirement management tools. These links deteriorate quickly during a project as time pressured team members fail to update them. Researchers aiming at a reusable approach favour model-based representations able to express specifically defined concepts related to traceability (often in a specific domain of application). The burden of maintaining traces coherent is eased in model-based solutions [21].

Another concern lies in the recording of trace evolution. The trace creation should be recorded, with the successive changes that affect it, for evolution analysis. Integrity measures respective to evolution events (*e.g.,* creation, modification) should be recorded as well to evaluate their evolution during a period of time. Rahimi *et al.* ensure the coevolution of artefacts and traces [80] using a set of heuristics coupled with refactoring detection and information retrieval technique to detect change scenarios between contiguous versions of software systems.

System integration Like most of the MDE approaches, Helming *et al.* use the same modeling language for both traceability and system artefacts [44]. Tracing features are embedded in the language. The conjunct use of EMF and a dedicated traceability metamodel (both written in Ecore) facilitates the integration of traceability features including graphical versions to stimulate human perception and standard analysis of traces in native environment. Galvao *et al.* in their seminal work on traceability and MDE call for more loosely coupled traceability support that can integrate external relationship with independent representations (in another, ideally common language) [32] as also elaborated by Azevedo *et al.* [6]. Finally, the SysMLv2 implementation committee is calling for *orthogonal* implementation of features such as traceability, annotations and comment through meta-level libraries in order to keep concerns separated at design level.

#### 6 Discussion

The feature model is a first step towards the shared understanding of all dimensions involved in a traceability solution. Ideally, a company interested in a certain set of such dimensions could try to create its perfect traceability solution by combining the top solutions for each dimension. But this is not yet a real possibility as those solution would be difficult to combine and, more importantly, several of the features in the feature model do not really have a great solution yet. This section elaborates on this discussion by presenting some open challenges in software traceability research.

Common traceability metamodel. We have counted over 20 different traceability metamodel proposals. Nevertheless, some are solutions limited to the specific problems the authors present as case studies. And these metamodels are rarely reused, if ever. This proliferation is a challenge to make different traceability solutions interoperate. The research community should agree in a unified proposal that facilitates the composability of traceability solutions.

Security of trace data. Considering that traceability is a major aspect in certification and other critical applications, it is surprising to see so little interest in security concerns in relationship to trace artefacts. We believe security mechanisms (even simple rule-based access control) for traceability are needed to control who can modify what trace data, given the implication such changes can have.

Library of trace types and semantics. We already mentioned the importance of having a rich set of types for traces to let engineers express the reasons behind the creation of a given trace. But at the same time, complete freedom makes reusability of analysis techniques difficult. We would like to see a rich yet predefined set of types for traces that could then be imported in new traceability projects.

Usefulness of identified traces. Managing a large number of traces is time-consuming. As such, we should make sure every explicit trace is actually useful. So far, algorithms aimed at automatically identifying traces are compared based on standard properties like precision and recall. But they should be evaluated on "usefulness": are those traces useful for the end-user? or are they simply redundant noise?

Verification, validation and testing of traces. Our ample literature on verification, validation and testing methods for software engineering should be extended to deal with trace data, especially from a temporal perspective. Reasoning on outdated and potentially incorrect trace data could have strong damaging impacts on the system as a whole. So far, very few approaches target these aspects except in coevolution in model-driven engineering. A recent study shows that the ability to justify with evidences and uncertainty evaluation the quality and integrity of traces is a prerequisite to robust and reliable traceability [8]. Given the effort required to create traces in the first place, it is important to instill more confidence to practitioners unsure if creating traces is worthwhile.

Traceability as core concern in general languages. Another important step towards the mainstream adoption of traceability in industry is the integration of the common traceability metamodel in popular modeling languages like UML or SysML, in the form of a profile (to be able to directly reuse existing modeling tools available for those languages) or new packages in the respective standards. This way, traceability would become a core concern and a primary class modeling primitive in software development while still being a rich concept and not just a variation of the simple generic plain dependency relationship we can use right now in those languages.

Working together with the industry. Orthogonal to all the others, we (the research community) should aim at more frequent exchanges with practitioners to better understand why they still create traces manually instead of reusing any of the dozens of existing solutions. Some reasons have been already hinted in this paper, but there might be others we are not aware of. If we want traceability research to transfer to industry, more and better communication flows should be part of the agenda.

### 7 Conclusion

Our survey reveals a continuous interest in traceability even if, often, it does not have the spotlight it deserves given the key role it plays in a good deal of software engineering tasks <sup>4</sup> . Work relating to traceability is indeed disseminated within established research communities (e.g., debugging, SPL). Existing conceptualizations vary greatly depending on the community to which its authors belong to as well as the objectives they aim at. As a consequence, a clear and measurable idea of the costs and benefits to software traceability is slow to emerge. To help visualize, classify and compare the different traceability approaches, we propose a feature model covering all important traceability aspects, as derived from a thorough analysis of the traceability literature. Following the existing body of work, we put special emphasis in separating how traces are represented from how they are identified and managed.

Beyond the feature model, our analysis highlights several limitations of current traceability approaches that should be further developed. We believe advancing on those aspects is especially important, even more given the new traceability challenges posed by the growing use of AI in Software Engineering (e.g. in terms of reproducibility and explainability of the AI decisions) [90,99]. In this sense, we hope this paper serves as a "wake-up call" to make sure new AI for SE proposals come together with a proper traceability mechanism that assists engineers in evaluating and understanding the impact of the new AI components in the software engineering process instead of having to blindly trust them.

As further work, we plan to start working on the above-mentioned aspects starting with a collaboration with some of the authors of other proposals to map and bridge their algorithms and techniques to our modular and quality-focused metamodel in order to combine the benefits of a unified and generic approach with those of a more domainspecific representation. We will also study how better embed traceability concepts into mainstream modeling languages (like UML or SysML) to further facilitate its adoption.

Acknowledgements: This work has been partially funded by the Spanish government (LOCOSS project - PID2020-114615RB-I00), and receives support from the ECSEL Joint Undertaking (AIDOaRt - grant agreement No 101007350).

<sup>4</sup> As an example, ICSE'18 awarded a trace-based paper as the most influential paper in the past 10 years [50]. The work introduced a novel trace-based approach to debugging. Though the focus was on the debugging aspect of the paper, traceability was the key to achieve that debugging improvement. The word "trace" alone is mentioned 46 times in the 10 pages paper.

## References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## Construction of Verifier Combinations Based on Off-the-Shelf Verifiers

Dirk Beyer<sup>1</sup> , Sudeep Kanav<sup>1</sup> , and Cedric Richter<sup>2</sup>

<sup>1</sup> LMU Munich, Munich, Germany <sup>2</sup> Carl von Ossietzky University, Oldenburg, Germany

Abstract. Software verifiers have different strengths and weaknesses, depending on properties of the verification task. It is well-known that combinations of verifiers via portfolio and selection approaches can help to combine the strengths. In this paper, we investigate (a) how to easily compose such combinations from existing, 'off-the-shelf' verification tools without changing them and (b) how much performance improvement easy combinations can yield, regarding the effectiveness (number of solved problems) and efficiency (consumed resources). First, we contribute a method to systematically and conveniently construct verifier combinations from existing tools, using the composition framework CoVeriTeam. We consider sequential portfolios, parallel portfolios, and algorithm selections. Second, we perform a large experiment on 8 883 verification tasks to show that combinations can improve the verification results without additional computational resources. All combinations are constructed from off-the-shelf verifiers, that is, we use them as published. The result of our work suggests that users of verification tools can achieve a significant improvement at a negligible cost (only configure our composition scripts).

Keywords: Software verification · Program analysis · Cooperative verification · Tool Combinations · Portfolio · Algorithm Selection · CoVeriTeam

## 1 Introduction

Automatic software verification has been an active area of research for many decades and various tools and techniques have been developed to solve the problem of verifying software [3, 7, 9, 25, 34, 37]. The research has also been adopted in practice [2, 22, 24, 39]. Each tool and technique has its own strengths in specific areas. In such a scenario, it becomes obvious to combine these tools to benefit from the strengths of individual tools, leading to a 'meta verifier' that solves more problems. Most current combination approaches are hardcoded, that is, the choice of the tools and the way to combine them is specifically programmed.

We contribute a method to construct combinations in a systematic way, independently from the set of tools to use. As for the types of combinations, we considered sequential and parallel portfolio [36], and algorithm selection [47]. The combinations are composed and executed with the tool CoVeriTeam [15].<sup>1</sup>

<sup>1</sup> https://gitlab.com/sosy-lab/software/coveriteam/

E. B. Johnsen and M. Wimmer (Eds.): FASE 2022, LNCS 13241, pp. 49–70, 2022. https://doi.org/10.1007/978-3-030-99429-7\_3

CoVeriTeam is a tool that is based on off-the-shelf atomic actors, which are executable units based on tool archives. It provides a simple language to construct tool combinations, and manages the download and execution of the existing tools on the provided input. CoVeriTeam provides a library of atomic actors for many well-known and publicly available verification tools. A new verification tool can be easily integrated into CoVeriTeam within a few minutes of effort.

For our experimental evaluation, we selected eight of the verification tools that participated in the 10th competition on software verification [6]. We reused the archives submitted to this competition, and composed combinations of three types (sequential and parallel portfolio, algorithm selection) with 2, 3, 4, and 8 verification tools: in total 12 combinations. We evaluated these 12 combinations on a large benchmark set consisting of 8 883 verification tasks in total and compared the results of the combinations against the results of the existing tools.

We show that all three combination approaches can lead to considerable improvements of the performance regarding effectiveness (number of correctly solved instances) and efficiency (consumed resources).

Contributions. We make the following contributions:


### 2 Improving Verification by Verifier Combinations

In this study, we explore different strategies for combining verifiers to improve the overall verification effectiveness. We focus on the most commonly applied black-box combinations (i.e., combinations that do neither require any changes to the existing tools nor communication between verification tools) which we briefly describe in the following.

Verifier Combinations. Existing strategies for combining verifiers can be generally classified into one of the following three categories: sequential portfolios [17, 33, 53], parallel portfolios [35, 36, 40], and algorithm selectors [8, 28, 47, 48, 50]. We provide an overview over these composition strategies in Figs. 1 and 2.

Sequential Portfolio. Portfolios combine several verification algorithms by executing them either sequentially or in parallel. A sequential portfolio (Fig. 1) executes a set of verifiers sequentially by running one verifier after another. In this setting, each verifier is assigned a specific time limit and the verifier runs until it finds a solution or reaches the time limit. If the current verifier is able to solve the given verification task, the sequential composition is stopped and the solution is emitted. Otherwise, if a verifier runs into a timeout without, the current algorithm is stopped and the next one is started. CPA-Seq [17, 53] and Ultimate Automizer [33] are examples of sequential portfolios.

Fig. 1: Sequential portfolio of verifiers. Each verifier runs for a certain amount of time. If a verifier stops without computing a result (grey box), the next one is started (white box with double borders).

Parallel Portfolio. In contrast to sequential portfolios, a parallel portfolio (Fig. 2(a)) executes all verification algorithms in parallel, while sharing all system resources like CPU time and memory. As soon as one algorithm solves the given verification problem, the portfolio is stopped. Based on the assumption that all verifiers provide only sound solutions, we can safely take the first solution computed as the final result of the overall portfolio. PredatorHP [35, 40] is an example of a parallel portfolio.

Algorithm Selection. To reduce spending resources on unsuccessful verifiers, algorithm selectors (Fig. 2(b)) are designed to select the verification algorithm that is likely well suited to solve a given verification task. More precisely, the algorithm selector analyzes the given verification problem for common characteristics (typically program features like the existence of a loop or an array) and based on these features, selects a verification algorithm likely suited for the given problem. Then the selected verifier is executed. Algorithm selectors were recently explored for selecting a task-dependent verification algorithm (e.g., in PeSCo [48, 50]) or a complete verification strategy (e.g., in CPAchecker [8]).

The above combination types have their own advantages and limitations when applied in real-world scenarios. While algorithm selectors omit the necessity of sharing resources, the approach heavily relies on the used selection algorithm. If the selection algorithm is not powerful enough or the selection task is too difficult, the selector fails to identify a verifier equipped for the given task. Although portfolios omit this problem by assigning the verification task to several verifiers, each verifier gets less resources, which could lead to out-of-resource failures.

#### 3 Construction of Verifier Combinations with CoVeriTeam

CoVeriTeam [15] is a tool for creating and executing tool combinations for cooperative verification [20]. It consists of a language for tool composition, and an execution engine for this language. Tools are considered as verification actors (verifiers, validators, testers, transformers), and the inputs consumed and outputs produced by the tools as verification artifacts (programs, specifications, witnesses, results). Verification artifacts are seen as basic objects, verification actors as basic operations, and tool combinations as composition of these operations.

CoVeriTeam supports execution of most of the well known automated verification tools that are publicly available. The composition operators supported by

(a) Parallel Portfolio (b) Algorithm Selection Fig. 2: Comparison of parallel portfolio and algorithm selection

CoVeriTeam are: SEQUENCE, PARALLEL, REPEAT, and ITE. SEQUENCE executes the composed tools sequentially, PARALLEL in parallel, REPEAT repeatedly till a termination condition is satisfied, ITE is an if-then-else that executes one tool if the provided condition is true and otherwise the other. The work in this paper uses SEQUENCE, PARALLEL, ITE, and a newly developed PORTFOLIO.

#### 3.1 Verifier Based on Sequential Portfolio

Figure 3 shows the construction of a sequential portfolio of two verifiers verifier1 and verifier2 using CoVeriTeam. This construction uses two kinds of compositions: SEQUENCE and ITE. At the outermost level, it is a sequence of verifier1 and an actor that in itself is a composition—an ITE composition. Let us call it ite\_verifier. When we execute this composition, first, verifier1 is executed and then ite\_verifier. ite\_verifier first checks if verifier1 was successful in verification or not (i.e., verdict 6∈ {T, F}). If verifier1 was successful, then it forwards the results, otherwise, verifier2 is executed and its results are taken. This construction can be generalized to create sequential portfolios of arbitrary sizes. We used it to create sequential portfolios of 2, 3, 4, and 8 verifiers.

#### 3.2 Verifier Based on Parallel Portfolios

We developed a composition operator for parallel portfolio in CoVeriTeam. In this composition, multiple tools are executed in parallel and the result of the one that succeeds first is taken. The composition consists of a set of verification actors of the same type (verifiers, testers, etc.), and a success condition defined over the artifacts produced by these actors. When one actor finishes, the success condition is evaluated: if it holds then the output of this actor is taken and the execution of the remaining actors is stopped. Otherwise, the portfolio waits for the next actor to finish and repeats the check. If none of the actors produce the output that satisfies the success condition, the result of the last one is taken.

Figure 4 shows a parallel portfolio of two verifiers verifier1 and verifier2. In this case, both the verifiers are executed simultaneously. When one verifier finishes, its result is checked for the success condition (i.e., verdict ∈ {T, F}). If the success condition holds then the result is forwarded, otherwise, the result is discarded and we wait for the second verifier to finish. Once a successful result is available, the remaining executing verifiers are terminated. For our experiments, we created parallel portfolios of 2, 3, 4, and 8 tools.

#### 3.3 Verifier Based on Algorithm Selection

We designed and implemented a generic selection framework in CoVeriTeam for selecting verifiers. The framework decomposes the algorithm-selection process into two phases: (1) a feature-extraction phase, in which a feature encoder extracts a set of predefined features for a given verification task (i.e., certain characteristics that are believed to indicate difficulty for a verifier), and (2) selection to identify an appropriate verifier based on the extracted features. Each phase is constructed using CoVeriTeam actors (explained below in more detail). Figure 5 shows the CoVeriTeam composition of a verifier based on algorithm selection.

Feature Encoder. The first component of our framework is the feature encoder. Given a verification task consisting of a program P and a specification S, the goal of the feature encoder is to encode the problem into a meaningful feature-vector

(F V ) representation, which we can later use to select a verification tool. Typically, the representation encodes certain features of a program which might correlate with the performance of a verifier such as the occurrence of specific loop patterns [28] or variable types [29]. In this study, we encode verification problems via a learning-based feature encoder by employing a pretrained CSTTransformer [50]. The CSTTransformer first parses a given program P into a simplified abstract syntax tree (AST) representation. Afterwards, a specific type of neural network processes the AST structure to produce a vector representation. The last encoding step is learned by pretraining the neural network on selecting various verification tools. While this approach was originally developed to learn a vector representation optimized for a specific verifier composition, the authors showed that the learned encoder can be effectively reused across many new selection tasks, often outperforming other hand-crafted feature encoders.

Selection of Verifiers Based on the Individual Difficulty of the Tasks. The same task might be solved with one tool in a few seconds, while another is not able to find a solution within the given resource constraints. Therefore, to avoid wasting resources on tools that are not well suited for a given task, the algorithm selector aims to predict the difficulty of a task before executing a tool. Then, the tool that is predicted to be the best suited tool for the task is executed.

Similar to previous work [28, 50], we learn to predict the difficulty of task with hardness models [55]. Based on the previously computed vector representation, a hardness model learns to predict the hardness of a given task for a specific tool. In our case, this reduces to a binary classification problem of predicting whether a tool can solve a task or not. We address this by training logistic regression classifiers. The classifier's confidence that a verifier will fail a particular task then determines the hardness of the task.

Now, given a set of hardness models —each accessing the hardness of a verification task for a specific tool— a verification tool is selected for which the task is likely easy (i.e., the respective model outputs the lowest hardness score). The final selection is done by a comparator implemented in CoVeriTeam that selects a tool by comparing the hardness scores.

#### 3.4 Extensibility

To facilitate future research and the design of novel combinations, we implemented all combination types such that they can be easily configured and extended. Extending a combination with a new verifier only requires an actor definition for that verifier in CoVeriTeam. Afterwards, this actor can be put in a sequential or parallel portfolio by adding it to the composition. While our algorithm selector can be easily used with all tools employed during our experiments, extending a combination based on algorithm selection with a new verifier requires a bit more effort. However, by using hardness models together with a common feature representation we simplified the process required for configuring algorithm selection. In fact, we are able to modify the set of verifiers to select from by simply adding or removing individual hardness models. While previous approaches to


Fig. 6: Subsets of verification tools used for composition

verifier selection often require training the complete selector from scratch, our combination can be extended by training a single hardness model.<sup>2</sup> For training a new model, we provide all training scripts that were used for training our hardness models and a precomputed dataset of vector representations for SV-COMP 2021. Therefore, to integrate a new tool in our algorithm selector, one only requires to run the respective verifier once on (a subset of) the benchmark set. The results then act as training examples.

## 4 Evaluation

We perform a thorough experimental evaluation on a large benchmark set in order to show the potential of combinations. We address the following research questions concerning the comparative evaluation of combinations against standalone tools:

	- (a) number of solved verification tasks, and
	- (b) resource consumption?
	- (a) number of solved verification tasks, and
	- (b) resource consumption?
	- (a) number of solved verification tasks, and
	- (b) resource consumption?

### 4.1 Experimental Setup

Selection of Existing Verifiers. We selected eight existing verification tools that performed well in a recent competition on software verification (SV-COMP 2021) [6]. We excluded two verifiers from consideration: VeriAbs [27] and PeSCo [49]. VeriAbs was excluded because its license does not allow us to use it for scientific evaluation, and PeSCo because it is a derivate of CPAchecker that would not contribute to diversity of technology in the combinations. The chosen set of verifiers used for the tool combinations is depicted in Fig. 6.

<sup>2</sup> A single hardness model can be trained within a few minutes on a modern CPU.

Tool Combinations. We evaluated twelve verifier combinations: for each of sequential portfolio, parallel portfolio, and algorithm selection, we constructed a combination of 2, 3, 4, and 8 verifiers. These variants of combinations with different numbers of verifiers allowed us to quantify the influence of the number of verifiers on the performance. We constructed these subsets of verifiers to maximize the number of tasks (from our benchmark set) that can be solved by at least one tool in the subset. For sequential portfolios, we additionally rank the verifiers in descending order of their success on the benchmark. We used the results from SV-COMP 2021 to achieve this. Figure 6 illustrates the sets of verifiers that we composed in different types of combinations.

Execution Environment. Our experiments were executed on machines with the following configuration: one 3.4 GHz CPU (IntelXeon E3-1230 v5) with 8 processing units (virtual cores), 33 GB RAM, operating system Ubuntu 20.04. Each verification run (execution of one tool or combination on one verification task) was limited to 8 processing units, 15 min of CPU time, and 15 GB memory. This configuration is the same as the configuration used in SV-COMP 2021 allowing us to use the competition results of the standalone tools for comparison.

Benchmark Selection. Our benchmark set consists of all the verification tasks with specification unreach-call from the open-source collection of verification tasks SV-Benchmarks<sup>3</sup> . Each verification task consists of a program written in C and a specification. The specification is a safety property describing that an error location should never be reached. The benchmark set includes all verification tasks of the competition categories ReachSafety and Concurrency, and a part of the verification tasks in category SoftwareSystems. In total, there were 8 883 verification tasks in our benchmark set. We evaluated our combinations on the version of the benchmark set that was used in SV-COMP 2021 (tag svcomp21).

Scoring Schema. We not only count the number of results of each kind<sup>4</sup> for the verification tasks, but also the scores as used in the competition, because this models what the community considers as quality. A verifier is rewarded score points as follows: 2 score points for each correct proof, 1 score point for each correct alarm, -32 score points for wrong proofs, and -16 score points for wrong alarms. This schema has been used in SV-COMP [6] since a few years and has been accepted by the verification community for judging the quality of results.

Resource Measurement and Benchmark Execution. We used the state-ofthe-art benchmarking framework BenchExec [18] for executing our benchmarks. It executes tools in isolation, reports the resource consumption, and also enforces the resource limitations. It provides measurements of the consumption of CPU time, wall time, memory, and CPU energy during an execution of a tool.

#### 4.2 Results of Existing Verifiers as Standalone

Table 1 shows the summary of results of the execution of the standalone tools on the selected benchmark set. These results are publicly available in the respective

<sup>3</sup> https://gitlab.com/sosy-lab/benchmarking/sv-benchmarks

<sup>4</sup> Either claims of program correctness or alarms of specification violations.


Table 1: Standalone verifiers

reproduction package of the competition [5] and on the competition web site<sup>5</sup> . We only adjust the presentation to our needs here.

Figure 7 shows the quantile plots of the results, where the x -coordinate represents the quantile of score obtained by the tool below the run time represented by y-coordinate. We used a logarithmic scale for time ranges between 1 and 1000 seconds, and linear scale between 0 and 1 second. The graph of a tool that solves more verification tasks will be farther to the right, and the plot of the faster tools would be lower. The farther on the right side a plot goes and the lower a plot remains. the better it is. More details about these plots are given elsewhere [4].

Figure 8 shows the resource consumption for standalone tools using a parallelcoordinates plot (each parallel coordinate represents a different variable). The plot shows the number of unsolved tasks, and resource consumption per score point. The lower the plot of a tool is the better it is for the user.

#### 4.3 RQ 1: Evaluation of Sequential-Portfolio Verifier

We now present the results of the sequential-portfolio verifier against the existing standalone verifier with the highest score: CPAchecker.

<sup>5</sup> https://sv-comp.sosy-lab.org/2021/results/results-verified

Fig. 8: Standalone verifiers: Parallel-coordinates plot showing unsolved tasks and resource consumption per score point

Table 2 shows the summary of results for the sequential verifiers. The sequential portfolio, in general achieves better score than the best performing standalone tool. The portfolio with 8 tools performs worst, which is expected because as we increase the size of the portfolio, the amount of time allocated to each verifier also decreases. This means that the verifiers can only solve relatively easier tasks. The table also shows that the portfolio requires more resources to solve the tasks. This is a side effect of the sequential portfolio, as all the resources consumed by unsuccessful attempts to solve a given task by the verifiers in a sequence are still counted in the resource consumption. Also, the portfolio with 8 tools has a considerably large number of wrong results as it is reduced to fast results, instead of the verifier earlier in the sequence. The index at which a verifier is placed plays a key role in the performance of the sequential portfolio. If we put a verifier that produces results fast but has more wrong results first in the sequential portfolio, then the overall results are going to have a lot of wrong results.

Figure 9 shows the quantile plot of scores. As a portfolio is biased towards the verifiers that compute results fast and not towards correctness, we see the sequential portfolio combinations starting from farthest in the left, i.e., having the most negative score, or most wrong results. CPAchecker has the least number of wrong results, and because of it its starting point is farthest to the right.


Table 2: Sequential portfolios of different sizes with CPAchecker

Figure 10 shows that CPAchecker is more resource efficient in comparison to the sequential portfolio. The sequential combination with best score is performing worst in resource efficiency.

#### 4.4 RQ 2: Evaluation of Parallel-Portfolio Verifier

We now present the results of the parallel-portfolio verifiers. The parallel portfolio, mostly, achieves worse score than the best performing standalone tool. But the parallel portfolio with 3 tools scores better. The parallel portfolio is affected by two aspects: (1) size of the parallel portfolio — if too many tools are used then any of them would not get enough resources to verify the task, (2) selection of tools — if there is a fast tool that produces a lot of wrong results it reduces the score. Parallel portfolio, in general, produces more wrong results; even more than sequential portfolio, as the tools are running in parallel, whereas in sequential portfolio this can be somewhat mitigated by putting a more sound tool before a less sound tool. Table 3 shows the summary of results for the parallel portfolios.

Figure 11 shows that parallel portfolios have many more wrong results when compared to CPAchecker. Interestingly, the graph for ParPortfolio-3, the best performing parallel portfolio, remains lower than CPAchecker, i.e., it takes less

Fig. 9: Sequential portfolios: Score-based quantile plot comparing the best and the worst sequential portfolio (SeqPortfolio-4 and SeqPortfolio-8, respectively) with the best performing standalone tool (CPAchecker)

Fig. 10: Sequential portfolios: Parallel-coordinates plot showing unsolved tasks and resource consumption per score point for best and worst portfolio (SeqPortfolio-4 and SeqPortfolio-8, resp.) and the best standalone tool (CPAchecker)

CPU time. This is because the parallel portfolio takes results of the most efficient tool. Figure 12 shows that the best performing parallel portfolio performs better than CPAchecker in terms of resource efficiency except memory consumption.

#### 4.5 RQ 3: Evaluation of Algorithm Selection Verifier

We now present the results of the algorithm-selection verifier. Table 4 shows the summary of results for algorithm selection: There is a clear trend of better results with more verifiers. This is expected because our selector that was trained using machine learning has more options to choose from, and can choose the better one. Also, an algorithm-selection verifier does not need to share resources for the verification task. It needs to perform the prediction, which takes some resources; but after this step all the provided resources are available to the verifier. The number of wrong results is also comparable with CPAchecker, as the training process is biased towards selecting the verifiers that are correct.

In Fig. 13, all the plots start from around similar scores but at different times. Initially, CPAchecker performs better with respect to CPU time, but after around half the scores, algorithm selection starts being more efficient. Figure 14 shows that algorithm selection is also more resource efficient than CPAchecker.


Table 3: Parallel portfolios of different size with CPAchecker

#### 4.6 Discussion

The experiments show that each of the compositions has a configuration that can perform better than any standalone tool in terms of correctly solved tasks. Initially, we thought that portfolios would be less resource efficient than standalone tools, and, in particular, would not be able to solve hard tasks as the resources allocated to each tool would be less. But the experimental data support the opposite: The benchmark set had a few such tasks: for most of the tasks that were hard for one tool, there was some other tool that solved it in the given time. This was especially pronounced in the parallel portfolio. The verifiers in the portfolios have to be selected with different strengths, otherwise there is no benefit, it might even perform worse.

Both the portfolios prefer fast results, as there is no selector. To mitigate this, one needs to either select the tools carefully or add a validation step.

Our algorithm selection was based on a model trained using machine learning. The training penalized the tools that produced more incorrect results, but it did not consider the resource consumption of these tools. In comparison to both the portfolios, the verifier based on algorithm selection produced much less incorrect

Fig. 11: Parallel portfolios: Score-based quantile plot comparing the best and the worst performing parallel portfolios (ParPortfolio-3 and ParPortfolio-8, respectively) with the best performing standalone tool (CPAchecker)

Fig. 12: Parallel portfolios: Parallel-coordinates plot showing unsolved tasks and resource consumption per score point of best and worst portfolio (ParPortfolio-3 and ParPortfolio-8, resp.) and the best standalone tool (CPAchecker)

results. We think if we used the resource consumption data in our training, the verifier based on selection would have consumed less resources. Our verifier combinations are easy to construct by simply selecting tools that complement each other well. Although this strategy is simple, we found that it still leads to successful combinations for all evaluated combination types. Nevertheless, the combinations can be further fine-tuned to achieve even better results.

The portfolio compositions are easy to construct, and with a well diversified tool selection, portfolios can perform good. Also, the portfolios should not be too large unless we are willing to increase the resources. On the other hand, training the selection requires more preliminary work but with limited resources and enough choice (number of tools) the selection-based verifier works better.

#### 5 Threats to Validity

External Validity. A combination of tools can only be as good as the parts it is combined from. Therefore, the concrete instantiation of our tool combinations is limited by the selected tools and their configuration. We have selected eight of the


Table 4: Algorithm-selection-based verifiers of different sizes with CPAchecker

most powerful verification tools as determined by the annual software-verification competition, and executed them in the original configuration as submitted to the competition. Furthermore, our evaluation results only hold for the given benchmark set. While we have evaluated our tool combinations on programs taken from one of the largest and diverse verification benchmarks publicly available, the performance of the evaluated combinations might differ on other sets of tasks.

Similarly, this also impacts the training of our algorithm selector. The training of a learning-based algorithm selector, which we employ for tool combinations based on algorithm selection, requires a large and diverse set of verification tasks; and each task has to be labeled with the execution results of each tool in our combination. The used benchmarks repository<sup>6</sup> was created by the efforts of the verification community over many years. We are not aware of any other benchmark set of verification tasks that is as diverse as this one. As a result, we had to train our algorithm selector on the same dataset that we later use for benchmarking the tool combinations. Therefore, we only showed that algorithm selection improves the performance of verification on the given benchmark set

<sup>6</sup> https://gitlab.com/sosy-lab/benchmarking/sv-benchmarks

Fig. 13: Algorithm-selection-based verifiers: Score-based quantile plot comparing the best and the worst performing portfolio (AlgoSelection-8 and AlgoSelection-3, respectively) with the best performing standalone tool (CPAchecker)

Fig. 14: Algorithm-selection-based verifiers: Parallel-coordinates plot showing unsolved tasks and resource consumption per score point of the best and the worst performing algorithm selection (AlgoSelection-8 and AlgoSelection-2, respectively) and the best performing standalone tool (CPAchecker)

and the selector might only generalize to a set of tasks with similarly distributed verification tasks. For a fair comparison, we (1) restricted the training to linear models, which are known to generalize well, (2) train only on a random subset of the benchmark, and (3) cross validated our model over multiple benchmark splits. The variance of selection performance between different splits was less than 1 %. Therefore, the performance of our trained algorithm selector is likely independent of the random subset selected for training.

Finally, the evaluation of algorithm selection is dependent on the chosen selection methodology and choosing alternative selection methods, for example, based on hand-crafted rules, might impact the evaluation. However, the design of hand-crafted methods is not straightforward and might require deep expert knowledge about the tool implementation. Depending on the human designer, this design process might in addition be biased in favor of certain tool combinations, which could also impact the experimental results.

For sequential portfolios, we ordered verifiers in sequence according to their performance in SV-COMP 2021. Changing the order of the tools might change the results with respect to resource consumption as well as soundness.

Internal Validity. We have used the same verifier archives, benchmark set, benchmarking framework, resource limits, and infrastructure to execute our experiments as was used in SV-COMP 2021. This minimizes the influence of a changing environment on our experiments, allowing us to compare results of our verifier combinations to the results of the standalone tools from SV-COMP 2021.

CoVeriTeam induces an overhead of about 0.8 s for each actor in the composition, and around 44 MB memory overhead [15]. It is possible that one can reduce this overhead by using shell scripts, but we decided in favor of using CoVeriTeam for composing tools because of the modular design. This is especially pronounced in our algorithm-selector composition. We could have saved a few seconds if we were using a monolithic algorithm selector instead of composing one.

#### 6 Related Work

Combination Strategies for Software Verification. Combining verifiers to increase the verification performance is well established in the domain of software verification [1, 8, 20, 26, 31, 33, 46, 48, 49, 53]. In fact, the top three winning entries of the software-verification competition SV-COMP 2021 all combine various verification techniques to achieve their performance [6]. CPAchecker [8] combines up to six different verification appraoches into three sequential portfolios that are taskdependently selected with an algorithm selector. PeSCo [49] ranks verification algorithms according to their predicted likelihood of solving a given task and then executes them sequentially in descending order. Ultimate Automizer [33] employs an integrated tool chain of preprocessing and verification algorithm to solve a given task. PredatorHP [46] and UFO [1] demonstrate that parallel portfolios can also be a promising strategy when running multiple specialized algorithms at the same time. Even though previous work showed that internal combinations can be successfully applied to improve the effectiveness of a single tool, we show that similar combinations can be effectively employed to combine 'off-the-shelf' verifiers. This gives us the unique opportunity to further increase the number of verifiable programs by simply combining state-of-the-art verification tools.

Cooperative methods [20] distribute the workload of a single verification task among multiple algorithms to combine their strengths. For example, conditional model checking [11, 12, 13, 14] runs two or more verifiers in sequence, while the program is reduced after every step to the state space of program unexplored by the previous algorithm. CoVeriTest [10], a tool for test-case generation based on verification, interleaves multiple verifiers, while (partially) sharing the analysis state between algorithms. MetaVal [19] integrates verification tools for witness validation (i.e., to check whether a previous verifier obtained a comprehensible result) by instrumenting the produced witness into the verified program. While cooperative methods are effective for reducing the workload of a verification task, employing cooperative methods at tool level would require to exchange analysis information between tools. In general, existing verification tools are not well suited for this type of cooperation, which lead us to explore black-box verifier

combinations. In addition, we showed that non-cooperative methods can improve the verification effectiveness without the need to adapt the employed tools.

Combining Algorithms Beyond Software Verification. The idea of combining algorithms to improve performance have been successfully applied in many research areas including SAT solving [51, 54, 56], constraint-satisfaction programs [21, 45, 57] and combinatorial-search problems [41]. Employed approaches traditionally focused on portfolio-based approaches [21, 51, 54], but recent techniques started to integrate algorithm selectors for either selecting single algorithms [45, 56] or portfolios of algorithms [44, 57]. For example, earlier works in SAT solving [51, 54] focused on parallel-portfolio solvers, while later works such as SATzilla [56] further improves the solving process by selecting a task-dependent solver. However, existing techniques often employ hybrid strategies between portfolios and algorithm selection to achieve state-of-the-art performance. Therefore, Kashgarani and Kothoff [38] have recently shown that parallel portfolios are generally bottlenecked by the available resources and that a pure algorithm selector that selects a single algorithm performs better. While we observed that portfolios of software verifiers are also restricted by available resources (i.e., the performance generally stops to improve after a certain portfolio size), we found that all evaluated combination types yield a similar performance gain when configured correctly.

## 7 Conclusion

This paper describes a method to construct combinations of verification tools in a systematic and modular way. The method does not require any changes to the verification tools that are used to construct the combinations. Our experimental evaluation shows that all three considered combinations (sequential portfolio, parallel portfolio, and algorithm selection) can lead to performance improvements. The improvements can be significant although the construction does not require significant development effort, because we use CoVeriTeam for the combination and execution of verification tools. We hope that our contribution makes it easy for practitioners to get access to the best performance out of the latest research and development efforts in software verification.

### Declarations

Data Availability Statement. A reproduction package including all our results is available at Zenodo [16]. Additionally, the result tables are also available on a supplementary web page for convenient browsing.<sup>7</sup>

Funding Statement. This work was funded in part by the Deutsche Forschungsgesellschaft (DFG) — 418257054 (Coop).

Acknowledgement. We thank Tobias Kleinert for implementing the parallel portfolio combination in CoVeriTeam.

<sup>7</sup> https://www.sosy-lab.org/research/coveriteam-combinations

#### References


Open Access. This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution, and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## On the Detection of Doped Software by Falsification?

Sebastian Biewer<sup>1</sup> () and Holger Hermanns1,<sup>2</sup>

<sup>1</sup> Saarland University, Saarland Informatics Campus, Saarbr¨ucken, Germany biewer@depend.uni-saarland.de 2 Institute of Intelligent Software, Guangzhou, China

Abstract. Software doping is a phenomenon that refers to the presence of hidden software functionality, whose existence is only in the interest of the manufacturer. The most prominent example is the diesel emissions scandal. There is a need for methods that identify software doping, and such methods are bound to be applied to the final product with no or rare knowledge about its internals. Black-box analysis techniques have recently been developed for this purpose, harvesting the formal foundations of software doping. This paper integrates them with established falsification techniques for the purpose of real-world applicability. With a focus on the diesel scandal and emissions tests on chassis dynamometers we make the testing procedures significantly more effective in terms of time and cost. The theoretical results are implemented in a prototypical doping tester.

### 1 Introduction

Embedded software is the innovation driver of our times. Software-defined systems are permeating our communication, perception, and storage technology as well as our personal interactions with technical systems at an unprecedented pace. "Software-defined everything" is among the hottest buzzwords in IT technology today [2, 18].

There is a tremendous problem hiding behind this apparently unstoppable trend: The owners of the physical "hull" of everything will not be the ones owning the software defining everything, nor will they have the right to look at what and how everything is defined. This is because commercial software typically is protected by intellectual property rights of the software manufacturer. This prohibits any attempt to disassemble the software or to reconstruct its inner working, albeit it is the very software that is forecasted to be defining everything. The use of machine-learnt software components amplifies the problem considerably. Since commercial interests of the software manufacturers seldomly are aligned

<sup>?</sup> This work is partly supported by DFG grant 389792660 as part of TRR 248 – CPEC, the European Union's Horizon 2020 research and innovation programme under the Marie Sk lodowska-Curie grant agreement No. 101008233, and by the Key-Area Research and Development Program Grant 2018B010107004 of Guangdong Province.

with the interest of end users, the promise of software-defined everything might well become a dystopia from the perspective of individual digital sovereignty.

A massive example of software-defined collective damage is the diesel emissions scandal. Over a period of more than 10 years, millions of diesel-powered cars have been equipped with illegal software that altogether polluted the environment for the sake of commercial advantages of the car manufacturers. At its core, this was made possible by the fact that only a single, precisely defined test setup was put in place for checking conformance with exhaust emissions regulations. This made it a trivial software engineering task to identify the test particularities and to turn off emission cleaning outside these particular conditions. This is an archetypical instance of software doping.

Against this background, there is an urgent need to establish stronger and enforceable requirements on the systems we are interacting with, and this is indeed echoed in legislatory frameworks [24]. However, the roll-out of such requirements in everyday practice needs a firm understanding of the technological basis for enforcing such requirements, respectively for identifying violations thereof.

This paper is part of ongoing research addressing this challenge. It harvests the outcomes of three recent scientific achievements: (i) formal definitions of software doping based on contracts enforcing well-defined software behaviour in the vicinity of standardised behaviour [12], (ii) a solid foundation for doping tests to be carried out in practice [6], and (iii) probabilistic falsification techniques developed to guide the search for property violations in cyber-physical system engineering [1, 20]. By combining the above ingredients, this paper addresses the question how to perform cost-effective doping tests that are indeed likely to succeed in uncovering actual cases of doped software. It approaches this question both from a foundational and from a practical perspective. On the foundational side, we introduce a temporal hyperlogic to reason about signals which we use to characterise the falsifiable fragment of a software doping contract. Great care is taken for this to work on the actual time-discrete traces that are recorded from the real system which itself is running in continuous time. On the practical side, we discuss a novel approach to probabilistic falsification that overcomes the problem that in many practical cases the possibility to carry out masses of highly-controlled experiments with a physical system is severely limited by cost or time budgets. To account for this, we add a passive recording component to the concept of falsification which observes the system in-the-wild to propose only few candidate traces to be inspected under lab conditions. All this is instantiated in the context of automotive emissions, where lab conditions correspond to expensive test runs on a chassis dynamometer, while observing the system in-the-wild is nothing else than collecting statistics while driving on normal roads.

The paper makes the following distinguished contributions: (i) a linear temporal logic for hyperproperties over continuous signals that enables quantitative reasoning across traces, (ii) a logical reformulation of the falsifiable fragment of a software doping contract, (iii) a probabilistic falsification technique that uses passive recording for cost-effective doping testing, and (iv) an exemplary instantiation of these concepts in the context of automotive emissions.

Related Work. Software doping theory provides a formal basis for enlarging the requirements on vehicle exhaust emissions beyond too narrow lab test conditions. That conceptual limitation has by now been addressed by the official authorities responsible for car type approval [24, 25]: The old NEDC-based test procedure is replaced by the newer Worldwide Harmonised Light Vehicles Test Procedure (WLTP), which is deemed to be more realistic. WLTP replaces the NEDC test by a new WLTC test, but WLTC still is just a single test scenario. In addition, WLTP embraces so called Real Driving Emissions (RDE) tests to be conducted on public roads. A recently launched mobile phone app [8], LolaDrives, harvests runtime monitoring technology for making low-cost RDE tests accessible to everyone.

Learning or approximating the behaviour of a system under test has been studied intensively. Meinke and Sindhu [19] were among the first to present a testing approach incrementally learning a Kripke structure representing a reactive system. Volpato and Tretmans [27] propose a learning approach which gradually refines an under- and over-approximation of an input-output transition system representing the system under test. The correctness of this approach needs several assumptions, e.g., an oracle indicating when, for some trace, all outputs, which extend the trace to a valid system trace, have been observed.

#### 2 Background

This section introduces the necessary background regarding temporal logics for hyperproperties and for continuous signals, probabilistic falsification basics, and reviews the formal definitions of software doping.

#### 2.1 Temporal Logics

Linear Temporal Logic (LTL) [22] is a popular formalism to reason about properties of traces. A trace is an infinite word where each literal is a subset of AP, the set of atomic propositions. Programs are interpreted as sets SLTL ⊆ (2AP) <sup>ω</sup> of such traces. LTL provides expressive means to characterise sets of traces, often called trace properties.

Temporal Logics for Hyperproperties. For some set of traces T, a trace property defines a subset of T, whereas a hyperproperty defines a set of subsets of T. In this way it specifies which traces are valid in combination with one another. Many temporal logics have been extended to corresponding hyperlogics supporting the specification of hyperproperties. HyperLTL [11] is such a temporal logic for the specification of hyperproperties of reactive systems. It extends LTL with trace quantifiers and trace variables that make it possible to refer to multiple traces within a logical formula. A HyperLTL formula is defined by the following grammar where π is drawn from a set V of trace variables:

$$\begin{array}{rcl} \psi ::= \exists \pi. \psi \mid \forall \pi. \psi \mid \phi\\ \phi ::= \quad a\_{\pi} \mid \mid \neg \phi \mid \phi \land \phi \mid \mathsf{X} \phi \mid \phi \,\mathsf{U} \phi \end{array}$$

The quantifiers ∃ and ∀ quantify existentially and universally, respectively, over the set of traces. For example, the formula ∀π. ∃π 0 . φ means that for every trace π there exists another trace π 0 such that φ holds over the pair of traces. To account for distinct valuations of atomic propositions across distinct traces, the atomic propositions are indexed with trace variables: for some atomic proposition a ∈ AP and some trace variable π ∈ V, a<sup>π</sup> states that a holds in the initial position of trace π. The temporal operators and Boolean connectives are interpreted as usual for LTL. In particular, X φ means that φ holds in the next state of every trace under consideration. Likewise, φ U φ <sup>0</sup> means that φ 0 eventually holds in every trace under consideration at the same point in time, provided φ holds in every previous instant in all such traces. Further operators are derivable: F φ ≡ true U φ enforces φ to eventually hold in the future, G φ ≡ ¬ F ¬φ enforces φ to always hold, and the weak-until operator φ W φ <sup>0</sup> ≡ φ U φ <sup>0</sup> ∨ G φ allows φ to always hold as an alternative to the obligation for φ 0 to eventually hold. We refer to [11] for the formal semantics.

Temporal Logics over Continuous Domains. LTL enables reasoning over traces σ ∈ (2AP) ω which are of discrete nature with respect to the time domain they represent. With each literal in the trace representing a time step, σ can equivalently be viewed as a function N → 2 AP. One extension of LTL is Signal Temporal Logic (STL) [13, 17], which instead is used for reasoning over real-valued signals that may change in value along an underlying time domain. A signal is a function s : T → R where T is the time domain. The time domain T can be either N (discrete-time signals), or R<sup>≥</sup><sup>0</sup> (continuous-time signals). This can be lifted to multi-dimensional signals w(t) = (s1(t), . . . , sn(t)), mapping each time point to some element of R<sup>n</sup>. We refer to such a w : T → R<sup>n</sup> as a (discrete-time or continuous-time) trace of width n in the sequel.

STL formulas can express properties of systems modelled as sets SSTL ⊆ (T → R<sup>n</sup>) of traces of some fixed width n, basically by making the atomic properties refer to booleanizations of the signal values. The syntax of the variant of STL that we use in this paper is as follows, where f ∈ R<sup>n</sup> → R:

$$
\phi ::= \top \mid f > 0 \mid \neg \phi \mid \phi \land \phi \mid \phi \bullet \phi \mid \phi
$$

STL replaces atomic propositions by threshold predicates of the form f > 0, which hold if and only if function f applied to the signal values at the current

time returns a positive value. The Boolean operators and the Until operator U are very similar to those of HyperLTL. The Next operator X is not part of STL, because "next" is without precise meaning in continuous time. The definitions of the derived operators F, G and W are the

$$\begin{aligned} &w,t \mid = \top \\ &w,t \mid = f > 0 \qquad \text{iff} \quad f(s\_1(t), \dots, s\_n(t)) > 0 \\ &w,t \mid = \neg \phi \qquad \text{iff} \quad w,t \not\models \phi \\ &w,t \mid = \phi \land \psi \qquad \text{iff} \quad w,t \mid = \phi \text{ and } w,t \mid = \psi \\ &w,t \mid = \phi \cup \psi \qquad \text{iff} \quad \text{exists } t' \ge t \text{ s.t. } w,t' \mid = \psi \text{ and } t \mid = \phi \\ &\qquad \text{for all } t'' \in [t, t'), \ w,t'' \mid = \phi \end{aligned}$$

Fig. 1: Boolean semantics of STL formulas

same as for HyperLTL. Formally, the Boolean semantics of an STL formula φ at time point t ∈ T for a trace w = (s1, . . . , sn) is defined inductively in Fig. 1.

Quantitative Interpretation. STL has been extended by a quantitative semantics [1, 13, 14] as presented in Fig. 2. This semantics is designed in such a way that whenever ρ(φ, w, t) 6= 0, its sign indicates whether w, t |= φ holds in the Boolean semantics. For any STL formula φ, trace w and time t, if ρ(φ, w, t) > 0, then w, t |= φ holds, and if ρ(φ, w, t) < 0, then w, t |= φ does not hold. For the scope of this paper,

we work with the untimed Until operator, instead of allowing U[a,b] for arbitrary bounds a, b ∈ R. With only the untimed Until operator, the continuous and discrete semantics [14] coincide.

$$\begin{aligned} \rho(\top, w, t) &= \infty \\ \rho(f > 0, w, t) &= f(s\_1(t), \dots, s\_n(t)) \\ \rho(\neg \phi, w, t) &= -\rho(\phi, w, t) \\ \rho(\phi \land \psi, w, t) &= \min(\rho(\phi, w, t), \rho(\psi, w, t)) \\ \rho(\phi \Downarrow \psi, w, t) &= \sup\_{t' \ge t} \min\left\{ \rho(\psi, w, t'), \inf\_{t' \in [t, t')} \rho(\phi, w, t'') \right\} \end{aligned}$$

Fig. 2: Quantitative semantics of STL formulas

Robustness and Falsification. The value of the quantitative semantics can serve as a robustness estimate and as such be used to search for a violation of the property

at hand, i.e., to falsify it. The robustness of STL formula φ is its quantitative value at time 0, that is, Rφ(w) := ρ(φ, w, 0). So, falsifying a formula φ for a system SSTL boils down to a search problem with the goal condition Rφ(w) < 0. Successful falsification algorithms solve this problem by understanding it as the optimisation problem minimisew∈SSTLRφ(w). Algorithm 1 [1, 20] sketches an algorithm for Monte-Carlo Markov


Chain falsification, which is based on acceptance-rejection sampling [10]. Our version of the algorithm works on system traces instead of an input space. An input to the algorithm is an initial trace w and a computable robustness function R. Robustness computation for finite timed traces of simulations of a system has been discussed in the literature [13, 14]; we omit this discussion here. The third input PS is a proposal scheme that proposes a new trace to the algorithm based on the previous one (line 2). The parameter β (used in line 3) can be adjusted during the search and is a means to avoid being trapped in local minima, preventing to find a global minimum. Any two traces w and w <sup>0</sup> ∈ SSTL with robustness values R(w) and R(w 0 ) are sampled with probability proportional to

e −βR(w) e−βR(w0) (lines 3-6). The algorithm seeks to minimise R over the system's traces SSTL, and terminates when it finds a trace with a negative robustness value, i.e., a trace that violates the STL property from which R is derived.

#### 2.2 Software Doping

Contracts and Robustness. Earlier work [12] has developed a formal basis for the purpose of characterising software doping, by providing precise definitions of when the system's behaviour is clean, i.e., does not contain hidden functionalities not in the interest of the user. If a program exhibits behaviour that is not clean, it is doped.

All cleanness definitions are based on the assumption that there is some well-defined and agreed standard input/output behaviour of the system. Robust cleanness, the cleanness definition that we work with in this paper, extends this behaviour to the vicinity around the inputs and outputs close to the standard behaviour. The definition of "vicinity" and of "standard behaviour" is assumed to be part of a contract between software manufacturer and user. The contract entails the standard behaviour, distance functions for input and output values, and distance thresholds to define the input and output vicinity, respectively. With this, a system behaviour is considered clean, if its output is (or stays) in the output vicinity of the standard, unless the input is (or moves) outside the standard's input vicinity.

Example 1. A concrete contract for diesel-powered cars will, for instance, enforce bounded deviations in exhaust emissions provided the driving profile stays in the bounded vicinity of the standardised tests (such as NEDC or WLTC). Recent experiments [6] have considered contracts based on NEDC with speed values as inputs and NO<sup>x</sup> emissions as output values, together with distance functions computing the absolute difference of speed inputs and NO<sup>x</sup> outputs, respectively, and value thresholds were 15 km/h for inputs and 80 mg/km for outputs.

A function d : X × X → R<sup>≥</sup><sup>0</sup> is a pseudometric function if it satisfies d(x, x) = 0, d(x, y) = d(y, x) and d(x, y) ≤ d(x, z) + d(z, y) for all x, y, z ∈ X. We let σ[k] denote the k-th literal of the infinite word σ.

Reactive Execution Model. We can view a (nondeterministic) reactive program as a function S<sup>R</sup> : In<sup>ω</sup> → 2 (Outω) perpetually mapping inputs In to sets of outputs Out [12]. A contract is a tuple C = hStdIn, dIn, dOut, κ<sup>i</sup> , κoi where StdIn ⊆ In<sup>ω</sup> is the input space of the system designated to define the standard behaviour, dIn : (In<sup>∗</sup>×In<sup>∗</sup> ) → R<sup>≥</sup><sup>0</sup> and dOut : (Out<sup>∗</sup>×Out<sup>∗</sup> ) → R<sup>≥</sup><sup>0</sup> are pseudometric distance functions on finite words over inputs, respectively outputs, and κ<sup>i</sup> ∈ R<sup>≥</sup><sup>0</sup> is a constant defining the maximum distance to the standard input allowed, and similarly κ<sup>o</sup> ∈ R<sup>≥</sup><sup>0</sup> is the maximum distance between two outputs such that they are still considered sufficiently close. For the purpose of this paper, we assume the distance functions to be induced by pointwise pseudometric functions of the form dIn : (In × In) → R<sup>≥</sup><sup>0</sup> and dOut : (Out × Out) → R<sup>≥</sup><sup>0</sup> in a past-forgetful manner.

Definition 1. A reactive program S<sup>R</sup> : In<sup>ω</sup> → 2 (Outω) is robustly clean w.r.t. to contract C = hStdIn, dIn, dOut, κ<sup>i</sup> , κoi if for all input sequences i, i <sup>0</sup> ∈ In<sup>ω</sup> with i ∈ StdIn, it holds for arbitrary k ≥ 0 that whenever dIn(i[j], i 0 [j]) ≤ κ<sup>i</sup> for all j ≤ k, then

1. for all o ∈ SR(i) there exists o <sup>0</sup> ∈ SR(i 0 ) such that dOut(o[k], o 0 [k]) ≤ κo, and 2. for all o <sup>0</sup> ∈ SR(i 0 ) there exists o ∈ SR(i) such that dOut(o[k], o 0 [k]) ≤ κo.

The definition enforces that whenever an input i 0 remains within κ<sup>i</sup> vicinity around the standard input i, then the output sets generated by i and i <sup>0</sup> are at most κ<sup>o</sup> away from each other.

HyperLTL Characterisation. D'Argenio et al. [12] prove that the following two HyperLTL formulas characterise robust cleanness in the sense of Definition 1.

$$
\forall \pi\_1. \forall \pi\_2. \exists \pi'\_2. \mathsf{ Set} \mathsf{dl} \mathsf{n}\_{\pi\_1} \to \left( \mathsf{G}(\mathsf{i}\_{\pi\_2} = \mathsf{i}\_{\pi'\_2}) \land \begin{array}{c} \mathsf{(1)} \\ \left( \hat{d}\_{\mathsf{Out}}(\mathsf{o}\_{\pi\_1}, \mathsf{o}\_{\pi'\_2}) \leq \kappa\_\bullet \right) \mathsf{W} \left( \hat{d}\_{\mathsf{In}}(\mathsf{i}\_{\pi\_1}, \mathsf{i}\_{\pi'\_2}) > \kappa\_i \right) \right) \end{array} \right)$$

$$
\forall \pi\_1. \forall \pi\_2. \exists \pi'\_1. \ \mathsf{Std} \mathsf{ln}\_{\pi\_1} \to \left( \mathsf{G}(\mathsf{i}\_{\pi\_1} = \mathsf{i}\_{\pi'\_1}) \land \begin{array}{c} \mathsf{(2)} \\ \left( \hat{d}\_{\mathsf{Out}}(\mathsf{i}\_{\pi'\_1}, \mathsf{i}\_{\pi\_2}) > \kappa\_i \right) \end{array} \right)$$

$$
\left( \left( \hat{d}\_{\mathsf{Out}}(\mathsf{o}\_{\pi'\_1}, \mathsf{o}\_{\pi\_2}) \leq \kappa\_\bullet \right) \mathsf{W} \left( \hat{d}\_{\mathsf{In}}(\mathsf{i}\_{\pi'\_1}, \mathsf{i}\_{\pi\_2}) > \kappa\_i \right) \right)$$

The non-atomic propositions in the formulas above are syntactic sugar; the input and output values in system SLTL give rise to a binary encoding into sets of atomic propositions.

Mixed-IO Model. The reactive execution model and the HyperLTL characterisation above have the strict requirement that for every input, the system produces exactly one output. Recent work [5, 6] instead considers mixed-IO models, where a program SIO ⊆ (In ∪ Out) <sup>ω</sup> is a subset of traces containing both inputs and outputs, but without any restriction on the order or frequency in which inputs and outputs appear in the trace. In particular, they are not required to strictly alternate (but they may, and in this way the reactive execution model can be considered a special case). A particularity of this model is the distinct output symbol δ for quiescence, i.e., the absence of an output. For example, finite behaviour can be expressed by adding infinitely many δ symbols to a finite trace. In this model, standard behaviour is captured by subset Std ⊆ SIO of traces of a system SIO. To capture the notion of robust cleanness in the mixed-IO model, every trace is projected into an input, respectively output domain. The set of input symbols contains one additional element –<sup>i</sup> , that indicates that in the respective steps an output was produced, but masking the concrete output. Similarly, the set of output symbols contains the additional element –<sup>o</sup> to mask a concrete input symbol. Projection on inputs ↓<sup>i</sup> : (In ∪ Out) <sup>ω</sup> → (In ∪ {–i}) <sup>ω</sup> and projection on outputs ↓<sup>o</sup> : (In ∪ Out) <sup>ω</sup> → (Out ∪ {–o}) <sup>ω</sup> are defined for all traces σ ∈ (In ∪ Out) <sup>ω</sup> and k ∈ N as follows: σ↓<sup>i</sup> [k] := if σ[k] ∈ In then σ[k] else –<sup>i</sup> and similarly σ↓o[k] := if σ[k] ∈ Out then σ[k] else –o. The distance functions

¯dIn and ¯dOut apply on input and output symbols or their respective masks, i.e. they are pseudometrics in (In∪ {–i})×(In∪ {–i}) → R≥<sup>0</sup> ∪ {∞} and, respectively, (Out ∪ {–o}) × (Out ∪ {–o}) → R≥<sup>0</sup> ∪ {∞}. As for the reactive model, we define a contract formally as a tuple C = hStd, ¯dIn, ¯dOut, κ<sup>i</sup> , κoi (where StdIn is replaced by Std, dIn by ¯dIn, and dOut by ¯dOut). Its satisfaction is defined by the adapted robust cleanness definition below [6].

Definition 2. A system SIO ⊆ (In ∪ Out) <sup>ω</sup> is robustly clean w.r.t. contract C = hStd, ¯dIn, ¯dOut, κ<sup>i</sup> , κoi if and only if Std ⊆ SIO and for all σ ∈ Std, σ <sup>0</sup> ∈ SIO and k ≥ 0 it holds that whenever ¯dIn(σ[j]↓<sup>i</sup> , σ<sup>0</sup> [j]↓i) ≤ κ<sup>i</sup> for all j ≤ k then


Def. 2 contains two requirements, numbered as 1. and 2 . In the following, we will sometimes explicitly address either of these conditions by referring to it as the first, respectively second condition of robust cleanness.

### 3 Logical characterisation for mixed IO

This section discusses how to reformulate robust cleanness to make it amenable to probabilistic falsification. For this, we translate eq. (2) into a HyperSTL formula, subsequentially remove its quantifiers by means of a highly efficient parallel composition on the level of traces and, finally, carefully adapt this quantifier-free representation to the mixed-IO model.

Hyperlogics over Continuous Domains. Previous work [21] extends STL to HyperSTL echoing the extension of LTL to HyperLTL. A major challenge of the robustness computation for HyperSTL formulas is the adequate handling of the continuous time domain when comparing two execution traces of a system. For systems that can be simulated, this can be avoided [21] by composing one or more copies of the simulation model in parallel to itself [11]. Snapshots of the composed system are effectively snapshots of the individual copies of the model at exactly the same time point. This approach is not available when interacting with (black-box) real-world cyber-physical systems (CPS). In such scenarios, a suitable logics is HyperSTL\* [7], an extension of STL\* [9], which enables the comparison of different time points in different traces by means of a freeze operator. We use a variant of this idea, but with a HyperSTL syntax similar to [21].

$$
\begin{array}{rcl}
\psi & ::= \exists \pi. \psi \mid \,\forall \pi. \psi \mid \,\phi \\
\phi & ::= \top \quad \mid \, f > 0 \mid \,\neg\phi \mid \,\phi \land \phi \mid \,\phi \,\mathsf{U}\,\phi \,\,.
\end{array}
$$

The meaning of the universal and existential quantifier is as for HyperLTL. A crucial difference to the other logics presented above is the proposition f > 0. In contrast to HyperLTL and to the existing definition of HyperSTL, we consider it insufficient to allow propositions to refer to only a single trace. In HyperLTL that does not cause harm, because atomic propositions of individual traces can

be compared by means of the Boolean connectives. To formulate thresholds for real values, however, we feel the need to allow real values from multiple traces to be combined in the function f, and thus to appear as arguments of f. Hence, in our semantics of HyperSTL, f > 0 holds if and only if the result of f, applied to all traces quantified over, is greater than 0. For this to work formally, the arity of function f is the product of the trace width n and the number m of traces quantified over at the occurrence of f > 0 in the formula, so f : (Rn) <sup>m</sup> → R.

A trace assignment [11] Π : V → SSTL is a partial function assigning traces of SSTL to variables. Let Π[π := w] denote the same function as Π, except that π is mapped to trace w. The quantitative semantics of a HyperSTL formula ψ, at time point t ∈ T , for a system S ⊆ (T → R<sup>n</sup>) and a trace assignment Π is defined inductively:

$$\begin{aligned} \rho(\exists \pi. \psi, \mathsf{S}, H, t) &= \max\_{w \in \mathsf{S}} \rho(\psi, \mathsf{S}, H[\pi := w], t) \\ \rho(\forall \pi. \psi, \mathsf{S}, H, t) &= \min\_{w \in \mathsf{S}} \rho(\psi, \mathsf{S}, H[\pi := w], t) \\ \rho(\top, \mathsf{S}, H, t) &= \infty \\ \rho(f > 0, \mathsf{S}, H, t) &= f(H(\pi\_1)(t), \dots, H(\pi\_m)(t)) \\ &\qquad \text{for } dom(H) = \{\pi\_1, \dots, \pi\_m\}^1 \\ \rho(\neg \phi, \mathsf{S}, H, t) &= -\rho(\phi, \mathsf{S}, H, t) \\ \rho(\phi\_1 \land \phi\_2, \mathsf{S}, H, t) &= \min(\rho(\phi\_1, \mathsf{S}, H, t), \rho(\phi\_2, \mathsf{S}, H, t)) \\ \rho(\phi\_1 \Downarrow \phi\_2, \mathsf{S}, H, t) &= \sup\_{t' \ge t} \min\{\rho(\phi\_2, \mathsf{S}, H, t'), \inf\_{t' \in [t, t']} \rho(\phi\_1, \mathsf{S}, H, t')\} \end{aligned}$$

It is an easy exercise to show that for continuous-time signals this quantitative semantics of HyperSTL is a conservative extension of the quantitative semantics of STL discussed above. For discrete-time signals it is important to understand that discrete time points often represent points in continuous time. It is widely accepted, that this can be cast into a (strictly monotonic) timing function τ : N → R<sup>≥</sup><sup>0</sup> [3, 14]. The HyperSTL semantics given above is meaningful in a discrete-time setting if all traces share the same timing function.

HyperSTL characterisation. As discussed in Section 2.2, robust cleanness is a hyperproperty. Recent work on testing and monitoring of robust cleanness [6] explains the difficulties of monitoring such hyperproperties. In essence, it turns out that the first condition of Definition 2 cannot be refuted by observing a real system. Intuitively, this is because this condition effectively puts a constraint on the lower bound of the size of the sets of outputs that a system must be able to produce whereas the second condition enforces an upper bound. A violation of the upper-bound constraint is irrevocable, i.e., once observed, the system is for sure not robustly clean. However not having observed an output that is larger than the lower bound, does not exclude the possibility for observing such an output in the future. We therefore follow [6], and focus only on the second

<sup>1</sup> We admit some sloppiness; the set dom(Π) should have a fixed order.

condition of the robust cleanness definition in our work on falsification. For the HyperLTL characterisation this means that we only work with the second formula, labelled (2).

The HyperLTL characterisation (2) assumes the system to be a subset of (2AP) <sup>ω</sup> and works with distances between traces by means of a Boolean encoding into atomic propositions. We will describe how to transform the HyperLTL formula (2) into a HyperSTL formula, where systems are given as subsets of (T → Rn) for some width n ∈ N. Robust cleanness distinguishes between inputs and outputs, and we assume that the input set In and the output set Out are represented as signals of width m, respectively width l. The system space then is SSTL ⊆ (T → R<sup>m</sup>+<sup>l</sup> ). Solely for the sake of clarity, we will in the sequel, unless otherwise stated, restrict to m = l = 1, i.e., In ⊆ R and Out ⊆ R, and thus work with a fixed width of 2, hence SSTL ⊆ (T → In × Out).

We can assume a set Std ⊆ SSTL as given, which defines all standard behaviours of the system. The HyperSTL characterisation of the HyperLTL formula (2) is then

$$\forall \pi\_1. \forall \pi\_2. \exists \pi'\_1. \ \mathsf{Std}\_{\pi\_1} > 0 \to \tag{3}$$

$$\left(\mathsf{G}(|\mathsf{i}\_{\pi\_1} - \mathsf{i}\_{\pi'\_1}| \le 0 \land \mathsf{Std}\_{\pi'\_1} > 0) \land \newline\right)$$

$$\left((d\mathsf{out}(\mathsf{o}\_{\pi'\_1}, \mathsf{o}\_{\pi\_2}) - \kappa\_\mathsf{o} \le 0) \,\mathsf{W}\left(d\mathsf{l}\_{\mathsf{h}}(\mathsf{i}\_{\pi'\_1}, \mathsf{i}\_{\pi\_2}) - \kappa\_\mathsf{i} > 0\right)\right)$$

The quantifiers remain unchanged relative to (2). The predicate StdInπ<sup>1</sup> that holds if and only if π<sup>1</sup> is a standard input, is replaced by the function Stdπ<sup>1</sup> which returns a positive value if π<sup>1</sup> is in Std, and a non-positive value otherwise. The input equality requirement of π<sup>1</sup> and π 0 1 is ensured by globally enforcing |iπ<sup>1</sup> − i<sup>π</sup> 0 1 | ≤ 0.

Since we switched from the concept of standard inputs to the concept of standard traces, we must also check that π 0 1 is a standard trace. This echoes the setup in Definition 2, where the second requirement asks for a trace σ <sup>00</sup> ∈ Std instead of a trace from SIO, see [6] for an elaborate discussion. In the operands of the Weak-Until operator W, we replace the AP-encoded versions of ˆdIn and ˆdOut by the original distance functions dIn and dOut, and we perform simple arithmetic operations to match the syntactic requirements of HyperSTL.

We remark that for encoding Stdπ, due to the absence of the Next-operator in HyperSTL, it might be necessary to add a clock signal s(t) = t to traces in a preprocessing step, not considered here for the sake of avoiding cluttered notation.

Quantifier Elimination. In many practical settings—where the different standard behaviours are spelled out upfront explicitly, as in NEDC and WLTC—it can be assumed that the number of distinct standard behaviours Std is finite (while there are infinitely many possible behaviours in SSTL). Finiteness of Std makes it possible to remove the quantifiers by enumeration, and opens the way to work with the STL fragment of HyperSTL, after proper adjustments.

Let Std = {w1, . . . , wc} be an arbitrary standard set with c unique standard traces. We will demonstrate the quantifier elimination by substituting by the placeholder V (π1, π2) the subformula (starting with ∃π 0 1 . . . .) of formula (3) behind the second quantification. We can switch the order of the ∀-quantifiers without changing the semantics of the formula, so we are working with ∀π2. ∀π1. V (π1, π2). Then, by replacing the second quantifier with the infinite conjunction [23], we get

$$\forall \pi\_2. \bigwedge\_{w \in \mathsf{S} \mathfrak{su}} V(w, \pi\_2).$$

The latter can be split into a finite and an infinite conjunction

$$\forall \pi\_2. \bigwedge\_{w \in \mathsf{Std}} V(w, \pi\_2) \wedge \bigwedge\_{w \in \mathsf{S}\_{\mathsf{STd}} \backslash \mathsf{Std}} V(w, \pi\_2). \tag{4}$$

Let W(π1, π2, π<sup>0</sup> 1 ) be the placeholder, such that V (π1, π2) = ∃π 0 1 . Stdπ<sup>1</sup> > 0 → W(π1, π2, π<sup>0</sup> 1 ). Unfolding V in the right (infinite) conjunction in formula (4) reveals

$$\bigwedge\_{w \in \mathbb{S}\_{\mathfrak{T}\mathfrak{U}} \backslash \mathfrak{St}} \exists \pi'\_1. \ \mathsf{Std}\_w > 0 \to W(w, \pi\_2, \pi'\_1).$$

It follows directly from the definition of Std<sup>π</sup> that for all w 6∈ Std, Std<sup>w</sup> is nonpositive. Hence that fragment of the formula is trivially fulfilled, and formula (4) is equivalent to

$$\forall \pi\_2. \bigwedge\_{w \in \mathsf{Std}} V(w, \pi\_2).$$

Combined with similar reasoning for the ∃-operator and disjunctions we can altogether rewrite formula (3) into

$$\bigwedge\_{w \in \mathsf{Std}} \bigvee\_{w' \in \mathsf{Std}} \left( \mathsf{G}(|\mathsf{i}\_w - \mathsf{i}\_{w'}| \le 0) \wedge \right. \tag{5}$$

$$\left( (d\_{\mathsf{Out}}(\mathsf{o}\_{w'}, \mathsf{o}) - \kappa\_{\mathsf{o}} \le 0) \, \mathsf{W} \left( d\_{\mathsf{In}}(\mathsf{i}\_{w'}, \mathsf{i}) - \kappa\_{\mathsf{i}} > 0 \right) \right),$$

where the ∀-quantification over π<sup>1</sup> is replaced by the conjunction over standard traces w, the ∃-quantification of π 0 <sup>1</sup> by the disjunction over standard traces w 0 , and the remaining ∀-quantification of π<sup>2</sup> is eliminated by rewriting into a trace formula and removing the trace indices from i<sup>π</sup><sup>2</sup> and o<sup>π</sup><sup>2</sup> .

Self-composition in logic. Formula (5) is not yet an STL formula, because the distance function dIn needs to compare the trace input with inputs of constant traces from the set Std. A popular technique to analyse hyperproperties is self-composition of a system [4, 15]. We use a syntactic variant of parallel selfcomposition as follows. For a trace width n, we compose the signals of the trace under investigation w = (s1, . . . , sn) and the signals of each of the, say, c standard traces {(s<sup>11</sup> , . . . , s<sup>1</sup>n), . . . ,(s<sup>c</sup><sup>1</sup> , . . . , scn)} = Std. The composed trace then is of width n + nc, it is w <sup>0</sup> = (s1, . . . , sn, s<sup>11</sup> , . . . , s<sup>n</sup><sup>1</sup> , . . . , s<sup>1</sup><sup>c</sup> , . . . , snc ). For

the restricted case considered here (one-dimensional input and output signals), w <sup>0</sup> = (i, o, i1, o1, . . . , ic, oc) is of trace width 2 + 2c. The resulting STL formula for monitoring robust cleanness is

$$\bigwedge\_{1 \le a \le c} \bigvee\_{1 \le b \le c} \left( \mathsf{G}(|\mathbf{i}\_a - \mathbf{i}\_b| \le 0) \wedge \right.\tag{6}$$

$$\left( (d\_{\mathsf{Out}}(\mathsf{o}\_b, \mathbf{o}) - \kappa\_\mathbf{o} \le 0) \, \mathsf{W} \left( d\_{\mathsf{In}}(\mathsf{i}\_b, \mathbf{i}) - \kappa\_\mathbf{i} > 0 \right) \right).$$

Recall that a discrete time interpretation of such a formula requires all system traces to share the same timing function τ .

Embedding into the mixed-IO model. The STL formula (6) still is bound to inputs and outputs forming pairs synchronized in time. A more realistic scenario is that of inputs and outputs occurring independently of each other. In particular, when testing a real-world CPS, the testing interface can either pass an input to the system under test or receive an output, but not both at the same time. Furthermore, certain tests require to pass a series of inputs before receiving an output at all [6]. The mixed-IO model supports such real-world testing scenarios. Mixed-IO signals are always defined in the discrete time domain. A mixed-IO signal s ∈ (In ∪ Out) <sup>ω</sup> (or, equivalently, s : N → In ∪ Out) is similar to a realvalued discrete-time signal but the value domain R is replaced by the domain In ∪ Out. A discrete-time mixed-IO trace w = (s1, . . . , sn) ∈ ((In ∪ Out) ω) n is a tuple of n mixed-IO signals. Accordingly, predicates of the form f > 0 must use functions f that produce real values for mixed-IO signals. Formula (6) requires that all traces share the same timing function. For continuous-time signals, we ensure that this condition is met by transforming all traces into traces with a common value frequency (say, 1 Hz) by averaging the values observed in a time unit (of one second). Let w = (s1, . . . , sn) ∈ (N → In ∪ Out) <sup>n</sup> be a recorded trace with some timing function τ : N → R<sup>≥</sup>0, that is sampled with at least one value per time unit, i.e., τ (i + 1) − τ (i) ≤ 1 for all i ∈ N. This trace is condensed to a new trace w <sup>0</sup> = (s 0 1 , . . . , s<sup>0</sup> n ) with timing function τ 0 (i) = i, and s 0 j (t) := average ∪τ(i)∈[t,t+1) s<sup>j</sup> (i) for 1 ≤ j ≤ n, i.e., each signal s 0 j is piecewise constant: for each unit time interval [t, t + 1) the signal value is set to the average of all signal values originally recorded in that unit time interval.

For adjusting formula (6), let Std = {s1, . . . , sc} ⊆ (In ∪ Out) <sup>ω</sup> be a set of standard traces, each in the form of a single mixed-IO signal. Following the syntactic self-composition idea from above, the composition of a trace w under investigation with Std is the trace w <sup>0</sup> = (w, s1, . . . , sc) ∈ ((In ∪ Out) ω) <sup>c</sup>+1. This needs two subtle adjustments of the formula. First, the distances dIn and dOut are replaced by their mixed-IO counterparts ¯dIn and ¯dOut, and instead of directly accessing inputs and outputs, the current value is projected to the input and output domain, respectively. Second, since the set In and Out is opaque, the expression |i<sup>a</sup> − ib| is not evaluable any more, it is replaced by the distance ¯d based on ¯dIn and ¯dOut. The resulting formula is

$$\bigwedge\_{1 \le a \le c} \bigvee\_{1 \le b \le c} \left( \mathsf{G}(\bar{d}(s\_a \downarrow\_i, s\_b \downarrow\_i) \le 0) \wedge \right) \tag{7}$$

$$\left( (\bar{d}\_{\mathsf{Out}}(s\_b \downarrow\_\mathfrak{o}, s \downarrow\_\mathfrak{o}) - \kappa\_\mathfrak{o} \le 0) \, \mathsf{W} \left( \bar{d}\_{\ln}(s\_b \downarrow\_i, s \downarrow\_i) - \kappa\_i > 0 \right) \right),$$

where ¯d is defined for some ε > 0 as

$$
\bar{d}(x,y) := \begin{cases}
0, & \text{if } x = y \\
\bar{d}\_{\mathbb{M}}(x,y) + \varepsilon, & \text{if } x \neq y \land x, y \in \mathbb{In} \\
\bar{d}\_{\text{Out}}(x,y) + \varepsilon, & \text{if } x \neq y \land x, y \in \mathbf{Out} \\
\infty, & \text{otherwise}.
\end{cases}
$$

In the second and third clause of the above definition we add some positive value ε to the result of ¯dIn and ¯dOut, because they are pseudometrics, and ¯dIn(i1, i2) could be 0 even if i<sup>1</sup> 6= i2. For the correctness of formula (7), however, it is crucial that ¯d(x, y) = 0 if and only if x = y. For a good performance of the falsification algorithm, we will nevertheless want to make use of ¯dIn and ¯dOut if i<sup>1</sup> 6= i2. We remark that ¯d is not a metric, because the triangle inequality requirement now is violated.

The discussion above has assembled all the details to formally back the following theorem, stating that a system satisfies formula (7) if and only if it satisfies the second condition of robust cleanness in Definition 2.

Theorem 1. Let C = hStd, ¯dIn, ¯dOut, κ<sup>i</sup> , κoi be a contract for some system SIO ⊆ (In ∪ Out) <sup>ω</sup> with Std = {σ1, . . . , σc} ⊆ SIO, and let φ denote formula (7). Then, for all σ <sup>0</sup> ∈ SIO, it holds (σ 0 , σ1, . . . , σc), 0 |= φ if and only if for all σ ∈ Std and k ≥ 0 such that ¯dIn(σ[j]↓<sup>i</sup> , σ<sup>0</sup> [j]↓i) ≤ κ<sup>i</sup> holds for all j ≤ k, there exists σ <sup>00</sup>∈ Std such that σ↓<sup>i</sup> = σ <sup>00</sup>↓<sup>i</sup> and ¯dOut(σ 0 [k]↓o, σ<sup>00</sup>[k]↓o) ≤ κo.

Example 2. We consider C = hStd, ¯dIn, ¯dOut, κ<sup>i</sup> , κoi where Std = {w1, w2} contains the two standard traces w<sup>1</sup> = 1<sup>i</sup> 2<sup>i</sup> 3<sup>i</sup> 7<sup>o</sup> 0<sup>i</sup> δ <sup>ω</sup> and w<sup>2</sup> = 0<sup>i</sup> 1<sup>i</sup> 2<sup>i</sup> 3<sup>i</sup> 6<sup>o</sup> δ <sup>ω</sup>. We here decorate inputs with index i and outputs with index o, i.e., w<sup>1</sup> describes a system receiving the three inputs 1, 2, and 3, then producing the output 7, and finally receiving input 0 before entering quiescence. We take

$$
\bar{d}\_{\mathbf{Out}}(\mathbf{o}\_1, \mathbf{o}\_2) = \begin{cases}
0, & \text{if } \mathbf{o}\_1 = \mathbf{o}\_2 = \mathbf{-o}\_\mathbf{o} \text{ or } \mathbf{o}\_1 = \mathbf{o}\_2 = \delta \\
\infty, & \text{otherwise},
\end{cases}
$$

and similarly for ¯dIn. The contractual value thresholds are assumed to be κ<sup>i</sup> = 1 and κ<sup>o</sup> = 6.

Assume we are observing the trace w = 0<sup>i</sup> 1<sup>i</sup> 2<sup>i</sup> 6<sup>o</sup> 0<sup>i</sup> δ <sup>ω</sup> to be monitored with formula (7). First notice, that for combinations of a and b in (7), where a 6= b, the subformula G( ¯d(sa↓<sup>i</sup> , sb↓i) ≤ 0) is always false, because s<sup>1</sup> and s<sup>2</sup> (i.e., the

combination of w<sup>1</sup> and w2) have different values at time point 0. Hence, it remains to show that

$$(\bar{d}\_{\text{Out}}(w\_1\downarrow\_{\mathfrak{o}}, w\downarrow\_{\mathfrak{o}}) - \kappa\_{\mathfrak{o}} \le 0) \,\mathsf{W} \left(\bar{d}\_{\text{In}}(w\_1\downarrow\_{\mathfrak{i}}, w\downarrow\_{\mathfrak{i}}) - \kappa\_{\mathfrak{i}} > 0\right) \land$$

$$(\bar{d}\_{\text{Out}}(w\_2\downarrow\_{\mathfrak{o}}, w\downarrow\_{\mathfrak{o}}) - \kappa\_{\mathfrak{o}} \le 0) \,\mathsf{W} \left(\bar{d}\_{\text{In}}(w\_2\downarrow\_{\mathfrak{i}}, w\downarrow\_{\mathfrak{i}}) - \kappa\_{\mathfrak{i}} > 0\right).$$

For the first part, the input distance between inputs in w and w<sup>1</sup> is always 1 at positions 1 to 3, it is 0 at position 4 (because –<sup>i</sup> is compared to –i) and in position 5 and beyond. Thus, ¯dIn(w1↓<sup>i</sup> , w↓i) − κ<sup>i</sup> is always at most 0, and the right hand-side of the W operator is always false. Consequently, by definition of W, the left operand of W must always hold, i.e., ¯dOut(w1↓o, w↓o) must always be less or equal to 6. This is the case for w<sup>1</sup> and w: at all positions except for 4, –<sup>o</sup> is compared to –<sup>o</sup> (or δ to δ), so the difference is 0, and at position 4, the distance of 6 and 7 is 1.

For the second W-formula, w is compared to w2. These two traces are comparable only to a limited extent: the order of input and output is altered at the last two positions of the signals before quiescence. Hence, the right operand of W is true at position 4, and the formula holds for the remaining trace. For positions 1 to 3, the input distances are 0, because the input values are identical. At these positions, the left operand must hold. The values are input values, so –<sup>o</sup> is compared to –<sup>o</sup> at each position. This distance is defined to be 0, so it holds that −6 ≤ 0, and the formula is satisfied. Since both formulas hold, the conjunction of both holds, too, and trace w is qualified as robustly clean. There could however be other system traces not considered in this example, that overall could violate robust cleanness of the system.

Restriction of input space. Robust cleanness puts semantic requirements on fragments of a system's input space, outside of which the system's behaviour remains unspecified. Typically, the fragment of the input space covered is rather small. To falsify the STL formula (7), the falsifier has two challenging tasks. First, it has to find a way to stay in the relevant input space, i.e., select inputs with a distance of at most κ<sup>i</sup> from the standard behaviour. Only if this is assured it can search for an output large enough to violate the κ<sup>o</sup> requirement. In this, a large robustness estimate provided by the quantitative semantics of STL cannot serve as an indicator for deciding whether an input is too far off or whether an output stays too close to the standard behaviour.

The general strength of the falsification technique is its proven ability to discover outputs of a black-box system violating a property. That is why the technique is considered suitable for real-world robust cleanness tests. We can improve its efficiency significantly by narrowing upfront the input space the falsifier uses.

In practice, test execution traces will always be finite. In previous reallife doping tests, test execution lengths have been bounded by some constant B ∈ N [6], i.e., systems are represented as sets of finite traces S ⊆ (In ∪ Out) B (which for formality reasons each can be considered suffixed with δ <sup>ω</sup>). In this bounded horizon, we can provide a predicate discriminating between relevant

and irrelevant input sequences. Formally, the restriction to the relevant input space fragment of a system S ⊆ (In ∪ Out) <sup>B</sup> is given by the set InStd,κ<sup>i</sup> = {w ∈ S | ∃w <sup>0</sup> ∈ Std. V<sup>B</sup>−<sup>1</sup> <sup>k</sup>=0( ¯dIn(w(k)↓<sup>i</sup> , w<sup>0</sup> (k)↓i) ≤ κi)}. Since Std and B are finite, membership is computable.

There are rare cases in which this optimisation may prevent the falsifier from finding a counterexample. This is only the case if there is an input prefix leading to a violation of the formula for which there is no suffix such that the whole trace satisfies the κ<sup>i</sup> constraint. Below is a pathological example in which this could make a difference.

Example 3. Apart from NO<sup>x</sup> emissions, NEDC (and WLTC) tests are used to measure fuel consumption. Consider a contract similar to the contracts above, but with fuel rate as the output quantity. Assuming a "normal" fuel rate behaviour during the standard test, there might be a test within a reasonable κ<sup>i</sup> distance, where the fuel is wasted insanely. Then, the fuel tank might run empty before the intended end of the test, which therefore could not be finished within the κ<sup>i</sup> distance, because speed would be constantly 0 at the end. The actually driven test is not in set InStd,κ<sup>i</sup> , but there is a prefix within κ<sup>i</sup> distance that violates the robust cleanness property.

#### 4 Diesel Emissions

This section discusses how to tailor the generic probabilistic falsification approach for STL based on Algorithm 1 to the particular case of diesel emissions, and reports on empirical observations when putting the approach into practice.

Robustness. In the case of diesel emissions doping, the only standard behaviour is either the NEDC or the WLTC. Assuming, for example, NEDC, let C = h{NEDC ◦ o}, κ<sup>i</sup> , κo, dIn, dOuti be a diesel emissions specific contract, where NEDC is the sequence of 1180 inputs with the kth input defining the speed of the car after k seconds from the beginning of the test. Here, the output o suffixed to NEDC is the (average) amount of emitted NO<sup>x</sup> during the NEDC drive. By restricting the input space to In{NEDC◦o},κ<sup>i</sup> as explained in Section 3, formula (7) can be simplified to

$$\mathbf{G}(\bar{d}\_{\text{Out}}((\mathsf{NEDC}\circ\mathsf{o})\downarrow\_{\text{o}},s\downarrow\_{\text{o}})-\kappa\_{\text{o}}\leq 0).\tag{8}$$

This is because the conjunction and disjunction over standard traces becomes obsolete for only a single standard trace. For the same reason, the requirement G( ¯d(sa↓<sup>i</sup> , sb↓i) ≤ 0) becomes obsolete, as the compared traces are always identical. In the W subformula, the right proposition is always false, because of the restricted input space: the proposition collapses to ¯dIn(NEDC◦o↓<sup>i</sup> , s↓i)−κ<sup>i</sup> > 0) and the input domain In{NEDC◦o},κ<sup>i</sup> is {(s)∈ S | ∀k∈[0, 1180]. ¯dIn s(k)↓<sup>i</sup> ,(NEDC◦o)(k)↓<sup>i</sup> ≤ κi}. And thus, by the definition of W and U, the W subformula is equivalent to formula (8). We implemented Algorithm 1 for the robustness computation according to formula (8).

Emissions Approximation. In practice, running tests like NEDC with real cars is a time consuming and expensive endeavour. Furthermore, tests on chassis dynamometers are usually prohibited to be carried out with rented cars by the rental companies. On the other hand, car emission models for simulation are not available to the public—and models provided by the manufacturer cannot be considered trustworthy. To carry out our experiments, we instead use an approximation technique that estimates the amount of NO<sup>x</sup> emissions of a car along a certain trajectory based on data recorded during previous trips with the same car, sampled at a frequency of 1 Hz (one sample per second). Notably, these trips do not need to have much in common with the trajectory to be approximated. A trip is represented as a finite sequence T ∈ (R × R × R) <sup>∗</sup> of triples, where each such triple (v, a, n) represents the speed, the acceleration and the absolute amount of NO<sup>x</sup> emitted at a particular time instant in the sample. Speed and acceleration can be considered as the main parameters influencing the instant emission of NOx. This is, for instance, reflected in the RDE regulation [16, 24] where the decisive quantities to validate the test route and driving behaviour during RDE tests are speed and acceleration.

A recording D is the union of finitely many trips T . We can turn such a recording into a predictor P of the NO<sup>x</sup> values given pairs of speed and acceleration as follows:

$$\mathcal{P}(v, a) = \mathsf{average}[n \mid (\exists v', a'. \left( |v - v'| \le 2 \land |a - a'| \le 2 \land (v', a', n) \in \mathcal{D} \right)]].$$

The amount of NO<sup>x</sup> assigned to a pair (v, a) here is the average of all NO<sup>x</sup> values seen in the recording D for v ± ` and a ± `, with 0 ≤ ` ≤ 2. To overcome measurement inaccuracies and to increase the robustness of the approximated emissions, the speed and acceleration may deviate up to 2 km/h, and 2 m/s 2 , respectively. This tolerance is adopted from the official NEDC regulation [26], which allows up to 2 km/h of deviations while driving the NEDC.

Experiment setup. To demonstrate the practical applicability of our implementation of Algorithm 1 and our NO<sup>x</sup> approximation, we report here on two experiments. For the first experiment, we use recordings from Biewer et al. [8]. They used the app LolaDrives to perform low-cost RDE tests and recorded the data received from a car's diagnosis port. Using the two RDE recordings that appear in their work, the above predictor can be used to estimate the NO<sup>x</sup> emission during NEDC to be 86 mg/km. Their car was an Audi A6 Avant Diesel, admitted in June 2020. We rented the successor of this car model, admitted in 2021, and recorded three low-cost RDE trips with the help of LolaDrives. The new version of this car turned out to have a significantly better emission cleaning system: the estimated amount of NO<sup>x</sup> emitted during the NEDC is 9 mg/km. In the sequel, we will refer to the first car as A20 and to the second as A21. Car A20 has previously been falsified w.r.t. the RDE specification. Neither A20 nor A21 has been falsified w.r.t. robust cleanness.

Contracts. Before turning to falsification, we need to spell out meaningful contracts. The input domain In ⊆ R<sup>∗</sup> is the set of finite speed trajectories, and

Fig. 3: NEDC speed profile (blue, dashed) and input falsifying C for κ<sup>o</sup> = 88 mg/km (red) with 182 mg/km of emitted NOx.

the set Out ⊆ R represents the average amount of NO<sup>x</sup> emitted during the test. dIn must be past-forgetful, hence only the last speed value in each trace must be considered. A natural distance function for inputs is dIn(v1, v2) = |v<sup>1</sup> − v2|. Similarly, a measurement for the distance of outputs is dOut(o1, o2) = |o<sup>1</sup> − o2|. Adding the necessary technicalities for the mixed-IO setting, we get ¯dIn and ¯dOut as defined in Example 2. For κ<sup>i</sup> , it turned out that κ<sup>i</sup> = 15 km/h is a reasonable choice, as it leaves enough flexibility for human-caused driving mistakes and intended deviations [6]. The threshold for NO<sup>x</sup> emissions under lab conditions is 80 mg/km. The emission limits for RDE tests depend on the admission date of the car. Cars admitted in 2020 or earlier, must emit 168 mg/km at most, and cars admitted later must adhere to the limit of 120 mg/km. For our experiments, we use κ<sup>o</sup> = 88 mg/km for A20 and κ<sup>o</sup> = 40 mg/km for A21 to have the same tolerances as for RDE tests. Effectively, the upper threshold for A20 is 84+ 88 = 172 mg/km, and for A21 the limit is 9 + 40 = 49 mg/km. Notice that for software doping analysis, the output observed for a certain standard behaviour and the constant κ<sup>o</sup> define the effective threshold; this threshold is typically different from the threshold defined by the regulation.

Evaluation. We modified Algorithm 1 by adding a timeout condition, i.e., if the algorithm is not able to find a falsifying counterexample within 3,000 iterations, it terminates and returns both the trace for which the smallest robustness has been observed and its corresponding robustness value. Hence, if falsification of robust cleanness for a system is not possible, the algorithm outputs an upper bound on how robust the system satisfies robust cleanness.

For the concrete case of the diesel emissions, the robustness value during the first 1180 inputs (sampled from the restricted input space InStd,κ<sup>i</sup> ) is always κo. When the NEDC output oNEDC and the non-standard output o are compared, the robustness value is κ<sup>o</sup> − |oNEDC − o| (cf., eq. (8), the quantitative semantics of STL, and definition of ¯dOut). Hence, for test cycles with small robustness values,

Fig. 4: NEDC speed profile (blue, dashed) and input maximising NO<sup>x</sup> emissions to 11 mg/km (red).

we get NO<sup>x</sup> emissions o that are either very small or very large compared to oNEDC. We ran the modified Algorithm 1 on A20 and A21 for the contracts defined above. For A20, it found a robustness value of −8, i.e., it was able to falsify robust cleanness relative to the assumed contract and found a test cycle for which NO<sup>x</sup> emissions of 182 mg/km are predicted. The test cycle is shown in Fig. 3. For A21, the smallest robustness estimate found—even after 100 independent executions of the algorithm—was 38, i.e., A21 is predicted to satisfy robust cleanness with a very high robustness upper bound. The corresponding test cycle is shown in Fig. 4.

#### 5 Conclusion & Future Work

This paper marks an important milestone in making software doping tests of real-world CPS practically feasible. Regarding test execution effort, real-world testing of CPS is not scalable; the number of tests realistically executable is usually very limited. Probabilistic falsification has its strength in repetitive testing of a system model in a strategic way. We improved this approach by embedding it into a very natural problem solving strategy: Patiently observing the system in-the-wild for the purpose of eventually conducting a small set of doping tests in an even more strategic way. With this paper, we have laid the formal foundations, and we have carved out the aspects that dominate practical applicability. For the latter we focussed on the automotive emissions context. In that context, we are currently spending considerable effort on the acquisition of more high-quality training data. We are building a car data platform (CDP) as a central place for automotive data, which, most importantly, includes the app LolaDrives for convenient recording, uploading and crowd-sourcing of data. With increasing amounts of data collected we hope to be able to roll out predictions that are more and more precise. Finally, we will extend the approach to broader application contexts, to make software doping tests available across the wider CPS domain.

#### References


Programming Languages and Systems - 26th European Symposium on Programming, ESOP 2017, Proceedings. LNCS, vol. 10201, pp. 83–110. Springer (2017). https://doi.org/10.1007/978-3-662-54434-1 4


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## Estimating Worst-case Resource Usage by Resource-usage-aware Fuzzing?

Liqian Chen<sup>1</sup> (), Renjie Huang1,<sup>2</sup> , Dan Luo<sup>1</sup> , Chenghu Ma1,<sup>2</sup> , Dengping Wei<sup>1</sup> (), and Ji Wang1,<sup>2</sup>

<sup>1</sup> College of Computer Science, National University of Defense Technology, Changhsha, China

{lqchen,renjiehuang,luodan,machenghu,dpwei,wj}@nudt.edu.cn <sup>2</sup> State Key Laboratory of High Performance Computing, Changhsha, China

Abstract. Worst-case resource usage provides a useful guidance in the design, configuration and deployment of software, especially when it runs under a context with limited amount of resources. Static resource-bound analysis can provide sound upper bounds of worst-case resource usage but may provide too conservative, even unbounded, results. In this paper, we present a resource-usage-aware fuzzing approach to estimate worstcase resource usage. The key idea is to guide the fuzzing process using resource-usage amount together with resource-usage relevant coverage. Moreover, we leverage semantic patch to make use of static analysis information (including control-flow, function-call, etc.) to instrument the original program, for the sake of aiding the subsequent fuzzing. We have conducted experiments to estimate worst-case resource usage of various resources in real-world programs, including heap memory, stack depths, sockets, user-defined resources, etc. The preliminary experimental results show the promising ability of our approach in estimating worst-case resource usage in real-world programs, compared with two state-of-the-art fuzzing tools (AFL and MemLock).

Keywords: Fuzzing · Resource Usage · Static Analysis

#### 1 Introduction

Resources refer to any abstractions offered to a process by system calls, apart from the process itself. Typical resources in practice include heap/stack memory, sockets, file descriptors, threads, database connections, gas consumed in Solidity smart contracts, etc. In addition, there exist a variety of user-defined applicationdependent resources in applications, such as buffers, memory pools, number of licenses consumed, etc. Worst-case resource usage provides a useful guidance in the design, configuration and deployment of software, especially when the software runs with limited amount of resources, e.g., under the context of modern

<sup>?</sup> This work is supported by the National Key R&D Program of China (No. 2017YFB1001802), and the NSFC (Nos. 61872445, 62032024).

cyber-physical systems, mobile systems and IoT devices, etc. Unexpected or uncontrolled resource usage may degrade program performance, or even leads to CWE (Common Weakness Enumeration) vulnerabilities (such as uncontrolledresource-consumption, file-descriptor-exhaustion, etc.).

Static resource-bound analysis can provide sound upper bounds of worstcase resource usage but may provide too conservative, even unbounded, results. Moreover, most of existing static resource-bound analysis techniques [1, 2, 4, 5, 8, 9, 14] focus on deriving the upper-bound number of accesses to a given control location or simply the bound of iterations of a loop (or recursion). The programs under analysis are often of small-scale, and complex syntactic constructs are usually being abstracted away for simplicity.

In real-world programs, resources are often manipulated via specific APIs which may involve complex structures. Moreover, the usage amount of resources often depends on not only such parameters, but also the running system environment. For example, considering malloc(n) in C programs, its actual allocation amount of heap memory depends on the running environment (due to factors such as alignment, the current first available free slot, etc.) and is somehow nondeterministic before execution. The allocation may fail or may allocate memory with size larger than n (e.g., due to alignment). In such cases, dynamic analysis methods are highly desired.

In this paper, we present a resource-usage-aware fuzzing approach to estimate worst-case resource usage. We use resource-usage amount together with resourceusage relevant coverage to guide the fuzzing process, so as to generate inputs triggering large resource-usage amount. More clearly, we use a different definition of branch coverage and additionally add resource-usage amount to guide the fuzzing process. Moreover, we also leverage semantic patch [11] to make use of static analysis information (including control-flow, function-call, etc.) to instrument the original program. Such information is helpful in aiding the subsequent fuzzing during runtime. We have conducted experiments to estimate worst-case resource usage of various resources in real-world programs, including heap memory, stack depths, sockets, user-defined resources, etc. Preliminary experimental results show the promising ability of our approach in estimating worst-case resource usage in real-world programs, compared with two state-of-the-art fuzzing tools (AFL and MemLock).

#### 2 Approach

In this section, we describe the basic process of our approach (shown in Fig. 1).

#### 2.1 Static analysis and instrumentation

For the target program, we first identify all program locations (i.e., program points) of the calls to resource-usage operations in the program. Such resourceusage operations can be APIs provided by systems or libraries, as well as application programmer-defined APIs. From the point of view of increasing/decreasing

Fig. 1. Workflow of resource-usage-aware fuzzing

resource-usage amount, all operations changing resource-usage can essentially be reduced into allocation (i.e., increasing) and deallocation (i.e., decreasing) operations. To this end, we define two basic modeling functions


We will instrument invocations of these two basic modeling functions to explicitly model the resource usage for each resource-usage operation in the original program, according to its semantics. For example, to model pFile = fopen(. . .), we will instrument (afterwards) RAlloc(pFile != NULL?1 : 0 ). To model free(p), we will instrument (beforehand) RAlloc(malloc usable size(p)), wherein the malloc usable size(p) function (which is a C library function) returns the number of usable bytes in the block pointed to by p. To model the change of callstack depths, we instrument RAlloc(1 ) and RDealloc(1 ), respectively at the entry and exit (before return statement) of each function. Note that each time of resource-usage fuzzing, we consider only one type of resources. The fuzzing engine will track the invocations of RAlloc(int n) and RDealloc(int n) and capture their parameters to maintain the current amount and the historical peak amount of resource usage at runtime.

On the other hand, many functions and basic blocks in the program are useful for implementing functionality of the program but not relevant to resource usage. Based on this insight, we propose to guide the fuzzing process to cover functions and basic blocks that are relevant to resource usage.

– First, we make use of the call graph of the target program to identify the list of all functions that directly or indirectly invoke resource-usage operations.<sup>3</sup>

<sup>3</sup> Specially, to track stack depth, we first collect a set FSet of functions that directly or indirectly call recursive functions. For other functions, we calculate for each function the depth from the main() function to that function according to the call graph, and add into FSet the top-K percent (e.g., top 30%) functions with large depths.

Then we instrument coverage-label function covl() before the invocation of these functions. We use covl() to identify basic blocks that involve resource usage, which will be further used to define resource-usage-aware coverage.

– Second, for each program block containing invocations of resource-usage modeling functions (i.e., RAlloc(), RDealloc()), label function covl() or exit function exit() (as well as similar functions such as raise()), we instrument label function covl() before the control-flow branch where this block locates in (e.g., in the then branch) and also at the beginning of the block in the other branch (e.g., the else branch). We conduct instrumentations of covl() in a bottom-up manner, i.e., from inside to outside blocks.

We leverage program transformation tool Coccinelle [12], to automatically instrument statements invoking resource-usage modeling functions as well as coverage-label function covl() into the original program. Coccinelle is a program matching and transformation engine which allows us to write so-called semantic patches [11] for specifying desired code matches and transformations. Particularly, the transformation engine of Coccinelle is defined in terms of control flow, and thus it fits well to instrument coverage-label functions for desired control-flow branches where resource-usage locates.

Fig. 2. Example illustration

Example illustration Fig. 2 illustrates the above process via an example (named libtirpc slice) extracted from an old version of libtirpc (that is a Transport-Independent RPC library for Linux) which contains a known CVE vulnerability <sup>4</sup> . The cause of this CVE is that the return value of makefd xprt() was not checked in all instances, which could lead to a crash when the server exhausted the maximum number of available file descriptors. Fig. 2(a) shows the slice extracted from the original code of libtirpc. Fig. 2(b) shows part of the semantic patch applied for instrumentation. The instrumented program is shown

<sup>4</sup> https://ubuntu.com/security/CVE-2018-14622

in Fig. 2(c). This program consumes socket connections, e.g., by calling accept() as shown on Line 13 in Fig. 2(a). We use semantic patch shown in Fig. 2(b), to instrument resource-usage modeling function RAlloc(1 ) as well as coveragelabel function covl() at the program location when a connection is established successfully. The instrumented code is highlighted in Fig. 2(c).

#### 2.2 Fuzzing loop


Require: an instrumented program P , and a set of initial seeds I<sup>0</sup>

Ensure: (max res, BuggyS) where max res is the found largest resource usage amount, and BuggyS is a set of test cases triggering resource-usage bugs

```
1: max res ← 0
2: BuggyS ← ∅
3: SeedQueue ← I0
4: while time not expire do
5: s ← select(SeedQueue)
6: s
      0 ← mutate(s)
7: trace ← execute(s
                     0
                     )
8: n res ← resP eak(trace)
9: if n res > max res then
10: max res ← n res
11: SeedQueue ← SeedQueue ∪ s
                                 0
12: else
13: if f ind new path(trace) then
14: SeedQueue ← SeedQueue ∪ s
                                    0
15: end if
16: end if
17: if trigger crash(trace) then
18: BuggyS ← BuggyS ∪ s
                            0
19: end if
20: end while
21: return (max res, BuggyS)
```
Algorithm. 1 shows the main procedure of our resource-usage aware fuzzing. The algorithm first selects an input s from the seed pool SeedQueue, mutates it and generates a mutant s 0 . Then, the fuzzer runs the mutant input and monitors its execution. If the mutant input consumes more resources or leads to new resource-usage-aware coverage, it will be added to the seed pool as an interesting input. This process is similar to the process of traditional coverage-based grey-box fuzzers (e.g., AFL). The main difference lies in that resource-usage aware fuzzer uses a different definition of branch coverage and adds resource consumption guidance to retain interesting inputs. Now we give the details.

Resource-usage aware coverage Traditional coverage-based grey-box fuzzers use instrumentation to capture basic block transitions, and log edge coverage information during runtime. For example, AFL uses a random number to represent each basic block, and each transition from one basic block to another is marked by the Exclusive-OR (and right shift) result of the two random values. The identifier of each transition is considered as an address and each time of triggering will increment the count of hits at that address. During runtime, AFL records edge coverage information, including whether the edge has been visited, and the count of hits.

In this paper, we concentrate only on resource usage in a program, while many basic blocks in the program are useful for implementing functionality of the program but not relevant to resource usage. Based on this insight, we log only transitions between those basic blocks that contain resource-usage modeling functions (i.e., RAlloc(), RDealloc()), coverage-label function covl() and exit() function. E.g., consider an execution trace B r 1 , B<sup>2</sup> , . . . , Bn−<sup>1</sup> , B r <sup>n</sup> wherein only B r 1 , B r n contain aforementioned resource-usage relevant functions. We will log it as a transition from B r 1 to B r n , and increase the count of hits of this transition. Resource-usage-aware edge coverage is more delicate and sensitive than traditional edge coverage in identifying different resource usage.

Resource-usage amount guidance When resource-usage aware fuzzer runs an input on the instrumented program, it collects not only the resource-usage aware coverage information, but also resource-usage amount. The fuzzing engine maintains two variables, resc cur and resc peak, to track respectively the current amount and the historical peak amount of resource usage. It captures the parameters of RAlloc(n) and RDealloc(n), and updates the current amount as well as the historical peak amount of resource usage.

Overall guidance mechanism As shown in Algorithm. 1, after execution over an input s 0 , we collect the peak resource usage amount of the running trace through resPeak(trace) (Line 8). If this input leads to more resource usage, it is added into the seed pool for further mutation (Lines 9-11). Besides, if it leads to new resource-usage aware coverage, it is also added into the seed pool for further mutation (Lines 13-14). In addition, if the input triggers a crash, it is added into BuggyS which collects the set of test cases triggering resource-usage bugs.

#### 3 Experiments

We have implemented our approach in a prototype fuzzer named ResFuz <sup>5</sup> , based on MemLock [16] which is built on top of AFL [17]. We employ Coccinelle [12] to conduct program instrumentation.

We conduct preliminary experiments on several open-source software, including jasper, openjpeg and yara, which are also part of the benchmark used in [16], as well as the small example libtirpc slice explained in Fig. 2. More specifically, jasper and openjpeg contain many heap resource operations, while yara contains recursive functions. Moreover, jasper and openjpeg contain many user-defined application-specific resource-usage operations. E.g., jasper uses operations like jas malloc(), jas free() to manage a heap memory pool with a user-configurable size. Similarly, openjpeg uses operations like opj malloc(), opj free() to manage a specific type of heap memory. The small program libtirpc slice contains socket operations, as explained in Sect. 2.1. We compare ResFuz against other two state-of-the-art fuzzers, namely AFL and MemLock [16]. All our experiments

<sup>5</sup> The artifact is available at https://doi.org/10.5281/zenodo.5894821.

Fig. 3. The growth trend of resource usage

have been performed on machines with an Intel (R) Core (TM) i9-10940X CPU (3.30GHz) and 32GB of RAM under 64-bit Ubuntu LTS 20.04. We run each fuzzer for 6 hours (except 10 minutes for libtirpc slice) each time, perform each experiment for 3 times, and use their average statistical performance as result.

Fig. 3 depicts the growth trend of the found resource peaks over time through the plots. The vertical axis shows the amount of the peak resource consumed (heaps for jasper and openjpeg, stack depths for yara, sockets for libtirpc slice). Fig. 3 shows that ResFuz outperforms the two baseline fuzzers in finding large resource consumption for almost all the cases (except for japser shown in Fig. 3(b), for which MemLock performs a little bit better than ResFuz). In particular, as shown in Figs. 3(d-f), for user-defined resources in openjpeg and jasper as well as sockets in libtirpc slice, ResFuz provides much better results than the other two tools. This is because the guidance mechanism in ResFuz is based on resourceusage amount and resource-usage aware coverage information, which accelerates the process of adding inputs triggering large resource usage into the seed pool. Note that for these user-defined resources and sockets, MemLock uses the consumption of the general heap to guide the fuzzing process, while ResFuz uses respectively the consumption of the specific OPJ heap (in openjpeg), JAS heap (in japser), sockets (in libtirpc slice) to guide the fuzzing process.

#### 4 Related Work

Using dynamic analysis or fuzzing to find resource-usage relevant bugs has received much attention in recent years. PREDATOR [3] is an automated black box testing tool for detection and identification of local resource-exhaustion vulnerabilities in network servers, which computes resource usage profiles for predicting the utilization of every monitored resource for test inputs. Radmin [7] confines the resource usage of a target program from its benign executions to the learned automata and then uses it to detect resource usage anomalies. Both PREDATOR and Radmin do not use fuzzing. MemFuzz [6] uses memory access (rather than memory consumption) instrumentation as addition to branch coverage to guide evolutionary fuzzing. Recently, researchers have drawn attention to the algorithmic complexity vulnerabilities such as SlowFuzz [13], Singularity [15] and PerfFuzz [10]. The basic idea behind is to use the number of executed instructions as the guidance for fuzzing. However, all these works consider time complexity issues.

The most relevant work to our technique is MemLock [16], which uses memory usage guided fuzzing to generate the excessive memory consumption inputs and trigger uncontrolled memory consumption bugs. MemLock also uses memory consumption information to guide the fuzzing process and considers two kinds of memory resources, i.e., stack memory and heap memory. Compared with MemLock, we consider the usage of general resources, including memory, file descriptors, socket connections, user-defined resources, etc. Moreover, Mem-Lock uses default branch coverage of AFL (which considers transitions of all basic blocks) to guide the fuzzing process, while our approach adopts resourceusage-aware coverage (which considers transitions between basic blocks that are relevant to resource usage). In addition, we employ semantic patch to make use of resource-usage relevant call graph and control-flow graph to conduct instrumentation at source code level, while MemLock uses control-flow graph in the same way as AFL (to define branch coverage) and uses call graph only to determine stack memory usage (by instrumenting at the entry and exit of functions).

#### 5 Conclusion and Future Work

In this paper, we present a resource-usage-aware fuzzing approach to estimate worst-case resource usage. It employs resource-usage amount and resource-usageaware coverage to guide the fuzzing process, for the sake of generating inputs to triggering massive resource usage. Moreover, we employ semantic patches to make use of resource-usage relevant call graph and control-flow graph information to conduct instrumentation, for the sake of aiding the subsequent fuzzing process. We have conducted experiments to estimate worst-case resource usage of various resources in real-world programs, including heap memory, stack depths, sockets, user-defined resources, etc. Preliminary experimental results show its promising ability to estimate worst-case resource usage in real-world programs, compared with two state-of-the-art fuzzing tools.

For future work, we plan to conduct experiments on more real-world programs and over more kinds of resources. We also plan to conduct evaluation comparison with more state-of-the-art fuzzing tools. Furthermore, we will evaluate our approach in detecting resource-usage bugs and security-critical vulnerabilities in real-world programs.

## References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## Quantitative Program Sketching using Lifted Static Analysis

Aleksandar S. Dimovski<sup>1</sup> ()

Mother Teresa University, st. Mirche Acev nr. 4, 1000 Skopje, North Macedonia aleksandar.dimovski@unt.edu.mk

Abstract. We present a novel approach for resolving numerical program sketches under Boolean and quantitative objectives. The input is a program sketch, which represents a partial program with missing numerical parameters (holes). The aim is to automatically synthesize values for the parameters, such that the resulting complete program satisfes: a Boolean (qualitative) specifcation given in the form of assertions; and a quantitative specifcation that estimates the number of execution steps to termination and which the synthesizer is expected to optimize.

To address the above quantitative sketching problem, we encode a program sketch as a program family (a.k.a. software product line) and analyze it by the specifcally designed lifted analysis algorithms based on abstract interpretation. In particular, we use a combination of forward (numerical) and backward (termination) lifted analysis of program families to fnd the variants (family members) that satisfy all assertions, and moreover are optimal with respect to the given quantitative objective. Such obtained variants represent "correct & optimal" sketch realizations. We present a prototype implementation of our approach within the FamilySketcher tool for resolving C sketches with numerical types. We have evaluated our approach on a set of benchmarks, and experimental results confrm the efectiveness of our approach.

Keywords: Quantitative program sketching · Software Product Lines · Abstract Interpretation

## 1 Introduction

A sketch [29,30] is a partial program with missing numerical expressions called holes to be discovered by the synthesizer. Previous approaches for program sketching [29,30,17] automatically synthesize integer constant values for the holes so that the resulting complete program satisfes Boolean (qualitative) properties in the form of assertions. However, the need for considering combined Boolean and quantitative properties is prominent in many applications. Still, quantitative properties have been largely missing from previous approaches for program sketching. In particular, there has been no possibility for measuring the "goodness" of solutions. Boolean properties are used to defne minimal requirements for the synthesized complete programs. Still, there are usually many diferent

complete programs that satisfy the Boolean properties, and some of them may be preferred over the others. Therefore, it is important to defne synthesis algorithms, which construct complete programs (solutions) that not only meet the Boolean properties, but are also optimal with respect to a given quantitative objective [2,6]. This is so-called quantitative sketching problem.

In this paper, we use lifted static analysis based on abstract interpretation [25] for program families (a.k.a. software product lines) [8] to solve this quantitative sketching problem. The key observation is that all possible sketch realizations constitute a program family, where each numerical hole is represented as a numerical feature. A program family describes a set of similar programs as variants of some common code base [8]. At compile-time, a variant of a program family is derived by assigning concrete values to a set of features (confguration options) relevant for it, and only then is this variant compiled or interpreted. Program families (often in C) enriched with compile-time confgurability by the C preprocessor CPP [8,21] are today widely used in open-source projects and industry [21]. By using the proposed transformation from program sketches to program families, we reduce the quantitative sketching problem to selecting those variants (family members) from the corresponding program family that satisfy all assertions and are optimal with respect to the given quantitative objective. As a quantitative objective we consider here the sufcient preconditions inferred by a quantitative termination analysis that estimates the efciency of a program by counting upper-bounds on the number of execution steps to termination. More specifcally, we use a combination of forward and backward lifted analysis to solve this problem. The forward numerical lifted analysis infers numerical invariants for all members of a program family, thus fnding the "correct" variants that satisfy all assertions. Subsequently, the backward termination lifted analysis is performed on a sub-family of "correct" variants to infer piecewise-defned ranking functions, which provide upper-bounds on the number of execution steps to termination. The variants with minimal ranking function are reported as optimal complete programs that solve the original quantitative sketching problem.

To fnd the required variants (i.e., the solution to the quantitative sketching problem), we use the specifcally designed lifted static analysis algorithms, which efciently analyze all variants of the program family simultaneously, without generating any of them explicitly [3,24,22,28,19,11,20,16]. Lifted analysis processes the common code base of a program family directly, exploiting the similarities among individual variants to reduce analysis efort. It reports precise analysis results for all variants of the family. In particular, we use an efcient, abstract interpretation-based lifted analysis of program families with numerical features [16], where sharing is explicitly possible between equivalent analysis elements corresponding to diferent variants. This is achieved by using a specialized decision tree lifted domain [16] that provides a symbolic and compact representation of the lifted analysis elements. More precisely, the elements of the lifted domain are decision trees, in which decision nodes are labelled with linear constraints over features, while leaf nodes belong to an existing single-program analysis domain (e.g., some numerical domain [25] or the termination domain [31,32]). The decision trees recursively partition the space of all variants (i.e., the space of possible combinations of feature's values), whereas the program properties at the leaves provide analysis information corresponding to each partition (i.e., to those variants that satisfy the constraints along the path to the given leaf node). This way, the forward (numerical) lifted analysis partitions the given family into: "correct", "incorrect", and "I don't know" (inconclusive) sub-families (sets of variants) with respect to the given assertions. The backward (termination) lifted analysis additionally partitions the "correct" sub-family with respect to the estimated number of execution steps to termination. Because of its special structure and possibilities for sharing of equivalent analysis results, the decision tree-based lifted analyses are able to converge to a solution very fast even for program families (sketches) that contain numerical features (holes) with large domains, thus giving rise to astronomical search spaces. This is particularly true for sketches in which holes appear in (linear) expressions that can be exactly represented in the underlying numerical domains used in the decision trees (e.g., polyhedra). In those cases, we can design very efcient lifted analysis with extended (improved) transfer functions for assignments and tests.

We have implemented our approach in a prototype program synthesizer, called FamilySketcher [17]. The numerical abstract domains (e.g., intervals, octagons, polyhedra) from the APRON library [23] are used as parameters of the underlying decision trees. FamilySketcher calls the Z3 SMT solver [26] to solve the optimization problem that represents the given quantitative objective. We illustrate this approach for automatic completion of various numerical C sketches from the Sketch project [29,30], SV-COMP (https://sv-comp. sosy-lab.org/), and the SyGuS-Competition (https://sygus.org/) [1]. We compare performances of our approach against the most popular sketching tool Sketch [29,30] and Brute-Force enumeration approach that checks for correctness and optimality all sketch realizations one by one.

In summary, this work makes the following contributions: (1) We combine forward and backward lifted analyses to resolve numerical program sketches with respect to both Boolean and quantitative specifcations; (2) We implement our approach in the FamilySketcher tool, which uses numerical domains from the APRON library as parameters and the Z3 tool for solving the underlying (linear) optimization problem; (3) We evaluate our approach and compare its performances with the Sketch tool and Brute-Force enumeration approach.

#### 2 Motivating Examples

Let us consider the Loop1a sketch taken from SyGuS-Competition [1]:

```
void main() {
O1 int x := ??1, y := 0;
O2 whileOh (x > ??2) {
O3 x := x-1;
O4 y := y+1; }
O5 assert (y > 2); //assert (y < 8); }
```
variant at location O<sup>5</sup> of Loop1a (solid edges = true, dashed edges = false). Fig. 2: Lifted ranking function at location O<sup>1</sup> of Loop1a. Fig. 3: Lifted ranking function at location O<sup>1</sup> of Loop1b.

which contains two numerical holes, denoted by ??<sup>1</sup> and ??2. The synthesizer should replace the holes with constants from Z, such that the synthesized program satisfes the assertion at location O<sup>5</sup> under all possible inputs. Moreover, we want to select the most efcient correct program, i.e. the one that terminates in the minimum number of execution steps.

We transform the Loop1a sketch to a program family, which contains two numerical features A and B with domains [Min, Max] ⊆ Z. <sup>1</sup> Since both holes in the Loop1a sketch occur in (linear) expressions that can be exactly represented in numerical domains (e.g. intervals), the Loop1a program family is obtained by replacing the two holes ??<sup>1</sup> and ??<sup>2</sup> with the features A and B. The total number of variants that can be generated from this family is (Max − Min +1)<sup>2</sup> , so that each variant corresponds to one possible sketch realization. We perform a forward numerical lifted analysis based on decision trees [16] of the Loop1a program family. The decision tree (lifted numerical invariant) inferred at the location O<sup>5</sup> is shown in Fig. 1. Notice that the inner nodes of the decision tree in Fig. 1 are labeled with polyhedral linear constraints defned over feature variables A and B, while the leaves are labeled with polyhedral linear constraints defned over program and feature variables x, y, A and B. The edges of decision trees are labeled with the truth value of the decision on the parent node: we use solid edges for true (i.e., the constraint in the parent node is satisfed) and dashed edges for false (i.e., the negation of the constraint in the parent node is satisfed). Note that linear constraints in decision nodes implicitly take domains of features into account. For example, the decision node (A≤B) is satisfed when (A≤B)∧(Min≤ A≤Max) ∧ (Min≤B≤Max). From the invariant inferred at location O<sup>5</sup> shown in Fig. 1, we can see that the given assertion (y > 2) may be valid in the leaf node that can be reached along the path satisfying the constraint ¬(A≤B), i.e. (A-B≥ 1). In fact, (y > 2) holds when the stronger constraint (A-B ≥ 3) is satisfed. Thus, any variant that satisfes the above constraint (A-B ≥ 3) represents a "correct" solution to the Loop1a sketch. To fnd a "correct & optimal" solution,

<sup>1</sup> Note that Min and Max represent some minimal and maximal representable integers. E.g., we may take Min = 0 and Max = 31 for 5-bit sizes of holes.

we perform a backward termination lifted analysis based on decision trees [13] of the Loop1a sub-family satisfying (A-B ≥ 3). The decision tree representing the lifted ranking function of the above sub-family at initial location O<sup>1</sup> is shown in Fig. 2. <sup>2</sup> Notice that the leaf nodes represent afne functions defned over feature and program variables. We can see that the ranking function is: 3A-3B+4. We call the Z3 solver [26] to solve the following linear optimization problem: fnd values for A and B that minimizes the value of ranking function 3A-3B+4 over the constraint (A-B ≥ 3) ∧ (3A-3B+4 > 0). Minimizing this function gives us values for A and B that are desirable according to the quantitative criterion while satisfying the given assertion. The solution produced by Z3 is: A=3 and B=0 with the minimal objective 13. Therefore, the synthesizer reports this variant, i.e. program where ??1=3 and ??2=0, as a "correct & optimal" solution.

We consider an alternative sketch of Loop1a, denoted by Loop1b, in which the assertion in location O<sup>5</sup> is (y < 8). The numerical invariant inferred in location O<sup>5</sup> is the same as for Loop1a as shown in Fig. 1. However, there are now two solutions to the assertion (y < 8): (A ≤ B) when the left leaf node is reached, and (1≤A-B≤7) when the right leaf node is reached. We perform two backward termination lifted analysis to fnd optimal solutions for both correct sub-families: (A≤B) and (1≤A-B≤7). The lifted ranking function inferred at the initial location is given in Fig. 3. The solutions to the given optimization problem produced by Z3 solver are: A=0, B=0 with the minimal objective 4 for the case (A≤B); and A=1, B=0 with the minimal objective 7 for the case (1≤A-B≤7).

Let us consider the Loop2a sketch in Fig. 10. The lifted numerical invariant inferred at location O<sup>6</sup> is shown in Fig. 4. We can see that the assertion (y > 2) is valid for variants satisfying: (A-B ≥ 1) ∧ (3 ≤ A ≤ Max). The lifted ranking function inferred for this sub-family is shown in Fig. 5. It represents a piecewisedefned ranking function since it depends on the value of the input variable x. To represent graphically piecewise-defned ranking functions in decision trees, we use rounded rectangles to represent second-level decision nodes that are labelled with linear constraints defned over both feature and program variables. Thus, they partition the confguration and memory space, i.e. the possible values of feature and program variables (see Fig. 5). The obtained "correct & optimal" solution is: A=3 and B=0 with the minimal objective 3 when (x>10) and -3x+36 when (x≤10). Similarly, we can resolve the Loop2b sketch, where the assertion (y < 8) is considered. The "correct" variants satisfy: (A-B ≥ 1) ∧ (Min ≤ A ≤ 7), and the "correct & optimal" solution is: A=1 and B=0 with the minimal objective 3 when (x > 10) and -3x+36 when (x ≤ 10). Note that, the inferred ranking functions for "correct" sub-families of Loop2a and Loop2b in Figs. 5 and 6 do not depend on feature variables, so any "correct" solution is "optimal" as well.

From the decision trees inferred by performing lifted analyses of our motivating examples, we can see that the decision tree-based representation uses only one or two leaf nodes, although there are many variants in total. This possibility for sharing of analysis equivalent information corresponding to diferent variants confrms that decision trees are symbolic and compact representation of lifted

<sup>2</sup> Termination analysis is backward, so the fnal result is reported in the initial location.

Fig. 4: Lifted numerical invariant before assertion in Loop2a. Fig. 5: Lifted ranking function at initial location of Loop2a. Fig. 6: Lifted ranking function at initial location of Loop2b.

analysis elements. This is the key for obtaining efcient lifted analyses of program families with large confguration spaces, and thus for efciently solving the quantitative sketching problem.

### 3 Transforming Sketches to Program Families

We now introduce the IMP language that we use to illustrate our work. We describe two extensions of IMP: IMP?? for writing program sketches, and IMP for writing program families. Finally, we defne the transformation of sketches to program families and show its correctness.

IMP. We use a simple imperative language, called IMP [27,25], for writing general-purpose single-programs. Program variables Var are statically allocated, and the only data type is the set Z of mathematical integers. Syntax is:

$$\begin{array}{l} s ::= \mathtt{skip} \mid \mathtt{x} :: \mathtt{ae} \mid s; s \mid \mathtt{if} \left( be \right) \mathtt{then} \, s \, \mathtt{else} \, s \mid \mathtt{while} \, (be) \, \mathtt{do} \, s \mid \mathtt{assert} \, (be), \\\ a ::= n \mid \left[ n, n' \right] \mid \mathtt{x} \mid a e \oplus a e, \qquad b \, ::= a e \rhd a e \mid \neg b e \mid b e \wedge b e \mid b e \vee b e \end{array}$$

where n ranges over integers Z, [n, n′ ] over integer intervals, x over program variables Var, ⊕ ∈ {+, −, ∗, /}, and ▷◁∈ {<, ≤, =, ̸=}. Intervals [n, n′ ] denote a random choice of an integer in the interval. The set of all statements s is denoted by Stm; the set of all arithmetic expressions ae is denoted by AExp; the set of all boolean expressions be is denoted by BExp.

A program state σ : Σ = Var → Z is a mapping from program variables to values. The meaning of boolean expressions [[be]] : Σ → P({true, false}), arithmetic expressions [[ae]] : Σ → P(Z), and statements [[s]] : Σ → P(Σ), are defned by induction on their structure [27,25]. For example, the meaning of an arithmetic expression ae is a function from a state to a set of values:

$$\begin{aligned} [n]\sigma &= \{n\}, \; [[n, n']]\sigma = \{n, \ldots, n'\}, \; [\mathbf{x}]\sigma = \{\sigma(\mathbf{x})\}, \\ [ae\_0 \oplus ae\_1]\sigma &= \{n\_0 \oplus n\_1 \mid n\_0 \in [ae\_0]\sigma, n\_1 \in [ae\_1]\sigma\} \end{aligned}$$

We write [[s]] for the set of fnal states that can be derived by executing s from some initial input state [27,25].

IMP??. The language for sketches IMP?? is obtained by extending IMP with a basic hole construct, denoted by ??. The numerical hole ?? is a placeholder that the synthesizer must replace with a suitable integer constant.

$$ae ::= \dots \mid \text{??}$$

Each hole occurrence in a program sketch is assumed to be uniquely labelled as ??<sup>i</sup> and has a bounded integer domain [n, n′ ]. We will sometimes write ??[n,n′ ] i to make explicit the domain of a given hole.

Let H be a set of holes in a program sketch. We defne a control function ϕ : Φ = H → Z to describe the value of each hole in the sketch. Thus, ϕ fully describes a candidate solution to the sketch. We write s ϕ to describe a candidate solution to the sketch s fully defned by control function ϕ.

IMP. Let F = {A1, . . . , An} be a fnite and totaly ordered set of numerical features available in a program family. For each feature A ∈ F, dom(A) ⊆ Z denotes the set of possible values that can be assigned to A. A valid combination of feature's values represents a confguration k, which specifes one variant of a program family. It is given as a valuation function k : F → Z, which is a mapping that assigns a value from dom(A) to each feature A ∈ F. We assume that only a subset K of all possible confgurations are valid. An alternative representation of confgurations is based upon propositional formulae. Each confguration k ∈ K can also be represented by a propositional formula: (A<sup>1</sup> = k(A1)) ∧ . . . ∧ (A<sup>n</sup> = k(An)). The set of confgurations K can be also represented as a formula: ∨k∈Kk. We defne feature expressions, denoted FeatExp(F), as the set of propositional logic formulas over constraints of F generated by:

$$\theta \coloneqq \text{true} \mid e\_{\mathcal{F}} \bowtie e\_{\mathcal{F}} \mid \neg \theta \mid \theta\_1 \land \theta\_2 \mid \theta\_1 \lor \theta\_2, \qquad e\_{\mathcal{F}} \coloneqq n \in \mathbb{Z} \mid A \in \mathcal{F} \mid e\_{\mathcal{F}} \oplus e\_{\mathcal{F}} \mid$$

When a confguration k ∈ K satisfes a feature expression θ ∈ FeatExp(F), we write k |= θ, where |= is the standard satisfaction relation. We write [[θ]] to denote the set of confgurations from K that satisfy θ, that is, k ∈ [[θ]] if k |= θ.

The language for program families IMP is obtained by extending IMP with a new compile-time conditional statement for encoding multiple variants and a new arithmetic expression that represents a feature variable. The new statement "#if (θ) s #endif" contains a feature expression θ ∈ FeatExp(F) as a presence condition, such that only if θ is satisfed by a confguration k ∈ K the statement s will be included in the variant corresponding to k. The syntax is:

$$s ::= \dots \mid \#\text{if } (\theta) \, s \,\nexists \text{endif}, \qquad ae ::= \dots \mid \mathbb{A} \in \mathcal{F}$$

Any other preprocessor conditional constructs can be desugared and represented only by #if construct. For example, #if (θ) s<sup>0</sup> #elif (θ ′ ) s<sup>1</sup> #endif is translated into the following: #if (θ) s<sup>0</sup> #endif ; #if (¬θ ∧ θ ′ ) s<sup>1</sup> #endif. Note that feature variables A ∈ F can occur in arbitrary expressions in IMP, not only in presence conditions of #if-s as in traditional program families [21,24].

The semantics of IMP has two stages: frst, given a confguration k ∈ K compute an IMP single-program without #if-s and A ∈ F; second, the obtained program is evaluated using the standard IMP semantics [24]. The frst stage is specifed by the projection function πk, which recursively pre-processes all sub-statements and sub-expressions of statements. Hence, πk(skip) = skip, πk(x:=ae) = x:=πk(ae), πk(s;s ′ ) = πk(s);πk(s ′ ), πk(ae⊕ae′ ) = πk(ae)⊕πk(ae′ ), and πk(ae ▷◁ ae′ ) = πk(ae) ▷◁ πk(ae′ ). For "#if (θ) s #endif", statement s is included in the variant if k |= θ, otherwise, if k ̸|= θ statement s is removed: <sup>3</sup> <sup>π</sup>k(#if (θ) <sup>s</sup> #endif) = ( πk(s) if k |= θ skip if k ̸|= θ . For a feature A ∈ F, the projection function π<sup>k</sup> replaces A with the value k(A) ∈ Z, that is πk(A) = k(A).

Transformation. We want to transform an input sketch ˆs with a set of m holes ??[n1,n′ 1 ] 1 , . . . , ??[nm,n′ <sup>m</sup>] <sup>m</sup> into an output program family s with a set of features A1, . . . , A<sup>m</sup> with domains [n1, n′ 1 ], . . . , [nm, n′ m], respectively. The set of confgurations K in s includes all possible combinations of feature's values.

If a hole occurs in a (linear) expression that can be exactly represented in the underlying numerical abstract domain D, then we can handle the hole in a more efcient symbolic way by an extended lifted analysis. Given the polyhedra domain P, we say that a hole ?? can be exactly represented in P, if it occurs in an expression of the form: α1x<sup>1</sup> + . . . αi?? + . . . αnx<sup>n</sup> + β, where α1, . . . , αn, β ∈ Z and x1, . . . x<sup>n</sup> are program variables or other hole occurrences. Similarly, we defne that a hole can be exactly represented in the interval I and the octagon O domains, if it occurs in expressions of the form: ±?? + β and ±x ± ?? + β, (where β ∈ Z, x is a program variable or other hole occurrence), respectively.

We now defne rewrite rules for eliminating holes ?? from a program sketch sˆ. Let s[??[n,n′ ] ] be a basic (non-compound) statement in which the hole ??[n,n′ ] occurs as a sub-expression. When the hole ??[n,n′ ] occurs in an expression that can be represented exactly in the numerical domain D, we eliminate ?? using the symbolic rewrite rule:

$$s[\mathcal{T}\mathcal{T}^{[n,n']}] \quad \leadsto \ s[\mathsf{A}] \tag{SR}$$

Otherwise, if the hole ??[n,n′ ] occurs in an expression that cannot be represented exactly in the numerical domain D, then we use the explicit rewrite rule:

$$s[??^{[n,n']}] \quad \leadsto \text{ \textquotedblleft \textquotedblright (\textsf{A}=n)}\\s[n] \mathbin{\mathsf{e1} \mathbin{\mathsf{i1} \mathbin{\mathsf{i1}}}} \dots \mathbin{\mathsf{e1} \mathbin{\mathsf{i1} \mathbin{\mathsf{i1}}}} \mathbin{\mathsf{e1} \mathbin{\mathsf{i1}}} \{\mathsf{A}=n' \texttt{-1}\}\\s[n'-1] \mathbin{\mathsf{e1} \mathbin{\mathsf{e1}}} \texttt{s}[n'] \ldots \texttt{\#end if} \quad \{\mathsf{ER}\}$$

The set of features F is also updated with the fresh feature A. We write Rewrite(ˆs) to be the resulting program family obtained by repeatedly applying rules (SR) and (ER) on a program sketch ˆs to saturation.

Example 1. Reconsider the Loop1a and Loop2a sketches from Section 2. All holes ?? can be represented exactly in the interval domain, so we use the symbolic (SR) rule to obtain the program family. Consider the sketch: int x; while (x ≥ 0) x := ??\*x+10. The hole ?? cannot be represented exactly in any numerical domain D. Thus, we use the explicit (ER) rule to obtain the program family. ⊓⊔

<sup>3</sup> Since any k ∈ K is a valuation function, we have that either k |= θ holds or k ̸|= θ (which is equivalent to k |= ¬θ) holds, for any θ ∈ FeatExp(F).

The following result establishes the correctness of our transformation. It can be proved by structural induction on statements and expressions.

Theorem 1. Let sˆ be a sketch with holes ??1, . . . , ??n, ϕ be a control function, and sˆ ϕ be a candidate solution of sˆ. Let s = Rewrite(ˆs) be a program family, in which features A1, . . . , A<sup>n</sup> correspond to holes ??1, . . . , ??n. We defne a confguration k ∈ K, s.t. k(Ai) = ϕ(??i) for 1≤i ≤ n. Then, we have: [[ˆs ϕ ]] = [[πk(s)]].

## 4 Decision Tree-based Lifted Analyses

In the context of program families, lifting means taking a static analysis that works on IMP single-programs, and transforming it into an analysis that works on IMP program families, without preprocessing them. In this work, we will use lifted versions of the (forward) numerical analysis [25] and the (backward) termination analysis [31] from the abstract interpretation framework [9]. They will be used to infer numerical invariants and piecewise-defned ranking functions in all program locations. We work with lifted analyses based on the lifted domain of decision trees [16], in which the leaf nodes belong to an existing single-program domain (e.g., a numerical or termination domain) and decision nodes are linear constraints over feature variables. This way, we encapsulate the set of confgurations K into decision nodes where each top-down path represents a subset of confgurations from K, and we store in each leaf node the analysis property generated from the variants corresponding to the given confgurations.

#### 4.1 Abstract domain for decision nodes

The domain of decision nodes C<sup>D</sup><sup>V</sup> is the fnite set of linear constraints defned over a set of variables V = {X1, . . . , Xk}. C<sup>D</sup> is constructed using the numerical domain D (see Section 4.2) by mapping a conjunction of constraints from D to a fnite set of constraints in P(CD). We assume the set of variables V = {X1, . . . , Xk} to be a fnite and totally ordered set, such that the ordering is X<sup>1</sup> > . . . > Xk. We impose a total order <<sup>C</sup><sup>D</sup> on C<sup>D</sup> to be the lexicographic order on the coefcients α1, . . . , α<sup>k</sup> and constant αk+1 of the linear constraints:

$$\begin{array}{ll} (\alpha\_1 \cdot X\_1 + \ldots + \alpha\_k \cdot X\_k + \alpha\_{k+1} \ge 0) &<\_{\mathbb{C}\_0} (\alpha'\_1 \cdot X\_1 + \ldots + \alpha'\_k \cdot X\_k + \alpha'\_{k+1} \ge 0) \\ \iff \quad \exists j > 0. \forall i < j. (\alpha\_i = \alpha'\_i) \land (\alpha\_j < \alpha'\_j) \end{array}$$

The negation of linear constraints is formed as: ¬(α1X<sup>1</sup> +. . . αkX<sup>k</sup> +β≥ 0) = −α1X<sup>1</sup> − . . . − αkX<sup>k</sup> − β − 1 ≥ 0. For example, the negation of X − 3 ≥ 0 is −X + 2 ≥ 0. To ensure canonical representation of decision trees, a linear constraint c and its negation ¬c cannot both appear as decision nodes. Thus, we only keep the largest constraint with respect to <<sup>C</sup><sup>D</sup> between c and ¬c.

#### 4.2 Abstract domain for leaf nodes

We assume the existence of a single-program abstract domain A defned over a set of variables V = {X1, . . . , Xk}. The domain A is equipped with sound operators for concretization γA, ordering ⊑A, join ⊔A, meet ⊓A, bottom ⊥A, top ⊤A, widening ∇A, and narrowing △A, as well as sound transfer functions for tests (boolean expressions) FILTERA, forward assignments F-ASSIGNA, and backward assignments B-ASSIGND. More specifcally, FILTERA(a : A, be : BExp) returns an abstract element from A obtained by restricting a to satisfy the test be; F-ASSIGNA(a : A, x:=e : Stm) returns an updated version of a by abstractly evaluating x:=e in it; whereas B-ASSIGNA(b : A, x:=ae : Stm) returns an abstract element from A that can lead to the abstract element b to hold after evaluating x:=ae. Note that a in F-ASSIGN<sup>A</sup> is an invariant in the initial location of x:=ae that needs to be propagated forward, while b in B-ASSIGN<sup>A</sup> is an invariant in the fnal location of x:=ae that needs to be propagated backwards. We will sometimes write A<sup>V</sup> to explicitly denote the set of variables V over which A is defned. In this work, we will use domains AVar, A<sup>F</sup> , and AVar∪F .

For the forward numerical analysis, we will instantiate A with some of the known numerical domains ⟨D, ⊑D⟩, such as Intervals ⟨I, ⊑<sup>I</sup> ⟩ [9,25], Octagons ⟨O, ⊑O⟩ [25], and Polyhedra ⟨P, ⊑<sup>P</sup> ⟩ [25]. The elements of I are intervals of the form: ±X ≥ β, where X ∈ V, β ∈ Z; the elements of O are conjunctions of octagonal constraints of the form ±X<sup>1</sup> ± X<sup>2</sup> ≥ β, where X1, X<sup>2</sup> ∈ V, β ∈ Z; while the elements of P are conjunctions of polyhedral constraints of the form α1X<sup>1</sup> + . . . + αkX<sup>k</sup> + β ≥ 0, where X1, . . . X<sup>k</sup> ∈ V, α1, . . . , αk, β ∈ Z.

For the backward termination analysis, we will instantiate A with the termination decision tree domain T T (C<sup>D</sup>Var∪F , FA), also written T T for short, introduced by Urban and Min´e [31,32], where C<sup>D</sup>Var∪F is the domain for decision nodes and F<sup>A</sup> is the domain of afne functions for leaf nodes. The elements of F<sup>A</sup> are: {⊥F, ⊤F} ∪ {f : Z <sup>|</sup>Var∪F | → N | f(x1, . . . , xn) = m1x<sup>1</sup> + . . . + mnx<sup>n</sup> + q}, where f ∈ F<sup>A</sup> is a natural-valued function of program and feature variables representing an upper bound on the number of steps to termination; the element ⊥<sup>F</sup> represents potential non-termination; and ⊤<sup>F</sup> represents the lack of information to conclude. The leaf nodes belonging to FA\{⊥F, ⊤F} and {⊥F, ⊤F} represent defned and undefned leaf nodes, respectively. A termination decision tree t ′ ∈ T T is: either a leaf node ≪f≫ with f ∈ FA, or [[c ′ : tl′ , tr′ ]], where c ′ ∈ C<sup>D</sup>Var∪F (denoted by t ′ .c) is the smallest constraint with respect to <<sup>D</sup> appearing in the tree t ′ , tl′ (denoted by t ′ .l) is the left subtree of t ′ representing its true branch, and tr′ (denoted by t ′ .r) is the right subtree of t ′ representing its false branch. The path along a decision tree establishes a set of program states and a set of confgurations (those that satisfy the encountered constraints), and leaf nodes represent partially-defned ranking functions over the given program states and confgurations. The transfer function B-ASSIGN<sup>T</sup><sup>T</sup> (t ′ , x:=ae) substitutes the arithmetic expression ae to the variable x in linear constraints occurring within decision nodes of t ′ and in functions occurring in leaf nodes of t ′ , whereas the transfer function FILTER<sup>T</sup><sup>T</sup> (t ′ , be) generates a set of linear constraints J from test be and restricts t ′ such that all paths satisfy the constraints from J. Finally, both transfer functions increment the constant q of defned functions f ∈ FA\{⊥F, ⊤F} in all leaf nodes of t ′ .

We refer to [25,31] for a precise defnition of all operations and transfer functions of intervals, octagons, polyhedra, and termination decision tree domain.



#### 4.3 Decision tree lifted domains

We now defne the decision tree lifted domain T(CD<sup>F</sup> , AVar∪F ), written T for short, for representing lifted analysis properties [16]. A decision tree t ∈ T(CD, A) is either a leaf node ≪a≫with a ∈ A, or [[c : tl, tr]], where decision node c ∈ C<sup>D</sup> (denoted by t.c) is the smallest constraint with respect to <<sup>C</sup><sup>D</sup> appearing in the tree t, tl (denoted by t.l) is the left subtree of t representing its true branch, and tr (denoted by t.r) is the right subtree of t representing its false branch. The path along a decision tree establishes the set of confgurations (those that satisfy the encountered constraints), and the leaf nodes represent their analysis properties.

Operations. The concretization function γ<sup>T</sup> of a decision tree t ∈ T(CD, A) returns γA(a) for k ∈ K that satisfes the set C ∈ P(CD) of constraints accumulated along the top-down path to the leaf node a ∈ A.

The binary operations rely on the algorithm for tree unifcation [16,31], which fnds a common labelling of decision nodes of two trees t<sup>1</sup> and t2. Note that the tree unifcation does not lose any information. All binary operations, including ordering ⊑T, join ⊔T, meet ⊓T, widening ∇T, and narrowing △T, are performed leaf-wise on the unifed decision trees. For example, the ordering t<sup>1</sup> ⊑<sup>T</sup> t<sup>2</sup> of two unifed decision trees t<sup>1</sup> and t<sup>2</sup> is defned recursively as:

$$\ll a\_1 \gg \sqsubseteq\_\mathbb{T} \ll a\_2 \gg = a\_1 \sqsubseteq\_\mathbb{A} a\_2,\ [c.t l\_1, tr\_1] \sqsubseteq\_\mathbb{T} [c.t l\_2, tr\_2] = (t l\_1 \sqsubseteq\_\mathbb{T} t l\_2) \wedge (tr\_1 \sqsubseteq\_\mathbb{T} tr\_2)$$

The top is: ⊤<sup>T</sup> =≪⊤A≫, while the bottom is: ⊥<sup>T</sup> =≪⊥A≫.

Transfer functions. We defne lifted transfer functions for tests, (forward and backward) assignments (ASSIGNT), and #if-s [16]. We consider several types of tests be and assignments x:=ae: when be and ae contain only program variables; and when be and ae contain both feature and program variables.

Transfer function ASSIGN<sup>T</sup> 4 for handling an assignment x:=ae in the input tree t, when the set of variables in ae is vars(ae) ⊆ Var, is implemented by applying ASSIGN<sup>A</sup> leaf-wise, as shown in Algorithm 1. Similarly, transfer function FILTER<sup>T</sup> for handling tests be ∈ BExp, when vars(be) ⊆ Var, is implemented by applying FILTER<sup>A</sup> leaf-wise.

Transfer function ASSIGN<sup>T</sup> for x:=ae, when vars(ae) ⊆ Var ∪ F, is given in Algorithm 2. It accumulates into the set C ∈ P(CD) (initialized to K) constraints encountered along the paths of the decision tree t (Line 2), up to the leaf nodes in which assignment is performed by ASSIGN<sup>A</sup>Var∪F . That is, we frst merge

<sup>4</sup> Note that ASSIGN is an abbreviation for both F-ASSIGN and B-ASSIGN.

Algorithm 2: ASSIGNT(t, x:=ae, C) when vars(ae) ⊆ Var ∪ F

<sup>1</sup> if isLeaf(t) then return ASSIGN<sup>A</sup>Var∪F (t ⊎Var∪F C, x:=ae); 2 else return [[t.c : ASSIGNT(t.l, x:=e, C∪{t.c}), ASSIGNT(t.r, x:=e, C∪{¬t.c})]] ;

Algorithm 3: FILTERT(t, be, C) when vars(be) ⊆ Var ∪ F

1 if isLeaf(t) then 2 a ′ = FILTER<sup>A</sup>Var∪F (t ⊎Var∪F C, be); 3 J = a ′ ↾<sup>F</sup> ; 4 if isRedundant(J, C) then return ≪a ′≫; 5 else return RESTRICT(≪a ′≫, C, J\C); 6 else return [[t.c : FILTERT(t.l, x:=e, C∪{t.c}), FILTERT(t.r, x:=e, C∪{¬t.c})]] ;

constraints from the leaf node t defned over Var ∪ F and constraints from decision nodes C ∈ P(C<sup>D</sup><sup>F</sup> ) defned over F, by using ⊎Var∪F operator, and then we apply ASSIGN<sup>A</sup>Var∪F on the obtained result (Line 1).

Transfer function FILTER<sup>T</sup> for test be, when vars(be) ⊆ Var∪F, is described by Algorithm 3. Similarly to ASSIGN<sup>T</sup> in Algorithm 2, it accumulates the constraints along the paths in a set C ∈ P(CD) up to the leaf nodes, and applies FILTER<sup>A</sup>Var∪F on an abstract element obtained by merging constraints in the leaf node and in C (Line 2). The obtained result a ′ is a new leaf node, and additionally a ′ is projected on feature variables using ↾<sup>F</sup> operator to generate a new set of constraints J that is added to the given path to a ′ by using the function RESTRICT [16] (Lines 3–5). The function isRedundant(J, C) checks if the constraints from J are redundant with respect to the set C.

Finally, transfer function for #if directives is defned as:

$$[\mathfrak{star}(\theta)s \,\,\mathsf{stand}]\_{\mathbb{T}}t = [s]\_{\mathbb{T}} \mathrm{FILTER}\_{\mathbb{T}}(t,\theta,\mathcal{K}) \sqcup\_{\mathbb{T}} \mathrm{FILTER}\_{\mathbb{T}}(t,\neg\theta,\mathcal{K})$$

where [[s]]T(t) is transfer function for s and FILTERT(t, θ, K) is defned by Algorithm 3 since θ contains only features. Transfer function for assertions is: [[assert(be)]]<sup>T</sup> = FILTERT(t, be, K).

After applying transfer functions, the obtained decision trees may contain some redundancy that can be exploited to further compress them. We use several optimizations [16]. E.g., if constraints on a path to some leaf are unsatisfable, we eliminate that leaf node; if a decision node contains two same subtrees, then we keep only one subtree and we also eliminate the decision node, etc.

#### 4.4 Decision tree-based lifted analysis

Operations and transfer functions of T(CD, D) and T(CD, T T ) are used to perform the numerical and termination lifted analysis of program families, respectively. The numerical lifted analysis derived from T(CD, D), written as T <sup>F</sup> for short, is a pure forward analysis that infers numerical invariants in all program locations. We defne the analysis function [[s]]T<sup>F</sup> t that takes as input a decision tree t corresponding to the initial location of statement s, and outputs a decision tree over-approximating the numerical invariant in the fnal location of s. The input decision tree t K in,F at the initial location of a program family has only one leaf node ⊤<sup>D</sup>Var∪F and decision nodes that defne the set K. Lifted invariants are propagated forward from the initial location towards the fnal location taking assignments, #if-s, and tests into account with widening and narrowing around while-s. We apply delayed widening [9], which means that we start extrapolating by widening after a fxed number of iterations of a loop are analyzed explicitly.

Similarly, we defne the termination lifted analysis derived from T(CD, T T ), written as T <sup>B</sup> for short. It is a pure backward analysis that infers ranking functions in all program locations. We defne the analysis function [[s]]<sup>T</sup><sup>B</sup> t that takes as input a decision tree t in the fnal location of statement s, and outputs a decision tree over-approximating the ranking function in the initial location of s. The input decision tree t K in,B at the fnal location of a program family has only one leaf node 0 (zero function) and decision nodes that defne the set K. Lifted ranking functions are propagated backward from the fnal towards the initial location.

We establish correctness of the lifted analysis based on T(CD, A) by showing that it produces identical results with the Brute-Force enumeration approach based on the domain A. Let [[s]]<sup>T</sup> denotes the transfer function of statement s of IMP in T(CD, A), while [[s]]<sup>A</sup> denotes the transfer function of statement s of IMP in A. Given t ∈ T(CD, A), we denote by P rk(t) ∈ A the leaf node of tree t that corresponds to the variant k ∈ K.

Theorem 2. P rk([[s]]T(t))=[[πk(s)]]A(P rk(t)) for all k∈K.

Example 2. In Figs. 7 and 8 we depict decision trees at locations O<sup>2</sup> and O<sup>h</sup> inferred by performing (forward) numerical analysis based on the domain T(C<sup>P</sup> , P) of the Loop1a program family (see Section 2). In order to enforce convergence of the analysis, we apply the widening operator at the loop head, i.e. at the location O<sup>h</sup> before the while test. We can see how the invariant at location O<sup>5</sup> shown in Fig. 1 is inferred from the invariant at location O<sup>h</sup> .

Subsequently, we perform a (backward) lifted termination analysis based on the domain T(C<sup>P</sup> , T T ) of the Loop1a sub-family satisfying (A-B ≥ 3). Lifted decision trees inferred at locations O<sup>h</sup> and O<sup>1</sup> are shown in Figs. 9 and 2, respectively. We can see how by back-propagating the tree at location O<sup>h</sup> , denoted tO<sup>h</sup> (see Fig. 9), via assignments y := 0 and x := A at location O<sup>1</sup> , we obtain the tree at location O<sup>1</sup> , denoted tO<sup>1</sup> (see Fig. 2). The transfer function B-ASSIGNT(tO<sup>h</sup> , x := A) will generate the tree tO<sup>1</sup> where x is replaced with A. The new decision node (A≥B+1) and the leaf node with ranking function 2 are eliminated from tO<sup>1</sup> since they are redundant with respect to (A-B≥3).

Fig. 7: Invariant at loc. O<sup>2</sup> of Loop1a. Fig. 8: Invariant at loc. O<sup>h</sup> of Loop1a. Fig. 9: Ranking fun. at loc. O<sup>h</sup> of Loop1a.

#### 5 Synthesis Algorithm

We can now solve the quantitative sketching problem using lifted analysis algorithms. More specifcally, we delegate the efort of conducting an efective search of all possible sketch realizations to an efcient lifted static analyzer, which combines the forward numerical and the backward termination analyses.

The synthesis algorithm SYNTHESIZE(ˆs : Stm) for solving a sketch ˆs is given in Algorithm 4. First, we transform the program sketch ˆs into a program family s = Rewrite(ˆs) (Line 1). Then, we call function [[s]]<sup>T</sup><sup>F</sup> t K in,F to perform the forward numerical lifted analysis of s. The inferred decision tree t<sup>F</sup> at the fnal location of s is analyzed by function FindCorrect (Line 3) to fnd the sets of variants for which non-⊥<sup>D</sup> and non-⊤<sup>D</sup> leaf nodes are reachable. The set of variants for which ⊥<sup>D</sup> leaf node is reachable are "incorrect" with respect to the given assertions; whereas the set of variants for which ⊤<sup>D</sup> leaf node is reachable are "I don't know" (inconclusive). For each non-⊥<sup>D</sup> and non-⊤<sup>D</sup> leaf node, we generate the set of variants K′ ⊆ K that satisfy the encountered linear constraints along the top-down path to that leaf node as well as the given assertions. For each such "correct" set of variants K', we perform the backward termination lifted analysis [[s]]<sup>T</sup><sup>B</sup> t K′ in,B. The inferred decision tree t<sup>B</sup> is analyzed by function FindOptimal (Line 7). It calls the Z3 solver [26] to solve the following optimization problem: fnd a model that minimizes the value of ranking functions t ′ ∈ T T , such that the linear constraints along the top-down paths to those leaf nodes are satisfed. More formally, given a decision tree t ∈ T(CD, T T ), we defne the function ϕ[C]t that fnds a set of pairs (k, t′ ) consisting of valid confgurations k ∈ K and the corresponding ranking function t ′ ∈ T <sup>T</sup> as follows:

$$\phi[C](\ll t^l \gg) = \{ (k, t^l) \mid k \in \mathcal{K}, k \succeq C \}, \ \phi[C](\{c.tl, tr\}) = \phi[C \cup \{c\}](tl) \cup \phi[C \cup \{\neg c\}](tr)$$

The optimization problem is the following. Given a decision tree t<sup>B</sup> ∈ T(CD, T T ) inferred at the initial location of s, fnd a confguration k ∈ K such that the corresponding ranking function is minimal. That is, mink∈K{t ′ | (k, t′ ) ∈ ϕ[K]tB}.

The confguration k with the minimal ranking function found by Z3 is reported as a "correct and optimal" solution to the quantitative sketching problem. Theorem 3. SYNTHESIZE(ˆs) is correct and terminates.

Algorithm 4: SYNTHESIZE(ˆs : Stm)

1 s = Rewrite(ˆs); 2 t<sup>F</sup> = [[s]]<sup>T</sup><sup>F</sup> t K in,F ; <sup>3</sup> C = FindCorrect(t<sup>F</sup> ); 4 while C ̸= ∅ do 5 K ′ = C.remove(); 6 t<sup>B</sup> = [[s]]<sup>T</sup><sup>B</sup> t K′ in,B; <sup>7</sup> sol.insert(FindOptimal(tB)) 8 return sol

## 6 Evaluation

We evaluate our approach for program sketching by comparing it with the Brute-Force enumeration approach and the popular Sketch tool.

Implementation We have implemented our synthesis algorithm for quantitative program sketching [14] within the FamilySketcher tool [17]. It uses the lifted decision tree domain T(CD, A), where A is instantiated either to numerical abstract domain D or to the termination decision tree domain T T . The abstract operations and transfer functions for the numerical domain D: intervals, octagons, polyhedra, are provided by the APRON library [23], while for the termination decision tree domain are provided by the Function tool [32]. The tool is written in OCaml and consists of around 7K LOC. The current tool provides a limited support for arrays, pointers, struct and union types. The only basic data type is mathematical integers, which is sufcient for our evaluation.

Within the FamilySketcher, we have also implemented the Brute-Force enumeration approach that analyzes all variants (sketch realizations), one by one, using the single-program domains D and T T .

Experiment setup and Benchmarks All experiments are executed on a 64-bit Intel®CoreTM i7-1165G7 CPU@2.80GHz, VM LUbuntu 20.10, with 8 GB memory, and we use a timeout value of 300 seconds. All times are reported as average over fve independent executions. We report times needed for the actual analysis task to be performed. The implementation is available from [14]: https://zenodo. org/record/5898643#.YhJLRejMLIU. We compare our approach with program sketching tool Sketch version 1.7.6 that uses SAT-based counterexample-guided inductive synthesis [30,29], and with the Brute-Force enumeration approach. The evaluation is performed on several C numerical sketches collected from the Sketch project [30,29], SV-COMP (https://sv-comp.sosy-lab.org/), and the SyGuS-Competition [1]. We use the following benchmarks: Loop1a and Loop1b (Sec. 2), Loop2a and Loop2b (Fig. 10), LoopCond (Fig. 11), NestedLoop (Fig. 12), vmcai2004 (Fig. 13).

Performance Results Table 1 shows the results of synthesizing our benchmarks. Note that Sketch reports only one "correct" solution for each sketch, which


Fig. 10: Loop2a. Fig. 11: LoopCond. Fig. 12: NestedLoop. Fig. 13: vmcai2004.

does not have to be "optimal" with respect to the given quantitative objective. FamilySketcher and Brute-Force use the polyhedra domain as parameter.

The Loop1a and Loop1b sketches are handled symbolically by (SR) rule. Thus, our approach does not depend on sizes of hole domains. FamilySketcher terminates in (around) 0.016 sec for Loop1a and in 0.026 sec for Loop1b. In contrast, Brute-Force and Sketch do depend on the sizes of holes. Sketch terminates in 37.74 sec (resp., 2.44 sec) for 16-bits sizes of holes for Loop1a (resp., Loop1b). It times out for bigger sizes of Loop1a. Sketch often reports "correct & optimal" solutions for both sketches. Similarly, our tool can handle symbolically Loop2a and Loop2b in 0.060 sec and 0.047 sec. However, Sketch cannot resolve them, since it uses 8 unrollments of the loop by default. If the loop is unrolled 11 times, Sketch terminates but often reports the empty solution.

The LoopCond sketch contains one hole that can be handled symbolically by (SR) rule. FamilySketcher has similar running times for all domain sizes reporting the solution ?? ≥ 2 and ranking function 4x+8. In contrast, Sketch resolves this example only if the loop is unrolled as many times as is the size of the hole and inputs (e.g., 32 times for 5-bits). Hence, Sketch's performance declines with the growth of size of the hole, and times out for 16-bits.

The NestedLoop sketch contains two holes that can be handled symbolically by (SR) rule. FamilySketcher terminates in (around) 0.126 sec for all sizes of holes. The "correct" solution is (??<sup>1</sup> − ??<sup>2</sup> ≥ 0) ∧ (Min ≤ ??<sup>2</sup> ≤ 1), while the "correct & optimal" solution is (??<sup>1</sup> = ??<sup>2</sup> = 0) with ranking function 13. On the other hand, Brute-Force takes 65.03 sec for 5-bit size of holes and times out for larger sizes, while Sketch cannot resolve this benchmark.

The vmcai2004 sketch contains two holes. The frst one ??<sup>1</sup> is handled symbolically by (SR) rule while the second one ??<sup>2</sup> explicitly by (ER) rule. The performance of FamilySketcher depends on the size of ??2. The decision tree inferred in the location before the assertion contains one leaf node for each possible value of feature B (features A and B represent ??<sup>1</sup> and ??2). The sub-family of "correct" solutions is: (1 ≤ A ≤ Max) ∧ (B ≥ 10), while the "correct & optimal" solution is (A=1) ∧ (B=10) with ranking function 6. Sketch scales better in this case reporting one "correct" (but not "optimal") solution. However, FamilySketcher still outperforms the Brute-Force approach.


Table 1: Performance results of FamilySketcher vs. Sketch vs. Brute-Force. FamilySketcher and Brute-Force use Polyhedra domain. All times in sec.

Discussion In summary, we can see that FamilySketcher often outperforms Sketch, especially in case of sketches that can be handled symbolically by (SR) rule. But, for sketches with holes that need to be handled by (ER) rule, the performances of our tool decline, which is the consequence of the need to explicitly consider all values of those holes. However, even in this case FamilySketcher scales better than Brute-Force. This is due to the fact that Brute-Force compiles and executes the fxed-point iterative algorithm once for each variant, while our approach does it once per whole family plus there are still possibilities for sharing. Moreover, FamilySketcher reports the "correct & optimal" solution, while Sketch reports the frst found "correct" solution.

Threats to validity The current tool has only limited support for arrays, pointers, struct and union types. However, the above features are largely orthogonal to the solution proposed here. In particular, these features complicate the semantics of single-programs and implementation of domains for leaf nodes, but have no impact on the semantics of variability-specifc constructs. We perform lifted analysis of relatively small benchmarks. However, the focus of our approach is to combat the realization space blow-up of sketches, not their LOC size. So, we expect to obtain similar or better results for larger benchmarks. Although we analyze relatively small set of benchmarks, we expect the results to carry over the other benchmarks.

## 7 Related Work

The proposed program sketcher uses numerical abstract domains as parameters, so it can be applied for synthesizing programs with numerical data types. The existing widely-known sketching tool Sketch [29,30], which uses SAT-based counterexample-guided inductive synthesis, is more general and especially suited for synthesizing bit-manipulating programs. However, Sketch reasons about loops by unrolling them, so is very sensitive to the degree of unrolling. Our approach being based on abstract interpretation does not have this constraint,

since we use the widening extrapolation operator to handle unbounded loops and an infnite number of execution paths in a sound way. This is stronger than fxing a priori a bound on the number of iterations of loops as in the Sketch tool. Moreover, Sketch may need several iterations to converge reporting only one solution. On the other hand, our approach needs only one iteration to perform lifted analysis reporting several, and very often all, "correct" solutions. This is the key for applying our approach to solve the quantitative sketching problem. Another work for solving a quantitative sketching problem is proposed by Chaudhuri et. al [6]. The quantitative optimum they consider is that the expected output value on probabilistic inputs is minimal [5]. They use smoothed proof search and probabilistic analysis to implement this approach in the Fermat tool built on top of Sketch. In contrast, in this work the quantitative optimum we consider is that the worst-case behavior of the program is minimal.

Recently, there have been proposed several works that solve the sketching synthesis problem using product line analysis and verifcation algorithms. Ceska et. al. [4] use a counterexample guided abstraction refnement technique for analyzing product lines to resolve probabilistic PRISM sketches. The work [17] uses a (forward) numerical lifted analysis based on abstract interpretation to resolve numerical sketches. We extend here this approach by considering the more general quantitative sketching problem, where we additionally employ a (backward) termination lifted analysis to fnd a solution that is not only "correct" but also "optimal" to the given quantitative objective.

Several lifted analysis based on abstract interpretation have been proposed recently [24,11,12,16,18,15,13] for analyzing program families with #if-s. Midtgaard et. al. [24] have proposed the lifted tuple-based analysis, while the work [11,12] improves the tuple representation by using lifted binary decision diagram (BDD) domains. They are applied to program families with only Boolean features. Subsequently, the lifted decision tree domain has been proposed to handle program families with both Boolean and numerical features [16,18], as well as dynamic program families where features can change during run-time [15]. The above lifted analyses are forward and infer numerical invariants, while a backward termination analysis for inferring ranking functions is proposed in [13].

Decision-tree abstract domains have been used in abstract interpretation community recently [10,7,32]. Segmented decision tree abstract domains have enabled path dependent static analysis [10,7]. Their elements contain decision nodes that are determined either by values of program variables [10] or by the if conditions [7], whereas the leaf nodes are numerical properties. Urban and Min´e [31,32] use decision tree abstract domains to prove program termination.

#### 8 Conclusion

In this work, we proposed a new approach for synthesis of program sketches, such that the resulting program satisfes the combined boolean and quantitative specifcations. We have shown that both reasoning tasks can be accomplished using a combination of forward and backward lifted analysis. We experimentally demonstrate the efectiveness of our approach on a variety of C benchmarks.

## References


https://doi.org/10.1016/j.cola.2021.101032, https://doi.org/10.1016/j.cola. 2021.101032


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## SixthSense:

## Debugging Convergence Problems in Probabilistic Programs via Program Representation Learning

Saikat Dutta(), Zixin Huang, and Sasa Misailovic

University of Illinois, Urbana, Illinois, 61820, USA {saikatd2,zixinh2,misailo}@illinois.edu

Abstract. Probabilistic programming aims to open the power of Bayesian reasoning to software developers and scientists, but identification of problems during inference and debugging are left entirely to the developers and typically require significant statistical expertise. A common class of problems when writing probabilistic programs is the lack of convergence of the probabilistic programs to their posterior distributions.

We present SixthSense, a novel approach for predicting probabilistic program convergence ahead of run and its application to debugging convergence problems in probabilistic programs. SixthSense's training algorithm learns a classifier that can predict whether a previously unseen probabilistic program will converge. It encodes the syntax of a probabilistic program as motifs – fragments of the syntactic program paths. The decisions of the classifier are interpretable and can be used to suggest the program features that contributed significantly to program convergence or non-convergence. We also present an algorithm for augmenting a set of training probabilistic programs that uses guided mutation.

We evaluated SixthSense on a broad range of widely used probabilistic programs. Our results show that SixthSense features are effective in predicting convergence of programs for given inference algorithms. SixthSense obtained Accuracy of over 78% for predicting convergence, substantially above the state-of-the-art techniques for predicting program properties Code2Vec and Code2Seq. We show the ability of SixthSense to guide the debugging of convergence problems, which pinpoints the causes of non-convergence significantly better by Stan's built-in warnings.

Keywords: Probabilistic Programming · Debugging · Machine Learning

#### 1 Introduction

Probabilistic programs (PP) express complicated Bayesian models as simple computer programs, used in various domains [22, 38, 44, 54], including the important applications like epidemic modeling [23] and single-cell genomics [42]. Probabilistic languages extend the conventional languages with constructs for sampling from probabilistic distributions (prior), conditioning on data, and probabilistic queries, such as the distribution reshaped by conditioning on the data (posterior) [26]. Probabilistic programming

systems (PP systems) compile the programs and compute the results using an efficient inference algorithm, while hiding the intricate details of inference. Most practical inference algorithms are non-deterministic and approximate. For instance, Markov Chain Monte Carlo (MCMC) algorithms [28, 40, 48] run a probabilistic program multiple times (each of which is referred to as an iteration) to sample data points from the posterior distribution. They drive today's popular PP systems, such as Stan [9].

MCMC algorithms have a nice theoretical property: in the limit, the samples they generate come from the correct posterior distribution. But, in practice, a user can only execute the algorithm for a finite time budget and hence needs to fine-tune the algorithms to balance between quality of inference and execution time. This complicates development: the programmer needs to write the program in a way that interacts well with the algorithm and select some parameters specific for the inference algorithms. For instance, inference may fail to properly initialize, silently produce inaccurate results, or generate non-independent samples from the posterior distribution. Even identifying and afterward resolving these challenges currently requires significant statistical expertise.

An important property for successful inference is convergence, since non-convergence is often a cause of inaccurate (or wrong) result. Convergence means the samples generated by the inference algorithm represent the target distribution. While there exists metrics for convergence (e.g. Gelman-Rubin diagnostic [25]) in statistic literature, there lacks a comprehensive study of what model features could cause non-convergence. Thus, getting a data-driven understanding of the causes could help developers to debug the non-convergence issues, and does not require expert knowledge. Moreover, the existing convergence diagnostics are not predictive – they cannot be determined ahead of time i.e. without running the program. Building prediction model for converges ahead of time would save the time to run programs (often taking minutes or more). It would also enable a faster program debug/update cycle.

#### 1.1 SixthSense

We present SixthSense, the first approach for identifying convergence problems in probabilistic programs ahead-of-run. SixthSense adopts a learning approach: its trains a classifier that can, for a previously unseen probabilistic program and its data, predict whether the program will converge in a specified number of steps (for a given threshold of Gelman-Rubin diagnostic). The decisions of the classifier are interpretable and can be used to suggest which program features leads to the convergence/non-convergence of the program.

To train such a classifier, SixthSense needs to overcome several challenges that are beyond the big-code techniques studied for conventional languages [4, 5, 31, 37, 47]. First, probabilistic programs are small (20-100 lines of code) compared to conventional programs but their execution is complicated, with conditioning statements for data and non-standard semantics that performs Bayesian inference. Second, due to their relative novelty, there are few publicly-available probabilistic programs that can be used for training. Finally, we should be able to interpret why the programs are predicted to convergence or non-convergence in order to guide developers to debug the non-convergence issues.

Representing Structural, Data, and Runtime Features: To learn a classifier, we embed the syntactic and semantic program features in a numerical vector. To encode program structure, we observe that many snippets of code in probabilistic programs form patterns (sampling from distributions, hierarchical models, relations between variables) that may repeat within the single program or across programs. We identify those patterns as motifs – fragments of probabilistic program code, consisting of several adjacent abstract-syntax-tree nodes (e.g., neighboring statements or expressions).

SixthSense learns the set of features from the subset of motifs it identifies in the code. It groups together similar motifs by calculating a low-dimensional representation of the motifs using randomized discrete projections [8]. This way, it can balance the accuracy of prediction and the size of the learned models. We also engineered a set of data features (e.g., means, variances) and the runtime features – diagnostics from early warmup iterations that the inference algorithms compute as they execute. These features cannot be learned by the approaches that focus on static code features [4, 5, 31, 47].

Mutation-Based Program Generation: We present a novel technique based on program and data mutations that produces a diverse set of probabilistic programs with a good balance between converging and non-converging programs, with the goal to augment the training set. Our technique takes a set of seed programs as input, analyzes them and applies a set of pre-defined mutations which aim to change the semantics of generated programs. To obtain better diversity, our algorithm identifies (via locality-sensitive hashing [6]) and discards any mutant that is too similar to the one that was generated before.

Interpretable Predictor Results: For problem diagnosis and debugging of probabilistic programs, it is important to be able to interpret why the algorithm predicted non-convergence. Our learning algorithm leverages random forests for this task. It relates the likely cause of non-convergence to specific statements or expressions in the program code.

#### 1.2 Results

In this work, we learn the classifiers for convergence of three popular classes of probabilistic programs: Regression, Time Series, and Mixture Models. We obtained 166 seed programs, across the three classes, from an open source repository of Stan programs [52]. For each class, SixthSense generated more than 10,000 mutants. We train our classifiers for multiple thresholds of the convergence score (Gelman-Rubin diagnostic) to evaluate the sensitivity of our classifiers.

Our evaluation shows the effectiveness of SixthSense in predicting convergence of probabilistic programs compared to two state-of-the-art learning algorithms for conventional code: Code2Vec [5] and Code2Seq [4]. We measure the prediction quality via Accuracy (ratio of sum of True Positives and True Negatives to total tested programs), Precision (ratio of True Positives to total classified as Positives) and Recall (ratio of True Positives to total actual Positives). Here True Positive is a program that is predicted to converge and it indeed converges; the others are defined analogously.

SixthSense obtains an average Accuracy score across the three model classes of 78% for convergence prediction (with almost equally high prediction and recall). Sixth-Sense, with just code features outperforms Code2Vec [5] by 8 percentage points on

average and Code2Seq [4] by 5 percentage points on average (for a tight convergence threshold). Moreover, we also show that Accuracy scores increase to over 83% when adding runtime features obtained after just the first 10-200 samples from the warmup stage of the inference algorithm (which is less than 10% of its run-time). SixthSense also has higher precision for all model classes, and recall higher than Code2Vec but similar to Code2Seq. SixthSense's prediction time is less than a second and the model size is modest – less than 20 MB, which is 25-37% smaller than Code2Vec/Code2Seq.

We further demonstrate, by studying 40 non-converging programs, that SixthSense can pinpoint the locations in the code that cause non-convergence for 29 programs. In contrast, Stan's runtime warnings point to non-convergence causes in only 5 programs.

#### 1.3 Contributions

We highlight the main contributions of this paper:


### 2 Example

We describe how SixthSense computes motifs, trains the predictor and demonstrate how we can use it to guide the debugging of probabilistic programs. Figure 1 shows two variants of a Mixture model in Stan. A Mixture Model is a probabilistic model that assumes that each observed data point comes from one out of N independent sub-distributions of values. Each sub-distribution has an associated probability (called mixing ratio) of being chosen.

The programs A and B in Figure 1(a), 1(b) have several (unknown) parameters: mean mu and variance sigma of the normal sub-distribution; theta is the mixing ratio of the sub-distributions and p1 is an auxiliary parameter. The programs also access the array of observations, y, of size K. Each observation in y is assumed to be sampled from one of these two sub-distributions: a normal distribution (as normal lpdf) or a uniform distribution (as the constant 0.5). For the program B,

<sup>1</sup> SixthSense is publicly available at https://github.com/uiuc-arc/sixthsense.

Fig. 1: An example of two models with different convergence behaviors. We obtain the features from the Abstract Syntax Tree (AST) of source code and data (not shown here). We use them as inputs to the trained Random Forest Model for predicting the label (Converging/Not Converging). We can also obtain the most important features which likely contributed to (non)-convergence.

consider a novice developer, who was confused about Stan's target statement [51], calculated the negative likelihood instead.

When run with Stan's default NUTS inference algorithm for 1000 iterations, the program A converges and the program B does not converge. Our goal is to predict, before running the programs, whether they will converge. If they do not converge, we would also want to know why and use this information to debug the program.

Feature Extraction. First, we extract different classes of features for each program in the corpus of mutants. These include motifs – fragments of the AST, augmented with data features, and run-time features. To extract motifs, we parse each program and construct an AST. Then, starting from each node, we obtain all AST paths of length L by traversing the ancestors of the node. Figures 1(c) and 1(d) present one sub-tree for the function call statement(in loop) in the programs A and B respectively and several motifs that SixthSense extracts. The elements in the motif are the sequence of the node type IDs as feature vectors.

A good learning algorithm should be able to combine similar motifs and operate only on groups of them. To identify such groups of motifs, we apply random discrete projections, a well-known technique for reducing the dimensionality of the feature space. It maps the feature vectors of the IDs onto a hash value with a much smaller dimension. The random projections algorithm has a distance-preserving property, which means that the similar vectors (even when they are not grouped together) will have similar low-dimensional representations. This property allows us to apply standard learning algorithms on this low-dimensional representation while preserving the similarity of the original motifs.

Computing Reference Solutions and Labels. To compute the program labels (i.e., 'converging', 'not-converging'), SixthSense runs them for the default 1000 iterations using Stan's MCMC algorithm (NUTS). For convergence, we calculate a

well-known diagnostic called Gelman-Rubin (Rˆ) statistic [25]. If the Rˆ statistic is within a certain bound (close to 1.0), it indicates that the program converged.

Training. Given a sufficient number of training programs, SixthSense extracts the features and gets the labels for convergence. SixthSense then generates precise and interpretable predictors. We build separate models for predicting convergence for each model class, since models in three classes are significantly different in both semantics and the way they interact with inference algorithms. The model classes are easy to identify for users without expertise or through simple analytical tools.

Prediction. We use the classifier trained using the batch of Mixture Models for convergence. We use a threshold of 1.05 for Gelman-Rubin diagnostic (a very tight bound). SixthSense correctly predicts True label for program in Figure 1(a) and False label for program in Figure 1(b). The total time required for computing the features and doing the prediction for a single program is less than a second, compared to 53 seconds on average to run a program.

Interpretation and Debugging. Our combination of random projections – which groups very similar motifs together, even if they appear at different locations in the program – and the random forest classification – which can easily explain its decisions – proves effective in identifying the parts of the program that impede convergence. Namely, we can employ SixthSense's random forest classifier to identify top features. When SixthSense predicts non-convergence, the user can debug the program according to the top features.

Now consider the scenario where a novice Stan developer used negative loglikelihood in Stan's target statement, and wrote program B (Figure 1(b)). SixthSense predicts that B does not converge, and gives the topmost feature as the path segment (motif) starting from the negative sign to the parameters in the log-likelihood calculation (function log mix). Figure 2 presents this motif. There were three such motifs in program B (one for each argument of the log mix function), highly contributing to non-convergence prediction. In contrast, this motif is missing from program A (Figure 1(a)), and thus has negatively contributed in the converging prediction. This observation validates our earlier intuition about the cause of difference in the nature of two programs and is correctly inferred by our prediction model.

It is intuitive for the user to fix a non-converging program by altering program code that corresponds to the top features. For program B, after the topmost motif indicates the location that contributes to non-convergence, removing the negative sign would allow program to converge. After applying the change, the user can use Sixth-Sense to predict again, or even iteratively search for a good fix. This iterative debugging would be much faster than running through the full compilation and execution with Stan. At the same time, SixthSense can provide more directed warning messages.

Fig. 2: Topmost motif in program B

#### 3 Overview

Figure 3 shows the architecture of SixthSense. We next describe each of its components.

Feature Computation. SixthSense's features can be broadly divided into three major groups: (1) automaticallyselected AST (Abstract Syntax Tree) based features - motifs - which represent fragments of the AST; (2) Data Features, and (3) runtime features of the inference

Fig. 3: SixthSense Training Workflow

algorithm. We present our feature selection and summarization in Section 4.

Program Generation. The generator uses the input set of seed programs to generate a batch of mutants. We use two sets of transformations to mutate the program: (1) Expansive Mutations produce more complex models compared to the original ones (e.g., add a new parameter), and (2) Reducing Mutations simplify the models by simplifying arithmetic expressions, removing conditional statements, etc. Our adaptive mutator uses nearest neighbor algorithms to efficiently explore the feature space of the programs. We explain the mutations and the algorithms in Section 5.

Program Runner. It runs each generated mutant and collects several statistics such as samples from MCMC iterations and runtimes.

Metric Calculator. Typically, the MCMC algorithms provide samples for each parameter from the posterior distribution. The metric calculator computes the convergence for each parameter using the samples from the posterior.

Model Trainer. Using the syntax, data and runtime features and metrics computed by the previous components, the Model Trainer builds a machine learning model for predicting the behavior of probabilistic models for the given inference algorithm. Here, we used Random Forest Classifier.

We build models to predict, for given metric thresholds, (1) Convergence of the models using static features of model and data, (2) Convergence of the models using static features and run-time diagnostics from initial phases of sampling, and (3) Predict iteration count for which the model will converge.

Deploying the Trained Model. Once the trainer produces the model, we can use it to predict the convergence of new programs. For a given program and its dataset, SixthSense runs the feature extractor, runs it through the predictor and outputs the convergence label. It also reports on the features that contributed most to the prediction, and relates them back to the source code.

#### 4 Learning Program Features

We present the description of the programs and SixthSense's approach for collecting code, data, and runtime.

Probabilistic Programs Syntax. A probabilistic program is an imperative program with additional constructs for sampling from distributions, conditioning the model on observed data values, and one or more queries for either the posterior distribution or expected value of a parameter. In this work, we use a subset of syntax of Storm-IR [19] for representing probabilistic program, as shown in Figure 4.

```
130 S. Dutta et al.
```

```
x ∈ Vars
c ∈ Consts∪{−∞,∞}
aop ∈ {+,−,∗,/,ˆ}
bop ∈ {=,>,...}
Dist ∈ {Normal,Uniform,...}
ID ∈ String
                                     Type ::= Int | Float
                                     Decl ::= x : Type | x : [c
                                                                 +]
                                     Expr ::= c | x | Expr aop Expr | Expr bop Expr
                                     Stmt ::= x = Expr | Decl | observe(Dist(Expr+),x)
                                                 | x ∼ Dist(Expr+) | for x ∈ 1..n;{Stmt∗}
                                                 | if (Expr) then Stmt∗
                                                                       else Stmt∗
                                     Query ::= posterior(x) | expectation(x)
                                     Program ::= Stmt∗
                                                      Query∗
```
Fig. 4: Syntax of Storm-IR [19]

Representing Program Paths. To understand the causes of non-convergence and for better debuggability, we select a representation that is easy to train and interpret. Existing approaches Code2Vec/Code2Seq [4, 5] aim to predict variable names through natural-language semantics, and they encode the path between any two terminal nodes in the Abstract Syntax Tree (AST). Instead, we encode the sequences of AST nodes with limited length to pinpoint the semantic issues. We formalize our notions: Definition 1. (Abstract Syntax Tree) Similar to [5], we define an AST for a program P as a tuple <N,T,X,s,δ,ϕ,ψ >. N is a set of non-terminal nodes, T is the set of terminal nodes, X is a set of values, s∈ N is the root node, δ : N →(N ∪T)∗ is a function which maps each non terminal node to a list of its children, ϕ:T →X is a function which maps each terminal node to some value, and ψ:N →N maps each non-terminal node to a unique natural number.

Definition 2. (AST Path) An AST path is a path between the nodes in the AST, which starts from one non-terminal node and ends at another non-terminal node, passing through the ancestors of each node at each step.

Definition 3. (Motifs) A Motif encodes an AST path from a node passing through the ancestors of length up to L. For a given AST Path : ⟨N1,N2,...,NL⟩, where N<sup>i</sup> ∈δ(Ni+1), ∀i∈1..L−1, we can define the motif as the list: ⟨I1,I2,...,IL⟩, where I<sup>m</sup> =ψ(Nm),∀m∈1..L.

#### 4.1 Extracting Features from Programs

Motivation. Two major challenges in efficiently encoding the motifs in a feature vector include (1) the large numbers of different paths that a program may have, and (2) the variability of length between different paths. A general approach to solve both problems is to design a flexible scheme for dimensionality reduction, which encodes the rich structures, like our motifs as a smaller set of program properties.

We rest our approach on two observations. First, despite a huge number of possible syntactic paths, similar motifs repeat often in a single program and across multiple programs. Therefore, we need to think only about the subsets of all possible paths that appear in the corpus of programs. Second, the variability between motifs is often local, and many similar (though not-identical) motifs may lead to the same program behaviors. Therefore, instead of encoding each motif in the feature vector independently, we can group similar motifs and encode only the group.

To reduce the dimensionality of available paths and group together similar motifs, we use Random Discretized Projections (RDP) [8], hashing technique for reducing dimensionality of large feature vectors. It is well-known in data mining, not been used for big-code representation. RDP calculates hash values that are used to group similar items into the same buckets with high probability based on a similarity metric (e.g. cosine similarity). The hash value represents the motif-group in the feature vector.

Extracting Features from Individual Programs. Line 5-9 in Algorithm 1 describes the procedure to extract motifs from a program. We iterate over the nodes in the AST and for each node, to extract a sequence of nodes by visiting the parent nodes up to level L, using the function GetMotifAt (line 6), which we define recursively as GetMotifAt(N,L) = N ::GetMotifAt(parent(N),L−1) and base cases GetMotifAt(∅,L)=∅ and GetMotifAt(N,0)=∅.

The function SimilarityHash (line 7) computes a hash key of each motif using the Random Discretized Projections (RDP) [8]. If the size of the motif is smaller than L (e.g., because the node does not have sufficient number of parents), PadRight pads the motif to the maximum size with unused elements. We increase the count for the hash each time a similar motif has obtained the similar hash function (line 8). The RDP has a flexible number of projections and the size of bins. These parameters can be tuned to make similarity more or less fine-grained. They also control, indirectly, the size of the feature vector, the construction of which we describe next.

Calculating Feature Vectors. Given a batch of programs Batch and the motif length L, we iterate over the batch to extract the motifs for each program (line 5-9), as described in the paragraph above. Then, to store all the motifs, we first use InitFVTable to create a feature vector table F whose column length is equal to the number of programs and the row length is equal to the number of unique motifs (features) across all programs in the batch (line 10). Each row of


F is the feature vector of the program prog, and each cell stores the count of a motif m in prog (line 11-13). index maps between the motif hash code and the column index in F. Finally, we output all the feature vectors.

#### 4.2 Data Features

The nature of the data-set may determine the performance of the probabilistic model when run using an inference algorithm. For instance, in absence of sufficient data, the choice of prior distributions become very important. Similarly, a strong prior with very small variance is unlikely to converge to the correct results in such a scenario [2]. SixthSense computes data metrics like sparsity (number of non-zero elements), auto-correlation (correlation between values of a time series), skewness (asymmetry of the distribution), maximum/minimum variances of the model's prior distributions, and several others for observed and predictor data variables.

#### 4.3 Runtime Features

For inference algorithms like MCMC, diagnostics from the early stages (warmup) of sampling can often indicate the presence or absence of problems with the model and associated data. Such diagnostics can help in discovering problems earlier so that the users can update their model for more efficient performance. Unfortunately, they are not predictive in nature: manually observing the raw values may not provide a good intuition about the program execution. However, our prediction engine can infer useful information from them.

To validate this intuition, we collect several runtime features from MCMC chains during the early stages of warmup iterations. These features are algorithm specific. For NUTS, they include posterior log density (log probability that the data is produced by the model using current set of the parameters), tree depth, divergence of the simulated trajectory, acceptance rate of the generated sample, step-size (the distance between consecutive samples), leapfrog steps, and energy estimate of Hamiltonian.

## 5 Program Generation for Training Set Augmentation

In this section, we describe our approach of generating mutant programs from a corpus of seed programs. To produce mutants from the original seed programs, we define two kinds of transformations – for code and data.

#### 5.1 Code Mutations

Our Code Mutations can be broadly classified into two sets: (1) expansive mutations, which make more complicated models from the original one, and (2) reducing mutations, which reduce the complexity of the models.

Expansive Mutations. These include Auxiliary Parameter Creator which converts a distribution argument to a parameter in the program, Conjugate Replacer which replaces prior distributions with distributions conjugate [46] to the likelihood when possible, Dimension Expander which expands the dimension of a scalar parameter to match the data dimension, Constant Replacer lifts a constant in the program to a parameter with an appropriate distribution, and Data to Parameter Transformer randomly replaces a real valued data array with a parameter with the same dimension. Reducing Mutations. The transformations include Arithmetic Simplifier, which replaces arithmetic expressions with either of the operands or changes the arithmetic operation, Conditional Eliminator which replaces conditional statements with either of the branches Distribution Simplifier which replaces complex distributions like Laplace, Weibull with common distributions like Normal or Uniform, Math-Function Call Eliminator which replaces common math functions like log, exp, etc. with constants. These transformations have been previous used by [19] for testing PP systems.

#### 5.2 Data Mutations

Apart from source code transformations, we also added several data transformations. Such transformations help in changing the distribution of values in the data set, which could produce challenging scenarios for the probabilistic model or inference algorithm to work with. The data mutations include scaling by a constant, adding arbitrary noise, Box-Cox transformation [49], scaling to new mean and standard deviation, cube root transform, and random replacement of values with values from the same data set.

#### 5.3 Adaptive Algorithm for Mutant Generation

To generate programs with different runtime behaviors, it is important to explore programs with diverse semantic and syntactic features. Our mutation algorithm randomly applies several mutations to the original program. However, to diversify the generated mutants it uses a nearest neighbor based algorithm (Locality Sensitive Hashing [12]), which only selects a representative set of mutants in multiple rounds.


Algorithm 2 presents the mutant selection algorithm. The inputs for the algorithm are seed programs S, total number of programs to generate M, and the number of programs to generate in each batch B from each seed program. The algorithm returns the selected mutant programs set progs as output. First, we initialize the LSH (Locality Sensitive Hashing) engine. We used four Random Discrete Projections hash functions. Next, in each round, we first choose a seed program using the chooseSeed function. The chooseSeed function randomly chooses among the original seed program s and the mutants generated (in progs) from it in earlier rounds. Next, we generate a new batch of programs of size B using generatePrograms.

For each new generated program k, we compute its feature vector and number of neighbors among the already generated programs. We select the program only if it has no neighbors in the already selected set of programs. Finally, the algorithm returns the selected set of programs once it has generated the target M programs. The generatePrograms algorithm (Algorithm 3) generates M mutants for a seed program S. For each program, in each iteration, it applies a set of randomly chosen mutations and adds it to the set of new programs. Finally, it returns the set of new programs to the caller. Using this algorithm, we obtained a diverse set of probabilistic programs with a good balance of converging/non-converging behavior.

#### 6 Methodology

We present the methodology for collecting seed probabilistic models and the program features and metrics we compute.

Seed Probabilistic Models. We collected a corpus of probabilistic models from the most comprehensive open-source repository of Stan Models [52] 2 . Out of total 505 models, we selected the three most common categories: Regression (120 models), Time-Series (23), and Mixture Models (23, augmented with 3 from [33]). The models come with their datasets.

Inference Engine and Sampling. NUTS, the default inference engine of Stan [24]. We executed all programs using 4 MCMC chains with 1000 iterations each for warmup phase and sampling. This configuration is default for Stan. We also checked the eventual convergence by running the programs for many more iterations. We used 100,000 as the maximum number (the convergence metrics do not change significantly even for 10<sup>6</sup> iterations for the seed models).

Feature Extraction. We used a Python based implementation of Randomized Discretized Projection [1]. We set its hyper-parameters P=5 and bin-width B=5, which worked well to reduce the dimensionality of the vector space.

Random Forests. We used Random Forests Classifier from Scikit-Learn package in Python for training. We use 5-fold cross validation for training. We extract top features using TreeInterpreter [56].

Execution Setup. We performed the mutant generation and feature computation on an Intel Xeon 3.6 GHz machine with 6 cores and 32 GB RAM. We used Azure Batch Scheduling Service to run all the programs and metrics computations. We capped the MCMC execution under 240 minutes.

#### 6.1 Baselines, Metrics, and Classification

Baselines. We compare SixthSense to three baselines: The first, Code2Vec [5], and the second, Code2Seq [4], are state-of-the-art predictors based on Deep Neural Networks for big-code. They were originally used to predict function names from code. We adapted these systems to do classification for each threshold of convergence, by extracting path contexts (subsets of paths similar to our motifs) form the code. Finally, the third baseline, the majority classifier assigns the most likely label during the training to all the predicted programs. It indicates the prediction 'hardness' when the training set is disbalanced.

Metrics. We used a common metric for measuring convergence, called the Gelman-Rubin (Rˆ) [25] diagnostic. Ideally, the value of this metric should be close to 1.0. If the observed value of Rˆ is e.g., 1.05 it is considered as good indication of convergence. The larger values, e.g., 1.5 and greater, are considered as weaker evidences for convergence. Given the threshold, we assign the label True to a program if the metric value is within the threshold and False otherwise.

<sup>2</sup> The number of publicly available probabilistic programs in public sources is low, compared to conventional languages. This is in part due to the novelty of these languages and expertise required to create and interpret those models. As a further challenge, Stan programs require the corresponding data set of sufficient size, which many Stan programs on Github do not have. Finally, most of publicly available programs are tuned to converge to their available data-sets.

#### 6.2 Evaluation Experimental Setup

Training and Test Sets. We generate a corpus of mutants programs for each seed program using the approach discussed in Section 5.3. We create a test-train split for every seed program in the following way: (1) Test set consists of a single seed program and all its mutants; (2) Training set contains all other seeds and mutants. Thus, the training is not aware of any mutants of the test seed program. For each such split, we train a classifier using the training set and evaluate its performance (using the metrics below) on the test set. With this strategy we obtain metrics for each split (each representing one seed program and its mutants). Finally, we compute the average performance across the splits.

Training a predictor by leaving out each model and its mutants in test set allows us to stress-test the model predictor. We choose this evaluation strategy because the number of original seed programs in each class is low compared to conventional big-code data-sets. Every seed probabilistic program represents a different statistical model and using this strategy helps us evaluate the sensitivity of the classifiers for each such model.

Classification Scores. We used Precision, Recall, Accuracy, and AUC [21] to evaluate the performance of the learned classifier. They range between 0 and 1 (higher better). We use the same metric for all the baselines.

Accuracy and AUC are adequate metrics for our scenario: Since we perform training by creating a test-train split for every seed program and its mutants (Section 6.2), in some cases the test-set can become imbalanced, e.g. no or few positive labels/no true and false positives or extremely different sizes of the splits.

Fig. 5: SixthSense Prediction Accuracy for Convergence (Measured Using Gelman-Rubin Diagnostic)

#### 7 Evaluation

#### 7.1 Predicting Convergence of Inference

Figure 5 presents the prediction scores for SixthSense when predicting convergence of MCMC algorithms (NUTS in this case). The Y-axis shows the accuracy scores for each prediction model (higher is better). The X-axis shows the four thresholds (1.05-1.2) of the convergence metric, Gelman-Rubin diagnostic, that we considered in our evaluation. We chose this range to test how general the prediction can be as the


Table 1: Precision (P) and recall (R) (Rˆ =1.05)



individual program labels change. For each threshold, we plot the accuracy scores of our prediction model (SixthSense) together with Code2Vec, Code2Seq and a Majority Label Classifier, as vertical bars in different colors. We evaluated the trained model on a held-out test set (see Section 6.2).

Comparison with Code2Vec/Code2Seq. Figure 5 shows that SixthSense, with solely AST motifs is better than Code2Vec and Code2Seq (see also the ablation study in Section 8). The results show that SixthSense's learned classifiers have an accuracy score close to 0.8. These prediction rates are already useful for the user because it helps them avoid wasting time for compiling and running programs which would likely not converge. Our training algorithm is able to learn classifiers that generalize well across different thresholds.

For Regression and Mixture models, SixthSense has consistently better accuracy than the other approaches across all thresholds. For the tightest convergence bound Rˆ = 1.05, its accuracy is by 5 percentage points higher than the alternatives for Regression, and 8 percentage points higher for Mixture. For TimeSeries models, the accuracy scores of SixthSense is by 1 percentage point higher than Code2Seq.

Table 1 presents the precision and recall for Rˆ = 1.05. SixthSense exhibits consistently higher precision over Code2Vec (8 to 10 percentage points) and Code2Seq (5 to 10 percentage points). SixthSense also has higher recall than Code2Vec (1 to 7 percentage points), while the recalls of SixthSense and Code2Seq are comparable (within 2 percentage points). Recall that the precision/recall are averaged over those for different splits and can be more sensitive to small and unbalanced splits.

Table 2 shows the AUC scores for SixthSense, SixthSense with runtime features and Code2Vec. Code2Seq does not provide its probability of predictions, which prevents us from computing its AUC score. The results show that SixthSense improves in AUC score over Code2Vec for all classes.

The prediction accuracy, prediction, and recall from Tables 1 and 2 persist for higher thresholds of Rˆ.

Comparison to Majority Label Classifier. Figure 5 shows the comparison of SixthSense to a naive Majority Label Classifier, which has the classification accuracy of 0.5. It indicates the significant level of improvement of SixthSense over the uninformed random choice.

Predicting with Warm-up Runtime Features. Figure 5 presents the impact of SixthSense's AST features augmented with runtime features (Section 4.3) sampled from the first 200 iterations of the warmup stage (at this point Stan still does not issue warnings for our programs). Recall, the results of these iterations are dropped by the inference algorithm, as in this phase the mixing of the MCMC chains has just begun. However they can be useful in addition to code features: they help improve the prediction by further 6 percentage points for Regression and Timeseries, and 8 percentage points for Mixture models (Rˆ =1.05).Table 2 also shows the improvement in AUC of both AST and Run-Time features over the AST-only version of SixthSense. However, note that collecting run-time features still requires compiling the program and starting its execution. While this time differs among the systems and datasets, it may be non-trivial, as is the case for Stan (e.g. around 30 seconds for compilation). This time may be an important factor when deciding to use a runtime-predictor for different PP systems. We also present a feature ablation study in Section 8.

#### 7.2 Debugging Non-Converging Programs

When SixthSense's learned model predicts that a model will not converge, two natural follow-ups are (1) ask which part of the program is likely culprit for non-convergence and (2) how many iterations would be sufficient to run the model to converge, if it converges.

Debugging Approach. We interpret the outcomes SixthSense predicts, and leverage the AST features and the random forests to help pinpoint which part of the program leads to non-convergence.

To obtain the set of programs, we randomly selected 40 probabilistic programs from our test sets, equally across the three model classes, which SixthSense correctly identified as non-converging for 1000 iterations. For each program, we obtained the most important features from the learned random forest. We selected top-5 features (motifs) and inspected the model to identify whether the parts of the motifs contains the culprit of non-convergence. The top-5 features typically only cover 5% of all the motifs, which means SixthSense points to a relatively small scope to debug.

We make up to two manual updates to each model by making changes only to the AST elements identified by the motifs or the referenced observed data. These changes represent simple semantic modifications that a user of probabilistic program might make as they explore various possible models for their data. We simulate a try and check interactive search with these localized transformations. For instance, SixthSense identified a constant array in a regression equation as one of the top motifs. Converting that constant into a parameter made the model converge. Some of our attempted updates include changing the variance (constant) of a distribution, changing the distribution for a parameter, changing a parameter to a constant, and removing mathematical functions (e,g. abs, log) when they are redundent.

After transforming the model, we run inference to see if it converges. We further check if the model become accurate (or correct) after the fix, since non-convergence often causes inaccurate (or wrong) result. For each model, we apply accuracy tests from Bayesian model checking [25, Ch.6]: we compute the mean squared error to compare the new model result to its correct data and also do visual inspection on the result density plot to check if it matches the correct distribution. Multiple student authors inspected the updates and agreed that these changes followed the protocol described above.

Results. Table 3 presents the results for this debugging application. Column 1 (Class) presents the classes of randomly sampled models. Column 2 (#Models) presents the number of mutant models we randomly selected from each class. Column 3 (6s Upd.) presents the number of programs that we manually updated to converge using the method above. Column 4 (Stan Warn.) presents the number of programs

which Stan issued a warning during sampling. Column 5 (Stan Upd.) presents the number of programs for which Stan's warnings helped update the program to converge.

Overall, we were able to identify the problem and let 29 updated models converge out of 40 models. Specifically, we corrected 16 models by replacing


Table 3: Debugging Non-Converging Models

a parameter indicated by SixthSense with a constant; corrected 6 by simplifying mathematical functions, 3 by changing constants in distributions, 2 by converting constants to parameters, and 2 by changing distributions for parameters. All the code elements we changed were pointed by top three motifs SixthSense returned. For 11 models that we were not able to update, we believe that the model correction would require more complex changes than those we specified in setup above.

Out of 29 updated, now converging models, we ran SixthSense again. It correctly predicted that 21 will converge (with 8 from Regression, 8 from TimeSeries and 5 from Mixture); this is, interestingly, close to the prediction rates from Section 7.1. This illustrates that SixthSense can be useful in the iterative debugging loop.

These results demonstrate the advantage of interpretability SixthSense's learned model. Using motifs from the AST as features and a simple learning model (random forests) helps the user easily identify key program components which affect the runtime behavior of a probabilistic model. In comparison, identifying such important features is hard for other complex neural network-based models and might require more low-level handling of the learned model. In particular, Code2Vec and Code2Seq do not provide a way to interpret how their prediction worked.

Comparison to Stan's runtime warnings. Compared to Stan's runtime warnings, SixthSense motifs reveal more fine-grained patterns that hinder convergence. For most of the non-converging models (29 out of the 40 in this experiment), Stan did not issue a warning (beyond the low Rˆ value at the end of inference) The 12 warnings issued by Stan only have regards to function domains. Seven out of 12 were not related to non-convergence. For instance, one program returns "Warning: normal lpdf: Scale parameter is -0.0799029, but must be >0." Changing the scale parameter limits does not help. Instead SixthSense identifies the fix that is not at this location.

The remaining 5 Stan runs indicate non-convergence and can help with updating the model. However, they were not as helpful in locating the causes as SixthSense. One example where both SixthSense and Stan indicated problem is in the program with the expression normal(exp(w0)+sqrt(abs(w1))∗x1+w2∗x2,s). Stan warned about the overflow in the first argument of normal, disregarding its sub-expressions. SixthSense traced the problem to the sqrt and abs sub-expressions that indeed helped fix the non-convergence, by simplifying the function expressions.

#### 8 Sensitivity Analysis

We present various sensitivity analyses of SixthSense to justify our design choices.


Table 5: Training w. Noisy Labels (Rˆ=1.05)


#### 8.1 Feature Ablation Study

Table 4 shows the Accuracy score for convergence predictions when trained with different combinations of feature groups (AST features, AST and data features, and all features). Runtime features are from 200 warmup iterations. The AST features (motifs) alone contribute a major portion to the Accuracy scores in all cases. Data features do not have much impact on these models. Runtime features, after a certain number of iterations further improve prediction (they are in fact a strong predictor, but do not establish a relation with the program code). Obtaining runtime statistics comes at a cost of compiling and running the program. This cost is often over 30 seconds for Stan.

Impact of the noisy labels on the prediction. To evaluate the robustness of our prediction, we perturb the class labels in the training set with different noise levels. We use the version of SixthSense, which applies Rank Pruning [41]. Table 5 shows the Accuracy scores for the different model classes for several noise levels (1-5%). For each noise level, Robust column shows the scores when trained using the Rank pruning algorithm and Basic column shows the scores for baseline SixthSense. Even in the presence of significant training noise, our learning approach maintains high Accuracy scores. For instance, the performance of Mixture Models remains almost constant (close to 78%) even when 5% labels are wrong.

Other sensitivity studies. We also performed other sensitivity studies on the features and generated programs. First, we looked at different motif sizes. For three motif sizes (5, 10, 20) on the threshold Rˆ =1.05, we do not see a significant increase in the Accuracy score. This reflects that even smaller motifs obtained from probabilistic programs can be very effective for predicting their runtime behavior. Therefore, we used Motif size of 5 in all our experiments.

We then removed overlapped motifs, which resulted in the reduction of the Accuracy scores (by 2 to 5 percentage points). Other experiments, such as different LSH configurations to remove syntactically similar programs from the training set did not show substantial deviation from the reported scores.

#### 9 Related Work

Probabilistic Programming. Probabilistic programming languages (PPLs) and their underlying inference systems have recently gained significant interest from research and industry [9, 10, 26, 27, 29, 36, 38, 45, 55, 58]. Tyically, PPLs (e.g., Stan) only provide simple runtime diagnostics and timing information as they run. In contrast, SixthSense is a predictive data-driven approach that complements these efforts.

The prior debugging approach for PPLs [39] requires augmenting Bayesian network representation with additional labels and requires extending the inference algorithm. However, its applicability is limited since state-of-the-art PP systems typically do not use Bayesian network representation. Our approach learns program features useful for debugging without modifications to the inference algorithm. Existing tools [15, 19] find lower-level implementation bugs in probabilistic programming systems.

Several recent approaches have explored the nature of regression tests in probabilistic and machine learning applications such as the causes and fixes for flaky tests [17, 18], usage of seeds in tests [14], and speeding up expensive regression tests [16].

Predicting Program Properties from Big-Code. Much attention has recently been devoted to uses of machine learning to analyze and predict various program properties. Notable examples include predicting variable names/types via statistical program models [47], predicting patches [35], summarizing code [3, 31], and API discovery [5, 57]. However, all of these works apply learning on conventional programs (C/Java/Javascript), obtained from massive code repositories. Moreover, many of these approaches predict static program properties (e.g., names/types), rather than execution properties like convergence. While some of these approaches benefit from the natural-language semantics of identifiers [4, 5], we are interested in semantics of the program itself, which are better represented by the sequence of AST nodes.

We also present how to augment the corpus of programs with diverse programs via guided mutation. While our approach bears similarity to data augmentation in machine learning [11, 50, 53], probabilistic programs have complex structure defined by many syntactic (and often semantic) rules.

Predicting Algorithm Performance. Researchers developed machine learning approaches that predict hardness of NP-hard problems (e.g., SAT, SMT, ILP) [7, 32, 34]. These works are complementary and their syntax and semantics are considerably simpler than for probabilistic programs. Researchers also proposed models for performance of other machine learning architectures [13, 20, 30, 43], but their techniques and applications are orthogonal to ours.

### 10 Conclusion

We presented SixthSense, a novel approach and system, which predicts convergence for probabilistic programs and helps guide the debugging of convergence issues. We show SixthSense is effective in extracting features from probabilistic programs and learning a prediction model. Compared to the state-of-the-art techniques, our results show significant improvement in accuracy.

### Acknowledgments

This research was supported in part by NSF Grants No. CCF-1846354, CCF-1956374, CCF-2008883, USDA NIFA Grant No. NIFA-2024827, a gift from Facebook, a Facebook Graduate Fellowship, and Microsoft Azure Credits. We would also like to thank Prof. Jian Peng for the useful comments on an earlier draft.

#### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## Finding Semantic Bugs Fast?

Lukas Gr¨atz(B) , Reiner H¨ahnle , and Richard Bubel

Technical University of Darmstadt, Darmstadt, Germany {graetz,haehnle,bubel}@cs.tu-darmstadt.de

Abstract. Finding semantic bugs in code is difficult and requires precious expert time. Lacking comprehensive formal specifications, deductive verification is not an option. We propose an incremental specification procedure: With the help of automatic verification tools, a domain expert is guided through program runs and source code locations. The expert validates a run at certain locations and creates lightweight annotations. Formal methods training is not required. We demonstrate by example that this approach is capable to quickly detect different kinds of semantic bugs. We position our approach in the middle ground between fullyfledged deductive verification and bug finding without semantic guidance.

#### 1 Introduction

The main obstacle against using program verification tools for bug finding is not their efficiency, but a lack of meaningful formal specifications that capture the intended semantics of a given program [2,9]. This is unfortunate, because semantic bugs are dominant over memory-related bugs [15], but cannot be found by existing bug finding approaches [1,3,6,7,18], which look for syntactic patterns or generic errors (such as uncaught exceptions, memory faults, etc.).

A notorious, relatively recent example was an alleged error in a software used in the UK to send mammography invitations to women in a certain age group [11]. Not all letters were sent according to the specification, which would statistically have led to belated diagnosis and possibly premature death of some women. As it turned out, the specification was drawn up in hindsight, after the software had been in use for years. To detect the mismatch would have required an expert to look at exactly the right decision points in the code and to compare the implicit assumptions with the specification. This is a challenging task: (i) There might be a vast number of inputs and runs: how to choose the ones that give insight into a possible semantic bug? (ii) Keeping track of implicit assumptions and to check their validity in a given run is tedious, time-consuming, and error-prone.

In this paper we propose a novel approach to help experts finding semantic bugs: These are bugs where functional and expected behavior in a domain context deviate, without domain-independent symptoms like abrupt termination or blocking. We address the issues above by dedicated tool and language support.

c The Author(s) 2022

<sup>?</sup> Supported by Deutsche Forschungsgemeinschaft (DFG) - Project number 351097374.

E. B. Johnsen and M. Wimmer (Eds.): FASE 2022, LNCS 13241, pp. 145–154, 2022. https://doi.org/10.1007/978-3-030-99429-7\_8

The main ingredients are: (1) to render implicit assumptions in the code explicit, traceable, and automatically checkable in the form of lightweight (Boolean Java) specification expressions and a simple labeling mechanism; (2) to use (automatic) deductive verification to guide the expert and to validate assumptions.

We do not aim at a fully automatic process, which we deem futile to detect non-trivial semantic bugs. We also are not interested in complete contract-based specification [13] as typically used in deductive verification [9], which we consider unrealistic in many cases, because of the required effort and the need for training in writing formal specifications. In contrast, the partial specifications we aim at are incrementally produced by a software engineer, guided by tool support. The annotations do not cover the full functionality of the analyzed software, but only part of the input space and source code. Therefore, the resulting annotations can stay simple and close to a designer's understanding of the code. Specific training in formal methods is not required.

The flip side of our approach is that we are unable to provide formal guarantees about the absence of bugs. This is in common with other bug-finding technology, such as systematic debugging [18], bug finding tools [1], test case generation [7], or code inspection [6]. On the other hand, all of the mentioned techniques either look for a fixed set of syntactic conditions or assume the presence of a specification, whereas we guide the user to come up with semantically relevant specification annotations. Consequently, we hope to occupy a sweet spot between fully-fledged deductive verification and bug finding without semantic guidance. In addition, partial specifications can help static verification as well as deductive verification tools. In this NIER paper we sketch our approach and illustrate how it works with simple examples. A robust implementation and full evaluation is envisaged.

#### 2 Validating Program Runs

To explain our approach we use the min method shown in Fig. 2. More realistic examples are provided in Section 4. We are given a software system and its source code. The system could be a method, a command line interface, or some other piece of software in which we want to find bugs. We assume the system is already free of memory-related or termination bugs (covered by existing tools) in particular, for any given input, there is no runtime exception and the system terminates in some final state. We will have a (virtual) code position for all final states: This is where the validation routine will start, see below.

Our validation process is performed by a domain expert, guided by a software assistant. A domain expert knows how the system should behave. In general, there is no (formal) specification, so we need an expert, who might be a software tester, code reviewer, or debugging specialist. We assume the expert understands source code and is able to validate the behavior of a given program run.

Program runs are supplied in the beginning. These could have been collected from log files while the system was running. We can often reconstruct a run, if all inputs (and events) are given. We assume the runs cover potential semantic bugs, the more runs, the better. Although the expert could validate every single run, it is unrealistic to look into all of them—there are far too many.

#### 2.1 Syntax

We illustrate our approach with Java source code, but it is applicable to any other (imperative) programming language with suitable tool support.

The expert will stepwise annotate the software under validation with partial specifications. We contribute a simple annotation syntax. Annotations are placed after //@ or between /\*@ and \*/, compatible with JML [12]. We do not use full JML, but only a fragment consisting of the labeled assertions and assumptions produced by the grammar in Fig. 1. The asserted/assumed expressions Expr are also simplified: A domain expert only needs to write side effect-free boolean Java expressions—quantifiers or other JML constructs are not required.

Assumptions and assertions are labeled using a prefixed identifier ALabel inside < and >. Labeled assumptions/assertions are only effective when explicitly referred to—they are not assumed in general. To make such references, we extend assert statements with the keyword assuming. In program 4 of Fig. 2, the assertion labeled aRes holds when assuming aGb.

The syntax allows an assertion to be assuming a logical combination of (other) labeled assertions/assumptions. A conjunction of ALabels is written as <l1,l2,...>. Any positive combination of ALabels in positive disjunctive normal form (PDNF) is supported: One can build a complex (acyclic) graph of assumptions and assertions depending on each other. We will see that the PDNF is naturally obtained by the validation steps. The PDNF also makes checking/verifying assertions easy, see Sect. 3.

Our annotations bear resemblance to [4,5], where the keyword verified is used instead of assuming. In contrast to [4,5], labeled assumptions are not expected to hold true in every run (in Fig. 2, assumption aGb is true for half of the inputs). Hence, in our setting, there is no point in trying to verify assumptions. Instead, to check or verify a claim, we can use a labeled assertion with assuming <>, that is, an assertion without an assumption. For example, we write assert x>0 assuming <> instead of assume x>0 to assume and verify that variable x is greater than 0.

When the system under validation reaches a termination point, we are not asserting any specific claim, however, usually a number of assertions must hold before the (virtual) termination point. These are listed in an OnlyAssuming clause. If the system boundaries are given by a method (as in Fig. 2), the OnlyAssuming declaration is placed before the method—in the example, <aRes> or <bRes>. This corresponds to a JML method specification clause [12].


Fig. 1. Syntax of labeled assumptions/assertions.

Fig. 2. Simple implementation of int min(int,int) with four validation steps.

#### 2.2 Validation Procedure

A validation assistant software is intended to guide an expert in validating all program runs without having to scrutinize each single run. In each validation step, the expert validates a single program run against an assertion and provides justifications for his or her judgment in form of assumptions. The validation assistant knows a current set of assertions G at certain source code locations and a set of program runs R. In the beginning, G includes merely one implicit, trivial assertion (assert true, always satisfied) at the (virtual) termination point of the program. The set G grows after each validation step.

Example 1. We perform the validation procedure for int min(int,int), Fig. 2. In the initial setting (omitted from the figure) we just have the source code without any annotation.

In the first validation step 1 , the expert is given a program run with input a==3, b==7, return value m==3 and the implicit assertion at the termination point. The expert judges this run to be valid, places assuming <aRes> above the method (virtually at the termination point), and then <aRes>assert m==a as justification. Verification tools check the implicit (and trivial) assert true at the virtual termination point under <aRes>. In 2 , the expert looks at the same program run, but now he has to give assumptions for the assertion aRes. The program run is still valid under the new assertion. The expert now adds an assuming <aLb>, and places the assumption <aLb>assume a<=b at the start of the method. Tools check assertion aRes under aLb.

In 3 , the expert is given a different program run a==9, b==0, m==0, plus the trivial assertion at the virtual termination point. The expert adds or <bRes> in the corresponding assuming, and <bRes>assert m==b before the method returns. Again, tools verify. In 4 , the expert looks at the same program run as before, Since assertion bRes is now assuming aGb, assumption aGb is added. Tools check and the validation procedure ends at this point successfully with a partially specified program, no bug was found.

We consolidate into the general description of the validation procedure:

Validation Assistant. Given sets of program runs R and assertions G, the latter containing only the implicit assert true at termination point. Repeat:

1. Choose<sup>1</sup> r ∈ R, g ∈ G such that:


2. Validation step (see below).

Validation Step. Given assertion g ∈ G and program run r ∈ R:


(a) In case of assuming <>, continue with 4.

	- (a) Expert adds assumption labeled with ALk.
	- (b) Expert adds assertion labeled with AL<sup>k</sup> (initially without assuming). The new assertion is added to G.
	- (a) All assumptions/assertions AL1, . . . ,AL<sup>n</sup> are satisfied in r.
	- (b) For all r˜ ∈ R: if AL1, . . . ,AL<sup>n</sup> are satisfied in r˜ then g is also satisfied.
	- (c) Attempt to formally verify g assuming AL1, . . . ,ALn, see Sect. 3.

#### 3 Checking/Verification

Formal verification can be achieved by translating labeled assertions into ordinary JML assertions as described below. The latter can be handled with stateof-the-art verification tools: For example, we can combine static verification and, for each program run separately, run-time assertion checking.

The translation processes one single assertion and its corresponding assumptions at a time and generates a separate verification task for each. For example, take assertion aRes from 4 in Fig. 2. There is just one corresponding assumption aLb, so we delete all other assumptions in the source file. The resulting code is left with only two annotations: //@ assume a<=b; and //@ assert m==b; without labels and assuming.

The translation of general ALabelPDNFs is more complex, for example, assertion pricePlausible in Fig. 4, line 19. We must show the assertion holds,

<sup>1</sup> If possible, choose the same run or the same assertion as in a previous iteration. This simplifies the validation step for the expert.

given either <dscdReg,minPr> or one of the other two disjuncts. We create three verification tasks. In the first dscdJun and minPr are JML assumptions and pricePlausible becomes a JML assertion (similar for the other two). We obtain assume discount==0, assume movie.getPrice() > 5.60, as well as the assertion assert dscdPrice >= 5.00. Observe that the labeled assertion dscdJun is translated into an assumption.

After translation, we can perform checks with any tool that understands JML and Java. We plan to use deductive verification as well as run-time assertion checking tools for every single program run. Depending on the result from the tools, disjuncts in the assuming are highlighted in different colors, as in Figs. 3, 4: white Assertion unchecked

red Assertion is violated in some run

green Assertion is formally verified

blue All runs are fine, but verification only partial due to system limitation yellow All runs are fine, but verification failed and gave a counter example

To demonstrate our approach, we wrote a script to translate annotations of all three examples in this paper [8]. We successfully reproduced the respective assertion verification. We expect that the performance of deductive verification tools is practical, as a side gain from the restricted syntax.

#### 4 Examples

We demonstrate the validation procedure with two examples. Example 3 is less algorithmic and oriented towards real-world software, where an expert familiar with the application domain is essential for validating a software's behavior. Example 2 features an implementation of int max(int[]), which produces incorrect results for certain inputs. We will find the bug in two validation steps.

Example 2. Fig. 3 displays an implementation of int max(int[]). It produces incorrect results for some inputs. However, we might not detect this immediately, as it gives correct results in the majority of cases. Moreover, it does not throw an exception, except when a.length==0. The supplementary material [8] contains a list of 100 random input arrays we used in the experiments. Each array contains between one and four random entries with values in [0, 100] (equally distributed). From that list, 11 of 100 runs give an invalid result.

Initially, the code in Fig. 3 is not annotated. Then we start the validation procedure. The set of initial goals consists of the return point of max(a) and we will, as usual, start there. The assistant chooses a program run, for example, corresponding to input a = {35,38,36,55}. Now the domain expert performs the first validation step. The expert observes that result 55 is correct. The expert slightly generalizes this: Whenever a.length == 4 and a[3] is greater/equal than each of the other three elements, then the result must be a[3]. Consequently, the expert adds assertion max3res and assumption max3of4 as in Fig. 3. Both names were chosen by the expert. Now the tool checks whether the assertion holds, whenever max3res holds. It turns out we have six runs (of 100

```
1 //@ assuming <max3res> or <max0res>;
2 int max(int a[]) {
3 //@ <max3of4>assume a.length==4 && a[0]<=a[3] && a[1]<=a[3] && a[2]<=a[3];
4 //@ <max0of1>assume a.length==1;
5 //@ <max0of4>assume a.length==4 && a[1]<=a[0] && a[2]<=a[0] && a[3]<=a[0];
6 int m = a[0];
7 for (int k=0; k < a.length; k++) {
8 if ( m < a[k]) {
9 m = a[k++];
10 }
11 }
12 //@ <max3res>assert m==a[3] assuming <max3of4>;
13 //@ <max0res>assert m==a[0] assuming <max0of1> or <max0of4>;
14 return m;
15 }
```
input arrays), where the assumption max3res holds: For one run, the assertion is violated—for input {56,56,69,91}, the program outputs 69 instead of 91.

Since an invalid run was found, we are done. Observe that the domain expert merely scrutinized the initial program run, where the result was still correct. There are cases where more iterations are necessary. For example, the validation assistant could have started with a singleton array {70} or with array {81,73,26,15}. For either we would need two or more iterations as these program runs do not have any similarity with one of the 11 invalid runs. See Fig. 3 for the annotated program with further assumptions max0of1 and max0of4.

Example 3. In a price calculation for cinema tickets, there are movies with different age restrictions and ticket prices, and there are several age groups with different discounts. The example might get much more complex with discount criteria such as happy hours, theme days, or vouchers. We conjecture that our validation approach works in these cases, too.

The relevant fragment of the ticket price calculation software is displayed in Fig. 4. Our initial goal is to validate program runs of the method nextTicket, starting from the termination points. The expert might first place an assertion in the called method calcDscdPrice, and then place the corresponding age group assumption in lines 5–7 of nextTicket.

There is a subtle bug which manifests in assertion pricePlausible (line 19) under assumption senior. Let's say the expert placed this assertion, because of the cinema's policy to sell tickets for at least 5 ¤. Assertion minPr guarantees that the normal price for each movie is more than 5.60 ¤. This holds for all program runs but cannot be formally proven, because the implementation of Movie is outside of boundary of the system under validation. Accordingly, the corresponding assuming <> is highlighted blue. Going back to pricePlausible, assume movie 2 has price 5.70 in some program run, this becomes with senior discount 4.85, hence <dscdSen,minPr> is marked red.

```
1 //@ assuming <pricePlausible> or <tooYoung>;
 2 public void nextTicket(Scanner input) {
 3 System.out.print("Enter age: ");
 4 int age = input.nextInt();
 5 //@ <junior>assume age < 16;
 6 //@ <regular>assume 16 <= age && age < 65;
 7 //@ <senior>assume 65 <= age;
 8 System.out.print("Select movie (1/2): ");
 9 int movieNumber = input.nextInt();
10 //@ <mv1>assume movieNumber == 1;
11 //@ <mv2>assume movieNumber == 2;
12 Movie movie = movies[movieNumber];
13 //@ <tooYoung>assert age < movie.getMinAge() assuming <junior,mv1>;
14 if (age < movie.getMinAge()) {
15 System.out.println("Too young for this movie.");
16 return;
17 }
18 double dscdPrice = calcDscdPrice(movie, age);
19 //@ <pricePlausible>assert dscdPrice >= 5.00
             assuming <dscdReg,minPr> or <dscdJun,minPr> or <dscdSen,minPr>;
20 System.out.printf("Your price: %.2f ¤\n", dscdPrice);
21 }
22 private double calcDscdPrice(Movie movie, int age) {
23 //@ <dscdReg>assert getDiscount(age) == 0 assuming <regular>;
24 //@ <dscdJun>assert getDiscount(age) == 10 assuming <junior>;
25 //@ <dscdSen>assert getDiscount(age) == 15 assuming <senior>;
26 //@ <minPr>assert movie.getPrice() > 5.60 assuming <>;
27 return movie.getPrice() * (1 - getDiscount(age)/100.0);
28 }
```
Fig. 4. Cinema Example.

#### 5 Conclusion and Related Work

We presented a procedure to validate program runs by a software engineer while iteratively generating a partial specification. This helps finding semantic bugs fast. The annotations can be re-used, for example, in regression verification.

Our validation procedure incorporates usage of verification and assertion checking tools. Assertion annotations are in use since the 1970s [14], verification has an even longer tradition. In contract-based verification [9], specifications are structured along method declarations, whereas our approach allows arbitrary dependencies via labeled assumptions, syntactically inspired by [4,5].

In [16], it is observed that in-house tests do not match the behavior of field program runs. Our approach directly validates the latter. Our validation is finished if every program run is covered by assertions highlighted in green/blue. This suggests an alternative to their proposed solution—generating test cases mimicking field runs [17].

Even if an assertion could not be formally verified (blue/yellow/red) we check it against said program field runs. We believe that this will suffice in our setting, without excluding future enhancements. Notably, there is an approach to generate test cases for partially unverified assertions [5].

An attempt to improve code reviews by animated symbolic execution is reported in [10]. In contrast, we guide an expert systematically through the code.

### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## SMC4PEP: Stochastic Model Checking of Product Engineering Processes

Hassan Hage1,<sup>2</sup> (), Emmanouil Seferis1,<sup>3</sup> , Vahid Hashemi<sup>2</sup> , and Frank Mantwill<sup>1</sup>

<sup>1</sup> Helmut-Schmidt-University, Holstenhofweg 85, 22043 Hamburg, Germany hassan.hage@hsu-hh.de

<sup>2</sup> AUDI AG, Auto-Union-Straße 1, 85057 Ingolstadt, Germany

<sup>3</sup> Technical University of Munich, Arcisstraße 21, 80333 Munich, Germany

Abstract. Product Engineering Processes (PEPs) are used for describing complex product developments in big enterprises such as automotive and avionics industries. The Business Process Model Notation (BPMN) is a widely used language to encode interactions among several participants in such PEPs. In this paper, we present SMC4PEP as a tool to convert graphical representations of a business process using the BPMN standard to an equivalent discrete-time stochastic control process called Markov Decision Process (MDP). To this aim, we first follow the approach described in an earlier investigation to generate a semantically equivalent business process which is more capable of handling the PEP complexity. In particular, the interaction between different levels of abstraction is realized by events rather than direct message flows. Afterwards, SMC4PEP converts the generated process to an MDP model described by the syntax of the probabilistic model checking tool PRISM. As such, SMC4PEP provides a framework for automatic verification and validation of business processes in particular with respect to requirements from legal standards such as Automotive SPICE. Moreover, our experimental results confirm a faster verification routine due to smaller MDP models generated from the alternative event-based BPMN models.

Keywords: Product Engineering Processes · Verification and validation · Probabilistic model checking · Markov decision processes · Probabilistic reward CTL.

## 1 Introduction

The ever-increasing technical challenges in products, for instance autonomous driving in automotive industries, requires *Original Equipment Manufacturers (OEMs)* to restructure their *Product Engineering Process (PEP)* from a mechanical-oriented to a system-oriented development to enable a rigorous verification and validation of its processes with respect to safety and non-safety requirements [5]. Additionally, legal authorities oblige OEMs to address consistency and traceability in their PEPs through compliance with standards such as *Automotive Software Process Improvement and Capability Determination (A-SPICE)* [21]. As the quality of a product is dependent on its processes's quality [17], consistent and qualitative processes are required for adequately addressing technical challenges, legal compliance and customer satisfaction.

c The Author(s) 2022 E. B. Johnsen and M. Wimmer (Eds.): FASE 2022, LNCS 13241, pp. 155–162, 2022. https://doi.org/10.1007/978-3-030-99429-7\_9

A well known and most common modelling language of processes in industrial PEPs is *Business Process Model and Notation (BPMN)* [7] which we refer to as *poolbased BPMN (*pBPMN*)*. pBPMNs provide different users with their internal process workflows in a graphical notation and show the communication and dependency between different organization within the PEP. With the aim of facing the above mentioned challenges, the previous work in [8] shows the need for a revision of the BPMN language which is called *event-based BPMN (*eBPMN*)* in this paper. The processes, which are modelled according to the BPMN guidelines, are enriched with events and time symbols while message-flows of all processes are removed. On that way we ensure to capture time aspects like milestones of PEPs, to enable a communication between processes on different levels of abstraction by means of events, to determine the logical dependencies between processes and finally to remove process redundancies for ensuring consistency and traceability in PEPs. These argumentations on the process design motivated us to consider eBPMNs as a better design language in SMC4PEP. We discuss later that the eBPMN is more beneficial than its pBPMN counterpart in generating smaller MDPs and hence, enabling faster verification routine. The core part of the SMC4PEP relies on converting pBPMNs to eBPMNs while implicitly reducing the model size which is in turn done by removing redundant processes without losing information. As a bi-product, it realizes consistency in PEPs by message passing on different levels of abstraction which is not the case if pBPMN is used as a design language. Then, SMC4PEP converts the generated eBPMN to an equivalent MDP described in the syntax of the probabilistic model checking tool PRISM [15]. SMC4PEP ensures not only the consistency in PEPs but also allows for automated verification of generated MDPs against formal description of requirements from legal standards such as A-SPICE.

### 2 Related Tools

There exist different tools for analyzing business processes. Due to the wide industrial use of the pBPMN standard, the most common tools for analyzing business processes use this graphical representation of processes as an initial model.

The work of Ou-Yang and Lin in [19] provides an approach to translate pBPMNs to the Modified BPEL4WS representation and then to the Colored Petri-net XML (CP-NXML) that can finally be verified by using CPN tools. This approach has restrictions in the support of split and merge conditions. The approach of Daclin et al. in [1] or Mendoza Morales in [18] realize a conversion of pBPMNs to a set of Timed Automata (TA) that uses Clocked Computation Tree Logic (CCTL) for the verification. In the work of Lam in [16] pBPMNs are converted to the New Symbolic Model Verifier (NuSMV) language. Then NuSMV enables an analysis of the processes using model checking techniques and verifying properties by the Computation Tree Logic (CTL). The approaches discussed in [1, 16, 18, 19] do not consider probability distributions and non-deterministic choices of processes which are required for complex processes such as PEP. Duran et al. [3] develop the approach of Rewriting Logic to enrich pBPMNs with timing and probabilistic properties. They verify stochastic properties such as synchronization time, probability distributions by means of the Parallel Statistical Model Checking And Quantitative Analysis (PVeStA) tool. However, message passing between different processes especially on different levels of abstraction is not considered. Finally, Herbert in [14] develop an algorithm for converting pBPMNs into MDPs, where resources like timing and probabilities are considered while message passing is performed between sub-processes. Nevertheless, the size of investigated processes is small and limited and hence, message passing between large processes in particular with different levels of abstraction is not considered. Moreover, the process model is designed with less message passing and complexity to avoid the already known state-space explosion in the generated MDP model which consequently means that this approach is not applicable on complex processes like PEPs.

#### 3 SMC4PEP Architecture and Workflow

As shown in Fig. 1, SMC4PEP consists of three modules, namely: (I) Differentiator, (II) Converter and (III) Generator. The Differentiator determines if the input model is a pBPMN or eBPMN. In case it is a pBPMN, the Converter converts the process model automatically to an eBPMN and moves then to the Generator. Otherwise with an eBPMN as input, SMC4PEP skips the Converter and moves automatically to the Generator. Finally, the Generator converts the eBPMN into an MDP described in the PRISM syntax which can directly be analyzed in PRISM. The process of generating the output PRISM model consists of three steps discussed as follows.

Fig. 1. Architecture of the tool SMC4PEP.

*Input.* SMC4PEP requires a business process model as input with no limitation of abstraction levels. Process models can be designed either according to the guidelines in [7] or [8] with different modelling tools such as Enterprise Architect [4]. Each process model needs to be exported as an XML document for the readability of SMC4PEP.

SMC4PEP*.* The Differentiator of SMC4PEP receives the input document and checks the content of the BPMN model based on the syntactic and semantic differences between eBPMN and pBPMN. According to [7] message passing between processes is performed by message flows from tasks to task of the associated sub-processes, while each sub-process obtains its own boundary called pool. In the eBPMN approach message flows and pools are eliminated [8] and each sub-process obtains its own diagram. Then the process is enriched by events to enable message passing between each process. In case of a detected pBPMN, the Differentiator triggers the Converter, otherwise the Converter will be skipped and SMC4PEP starts automatically the Generator.

The Converter of SMC4PEP analyzes the number of identical processes within the whole process model to remove first redundant processes of pBPMN that may occur on different levels of abstraction. Redundant processes are determined when one process is equal to a second process in all elements of the model. That means in all number and content of tasks, number and content of events, number and content of gateways, role/responsible person of the process as well as number and order of sequence flows. The definition of these elements is available in [7]. When equal processes are detected, SMC4PEP eliminates all equal processes apart from one. Afterwards, all pools of the process models are removed and each sub-process obtains its own diagram. Finally, message flows are eliminated and replaced with events to ensure message passing and logical dependencies between the processes on different levels of abstraction. Note that message passing of the removed processes are also considered so that only one process enables a communication between different levels of abstraction. Finally, the pBPMN initial model is converted into an eBPMN and the Converter triggers the Generator.

The Generator requires an eBPMN which is provided either from the Differentiator or Converter. Then the process model is split into its number of diagrams. Afterwards, the Generator converts each diagram to an MDP taking into account message passing on different levels of abstraction by events, probability distributions and non-deterministic choices. Followed by the next step, the Generator of SMC4PEP generates for each MDP model a PRISM module list which are then combined to one main PRISM module list. Finally, in case of an available timeline [8] in the process model, the PRISM module list is enriched by the values of the timeline to consider time aspects and process execution costs as rewards in the MDP model described in the PRISM syntax.

*Output.* SMC4PEP saves the generated MDP model described in the syntax of PRISM as a DAT document which can be uploaded into the probabilistic model checker PRISM. It is worthwhile to mention that there are quite a number of tools which are able to read the PRISM modelling language. Among others, model checkers Storm [2], PARAM [10], ePMC [11] and Modest [12] can read our generated PRISM model for doing model checking various properties of interests.

### 4 Case Studies

For the evaluation of SMC4PEP, we converted two different use cases with SMC4PEP. Before, we developed an algorithm inspired by the work of [14] to convert a pBPMN directly into an MDP. Note that this conversion is not applicable on complex processes with different levels of abstraction. Complexity means a higher number of message passing between processes, probability distributions and non-deterministic choices. Therefore, for the evaluation we assumed that in pBPMN a communication between different levels of abstraction is possible by merging all diagrams to one main diagram, although in real processes it is not the case. This assumption is met to obtain the MDP sizes of the pBPMN. On that way MDP sizes generated through a pBPMN and eBPMN model can be compared and the effectiveness of the eBPMN can be approved. The first use case describes the process of testing an autonomous park pilot with three levels of abstraction and includes five roles where each role performs its associated task of the process. The second use case handles a more complex process of an urgent request for a change of the vehicle construction during the PEP. In total this use case extends over four levels of abstraction and includes eleven roles. Both use cases are provided by an automotive OEM. We run all experiments on an Core i7 laptop running Windows 10.

Table 1 provides promising results generated based on SMC4PEP. The generated MDP model of the first use case with two levels of abstraction is for the eBPMN 33.8% in states and 40.7% in transitions less than for the pBPMN. Moreover, the generated MDP model in the third level of abstraction is in the eBPMN 67.78% in states and 73.11% in transitions less than in the pBPMN. The build time of the MDP model for the eBPMN with three levels of abstraction is higher compared to the pBPMN. Note that the MDP model is built only once which has no impact on the run-time of model checking MDPs. This is indeed the case for generating a formalism like MDP from giant BPMN models and use it several times for model checking various properties. The generated MDP models of use case two with four levels of abstraction are large compared to the first use case due to the high number of activities, probability distributions and nondeterministic choices of the processes. Nevertheless, the effectiveness of the eBPMN for complex processes is strongly confirmed by the generated MDP size of the second use case on four levels of abstraction which is far less than the MDP size of pBPMN. Finally, our generated MDP models from eBPMN have much smaller sizes compared to the approach discussed in [14]. In particular, for the second use case we got several order of magnitudes reduction in model size which is significant for an efficient model checking routine. However, similar to [14] we also realize the state space explosion problem which can be alleviated using bisimulation minimization techniques [6,9,13].


Table 1. Results of the analyzed processes.

At the end, we take the PRISM tool for model checking some properties of interest described in the *Probabilistic Reward Computation Tree Logic (PRCTL)* [20]. It is worthwhile to note that for SMC4PEP we provide the first use case as an eBPMN to capture time and cost aspects of the PEP by a timeline while the second use case is described first in pBPMN and then converted to eBPMN. Firstly, we verify some prop-


erties based on the A-SPICE guidelines [21] by ϕ1, ϕ<sup>2</sup> and ϕ3. The properties are taken from the *Generic Practice (GP)* of A-SPICE Level 2 [21] where each level of A-SPICE determines the quality of the processes. The property GP 2.1.7 of A-SPICE denoted as ϕ<sup>1</sup> which requires ensuring no deadlocks in the processes and reaching the final state of the process with the probability of 100%. Additionally by ϕ<sup>2</sup> we denote the property GP 2.1.2 which ensures the ability of performing the process to fulfil the identified objectives similar to ϕ1. Moreover, the GP 2.1.3 is denoted by ϕ<sup>3</sup> through which we ensure that our process does not deviate from its original setting according to A-SPICE. Finally for use case one, the non-functional properties are denoted by ϕ<sup>4</sup> which delivers the minimum days (d) for performing the whole process, and by ϕ<sup>5</sup> which enables the expected cost estimation of the process obtained in accumulated working days (wd). We have to note here that ϕ<sup>4</sup> is obtained by the GUI simulator of PRISM. The results of the property verification obtained from PRISM are depicted in Table 2.

## 5 Conclusion

In this paper we presented the new tool SMC4PEP to enable in the first phase an automated conversion of complex process models such as PEPs that are modelled according to the BPMN standard [7] into revised process models based on the modelling approach of [8]. This conversion paves the way for consistency and traceability of complex PEPs by removing redundant processes and enabling an exchange between different levels of process abstraction. In the second phase, SMC4PEP converts the new process model into an MDP to capture stochastic properties of a PEP and to enable an automated verification of the MDP using PRISM against formal descriptions of requirements. In case of designing a new PEP based on [8], SMC4PEP considers also the timeline of processes to capture time and cost aspects of a PEP that are essential for developing a new product in particular in automotive and avionics industries. Finally, we approved the effectiveness of our tool in an automotive case study where we compared pBPMNs with eBPMNs and verified some properties of interest such as legal regulations from A-SPICE.

Acknowledgments. This work is supported by the Helmut-Schmidt-University in Hamburg and by the AVAI project at AUDI AG in Ingolstadt.

#### References


162 H. Hage et al.


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## Symbolic Predictive Cache Analysis for Out-of-Order Execution

Zunchen Huang () and Chao Wang

University of Southern California, Los Angeles CA 90089, USA {zunchenh, wang626}@usc.edu

Abstract. We propose a trace-based symbolic method for analyzing cache side channels of a program under a CPU-level optimization called out-of-order execution (OOE). The method is predictive in that it takes the in-order execution trace as input and then analyzes all possible outof-order executions of the same set of instructions to check if any of them leaks sensitive information of the program. The method has two important properties. The first one is accurately analyzing cache behaviors of the program execution under OOE, which is largely overlooked by existing methods for side-channel verification. The second one is efficiently analyzing the cache behaviors using an SMT solver based symbolic technique, to avoid explicitly enumerating a large number of out-of-order executions. Our experimental evaluation on C programs that implement cryptographic algorithms shows that the symbolic method is effective in detecting OOE-related leaks and, at the same time, is significantly more scalable than explicit enumeration.

Keywords: program analysis · out-of-order execution · side channel · SMT solver

### 1 Introduction

There has been growing interest in recent years in detecting side-channel leaks in software using automated program analysis and verification techniques, due to the increased awareness of the threat of real-world side-channel attacks [4,15,18]. These are side-channel attacks because they exploit dependencies between sensitive information of the program and non-functional properties of the computing platform, including cache-related timing variations caused by CPU-level optimizations such as pipelining and branch prediction. While there are existing methods for detecting these side channels based on static analysis [6,28,31] and symbolic execution [3, 10–12, 29], they do not accurately model an important CPU-level optimization called out-of-order execution (OOE).

Out-of-order execution is widely adopted by modern CPUs. It is possible for a program to be free of side-channel leaks when instructions are executed in the program order but have leaks when they are executed out of order. Here, the program order refers to the order in which instructions appear in the program. However, modeling out-of-order execution during program analysis is a

Fig. 1. Spreca – symbolic predictive analysis for out-of-order execution.

challenging task due to the inherently large number of possible scenarios that must be considered. Generally speaking, instructions within a fixed window (an imaginary window used to model the effect of hardware features including the reorder buffer, issue queue, and load-store queue) may be executed in any order as long as it respects the semantics of the program. Thus, given N instructions, the number of possible execution orders can be as large as O(N!). Since it is practically intractable to examine these execution orders individually, existing methods had to choose from the following two undesired outcomes: if they overapproximate, they may report bogus leaks since some infeasible execution orders will be included; but if they under-approximate, they may miss real leaks since some feasible execution orders will be excluded.

To solve the aforementioned problem, we propose a trace-based symbolic predictive analysis to accurately and efficiently analyze the OOE related cache behaviors. Here, accurately means that our method does not over- or underapproximate the OOE behaviors but precisely encodes these behaviors as a set of logical constraints; efficiently means that our method avoids enumerating the out-of-order executions explicitly to avoid the exponential blowup; instead it leverages an off-the-shelf SMT solver to conduct a symbolic analysis of the logical constraints. Our method is predictive in that, given an in-order execution trace of the program, it analyzes the cache behaviors of all out-of-order executions of the instructions that appeared in the in-order execution, instead of executing them.

Fig. 1 shows the overall flow of our method, named Spreca, which takes an annotated C program as input; the annotation marks program inputs as either public or private (secret). Internally, our method has three steps. In the first step, it utilizes the LLVM compiler to parse the C program, compute the program dependencies, and use the information to instrument the LLVM bit-code. The instrumented program, at run time, can generate the in-order execution trace. In the second step, our method encodes the set of all possible OOE related cache behaviors as a set of logical constraints, to be solved by an off-the-shelf SMT solver. In the third step, our method checks if there are secret-dependent divergent cache behaviors, e.g., an out-of-order execution causing a cache hit for one value of the secret variable but a cache miss for another value of the secret variable.

The main contribution of our work is symbolically modeling the OOE related cache behaviors accurately and efficiently. We design the SMT encoding (to be presented in Section 5) carefully to make it compact. For example, a straightforward encoding of all possible permutations of N instructions would lead to an SMT formula of size O(N<sup>3</sup> ), since any instruction may have any other instruction as its predecessor and, as a result, the update function must be encoded for each predecessor's cache state and the current cache state. Our method, in contrast, avoids most of these update functions by leveraging the program dependency relations recorded in the in-order execution trace to prune away the infeasible permutations.

Our method differs significantly from the method of Guo et al. [10,11] based on symbolic execution. While their method also uses symbolic analysis, they only made the program input symbolic, whereas the out-of-order executions are still enumerated explicitly (this is evident based on their use of a technique designed for speeding up explicit enumeration, called partial order reduction). In other words, for each out-of-order execution, they had to generate an SMT formula to check if it has divergent cache behaviors; as a result, they did not avoid the exponential blowup. In contrast, our method generates a single SMT formula to encode all possible out-of-order executions associated with the inorder execution. In addition to being more efficient, our single-formula based encoding can be more easily adapted to model other CPU-level optimizations by slightly modifying how dependencies are encoded as logical constraints.

We have implemented our method in a software tool by leveraging the opensource LLVM compiler [17] and the Z3 SMT solver [19]. Specifically, we use LLVM to parse the C program, compute the program dependencies, and instrument the bit-code, to generate the in-order execution trace at run time. We use Z3 to implement symbolic analysis of the out-of-order executions. We evaluated our method on a set of C programs from OpenSSL that implement well-known block ciphers and cryptographic hash functions. The experimental results show that our method, by accurately modeling the OOE related cache behaviors, can detect OOE-related side-channel leaks that otherwise would have been missed. The results also show that our SMT solver-based symbolic analysis is significantly more scalable than explicit enumeration.

To summarize, this paper makes the following contributions:


The remainder of this paper is organized as follows. First, we motivate our work using examples in Section 2. Then, we provide the technical background in Section 3. Next, we present our method in Sections 4 and 5, followed by the experimental results in Section 6. We review the related work in Section 7. Finally, we give our conclusions in Section 8.

### 2 Motivation

In this section, we use examples to illustrate the cache behaviors of the in-order execution and an out-of-order execution. We also explain the high-level idea of our trace-based symbolic analysis.

#### 2.1 The Example Program

Fig. 2 shows the code snippet which, for ease of presentation, is written in a mixture of C and simplified assembly language. Here, assume i ∈ {0, 1, 2} is a secret variable and each array element A[i] occupies 4 bytes in memory. Furthermore, while our method handles realistic cache size and configurations, in this motivating example, we assume the cache has only one set, consisting of 3 cache lines, with each cache line holding only 4 bytes. We assume the cache is fully associative, and uses the LRU (least recently used) replacement policy. Under these assumptions, each array element A[i] occupies an entire cache line.

```
1 load A[0];
2 load A[1];
3 load A[2];
4 store A[i]; /* Can the secret value i affect the cache behavior? */
5 load B;
```
Fig. 2. An example program where the value of i is a secret.

#### 2.2 The Execution Order

The order in which instructions are written in a program is called the program order. During the in-order execution, instructions are executed according to their program order. Without loss of generality, we assume that there are two types of instructions: memory-related instructions such as Load and Store, and nonmemory-related instructions, such as ALU and branch instructions. As far as this work is concerned, our focus is on memory-related instructions because non-memory instructions do not affect cache behavior <sup>1</sup> .

Fig. 3 compares the in-order execution on the left with a possible out-oforder execution on the right. The out-of-order execution is a permutation of instructions of the in-order execution that, at the same time, must respect the semantics of the original program. In both of these two execution traces, each row represents an instruction and its associated memory address. Note that while a program may have if-else statements and thus multiple paths, an execution trace corresponds to only one program path.

<sup>1</sup> Non-memory instructions may impose ordering constraints over memory-related instructions. These constraints are computed by our method, and used to constrain the analysis of out-of-order executions; details are in Section 4.


Fig. 3. Two execution orders of the example program in Fig. 2.

#### 2.3 The Cache State

Given a program execution, regardless of whether it is the in-order execution or one of the out-of-order executions, it is straightforward to compute changes of the cache state at each step. The cache state of our running example can be defined as a tuple S = hAge(A[0]), Age(A[1]), Age(A[2]), Age(B)i, consisting of the ages of cache lines associated with the four program variables. Since we assume that the cache holds at most 3 variables (lines) at any moment if a variable is inside the cache, its age must be 0, 1, or 2; and if it is evicted from the cache, its age must be 3. Initially, the cache state is S<sup>0</sup> = h−1, −1, −1, −1i, where -1 is a special symbol meaning it is not loaded into cache yet.

In-Order Cache Behavior As shown in Fig. 4 for the in-order execution, executing the first instruction load A[0] changes the cache state to SI<sup>1</sup> = h0, −1, −1, −1i from S0, where SI<sup>1</sup> is the cache state after executing I1. That is, variable A[0] now occupies the youngest cache line. Similarly, after executing the first three instructions, the cache state becomes SI<sup>3</sup> = h2, 1, 0, −1i, meaning that A[2] occupies the youngest cache line and A[0] occupies the oldest cache line. Thus, executing the instruction store A[i] results in a cache hit regardless of whether i = 0, 1, or 2. At this moment, the age of variable B remains -1 since it has not yet been loaded to the cache.


Fig. 4. Cache behavior of the in-order execution does not depend on the secret value i; that is, for all i = 0, 1, 2, accessing A[i] results in a cache hit.

Out-of-Order Cache Behavior There can be many out-of-order executions, or permutations of instructions, corresponding to an in-order execution. While they must preserve the semantics of the in-order execution, they do not have to preserve its cache behavior. Thus, even if the in-order execution does not have

divergent cache behaviors (with respect to a secret variable), one of the out-oforder executions may have divergent cache behaviors. As shown in Fig. 5 for this particular out-of-order execution that reorders store A[i] and load B, when i 6= 0, accessing A[i] results in a cache hit, but when i = 0, it results in a cache miss.


Fig. 5. Cache behavior of the out-of-order execution depends on the secret value i; that is, accessing A[i] results in a cache hit when i 6= 0 but a cache miss when i = 0.

#### 2.4 The Side-channel Leak

Whenever the cache behavior of an execution (regardless of whether it is the in-order execution or an out-of-order execution) depends on the value of a secret variable, it is called a side-channel leak. This is a security risk because, in modern CPUs, a cache hit only takes 1-3 CPU cycles whereas a cache miss may take up to a hundred CPU cycles. By observing the difference in the execution time of a victim program, the attacker may be able to deduce a certain amount of information about the secret.

In our running example, since store A[i] is dependent on the value of the secret variable i, we need to check if executing store A[i] leads to divergent cache behaviors. During the in-order execution, the answer is no, since it results in a cache hit for all i = 0, 1, and 2. Thus, the in-order execution has no sidechannel leak. During one of the out-of-order executions, however, the answer is yes, since it results in a cache hit for some value of i but a cache miss for some other value of i. Thus, the out-of-order execution has a leak.

Generally speaking, there are two types of side-channel analysis techniques: approximate and accurate. While over- or under-approximation may be fast, it leads to poor results, i.e., reporting bogus leaks or missing real leaks. Thus, we are only concerned with accurate analysis techniques. In this context, while it is possible to examine each individual out-of-order execution, it will lead to exponential blowup. Our method, in contrast, encodes the cache behaviors of all out-of-order executions in a single logical formula. The formula is then solved using an efficient, off-the-shelf SMT solver to avoid an exponential blowup.

#### 3 Preliminaries

In this section, we present the technical background related to our analysis of the out-of-order executions and divergent cache behaviors.

Fig. 6. The instruction window and the different execution orders.

#### 3.1 The Execution Model

Recall that modern CPUs may execute instructions of a program in any order as long as the end result remains the same. The default order is the program order, i.e., the order in which instructions appear in the program. For performance reasons, however, the CPU does not always follow the program order, because some instructions may be significantly slower than others and, instead of waiting for the slower instructions to complete, the CPU may choose to execute some subsequent instructions as long as the program semantics is preserved.

Instruction Window As shown in Fig. 6, we use an imaginary instruction window to abstract the behavior of various hardware components inside the CPU for supporting out-of-order execution. The size of this instruction window depends on the CPU, including but not limited to the sizes of its reorder buffer, issue queue, and load-store queue. For this work, however, there is no need to delve into the hardware details. Instead, it suffices to assume that within this imaginary window of N instructions, the CPU may choose any execution order as long as the end result remains the same.

Data Hazards To make sure that the end result remains the same, only the outof-order executions that respect the data dependencies of the original program are allowed. In the computer architecture literature, violations of such dependencies are called hazards. Specifically, there are three types of hazards, named RAW (read after write), WAR (write after read), and WAW (write after read), respectively. It is worth noting that RAR (read after read) is not a hazard.

#### 3.2 The Cache Model

Without loss of generality, we assume the cache has K cache lines in total and each cache line has 64 bytes. The cache lines are further divided into M sets, which means each set has (K/M) cache lines. The memory is also divided into 64-byte blocks, each of which is mapped to a unique set. Within the same set, however, the 64-byte block may occupy any of the cache lines. Thus, within the set, it is called fully associative; overall, the entire cache is called set associative. In this context, a fully associative cache is a special case (K-way set associative), while a direct mapped cache is another special case (1-way set associative).

The Cache State The cache state is a tuple S<sup>I</sup> = hAge(v1), . . . , Age(vn)i, where each v<sup>i</sup> ∈ V ars (1 ≤ i ≤ n) is a variable in the program, and Age(vi) is the age of the cache line associated with v<sup>i</sup> . V ars is the set of all variables. Here, we use the subscript in S<sup>I</sup> to indicate that it is the cache state resulting from executing the instruction I. Assume that K is the number of cache lines in a set. The domain of Age(vi) is {0, 1, . . . , K, −1}, where an age from 0 to K − 1 means the variable is inside the cache, while K means the variable is evicted from cache and −1 means it has never been loaded into cache.

We assume that the cache uses the LRU (least recently used) replacement policy. Given a cache state S<sup>I</sup> and an instruction I 0 , the new cache state S<sup>I</sup> 0 is computed by the U pdate(S<sup>I</sup> , I<sup>0</sup> ) function. Assuming that v ∈ V ars is the variable used by the instruction I 0 , u<sup>1</sup> ∈ V ars is another variable whose age was younger than v in S<sup>I</sup> , and u<sup>2</sup> ∈ V ars is yet another variable whose age was older than v in S<sup>I</sup> , we compute the new cache state S<sup>I</sup> <sup>0</sup> = hAge<sup>0</sup> (v1), . . . , Age<sup>0</sup> (vn)i as follows:

– Age<sup>0</sup> (v) = 0; – Age<sup>0</sup> (u1) = Age(u1) + 1; – Age<sup>0</sup> (u2) = Age(u2).

That is, the most recently used variable (v) occupies the youngest cache line, any variable (u1) whose age was younger than v in S<sup>I</sup> increases its age by 1, and any variable (u2) whose age was older than v in S<sup>I</sup> keeps its age unchanged.

#### 3.3 The Side-channel Leak Condition

Whenever there is a dependency between the secret and some divergent cache behaviors of an execution, there is a side-channel leak. Thus, there are two requirements. First, there must be divergent cache behaviors, i.e., memory-related instruction causing a cache miss for some input value but a cache hit for some other input value. Second, the input value causing divergent cache behaviors must be a secret, e.g., a password, security token, or cryptographic key.

Thus, the side-channel leak condition can be defined as follows:

∃ E, I, v1, v<sup>2</sup> . CacheStatus(E, I, v1) 6= CacheStatus(E, I, v2)

Here, E denotes an execution, and I ∈ E is an instruction in E; v<sup>1</sup> and v<sup>2</sup> are two values of a secret variable v<sup>s</sup> ∈ V ars; and CacheStatus(E, I, vs) is a function that returns the cache status (hit or miss) when instruction I is executed in E using vs.

## 4 Analyzing the In-Order Execution

In this section, we present our method for generating, and then analyzing the in-order execution trace. There are two tasks. The first one is to compute the dependencies of memory-related instructions. The second one is to compute the default cache states. Both the dependencies and the default cache states will be used during our symbolic analysis of the out-of-order executions.

#### 4.1 Computing the Dependencies

There are two types of dependencies associated with the in-order execution of a program: explicit dependencies and implicit dependencies.

Explicit Dependencies Explicit dependencies refer to data conflicts that can be directly observed during the execution, by looking at the actual addresses of memory blocks used by the instructions at run time. Consider the in-order execution example in Fig. 3 (left). Since both instructions I<sup>4</sup> and I<sup>1</sup> access the memory block at the address 0x77ef5bd0, and at least one of them is a store operation, these two instructions have an explicit dependency; that is, they cannot be reordered during out-of-order.


Fig. 7. Example implicit dependency that cannot be observed in the execution trace.

Implicit Dependencies Implicit dependencies, on the other hand, refer to data conflicts that cannot be directly observed during the in-order execution. Fig. 7 shows an example. The code snippet shows that store A[1] is dependent on load A[0], through the def-use chain of (register) variables r1-r3. Since nonmemory instructions (mul, add, mov in this example) do not show up in the logged execution trace, their constraints on the memory instructions would have been lost if we do not compute and record them explicitly into the execution trace.

In our method, we compute the implicit dependencies by statically analyzing the LLVM bit-code of the program before instrumenting the bit-code to add self-logging capabilities. Then, we execute the instrumented code to obtain the trace. As a result, the implicit dependencies will be captured in the execution trace as a special relation (DEPsta). Static program analysis has a global view of the program and thus is well suited for computing the implicit dependencies. Inside LLVM, the bit-code is represented in a Single Static Assignment (SSA) format, meaning each variable is defined only once, which makes it possible to efficiently compute the implicit dependencies [20].

In addition to the implicit dependencies (DEPsta) computed by static analysis, we also compute the explicit dependencies (DEPdyn) based on the actual addresses appeared in the execution trace: for each memory address, instructions that use the address are checked to see if they have data hazards (RAW, WAR, or WAW). For instructions that have data hazards, their relative execution order during in-order execution cannot be violated; otherwise, the original program semantics may be changed.

Given both the statically computed DEPsta and the dynamically computed DEPdyn, we compute their transitive closure to obtain DEP = (DEPsta ∪ DEPdyn) ∗ , which represents the complete set of dependency constraints that must be respected at all time, to ensure that the out-of-order executions examined by our symbolic analysis are feasible.

The fact that static analysis is conservative in nature will not affect the correctness of our subsequent symbolic analysis. Since not all memory-addressing instructions can be statically resolved, as shown by the example instruction store A[i] in Fig. 2, static analysis may soundly over-approximate the possible dependencies of memory-related instructions. This is not a problem because it guarantees that, as long as two instructions are marked as independent, it is always safe to reorder these instructions during out-of-order execution. This is crucial for ensuring that leaks detected by our method are feasible.

#### 4.2 Computing the Default Cache States

Given the in-order execution trace, we perform an in-order simulation to compute the default cache states, which will be used during our symbolic analysis of the out-of-order executions.

We regard the in-order execution trace as a sequence of instructions Tino = {I1, . . . , In}. The type of each instruction may be Load, Store, Symbolic Load, or Symbolic Store. Each Load/Store instruction is associated with an actual memory address. Each Symbolic Load/Store instruction is associated with a range of addresses that it may use.

Starting with an initial cache state S0, we compute the sequence of cache states Tcache = {S0, SI<sup>1</sup> . . . , SI<sup>n</sup> } using the update function defined in Section 3.2. While the update function in Section 3.2 uses the LRU replacement policy, other cache replacement policies can also be implemented easily.

The result of in-order simulation will be given to our symbolic analysis, to examine the set of all possible out-of-order executions. Here, an out-of-order execution, denoted Tooe = {I 0 1 , ..., I<sup>0</sup> <sup>n</sup>}, is a permutation of instructions of the in-order execution. That is, for all 1 ≤ i ≤ n and instruction I<sup>i</sup> ∈ Tino, there exists 1 ≤ j ≤ n, i 6= j such that I 0 <sup>j</sup> ∈ Tooe and I 0 <sup>j</sup> = I<sup>i</sup> , and vice versa.

#### 5 Analyzing the Out-of-Order Executions

In this section, we present our method for symbolically analyzing the out-of-order executions.

#### 5.1 Symbolic Encoding

Our method uses a single logical formula (Φ) to encode the behaviors of all outof-order executions of instructions within a sliding window of size N, together with the condition under which an out-of-order execution has secret-dependent, divergent cache behaviors. It guarantees that Φ is satisfiable if and only if there exists such a side-channel leak in the sliding window of size N. Thus, when setting the value of N, there is a trade-off between coverage and scalability.

Before explaining how Φ is constructed from the in-order execution trace, however, we need to define the notations used in the symbolic encoding.


With these notations, we define the formula Φ as a conjunction of the following subformulas:

$$\Phi = \Phi\_{pc} \wedge \Phi\_{cs} \wedge \Phi\_{ics} \wedge \Phi\_{rep} \wedge \Phi\_{dep} \wedge \Phi\_{divc}$$

where Φpc is the program counter constraint, Φcs is the cache state constraint, Φics is the initial cache state constraint, Φrep is the cache replacement constraint, Φdep is the dependency constraint, and Φdivc is the divergence condition constraint.

Program Counter Constraint (Φpc) To get a total order of the N instructions, we require that, for all 0 ≤ i ≤ N, the value of P C I<sup>i</sup> is unique; furthermore, we require 0 ≤ P C I<sup>i</sup> ≤ N. Thus, the constraint is defined as

$$\Phi\_{pc} = \bigwedge\_{0 \le i \le N} (0 \le PC \, I\_i \le N) \land \bigwedge\_{\substack{0 \le i, j \le N \ \text{and} \ i \ne j}} (PC \, I\_i \ne PC \, I\_j)$$

Cache State Constraint (Φcs) Let MAX be the cache's associativity, or the maximal number of cache lines that can be mapped to a memory address. After executing an instruction I<sup>i</sup> , if 0 ≤ Age addr<sup>k</sup> I<sup>i</sup> < MAX, it means the memory block at addr<sup>k</sup> is inside the cache; but if Age addr<sup>k</sup> I<sup>i</sup> = MAX, it means the memory block is evicted out of the cache <sup>2</sup> . Thus, the constraint is defined as

$$\Phi\_{cs} = \bigwedge\_{\substack{0 \le i \le N \ \text{and} \ 0 \le k \le M}} (-1 \le Age.addr\_k \, I\_i \le MAX)$$

<sup>2</sup> Age addr<sup>k</sup> I<sup>i</sup> = −1 means it has never been loaded to the cache yet.

Initial Cache State Constraint (Φics) Before the first instruction is executed, the cache must be set to a proper initial state. In other words, variables Age addr<sup>1</sup> I0, . . . , Age addr<sup>M</sup> I<sup>0</sup> must be initialized based on the default cache states computed by in-order simulation (Section 4.2). Thus, the constraint is defined as

$$\Phi\_{ics} = \bigwedge\_{0 \le k \le M} (Age\,\mathcal{add} r\_k \, I\_0 = init \,\mathcal{age}\,\mathcal{add} r\_k).$$

Replacement Constraint (Φrep) Assuming that instruction I<sup>j</sup> is immediately before I<sup>i</sup> during an out-of-order execution, we define the cache line ages after executing I<sup>i</sup> based on their ages after executing the predecessor instruction I<sup>j</sup> . Let addr<sup>k</sup> be the address used by I<sup>i</sup> , addrk<sup>1</sup> be any address whose age was younger than that of addr<sup>k</sup> immediately before executing I<sup>i</sup> , and addrk<sup>2</sup> be any address whose age was older than that of addrk. According to the update function defined in Section 3.2, we set Age addr<sup>k</sup> I<sup>i</sup> to 0, set Age addrk<sup>1</sup> I<sup>i</sup> to (Age addrk<sup>1</sup> I<sup>j</sup> + 1), and set Age addrk<sup>2</sup> I<sup>i</sup> to Age addrk<sup>2</sup> I<sup>j</sup> . Let the relation U pdateRel(I<sup>i</sup> , I<sup>j</sup> ) be the conjunction of the constraints defined above.

If a symbolic address (secret-dependent) is used by I<sup>i</sup> , we encode it into the update relation as follows: for each concrete address that may be instantiated from the symbolic address, we construct an update relation U pdateRel() under the assumption that it may be the actual address used by I<sup>i</sup> .

Overall, the cache replacement constraint is defined as

$$\Phi\_{rep} = \bigwedge\_{\substack{0 \le i, j \le N \ \text{and} \ i \ne j}} (PC.I\_i = PC.I\_j + 1) \implies Update Rel(I\_i, I\_j)$$

Dependency Constraint (Φdep) To ensure that out-of-order executions are feasible, we enforce the relative order of any two instructions if they have dependencies according to the DEP relation. Thus, the constraint is defined as

$$\Phi\_{dep} = \bigwedge\_{\substack{0 \le i, j \le N \ \text{and} \ i \ne j \ \text{and} \ DEP(I\_i, I\_j)}} \text{(PC.I}\_i < PC.I\_j).$$

That is, if I<sup>j</sup> depends on I<sup>i</sup> , I<sup>i</sup> must be executed before I<sup>j</sup> .

Divergent Cache Constraint (Φdivc) Let V ar<sup>s</sup> be a symbolic (secret) variable whose values include v1, v2, . . . and let I<sup>i</sup> be a symbolic instruction whose actual addresses include addr<sup>v</sup><sup>1</sup> , addr<sup>v</sup><sup>2</sup> , . . . Here, the value v<sup>1</sup> corresponds to addr<sup>v</sup><sup>1</sup> and the value v<sup>2</sup> corresponds to addr<sup>v</sup><sup>2</sup> . If accessing the memory block at addr<sup>v</sup><sup>1</sup> leads to a cache hit and accessing addr<sup>v</sup><sup>2</sup> leads to a cache miss (or vice versa), the target instruction I<sup>i</sup> has divergent cache behaviors. Thus, the constraint is defined as

$$\Phi\_{divc} = \bigvee\_{\forall v\_1, v\_2} (0 \le Age.addr\_{v\_1} I\_i < MAX) \land (Age.addr\_{v\_2} I\_i \ge MAX)$$

Conjoining all of the subformulas defined above, we can construct the entire formula Φ which is satisfiable (SAT) if and only if there is a side-channel leak during one of the out-of-order executions.

#### 5.2 The Overall Algorithm

The overall algorithm for predictive cache analysis is shown in Algorithm 1, which takes the in-order execution trace Tino = {I1, . . . , In}, the in-order cache state trace Tcache = {S0, . . . , Sn}, and the sliding window size N as input. Internally, it uses a sliding window of N instructions, Twindow, to generate the SMT formula Φ. For this window, Sinit is the initial cache state as computed by in-order simulation, and Itarget is the target instruction. The formula Φ is satisfiable if and only if an out-of-order execution of the instructions within the window leads to divergent cache behaviors at the instruction Itarget.


1: for pos ← 1 to (n − N) do 2: f irst = (pos − N > 0) ? (pos − N) : 1 3: Twindow = Tino[f irst, pos] 4: Itarget = Tino[pos] 5: Sinit = Tcache[f irst − 1] 6: Φ = BuildFormula( Twindow, Itarget, Sinit ) 7: if ( SAT(Φ) == true ) print LEAK FOUND

Running Example We use the example code snippet in Fig. 2 to illustrate the symbolic encoding presented in this section. For this example, the in-order execution trace generated by our method is shown in the top half of Fig. 8. Note that A is marked as symbolic since A[i] is affected by the unknown variable i. The logical constraints are shown in the bottom half. Assume that the target instruction is I4, meaning that we want to construct a formula Φ to check if I<sup>4</sup> has divergent cache behaviors.

The program counter and cache state constraints are shown in Lines 10- 12; recall that each program counter variable must have a unique value. The dependency constraints are shown in Line 13. Then, in Line 14, we show the two symbolic variables used to check divergent cache behaviors; their values are in the range of the symbolic store in Line 5.

The update function for Instruction I<sup>4</sup> starts from Line 15. If v1==0x77ef5bd0, which means 0x77ef5bd0 is used, the age after executing I<sup>4</sup> is set to 0. The dependency relations indicate that I<sup>5</sup> is allowed to execute before I4. From Line 16 to 18, we show an example update age constraints with program counter constraint and the condition which Age 0x77ef5bd4 I<sup>4</sup> would increase by 1 from its predecessor I<sup>5</sup> according to Section 5.1. Similarly, we encode other predecessors of I<sup>4</sup> for the update function in Line 19. Finally, we encode the divergent cache constraint in Line 20.


Fig. 8. An example encoding where the register variable i holds a secret value .

#### 5.3 Optimizations of the Symbolic Encoding

Without optimization, the size of the formula Φ may be as large as O(N2M) in the worst case, where N is the number of instructions in the sliding window and M is the number of memory addresses used inside the window. In practice, however, many of the logical constraints can be skipped. Here, we propose two optimization techniques.

Skipping the Infeasible Cache Update Relations While constructing the constraints that update the cache states of the instructions, the default approach is to assume that, for any instruction I<sup>i</sup> , any other instruction I<sup>j</sup> in the same window may be executed immediately before I<sup>i</sup> . This means it must construct N<sup>2</sup> update relations. However, due to the dependencies among instructions captured by the DEP relation, there may be many instruction pairs (I<sup>j</sup> , Ii) such that I<sup>j</sup> is not allowed to execute before I<sup>i</sup> . By leveraging the information, we can skip many of these update relations.

Skipping the Unnecessary Φdivc Constraints In many cases, by checking the initial cache state with respect to the sliding window of N instructions, we may be able to know that divergent cache behaviors are impossible during any of the out-of-order executions. In other words, Φdivc is guaranteed to be unsatisfiable (UNSAT). Thus, we can avoid generating Φ. Toward this end, we check for the following two conditions, each of which is sufficient for Φdivc to be UNSAT:

– All ages are too young: Inside the initial cache state (with respect to the window), if all cache line ages are less than (MAX − M), where M is the number of unique addresses used in this window, we skip checking any of the instructions in this window for divergent cache behaviors. This is because the cache is large enough that, regardless of the execution order, none of the cache lines will be evicted.


Table 1. Statistics of the benchmark programs and the execution traces.

– The age of addr accessed by the target instruction is too young: Inside the initial cache state, if the age of addr is less than (MAX − M), we skip checking this particular target instruction for divergent cache behaviors. This is because, regardless of the value of the secret variable, this particular cache line will never be evicted out of the cache.

#### 6 Experiments

We have implemented our method in a tool named Spreca, which builds upon the LLVM compiler [17] and the Z3 SMT solver [19]. Specifically, it uses LLVM to implement the static analysis component, which takes a C program as input and computes the dependencies of memory-related instructions before instrumenting the LLVM bit-code; the instrumented bit-code, after compilation, is used to generate the execution trace at run time. We use Z3 to implement our symbolic analysis component, which takes the logged execution trace as input and generates SMT formulas of the cache states for leakage detection. Overall, our implementation includes 3.6K lines of C++ code inside LLVM for trace generation, SMT encoding and leakage detection, as well as 0.5K lines of Python/Bash script code for processing the trace files and automation. The archive is available at: https://doi.org/10.5281/zenodo.6117196.

#### 6.1 Benchmarks

The benchmarks used to evaluate our tool are a set of C programs from OpenSSL 1.1.1k that implement well-known block-ciphers such as AES and DES and cryptographic hashing functions such as SHA256 and Whirlpool. The statistics of these benchmark programs are shown in Table 1, including the name of the program, a short description, the number of lines of C code, and statistics of the logged execution trace, which serves as input of our symbolic analysis method. For each execution trace, we show the trace length, the number of Store (ST) operations, the number of Load (LD) operations, the number of distinct memory locations touched by the execution, and the number of corresponding cache lines.

Our experiments were designed to answer the following questions:


Table 2. Results of our symbolic predictive analysis method for 8K fully associative cache, with LRU replacement policy, and window size set to 10.


Toward this end, for each benchmark program, we applied our symbolic analysis method to check if it can find OOE-related cache side-channel leaks, i.e., leaks that otherwise would not show up unless out-of-order execution is considered. To evaluate the scalability of our method, we also compared it with a baseline explicit analysis method. Due to space limit, we omit the detailed algorithm of the explicit analysis method, which systematically enumerates the same set of out-of-order executions of instructions considered by our symbolic analysis method. Thus, both our symbolic method and the explicit method examine the same type of secret-dependent divergent cache behaviors, but they differ in efficiency and scalability.

#### 6.2 Leakage Detection Results

Table 2 shows the results of our symbolic analysis method. These results were obtained using the following parameters: the cache has a total of 8K bytes, divided into 128 cache lines, with 64 bytes per cache line. The cache is fully associative, with the LRU replacement policy. The OOE window size is set to 10, meaning the number of Load/Store instructions that will be executed out of order is bounded to 10. Recall that inside the reorder buffer, there can be many non-memory instructions (e.g., arithmetic operations); thus, setting the window size to 10 is a reasonable choice. In this table, Columns 1-2 show the program name and the trace length. Columns 3-5 show the number of SMT solver calls, the number of satisfiable (SAT) instances, and the number of unsatisfiable (UNSAT) instances. Column 6 shows the number of leaking sites detected by our method and Column 7 shows the total analysis time in seconds.

Note that the number of SMT solver calls may be smaller than the number of instructions in the trace and, in many cases, is 0 because of the optimizations implemented during our symbolic encoding: for any instruction, if our simple

Fig. 9. Comparison of the analysis time: symbolic method versus explicit method.

checks reveal that no OOE-related divergent cache behavior is possible, we skip the more time-consuming SMT solver call. Also note that the number of leaking sites in Column 6, which are locations in the original C program, may be smaller than the number of UNSAT instances in Column 4; this is because multiple UNSAT results may be mapped to the same source code location.

To confirm that the leaking sites reported in Table 2 are indeed feasible (5 for SEED and 6 for Camellia), we manually inspected the source code and the LLVM bit-code of both SEED and Camellia. Our manual inspection shows that the reordered sequences provided by the SMT solver are indeed feasible as we check them against the source code. We also find that the divergent cache behaviors are real in that the two concrete values computed for each symbolic (sensitive) variable can indeed lead to a cache hit in one case but a cache miss in the other case.

#### 6.3 Scalability Results

To evaluate the scalability of our symbolic analysis method, we compared its analysis time to that of the baseline explicit enumeration method. This experiment was conducted on SEED, with the OOE window size set to 2, 4, 6, 8 and 10, respectively. This is because the computational complexity of the problem increases exponentially as the OOE window size increases. The results are shown in Fig. 9, where the x-axis is the OOE window size and the y-axis is the analysis time in seconds. The blue line represents our symbolic method while the red line represents the explicit method.

The results in Fig. 9 show that, while our symbolic method has a higher fixed cost (associated with generating SMT formulas, calling the Z3 solver, and interpreting the results), and thus is slower than the explicit method when the OOE window size is smaller, it becomes significantly more efficient when the window size is larger. The figure also show that, as expected, the explicit method has an exponential blowup – its analysis time is actually worse than exponential

(factorial in the window size) – whereas the scalability of our symbolic method is significantly better.

## 7 Related Work

As we have mentioned earlier, the most closely related work is that of Guo et al. [10, 11] which relies on KLEE to detect cache side channels. However, their method only treats program input as symbolic, while still explicitly enumerating the out-of-order executions. Unlike their method, we analyze the set of all possible out-of-order executions symbolically by encoding them in a single logical formula to avoid the exponential blowup. In this sense, our method is the only predictive analysis method that can symbolically analyze the cache behaviors of out-of-order executions.

Besides our method and the method of Guo et al. [10, 11], there are many other techniques for analyzing cache side channels. Some of them use symbolic execution as well, e.g., to detect concurrency-related leaks [12] as well as leaks in sequential programs [3, 21, 29, 32]. Others use static analysis techniques including those based on abstract interpretation [6, 28, 30, 31]. In addition to leakage detection, there are techniques for leakage quantification [1, 2, 5, 7, 16] as well. However, none of these prior works considers out-of-order execution.

Beyond side-channel leakage detection and leakage quantification, cache analysis has been used in other applications such as estimating the worst-case execution time (WCET) of real-time software [9, 13, 25]. Beyond cache analysis, the idea of trace-based predictive analysis has been applied to multithreaded programs to detect concurrency bugs [8, 14, 22–24, 26, 27]. However, a crucial difference is that while concurrency bugs are violations of functional properties of a program, our method for side-channel analysis focuses exclusively on non-functional properties.

## 8 Conclusions

We have presented a symbolic method for analyzing the cache behaviors of outof-order executions associated with an in-order execution trace. The method uses static analysis to compute dependencies before instrumenting the program to generate the in-order execution trace. Then, it uses an SMT solver based symbolic analysis to analyze the cache behaviors of all out-of-order executions. Our experiments on cryptographic software code show that the symbolic analysis method is effective in detecting OOE-related cache side-channel leaks and is significantly more scalable than explicit analysis. For future work, we plan to extend our method to detect side-channel leaks caused by other CPU-level optimizations.

Acknowledgements This work was partially funded by the U.S. National Science Foundation grants CNS-1722710 and CNS-1702824.

### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## PEQtest: Testing Functional Equivalence<sup>⋆</sup>

Marie-Christine Jakobs (B) and Maik Wiesner

Technical University of Darmstadt, Department of Computer Science, Darmstadt, Germany jakobs@cs.tu-darmstadt.de

Abstract. Refactoring a program without changing the program's functional behavior is challenging. To prevent that behavioral changes remain undetected, one may apply approaches that compare the functional behavior of original and refactored programs. Diference detection approaches often use dedicated test generators and may be inefcient (i.e., execute (some of) the non-modifed code twice). In contrast, proving functional equivalence often requires expensive verifcation. Therefore, we propose PEQtest, which aims at localized functional equivalence testing thereby relying on existing tests or test generators. To this end, PEQtest derives a test program from the original program by replacing each code segment being refactored with program code that encodes the equivalence of the original and its refactored code segment. The encoding is similar to program encodings used by some verifcation-based equivalence checkers. Furthermore, we prove that the test program derived by PEQtest indeed checks functional equivalence. Moreover, we implemented PEQtest in a prototype and evaluate it on several examples. Our evaluation shows that PEQtest successfully detects refactored programs that change the program behavior and that it often performs better than the state-of-the-art equivalence checker PEQcheck.

#### 1 Introduction

Developers refactor programs [16] to improve quality attributes like e.g. performance. For instance, a developer may parallelize a program with OpenMP [30] to improve performance. While a refactoring changes the program code, e.g., adds OpenMP pragmas, to improve the program's quality, the changes must not alter the program's functional behavior. To ensure that a refactored program is reliable, we must check that the refactoring preservers the functional behavior.

Various approaches exist that aim to safeguard refactored programs from altered behavior. In practice, developers often perform regression testing [54], but the success of detecting altered behavior depends on the test suite and its test oracle(s). If refactoring rules are applied, one can prove the correctness of the applied refactoring rules [45,22,44]. In contrast, incremental verifcation techniques, e.g., [53,39,8,35], propose solutions for efcient re-verifcation of changed programs,

<sup>⋆</sup> This work was funded by the Hessian LOEWE initiative within the Software-Factory 4.0 project.

E. B. Johnsen and M. Wimmer (Eds.): FASE 2022, LNCS 13241, pp. 184–204, 2022. https://doi.org/10.1007/978-3-030-99429-7\_11

Listing 1.1: Original program void sum seq ( unsigned char N) { in t a [N+ 1 ] ; a [ 0 ] = 0 ; fo r ( in t i =1; i<=N ; i++) a [ i ] = a [ i =1] + i ; } Listing 1.2: Refactored program in t sum par ( unsigned char N) { in t a [N+ 1 ] ; a [ 0 ] = 0 ; #pragma omp p a r a l l e l fo r fo r ( in t i =1; i <= N ; i++) a [ i ] = ( i \*( i + 1 ) ) / 2 ; } Listing 1.3: Generated test program in t s um t e s t ( unsigned char N) { in t a [N+ 1 ] ; a [ 0 ] = 0 ; s t o r e ( a , 0 ) ; fo r ( in t i =1; i <= N ; i++) a [ i ] = ( i \*( i + 1 ) ) / 2 ; s t o r e ( a , 1 ) ; r e s t o r e ( a , 0 ) ; #pragma omp p a r a l l e l fo r fo r ( in t i =1; i <= N ; i++) a [ i ] = ( i \*( i + 1 ) ) / 2 ; s t o r e ( a , 2 ) ; e q s t o r e ( a , 1 , 2 ) ; }

Fig. 1: Original, sequential program (top left), which initializes each array entry i with P<sup>i</sup> <sup>j</sup>=0 j, the refactored program (bottom left), which parallelizes the array initialization using OpenMP and utilizing that P<sup>i</sup> <sup>j</sup>=0 j = i·(i+1) 2 , as well as the generated program for testing functional equivalence (right)

but they typically need a specifcation of the functional behavior, which rarely exists. Another solution, which does not require a specifcation, is to inspect whether or when the original and the refactored program behave functionally equivalent. Approaches aiming to detect diferences in the behavior [26,52,46,20,31,29,36,47] are inefcient, i.e., execute each test case on the original and the refactored program or function, and often use dedicated test generators. Approaches aiming to prove functional equivalence [5,56,40,14,13,43,49,41,34,4,15,17,38,23,42,19] use heavyweight verifcation techniques, rarely support parallel programs, and often consider all possible variables values.

Our goal is to develop a lightweight, test-based approach for functional equivalence checking, for which we can use existing tests or test generators. Inspired by equivalence checkers [17,38,23,51,42,2,19] that transform the equivalence of two programs into a set of verifcation tasks (i.e., programs with assertions), our PEQtest approach transforms the equivalence of two programs into a test program. To restrict equivalence testing to relevant program values and to reduce the duplicate execution of non-modifed code, PEQtest generates a single test program (verifcation task) that executes the unchanged code only once and individually checks equivalence of each refactored code segment in the context of the original program. The individual checks use a similar idea as UC-KLEE [38], which verifes equivalence of functions. More concretely, PEQtest derives the test program from the original program by extending each original code segment with (a) the refactored code segment and (b) code to store, restore, and compare

variable values of modifed variables. To store, restore, and compare the values of modifed variables, PEQtest relies on checkpoints, which save the values of a given set of modifed variables in a given program state.

In our example (Fig. 1), PEQtest frst detects that the original (sequential) code segment (framed, dark blue) and the refactored (parallelized) code segment (frameless, light blue) modify variable a 1 . Thereafter, PEQtest derives the test program (right) from the original program (top left). It adds the parallelized code segment. To provide the same input to the original and refactored code segment, PEQtest uses checkpoint 0 to store modifed variables. The test program calls store(a, 0); to save in checkpoint 0 the values of modifed variable abefore the original code segment and calls restore(a,0); to restore the values of modifed variables before the refactored code segment. To make the result of both code segments available for equivalence checking, the test program stores the values of modifed variable a after each code segment in checkpoint 1 and 2, respectively. Finally, the equivalence test eq store checks whether the checkpoints 1 and 2 contain equivalent values for the modifed variable a.

We proved that PEQtest generates test programs that can indeed detect inequivalence and that if no execution of the test program reveals an inequivalence, original and refactored program are equivalent. As a proof-of-concept, we implemented PEQtest and used it to check several program parallelizations and a few sequential refactorings. Our evaluation shows that PEQtest reliably detects inequivalences and typically outperforms the state-of-the-art equivalence checker PEQcheck [19].

## 2 Background

Program Syntax. To present our approach, we rely on a simple imperative language on integer variables.<sup>2</sup> Since synchronization issues, e.g., deadlocks, do not afect how our approach works and we want to keep the programming language simple, our language supports parallel execution, but no synchronization operations. Below, we show the grammar of the programming language that we use to present our approach.

$$S := E \mid v :=\_\ell \, a \, expr; \mid \text{if }\_\ell \, be \, prr \text{ } \mathbf{then} \, S\_1 \, \mathbf{else} \, S\_2 \mid \text{while} \, \_\ell \, be \, prr \text{ } \mathbf{do} \, S \mid S\_1 \, S\_2 \mid \, [S\_1 \parallel \dots \parallel S\_n]$$

We use E to denote the empty program and assume that arithmetic expressions aexpr in assignments and Boolean expressions bexpr in if and while statements are built with standard operators on integers. To build more complex programs S, several subprograms S<sup>i</sup> may be assembled into a sequence or into a parallel statement. To unambiguously identify the original and refactored code segments during test program generation<sup>3</sup> and any subprogram in our proofs,

<sup>1</sup> Both segments also modify variable i, but it is a local variable, which can be ignored.

<sup>2</sup> Our implementation supports a subset of C programs, which may use OpenMP pragmas for parallelization.

<sup>3</sup> For our implementation, one only needs to specify the start and end of code segments i.

Fig. 2: Rules for operational semantics

we assume that each basic statement is annotated with a label ℓ, which must be unique in the complete program. Moreover, we use the set V to refer to all program variables and subset V(S) ⊆ V to refer to the variables occurring in (sub)program S. Similarly, subset V(expr) ⊆ V represents all variables that occur in an arithmetic or Boolean expression expr.

While the programming language above is sufcient to represent original and refactored programs, the test programs derived by our approach also use checkpointing to store, restore, and compare relevant parts (e.g., modifed variables) of program states. To support checkpointing and checkpoint comparison, we extend the programming language for test programs with the three checkpoint functions eq store, restore, and store. All three functions get as input a subset V ⊆ V of relevant variables and one or two arithmetic expressions (typically an integer constant) to refer to the relevant checkpoints.

S := eq store(V, aexpr1, aexpr2); | restore(V, aexpr); | store(V, aexpr);

Program Semantics We formalize the program semantics using a fairly standard operational semantics that defnes how a program executes. A program execution is a sequence of transitions between execution states. An execution state is a triple of a program, a data state, and an additional checkpoint state. A data state is a function σ : V → Z that provides an integer value for each program variable. We denote the set of all data states by Σ. A checkpoint state is a function ξ : N → Σ that maps checkpoints i to data states σ. The set Ξ denotes all checkpoint states.

The 12 rules shown in Fig. 2, which consists of 7 standard rules plus 5 newly introduced rules highlighted in light gray, defne the possible transitions. As usual, we write σ(expr) for the evaluation of expr in data state σ ∈ Σ. <sup>4</sup> The state update σ[v := σ(aexpr)], which is used in the rule for the assignment, returns a new data state σ<sup>n</sup> with σn(w) = σ(w) for all w ∈ V \ {v} and σn(v) = σ(aexpr). Similarly, the multi state update σ[V ← σ ′ ], which is used by the new store and restore rules, returns a new data state σ<sup>n</sup> with σn(w) = σ(w) for all w ∈ V \ V and σn(v) = σ ′ (v) for all v ∈ V . In addition, the checkpoint update ξ[c := σu], which is used in the store rule, returns a new checkpoint state ξ<sup>n</sup> with ξn(i) = ξ(i) for all i ∈ N \ {c} and ξn(c) = σu. <sup>5</sup> Also, note that instead of assuming that E S and [E∥ . . . ∥E]S are equivalent to S, we introduce two nop rules, which make our proofs simpler. After we formalized the transitions, we now inductively defne the executions ex(S) of a program S with two inference rules:


We write (S, σ, ξ) →<sup>∗</sup> (S ′ , σ′ , ξ′ ) if the intermediate steps of the execution are unimportant. Furthermore, we say that execution (S, σ, ξ) →<sup>∗</sup> (S ′ , σ′ , ξ′ ) (i) terminates normally if S ′ = E and (ii) violates a checkpoint equivalence if S ′ violates a checkpoint equivalence in (σ ′ , ξ′ ). In general, a program S ′ violates a checkpoint equivalence in (σ ′ , ξ′ ) if either (a) there exists a statement Seq = eq store(V, aexpr1, aexpr2); such that ∃v ∈ V : ξ ′ (σ ′ (aexpr1))(v) ̸= ξ ′ (σ ′ (aexpr2))(v) and S = Seq or S = SeqS ′ or (b) S = [S1∥ . . . ∥Si∥ . . . ∥Sn] or S = [S1∥ . . . ∥Si∥ . . . ∥Sn]S ′ and there exists at least one subprogram S<sup>i</sup> that violates a checkpoint equivalence in (σ ′ , ξ′ ). In general, a program S violates a checkpoint equivalence if there exists an execution (S, σ, ξ) →<sup>∗</sup> (S ′ , σ′ , ξ′ ) ∈ ex(S) such that S ′ violates a checkpoint in (σ ′ , ξ′ ).

Partial Equivalence. We are interested whether two (sub)programs behave functionally equivalent, i.e., compute the same output when given the same input. Like many other approaches on equivalence checking, we focus on partial equivalence, i.e., we limit equivalence to executions that terminate normally.<sup>6</sup> In addition, we utilize that checkpoint functions are not used in programs, but are only introduced to test functional equivalence. Therefore, our defnition of partial equivalence focuses on data states and ignores checkpoint states.

Defnition 1. (Sub)programs S1 and S2 are partially equivalent (S1 ≡ S2) if

$$\begin{array}{c} \forall \sigma, \sigma', \sigma'' \in \Sigma, \xi, \xi', \xi'', \xi''' \in \Xi: ((S1, \sigma, \xi) \to^\* (E, \sigma', \xi') \in ex(S1)) \\\land (S2, \sigma, \xi'') \to^\* (E, \sigma'', \xi''') \in ex(S2)) \implies \sigma' = \sigma'' \ . \end{array}$$

<sup>4</sup> Note that we do not specify the expression evaluation in detail because we have not fxed the expression syntax. However, we assume that the result of evaluating integer constant c in data state σ is the constant c (i.e., σ(c) = c) and that expression evaluation is deterministic (i.e., σ(expr) = x ∧ σ(expr) = y =⇒ x = y).

<sup>5</sup> The store rule determines the state σ<sup>u</sup> using a multi state update and the index c evaluating an arithmetic expression (often a constant) in the current data state.

<sup>6</sup> Note that we still may detect that a refactoring introduces non-termination because if a refactoring introduces non-termination, our test program either detects inequivalence or does not terminate for some inputs.

Variable Modifcation. To make equivalence testing more efcient, we only want to checkpoint modifed variables, i.e., the checkpoint should only store the value of those variables whose value may change. The following defnition formalizes the set of variables modifed by a (sub)program.

Defnition 2. Let S be a (sub)program. The variables modifed by S are:

M(S) := {v ∈ V | ∃σ, σ′ ∈ Σ, ξ, ξ′ ∈ Ξ : (S, σ, ξ) →<sup>∗</sup> (·, σ′ , ξ′ ) ∧ σ(v) ̸= σ ′ (v)}.

For instance, in programs written in our programming language that do not use restore statements variables can only be modifed by assignments. For those programs, the set M(S) of modifed variables can be overapproximated by the set of variables that occur in S on the left-hand side of an assignment. In the following, we describe any overapproximation of the modifed variables, e.g. the one sketched above, by M<sup>≈</sup> : S → 2 <sup>V</sup> and assume that M(S) ⊆ M≈(S).

#### 3 Generating Test Programs with PEQtest

Our goal is to test equivalence between an original and refactored program, which both do not use checkpoint functions. As explained earlier, checkpoint functions are supposed to be used by test programs only. In this section, we describe how PEQtest generates the test program for equivalence testing, prove soundness of the generated test program, i.e., show that the generated test program checks functional equivalence, and discuss limitations of PEQtest's program generation as well as our implementation.

Sound Test Program Generation. To test functional equivalence of two subprograms, the idea of our PEQtest approach is to execute both subprograms with the same input and compare their outputs. The test program generated by PEQtest will execute the two subprograms sequentially to avoid that their executions can interfere with each other. Furthermore, it will ensure that both subprograms get equal inputs, which may be produced by the (original) program, and that their outputs can be compared. Many verifcation approaches for functional equivalence [17,38,23,51,42,2,19] use a similar setup, but do not restrict the inputs. To ensure equal inputs and make outputs available, these approaches either (1) duplicate (shared, modifed) variables, replace the variables in one of the subprograms by the duplicated ones, and assign equal values to the original and duplicated inputs [17,42,2,19], (2) add additional variables to store the input and output values and restore the input after the execution of the frst subprogram [51,23], or (3) use dedicated functions, e.g., checkpoint functions, to store and restore inputs and outputs [38]. For our test program, we choose option (3) because it does not change the subprograms and, thus, it simplifes test program generation as well as it eases the comprehensibility of the test program.

Next, we discuss how we implement option (3). To lower the test efort, we decide to only store and compare values of variables that may be modifed by one of the two subprograms. Since this set cannot always be determined precisely and diferent overapproximations are imaginable, we use parameter V to provide

this set to the test program generator. Moreover, we aim at localized equivalence testing. Thus, our test program likely includes more than one functional equivalence test, namely one for each pair of original and refactored subprogram. While the output must be stored directly after the execution of each subprogram, the output comparison can be done at the end of the test program or after the execution of original and refactored subprogram. We choose the second option because it allows us to reuse checkpoints and lets the test program stop at the frst diference of outputs, which makes it easier to detect which pair of subprograms is responsible for the failure of the test program, i.e., which pair of subprograms is inequivalent. We stop at the frst diference instead of e.g. logging the diference because test execution becomes faster, but we address the logging alternative when discussing the limitations. The following defnition shows how we encode the functional equivalence test for an original subprogram S1 and the refactored subprogram S2 for a given overapproximation V of the set of modifed variables.

test eq(V, S1, S2) := store(V, 0); S1 store(V, 1); restore(V, 0); S2 store(V, 2); eq store(V, 1, 2);

Next, we show that our test encoding is sound, i.e., it may detect inequivalences if the two subprograms S1 and S2 are inequivalent. Our encoding uses checkpoint equivalence to detect whether two subprograms S1 and S2 are inequivalent, i.e., difer in their outputs. Hence, it must violate a checkpoint equivalence if S1 and S2 are inequivalent. We can ensure even more and show that the test encoding is also complete. As shown by the following theorem our test encoding violates a checkpoint equivalence if and only if S1 and S2 are inequivalent.

Theorem 1. Let S1 and S2 be (sub)programs without calls to checkpoint functions and M<sup>≈</sup> be an overapproximation of the modifed variables. Then, S1 ≡ S2 if test eq(M≈(S1) ∪ M≈(S2), S1, S2) does not violate a checkpoint equivalence.

Proof (Sketch). Let M := M≈(S1) ∪ M≈(S2).

⇒ Let (test eq(M, S1, S2), σ, ξ) →<sup>∗</sup> (eq store(M, 1, 2); , σ5, ξ5) be arbitrary. Show with semantics that there exists an execution

(test eq(M, S1, S2), σ, ξ)

→ (S1 store(M, 1); restore(M, 0); S2 store(M, 2); eq store(M, 1, 2); , σ1, ξ1)

→<sup>∗</sup> (store(M, 1); restore(M, 0); S2 store(M, 2); eq store(M, 1, 2); , σ2, ξ2)

→<sup>∗</sup> (S2 store(M, 2); eq store(M, 1, 2); , σ3, ξ3)

→<sup>∗</sup> (store(M, 2); eq store(M, 1, 2); , σ4, ξ4)

→ (eq store(M, 1, 2); , σ5, ξ5)

with σ = σ<sup>1</sup> = σ3, for all v ∈ V \ M also σ(v) = σ5(v), and for all v ∈ M we have ξ5(1)(v) = σ2(v) and ξ5(2)(v) = σ4(v).

Conclude that exists (S1, σ1, ξ1) →<sup>∗</sup> (E, σ2, ξ2) and (S2, σ3, ξ3) →<sup>∗</sup> (E, σ4, ξ4) with σ = σ<sup>1</sup> = σ<sup>3</sup> and for all v ∈ M we have ξ5(1)(v) = σ<sup>2</sup> and ξ5(2)(v) = σ4. By assumption (S1 ≡ S2), σ<sup>2</sup> = σ<sup>4</sup> and, thus, ξ5(1)(v) = σ2(v) = σ4(v) = ξ5(2)(v). By semantics, (test eq(M, S1, S2), σ, ξ) →<sup>∗</sup> (eq store(M, 1, 2); , σ5, ξ5) does not violate a checkpoint equivalence.

⇐ Let (S1, σ1, ξ1) →<sup>∗</sup> (E, σ2, ξ2) and (S2, σ3, ξ3) →<sup>∗</sup> (E, σ4, ξ4) be arbitrary with σ<sup>1</sup> = σ3. Show with semantics that there exists an execution

(test eq(M, S1, S2), σ, ξ) → (S1 store(M, 1); restore(M, 0); S2 store(M, 2); eq store(M, 1, 2); , σ1, ξ1) →<sup>∗</sup> (store(M, 1); restore(M, 0); S2 store(M, 2); eq store(M, 1, 2); , σ2, ξ2) →<sup>∗</sup> (S2 store(M, 2); eq store(M, 1, 2); , σ3, ξ3) →<sup>∗</sup> (store(M, 2); eq store(M, 1, 2); , σ4, ξ4) → (eq store(M, 1, 2); , σ5, ξ5) with σ = σ<sup>1</sup> = σ3, for all v ∈ V \ M also σ(v) = σ5(v), and for all v ∈ M we have ξ5(1)(v) = σ2(v) and ξ5(2)(v) = σ4(v).

Since the test program does not violate a checkpoint equivalence, for all v ∈ M we know σ2(v) = ξ5(1)(v) = ξ5(2)(v) = σ4(v). We conclude that σ<sup>2</sup> = σ4. ⊓⊔

So far, we can use the test encoding to test or even verify functional equivalence of complete programs. Following the idea of PEQcheck [19], which checks equivalence on the level of subprograms rather than on the level of functions or programs, our goal is to split testing of equivalence into multiple subtests, namely one subtest per pair of original and refactored subprogram. While PEQcheck builds one equivalence task per pair and verifes all tasks on every input, our PEQtest approach generates one single test program that only provides inputs produced by the original program<sup>7</sup> . More concretely, PEQtest derives the test program from the original program by replacing the subprograms being refactored with the test encoding test eq of the original and refactored subprogram.

Currently, we assume that PEQtest is informed about the refactored subprograms. More concretely, given original program S and refactored program S ′ , we assume that there exists a partial, injective replacement function γ : 2<sup>S</sup> ⇀ 2 S8 such that S ′ can be derived from S by replacing all subprograms S<sup>1</sup> of S with S<sup>1</sup> ∈ preImg(γ) by γ(S1). Generally, we write S2 = Γ(S1, γ) to denote that S2 is derivable from S1 by replacing all subprograms S<sup>s</sup> of S1 by γ(Ss). For the PEQtest approach, we assume that the replacement function γ only describes the refactoring of the original program S, i.e., preImg(γ) only contains subprograms of S. In addition, the replacement must be unambiguous. Hence, we do not allow S1, S<sup>2</sup> ∈ preImg(γ) such that S<sup>2</sup> is a subprogram of S<sup>1</sup> nor S1, S1S<sup>2</sup> ∈ img(γ) such that S<sup>1</sup> is a subprogram of S and S<sup>1</sup> ∈/ preImg(γ).<sup>9</sup> We also require that E, [E∥ . . . ∥E] ∈/ (preImg(γ) ∪ img(γ)) and ¬∃S : E S, [E∥ . . . ∥E]S ∈ (preImg(γ) ∪ img(γ)) because they are no proper programs. To avoid that interference of parallel statements can invalidate the result of a test, all subprograms in preImg(γ) (img(γ)) must not occur in a parallel statement of the original (refactored) program. Thus, a refactoring in a parallel statement must be described by a refactoring of the parallel statement. Note that for proper programs one can always use γ = {(S, S′ )}. 10

To generate our test program, PEQtest requires a replacement function γtest that maps the subprograms being refactored to their test encodings. PEQtest

<sup>7</sup> If all original and refactored subprograms are equivalent (which we aim to inspect), the original and refactored program will provide the same inputs.

<sup>8</sup> If γ is not injective, one can make it injective by properly changing statement labels.

<sup>9</sup> One can achieve this by proper choices of code segments and statement labels.

<sup>10</sup> However, one may need to adapt some of the labels in S ′ .

derives γtest from the replacement function γ, which describes the refactoring. For each subprogram in the domain, PEQtest replaces its image (the refactored subprogram) by the test encoding of that subprogram and its refactored subprogram thereby using an M<sup>≈</sup> to determine the set of modifed variables.

$$\gamma\_{\text{test}}(\gamma, M\_{\approx}) := \{ (S1, test.eq(M\_{\approx}(S1) \cup M\_{\approx}(\gamma(S1)), S1, \gamma(S1))) \mid S1 \in preImg(\gamma) \}$$

Let us briefy discuss why γtest fulflls the requirements on a replacement function. Since the test encoding contains γ(S1), function γtest inherits injectivity from γ. By construction, test encodings are unequal to E, E S, [E∥ . . . ∥E], and [E∥ . . . ∥E]S and start with checkpoint functions, which we assume that the original program does not contain. The remaining requirements are fulflled because we only replace refactored subprograms by the corresponding test encoding.

Now, we have everything at hand to generate the test program, which can then be used to detect inequivalences with an existing test approach, e.g., [12,1]. As explained, we derive the test program from the original program by replacing the subprograms being refactored with the test encoding test eq of the original and refactored subprogram. To achieve this, we use the replacement function γtest.

$$\text{test\textquotesingle}\text{prog}(S, \gamma, M\_{\approx}) := \Gamma(S, \gamma\_{\text{test}}(\gamma, M\_{\approx}))$$

Again, let us consider soundness, but now for the test program. Our goal is to detect inequivalences caused by a refactoring. Thus, we do not give any guarantees if the original program is non-deterministic, i.e., not equivalent to itself, which can only occur if it contains non-deterministic parallel statements or checkpoint functions. We already assumed that checkpoint functions are only used by the test program, but not by the original or refactored program. For our soundness discussion, we also exclude programs that contain non-replaced, nondeterministic parallel statements. More concretely, we assume that all parallel statements S<sup>p</sup> that are not replaced, i.e., for whom there does not exist a subprogram S<sup>s</sup> ∈ preImg(γ) such that S<sup>p</sup> = S<sup>s</sup> or S<sup>p</sup> is a subprogram of Ss, are deterministic (S<sup>p</sup> ≡ Sp). In this case, the following theorem ensures that our PEQtest approach can soundly detect inequivalences, i.e., the test program generated by PEQtest is able to detect a violation of a checkpoint equivalence if original and refactored program are inequivalent.

Theorem 2. Let S and S ′ be programs without calls to checkpoint functions, M<sup>≈</sup> an overapproximation of the modifed variables, γ be a replacement function such that S ′ = Γ(S, γ), and all non-replaced parallel statements S<sup>p</sup> of S are deterministic (S<sup>p</sup> ≡ Sp). If S ̸≡ S ′ , then there exists (S0, σ0, ξ0) →<sup>∗</sup> (Sn, σn, ξn) ∈ ex(test prog(S, γ, M≈)) that violates a checkpoint equivalence.

Finally, let us look at the contraposition of the above theorem. While our intention for PEQtest is testing and detection of equivalence violations, the corollary below states that we can alternatively verify the test program generated by PEQtest to show functional equivalence.

Fig. 3: Behaviorally equivalent original and refactored program whose code segments are not equivalent

Corollary 1. Let S and S ′ be programs without calls to checkpoint functions, M<sup>≈</sup> an overapproximation of the modifed variables, γ be a replacement function such that S ′ = Γ(S, γ), and all non-replaced parallel statements S<sup>p</sup> of S are deterministic (S<sup>p</sup> ≡ Sp). If no execution (S0, σ0, ξ0) →<sup>∗</sup> (Sn, σn, ξn) ∈ ex(test prog(S, γ, M≈)) violates a checkpoint equivalence, then S ≡ S ′ .

Discussion of Limitations. Functional equivalence of two programs is undecidable [17]. While our PEQtest approach is sound under certain assumptions. PEQtest may report violations of checkpoint equivalences, although original and refactored program are equivalent. Hence, it may be incomplete. One reason is the wrong choice of code segments. For example, consider Fig. 3. Although the two code segments of original and refactored program (highlighted in blue and green, respectively) are inequivalent, the programs are equivalent. For our experiments, we ensured that we do not make the wrong choice for the code segments. In practice, one may check whether a reported violation is a false alarm caused by a wrong choice of code segments by reusing the test input causing the violation to execute one or more test programs generated by PEQtest that use the same original and refactored program but larger segments, e.g., using segments on function or program level, or iteratively merging segments until the violation is disproved or the segments become the programs.

Next, let us discuss the assumption used in Theorem 2. One can easily get rid of the assumption that non-replaced parallel statements must be deterministic. Basically, PEQtest needs to extend γ with pairs (Sp, Sp) for all non-replaced parallel statements Sp. Supporting checkpoint functions is more challenging because PEQtest must be able to store and restore checkpoints and it must ensure that its checkpoints and the program's checkpoints do not interfere. While one may fnd such an encoding, our defnition of partial equivalence does not cover checkpoint states. Also, it does not support non-deterministic programs since our main motivation for PEQtest is refactoring or parallelization of sequential programs not the refactoring of non-deterministic, parallel programs. To properly support checkpointing and all kinds of parallel programs, our defnition of equivalence and PEQtest need to be adapted signifcantly.

Also, the requirements on the replacement function restrict our PEQtest approach. While many assumptions can be met by adapting labels of statements, the requirement that code segments must be subprograms and they must not occur in a parallel statement are major restrictions. However, note that this only limits the granularity of code segments, but not the applicability of the approach.

Finally, we want to mention that in our above formalization we chose to stop as soon as PEQtest fnds a violation because it simplifed our proofs. To always inspect all refactored code segments, one can either move PEQtest's checks at the end of the test program and use diferent checkpoints per test encoding, or only write a log but do not stop when detecting a diference. To ensure that one still tests on values of the original program, one must restore the output of the original program at the end of each test encoding or swap S1 and S2 in the test encoding test eq, i.e., execute the refactored subprogram S2 before the original subprogram S1. Our current implementation postpones PEQtest's checks to the end of the test program and restores the output of the original program at the end of each test encoding.

Implementation. We support test program generation for a subset of C programs with or without OpenMP directives. So far, we do not support programs with pointer aliasing (except for parameter passing). While we allow pointers and dynamic memory allocation, we do not support the modifcation of dynamic data structures in original or refactored code segments. The reason is that we checkpoint arrays and structs by recursively checkpointing their elements and checkpoint pointers by dereferencing them and then checkpoint the dereferenced non-pointer element. Thus, our current implementation only works correctly in case that pointers that need to be checkpointed are non-null and do not change in original or refactored code segments.<sup>11</sup>

Our test program generation relies on the ROSE compiler framework [37]. To store and restore checkpoints, we use a minicpr library, but we built our own library to compare checkpoints. Our implementation assumes that the start and end of a code segment i is specifed by pragma statements #pragma scope i and #pragma epocs i. Currently, we insert them manually. For OpenMP parallelization (our main feld of application), insertion is mostly straightforward. Often, choosing the code blocks associated with the outermost OpenMP directives is a good choice. This can easily be automated, but has not been implemented yet.

For each code segment, our implementation runs ROSE's defnition-use analysis to detect the modifed variables M<sup>≈</sup> that are visible after the code segment. If a code segment contains procedure calls, we also add all global variables and all variables occurring in the parameter expression of a pointer or array argument to the modifed variables M≈. Based on the computed set M<sup>≈</sup> of modifed variables, we then extend the sequential code segment with the refactored code segment and the calls to the checkpoint library necessary to store and restore checkpoints. In contrast to our formalization, the store and restore operation only get the checkpoint name, while additional calls are used to inform the checkpoint library which variables V to consider. Also note that the test program generated by our implementation stores the output of the original and refactored code segments

<sup>11</sup> Due to internals of the used checkpoint library, pointers must not change after they are frst checkpointed.

in checkpoints that difer for each execution of a test encoding and performs output comparison at the end of the test program, which allows us to inspect all checkpoints at once and to possibly fnd multiple violations.

Next, we describe the checkpoint comparison. For each variable in the two checkpoints<sup>12</sup>, we check whether their content is equivalent. Except for foating point values, we rely on C's byte level comparison function memcmp. Often, implementations of foating point operations like + are not associative, but small diferences of foating point values are tolerable. Thus, our comparison of foating point values succeeds when the diference of the values is within a tolerance ε 13 .

#### 4 Evaluation

The goals of our experiments are to (a) study how efective and efcient is PEQtest's detection of inequivalences and to (b) compare PEQtest to an existing equivalence checker. For our comparison, we choose PEQcheck because it also supports localized checking for OpenMP programs.

#### 4.1 Experimental Setup

Benchmark. To check equivalence of sequential and parallelized programs, we use the tasks from the DataRaceBench (DRB) benchmark suite [24,50] (version 1.3.2), which addresses common mistakes in OpenMP parallelization and contains OpenMP programs with and without data races. From the DataRaceBench, we exclude all tasks with thread private directives, which we cannot cover with our segments and all tasks that require at least an OpenMP 4.5 compiler or that ofoad computation to a diferent device (i.e., use the target construct) because they are neither supported by PEQtest nor PEQcheck. In total, we get 132 tasks (26 equivalent and 106 inequivalent tasks). We manually selected the code segments following the idea discussed in the implementation paragraph and use the DataRaceBench programs without OpenMP constructs for the sequential (original) programs. To execute the generated test programs, we use the inputs provided by DataRaceBench.

To check equivalence of two sequential program versions, we consider all non-recursive programs from Rˆeve [15]. However, we exclude loop4 and loop5, which were not available, as well as digits10, digits!10, and barthe2, which declare diferent sets of output variables in original and refactored program and, thus, are detected inequivalent during test program generation. To make the programs executable, we remove the mark annotations, which have no implementation, and extend each of the programs with a test driver that generates random inputs. The code segments are the same as in the evaluation of PEQcheck [19]. In total, we get 15 sequential tasks (5 equivalent and 10 inequivalent tasks).

Tool Confgurations. To study the trade-of between efectiveness and efciency, we examine three PEQtest confgurations, which difer in the resources

<sup>12</sup> By construction, checkpoints that are compared store the same variables.

<sup>13</sup> In our evaluation, we use ε = 10<sup>−</sup><sup>8</sup> .

used during test program execution. The low efort confguration uses one thread and runs the test program once. The other two confgurations use two threads for the DRB tasks and one thread for the sequential tasks while running the test program 10- and 100-times. For the competitor PEQcheck [19], we use a setup similar to [19]. For the DRB tasks, PEQcheck combines the PEQcheck encoding<sup>14</sup> (revision 9dc36b) and verifer CIVL [42] (version 1.20 5259) using the theorem prover Z3 [27] (version 4.8.7). We restrict CIVL to two threads, set its timeout to 5 min, and disable the division by zero and memory leakage checks. For the sequential tasks, PEQcheck combines the PEQcheck encoding with verifer CPAchecker [7] (version 2.0). For verifcation, we use CPAchecker's default analysis, which is also limited to 5 min.

Environment. We use a time limit of 5 min per task and run our experiments on an Ubuntu 20.04 machine with an Intel Core i7 (1.8 GHz) and 32 GB of RAM.

#### 4.2 Experiments

RQ 1: How efective is PEQtest with minimal resources? To answer this research question, we look at PEQtest's results for the low efort confguration (1 thread, 1 run). For the DataRaceBench (DRB) tasks (left) and the sequential (SEQ) tasks (right), Tab. 1 shows for all three PEQtest confgurations the absolute and relative number<sup>15</sup> of correctly detected inequivalences, the number of missed inequivalences, i.e., inequivalences that are not detected, the number of equivalent tasks for which an inequivalence is incorrectly detected (i.e., the false alarms), and the number of equivalent tasks for which no inequivalence is detected. For the two classes in which no inequivalence is detected (missed inequivalence or correctly detected no inequivalence), we also distinguish between the two reasons for not detecting inequivalences: (1) no inequivalences are reported during test program execution and (2) task not completed, e.g., test program generation failed or a timeout occurred during test program generation or execution.

Looking at the frst two columns of the DRB tasks and the two columns of the SEQ tasks in Tab. 1, which show the results of the low efort confguration, we observe that for our examples PEQtest does not report any false alarms, i.e., the number of incorrectly detected inequivalences is zero. Thus, we have 100 % precision for inequivalence detection. More surprisingly, PEQtest detects more than half of the inequivalences (i.e., recall > 50 %) with its low efort confguration and, thus, without parallel execution in case of the parallelized DRB tasks. Studying the detected inequivalences, we observe that almost all the detected inequivalent DRB tasks use a variable to which data-sharing attribute (frst)private is assigned and that is visible, but typically not live after the parallelized code segment. The data-sharing attribute makes the variable threadlocal during execution of the parallelized code segment and prevents that the thread-local variable values become available after the parallelized code segment.

<sup>14</sup> https://git.rwth-aachen.de/svpsys-sw/FECheck

<sup>15</sup> The relative numbers are the absolute numbers divided by the total number of equivalent and inequivalent tasks, respectively.

Table 1: For each of the three PEQtest confgurations, shows for the DRB and sequential (SEQ) tasks the absolute and relative number of tasks for which inequivalence is detected correctly, is missed, is detected incorrectly, and is correctly not detected. If no inequivalence is detected, the table also distinguishes between no inequivalence reported (i.e., no inequivalence observed in runs) and task is not completed due to a timeout or failure.


Furthermore, many of the detected inequivalent sequential tasks are inequivalent for many diferent input values. We conclude that inequivalences caused by the discussed data-sharing attributes or input-insensitive inequivalences can easily be detected with a single run and thread.

RQ 2: Does PEQtest's efectiveness increase when given more resources and what are the costs? First, we examine whether PEQtest performs better if we increase the resources for testing, i.e., the number of runs and for parallelized programs also the number of threads used during test program execution. Comparing the results of our three PEQtest confgurations (Tab. 1), we observe that there is no diference for the sequential tasks. The reason is that one can only detect the missed inequivalences with particular inputs whose random generation is unlikely. For the DRB tasks, however, the number of correctly detected inequivalences increases and the number of missed inequivalences decreases when providing more resources. All other entries stay the same. Hence, PEQtest's efectiveness may increase (i.e., its recall increases) when we allow it to use more resources. Especially, using more than one thread for parallelized programs increases the efectiveness signifcantly, as one could have expected. For our examples, using 100 instead of 10 runs hardly improves PEQtest's efectiveness. In general, PEQtest misses inequivalences in the DRB tasks if the generation of the test program fails (10 tasks). In addition, it misses inequivalences for SIMD constructs (2 tasks), inequivalences depending on thread scheduling (13 tasks), and inequivalences in I/O behavior (7 tasks), e.g., values written via printf, which our implementation does not support yet<sup>16</sup> .

<sup>16</sup> Support for I/O can be added by writing all outputs to the checkpoint.

Fig. 4: Per task compare execution time of all test program runs (left) and total runtime of PEQtest (right) in low efort confguration (1 thread, 1 run) against the other two confgurations (2 threads for DRB tasks and 1 thread for sequential tasks, and 10 (▲) or 100 (■) runs)

Second, we examine the costs for increasing PEQtest's resources for test program execution. To this end, we look at the execution times PEQtest consumes for all test program runs and the total execution time (test program generation and execution). Figure 4 compares for each task that does not belong to one of the task not completed categories the times for the low efort confguration (1 thread, 1 run, x-axis) with the other two confgurations of PEQtest. As one could have expected, the scatter plot on the left-hand side of Fig. 4 shows that the execution times for the test programs scale linearly with the number of runs. A similar behavior can often be observed when the total time of the low efort confguration is not dominated by the test program generation (> 3 s).

In summary, providing more resources often increases PEQtest's efectiveness while causing at most a linear increase of runtime costs. In particular for parallelized tasks, using more than one thread is benefcial. However, we require many runs of the generated test program to fnd schedule-dependent or input-sensitive inequivalences.

RQ 3: How does PEQtest compare against state-of-the-art? We compare PEQtest's confguration using 100 runs with equivalence checker PEQcheck [19], which also performs localized checks, but relies on verifcation. Since PEQtest's and PEQcheck's defnition of functional equivalence difer (PEQtest considers all variables, while PEQcheck only considers live variables), we restrict the comparison of PEQtest and PEQcheck to those 72 DRB tasks and 8 sequential tasks that (1) are either equivalent or inequivalent for both notions of equivalence and (2) in which the code segments afect at least one variable that is live afterwards.

Table 2 shows the results of PEQtest and PEQcheck on the restricted benchmark. The structure of Tab. 2 is the same as Tab. 1. Looking at Tab. 2, we frst observe that both approaches do not incorrectly detect an inequivalence, i.e., they do not report false alarms. Hence, the precision for inequivalence detection is

Table 2: For PEQtest and PEQcheck, shows for the DRB and sequential (SEQ) tasks the absolute and relative number of tasks for which inequivalence is detected correctly, is missed, is detected incorrectly, and is correctly not detected. If no inequivalence is detected, also distinguishes between no inequivalence reported and task is not completed due to a timeout or failure.


100 %. For the sequential tasks, PEQcheck detects one additional inequivalent task, for which PEQtest times out. In contrast, PEQtest detects signifcantly more inequivalent DRB tasks (i.e., has a higher recall) and, thus, misses less inequivalent DRB tasks. An important reason for the lower recall of PEQcheck is that PEQcheck's inspection fails in 87.5 % of the DRB tasks. The major failure causes are timeouts (30 %), missing support for OpenMP constructs in the verifer CIVL (31 %), and the detection of violations that are unrelated to functional equivalence, e.g., array out of bounds accesses in a verifcation task, which is generated by PEQcheck to check functional equivalence. Despite PEQcheck's worse performance, it can verify the task DRB076-flush-orig-no.c, for which PEQtest failed. Finally, we remark that although PEQtest has a higher time limit than PEQtest (namely, 5 min per run instead of 5 min per verifcation task), there exist only two tasks in which PEQtest requires more than 5 min in total and PEQcheck could have profted from a higher time limit.

Summing up, PEQtest is typically a better choice than PEQcheck when aiming to fnd inequivalences. In particular, PEQtest profts from relying on compiler support of OpenMP constructs and from checking equivalence only for the test inputs. Thus, PEQtest is well-suited for inequivalence detection, but in contrast to PEQcheck, which considers all inputs, it rarely proves equivalence.

#### 5 Related Work

Approaches inspecting functional equivalence aim at proving equivalence or detecting behavioral diferences. Alternatively, they characterize for which inputs equivalence is ensured.

Proving Functional Equivalence. Approaches proving functional equivalence may use relational verifcation [6,5],(bi)simulation relations [56,40,14,13], or domain-specifc checks [55,9,10,18,25]. Other approaches transform the programs into models and check model equivalence [43,49,41]. ARDif [4] compares symbolic summaries and Rˆeve [15] translates the equivalence into Horn constraints. Several approaches [17,38,23,51,42,2,19] encode equivalence checking into programs. Their encoding idea is similar to PEQtest's encoding of the functional equivalence tests. The closest encoding is the encoding of UCKLEE [38], which also use checkpointing, while the other approaches duplicate variables. Despite similar encodings, these approaches do not test, but verify the generated programs. A further diference is that they typically generate more than one program, namely one per changed unit (program [42], function [17,38,23,51], or refactored code segment [2,19]). Each generated program only consists of the functional equivalence check of the respective unit and typically considers all possible inputs. In contrast, PEQtest embeds the equivalence tests into the original program and only considers inputs produced by the original program.

Diference Detection. Relative debugging [3] executes the original and refactored program in parallel and compares the values of user-defned variables or data structure at user-defned program locations, which is more fne-grained than functional equivalence. Nevertheless, several techniques focus on detecting diferences of the functional behavior. Diferential monitoring [28] applies runtime monitoring that runs two programs, e.g., original and refactored program, in parallel, distributes any input to both programs, compares their outputs, and forwards equivalent outputs to the environment, while aborting in case of inequivalence. Following the idea of diferential testing [26], BERT [20], shadow symbolic execution [31], and HyDif [29] generate tests and execute the generated tests on original and refactored program to detect diferences in the behavior. BERT [20] generates inputs to cover the changed code parts. Shadow symbolic execution [31] uses a more advanced test generation that is steered towards internal behavioral diferences. HyDif [29] combines shadow symbolic execution with fuzzing, using the tests from the shadow symbolic execution to steer the fuzzer AFL. In contrast, Qi et al. [36] and eXpress [47] directly aim at generating diference revealing tests. To this end, they steer the test generation to fnd test inputs that reach a change that afects the output. While the previous techniques use special test generators, Difut [52] and DifGen [46] rely on standard test generators. Difut keeps shadow variables for the original program in the refactored program, wraps the method of the original program to extend it with equivalence checks, and uses JML annotations to force the execution of the wrapped method of the original program while testing the refactored program. DifGen [46] generates one test driver per changed method that copies the input, executes original and refactored method with original and copied input, respectively, and contains one check per output. DifGen's encoding idea is similar to PEQtest's encoding of functional equivalence tests, but PEQtest focuses on refactored code segments.

Semantic Characterization of Diferences. To provide more information in case of non-equivalence, a few approaches compute or (under)approximate the condition when original and refactored program are equivalent. To this end, they use symbolic execution [34,48], abstract interpretation [21,32,33], or testing [11].

#### 6 Conclusion

While refactorings are necessary to improve software quality, correct refactoring, i.e., a refactoring that does not change the functional behavior of the software, is challenging. Several solutions have been proposed to detect that refactored programs alter the behavior, some of them compare the functional behavior of original and refactored programs.

Approaches checking functional equivalence often use heavyweight (formal) verifcation. Furthermore, diference detection approaches frequently use dedicated test generators and execute (some of) the non-modifed code twice, once for the original and once for the refactored program (function). To overcome these restrictions, we propose PEQtest, which can be used to test (the intended application) or to formally verify functional equivalence. The test program generated by PEQtest—for which we proved that it checks functional equivalence—allows us to rely on compiler support, e.g., for OpenMP, to reuse existing tests or test generators, and at the same time to utilize that refactorings are often local, thus, avoiding to execute non-modifed code more than once in each test program execution. To this end, PEQtest replaces each refactored code segment in the program, e.g., a parallelized code segment, by a local check that inspects the equivalence of the corresponding original and refactored code segment.

We implemented PEQtest and evaluated it with the DataRaceBench benchmark suite and sequential refactorings already used to evaluate other functional equivalence checkers. Our experiments show that PEQtest detects many of the inequivalent tasks, e.g., incorrectly parallelized tasks, using a limited amount of resources, while reporting no false alarm. A comparison with the state-of-the-art equivalence checker PEQcheck reveals that PEQtest often performs better.

#### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **An Institutional Approach to Communicating UML State Machines**

Tobias Rosenberger1,2 , Alexander Knapp3 () , and Markus Roggenbach1

1 Swansea University, Swansea, U. K. {t.rosenberger.971978, m.roggenbach}@swansea.ac.uk <sup>2</sup> VERIMAG, Université Grenoble Alpes, Grenoble, France <sup>3</sup> Universität Augsburg, Augsburg, Germany knapp@informatik.uni-augsburg.de

**Abstract** We present a new approach on how to provide institution-based semantics for communicating UML state machines in form of a hybrid modal logic M<sup>↓</sup> D. A theoroidal comorphism maps M<sup>↓</sup> <sup>D</sup> into the Casl institution. This allows for symbolic reasoning on communicating UML state machines.

#### **1 Introduction**

In line with a long-standing line of research [5,6,15,4], we set out on a general programme to bring together multi-view system specification with UML diagrams and heterogeneous specification and verification based on institution theory, giving the different system views both a joint semantics and richer tool support. Institutions, a formal notion of a logic, are a principled way of creating such joint semantics. They make moderate assumptions about the data constituting a logic, give uniform notions of well-behaved translations between logics and, given a graph of such translations, automatically give rise to a joint institution.

UML state machines are an object-based variant of Harel statecharts. Within the UML, state machines are a central means to specify system behaviour. In previous work [16], an institutional semantics for UML state machines was provided that allowed for symbolic reasoning. Such symbolic reasoning can be of advantage as, in principle, it allows to verify properties of UML state machines with large or infinite state spaces. Here, we extend this work in order to cater for communication.

A typical scenario for such communication is the interaction between a User, an ATM, and a Bank in order to authenticate the User as legitimate owner of a bank card by checking an entered PIN. Figure 1 depicts a UML modelling for this scenario. In brief, the system consists of the ATM and the Bank, where we consider User interaction as an external communication. The scenario begins with the User entering a bank card and a PIN. The ATM requests their verification by the Bank. The Bank checks validity of the card/PIN combination and communicates the result to the ATM. We model the validity check as internal, non-deterministic choice made by the Bank. In case of a positive result, the ATM will return the card to the User. In case of a negative result, the User is given a

c The Author(s) 2022

E. B. Johnsen and M. Wimmer (Eds.): FASE 2022, LNCS 13241, pp. 205–224, 2022. https://doi.org/10.1007/978-3-030-99429-7\_12

**Figure 1.** UML diagrams for the ATM example (implicit completion events omitted): Composite structure diagram: top; state machine: left ATM, right Bank.

second and third chance to enter a correct PIN. After the third verification failure, the ATM will keep the card. A typical question on this model is whether the ATM will consider the verification successful only if the Bank has already come to the same conclusion. To answer such questions, one needs to take into account the behaviour of all state machines involved as well as how they can communicate via the ports and connectors as specified by a composite structure diagram.

Closest to our approach are the works [6,4]. Both these papers address the topic of communicating state machines, however, both fail to provide institutions of state machines as reported in [15,16]. Learning from the reason for this shortcoming, rather than capturing UML state machines directly as an institution, [16] builds up a new logic in which UML state machines can be embedded. Here, we extend this logic for communication. In particular, we treat UML event pools as part of composite structure diagrams rather than of state machines. State machines are seen as a completely open system, which is (partially) closed by 'wiring up' in a communication structure. Overall, this leads to a separation of concerns: event pools and transitions can be analysed independently.

A number of authors give formal semantics to communicating state machines, however with a purpose different from symbolic analysis of UML. The Object Management Group provides an executable semantics of UML Composite Structures [14]. Their objective is to provide an interpreter for the executable subset fUML of the UML. Dragomir [12] define transformations from composite structure diagrams to communicating extended timed automata for the purpose of simulation, static analysis and model-checking. Mazzanti et al. [8] provide a UML model checker that also covers composite structure diagrams. A quite comprehensive formal semantics has been provided by Liu et al. [7], again with the main purpose of supporting model checking.

In Section 2, we recall the notion of an institution and sketch the CFOL<sup>=</sup> institution of Casl, which we use for specifying data. In Section 3, we extend the hybrid modal logic M<sup>↓</sup> <sup>D</sup> [16] to cater also for output by adding the notion of messages (in [16] with input only). For structures and formulae this requires us to introduce relativisations with

regards to a set of outputs. We show that the extended logic M<sup>↓</sup> <sup>D</sup> is an institution, can be embedded into Casl via a theoroidal comorphism, and allows for "borrowing" of Casl theorem proving support. In Section 4, we show how to embed simple UML state machines with output into the extended logic M<sup>↓</sup> <sup>D</sup>. In Section 5, we provide an institution for simple UML composite structures by enriching our extended logic M<sup>↓</sup> <sup>D</sup> with elements capturing connectors and event queues. Again, "borrowing" of Casl theorem proving support is possible. Finally, in Section 6, we demonstrate that our approach allows for automated theorem proving.

## **2 Background on Institutions and** Casl

Institutions are an abstract formalisation of the notion of logical systems combining signatures, structures, sentences, and satisfaction under the slogan "truth is invariant under change of notation" [3]. Institutions can be related in different ways by institution (forward) (co-)morphisms, where a so-called theoroidal institution comorphism covers a particular case of encoding a "poorer" logic into a "richer" one. The algebraic specification language Casl [11] uses an institution of first-order logic at its basic specification level, where mainly signature items and axiom sentences are listed. On its structured specifications level, Casl offers institution-independent combination mechanisms to build larger specifications in a hierarchical and modular fashion. We use Casl's basic institution CFOL<sup>=</sup> of first-order logic with equality and sort generation constraints [9] and construct a theoroidal institution comorphism from our hybrid modal logic institution M<sup>↓</sup> <sup>D</sup> to CFOL=.

#### **2.1 Institutions and Theoroidal Institution Comorphisms**

An *institution* I = (S I , Str<sup>I</sup> , Sen<sup>I</sup> , |=<sup>I</sup> ) consists of (i) a category of *signatures* S I ; (ii) a contravariant *structures functor* Str<sup>I</sup> : (S I ) op → Cat, where Cat is the category of (small) categories; (iii) a *sentence functor* Sen<sup>I</sup> : S <sup>I</sup> → Set, where Set is the category of sets; and (iv) a family of *satisfaction relations* |=<sup>I</sup> <sup>Σ</sup> ⊆ |Str<sup>I</sup> (Σ)| × Sen<sup>I</sup> (Σ) indexed over Σ ∈ |S I |, such that the following *satisfaction condition* holds for all σ : Σ → Σ<sup>0</sup> in S I , ϕ ∈ Sen<sup>I</sup> (Σ), and M<sup>0</sup> ∈ |Str<sup>I</sup> (Σ<sup>0</sup> )|:

Str<sup>I</sup> (σ)(M<sup>0</sup> ) |= I <sup>Σ</sup> ϕ ⇐⇒ M<sup>0</sup> |= I <sup>Σ</sup><sup>0</sup> Sen<sup>I</sup> (σ)(ϕ) .

Str<sup>I</sup> (σ) is called the *reduct* functor, Sen<sup>I</sup> (σ) the *translation* function.

A *theory presentation* T = (Σ, Φ) in the institution I consists of a signature Σ ∈ |S I |, also denoted by Sig(T), and a set of sentences Φ ⊆ Sen<sup>I</sup> (Σ). Its *model class* Mod<sup>I</sup> (T) is the class {M ∈ Str<sup>I</sup> (Σ) | M |=<sup>I</sup> <sup>Σ</sup> ϕ f. a. ϕ ∈ Φ} of the Σ-structures satisfying the sentences in Φ. A *theory presentation morphism* σ : (Σ, Φ) → (Σ<sup>0</sup> , Φ<sup>0</sup> ) is given by a signature morphism σ : Σ → Σ<sup>0</sup> such that M<sup>0</sup> |=<sup>I</sup> <sup>Σ</sup><sup>0</sup> Sen<sup>I</sup> (σ)(ϕ) for all ϕ ∈ Φ and M<sup>0</sup> ∈ Mod<sup>I</sup> (Σ<sup>0</sup> , Φ<sup>0</sup> ). Theory presentations in I and their morphisms form the category Pres<sup>I</sup> .

A *theoroidal institution comorphism* ν = (ν Sig , νMod, νSen): I → I<sup>0</sup> consists of a functor ν Sig : S <sup>I</sup> → Pres<sup>I</sup> 0 inducing the functor ν <sup>S</sup> = ν Sig ; Sig : S <sup>I</sup> → S I 0 on signatures, a natural transformation ν Mod : (ν Sig ) op; Mod<sup>I</sup> <sup>0</sup> →˙ Str<sup>I</sup> on models and structures, and a natural transformation ν Sen : Sen<sup>I</sup> →˙ ν S ; Sen<sup>I</sup> 0 on sentences, such that for all Σ ∈ |S I |, M<sup>0</sup> ∈ |Mod<sup>I</sup> 0 (ν Sig (Σ))|, and ϕ ∈ Sen<sup>I</sup> (Σ) the following *satisfaction condition* holds:

$$\nu^{\mathrm{Mod}}\_{\Sigma}(M') \mathop{\mid}=^{\mathcal{T}}\_{\Sigma} \varphi \iff M' \mathop{\mid}=^{\mathcal{T}'}\_{\nu^{\mathbb{B}}(\Sigma)} \nu^{\mathrm{Sen}}(\Sigma)(\varphi)\,.$$

A theory presentation (Σ, Φ) over the institution I is*translated* via a theoroidal institution comorphism ν : I → I<sup>0</sup> into the theory presentation ν Pres(Σ, Φ) = (Σν, Φ<sup>ν</sup> ∪ ν Sen <sup>Σ</sup> (Φ)) over I <sup>0</sup> where ν Sig (Σ) = (Σν, Φν) and ν Sen <sup>Σ</sup> (Φ) = {ν Sen <sup>Σ</sup> (ϕ) | ϕ ∈ Φ}.

#### **2.2** Casl **and the Institution** CFOL<sup>=</sup>

At the level of basic Casl specifications, CFOL<sup>=</sup> offers declarations of *sorts*, *operations*, and *predicates* with given argument and result sorts. Formally, this defines a *many-sorted signature* Σ = (S, F, P) with a set S of sorts, a S <sup>∗</sup>×S-sorted families F = (Fw,s)w s∈S<sup>+</sup> of *total function symbols*, and family P = (Pw)w∈S<sup>∗</sup> of *predicate symbols*. Using these symbols, one may then write axioms in first-order logic with equality. Moreover, one can specify *data types*, each given by a list of data constructors and, optionally, selectors. Data types may be declared to be *generated* or *free*. Generatedness amounts to an implicit higher-order induction axiom and intuitively states that all elements of the data types are reachable by constructor terms ("no junk"); freeness additionally requires that all these constructor terms are distinct ("no confusion"). Basic Casl specifications denote the class of all algebras which fulfil the declared axioms, i.e., Casl has loose semantics. More formally, for CFOL<sup>=</sup> a *many-sorted* Σ*-structure* M consists of a non-empty carrier set s<sup>M</sup> for each s ∈ S, a total function f<sup>M</sup> : w<sup>M</sup> → s<sup>M</sup> for each function symbol f ∈ Fw,s and a predicate p<sup>M</sup> for each predicate symbol p ∈ Pw. A *many-sorted* Σ*-sentence* is a closed many-sorted first-order formula over Σ or a sort generation constraint.

#### **3 The Hybrid Modal Logic** M<sup>↓</sup> <sup>D</sup> **for Event/Data Systems**

The logic M<sup>↓</sup> <sup>D</sup> is a hybrid modal logic for specifying and reasoning about event/databased reactive systems. The modal part of the logic allows to handle transitions between system configurations where the modalities describe guarded configuration moves based on input and output events with arguments, i.e., messages, and the corresponding effects on data. The hybrid part of the logic allows to bind control states of system configurations and to jump to configurations with such control states explicitly. M<sup>↓</sup> <sup>D</sup> with its signatures, sentences, and structures forms an institution. Furthermore, M<sup>↓</sup> <sup>D</sup> can be translated into Casl via a theoroidal institution comorphism.

We extend the logic and the comorphism of [16] by including output. A modal formula h|<sup>i</sup> : <sup>φ</sup>( [O]<sup>N</sup> : <sup>ψ</sup>|i% now says that in the current configuration an input message according to i can be accepted if precondition state predicate φ holds and that, in response, output messages according to [O]<sup>N</sup> and satisfying the transition predicate ψ can be produced such that % holds afterwards. The messages frame [O]<sup>N</sup> tells that besides outputs from O also additional messages according to N can be sent. This relativisation allows M<sup>↓</sup> D

to specify the "cone of messages above O" in a finite and, in particular, institutioncompatible way that also is extensible into a theoroidal institution comorphism from M<sup>↓</sup> <sup>D</sup> to Casl. We furthermore demonstrate that for pure <sup>M</sup><sup>↓</sup> <sup>D</sup>-invariants the comorphism leads to simpler Casl proof obligations that are easier to automate in theorem proving.

For the inclusion of *data* in M<sup>↓</sup> <sup>D</sup>, we assume given a consistent, monomorphic Casl specification Dt. The interpretation of the sorts S(Dt) of Dt represents the different kinds of data, like the integers or lists of integers. Requiring Dt to be monomorphic fixes these carrier sets as there is, up to isomorphism, a single model D of Dt. We also use open formulæ FCasl Sig(Dt),X over sorted variables <sup>X</sup> = (Xs)s∈S(Dt) and their satisfaction relation D, β |=Casl Sig(Dt),X <sup>ϕ</sup> for a variable valuation <sup>β</sup> : <sup>X</sup> → D, i.e., β = (β<sup>s</sup> : X<sup>s</sup> → s <sup>D</sup>)s∈S(Dt) .

#### **3.1 Data States and Transitions**

A *data signature* A consists of a finite set of *attributes* |A| and a sorting s(A): |A| → S(Dt). A *data signature morphism* from a data signature A to a data signature A<sup>0</sup> is a function α: |A| → |A<sup>0</sup> | such that s(A)(a) = s(A<sup>0</sup> )(α(a)) for all a ∈ |A|. We sometimes identify A with the S(Dt)-sorted family (s(A) −1 (s))s∈S(Dt) .

A *data state* ω for a data signature A is given by an attribute valuation ω: A → D, i.e., ω(a) ∈ s(A)(a) <sup>D</sup> for a ∈ |A|; in particular, Ω(A) = D<sup>A</sup> is the set of A-data states. The *state predicates* F<sup>D</sup> A,X are the formulæ in <sup>F</sup>Casl Sig(Dt),A∪X , taking A as well as an additional S(Dt)-indexed family X as variables. A state predicate φ ∈ F<sup>D</sup> A,X is to be interpreted over an A-data state ω and variable valuation β : X → D and we define the *satisfaction relation* |=<sup>D</sup> by

$$\{\omega, \beta \models\_{A,X}^{\mathcal{D}} \phi \iff \mathcal{D}, \omega \cup \beta \models\_{Sig(Dt), A \cup X}^{\text{Cas.}} \phi\}.$$

The α*-reduct* of an A<sup>0</sup> -data state ω 0 : A<sup>0</sup> → D along a data signature morphism α: A → A<sup>0</sup> is given by the A-data state ω 0 |α: A → D with (ω 0 |α)(a) = ω 0 (α(a)) for every a ∈ |A|. The *state predicate translation* F<sup>D</sup> α,X : F<sup>D</sup> A,X → F<sup>D</sup> A<sup>0</sup> ,X along <sup>α</sup>: <sup>A</sup> <sup>→</sup> A0 is given by the Casl-formula translation FCasl Sig(Dt),α∪1<sup>X</sup> along the substitution α∪1X. Reduct and translation fulfil the following *satisfaction condition* due to the general substitution lemma for Casl:

$$
\omega' | \alpha, \beta \models\_{A,X}^{\mathcal{D}} \phi \iff \omega', \beta \models\_{A',X}^{\mathcal{D}} \mathcal{F}\_{\alpha,X}^{\mathcal{D}}(\phi) \; .
$$

A *data transition* (ω, ω<sup>0</sup> ) for a data signature A is a pair of A-data states; in particular, Ω<sup>2</sup> (A) = (D<sup>A</sup>) 2 is the set of A-data transitions. It holds that (D<sup>A</sup>) <sup>2</sup> ∼= D<sup>2</sup>A, where 2A = A ] A and we assume that no attribute in A ends in a prime 0 and all attributes in the second summand are adorned with an additional prime. The *transition predicates* F<sup>2</sup><sup>D</sup> A,X are the formulæ <sup>F</sup><sup>D</sup> <sup>2</sup>A,X. The satisfaction relation <sup>|</sup>=<sup>2</sup><sup>D</sup> for a transition predicate ψ ∈ F<sup>2</sup><sup>D</sup> A,X, data transition (ω, ω<sup>0</sup> ) ∈ Ω<sup>2</sup> (A), and valuation β : X → D is defined as

$$(\omega, \omega'), \beta \doteq^{2\mathcal{D}}\_{A,X} \psi \iff \omega + \omega', \beta \models^{\mathcal{D}}\_{2A,X} \psi$$

where ω + ω <sup>0</sup> ∈ Ω(2A) with (ω + ω 0 )(a) = ω(a) and (ω + ω 0 )(a 0 ) = ω 0 (a).

The α*-reduct* of an A<sup>0</sup> -data transition (ω 0 , ω00) along a data signature morphism α: A → A<sup>0</sup> is given by the A-data transition (ω 0 , ω00)|α = (ω 0 |α, ω<sup>00</sup>|α). The *transition predicate translation* F2<sup>D</sup> α,X along α is given by F<sup>D</sup> <sup>2</sup>α,X with 2α: 2A → 2A<sup>0</sup> defined by 2α(a) = α(a) and 2α(a 0 ) = α(a) 0 . Like for data states, reduct and translation fulfil the following *satisfaction condition*:

$$(\omega', \omega'') | \alpha, \beta \mid \!= \! \_{A,X}^{2\mathcal{D}} \psi \iff (\omega', \omega'') , \beta \mid \!= \! \_{A',X}^{2\mathcal{D}} \mathcal{F} \_{\alpha,X}^{2\mathcal{D}} (\psi) \; . \;=$$

#### **3.2 Events and Messages**

An *event signature* E consists of a finite set of events|E| and a map s(E): |E| → S(Dt) ∗ assigning to each e ∈ |E| its list of parameter sorts. An *event signature morphism* η : E → E<sup>0</sup> is a function η : |E| → |E<sup>0</sup> | such that s(E)(e) = s(E<sup>0</sup> )(η(e)) for all e ∈ |E|. We write e(X)for e ∈ |E| and s(E)(e) = s1, . . . , s<sup>n</sup> when choosing n different *parameters* X = x1, . . . , xn, and also e(X) ∈ E in this case; when f = e(X), we write X(f) for X and we furthermore lift this notation to sets and lists of events. We sometimes identify the parameter list X with the S(Dt)-sorted family ({x<sup>i</sup> | s<sup>i</sup> = s})s∈S(Dt) and write s(E)(e)(xi) for s<sup>i</sup> .

A *message* e(β) over an event signature E is given by an event e(X) ∈ E with its parameters X instantiated by a parameter valuation β : X → D such that β(x) ∈ s D for s(E)(e)(x) = s; the set of all messages over an event signature E is denoted by Eˆ(E). When eˆ = e(β) ∈ Eˆ(E), we write β(ˆe) for β, and when e(X) ∈ E and β : Y → D for X ⊆ Y , we write e(β) for e(βX); both notations are furthermore lifted to sets and lists.

The set of *shufflings* Fˆ <sup>1</sup> k Fˆ <sup>2</sup> of two message lists Fˆ <sup>1</sup> and Fˆ <sup>2</sup> is inductively given by

$$\begin{aligned} \hat{F} \parallel \varepsilon &= \{\hat{F}\} = \varepsilon \parallel \hat{F} \mid, \\ (\hat{f} :: \hat{F}\_1) \parallel \hat{F}\_2 &= \{\hat{f} :: \hat{F} \mid \hat{F} \in \hat{F}\_1 \parallel \hat{F}\_2\} = \hat{F}\_1 \parallel (\hat{f} :: \hat{F}\_2) \mid. \end{aligned}$$

An event signature morphism η : E → E<sup>0</sup> is lifted to a message e(β) ∈ Eˆ(E) by setting Eˆ(η)(e(β)) = η(e)(β) ∈ Eˆ(E<sup>0</sup> ) and also to sets and lists of messages.

#### **3.3 Event/Data Signatures**

An *event/data signature* Σ consists of *input* and *output* event signatures I(Σ) and O(Σ), and a data signature A(Σ). An *event/data signature morphism* σ : Σ → Σ<sup>0</sup> consists of an input event signature morphism I(σ): I(Σ) → I(Σ<sup>0</sup> ), an output event signature morphism O(σ): O(Σ) → O(Σ<sup>0</sup> ), and a data signature morphism A(σ): A(Σ) → A(Σ<sup>0</sup> ). We lift the event signatures and signature morphisms to messages by writing ˆI(Σ) for Eˆ(I(Σ)), Oˆ(Σ) for Eˆ(O(Σ)), ˆI(σ) for Eˆ(I(σ)), and Oˆ(σ) for Eˆ(O(σ)).

The category of M<sup>↓</sup> <sup>D</sup>*-signatures* <sup>S</sup>M<sup>↓</sup> <sup>D</sup> consists of the event/data signatures and signature morphisms.

#### **3.4 Event/Data Structures**

A *configuration* γ = (c, d) consists of a *control state* c from some set of control states C and a *data state* d from some set of data states D. Given a data signature A the data state of γ may be *labelled* by a map ω such that ω(d) ∈ Ω(A). For a set of configurations Γ we write C(Γ) for its set of control states.

A Σ-*event/data structure* M = (Γ, R, Γ0, ω) over an event/data signature Σ consists of a set of *configurations* <sup>Γ</sup> <sup>⊆</sup> <sup>C</sup> <sup>×</sup> <sup>D</sup>, a family of *transition relations* <sup>R</sup> = (Rˆı,O<sup>ˆ</sup> <sup>⊆</sup> Γ × Γ) ˆı∈Iˆ(Σ),Oˆ∈Oˆ(Σ) <sup>∗</sup> , and a non-empty set of *initial configurations* Γ<sup>0</sup> ⊆ Γ such that Γ is *reachable* from Γ<sup>0</sup> via R, i.e., for all γ ∈ Γ there are γ<sup>0</sup> ∈ Γ0, n ≥ 0, ˆı1, . . . , ˆı<sup>n</sup> ∈ ˆI(Σ), Oˆ <sup>1</sup>, . . . , Oˆ <sup>n</sup> ∈ Oˆ(Σ) ∗ , and (γk, γk+1) <sup>∈</sup> <sup>R</sup>ˆık+1,Oˆk+1 for all 0 ≤ k < n with γ<sup>n</sup> = γ; and a *data state labelling* ω: D → Ω(A(Σ)).

We write c(M)(γ) = c and ω(M)(γ) = ω(d) for γ = (c, d) ∈ Γ, Γ(M) for Γ, C(M) for {c(M)(γ) | γ ∈ Γ(M)}, R(M) for R, Γ0(M) for Γ0, C0(M) for C(Γ0), and Ω0(M) for {ω(M)(γ0) | γ<sup>0</sup> ∈ Γ0}.

The above definition restricts structures to reachable ones only. Although an M<sup>↓</sup> Dsentence will hold in an event/data structure if it is satisfied in all its initial states, the modal and hybrid operators of M<sup>↓</sup> <sup>D</sup> will allow for expressing that a certain property holds in all (reachable) states of the structure.

The σ*-reduct* of a Σ<sup>0</sup> -event/data structure M<sup>0</sup> along the event/data signature morphism σ : Σ → Σ<sup>0</sup> is the Σ-event/data structure M<sup>0</sup> |σ such that

– Γ(M<sup>0</sup> |σ) ⊆ Γ(M<sup>0</sup> ) as well as R(M<sup>0</sup> |σ) = (R(M<sup>0</sup> |σ) <sup>ˆ</sup>ı,O<sup>ˆ</sup> ) ˆı∈Iˆ(Σ),Oˆ∈Oˆ(Σ) <sup>∗</sup> are inductively defined by Γ0(M<sup>0</sup> ) ⊆ Γ(M<sup>0</sup> |σ) and, for all γ 0 , γ<sup>00</sup> ∈ Γ(M<sup>0</sup> ), ˆı ∈ ˆI(Σ), and Oˆ ∈ Oˆ(Σ) ∗ , if γ <sup>0</sup> ∈ Γ(M<sup>0</sup> |σ) and (γ 0 , γ<sup>00</sup>) ∈ R(M<sup>0</sup> ) Iˆ(σ)(ˆı),Oˆ(σ)(Oˆ) , then γ <sup>00</sup> ∈ Γ(M<sup>0</sup> |σ) and (γ 0 , γ<sup>00</sup>) ∈ R(M<sup>0</sup> |σ) <sup>ˆ</sup>ı,O<sup>ˆ</sup> ; – Γ0(M<sup>0</sup> |σ) = Γ0(M<sup>0</sup> ); and – ω(M<sup>0</sup> |σ)(γ 0 ) = (ω(M<sup>0</sup> )(γ 0 ))|A(σ) for all γ <sup>0</sup> ∈ Γ(M<sup>0</sup> |σ).

This σ-reduct keeps exactly those transitions that are a direct image along σ. It would also be possible to additionally keep transitions that show a super-list of the outputs that can be reached by σ. When moving to M<sup>↓</sup> <sup>D</sup>-sentences, however, it turns out to be impossible to fix a particular list of outputs.

Given sets of input events J ⊆ I(Σ) and output events N ⊆ O(Σ), we denote by Γ J,N (M, γ) and Γ J,N (M), respectively, the set of configurations of a Σ-event/data structure M that are J, N-reachable from a configuration γ ∈ Γ(M) and from an initial configuration γ<sup>0</sup> ∈ Γ0(M), respectively. Here a γ<sup>n</sup> ∈ Γ(M) is J, N*-reachable in* M *from* a γ<sup>1</sup> ∈ Γ(M) if there are n ≥ 1, ˆı2, . . . , ˆı<sup>n</sup> ∈ ˆI(J), Oˆ <sup>2</sup>, . . . , Oˆ <sup>n</sup> ∈ Oˆ(N) ∗ , and (γ<sup>i</sup> , γi+1) ∈ R(M) ˆık+1,Oˆk+1 for all 1 ≤ k < n.

The Σ-event/data structures form the discrete category Str<sup>M</sup><sup>↓</sup> <sup>D</sup> (Σ) ofM<sup>↓</sup> <sup>D</sup>*-structures* over Σ. For each σ : Σ → Σ<sup>0</sup> in SM<sup>↓</sup> <sup>D</sup> the σ*-reduct functor* Str<sup>M</sup><sup>↓</sup> <sup>D</sup> (σ): Str<sup>M</sup><sup>↓</sup> <sup>D</sup> (Σ<sup>0</sup> ) → Str<sup>M</sup><sup>↓</sup> <sup>D</sup> (Σ) is given by Str<sup>M</sup><sup>↓</sup> <sup>D</sup> (σ)(M<sup>0</sup> ) = M<sup>0</sup> |σ.

#### **3.5 Event/Data Formulæ and Sentences**

The Σ*-event/data formulæ* F M<sup>↓</sup> D Σ,S over an event/data signature <sup>Σ</sup> and a set of *state variables* S are inductively defined by

– ϕ — data state sentence ϕ ∈ F<sup>D</sup> A(Σ),∅ holds in the current configuration;


We write (@s)% for (@I(Σ),O(Σ) <sup>s</sup>)%, ✷% for ✷I(Σ),O(Σ)%, [i( [O]<sup>N</sup> : <sup>ψ</sup>]% for ¬hi( [O]<sup>N</sup> : ψi¬%, and true for ↓s . s; we write O for [O]∅.

Two different kinds of relativisations are used in M<sup>↓</sup> <sup>D</sup>-formulæ: For the jump operator (@J,N s)% and the globally operator ✷J,N % the subsets of input events J ⊆ I(Σ) and output events N ⊆ O(Σ) restrict the referable states in an M<sup>↓</sup> <sup>D</sup>-structure to those that are J, N-reachable. On the other hand, [O]<sup>N</sup> specifies that besides messages from O additional messages for events in N ⊆ O(Σ) can be mixed into the output, such that, in particular, [O]<sup>∅</sup> requires exactly O. Since the set of output events is assumed to be finite, [O]<sup>N</sup> can be used to specify message lists of arbitrary length with finitely many formulæ. Moreover, the syntactic information in both kinds of relativisations is kept through a translation to another M<sup>↓</sup> <sup>D</sup>-signature.

Let σ : Σ → Σ<sup>0</sup> be an event/data signature morphism. The *event/data formulæ translation* F M<sup>↓</sup> D σ,S : F M<sup>↓</sup> D Σ,S → F M<sup>↓</sup> D Σ<sup>0</sup> ,S along <sup>σ</sup> is recursively given by

$$\begin{split} & -\mathcal{P}\_{\sigma,S}^{\mathcal{M}\_{\mathcal{D}}^{\perp}}(\varphi) = \mathcal{P}\_{A(\sigma),\emptyset}^{\mathcal{D}}(\varphi); \\ & -\mathcal{P}\_{\sigma,S}^{\mathcal{M}\_{\mathcal{D}}^{\perp}}(s) = s; \\ & -\mathcal{P}\_{\sigma,S}^{\mathcal{M}\_{\mathcal{D}}^{\perp}}(\downarrow s,\varrho) = \downarrow s \cdot \mathcal{P}\_{\sigma,S \cap \{s\}}^{\mathcal{M}\_{\mathcal{D}}^{\perp}}(\varrho); \\ & -\mathcal{P}\_{\sigma,S}^{\mathcal{M}\_{\mathcal{D}}^{\perp}}((\otimes^{J,N} s)\varrho) = (\oplus^{I(\sigma)(J),O(\sigma)(N)\_{S}})\_{\mathcal{S}}\mathcal{P}\_{\sigma,S}^{\mathcal{M}\_{\mathcal{D}}^{\perp}}(\varrho); \\ & -\mathcal{P}\_{\sigma,S}^{\mathcal{M}\_{\mathcal{D}}^{\perp}}(\Box^{J,N} \varrho) = \Box^{I(\sigma)(J),O(\sigma)(N)}\mathcal{P}\_{\sigma,S}^{\mathcal{M}\_{\mathcal{D}}^{\perp}}(\varrho); \\ & -\mathcal{P}\_{\sigma,S}^{\mathcal{M}\_{\mathcal{D}}^{\perp}}((i \not\![O] \Box\_{N} : \psi)\varrho) = \\ & \qquad \qquad \langle I(\sigma)(i) \not\!\langle \ul$$

$$\begin{split} & - \mathcal{F}^{\mathcal{M}\_{\mathcal{D}}^{\mathsf{id}}}\_{\sigma, \mathcal{S}} (\{i: \phi \} [O]\_{N}: \psi \} \varrho) = \\ & \qquad \qquad \qquad \{ I(\sigma)(i): \mathcal{F}^{\mathcal{D}}\_{A(\sigma), X(i)}(\phi) \{ [O(\sigma)(O)]\_{O(\sigma)(N)}: \mathcal{F}^{2\mathcal{D}}\_{A(\sigma), X(i) \cup X(O)}(\psi) \} \mathcal{F}^{\mathcal{M}\_{\mathcal{D}}^{\mathsf{id}}}\_{\sigma, \mathcal{S}}(\varrho); \\ & - \mathcal{F}^{\mathcal{M}\_{\mathcal{D}}^{\mathsf{id}}}\_{\sigma, \mathcal{S}}(\neg \varrho) = - \mathcal{F}^{\mathcal{M}\_{\sigma, \mathcal{S}}^{\mathsf{id}}}\_{\sigma, \mathcal{S}}(\varrho); \\ & - \mathcal{F}^{\mathcal{M}\_{\mathcal{D}}^{\mathsf{id}}}\_{\sigma, \mathcal{S}}(\varrho\_{1} \vee \varrho\_{2}) = \mathcal{F}^{\mathcal{M}\_{\sigma, \mathcal{S}}^{\mathsf{id}}}\_{\sigma, \mathcal{S}}(\varrho\_{1}) \vee \mathcal{F}^{\mathcal{M}\_{\mathcal{D}}^{\mathsf{id}}}\_{\sigma, \mathcal{S}}(\varrho\_{2}). \end{split}$$

The set SenM<sup>↓</sup> <sup>D</sup> (Σ) of Σ*-event/data sentences* is given by F M<sup>↓</sup> D Σ,∅ , the *event/data sentence translation* SenM<sup>↓</sup> <sup>D</sup> (σ): SenM<sup>↓</sup> <sup>D</sup> (Σ) → SenM<sup>↓</sup> <sup>D</sup> (Σ<sup>0</sup> ) by F M<sup>↓</sup> D σ,∅ .

#### **3.6 Satisfaction Relation for** M<sup>↓</sup> D

Let Σ be an event/data signature, M a Σ-event/data structure, S a set of state variables, v : S → C(M) a state variable assignment, and γ ∈ Γ(M). The *satisfaction relation* for event/data formulæ is inductively given by

– M, v, γ |= M<sup>↓</sup> D Σ,S <sup>ϕ</sup> iff <sup>ω</sup>(M)(γ), ∅ |=<sup>D</sup> <sup>A</sup>(Σ),<sup>∅</sup> <sup>ϕ</sup>; M<sup>↓</sup> D

$$-\begin{array}{c} M, v, \gamma \doteq\_{\Sigma, S}^{\mathcal{M}\_{\mathcal{D}}} s \text{ iff } v(s) = c(M)(\gamma); \\ \downarrow \ldots \ldots \downarrow \quad \mathcal{M}\_{\mathcal{D}}^{+} \mid \ldots \ldots \text{ iff } M \text{ and } \ldots \text{ if } M \text{/} \ldots) \end{array}$$


For a Σ ∈ |SM<sup>↓</sup> <sup>D</sup> |, an M ∈ |Str<sup>M</sup><sup>↓</sup> <sup>D</sup> (Σ)|, and a ρ ∈ Sen<sup>M</sup><sup>↓</sup> <sup>D</sup> (Σ) the *satisfaction relation* M |= M<sup>↓</sup> D <sup>Σ</sup> <sup>ρ</sup> holds if, and only if, M, <sup>∅</sup>, γ<sup>0</sup> <sup>|</sup><sup>=</sup> M<sup>↓</sup> D Σ,∅ ρ for all γ<sup>0</sup> ∈ Γ0(M).

**Theorem 1.** (SM<sup>↓</sup> <sup>D</sup> , Str<sup>M</sup><sup>↓</sup> <sup>D</sup> , Sen<sup>M</sup><sup>↓</sup> <sup>D</sup> , |=<sup>M</sup><sup>↓</sup> <sup>D</sup> ) *is an institution.* **from** *Basic/StructuredDatatypes* **get** List, Set % import finite lists and sets **spec** Trans<sup>Σ</sup> = Dt **then free type** InEvt ::= I(Σ) **free type** OutEvt ::= O(Σ) **then** List[sort OutEvt] **and** Set[sort InEvt] **and** Set[sort OutEvt] **then sort** Ctrl **free type** Conf ::= conf(c : Ctrl; A(Σ)) **preds** init : Conf; trans : Conf × InEvt × List[OutEvt] × Conf · ∃g : Conf · init(g) % there is some initial configuration **then free** { **pred** reachable : Set[InEvt] × Set[OutEvt] × Conf × Conf ∀g, g<sup>0</sup> , g<sup>00</sup> : Conf; J : Set[InEvt]; N : Set[OutEvt];i : InEvt; O : List[OutEvt] · reachable(J, N, g, g) · reachable(J, N, g, g<sup>0</sup> ) ∧ i ∈ J ∧ O ⊆ N ∧ trans(g 0 , i, O, g<sup>00</sup>) ⇒ reachable(J, N, g, g<sup>00</sup>) } **then preds** reachable(J : Set[InEvt], N : Set[OutEvt], g : Conf) ⇔ ∃g<sup>0</sup> : Conf · init(g0) ∧ reachable(J, N, g0, g); reachable(g : Conf) ⇔ reachable(I(Σ), O(Σ), g) **then pred** mixed : List[OutEvt] × Set[OutEvt] × List[OutEvt] ∀o, o<sup>0</sup> : OutEvt; O, O<sup>0</sup> : List[OutEvt]; N : Set[OutEvt] · mixed(O, N, O) · mixed(o :: O, N, o :: O 0 ) **if** mixed(O, N, O<sup>0</sup> ) · mixed(O, N, o<sup>0</sup> :: O 0 ) **if** mixed(O, N, O<sup>0</sup> ) ∧ o <sup>0</sup> ∈ N **end**

> **Figure 2.** Frame for translating M<sup>↓</sup> <sup>D</sup> into Casl.

#### **3.7 A Theoroidal Comorphism from** M<sup>↓</sup> <sup>D</sup> **to** Casl

We define a theoroidal comorphism from M<sup>↓</sup> <sup>D</sup> to Casl. The construction mainly follows the standard translation of modal logics to first-order logic [1] and extends the scheme of [16] by outputs.

The basis is a representation ofM<sup>↓</sup> D-signatures and the frame given byM<sup>↓</sup> <sup>D</sup>-structures as a Casl-specification as shown in Fig. 2. The signature translation

ν Sig : SM<sup>↓</sup> <sup>D</sup> → PresCasl

maps an M<sup>↓</sup> <sup>D</sup>-signature <sup>Σ</sup> to the Casl-theory presentation given by Trans<sup>Σ</sup> and an M<sup>↓</sup> <sup>D</sup>-signature morphism to the corresponding theory presentation morphism. Trans<sup>Σ</sup> first of all covers the events according to I(Σ) and O(Σ) with types InEvt and OutEvt, and the configurations with type Conf showing a single constructor conf for the control state from Ctrl and a data state given by assignments to the attributes from A(Σ). Furthermore, Trans<sup>Σ</sup> sets the frame for describing reachable transition systems with a set of initial configurations, a transition relation, and reachability predicates, where the specification of reachable uses Casl's "structured free" construct to ensure reachability to be inductively defined. Finally, a predicate mixed is included for representing the shufflings of a list of outputs with some additional output events.

The model translation

$$\nu\_{\Sigma}^{\text{Mod}} \colon \text{Mod}^{\text{Casu}}(\nu^{Sig}(\Sigma)) \to \text{Str}^{\mathcal{M}^{\downarrow}\_{\mathcal{D}}}(\Sigma)$$

then can rely on this encoding. In particular, for a model M<sup>0</sup> ∈ ModCasl(ν Sig (Σ)), there are bijective maps ιM<sup>0</sup> ,Conf : ConfM<sup>0</sup> <sup>∼</sup><sup>=</sup> CtrlM<sup>0</sup> ×Ω(A(Σ)) for the configurations as well as ιM<sup>0</sup> ,InEvt : InEvtM<sup>0</sup> <sup>∼</sup><sup>=</sup> <sup>ˆ</sup>I(Σ) and <sup>ι</sup>M<sup>0</sup> ,OutEvt : OutEvtM<sup>0</sup> <sup>∼</sup><sup>=</sup> <sup>O</sup>ˆ(Σ) for the messages. Moreover, mixedM<sup>0</sup> (ι −1 M<sup>0</sup> ,OutEvt(Oˆ), ι−<sup>1</sup> M<sup>0</sup> ,OutEvt(N), ι−<sup>1</sup> M<sup>0</sup> ,OutEvt(Oˆ<sup>0</sup> )) if, and only if, Oˆ<sup>0</sup> ∈ Oˆ k Nˆ<sup>0</sup> with Nˆ<sup>0</sup> ∈ N<sup>∗</sup> . The M<sup>↓</sup> <sup>D</sup>-structure resulting from a Casl-model M<sup>0</sup> of Trans<sup>Σ</sup> can thus be defined by

$$\begin{array}{l} -\varGamma(\nu\_{\Sigma}^{\mathrm{Mod}}(M')) = \iota\_{M',\mathrm{Conf}}^{-1}(\{g^{M'} \in \mathrm{Conf}^{M'} \mid \mathrm{reachable}^{M'}(g^{M'})\}) \\ -\varPi(\nu\_{\Sigma}^{\mathrm{Mod}}(M'))\_{i,\mathcal{O}} = \{ (\gamma,\gamma') \in \Gamma(\nu\_{\Sigma}^{\mathrm{Mod}}(M')) \times \Gamma(\nu\_{\Sigma}^{\mathrm{Mod}}(M')) \mid \\ \mathrm{trans}^{M'}(\iota\_{M',\mathrm{Conf}}(\gamma),\iota\_{M',\mathrm{InEvt}}^{-1}(\imath),\iota\_{M',\mathrm{OutEvt}}^{-1}(\hat{O}),\iota\_{M',\mathrm{Conf}}(\gamma')) \} \\ -\varGamma\_0(\nu\_{\Sigma}^{\mathrm{Mod}}(M')) = \{ \gamma \in \Gamma(\nu\_{\Sigma}^{\mathrm{Mod}}(M')) \mid \mathrm{init}^{M'}(\iota\_{M',\mathrm{Conf}}(\gamma)) \} \\ -\omega(\nu\_{\Sigma}^{\mathrm{Mod}}(M')) = \{ (c,\omega) \in \Gamma(\nu\_{\Sigma}^{\mathrm{Mod}}(M')) \mapsto \omega \} \end{array}$$

For M<sup>↓</sup> <sup>D</sup>-sentences, we first define a formula translation

$$
\nu\_{
\Sigma,S,g}^{\mathcal{F}} \colon \mathcal{F}\_{
\Sigma,S}^{\mathcal{M}\_{\mathcal{D}}^{\downarrow}} \to \mathcal{F}\_{
\nu^{\mathbb{S}}(
\Sigma),S \cup \{g\}}^{\text{Cash}}
$$

which, mimicking the standard translation, takes a variable g : Conf as a parameter that records the "current configuration" and also uses a set S of state names for the control states. The translation embeds the data state and 2-data state formulæ using the substitution A(Σ)(g) = {a 7→ a(g) | a ∈ A(Σ)} for replacing the attributes a ∈ A(Σ) by the accessors a(g). The translation of M<sup>↓</sup> <sup>D</sup>-formulæ then reads

– ν F Σ,S,g(ϕ) = <sup>F</sup>Casl νS(Σ),A(Σ)(g) (ϕ) – ν F Σ,S,g(s) = (s = c(g)) – ν F Σ,S,g(↓s . %) = ∃s : Ctrl . s = c(g) ∧ ν F Σ,S]{s},g (%) – ν F Σ,S,g((@J,N s)%) = ∀g 0 : Conf .(c(g 0 ) = s ∧ reachable(J, N, g<sup>0</sup> )) ⇒ ν F Σ,S,g<sup>0</sup> (%) – ν F Σ,S,g(✷J,N %) = ∀g 0 : Conf .reachable(J, N, g, g<sup>0</sup> ) ⇒ ν F Σ,S,g<sup>0</sup> (%) – ν F Σ,S,g(hi( [O]<sup>N</sup> : <sup>ψ</sup>i%) = <sup>∃</sup><sup>X</sup> : <sup>s</sup>(I(Σ))(i); <sup>X</sup><sup>0</sup> : s(O(Σ))(O); O0 : List[OutEvt]; g 0 : Conf . mixed(O(X<sup>0</sup> ), N, O<sup>0</sup> ) ∧ trans(g, i(X), O<sup>0</sup> , g<sup>0</sup> ) ∧ FCasl νS(Σ),A(Σ)(g)∪A(Σ)(g <sup>0</sup>)∪1X∪X<sup>0</sup> (ψ) ∧ ν F Σ,S,g<sup>0</sup> (%) – ν F Σ,S,g(h|<sup>i</sup> : <sup>φ</sup>( [O]<sup>N</sup> : <sup>ψ</sup>|i%) = <sup>∀</sup><sup>X</sup> : <sup>s</sup>(I(Σ))(i). <sup>F</sup>Casl νS(Σ),A(Σ)(g)∪1<sup>X</sup> (φ) ⇒ ∃X<sup>0</sup> : s(O(Σ))(O); O<sup>0</sup> : List[OutEvt]; g 0 : Conf . mixed(O(X<sup>0</sup> ), N, O<sup>0</sup> ) ∧ trans(g, i(X), O<sup>0</sup> , g<sup>0</sup> ) ∧ FCasl νS(Σ),A(Σ)(g)∪A(Σ)(g <sup>0</sup>)∪1X∪X<sup>0</sup> (ψ) ∧ ν F Σ,S,g<sup>0</sup> (%) – ν F Σ,S,g(¬%) = ¬ν F Σ,S,g(%) – ν F Σ,S,g(%<sup>1</sup> ∨ %2) = ν F Σ,S,g(%1) ∨ ν F Σ,S,g(%2)

$$\text{Building on the translation of formulae, the sentence translation}$$

$$\nu\_{\Sigma}^{\mathrm{Sen}} \colon \mathrm{Sen}^{\mathcal{M}\_{\mathcal{D}}^{\downarrow}}(\Sigma) \to \mathrm{Sen}^{\mathrm{Casu}}(\nu^{\mathcal{G}}(\Sigma))$$

only has to require additionally that evaluation starts in an initial state:

– ν Sen <sup>Σ</sup> (ρ) = ∀g : Conf . init(g) ⇒ ν F Σ,∅,g (ρ)

**Theorem 2.** (ν Sig , νMod, νSen) *is a theoroidal comorphism from* M<sup>↓</sup> <sup>D</sup> *to* Casl*.*

For a Casl-proof of an M<sup>↓</sup> <sup>D</sup>-*invariant* ✷<sup>ϕ</sup> such that <sup>ϕ</sup> has to hold in every reachable configuration, the full generality of the reachable predicate can sometimes be avoided by replacing the proof obligation ∀g : Conf .reachable(g) ⇒ FCasl νS(Σ),A(Σ)(g) (ϕ) by the usual stepwise induction scheme that only requires to demonstrate the invariant to hold in all initial configurations and that it is preserved by every transition. Moreover, the M<sup>↓</sup> <sup>D</sup>-state formula <sup>ϕ</sup> can be generalised into a Casl-invariant.

**Proposition 1.** *Let* (Σ, P) *be a theory presentation in* M<sup>↓</sup> <sup>D</sup> *and* (ν S (Σ), Φ) *a theory presentation in* Casl *such that* ModCasl(ν Pres(Σ, P)) ⊆ ModCasl(ν S (Σ), Φ)*. Let* invCasl(g) ∈ FCasl νS(Σ),{g} *be a* Casl*-formula with a single free variable* g *and* inv<sup>M</sup><sup>↓</sup> <sup>D</sup> ∈ F<sup>D</sup> A(Σ),∅ *an* M<sup>↓</sup> <sup>D</sup>*-state formula, such that*


*hold in every model* M<sup>0</sup> ∈ ModCasl(ν S (Σ), Φ)*. Then* ν Mod <sup>Σ</sup> (M<sup>0</sup> ) |= M<sup>↓</sup> D <sup>Σ</sup> ✷inv<sup>M</sup><sup>↓</sup> <sup>D</sup> *for all models* M<sup>0</sup> ∈ ModCasl(ν Pres(Σ, P))*.*

#### **4 Simple UML State Machines with Outputs**

UML state machines [13, Ch. 14] provide means to specify the reactive behaviour of objects or component instances. These entities hold an internal data state, typically given by a set of attributes or properties as specified in a static structure, and shall react to event occurrences like incoming messages by firing different transitions in different control states. Such transitions may have a guard depending on event arguments and the internal state and may change, as an effect, the internal control and data state of the entity as well as send out messages on their own. Beyond such "simple" means for specifying reactive entities, UML state machines offer also more advanced modelling constructs, like hierarchical states or compound transitions, which, however, we defer to future work.

In our formal account, extending again [16], a *simple UML state machine with outputs* U uses an event/data signature Σ(U) for its input and output events as well as its attributes and consists of a finite set of *control states* C(U); a finite set of *transition specifications* T(U) of the form (c, φ, i(X), o1(X1), . . . , om(Xm), ψ, c<sup>0</sup> ) with


– *postcondition* transition predicate ψ ∈ F2<sup>D</sup> A(Σ(U)),X∪ S <sup>1</sup>≤k≤<sup>m</sup> X<sup>k</sup> ;

an *initial control state* c0(U) ∈ C(U); and an *initial state predicate*ϕ0(U) ∈ F<sup>D</sup> A(Σ(U)),∅ , such that C(U) is *syntactically reachable*, i.e., for every c ∈ C(U) \ {c0(U)} there are (c0(U), φ1, i1, O1, ψ1, c1), . . . ,(cn−1, in, On, ψn, cn) ∈ T(U) with n > 0 and c<sup>n</sup> = c. The constraint of syntactic reachability is only introduced to simplify semantic and algorithmic constructions on simple UML state machines with output.

A Σ(U)-event/data structure M is a *model* of a simple UML state machine U with output, M ∈ ModM<sup>↓</sup> <sup>D</sup> (U), if C(U) ⊆ C(M) up to a bijective renaming, C0(M) = {c0(U)}, Ω0(M) ⊆ {ω ∈ |Ω(A(Σ(U)))| | ω |=<sup>D</sup> <sup>A</sup>(Σ(U)),<sup>∅</sup> <sup>ϕ</sup>0(U)}, and if the following holds for all (c, d) ∈ Γ(M):


The last requirement that all transitions in a model are due to transition specifications does not cover the requirement of *input enabledness* for UML state machines: An event for which currently no transition can fire is discarded. This behaviour can be added by a syntactical transformation extending the set of transition specifications by self-loops with empty outputs for all situations where some event is not accepted.

In UML, completion events are produced whenever a state completes its internal behaviour and such events have always to be prioritised in event processing; the reaction to a completion event is indicated by a transition without a triggering event. For the simple machines with output described here, where states do not show internal behaviour, the only use of completion events is to let a machine make progress autonomously without external input. For using this feature, the machine's event/data signature has to be extended by such events and the transition specifications have to take completions into account. Still, the prioritisation cannot be covered by a single state machine alone, as it has no event processing discipline of its own.

Extending the characterisation algorithm in [16] with outputs, it can be shown that M<sup>↓</sup> <sup>D</sup> is expressive enough to capture the model class of a simple UML state machine with output <sup>U</sup> by a single sentence %<sup>U</sup> such that <sup>M</sup> <sup>∈</sup> Mod<sup>M</sup><sup>↓</sup> <sup>D</sup> (U) if, and only if, M |= M<sup>↓</sup> D Σ(U) %<sup>U</sup> . The simplest case is a single transition specification (c, φ, i, O, ψ, c<sup>0</sup> ): By requiring (@c)h|<sup>i</sup> : <sup>φ</sup>( <sup>O</sup> : <sup>ψ</sup>|i<sup>c</sup> 0 it can be ensured that a model indeed shows a transition from control state c to the control state c 0 for the input event i with precondition φ satisfied which outputs O with ψ satisfied. For requiring that such a transition for input i and output O is only offered when the precondition φ and the transition condition <sup>ψ</sup> hold, a formula (@c)[i( <sup>O</sup> : <sup>¬</sup><sup>φ</sup> ∨ ¬ψ]false has to be added. For ensuring that no other output than <sup>O</sup> can be produced, on the one hand (@c)[i( <sup>O</sup><sup>0</sup> : true]false for every O<sup>0</sup> 6= O that is at most the length of O has to be added and on the other hand (@c)[i( [O<sup>0</sup> ]O(Σ) : true]false for every O<sup>0</sup> with length one more than O.

Reasoning over a simple UML state machine with output U in Casl via the translation of U's characterising sentence along the theoroidal comorphism of Thm. 2 will involve some not fully transpicuous axioms due to the necessary exclusion of some behaviour using formulæ like (@c)[i( [O<sup>0</sup> ]O(Σ) : true]false. It is therefore sometimes advantageous to directly use the requirements for M being a model of U to obtain another characterisation of the trans predicate in the Casl presentation for the comorphism, which then can be favourably combined with Prop. 1 for proving invariants:

**Proposition 2.** *Let* U *be a simple UML state machine with output and let* M<sup>0</sup> ∈ ModCasl(ν Sig (Σ(U))) *such that* ν Mod Σ(U) (M<sup>0</sup> ) ∈ Mod<sup>M</sup><sup>↓</sup> <sup>D</sup> (U)*. Then*

$$\begin{split} M' & \left| \mathop{\rm Ca}\_{\nu^{\mathcal{S}}(\Sigma)}^{\mathrm{Cu}} \right\rangle \forall g: \mathrm{Conf} \, . \,\mathrm{reachable}(g) \Rightarrow \\ & \left( \forall g': \mathrm{Conf}; i\_{\*}: \mathrm{InEvt}; O\_{\*}: \mathrm{List}[\mathrm{OutEvt}] \, . \,\mathrm{trans}(g, i\_{\*}, O\_{\*}, g') \iff \\ & \bigvee\_{(c, \phi, i, O, \psi, c') \in T(U)} \exists X: \overline{\mathfrak{s}}(I(\Sigma))(i); X': \overline{\mathfrak{s}}(O(\Sigma))(O) \, . \,\!\right. \\ & \left( g(g) = c \land \mathcal{F}\_{\nu^{\mathcal{S}}(\Sigma), A(\Sigma)(g) \cup 1\_{X}}^{\mathrm{Cu}, \mathrm{L}}(\phi) \land i\_{\*} = i(X) \land O\_{\*} \, = O(X') \land 0 \\ & \quad \mathcal{F}\_{\nu^{\mathcal{S}}(\Sigma), A(\Sigma)(g) \cup A(\Sigma)(g') \cup 1\_{X \cup X'}}^{\mathrm{Cu}, \mathrm{L}}(\psi) \land c(g') = c' \right) \,. \,\!$$

#### **5 Simple UML Composite Structures**

A UML composite structure [13, Ch. 11] specifies the internal structure of a class or component and its collaborations. For our purposes, a composite structure is given by class or component instances, its so-called *parts*, that can communicate through their attached *ports* specifying provided and required interfaces and being linked by *connectors*. All connectors are assumed to be binary and each part to be equipped with a state machine for describing its behaviour.

A *composite structure signature* ∆ over M<sup>↓</sup> <sup>D</sup> consists of a set Cmp(∆) of *parts* c each equipped with an M<sup>↓</sup> <sup>D</sup>-signature <sup>Σ</sup>(∆, c) for its input and output events and internal attributes; a set Prt(∆) of *ports* p each showing a part cmp(∆)(p) ∈ Cmp(∆) as well as an M<sup>↓</sup> <sup>D</sup>-signature <sup>Σ</sup>(∆, p) without attributes (i.e., <sup>A</sup>(Σ(∆, p)) = <sup>∅</sup>) for its *provided* (input) and *required* (output) events; and a symmetric binary relation Con(∆) ⊆ Prt(∆) × Prt(∆) of *connectors* such that


We say that port p ∈ Prt(∆) is *open* in ∆ if there is no p <sup>0</sup> ∈ Prt(∆) such that (p, p<sup>0</sup> ) ∈ Con(∆); otherwise p is *connected*.

A *composite structure signature morphism* δ : ∆ → ∆<sup>0</sup> over M<sup>↓</sup> <sup>D</sup> consists of a function Cmp(δ): Cmp(∆) → Cmp(∆<sup>0</sup> ) mapping parts, together with an M<sup>↓</sup> Dsignature morphism Σ(δ, c): Σ(∆, c) → Σ(∆<sup>0</sup> , Cmp(δ)(c)) for each c ∈ Cmp(Σ); a function Prt(δ): Prt(∆) → Prt(∆<sup>0</sup> ) mapping ports, together with an M<sup>↓</sup> <sup>D</sup>-signature morphism Prt(δ)(p): Σ(∆, p) → Σ(∆<sup>0</sup> ,Prt(δ)(p)), preserving

– the part owning each port p, i.e., Cmp(δ)(cmp(∆)(p)) = cmp(∆<sup>0</sup> )(Prt(δ)(p));

– the connections, i.e., if (p, p<sup>0</sup> ) ∈ Con(∆), then (Prt(δ)(p),Prt(δ)(p 0 )) ∈ Con(∆<sup>0</sup> ).

The category of cs(M<sup>↓</sup> <sup>D</sup>)*-signatures* <sup>S</sup> cs(M<sup>↓</sup> <sup>D</sup>) consists of the composite structure signatures and signature morphisms over M<sup>↓</sup> D.

For an cs(M<sup>↓</sup> <sup>D</sup>)-signature <sup>∆</sup>, a <sup>∆</sup>*-composite structure structure* (sic!) over <sup>M</sup><sup>↓</sup> <sup>D</sup> is a family <sup>C</sup> <sup>∈</sup> (<sup>C</sup> (c) ∈ |Str<sup>M</sup><sup>↓</sup> <sup>D</sup> (Σ(∆, c))|)c∈Cmp(∆) consisting of an M<sup>↓</sup> <sup>D</sup>-structure for each part c. The δ*-reduct* C 0 |δ of a ∆<sup>0</sup> -composite structure structure C <sup>0</sup> over M<sup>↓</sup> <sup>D</sup> along a composite structure signature morphism δ : ∆ → ∆<sup>0</sup> is computed component-wise as (C 0 (Cmp(δ)(c))|Σ(δ, c))c∈Cmp(∆) . The ∆-composite structure structures form the discrete category Str cs(M<sup>↓</sup> <sup>D</sup>) (∆) of cs(M<sup>↓</sup> <sup>D</sup>)*-structures* over <sup>∆</sup>. For each signature morphism δ : ∆ → ∆<sup>0</sup> in S cs(M<sup>↓</sup> <sup>D</sup>) the δ*-reduct functor* Str cs(M<sup>↓</sup> <sup>D</sup>) (δ): Str cs(M<sup>↓</sup> <sup>D</sup>) (∆<sup>0</sup> ) → Str cs(M<sup>↓</sup> <sup>D</sup>) (∆) is given by Str cs(M<sup>↓</sup> <sup>D</sup>) (δ)(C 0 ) = C 0 |δ.

In UML, state machines organised in a composite structure communicate with each other by sending messages which are stored in event pools. A state machine draws a message from its event pool, which is typically implemented as an event queue, and reacts to this message by firing one of its enabled transitions or by discarding it when no transition is enabled. This communication scheme is obtained for a ∆-composite structure structure C over M<sup>↓</sup> <sup>D</sup> by constructing an overall <sup>M</sup><sup>↓</sup> <sup>D</sup>-structure over an <sup>M</sup><sup>↓</sup> <sup>D</sup>-signature that reflects the parts, the ports, and the connections in its events and attributes, but includes explicit event queues as additional attributes. The overall M<sup>↓</sup> <sup>D</sup>-structure over this queue-based M<sup>↓</sup> <sup>D</sup>-signature then implements the selection of an event from a part's event queue, the reactions of this part to this event, and the distribution of the produced messages to the connected parts.

Formally, we construct a functor Σ<sup>q</sup> : S cs(M<sup>↓</sup> <sup>D</sup>) → SM<sup>↓</sup> <sup>D</sup> on signatures that assigns to a composite structure signature ∆ the *queue-based* event/data signature Σq(∆) = S c∈Cmp(∆) (Σ(∆, c) ∪ {q<sup>c</sup> : ˆI(Σ(∆, c))<sup>∗</sup>}) and to a composite structure signature morphism the canonically corresponding event/data signature morphism. For a composite structure signature ∆ and a part c ∈ Cmp(∆) there is a natural signature embedding η q ∆,c : <sup>Σ</sup>(∆, c) <sup>→</sup> <sup>Σ</sup>q(∆).

For a ∆-composite structure structure C we construct an overall Σq(∆)-event/data structure M<sup>C</sup> as follows: An overall configuration of M<sup>C</sup> consists, for each part c ∈ Cmp(∆), of an *event queue* q(c) ∈ ˆI(Σ(∆, c))<sup>∗</sup> stored in the attribute q<sup>c</sup> and a part configuration γ(c) ∈ Γ(C (c)); initially, all parts are in some of their initial configurations and all event queues are empty. In an overall configuration (q(c), γ(c))c∈Cmp(∆) an overall transition to another overall configuration (q 0 (c), γ<sup>0</sup> (c))c∈Cmp(∆) *reacts* to some ˆı ∈ ˆI(Σq(∆)) and *outputs* some Oˆ ∈ Oˆ(Σq(∆))<sup>∗</sup> . This ˆı can either instantiate some provided event i ∈ I(Σ(∆, p∗)) of some of the open ports p<sup>∗</sup> ∈ Prt(∆) with

c<sup>∗</sup> = cmp(∆)(p), or it is the head of the event queue of some c<sup>∗</sup> ∈ Cmp(∆) such that i ∈ I(Σ(∆, c∗)). In the latter case, ˆı is removed from the event queue of c∗. In both cases, the reaction of part c<sup>∗</sup> is any transition (γ(c∗), γ<sup>0</sup> ∗ ) ∈ R(C (c)) <sup>ˆ</sup>ı,O<sup>ˆ</sup> and overall γ <sup>0</sup> = γ{c<sup>∗</sup> 7→ γ 0 <sup>∗</sup>}. Finally, all outputs p.oˆ ∈ Oˆ such that (p, p<sup>0</sup> ) ∈ Con(∆) and cmp(∆)(p 0 ) = c 0 are appended to the respective event queue of part c 0 . This defines a natural transformation Str cs(M<sup>↓</sup> <sup>D</sup>) <sup>q</sup> : Str cs(M<sup>↓</sup> <sup>D</sup>) <sup>→</sup>˙ <sup>Σ</sup>q; StrM<sup>↓</sup> <sup>D</sup> with Str cs(M<sup>↓</sup> <sup>D</sup>) <sup>q</sup>,∆ (C ) = M<sup>C</sup> .

**Theorem 3.** (S cs(M<sup>↓</sup> <sup>D</sup>) , Str cs(M<sup>↓</sup> <sup>D</sup>) , Sencs(M<sup>↓</sup> <sup>D</sup>) , |=cs(M<sup>↓</sup> <sup>D</sup>) )*with* Sen<sup>M</sup><sup>↓</sup> <sup>D</sup> <sup>=</sup> <sup>Σ</sup>q; Sen<sup>M</sup><sup>↓</sup> D *and* C |= cs(M<sup>↓</sup> <sup>D</sup>) <sup>∆</sup> % *if, and only if,* Str cs(M<sup>↓</sup> <sup>D</sup>) <sup>q</sup>,∆ (C ) |= M<sup>↓</sup> D Σq(∆) % *is an institution.*

cs(M<sup>↓</sup> <sup>D</sup>) inherits the event/data formulæ of <sup>M</sup><sup>↓</sup> <sup>D</sup> and the underlying <sup>D</sup>, though extended by queue attributes. In particular, we have for a part c ∈ Cmp(∆) that a transition sentence h|<sup>i</sup> : <sup>φ</sup>( <sup>O</sup> : <sup>ψ</sup>|i% (in the current configuration there are valuations and a transition for the incoming message and the outgoing messages such that these valuations satisfy transition formula ψ and % holds afterwards) locally formulated for this part can be faithfully transferred to the global composite structure, abbreviating the embedding η q ∆,c to <sup>η</sup>,

$$\begin{array}{lcl} \{\eta(i): \mathcal{F}\_{A(\eta),X(i)}^{\mathcal{D}}(\phi) \land (\text{hd}(\mathbf{q}\_{c}) = I(\eta)(i) \lor \text{open}\_{\Delta,c}(I(\eta)(i)))\} \\ \qquad O(\eta)(O): \mathcal{F}\_{A(\eta),X(i)\cup X(O)}^{2\mathcal{D}}(\psi) \land \\ \qquad \bigwedge\_{a\in A(\Sigma\_{\eta}(\Delta))} \{A(\Sigma(\Delta,c))\cup\{\mathbf{q}\_{c}|c\in Cmp(\Delta)\}\} \; a = a' \land \\ \quad \text{dist}\_{\Delta,c}(I(\eta)(i),O(\eta)(O),\{\mathbf{q}\_{c},\mathbf{q}\_{c}^{\prime}\}\_{c\in Cmp(\Delta)})\} \text{Sen}^{\mathcal{M}\_{\mathcal{D}}^{\downarrow}}(\eta)(\rho) \; , \end{array}$$

where hd yields the head of a queue, open checks whether the part's port for the event is open, the frame condition a = a 0 ranges over all attributes not pertaining to c or the queues, dist removes the input and distributes the outputs to the queues.

### **6 Verification Example: Communication between** User**,** ATM **and** Bank

We applied4 the technique set out in this paper to the example from the introduction concerning a typical interaction between a User, an ATM component and a Bank component.

We formalised the state machines for the Bank and the ATM as well as their communication in Casl. We then set out to show a safety property (by means of a stronger invariant) on this system by inductive verification, as justified by Prop. 1. We first tried to show the preservation of said invariant using fully automatic provers connected to Hets [10], the main tool suite for verification based on Casl and institution theory. However, no inductive automated provers are currently connected to Hets. Therefore, handling freely generated datatype would require manual intervention to add suitable induction schemes — defeating our goal of automation. Instead we utilised the interactive theorem prover

<sup>4</sup> Full specifications and proofs accessible at: https://rosento.github.io/2021-paper-composite/

KIV [2]. This prover supports algebraic specifications similar to Casl and offers extensive heuristics for inductive proofs. KIV's heuristics fully automatically discharged all proof obligations in our experiments. The translation of the Casl specifications into KIV is straightforward.

With our process clarified, we can now state the safety property we will prove:

safe-def: safe(g) ↔ (ctrl(caConf(g)) = Verified → wasVerified(cbConf(g)) = 1); used for: s, ls;

The above introduces an axiom safe-def defining the predicate safe and marks the axiom for use as a simplifier rule (s) and a local simplifier rule (ls) for the KIV system.

The predicate safe ranges over a type of system configurations, each consisting of the ATM configuration (caConf) and queue, as well as the bank configuration (cbConf) and queue. The machine configurations in turn consist of the control state and attributes. The safety predicate holds in a configuration *iff* should the ATM be in control state Verified, the bank attribute wasVerified has the value 1.

The behaviours of Bank and ATM are defined in the form of an initial state predicate and a transition predicate. For space reasons we show only one transition:

```
atmTrans-def: atmTrans(atmConf(sa1, c1, p1, t1), in, out, atmConf(sa2, c2, p2, t2))
  ↔ ∃ c : CardId, p : Pin . . . .
        ∨ ( sa1 = CardEntered
           ∧ in = msg(userCom, PIN(p)) ∧ out = (msg(atmCompl, PINEnteredCompl) +l [])
           ∧ p2 = p ∧ sa2 = PINEntered ∧ c2 = c1 ∧ t2 = t1)
        ∨ . . .; used for: s, ls;
```
The ATM transitions from one configuration to another, receiving an input event and sending out a list of messages. Each ATM configuration consists of (in that order) the control state, the card id to be verified, the PIN to be verified and the counter for the number of verification attempts. We give the definition of the transition predicate by a disjunction of the conditions of all syntactic transitions, including the control state before, the input event, the output list, variables to be set, the control state after and variables to remain unchanged. Given these machine predicates and a predicate dist to encode connectors, we can then define the transition predicate for the overall system:

```
trans-def: trans(conf(ca1, qa1, cb1, qb1), in, out, conf(ca2, qa2, cb2, qb2))
  ↔ dist(out, qa1, qa2, qb1, qb2)
     ∧ ( (atmTrans(ca1, in, out, ca2) ∧ cb2 = cb1 )
        ∨ (bankTrans(cb1, in, out, cb2) ∧ ca2 = ca1)); used for: s, ls;
```
Initially, the queues are empty and the machines are in their initial configurations.

Having thus defined the machines, we turn to verification and define an invariant strong enough to show both its own preservation and our safety property. The idea is to control the queues' status that allows us to enter the Verified state on the ATM or to reset the wasVerified attribute. In essence the invariant can be syntactically read off from the composite structure.

```
invar-def: invar(conf(ca, qa, cb, qb)) ↔ ∃ x.
    (ctrl(ca) = Idle ∧ ctrl(cb) = Idle ∧ qa = empty ∧ qb = empty)
 ∨ (ctrl(ca) = CardEntered ∧ ctrl(cb) = Idle ∧ qa = empty ∧ qb = empty)
 ∨ (ctrl(ca) = PINEntered ∧ ctrl(cb) = Idle ∧ qa = enq(x, empty) ∧ qb = empty)
 ∨ (ctrl(ca) = Verifying ∧ ctrl(cb) = Idle ∧ qa = empty ∧ qb = enq(x, empty))
 ∨ (ctrl(ca) = Verifying ∧ ctrl(cb) = Verifying ∧ qa = empty ∧ qb = enq(x, empty))
 ∨ (ctrl(ca) = Verifying ∧ ctrl(cb) = VeriSuccess ∧
               qa = empty ∧ qb = enq(x, empty) ∧ wasVerified(cb) = 1)
```

```
∨ (ctrl(ca) = Verifying ∧ ctrl(cb) = VeriFail ∧ qa = empty ∧ qb = enq(x,empty))
∨ (ctrl(ca) = Verifying ∧ ctrl(cb) = Idle ∧
             qa = enq(msg(bankCom, reenterPIN), empty) ∧ qb = empty)
∨ (ctrl(ca) = Verifying ∧ ctrl(cb) = Idle ∧
             qa = enq(msg(bankCom, verified), empty) ∧ qb = empty ∧ wasVerified(cb) = 1)
∨ (ctrl(ca) = Verified ∧ ctrl(cb) = Idle ∧
             qa = enq(x, empty) ∧ qb = empty ∧ wasVerified(cb) = 1); used for: s, ls;
```
Note that we can mostly ignore attribute values, as well as all distinctions between queue elements unrelated to our verification task. We can then formulate lemmas to the effect that this invariant does in fact imply the safety property, that it is satisfied in any legal initial configurations and that it is preserved by all transitions. These lemmas are as follows, again limited to one example for the transitions:

```
Safe: invar(g) → safe(g);
Init: init(g) → invar(g);
. . .
Trans6: g1 = conf(atmConf(Verifying, c, p, t), qa, cb, qb)
      ∧ qa 6= empty ∧ top(qa) = msg(atmCom, verified)
      ∧ g2 = conf(atmConf(Verified, c, p, t),
                   enq(msg(atmCompl, VerifiedCompl), deq(qa)), cb, qb)
      ∧ invar(g1) → invar(g2);
```
Formulating separate lemmas for each transition instead of one lemma using the transition predicate helps us avoid a combinatorial explosion in the theorem prover.

Providing our specification to KIV with all definitions marked as simplifier rules and activating the heuristics mode "PL heuristics + structural induction", each of our lemmas is proved without noticeable delay, i.e., the verification of the invariant is successful and does not pose any difficulty to the prover.

#### **7 Conclusion**

We have developed two new institutions extending the hybrid modal logic M<sup>↓</sup> <sup>D</sup> [16]. One institution caters for simple UML state machines with outputs, an extension of it captures simple UML composite structure diagrams. Besides providing formal semantics for communicating UML state machines, via comorphisms these institutions provide a bridge towards theorem proving for UML. Through an elementary example we could demonstrate that, thanks to our framework, effective automated theorem proving for communicating UML state machines is possible.

Future work will be on proof automation. In particular we plan to implement the translations from UML into extended M<sup>↓</sup> <sup>D</sup>, the institution comorphisms from extended M<sup>↓</sup> <sup>D</sup> to Casl, and possibly the link from Hets to KIV. Yet another important aspect is to implement analyses of the composite structure and its state machines with a view to automatically generate lemmas for automated theorem proving. In terms of our general research programme, the next topic to tackle are UML interactions and how they relate or refine to UML state machines. Going beyond the UML, it would be interesting to consider a truly heterogeneous framework, in which composite structure diagrams connect not only UML state machines, but also components specified in languages such as TLA or Event-B.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## Semantic Code Search in Software Repositories using Neural Machine Translation

Evangelos Papathomas() , Themistoklis Diamantopoulos , and Andreas Symeonidis

Electrical and Computer Engineering Dept., Aristotle University of Thessaloniki Thessaloniki, Greece epapathom@ece.auth.gr, thdiaman@issel.ee.auth.gr, symeonid@ece.auth.gr

Abstract. Nowadays, software development is accelerated through the reuse of code snippets found online in question-answering platforms and software repositories. In order to be efcient, this process requires forming an appropriate query and identifying the most suitable code snippet, which can sometimes be challenging and particularly time-consuming. Over the last years, several code recommendation systems have been developed to ofer a solution to this problem. Nevertheless, most of them recommend API calls or sequences instead of reusable code snippets. Furthermore, they do not employ architectures advanced enough to exploit the semantics of natural language and code in order to form the optimal query from the question posed. To overcome these issues, we propose CodeTransformer, a code recommendation system that provides useful, reusable code snippets extracted from open-source GitHub repositories. By employing a neural network architecture that comprises advanced attention mechanisms, our system efectively understands and models natural language queries and code snippets in a joint vector space. Upon evaluating CodeTransformer quantitatively against a similar system and qualitatively using a dataset from Stack Overfow, we conclude that our approach can recommend useful and reusable snippets to developers.

Keywords: code reuse · semantic analysis · neural transformers.

## 1 Introduction

The wide uptake of open-source software in the last few decades has accelerated software development through code reuse. Nowadays, developers search online for ways to solve issues that arise during the development process, such as writing code for complex tasks, integrating APIs, or fxing bugs. The popularity of this paradigm has been further boosted from the introduction of online repositories (e.g. GitHub) and programming communities (e.g. Stack Overfow).

As code reuse has become a vital aspect of today's software development, the challenge of fnding appropriate answers to programming-related questions in the vastness of the Internet led to the development of code recommendation systems. While the majority focus on providing API calls and sequences (e.g. DeepAPI [10]), a selected few have the advantage of recommending reusable code snippets (e.g. DeepCS [9]). Such systems that employ whole snippet extraction mechanisms are greatly valued, as they signifcantly reduce development time.

However, they are also prone to important limitations. Many accept queries in specialized query languages instead of natural language. In addition, most systems do not employ mechanisms advanced enough to extract the semantics found both in the queries and the source code. And even though some systems engage in semantic analysis (e.g. DeepCS [9], CodeSearchNet [12]), crucial information, such as the control fow of a code snippet, is discarded. Finally, the aforementioned systems typically employ non-annotated datasets and, by extension, lack in terms of training and quantitative evaluation, as ground truth data are essential for the training of a system and the assessment of its performance.

Acknowledging the need for advancing code reuse, GitHub initiated the Code-SearchNet challenge [12], a public competition for code search, specifcally aiming to improve on four baseline models using an annotated dataset. These models receive queries in natural language and employ diferent neural network architectures to return high-quality code snippets. The CodeSearchNet challenge overall provides an interesting testbed due to the variety of programming languages and code snippets in the dataset and the evaluation tools ofered.

Given infuence by this challenge, in this paper we present CodeTransformer, a system that receives natural language queries and provides reusable code snippets. CodeTransformer uses state-of-the-art neural network and language understanding techniques, while it also employs a custom similarity metric and a custom loss function. Our system does not require some specialized query language; instead, it receives queries in natural language and employs neural machine translation to ofer reusable snippets in the form of methods. We train our system on a state-of-the-practice annotated dataset and evaluate its efectiveness against the baseline CodeSearchNet systems [12]. Finally, we assess its applicability in a question-answering context using data from Stack Overfow.

#### 2 Related Work

Code search systems can be distinguished into two categories, those producing sequences of API calls and those producing reusable code. The frst category includes systems such as SWIM [21] and T2API [19], which translate text queries to API calls and then synthesize their usage code, i.e. code that uses the calls. SWIM extracts API calls related to a query using Bing and forms their usage code, including the control fow. A limitation is that it cannot handle the semantics of queries (e.g. "convert int to string" and "convert string to int"). T2API is trained on Stack Overfow posts and uses the GraLan language model [17] to model dependencies between API calls and synthesize their usage code.

A diferent approach to API call recommendation is taken by MULAPI [24]. Apart from usage examples, MULAPI also analyzes the source code and API libraries of a project to provide an implementation of the requested feature. The system also maps the repository of the code to recommend fles as locations for the provided API usage code. The architecture of MULAPI comprises a Stanford Word Segmenter for text preprocessing and a Vector Space Model to assess the similarity between texts. FOCUS [18] is a similar system that analyzes a project's repository and other open source repositories using Abstract Syntax Trees and assesses their similarity using Context-Aware Correlation Filter. Next, it mines API calls from the most similar repositories and presents them to the developers.

Other systems treat code recommendation as a machine translation problem. One of them is DeepAPI [10], which utilizes a Neural Network architecture to transform natural language queries to API sequences. It consists of a recurrent neural network (RNN) encoder that processes natural language using attention mechanisms and an RNN decoder using an Inverse Document Frequency (IDF) based weighting mechanism to output API sequences. BIKER [4] is a similar system that receives natural language queries and assesses their similarity to Stack Overfow question posts and API documentation. Post texts and code snippets are handled as text and are used to train an embedding model that takes into account IDF weights, and recommends relevant API calls.

Word2API [15] also bridges the semantic gap between natural language and code to provide API recommendations. The system creates tuples of method descriptions and API sequences that are used to train a word embedding model for vector generation. A more advanced approach was implemented by DeepAPIRec [6]. Its architecture consists of Tree-LSTMs, a long short-term memory (LSTM) unit variant that organizes information in an inverse tree structure. DeepAPIRec also utilizes a statistical parameter model of data dependency that allows recommending parameter values for the APIs suggested by the Tree-LSTM.

The second category of systems comprises the ones that recommend reusable code snippets instead of API calls. One of them is Seahawk [20], an Eclipse plugin that, given a query, returns a ranked list of relevant Stack Overfow posts. The posts are retrieved using Apache Solr and ranked using tf-idf. The snippets found in the posts can be integrated into the code of a project. Like Seahawk, NLP2Code [5] is an Eclipse plugin that retrieves code snippets from Stack Overfow posts. NLP2Code processes natural language text and snippets using the TaskNav algorithm and measures their grammatical correlation with the Stanford CoreNLP Toolkit. The system receives natural language queries and employs a customized version of Google Search Engine for search. StackSearch [8] also extracts information from Stack Overlow posts and recommends code snippets using a hybrid language model that combines Tf-Idf and fastText [3]. Its results are also accompanied with labels extracted using named entity recognition.

An interesting alternative is DeepCS [9], which recommends reusable code snippets given a natural language query. DeepCS employs two RNN encoders, one that receives natural language descriptions of methods and one that receives a fusion of method names, API sequences and code tokens. Then the system max pools the embeddings generated by the two encoders and assesses their similarity using cosine similarity. DeepCS can understand the semantics of natural language and code to a specifc extent, however it relies on the generated vectors to rank its results without considering more code features such as context.

In contrast to systems that utilize raw data dumps from Stack Overfow or code repositories, CodeSearchNet [12] introduced a well curated dataset specifically designed for semantic code search, as it consists of docstring and code tokens which highlight their semantics while also facilitating the preprocessing. Moreover, it introduced four diferent baselines, each using a diferent architecture for its encoders (Neural Bag-of-Words, Bidirectional RNNs, 1D Convolutional Neural Networks and Self-Attention). CodeSearchNet outperforms most systems due to the quality of its dataset and its powerful neural architectures. However, it ignores certain semantics, such as the control fow of the code, so it favors keyword-based methods instead of those using semantic information.

Although the aforementioned systems are efective in certain scenarios, they have important limitations. Most of them handle natural language input as keywords, i.e. measuring token frequency instead of analyzing semantics and context. Also, most systems output API calls or API usage code instead of reusable snippets. Deep learning systems often do not employ custom similarity metrics and loss functions. CodeTransformer, is trained on high-quality annotated data from the CodeSearchNet corpus. It analyzes the query and code semantics using word embeddings, generated with state-of-the-art attention mechanisms. We employ a hybrid similarity metric and build a custom loss function that are suited to the challenge at hand. Thus, our system is able to comprehend relations between similar queries (e.g. "how to write to command line" and "how to output to terminal") and distinguish queries with lexically minor, yet semantically major diferences (e.g. "convert int to string" and "convert string to int").

#### 3 Semantic Code Search using Machine Translation

The architecture of our system, shown in Figure 1, comprises four modules: the Dataset Builder, the Neural Network, the Index Builder, and the Search Engine. The Dataset Builder preprocesses the natural language and code data to produce a clean dataset, including the vocabularies of the input and target languages. The Neural Network module generates word embeddings and extracts the most important features per language using attention mechanisms.

Fig. 1. The architecture of CodeTransformer

Max pooling is used on the word embeddings to generate a single embedding for each natural language and code sequence. The Index Builder builds a vector space containing the sequence embeddings. Each code vector is assigned to an index to allow nearest neighbor search when a natural language vector is received. The Search Engine receives an input query in the GUI and forwards it to the Computations submodule, where the Neural Network analyzes it and generates a natural language sequence embedding. This vector representation of the query is inserted in the vector space to search for its nearest code vectors. The results are forwarded back to the GUI and presented to the user. These modules are further analyzed in the following subsections.

#### 3.1 Data Preprocessor

Dataset Overview The CodeSearchNet corpus comprises over 6.4 million code snippets written in 6 languages, with over 2.3 million of them annotated using docstrings [12]. The snippets were extracted from GitHub repositories, and fltered to remove test functions/constructors, trim long docstrings, and apply de-duplication [16,1]. CodeTransformer was implemented using the Java dataset of the corpus that contains over 1.5 million snippets, of which over 0.54 million come with docstrings. Although we use Java as a proof of concept, it is important to note that our system is mostly language agnostic. Our methodology can be applied to other languages, e.g. Python or JavaScript, with minimal changes.

For each snippet, the dataset contains felds about its origin (repo, path, url, sha) and felds concerning its data (original/full string, method name, extracted code and docstring). The code and the documentation of the snippet (docstring) are also provided as tokens. Table 1 depicts a sample entry of the dataset.


Table 1. An example entry of the dataset

After manual inspection, we concluded that the majority of the dataset entries contain valid natural language docstrings, extracted from each function. However, in certain entries the snippets are not properly annotated and in others the automated natural language text extractor has failed to extract the docstring correctly. For instance, in the docstring of Table 1, the extracted docstring tokens are ['helper', 'method', 'to', 'return', 'a', '{']. To avoid having docstrings that are incorrect or are not properly tokenized, we frst preprocess the dataset.

Data Preprocessing We create two separate preprocessing pipelines to efectively target the docstrings and the code data. The regular expressions of Table 2 enable modifcations in the tokens of the dataset.


Table 2. Regular expressions for preprocessing

For the removal of noisy natural language data, we designed a pipeline of preprocessing steps, as described below:


As an example, the docstring of the snippet shown in Table 1 produces the tokens ['helper', 'method', 'to', 'return', 'json', 'node', 'from', 'the', 'tree'].

<sup>1</sup> The limits were defned after studying the data and concluding that most entries with inefcient docstrings contained less than 6 docstring tokens, while also noting that 30 tokens are adequate for a well-defned description of a function.

Concerning noisy code data, we designed a preprocessing pipeline that slightly difers from those of other systems. Most systems do not sufciently exploit the control fow information of a code snippet. Instead, they solely focus on function and variable names, as well as control fow words, such as if, else, for, etc. To fully exploit the programming symbols of snippets, we perform the following steps:



Table 3. The encoding of programming symbols to unique tokens


As an example, the code of the method snippet shown in Table 1 produces the tokens shown in Figure 2.

'protected', 'json', 'node', 'get', 'required', 'node', 'openingparen', 'json', 'node', 'tree', 'string', 'feld', 'name', 'closingparen', 'openingbrace', 'assert', 'not', 'null', 'openingparen', 'tree', 'tree', 'must', 'not', 'be', 'null', 'closingparen', 'semicolon', 'json', 'node', 'node', 'assignoperator', 'tree', 'get', 'openingparen', 'feld', 'name', 'closingparen', 'semicolon', 'assert', 'state', 'openingparen', 'notequaloperator', 'null', 'notoperator', 'openingparen', 'node', 'instanceof', 'null', 'node', 'closingparen', 'openingparen', 'closingparen', 'missing', 'json', 'feld', 'addoperator', 'feld', 'name', 'addoperator', 'closingparen', 'semicolon', 'return', 'node', 'semicolon', 'closingbrace'

Fig. 2. Example tokens extracted from the code of the method snippet of Table 1

Our preprocessing pipeline minimizes the loss of information by performing data augmentation on docstrings and code. In docstrings where the information is insufcient, the pipeline replaces them with separated camelCase function names (e.g. 'camelCase' becomes 'camel case') that are representative of the code. The pipeline also encodes most code symbols to words instead of removing them and, thus, reinforces code semantics such as control and data fow.

#### 3.2 Neural Network

In this subsection we present the main module of our system, a neural network that employs transformers to map natural language queries to source code.

Network Architecture The main architecture of CodeTransformer is based on Matching Networks [23], a neural network architecture designed to solve One-Shot Learning problems. Our system, however, follows a slightly diferent approach, as it uses an improved embedding similarity metric and does not require an external memory to function. As we discuss in the following subsections, our architecture utilizes self-attention encoders and a hybrid geometric similarity metric. In contrast to the original approach, ours does not use a softmax function on its output, as the similarity metric we selected does not natively support it. In Figure 3 we present the architecture of the Neural Network module.

Fig. 3. The main architecture of the Neural Network module

Transformers To maximize the semantic abilities of our system, we employed the state-of-the-art Transformers architecture on both of its encoders [22]. A Transformer consists of two modules, an encoder and a decoder, with minimal architectural diferences. Considering the fact that a Matching Network performs feature extraction and not direct translation of language data, our implementation solely requires encoders for its function. The architecture of the Transformer encoder is presented in Figure 4.

The Transformer encoder comprises an embedding layer, a Positional Encoding layer and encoder layers, i.e. consecutive blocks of Multi-Head Attention and Feed-Forward Network layers. In our implementation, we opted for three stacked encoder layers, as they provide sufcient depth for achieving high efciency.

Fig. 4. The architecture of a Transformer encoder

Before inserting a token sequence to an encoder, we create a vocabulary that includes the most frequently occurring words and then encode them to integers. We build two vocabularies, each consisting of 10,000 unique words. After encoding, we pad each entry with zeros to form tensors of equal dimensions. To enhance the generalization capabilities of our system, we reshufe the dataset at the start of every training iteration and divide it in batches of 128.

When a token sequence is received as input, the encoder embeds the tokens in a high-dimensional vector space. In other words, the encoder generates word embeddings, i.e. vector representations aiming to extract token information. The encoder generates word embeddings of 128 dimensions using an embedding layer. The natural language encoder and the source code encoder have identical parameter values, but each encoder has its own distinct weights and vocabulary. To generate sequence embeddings we use max pooling, as extracts the most essential features of the embeddings outputted from the stacked encoder layers.

Similarity Metric The similarity between natural language and code sequence embeddings is usually quantifed using the Euclidean distance or the cosine similarity. However, the computation of the Euclidean distance between two vectors does not contain any information about the angle between the two vectors. On the other hand, cosine similarity does not consider the magnitude of the vectors.

Our system utilizes a hybrid similarity metric, the Triangle's Area Similarity - Sector's Area Similarity [11], also known as TS-SS, which improves upon the aforementioned metrics by incorporating the Euclidean distance, the magnitude diference and the angle between two vectors to compute their similarity. The Triangle's Area Similarity (TS) comprises the Euclidean distance, the magnitude of each vector and the angle between them, while the Sector's Area Similarity (SS) provides the magnitude diference. The TS of two vectors A and B is:

$$TS\left(A,B\right) = \frac{|A| \cdot |B| \cdot \sin\left(\theta'\right)}{2} \tag{1}$$

where, given θ is the angle between the two vectors, θ ′ is defned as cos<sup>−</sup><sup>1</sup> (θ)+10◦ . We use θ ′ instead of θ so that the computation is valid in the case of overlapping vectors (when θ = 0). The SS of two vectors A and B is defned as:

$$SS\left(A,B\right) = \pi \left(ED\left(A,B\right) + MD\left(A,B\right)\right)^2 \cdot \left(\frac{\theta'}{360}\right) \tag{2}$$

where θ ′ is defned as above, while ED (A, B) and MD (A, B) correspond to the Euclidean distance and the magnitude diference between the two vectors,

respectively. Given the dimension of the vectors N, the magnitude diference is:

$$MD\left(A,B\right) = \left| \sqrt{\sum\_{n=1}^{N} A\_n^2} - \sqrt{\sum\_{n=1}^{N} B\_n^2} \right| \tag{3}$$

Merging TS and SS via addition is not possible, as they are in diferent scale. According to Heidarian and Dinneen [11], their multiplication establishes a new scale that sufciently represents similarity. Consequently, TS-SS is computed as:

$$TS-SS\left(A,B\right) = \frac{|A| \cdot |B| \cdot \sin\left(\theta'\right) \cdot \theta' \cdot \pi \cdot \left(ED\left(A,B\right) + MD\left(A,B\right)\right)^2}{720} \tag{4}$$

TS-SS values range from 0 to infnity, with 0 indicating that two vectors are identical. Accordingly, the TS-SS value of two dissimilar vectors is larger than zero, without any limitations. In our implementation, we decided to calculate the reciprocal TS-SS in favor of the custom loss function we use during our network's training process. The fnal similarity of the two vectors is computed as:

$$Similarity\left(A,B\right) = \frac{1}{TS - SS\left(A,B\right)}\tag{5}$$

Loss Function The neural network of CodeTransformer outputs a square similarity matrix, where each row represents a natural language embedding and each column represents a source code embedding. The diagonal matrix cells correspond to the positive pairs of natural language and source code and their values ought to be high. The rest of the matrix cells correspond to the negative pairs, and their values ought to be low. At network initialization, all embeddings contain random values and are scattered throughout the vector space. As a result, in order to bring all similar embeddings closer during training, we need to utilize a loss function that is based on the computations of the reciprocal TS-SS.

A loss function typically used by similar systems (such as CodeSearchNet [12] and DeepCS [9]) is a variation of Hinge loss, computed as follows:

$$Loss = \max\left(0, 1 - positive + negative\right) \tag{6}$$

Upon testing this variation of Hinge loss, we observed that it did not result in successful integration with the vanilla or reciprocal TS-SS. Even after modifying the function's margin to a value larger than 1, due to TS-SS infnite value range, the result was always the same. The embeddings constantly collapsed to a specifc point, not allowing distinct sequence embeddings for each positive pair.

This led us to design a custom loss function, based on the squared variation of Hinge loss. We name this loss function Squared Margin Loss and defne it as:

$$Loss = \left(\max\left(0, margin - positive\right)\right)^2 + negative^2\tag{7}$$

Furthermore, the derivatives of our loss function are defned as follows:

$$\frac{\partial}{\partial \left( positive\right)} Loss = \begin{cases} 2 \cdot \left(margin - positive\right), & \text{if } positive < margin\\ 0, & \text{otherwise} \end{cases} \tag{8}$$

Semantic Code Search in Software Repos using Neural Machine Translation 235

$$\frac{\partial}{\partial \left(negative\right)} Loss = 2 \cdot negative \tag{9}$$

The Squared Margin Loss encourages the penalization of larger loss values more, and the penalization of smaller loss values less. Thus, the function ensures the convergence of the network at frst epochs and its optimization at later epochs.

By further restricting the function with the function max, the positive pairs of similarity value above the margin do not take part in the computation of the loss. In this case, however, the similarity values of the corresponding negative pairs continue to decrease. This allows the similarity values of the diagonal to increase further than the margin. Without the use of the max function, the elements that have crossed the margin would generate useless losses and positive gradients, resulting in the fuctuation of their similarity values around the margin.

Optimizer We train our neural network using the Adaptive Moment Estimation (Adam) optimizer [14], which computes adaptive learning rates for each parameter. Adam stores the exponentially decaying average of past gradients and the exponentially decaying average of past squared gradients. Using Adam ensures that the network converges fast through momentum estimation. The convergence also depends on the learning rate; a poor choice of its value can slow down the training process, or even derail the network's weights. To fnd the ideal learning rate, we examined a range of values generated by the equation:

$$LearningRate = 1.1^{step/100} \cdot 10^{-10} \tag{10}$$

This function generates values starting from 10<sup>−</sup><sup>10</sup> up to a practically infnite value. The learning rate is multiplied by 1.1 once every 100 training steps.

After plotting the accuracy and loss per training step, we noticed a point with a steep increase in accuracy and a steep decrease in loss as well as a point with a steep decrease in accuracy and a steep increase in loss. Next, we isolated the values between these steps and tested those closer to the lower end, where the increase in accuracy and decrease in loss occur. Through trial and error, we selected a learning rate value of 3.2 · 10<sup>−</sup><sup>4</sup> . We set the margin of our network to 5, the number of heads to 8, and the df to 512, and trained the network for 40 epochs, as these have been shown to be enough for the efciency of the results.

#### 3.3 Index Builder

Due to the complexity of our neural network and the number of its parameters, fast response times cannot be guaranteed. To signifcantly reduce the processing time of our system, we employed Annoy [2], a tool using a Nearest Neighbor Search algorithm. Using Annoy in the Index Builder module allows us to generate a vector space that contains all the source code embedding vectors of the corpus.

Annoy assigns an index on all code embeddings and then assorts them based on their values by building up a forest of trees. The vector dimension of the vector space is set according to the dimension of the output embedding, which is 128. We calculated the Euclidean distance between the vectors and built 10 trees. Regarding the search process in the vector space, we select the frst 100 nearest vectors out of the 10.000 nearest forest nodes. Thus, instead of calculating the similarity between a query and the whole corpus using the neural network, Annoy compares the query vector with the nearest 10.000 code vectors. In addition, Annoy's search time does not seem to be hindered by the embedding dimension.

The search process of a query is executed in three stages. Firstly, the query of the user is preprocessed, so that non-alphanumeric symbols are removed, camel-Case tokens are separated and uppercase characters are lowercased. Secondly, every query token is encoded as an integer to be passed as input to the neural network. The neural network, in inference mode, generates the sequence embedding of the query to be inserted to the vector space of the Index Builder. Finally, Annoy extracts the indices of the 10 code vectors nearest to the query, and the corresponding code snippets and GitHub URLs are presented to the user.

#### 4 Evaluation

We evaluate our system using two diferent datasets, the Java corpus of Code-SearchNet [12], and a set of popular Java questions from Stack Overfow<sup>2</sup> .

The performance of our system is assessed using the Precision at K (P@K), the Mean Reciprocal Rank (MRR) [7] and the Normalized Discounted Cumulative Gain (NDCG) [13]. P@K indicates how many out of the frst K results are relevant to the query. MRR further incorporates the order of the results, computed as the mean of the reciprocal rank of each query (the reciprocal rank of the i-th query is 1/rank<sup>i</sup> , where rank<sup>i</sup> is the rank position of the frst relevant document). The NDCG is the normalized DCG, computed for N results as:

$$DCG = \sum\_{i=1}^{N} \frac{2^{rel\_i} - 1}{\log\_2(i+1)} \tag{11}$$

where rel<sup>i</sup> is the graded relevance of the result at position i. Thus, NDCG is computed dividing the result of equation (11) by the ideal DCG, i.e. the one produced if all the results in the list were sorted in the correct order.

#### 4.1 Evaluation using CodeSearchNet Queries

CodeTransformer employs the CodeSearchNet corpus [12] for training and inference, allowing its direct comparison with the implementations of CodeSearchNet. CodeSearchNet comprises four diferent encoder architectures. One of them is the Self-Attention (SelfAtt) architecture, which was examined in the previous section. The Neural Bag of Words (NBoW) architecture measures word occurrence within a document, therefore it performs well on keyword-based search operations. The 1D Convolutional Neural Network (1D-CNN) architecture learns to

<sup>2</sup> The code and details used to reproduce our fndings can be found at the repository: https://github.com/AuthEceSoftEng/CodeTransformer

recognize complex, non-linear patterns. In contrast to NBoW and 1D-CNN, the Bidirectional RNN (biRNN) architecture further models the word order.

The four implementations are compared to CodeTransformer on the test set of the CodeSearchNet corpus, which includes 15000 docstring and code snippet pairs, for the computation of MRR. Additionally, the four implementations are compared to CodeTransformer using 99 annotated queries provided by Code-SearchNet for computing NDCG. The results are shown in Table 4. Note that, although our system is not directly compared with DeepCS [9] as the systems use diferent data, we compare it with the biRNN implementation of CodeSearchNet that has a similar neural architecture with DeepCS.


Table 4. Evaluation results of CodeTransformer and CodeSearchNet

Concerning MRR, our system outperforms CodeSearchNet measurements, indicating that the diferent strategies followed for our data pipeline are efective. Another factor that may contribute to this result is our preprocessing methodology, as it may be possible that the replacement of insufcient docstrings with function names led to increased MRR values. As a side note, these results were also clear during the validation phase of the algorithms (e.g. the MRR of Code-Transformer for the validation set was the highest at 0.62604, while the second highest was that of CodeSearchNet-SelfAtt at 0.5513).

Concerning NDCG, our system performs slightly better compared to the corresponding Self-Attention implementation of CodeSearchNet, while the NBoW and 1D-CNN implementations perform better than CodeTransformer, possibly because they use docstrings as natural language. However, we note that only a small amount of data was annotated for the computation of NDCG (i.e. only 823 out of 1.5 million Java code snippets). In addition, as the authors of Code-SearchNet note [12], the annotated data were selected using the top 10 results per query, generated by an ensemble of the CodeSearchNet neural models and ElasticSearch, therefore they are what these systems are more likely to produce. Hence, it is possible that correct results are ignored for computing NDCG.

Figure 5 depicts the distribution and the individual MRR values for 99 queries of the test set of CodeSearchNet [12]. As the annotations were not provided, we annotated the frst 10 results returned by our system to compute the MRR. The majority of MRR values are equal to 1, indicating that our system returns a relevant result in the frst position for more than half of the queries. By examining the results, we found that our system efectively models the semantic information of the text and the code snippets. Indicatively, for Q64, CodeTransformer outputs a function that sorts an array using another array's order, even though almost none of the exact words of the query are present in the code (except for the word "sort"). Semantically similar terms are also efectively interpreted. E.g., for Q16 that requests exporting data to an excel fle, our system returns an exportXls method, thus modeling the semantic similarity between terms "excel" and "xls". Similarly, given Q91 that requests data extraction from a text fle, CodeTransformer returns a method using the term "read" instead of "extract".

Fig. 5. MRR values of CodeTransformer for the 99 queries of CodeSearchNet dataset

Concerning queries for which our system did not perform as efectively, some of them are relevant to other programming languages and/or are not included in the corpus. Note that these 99 queries are drawn from 6 languages and thus not all of them are relevant to Java. An example unanswered query is Q34, as readonly arrays do not exist in Java and, therefore, a relevant code snippet is not included in the corpus. After manually inspecting the corpus, we concluded that Java code snippets for queries Q53, Q68, Q70, Q73, Q76 and Q97 are focused on other languages. This is also the case for HTML parsing queries, such as queries Q49, Q66, Q72, Q93 and Q98, for which we could fnd a few Java methods by manual inspection, however they are mainly targeted at other languages. In any case, considering the results of Table 4 and Figure 5, CodeTransformer seems to provide a relevant answer in the frst two positions more often than not.

Finally, as a proof of concept, Table 5 depicts the declarations of the methods returned by our system for query Q91, which refers to "extracting data from a text fle". It is clear that the methods respond efectively to the query.

Table 5. Declarations of the methods returned by CodeTransformer for query Q91 "extracting data from a text fle"


#### 4.2 Evaluation using Stack Overfow Questions

To further evaluate CodeTransformer, we reviewed its performance on real user queries. Although our model uses docstrings instead of real queries, we consider this experiment adequate for assessing its efectiveness as a proof of concept.

We manually selected the frst 40 highest-rated Stack Overfow posts at the time of research, in which the posters search for Java code snippets. After querying our system using their titles, we obtained 10 results for each query, sorted by their similarity to the query. Next, we manually annotated the similarity of each result to the query, making sure that the result is a valid answer. To avoid any threats to validity, the annotations were performed without knowledge of the order of the results. Table 6 depicts the questions as well as the rank of the frst relevant result and the precision at the frst 10 results for each question.

Table 6. Evaluation results of CodeTransformer on the set of the 40 most popular Stack Overfow Java questions


Precision at the frst 10 results is relatively high for most queries. Moreover, we may note that CodeTransformer efectively disambiguates among queries with similar context. Consider, e.g., queries S17 and S32 that are both relevant to stack traces; although these queries are similar, the system was able to comprehend the semantics of each query and return several highly ranked relevant results. Even for queries with low precision in their results, CodeTransformer placed the frst relevant result in the frst or the second position. Thus, even though for some queries there are not many relevant results, the users typically receive at least one correct answer. An example would be query S06, for which the system returned only two relevant results, but one of them is ranked in the frst place. It is also notable, in the same query, that CodeTransformer distinguishes among converting "string to integer" and "integer to string".

The fact that 8 out of 40 questions were not answered at all occurs mostly because a matching function does not exist in the corpus. For example, queries S07, S09, S10, S24, and S37 do not require a whole method for their implementation and, thus, the corpus does not include relevant code snippets. Other queries may be too complex, such as query S18, for which our system returns some relevant code snippets, however these results do not meet the condition of the fastest way to examine if an integer's square root is an integer.

In Table 7 we provide three example Stack Overfow queries and the corresponding relevant answers. For the frst two queries, CodeTransformer has placed the answers at the frst position, while for the third query the answer was placed at the second position. As shown by these examples, CodeTransformer indeed retrieves and recommends useful snippets in a question-answering scenario.

#### 5 Conclusion

Although there are several approaches for code snippet retrieval, most of them do not consider semantics of natural language and code, ignoring essential information regarding the data. Furthermore, several of them recommend API calls or sequences instead of reusable code snippets, requiring more efort from the developer. Deep learning systems are usually more efective, however most do not employ advanced neural transformer architectures and are limited by the fact that they are not trained on annotated datasets. Our system, Code-Transformer, overcomes these limitations by employing a state-of-the-art neural network architecture. The advanced attention mechanisms of this architecture, including specialized similarity metric and custom loss function, along with the preprocessing pipeline specifcally designed to augment natural language and code semantics, allow the system to generate powerful data representations.

Upon evaluating CodeTransformer against the implementations of Code-SearchNet, we found that our system is more efective, especially when the developer would prefer to receive the method most relevant to the query rather than a list of related methods. We further assessed CodeTransformer on a dataset of actual questions from Stack Overfow, with the results indicating that it is capable of retrieving useful code, even for complex natural language queries.

For future work, we consider implementing our network using real-life natural language data, such as Stack Overfow questions, instead of code documentation. In addition, we could train our network using other (less curated) datasets Table 7. Example Stack Overfow queries and the answers of CodeTransformer


Query S20 How can I concatenate two arrays in Java?

Result public static String[] concat(String[] array1, String[] array2) { int length1 = array1.length; int length2 = array2.length; int length = length1 + length2; String[] dest = new String[length]; System.arraycopy(array1, 0, dest, 0, length1); System.arraycopy(array2, 0, dest, length1, length2); return dest; }

Query S36 Java string to date conversion

```
Result
           public static Date serviceStringToDate(String s) {
              if (s == null)
                return null;
              try {
                return new SimpleDateFormat( serviceDateFormat).parse(s);
              }
              catch (Exception e) {
                return null;
              }
           }
```
and explore diferent preprocessing techniques, incorporating the semantics of programming symbols and the information provided by method names to the natural language data. Finally, we could explore whether our system can generate docstrings by providing code snippets as input to the code encoder and comparing their sequence embeddings to docstring sequence embeddings.

### Acknowledgements

This research has been co-fnanced by the European Regional Development Fund of the European Union and Greek national funds through the Operational Program Competitiveness, Entrepreneurship and Innovation, under the call RE-SEARCH - CREATE - INNOVATE (project code: T1EDK-02347).

### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## AequeVox: Automated Fairness Testing of Speech Recognition Systems<sup>⋆</sup>

Sai Sathiesh Rajan() , Sakshi Udeshi , and Sudipta Chattopadhyay

Singapore University of Technology and Design, Singapore 487372, Singapore {sai\_rajan,sakshi\_udeshi}@mymail.sutd.edu.sg sudipta\_chattopadhyay@sutd.edu.sg

Abstract. Automatic Speech Recognition (ASR) systems have become ubiquitous. They can be found in a variety of form factors and are increasingly important in our daily lives. As such, ensuring that these systems are equitable to diferent subgroups of the population is crucial. In this paper, we introduce, AequeVox, an automated testing framework for evaluating the fairness of ASR systems. AequeVox simulates diferent environments to assess the efectiveness of ASR systems for diferent populations. In addition, we investigate whether the chosen simulations are comprehensible to humans. We further propose a fault localization technique capable of identifying words that are not robust to these varying environments. Both components of AequeVox are able to operate in the absence of ground truth data.

We evaluate AequeVox on speech from four diferent datasets using three diferent commercial ASRs. Our experiments reveal that non-native English, female and Nigerian English speakers generate 109%, 528.5% and 156.9% more errors, on average than native English, male and UK Midlands speakers, respectively. Our user study also reveals that 82.9% of the simulations (employed through speech transformations) had a comprehensibility rating above seven (out of ten), with the lowest rating being 6.78. This further validates the fairness violations discovered by AequeVox. Finally, we show that the non-robust words, as predicted by the fault localization technique embodied in AequeVox, show 223.8% more errors than the predicted robust words across all ASRs.

## 1 Introduction

Automated speech recognition (ASR) systems have made great strides in a variety of application areas e.g. smart home devices, robotics and handheld devices, among others. The wide variety of applications have made ASR systems serve increasingly diverse groups of people. Consequently, it is crucial that such systems behave in a non-discriminatory fashion. This is particularly important because assistive technologies powered by ASR systems are often the primary mode of

<sup>⋆</sup> This work is partially supported by Singapore Ministry of Education (MOE) grant number MOE2018-T2-1-098 and OneConnect Financial grant number RGOCFT2001.

c The Author(s) 2022

E. B. Johnsen and M. Wimmer (Eds.): FASE 2022, LNCS 13241, pp. 245–267, 2022. https://doi.org/10.1007/978-3-030-99429-7\_14

$$\operatorname{ASR}\_{Err}(\mathsf{Diff}^{\Box} + \mathsf{A}) \ll \operatorname{ASR}\_{Err}(\mathsf{Diff}^{\Box} + \mathsf{A})$$

Fig. 1: Fairness Testing in AequeVox

interaction for users with certain disabilities [20]. Consequently, it is critical that an ASR system employed in such systems is efective in diverse environments and across a wide variety of speakers (e.g. male, female, native English speakers, non-native English speakers) since they are often deployed in safety-critical scenarios [18].

In this paper, we are broadly concerned with the fairness properties in ASR systems. Specifcally, we investigate whether speech from one group is more robustly recognised as compared to another group. For instance, consider the example shown in Figure 1 for a system ASR. The metric ASRErr captures the error rate induced by ASR. Consider speech from two groups of speakers i.e. male and female. We assume that the ASR has similar error rates for both the groups of speakers, as illustrated in the upper half of Figure 1. We now apply a small, constant perturbation on the speech provided by the two groups. Such a perturbation can be, for instance, addition of small noise, exemplifying the natural conditions that the ASR systems may need to work in (e.g. a noisy environment). If we observe that the ASRErr increases disproportionately for one of the speaker groups, as compared to the other, then we consider such a behaviour a violation of fairness (see the second half of Figure 1). Intuitively, Figure 1 exemplifes the violations of Equality of Outcomes [36] in the context of ASR systems, where the male group is provided with a higher quality of service in a noisy environment as compared to the female group. Automatically discovering such scenarios of unfairness via simulating the ASR service in diverse environments is the main contribution of our AequeVox framework.

AequeVox facilitates fairness testing without having any access to ground truth transcription data. Although, text-to-speech (TTS) can be used for generating speech, we argue that it is not suitable for accurately identifying the bias towards speech coming from a certain group. Specifcally, speakers may intentionally use enunciation, intonation, diferent degrees of loudness or other aspects of vocalization to articulate their message. Additionally, speakers unintentionally communicate their social characteristics such as their place of origin (through their accent), gender, age and education. This is unique to human speech and TTS systems cannot faithfully capture all the complexities inherent to human speech. Therefore, we believe that fairness testing of ASR systems should involve speech data from human speakers.

We note that human speech (and the ASRs) may be subject to adverse environments (e.g. noise) and it is critical that the fairness evaluation considers such adverse environments. To facilitate the testing of ASR systems in adverse environments, we model the speech signal as a sinusoidal wave and subject it to eight diferent metamorphic transformations (e.g. noise, drop, low/high pass flter) that are highly relevant in real life. Furthermore, in the absence of manually transcribed speech, we use a diferential testing methodology to expose fairness violations. In particular, AequeVox identifes the bias in ASR systems via a two step approach: Firstly, AequeVox registers the increase in error rates for speech from two groups when subjected to a metamorphic transformation. Subsequently, if the increase in the error rate of one group exceeds the other by a given threshold, AequeVox classifes this as a violation of fairness. To the best of our knowledge, we are unaware of any such diferential testing methodology. As a by product of our AequeVox framework, we highlight words that contribute to errors by comparing the word counts from the original speech. This information can be further used to improve the ASR system.

Existing works [17,49] isolate certain sensitive attributes (e.g. gender) and use such attributes to test for fairness. Isolating these attributes is difcult in speech data, making it challenging to apply existing techniques to evaluate the fairness of ASR systems. AequeVox tackles this by formalizing a unique fairness criteria targeted at ASR systems. Despite some existing eforts in testing ASR systems [5,13], these are not directly applicable for fairness testing. Additionally, some of these works require manually labelled speech transcription data [13]. Finally, diferential testing via TTS [5] is not appropriate to determine the bias towards certain speakers, as they might use diferent vocalization that might be impossible (and perhaps irrational) to generate via a TTS. In contrast, AequeVox works on speech signals directly and defnes transformations directly on these signals. AequeVox also does not require any access to manually labelled speech data for discovering fairness violations. In summary, we make the following contributions in the paper:




6. We evaluate (via the user study) the human comprehensibility score of the transformations employed by AequeVox on the speech signal. The lowest comprehensibility score was 6.78 and 82.9% of the transformations had a comprehensibility score of more than seven.

## 2 Background

In this section, we introduce the necessary background information.

Fairness in ASR Systems: A recent work, FairSpeech [26], uses conversational speech from black and white speakers to fnd that the word error rate for individuals who speak African American Vernacular English (AAVE) is nearly twice as large in all cases.

Testing ASR Systems: The major testing focus, till date has been on image recognition systems and large language models. Few papers have probed ASR systems. One such work, Deep-Cruiser [13] applies metamorphic transformations to audio samples to perform coverage-guided testing on ASR systems. Iwama et al. [23] also perform automated testing on the basic recognition capabilities of ASR systems to detect functional defects. CrossASR [5] is another recent paper that applies diferential testing to ASR systems.

The Gap in Testing ASR Systems: There is little work on automated methods to formalise and test fairness in ASR systems. In this work, we present AequeVox to test the fairness of ASR systems with respect to diferent population groups. It accomplishes this with the aid of diferential testing of speech samples that have gone through metamorphic transformations of varying intensity. Our experimentation suggests that speech from diferent groups of speakers receives signifcantly diferent quality of service across ASR systems. In the subsequent sections, we describe the design and evaluation of our AequeVox system.

### 3 Methodology

In this section, we discuss AequeVox in detail. In particular, we motivate and formalize the notion of fairness in ASR systems. Then, we discuss our methodology to systematically fnd the violation of fairness in ASR systems. The notations used are described in Table 1.

Motivation: Equality of outcomes [36] describes a state in which all people have approximately the same material wealth and income, or in which the general economic conditions of everyone's lives are alike. For a software system, equality of outcomes can be thought of as everyone getting the same quality of service from the software they are using. For a lot of software services, providing the same quality of service is baked into the system by design. For example, the results of a search engine only depend on the query. The quality of the result generally does not depend on any sensitive attributes such as race, age, gender and nationality. In the context of an ASR, the quality of service does depend on these sensitive attributes. This inferior quality of service may be especially detrimental in safety-critical settings such as emergency medicine [18] or air trafc management [27,21].

In our work, we show that the quality of service provided by ASR systems is vastly diferent depending on one's gender/nationality/accent. Suppose there are two groups of people using an ASR system, males and females. They have approximately the same level of service when using this service at their homes. However, once they step into a diferent environment such as a noisy street, the quality of service drops notably for the female users, but does not drop noticeably for the male users. This is a violation of the principle of equality of outcomes (as seen for software systems) and more specifcally, group fairness [14]. Such a scenario is unfair (violation of group fairness) because some groups enjoy a higher quality of service than others.

In our work, we aim to automate the discovery of this unfairness. We do this by simulating the environment where the behaviour of ASR systems are likely to vary. The simulated environment is then enforced in speech from diferent groups. Finally, we measure how diferent groups are served in diferent environments.

Formalising Fairness in ASRs: In this section, we formalise the notion of fairness in the context of automated speech recognition systems (ASRs). The fairness defnition in ASRs is as follows:

$$|ASR\_{Err}(GR\_i) - ASR\_{Err}(GR\_j)| \le \tau \tag{1}$$

Here, GR<sup>i</sup> and GR<sup>j</sup> capture speech from distinct groups of people. If the error rates induced by ASR for group GR<sup>i</sup> (ASRErr (GRi)) and for group GR<sup>j</sup> (ASRErr (GR<sup>j</sup> )) difer beyond a certain threshold, we consider this scenario to be unfair. Such a notion of unfairness was studied in a recent work [26].

In this work, we want to explore whether diferent groups are fairly treated under varying conditions. Intuitively, we subject speech from diferent groups to a variety of simulated environments. We then measure the word error rates of the speech in such simulated environments and check if certain groups fare better than others. Formally, we capture the notion of fairness targeted by AequeVox as follows:

$$\begin{aligned} D\_i &\leftarrow ASR\_{Err}(GR\_i) - ASR\_{Err}(GR\_i + \delta) \\ D\_j &\leftarrow ASR\_{Err}(GR\_j) - ASR\_{Err}(GR\_j + \delta) \\ &| D\_i - D\_j| \leq \tau \end{aligned} \tag{2}$$

Here we perturb the speech of the two groups (GR<sup>i</sup> and GR<sup>j</sup> ) by adding some δ to the speech. We compare the degradation in the speech (D<sup>i</sup> and D<sup>j</sup> ). If the

Algorithm 1 AequeVox Fairness Testing

```
1: procedure Fairness_Testing(GRB, MT, GR1, · · · , GRn, τ, ASR1, ASR2)
2: Error_Set ← ∅
3: for T ∈ MT do
4: GRT
          B ← T(GRB)
5: ▷ L computes the average word level levenshtein distance
6: ▷ between the outputs of ASR1 and ASR2
7: dB ← L(ASR1(GRB), ASR2(GRB))
8: d
        T
        B ← L(ASR1(GRT
                      B), ASR2(GRT
                                B))
9: DB ← d
              T
              B − dB
10: for k ∈ (1, n) do
11: GRT
             k ← T(GRk)
12: dk ← L(ASR1(GRk), ASR2(GRk))
13: d
           T
           k ← L(ASR1(GRT
                         k
                          ), ASR2(GRT
                                   k
                                    ))
14: Dk ← d
                T
                k − dk
15: if DB − Dk > τ then
16: Error_Set ← Error_Set ∪ (GRB, GRk, T)
17: end if
18: end for
19: end for
20: return Error_Set
21: end procedure
```
degradation faced by one group is far greater than the one faced by the other, we have a fairness violation. This is because speech from both groups ought to face similar degradation when subject to similar environments (simulated by δ perturbation) when equality of outcomes [36] holds. More specifcally, this is a group fairness violation because the quality of service (outcome) depends on the group [14,51].

Example: To motivate our system, let us sketch out an example. Consider texts of approximately the same length spoken by two sets of speakers whose native languages are L<sup>1</sup> and L<sup>2</sup> respectively. Let us assume that both sets of speakers read out a text in English. AequeVox uses two ASR systems and obtains the transcript of this speech. AequeVox then employs diferential testing to fnd the word-level levenshtein distance [29] between these two sets of transcripts. Let us also assume that the average word-level levenshtein distance is two and four for L<sup>1</sup> and L<sup>2</sup> native speakers, respectively.

AequeVox then simulates a noisy environment by adding noise to the speech and obtains the transcript of this transformed speech. Let us assume now that the average levenshtein distance for this transformed speech is 4 and 25 for L<sup>1</sup> and L<sup>2</sup> native speakers, respectively. It is clear that the degradation for the speech of native L<sup>2</sup> speakers is much more severe. In this case, the quality of service that L<sup>2</sup> native speakers receive in noisy environments is worse than L<sup>1</sup> native speakers. This is a violation of fairness which AequeVox aims to detect.

The working principle behind AequeVox holds even if the spoken text is diferent. This is because AequeVox just measures the relative degradation in ASR performance for a set of speakers. For large datasets, we are able to measure the average degradation in ASR performance with respect to diferent groups of speakers (e.g. male, female, native, non-native English speakers).

Fig. 2: Sound wave transformations

Fig. 3: AequeVox System Overview

Metamorphic Transformations of Sound: The ability to operate in a wide range of environments is crucial in ASR systems as they are deployed in safetycritical settings such as medical emergency services [18] and air trafc managment [21], [27], which are known to have interference and noise. Metamorphic speech transformations serve to simulate such scenarios. The key insight for our metamorphic transformations comes from how waves are represented and what can happen to these waves when they're transmitted in diferent mediums. We realise this insight in the fairness testing system for ASR systems. To the best of our knowledge AequeVox is the frst work that combines this insight from acoustics, software testing and software fairness to evaluate the fairness of ASR systems. AequeVox uses the addition of noise (Figure 2 (b)), amplitude modifcation (Figure 2 (c)), frequency modifcation (Figure 2 (d)), amplitude clipping (Figure 2 (e)), frame drops (Figure 2 (f)), low-pass flters (Figure 2 (g)), and high-pass flters (Figure 2 (h)) as metamorphic speech transformations. We choose these transformations because they are the most common distortions for sound in various environments [1].

System Overview: Algorithm 1 provides an outline of our overall test generation process. We realise the notion of fairness described in Equation (2) using diferential testing. The error rates (ASRErr) for a particular speech clip are found by fnding the diference between the outputs of two ASR systems, ASR<sup>1</sup> and ASR2. It is important to note that we make a design choice to use diferential testing to fnd the error rate (ASRErr). This helps us eliminate the need for ground truth transcription data which is both labor intensive and expensive to obtain. Furthermore, AequeVox realises the δ seen in Equation (2) by using metamorphic transformations for speech (see Figure 3). These speech metamorphic transformations represent the various simulated environments for which AequeVox wants to measure the quality of service for diferent groups. Additionally, the user can customise this δ per their requirements. In our implementation we use eight distinct metamorphic transformations as δ (see Figure 2). Specifically, we investigate how fairly do two ASR systems (ASR<sup>1</sup> and ASR2) treat groups (GR<sup>T</sup> k | k ∈ {1, 2, · · · n}) with respect to a base group (GRB). AequeVox achieves this by taking a dataset of speech which contains data from two or more diferent groups (e.g. male and female speakers, Native English and Non-native English speakers) and modifes these speech snippets through a set of transformations (MT). These are then divided into base group transformed speech (GR<sup>T</sup> B) and the transformed speech for other groups (GR<sup>T</sup> k | k ∈ {1, 2, · · · n}). As seen in Algorithm 1, the average word-level levenshtein distance (word-level levenshtein distance divided by the number of words in the longer transcript) between the outputs of the two ASR systems is captured by d<sup>B</sup> and d T <sup>B</sup> for the original and transformed speech respectively. Similarly, for the comparison groups GR<sup>T</sup> k (k ∈ {1, 2, · · · n}) the word-level levenshtein distance is captured by d<sup>k</sup> and d T k . The higher the levenshtein distance the larger the error in terms of diferential testing. In other words, larger error in diferential testing would mean that the ASR systems disagree on a higher number of words.

To capture the degradation in the quality of service for the speech subjected to simulated environments (MT), we compute the diference between the wordlevel levenshtein distance for the original and transformed speech. Specifcally, we compute D<sup>B</sup> as d T <sup>B</sup> −d<sup>B</sup> and D<sup>k</sup> as d T <sup>k</sup> −dk(k ∈ {1, 2, · · · n}) for the base and comparison groups, respectively. The higher this metric (D<sup>B</sup> and Dk), the more severe the degradation in ASR quality of service because of the transformation T.

We compare these metrics and if D<sup>B</sup> exceeds D<sup>k</sup> by some threshold τ , we classify this as an error for the base group (GRB) and more specifcally a violation of fairness (see Figure 3). In our experiments we set each of the groups in our dataset as the base group (GRB) and run the AequeVox technique to fnd errors with respect to that base group. The lower the errors (as computed via the violation of the assertion D<sup>B</sup> − D<sup>k</sup> ≤ τ ), the fairer the ASR systems are with respect to groups GRB. As an example, let us say Russian speakers are the base group (GRB), English speakers are the comparison group (GRk) and the value of τ is 0.1. If D<sup>B</sup> is strictly greater than D<sup>k</sup> by 0.1, then fairness violation is counted for the Russian speakers. Otherwise, no fairness errors are recorded.

Fault Localisation: AequeVox introduces a word-level fault localisation technique, which does not require any access to ground truth data. We frst illustrate a use case of this fault localisation technique.

#### Algorithm 2 AequeVox Fault Localizer


Fig. 4: AequeVox Fault Localization Overview

Example: Let us consider a corpus of English sentences by a group of speakers (say GR) who speak language L<sup>1</sup> natively. AequeVox builds a dictionary for all the words in the transcript obtained from ASR1. An excerpt from such a dictionary appears as follows: {brother : 16, nice : 25, is : 33, · · · }. This means the words brother, nice and is were seen 16, 25 and 33 times in the transcript respectively. Now, assume AequeVox simulates a noisy environment by adding noise with various signal to noise (SNR) ratios as follows: {10, 8, 6, 4, 2 }. This is the parameter for the transformation (param<sup>T</sup> ).

Once AequeVox obtains the transcript of these transformed inputs, it creates dictionaries similar to the ones seen in the preceding paragraph. Let the relevant subset of the dictionary for SNR two (2) be {brother : 1, nice : 23, is : 32, · · · }. We use this to determine that the utterance of the word brother is not robust for noise addition for the group GR. This is because, the word brother appears signifcantly less in the transcript for the modifed speech, as compared to the transcript for the original speech.

AequeVox fault localisation overview: Algorithm 2 provides an overview of the fault localization technique implemented in AequeVox. The goal of the AequeVox fault localisation is to fnd words for a group (GR) that are not robust to the simulated environments. Specifcally, AequeVox fnds words which are not recognised by the ASR when subjected to the appropriate speech transformations.

The transformation is represented by Tθ. Here, T ∈ MT is the transformation and θ ∈ param<sup>T</sup> is the parameter of the transformation, which controls the severity of the transformation.

As seen in Algorithm 2, AequeVox builds a word count dictionary for each word in W C and W C<sup>T</sup><sup>θ</sup> for the original speech and for each θ ∈ param<sup>T</sup> respectively. For each word, AequeVox fnds the diference in the number of appearances for a word in W C and in W CT<sup>θ</sup> for θ ∈ param<sup>T</sup> . To compute the diference, we locate the minimum number of appearances across all the transformation parameters θ ∈ param<sup>T</sup> (i.e. min\_count in Algorithm 2). This is to locate the worst-case degradation across all transformation parameters. The difference is then calculated between min\_count and the number of appearances of the word in the original speech (i.e. init\_count). If the diference exceeds some user-defned threshold ω, then AequeVox classifes the respective words as non robust w.r.t the group GR and transformation T.

We envision that practitioners can then review the data generated by fault localization (i.e. Algorithm 2) and target the non-robust words to further improve their ASR systems for speech from underrepresented groups [24] and accommodate for speech variability [22]. In RQ3, we validate our fault localization method empirically and in RQ4, we show how the proposed fault localization method can be used to highlight fairness violations.

## 4 Datasets and Experimental Setup

ASR Systems under Test: We evaluate AequeVox on three commercial ASR systems from Google Cloud Platform (GCP), IBM Cloud, and Microsoft Azure. We use the standard models for GCP and Azure, and the BroadbandModel for IBM. In all three cases, the audio samples were identically encoded as .wav fles using Linear 16 encoding.

In each of the following transformations, we vary a parameter, θ. We call this the transformation parameter. Some of the transformations have abbreviations within parentheses. Such abbreviations are used in later sections to refer to the respective transformations.

Amplitude Scaling (Amp): For amplitude scaling, we scale the audio sequence by a constant by multiplying each individual audio sample by θ.

Clipping: The audio samples are scaled such that their amplitude values are bound by [−1, 1]. AequeVox then clips these samples such that the amplitude range is [−θ, θ]. These clipped samples are then rescaled and encoded.

Drop/Frame: For Drop, AequeVox divides the audio into 20ms chunks. θ% of these chunks are then randomly discarded (amplitude set to zero) from the audio. For Frame, AequeVox divides the audio into θms chunks and 10% of these chunks are then randomly discarded. No two adjacent chunks are discarded.

High Pass (HP)/ Low Pass (LP) Filter: Here we apply a butterworth [7] flter of order two to the entire audio fle with θ determining the cut-of frequency.

Noise Addition (Noise): θ represents signal to noise (SNR) ratio [25] of the transformed audio signal. A lower θ means higher noise in the transformed audio.

Frequency Scaling (Scale): In this case, θ is the sampling frequency. The lower the value of θ, the slower the audio. In this transformation, the audio is slowed down θ times.


Table 2: Transformations Used

Table 2 lists all the diferent values used for θ. An additional parameter (θ = 2.0) is used for Amp.

Datasets: We use the Speech Accent Archive (Accents) [54], the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) [30], Multi speaker Corpora of the English Accents in the British Isles (Midlands) [11], and a Nigerian English speech dataset [2] to evaluate AequeVox taking care to ensure male and female speakers are equally represented. Table 3 provides additional details about the setup.

#### 5 Results

In this section, we discuss our evaluation of AequeVox in detail. In particular, we structure our evaluation in the form of four research questions (RQ1 to RQ4). The analysis of these research questions appears in the following sections.

#### RQ1: What is AequeVox's efcacy?

We structure the analysis of this research question into three sections, each corresponding to a dataset we have used in our analysis. All of the relevant data is presented in Table 4 with the lowest errors for each dataset bolded. We frst analyse the number of errors (used interchangeably with fairness violations) for each case. Subsequently, we analyse the sensitivity of the errors with respect to the values of τ (τ ∈ {0.01, 0.05, 0.1, 0.15}). Detecting violations of fairness is regulated by parameter τ . Lower values of τ imply that the degradation of word error rates between two groups should be similar, and conversely higher values of τ allow for the diference in degradation of word error rates to be more severe between two groups. Next, we analyse the sensitivity of the pairs of the ASR systems under test. Concretely, we analyse the errors found in the Microsoft Azure and IBM Watson (MS\_IBM), Google Cloud and IBM Watson (IBM\_GCP), and Microsoft Azure and Google Cloud (MS\_GCP) pairs. Finally, we analyse the sensitivity of the AequeVox test generation with respect to the eight diferent types of transformations implemented (see Figure 2).


Table 4: Errors Discovered by AequeVox

It is important to note that we excluded the two most destructive Scale transformations. This is because the word error rate for these transformations is 0.89 on average out of 1. This degradation may be attributed to the transformation itself rather than the ASR. To avoid such cases, we exclude these transformations from this research question.

Accents Dataset: Native English speakers and Indonesian speakers have the lowest number of errors. On average, speech from non-native English speakers generates 109% more errors in comparison to speech from native English speakers. For the two smallest values of τ , speech from the native English speakers shows the least number of fairness violations. Speech from native English speakers has the lowest, second lowest and third lowest errors for the pairs of ASRs, (MS\_IBM), (MS\_GCP) and (IBM\_GCP) respectively. Speech from native English speakers has the lowest errors for the clipping, two types of frame drops and noise transformations and the second lowest errors for the low-pass flter transformation. The high-pass flter and scaling induce a comparable number of errors from native and a majority of the non-native English speakers. However, speech from native English speakers has the highest number of errors when subject to the amplitude transformation.

Speech from non-native English speakers generally exhibits more fairness violations in comparison to speech from native English speakers.

RAVDESS Dataset: Speech from male speakers has signifcantly lower errors than speech from female speakers. On average, speech from female speakers generates 528.57% more errors in comparison to speech from male speakers. Speech from male speakers shows signifcantly fewer fairness violations for all values of τ , and for all ASR pairs tested. Clipping, both types of frame drops, noise, low-pass, amplitude and high-pass transformations induce signifcantly fewer errors on speech from male speakers. However, speech from male speakers has more errors when subject to scale transformations.

Speech from female speakers has signifcantly more fairness violations in comparison to speech from male speakers.

Midlands/Nigeria Dataset: Speech from UK Midlands English (ME) speakers has signifcantly fewer errors than speech from Nigerian English (NE) speakers. On average, speech from NE speakers generates 156.9% more errors in comparison to speech from ME speakers. Speech from ME speakers has signifcantly fewer fairness errors for all values of τ , and for all ASR pairs tested. For the transformations scale, drop, noise, amplitude, low pass and high pass flters, the speech from ME speakers has signifcantly fewer errors than speech from NE speakers. Clipping induces more errors in speech from ME speakers, while the frame transformation induces comparable number of errors in speech from both groups.

Speech from Nigerian English speakers has signifcantly more fairness errors in comparison to speech from UK Midlands speakers.

#### RQ2: What are the efects of transformations on comprehensibility?

To better understand the efects of the transformations (see Figure 2) on the comprehensibility of the speech we conducted a user study. Speech of one randomly chosen female native English speaker from the Accents [54] dataset was used since the audio contains nearly all the sounds present in the English language [54]. Survey participants were presented with the original audio fle along with a set of transformed speech fles in order of increasing intensity. All the transformations (see Figure 2) and transformation parameters (see Table 2) were used. We asked 200 survey participants (sourced through Amazon mTurk) the following question:

*How comprehensible is (transformed) Speech with respect to the Original speech?*

Fig. 5: Average Transformation Comprehensibility Ratings

The rating of one (1) is Not Comprehensible at all and the rating of ten (10) is Just as Comprehensible as the Original.

Unsurprisingly, as seen in Figure 5, increasing the intensities of the transformation had a generally detrimental efect on the comprehensibility of the speech. But none of the transformations majorly afect the comprehensibility of the speech. All of the transformations had an average comprehensibility rating above 6.75 and 82.9% of the transformations had a comprehensibility rating above 7.


Table 5: Fairness errors where the transformations have a comprehensibility rating of at least 7.2

The average degradation in comprehensibility for the least destructive parameter across all transformations was 24.36%. Noise was the most destructive at 27.75% and drop was the least destructive (20.96%).

The average degradation in comprehensibility for the most destructive parameter across all transformations was 29.18%. In this case, scaling was the most destructive at 32.23% whereas drop was the least destructive with 25.88%.

Additionally, for each transformation, we analyse the percentage drop of comprehensibility between the least and the most destructive transformation parameters. The average drop is 4.82% across all transformations. The scaling and drop transformations show high relative percentage drops of 10.05% and 8.32% respectively. Amplitude, clipping, noise, high-pass and low-pass flters show closer to average drops between 3.1% and 4.5%. Frame, on the other hand, shows very low relative drops at 0.76%.

#### All the transformations, though destructive, are comprehensible by humans.

For safety critical applications, we recommend that future work test the whole gamut of transformations. For other use cases, practitioners may choose the transformations that satisfy their needs. To aid this, AequeVox allows the users to choose the comprehensibility threshold of the transformations. As seen in Table 5, our conclusion holds even if we choose the transformations with higher comprehensibility threshold (7.2). We highlight the group with the least errors in each dataset to aid in readability. In particular, we observe that speech from male and UK Midlands speakers generally exhibit fewer errors. Setting aside speech from native Gujarati speakers, speech from native English speakers exhibits comparable or better performance than speech from other groups.

#### RQ3: Are the outputs produced by AequeVox fault localiser valid?

To study the validity of the outputs of the fault localiser, we study the number of errors for the predicted robust and non-robust words. We do this by generating speech containing the predicted robust and non-robust words for each ASR tested. We choose an ω of three, three and two for GCP, MS Azure and IBM respectively to choose the non-robust words (see Algorithm 2). We choose the robust words from the set of words that do not show any errors in the presence of noise (count\_dif = 0 in Algorithm 2) for these specifc ASR systems. Specifcally, we test whether the robust and non-robust words identifed by the fault localiser in the Accents dataset are robust in the presence of noise. Our goal is to show that if noise is added to speech containing these non-robust words, the ASR will be less likely to recognise them. Vice-versa, if noise is added to the predicted robust-words they are less likely to be afected.

To generate the speech from the output we generate sentences containing the robust and non-robust words predicted by the fault localiser for each ASR using a grammar and then use a text-to-speech (TTS) service to generate speech. The actual randomly selected robust and non-robust words (in bold) and the examples of the sentences generated by the grammar can be seen in Table 6. We use the Google TTS for MS Azure and we use the Microsoft Azure TTS for GCP and IBM to generate the speech.

To evaluate the generality of outputs of the fault localisation technique, we use the speech produced by the TTS and then add noise to that speech. This speech is used to generate a transcript from the ASR and the transcript is used to evaluate how many of the predicted robust and non-robust words are incorrect in the transcript. We add the most noise possible to the TTS speech in our AequeVox framework. Specifcally, the signal to noise (SNR) ratio is 2. We use the TTS generated speech for 50 sentences for each of the robust and non-robust cases. Each sentence has either a robust or a non-robust word.

The results of the experiments are seen in Table 7. In the transcript of the speech with noise added at SNR 2, robust words show zero errors for the predicted robust words for Microsoft and Google Cloud and 21 errors for IBM. The non-robust words, on the other hand, had 23, 15 and 30 errors.


The predicted non-robust words have a higher propensity for errors than the

Note on grammar validity: Since the grammars used by us to validate the explanations of AequeVox are handcrafted, they may be prone to errors. To verify these hand crafted grammars, we use 100 sentences produced by each grammar and use the online tool Grammarly [3] to investigate the semantic and syntactic correctness of the sentences and the clarity. The sentences generated by the grammars have a high overall average score of 98.33 out of 100, with the


Table 9: Average word mispredictions in the Accents dataset using the AequeVox localisation techniques

lowest being 96 (see Table 8). On the correctness and clarity measure, all the sentences generated by the grammars score Looking Good and Very Clear.

#### RQ4: Can the fault localiser be used to highlight unfairness?

The goal of this RQ is to investigate if the output of Algorithm 2 can call attention to bias between diferent groups. Specifcally, we evaluate if some groups show fewer faults, on average than others. To this end, we use the fault localisation algorithm (Algorithm 2) on the accents dataset and record the number of words incorrect in the transcript, on average for each group of the accents dataset. This is done for each ASR under test. It is also important to note that this technique uses no ground truth data and requires no manual input. This technique is designed to work with just the speech data and metadata (groups).

Table 9 shows the average word drops across all transformations for the accents dataset for each ASR under test. We highlight the best performing groups by bolding the values. Speech from native-English speakers shows the lowest average word drops for the IBM Watson ASR and the third lowest for GCP and MS Azure ASRs. We also investigate the average word drops for each transformation in AequeVox averaged across all ASRs. Speech from native English speakers has the lowest average word drops for the Clipping, two types of frame drops and noise transformations and the second lowest errors for the low-pass flter transformation. (see Table 9). For the rest of the transformations, namely amplitude, high-pass flter and scaling, we fnd that both speech from non-native English speakers and speech from native English speakers have comparable average word drops in the majority of cases (see Table 9). This result is consistent with results seen in RQ1.

The technique seen in Algorithm 2 can be used to highlight bias in speech and the results are consistent with RQ1.

## 6 Threats to Validity

User Study: In conducting the study, two assumptions were made. Firstly, we assume that the degree to which comprehensibility changes when subject to transformations is independent of the characteristics of the speaker's voice. Secondly, we assume that the speech is refective of the broader English language. In future work, a larger scale user study could be performed to verify the results.

ASR Baseline Accuracy: AequeVox measures the degradation of the speech to characterise the unfairness amongst groups and ASR systems. If the baseline error rate is very high, then the room for further degradation is very low. As a result, AequeVox expects ASR services to have a high baseline accuracy. To mitigate this threat, we use state-of-the-art commercial ASR systems which have high baseline accuracies.

Completeness and Speech Data: AequeVox is incomplete, by design, in the discovery of fairness violations. AequeVox is limited by the speech data and the groups of this speech data used to test these ASR systems. With new data and new groups, it is possible to discover more fairness violations. The practitioners need to provide data to discover these. In our view, this is a valid assumption because the developers of these systems have a large (and growing) corpus of such speech data. It is also important to note that AequeVox does not need the ground truth transcripts for this speech data and such speech data is easier to obtain.

Fault Localisation: To test AequeVox's fault localisation, we identify the robust and non-robust words in the speech and subsequently construct sentences (with the aid of a grammar). These sentences are then converted to speech using a text-to-speech (TTS) software and the performance of the robust and non-robust words is measured. In the future, we would like to repeat the same experiment with a fxed set of speakers, which allows us capture the peculiarities of speech in contrast to the usage of TTS software.

## 7 Related Work

In the past few years, there has been signifcant attention in testing ML systems [35,48,32,47,55,34,50,40,56,16,52,8,41,19]. Some of these works target coveragebased testing [48,55,34,32] or leverage property driven testing [41], while others focus on efective testing in targeted domains e.g. text [50,40]. None of these works, however, are directly applicable for testing ASR systems. In contrast, the goal of AequeVox is to automatically discover violations of fairness in ASR systems without access to ground truth data.

DeepCruiser [13] uses metamorphic transformations and performs coverageguided fuzzing to discover transcription errors in ASR systems. Concurrently, CrossASR [5] uses text to generate speech from a TTS engine and subsequently employs diferential testing to fnd bugs in the ASR system. In contrast to these systems, the goal of AequeVox is to automatically fnd violations of fairness by measuring the degradation of transcription quality from the ASR when the speech is transformed. AequeVox compares this degradation across various groups of speakers and if the diference is substantial, AequeVox characterises this as a fairness violation. Moreover, AequeVox neither requires access to manually labelled speech data nor does it require any white/grey box access to the ASR model. Works on audio adversarial testing [23], [10], [9], [37], [28] aims to fnd an imperceptible perturbation that are specially crafted for an audio fle. In contrast, AequeVox aims to fnd fairness violations. Additionally, AequeVox also proposes automatic fault localisation for ASR systems without using a ground truth transcript.

Unlike AequeVox, recent works on fairness testing have focused on credit rating [17,49,4,57,42,44,43,41], computer vision [12,6] or NLP systems [33,45]. In the systems that deal with such data, it is possible to isolate certain sensitive attributes (gender, age, nationality) and test for fairness based on these attributes. It is challenging to isolate such sensitive attributes in speech data, necessitating the need for a separate fairness testing framework specifcally for speech data.

Frameworks such as LIME [38], SHAP [31], Anchor [39] and DeepCover [46] attempt to reason why a model generates a specifc output for a specifc input. In contrast to this, AequeVox's fault localisation algorithm identifes utterances spoken by a group which are likely to be not recognised by ASR systems in the presence of a destructive interference (such as noise). Recent fault localization approaches either aim to highlight the neurons [15] or training code [53] that are responsible for a fault during inference. In contrast, AequeVox highlights words that are likely to be transcribed wrongly without having any access to the ground truth transcription and with only blackbox access to the ASR system.

## 8 Conclusion

In this work, we introduce AequeVox, an automated fairness testing technique for ASR systems. To the best of our knowledge, we are the frst work that explores considerations beyond error rates for discovering fairness violations. We also show that the speech transformations used by AequeVox are largely comprehensible through a user study. Additionally, AequeVox highlights words where a given ASR system exhibits faults, and we show the validity of these explanations. These faults can also be used to identify unfairness in ASR systems.

AequeVox is evaluated on three ASR systems and we use four distinct datasets. Our experiments reveal that speech from non-native English, female and Nigerian English speakers exhibit more errors, on average than speech from native English, male and UK Midlands speakers, respectively. We also validate the fault localization embodied in AequeVox by showing that the predicted non-robust words exhibit 223.8% more errors than the predicted robust words across all ASRs.

We hope that AequeVox drives further work on systematic fairness testing of ASR systems. To aid future work, we make all our code and data publicly available at github.com/sparkssss/AequeVox and 10.5281/zenodo.5897347

## References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## SMT-Based Planning Synthesis for Distributed System Reconfigurations

Simon Robillard<sup>1</sup> () and Hélène Coullon<sup>2</sup>

<sup>1</sup> LIRMM, CNRS, Université de Montpellier, France 2 IMT Atlantique, Inria, LS2N, Nantes, France

Abstract. Large distributed systems with an emphasis on adaptability are now considered a necessity in many domains, yet reconfiguration of these systems is still largely carried out in an ad hoc fashion, a process that is both inefficient and error-prone. In this paper, we tackle the planification problem for the reconfiguration of distributed systems in the component-based reconfiguration model Concerto. Specifically, given some tasks to execute and a desired final state of the system, we show how to compute a reconfiguration plan that guarantees satisfaction of intercomponent dependencies and is also optimized for parallel execution. Our technique relies on an SMT solver to compute the required dependencies between components and ultimately schedule the reconfiguration. We illustrate the use of this technique on a variety of synthetic examples as well as a real use case in the context of an OpenStack system.

Keywords: reconfiguration, planning, synthesis, component models, distributed systems

## 1 Introduction

Large distributed software systems are now ubiquitous, with component-based systems (e.g., service-oriented architectures or microservices) offering a convenient way to structure large applications. Indeed, isolating functionalities in components and building systems through composition greatly enhances adaptability and scalability of applications, two important requirements for many organizations. This approach is also promoted by the massive adoption of highlydistributed computing infrastructures such as cloud and edge computing.

However, the advantages of distributed architectures come at the price of increased complexity and technical challenges related to observability, coordination, maintenance, etc. Notably, the system reconfigurations that are required to achieve adaptability commonly lead to faults. For example, a study of 597 unplanned outages that affected popular cloud services between 2009 and 2015 found that 16% of them were caused by a software or hardware upgrade [16]. The study concludes that "the complexity of cloud hardware and software ecosystem has outpaced existing testing, debugging, and verification tools". Indeed, testing and debugging methods are largely inadequate in the context of distributed systems, while the adoption of more suitable formal methods remains marginal in

industry. The latter can be attributed to the difficulty of using formal methods and tools. Yet formal methods can lighten the burden of program developers and system administrators instead of adding to it, with synthesis techniques used to generate correct-by-construction programs. In that spirit, we propose to employ a Satisfiability Modulo Theories (SMT) solver to automate the planning of reconfigurations (deployment, migrations, software updates, etc.) of componentbased systems, i.e., to generate programs that coordinate the non-functional operations required to perform such reconfigurations. There have been some attempts to synthesize reconfiguration programs for component-based systems (some of them relying on an SMT solver), but they either target ad hoc, nonexecutable models [20], or are limited to specific cases such as deployment [22], where the problem of executing parallel tasks is reduced to finding a precedence order. In contrast, our work targets the full scope of the component-based reconfiguration model Concerto [9], which provides a formally-defined execution model with expressive constraints on parallelism, as well as a concrete execution engine, making it suitable for formal analysis and experimental work.

In Concerto, reconfigurations are driven by asynchronous behavior requests to components. The execution of a behavior may depend on the state of other components: such dependencies are denoted by ports that form the interface of components, indicating their provisions and requirements towards each other. Section 2 gives an overview of Concerto, for a more complete presentation, the reader can refer to [9]. Our goal with this work is to automatically generate reconfiguration scripts for systems of Concerto components, i.e., determine required behaviors and coordinate their execution. We take as starting point a reconfiguration goal composed of behaviors to execute over some components and a specification of the final state of the system, particularly the statuses of ports. That goal may be provided by a system administrator, or could have been generated in the context of an autonomic control loop [19]. Importantly, it is a partial specification that typically only mentions parts of the system. For example, an administrator may specify only that a certain utility component should execute a behavior to update its software, whereas the completion of this task actually requires other components to suspend and later resume their activity.

Since a reconfiguration goal can require changes in any component of a system, the search space for reconfiguration scripts grows rapidly with the number of components. To synthesize reconfigurations for large systems, we propose a novel technique that takes advantage of the nature of component-based models. It first solves the problem for each component individually, by considering the internals of the component to find relevant behaviors, under the simplifying assumption that external requirements are all satisfied. Later the method coordinates behaviors over the whole system, relying on a first-order encoding of the scheduling problem and making use of the model-finding capabilities of an SMT solver. If this step fails due to unsatisfied dependencies, individual component reconfiguration goals are refined and the process iterated. Section 3 describes this method, and Section 4 measures its performance and scalability on a variety of synthetic examples, and illustrates its applicability on a real use case.

#### 2 Reconfiguration With Concerto

Components and Assemblies. A distributed system in Concerto is represented as an assembly, i.e., a collection of components that correspond to control entities for the elements of the system. Components are not intended to represent the functional aspects of those elements, but instead to pilot the actions (installation, maintenance, suspension of service, etc.) required to operate them during their lifespan. In other words, a Concerto component is a wrapper around a new or legacy piece of software (e.g., service, module), typically written by its developer, that acts as replacement for scripts to install and maintain it.

The structural interface of a component is provided by its provide ports and use ports. Provide ports denote services or data provided by that component when those ports are active, while use ports denote requirements that the component has when those ports are active. Ports can be connected in an assembly to allow the satisfaction of component requirements. Connected ports impose synchronization rules between their components: a use port cannot be activated unless connected to an active provide port (the user component may have to wait for that requirement to be fulfilled in order to continue its internal activity) and a provide port cannot be deactivated while connected to an active use port.

Internally, components are characterized by places representing milestones in the life cycle, and transitions between places, mapped to concrete reconfiguration actions (e.g., starting a virtual machine, downloading an image, etc.). The internal state of a component is given by its places: at any point during execution, one or more places are active. While a place π is active, transitions originating from it can be (simultaneously) fired, after which π ceases to be active. Conversely, a place π <sup>0</sup> becomes active after the completion of all the transitions that reach it. The completion of a transition takes a non-deterministic duration after firing, modeling the execution of the associated action. Active places also determine the statuses of ports: each port is bound to a set of places, and is active whenever one of them is active. Thus the status of ports changes according to the life cycle of the component. In graphic representations, ports are linked to the place (or set of places, denoted by rounded boxes) to which they are bound.

The last characteristic attribute of a component is its set of behaviors. A behavior is a subset of the transitions in a component, such that the associated subgraph is acyclic. At any point in an execution, a component may execute one behavior. Only then can the transitions in that behavior be fired. The behaviors of a component serve as its operational interface: a component may have one behavior including the actions to start it, another including the actions to update it, etc. A component can be requested to execute a behavior, which will determine its evolution and the actions that it performs. Graphically, different behaviors are represented by depicting transitions in different colors.

Figure 1 gives a graphic representation of an assembly. Component dep1 includes three places (uninstalled, installed and running) and three transitions (arrows between places) that belong to three behaviors (deploy, update, and uninstall). Place running is active (denoted by a token) and bound to provide port service, whereas places installed and running are bound to provide port config. Both ports are connected to use ports belonging to server.

Fig. 1: A Concerto assembly with three components. For readability, the bindings of ports config1 and config2 are only partially depicted: they also contain places configured, running, s1 and s2.

Reconfiguration Scripts. Concerto is equipped with a simple language to execute reconfigurations. Whereas a Concerto component is written by a developer, the reconfiguration language is intended to be used by system administrators or DevOps engineers. Components are piloted through asynchronous requests via the command pushB(id, b) that asks the component identified by id to execute behavior b. The command takes its name from the fact that requests received by a component are queued and asynchronously executed by that component in the order in which they were received. While a component executes a behavior request, transitions in that behavior are fired until the component reaches a state where none of them can be fired. The behavior request is then considered complete, and the component executes the next one, until no more requests remain. The Concerto language also provides synchronization commands: wait(id) blocks the execution of the reconfiguration program until the component identified by id has executed all behaviors requests submitted to it, and waitAll() blocks the execution until all components have executed all pending behavior requests. These three commands allow parallel asynchronous execution in Concerto, leading to more efficient reconfigurations. Based on the description of the components provided by their developers, Concerto can execute reconfiguration scripts , allowing for empirical performance comparisons [10].

The goal of this work is to generate a reconfiguration script using the three aforementioned commands to execute behaviors over components and bring them to a desired state. In addition to those three commands, the Concerto language also provides four usual commands to modify the topology of an assembly: create and delete components, connect and disconnect them. These operations are out of the scope of reconfiguration planning as we define it. Indeed, the decision to modify the topology of the assembly is usually taken by the same entity that determines reconfiguration goals (system administrator or autonomic analysis tool) [15, 17] rather than left to the planning phase. Furthermore, if topological changes in the assembly are deemed necessary, they can almost always be implemented through a reconfiguration script with the following steps: (i) creations of components, (ii) creations of connections, (iii) changes in component states, (iv) deletions of connections and (v) deletions of components [5, 7]. The main difficulty is to determine the operations of the third step that take the components to a safe state, in particular ensuring that none of the connections that will be deleted include an active use port. Computing a reconfiguration program to lead components to a desired state (or to have them perform some required operations) is the focus of this paper.

As an example, consider the assembly in Figure 1, where all the components are running. We wish to run software updates on dep1 and dep2, but this will deactivate their provide port service. To carry out the updates, component server must first deactivate its corresponding use ports, which is accomplished by executing its behavior suspend. Figure 2a depicts a reconfiguration script that performs this, then returns the components to a running state. No explicit synchronization is needed between the suspension of server and the updates: the execution model of Concerto ensures that the updates cannot be executed as long as the provide ports are in use. An explicit synchronization is however needed before re-deploying the server, to prevent it from reactivating its use ports before the updates start. As a side note, the ports config (that represent configuration information that is not affected by the update, such as connection information) remain active throughout the reconfiguration: the fine-grained management of dependencies in Concerto avoids a full restart of the system. This assembly also illustrates the capacity of Concerto components to execute actions in parallel: for example, after server has reached place allocated, it can fire multiple transitions, corresponding to independent reconfiguration actions.

Concerto provides structured semantic tools to design efficient reconfiguration plans with highly parallel, asynchronous execution. However, taking full advantage of these features adds complexity to the internal structure of components and to associated reconfiguration scripts. Automated synthesis of reconfiguration scripts is therefore particularly useful in this context.

#### 3 Reconfiguration Script Synthesis

This section describes the synthesis process used to generate reconfiguration scripts. This process takes as input a description of the current state of the system, namely the topology of the assembly (components and their connections) and the active places. We assume that the system is in a state where no component has pending behavior requests or ongoing transitions. Besides that information, the synthesis process also depends on a reconfiguration goal that is composed of (i) constraints Γports on the final state of ports and (ii) a set of behaviors Γbhv to execute on designated components.

The constraints Γports are given by a partial function that maps specific instances of component ports to a boolean indicating whether that port is required to be active or inactive. A reconfiguration satisfies that goal if it ends in a state such that for any component c and port p, if Γports (c, p) is defined, the port p of component c is active if and only if Γports (c, p) = >. Where the value of Γports is undefined, any status of the port satisfies the constraint. This means that a reconfiguration goal does not have to specify a unique final state for components, but instead allows for multiple target states. It may appear tedious to specify constraints for all components of an assembly when a reconfiguration is specifically aimed at a subset of it, but in practice the current state of the assembly can be used to guide the choice of Γports for those other components. A reasonable strategy might specify that provide ports active before the reconfiguration should remain active, and leave other ports unspecified.

The other element of the reconfiguration goal is the set Γbhv , where each element is a pair composed of a component and a behavior. The reconfiguration satisfies it if it executes at least all these behaviors on the corresponding components. The set Γbhv alone may not correspond to a feasible reconfiguration. For example, a system administrator wishing to update the components of Figure 1 might give a behavior goal Γbhv = {(dep1, update),(dep2, update)} and a port goal Γports that maps every port instance to >. The behaviors listed in that reconfiguration goal are not enough to carry it out, as it lacks a behavior to deactivate the use ports of the server prior to the update, and behaviors to reactivate all ports after the update. The synthesis process must therefore deduce necessary behaviors to carry out the reconfiguration goal, then schedule their executions in a suitable order. It proceeds as follows:


#### 3.1 Determining Sequences of Component Behaviors

A procedure localSeq(c, act <sup>c</sup> , Γbhv , Γports ) finds a sequence of behaviors that satisfies a reconfiguration goal Γ for a single component c starting in a state with active places act <sup>c</sup> . This is achieved by enumerating all sequences of behaviors with at most one occurrence of any behavior, and selecting one that satisfies the goal constraints. In practice, this enumeration is short because the number of behaviors of a component is usually small. More importantly, for a given component state (denoted by its active places), many behaviors do not have transitions originating from the active places. Since executing these behaviors would not have any effect, they can be ignored during the enumeration. Consequently, the number of useful sequences of behaviors to analyze is often much lower than the number of permutations. If no satisfying sequence is found by localSeq, then the problem has no solution, and the whole synthesis process fails. However, if multiple solutions are returned, the best possible sequence is picked, according to some (possibly user-defined) selection criterion. Some interesting optimization criteria are: the length of a sequence, its execution time (if time estimations are available for individual transitions, this may be computed with great accuracy [10]), the number of transitions it executes sequentially, or the number of ports it (de)activates. In our experiments, we used this last criterion, as it picks the component reconfiguration that is least likely to induce changes in other components, leading to simpler and potentially faster reconfiguration plans.

In order to coordinate sequences or behaviors across the assembly, we keep track of ports requirements and activity during each behavior of a sequence. In particular, for each behavior in a sequence, we record use ports of the component that are activated at least once by the behavior (they must be connected to an active provide port during the execution of the behavior), and provide ports that are deactivated at least once (they must not be connected to an active use port). In addition, we also record the status of each port at the end of the behavior. This information is computed with a simple traversal of the behavior graph, starting from the places that are active at the beginning of the behavior.

In the example of the update for the assembly in Figure 1, localSeq determines that components dep1 and dep2 should each execute the sequence [update, deploy]: the first behavior is included in Γbhv and the second is required to take the components to a state that satisfies Γports .

#### 3.2 Assembly-Level Reconfiguration Scheduling

Once sequences of behaviors to execute over each component have been determined, we turn our attention to the whole assembly and attempt to compute a sequence of reconfiguration commands (specifically, behavior requests and synchronization requests) that execute these behaviors. The challenge is to coordinate these behaviors in a way that satisfies all port requirements. To facilitate coordination and to restrict the search space, we specifically try to generate a reconfiguration composed of steps, such that each component executes at most one behavior per step, and each step is followed by a global synchronization request. This assumption on parallelism is reminiscent of the BSP model [4]. Figure 2b gives an example of such a reconfiguration, to be compared with Figure 2a, which achieves the same result with fewer synchronization points.

```
pushB ( server , suspend )
pushB (dep1 , update )
pushB (dep2 , update )
pushB (dep1 , deploy )
pushB (dep2 , deploy )
wait ( dep1 )
wait ( dep2 )
pushB ( server , deploy )
wait ( server )
                                            pushB ( server , suspend )
                                            waitAll ()
                                            pushB (dep1 , update )
                                            pushB (dep2 , update )
                                            waitAll ()
                                            pushB (dep1 , deploy )
                                            pushB (dep2 , deploy )
                                            waitAll ()
                                            pushB ( server , deploy )
                                            waitAll ()
```
(a) Target reconfiguration program.

(b) A reconfiguration with four synchronized steps.

Fig. 2: A reconfiguration plan to perform updates on components dep1 and dep2 of the assembly in Figure 1, then restore the system to a working state.

SMT Constraints To find a reconfiguration plan, ordering constraints and port requirements are encoded as a problem in a many-sorted first-order logic (i.e., the logic is equipped with sorts that partition the domain, similarly to a simple type system), and an SMT solver is used to obtain a solution. That encoding of the scheduling problem centers around a sort Behavior, with a finite number of elements that represent the behaviors to schedule. The main task of the SMT solver is to find an interpretation for a function schedule that maps behaviors to a reconfiguration step during which to execute them. Conceptually, schedule could range over natural numbers, with behavior b executed at the ith step if i = schedule(b). However, such a model would require constraints with universal quantifiers over natural numbers, which pose a challenge for SMT solvers. It is also unnecessary, since there are only a finite number of behaviors to schedule: the number of steps required is at most the number of behaviors, when only one component executes a behavior at each step. If behaviors are executed in parallel over different components, fewer steps are required. Consequently, to improve the performance of the solver, the different steps of the reconfiguration are represented by another finite-domain sort Step, with elements step1, . . .stepn,stepfinal. The element stepfinal represents the ultimate state of the system rather than a reconfiguration step. Accordingly, the scheduling function has the signature schedule : Behavior → Step, and the problem contains the constraint schedule(b) 6= stepfinal for each behavior b.

A successor function succ : Step → Step is needed to describe the effect of a reconfiguration step on the subsequent state of the system. Constraints succ(stepi) = stepi+<sup>1</sup> (for 0 6 i < n), succ(stepn) = stepfinal and succ(stepfinal) = stepfinal define the interpretation of succ. Likewise, to easily express sequentiality constraints, a function int : Step → Int maps each step to its step number, as defined by constraints int(stepi) = i. With this function, sequentiality is easily expressed: for any two consecutive behaviors b<sup>1</sup> and b<sup>2</sup> in the sequence of behaviors to schedule for a given component, the constraint int(schedule(b1)) < int(schedule(b2)) is added. This function reintroduces an infinite domain, which we sought to eliminate with the sort Step. However, since the problem contains no quantifiers over integers, the solver only has to check that the aforementioned formula is satisfied by a speculated interpretation of schedule. This limited form of integer reasoning has a negligible impact on the search.

The main difficulty in scheduling a reconfiguration lies in ensuring that ports requirements are satisfied for each behavior of a component. A predicate act<sup>p</sup> : Step → Bool is introduced for each (use or provide) port p to indicate the activity status of the port at the beginning of reconfiguration steps. The status of each port p after each behavior b is uniquely defined, as determined during the computation of the sequences of behaviors of the component to which the port belongs. Correspondingly, a constraint [¬]actp(succ(schedule(b))) is added to reflect that status. The square brackets denote the absence or presence of the negation, depending on whether the port is inactive or active at the end of the behavior. Conversely, the status of a port cannot change if its component is not executing a behavior. For a component with behaviors b1, . . . , bn, the constraint schedule(b1) 6= step<sup>i</sup> ∧ · · · ∧ schedule(bn) 6= step<sup>i</sup> =⇒ (actp(stepi) ⇐⇒ actp(succ(stepi))) is added for every step i such that 0 6 i < n. Ports requirements can then be modeled. Let u be a use port that needs to be provided (i.e., connected to an active provide port) during behavior b, and p the provide port to which it is connected, the constraint actp(schedule(b)) ensures that p is active (and u provided) when b begins. Conversely, for a provide port p deactivated by a behavior b and connected to a use port u, ¬actu(schedule(b)) ensures that u is inactive when b begins. Furthermore, for any behavior b that activates a use port u and any behavior b 0 that deactivates the connected provide port p, the constraint schedule(b) 6= schedule(b 0 ) ensures that the behaviors are executed at different steps, hence separated by a synchronization barrier.

The problem<sup>3</sup> is passed to an SMT solver. If satisfiable, the interpretation found for schedule is used to build a reconfiguration script such as in Figure 2b.

Note that the scheduling problem could be encoded as a SAT problem. However, SMT solvers can reason about the theory EUF (equality and uninterpreted functions) using a dedicated congruence algorithm. We also use (non-recursive) data types, for which some SMT solvers have a dedicated reasoning algorithm [3], to represent the domains of Behavior and Step. These capabilities allow us to encode the problem straightforwardly and obtain solutions efficiently. Also note that the size of the scheduling problem is only a function of the number of behaviors to schedule and the number of component ports, but does not depend on the internal complexity of components, so that optimized components with several parallel transitions will not adversely affect the synthesis method.

#### 3.3 Determining Missing Behaviors

Until now, we have considered the scheduling problem under the assumption of a fixed sequence of behaviors to schedule for each component. In general, a set of behaviors may have no feasible schedule. For example, it is not possible to

<sup>3</sup> Illustrating instances for the running example, in the SMT-LIB file format, can be found at https://doi.org/10.5281/zenodo.5820571.

fully execute the behavior update on components dep1 and dep2 of the assembly in Figure 1 without first deactivating the use ports service1 and service2 of component server, i.e., executing its behavior suspend. To plan reconfigurations for an incomplete set of behaviors, we use our SMT encoding of the scheduling problem to detect the point in the reconfiguration at which additional changes must be performed, then we create new component reconfiguration sub-problems and use the solutions to augment the sequences of behaviors to schedule.

Let S be a mapping that associates to each component a sequence of behaviors (i.e., the sequence to be executed by that component, as determined in Subsection 3.1), a maximal executable schedule S <sup>0</sup> of S is a mapping that associates to each component c a prefix of S(c), such that (i) the scheduling problem corresponding to the sequences in S <sup>0</sup> has a solution (ii) no reconfiguration problem built by extending a prefix in S <sup>0</sup> with one behavior has a solution. Intuitively, a maximal executable schedule is a point up to which the reconfiguration S can be carried out, before unsatisfied port requirements prevent further execution.

Procedure 1 iteratively computes a maximal executable schedule S <sup>0</sup> and uses the resulting information to refine the sequences of behaviors to execute for each component, until a solution is found that executes them all. By analyzing the statuses of ports in the assembly at the end of the execution of S 0 (which depend only on the last behavior in each sequence), and comparing them to the requirements of the first unscheduled behaviors in S, we deduce a set of provide ports to activate and use ports to deactivate to allow further scheduling of S, and compute intermediary ports constraints Γ'ports . For each component c that does not have unscheduled behaviors in S, we determine a sequence s<sup>1</sup> of behaviors that satisfies this intermediate goal (assuming that the component starts with active places act <sup>c</sup> <sup>S</sup><sup>0</sup> corresponding to its state after executing the last behavior in S 0 (c)) and a sequence s<sup>2</sup> that takes the component from its state after executing s<sup>1</sup> (active places act <sup>c</sup> s1 ) to one that satisfies the port constraints Γports of the original goal. Sequences of behaviors to execute are thus extended ([ ] denotes the empty sequence, and s<sup>1</sup> · s<sup>2</sup> the concatenation of two sequences). To ensure a monotonic search, sequences are extended only for components c without unscheduled behaviors in S, i.e., not the components that brought about the intermediary goal Γ'ports . If no such extension can be found (¬progress), the scheduling of S is blocked by a circular dependency between components and the synthesis process fails. If the procedure terminates, it returns a reconfiguration script corresponding to a solution of the scheduling problem of S.

Consider the example of running updates in the assembly of Figure 1. Initially (see Subsection 3.1), the mapping S of sequences of behaviors computed with localSeq is defined by S(dep1) = S(dep2) = [update, deploy], and S(server) = [ ], because Γbhv does not include any behavior for that component, and the component is already in a state that satisfies Γports . This combination of sequences of behaviors has no feasible schedule. In particular, the mapping S 0 that associates to every component the empty sequence is found to be a maximal executable schedule of S. The first unscheduled behaviors in S are two instances of update, they require use ports service1 and service2 of component server to be deac-

```
Procedure globalSolution(A, Γbhv , Γports ) is
    for c ∈ A do S(c) ← localSeq(c, act c
                                          A, Γbhv , Γports );
    while findMaxExecSchedule(S) 6= S do
        S
         0 ← findMaxExecSchedule(S) ;
        Γ'ports ← port conditions required to execute, for every component c,
         the first behavior in S(c) that is not in S
                                                      0
                                                       (c);
        progress ← false ;
        for c ∈ A such that S
                                 0
                                  (c) = S(c) do
            s1 ← localSeq(c, act c
                                 S0 , Γbhv \S
                                            0
                                             (c), Γ'ports ) ;
            s2 ← localSeq(c, act c
                                 s1
                                   , ∅, Γports ) ;
            if s1 6= [ ] or s2 6= [ ] then
                S(c) ← S
                          0
                           (c) · s1 · s2 ;
                progress ← true ;
            end
        end
        if ¬ progress then fail ;
    end
    return reconfigurationScriptOfSolution(S) ;
end
```
Procedure 1: Synthesizes a reconfiguration script.

tivated. Consequently, two new reconfiguration sub-goals are created for server. The first requires it to reach a state where the two ports are deactivated, a call to localSeq returns the solution s<sup>1</sup> = [suspend]. From the resulting component state, the second reconfiguration sub-goal requires server to go to a state that satisfies Γports , in this case localSeq returns the sequence s<sup>2</sup> = [deploy]. S is updated so that S(server) = [suspend, deploy]. At this point, S is found to be a maximal executable schedule of itself, and the corresponding solution is returned, i.e., the reconfiguration plan in Figure 2b. Note that Procedure 1 is not guaranteed to terminate, nor is it a complete search algorithm. In particular, it relies on two heuristics: the selection function used when localSeq finds multiple candidate sequences, and the choice of maximal executable schedule for a given mapping S.

Computing a Maximal Executable Schedule Procedure 1 relies on a function findMaxExecSchedule to compute a maximal executable schedule of a mapping S, illustrated in Procedure 2, that maintains a mapping containing prefixes of elements in S (initially mapping every component to the empty sequence) and incrementally extends those prefixes, checking every time the satisfiability of the corresponding scheduling problem. This procedure calls the SMT solver to check the satisfiability of the scheduling problems. In the actual implementation, some simple checks are also used to quickly detect some trivially unsatisfiable or satisfiable instances of the scheduling problem, although these are left out of Procedure 2 for clarity. The procedure continues until all behaviors have been included or no additional behavior can be scheduled. A maximal executable

schedule always exists (the mapping that associates every component to the empty sequence always has a satisfiable scheduling problem, and may be maximal), and findMaxExecSchedule always finds one. However maximal executable schedules are not unique, and a bad choice may result in an ineffective reconfiguration plan. In the example above, during the second iteration, the mapping S of sequences to schedule is defined by S(server) = [suspend, deploy] and S(dep1) = S(dep2) = [update, deploy]. S itself is a maximal executable schedule of S, but so is the mapping S <sup>0</sup> defined by S 0 (server) = [suspend, deploy] and S 0 (dep1) = S 0 (dep2) = [ ]. S 0 corresponds to the case where the server is restarted too early. Picking this maximal executable schedule will ultimately lead to a reconfiguration that stops the server at least twice. To avoid this, a good heuristic for findMaxExecSchedule is to extend in priority the prefixes for which the added behavior is least likely to affect other components, i.e., those that deactivate the fewest provide ports and activate the fewest use ports.

```
Procedure findMaxExecSchedule(S) is
   suffixes ← S ;
   for c such that suffixes(c) is defined do prefixes(c) ← [ ];
   progress ← true ;
   while progress do
       progress ← false ;
       for c such that suffixes(c) 6= [ ] do
           b ← head(suffixes(c)) ;
           if the scheduling problem for prefixes extended with b is satisfiable
             then
               progress ← true ;
               prefixes(c) ← prefixes(c) · [b] ;
               suffixes(c) ← tail(suffixes(c)) ;
           end
       end
   end
   return prefixes ;
end
```
Procedure 2: Computes a maximum executable schedule for sequences of behaviors S.

#### 3.4 Relaxation of Synchronization Barriers

The assumption that reconfigurations should proceed in globally synchronized steps, although useful to find a solution, severely limits the potential for intercomponent parallelism, a key feature of Concerto. A final optimization stage takes the reconfiguration plan with synchronized steps and relaxes synchronization where possible. First, every command waitAll() is replaced with a sequence of commands wait(c) for every component c that executes a behavior in the preceding step. This preserves the semantics of the reconfiguration and makes the targets of synchronization explicit. Then, for a given step i and a given command wait(c) after this step, we apply the following rule: if for all behaviors executed by c since the last command wait(c) up to step i, no provide (resp., use) port is deactivated (resp., activated) and connected to a use (resp., provide) port that is activated (resp., deactivated) at step i + 1, then wait(c) can be delayed until after step i + 1. This rule is applied for every step in order, delaying barriers as late as possible and removing duplicates. This transformation reduces the number of barriers yet ensures that behaviors with conflicting effects on ports remain separated by an explicit synchronization. Port requirements for behaviors do not have to be taken into account, as the Concerto execution model ensures implicit synchronization for those. As an example, this optimization applied to the reconfiguration plan in Figure 2b yields the one in Figure 2a.

## 4 Experiments

The implementation described here, the examples, and the experimental results are available at https://doi.org/10.5281/zenodo.5820571.

#### 4.1 Implementation

We implemented the synthesis process in a Python tool that attempts to produce a reconfiguration script for a given assembly and reconfiguration goal. The process is entirely automated. Given a description of an assembly and a reconfiguration goal, it generates relevant scheduling problems and interacts with an SMT solver to generate reconfiguration programs. Intermediate scheduling problems can be output in the SMT-LIB file format, the standard used by most SMT solvers [2], and can be solved using any solver that complies with version 2.6 of the SMT-LIB standard. The preferred mode of operation for our tool does not output files, but interacts with the SMT solver Z3 [23] through the Z3 Python API. This interface makes it easy to analyze interpretations returned by the solver for satisfiable problems, and thus to reconstruct schedules. This is the mode of operation used to conduct the experiments described below.

#### 4.2 Results Over Synthetic Examples

To test our technique on a variety of cases, we devised assemblies with four types of topology. In central-user assemblies, a set of provider components, each with a pair of provide ports, is connected to different use ports of one central user component. In central-provider assemblies, one central provider component has a pair of provide ports that is connected to (a pair of use ports of) multiple other components. In linear assemblies, components form a chain such that each component has a pair of provide ports connected to the pair of use ports of the next component. In stratified architectures, components are organized in levels containing up to three components, such that each component in a level has a pair of provide ports connected to use ports on every component in the level above (i.e., a provide port can be connected to up to three use ports). Every component in these assemblies is equipped with behaviors to deploy it, update or suspend it, and uninstall it. Figure 3 depicts those four topologies, with internal nets of components omitted for clarity. As an example of the internal structure of components, Figure 1 shows the central-user assembly with three components. For other types and sizes of assembly, components follow similar internal structures, adapted to offer adequately many ports.

For each architecture, we generated assemblies with 10, 30 and 100 components (scaling the number of providers for central-user, the number of users for central-provider, the length of the chain for linear or the number of levels for stratified), and ran three scenarios. The deployment scenario starts with all components uninstalled, Γports requires the activation of a provide port on the last component(s) in the dependency order, while Γbhv is empty. The update scenario starts with all components running, Γports requires a similar final state, and Γbhv includes update behaviors for components that are first in the dependency order and no behavior for the others. The uninstall scenario starts with all components running, Γports requires the deactivation of all ports, while Γbhv is empty. Each scenario affects every component of the assembly.

Fig. 3: The four assembly topologies in synthetic examples.

Table 1 describes the solving process and resulting solution for these 36 examples. Experiments were executed on a computer with an 8-core 1.6GHz processor and 16 GiB of RAM. Solutions were successfully generated for all but 4 examples (the process was aborted after one hour). For 21 of them, the process took less than a minute. Results indicate that the solving time, and ultimately the success of the method, depend on the topology of the assembly: the assemblies for which some reconfigurations could not be computed within one hour are those with long chains of dependencies (linear and stratified assemblies with 100 components). This can be explained in two ways: firstly, the propagation of port requirements and the deduction of missing behaviors requires a number of iterations of the main loop of Procedure 1 proportional to the length of the longest chain of dependency. Secondly, architectures with long chains of dependencies are less conducive to parallel execution of behaviors, and therefore the instances of scheduling problems solved have a high number of steps, leading to a large search space and long solving times. For example, the deployment of 100 component in the linear architecture ultimately requires 100 steps. For each of the 17 instances of the scheduling problem solved to compute that reconfiguration, the SMT solver took on average 147 seconds to return a solution. In contrast, to deploy 100 components in the central-user architecture, the reconfiguration script requires only 2 steps, as a result the SMT solver was able to return a solution after only 0.21 seconds on average each time it was called. For difficult problems, the solving time is dominated by calls to the SMT solver as shown in the solving time column of Table 1 (in parentheses, the cumulated time taken by the SMT solver). Overall, these examples show that our method is able to plan reconfigurations affecting large number of components. Furthermore, architectures with a very large number of components, such as microservice architectures, typically have a shallow depth rather than long chains of dependencies, and scale horizontally [21,24,25], similarly to our central-user and central-provider architectures. Our method scales well in those conditions.

Writing a correctly coordinated reconfiguration plan with tens of asynchronous behaviors is a non-trivial task. It is particularly difficult when explicit synchronizations commands are needed in the reconfiguration script. The execution model of Concerto ensures that this is seldom necessary, but some synchronization barriers are required, e.g., in the update scenarios to prevent early restarts that would block the updates. Our synthesis technique determines synchronization points required for completion of the reconfigurations, but it also avoids synchronization points that would slow the execution unnecessarily. It performs these tasks quickly, with a time gain that is especially significant when compared to the service interruption that an incorrect reconfiguration would cause.

#### 4.3 OpenStack Use Case

We also tested our method on a real OpenStack system. OpenStack is the de facto standard open-source solution to address the IaaS level of the cloud paradigm, it can be seen as the open-source operating system of the cloud.

In previous work [8], Madeus, a subset of Concerto restricted to deployment, was used to deploy an OpenStack system. Following the deployment strategy of the reference production deployment tool Kolla, 11 components were specified, resulting in a real OpenStack deployment up to 70% faster than Kolla. Here we use the same components, extended with behaviors for reconfiguration. We analyzed the official installing, updating and uninstalling procedures of OpenStack to design the associated internal nets. Figure 4 depicts those components and their connections, with details of the internal structure depicted for the four main components, and omitted for clarity on seven others. The reconfiguration starts with all components running. The reconfiguration scenario requires an update of the database component (Γbhv = {(mariadb, update),(mariadb, deploy)}) and Γports specifies that all ports must eventually return to their initial (active) state. Our method generates a reconfiguration plan in 1.95 seconds, correctly deducing


Table 1: Results of the synthesis process on synthetic examples. For each problem, the table indicates the architecture of the assembly and (arch.) its number of components (size), the number of problems solved by the SMT solver (smt) followed in parentheses by the number of those that were found satisfiable, the total solving time in seconds followed in parentheses by the cumulated time taken by the SMT solver (time), the number of steps in the solution before relaxation of the synchronization barriers (steps), and the number of behaviors executed in that solution (bhvs).

Fig. 4: A Concerto assembly for an OpenStack system.

missing behaviors for mariadb as well as components affected by the interruption of its service, i.e., keystone, nova, neutron, and glance. The generated plan coordinates 12 behaviors on these 5 components. After optimization, it includes only 2 synchronization points needed to ensure the complete re-deployment of mariadb and keystone, whose services are required by other components.

While the scale of this use case may seem limited, its architecture is not trivial. This real-world scenario leads to a complicated synchronization problem. The 12 behaviors in the reconfiguration program require 8 global synchronization steps before optimization. The optimization phase reduces this to 2 individual synchronization points, thus enhancing the level of parallelism and asynchrony of the reconfiguration program, while preserving its correctness. A DevOps engineer or system administrator would be challenged to write such a program without errors or unnecessary synchronization points, whereas our solution only requires them to specify a reconfiguration goal.

#### 5 Related work

For models with fixed component life cycles, planning and scheduling techniques have been used to plan reconfigurations [1, 13]. Pre-established protocols can also be used: while such solutions are in general less flexible, they have desirable features such as decentralized coordination [14] or recovery policies [6]. Comparatively fewer works study the problem of reconfiguration planning in models with programmable component life cycles, such as Concerto. Kikuchi et al. [20] synthesize reconfiguration plans with a model finder. Unlike us, they assume that all available reconfiguration operations are given in the input of the scheduling problem, which may limit scalability. Operations and reconfiguration goal are

encoded in the Alloy specification language, and synthesis is performed by the Alloy Analyzer. This work relies on a simple ad hoc component-based model, with reconfiguration operations that must be sequentially ordered. The model does not have specific execution semantics, instead the list of operations has to be given by the user, with their effects described as constraints on the states before and after the operations. Therefore the correctness of the correspondence between the synthesized procedure and its executable counterpart depends on the user. Metis [22] closes that gap between planning and execution, as it schedules deployment plans for distributed systems in the Aeolus model [12], which has formal execution semantics. The authors first describe the problem as a generic planning problem and use standard planners to solve it, then present a specialized solving algorithm. Metis is limited to deployment rather than general reconfiguration, making the computation of dependencies more straightforward. Aelous shares many similarities with Concerto, but lacks intra-component parallelism and asynchronous commands in its reconfiguration language. These features improve the efficiency of reconfigurations but also make them more difficult to plan. Note that these features can also be represented through planning and scheduling problems [18], typically solved by approximation.

The problem of determining reconfiguration goals (i.e., the analysis phase) is complementary to the planning problem. Engage [15] uses a SAT solver to build a complete target configuration (a set of components to deploy) from a partial specification, based on a hierarchical specification of a distributed software stack. It also performs limited planning, namely sequentially ordering deployments. Engage does not account for the state of the system, and is thus limited to initial deployments or reconfigurations from the ground up. Zephyrus [11] and ConfSolve [17] are two tools to infer, from the state of the system/environment, a target configuration that could be used as an entry of our planning tool.

#### 6 Conclusion

We have described a synthesis method for reconfiguration plans of componentbased systems, that relies on (i) finding local solutions at the component level, (ii) finding a schedule that coordinates those solutions at the assembly level, with the help of an SMT solver, (iii) determining unsatisfied dependencies to refine the reconfiguration goal until it becomes satisfiable, and (iv) optimizing the synthesized reconfiguration plan to improve its level of parallel and asynchronous execution. Dividing the problem in this manner, as opposed to attempting to solve it at once with an SMT solver, is a key to solving large instances, although it leads to incompleteness (the third step relies on an incomplete search guided by some heuristic choices). This design decision does not appear to affect the success of the method or the quality of synthesized plans, and allows the technique to scale to applications with large number of components, as demonstrated in our experiments on synthetic examples and a real use case. To improve scalability on complex architectures, this technique could be adapted to a hierarchical composition model, which would lend itself to a recursive resolution algorithm.

## References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Semantic Clone Detection via Probabilistic Software Modeling**

Hannes Thaller<sup>1</sup> () , Lukas Linsbauer<sup>2</sup> , and Alexander Egyed<sup>1</sup>*?*

> 1 Johannes Kepler University Linz, Austria {hannes.thaller, alexander.egyed}@jku.at <sup>2</sup> Technical University of Braunschweig, Germany l.linsbauer@tu-braunschweig.de

**Abstract.** Semantic clone detection is the process of finding program elements with similar or equal runtime behavior. For example, detecting the semantic equality between the recursive and iterative implementation of the factorial computation. Semantic clone detection is the de facto technical boundary of clone detectors. In recent years, this boundary has been tested using interesting new approaches. This article contributes a semantic clone detection approach that detects clones which have 0 % syntactic similarity. We present Semantic Clone Detection via Probabilistic Software Modeling (SCD-PSM) as a stable and precise solution to semantic clone detection. PSM builds a probabilistic model of a program that is capable of evaluating and generating runtime data. SCD-PSM leverages this model and its model elements for finding behaviorally equal model elements. This behavioral equality is then generalized to semantic equality of the original program elements. It uses the likelihood between model elements as a distance metric. Then, it employs the likelihood ratio significance test to decide whether this distance is significant, given a pre-specified and controllable false-positive rate. The output of SCD-PSM are pairs of program elements (i.e., methods), their distance, and a decision on whether they are clones or not. SCD-PSM yields excellent results with a Matthews Correlation Coefficient greater than 0*.*9. These results are obtained on classical semantic clone detection problems such as detecting recursive and iterative versions of an algorithm, but also on complex problems used in coding competitions.

**Keywords:** semantic clone detection · probabilistic software modeling · clone detection

## **1 Introduction**

Copying and pasting source code fragments leads to code clones, which are considered an anti-pattern. Code clones increase maintenance costs [31,32], promote

*<sup>?</sup>* The research reported in this paper has been supported by the Austrian Ministry for Transport, Innovation and Technology, the Federal Ministry of Science, Research and Economy, and the Province of Upper Austria in the frame of the COMET center SCCH. This research was funded in part, by the Austrian Science Fund (FWF) [P25513].

bad software design [29,13,17], and introduce or propagate bugs [4,28,14]. However, duplicating code fragments also allows faster adaptation to requirements, the re-use of stable and well-tested solutions [25,26], and helps to overcome language limitations [21,35], thereby lowering development costs. The impact of code clones and the contradicting evidence various studies provide are the topics of an ongoing discussion in the community. Meanwhile, it is certain that developers will continue duplicating source code to leverage its benefits, despite its drawbacks. The key is the awareness and management of clones to maximize efficiency while balancing quality.

Traditionally, the clone taxonomy distinguishes between four types of clones [35,2,34]. Type 1-3 describe code clones caused by copying and pasting the source code with or without changes. Type 4 clones describe code clones that do not have any syntactic similarity but implement the same functionality (semantic equivalence). For example, the recursive and iterative implementation of an algorithm (e.g., Fibonacci computation) have no syntactic similarity while implementing the same functionality. Existing tools have limited or no capabilities to detect Type 4 clones [19]. Most current studies exclude them because of the lack of tool support [23,35,2,39,11]. Nevertheless, Type 4 clones exist, and recent research efforts have tried to deepen the understanding of them [19,49,20]. This article provides a significant contribution to semantic clone detection in the form of novel concepts and a prototype implementing them.

We present *Semantic Clone Detection via Probabilistic Software Modeling (SCD-PSM)*. SCD-PSM extends our work on Probabilistic Software Modeling (PSM) [43] via a semantic clone detection pipeline. PSM builds Probabilistic Models (PMs) from programs. It analyzes the static structure and dynamic runtime behavior and replicates the program in the form of a generative probabilistic model. These models allow developers to reason about the semantics of a program. SCD-PSM extends this work by leveraging the PMs and causal reasoning to find semantically (i.e., behaviorally) equivalent code elements. SCD-PSM allows full quantification of the behavioral distance of code elements via likelihoods. Furthermore, the likelihood evaluation via PMs allows for statistical significance tests to decide whether a pair of code elements are clones. SCD-PSM detects semantic clones with no textual similarity, such as the iterative and recursive version of an algorithm. The average performance of the approach reaches a Matthews Correlation Coefficient of 0*.*965 on a complex problem set indicating a robust method for semantic clone detection. This work extends our previous work [41] with a full evaluation and the theoretical foundation.

Section 2 provides the background needed to understand SCD-PSM including the basics of PSM. Section 3 clarifies what semantic clones are in the context of this work. Section 4 presents the approach in which representation, search space, and the various similarity stages are described. Section 5 evaluates the approach while Section 6 discusses the results. Limitations of the approach and possible threats are given in Section 7 and Section 8. Section 9 compares the work to the state-of-art and Section 10 concludes this article.

```
1 int fa ( int n ){
2 product = 1
3 for (i = 1; i <= n i ++)
4 product *= i
5 return product
6 }
```
Listing 1.1: *for-loop* implementation of factorial

```
1 int fc ( int n ){
2 if(n <= 1) return 1
3 return fc (n - 1) * n
4 }
```
Listing 1.3: *Recursive* implementation of factorial

```
1 int fb ( int n ){
2 product = 1
3 i = 1
4 while (i <= n )
5 product *= i
6 i ++
7 return product
8 }
```
Listing 1.2: *while-loop* implementation of factorial

```
1 int fd ( int n , String guard ){
2 if(n < 1 && guard == " val " )
3 return -1
4 if(n < 1 && guard == " throw ")
5 throw Exception ()
6 return fc ( n)
7 }
```
Listing 1.4: *Delegate* implementation of factorial

## **2 Background**

The clone detection research community has a long history and defines many concepts, algorithms, and tools. In contrast, Probabilistic Software Modeling (PSM) is relatively new and combines software engineering and probabilistic modeling. Some terms need clarification; others require an introduction if they diverge from their traditional names.

#### **2.1 Clone Detection**

Clone detection is the process of finding two similar program fragments. Listings 1.1 to 1.4 are four different implementations of the factorial function (*n*!). Listing 1.1 is a *for*-loop implementation, Listing 1.2 uses a *while*-loop, and Listing 1.3 is recursively defined. Finally, Listing 1.4 delegates its implementation to fc() from Listing 1.3 but may also return −1 in case of invalid inputs (including *n* = 0).

Representation, pairing, similarity evaluation, and clone decision are the core concepts of clone detection. *Representations* describe on which artifact the detector operates, such as text, graphs (e.g., AST), or probabilistic models. *Pairing* describes the selection of two code fragments that are potentially clones (e.g., fa() and fb()). Each pair is called a *candidate clone pair* (or candidate pair). The *similarity evaluation* measures the similarity of a candidate pair, e.g., by counting the number of different characters. Finally, the *clone decision* labels the candidate pair as a clone given a criterion on the similarity, e.g., less than ten different characters.

The properties of the similarity metric split clones into two groups [35]. Type 1-3 clones capture textual similarity while Type 4 clones capture semantic similarity [2,23,24,35,34,44]. Type 1 (Exact Clones) clones are program fragments

that are identical except for variations in white-space and comments. Type 2 (Parameterized Clones) clones are program fragments that are structurally or syntactically similar except for changes in identifiers, literals, types, and comments. Type 3 (Near-Miss Clones) clones are program fragments that include insertions or deletions in addition to changes in identifiers, literals, types, and layouts. Type 4 (Semantic Clones) clones are program fragments that are functionally or semantically similar (i.e., perform the same computation) without textual similarities. These types are increasingly challenging to detect, with Type 4 being the most complex one. Note that the definition of *Semantic Clones* is often relaxed, where up-to 50% syntactic similarity of the code fragments is allowed (e.g., BigCloneBench [39]). However, we consider these clones as complex Type 3 clones (additions, deletions, reordering) and *not* as semantic clones. This means that semantic clones in the context of this work are clones with no syntactic similarity except for per-chance similarities.

We will use *a* ' *b* to denote that *a* is a clone of *b*. Furthermore, *a* 6' *b* denotes that *a* is not a clone of *b*.

#### **2.2 Programs & Code Elements**

PSM generalizes object-oriented terms to describe *code elements* in a program. Code elements are *types T* , *properties P r*, and *executables Ex* that refer to, e.g., classes, fields, and methods in Java [1], or classes, properties, and functions in Python [45]. Additional code elements are *parameters P r* and *results Re* of executables that refer to parameters and return values of a method. Properties, parameters, and results are *atomic* code elements that have identifiable states at runtime. Types and executables are *compositional* elements that act as a collection of atomic elements. Types *declare* properties and executables, capturing structural relationships. Executables have behavioral relationships that are categorized into *Inputs* (I) and *Outputs* (O). *Inputs* are *received parameters P a*<sup>I</sup> , *read properties P r*<sup>I</sup> , and *requested invocation results Re*<sup>I</sup> . *Outputs* are *returned executable results Re*<sup>O</sup>, *written properties P r*<sup>O</sup>, and *provided parameters P a*<sup>O</sup>. We will denote atomic elements in lowercase, and compositional elements in bold-face lowercase, e.g., *n* and *fa* in Listing 1.1. Executable results are named after their executables, e.g., *f a* in Listing 1.1. *f c* = {*n P a,*I *, fcRe,*<sup>I</sup> *, fcRe,*<sup>O</sup>} denotes the code elements of Listing 1.3. For the sake of readability, we will omit the superscript classifiers if it is unambiguously possible, e.g., *fa* = {*n, f a*}. The subset of *inputs* is denoted by *f c*<sup>I</sup> = {*n P a,*I *, fcRe,*<sup>I</sup>} and *outputs* by *f c*<sup>O</sup> = {*f cRe,*<sup>O</sup>}. Finally, the set of all input and output combinations is given by *bmex*IO = {(*i, o*) ∈ *ex*<sup>I</sup> × *ex*<sup>O</sup>}*.* For example, *fd*IO = {(*n, f d*)*,*(*guard, f d*)} describes the IO pairs of fd().

#### **2.3 Probabilistic Software Modeling**

Probabilistic Software Modeling (PSM) [40] is a data-driven modeling paradigm that transforms a program into a Probabilistic Model (PM). PSM extracts the structure and behavior of a program. Code elements and their dependency graph represent the *structure* as described in Section 2.2. All observable events at runtime represent the *behavior*. The resulting PM and its model elements are a probabilistic copy of the program.

*Model elements* in the PM are the equivalent to code elements in the program. *P*(*x*) denotes the probability distribution of variable *x*, e.g., *Pfa*(*n*) denotes the probability distribution of input parameter *n* of the fa-method. *p*(*x*) denotes the probability of a specific event of a variable, e.g., *pfa*(*n* = 2). This extends the notation of code elements with probabilistic quantities. However, the notation reasons about the probabilistic behavior of code elements instead of their structural properties.

Each model element is a flow-based latent variable model [7] that learns an invertible mapping between the original observations and an isotropic unit norm Gaussian N (0*,* **1**) with *f* : *X* 7→ *Z*. An example for *x* ∈ *X* may be *n* ∈ *fa* with *n <sup>z</sup>* ∈ *fa<sup>z</sup>* being its latent Gaussian representation. The Gaussian latent space enables the model elements to generate new samples and evaluate the likelihood of samples.

*Generation* (or Sampling) draws, either marginally or conditionally, observations from a model element simulating the execution of the corresponding code element. For example, drawing 100 observations from *fa* ∼ *Pfa*(*n, f a*), i.e., values for *n* <sup>I</sup> and *f a*<sup>O</sup>, simulates 100 program executions of this method. An example for *conditional generation* would be *fa*<sup>|</sup>*n<*<sup>10</sup> ∼ *Pfa*(*f a* | *n <* 10) that only draws observations where *n <* 10. The process involves sampling from the latent Gaussian variables, and inverting the Gaussian samples to the original domain via the flow *f* −1 (*z*) = *x*. *Evaluation* takes observations and evaluates their likelihood under a model element. For example, *Pfa*(*n* = 4*, f a* = 24) evaluates the likelihood of input 4 and output 24 under the *f a* model element. The process of evaluation involves mapping a given sample into the latent space and evaluating it under the Gaussians *p*<sup>N</sup>(0*,***1**)(*f*(*x*)). Generation and evaluation are the core of any PSM applications and of SCD-PSM. A detailed description is given in our previous work [43].

#### **3 Semantic Clones**

A clear understanding of what SCD-PSM defines a *semantic clone* is essential in understanding the approach and its design choices.

**Definition 1.** *A semantic clone is a pair of executables whose (partial) input, and output relationships exhibit significant (conditional) similarities.*

Definition 1 defines semantic clones over the similarity between IO relationships of executables. This holds if the IO relationships are only partially similar, i.e., not all combinations of IO pairs between executables have to be similar. For example, *f d* in Listing 1.4 has two IO pairs (*fd*IO = {(*n, f d*)*,*(*guard, f d*)}) while *f a* in Listing 1.1 has one IO pair (*fa*IO = {(*n, f a*))}). According to the definition, at least one IO pair comparison needs to be similar such that both executables are declared as a semantic clone (e.g., (*n, f d*) ' (*n, f a*)).

Furthermore, the similarities between IO pairs may only be conditional, i.e., the similarity of matching IO pairs might be depending on the state of any other

code element in the comparison context. For example, the IO pair (*n, f d*) ' (*n, f a*) is only a perfect clone in case that fd.guard != "val". If fd.guard == "val", the IO behavior would differ in case of *n* = 1 (fd(1) 7→ −1 while fa(1)7→ 1). According to the definition, at least parts of the behavior need to be similar, capturing complex multidimensional behavioral patterns in IO relationships.

The rationale behind the comparison of IO relationships is one of cause and effect. If a pair of executables exhibit similar effects given similar causes, then their computational behavior is identical. Extending this rationale by multiple inputs and outputs leads to *partial conditional similarity*.

#### Probabilistic Model Source Code *PSM* ... ... Modeling Search Space Static Similarity mA mB Decision *type check* = ≉ null alt conditioning alt null *sampling pooling* conditioning null alt null alt mA mB Input Runtime Samples Output Runtime Samples Decision *univariate testing* ≉ = Input Data Types Output Data Types mA mB *evaluate evaluate* Dynamic Similarity Model Similarity *Dynamic Similarity Static Similarity pairing* Executable Executable Pairs Variables Decision float text, integer likelihoods likelihoods Marginal IO Sample Conditional IO Sample Likelihood Ratio *multivariate testing* 7 8 9 10 <sup>11</sup> <sup>13</sup> *sampling* 12 14 1 2 3 4 5 6

### **4 Approach**

Fig. 1: The modeling phase transforms the program into a PM. The search space phase then pairs the PM model elements into candidate pairs. Finally, Static-, Dynamic- and Model Similarity evaluate the behavioral equality of the candidates.

Figure 1 illustrates SCD-PSM. It is a five-fold approach consisting of the following steps:


The approach represents a rejecting filter pipeline that candidate pairs must traverse in order to be declared a clone. Static-, Dynamic-, and Model Similarity represent filter stages of increasing complexity.

The main contribution of this work is the implementation of a semantic clone detection pipeline on top of PSM. Further, we provide an effective process of traversing the potentially large search space of candidate pairs. Finally, we show that the behavioral equivalence of model elements generalizes to the semantic equivalence of code elements.

#### **4.1 Modeling**

Starting from the *Source Code* in Figure 1, PSM builds a *Probabilistic Model* (PM) [40] of the program (1). The PM is also called the Inference Graph (IG), which is a cluster graph [22] with Non-Volume Preserving Flows (NVPs) [7] as clusters. SCD-PSM uses this PM as a representation for the clone detection, similar to text-based clone detectors that use text fragments. The PM is the output of PSM and is considered as given in the context of SCD-PSM.

Executable model elements in the PM act as a surrogate to the executables in the program. SCD-PSM pairs these model elements and computes their similarity. If a behaviorally equivalent model element pair is found, then it can be seen as a semantically equivalent code element pair. In conclusion, the SCD-PSM allows for method-level semantic clone detection based on PMs representing the original executables in the program.

#### **4.2 Search Space**

Fig. 2: SCD-PSM operates on four levels of abstraction: program, between executable, within executable, and the IO level.

SCD-PSM conducts method-level semantic clone detection, which operates on multiple abstraction levels. Figure 2 illustrates these levels, starting with the program and ending with the inputs and outputs of an executable.

The second step in Figure 1 builds a *within- and between-executable space* that SCD-PSM searches for clones. The *Between-Executable Space (BES)* is the set of executable combinations

$$BES = \{ \{a, b\} \in Ex \times Ex \mid a \neq b \},\tag{1}$$

where *exa, exb* is a *candidate pair* (or executable pair), and *Ex* is the set of all executables in the current analysis (illustrated in Figure 2). The theoretical size of the between-executable space are all 2-length combinations without replacement, given by

$$|BES| = \frac{|Ex|!}{2 \cdot (|Ex| - 2)!},\tag{2}$$

where |·| describes the size of the underlying set. Note that the size of the BES is smaller than the Cartesian product since {*a, b*} = {*b, a*}. Figure 1 shows this pairing process in the Search Space aspect (2) from Figure 1. The *Within-Executable Space (WES)* is the product of IO pairs

$$\mathbf{W} \mathbf{E} \mathbf{S}^{ab} = \{(i, j) \in \mathbf{a}^{\mathcal{IC}} \times \mathbf{b}^{\mathcal{ID}}\}. \tag{3}$$

Figure 2 illustrates the WES and one IO pair from the WES that we also call *link*. The theoretical size of the within-executable space is

$$\left|WES^{ab}\right| = \left|\mathbf{a}^{\mathcal{LO}}\right| \cdot \left|\mathbf{b}^{\mathcal{TO}}\right|\tag{4}$$

For the sake of visualization, IO pairs are not shown in Figure 1 but are abstracted in their executable elements. The maximum theoretical search space is

$$S = \sum\_{i} |wss(\mathbf{BES}\_i)|\,,\tag{5}$$

given that *wes* describes a construction function according to Equation (3), and *BES<sup>i</sup>* is the i'th candidate pair.

In practice, SCD-PSM evaluates only a fraction of possible combinations because of the skip evaluation. The *skip evaluation* consists of two search space limiting factors: greedy evaluation and transitive similarity. *Greedy evaluation* stops the search through the WES once a similar pair is found. The initial detection process only confirms the similarity of a candidate pair. A post-analysis can then extract all possible IO similarities for potential actions. *Transitive similarity* skips evaluations in the BES, because of *a* ' *b* ' *c* then also *a* ' *c* holds. In conclusion, SCD-PSM compares IO pairs of executable model elements and uses skip evaluation to traverse the search space efficiently.

#### **4.3 Static Similarity**

The static similarity stage is a filter that accepts candidate pairs based on their data type, as shown in Figure 1. Data types in a PSM model are integers, floats, and text.

*Input* (3) of the stage are the IO pairs *W ESab* = *wes*({*a, b*}) of a candidate. The filter *criteria* (4) accepts a candidate pair if at least *one* link (i.e., IO pair) has a matching data type, i.e., the input but also the output have a matching data type. *Output* (5) is a boolean decision whether the candidate pair is a clone or not from a static viewpoint. If positive, then the candidate pair is moved to the next pipeline stage, i.e., the *Dynamic Similarity* evaluation (see Figure 1). If negative, then the candidate pair is marked as being *not* a clone *a* 6' *b* and no further processing is conducted. For example, the IO pairs (*n, f a*) ' (*n, fb*) would be statically accepted as clones as both inputs and outputs have the same data type (integer). A counterexample is given by (*n, f a*) ' (*guard, f d*) where the input data types are integers and text.

The static similarity indicates that the analyzed program is given in a programming language that allows for static analysis. Programs written in programming languages without static typing can not make use of this filter stage. In conclusion, the static similarity stage filters candidates based on their data type.

#### **4.4 Dynamic Similarity**

The dynamic similarity stage is a filter that accepts candidate pairs based on the runtime data, as shown in Figure 1. Candidates pairs are accepted if at least *one* IO pair (6) has an *insignificant* diverging runtime distribution (7). This boolean decision is evaluated via a Kolmogorov-Smirnov test [30], and determines whether a pair is a clone from a dynamic viewpoint (8). For example, the IO pair (*n, f a*) ' (*n, f d*) with guard == true would be excluded form the filter given that runtime events with *n* = 0 reach a majority. In comparison, (*n, f a*) ' (*n, fb*) would be accepted by the stage.

A requirement is that the candidates use a synthetic trigger. Otherwise, the comparison of the data distributions may fail because of the different modus operandi of the program. For example, running fa and fb where *nfa* = U(0*,* 4) and *nf b* = U(5*,* 10) would cause the dynamic stage to fail even if the implementations are equivalent. Property-based [12] or random testing can be used to generate diverse synthetic inputs.

In conclusion, the dynamic similarity stage pre-filters candidates based on univariate tests on the input and output events.

#### **4.5 Model Similarity**

The model similarity stage is a filter that accepts candidate pairs based on the models, as shown in Figure 1. This stage conducts a multivariate test by sampling from the executable models and cross evaluating them. This test includes the evaluation of conditional influences caused by elements that are not actively participating in an IO pair. For example, (*n, f d*) ' (*n, f a*) holds but is conditionally dependent on *guard*. The model similarity can factor *guard* into its decision while the dynamic stage can only evaluate the average behavior of an IO pair.

*Input* (9) are the IO pairs of a candidate *W ESab* = *wes*({*a, b*}). The crosswise log-likelihood ratio of the models is computed by *(conditional) generation*

and *evaluation*. *Output* is a boolean decision on whether the candidate pair is a clone or not, from a model viewpoint. Figure 1 illustrates the entire process of the model similarity.


The roles between the *null* and *alt* models are then swapped, and the process is repeated. Both log-likelihood ratios are then combined by a pooling operator to produce the clone decision (14).

The role-swap is needed to avoid sub-model relationships. For example, if *Mnull* = N (0*,* 3) and *Malt* = N (0*,* 1) then *LLalt* will be very high because *Malt* is a sub-model from *Mnull*. Reversing the roles highlights the differences in the models.

The final decision is based on the Generalized Likelihood Ratio Test (GLRT) [10]. It measures whether the log-likelihoods are significantly different from 0, where *λ* is the test statistic. The null hypothesis is rejected for small ratios *λ* ≤ *c* where *c* is set to an appropriate false-positive rate. For example, *λ <* log(0*.*01) allows 1 out of 100 candidates to be a false-positive, i.e., wrongly rejecting semantic equivalence. The pooling operator combines the link results either via hard or soft pooling. *Hard pooling* conducts for both links a GLRT yielding a positive decision if *both* links are positive. *Soft pooling* averages the link loglikelihoods ratios and then computes the GLRT yielding a positive decision if the joint GLRT is positive. Hard pooling does not allow any sub-model relationships, while soft pooling relaxes this constraint.

In conclusion, the model similarity conducts a multivariate significance test between two models, including possible conditional dependencies.

#### **5 Study**

This study answers the following research questions.


Q1 answers whether semantic clones can be detected via SCD-PSM. Q2 answers whether the search space can be efficiently processed using skip evaluation. Q3 answers how the skip evaluation influences the performance of the detection process. This is important because candidate pairs might be skipped based on false-positives or false-negatives.

#### **5.1 Setup**

We implemented a prototype for SCD-PSM on top of Gradient [40], a prototype for PSM. The elements and data flow of the detection process are shown in Figures 1 and 2.


#### **5.2 Dataset**

The study uses three well-known algorithms and 10 Google Code Jam 2017 (GCJ)<sup>1</sup> problems. The total dataset contains 108 implementation variants across 13 clone classes described by *Instance*.

Each clone class was differentially tested to verify the behavior across instances. Factorial, Fibonacci, and Sort do not need any further explanation. The GCJ problems are well specified complex optimization problems packaged in an everyday theme.

The dataset contains in total 5778 (see Equation (2)) candidate pairs of which 458 are semantic clones and 5320 are not. This yields a positive to negative ratio of 1 : 11*.*6, indicating a highly imbalanced distribution. An even more pronounced imbalance is to be expected in real-world applications.

Each instance was triggered with input data to allow PSM to model the different implementations. Factorial, Fibonacci, and Sort were triggered by sampling from a uniform distribution U(0*,* 20). GCJ problems were triggered by the input data provided by the competition. Each instance received the same trigger.

GCJ problems read from and write to the standard stream, which is impractical in terms of reproducibility. Our dataset is constructed such that each implementation has a run-method representing the cloned executable. The study results are limited to the run-method even if the solutions use helper methods.

<sup>1</sup> https://codingcompetitions.withgoogle.com/codejam/archive

Helper methods may, for example, be methods that compute parts of the final solution, or reorganize the data. This guarantees a proper problem scope, a well-defined recall and precision, and a clearly defined benchmark for future reproducibility.

#### **5.3 Controlled Variables**

The study controls for the search space *Evaluation* strategy, *Dynamic False-Positive Rate (D-FPR)*, *Model False-Positive Rate (M-FPR)*, and *Pooling*.


An additional fixed parameter is the *number of particles*. It defines the sample size that is generated during the model similarity |*D*| = 50.

#### **5.4 Response Variables**

The response measures of the study are the number of *Skip Evaluations*, processing *Duration*, *TP, FP, TN, FN*, *Precision*, *Recall*, *F1*, and *Matthews Correlation Coefficient*.

**Skip Evaluations** measures the number of evaluations that were skipped due to the skip evaluation strategy.

**Duration** measures the elapsed time to compute one candidate pair.

**TP, FP, TN, FN** measures the True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN) detection results compared to the ground truth.

**Precision** measures the fraction of detected clones that are truly clones.

**Recall** measures the fraction of semantic clones that have been found.

**F1** measures the accuracy of a binary classification as the harmonic mean of recall and precision.


Table 1: Results of the top-5 and bottom-1 experiment along with the average performance of the top-5.

Duration in seconds

**Matthews Correlation Coefficient (MCC)** measures the quality of the clone detection in the form of a correlation ranging from −1 to 1, with 0 being a random selection. The MCC will be the reference performance metric as it is the most robust metric in an imbalanced binary classification setting [3]. It is a correlation coefficient which may be interpreted by the guidelines proposed by Evans [9].

#### **5.5 Comparison of Clone Detectors**

In total, eight alternative approaches are used to contextualize the performance of SCD-PSM. The alternatives have a wide variety in terms of internal representation and clone detection capabilities as listed in Table 3. ASTNN (8) and ASTNN Leaky (9) are the same approach but have different evaluation methods. ASTNN Leaky (9) uses a random split of the dataset as reported by the authors [50]. It overestimates the performance of the approach via a lack of isolation between training and test dataset. For example, *f a* ' *f b* and *f a* ' *f c* might be in the train split while *f b* ' *f c* might be in the test split. ASTNN (8) uses a group-wise Cross Validation (CV), where clone classes are entirely isolated either into the training or test proportion of the dataset. This represents a real-world situation where first the detector is fitted and then applied to a new system with unknown code fragments.

Detectors that report lines instead of methods may produce more results (TP, FP, TN, FN) than present in the dataset. A similar situation is given by ASTNN Leaky that runs multiple evaluations via the cross validation.

#### **5.6 Experiment Results**

Creating the PSM model with Gradient took 2134*.*38 s, resulting in an average modeling time of 19*.*75 s for the 195 executables. This includes 87 helper methods.

Table 1 contains the aggregate results of the top-5 experiments along with the results of the worst experiment. The bottom line in Table 1 is the average


Table 2: Performance breakdown of the best performing experiment listed as Nr. 1 in Table 1.

Duration in seconds

performance of the top-5 experiments. The generally expected performance of the approach is *very strong* with an MCC of 0*.*965. High confidence for negative examples is given with no false-positives reflecting the pipeline's FPR rates (D-FPR × M-FPR). The best experiment featured a *skip evaluation*, *0.100* D-FPR and *0.001* M-FPR rates, and *soft pooling* (Nr. 1) with an MCC of 0*.*975. The worst experiment featured a *exhaustive evaluation*, D-FPR of 0*.*100, M-FPR of 0*.*010, and *hard pooling* (Nr.16) with a *strong* MCC of 0*.*787. A total of 345 candidates were skipped while reaching a recall of 0*.*933.

Table 2 lists the cumulative performance of the best model, starting with an initial prediction that all candidates are semantic clones (rejecting pipeline). The *static* stage finds 71*.*729 % (3816) of the FPs, improving the MCC by 0*.*409. The *dynamic* stage additionally removes another 27*.*330 % (1454) of FPs but introduces 1*.*528 % (7) of the possible FNs. An improvement of the MCC by 0*.*527 is achieved via the dynamic stage. Finally, the *model* stage removes the remaining 0*.*939 % (50) FPs but introduces additional 3*.*056 % (14) additional FNs. The model stage improves the MCC by 0*.*039.

On average, 5*.*884 % (340) of the total 5778 evaluations could be skipped. This equals 74*.*235 % of the total 458 TPs. On average 37*.*359 % (50 354) of the total 134 782 IO pair evaluations could be saved via greedy evaluation. The average duration of the exhaustive experiments was 2394 s, leading to 414 ms per candidate. Skip experiments lasted on average 1988 s with 344 ms per candidate. The static stage lasted on average for *<*0*.*001 % of the time per candidate (see Table 2), the dynamic stage for 0*.*106 %, and the model stage for 0*.*893 %.

Table 3 lists the detection results of eight alternative clone detectors. Simian, NiCad, and CCAligner found no clones in the dataset. PMD, SourcererCC, Oreo, and iClones found some clones (*<* 20) with a low recall (4 %). Each of these detectors has a *very weak* performance below an MCC of 0*.*20 ASTNN with the leaky evaluation has a *very strong* performance with an MCC of 0*.*976. ASTNN 3-Group CV has a *strong* performance with an MCC of 0*.*711. The longest computational duration is given by ASTNN with 1034 min.


Table 3: Detection results of other clone detectors on the dataset.

Duration in seconds

## **6 Discussion**

The goal of the study was to provide evidence of whether behavioral equality of model elements generalizes to semantic equality of code elements (Q1). Furthermore, we were interested in the skip evaluation and its performance implications (Q2 and Q3).

#### **6.1 Research Question 1 — Detection Performance**

Table 1 and Table 2 present strong results in favor of Q1. The MCC for the top-5 experiments was *very strong* with all MCCs being above 0*.*9. Even the worst experiment still yielded a *moderate* performance of 0*.*749.

Table 3 provides additional context to the results by presenting the detection results of alternative clone detectors. As expected, tools relying heavily on the textual representation of clones have very low recall (Simian, NiCad, CCAligner, PMD) on the dataset. Most clones found by the alternative tools span only a few lines of code. In contrast, iClones finds large clones that include array accesses and manipulations. ASTNN is the best comparison tool and finds many clones with good precision. The approach is sensitive to hyper-parameters and to the training and test split, leading in some cases to a test performance close to MCC of 0. The low recall for Type 1-3 detectors indicate the high quality of the dataset. The moderate recall for Type 3/4 detectors indicate the high quality of SCD-PSM. Given this evidence, we conclude that Q1 holds.

Q1 — Behavioral equality between model elements generalizes to semantic equality of code elements, allowing for semantic clone detection via probabilistic software modeling.

#### **6.2 Research Question 2 — Skip Evaluation Scalability**

The goal of the static and dynamic stage is to reduce the number of evaluations that the model stage must conduct. Each stage incurs an increasing cost of evaluation per candidate, with the model stage taking the largest share of the evaluation time, 89 %. Every TP has to pass the model stage to be declared a clone (rejecting pipeline). The skip evaluation avoided, on average, the recomputation of 74 % (340) of the TP candidate pairs. The greedy evaluation avoided, on average, the evaluation of 37 % of IO pairs. This offloads most of the evaluation time to the earlier stages, which are computationally inexpensive, while shortcutting the model stage. In comparison to the alternative detectors, SCD-PSM needs substantially more time to compute (1*.*32 min vs. 29 min). An exception is ASTNN which has a similar runtime as SCD-PSM. Most of the runtime of SCD-PSM is caused by the operational overhead, e.g., loading the model from the database. Optimizing this overhead, as a theoretical maximum, could reduce the overall runtime on the dataset to 6*.*49 min given the average durations for each stage in Table 2. In conclusion, the skip evaluation reduces the number of model evaluations, which are responsible for most of the evaluation time, down to a quarter.

Q2 — Skip evaluation reduces the number of evaluations for the most expensive stage (model) in the SCD-PSM pipeline significantly.

#### **6.3 Research Question 3 — Skip Evaluation Effects**

Skip evaluation can cause cascading errors given an FP. Once an FP is introduced, every semantic clone related to the FP has a chance to become an FP in the same (wrong) clone class itself. These cascading FPs are potential sources of serious performance degradation. Skip evaluation experiments are ranked higher and are significantly better than experiments that conducted an exhaustive search. However, the absolute performance gain is only a MCC of 0*.*056, hinting at a per-chance significance introduced by the small sample size (16 experiments). Nevertheless, given the evidence in Table 1 and Section 5.6, we can conclude that skip evaluation does not affect the performance of the detector.

Q3 — The skip evaluation has no negative impact on the performance of the detector given low false-positive rates.

#### **7 Limitations**

SCD-PSM inherits the limitations of PSM, such as its need for a runnable program to build the model. PSM only models the application structure and its data, not references. References are changing addresses with no relation to the running program. Hence, they have no meaningful underlying distribution that can be modeled. However, once references are dereferenced, e.g., by accessing a field, their accessed data will be part of the model and therefore usable in SCD-PSM. Nevertheless, algorithms with the sole purpose of manipulating references do not work with SCD-PSM.

PSM explodes lists into singular values, since distributions do not contain any order information. This means executables that change the order of sequences are matched based on the values, not their order. As a consequence, an ascending and descending sorting algorithm are semantically equivalent, leading to a falsepositive. Extending PSM to distributions of sequences alleviates the issue but is not a trivial task.

SCD-PSM cannot detect Type 2-3 clones since textual similarities represent a different problem set. A proof can easily be constructed by adding an arbitrary number of statements that do not influence the behavior of the program but mislead text based detectors. Inversely, changing one character, e.g., a multiplication to a division, may alter the entire behavior while preserving the general textual similarity.

We employed a controlled laboratory evaluation strategy that allowed us to exactly evaluate the performance metrics and fairly compare them between different clone detectors. This follows a recent trend [38,46,48] in the light of some criticism of opportunistic evaluations on arbitrary open source projects. The controlled laboratory evaluation provides purely functional performance results given a fixed and controlled sample of programs. The generalizability of results obtained from laboratory evaluations is limited; Using an opportunistic evaluation strategy avoids this problem. However, the strategy is prone to biases caused by the human oracles (often the authors themselves) or proxy oracles that evaluate the clones. Moreover, a fair comparison between detectors is hardly possible because the true recall of clones is in general unknown. A combination of both evaluation strategies may yield precise and generalizable results. The extension to this study is part of our future work.

## **8 Threats to Validity**

A threat to validity in any semantic clone detection study is given by the programs and code fragments used in the evaluation. Semantic clones may not exhibit the same functional behavior or share too many lexicographical similarities. This study tested every clone class on its behavioral equality. Furthermore, we evaluated text-, token-, graph- and model-based detectors capable of detecting Type 1-3 clones. The low performance of Type 1-3 detectors confirmed the high quality of semantic clones in the benchmark.

### **9 Related Work**

We started this article by defining what *semantic clones* means in the context of our approach (Section 3). While our definition is motivated by the capabilities of our approach, we can see strong similarities to the definition of Juergens

[19]. Both definitions define behavioral similarity via IO relationships. Also, Juergens already discussed a notion of partial and conditional similarity. This understanding of Type 4 clones can be seen in multiple more recent studies [8,6,27]. In that, we see the progress of the community in terms of Type 4 clones as the definition becomes more specific.

Many studies evaluated textual clones. However, only a few studies have reported results on semantic clones without relaxing the definition of Type 4. Rattan [34] et al. provided a review of clone detection studies including approaches focused on Type 4 clones. They concluded that some approaches solve approximations (i.e., complex Type 3 clones) of Type 4 clones.

Test-based methods randomly trigger the execution of candidates and measure whether equal inputs cause similar outputs. Jiang and Su [18] were able to find semantically equivalent methods without any syntactic similarities. A similar approach was presented by Deissenboeck et al. [6]. One issue with test-based clone detection is that candidates need a similar signature. Differences in data types or the number of parameters can not be effectively handled. SCD-PSM works similarly to test-based methods in that it observes the runtime and compares the resulting behavior. However, SCD-PSM builds generative models from the observed behavior, capable of generating, conditioning, and evaluating data. This allows SCD-PSM to bridge signature mismatches by imputing missing code elements and the using a generalized type system.

Zhao and Huang [51] developed DeepSim, which phrases the problem as a binary classification task. DeepSim uses neural networks to learn encodings of the control and data flow without observing the program's runtime. PSM also uses neural networks but learns an underlying representation of the data flow and runtime. DeepSim was also evaluated on a Google Code Jam dataset. It reached an F1 score of 0*.*76 on the GCJ 2016 competition, while SCD-PSM reached 0*.*967 on the GCJ 2017. While not entirely comparable, the results are a good approximation given the similarity in the datasets.

#### **10 Conclusions and Future Work**

In this article, we presented Semantic Clone Detection via Probabilistic Software Modeling (SCD-PSM). PSM builds a Probabilistic Model (PM) from a program that can be used to simulate or evaluate a program. We used these PMs to detect semantic clones in programs that have 0 % syntactic similarity.

We discussed the representation, search space, static-, dynamic-, and modelsimilarity stages forming the main aspects of SCD-PSM. The study evaluated SCD-PSM in great detail resulting in an average MCC greater than 0*.*9. Also, the study showed the capability to control the false-positive rate, which is important for an industry adoption. Finally, we concluded that behavioral equality of model elements generalizes to semantic equality of code elements.

Our future work focuses on constructing a comprehensive benchmark covering controlled and real-world systems for improved generalizability of clone detection studies. Furthermore, semantic clone detection has the potential to enable new methods for fault localization applications [42].

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## QMaxUSE: A Query-based Verification Tool for UML Class Diagrams with OCL Invariants

Hao Wu ()

Computer Science Department, Maynooth University, Maynooth, Ireland haowu@cs.nuim.ie

Abstract. Verifying whether a UML class diagram annotated with Object Constraint Language (OCL) constraints is consistent involves finding valid instances that provably meet its structural and OCL constraints. Recently, many tools and techniques have been proposed to find valid instances. However, they often do not scale well when the number of OCL constraints significantly increases. In this paper, we present a new tool called QMaxUSE that is capable of automatically verifying a large number of OCL invariants. QMaxUSE works by decomposing them into a set of different queries. It then uses an SMT solver to concurrently verify each query and pinpoints conflicting OCL invariants. Our evaluation results suggest that QMaxUSE can offer up to 30x efficiency improvement in verifying UML class diagrams with a large number of OCL invariants.

### 1 Introduction

Verifying the consistency of a UML class diagram annotated with OCL constraints is a challenging task [1,2,3]. This is because it requires finding an instance satisfying both structural and OCL constraints at the same time. To tackle this challenge, many tools and techniques have been proposed [4,5,6,7,8]. However, most of these tools and techniques do not scale well when the number of OCL invariants significantly increases [9,10,11,12,13,5,14,15,16]. These tools often time out or cannot pinpoint the conflicting OCL invariants that cause a UML class diagram to become inconsistent.

In this paper, we present a new tool QMaxUSE that is capable of verifying a large number of complex OCL invariants in an efficient manner. This is achieved by two distinct features provided within QMaxUSE. (1) a query language that allows users to select parts of a UML class diagram to be verified. (2) a new specialised algorithm that is able to decompose a UML class diagram that has a large number of complex OCL invariants into different queries. These queries can then be verified concurrently via efficient SMT solving. The detailed explanation of our approach can be found in [17].

Related Work. Verifying the consistencies of a UML class diagram has gained much attention in recent years and many approaches and tools are proposed. A UML class diagram can be considered as a graph, so graph-based approaches are naturally employed for reasoning about consistencies [18,19,20,7,21]. Semer´ath

https://doi.org/10.1007/978-3-030-99429-7\_17

et al. proposed a new graph solver that is able to generate much larger number of objects [22]. Their approach utilises a combination of multiple advanced graphbased and SAT-solving techniques to achieve large-scale graphs generation. On the other hand, many tools incorporate logic solvers to support OCL constraints solving [14,16,23,24,25]. However, many of them do not scale well and cannot pinpoint conflicting OCL constraints when a UML class diagram is inconsistent. Our goal here is to provide an open-source tool that is capable of not only locating conflicting OCL constraints but also preserves high-performance when the number of OCL constraints significantly increases.

### 2 Architecture

QMaxUSE is fully automatic and integrated with the USE modelling tool [26]. Currently, it is command-line based and can be run under operating system Windows 10 (x64), Ubuntu 20.04 (x64) and Mac OS Big Sur(x64). QMaxUSE is implemented in Java. It consists of nearly 33k lines of code, and approximately 3.5k lines of code are dedicated to its algorithms. The latest version of QMaxUSE is available at [27]:

https://github.com/classicwuhao/qmaxuse

The architecture of QMaxUSE is shown in Figure 1. Overall, it has four layers: front-end, query engine, translation and solver.

Fig. 1. The overall architecture of QMaxUSE.

Front-end. At the front-end layer, QMaxUSE uses parsers from USE to generate ASTs (abstract syntax trees) for a class diagram and OCL invariants. QMaxUSE provides a simple query language that allows users to choose a part of a class diagram and its OCL invariants to be verified. To parse a query issued by a user, we have designed and implemented a query parser. This parser is able to read multiple queries simultaneously in a specification file and produces corresponding ASTs.

Query Engine. QMaxUSE's query engine uses a set of selection algorithms to traverse the ASTs generated from the front-end layer to produce a query result. A query result essentially contains a set of classes, attributes, associations and OCL invariants to be verified. At this layer, QMaxUSE also provides a specialised algorithm (Decomposer) that is able to decompose a class diagram along with OCL invariants into a set of different queries. These queries can then be verified concurrently using a query verification procedure.

Translation. At the translation layer, QMaxUSE uses a first-order translator to translate a query into a set of first-order formulas that can be verified by the SMT solver. The translation here is similar to the one described in [8]. We use uninterpreted functions to encode classes or attributes and linear integer inequalities to capture the multiplicities at an association-end. For an OCL invariant, we traverse its AST and generate an SMT formula by using a combination of first-order theories.

Solver. We have designed a new interface (SolverM anager) to optimise the interaction between QMaxUSE and the SMT solver. This interface can reduce extra overhead between our first-order translator and an SMT solver by minimising the number of APIs calls. Currently, QMaxUSE uses Z3 as its default SMT solver and this new interface easily allows us to plug in other SMT solvers [28].

## 3 Design

### 3.1 Query

QMaxUSE allows a user to verify a particular set of features of a UML class diagram through a query language. A query expression accepted by QMaxUSE must use a select statement. It allows users to choose multiple features along with OCL invariants from a UML class diagram. A feature here may include a class, an attribute, an association or an OCL invariant. For example, the following query (query 1) first selects the University, Department, Student and Module class, an association teach along with the invariant defined under the Module class from the UML class diagram in Figure 2.

query 1 : select University, Student.\*, Department:teach:Student with

Student::inv2, Module::\*

Notably, we allow users to use a wild character ∗ to represent a set of features under a specified classifier. Further, it is quite common that an OCL invariant may use features from other classes in its expression. Hence, our selection algorithm implicitly selects these features during the execution of a query. Thus, query 1 also selects the P erson class from Figure 2 since inv2 defined under the Student class imposes a constraint on the age attribute that is inherited from the P erson class.

For each query issued by a user, QMaxUSE launches a verification procedure that is able to verify the consistencies of the collected features. This verification procedure casts the set of collected features to a set of SMT formulas that can be checked by an SMT solver. If the formulas are not satisfied, QMaxUSE reports inconsistencies by pinpointing the OCL invariants that cause conflicts. For example, QMaxUSE reports that there is a conflict between OCL invariant inv1 and inv2 after verifying the following query (query 2). It shows that both

Fig. 2. A UML class diagram with the 8 OCL class invariants shows how the students in each department can choose multiple modules to study.

inv1 and inv2 can make the Student class impossible to instantiate. Figure 3 shows a screenshot of QMaxUSE after executing query 2.

query 2 : select Person.\*, Student.\* with Person::inv1, Student::inv2

#### 3.2 Concurrent Verification

QMaxUSE has a crafted algorithm that is designed for performing concurrent verification on UML class diagrams with a large number of OCL invariants. The main idea of this algorithm is that it is able to decompose a large number of complex OCL invariants into different queries. For each query, it launches a thread of verification procedure to verify that query. In this way, QMaxUSE is able to shift solving a large number of complex formulas from a single run into multiple simultaneous runs on a collection of much smaller and less complex formulas. Therefore, it is particularly powerful when the number of OCL invariants grows significantly.

A high-level structure of this dedicated algorithm is shown in Algorithm 1 [17]. This algorithm takes a UML class diagram annotated with OCL invariants (denoted as model) as its input and outputs a set S that contains all possible conflicting features. It first employs a novel decomposition algorithm to decompose a model into different parts and produces a query for each part of this model. It then executes each query and produces a new query result by explicitly choosing those features that are used by an OCL invariant expression in

Fig. 3. A screenshot of running query 2 in QMaxUSE.

a query. Once the set of query results are generated, Algorithm 1 launches a number of threads to verify the formulas (Φ) that encode query results. If the Φ are not satisfied, then this means that there must be conflicts. Finally, our algorithm extracts those conflicting features and saves them into the set S.

#### Algorithm 1: ConcurrentVerification

Input : A UML class diagram annotated with OCL invariants (model) Output: A set of conflicting features cause inconsistencies (S). R ← ∅ ∧ S ← ∅; Q ← Decompose(model); /\*produce a set of queries Q\*/ foreach q ∈ Q do q<sup>r</sup> = q.execute(); /\* create a new query result qr\*/ /\* add features used in an OCL invariant into a query result qr\*/ foreach inv ∈ q do qr.add(inv.classes(), inv.attributes(), inv.associations(), inv); end R.add(qr); end /\* verify model with |R| number of threads. \*/ foreach q<sup>r</sup> ∈ R do Φ ← T ranslate(qr); /\*cast q<sup>r</sup> to SMT formulas\*/ T hreadM anager.start(QueryV erif ication(Φ, S)); /\*check satisfiability of Φ and saves each conflict occurred in Φ in the set S\*/ end return S;


Table 1. Evaluation results. Invs=number of OCL invariants, Nodes=size of invariant ASTs, Quant=number of quantifiers, Op=number of operators. TO= Timeout (20min), MaxUSE=QMaxUSE without query and concurrent verification support.

### 4 Results

We use a benchmark from [8] to show the size and the complexities of OCL invariants QMaxUSE can handle. This benchmark has two parts. Part A only covers a small number of toy examples from [29] and Part B covers a wide range of OCL language features including: nested quantifiers, collections, logical/arithmetic operations and navigations. In particular, Part B contains a large number of complex and conflicting OCL invariants. Table 1 summarises part of our evaluation results for QMaxUSE <sup>1</sup> . The evaluation is carried out on an Intel(R) Core (TM) machine that has six 2.8GHz cores with 16G memory. The underlying SMT solver is the Z3 SMT solver (version 4.8.10). As it can seen that QMaxUSE is able to handle much larger size of OCL invariants. It is able to gain upto 30x efficiency in improvement in verifying large number of complex OCL invariants. For example, it takes 131.23 seconds to verify B3 in Group B without using our query and concurrent techniques while QMaxUSE is able to finish its verification in just 4.6 seconds.

#### 5 Conclusion

In this paper, we have presented our latest verification tool QMaxUSE. We believe that QMaxUSE can add significant value in modelling community for two reasons. (1) Users now are able to use QMaxUSE to incrementally verify different parts of their class diagrams by issuing different queries. (2) Our preliminary evaluation results indicate that QMaxUSE can scale well on a large number of complex OCL invariants because of our concurrent verification algorithm.

<sup>1</sup> The complete benchmark is packed within QMaxUSE release files.

## References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## Test-Comp Contributions

## Advances in Automatic Software Testing: Test-Comp 2022

Dirk Beyer <sup>B</sup>

LMU Munich, Munich, Germany

Abstract. Test-Comp 2022 is the 4th edition of the Competition on Software Testing. Research competitions are a means to provide annual comparative evaluations. Test-Comp focusses on fully automatic software test generators for C programs. The results of the competition shall be reproducible and provide an overview of the current state of the art in the area of automatic test-generation. The competition was based on 4 236 test-generation tasks for C programs. Each test-generation task consisted of a program and a test specification (error coverage, branch coverage). Test-Comp 2022 had 12 participating test generators from 5 countries.

Keywords: Software Testing · Test-Case Generation · Competition · Program Analysis · Software Validation · Software Bugs · Test Validation · Test-Comp · Benchmarking · Test Coverage · Bug Finding · Test-Suites · SV-Benchmarks · BenchExec · TestCov · CoVeriTeam

### 1 Introduction

The Competition on Software Testing (Test-Comp, https://test-comp.sosy-lab.org, [5, 6, 7, 9]) showcases the state of the art in the area of automatic software testing. For the 4th time, the competition provides an overview of the results achieved by implementations of the most recent ideas, concepts, and algorithms for fully automatic test generation. This competition report describes the (updated) rules and definitions, presents the competition results, and discusses some interesting facts about the execution of the competition experiments. We use BenchExec [20] to execute the benchmarks and the results are presented in tables and graphs on the competition web site (https://test-comp.sosy-lab.org/2022/results) and are available in the accompanying archives (see Table 3).

This report extends previous reports on Test-Comp [5, 6, 7, 9].

Reproduction packages are available on Zenodo (see Table 3).

<sup>B</sup> dirk.beyer@sosy-lab.org

c The Author(s) 2022

E. B. Johnsen and M. Wimmer (Eds.): FASE 2022, LNCS 13241, pp. 321–335, 2022. https://doi.org/10.1007/978-3-030-99429-7\_18

Competition Goals. In summary, the goals of Test-Comp are the following [6]:


Related Competitions. In the field of formal methods, competitions are respected as an important evaluation method and there are many competitions [3]. We refer to the report from Test-Comp 2020 [6] for a more detailed discussion and give here only the references to the most related competitions [3, 10, 41, 43].

## 2 Definitions, Formats, and Rules

Organizational aspects such as the classification (automatic, off-site, reproducible, jury, training) and the competition schedule is given in the initial competition definition [5]. In the following, we repeat some important definitions that are necessary to understand the results.

Test-Generation Task. A test-generation task is a pair of an input program (program under test) and a test specification. A test-generation run is a non-interactive execution of a test generator on a single test-generation task, in order to generate a test suite according to the test specification. A test suite is a sequence of test cases, given as a directory of files according to the format for exchangeable test-suites.<sup>1</sup>

Execution of a Test Generator. Figure 1 illustrates the process of executing one test generator on the benchmark suite. One test run for a test generator gets

<sup>1</sup> https://gitlab.com/sosy-lab/software/test-format

Fig. 1: Flow of the Test-Comp execution for one test generator (taken from [6])

as input (i) a program from the benchmark suite and (ii) a test specification (cover bug, or cover branches), and returns as output a test suite (i.e., a set of test cases). The test generator is contributed by a competition participant as a software archive in ZIP format. The test runs are executed centrally by the competition organizer. The test-suite validator takes as input the test suite from the test generator and validates it by executing the program on all test cases: for bug finding it checks if the bug is exposed and for coverage it reports the coverage. We use the tool TestCov [19] <sup>2</sup> as test-suite validator.

Test Specification. The specification for testing a program is given to the test generator as input file (either properties/coverage-error-call.prp or properties/coverage-branches.prp for Test-Comp 2022).

The definition init(main()) is used to define the initial states of the program under test by a call of function main (with no parameters). The definition FQL(f) specifies that coverage definition f should be achieved. The FQL (FShell query language [30]) coverage definition COVER EDGES(@DECISIONEDGE) means that all branches should be covered (typically used to obtain a standard test suite for quality assurance) and COVER EDGES(@CALL(foo)) means that a call (at least one) to function foo should be covered (typically used for bug finding). A complete specification looks like: COVER(init(main()), FQL(COVER EDGES(@DECISIONEDGE))).

Table 1 lists the two FQL formulas that are used in test specifications of Test-Comp 2022; there was no change from 2020 (except that special function \_\_VERIFIER\_error does not exist anymore).

Task-Definition Format 2.0. Test-Comp 2022 used again the task-definition format in version 2.0.

License and Qualification. The license of each participating test generator must allow its free use for reproduction of the competition results. Details on qualification criteria can be found in the competition report of Test-Comp 2019 [7].

<sup>2</sup> https://gitlab.com/sosy-lab/software/test-suite-validator


Table 1: Coverage specifications used in Test-Comp 2022 (similar to 2019–2021)

## 3 Categories and Scoring Schema

Benchmark Programs. The input programs were taken from the largest and most diverse open-source repository of software-verification and test-generation tasks <sup>3</sup> , which is also used by SV-COMP [8]. As in 2020 and 2021, we selected all programs for which the following properties were satisfied (see issue on GitHub <sup>4</sup> and report [7]):


This selection yielded a total of 4 236 test-generation tasks, namely 776 tasks for category Error Coverage and 3 460 tasks for category Code Coverage. The test-generation tasks are partitioned into categories, which are listed in Tables 6 and 7 and described in detail on the competition web site.<sup>6</sup> Figure 2 illustrates the category composition.

Category Error-Coverage. The first category is to show the abilities to discover bugs. The benchmark set consists of programs that contain a bug. Every run will be started by a batch script, which produces for every tool and every testgeneration task one of the following scores: 1 point, if the validator succeeds in executing the program under test on a generated test case that explores the bug (i.e., the specified function was called), and 0 points, otherwise.

Category Branch-Coverage. The second category is to cover as many branches of the program as possible. The coverage criterion was chosen because many test generators support this standard criterion by default. Other coverage criteria can be reduced to branch coverage by transformation [29]. Every run will be started by a batch script, which produces for every tool and every

<sup>3</sup> https://github.com/sosy-lab/sv-benchmarks

<sup>4</sup> https://github.com/sosy-lab/sv-benchmarks/pull/774

<sup>5</sup> https://test-comp.sosy-lab.org/2022/rules.php

<sup>6</sup> https://test-comp.sosy-lab.org/2022/benchmarks.php

Fig. 2: Category structure for Test-Comp 2022; compared to Test-Comp 2021, sub-category ProductLines was added to both main categories Cover-Error and Cover-Branches

test-generation task the coverage of branches of the program (as reported by TestCov [19]; a value between 0 and 1) that are executed for the generated test cases. The score is the returned coverage.

Ranking. The ranking was decided based on the sum of points (normalized for meta categories). In case of a tie, the ranking was decided based on the run time,

Fig. 3: Benchmarking components of Test-Comp and competition's execution flow (same as for Test-Comp 2020)

Table 2: Publicly available components for reproducing Test-Comp 2022


which is the total CPU time over all test-generation tasks. Opt-out from categories was possible and scores for categories were normalized based on the number of tasks per category (see competition report of SV-COMP 2013 [4], page 597).

## 4 Reproducibility

We followed the same competition workflow that was described in detail in the previous competition report (see Sect. 4, [9]). All major components that were used for the competition were made available in public version-control repositories. An overview of the components that contribute to the reproducible setup of Test-Comp is provided in Fig. 3, and the details are given in Table 2. We refer to the report of Test-Comp 2019 [7] for a thorough description of all components of the Test-Comp organization and how we ensure that all parts are publicly available for maximal reproducibility.

In order to guarantee long-term availability and immutability of the testgeneration tasks, the produced competition results, and the produced test suites, we also packaged the material and published it at Zenodo (see Table 3).

The competition used CoVeriTeam [17] <sup>7</sup> again to provide participants access to the actual competition machines. The competition report of SV-COMP 2022 provides a description on reproducing individual results and on trouble-shooting (see Sect. 3, [10]).

<sup>7</sup> https://gitlab.com/sosy-lab/software/coveriteam


Table 3: Artifacts published for Test-Comp 2022

Table 4: Competition candidates with tool references and representing jury members; new indicates first-time participants, <sup>∅</sup> indicates hors-concours participation


#### 5 Results and Discussion

This section represents the results of the competition experiments. The report shall help to understanding the state of the art and the advances in fully automatic test generation for whole C programs, in terms of effectiveness (test coverage, as accumulated in the score) and efficiency (resource consumption in terms of CPU time). All results mentioned in this article were inspected and approved by the participants.

Participating Test Generators. Table 4 provides an overview of the participating test generators and references to publications, as well as the team representatives of the jury of Test-Comp 2022. (The competition jury consists of the chair and one member of each participating team.) An online table with information about all participating systems is provided on the competition web site.<sup>8</sup> Table 5 lists the features and technologies that are used in the test generators.

There are test generators that did not actively participate (e.g., tester archives taken from last year) and that are not included in rankings. Those are called hors-concours participations and the tool names are labeled with a symbol (<sup>∅</sup>).

<sup>8</sup> https://test-comp.sosy-lab.org/2022/systems.php


Table 5: Technologies and features that the test generators used

Computing Resources. The computing environment and the resource limits were the same as for Test-Comp 2020 [6]: Each test run was limited to 8 processing units (cores), 15 GB of memory, and 15 min of CPU time. The test-suite validation was limited to 2 processing units, 7 GB of memory, and 5 min of CPU time. The machines for running the experiments are part of a compute cluster that consists of 167 machines; each test-generation run was executed on an otherwise completely unloaded, dedicated machine, in order to achieve precise measurements. Each machine had one Intel Xeon E3-1230 v5 CPU, with 8 processing units each, a frequency of 3.4 GHz, 33 GB of RAM, and a GNU/Linux operating system (x86\_64-linux, Ubuntu 20.04 with Linux kernel 5.4). We used BenchExec [20] to measure and control computing resources (CPU time, memory, CPU energy) and VerifierCloud<sup>9</sup> to distribute, install,

<sup>9</sup> https://vcloud.sosy-lab.org


Table 6: Quantitative overview over all results; empty cells mark opt-outs; new indicates first-time participants, <sup>∅</sup> indicates hors-concours participation

run, and clean-up test-case generation runs, and to collect the results. The values for time and energy are accumulated over all cores of the CPU. To measure the CPU energy, we use CPU Energy Meter [21] (integrated in BenchExec [20]). Further technical parameters of the competition machines are available in the repository which also contains the benchmark definitions. <sup>10</sup>

One complete test-generation execution of the competition consisted of 50 056 single test-generation runs. The total CPU time was 339 days and the consumed energy 88 kWh for one complete competition run for test generation (without validation). Test-suite validation consisted of 50 832 single test-suite validation runs. The total consumed CPU time was 15 days. Each tool was executed several times, in order to make sure no installation issues occur during the execution. Including preruns, the infrastructure managed a total of 311 754 test-generation runs (consuming 4.9 years of CPU time). The CPU energy was not measured during preruns.

Quantitative Results. The quantitative results are presented in the same way as last year: Table 6 presents the quantitative overview of all tools and all categories. The head row mentions the category and the number of test-generation

<sup>10</sup> https://gitlab.com/sosy-lab/test-comp/bench-defs/tree/testcomp22


Table 7: Overview of the top-three test generators for each category (measurement values for CPU time and energy rounded to two significant digits)

tasks in that category. The tools are listed in alphabetical order; every table row lists the scores of one test generator. We indicate the top three candidates by formatting their scores in bold face and in larger font size. An empty table cell means that the test generator opted-out from the respective main category (perhaps participating in subcategories only, restricting the evaluation to a specific topic). More information (including interactive tables, quantile plots for every category, and also the raw data in XML format) is available on the competition web site <sup>11</sup> and in the results artifact (see Table 3). Table 7 reports the top three test generators for each category. The consumed run time (column 'CPU Time') is given in hours and the consumed energy (column 'Energy') is given in kWh.

Score-Based Quantile Functions for Quality Assessment. We use scorebased quantile functions [20] because these visualizations make it easier to understand the results of the comparative evaluation. The web site <sup>11</sup> and the results artifact (Table 3) include such a plot for each category; as example, we show the plot for category Overall (all test-generation tasks) in Fig. 4. We had 11 test generators participating in category Overall, for which the quantile plot shows the overall performance over all categories (scores for meta categories are normalized [4]). A more detailed discussion of score-based quantile plots for testing is provided in the Test-Comp 2019 competition report [7].

Alternative Rankings. Table 8 is similar to Table 7, but contains the alternative ranking category Green Testing. Column 'Quality' gives the score in score points (sp), column 'CPU Time' the CPU usage in hours (h), column

<sup>11</sup> https://test-comp.sosy-lab.org/2022/results

Fig. 4: Quantile functions for category Overall. Each quantile function illustrates the quantile (x-coordinate) of the scores obtained by test-generation runs below a certain number of test-generation tasks (y-coordinate). More details were given previously [7]. The graphs are decorated with symbols to make them better distinguishable without color.


Table 8: Alternative rankings; quality is given in score points (sp), CPU time in hours (h), energy in kilo-watt-hours (kWh), the first rank measure in kilojoule per score point (kJ/sp), and the second rank measure in score points (sp);

'CPU Energy' the CPU usage in kilo-watt-hours (kWh), and column 'Rank Measure' reports the values for the rank measure.

Green Testing — Low Energy Consumption. Since a large part of the cost of test generation is caused by the energy consumption, it might be important to also consider the energy efficiency in rankings, as complement to the official Test-Comp ranking. This alternative ranking category uses the energy consumption per score point as rank measure: CPU Energy Quality , with the unit kilo-joule per score point (kJ/sp).The energy is measured using CPU Energy Meter [21], which we use as part of BenchExec [20].

New Test Generators. To acknowledge the test generators that participated for the first time in Test-Comp, we list the test generators that participated for the first time. CMA-ES Fuzz<sup>∅</sup> and FuSeBMC participated for the first time in


Table 9: New verifiers in Test-Comp 2021 and Test-Comp 2022; column 'Subcategories' gives the number of executed categories

Fig. 5: Number of evaluated test generators for each year (top: number of firsttime participants; bottom: previous year's participants)

Test-Comp 2021, and Legion/SymCCnew participated first in Test-Comp 2022. Table 9 reports also the number of subcategories in which the tools participated.

## 6 Conclusion

For the 4th time, the Competition on Software Testing took place and provides an overview of test-generation tools for C programs. The competition event attracted 12 participating teams (see Fig. 5 for the participation numbers and Table 4 for the details). The competition is an off-site competition, the execution of the experiments is fully-automatatic and reproducible. To ensure transparency, all components are made available in public repositories and a jury (consisting of members from each team) oversees the process. The produced test suites are validated by the test-suite validator TestCov. The results of the competition are presented at the 25th International Conference on Fundamental Approaches to Software Engineering at ETAPS 2022.

Data-Availability Statement. The test-generation tasks and results of the competition are published at Zenodo, as described in Table 3. All components and data that are necessary for reproducing the competition are available in public version repositories, as specified in Table 2. For easy access, the results are presented also online on the competition web site https://test-comp.sosy-lab.org/2022/results.

Funding Statement. This project was funded in part by the Deutsche Forschungsgemeinschaft (DFG) — 418257054 (Coop).

### References


Open Access. This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution, and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## FuSeBMC v4: Smart Seed Generation for Hybrid Fuzzing (Competition Contribution)

Kaled M. Alshmrany(B)1,<sup>2</sup> \*, Mohannad Aldughaim<sup>1</sup> , Ahmed Bhayat<sup>1</sup> , and Lucas C. Cordeiro<sup>1</sup>

> <sup>1</sup> University of Manchester, Manchester, UK 2 Institute of Public Administration, Jeddah, Saudi Arabia kaled.alshmrany@postgrad.manchester.ac.uk

Abstract. *FuSeBMC* is a test generator for fnding security vulnerabilities in C programs. In Test-Comp 2021, we described a previous version that incrementally injected labels to guide Bounded Model Checking (BMC) and Evolutionary Fuzzing engines to produce test cases for code coverage and bug fnding. This paper introduces an improved version of *FuSeBMC* that utilizes both engines to produce smart seeds. First, the engines run with a short time limit on a lightly instrumented version of the program to produce the seeds. The BMC engine is particularly useful in producing seeds that can pass through complex mathematical guards. Then, *FuSeBMC* runs its engines with extended time limits using the smart seeds created in the previous round. *FuSeBMC* manages this process in two main ways. Firstly, it uses *shared memory* to record the labels covered by each test case. Secondly, it evaluates test cases, and those of high impact are turned into seeds for subsequent test fuzzing. In this year's competition, we participate in the *Cover-Error*, *Cover-Branches*, and *Overall* categories. The Test-Comp 2022 results show that we signifcantly increased our code coverage score from last year, outperforming all tools in all categories.

Keywords: Automated Test-Case Generation · Symbolic Execution · Bounded Model Checking · Fuzzing · Security · Seed.

## 1 Overview

Software testing is one of the most crucial phases in software development [11]. Tests often expose critical bugs in software applications. In earlier work [4], we presented *FuSeBMC*, an automated test generation tool that exploits the combination of Fuzzing and BMC. *FuSeBMC* achieved second place in Test-Comp 2021 [5,3] and frst place in the *Cover-Error* category. It ranked fourth in the *Cover-Branches* category. This year, we introduce a new version of *FuSeBMC* (v4) that adds smart seed generation and shared memory amongst other improvements and features. The new version signifcantly improves on the previous version, particularly relating to code coverage. One of the primary contributions of this paper is the linking of a grey-box fuzzer with a bounded model checker. A bounded model checker works by treating a program as a state transition system and then checking whether there exists a transition in this system of length less than a bound k that violates the property to be verifed [6,8]. We leverage

<sup>\*</sup>Jury Member

this power of model checkers as a method for smart seed generation. Here, we rate seeds on two metrics. First, the depth of the deepest goal covered by the seed. Second, the number of goals covered uniquely by the seed. Seeds that rate highly on these metrics are called *smart*. During grey-box fuzzing, if a particular branch has not been explored, BMC can be used to provide a model (set of assignments to input variables) that reaches the branch. This model is a smart seed since it covers a previously unexplored branch. It is then added to a seed store. Periodically seeds are selected from the store for further grey-box fuzzing based on the criteria as mentioned above. However, BMC can be slow and resource-intensive. As an alternative, we also carry out a lightweight static program analysis to recognize certain restricted forms of input verifcation. We analyze the code for conditions on the input variables and ensure that seeds are only selected if they pass these conditions. Together, these contributions turn *FuSeBMC* into a world-class fuzzer.

## 2 Test Generation Approach

Figure 1 provides an overview of the components within *FuSeBMC* and how these interact. *FuSeBMC* makes use of the Clang tooling infrastructure [1] to instrument programs. In addition, *FuSeBMC* employs three engines in its reachability analysis: one BMC and two fuzzing engines. ESBMC [9,10] is a state-of-the-art SMT-based bounded model checker. For the two fuzzers, one is based on the American Fuzzy Lop (AFL) [7,2], and the other is a custom fuzzer, which we refer here to as *selective fuzzer* (see [4] for details). In the sections below, we detail how these components work together.

Fig. 1. *FuSeBMC* v4 Framework. This fgure illustrates the major components of the *FuSeBMC* test generator and how they interact. Note in particular the seed store, which interacts with the BMC/AFL and the shared memory to produce test cases.

Code Instrumentation *FuSeBMC* front-end uses Clang tooling infrastructure [1] to parse a C program and produce an Abstract Syntax Tree (AST). While traversing the AST, *FuSeBMC* injects labels into each branch, including every conditional statement, loop, and function. Using these labels, *FuSeBMC* can measure the code coverage.

Reachability Graph Analysis After instrumenting the C program, *FuSeBMC* analyzes it and produces a reachability graph. The graph assigns each goal label to the code block it is located in. Then, *FuSeBMC* ranks goals depending on the strategy chosen. For example, one strategy, which we used in Test-Comp 2022, is to prefer deeper goals over shallower goals. This strategy improves the performance of *FuSeBMC* since a test case that covers a deep goal will also cover shallower goals on the path to it. *FuSeBMC* also ranks coverage metrics over others, such as conditional coverage over loop coverage.

Seed Generation A unique aspect of the latest version of *FuSeBMC* is a seed generation phase that is run prior to the start of the principal reachability analysis. In this phase, *FuSeBMC* frst lightly instruments the code under test by limiting loop bounds and assuming a narrow range of values for input variables. The bounds on input variables are further limited by carrying out a lightweight static analysis to recognize code that applies verifcation conditions to input variables. After instrumenting the code, *FuSeBMC* runs its fuzzing and BMC engines with concise time limits (60 s for Test-Comp 2022). The test cases generated by these engines are ranked, and the highest impact test cases are selected as smart seeds for the next round. The selected seeds are added to the *seed store*. The impact of a test case is measured using two metrics.


ESBMC is particularly effective at seed generation as its underlying SMT solvers can be used to discover test cases that circumvent complex mathematical guards. Note that we do not rely on any specifc features of the models returned by the SMT solvers. Instead, the strength of the method lies in the solvers' ability to return *some* model that can satisfy a guard and cover goals lying beyond. A fuzzer on its own, randomly mutating a seed, struggles to explore program sections occurring behind complex guards [12].

Reachability Analysis Engines In its primary phase, *FuSeBMC* carries out reachability analysis. Essentially, this involves running the engines in parallel with longer timeouts on the original, non-instrumented code with the fuzzer making use of the smart seeds. ESBMC is run using an incremental BMC strategy with some fxed time limit for each goal it attempts. *FuSeBMC*'s *Tracer* component coordinates the various engines through the use of *shared memory*. In this shared memory, we have two components. The frst component is a "goals covered array" that stores the goals covered so far during the execution. Its purpose is to ensure there is no wasting effort through duplication of work. Secondly, the *Tracer* maintains a set of the currently most effective seeds for the fuzzer to use.

As the engines run and produce new test cases, the *Tracer* monitors these and evaluates them, adding those with the highest impact, as measured by the metrics above, to the seed store. Thus, the seed store is dynamically updated as the analysis progresses. Periodically, it selects a number of the most effective seeds from the store and adds them to shared memory for the fuzzers to use in their next fuzzing round. In parallel, ESBMC uses the "goals covered array" to select an as yet uncovered goal and attempts to fnd a test case that covers it. Test cases produced by ESBMC are passed directly to the store because they are likely to be benefcial for future fuzzing attempts.

For example, assume that the fuzzers are unable to cover some goal L due to a complex condition guarding it. ESBMC can be used to create a seed that covers L. This seed is then passed to the store and later selected for fuzzing. The fuzzers, armed with a seed that covers L, may well now be able to reach goals deeper than L along L's path. Thus, *FuSeBMC* combines the strengths of both types of engines. The BMC engine produces seeds that bypass complex guards and thereby help the fuzzers explore paths deep within the program.

## 3 Strengths and Weaknesses

The strengths of the latest version of *FuSeBMC* are as follows. It runs a dedicated seed generation phase to start the main fuzzing effort with high-quality, high-impact seeds. Furthermore, these seeds are constantly being updated during the main test-generation phase. Beyond this, it incorporates a dedicated subsystem, the *Tracer*, that uses a shared memory store to manage the various engines. By combining the engines, the *Tracer* ensures that *FuSeBMC* far outperforms the individual engines or even the running of the engines in parallel, but isolated. The outcome of these improvements can be seen in the ECA and Combination benchmark sets. Previously, these posed a challenge to *FuSeBMC*. With the latest changes, *FuSeBMC* achieved frst place in the Combination subcategory and took second place in the ECA subcategory of the 2022 Test-Comp competition. Since the benchmarks in the ECA category have remained stable between last year's and this year's competitions, we can measure FuSeBMC's improvement in terms of the combined coverage it achieves across the 29 tasks. This improvement stands at a remarkable 60%. The 2022 Test-Comp results also show that *FuSeBMC* has achieved frst place in the *Cover-Branches* category with high coverage and validation statistics. However, one of the weaknesses of *FuSeBMC* that we plan to work on is that for large programs, particularly for programs that redefne C library functions, seed generation can be slow and consume too much of the tool's time.

## 4 Tool Setup and Confguration

*FuSeBMC* can be run using the command below. The user is required to set the architecture, the property fle path, the competition strategy, and the benchmark path, as:

```
fusebmc.py [-a {32, 64}] [-p PROPERTY FILE]
               [-s {kinduction,falsi,incr,fixed}]
               [BENCHMARK PATH]
```
where -a sets the architecture to 32 or 64, -p sets the property fle to PROPERTY - FILE, where it has a list of all the properties to be tested. -s sets the BMC strategy to one of the listed strategies{kinduction,falsi,incr,fixed}. For Test-Comp'22, *FuSeBMC* uses incr for incremental BMC, which relies on the ESBMC's symbolic execution engine to increasingly unwind the program loops using an iterative technique. The incr strategy verifes the program for each unwind bound up to a maximum default value of 50 or indefnitely (until it exhausts the time or memory limits). The Benchexec tool info module is fusebmc.py and the benchmark defnition fle is FuSeBMC.xml.

## 5 Software Project

*FuSeBMC* is implemented using C++, and it is publicly available under the terms of the MIT License at GitHub<sup>1</sup> . The repository includes the latest version of *FuSeBMC* (version 4.1.14). *FuSeBMC* dependencies and instructions for building from source code are all listed in the README.md fle. Test-Comp 2022 provides the script, benchmarks, and *FuSeBMC* binary to reproduce the competition's results<sup>2</sup> .

<sup>1</sup> https://github.com/kaled-alshmrany/FuSeBMC

<sup>2</sup> https://test-comp.sosy-lab.org/2022/

## Acknowledgment

The Institute of Public Administration - IPA - Saudi Arabia <sup>1</sup> supports the FuSeBMC development. The work in this paper is also partially funded by the EPSRC grants EP/T026995/1, EP/V000497/1, EU H2020 ELEGANT 957286, and Soteria project awarded by the UK Research and Innovation for the Digital Security by Design (DSbD) Programme.

## References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

<sup>1</sup> https://www.ipa.edu.sa/en-us/Pages/default.aspx

## VeriFuzz: Good Seeds for Fuzzing (Competition Contribution)

Ravindra Metta , Raveendra Kumar Medicherla? () , and Hrishikesh Karmarkar

> TCS Research, Tata Consultancy Services, India r.metta,raveendra.kumar,hrishikesh.karmarkar@tcs.com

Abstract. We present VeriFuzz 1.2 with two new enhancements: (1) unroll the given program to a short depth and use BMC to produce incomplete test inputs, which are extended into complete inputs, and (2) if BMC fails for this short unrolling, automatically identify the reason and rerun BMC with a corresponding remedial strategy.

Keywords: Coverage Guided Fuzzing · Bounded Model Checking · Scalable Model Checking

#### 1 Introduction

VeriFuzz 1.0 [5] is an automated test input generation tool built on top of AFL [11], a Coverage Guided Fuzzing (CGF) engine, and the PRISM [7] program analysis framework. CGF requires initial test inputs (seeds) to generate newer inputs in order to build a test suite for coverage. In VeriFuzz 1.0, the seeds are generated as follows: (1) random seeds are either generated dynamically, or picked from a small set of unbiased inputs, and (2) for sequentialized concurrent programs that have deep nesting, generate test inputs using BMC by unrolling all the loop bodies once. However, a direct application of CGF or BMC to a given program may not yield required coverage as CGF may get stuck in "complex conditions" [10], while BMC does not scale well for programs with "complex loops". To address these issues, we implemented two key enhancements to VeriFuzz 1.0. The first is to generate incomplete test seeds using BMC and complete them using random inputs. The second is to automatically identify the cause if BMC fails, and re-run BMC with an appropriate remedial strategy. These enhancements, implemented in VeriFuzz 1.2 [8], are described below.

#### 1.1 Enhancement 1 : New Seed Generation Approach

Instead of generating a complete test seed using BMC, which scales poorly, for a given program P, we use CBMC [6] to generate an incomplete program P<sup>u</sup> by unwinding P only to a "short" depth d, which is heuristically guessed

<sup>?</sup> Jury member

c The Author(s) 2022

E. B. Johnsen and M. Wimmer (Eds.): FASE 2022, LNCS 13241, pp. 341–346, 2022. https://doi.org/10.1007/978-3-030-99429-7\_20

to be small enough for BMC to scale. This short unwinding makes rest of P unreachable due to the incompleteness of the unwinding. But it allows BMC to scale much better to Pu. We then use CBMC to produce test input sequences that cover the branches of Pu, using the cbmc options "–cover branches". Each of these inputs forms a valid prefix of a complete test input for P. We denote the set of such prefixes with Tp. When P is executed with a prefix t<sup>p</sup> in Tp, P may still require additional inputs to complete it's execution. We randomly generate such additional inputs, from the value ranges respecting the input types. We append each of these random inputs to the corresponding t<sup>p</sup> to form a complete input for P. Our experimentation showed that this approach helped VeriFuzz 1.2 to cover many more branches, which could not be covered with VeriFuzz 1.0.

#### 1.2 Enhancement 2: Remedying A Stuck or Failed BMC

We observed that often times, even for short unwindings on complex programs, BMC either gets stuck (i.e., does not terminate in the given time budget) or fails with some error. We investigated this problem and found that BMC may get stuck/fail in any of its internal phases during the translation of a given program into a SAT/SMT formula. Some times the formula gets generated, but the backend SAT/SMT solver times out due to the complexity of the formula. Some of the common BMC failure causes and the remedial actions are:-


#### 2 Tool Architecture and Flow

Fig. 1. VeriFuzz architecture.

Figure 1 shows the architecture of VeriFuzz 1.2. The yellow boxes show the enhancements to VeriFuzz 1.0. For BMC, we use CBMC 5.42.0, with z3 4.8.12 and Glucose Syrup 4.0 [1]. VeriFuzz 1.2 takes two inputs: (1) a program to test, say P, and (2) a property to test, such as branch or error coverage. First, the module "BMC" invokes CBMC on P for a short unwinding, typically with a timeout of 1 minute, to generate the incomplete test inputs. If CBMC times out, then: if any of the incomplete test inputs have been generated by CBMC, then output those, else identify the phase where CBMC is stuck and re-run CBMC with a corresponding remedial strategy. Then, the "Test-Completion" module extends these incomplete test inputs to form complete test inputs. These complete tests are then passed to VeriFuzz 1.0, which fuzzes them using AFL to produce more test inputs.

#### 3 Strengths and Weaknesses

VeriFuzz 1.0 was enhanced with minor optimizations into VeriFuzz 1.1, which does not contain Enhancement 1 and Enhancement 2 (see Sec. 1). VeriFuzz 1.1 participated in Test-Comp 2020 [2], while VeriFuzz 1.2 participated in Test-Comp 2021 [3] and Test-Comp 2022 [4]. Here, we compare VeriFuzz 1.2's results against VeriFuzz 1.1's results, for all the categories common to Test-Comp 2022 and 2020, except ECA (to avoid any bias due to the fixed-seeds).

Performance: In Cover-Error, VeriFuzz 1.2 detected 93% of the errors (387 out of 415) with an average time of 17 seconds, while VeriFuzz 1.1 detected 91% (262 out of 287) of the errors with an average time of 33 seconds. In Cover-Branches, VeriFuzz 1.2 covered 68% (scored 1626 out of 2378)) of the branches, with an average of 14.7 minutes per benchmark, while VeriFuzz 1.1

covered 59% (scored 872 out of 1485) of the branches, with an average of 13.3 minutes per benchmark. The higher time taken by VeriFuzz 1.2 directly corresponds to the increase in coverage. On device drivers in BusyBox (MemSafety) and LDV(ReachSafety) VeriFuzz 1.2 scored substantially better: 29 of 75 and 57 of 290 respectively, while VeriFuzz 1.1 scored only 19 of 72 and 23 of 290 respectively.

Usefulness of the enhancements: We analyzed the results of VeriFuzz 1.2, and noticed that in some cases Enhancement 1 was sufficient, while Enhancement 2 was also necessitated in other cases. For instance, in loop-floats-scientificcomp/loop2-1.c and ntdrivers-simplified/cdaudio simpl1.cil-1.c, VeriFuzz 1.1 was unable to detect the error, which VeriFuzz 1.2 could detect with the seeds generated by Enhancement 1 alone. In cases like array-examples/sorting bubblesort 2 ground.i and array-tiling/mlceu.c, Enhancement 2 was also required, in addition to Enhancement 1, to generate a seed that allowed VeriFuzz 1.2 to detect the error, which VeriFuzz 1.1 could not. In Cover-Branches, on benchmarks like loop-industry-pattern/mod3.c (2 seeds generated by BMC), and bitvector/s3 srvr 1a alt.BV.c.cil.c (73 seeds generated by BMC), VeriFuzz 1.2 could cover 90% of the branches, while VeriFuzz 1.1 could cover none.

Weaknesses: (1) In some cases, e.g. array-multidimensional/copy-2-u.c, BMC is running out of memory, leading to the termination of entire VeriFuzz process. In some other cases, e.g. float-benchs/inv square-1.c, the floating point interpretation mismatches between CBMC and VeriFuzz lead to unintended behavior. These issues indicate that our tooling needs improvement. (2) In Arrays subcategory of Cover-Error, VeriFuzz 1.2 took more than twice the time of VeriFuzz 1.1. This is because, many array benchmarks contain for-loops that iterate over large arrays. In such cases, short unwindings of BMC do not go past the array initialization itself and hence the seeds generated by BMC were ineffective, adding to the elapsed time.(3) In BusyBox and LDV drivers, there are many benchmarks that VeriFuzz is unable to solve due to issues like complex loops and quadratic constraints, which we are currently working on.

## 4 VeriFuzz Tool Configuration and Setup

The tool is available at git@gitlab.com:sosy-lab/test-comp/archives-2022.git. To install and run the tool, follow the instructions in the README.txt provided with the tool. The benchexec tool-info module is verifuzz.py and the benchmark description file is verifuzz.xml. A sample run command is as follows: ./scripts/verifuzz.py --propertyFile coverage-error.prp example.c. In 2022, VeriFuzz 1.2 participated in Cover-Branches and Cover-Error categories.

## 5 Software Project and Contributors

VeriFuzz is developed and maintained by the authors at TCS Research. They can be contacted at VeriFuzz.Tool@tcs.com. We thank everyone who has contributed to VeriFuzz, AFL, PRISM, CBMC, Glucose Syrup and Z3.

#### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the

source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Author Index**

Aldughaim, Mohannad 336 Alshmrany, Kaled M. 336

Bartocci, Ezio 3 Batot, Edouard Romari 23 Beyer, Dirk 49, 321 Bhayat, Ahmed 336 Biewer, Sebastian 71 Bubel, Richard 145

Cabot, Jordi 23 Chattopadhyay, Sudipta 245 Chen, Liqian 92 Cordeiro, Lucas C. 336 Coullon, Hélène 268

da Costa, Ana Oliveira 3 Diamantopoulos, Themistoklis 225 Dimovski, Aleksandar S. 102 Dutta, Saikat 123

Egyed, Alexander 288

Ferrère, Thomas 3

Gérard, Sebastien 23 Grätz, Lukas 145

Hage, Hassan 155 Hähnle, Reiner 145 Hashemi, Vahid 155 Henzinger, Thomas A. 3 Hermanns, Holger 71 Huang, Renjie 92 Huang, Zixin 123 Huang, Zunchen 163

Jakobs, Marie-Christine 184

Kanav, Sudeep 49 Karmarkar, Hrishikesh 341 Knapp, Alexander 205

Linsbauer, Lukas 288 Luo, Dan 92

Ma, Chenghu 92 Mantwill, Frank 155 Medicherla, Raveendra Kumar 341 Metta, Ravindra 341 Misailovic, Sasa 123

Nickovic, Dejan 3

Papathomas, Evangelos 225

Rajan, Sai Sathiesh 245 Richter, Cedric 49 Robillard, Simon 268 Roggenbach, Markus 205 Rosenberger, Tobias 205

Seferis, Emmanouil 155 Symeonidis, Andreas 225

Thaller, Hannes 288

Udeshi, Sakshi 245

Wang, Chao 163 Wang, Ji 92 Wei, Dengping 92 Wiesner, Maik 184 Wu, Hao 310