# **Ilya Sergey (Ed.)**

# **Programming Languages and Systems**

**31st European Symposium on Programming, ESOP 2022 Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2022 Munich, Germany, April 2–7, 2022 Proceedings**

# Lecture Notes in Computer Science 13240

Founding Editors

Gerhard Goos, Germany Juris Hartmanis, USA

### Editorial Board Members

Elisa Bertino, USA Wen Gao, China Bernhard Steffen , Germany Gerhard Woeginger , Germany Moti Yung , USA

### Advanced Research in Computing and Software Science Subline of Lecture Notes in Computer Science

Subline Series Editors

Giorgio Ausiello, University of Rome 'La Sapienza', Italy Vladimiro Sassone, University of Southampton, UK

Subline Advisory Board

Susanne Albers, TU Munich, Germany Benjamin C. Pierce, University of Pennsylvania, USA Bernhard Steffen , University of Dortmund, Germany Deng Xiaotie, Peking University, Beijing, China Jeannette M. Wing, Microsoft Research, Redmond, WA, USA More information about this series at https://link.springer.com/bookseries/558

Ilya Sergey (Ed.)

# Programming Languages and Systems

31st European Symposium on Programming, ESOP 2022 Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2022 Munich, Germany, April 2–7, 2022 Proceedings

Editor Ilya Sergey National University of Singapore Singapore, Singapore

ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Computer Science ISBN 978-3-030-99335-1 ISBN 978-3-030-99336-8 (eBook) https://doi.org/10.1007/978-3-030-99336-8

© The Editor(s) (if applicable) and The Author(s) 2022. This book is an open access publication.

Open Access This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

### ETAPS Foreword

Welcome to the 25th ETAPS! ETAPS 2022 took place in Munich, the beautiful capital of Bavaria, in Germany.

ETAPS 2022 is the 25th instance of the European Joint Conferences on Theory and Practice of Software. ETAPS is an annual federated conference established in 1998, and consists of four conferences: ESOP, FASE, FoSSaCS, and TACAS. Each conference has its own Program Committee (PC) and its own Steering Committee (SC). The conferences cover various aspects of software systems, ranging from theoretical computer science to foundations of programming languages, analysis tools, and formal approaches to software engineering. Organizing these conferences in a coherent, highly synchronized conference program enables researchers to participate in an exciting event, having the possibility to meet many colleagues working in different directions in the field, and to easily attend talks of different conferences. On the weekend before the main conference, numerous satellite workshops took place that attract many researchers from all over the globe.

ETAPS 2022 received 362 submissions in total, 111 of which were accepted, yielding an overall acceptance rate of 30.7%. I thank all the authors for their interest in ETAPS, all the reviewers for their reviewing efforts, the PC members for their contributions, and in particular the PC (co-)chairs for their hard work in running this entire intensive process. Last but not least, my congratulations to all authors of the accepted papers!

ETAPS 2022 featured the unifying invited speakers Alexandra Silva (University College London, UK, and Cornell University, USA) and Tomáš Vojnar (Brno University of Technology, Czech Republic) and the conference-specific invited speakers Nathalie Bertrand (Inria Rennes, France) for FoSSaCS and Lenore Zuck (University of Illinois at Chicago, USA) for TACAS. Invited tutorials were provided by Stacey Jeffery (CWI and QuSoft, The Netherlands) on quantum computing and Nicholas Lane (University of Cambridge and Samsung AI Lab, UK) on federated learning.

As this event was the 25th edition of ETAPS, part of the program was a special celebration where we looked back on the achievements of ETAPS and its constituting conferences in the past, but we also looked into the future, and discussed the challenges ahead for research in software science. This edition also reinstated the ETAPS mentoring workshop for PhD students.

ETAPS 2022 took place in Munich, Germany, and was organized jointly by the Technical University of Munich (TUM) and the LMU Munich. The former was founded in 1868, and the latter in 1472 as the 6th oldest German university still running today. Together, they have 100,000 enrolled students, regularly rank among the top 100 universities worldwide (with TUM's computer-science department ranked #1 in the European Union), and their researchers and alumni include 60 Nobel laureates. The local organization team consisted of Jan Křetínský (general chair), Dirk Beyer (general, financial, and workshop chair), Julia Eisentraut (organization chair), and Alexandros Evangelidis (local proceedings chair).

ETAPS 2022 was further supported by the following associations and societies: ETAPS e.V., EATCS (European Association for Theoretical Computer Science), EAPLS (European Association for Programming Languages and Systems), and EASST (European Association of Software Science and Technology).

The ETAPS Steering Committee consists of an Executive Board, and representatives of the individual ETAPS conferences, as well as representatives of EATCS, EAPLS, and EASST. The Executive Board consists of Holger Hermanns (Saarbrücken), Marieke Huisman (Twente, chair), Jan Kofroň (Prague), Barbara König (Duisburg), Thomas Noll (Aachen), Caterina Urban (Paris), Tarmo Uustalu (Reykjavik and Tallinn), and Lenore Zuck (Chicago).

Other members of the Steering Committee are Patricia Bouyer (Paris), Einar Broch Johnsen (Oslo), Dana Fisman (Be'er Sheva), Reiko Heckel (Leicester), Joost-Pieter Katoen (Aachen and Twente), Fabrice Kordon (Paris), Jan Křetínský (Munich), Orna Kupferman (Jerusalem), Leen Lambers (Cottbus), Tiziana Margaria (Limerick), Andrew M. Pitts (Cambridge), Elizabeth Polgreen (Edinburgh), Grigore Roşu (Illinois), Peter Ryan (Luxembourg), Sriram Sankaranarayanan (Boulder), Don Sannella (Edinburgh), Lutz Schröder (Erlangen), Ilya Sergey (Singapore), Natasha Sharygina (Lugano), Pawel Sobocinski (Tallinn), Peter Thiemann (Freiburg), Sebastián Uchitel (London and Buenos Aires), Jan Vitek (Prague), Andrzej Wasowski (Copenhagen), Thomas Wies (New York), Anton Wijs (Eindhoven), and Manuel Wimmer (Linz).

I'd like to take this opportunity to thank all authors, attendees, organizers of the satellite workshops, and Springer-Verlag GmbH for their support. I hope you all enjoyed ETAPS 2022.

Finally, a big thanks to Jan, Julia, Dirk, and their local organization team for all their enormous efforts to make ETAPS a fantastic event.

February 2022 Marieke Huisman ETAPS SC Chair ETAPS e.V. President

### Preface

This volume contains the papers accepted at the 31st European Symposium on Programming (ESOP 2022), held during April 5–7, 2022, in Munich, Germany (COVID-19 permitting). ESOP is one of the European Joint Conferences on Theory and Practice of Software (ETAPS); it is dedicated to fundamental issues in the specification, design, analysis, and implementation of programming languages and systems.

The 21 papers in this volume were selected by the Program Committee (PC) from 64 submissions. Each submission received between three and four reviews. After receiving the initial reviews, the authors had a chance to respond to questions and clarify misunderstandings of the reviewers. After the author response period, the papers were discussed electronically using the HotCRP system by the 33 Program Committee members and 33 external reviewers. Two papers, for which the PC chair had a conflict of interest, were kindly managed by Zena Ariola. The reviewing for ESOP 2022 was double-anonymous, and only authors of the eventually accepted papers have been revealed.

Following the example set by other major conferences in programming languages, for the first time in its history, ESOP featured optional artifact evaluation. Authors of the accepted manuscripts were invited to submit artifacts, such as code, datasets, and mechanized proofs, that supported the conclusions of their papers. Members of the Artifact Evaluation Committee (AEC) read the papers and explored the artifacts, assessing their quality and checking that they supported the authors' claims. The authors of eleven of the accepted papers submitted artifacts, which were evaluated by 20 AEC members, with each artifact receiving four reviews. Authors of papers with accepted artifacts were assigned official EAPLS artifact evaluation badges, indicating that they have taken the extra time and have undergone the extra scrutiny to prepare a useful artifact. The ESOP 2022 AEC awarded Artifacts Functional and Artifacts (Functional and) Reusable badges. All submitted artifacts were deemed Functional, and all but one were found to be Reusable.

My sincere thanks go to all who contributed to the success of the conference and to its exciting program. This includes the authors who submitted papers for consideration; the external reviewers who provided timely expert reviews sometimes on very short notice; the AEC members and chairs who took great care of this new aspect of ESOP; and, of course, the members of the ESOP 2022 Program Committee. I was extremely impressed by the excellent quality of the reviews, the amount of constructive feedback given to the authors, and the criticism delivered in a professional and friendly tone. I am very grateful to Andreea Costea and KC Sivaramakrishnan who kindly agreed to serve as co-chairs for the ESOP 2022 Artifact Evaluation Committee. I would like to thank the ESOP 2021 chair Nobuko Yoshida for her advice, patience, and the many insightful discussions on the process of running the conference. I thank all who contributed to the organization of ESOP: the ESOP steering committee and its chair Peter Thiemann, as well as the ETAPS steering committee and its chair Marieke Huisman. viii Preface

Finally, I would like to thank Barbara König and Alexandros Evangelidis for their help with assembling the proceedings.

February 2022 Ilya Sergey

### Organization

### Program Chair

Ilya Sergey National University of Singapore, Singapore

### Program Committee


### Additional Reviewers


### Artifact Evaluation Committee Chairs


### Artifact Evaluation Committee


Pritam Choudhury University of Pennsylvania, USA Jan de Muijnck-Hughes University of Glasgow, UK Darius Foo National University of Singapore, Singapore Léo Gourdin Université Grenoble-Alpes, France Daniel Hillerström University of Edinburgh, UK Jules Jacobs Radboud University, The Netherlands Chaitanya Koparkar Indiana University, USA Yinling Liu Toulouse Computer Science Research Center, France Yiyun Liu University of Pennsylvania, USA Kristóf Marussy Budapest University of Technology and Economics, Hungary Orestis Melkonian University of Edinburgh, UK Shouvick Mondal Concordia University, Canada Krishna Narasimhan TU Darmstadt, Germany Mário Pereira Universidade NOVA de Lisboa, Portugal Goran Piskachev Fraunhofer IEM, Germany Somesh Singh Inria, France Yahui Song National University of Singapore, Singapore Vimala Soundarapandian IIT Madras, India

### Contents



### Categorical Foundations of Gradient-Based Learning

Geofrey S. H. Cruttwell<sup>1</sup> () , Bruno Gavranovi´c<sup>2</sup> () , Neil Ghani<sup>2</sup> () , Paul Wilson<sup>4</sup> () , and Fabio Zanasi<sup>4</sup> ()

> <sup>1</sup> Mount Allison University, Canada <sup>2</sup> University of Strathclyde, United Kingdom <sup>3</sup> University College London

Abstract. We propose a categorical semantics of gradient-based machine learning algorithms in terms of lenses, parametric maps, and reverse derivative categories. This foundation provides a powerful explanatory and unifying framework: it encompasses a variety of gradient descent algorithms such as ADAM, AdaGrad, and Nesterov momentum, as well as a variety of loss functions such as MSE and Softmax cross-entropy, shedding new light on their similarities and diferences. Our approach to gradient-based learning has examples generalising beyond the familiar continuous domains (modelled in categories of smooth maps) and can be realized in the discrete setting of boolean circuits. Finally, we demonstrate the practical signifcance of our framework with an implementation in Python.

### 1 Introduction

The last decade has witnessed a surge of interest in machine learning, fuelled by the numerous successes and applications that these methodologies have found in many felds of science and technology. As machine learning techniques become increasingly pervasive, algorithms and models become more sophisticated, posing a signifcant challenge both to the software developers and the users that need to interface, execute and maintain these systems. In spite of this rapidly evolving picture, the formal analysis of many learning algorithms mostly takes place at a heuristic level [41], or using defnitions that fail to provide a general and scalable framework for describing machine learning. Indeed, it is commonly acknowledged through academia, industry, policy makers and funding agencies that there is a pressing need for a unifying perspective, which can make this growing body of work more systematic, rigorous, transparent and accessible both for users and developers [2, 36].

Consider, for example, one of the most common machine learning scenarios: supervised learning with a neural network. This technique trains the model towards a certain task, e.g. the recognition of patterns in a data set (cf. Figure 1). There are several diferent ways of implementing this scenario. Typically, at their core, there is a gradient update algorithm (often called the "optimiser"), depending on a given loss function, which updates in steps the parameters of the network, based on some learning rate controlling the "scaling" of the update. All of these components can vary independently in a supervised learning algorithm and a number of choices is available for loss maps (quadratic error, Softmax cross entropy, dot product, etc.) and optimisers (Adagrad [20], Momentum [37], and Adam [32], etc.).

Fig. 1: An informal illustration of gradient-based learning. This neural network is trained to distinguish diferent kinds of animals in the input image. Given an input X, the network predicts an output Y , which is compared by a 'loss map' with what would be the correct answer ('label'). The loss map returns a real value expressing the error of the prediction; this information, together with the learning rate (a weight controlling how much the model should be changed in response to error) is used by an optimiser, which computes by gradient-descent the update of the parameters of the network, with the aim of improving its accuracy. The neural network, the loss map, the optimiser and the learning rate are all components of a supervised learning system, and can vary independently of one another.

This scenario highlights several questions: is there a uniform mathematical language capturing the diferent components of the learning process? Can we develop a unifying picture of the various optimisation techniques, allowing for their comparative analysis? Moreover, it should be noted that supervised learning is not limited to neural networks. For example, supervised learning is surprisingly applicable to the discrete setting of boolean circuits [50] where continuous functions are replaced by boolean-valued functions. Can we identify an abstract perspective encompassing both the real-valued and the boolean case? In a nutshell, this paper seeks to answer the question:

what are the fundamental mathematical structures underpinning gradientbased learning?

Our approach to this question stems from the identifcation of three fundamental aspects of the gradient-descent learning process:

(I) computation is parametric, e.g. in the simplest case we are given a function f : P × X → Y and learning consists of fnding a parameter p : P such that f(p, −) is the best function according to some criteria. Specifcally, the weights on the internal nodes of a neural network are a parameter which the learning is seeking to optimize. Parameters also arise elsewhere, e.g. in the loss function (see later).


We model bidirectionality via lenses [6, 12, 29] and based upon the above three insights, we propose the notion of parametric lens as the fundamental semantic structure of learning. In a nutshell, a parametric lens is a process with three kinds of interfaces: inputs, outputs, and parameters. On each interface, information fows both ways, i.e. computations are bidirectional. These data are best explained with our graphical representation of parametric lenses, with inputs A, A′ , outputs B, B′ , parameters P, P ′ , and arrows indicating information fow (below left). The graphical notation also makes evident that parametric lenses are open systems, which may be composed along their interfaces (below center and right).

This pictorial formalism is not just an intuitive sketch: as we will show, it can be understood as a completely formal (graphical) syntax using the formalism of string diagrams [39], in a way similar to how other computational phenomena have been recently analysed e.g. in quantum theory [14], control theory [5, 8], and digital circuit theory [26].

It is intuitively clear how parametric lenses express aspects (I) and (II) above, whereas (III) will be achieved by studying them in a space of 'diferentiable objects' (in a sense that will be made precise). The main technical contribution of our paper is showing how the various ingredients involved in learning (the model, the optimiser, the error map and the learning rate) can be uniformly understood as being built from parametric lenses.

We will use category theory as the formal language to develop our notion of parametric lenses, and make Figure 2 mathematically precise. The categorical perspective brings several advantages, which are well-known, established principles in programming language semantics [3,40,49]. Three of them are particularly

Fig. 2: The parametric lens that captures the learning process informally sketched in Figure 1. Note each component is a lens itself, whose composition yields the interactions described in Figure 1. Defning this picture formally will be the subject of Sections 3-4.

important to our contribution, as they constitute distinctive advantages of our semantic foundations:


We now give a synopsis of our contributions:

	- In Section 3.1, we show how the combinatorial model subject of the training can be seen as a parametric lens. The conditions we provide are met by the 'standard' case of neural networks, but also enables the study of learning for other classes of models. In particular, another instance are Boolean circuits: learning of these structures is relevant to binarisation [16] and it has been explored recently using a categorical approach [50], which turns out to be a particular case of our framework.
	- In Section 3.2, we show how the loss maps associated with training are also parametric lenses. Our approach covers the cases of quadratic error, Boolean error, Softmax cross entropy, but also the 'dot product loss' associated with the phenomenon of deep dreaming [19, 34, 35, 44].
	- In Section 3.3, we model the learning rate as a parametric lens. This analysis also allows us to contrast how learning rate is handled in the 'realvalued' case of neural networks with respect to the 'Boolean-valued' case of Boolean circuits.
	- In Section 3.4, we show how optimisers can be modelled as 'reparameterisations' of models as parametric lenses. As case studies, in addition to basic gradient update, we consider the stateful variants: Momentum [37], Nesterov Momentum [48], Adagrad [20], and Adam (Adaptive Moment Estimation) [32]. Also, on Boolean circuits, we show how the reverse derivative ascent of [50] can be also regarded in such way.

#### 6 Cruttwell, Gavranovi´c, Ghani, Wilson, and Zanasi


We show our library via a number of experiments, and prove correctness by achieving accuracy on par with an equivalent model in Keras, a mainstream deep learning framework [11]. In particular, we create a working non-trivial neural network model for the MNIST image-classifcation problem [33].

– Finally, in Sections 6 and 7, we discuss related and future work.

### 2 Categorical Toolkit

In this section we describe the three categorical components of our framework, each corresponding to an aspect of gradient-based learning: (I) the Para construction (Section 2.1), which builds a category of parametric maps, (II) the Lens construction, which builds a category of "bidirectional" maps (Section 2.2), and (III) the combination of these two constructions into the notion of "parametric lenses" (Section 2.3). Finally (IV) we recall Cartesian reverse differential categories — categories equipped with an abstract gradient operator.

Notation We shall use f; g for sequential composition of morphisms f : A → B and g : B → C in a category, 1<sup>A</sup> for the identity morphism on A, and I for the unit object of a symmetric monoidal category.

#### 2.1 Parametric Maps

In supervised learning one is typically interested in approximating a function g : R <sup>n</sup> → R <sup>m</sup> for some n and m. To do this, one begins by building a neural network, which is a smooth map f : R <sup>p</sup> × R <sup>n</sup> → R <sup>m</sup> where R p is the set of possible weights of that neural network. Then one looks for a value of q ∈ R p such that the function f(q, −) : R <sup>n</sup> → R <sup>m</sup> closely approximates g. We formalise these maps categorically via the Para construction [9, 23, 24, 30].

Defnition 1 (Parametric category). Let (C, ⊗, I) be a strict<sup>4</sup> symmetric monoidal category. We defne a category Para(C) with objects those of C, and a map from A to B a pair (P, f), with P an object of C and f : P ⊗ A → B. The composite of maps (P, f) : A → B and (P ′ , f′ ) : B → C is the pair (P ′ ⊗ P,(1<sup>P</sup> ′ ⊗ f); f ′ ). The identity on A is the pair (I, 1A).

Example 1. Take the category Smooth whose objects are natural numbers and whose morphisms f : n → m are smooth maps from R n to R <sup>m</sup>. As described above, the category Para(Smooth) can be thought of as a category of neural networks: a map in this category from n to m consists of a choice of p and a map f : R <sup>p</sup> × R <sup>n</sup> → R <sup>m</sup> with R p representing the set of possible weights of the neural network.

As we will see in the next sections, the interplay of the various components at work in the learning process becomes much clearer once represented the morphisms of Para(C) using the pictorial formalism of string diagrams, which we now recall. In fact, we will mildly massage the traditional notation for string diagrams (below left), by representing a morphism f : A → B in Para(C) as below right.

f P A B f P A B

This is to emphasise the special role played by P, refecting the fact that in machine learning data and parameters have diferent semantics. String diagrammatic notations also allows to neatly represent composition of maps (P, f) : A → B and (P ′ , f′ ) : B → C (below left), and "reparameterisation" of (P, f) : A → B by a map α : Q → P (below right), yielding a new map (Q,(α⊗1A); f) : A → B.

$$\begin{array}{ccccc} P & & & & & \stackrel{Q}{\underset{\begin{subarray}{c}\alpha\\ \text{Id}\end{subarray}}{\text{Id}}}\\ \stackrel{\text{B}}{\text{B}} & \stackrel{B}{\underset{\begin{subarray}{c}\alpha\\ \text{Id}\end{subarray}}{\text{Id}}} & \stackrel{B}{\text{B}} & \stackrel{C}{\underset{\begin{subarray}{c}\alpha\\ \text{Id}\end{subarray}}}{\text{Id}} & \stackrel{Q}{\text{B}} \end{array} \tag{2}$$

<sup>4</sup> One can also defne Para(C) in the case when C is non-strict; however, the result would be not a category but a bicategory.

Intuitively, reparameterisation changes the parameter space of (P, f) : A → B to some other object Q, via some map α : Q → P. We shall see later that gradient descent and its many variants can naturally be viewed as reparameterisations.

Note coherence rules in combining the two operations in (2) just work as expected, as these diagrams can be ultimately 'compiled' down to string diagrams for monoidal categories.

#### 2.2 Lenses

In machine learning (or even learning in general) it is fundamental that information fows both forwards and backwards: the 'forward' fow corresponds to a model's predictions, and the 'backwards' fow to corrections to the model. The category of lenses is the ideal setting to capture this type of structure, as it is a category consisting of maps with both a "forward" and a "backward" part.

Defnition 2. For any Cartesian category C, the category of (bimorphic) lenses in C, Lens(C), is the category with the following data. Objects are pairs (A, A′ ) of objects in C. A map from (A, A′ ) to (B, B′ ) consists of a pair (f, f <sup>∗</sup> ) where f : A → B (called the get or forward part of the lens) and f ∗ : A × B′ → A′ (called the put or backwards part of the lens). The composite of (f, f <sup>∗</sup> ) : (A, A′ ) → (B, B′ ) and (g, g<sup>∗</sup> ) : (B, B′ ) → (C, C′ ) is given by get f; g and put ⟨π0,⟨π0; f, π1⟩; g ∗ ⟩; f ∗ . The identity on (A, A′ ) is the pair (1A, π1).

The embedding of Lens(C) into the category of Tambara modules over C (see [7, Thm. 23]) provides a rich string diagrammatic language, in which lenses may be represented with forward/backward wires indicating the information fow. In this language, a morphism (f, f <sup>∗</sup> ) : (A, A′ ) → (B, B′ ) is written as below left, which can be 'expanded' as below right.

It is clear in this language how to describe the composite of (f, f <sup>∗</sup> ) : (A, A′ ) → (B, B′ ) and (g, g<sup>∗</sup> ) : (B, B′ ) → (C, C′ ):

#### 2.3 Parametric Lenses

The fundamental category where supervised learning takes place is the composite Para(Lens(C)) of the two constructions in the previous sections:

Defnition 3. The category Para(Lens(C)) of parametric lenses on C has as objects pairs (A, A′ ) of objects from C. A morphism from (A, A′ ) to (B, B′ ), called a parametric lens<sup>5</sup> , is a choice of parameter pair (P, P′ ) and a lens (f, f <sup>∗</sup> ) : (P, P′ )×(A, A′ ) → (B, B′ ) so that f : P ×A → B and f ∗ : P ×A×B′ → P ′×A′

String diagrams for parametric lenses are built by simply composing the graphical languages of the previous two sections — see (1), where respectively a morphism, a composition of morphisms, and a reparameterisation are depicted.

Given a generic morphism in Para(Lens(C)) as depicted in (1) on the left, one can see how it is possible to "learn" new values from f: it takes as input an input A, a parameter P, and a change B′ , and outputs a change in A, a value of B, and a change P ′ . This last element is the key component for supervised learning: intuitively, it says how to change the parameter values to get the neural network closer to the true value of the desired function.

The question, then, is how one is to defne such a parametric lens given nothing more than a neural network, ie., a parametric map (P, f) : A → B. This is precisely what the gradient operation provides, and its generalization to categories is explored in the next subsection.

#### 2.4 Cartesian Reverse Diferential Categories

Fundamental to all types of gradient-based learning is, of course, the gradient operation. In most cases this gradient operation is performed in the category of smooth maps between Euclidean spaces. However, recent work [50] has shown that gradient-based learning can also work well in other categories; for example, in a category of boolean circuits. Thus, to encompass these examples in a single framework, we will work in a category with an abstract gradient operation.

Defnition 4. A Cartesian left additive category [13, Defn. 1] consists of a category C with chosen fnite products (including a terminal object), and an addition operation and zero morphism in each homset, satisfying various axioms. A Cartesian reverse diferential category (CRDC) [13, Defn. 13] consists of a Cartesian left additive category C, together with an operation which provides, for each map f : A → B in C, a map R[f] : A × B → A satisfying various axioms.

For f : A → B, the pair (f, R[f]) forms a lens from (A, A) to (B, B). We will pursue the idea that R[f] acts as backwards map, thus giving a means to "learn"f.

<sup>5</sup> In [23], these are called learners. However, in this paper we study them in a much broader light; see Section 6.

Note that assigning type A×B → A to R[f] hides some relevant information: B-values in the domain and A-values in the codomain of R[f] do not play the same role as values of the same types in f : A → B: in R[f], they really take in a tangent vector at B and output a tangent vector at A (cf. the defnition of R[f] in Smooth, Example 2 below). To emphasise this, we will type R[f] as a map A × B′ → A′ (even though in reality A = A′ and B = B′ ), thus meaning that (f, R[f]) is actually a lens from (A, A′ ) to (B, B′ ). This typing distinction will be helpful later on, when we want to add additional components to our learning algorithms.

The following two examples of CRDCs will serve as the basis for the learning scenarios of the upcoming sections.

Example 2. The category Smooth (Example 1) is Cartesian with product given by addition, and it is also a Cartesian reverse diferential category: given a smooth map f : R <sup>n</sup> → R <sup>m</sup>, the map R[f]: R <sup>n</sup> × R <sup>m</sup> → R <sup>n</sup> sends a pair (x, v) to J[f] T (x) · v: the transpose of the Jacobian of f at x in the direction v. For example, if f : R <sup>2</sup> → R 3 is defned as f(x1, x2) := (x 3 <sup>1</sup> + 2x1x2, x2,sin(x1)), then

$$R[f]: \mathbb{R}^2 \times \mathbb{R}^3 \to \mathbb{R}^2 \text{ is given by } (x, v) \mapsto \begin{bmatrix} 3x\_1^2 + 2x\_2 \ 0 \cos(x\_1) \\ 2x\_1 \ 1 \ 0 \end{bmatrix} \cdot \begin{bmatrix} v\_1 \\ v\_2 \\ v\_3 \end{bmatrix}. \text{ Using } $$

the reverse derivative (as opposed to the forward derivative) is well-known to be much more computationally efcient for functions f : R <sup>n</sup> → R <sup>m</sup> when m ≪ n (for example, see [28]), as is the case in most supervised learning situations (where often m = 1).

Example 3. Another CRDC is the symmetric monoidal category POLY<sup>Z</sup><sup>2</sup> [13, Example 14] with objects the natural numbers and morphisms f : A → B the Btuples of polynomials Z2[x<sup>1</sup> . . . xA]. When presented by generators and relations these morphisms can be viewed as a syntax for boolean circuits, with parametric lenses for such circuits (and their reverse derivative) described in [50].

### 3 Components of learning as Parametric Lenses

As seen in the introduction, in the learning process there are many components at work: a model, an optimiser, a loss map, a learning rate, etc. In this section we show how each such component can be understood as a parametric lens. Moreover, for each component, we show how our framework encompasses several variations of the gradient-descent algorithms, thus ofering a unifying perspective on many diferent approaches that appear in the literature.

#### 3.1 Models as Parametric Lenses

We begin by characterising the models used for training as parametric lenses. In essence, our approach identifes a set of abstract requirements necessary to perform training by gradient descent, which covers the case studies that we will consider in the next sections.

The leading intuition is that a suitable model is a parametric map, equipped with a reverse derivative operator. Using the formal developments of Section 2, this amounts to assuming that a model is a morphism in Para(C), for a CRDC C. In order to visualise such morphism as a parametric lens, it then sufces to apply under Para(−) the canonical morphism R: C → Lens(C) (which exists for any CRDC C, see [13, Prop. 31]), mapping f to (f, R[f]). This yields a functor Para(R) : Para(C) → Para(Lens(C)), pictorially defned as

Example 4 (Neural networks). As noted previously, to learn a function of type R <sup>n</sup> → R <sup>m</sup>, one constructs a neural network, which can be seen as a function of type R <sup>p</sup> × R <sup>n</sup> → R <sup>m</sup> where R p is the space of parameters of the neural network. As seen in Example 1, this is a map in the category Para(Smooth) of type R <sup>n</sup> → R <sup>m</sup> with parameter space R p . Then one can apply the functor in (4) to present a neural network together with its reverse derivative operator as a parametric lens, i.e. a morphism in Para(Lens(Smooth)).

Example 5 (Boolean circuits). For learning of Boolean circuits as described in [50], the recipe is the same as in Example 4, except that the base category is POLY<sup>Z</sup><sup>2</sup> (see Example 3). The important observation here is that POLY<sup>Z</sup><sup>2</sup> is a CRDC, see [13, 50], and thus we can apply the functor in (4).

Note a model/parametric lens f can take as inputs an element of A, an element of B′ (a change in B) and a parameter P and outputs an element of B, a change in A, and a change in P. This is not yet sufcient to do machine learning! When we perform learning, we want to input a parameter P and a pair A×B and receive a new parameter P. Instead, f expects a change in B (not an element of B) and outputs a change in P (not an element of P). Deep dreaming, on the other hand, wants to return an element of A (not a change in A). Thus, to do machine learning (or deep dreaming) we need to add additional components to f; we will consider these additional components in the next sections.

#### 3.2 Loss Maps as Parametric Lenses

Another key component of any learning algorithm is the choice of loss map. This gives a measurement of how far the current output of the model is from the desired output. In standard learning in Smooth, this loss map is viewed as a map of type B × B → R. However, in our setup, this is naturally viewed as a parametric map from B to R with parameter space B. <sup>6</sup> We also generalize the codomain to an arbitrary object L.

Defnition 5. A loss map on B consists of a parametric map (B, loss) : Para(C)(B, L) for some object L.

Note that we can precompose a loss map (B, loss): B → L with a neural network (P, f) : A → B (below left), and apply the functor in (4) (with C = Smooth) to obtain the parametric lens below right.

$$A \xleftarrow{\begin{array}{c} P\\ \hline f \end{array}} \xrightarrow{B} \begin{array}{c} \xrightarrow{B} \\ \hline \end{array} \xrightarrow{B} \begin{array}{c} \xrightarrow{P} \\ \xrightarrow{f} \\ \end{array} \xrightarrow{B} \begin{array}{c} \xrightarrow{B} \\ \xrightarrow{f} \\ \end{array} \xrightarrow{B} \begin{array}{c} \xrightarrow{B} \\ \xrightarrow{B} \\ \end{array} \xrightarrow{B} \begin{array}{c} \xrightarrow{B} \\ \xrightarrow{B} \\ \end{array} \xrightarrow{L} \tag{5}$$

This is getting closer to the parametric lens we want: it can now receive inputs of type B. However, this is at the cost of now needing an input to L ′ ; we consider how to handle this in the next section.

Example 6 (Quadratic error). In Smooth, the standard loss function on R b is quadratic error: it uses L = R and has parametric map e : R <sup>b</sup> × R <sup>b</sup> → R given by e(bt, bp) = <sup>1</sup> 2 P<sup>b</sup> <sup>i</sup>=1((bp)i−(bt)i) 2 , where we think of b<sup>t</sup> as the "true" value and b<sup>p</sup> the predicted value. This has reverse derivative R[e] : R <sup>b</sup> × R <sup>b</sup> × R → R <sup>b</sup> × R b given by R[e](bt, bp, α) = α · (b<sup>p</sup> − bt, b<sup>t</sup> − bp) — note α suggests the idea of learning rate, which we will explore in Section 3.3.

Example 7 (Boolean error). In POLY<sup>Z</sup><sup>2</sup> , the loss function on Z <sup>b</sup> which is implicitly used in [50] is a bit diferent: it uses L = Z <sup>b</sup> and has parametric map e : Z <sup>b</sup> × Z <sup>b</sup> → Z <sup>b</sup> given by

$$e(b\_t, b\_p) = b\_t + b\_p.$$

(Note that this is + in Z2; equivalently this is given by XOR.) Its reverse derivative is of type R[e] : Z <sup>b</sup> × Z <sup>b</sup> × Z <sup>b</sup> → Z <sup>b</sup> × Z <sup>b</sup> given by R[e](bt, bp, α) = (α, α).

Example 8 (Softmax cross entropy). The Softmax cross entropy loss is a R b parametric map R <sup>b</sup> → R defned by e(bt, bp) = P<sup>b</sup> <sup>i</sup>=1(bt)i((bp)i−log(Softmax(bp)i)) where Softmax(bp) = <sup>P</sup> exp((bp)i) b <sup>j</sup>=1 exp((bp)<sup>j</sup> ) is defned componentwise for each class i.

We note that, although b<sup>t</sup> needs to be a probability distribution, at the moment there is no need to ponder the question of interaction of probability distributions with the reverse derivative framework: one can simply consider b<sup>t</sup> as the image of some logits under the Softmax function.

<sup>6</sup> Here the loss map has its parameter space equal to its input space. However, putting loss maps on the same footing as models lends itself to further generalizations where the parameter space is diferent, and where the loss map can itself be learned. See Generative Adversarial Networks, [9, Figure 7.].

Example 9 (Dot product). In Deep Dreaming (Section 4.2) we often want to focus only on a particular element of the network output R b . This is done by supplying a one-hot vector b<sup>t</sup> as the ground truth to the loss function e(bt, bp) = b<sup>t</sup> ·b<sup>p</sup> which computes the dot product of two vectors. If the ground truth vector y is a onehot vector (active at the i-th element), then the dot product performs masking of all inputs except the i-th one. Note the reverse derivative R[e]: R <sup>b</sup> × R <sup>b</sup> × R → R <sup>b</sup> × R b of the dot product is defned as R[e](bt, bp, α) = (α · bp, α · bt).

#### 3.3 Learning Rates as Parametric Lenses

After models and loss maps, another ingredient of the learning process are learning rates, which we formalise as follows.

Defnition 6. A learning rate α on L consists of a lens from (L, L′ ) to (1, 1) where 1 is a terminal object in C.

Note that the get component of the learning rate lens must be the unique map to 1, while the put component is a map L × 1 → L ′ ; that is, simply a map α ∗ : L → L ′ . Thus we can view α as a parametric lens from (L, L′ ) → (1, 1) (with trivial parameter space) and compose it in Para(Lens(C)) with a model and a loss map (cf. (5)) to get

$$A \xleftarrow{P} \xleftarrow{P'} \overbrace{\frac{f}{R[f]}}^{P'} \xleftarrow{B} \overbrace{\frac{\text{loss}}{\text{loss}}}^{B \cdot B'} \xleftarrow{L} \xleftarrow{L} \xleftarrow{\alpha} \tag{6}$$

Example 10. In standard supervised learning in Smooth, one fxes some ϵ > 0 as a learning rate, and this is used to defne α: α is simply constantly −ϵ, ie., α(l) = −ϵ for any l ∈ L.

Example 11. In supervised learning in POLY<sup>Z</sup><sup>2</sup> , the standard learning rate is quite diferent: for a given L it is defned as the identity function, α(l) = l.

Other learning rate morphisms are possible as well: for example, one could fx some ϵ > 0 and defne a learning rate in Smooth by α(l) = −ϵ · l. Such a choice would take into account how far away the network is from its desired goal and adjust the learning rate accordingly.

#### 3.4 Optimisers as Reparameterisations

In this section we consider how to implement gradient descent (and its variants) into our framework. To this aim, note that the parametric lens (f, R[f]) representing our model (see (4)) outputs a P ′ , which represents a change in the parameter space. Now, we would like to receive not just the requested change in the parameter, but the new parameter itself. This is precisely what gradient descent accomplishes, when formalised as a lens.

Defnition 7. In any CRDC C we can defne gradient update as a map G in Lens(C) from (P, P) to (P, P′ ) consisting of (G, G<sup>∗</sup> ) : (P, P) → (P, P′ ), where G(p) = p and G<sup>∗</sup> (p, p′ ) = p + p ′7 .

Intuitively, such a lens allows one to receive the requested change in parameter and implement that change by adding that value to the current parameter. By its type, we can now "plug" the gradient descent lens G: (P, P) → (P, P′ ) above the model (f, R[f]) in (4) — formally, this is accomplished as a reparameterisation of the parametric morphism (f, R[f]), cf. Section 2.1. This gives us Figure 3 (left).

Fig. 3: Model reparameterised by basic gradient descent (left) and a generic stateful optimiser (right).

Example 12 (Gradient update in Smooth). In Smooth, the gradient descent reparameterisation will take the output from P ′ and add it to the current value of P to get a new value of P.

Example 13 (Gradient update in Boolean circuits). In the CRDC POLY<sup>Z</sup><sup>2</sup> , the gradient descent reparameterisation will again take the output from P ′ and add it to the current value of P to get a new value of P; however, since + in Z<sup>2</sup> is the same as XOR, this can be also be seen as taking the XOR of the current parameter and the requested change; this is exactly how this algorithm is implemented in [50].

Other variants of gradient descent also ft naturally into this framework by allowing for additional input/output data with P. In particular, many of them keep track of the history of previous updates and use that to inform the next one. This is easy to model in our setup: instead of asking for a lens (P, P) → (P, P′ ), we ask instead for a lens (S×P, S×P) → (P, P′ ) where S is some "state" object.

<sup>7</sup> Note that as in the discussion in Section 2.4, we are implicitly assuming that P = P ′ ; we have merely notated them diferently to emphasize the diferent "roles" they play (the frst P can be thought of as "points", the second as "vectors")

Defnition 8. A stateful parameter update consists of a choice of object S (the state object) and a lens U : (S × P, S × P) → (P, P′ ).

Again, we view this optimiser as a reparameterisation which may be "plugged in" a model as in Figure 3 (right). Let us now consider how several well-known optimisers can be implemented in this way.

Example 14 (Momentum). In the momentum variant of gradient descent, one keeps track of the previous change and uses this to inform how the current parameter should be changed. Thus, in this case, we set S = P, fx some γ > 0, and defne the momentum lens (U, U<sup>∗</sup> ) : (P × P, P × P) → (P, P′ ) . by U(s, p) = p and U ∗ (s, p, p′ ) = (s ′ , p + s ′ ), where s ′ = −γs + p ′ . Note momentum recovers gradient descent when γ = 0.

In both standard gradient descent and momentum, our lens representation has trivial get part. However, as soon as we move to more complicated variants, this is not anymore the case, as for instance in Nesterov momentum below.

Example 15 (Nesterov momentum). In Nesterov momentum, one uses the momentum from previous updates to tweak the input parameter supplied to the network. We can precisely capture this by using a small variation of the lens in the previous example. Again, we set S = P, fx some γ > 0, and defne the Nesterov momentum lens (U, U<sup>∗</sup> ) : (P × P, P × P) → (P, P′ ) by U(s, p) = p + γs and U <sup>∗</sup> as in the previous example.

Example 16 (Adagrad). Given any fxed ϵ > 0 and δ ∼ 10<sup>−</sup><sup>7</sup> , Adagrad [20] is given by S = P, with the lens whose get part is (g, p) 7→ p. The put is (g, p, p′ ) 7→ (g ′ , p + ϵ δ+ √ g ′ ⊙ p ′ ) where g ′ = g + p ′ ⊙ p ′ and ⊙ is the elementwise (Hadamard) product. Unlike with other optimization algorithms where the learning rate is the same for all parameters, Adagrad divides the learning rate of each individual parameter with the square root of the past accumulated gradients.

Example 17 (Adam). Adaptive Moment Estimation (Adam) [32] is another method that computes adaptive learning rates for each parameter by storing exponentially decaying average of past gradients (m) and past squared gradients (v). For fxed β1, β<sup>2</sup> ∈ [0, 1), ϵ > 0, and δ ∼ 10<sup>−</sup><sup>8</sup> , Adam is given by S = P × P, with the lens whose get part is (m, v, p) 7→ p and whose put part is put(m, v, p, p′ ) = (mb ′ , vb ′ , p + ϵ δ+ √ <sup>v</sup>b′ <sup>⊙</sup> <sup>m</sup><sup>b</sup> ′ ) where m′ = β1m + (1 − β1)p ′ , v ′ = β2v + (1 − β2)p ′2 , and <sup>m</sup><sup>b</sup> ′ = m′ 1−β t 1 , vb ′ = v ′ 1−β t 2 .

Note that, so far, optimsers/reparameterisations have been added to the P/P′ wires. In order to change the model's parameters (Fig. 3). In Section 4.2 we will study them on the A/A′ wires instead, giving deep dreaming.

### 4 Learning with Parametric Lenses

In the previous section we have seen how all the components of learning can be modeled as parametric lenses. We now study how all these components can be put together to form supervised learning systems. In addition to studying the most common examples of supervised learning: systems that learn parameters, we also study diferent kinds systems: those that learn their inputs. This is a technique commonly known as deep dreaming, and we present it as a natural counterpart of supervised learning of parameters.

Before we describe these systems, it will be convenient to represent all the inputs and outputs of our parametric lenses as parameters. In (6), we see the P/P′ and B/B′ inputs and outputs as parameters; however, the A/A′ wires are not. To view the A/A′ inputs as parameters, we compose that system with the parametric lens η we now defne. The parametric lens η has the type (1, 1) → (A, A′ ) with parameter space (A, A′ ) defned by (get<sup>η</sup> = 1A, put<sup>η</sup> = π1) and can A

be depicted graphically as A A′ . Composing η with the rest of the learning system in (6) gives us the closed parametric lens

$$\underbrace{\begin{aligned} ^A \underbrace{\begin{aligned} ^{A'} \end{^{A'}} \quad ^{A'} \underbrace{\begin{aligned} ^{P} \end{^{P} \quad ^{P'}} \end{^{B'}}\_{A'} \underbrace{\begin{aligned} ^{B} \quad \bigsqcup ^{B'} \quad ^{B'} \end{aligned}}\_{B'} \end{aligned} \xrightarrow{B} \underbrace{\begin{aligned} ^{B'} \quad ^{B'} \end{aligned}}\_{L'} \end{cases} \tag{7}$$

This composite is now a map in Para(Lens(C)) from (1, 1) to (1, 1); all its inputs and outputs are now vertical wires, ie., parameters. Unpacking it further, this is a lens of type (A × P × B, A′ × P ′ × B′ ) → (1, 1) whose get map is the terminal map, and whose put map is of the type A × P × B → A′ × P ′ × B′ . It can be unpacked as the composite put(a, p, bt) = (a ′ , p′ , b′ t ), where

$$b\_p = f(p, a) \qquad (b\_t', b\_p') = R[\mathsf{loss}](b\_t, b\_p, \alpha(\mathsf{loss}(b\_t, b\_p))) \qquad (p', a') = R[f](p, a, b\_p').$$

In the next two sections we consider further additions to the image above which correspond to diferent types of supervised learning.

#### 4.1 Supervised Learning of Parameters

The most common type of learning performed on (7) is supervised learning of parameters. This is done by reparameterising (cf. Section 2.1) the image in the following manner. The parameter ports are reparameterised by one of the (possibly stateful) optimisers described in the previous section, while the backward wires A′ of inputs and B′ of outputs are discarded. This fnally yields the complete picture of a system which learns the parameters in a supervised manner:

Fixing a particular optimiser (U, U<sup>∗</sup> ) : (S × P, S × P) → (P, P′ ) we again unpack the entire construction. This is a map in Para(Lens(C)) from (1, 1) to (1, 1) whose parameter space is (A × S × P × B, S × P). In other words, this is a lens of type (A × S × P × B, S × P) → (1, 1) whose get component is the terminal map. Its put map has the type A × S × P × B → S × P and unpacks to put(a, s, p, bt) = U ∗ (s, p, p′ ), where

$$\begin{aligned} \overline{p} &= U(s, p) & b\_p &= f(\overline{p}, a) \\ (b\_t', b\_p') &= R[\mathsf{loss}](b\_t, b\_p, \alpha(\mathsf{loss}(b\_t, b\_p))) & (p', a') &= R[f](\overline{p}, a, b\_p'). \end{aligned}$$

While this formulation might seem daunting, we note that it just explicitly specifes the computation performed by a supervised learning system. The variable p represents the parameter supplied to the network by the stateful gradient update rule (in many cases this is equal to p); b<sup>p</sup> represents the prediction of the network (contrast this with b<sup>t</sup> which represents the ground truth from the dataset). Variables with a tick ′ represent changes: b ′ <sup>p</sup> and b ′ <sup>t</sup> are the changes on predictions and true values respectively, while p ′ and a ′ are changes on the parameters and inputs. Furthermore, this arises automatically out of the rule for lens composition (3); what we needed to specify is just the lenses themselves.

We justify and illustrate our approach on a series of case studies drawn from the literature. This presentation has the advantage of treating all these instances uniformly in terms of basic constructs, highlighting their similarities and diferences. First, we fx some parametric map (R p , f) : Para(Smooth)(R a , R b ) in Smooth and the constant negative learning rate α : R (Example 10). We then vary the loss function and the gradient update, seeing how the put map above reduces to many of the known cases in the literature.

Example 18 (Quadratic error, basic gradient descent). Fix the quadratic error (Example 6) as the loss map and basic gradient update (Example 12). Then the aforementioned put map simplifes. Since there is no state, its type reduces to A × P × B → P, and we have put(a, p, bt) = p + p ′ , where (p ′ , a′ ) = R[f](p, a, α · (f(p, a) − bt)). Note that α here is simply a constant, and due to the linearity of the reverse derivative (Def 4), we can slide the α from the costate into the basic gradient update lens. Rewriting this update, and performing this sliding we obtain a closed form update step put(a, p, bt) = p+α·(R[f](p, a, f(p, a)−bt); π0), where the negative descent component of gradient descent is here contained in the choice of the negative constant α.

This example gives us a variety of regression algorithms solved iteratively by gradient descent: it embeds some parametric map (R p , f): R <sup>a</sup> → R b into the system which performs regression on input data - where a denotes the input to the model and b<sup>t</sup> denotes the ground truth. If the corresponding f is linear and b = 1, we recover simple linear regression with gradient descent. If the codomain is multi-dimensional, i.e. we are predicting multiple scalars, then we recover multivariate linear regression. Likewise, we can model a multi-layer perceptron or even more complex neural network architectures performing supervised learning of parameters simply by changing the underlying parametric map.

Example 19 (Softmax cross entropy, basic gradient descent). Fix Softmax cross entropy (Example 8) as the loss map and basic gradient update (Example 12). Again the put map simplifes. The type reduces to A × P × B → P and we have put(a, p, bt) = p + p ′ where (p ′ , a′ ) = R[f](p, a, α · (Softmax(f(p, a)) − bt)). The same rewriting performed on the previous example can be done here.

This example recovers logistic regression, e.g. classifcation.

Example 20 (Mean squared error, Nesterov Momentum). Fix the quadratic error (Example 6) as the loss map and Nesterov momentum (Example 15) as the gradient update. This time the put map A×S ×P ×B → S ×P does not have a simplifed type. The implementation of put reduces to put(a, s, p, bt) = (s ′ , p+s ′ ), where p = p + γs, (p ′ , a′ ) = R[f](p, a, α · (f(p, a) − bt)), and s ′ = −γs + p ′ .

This example with Nesterov momentum difers in two key points from all the other ones: i) the optimiser is stateful, and ii) its get map is not trivial. While many other optimisers are stateful, the non-triviality of the get map here showcases the importance of lenses. They allow us to make precise the notion of computing a "lookahead" value for Nesterov momentum, something that is in practice usually handled in ad-hoc ways. Here, the algebra of lens composition handles this case naturally by using the get map, a seemingly trivial, unused piece of data for previous optimisers.

Our last example, using a diferent base category POLY<sup>Z</sup><sup>2</sup> , shows that our framework captures learning in not just continuous, but discrete settings too. Again, we fx a parametric map (Z p , f) : POLY<sup>Z</sup><sup>2</sup> (Z a , Z b ) but this time we fx the identity learning rate (Example 11), instead of a constant one.

Example 21 (Basic learning in Boolean circuits). Fix XOR as the loss map (Example 7) and the basic gradient update (Example 13). The put map again simplifes. The type reduces to A × P × B → P and the implementation to put(a, p, bt) = p + p ′ where (p ′ , a′ ) = R[f](p, a, f(p, a) + bt).

A sketch of learning iteration. Having described a number of examples in supervised learning, we outline how to model learning iteration in our framework. Recall the aforementioned put map whose type is A×P ×B → P (for simplicity here modelled without state S). This map takes an input-output pair (a0, b0), the current parameter p<sup>i</sup> and produces an updated parameter pi+1. At the next time step, it takes a potentially diferent input-output pair (a1, b1), the updated parameter pi+1 and produces pi+2. This process is then repeated. We can model this iteration as a composition of the put map with itself, as a composite (A × put × B); put whose type is A × A × P × B × B → P. This map takes two inputoutput pairs A × B, a parameter and produces a new parameter by processing these datapoints in sequence. One can see how this process can be iterated any number of times, and even represented as a string diagram.

But we note that with a slight reformulation of the put map, it is possible to obtain a conceptually much simpler defnition. The key insight lies in seeing that the map put : A×P ×B → P is essentially an endo-map P → P with some extra inputs A × B; it's a parametric map!

In other words, we can recast the put map as a parametric map (A×B, put) : Para(C)(P, P). Being an endo-map, it can be composed with itself. The resulting composite is an endo-map taking two "parameters": input-output pair at the time step 0 and time step 1. This process can then be repeated, with Para composition automatically taking care of the algebra of iteration.

$$\bigsqcup\_{P}^{A \times B} \bigsqcup\_{P}^{B} \bigsqcup\_{\begin{subarray}{c} \text{put} \\ \text{put} \end{subarray}} \bigsqcup\_{P}^{B} \bigsqcup\_{\begin{subarray}{c} \text{put} \\ \text{put} \end{subarray}} \bigsqcup\_{\begin{subarray}{c} \text{put} \\ \text{put} \end{subarray}} \bigsqcup\_{\begin{subarray}{c} \text{put} \\ \text{put} \end{subarray}} \bigsqcup\_{P}^{B}$$

This reformulation captures the essence of parameter iteration: one can think of it as a trajectory p<sup>i</sup> , pi+1, pi+2, ... through the parameter space; but it is a trajectory parameterised by the dataset. With diferent datasets the algorithm will take a diferent path through this space and learn diferent things.

#### 4.2 Deep Dreaming: Supervised Learning of Inputs

We have seen that reparameterising the parameter port with gradient descent allows us to capture supervised parameter learning. In this section we describe how reparameterising the input port provides us with a way to enhance an input image to elicit a particular interpretation. This is the idea behind the technique called Deep Dreaming, appearing in the literature in many forms [19, 34, 35, 44].

Deep dreaming is a technique which uses the parameters p of some trained classifer network to iteratively dream up, or amplify some features of a class b on a chosen input a. For example, if we start with an image of a landscape a0, a label b of a "cat" and a parameter p of a sufciently well-trained classifer, we can start performing "learning" as usual: computing the predicted class for the landscape a<sup>0</sup> for the network with parameters p, and then computing the distance between the prediction and our label of a cat b. When performing backpropagation, the respective changes computed for each layer tell us how the activations of that layer should have been changed to be more "cat" like. This includes the frst (input) layer of the landscape a0. Usually, we discard this changes and apply gradient update to the parameters. In deep dreaming we discard the parameters and apply gradient update to the input (see (8)). Gradient update here takes these changes and computes a new image a<sup>1</sup> which is the same image of the landscape, but changed slightly so to look more like whatever the network thinks a cat looks like. This is the essence of deep dreaming, where iteration of this process allows networks to dream up features and shapes on a particular chosen image [1].

Just like in the previous subsection, we can write this deep dreaming system as a map in Para(Lens(C)) from (1, 1) to (1, 1) whose parameter space is (S×A× P ×B, S×A). In other words, this is a lens of type (S×A×P ×B, S×A) → (1, 1) whose get map is trivial. Its put map has the type S × A × P × B → S × A and unpacks to put(s, a, p, bt) = U ∗ (s, a, a′ ), where a = U(s, a), b<sup>p</sup> = f(p, a), (b ′ t , b′ p ) = R[loss](bt, bp, α(loss(bt, bp))), and (p ′ , a′ ) = R[f](p, a, b′ p ).

We note that deep dreaming is usually presented without any loss function as a maximisation of a particular activation in the last layer of the network output [44, Section 2.]. This maximisation is done with gradient ascent, as opposed to gradient descent. However, this is just a special case of our framework where the loss function is the dot product (Example 9). The choice of the particular activation is encoded as a one-hot vector, and the loss function in that case essentially masks the network output, leaving active only the particular chosen activation. The fnal component is the gradient ascent: this is simply recovered by choosing a positive, instead of a negative learning rate [44]. We explicitly unpack this in the following example.

Example 22 (Deep dreaming, dot product loss, basic gradient update). Fix Smooth as base category, a parametric map (R p , f) : Para(Smooth)(R a , R b ), the dot product loss (Example 9), basic gradient update (Example 12), and a positive learning rate α : R. Then the above put map simplifes. Since there is no state, its type reduces to A × P × B → A and its implementation to put(a, p, bt) = a + a ′ , where (p ′ , a′ ) = R[f](p, a, α · bt). Like in Example 18, this update can be rewritten as put(a, p, bt) = a + α · (R[f](p, a, bt); π1), making a few things apparent. This update does not depend on the prediction f(p, a): no matter what the network has predicted, the goal is always to maximize particular activations. Which activations? The ones chosen by bt. When b<sup>t</sup> is a one-hot vector, this picks out the activation of just one class to maximize, which is often done in practice.

While we present only the most basic image, there is plenty of room left for exploration. The work of [44, Section 2.] adds an extra regularization term to the image. In general, the neural network f is sometimes changed to copy a number of internal activations which are then exposed on the output layer. Maximizing all these activations often produces more visually appealing results. In the literature we did not fnd an example which uses the Softmax-cross entropy (Example 8) as a loss function in deep dreaming, which seems like the more natural choice in this setting. Furthermore, while deep dreaming commonly uses basic gradient descent, there is nothing preventing the use of any of the optimiser lenses discussed in the previous section, or even doing deep dreaming in the context of Boolean circuits. Lastly, learning iteration which was described in at the end of previous subsection can be modelled here in an analogous way.

### 5 Implementation

We provide a proof-of-concept implementation as a Python library — full usage examples, source code, and experiments can be found at [17]. We demonstrate the correctness of our library empirically using a number of experiments implemented both in our library and in Keras [11], a popular framework for deep learning. For example, one experiment is a model for the MNIST image classifcation problem [33]: we implement the same model in both frameworks and achieve comparable accuracy. Note that despite similarities between the user interfaces of our library and of Keras, a model in our framework is constructed as a composition of parametric lenses. This is fundamentally diferent to the approach taken by Keras and other existing libraries, and highlights how our proposed algebraic structures naturally guide programming practice

In summary, our implementation demonstrates the advantages of our approach. Firstly, computing the gradients of the network is greatly simplifed through the use of lens composition. Secondly, model architectures can be expressed in a principled, mathematical language; as morphisms of a monoidal category. Finally, the modularity of our approach makes it easy to see how various aspects of training can be modifed: for example, one can defne a new optimization algorithm simply by defning an appropriate lens. We now give a brief sketch of our implementation.

#### 5.1 Constructing a Model with Lens and Para

We model a lens (f, f <sup>∗</sup> ) in our library with the Lens class, which consists of a pair of maps fwd and rev corresponding to f and f ∗ , respectively. For example, we write the identity lens (1A, π2) as follows:

i d e n t i t y = Lens (lambda x : x , lambda x dy : x dy [ 1 ] )

The composition (in diagrammatic order) of Lens values f and g is written f >> g, and monoidal composition as f @ g. Similarly, the type of Para maps is modeled by the Para class, with composition and monoidal product written the same way. Our library provides several primitive Lens and Para values.

Let us now see how to construct a single layer neural network from the composition of such primitives. Diagramatically, we wish to construct the following model, representing a single 'dense' layer of a neural network:

$$\underbrace{\mathbb{R}^{a}\sideset{}{}{\mathop{\!\!\mathbb{R}}}\sideset{}{\mathop{\!\!\mathbb{R}}}\sideset{}{\mathop{\!\!\mathbb{R}}}\sideset{}{\mathop{\!\!\mathbb{R}}}\sideset{}{\mathop{\!\!\mathbb{R}}}\sideset{}{\mathop{\!\!\mathbb{R}}}\sideset{}{\mathop{\!\!\mathbb{R}}}\sideset{}{\mathop{\!\!\mathbb{R}}}\sideset{}{\mathop{\!\!\mathbb{R}}}\sideset{}{\mathop{\!\!\mathbb{R}}}\sideset{}{\mathop{\!\!\mathbb{R}}}\s}\_{\mathbb{R}^{b}}\sideset{}{\mathop{\!\!\mathbb{R}}}\s}{\mathop{\!\!\mathbb{R}}^{b}}\sideset{}{\mathop{\!\!\mathbb{R}}}\s}\_{\mathbb{R}^{b}}\sideset{}{\mathop{\!\!\mathbb{R}}}\s}\_{\mathbb{R}^{b}}\tag{9}$$

Here, the parameters of linear are the coefcients of a b×a matrix, and the underlying lens has as its forward map the function (M, x) → M · x, where M is the b × a matrix whose coefcients are the R b×a parameters, and x ∈ R a is the input vector. The bias map is even simpler: the forward map of the underlying lens is simply pointwise addition of inputs and parameters: (b, x) → b+x. Finally, the activation map simply applies a nonlinear function (e.g., sigmoid) to the input, and thus has the trivial (unit) parameter space. The representation of this composition in code is straightforward: we can simply compose the three primitive Para maps as in (9):

def den se ( a , b , a c t i v a t i o n ) : return l i n e a r ( a , b ) >> bi a s ( b ) >> a c t i v a t i o n

Note that by constructing model architectures in this way, the computation of reverse derivatives is greatly simplifed: we obtain the reverse derivative 'for free' as the put map of the model. Furthermore, adding new primitives is also simplifed: the user need simply provide a function and its reverse derivative in the form of a Para map. Finally, notice also that our approach is truly compositional: we can defne a hidden layer neural network with n hidden units simply by composing two dense layers, as follows:

den se ( a , n , a c t i v a t i o n ) >> den se ( n , b , a c t i v a t i o n )

#### 5.2 Learning

Now that we have constructed a model, we also need to use it to learn from data. Concretely, we will construct a full parametric lens as in Figure 2 then extract its put map to iterate over the dataset.

By way of example, let us see how to construct the following parametric lens, representing basic gradient descent over a single layer neural network with a fxed learning rate:

This morphism is constructed essentially as below, where apply update(α, f) represents the 'vertical stacking' of α atop f:

appl y upd a te ( b a si c u p d a t e , den se ) >> l o s s >> l e a r n i n g r a t e ( ϵ )

Now, given the parametric lens of (10), one can construct a morphism step : B×P ×A → P which is simply the put map of the lens. Training the model then consists of iterating the step function over dataset examples (x, y) ∈ A×B to optimise some initial choice of parameters θ<sup>0</sup> ∈ P, by letting θi+1 = step(y<sup>i</sup> , θ<sup>i</sup> , xi).

Note that our library also provides a utility function to construct step from its various pieces:

s t e p = s u p e r v i s e d s t e p ( model , update , l o s s , l e a r n i n g r a t e )

For an end-to-end example of model training and iteration, we refer the interested reader to the experiments accompanying the code [17].

### 6 Related Work

The work [23] is closely related to ours, in that it provides an abstract categorical model of backpropagation. However, it difers in a number of key aspects. We give a complete lens-theoretic explanation of what is back-propagated via (i) the use of CRDCs to model gradients; and (ii) the Para construction to model parametric functions and parameter update. We thus can go well beyond [23] in terms of examples - their example of smooth functions and basic gradient descent is covered in our subsection 4.1.

We also explain some of the constructions of [23] in a more structured way. For example, rather than considering the category Learn of [23] as primitive, here we construct it as a composite of two more basic constructions (the Para and Lens constructions). The fexibility could be used, for example, to compositionally replace Para with a variant allowing parameters to come from a diferent category, or lenses with the category of optics [38] enabling us to model things such as control fow using prisms.

One more relevant aspect is functoriality. We use a functor to augment a parametric map with its backward pass, just like [23]. However, they additionally augmented this map with a loss map and gradient descent using a functor as well. This added extra conditions on the partial derivatives of the loss function: it needed to be invertible in the 2nd variable. This constraint was not justifed in [23], nor is it a constraint that appears in machine learning practice. This led us to reexamine their constructions, coming up with our reformulation that does not require it. While loss maps and optimisers are mentioned in [23] as parts of the aforementioned functor, here they are extracted out and play a key role: loss maps are parametric lenses and optimisers are reparameterisations. Thus, in this paper we instead use Para-composition to add the loss map to the model, and Para 2-cells to add optimisers. The mentioned inverse of the partial derivative of the loss map in the 2nd variable was also hypothesised to be relevant to deep dreaming. We have investigated this possibility thoroughly in our paper, showing

it is gradient update which is used to dream up pictures. We also correct a small issue in Theorem III.2 of [23]. There, the morphisms of Learn were defned up to an equivalence (pg. 4 of [23]) but, unfortunately, the functor defned in Theorem III.2 does not respect this equivalence relation. Our approach instead uses 2-cells which comes from the universal property of Para — a 2-cell from (P, f) : A → B to (Q, g) : A → B is a lens, and hence has two components: a map α : Q → P and α ∗ : Q × P → Q. By comparison, we can see the equivalence relation of [23] as being induced by map α : Q → P, and not a lens. Our approach highlights the importance of the 2-categorical structure of learners. In addition, it does not treat the functor Para(C) → Learn as a primitive. In our case, this functor has the type Para(C) → Para(Lens(C)) and arises from applying Para to a canonical functor C → Lens(C) existing for any reverse derivative category, not just Smooth. Lastly, in our paper we took advantage of the graphical calculus for Para, redrawing many diagrams appearing in [23] in a structured way.

Other than [23], there are a few more relevant papers. The work of [18] contains a sketch of some of the ideas this paper evolved from. They are based on the interplay of optics with parameterisation, albeit framed in the setting of difeological spaces, and requiring cartesian and local cartesian closed structure on the base category. Lenses and Learners are studied in the eponymous work of [22] which observes that learners are parametric lenses. They do not explore any of the relevant Para or CRDC structure, but make the distinction between symmetric and asymmetric lenses, studying how they are related to learners defned in [23]. A lens-like implementation of automatic diferentiation is the focus of [21], but learning algorithms aren't studied. A relationship between categorytheoretic perspective on probabilistic modeling and gradient-based optimisation is studied in [42] which also studies a variant of the Para construction. Usage of Cartesian diferential categories to study learning is found in [46]. They extend the diferential operator to work on stateful maps, but do not study lenses, parameterisation nor update maps. The work of [24] studies deep learning in the context of Cycle-consistent Generative Adversarial Networks [51] and formalises it via free and quotient categories, making parallels to the categorical formulations of database theory [45]. They do use the Para construction, but do not relate it to lenses nor reverse derivative categories. A general survey of category theoretic approaches to machine learning, covering many of the above papers, can be found in [43]. Lastly, the concept of parametric lenses has started appearing in recent formulations of categorical game theory and cybernetics [9,10]. The work of [9] generalises the study of parametric lenses into parametric optics and connects it to game thereotic concepts such as Nash equilibria.

### 7 Conclusions and Future Directions

We have given a categorical foundation of gradient-based learning algorithms which achieves a number of important goals. The foundation is principled and mathematically clean, based on the fundamental idea of a parametric lens. The foundation covers a wide variety of examples: diferent optimisers and loss maps in gradient-based learning, diferent settings where gradient-based learning happens (smooth functions vs. boolean circuits), and both learning of parameters and learning of inputs (deep dreaming). Finally, the foundation is more than a mere abstraction: we have also shown how it can be used to give a practical implementation of learning, as discussed in Section 5.

There are a number of important directions which are possible to explore because of this work. One of the most exciting ones is the extension to more complex neural network architectures. Our formulation of the loss map as a parametric lens should pave the way for Generative Adversarial Networks [27], an exciting new architecture whose loss map can be said to be learned in tandem with the base network. In all our settings we have fxed an optimiser beforehand. The work of [4] describes a meta-learning approach which sees the optimiser as a neural network whose parameters and gradient update rule can be learned. This is an exciting prospect since one can model optimisers as parametric lenses; and our framework covers learning with parametric lenses. Recurrent neural networks are another example of a more complex architecture, which has already been studied in the context of diferential categories in [46]. When it comes to architectures, future work includes modelling some classical systems as well, such as the Support Vector Machines [15], which should be possible with the usage of loss maps such as Hinge loss.

Future work also includes using the full power of CRDC axioms. In particular, axioms RD.6 or RD.7, which deal with the behaviour of higher-order derivatives, were not exploited in our work, but they should play a role in modelling some supervised learning algorithms using higher-order derivatives (for example, the Hessian) for additional optimisations. Taking this idea in a diferent direction, one can see that much of our work can be applied to any functor of the form F : C → Lens(C) - F does not necessarily have to be of the form f 7→ (f, R[f]) for a CRDC R. Moreover, by working with more generalised forms of the lens category (such as dependent lenses), we may be able to capture ideas related to supervised learning on manifolds. And, of course, we can vary the parameter space to endow it with diferent structure from the functions we wish to learn. In this vein, we wish to use fbrations/dependent types to model the use of tangent bundles: this would foster the extension of the correct by construction paradigm to machine learning, and thereby addressing the widely acknowledged problem of trusted machine learning. The possibilities are made much easier by the compositional nature of our framework. Another key topic for future work is to link gradient-based learning with game theory. At a high level, the former takes little incremental steps to achieve an equilibrium while the later aims to do so in one fell swoop. Formalising this intuition is possible with our lens-based framework and the lens-based framework for game theory [25]. Finally, because our framework is quite general, in future work we plan to consider further modifcations and additions to encompass non-supervised, probabilistic and non-gradient based learning. This includes genetic algorithms and reinforcement learning.

Acknowledgements Fabio Zanasi acknowledges support from epsrc EP/V002376/1. Geof Cruttwell acknowledges support from NSERC.

### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/ 4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Compiling Universal Probabilistic Programming Languages with Efficient Parallel Sequential Monte Carlo Inference?

Daniel Lundén<sup>1</sup> () , Joey Öhman<sup>2</sup> , Jan Kudlicka<sup>3</sup> , Viktor Senderov<sup>4</sup> , Fredrik Ronquist4,<sup>5</sup> , and David Broman<sup>1</sup>

<sup>1</sup> EECS and Digital Futures, KTH Royal Institute of Technology, Stockholm, Sweden, {dlunde,dbro}@kth.se

<sup>2</sup> AI Sweden, Stockholm, Sweden, joey.ohman@ai.se

<sup>3</sup> Department of Data Science and Analytics, BI Norwegian Business School, Oslo,

Norway, jan.kudlicka@bi.no

<sup>4</sup> Department of Bioinformatics and Genetics, Swedish Museum of Natural History,

Stockholm, Sweden, {viktor.senderov,fredrik.ronquist}@nrm.se

<sup>5</sup> Department of Zoology, Stockholm University

Abstract. Probabilistic programming languages (PPLs) allow users to encode arbitrary inference problems, and PPL implementations provide general-purpose automatic inference for these problems. However, constructing inference implementations that are efficient enough is challenging for many real-world problems. Often, this is due to PPLs not fully exploiting available parallelization and optimization opportunities. For example, handling probabilistic checkpoints in PPLs through continuationpassing style transformations or non-preemptive multitasking—as is done in many popular PPLs—often disallows compilation to low-level languages required for high-performance platforms such as GPUs. To solve the checkpoint problem, we introduce the concept of PPL control-flow graphs (PCFGs)—a simple and efficient approach to checkpoints in lowlevel languages. We use this approach to implement RootPPL: a low-level PPL built on CUDA and C++ with OpenMP, providing highly efficient and massively parallel SMC inference. We also introduce a general method of compiling universal high-level PPLs to PCFGs and illustrate its application when compiling Miking CorePPL—a high-level universal PPL—to RootPPL. The approach is the first to compile a universal PPL to GPUs with SMC inference. We evaluate RootPPL and the CorePPL compiler through a set of real-world experiments in the domains of phylogenetics and epidemiology, demonstrating up to 6× speedups over stateof-the-art PPLs implementing SMC inference.

Keywords: Probabilistic Programming Languages · Compilers · Sequential Monte Carlo · GPU Compilation

? This project is financially supported by the Swedish Foundation for Strategic Research (FFL15-0032 and RIT15-0012), the European Union's Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement PhyPPL (No 898120), and the Swedish Research Council (grant number 2018-04620).

### 1 Introduction

Probabilistic programming languages (PPLs) allow for encoding a wide range of statistical inference problems and provide inference algorithms as part of their implementations. Specifically, PPLs allow language users to focus solely on encoding their statistical problems, which the language implementation then solves automatically. Many such languages exist and are applied in, e.g., statistics, machine learning, and artificial intelligence. Some example PPLs are WebPPL [20], Birch [32], Anglican [40], and Pyro [10].

However, implementing efficient PPL inference algorithms is challenging for many real-world problems. Most often, universal <sup>6</sup> PPLs implement generalpurpose inference algorithms—most commonly sequential Monte Carlo (SMC) methods [14], Markov chain Monte Carlo (MCMC) methods [18], Hamiltonian Monte Carlo (HMC) methods [12], variational inference (VI) [39], or a combination of these. In some cases, poor efficiency may be due to an inference algorithm not well suited to the particular PPL program. However, in other cases, the PPL implementations do not fully exploit opportunities for parallelization and optimization on the available hardware. Unfortunately, doing this is often tricky without introducing complexity for end-users of PPLs.

A critical performance consideration is handling probabilistic checkpoints [37] in PPLs. Checkpoints are locations in probabilistic programs where inference algorithms must interject, for example, to resample in SMC inference or record random draw locations where MCMC inference can explore alternative execution paths. The most common approach to checkpoints—used in universal PPLs such as WebPPL [20], Anglican [40], and Birch [32]—is to associate them with PPLspecific language constructs. In general, PPL users can place these constructs without restriction, and inference algorithms interject through continuationpassing style (CPS) transformations [9,20,40] or non-preemptive multitasking [32] (e.g., coroutines) that enable pausing and resuming executions. These solutions are often not available in languages such as C and CUDA [1] used for high-performance platforms such as graphics processing units (GPUs), making compiling PPLs to these languages and platforms challenging. Some approaches for running PPLs on GPUs do exist, however. LibBi [29] runs on GPUs with SMC inference but is not universal. Stan [12] and AugurV2 [22] partially run MCMC inference on GPUs but have limited expressive power. Pyro [10] runs on GPUs, but currently not in combination with SMC. In this paper, we compile a universal PPL and run it with SMC on GPUs for the first time.

A more straightforward approach to checkpoints, used for SMC in Birch [32] and Pyro [10], is to encode models with a step function called iteratively. Checkpoints then occur each time step returns. This paper presents a new approach to checkpoint handling, generalizing the step function approach. We write probabilistic programs as a set of code blocks connected in what we term a PPL

<sup>6</sup> A term due to Goodman et al. [19]. No precise definition exists, but in principle, a universal PPL program can perform probabilistic operations at any point. In particular, it is not always possible to statically determine the number of random variables.

Fig. 1: The CorePPL and RootPPL toolchain. Solid rectangular components (gray) represent programs and rounded components (blue) translations. The dashed rectangles indicate paper sections.

control-flow graph (PCFG). PPL checkpoints are restricted to only occur at tail position in these blocks, and communication between blocks is only allowed through an explicit PCFG state. As a result, pausing and resuming executions is straightforward: it is simply a matter of stopping after executing a block and then resuming by running the next block. A variable in the PCFG state, set from within the blocks, determines the next block. This variable allows for loops and branching and gives the same expressive power as other universal PPLs. We implement the above approach in RootPPL: a low-level universal PPL framework built using C++ and CUDA with highly efficient and parallel SMC inference. RootPPL consists of both an inference engine and a simple macro-based PPL.

A problem with RootPPL is that it is low-level and, therefore, challenging to write programs in. In particular, sending data between blocks through the PCFG state can quickly get difficult for more complex models. To solve this, we develop a general technique for compiling high-level universal PPLs to PCFGs. The key idea is to decompose functions in the high-level language to a set of PCFG blocks, such that checkpoints in the original function always occur at tail position in blocks. As a result of the decomposition, the PCFG state must store a part of the call stack. The compiler adds code for handling this call stack explicitly in the PCFG blocks. We illustrate the compilation technique by introducing a high-level source language, Miking CorePPL, and compiling it to RootPPL. Fig. 1 illustrates the overall toolchain.

In summary, we make the following contributions.


Furthermore, we introduce Miking CorePPL in Section 2 and evaluate the performance of RootPPL and the CorePPL compiler in Section 5 on real-world models from phylogenetics and epidemiology, achieving up to 6× speedups over the state-of-the-art. An artifact accompanying this paper supports the evaluation [26]. An extended version of this article is also available [27]. A † symbol in the text indicates more information is available in the extended version.

### 2 Miking CorePPL

This section introduces the Miking CorePPL language, used as a source language for the compiler in Section 4. We discuss design considerations (Section 2.1) and present the syntax and semantics (Section 2.2).

#### 2.1 Design Considerations

Miking CorePPL (or CorePPL for short) is an intermediate representation (IR) PPL, similar to IRs used by LLVM [6] and GCC [2]. This allows the reuse of CorePPL as a target for domain-specific high-level PPLs and PPL compiler back-ends. Consequently, CorePPL needs to be expressive enough to allow easy translation from various domain-specific PPLs and simple enough for practical use as a shared IR for compilers. Therefore, we base CorePPL on the lambda calculus, extended with standard data types and constructs.

We must also consider which PPL-specific constructs to include. Critically, most PPLs include constructs for defining random variables and likelihood updating [21]. CorePPL includes such constructs, including first-class probability distributions, to match the expressive power of existing PPLs.

#### 2.2 Syntax and Semantics

We build CorePPL on top of the Miking framework [11]: a meta-language system for creating domain-specific and general-purpose languages. This allows reusing many existing Miking language components and transformations when building the CorePPL language. More precisely, CorePPL extends Miking Core—a core functional programming language in Miking—with PPL constructs.

A CorePPL program t is inductively defined by

$$\begin{array}{l} \mathbf{t} ::= x \mid \mathbf{1am}\ x. \ \mathbf{t} \mid \mathbf{t\_1} \ \mathbf{t\_2} \mid \mathbf{1et}\ x = \mathbf{t\_1} \ \mathbf{in}\ \mathbf{t\_2} \mid C \ \ \ \mathbf{t} \mid \ c \\\ \mid \text{ recursive}\ \ [\mathbf{1et}\ x = \mathbf{t}] \ \ \ \text{in} \\\ \mid \text{match}\ \ \mathbf{t\_1} \ \ \mathbf{with}\ p \ \ \mathbf{then}\ \ \mathbf{t\_2} \ \ \text{else}\ \ \mathbf{t\_3} \mid \ \ \ \ \ \begin{bmatrix} \mathbf{t\_1}, \ \mathbf{t\_2}, \ \dots, \ \mathbf{t\_n} \end{bmatrix} \\\ \mid \quad \{l\_1 = \mathbf{t\_1}, \ l\_2 = \mathbf{t\_2}, \ \dots, \ l\_3 = \mathbf{t\_3}\} \\\ \mid \text{assume}\ \ \mathbf{t} \mid \text{weight}\ \ \mathbf{t} \mid \text{observed}\ \ \mathbf{t\_1} \ \ \mathbf{t\_2} \mid D \ \ \ \mathbf{t\_1} \ \ \mathbf{t\_2} \ \dots \ \mathbf{t\_{|D|}} \end{array} \tag{1}$$

where the metavariable x ranges over a set of variable names; C over a set of data constructor names; p over a set of patterns; l over a set of record labels; and c over various literals, such as integers, floating-point numbers, booleans, and strings, as well as over various built-in functions in prefix form such as addi (adds integers). The notation [let x = t] indicates a sequence of mutually recursive let bindings. The metavariable D ranges over a set of probability distribution names, with |D| indicating the number of parameters for a distribution D. For example, for the normal distribution, |N | = 2. In addition to (1), we will also use the standard syntactic sugar ; to indicate sequencing, as well as if t<sup>1</sup> then t<sup>2</sup> else t<sup>3</sup> for match t<sup>1</sup> with true then t<sup>2</sup> else t3.

Fig. 2: A toy example encoding a skewed geometric distribution, illustrating CorePPL. Part (a) gives the CorePPL program, and part (b) the corresponding distribution. The upper part of (b) shows the distribution for (a) with line 4 omitted, and the lower part of (b) shows it with line 4 included.

Consider the simple but illustrative CorePPL program in Fig. 2a. The program encodes a variation of the geometric distribution, for which the result is the number of times a coin is flipped until the result is tails. The program's core is the recursive function geometric, defined using a function over the probability of heads for the coin, p. We initially call this function at line 7 with the argument 0.5, indicating a fair coin. On line 2, we define the random variable x to have a Bernoulli distribution (i.e., a single coin flip) using the assume construct (often known as sample in PPLs with sampling-based inference). If the random variable is false (tails), we stop and return the result 1. If the random variable is true (heads), we keep flipping the coin by a recursive call to geometric and add 1 to this result. To illustrate likelihood updating, we make a contrived modification to the standard geometric distribution by adding weight (log 1.5) on line 4. This construct weights the execution by a factor of 1.5 each time the result is heads. Note that CorePPL weight computations are in log-space for numerical stability (hence the log 1.5 to factor by 1.5). Thus, the unnormalized probability of seeing n coin flips, including the final tails, is 0.5 <sup>n</sup> ·1.5 <sup>n</sup>−<sup>1</sup>—where 1.5 n−1 is the factor introduced by the n−1 calls to weight. The difference compared to the standard geometric distribution is illustrated in Fig. 2b. The weight construct is also commonly named factor or score in other PPLs.

What separates PPLs from ordinary programming languages is the ability to modify the likelihood of execution paths, akin to the use of weight in Fig. 2a. We often use likelihood modification to condition a probabilistic model on observed data. For this purpose, CorePPL includes an explicit observe construct, which allows for modifying the likelihood based on observed data assumed to originate from a given probability distribution. For instance, observe 0.3 (Normal 0 1) updates the likelihood with f<sup>N</sup>(0,1)(0.3) (note that this can equivalently be expressed through weight), where f<sup>N</sup>(0,1) is the probability density function of the standard normal distribution. This conditioning can be related to Bayes' theorem: the random variables defined in a program define a prior distribution (e.g., the upper part of Fig. 2b), the use of the weight and observe primitives a likelihood function, and the inference algorithm of the PPL infers the posterior distribution (e.g., the lower part of Fig. 2b)

CorePPL includes sequences, recursive variants, records, and pattern matching, standard in functional languages. For example, [1, 2, 3] defines a sequence of length 3, {a = false, b = 1.2} a record with labels a and b, and Leaf {age = 1.0} a variant with the constructor name Leaf, containing a record with the label age. The match construct allows pattern matching. For example, match a with Leaf {age = f} then f else 0.0 checks if a is a Leaf and returns its age if so, or 0.0 otherwise. Here, f is a pattern variable that is bound to the value of the age element of a in the then branch of the match.

The data types and pattern matching features in Miking, and consequently CorePPL, are not directly related to the paper's key contributions. Therefore, we do not discuss them further. However, the CorePPL compiler in Section 4.3 supports the features, and the CorePPL models in Section 5 make frequent use of them. We consider CorePPL again in Section 4 when compiling to PCFGs.

### 3 PPL control-flow graphs and RootPPL

This section introduces the new PCFG concept (Section 3.1) and shows how to apply SMC over these (Section 3.2). Finally, we present the PCFG and SMCbased RootPPL framework (Section 3.3).

#### 3.1 PPL Control-Flow Graphs

In order to handle checkpoints efficiently without CPS or non-preemptive multitasking, we introduce PPL control-flow graphs (PCFGs). In contrast to traditional PPLs, where checkpoints are most often implicit, we make them explicit and central in the PCFG framework. The main benefit of this approach is that the handling of checkpoints in inference algorithms is greatly simplified, which allows for implementing the framework in low-level languages. However, the explicit checkpoint approach makes PCFGs relatively low-level, and they are mainly intended as a target when compiling from high-level PPLs. We introduce such a compiler in Section 4.

Formally, we define a PCFG as a 6-tuple (B, S, sim, b0, bstop,L). The first component B is a set of basic blocks inspired by basic blocks used as a part of the control-flow analysis in traditional compilers [8]. In practice, the blocks in B are pieces of code that together make up a complete probabilistic program. Unlike basic blocks used in traditional compilers, we allow these pieces of code to contain branches internally. The second component S is a set of states, representing collections of information that flow between basic blocks. In practice, this state often contains local variables that live between blocks and an accumulated likelihood. The blocks and states form the domain of the function sim : B×S → B×S × {false,true}. This function performs computation specific for the given block over the given state and outputs a successor block indicating

Fig. 3: A PCFG illustration. Part (a) shows an example PCFG. The arrows denote the possible flows of control between the blocks, with regular arrows denoting checkpoint transitions and arrows with open tips non-checkpoint transitions. Part (b) shows a possible execution sequence with sim for (a).



what to execute next, an updated state, and a boolean indicating whether or not there is a checkpoint at the end of the executed block.

To illustrate this formalization, consider the PCFG in Fig. 3a for which B = {b0, b1, . . . , b4, bstop}. The block b<sup>0</sup> is present in every PCFG and represents its entry point. Similarly, the block bstop is a unique block indicating termination, which must be reachable from all other blocks. For some initial state s<sup>0</sup> ∈ S, Fig. 3b illustrates a possible execution sequence starting at b<sup>0</sup> in Fig. 3a before terminating at bstop. The structure of a PCFG restricts checkpoints to only occur at the end of basic blocks and confines communication between blocks to the state. These restrictions greatly simplify inference algorithm implementations. More precisely, rather than relying on CPS or non-preemptive multitasking, the inference algorithm can simply run a block b with sim, handle the checkpoint, and then run the successor block indicated by the output of sim.

#### 3.2 SMC and PCFGs

To prepare for introducing RootPPL in Section 3.3, we present how to apply SMC inference to PCFGs. The work by Naesseth et al. [33] contains a more general and pedagogical introduction to SMC. At a high level, SMC inference works by simulating many instances—known as particles in SMC literature—of a PCFG program concurrently, occasionally resampling the different particles based on their current likelihoods. In CorePPL, for example, such likelihoods are determined by weight and observe. Resampling allows the downstream simulation to focus on particles with a higher likelihood.

In order to apply SMC inference over PCFGs, we need some way of determining the likelihood of the SMC particles. For this, we use the final component of the PCFG definition, L : S → R≥0, which is a function mapping states to a likelihood (a non-negative real number). Concretely, this likelihood is most often stored directly in the state as a real number, and L simply extracts it.

Algorithm 1 defines an SMC algorithm over PCFGs. It takes a PCFG as input, together with a set of N states {sn} N n=1, which represent the SMC particles. Step 1 in the algorithm sets up variables a<sup>n</sup> and cn, indicating for each particle its current block and whether or not a checkpoint has occurred in it. Step 2 simulates all particles that have not yet reached a checkpoint using sim. This step repeats until all particles have reached a checkpoint (this is a synchronization point for parallel implementations). Step 3 uses the likelihood function L to compute the relative likelihoods of all particles and then resamples them based on this. That is, we sample N particles from the existing N particles (with replacement) based on the relative likelihoods. After resampling, we return to step 2. If all particles have reached the termination block bstop, the algorithm terminates and returns the current states.

Note in Algorithm 1 that the input states are not required to be identical. For example, each state should have a unique seed used to generate random numbers (e.g., with assume in CorePPL). Non-identical initial states in Algorithm 1 imply that different particles may traverse the blocks in B differently and reach checkpoints at different times. Although this means that different particles can be at different blocks concurrently, the SMC algorithm is still correct [24]. This PCFG property is essential as it allows for the encoding of universal probabilistic programs in PCFG-based PPLs. Furthermore, it implies that some particles may reach bstop earlier than others. To solve this, we require in Algorithm 1 that sim(bstop, s) = (bstop, s,true) holds for all states s. That is, particles that have finished also participate in resampling and cannot cause step 2 to loop infinitely.

Next, we describe our implementation of PCFGs with SMC: RootPPL.

#### 3.3 RootPPL

We make use of the PCFG framework when implementing RootPPL: a new low-level PPL framework built on top of CUDA C++ and C++, intended for highly optimized and massively parallel SMC inference on general-purpose GPUs. RootPPL consists of two major components: a macro-based C++ PPL for encoding probabilistic models and an SMC inference engine.

The macro-based language has two purposes: to support compiling the same program to either CPU or GPU and to simplify the encoding of models for programmers. As a result, the macros hide all hardware details from the programmer. To illustrate this macro-based PPL, consider the example RootPPL

(a) RootPPL program

Fig. 4: Part (a) illustrates a RootPPL program encoding the state-space model in (2). The text provides details. We set NEXT at line 4 rather than in iter as an optimization. Part (b) defines the RootPPL program state type progState\_t.

program in Fig. 4a. This program encodes a simple state-space model for an object moving along an axis in R, given by

$$X\_0 \sim \mathcal{N}(0, 100), \quad X\_t \sim \mathcal{N}(x\_{t-1} + 2, 1), \quad Y\_t \sim \mathcal{N}(x\_t, 5), \quad 1 \le t \le T. \tag{2}$$

Here, X<sup>0</sup> is the initial position, X<sup>t</sup> the following positions, and Y<sup>t</sup> a set of noisy observations of the object position. The inference goal is to determine the distribution of X<sup>T</sup> (the final position of the object) conditioned on all Yt.

Fig. 4a implements (2) with two basic blocks, introduced with the BBLOCK macro in RootPPL. The first block init draws X<sup>0</sup> using the SAMPLE macro (equivalent to assume in CorePPL) on line 2 and stores the drawn value in the program state variable x through the PSTATE macro. This program state is the RootPPL instantiation of the PCFG state introduced in Section 3.1. Another program state variable, t (corresponding to the index t in the model), is initialized on line 3. As preparation for iterating over the iter block, we set the NEXT construct to iter at line 4. Finally, the block exits by making a direct non-checkpoint transition to iter using the BBLOCK\_JUMP macro at line 5.

In iter, we sample X<sup>1</sup> at line 9 and write the result to x (overwriting the previous X0, which is no longer needed). Line 10 updates the likelihood using the OBSERVE macro (equivalent to observe in CorePPL), corresponding to observing Y<sup>1</sup> in the model. We access all Y<sup>t</sup> through the data array, a shared global constant, avoiding memory duplication in the program state. Finally, at line 11, we check if we are at time T (a shared global constant for T). If this is the case, NEXT is set to NULL, indicating termination. This is equivalent to moving to bstop in the PCFG formalization. Otherwise, NEXT keeps its value set at line 4 and jumps to the beginning of the iter block. Not using BBLOCK\_JUMP allows iter to return to the inference engine between iterations, indicating checkpoint transitions. In RootPPL, this means that SMC inference will resample the instances before returning to iter for the next iteration.

The programmer defines the RootPPL program state for each RootPPL program as an arbitrary C++ struct type and passes this type (e.g., progState\_t in Fig. 4a) to each basic block. The PSTATE macro accesses the variables in the struct. Fig. 4b illustrates the program state for the example program in Fig. 4a. As described in Section 3.1, this program state is the only possible means to pass data from one basic block to another in RootPPL.

This minimal example does not illustrate all RootPPL language features (e.g., WEIGHT). Further details on the RootPPL language are available at GitHub [4].

The second part of the RootPPL framework is the SMC inference engine. It is crucial to take advantage of the highly parallel nature of SMC and available hardware for parallelization to achieve high performance. For this purpose, RootPPL supports compilation to either C++ on single-core, C++ on multicore through OpenMP [3], and CUDA C++ [1] with massive parallelism on the GPU.

We present the main inference loop in RootPPL below (cf. Algorithm 1).


The random seeds in step 1 are initialized differently depending on the compile target. For plain C++ on a single core, one seed is shared between all particles because they are executed sequentially. However, for OpenMP and CUDA, the parallel execution requires that we assign each thread a unique seed shared between all particles running on it. For CUDA, these seeds are placed in threadlocal CUDA memory for each particle to minimize memory overhead when using SAMPLE (which is performance-critical). In addition, when compiling to CUDA, we initialize the seeds in parallel using a CUDA compute kernel.

Step 2 executes the particles sequentially, in parallel using OpenMP threads, or in parallel using a CUDA compute kernel. Step 3 then performs a termination check. First, we check if the first particle has terminated. If it has not terminated, we directly move to the resampling step. If it has terminated, we iteratively check other particles to either find a particle that has not terminated or conclude that all particles have terminated and stop the inference. This approach both allows for particles terminating at different times and introduces minimal overhead for the case when all particles terminate simultaneously (which is quite common). When all particles terminate simultaneously, it is enough to check the first particle in all iterations of step 3 except the last.

The resampling step is the most difficult one to parallelize efficiently. The reason is the normalizing sum (e.g., P<sup>N</sup> <sup>i</sup>=1 L(si) in Algorithm 1) that we must compute in order to determine resampling probabilities. We use systematic resampling for single-core and OpenMP and parallel systematic resampling for CUDA, as described in Murray et al. [31] (we do not use in-place propagation). We compute the normalizing sum in parallel via the Thrust library [7] for CUDA.

Another important consideration for the inference engine is memory allocation. In particular, the memory allocated for NEXT, the likelihood, and the PSTATE for each particle, is laid out as separate arrays in memory, rather than one big array of structs. This approach, known as memory coalescing, avoids strided memory accesses in global memory and is preferred for parallel operations, particularly for CUDA. Another memory consideration is particle duplication during resampling. For this, we use a custom aligned memory transfer in CUDA because the standard memcpy implementation in CUDA proved to be a bottleneck. With a single core and OpenMP, memcpy runs without issue. Additionally, we perform a specific optimization when copying the program state used in the CorePPL compiler. This program state consists of a possibly large stack (with user-definable size) together with a stack pointer, and we ensure not to copy the unused part of the stack located beyond the stack pointer. This is a critical optimization for the CorePPL compiler.

Other things supported in RootPPL are the estimation of normalizing constants for encoded models and adaptive resampling based on the current effective sample size (ESS). These are standard concepts in SMC inference. For more details, see, e.g., Naesseth et al. [33].

Next, we use RootPPL as the target language for the CorePPL compiler.

### 4 Compiling to PCFGs

This section introduces the ideas for compiling high-level universal PPLs to PCFGs. We present the key transformation—function decomposition into basic blocks—using a toy example (Section 4.1), a formal algorithm (Section 4.2), a high-level overview of the CorePPL-to-RootPPL compiler (Section 4.3), and the compilers strengths and limitations (Section 4.4).

#### 4.1 Function Decomposition Example

The major challenge when compiling high-level PPLs is implementing pausing and resuming at checkpoints to yield control to an inference algorithm temporarily. Pausing and resuming in low-level languages is especially difficult due to runtime limitations. We solve this problem by compiling to the PCFGs introduced in Section 3, specifically designed for implementation in low-level target languages. A challenge with this approach is that checkpoints can occur at arbitrary locations in high-level probabilistic programs, whereas in PCFGs, checkpoints must always occur at tail position in basic blocks. We solve this by decomposing functions in the source language into a set of basic blocks. Our approach is similar to how functions are decomposed into basic blocks in standard compilers such as GCC [2] and LLVM [6] (see, e.g., Aho et al. [8]). The difference is that we only decompose as needed, based on where checkpoints occur. In particular, we do not decompose functions, and parts of functions, in which checkpoints are guaranteed not to occur. This allows for more optimizations by the underlying compiler (e.g., NVCC or GCC for RootPPL).

Consider the toy CorePPL function in Fig. 5a and the resulting compilation to a RootPPL PCFG in Fig. 5c. For this example, we introduce an explicit SMC checkpoint resample in CorePPL, indicating where SMC should pause

(c) Compiled RootPPL PCFG illustration. Some RootPPL constructs are omitted or slightly modified for readability. In particular, we omit the BBLOCK construct used in Fig. 4a. Instead, we illustrate the blocks as nodes in a graph, numbered by indices. The arrows indicate control flow between the blocks, with the incoming arrow to block 1 representing the call to f and the outgoing arrow from block 4 representing the return from f.

Fig. 5: Compilation of a CorePPL program (a) to a RootPPL PCFG (c). Part (b) illustrates an intermediate ANF representation of (a) and also indicates the parts of the program corresponding to the blocks in (c). We provide further details in the text.

executions in order to resample. The resample construct is the sole checkpoint considered in this example (and the CorePPL compiler), but the method generally applies for arbitrary checkpoints. Optimally, the resample construct should be automatically inserted by the compiler [25]. However, we do not consider this problem in this paper and assume resamples are inserted prior to compilation. The first step in the decomposition is to translate the program into A-normal form (ANF) [15], illustrated in Fig. 5b. ANF is commonly used in compilers and ensures that non-trivial expressions (e.g., function applications and checkpoints) are always name-bound. For CorePPL, ANF guarantees that the body of each let expression, or expression in tail position, is trivial, contains at most one function application, or is an if expression with a trivial condition, resulting in simplified decomposition. We will use the program in Fig. 5b as the target for decomposition in the following. Note that variables introduced by ANF start with a t in Fig. 5b, while the original variables from Fig. 5a start with an s.

The goal with the decomposition is to ensure that we immediately return control to the inference engine at checkpoints. In the PCFG framework, the only way to fulfill this is to ensure that checkpoints occur at tail position in basic blocks. First, consider the resample checkpoint at line 4 in Fig. 5b, causing a split into blocks 1 and 2 in the compiled RootPPL PCFG in Fig. 5c. Note that in block 1, NEXT is set to 2 at line 7 before returning, indicating that the inference engine should resume execution at block 2 after handling the checkpoint, also illustrated by a closed arrow. Note the stack frame pointer sf in block 1 for this invocation of f, which points to a location in an explicit call stack in the RootPPL program state PSTATE. We require such a call stack due to compiling to PCFGs—any data that lives between basic blocks (e.g., a call stack), such as s1, must be put in the program state. We define the stack frame pointer sf equivalently at the top of all blocks for the decomposed function f in Fig. 5c but replace the definition with . . . in blocks other than the first for brevity.

It is not sufficient to split into blocks at explicit checkpoints. Consider, for example, the recursive call to f in the else branch on line 12 in Fig. 5b. During this function call, we encounter at least one resample, resulting in at least one block split within the function, meaning that all data required by f must be put in an explicit stack frame and stored in the program state. If not, we lose the data between the basic blocks of f. In particular, the block return address ra is stored in the stack frame, indicating which block to return to at the end of the function call. In the case of the call to f at line 12 in Fig. 5b, we must return to line 13. Therefore, we must place line 13 at the beginning of a basic block in Fig. 5c (block 3). In general, we must place all calls to decomposed functions (i.e., functions that may, directly or indirectly, encounter a checkpoint) at tail position in basic blocks. Besides line 13 in Fig. 5b, this also means that line 15 in Fig. 5b cannot be part of block 2. It cannot be part of block 3 either because it may be executed independently of line 13 in Fig. 5b if we take the else branch of the if at line 9 in Fig. 5b. Consequently, we must put it in a separate block (block 4 in Fig. 5c). The decomposition of function applications and if expressions is similar to how standard compilers decompose machine instructions into basic

blocks (sequences of instructions without any internal jumps or branches) [8]. The difference, however, is that we do not split into blocks at all if expressions and function calls. For example, the if at line 6 in Fig. 5b is guaranteed not to include a checkpoint and can be left untouched (lines 4–5 in Fig. 5c). Similarly, the call to geqf at line 5 in Fig 5b is guaranteed not to encounter any checkpoints. Conservatively determining which functions are guaranteed not to encounter any checkpoints can be done through static analysis. Such a static analysis phase is part of the CorePPL compiler, described in Section 4.3.

We now take a closer look at the call stack handling in Fig. 5c. The following description is specific for RootPPL, but similar solutions must be applied if compiling to other target languages utilizing PCFGs. First, the program state PSTATE consists of a byte array stack and a pointer to the top of this stack named stackPtr. We increment and decrement this stack pointer when stack frames are added and removed, respectively, at function calls and returns. The type STACK\_f represents the stack frame for the function f (such a stack frame type must be determined and set up for each function we decompose) and contains its block return address ra, its parameter p (functions with multiple parameters have one entry for each parameter), and an address retValLoc at which we write its return value. Additionally, it contains the local variables s1, s3, and s4 that travel across the blocks in f. Note, however, that local variables used only within a single block do not need to go in the stack frame (e.g., t1 and s2), and the underlying target language (e.g., CUDA for RootPPL) can instead handle them directly. Lines 13–24 in block 2 in Fig. 5c illustrate the recursive call to f at line 12 in Fig. 5b. Here, we allocate a new complete stack frame callsf and initialize ra, p, and retValLoc. Allocating the complete stack frame prior to the function call is different from most standard compilers, which most often allocate the part of the stack frame containing local variables at the start of the called function. This strategy allows for making the allocation size dependent on, e.g., function arguments. Here, we instead know all stack frame sizes at compile time. After setting up the stack frame, we increment the stack pointer at lines 21–23 and pass control to the recursive invocation of f by using BBLOCK\_JUMP at line 24. Inversely, we illustrate function return in block 4 on lines 3–7. First, we set the return value, and second, we decrement the stack pointer. Finally, we retrieve the return block from the stack frame and pass control to this block at line 7.

#### 4.2 Function Decomposition Algorithm

We now turn to a formal description of the decomposition algorithm. To avoid going into specifics of the underlying target language, and in particular the call stack handling, we take an abstract view of function bodies and regard them as lists of statements of the form

$$\texttt{stmt} ::= \begin{array}{c} \texttt{checkpoint} \mid \texttt{ ca1} \mid \text{ if } \begin{bmatrix} \texttt{stmt} \end{bmatrix} \begin{bmatrix} \texttt{stmt} \end{bmatrix} \mid \texttt{other.} \end{array} \tag{3}$$

Here, the [stmt] syntax indicates a list of stmts. Thus, the if construct inductively contains two lists of stmts—one for each branch.

(b) Decomposition of (a) into [tstmt] basic blocks.

Fig. 6: Illustrating Algorithm 2 on the example from Fig. 5.

We illustrate the representation stmt through an example. Consider the program in Fig. 5b and its mapping to stmts in Fig. 6a. Due to ANF, we can view the body of f as a sequence of let bindings and operations separated by ;, each performing a single operation of some kind (e.g., a checkpoint or a function application). We map each such operation to a stmt in Fig. 6a. The resample checkpoint at line 4 in Fig. 5b maps to a checkpoint at line 3 in Fig. 6a, and the application of f at line 12 maps to a call at line 11. However, other applications, such as geqf and leqf, are guaranteed not to encounter any checkpoints. Therefore, they map to others, and not calls. The three ifs at lines 6, 9, and 12 map to ifs. Note that we always lift the if conditions in Fig. 5b to a separate let as a result of ANF, and they are therefore not part of the if representation in stmt. We map all remaining operations to others.

While the illustration above only shows how to map a CorePPL function body to stmts, the representation is general. For example, in the CorePPL compiler (Section 4.3), the decomposition is performed after translation to C, and not at the CorePPL stage. The reason is that there are no basic blocks in CorePPL. It is, therefore, more natural to perform this translation closer to RootPPL.

We now turn to the full decomposition algorithm over lists of stmts, given in Algorithm 2. The target language representation is a small extension of stmt, Algorithm 2 A functional-style algorithm for function decomposition into basic blocks. We denote tuples with comma-separated expressions within parentheses and sequences with comma-separated items within square brackets. We denote type annotation with the : character, the cons operator with :: characters, and sequence concatenation with ++. The non-pure function newIndex returns a unique number from N at every call.

```
1 function decompose srcs: [stmt] → (N → [tstmt]) =
 2 let (block, blocks, _) = rec ([], ∅, return) srcs in
 3 blocks ∪ (newIndex (), block)
 4
 5 function initNext next: next+ → next =
 6 match next with none → newIndex () | _ → next
 7
 8 function rec (block, blocks, next) srcs: acc → [stmt] → acc =
 9 match srcs with
10 | [] → match next with
11 | none → (block, blocks, next)
12 | n | return → (block ++ [jump next], blocks, next)
13 | src :: srcs → match src with
14 | checkpoint | call → match srcs with
15 | [] →
16 let next = initNext next in
17 (block ++ [src next], blocks, next)
18 | _ −>
19 let index = newIndex () in
20 let block = block ++ [src index] in
21 let (nextBlock, blocks, next) = rec ([], blocks, initNext next) srcs in
22 (block, blocks ∪ (index, nextBlock), next)
23 | other → rec (block ++ [other], blocks, next) srcs
24 | if thn els → match srcs with
25 | [] →
26 let (thn, thnBlocks, thnNext) = rec ([], blocks, next) thn in
27 let (els, elsBlocks, elsNext) = rec ([], thnBlocks, thnNext) els in
28 let thn = if next 6= elsNext ∧ thnNext = none
29 then thn ++ [jump elsNext] else thn in
30 (block ++ [if thn els], elsBlocks, elsNext)
31 | _ →
32 let (thn, thnBlocks, thnNext) = rec ([], blocks, none) thn in
33 let (els, elsBlocks, elsNext) = rec ([], thnBlocks, thnNext) els in
34 if elsNext = none then rec (block ++ [if thn els], elsBlocks, next) srcs
35 else
36 let thn = if thnNext = none then thn ++ [jump elsNext] else thn in
37 let (nextBlock, blocks, next) =
38 rec ([], elsBlocks, initNext next) srcs in
39 (block ++ [if thn els], blocks ∪ (elsNext, nextBlock), next)
```
adding transitions between N-indexed basic blocks. It is given by

$$\begin{aligned} \texttt{txtmt} & \coloneqq \texttt{ checkpoint } \texttt{next} \mid \texttt{call1 } \texttt{next} \\ & \mid \text{ if } \left[ \texttt{txtmt} \right] \mid \texttt{txtmt} \mid \text{ jump } \texttt{next} \mid \text{ other} . \end{aligned} \tag{4}$$

In particular, we annotate checkpoints and calls with the type next, given by next ::= return | n, where n ∈ N. For checkpoints, the next indicates which block to jump to after handling the checkpoint, and for calls, it indicates the block to return to (e.g., the value set for ra in Fig 5c) at the end of the function invocation. We also include a jump in tstmt for directly jumping to another block (corresponding to BBLOCK\_JUMP in Fig. 5c). The return case of next indicates that the return address gives the next block for the current function call. For example, BBLOCK\_JUMP(sf->ra) is equivalent to jump return.

Fig. 6b shows the result of applying Algorithm 2 on the [stmt] in Fig. 6a. Note that the block structure in Fig. 6b mirrors that of Fig. 5c. The entry point in Algorithm 2 is the function decompose, which accepts a [stmt] as input, and produces a map from indices to [tstmt] as output (e.g., Fig 6b). The core of Algorithm 2 is the function rec, which recursively constructs the basic blocks. It is called from decompose, and makes use of the function initNext. The accumulator is the triple (block, blocks, next) of type acc ::= [stmt] × (N → [stmt]) × next+, where block is the current block being constructed, blocks are all blocks constructed so far, and next indicates the action to take at tail position in the current block. The type next<sup>+</sup> is defined as next<sup>+</sup> ::= next | none. When reaching the end of a block, a value none for next means do nothing, a value return indicates that the next block is the return block for the current function invocation, and a natural number n means that the next block has index n.

We now walk through the translation of Fig. 6a to Fig. 6b. We set the accumulator to ([], ∅, return) at line 2 in Algorithm 2 just before the initial call to rec, indicating that the current block is empty, that we have accumulated no complete blocks so far, and that we must use the return block address when reaching the end of the current block. In the first call to rec, the other at line 2 in Fig. 6a triggers the case at line 23 in Algorithm 2, which accumulates the other in the current block. Next, the checkpoint triggers the case at line 14, followed by line 18, since the checkpoint is not at tail position. At line 19, we create a new index for the following block. We then close the current block by tagging the checkpoint with the new index, resulting in block 1 in Fig. 6b. Next, we recursively create the block following the checkpoint at line 21. Finally, we add the recursively created block with the new index to the map of complete blocks (now also populated by the recursive call) and return the updated accumulator triple at line 22.

The complex part of Algorithm 2 involves handling of ifs. In particular, we must handle cases where there are block splits within the branches with care. In our example, the first if at line 5 in Fig. 6a triggers the case at line 31 since it is not in tail position. To determine whether or not there is at least one split within the branches, we set next to none for the call on line 32. If a block is split during this call, initNext will be applied on next, and thnNext at line 32 will

Fig. 7: The main components of the CorePPL-to-RootPPL compiler. Grey blocks are programs, and blue blocks are transformations or analyzes.

be a natural number, indicating where the branch jumped to (either through a jump, checkpoint, or call) at tail position. However, if there is no split in the branch, the resulting thnNext remains none. There is no split in the first branch of the if at line 5 in Fig. 6a, and none is passed to the recursive call at line 33 as well. Again, there is no split in the second branch, triggering the then case at line 34, and we accumulate the if in the same way as an other.

The ifs at lines 7 and 9 in Fig. 6a do contain a split due to the call at line 11, resulting in blocks 2, 3, and 4, shown in Fig. 6b. The elsNext is a natural number for these ifs, and the else case at line 35 is triggered. Here, we must take particular care if there is only a split in the second branch of the if and not the first. In that case, thnNext is none, and unlike the second branch, we do not add a block jump to the end of this branch in the call at line 32. Therefore, we must instead add it at line 36. We add the jump at line 11 in block 2 in Fig. 6b in this way. Note that we do not require an equivalent step to the above for the second branch if the split is only in the first branch, since we pass the next from the first branch to the recursive call for the second branch. After handling the if itself, we recursively create the new block following the if at lines 37–38 (note that we pass the next given as argument to rec here, and use initNext on it to indicate a split has occurred), and give it the index elsNext at line 39.

The case where if is at tail position, at line 25, is handled similarly to the case at line 31. The difference is that we do not pass none to the first branch since there is nothing following the if which we can jump to. Instead, we directly pass the current next to the first call at line 26.

In the blocks resulting from Algorithm 2, call and checkpoint only occurs in tail-position by construction. As discussed in Section 4.1, this is precisely the required property when compiling to PCFGs.

#### 4.3 CorePPL-to-RootPPL Compiler

Fig. 7 gives an overview of the CorePPL-to-RootPPL compiler components. Besides the techniques described previously, an integral part of the compiler is the C translation step, which translates many of the CorePPL language features to C, including data type definitions and pattern matching. More precisely, CorePPL records and variants are translated to C structs and tagged unions, respectively, while pattern matching is compiled to C if statements.

A simple static analysis phase discovering functions that are guaranteed not to encounter any resamples is also part of the compiler. It iterates through all functions and marks a function as containing a resample if it either directly contains a resample or calls another function containing a resample. We do not need to decompose resample-free functions, and invocations can be handled directly by the C++ or CUDA compiler (and we do not need to set up an explicit stack frame). An example of such a function invocation is the geqf s1 1. at line 5 in Fig. 5b. We disallow passing functions as arguments to other functions as it complicates the analysis. A solution to allow passing functions as arguments is to use static analysis techniques such as 0-CFA [35] instead.

The code generation stage in Fig. 7 adds RootPPL boilerplate code and emits a complete RootPPL program that is provided as input to a C++ or CUDA compiler together with the RootPPL inference engine (see Fig. 1). The CorePPL compiler implementation is hosted at GitHub [4] and consists of approximately 3000 lines of code (a contribution of this paper). Note that the ANF, static analysis, and C translation steps are quite standard, with no new contributions.

An important detail concerning memory allocation in the compiler is the translation between relative and absolute addresses. Fig. 5c illustrates this translation. On line 3 in block 4, we convert the retValLoc relative pointer to an absolute pointer prior to dereferencing, and at lines 18–20 in block 2, the address of s4 is translated to a relative address with respect to the start of the stack before being assigned to retValLoc. This translation is needed because, at checkpoints in RootPPL, resampling copies and moves SMC executions in memory. Therefore, we cannot use absolute addresses to refer to data on the PSTATE stack and must instead use addresses relative to the start of the stack.

#### 4.4 Compiler Strengths and Limitations

The main strength of the CorePPL compiler, compared to using other PPL compilers and tools, is the execution time of the compiled programs. In particular, the compilation from a universal PPL to CUDA is the first of its kind and allows for utilizing GPUs for massively parallel SMC inference.

The compiler does, however, have some limitations. Most importantly, the lack of standard garbage collectors in C++ and CUDA leads to restrictions for automatic data allocation. Currently, we support only stack-based allocation, which means that CorePPL programs that allocate and return dynamically sized data structures (e.g., trees or linked lists) from functions are not supported. Consequently, the current compiler cannot handle probabilistic programs encoding distributions over such data structures (e.g., phylogenetic trees)—the distribution must be over fixed-size data types. However, as the evaluation in Section 5 suggests, practically significant universal probabilistic programs over fixed-sized data types are plentiful. In general, the compiler supports universal CorePPL programs including both stochastic branching and an unbound number of (stackallocated) random variables. Automatic heap-based data allocation is a general challenge when compiling to GPUs and not specific to our approach. Exploring the use of garbage collectors or other means for automatic memory management on GPUs is an interesting direction for future research.

The compiler also lacks support for some features, which we foresee no substantial technical challenges in implementing in the near future. In particular, the compiler does not support first-class distributions—we restrict distributions to occur immediately at assumes (e.g., the Bernoulli distribution in assume (Bernoulli p) in Fig. 2a). Another possible feature is to add limited support for nested and higher-order functions.

### 5 Evaluation

This section evaluates RootPPL and the CorePPL-to-RootPPL compiler. The source code for all experiments is publicly available [26]. We compare RootPPL and CorePPL to state-of-the-art SMC PPL implementations on two models: a constant rate birth-death (CRBD) model from evolutionary biology (Sections 5.1 and 5.3) and a vector-borne disease model from epidemiology (Section 5.2). Previous work shows that SMC handles these models particularly well [36,28], and they are therefore good candidates for this evaluation. Comparison with other types of inference algorithms is a challenging problem and beyond the scope of this paper. For example, comparing SMC with variational inference (VI) is challenging as VI is approximate and SMC is asymptotically exact.

In addition to CorePPL (compiled to RootPPL) and RootPPL (hand-tuned), we implement the models above in a set of state-of-the-art PPLs with SMC inference: Birch [32], WebPPL [20], and Pyro [10]. For each PPL, we implement the two models as efficiently as possible, given the available language features. We compile RootPPL with GCC 7.5.0 for single-core and multicore and with CUDA 11.4 for GPU. We compile Birch 1.634 with GCC 7.5.0. We use WebPPL 0.9.15 with Node.js 14.17.6. We use Pyro 1.7.0 with PyTorch 1.9.0 and CUDA 10.2. Additionally, we use Numba 0.54.0—a just-in-time (JIT) compiler for Python to improve the Pyro performance for the Section 5.1 experiment.

To aid the comparison between languages both in the text and in the figures, we use the (S), (M), and (G) symbols suffixed to PPL names to indicate if they run on single-core, multicore, or GPU, respectively. Despite the CUDA dependency for Pyro, we did not observe any GPU usage during Pyro SMC runs. In Pyro, SMC is a minor inference algorithm, with variational inference instead being the main focus. This may explain this lack of GPU support for SMC. Consequently, we classify SMC in Pyro as (M) and not (G).

We ran all experiments on a machine with a 12-core (24 threads) Intel Xeon Gold 6136 CPU, 64 GB of memory, and an NVIDIA TITAN RTX GPU with 24 GB of memory and 4608 CUDA cores.

#### 5.1 Experiment: Constant-Rate Birth Death

In this experiment, we consider the non-trivial CRBD model described in Ronquist et al. [36]. This model encodes the posterior distributions of the rates with which new evolutionary lineages arise (birth rate) and die out (death rate), conditioned on the input of a fixed evolutionary tree (phylogeny). We use the dated

Fig. 8: Execution times for the CRBD experiment, for different numbers of particles N. The vertical line at the top of each bar indicates one standard deviation. PPLs with an (S) runs on a single core, (M) on multicore, and (G) on the GPU.

Alcedinidae phylogeny (Kingfisher birds) referenced in Ronquist et al. [36], and introduced in Jetz et al. [23]. A notable feature of this model is that it contains recursive tree constructions, which are only expressible in universal PPLs. The CorePPL implementation of this model consists of 118 lines of code† .

We measure execution time. To ensure fairness, we disabled variance-reducing techniques such as delayed sampling [28] and ESS-triggered resampling in all PPLs where available. Consequently, all implementations use precisely the same SMC inference algorithm. We checked this and the implementations' correctness by considering the output normalizing constant estimates in all runs† . The variance and mean of these estimates were comparable for all PPLs.

The results of the experiment are shown in Fig. 8 for three different numbers of SMC particles: 10 000, 100 000, and 1 000 000. We ran the PPL implementations for 100 iterations (a number determined by available time and hardware) for each number of SMC particles. The exception to this is WebPPL (S) and Pyro (M), which we ran only for 10 000 particles due to excessive execution times. For 10 000 particles, WebPPL (S) ran for 55 seconds (standard deviation 0.63 seconds), and Pyro (M) for 250 seconds (standard deviation 28 seconds). We omit WebPPL (S) and Pyro (M) from Fig. 8. Pyro relies heavily upon vectorization through PyTorch, and the expensive operations in the CRBD model are recursive and stochastic tree constructions, which are difficult to vectorize. This explains the particularly abnormal execution times for Pyro (M).

RootPPL is the best alternative in all categories. We conjecture that the difference compared to CorePPL is due to hand-tuned details in the RootPPL model. The RootPPL model uses efficient array encodings of the observed tree, precomputes the recursion order over this tree, and encodes it as an iterative procedure. CorePPL instead compiles the tree as a tagged union type with pointers

Fig. 9: Execution times for the Vector-Borne Disease experiment, for different numbers of particles N. The vertical line at the top of each bar indicates one standard deviation. PPLs with an (S) runs on a single core, (M) on multicore, and (G) on the GPU.

to subtrees in each node and traverses it via recursion. Automatically discovering this transformation from trees to arrays and recursion to iteration is non-trivial and not considered here but could have potential for future work.

To improve the performance of Pyro, we also applied Numba to parallelize the recursive tree construction in the model manually. The parallelization we apply is more fine-grained than the natural SMC particle parallelism and resulted in an order-of-magnitude performance boost over Pyro (M). Unlike CorePPL, RootPPL, and Birch, the execution times for Pyro/Numba (M) seems to grow sub-linearly when going from 100 000 to 1 000 000 particles, as this only increases mean execution time from 6.72 seconds to 13.76. We conjecture that this is related to the different type of parallelism introduced with Numba, in combination with its JIT compilation. Therefore, looking at adding such parallelism to RootPPL and CorePPL is an interesting direction for future work.

#### 5.2 Experiment: Vector-Borne Disease

Next, we consider the vector-borne disease model from Funk et al. [16], which is also studied further in Murray et al. [28]. This epidemiological model encodes a dengue outbreak in Micronesia and includes the spread of disease between mosquito and human populations. The inference is over the number of susceptible, exposed, infectious, and recovered (SEIR) individuals in the populations at discrete time steps (days), and the observations are daily numbers of reported new cases at health centers (the data is available in Funk et al. [16]). The CorePPL implementation of this model consists of 140 lines of code† .

The experiment setup is identical to Section 5.1 but with fewer SMC particles due to more demanding computations in the model. Fig. 9 shows the results. We

Fig. 10: Execution times for the CRBD experiment with variance-reducing techniques for different numbers of particles N. The vertical line at the top of each bar indicates one standard deviation. PPLs with an (S) runs on a single core, (M) on multicore, and (G) on the GPU. Note the 6× speedup of RootPPL (M) over Birch (M) for N = 100 000.

omit WebPPL (S) entirely due to high execution times. However, we include Pyro (M) because the simple non-stochastic control-flow in this model allows much better vectorization than the CRBD model. The Numba optimization in Section 5.1 relied on the recursive structure of the model. We exclude Pyro/Numba (M) here, as such an optimization is not possible in this model.

This time, CorePPL is the best option, by a small margin, over RootPPL. We conjecture that this is due to how RootPPL preallocates memory, which is instead dynamically allocated in CorePPL. This results in copying slightly more memory during resampling for this model in RootPPL.

The difference between GPU and CPU for CorePPL and RootPPL is not as significant as in Fig. 8. We conjecture that this is due to the lower numbers of SMC particles used and RootPPL using different implementations for binomial distribution sampling on the CPU and GPU. The GPU uses a custom, and less efficient version, because the C++ standard library binomial sampling implementation is not available in CUDA. Because binomial sampling is the most expensive operation in this model, this can improve GPU performance further.

#### 5.3 Experiment: CRBD with Variance-Reducing Techniques

In this experiment, we again consider the CRBD model from Section 5.1, but with delayed sampling and ESS-triggered resampling allowed. Also, we now consider a different, more challenging phylogeny of Tyrant flycatchers [36,23].

Fig. 10 shows the results. Other than the changes above, the setup is identical to Section 5.1. We added static delayed sampling manually to all models to ensure fairness. Note, however, that automatic and dynamic delayed sampling, as introduced in Murray et al. [28], is also natively supported in Birch (but introduces some unfair overhead). CorePPL is omitted here, as adding efficient delayed sampling to the model is rendered more difficult by the current lack of support for mutable data structures. Based on the experiment in Section 5.1, WebPPL (S) and Pyro (M) are also not considered here.

The results offer no surprise over Fig 8, and RootPPL is again the best alternative. Note the increased execution times here compared to Fig 8 due to the more challenging phylogeny and delayed sampling overhead (which is greatly compensated by increased inference accuracy).

### 6 Related Work

There are quite a few PPL implementations making use of SMC inference. Most closely related to the contributions in this paper is Birch [32]. Similarly to RootPPL, Birch implements SMC inference, and the target language for compilation is C++. However, while performance is one of the main goals with Birch, some overhead is inevitably introduced by supporting various quality-oflife C++ features—including automatic heap allocation [30] and object-oriented features. RootPPL does not support such features in favor of performance. Similarly to RootPPL, Birch supports CPU parallelism through the use of OpenMP. Compilation to GPUs is, however, currently not supported in Birch.

The PCFG concept can also be related to Birch. In Birch, users write models for SMC inference as a method simulate which the inference algorithm calls iteratively. Resampling only occurs between calls to this method. Furthermore, data is passed between calls to simulate through particle variables stored in an object defined as part of the model (similar to the PCFG state). We can view PCFG basic blocks as a natural generalization of the Birch simulate method, conceptually allowing for many simulate methods with arbitrary control-flow in between them. In particular, SMC particles can take different paths through the PCFG. As with PCFG blocks, the explicit simulate function used in Birch can potentially make it more challenging to express models for programmers. This is not a problem when using our approach of compiling into PCFGs, as we then do the block decomposition automatically.

Besides Birch, parallelism for SMC inference in PPLs is surprisingly absent in previous work. The predecessor of Birch, LibBi [29], is an exception to this and implements highly performant SMC inference through SIMD instructions, OpenMP, and CUDA. However, in contrast with RootPPL and CorePPL, the LibBi modeling language is not universal. In other words, LibBi can not express many probabilistic models.

Pyro [10] is a PPL mainly focused on stochastic variational inference, supporting MCMC and SMC in addition. SMC in Pyro is similar to Birch in that models are constructed using an explicit step function (equivalent to simulate in Birch). In general, Pyro supports parallelism through vectorization using PyTorch [5] tensors, which is powerful but also restrictive. We saw this in Section 5.1, where we could not use Pyro tensors to parallelize the tree recursion.

Other universal PPLs implementing SMC inference include WebPPL [20] and Anglican [40]. These languages are embedded in JavaScript, and Clojure, respectively, and implement several inference algorithms (including SMC) through CPS transformations. The focus is on ease of modeling through functional-style constructs supported by complex runtimes (V8 for JavaScript and the JVM for Clojure) and supporting many different inference algorithms. Parallelism for SMC is not directly supported, which is different from CorePPL and RootPPL, where the focus is parallelism and performance.

Stan [12] and AugurV2 [22] support GPU parallelization of MCMC. Their modeling languages are, however, more restricted than CorePPL. Stan supports explicit parallelization of specific functions, and the AugurV2 compiler can compile to MCMC algorithms running partially in parallel on CUDA. This is quite different from the natural SMC parallelism in CorePPL and RootPPL.

There are also many other probabilistic programming tools, libraries, and languages available, for instance, Gen [13], Turing [17], Hakaru [34], and Edward [38]. Generally, these either focus on assisting users in manually constructing inference algorithms tailored for their specific models or on providing efficient inference for a restricted set of models.

### 7 Conclusion

This paper introduced the concept of PCFGs and a general method for compiling universal PPLs to PCFGs. We illustrated these contributions further through the RootPPL implementation and the CorePPL compiler. This is the first work compiling a universal PPL to GPU with SMC inference. Furthermore, the evaluation showed that CorePPL and RootPPL can deal with real-world SMC inference problems and outperform the current state-of-the-art with up to 6× speedups for challenging models (and even more when compared across CPU and GPU). This gives strong empirical support for the usefulness of the contributions.

Possible improvements upon this work include the exploration of more complex CUDA and C++ runtimes for RootPPL, e.g., runtimes with automatic memory management through garbage collection. Additionally, high-performance implementations similar to RootPPL for other inference methods (e.g., MCMC) are highly relevant for many probabilistic models—for instance, various models from phylogenetics [36]. We leave these topics for future work.

#### Acknowledgments

We thank Lawrence Murray for his assistance with Birch; the anonymous reviewers at ESOP for their valuable comments; Gizem Çaylak for her valuable comments and contributions to CorePPL and Miking; Lars Hummelgren, Viktor Palmkvist, and Oscar Eriksson for their valuable comments and contributions to Miking; and finally all other Miking developers for their contributions to Miking.

### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## Foundations for Entailment Checking in Quantitative Separation Logic?

Kevin Batz<sup>1</sup> () , Ira Fesefeldt<sup>1</sup> () , Marvin Jansen<sup>1</sup> , Joost-Pieter

Katoen<sup>1</sup> () , Florian Keßler<sup>1</sup> , Christoph Matheja2,<sup>3</sup> () , and

> Thomas Noll<sup>1</sup> ()

<sup>1</sup> Software Modeling and Verification Group, RWTH Aachen University, Germany

{kevin.batz,fesefeldt,katoen,noll}@cs.rwth-aachen.de

<sup>2</sup> Programming Methodology Group, ETH Z¨urich, Switzerland

<sup>3</sup> Technical University of Denmark, chmat@dtu.dk

Abstract. Quantitative separation logic (QSL) is an extension of separation logic (SL) for the verification of probabilistic pointer programs. In QSL, formulae evaluate to real numbers instead of truth values, e.g., the probability of memory-safe termination in a given symbolic heap. As with SL, one of the key problems when reasoning with QSL is entailment: does a formula f entail another formula g?

We give a generic reduction from entailment checking in QSL to entailment checking in SL. This allows to leverage the large body of SL research for the automated verification of probabilistic pointer programs. We analyze the complexity of our approach and demonstrate its applicability. In particular, we obtain the first decidability results for the verification of such programs by applying our reduction to a quantitative extension of the well-known symbolic-heap fragment of separation logic.

### 1 Introduction

Separation logic [29] (SL) is a popular formalism for Hoare-style verification of imperative, heap-manipulating and, possibly, concurrent programs. Its assertion language extends first-order logic with two connectives—the separating conjunction ? and the magic wand −−?—that enable concise specifications of how program memory, or other resources, can be split-up and combined. SL builds upon these connectives to champion local reasoning about the resources employed by programs. Consequently, program parts can be verified by considering only those resources they actually access—a crucial property for building scalable tools including automated verifiers [46,12,16,44,31], static analyzers [10,24,14], and interactive theorem provers [32]. At the foundation of almost any automated approach based on SL, lies the entailment problem ϕ |= ψ: are all models of SL formula ϕ also models of SL formula ψ? For example, Hoare-style verifiers need to solve entailments whenever they invoke the rule of consequence, and static

<sup>?</sup> This work is partially supported by the ERC AdG project 787914 FRAPPANT.

analyzers ultimately solve entailments to perform abstraction. While undecidable in general [1], the wide adoption of SL and the central role of the entailment problem have triggered a massive research effort to identify SL fragments with a decidable entailment problem [11,17,21,22,27,28,35,40,47,18,20], and to build practical entailment solvers [46,12,16,50].

Probabilistic programs, that is, programs with the ability to sample from probability distributions, are an increasingly popular formalism for, amongst others, designing efficient randomized algorithms [42] and describing uncertainty in systems [23,15]. While formal reasoning techniques for probabilistic programs exist since the 80s (cf., [37,38,49]), they are rarely automated and typically target only simplistic programming languages. For example, verification techniques that support reasoning about both randomization and data structures are, with notable exceptions [51,9], rare—a surprising situation given that randomized algorithms typically rely on dynamic data structures.

Quantitative separation logic (QSL) is a weakest-precondition-style verification technique that targets randomized algorithms manipulating complex data structures; it marries SL and weakest preexpectations [43]—a well-established calculus for reasoning about probabilistic programs. In contrast to classical SL, QSL's assertion language does not consist of predicates, which evaluate to Boolean values, but expectations (or: random variables), which evaluate to real numbers. QSL has been successfully applied to the verification of randomized algorithms, and QSL expectations have been formalized in Isabelle/HOL [26]. However, reasoning is far from automated—mainly due to the lack of decision procedures or solvers for entailments between expectations in QSL.

This paper presents, to the best of our knowledge, the first technique for automatically deciding QSL entailments. More precisely, we reduce QSL quantitative entailments to classical entailments between SL formulas. Hence, we can leverage two decades of separation logic research to advance QSL entailment checking, and thus also automated reasoning about probabilistic programs.

Contributions. We make the following technical contributions:


Outline. Section 2 introduces (quantitative) separation logic. Section 3 motivates our approach by providing the foundations for probabilistic pointer program verification with QSL together with several examples. We present the key ideas and our main contribution of reducing QSL entailment checking to SL entailment checking in Section 4. We analyse the complexity of our approach in Section 5. In Section 6, we apply our approach to obtain the first decidability results for probabilistic pointer verification. Finally, Section 7 discusses related work and Section 8 concludes.

Detailed proofs are found in an extended version of this paper [7].


Table 1. Metavariables used throughout this paper.

### 2 (Quantitative) Separation Logic

#### 2.1 Program States

Let Vals be a countably infinite set of values, and let Vars be a countably infinite set of variables with domain Vals. The set of stacks is given by

$$\mathsf{Stacks} = \{ \begin{array}{c} s \mid \mid s: \mathsf{Vars} \to \mathsf{Vals} \} \end{array} \dots$$

Let Locs ⊂ Vals be an infinite set of locations. We denote locations by ` and variations thereof. We fix a natural number k ≥ 1 and a heap model where finite


Table 2. Semantics of SL [A] formulae.

sets of locations are mapped to fixed-size records over Vals of size k. Put more formally, the set of heaps is given by

$$\mathsf{Heaps}\_{\mathbb{k}} = \left\{ h \, \middle| \, h \colon L \to \mathsf{Vals}^{\mathbb{k}}, \, \, L \subseteq \mathsf{Locs}, \, |L| < \infty \right\}. $$

The set of program states is then given by

States = { (s, h) | s ∈ Stacks, h ∈ Heaps<sup>k</sup> } .

Given a program state (s, h) and an expression t over Vars, we denote by t(s) the evaluation of expression t in s, i.e., the value that is obtained by evaluating t after replacing any occurrence of any variable x ∈ Vars in t by the value s(x). We write s [x:=v] to indicate that we set variable x to value v ∈ Vals in s, i.e.<sup>4</sup> ,

$$s\left[x:\mathbb{v}v\right] \;=\;\lambda\,y\_{\mathsf{T}}\begin{cases}v, & \text{if }y=x\\s(y), & \text{if }y\neq x.\end{cases}$$

For heap h, h [`:=(v1, . . . , vk)] is defined analogously. For a given heap h: L → Vals<sup>k</sup> , we denote by dom (h) its domain L. Two heaps h1, h<sup>2</sup> are disjoint, denoted h<sup>1</sup> ⊥ h2, if their domains do not overlap, i.e., dom (h1) ∩ dom (h2) = ∅. The disjoint union of two disjoint heaps <sup>h</sup><sup>1</sup> : <sup>L</sup><sup>1</sup> <sup>→</sup> Vals<sup>k</sup> and <sup>h</sup><sup>2</sup> : <sup>L</sup><sup>2</sup> <sup>→</sup> Vals<sup>k</sup> is

$$h\_1 \star h\_2 \colon \mathsf{dom}\left(h\_1\right) \dot\cup \mathsf{dom}\left(h\_2\right) \to \mathsf{Vals}^{\mathbb{k}}, \left(h\_1 \star h\_2\right)(\ell) = \begin{cases} h\_1(\ell), & \text{if } \ell \in \mathsf{dom}\left(h\_1\right) \\ h\_2(\ell), & \text{if } \ell \in \mathsf{dom}\left(h\_2\right) \end{cases}$$

#### 2.2 Separation Logic

A predicate Φ ∈ P (States) is a set of states. A predicate Φ is called pure if it does not depend on the heap, i.e, for every stack s and heaps h, h<sup>0</sup> , we have (s, h) ∈ Φ iff (s, h<sup>0</sup> ) ∈ Φ.

<sup>4</sup> We use <sup>λ</sup>-expressions to denote functions: Function λX. <sup>f</sup> applied to an argument v evaluates to f in which every occurrence of X is replaced by v.

We consider a separation logic SL [A] with standard semantics [48]. A distinguishing aspect is that SL [A] is parametrized by a set A of predicate symbols <sup>ψ</sup> with given semantics <sup>J</sup>ψ<sup>K</sup> ∈ P (States). We often identify predicate symbols <sup>ψ</sup> with their predicates <sup>J</sup>ψK. Elements of <sup>A</sup> build the atoms of SL [A]. Our reduction from quantitative entailments to qualitative entailments does not depend on the choice of these predicate symbols. We therefore take a generic approach that allows for user-defined atoms, e.g., list or tree predicates.

Definition 1. Let A be a countable set of predicate symbols. Formulae in separation logic SL [A] with atoms in A adhere to the grammar

$$
\varphi \quad \rightarrow \quad \vartheta \mid \neg \varphi \mid \varphi \land \varphi \mid \varphi \lor \varphi \mid \exists x \colon \varphi \mid \forall x \colon \varphi \mid \varphi \star \varphi \mid \varphi \multimap \star \varphi \mid,
$$

where ϑ ∈ A, and where x ∈ Vars. 4

The Boolean connectives ¬, ∧, and ∨ as well as the quantifiers ∃ and ∀ are standard. ? is the separation conjunction and −−?is the magic wand.

The semantics <sup>J</sup>ϕ<sup>K</sup> ∈ P (States) of a formula <sup>ϕ</sup> <sup>∈</sup> SL [A] is defined by induction on the structure of ϕ as shown in Table 2. Recall that we assume the semantics <sup>J</sup>ψ<sup>K</sup> of predicate symbols <sup>ψ</sup> <sup>∈</sup> <sup>A</sup> to be given. We often write (s, h) <sup>|</sup><sup>=</sup> <sup>ϕ</sup> instead of (s, h) <sup>∈</sup> <sup>J</sup>ϕK. For ϕ, ψ <sup>∈</sup> SL [A], we say that <sup>ϕ</sup> entails <sup>ψ</sup>, denoted <sup>ϕ</sup> <sup>|</sup><sup>=</sup> <sup>ψ</sup>, if whenever (s, h) ∈ States such that (s, h) |= ϕ, also (s, h) |= ψ.

Example 1. Let Vals = Z, Locs = N>0, and k = 1. A term t is either a variable x ∈ Vars or the constant 0 ∈ Vals. The set A of predicate symbols is

$$\mathfrak{A} = \{ \text{true}, \mathsf{emp}, x \mapsto t, t = t', t \neq t', \mathsf{ls}\left(t, t'\right) \: \mid \: x \in \mathsf{Vars}, t, t' \text{ terms} \} $$

Here, apart from standard predicates for true, equalities, and disequalities,

1. emp is the empty-heap predicate, i.e.,

(s, h) |= emp iff dom (h) = ∅ ,

2. x 7→ t is the points-to predicate, i.e.,

(s, h) |= x 7→ t iff dom (h) = {s(x)} and h(s(x)) = t(s) ,

3. the list predicate ls(t, t<sup>0</sup> ) asserts that the heap models a singly-linked list segment from t to t 0 :

$$\begin{aligned} (s, h) &\vdash \mathfrak{s}\left(t, t'\right) \\ \text{iff} & \quad \mathsf{dom}\left(h\right) = \emptyset \text{ and } t(s) = t'(s) \text{ or} \\ & \quad \text{there exist } n \ge 1 \text{ and terms } t\_1, \dots, t\_n \text{ with } t\_n = t' \text{ such that } \\ & (s, h) \mid = t \mapsto t\_1 \star \dots \star t\_{n-1} \mapsto t\_n \ . \end{aligned}$$

In this setting, SL [A] contains, e.g., the well-known symbolic heap fragment of separation logic with lists. For instance, the SL [A] formula

$$
\exists y \colon \exists z \colon \ x \mapsto y \star y \mapsto z \star \mathsf{ls}\left(z, 0\right)\ \mathsf{J}
$$

asserts that the heap consists of a list with head x of length at least 2. 4


Table 3. Semantics of QSL [A] formulae.

#### 2.3 Quantitative Separation Logic

In quantitative separation logic [9,39], formulae evaluate to non-negative real numbers or infinity instead of truth values. By conservatively extending the weakest preexpectation calculus by McIver & Morgan [41], this enables the compositional verification of probabilistic pointer programs by reasoning about expected list-sizes, probabilities of terminating with an empty heap, and alike.

We consider here a fragment of quantitative separation logic suitable for reasoning about the likelihood of events in probabilistic pointer programs such as, e.g., the probability of terminating in a given symbolic heap. The formulae we consider evaluate to rational probabilities rather than arbitrary reals or infinity. We denote the set [0, 1]∩ Q<sup>≥</sup><sup>0</sup> of rational probabilities by P. Like SL [A], quantitative separation logic is parameterized by a set A of predicate symbols ψ with given semantics <sup>J</sup>ψ<sup>K</sup> ∈ P (States), building the atoms of QSL [A].

Definition 2. Let A be a countable set of predicate symbols. Formulae in quantitative separation logic QSL [A] with atoms in A adhere to the grammar

$$\begin{array}{rclcrcl} f & \rightarrow & [\psi] & |\ & [\pi] \cdot f + [\neg\pi] \cdot f & |\ q \cdot f + (1 - q) \cdot f \ | & f \cdot f \\ & & |\ & 1 - f \ |\ & f \max f & |\ f \min f & |\ & \mathfrak{B}x \colon f \ | & \mathcal{L}x \colon f \\ & & |\ f \star f \ | & [\psi] & \longrightarrow f \ , \end{array}$$

where ψ, π ∈ A with π pure, q ∈ P, and where x ∈ Vars. 4

The semantics of a formula f ∈ QSL [A] is a (one-bounded) expectation. The set E<sup>≤</sup><sup>1</sup> of one-bounded expectations is defined as

$$\mathbb{E}\_{\leq 1} \ = \ \{ X \ \mid \ X: \mathsf{States} \to [0, 1] \} \ \ .$$

We use the Iverson bracket [30] notation [Φ] to associate with predicate Φ its indicator function. Formally,

$$\begin{aligned} \left[\Phi\right]: \quad \mathsf{States} \to \{0, 1\}, \quad \left[\Phi\right](s, h) &= \begin{cases} 1, & \text{if } (s, h) \in \Phi \\ 0, & \text{if } (s, h) \notin \Phi \end{cases} \end{aligned} $$

Given a predicate symbol <sup>ψ</sup>, we often write [ψ] instead of [JψK]. The semantics <sup>J</sup>f<sup>K</sup> <sup>∈</sup> <sup>E</sup>≤<sup>1</sup> of <sup>f</sup> <sup>∈</sup> QSL [A] is defined by induction on the structure of <sup>f</sup> in Table 3. We write <sup>f</sup> <sup>≡</sup> <sup>g</sup> if <sup>f</sup> and <sup>g</sup> are equivalent, i.e. if <sup>J</sup>f<sup>K</sup> <sup>=</sup> <sup>J</sup>gK. Infima and suprema are taken over the complete lattice ([0, 1], ≤). In particular, inf ∅ = 1 and sup ∅ = 0.

Theorem 1. The semantics of QSL [A] formulae is well-defined, i.e., for all <sup>f</sup> <sup>∈</sup> QSL [A], we have <sup>J</sup>f<sup>K</sup> <sup>∈</sup> <sup>E</sup><sup>≤</sup>1.

Proof. By induction on the structure of f.

Let us go over the individual constructs. Formulae of the form [ψ] are the atomic formulae. [π] · g + [¬π] · u is a Boolean choice between g and u that does not depend upon the heap since <sup>J</sup>π<sup>K</sup> is pure. <sup>q</sup> · <sup>g</sup> + (1−q)·<sup>u</sup> is a convex combination of g and u. g·u is the pointwise multiplication of g and u. 1−g is the quantitative (or probabilistic) negation of g. g max u and g min u is the pointwise maximum and minimum of g and u, respectively.

Sx: g is the supremum quantification that, given a state (s, h), evaluates to the supremum of the set obtained from evaluating g in (s [x:=v] , h) for every value v ∈ Vals. In our setting, this supremum is actually a maximum. Dually, Jx: g is the infimum quantification.

? and −−? are the quantitative analogous of the separating conjunction and the magic wand from separation logic as defined in [9]. g ? u is the quantitative separating conjunction of g and u. Intuitively speaking, whereas the qualitative separating conjunction maximizes a truth value under all appropriate partitionings of the heap, the quantitative separating conjunction maximizes a probability. [ψ] −−? u is the quantitative magic wand. Whereas the qualitative magic wand minimizes a truth value under all appropriate extensions of the heap, the quantitative magic wand minimizes a probability. For an in-depth treatment of these connectives, we refer to [9].

Example 2. Let Vals, Locs, k, and A be as in Example 1. Then QSL [A] contains, e.g., a quantitative extension of the symbolic heap fragment of separation logic with lists. For instance, the QSL [A] formula

$$0.7 \cdot (\mathcal{B}y \colon \mathcal{B}z \colon \ [x \mapsto y] \star [y \mapsto z] \star [\mathsf{ls}(z, 0)]) + 0.3 \cdot [\mathsf{emp}]$$

expresses that with probability 0.7 the heap consists of a list with head x of length at least 2 and that with probability 0.3 the heap is empty. 4

Finally, given f, g ∈ QSL [A], we say that f entails g, denoted f |= g, if

for all (s, h) <sup>∈</sup> States: <sup>J</sup>f<sup>K</sup> (s, h) <sup>≤</sup> <sup>J</sup>g<sup>K</sup> (s, h) .

Quantitative entailments f |= g generalize classical entailments in the sense that f (pointwise) lower-bounds the quantity g. For example, if g assigns to each state the probability that some program C terminates without a memory error, then the entailment [true] |= g means that C terminates almost-surely, i.e., with probability one. Our problem statement now reads as follows: Reduce entailment checking in QSL [A] to checking finitely many entailments in SL [A].

### 3 Entailments in Probabilistic Program Verification

Our primary motivation for studying the entailment problem for quantitative separation logic is to provide foundations for the automated verification of probabilistic pointer programs. In this section, we consider examples of such programs written in hpGCL—an extension of McIver & Morgan's probabilistic guarded command language (cf., [41]) by heap-manipulating instructions— and the entailments that arise from their verification. We briefly formalize reasoning about hpGCL programs with weakest liberal preexpectations; for a thorough introduction of hpGCL programs and techniques for their verification, we refer to [9,39].

### 3.1 Heap-manipulating pGCL

Recall from Section 2.1 that heaps map memory locations to fixed-size records (or tuples) of length k ≥ 1. The set of programs in heap-manipulating probabilistic guarded command language for k = 1, Vals = Z and Locs = N>0, denoted hpGCL, is given by the grammar


where x ∈ Vars, p ∈ P, E, E<sup>0</sup> are arithmetic expressions and B is a Boolean expression. We assume that expressions do not depend on the heap. For now, we do not fix a specific syntax for expressions but assume evaluation mappings

E : Stacks → Z and B : Stacks → {true, false} .

In addition to the usual control flow structures for sequential composition, conditionals, and loops, skip does nothing, x := E assigns the value E(s) obtained

Table 4. Rules for compositionally computing weakest liberal preexpectations. Here, f is a QSL [A] formula representing the postexpectation. f [x:=E] denotes the substitution of every free occurrence of x by E in f. [E 7→ − ] desugars to Sz : [E 7→ z].


from evaluating expression E in the current program state (s, h) to x, and the probabilistic choice { C<sup>1</sup> } [ p ] { C<sup>2</sup> } flips a coin with bias p—it executes C<sup>1</sup> if the coin flip yields heads, and C<sup>2</sup> otherwise. The allocation x := new (E) nondeterministically selects a fresh location, stores it in x, and puts a record with value E on the heap at that location. Since we assume an infinite address space, allocation never fails. Conversely, free(E) disposes the record at location E from the heap; it fails if no such location exists. The mutation < E > := E<sup>0</sup> and the lookup x := < E > update to E<sup>0</sup> resp. assign to x the value stored at location E; both statements fail if the heap contains no such location.

#### 3.2 Weakest Liberal Preexpectations

We formalize reasoning about hpGCL programs in terms of the weakest liberal preexpectation transformer wlp: hpGCL → (QSL [A] → QSL [A]), where A at least contains formulae of the form [E 7→ E<sup>0</sup> ]; Table 4 summarizes the rules for computing wlp of loop-free programs on the program structure.

Conceptually, the weakest liberal preexpectation <sup>J</sup>wlpJC<sup>K</sup> (f)<sup>K</sup> (s, h) of program C with respect to postexpectation f ∈ QSL [A] on (s, h) is the least expected value of <sup>J</sup>f<sup>K</sup> (measured in the final states) after successful<sup>5</sup> termination of C on initial state (s, h), plus the probability that C does not terminate on (s, h). Adding the non-termination probability can be thought of as a partial correctness view: we include the non-termination probability of C on state (s, h) in the wlp of C just as we include the state (s, h) in the weakest liberal precondition of C in case C does not terminate on (s, h).

<sup>5</sup> i.e., without encountering a memory error.

A reader familiar with separation logic will realize the close similarity between the rules in Table 4 and the weakest preconditions for SL by Ishtiaq and O'Hearn [29]. The main differences are (1) the use of the quantitative connectives ?, −−?, and ·, and +, and (2) the additional rule for probabilistic choice, wlpJ{ <sup>C</sup><sup>1</sup> } [ <sup>p</sup> ] { <sup>C</sup><sup>2</sup> }<sup>K</sup> (f), which is a convex sum that weights wlpJC<sup>1</sup><sup>K</sup> (f) and wlpJC<sup>2</sup><sup>K</sup> (f) by <sup>p</sup> and (1 <sup>−</sup> <sup>p</sup>), respectively.

The transformer wlp is well-defined in the sense that, for every loop-free hpGCL-program and every QSL [A] formula, we obtain—under mild conditions again a QSL [A] formula:

Theorem 2. Let C ∈ hpGCL be loop-free and A be a set of predicate symbols. If


then, for every QSL [A] formula <sup>f</sup>, wlpJC<sup>K</sup> (f) <sup>∈</sup> QSL [A].

Proof. By induction on loop-free C.

For loops, wlpJwhile ( <sup>B</sup> ) { <sup>C</sup> }<sup>K</sup> (f) is typically characterized as the greatest fixed point of loop unrollings. However, we fixed an explicit syntax of formulae instead of allowing arbitrary expectations; the above fixed point is in general not expressible in our syntax.<sup>6</sup> To deal with loops, we thus require a user-supplied invariant I and apply the following proof rule (cf., [34]) to approximate wlp:

<sup>I</sup> <sup>|</sup>= [¬B] · <sup>f</sup> + [B] · wlpJ<sup>C</sup> 0 <sup>K</sup> (I) implies <sup>I</sup> <sup>|</sup><sup>=</sup> wlpJwhile ( <sup>B</sup> ) { <sup>C</sup> 0 }<sup>K</sup> (f)

Notice that verifying that I is indeed an invariant via the above rule requires proving an entailment between QSL [A] formulae.

### 3.3 Interfered Swap

Our first example concerns a program Cswap, implemented in hpGCL below, that attempts to swap the contents of two memory locations x and y. However, since variable x is shared with a concurrently running process, writing to x can be unreliable, that is, instead of the intended value, the concurrently running process may write a corrupted value err into memory with some probability, say 0.001. A similar situation occurs, e.g., when using the protocol described in [2].

$$\begin{aligned} \left| C\_{\text{swap}} : \begin{array}{l} \mathsf{tmp1} := \mathsf{e} \, x > ; \\ \mathsf{tmp2} := \mathsf{e} \, y > ; \\ \{ \, < x \, := \mathsf{tmp2} \} \, [0.999] \, \{ \, < x \, := \mathsf{err} \} \, ; \\ < y \, := \mathsf{tmp1} \, . \end{array} \right. \end{aligned}$$

<sup>6</sup> It is noteworthy that a sufficiently expressive syntax for weakest preexpectation reasoning without heaps has been developed only recently [8].

We can use wlp to verify an upper bound on the probability that an erroneous write operation happened by solving the QSL entailment

$$\begin{aligned} &\mathsf{wlp}[C\_{\mathsf{swap}}] \left( [x \mapsto z\_2] \star [y \mapsto z\_1] \right) \\ &= [z\_2 = \mathsf{err}] \cdot \left( [x \mapsto z\_1] \star [y \mapsto z\_2] \right) + [z\_2 \neq \mathsf{err}] \cdot \left( 0.999 \cdot ([x \mapsto z\_1] \star [y \mapsto z\_2]) \right) \\ &\quad \cdot [z\_1 \mapsto z\_2] \star [z\_1 \mapsto z\_2] \cdot \left( [x \mapsto z\_1] \star [y \mapsto z\_2] \right) \end{aligned}$$

That is, the probability that Cswap successfully swaps the contents of x and y is at most 0.999 if y does initially not point to the corrupt value err.

As we will see in Section 6.1, our approach for solving QSL entailments is capable of deciding the above entailment, where wlpJCswap<sup>K</sup> ([<sup>x</sup> 7→ <sup>z</sup>2] ? [<sup>y</sup> 7→ <sup>z</sup>1]) is computed according to the rules in Table 4.

#### 3.4 Avoiding Magic Wands

Recall from Table 4 that computing wlp introduces a magic wand (−−?) for almost every statement that accesses the heap. This is unfortunate because many decidable separation logic fragments as well as practical entailment solvers do not support magic wands.

In particular, in Section 6.1 we present a QSL fragment with a decidable entailment problem that supports magic wands only on the left-hand side of entailments. Hence, proving a lower bound on the probability that the program Cswap from above successfully swapped the contents of two memory cells, e.g.,

$$\mathbb{P}\left(0.98 \cdot \left(\left[x \mapsto z\_2\right] \star \left[y \mapsto z\_1\right]\right) \middle| = \mathsf{w} \mathsf{l} \mathsf{p}[C\_{\mathsf{swap}}] \left(\left[x \mapsto z\_1\right] \star \left[y \mapsto z\_2\right]\right) \right) \,,\tag{7}$$

might still be possible with our technique but requires a different separation logic fragment to reduce to.

Fortunately, we can often avoid introducing magic wands by employing local reasoning and rules for computing wlp for specific pre- and postexpectations. In particular, the wlp calculus features (1) the frame rule from separation logic, i.e., if no free variable in <sup>g</sup> is modified by <sup>C</sup>, then wlpJC<sup>K</sup> (f) ? <sup>g</sup> <sup>|</sup><sup>=</sup> wlpJC<sup>K</sup> (f ? g), (2) super-distributivity for convex combinations and maximum, i.e., <sup>q</sup> · wlpJC<sup>K</sup> (f) + (1 <sup>−</sup> <sup>q</sup>) · wlpJC<sup>K</sup> (g) <sup>|</sup><sup>=</sup> wlpJC<sup>K</sup> (<sup>q</sup> · <sup>f</sup> + (1 <sup>−</sup> <sup>q</sup>) · <sup>g</sup>) and wlpJC<sup>K</sup> (f) maxwlpJC<sup>K</sup> (g) <sup>|</sup><sup>=</sup> wlpJC<sup>K</sup> (<sup>f</sup> max <sup>g</sup>), and (3) monotonicity, i.e., <sup>f</sup> <sup>|</sup><sup>=</sup> <sup>g</sup> implies wlpJC<sup>K</sup> (f) <sup>|</sup><sup>=</sup> wlpJC<sup>K</sup> (g). Moreover, we give four examples of specialized rules that avoid magic wands but require specific postexpectations: if x is not a free variable of E or f, and x and y are distinct variables, then

$$\begin{array}{lcl} (\mbox{i)} & \mbox{\bf w} \texttt{lp} \begin{bmatrix} x \ \cdot=\leq E \end{bmatrix} \left( ([E \mapsto y] \cdot [x = y]) \star f \right) = [E \mapsto y] \star f \begin{bmatrix} x \ \mathsf{i} \ \mathsf{j} \end{bmatrix} \left( \begin{array}{l} \cdot\\ \cdot \end{array} \right);\\ (\mbox{\bf U} \begin{array}{l} \cdot\\ \cdot \end{array} \right) & \begin{array}{l} \end{array} \left( \begin{array}{l} \cdot\\ \cdot \end{array} \right) = [E \mapsto y] \star f \begin{array}{l} \left[ x \ \mathsf{i} \ \mathsf{j} \end{array} \left( \begin{array}{l} \cdot\\ \cdot \end{array} \right);\\ \mbox{\bf U} \begin{array}{l} \left[ x \ \mathsf{i} \ \mathsf{j} \end{array} \right] \end{array} \right);\\ \end{array}$$

$$\begin{array}{rcl} \text{(ii)} & \mathsf{wlp}[\leq E \mathrel{\mathop{\mathsf{E}}} := E'] \{ ([E \mapsto E'] \star f) = [E \mapsto -] \star f \mathrel{\mathop{\mathsf{E}}} := \bot \} \\ \text{(iii)} & \mathsf{wlp}[\![\![\![\mathsf{D} \mapsto \mathsf{n} \,\mathsf{m} \,\!\langle\, \mathsf{n} \rangle \!] \!] \!] \!\langle \mathsf{D} \mapsto \bot \rangle \} \end{array}$$

$$\begin{array}{l} \text{(iii) } \mathsf{wlp}[x: \mathsf{new}\,(x)] \,(\mathsf{\mathcal{B}}y \colon [x \mapsto y] \star f) = f \,\,[y: \mathsf{=} x] \,\,;\,\,\text{and}\\ \text{(iv) } \mathsf{wlp}[\_{x} : \mathsf{new}\,(y)] \,([\_{x} \perp\_{y}] \downarrow f) = f \end{array}$$

$$\mathfrak{h}(\text{iv})\text{ }\mathfrak{w}\mathfrak{p}[x\ := \text{new}\,(y)]\left([x\mapsto y]\star f\right) = f\ .\,.$$

Similar rules have been used successfully for symbolic execution with separation logic in non-probabilistic settings [13]. Combining the above rules with framing, distributivity, and monotonicity often allows avoiding magic wands. In such cases, we have a richer set of decidable SL fragments upon which to build solvers

for QSL entailments at our disposal. Coming back to the entailment (†) from above and writing Cswap = C1; C2; C3; C4, we calculate

$$\begin{aligned} \left\| \begin{array}{l} \mathsf{w}[\mathsf{f}\_{\mathrm{Cusp}}] \left[ \left( \left[ x \mapsto z\_{1} \right] \star \left[ y \mapsto z\_{2} \right] \right) \right] \\ \right\| \left\| \mathsf{w}[\mathsf{f}\_{\mathrm{Cusp}}] \left[ \left( \left[ y \mapsto \mathsf{tmp1} \right] \right) \star \left[ x \mapsto \mathsf{tmp2} \right] \right] \cdot \left[ \mathsf{tmp1} = z\_{2} \right] \cdot \left[ \mathsf{tmp2} = z\_{1} \right] \end{array} \right\| \\ \left\| \begin{array}{l} \mathsf{w}[\mathsf{p}[\mathsf{f}\_{\mathrm{Cusp}}] \left[ \left( \left[ y \mapsto \mathsf{tmp1} \right] \right) \left( \left[ y \mapsto \mathsf{tmp1} \right] \right) \right) \\ \end{array} \right\| \left\| \begin{array}{l} \mathsf{w}[\mathsf{p}[\mathsf{f}\_{\mathrm{Cof}}] \left( \left[ \left[ y \mapsto \mathsf{tmp2} \right] \right) \left( \left[ y \mapsto \mathsf{tmp2} \right] \right) \right) \\ \end{array} \right\| \left\| \begin{array}{l} \mathsf{w}[\mathsf{f}\_{\mathrm{Cof}}] \left( \left[ x \mapsto \mathsf{tmp2} \right] \right) \cdot \left( \left[ \mathsf{tmp1} = z\_{2} \right] \right) \left( \left[ \mathsf{tmp2} = z\_{1} \right] \right) \end{array} \right\| \end{aligned} \end{aligned} \tag{\mathsf{f}[\mathsf{f}\_{\mathrm{Cusp}}]} \| \begin{array}{l} \mathsf{w}[\mathsf{f}\_{\mathrm{Cof}}] \left( \left[ \left[ \left[ y \mapsto \mathsf{tmp1} \right] \right] \right) \left( \left[ \left[ \left[ \left[ y \mapsto$$

=| 0.999 · ([x 7→ z2] ? [y 7→ z1]) + 0.001 · [false] (Rule (i))

which yields a preexpectation without magic wand. Hence, we obtain a magic wand-free entailment in (†). We have used our technique to transform this quantitative entailment into several qualitative entailments and checked them successfully using the separation logic extension of CVC4 [47]. Detailed calculuations, the resulting qualitative entailments, and the input for CVC4 in SMT-LIB 2 format are found in the extended version [7].

#### 3.5 Randomized List Population

Our second example populates a singly-linked list by flipping coins and adding a list element until the coin flip yields heads, i.e., we consider the program

$$\begin{array}{rcl} \text{Cepulate}: & \mathtt{while} \,(c \neq 0) \,\{ \\ & \{ \begin{array}{l} c \ := 0 \end{array} \} [0.5] \,\{ \,x \ := \mathtt{new} \,(x) \} \\\\ \end{array}$$

where x is the head of a linked list. Assume we would like to determine a lower bound on the probability that the above program does not crash and produces a list of length at least two<sup>7</sup> . For that, recall from Example 1 the separation logic formula ls(x, y) for singly-linked list segments. The aforementioned probability is then given by wlpJCpopulate<sup>K</sup> (f) for postexpectation

$$f \;= \mathcal{B}y \colon \mathcal{B}z \colon \; [x \mapsto y] \star [y \mapsto z] \star [\mathsf{ls}(z,0)] \; . \;\; .$$

<sup>7</sup> plus the probability of nontermination, which is 0.

We propose the loop invariant <sup>I</sup> below to show that <sup>I</sup> <sup>|</sup><sup>=</sup> wlpJCpopulate<sup>K</sup> (f), i.e., I is a lower bound on the sought-after probability.

I = Sy : [x 7→ y] ? [c = 0] · Sz : [y 7→ z] ? [ls(z, 0)] + [c 6= 0] · <sup>1</sup>/<sup>2</sup> · ( Sz : [y 7→ z] ? [ls(z, 0)] + <sup>1</sup>/<sup>2</sup> · [ls(z, 0)]) .

To verify that I is indeed a loop invariant (hint: it is), we need to prove that

$$I \mid = \; [c = 0] \cdot f + [c \neq 0] \cdot \mathsf{wlp}[\{ \; c := 0 \} \; [0.5] \; \{ \; x := \mathsf{new} \, (x) \}] \; (I) \; .$$

As described in Section 3.4, we can compute wlp in a way such that the resulting formula contains no magic wands. Our reduction from QSL entailments to standard SL entailments then allows us to discharge the above invariant check using existing separation logic solvers with support for fixed list predicates, e.g., [46].

### 4 Quantitative Entailment Checking

We present our main contribution of reducing entailment checking in QSL [A] to entailment checking in SL [A]. We consider the key observations leading to our reduction in Section 4.1. We then deal with the formalization and more technical considerations of our approach in Sections 4.2 and 4.3.

#### 4.1 Idea and Key Observations

We reduce entailment checking in QSL [A] to entailment checking in SL [A], i.e.,

Given f, g ∈ QSL [A], we reduce checking f |= g to checking finitely many entailments of the form ϕ |= ψ with ϕ, ψ ∈ SL [A].

We instantiate QSL [A] and SL [A], respectively, for the sake of concreteness. For that, we fix the set A of predicate symbols given by

$$\mathfrak{A} = \{ \text{true}, \mathsf{emp}, \ x = y, \ x \neq y, \ x \mapsto y \: \mid \ x, y \in \mathsf{Vars} \} \ .$$

Now, consider the following entailment u<sup>1</sup> |= u<sup>2</sup> as a running example:

$$u\_1 = 0.4 \cdot \underbrace{([x \mapsto y] \star [y \mapsto z])}\_{=g\_1} + 0.6 \cdot \underbrace{[x \mapsto y]}\_{=g\_2} \quad \left| = \ 0.6 \cdot ([x \mapsto y] \star [\text{true}]) \right. \\ = u\_2 \dots u\_n$$

Intuitively speaking, u<sup>1</sup> expresses that with probability 0.4 the heap consists of two cells where x points to y and separately y points to z, and that with probability 0.6 the heap consists of a single cell where x points to y. Formula u<sup>2</sup> expresses that with probability 0.6 the heap contains a cell where x points to y. How can we reduce the problem of checking whether u<sup>1</sup> |= u<sup>2</sup> holds to checking finitely many entailments in SL [A]? We rely on two key observations:

Observation 1. For every f ∈ QSL [A], the set

$$\mathsf{Eval}(f) \;= \; \{\llbracket f\rrbracket(s,h) \; \mid \; (s,h) \in \mathsf{States}\} \; \subset \; \mathbb{P}$$

is finite. Moreover, there is an effectively constructible finite and sound overapproximation Val [f] of Eval(f), i.e., Eval(f) ⊆ Val [f].

Example 3. Consider the expectation u<sup>1</sup> from our running example: We have Eval(u1) = {0, 0.4, 0.6}. We construct a finite overapproximation of Eval(u1) as follows: First, we observe that both subformulae g<sup>1</sup> and g<sup>2</sup> evaluate to a value in {0, 1}, i.e, Val [g1] = Val [g2] = {0, 1}. From Val [g1] and Val [g2], we obtain a finite overapproximation Val [u1] of Eval(u1) given by

$$\mathsf{Val}\left[u\_{1}\right] \;=\;\left\{ 0.4\cdot\alpha + 0.6\cdot\beta \; \middle|\; \alpha \in \mathsf{Val}\left[g\_{1}\right],\; \beta \in \mathsf{Val}\left[g\_{2}\right] \right\} \;=\;\left\{ 0, 0.4, 0.6, 1 \right\}.$$

Notice that Val [u1] is a proper superset of Eval(u1) since 1 6∈ Eval(u1). 4

We consider the construction of Val [f] for arbitrary f ∈ QSL [A] in Section 4.2.

Observation 2. Given f ∈ QSL [A] and a probability α ∈ P, there is an effectively constructible SL [A] formula, which we denote by dα fe, such that (s, h) is a model of dα fe if and only if f evaluates at least to α on state (s, h), i.e.,

$$\underbrace{(s,h)\mid\vdash\lceil\alpha\preceq f\rceil}\_{\text{in  $\mathsf{SL}[\mathfrak{A}]}}\qquad\text{iff}\qquad\underbrace{\alpha\leq\;\;\!\left[f\right](s,h)}\_{\text{in $ \mathsf{QSL}[\mathfrak{A}]}}\dots$$

We can thus lower bound QSL [A] formulae in terms of SL [A] formulae.

Example 4. Continuing our running example, we construct d0.5 u1e, i.e., an SL [A] formula evaluating to true on state (s, h) if and only if u<sup>1</sup> evaluates at least to 0.5. We start by considering the subformulae of u1. Since both g<sup>1</sup> and g<sup>2</sup> embed SL [A] predicates, we have for every α ∈ P

$$\begin{array}{rcl} \lceil \alpha \preceq g\_1 \rceil &=& \text{true if } \alpha = 0 \text{ else } x \mapsto y \star y \mapsto z \\ \text{and} \quad \lceil \alpha \preceq g\_2 \rceil &=& \text{true if } \alpha = 0 \text{ else } x \mapsto y \text{ }. \end{array}$$

The intuition is as follows: α = 0 lower bounds every probability. Conversely, if α > 0 then α lower bounds g<sup>1</sup> (resp. g2) on state (s, h) if and only if (s, h) satisfies the predicate g<sup>1</sup> (resp. g2). Now, when does u<sup>1</sup> evaluate at least to 0.5? Given Val [g1] and Val [g2] and the fact that the valuation of u<sup>1</sup> is a convex combination of the valuations of g<sup>1</sup> and g2, there are (at most) two cases: Either both g<sup>1</sup> and g<sup>2</sup> evaluate to (at least) 1, or g<sup>2</sup> (but not necessarily g1) evaluates to (at least) 1. Given d1 g1e and d1 g2e, the aforementioned informal disjunction translates to a formal disjunction in SL [A]:

$$\begin{aligned} \left( \left[ 0.5 \preceq u\_1 \right] \right) &= \left( \left[ 1 \preceq g\_1 \right] \wedge \left[ 1 \preceq g\_2 \right] \right) \vee \left[ 1 \preceq g\_2 \right] \\ &= \left( \left( x \mapsto y \star y \mapsto z \right) \wedge x \mapsto y \right) \vee x \mapsto y \end{aligned}$$

Notice that—as it is the case for Val [u1]—we construct d0.5 u1e syntactically. In particular, we disregard that the disjunct (x 7→ y ? y 7→ z) ∧ x 7→ y is unsatisfiable and therefore equivalent to false. 4

We provide the construction of dα fe for arbitrary QSL [A] formulae f including quantitative quantifiers and the magic wand—in Section 4.3.

Finally, Observations 1 and 2 together yield our reduction from f |= g to finitely many entailments in SL [A]. Intuitively speaking, we formalize that


Table 5. Inductive definition of Val [f].

whenever f evaluates at least to α, then g too evaluates at least to α

equivalently in terms of finitely many SL [A] entailments. Put more formally, since Val [f] is finite, we have

f |= g iff for all (s, h): <sup>J</sup>f<sup>K</sup> (s, h) <sup>≤</sup> <sup>J</sup>g<sup>K</sup> (s, h) (by definition) iff for all (s, h) and all <sup>α</sup> <sup>∈</sup> Val [f]: <sup>α</sup> <sup>≤</sup> <sup>J</sup>f<sup>K</sup> (s, h) implies <sup>α</sup> <sup>≤</sup> <sup>J</sup>g<sup>K</sup> (s, h) (by Observation 1) iff for all (s, h) and all α ∈ Val [f]: (s, h) |= dα fe implies (s, h) |= dα ge (by Observation 2) iff for all α ∈ Val [f]: dα fe |= dα ge . (by definition)

Example 5. Reconsider our running example. Since |Val [u1] | = 4, the QSL [A] entailment u<sup>1</sup> |= u<sup>2</sup> is equivalent to the four entailments

$$
\lceil \alpha \preceq u\_1 \rceil = \lceil \alpha \preceq u\_2 \rceil \quad \text{for } \alpha \in \{0, 0.4, 0.6, 1\}
$$

in SL [A], each of which actually holds. 4

#### 4.2 Constructing Finite Overapproximations of Eval (f)

We consider the formal construction underlying Observation 1 from the previous section, i.e., given f ∈ QSL [A], we provide a syntactic construction of a finite

overapproximation Val [f] of Eval(f). This construction is by induction on the structure of f as shown in Table 5. For that, we define some shorthands. Given α ∈ P, V, W ⊆ P, and a binary operation ◦: P × P → P, we define

$$
\alpha \cdot V = \{ \alpha \cdot \beta \: \mid \: \beta \in V \} \quad \text{and} \quad V \circ W = \{ \beta \circ \gamma \: \mid \: \beta \in V, \: \gamma \in W \} \; .
$$

Let us now go over the individual cases.

The case f = [ψ]. We have [ψ] (s, h) ∈ {0, 1} by definition.

The case f = [π] · g + [¬π] · u. For every (s, h), the formula f either evaluates to <sup>J</sup>g<sup>K</sup> (s, h) or to <sup>J</sup>u<sup>K</sup> (s, h), depending on whether (s, h) <sup>|</sup><sup>=</sup> <sup>π</sup> holds.

The case f = p · g + (1 − p) · u. The formula f evaluates to p · α + (1 − p) · β for some α ∈ Val [g] and β ∈ Val [u].

The case f = g · u or f = g ? u. The formula f evaluates to α · β for some α ∈ Val [g] and β ∈ Val [u].

The case f = 1 − g. The formula f evaluates to 1 − α for some α ∈ Val [g].

The case f = g ◦ u for ◦ ∈ {max, min}. Since max and min are defined pointwise, the formula f evaluates to some value α ◦ β for α ∈ Val [g] , β ∈ Val [u].

The case f = Sx: g or f = Jx: g. Since Val [g] overapproximates the set of all valuations of g, quantitative quantifiers do not add any valuation.

The case f = [ψ] −−?g. Recall that

> <sup>J</sup>f<sup>K</sup> (s, h) = inf {Jg<sup>K</sup> (s, h ? h<sup>0</sup> ) | h <sup>0</sup> ⊥ h and [ψ] (s, h<sup>0</sup> ) = 1} .

If the above set is non-empty, the infimum is actually a minimum and therefore <sup>f</sup> evaluates to some value in Val [g]. If the above set is empty, then <sup>J</sup>f<sup>K</sup> (s, h) = 1. It is easy to verify that 1 is necessarily an element of Val [g] (cf., [7, Lemma 4]).

Summarizing our considerations on Val [f], we get:

Theorem 3. For every f ∈ QSL [A], the effectively constructible set Val [f] ⊂ P given in Table 5 satisfies


Proof. Straightforward by induction on f.

#### 4.3 Lower Bounding QSL [A] by SL [A] Formulae

We now consider the formal construction underlying Observation 2 from Section 4.1. That is, given f ∈ QSL [A] and α ∈ P, we provide the syntactic construction of an SL [A] formula dα fe evaluating to true on state (s, h) if and


Table 6. Inductive definition of dα fe for a given α ∈ P.

only if f evaluates at least to α on (s, h). This construction relies on Val [f] from the previous section and is given by induction on the structure of f as shown in Table 6. We consider the individual constructs. For that, we fix some state (s, h).

The case f = [ψ]. There are two cases. If α = 0, then α trivially lower bounds the value of [ψ]. Conversely, if α > 0, then α lower bounds [ψ] on state (s, h) if and only if (s, h) satisfies ψ.

For the composite cases, recall that by Theorem 3 there are effectively constructible finite sets Val [g] , Val [u] covering all values g and u evaluate to.

The case f = [π]·g+[¬π]·u. The formula f represents a Boolean choice between the formulae g and u, depending on the truth value of π. Hence, there are two cases: If (s, h) does satisfy π, then α lower bounds f iff α lower bounds g. Conversely, if (s, h) does not satisfy π, then α lower bounds f iff α lower bounds u.

The case f = p · g + (1 − p) · u. Since Val [g] and Val [u] cover every possible valuation of g and u, respectively, it follows that α lower bounds the valuation of f if and only if there are β ∈ Val [g] and γ ∈ Val [u] such that (1) β lower bounds g, (2) γ lower bounds u, and (3) α lower bounds the convex sum p · β + (1 − p) · γ.

The case f = g · u. The reasoning is analogous to the previous case.

The case <sup>f</sup> = 1 <sup>−</sup> <sup>g</sup>. We write <sup>α</sup> <sup>≤</sup> <sup>J</sup><sup>1</sup> <sup>−</sup> <sup>g</sup><sup>K</sup> (s, h) equivalently as <sup>¬</sup>(1 <sup>−</sup> α < <sup>J</sup>g<sup>K</sup> (s, h)). In order to turn the strict inequality into a non-strict one, we consider the successor δ of 1 − α in Val [g], i.e., the smallest δ in Val [g] greater than 1 − α. Since Val [g] is finite, such a δ always exists if 1 − α 6= 1. We illustrate the idea in the following picture, where all elements in Val [g] are marked by •.

For the successor <sup>δ</sup>, checking if <sup>δ</sup> is a lower bound of <sup>J</sup>g<sup>K</sup> (s, h) is equivalent to checking if 1 − α is a strict lower bound - if δ is not a lower bound, then we ran out of possible valuations that are strictly lower bounded by 1 − α.

The case f = g ◦ u for ◦ ∈ {max, min}. The probability α lower bounds the maximum of g and u on state (s, h) if and only if α lower bounds g or α lower bounds u. For ◦ = min, the reasoning is dual.

The case f = Sx: g. Recall that

$$\left( \left\lbrack f \right\rbrack \left( s, h \right) \right) = \max \left\{ \left\lbrack g \right\rbrack \left( s \left\lbrack x: \mathsf{w} \right\rbrack, h \right) \mid v \in \mathsf{Vals} \right\} \;.$$

Now observe that α lower bounds the above maximum if and only if α lower bounds some element of the above set, i.e., if and only if there is some v with

<sup>α</sup> <sup>≤</sup> <sup>J</sup>g<sup>K</sup> (<sup>s</sup> [x:=v] , h) which is equivalent to (s, h) <sup>|</sup><sup>=</sup> <sup>∃</sup>x: <sup>d</sup><sup>α</sup> <sup>g</sup><sup>e</sup> .

The case f = Jx: g. Recall that

$$\left( \left\lbrack f \right\rbrack \left( s, h \right) \right) = \min \left\{ \left\lbrack g \right\rbrack \left( s \left\lbrack x: \mathsf{w} \right\rbrack, h \right) \mid v \in \mathsf{Vals} \right\} \;.$$

Since α lower bounds the above minimum if and only if α lower bounds all elements of the above set, the reasoning is dual to the previous case.

The case f = g ? u. Recall that

$$\left\|f\right\|(s,h) = \max\left\{ \left\|g\right\|(s,h\_1) \cdot \left\|u\right\|(s,h\_2) \: \mid \: h = h\_1 \star h\_2 \right\} \; .$$

Since Val [g] and Val [u] cover every possible valuation of g and u, respectively, α lower bounds the evaluation of f on (s, h) iff there are β ∈ Val [g] , γ ∈ Val [u] and h1, h<sup>2</sup> with h<sup>1</sup> ? h<sup>2</sup> = h such that (1) β lower bounds g on (s, h1), (2) γ lower bounds u on (s, h2), and (3) α lower bounds β · γ. Given such β andγ, we can phrase this equivalently in SL [A] as

$$(s,h) \mid = \lceil \beta \preceq g \rceil \star \lceil \gamma \preceq u \rceil \; .$$

The case f = [ψ] −−?g. Recall that

$$\left[\left[f\right]\left(s,h\right)\right] = \inf\left\{\left[g\right]\left(s,h\star h'\right) \mid h'\perp h \text{ and } \left[\psi\right]\left(s,h'\right) = 1\right\}\dots$$

Probability α lower bounds the above infimum if and only if for every extension h <sup>0</sup> of the heap h such that the stack s together with h 0 satisfy ψ, probability α is a lower bound on <sup>J</sup>g<sup>K</sup> (s, h ? h<sup>0</sup> ). Put more formally, the latter statement reads

$$\text{for all } h' \perp h \text{ with } (s, h') \models \psi \colon (s, h \star h') \Vdash \lceil \alpha \preceq g \rceil \,,$$

which is equivalent to (s, h) |= ψ −−?dα ge.

Our construction thus applies to arbitrary QSL [A] formulae and we get:

Theorem 4. For every f ∈ QSL [A] and all α ∈ P there is an effectively constructible SL [A] formula dα fe such that for all (s, h) ∈ States, we have

$$(s,h) \; = \; \lbrack \alpha \preceq f \; \rbrack \qquad \text{iff} \qquad \alpha \leq \; \lbrack f \rceil \; (s,h) \; .$$

Proof. By induction on f.

Finally, we obtain our main theorem.

Theorem 5. Entailment checking in QSL [A] reduces to entailment checking in SL [A], i.e, for all f, g ∈ QSL [A], we have

f |= g iff for all α ∈ Val [f]: dα fe |= dα ge .

Proof. Follows from Theorems 3 and 4 and the reasoning at the end of Section 4.1.

Remark 1 (Avoiding true in SL [A] entailments). Formulae of the form dα fe ∈ SL [A] may introduce the atom true, which is not admitted by some decidable separation logic fragments, such as [27]. Fortunately, we can avoid true in dα fe formulae. true is only required in formulae of the form d0 fe, which arise in two situations when applying Theorem 5: (1) in entailment checks of the form d0 fe |= d0 ge, which always hold and can thus be omitted, and (2) if f = p · g + (1 − p) · u. In the latter case, if we have α 6= 0 in

$$\lceil \alpha \preceq f \rceil \ = \bigvee\_{\beta \in \mathsf{Val}[g], \gamma \in \mathsf{Val}[u], p \cdot \beta + (1 - p) \cdot \gamma \ge \alpha} \lceil \beta \preceq g \rceil \wedge \lceil \gamma \preceq u \rceil \ ,$$

then either β 6= 0 or γ 6= 0 holds for every disjunct. Hence, subformulae of the form d0 ge or d0 ue can be omitted, as well. 4

### 5 Complexity

We now analyze the complexity of our approach. Recall that Theorem 5 reduces checking f |= g in QSL [A] to checking

$$\text{ for all } \alpha \in \mathsf{Val}\left[f\right] \colon \ \lceil \alpha \preceq f \rceil \ \vdash \ \lceil \alpha \preceq g \rceil$$

in SL [A]. We consider two aspects: (1) the number of SL [A] entailments and (2) the size of the resulting SL [A] formulae occurring in each entailment. We express these quantities in terms of the size of a QSL [A] formula f and a SL [A] formula ϕ and denote them as |f| and |ϕ| respectively. In these sizes, we count every construct in the formula and require that the size of atoms are defined at instantiation. Moreover, we assume that every atom in A is at least of size 1 and especially the atom true is of size 1. Additionally we count in an QSL [A] formula f the constructs that increase the number of possible evaluation results of f, namely q · g + (1 − q) · u, g · u and g ? u, and denote it as |f|p. 8

We will see that for an entailment f |= g in QSL [A], (1) the number of SL [A] entailments is in 2O(|f|p) in the worst case (see Theorem 6) and (2) the size of the resulting SL [A] formulae are in O(|f|)·2 O(|f| 2 p ) and O(|g|)·2 O(|g| 2 p ) respectively in the worst case (see Theorem 7). Now let us assume we have an entailment checker for SL [A] formulae that can solve entailments of the form dα fe |= dα ge and which has a runtime complexity of SL-Time(n, m) where n and m are the size of SL [A] formulae on the left and right side of an entailment respectively. Putting the above together, checking the entailment f |= g in QSL [A] then has a runtime complexity of

$$\begin{split} 2^{\mathcal{O}(\left|f\right|\_{p})} \cdot \mathrm{SL-Time} \left( \mathcal{O}(\left|f\right|) \cdot 2^{\mathcal{O}(\left|f\right|\_{p}^{2})}, \mathcal{O}(\left|g\right|) \cdot 2^{\mathcal{O}(\left|g\right|\_{p}^{2})} \right) \\ + \mathcal{O}(\left|f\right|) \cdot 2^{\mathcal{O}(\left|f\right|\_{p}^{2})} + \mathcal{O}(\left|g\right|) \cdot 2^{\mathcal{O}(\left|g\right|\_{p}^{2})} \ . \end{split}$$

If we furthermore reasonably assume that SL-Time(n, m) is at least linear in both arguments (otherwise the entailment checker can only check trivial entailments anyway), the runtime complexity simplifies to

$$2^{\mathcal{O}(\left|f\right|\_{P})} \cdot \text{SL-Time}\left(\mathcal{O}(\left|f\right|) \cdot 2^{\mathcal{O}(\left|f\right|\_{P}^{2})}, \mathcal{O}(\left|g\right|) \cdot 2^{\mathcal{O}(\left|g\right|\_{P}^{2})}\right) \cdot 1$$

As for aspect (1), we first observe that checking f |= g by means of Theorem 5 requires checking |Val [f] | entailments in SL [A]. However, only the constructs we count with |f|<sup>p</sup> increase the number of possible evaluations, which in turn will also increase the size of the overapproximation Val [f]. Every time any of these constructs occur, the number of possible evaluations Eval(f) may double. Consequently, also the overapproximation Val [f] doubles in size when any of these constructs occur. Other constructs do not increase the number of evaluations, but instead inherit the evaluations from their subformulae.

Theorem 6. We have |Val [f] | ≤ 2 <sup>|</sup>f|p+1. Hence, checking f |= g by means of Theorem 5 requires checking 2 O(|f|p) entailments in SL [A].

Proof. By induction on f.

For the size of the resulting SL [A] formulae, i.e., aspect (2), recall that we construct entailments of the form

$$
\lceil \alpha \preceq f \rceil \ \vdash \lceil \alpha \preceq g \rceil \ \cdot
$$

<sup>8</sup> For a formal definition see [7].

We thus determine an upper bound on the size of any SL [A] formula dα fe. Here we make a similar observation as in aspect (1): whenever one of the constructs we count with |f|<sup>p</sup> appears, the size of the formula increases by the exponential factor |Val [f] |. Such a multiplication of increasing exponential expressions then results asymptotically in a squared exponent. The other constructs increase the size by only a constant per construct. By combining both observations we can finally conclude an upper bound on the size of the formula dα fe.

Theorem 7. For any formula f ∈ QSL [A] and all probabilities α ∈ P, the SL [A] formula dα fe has at most size 3 · |f| · 2 (|f|p+1)<sup>2</sup> . Hence the size of the formula dα fe is in O(|f|) · 2 O(|f| 2 p ) .

Proof. By induction on f.

Remark 2 (Complexity of SL [A] Entailments in QSL [A]). By Theorem 6 and Theorem 7, the number of entailments and the size of formulae dα fe is only exponential if |f|<sup>p</sup> is not constant. However, we would assume that an entailment f |= g in QSL [A], where neither in f nor in g the probabilistic choice p·g+(1−p)·u appears, should have a similar runtime complexity as SL [A] entailment. While it is easy to see that Val [f] = {0, 1} has constant size in this setting, the size of the formula is still exponential. In the case where no probabilistic choice is present, we generate multiple exponentially-sized tautologies of the form d0 fe. However, due to Remark 1 we can eliminate all occurrences of d0 fe. That means, if f does not contain p · g + (1 − p) · u, then for α 6= 0, we can construct an equivalent formula to dα fe in such a way that its size is in O(|f|) and |Val [f] | = 2. 4

### 6 Application: Decidable hpGCL Verification

Since entailment in full separation logic is undecidable, it is common to consider fragments of separation logic with a (semi-)decidable entailment problem. Given a QSL [A] fragment Q, we provide sufficient and easy-to-check characterizations on SL [A] fragments S ensuring that entailment checking in Q reduces to entailment checking in S. This simplifies the search for decidable fragments of quantitative separation logic.

We then apply our results in Section 6.1 to show the decidability of entailment checking for quantitative symbolic heaps—a quantitative extension of the well-known symbolic heap fragment of separation logic—and demonstrate the applicability to the verification of probabilistic pointer programs.

Our reduction from entailments in QSL [A] to entailments in SL [A] relies on the construction of the dα fe formulae from Section 4.3. This suggests to define:

Definition 3. Let Q be a QSL [A] fragment. We say that an SL [A] fragment S is Q-admissible if dα fe ∈ S holds for all f ∈ Q and all α ∈ P. 4


Table 7. SL [A] requirements for entailment checking in QSL [A].

The syntactic nature of our construction of the S formulae dα fe allows for a syntactic criterion on SL [A] fragments to be Q-admissible.

Lemma 1. Let Q be a QSL [A] fragment. If an SL [A] fragment S satisfies the requirements provided in Table 7, then S is Q-admissible.

Proof. By induction on f.

Finally, we provide a sufficient criterion for the decidability of entailment in QSL [A] fragments given SL [A] fragments with a decidable entailment problem. Since entailment checks ϕ |= ψ in SL [A] can often (but not always) be reduced to unsatisfiability checks ϕ ∧ ¬ψ, we take a more fine-grained perspective and distinguish between fragments for the left- and the right-hand side of entailments, respectively. This distinction matters when, e.g., SL [A] fragments with a decidable satisfiability problem impose restrictions on quantifiers (cf., [20]).

Theorem 8. Let Q1, Q<sup>2</sup> be QSL [A] fragments, and let S1, S<sup>2</sup> be SL [A] fragments. If S<sup>1</sup> is Q1-admissible and S<sup>2</sup> is Q2-admissible, then

$$\begin{aligned} \varphi &= \psi \ for \ \varphi \in \mathsf{S}\_1, \psi \in \mathsf{S}\_2 \ is \ decid{\mathsf{L}}\text{-} \\\ implies \qquad g &= f \ for \ g \in \mathsf{Q}\_1, f \in \mathsf{Q}\_2 \ is \ decid{\mathsf{L}}\text{-} \end{aligned}$$

Proof. This is a consequence of Theorem 5. ut

#### 6.1 Quantitative Symbolic Heaps

We now demonstrate that our approach can facilitate the automated verification of probabilistic pointer programs by providing a sample QSL fragment with a decidable entailment problem.

Recall that QSL [A] is parameterized by a set A of predicate symbols. We obtain the quantitative symbolic heap fragment of QSL by instantiating A.

Definition 4. Let A be the set of predicate symbols given by

$$\begin{split} \mathfrak{A} &= \{ \texttt{true}, \texttt{emp} \} \cup \{ x \mapsto (y\_1, \dots, y\_{\mathbb{k}}) \: \mid \, x, y\_1, \dots, y\_{\mathbb{k}} \in \mathsf{Vars} \} \\ &\cup \{ \, x = y, \,\, x \neq y, \,\, x = y \wedge \texttt{emp}, \,\, x \neq y \wedge \texttt{emp} \,\, \mid \, x, y \in \mathsf{Vars} \} \dots \end{split}$$

Then the set QSH of quantitative symbolic heaps is given by the grammar

$$f \quad \rightarrow \quad [\psi] \quad | \quad [\pi] \cdot f + [\neg \pi] \cdot f \quad | \quad q \cdot f + (1 - q) \cdot f \quad | \quad \mathcal{B}x \colon f \quad f \star f \quad \dots \quad \triangle$$

Quantitative symbolic heaps naturally extend the symbolic heap fragment of separation logic. Intuitively speaking, a quantitative symbolic heap f specifies probability (sub-)distributions over (symbolic) heaps. By applying Theorem 5, we obtain the following decidability result.

Theorem 9. For loop- and allocation-free hpGCL programs C (that only perform pointer operations, no arithmetic, and guards from the pure fragment of A) and <sup>f</sup>1, f<sup>2</sup> <sup>∈</sup> QSH, it is decidable whether the entailment wlpJC<sup>K</sup> (f1) <sup>|</sup><sup>=</sup> <sup>f</sup><sup>2</sup> holds.

Hence, for loop- and allocation-free programs C as above, upper bounds (in terms of quantitative symbolic heaps <sup>f</sup>2) on the probability wlpJC<sup>K</sup> (f1) of terminating in a given quantitative symbolic heap f<sup>1</sup> are decidable. We refer to Section 3.3 for an example entailment involving quantitative symbolic heaps. In the sequel, we show how to prove the above result.

Proof of Theorem 9. The proof relies on extended quantitative symbolic heaps eQSH, which include magic wands with points-to formulae on their left-hand side.

Definition 5. The set eQSH of extended quantitative symbolic heaps is given by the grammar

$$\begin{array}{ccccc} g & \rightarrow & [\psi] & [\pi] \cdot g + [\neg\pi] \cdot g & & q \cdot g + (1 - q) \cdot g & & g \star g \\ & & [\cdot \text{ } \mathcal{B}x \text{:} \ g & [x \mapsto (y\_1, \dots, y\_{\mathsf{k}})] \xrightarrow{\sim} g & & & \Delta \end{array}$$

Notice that indeed QSH ⊆ eQSH.

Lemma 2. For every loop- and allocation-free program C ∈ hpGCL without arithmetic and only with guards of the pure fragment of A, extended quantitative symbolic heaps are closed under wlpJCK, i.e.,

for all <sup>g</sup> <sup>∈</sup> eQSH: wlpJC<sup>K</sup> (g) <sup>∈</sup> eQSH .

In particular, since QSH ⊆ eQSH, we have

$$\text{ } for \text{ } all \text{ } f \in \mathsf{QSH}: \quad \mathsf{w} \mathsf{lp}[C] \,(f) \in \mathsf{eQSH} \dots$$

Proof. By induction on the structure of loop- and allocation-free program C.

Hence, if g |= f is decidable for g ∈ eQSH and f ∈ QSH, Theorem 9 follows.

Lemma 3. For g ∈ eQSH and f ∈ QSH, it is decidable whether g |= f holds.

Proof. We employ Lemma 1 to determine two SL [A] fragments S1, S<sup>2</sup> such that S<sup>1</sup> is eQSH-admissible and S<sup>2</sup> is QSH-admissible. Then, by Theorem 8, decidability of g |= f follows from decidability of ϕ |= ψ for ϕ ∈ S<sup>1</sup> and ψ ∈ S2. For that, we exploit the equivalence

ϕ |= ψ iff ϕ ∧ ¬ψ is unsatisfiable .

The latter is decidable by [20, Theorem 3.3] since ϕ ∧ ¬ψ is equivalent to a formula of the form ∃ ∗∀ ∗ : ϑ with ϑ quantifier-free and no formula ϑ<sup>1</sup> −−? ϑ<sup>2</sup> occurring in ϑ contains a universally quantified variable.

### 7 Related Work

Weakest preexpectations. Weakest precondition reasoning was established in a classical setting by Dijkstra [19] and has been extended to provide semantic foundations for probabilistic programs by Kozen [38,37] and McIver & Morgan [41], who also coined the term weakest preexpectations. Their relation to operational models is studied in [25]. Moreover, weakest preexpectation reasoning has been shown to be useful for obtaining bounds on the expected resource consumption [45] and, in particular, the expected run-time [33] of probabilistic programs.

Logics for probabilistic pointer programs. Although many algorithms rely on randomized dynamic data structures, formal reasoning about programs that are both probabilistic and heap manipulating has received scarce attention. A notable exception is the work by Tassarotti and Harper [51], who introduce a concurrent separation logic with support for probabilistic reasoning, called Polaris. Their focus is on program refinement, employing a semantic model that is based on the idea of coupling, which underlies recent work on probabilistic relational Hoare logics [4]. However, no other decision procedures targeting entailments for QSL or other logics targeting probabilistic pointer programs exist.

Leveraging SL research. As shown in Table 7, building QSL entailment checkers by employing our reduction technique requires the availability of SL fragments that support certain logical operations, and whose entailment problem is decidable. Since the inception of separation logic [29], the latter has been extensively studied. In particular, the symbolic heap fragment of SL has received a lot of attention. Table 8 gives an overview of related approaches. <sup>9</sup>

<sup>9</sup> ? is always covered. Supported (Boolean or separating) connectives are marked with "+", unsupported ones with "–". "∗" means that the restrictions on the connective are more involved. "Pure" means that the connective can only appear in pure formulae and "flat" means that the quantifier needs to be on the outermost level.


Table 8. SL fragments with decidable entailment problem.

### 8 Discussion and Conclusion

We studied entailment checking in QSL by means of a reduction to entailment checking in SL. We analyzed the complexity of our approach and demonstrated its applicability by means of several examples. In particular, our reduction yields the first decidability result for probabilistic pointer program verification.

Our primary goal was to investigate the entailment problem for QSL to pave the way for automated verification of probabilistic pointer programs. Theorem 8 provides a generic result that enables building upon the large body of work dealing with classical SL entailments to obtain both theoretical and practical insights. Theoretically, Theorem 8 gives sufficient criteria to derive QSL fragments with a decidable entailment problem from a classical SL fragment. We derived a QSL fragment such that reasoning about a simple probabilistic heapmanipulating language becomes decidable. More practically, Theorem 8 allows reusing existing (possibly incomplete) SL solvers to solve the entailments derived by our construction—an empirical evaluation of how well existing solvers can deal with these entailments is an interesting direction for future work.

We believe that our fine-grained complexity analysis demonstrates that our approach can be practically feasible: the exponential blow-up in Theorem 7 stems from the number of probabilistic constructs in the given QSL formulae. We expect the number of such constructs to be small for many randomized algorithms. We remark that existing approaches on checking quantitative entailments between heap-independent expectations encounter similar exponential blow-ups (cf., [36,6]). There is thus some evidence that such exponential blow-ups do not prohibit one from automatically verifying non-trivial properties. We are not aware of work on checking quantitative entailments between expectations that avoids such exponential blow-ups.

Future work includes considering richer classes of QSL and applications of entailment checking such as k-induction [6]. Another interesting direction is the applicability of our reduction to other approaches that aim for local reasoning about the resources employed by probabilistic programs, such as [51,3,5].

### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### Extracting total Amb programs from proofs

Ulrich Berger<sup>1</sup> and Hideki Tsuiki<sup>2</sup> 

<sup>1</sup> Swansea University, Swansea, UK u.berger@swansea.ac.uk <sup>2</sup> Kyoto University, Kyoto, Japan tsuiki@i.h.kyoto-u.ac.jp

Abstract. We present a logical system CFP (Concurrent Fixed Point Logic) that supports the extraction of nondeterministic and concurrent programs that are provably total and correct. CFP is an intuitionistic first-order logic with inductive and coinductive definitions extended by two propositional operators, B|<sup>A</sup> (restriction, a strengthening of implication) and (B) (total concurrency). The source of the extraction are formal CFP proofs, the target is a lambda calculus with constructors and recursion extended by a constructor Amb (for McCarthy's amb) which is interpreted operationally as globally angelic choice and is used to implement nondeterminism and concurrency. The correctness of extracted programs is proven via an intermediate domain-theoretic denotational semantics. We demonstrate the usefulness of our system by extracting a nondeterministic program that translates infinite Gray code into the signed digit representation. A noteworthy feature of our system is that the proof rules for restriction and concurrency involve variants of the classical law of excluded middle that would not be interpretable computationally without Amb.

### 1 Introduction

Nondeterministic bottom-avoiding choice is an important and useful idea. With the wide-spread use of hardware that supports parallel computation, it has the possibility to speed up practical computation and, at the same time, it is related to computation over mathematical structures like real numbers [20,42]. On the other hand, it is not easy to apply theoretical tools like denotational semantics to nondeterministic bottom-avoiding choice [24,29] and guaranteeing correctness and totality of such programs through logical systems is a difficult task.

To explain the subtleness of the problem, let us start with an example. Suppose that M and N are partial programs that, under the conditions A and ¬A, respectively, are guaranteed to terminate and produce values satisfying specification B. Then, by executing M and N in parallel and taking the result obtained first, we should always obtain a result satisfying B. This kind of bottom-avoiding nondeterministic program is known as McCarthy's amb (ambiguous) operator [32], and we denote such a program by Amb(M, N). Amb is called the angelic choice operator and is usually studied as one of the three nondeterministic choice

operators (the other two are erratic choice and demonic choice). On the other hand, we are interested in this operator not only from a theoretical point of view but also from the way it behaves as a concurrent program running on a parallel execution mechanism.

If one tries to formalize this idea naively, one will face some obstacles. Let M r B ("M realizes B") denote the fact that a program M satisfies a specification B and let (B) be the specification that can be satisfied by a concurrent program of the form Amb(M, N) that always terminates and produces a value satisfying B. Then, the above inference could be written as

$$\frac{A \to (M \operatorname{r} B) \qquad \neg A \to (N \operatorname{r} B)}{\operatorname{Amb}(M, N) \operatorname{r} \downarrow(B)}$$

However, this inference is not sound for the following reason. Suppose that A does not hold, that is, ¬A holds. Then, the execution of N will produce a value satisfying B. But the execution of M may terminate as well, and with a data that does not satisfy B since there is no condition on M if A does not hold. Therefore, if M terminates first in the execution of Amb(M, N), then we obtain a result that may not satisfy B.

To amend this problem, we add a new operator B|<sup>A</sup> (pronounced "B restricted to A") and consider the rule

$$\frac{M\operatorname{r}\left(B|\_{A}\right) \quad N\operatorname{r}\left(B|\_{\neg A}\right)}{\operatorname{Amb}(M,N)\operatorname{r}\downarrow(B)}\tag{1}$$

Intuitively, M r (B|A) means two things: (1) M terminates if A holds, and (2) if M terminates, then the result satisfies B even for the case A does not hold. As we will see in Sect. 5.2, the above rule is derivable in classical logic and can therefore be used to prove total correctness of Amb programs.

In this paper, we go a step further and introduce a logical system CFP whose formulas can be interpreted as specifications of nondeterministic programs although they do not talk about programs explicitly. CFP is defined by adding the two logical operators B|<sup>A</sup> and (B) to the system IFP, a logic for program extraction [12] (see also [4,9,7]). A related approach has been developed in the proof system Minlog [38,6,39]. IFP supports the extraction of lazy functional programs from inductive/coinductive proofs in intuitionistic first-order logic. It has a prototype implementation in Haskell, called Prawf [8].

We show that from a CFP-proof of a formula, both a program and a proof that the program satisfies the specification can be extracted (Soundness theorem, Theorem 3). For example, in CFP we have the rule

$$\frac{B|\_A \quad B|\_{\neg A}}{\downarrow(B)} \text{ (Conc-lem)}\tag{2}$$

which is realized by the program λa.λb.Amb(a, b), and whose correctness is expressed by the rule (1). Programs extracted from CFP proofs can be executed in Haskell, implementing Amb with the concurrent Haskell package.

Compared with program verification, the extraction approach has the benefit that (a) the proofs programs are extracted from take place in a formal system that is of a very high level of abstraction and therefore is simpler and easier to use than a logic that formalizes concurrent programs (in particular, programs do not have to be written manually at all); (b) not only the complete extracted program is proven correct but also all its sub-programs come with their specifications and correctness proofs since these correspond to sub-proofs. This makes it easier to locally modify programs without the danger of compromising overall correctness.

As an application, we extract a nondeterministic program that converts infinite Gray code to signed digit representation, where infinite Gray code is a coding of real numbers by partial digit streams that are allowed to contain a ⊥, that is, a digit whose computation does not terminate [18,42]. Partiality and multi-valuedness are common phenomena in computable analysis and exact real number computation [46,30]. This case study connects these two aspects through a nondeterministic and concurrent program whose correctness is guaranteed by a CFP-proof. The extracted Haskell programs are available in the repository [3].

Organization of the paper: In Sects. 2 and 3 we present the denotational and operational semantics of a functional language with Amb and prove that they match (Thms. 1 and 2). Sects. 4 and 5 describe the formal system CFP and its realizability interpretation which our program extraction method is based on (Thms. 3 and 5). In Sect. 6 we extract a concurrent program that converts representation of real numbers and study its behaviour in Sect. 7. Most proofs, unless very short, are omitted do to space limitation. Full proofs of the main results can be found in the extended version [11].

### 2 Denotational semantics of globally angelic choice

In [32], McCarthy defined the ambiguity operator amb as

$$\mathbf{amb}(x,y) = \begin{cases} x \ \left(x \neq \bot\right) \\ y \ \left(y \neq \bot\right) \\ \bot \left(x = y = \bot\right) \end{cases}$$

where ⊥ means 'undefined' and x and y are taken nondeterministically when both x and y are not ⊥. This is called locally angelic nondeterministic choice since convergence is chosen over divergence for each local call for the computation of amb(x, y). It can be implemented by executing both of the arguments in parallel and taking the result obtained first. Despite being a simple construction, amb is known to have a lot of expressive power, and many constructions of nondeterministic and parallel computation such as erratic choice, countable choice (random assignment), and 'parallel or' can be encoded through it [28]. These multifarious aspects of the operator amb are reflected by the difficulty of its mathematical treatment in denotational semantics. For example, amb is not monotonic when interpreted over powerdomains with the Egli-Milner order [14].

On the other hand, one can consider an interpretation of amb as globally angelic choice, where an argument of amb is chosen so that the whole ambient computation converges, if convergence is possible at all [17,40]. Since globally angelic choice is not defined compositionally, it is not easy to integrate it into a design of a programming language with clear denotational semantics. However, it can be easily implemented by running the whole computation for both of the arguments of amb in parallel and taking the result obtained first. Denotationally, globally angelic choice can be modelled by the Hoare powerdomain construction. However, this would not be suitable for analyzing total correctness because the ordering of the Hoare powerdomain does not discriminate X and X∪{⊥} [23,24]. Instead, we consider a two-staged approach (see Sect. 2.2).

The difference between the locally and the globally angelic interpretation of amb is highlighted by the fact that the former does not commute with function application. For example, if f(0) = 0 but f(1) diverges, then amb(f(0), f(1)) will always terminate with the value 0, whereas f(amb(0, 1)) may return 0 or diverge. On the other hand, the latter term will always return 0 if amb is implemented with a globally angelic semantics. As suggested in [17], we use this commutation property to realize the globally angelic semantics.

#### 2.1 Programs and types

Our target language for program extraction is an untyped lambda calculus with recursion operator and constructors as in [12], but extended by an additional constructor Amb that corresponds to globally angelic version of McCarthy's amb. This could be easily generalized to an Amb operator of any arity ≥ 2.

$$\begin{aligned} &\text{Programs} \ni M, N, L, P, Q, R ::= a, b, \dots, f, g \text{ (program variables)}\\ &\quad \mid \lambda a. M \mid M \, N \mid M \downarrow N \mid \mathbf{rec} \, M \mid \bot\\ &\quad \mid \mathbf{Null} \mid \mathbf{Left}(M) \mid \mathbf{Right}(M) \mid \mathbf{Pair}(M, N) \mid \mathbf{amb}(M, N) \\ &\quad \mid \mathbf{case} \, M \, \mathbf{of} \, \{\mathtt{Left}(a) \to L; \mathbf{Right}(b) \to R\} \\ &\quad \mid \mathbf{case} \, M \, \mathbf{of} \, \{\mathtt{Pair}(a, b) \to N\} \\ &\quad \mid \mathbf{case} \, M \, \mathbf{of} \, \{\mathtt{amb}(a, b) \to N\} \end{aligned}$$

Denotationally, Amb is just another pairing operator. Its interpretation as globally angelic choice will come to effect only through its operational semantics. Though essentially a call-by-name language, it also has strict application M↓N, needed for realizing the rules for restriction and the concurrency operator.

We use a, . . . , g for program variables to distinguish them from the variables x, y, z of the logical system CFP (Sect. 4). Nil,Left, Right, Pair, Amb are called constructors. Constructors different from Amb are called data constructors. C<sup>d</sup> denotes the set of data constructors. Left↓M stands for (λa.Left(a))↓M, etc., and we sometimes write Left and Right for Left(Nil) and Right(Nil). Natural numbers are encoded as 0 Def = Left, 1 Def = Right(Left), and so on.

Although programs are untyped, programs extracted from proofs will be typable by the following system of simple recursive types:

T ypes 3 ρ, σ ::= α (type variables) | 1 | ρ × σ | ρ + σ | ρ ⇒ σ | fix α . ρ | A(ρ)

Here, A(ρ) is the type of programs which, if they terminate (see Sect. 3), reduce to a form Amb(M, N) with M, N : ρ. The formation of fix α . ρ has the side conditions that α occurs freely in ρ, ρ is strictly positive in α (that is, there is no free occurrence of α in ρ which is in the left part of a function type), and not of the form α or A(α). These conditions ensure, among other things, that the type transformer α 7→ ρ has a unique fixed point, which is taken as the semantics of fix α . ρ (see below). We require in A(ρ) that ρ is neither a variable nor of the form fix α<sup>1</sup> . . . . fix α<sup>n</sup> . A(ρ 0 ) (n ≥ 0). This enables the interpretation of Amb as a bottom-avoiding choice operator (see the explanation below Corollary 1). We call types that satisfy all these conditions regular. An example of a regular type is the type of lazy (partial) natural numbers, nat Def = fix α . 1 + α.


Fig. 1. Typing rules

The typing rules are listed in Fig. 1. They are valid w.r.t. the denotational semantics given in Sect. 2.2 and extend the rules given in [12]. Recursive types are equirecursive [35] in that M : fix α . ρ iff M : ρ[fix α . ρ/α].

As an example of a program consider

$$f \stackrel{\text{Def}}{=} \lambda a. \mathbf{case}. a \mathbf{of} \left\{ \mathbf{Left}(\downarrow) \to \mathbf{Left}; \mathbf{Right}(\downarrow) \to \downarrow \right\} \tag{3}$$

which implements the function f discussed earlier, i.e., f 0 = 0 and f 1 = ⊥. f has type nat ⇒ nat. Since Amb(0, 1) has type A(nat), the application f Amb(0, 1) is not well-typed. Instead, we consider mapamb f Amb(0, 1) where mapamb : (ρ → σ) → A(ρ) → A(σ) is defined as

mapamb Def = λf. λc. case c of {Amb(a, b) → Amb(f↓a, f↓b)}

This operator realizes the globally angelic semantics: mapamb f Amb(0, 1) is reduced to Amb(f↓0, f↓1), and f↓0 and f↓1 (which are the same as f 0 and f 1 since 0 and 1 are defined) are computed concurrently and the whole expression is reduced to 0, using the operational semantics in Section 3. In Sect. 5, we will introduce a concurrent (or nondeterministic) version of Modus Ponens, (Concmp), which will automatically generate an application of mapamb.

### 2.2 Denotational semantics

The denotational semantics has two phases: Phase I interprets programs in a Scott domain D defined by the following recursive domain equation

$$D = (\mathbf{Nil} + \mathbf{Left}(D) + \mathbf{Right}(D) + \mathbf{Pair}(D \times D) + \mathbf{amb}(D \times D) + \mathbf{Fun}(D \to D))\_\perp \dots$$

where + and × denote separated sum and cartesian product, and the operation ·<sup>⊥</sup> adds a least element ⊥ ([21] is a recommended reference for domain theory and the solution of domain equations). A closed program M denotes an element <sup>J</sup>M<sup>K</sup> <sup>∈</sup> <sup>D</sup> as defined in Fig. 2. Note that Amb is interpreted (like Pair) as a simple pairing operator.

A type is interpreted as a subdomain, which is a subset of D that is downward closed and closed under suprema of bounded subsets. We use the following operations on subdomains:

$$\begin{aligned} &(X+Y)\_{\perp} \stackrel{\text{Def}}{=} \{ \mathtt{Left}(a) \mid a \in X \} \cup \{ \mathtt{Right}(b) \mid b \in Y \} \cup \{ \bot \} \\ &(X \times Y)\_{\perp} \stackrel{\text{Def}}{=} \{ \mathtt{Pair}(a,b) \mid a \in X, b \in Y \} \cup \{ \bot \} \\ &(X \Rightarrow Y)\_{\perp} \stackrel{\text{Def}}{=} \{ \mathtt{Fun}(f) \mid f: D \to D \text{ continuous}, \forall a \in X (f(a) \in Y) \} \cup \{ \bot \}. \end{aligned}$$

Through the semantics in Fig. 2, closed programs denote elements of D and closed types denote subdomains of D such that the typing rules (Fig. 1) are sound.

In Phase II we assign to every a ∈ D a set data(a) ⊆ D that reveals the role of Amb as a choice operator. The relation 'd ∈ data(a)' is defined (coinductively) as the largest relation satisfying

$$\begin{array}{lcl} d \in \text{data}(a) & \stackrel{\nu}{=} & (a = \mathbf{Amb}(a', b') \land a' \neq \bot \land d \in \text{data}(a')) \lor \\ & & (a = \mathbf{Amb}(a', b') \land b' \neq \bot \land d \in \text{data}(b')) \lor \\ & & (a = \mathbf{Amb}(\bot, \bot) \land d = \bot) \lor \\ & & \bigvee\_{C \in \text{C}\_{d}} \left( a = C(\vec{d'}) \land d = C(\vec{d'}) \land \bigwedge\_{i} d'\_{i} \in \text{data}(a'\_{i}) \right) \lor \\ & & (a = \mathbf{Fun}(f) \land d = a) \; \vee \left( a = d = \bot \right) .\end{array}$$

Now, every closed program <sup>M</sup> denotes the set data(JMK) <sup>⊆</sup> <sup>D</sup> containing all possible globally angelic choices derived form its denotation in D. For example, data(Amb(0, 1)) = {0, 1} and, for f as defined in (3), we have, as expected,

<sup>J</sup>aK<sup>η</sup> <sup>=</sup> <sup>η</sup>(a) <sup>J</sup>λa. MK<sup>η</sup> <sup>=</sup> Fun(f) where <sup>f</sup>(d) = <sup>J</sup>MKη[<sup>a</sup> 7→ <sup>d</sup>] <sup>J</sup>M NK<sup>η</sup> <sup>=</sup> <sup>f</sup>(JNKη) if <sup>J</sup>MK<sup>η</sup> <sup>=</sup> Fun(f) <sup>J</sup>M↓NK<sup>η</sup> <sup>=</sup> <sup>f</sup>(JNKη) if <sup>J</sup>MK<sup>η</sup> <sup>=</sup> Fun(f) and <sup>J</sup>N<sup>K</sup> <sup>6</sup><sup>=</sup> <sup>⊥</sup> <sup>J</sup>rec <sup>M</sup>K<sup>η</sup> = the least fixed point of <sup>f</sup> if <sup>J</sup>MK<sup>η</sup> <sup>=</sup> Fun(f) <sup>J</sup>C(M1, . . . , Mk)K<sup>η</sup> <sup>=</sup> <sup>C</sup>(JM<sup>1</sup>Kη, . . . , <sup>J</sup>M<sup>k</sup>Kη) (<sup>C</sup> a constructor (including Amb)) <sup>J</sup>case <sup>M</sup> of Cl <sup>~</sup> }K<sup>η</sup> <sup>=</sup> <sup>J</sup>KKη[~a 7→ d~] if <sup>J</sup>MK<sup>η</sup> <sup>=</sup> <sup>C</sup>(d~) and <sup>C</sup>(~a) <sup>→</sup> <sup>K</sup> <sup>∈</sup> Cl <sup>~</sup> <sup>J</sup>MK<sup>η</sup> <sup>=</sup> <sup>⊥</sup> in all other cases, in particular <sup>J</sup>⊥K<sup>η</sup> <sup>=</sup> <sup>⊥</sup> η is an environment that assigns elements of D to variables. D ζ <sup>α</sup> = ζ(α), D<sup>ζ</sup> <sup>1</sup> = {Nil, ⊥}, D ζ fix α . ρ = \ {X ✁ D | D ζ[α7→X] <sup>ρ</sup> ⊆ X} (X ✁ D means X is a subdomain of D) D ζ <sup>A</sup>(ρ) = {Amb(a, b) | a, b ∈ D ζ <sup>ρ</sup>} ∪ {⊥} D ζ <sup>ρ</sup><sup>σ</sup> = (D ζ <sup>ρ</sup> D ζ <sup>σ</sup>)<sup>⊥</sup> ( ∈ {+, ×, ⇒}) ζ is a type environment that assigns subdomains D to type variables.

Fig. 2. Denotational semantics of programs (Phase I) and types

data(mapamb f Amb(0, 1)) = data(Amb(0, ⊥)) = {0}. In Sect. 3 we will define an operational semantics whose fair execution sequences starting with a regulartyped program <sup>M</sup> compute exactly the elements in data(JMK).

Example 1. Let M = rec λa.Amb(Left(Nil), Right(a)). M is a closed program of type fix α . A(1 + α). We have data(M) = {0, 1, 2, . . .}. Thus, we can express countable choice (random assignment) with Amb.

Lemma 1. If a ∈ D belongs to a regular type, then the following are equivalent: (1) a ∈ {⊥, Amb(⊥, ⊥)}; (2) {⊥} = data(a); (3) ⊥ ∈ data(a).

### 3 Operational semantics

We define a small-step operational semantics that, in the limit, reduces each closed program <sup>M</sup> nondeterministically to an element in data(JMK) (Thm. 1). If <sup>M</sup> has a regular type, the converse holds as well: For every <sup>d</sup> <sup>∈</sup> data(JMK) there exists a reduction sequence for M computing d in the limit (Thm. 2). If M denotes a compact data, then the limit is obtained after finitely many reductions. In the following, all programs are assumed to be closed.

#### 3.1 Reduction to weak head normal form

A program is called a weak head normal form (w.h.n.f.) if it begins with a constructor (including Amb), or has the form λa.M. We define inductively a

small-step leftmost-outermost reduction relation on programs where C ranges over constructors.

$$\begin{aligned} \text{(s-i) } (\lambda a.M) &\ N \sim M[N/a] \\ \text{(s-ii) } &\frac{M \sim M'}{M \, N \sim M'N} \\ \text{(s-iii) } &(\lambda a.M) \downarrow N \sim M[N/a] \quad \text{if } N \text{ is a w.h.n.f.} \\ \text{(s-iv) } &\frac{M \sim M'}{M \downarrow N \sim M' \downarrow N} \quad \text{if } N \text{ is a w.h.n.f.} \\ \text{(s-v) } &\frac{N \sim N'}{M \downarrow N \sim M \downarrow N'} \\ \text{(s-vi) } &\mathbf{rec } M \sim M \, (\mathbf{rec } M) \\ \text{(s-vii) } &\mathbf{case} \, C(\vec{M}) \, \mathbf{of } \{ \ldots, : \, C(\vec{b}) \to N; \ldots \} \sim N[\vec{M}/\vec{b}] \\ \text{(s-viii) } &\mathbf{case } M \, \mathbf{of } \{ \vec{C}l \} \sim \mathbf{case } M' \, \mathbf{of } \{ \vec{C}l \} \\ \text{(s-ix) } &\mathbf{M \sim \perp} \quad \text{if } M \text{ is } \perp \text{-like (see below)} \end{aligned}$$

⊥-like programs are such that their syntactic forms immediately imply that they denote ⊥, more precisely they are of the form ⊥, C(M~ ) N, C(M~ )↓N, and case M of {. . .} where M is a lambda-abstraction or of the form C(M~ ) such that there is no clause in {. . .} which is of the form C(~a) → N. W.h.n.f.s are never ⊥-like, and the only typeable ⊥-like program is ⊥.

Lemma 2. (1) is deterministic (i.e., M M<sup>0</sup> for at most one M<sup>0</sup> ). (2) preserves the denotational semantics (i.e., <sup>J</sup>M<sup>K</sup> <sup>=</sup> <sup>J</sup>M<sup>0</sup> <sup>K</sup> if <sup>M</sup> <sup>M</sup><sup>0</sup> ). (3) M is a -normal form iff M is a w.h.n.f. (4) [Adequacy Lemma] If <sup>J</sup>M<sup>K</sup> <sup>6</sup><sup>=</sup> <sup>⊥</sup>, then there is a w.h.n.f. <sup>V</sup> s.t. <sup>M</sup> <sup>∗</sup> <sup>V</sup> .

#### 3.2 Making choices

Next, we define the reduction relation <sup>c</sup> ('c' for 'choice') that reduces arguments of Amb in parallel.

$$\begin{aligned} \text{(c-i)} &\xrightarrow{M \leadsto M'} \\ \text{(c-ii)} &\xrightarrow{\text{c} \quad M'} \\ \text{(c-iii)} &\xrightarrow{M\_1 \leadsto M'\_1} \text{Amb}(M'\_1, M\_2) \\ \text{(c-iii')} &\xrightarrow{M\_2 \leadsto M'\_2} \text{Amb}(M\_1, M'\_2) \\ \text{(c-iii)} &\text{Amb}(M\_1, M\_2) \xleftarrow{\text{c}} M\_1 \quad \text{if } M\_1 \text{ is a w.h.n.f.} \end{aligned}$$

(c-iii') Amb(M1, M2) <sup>c</sup> M<sup>2</sup> if M<sup>2</sup> is a w.h.n.f.

From this definition and Lemma 2, it is immediate that M is a <sup>c</sup> -normal form iff M is a deterministic weak head normal form (d.w.h.n.f.), that is, a w.h.n.f. that does not begin with Amb. Finally, we define a reduction relation <sup>p</sup> that reduces arguments of data constructors in parallel.

$$\begin{array}{l} \text{(p-i)} \xrightarrow{M \stackrel{\text{c}}{\asymp} M'} \\ M \stackrel{\text{p}}{\asymp} M' \\ \text{(p-ii)} \xrightarrow{M\_i \stackrel{\text{p}}{\asymp} M'\_i} \begin{array}{l} \text{( $i = 1, \dots, k$ )} \\ \text{C}(M\_1, \dots, M\_k) \stackrel{\text{p}}{\asymp} \text{C}(M'\_1, \dots, \dots, M'\_k) \end{array} \text{( $C \in \text{C}\_{\text{d}}$ )} \end{array}$$

(p-iii) λa. M <sup>p</sup> λa. M

Every (closed) program reduces under <sup>p</sup> (easy proof by structural induction). For example, Nil <sup>p</sup> Nil by (p-ii). In the following, all <sup>p</sup> -reduction sequences are assumed to be infinite.

We call a <sup>p</sup> -reduction sequence unfair if, intuitively, from some point on, one side of an Amb term is permanently reduced but not the other. More precisely, we inductively define M<sup>1</sup> <sup>p</sup> <sup>M</sup><sup>2</sup> <sup>p</sup> . . . to be unfair if


A <sup>p</sup> -reduction sequence is fair if it is not unfair.

Intuitively, reduction by <sup>p</sup> proceeds as follows: A program L is head reduced by to a w.h.n.f. L 0 , and if L 0 is a data constructor term, all arguments are reduced in parallel by (p-ii). If L <sup>0</sup> has the form Amb(M, N), two concurrent threads are invoked for the reductions of M and N in parallel, and the one reduced to a w.h.n.f. first is used. Fairness corresponds to the fact that the 'speed' of each thread is positive which means, in particular, that no thread can block another. Note that <sup>c</sup> is not used for the reductions of M and N in (s-ii), (s-iv), (s-v) and (s-viii). This means that <sup>c</sup> is applied only to the outermost redex. Also, (c-ii) is defined through , not <sup>c</sup> , and thus no thread creates new threads. This ability to limit the bound of threads was not available in an earlier version of this language [5] (see also the discussion in Sect. 8.1).

#### 3.3 Computational adequacy: Matching denotational and operational semantics

We define M<sup>D</sup> ∈ D by structural induction on programs:

$$\begin{aligned} C(M\_1, \ldots, M\_k)\_D &= C(M\_{1D}, \ldots, M\_{kD}) & \quad &(C \in \mathcal{C}\_{\mathbf{d}})\\ (\lambda a.M)\_D &= [\lambda a.M] & \\ M\_D &= \bot & \quad &\text{otherwise} \end{aligned}$$

Since clearly M <sup>p</sup> <sup>N</sup> implies <sup>M</sup><sup>D</sup> <sup>v</sup><sup>D</sup> <sup>N</sup>D, for every computation sequence M<sup>0</sup> <sup>p</sup> <sup>M</sup><sup>1</sup> <sup>p</sup> . . ., the sequence ((Mi)D)i∈<sup>N</sup> is increasing and therefore has a least upper bound in D. Intuitively, M<sup>D</sup> is the part of M that has been fully evaluated to a data.

A computation of M is an infinite fair sequence M = M<sup>0</sup> <sup>p</sup> <sup>M</sup><sup>1</sup> <sup>p</sup> . . ..

Theorem 1 (Computational Adequacy: Soundness). For every computation M = M<sup>0</sup> <sup>p</sup> <sup>M</sup><sup>1</sup> <sup>p</sup> . . ., <sup>t</sup>i∈N(Mi)<sup>D</sup> <sup>∈</sup> data(JMK).

The converse does not hold in general, i.e. <sup>d</sup> <sup>∈</sup> data(JMK) does not necessarily imply d = ti∈N((Mi)D) for some computation of M. For example, for M Def <sup>=</sup> rec λ a. Amb(a, <sup>⊥</sup>) (for which <sup>J</sup>M<sup>K</sup> <sup>=</sup> <sup>J</sup>Amb(M, <sup>⊥</sup>)K) one sees that <sup>d</sup> <sup>∈</sup> data(JMK) for every <sup>d</sup> <sup>∈</sup> <sup>D</sup> while <sup>M</sup> p ∗ M and M<sup>D</sup> = ⊥. But M has the type fix α . A(α) which is not regular (see Sect. 2.1). For programs of a regular type, the converse of Thm. 1 holds.

Theorem 2 (Computational Adequacy: Completeness). If M has a regular type, then for every <sup>d</sup> <sup>∈</sup> data(JMK), there is a computation <sup>M</sup> <sup>=</sup> <sup>M</sup><sup>0</sup> p M<sup>1</sup> <sup>p</sup> . . . with <sup>d</sup> <sup>=</sup> <sup>t</sup>i∈N((Mi)D).

A computation M = M<sup>0</sup> <sup>p</sup> <sup>M</sup><sup>1</sup> <sup>p</sup> . . . is productive if some <sup>M</sup><sup>i</sup> is a deterministic w.h.n.f. Clearly, this is the case iff ti∈N((Mi)D) 6= ⊥. Therefore, by the Adequacy Theorem and Lemma 1:

Corollary 1. For a program M of regular type, the following are equivalent.


The corollary does not hold without the regularity condition. For example, M = Amb(Amb(Nil, Nil), Amb(⊥, ⊥)) can be reduced to M<sup>1</sup> = Amb(⊥, ⊥) and then repeats M<sup>1</sup> forever, whereas it can also be reduced to Nil. McCarthy's amb operator is bottom-avoiding in that when it can terminate, it always terminates. Corollary 1 guarantees a similar property for our globally angelic choice operator Amb.

### 4 CFP (Concurrent Fixed Point Logic)

In [12], the system IFP (Intuitionistic Fixed Point Logic) was introduced. IFP is an intuitionistic first-order logic with strictly positive inductive and coinductive definitions, from the proofs of which programs can be extracted. CFP is obtained by adding to IFP two propositional operators, B|<sup>A</sup> and (B), that facilitate the extraction of nondeterministic and concurrent programs.

#### 4.1 Syntax

CFP is defined relative to a many-sorted first-order language. CFP-formulas have the form A ∧ B, A ∨ B, A → B, ∀x A, ∃x A, s = t (s, t terms of the same sort), P(~t) (for a predicate P and terms ~t of fitting arities), as well as B|<sup>A</sup> (restriction) and (B) (concurrency). Predicates are either predicate constants (as given by the first-order language), or predicate variables (denoted X, Y, . . .), or comprehensions λ~x A (where A is a formula and ~x is a tuple of first-order variables), or fixed points µ(Φ) and ν(Φ) (least fixed point aka inductive predicate and greatest fixed point aka coinductive predicate) where Φ is a strictly positive (s.p.) operator. Operators are of the form λX Q where X is a predicate variable and Q is a predicate of the same arity as X. λX Q is s.p. if every free occurrence of X in Q is at a strictly positive position, that is, at a position that is not in the left part of an implication. We identify (λ~x A)(~t) with A[~t/~x] where [~t/~x] means capture avoiding substitution.

The following syntactic properties of expressions (i.e., formulas, predicates and operators) will be important. A Harrop expression is one that contains at strictly positive positions neither free predicate variables nor disjunctions (∨) nor restrictions (|) nor concurrency (). An expression is non-Harrop if it is not Harrop; it is non-computational (nc) if it contains neither disjunctions, nor restrictions nor concurrency nor free predicate variables. Every nc-formula is Harrop but not conversely. Finally, we define, recursively, when a formula is strict: Harrop formulas and disjunctions are strict. A non-Harrop conjunction is strict if either both conjuncts are non-Harrop or it is a conjunction of a Harrop formula and a strict formula. A non-Harrop implication is strict if the premise is non-Harrop. Formulas of the form x A ( ∈ {∀, ∃}) or ✷(λXλ~x A) (✷ ∈ {µ, ν}) are strict if A is strict. Formulas of other forms (e.g., B|A, (A), X(~t)) are not strict. The significance of these definitions is that Harropness ensures that (a proof of) the formula will have no computational content. Strictness ensures, among other things, that ⊥ is not a realizer (see Sect. 5).

As an additional requirement for formulas to be wellformed we demand that in formulas of the form B|<sup>A</sup> or (B), B must be strict.

Notation: P(~t) will also be written ~t ∈ P, and if Φ is λX Q, then Φ(P) stands for Q[P/X]. Definitions (on the meta level) of the form P Def = ✷(Φ) (✷ ∈ {µ, ν}) where Φ = λX λ~x A, will usually be written P(~x) ✷= A[P/X]. We write P ⊆ Q for ∀~x (P(~x) → Q(~x)), ∀x ∈ P A for ∀x (P(x) → A), and ∃x ∈ P A for ∃x (P(x) ∧ A). ¬A Def = A → False where False Def = µ(λX X).

#### 4.2 Proof rules

The proof rules of CFP contain those of IFP, which are the usual natural deduction rules for intuitionistic first-order logic with equality (see e.g. [53]), plus the following rules for induction and coinduction, where Φ is a s.p. operator:

$$\frac{1}{\Phi(\mu(\Phi)) \subseteq \mu(\Phi)} \text{ CL}(\Phi) \qquad \frac{\Phi(P) \subseteq P}{\mu(\Phi) \subseteq P} \text{ IND}(\Phi, P)$$

$$\begin{array}{cc} \frac{1}{\nu(\Phi) \subseteq \Phi(\nu(\Phi))} & \mathbf{COCL}(\Phi) & \frac{P \subseteq \Phi(P)}{P \subseteq \nu(\Phi)} & \mathbf{COIND}(\Phi, P) \end{array}$$

The rules for restriction and concurrency are (with the earlier mentioned condition that in formulas of the form B|<sup>A</sup> or (B), B must be strict):

$$\begin{array}{c} A \to (B\_0 \lor B\_1) \quad \neg A \to B\_0 \land B\_1 \quad \text{Rest-intro} \\ \hline \\ B|\_A \quad B \to (B'|\_A) \quad \text{Rest-bin} \\ B'|\_A \end{array} \quad \text{Rest-bin} \quad \begin{array}{c} B \quad \text{Rest-bin} \\ \hline \end{array} \quad \begin{array}{c} B \quad \text{Rest-bin} \\ \hline \end{array} \quad \begin{array}{c} B \quad \text{Rest-return} \\ \hline \end{array}$$

$$\begin{array}{c} A' \to A \quad B|\_A \quad \text{Rest-antimon} \\ \hline B|\_{A'} \quad \text{Rest-off} \quad \begin{array}{c} B|\_A \quad A \quad \text{Rest-rn} \\ \hline \end{array} \quad \begin{array}{c} B|\_A \quad A \quad \text{Rest-rn} \\ \hline \end{array}$$

$$\begin{array}{c} \overline{B|\_A} \quad \text{Rest-of-bin} \\ \hline \end{array} \quad \begin{array}{c} \text{Conc-lem} \\ \hline \end{array} \quad \begin{array}{c} \overline{B|\_A} \quad \text{Rest-star} \\ \hline \end{array} \quad \begin{array}{c} \text{Rest-star} \\ \hline \end{array} \quad \begin{array}{c} \overline{A} \quad \text{Cont-retur} \\ \hline \end{array}$$

$$\begin{array}{c} \underline{A} \to B \quad \text{| } \vert (A) \quad \text{Conc-mp} \end{array}$$

In Sect. 5 we will prove that each of these rules is realized by a program from our programming language in Sect. 2.

#### 4.3 Tarskian semantics, axioms and classical logic

Although we are mainly interested in the realizability interpretation of CFP, it is important that all proof rules of CFP are also valid w.r.t. a standard Tarskian semantics, provided we identify B|<sup>A</sup> with A → B and (B) with B.

Like IFP, CFP is parametric in a set A of axioms, which have to be closed nc-formulas. The significance of the restriction to nc-formulas is that these are identical to their (formalized) realizability interpretation (see Sect. 5), in particular, Tarskian and realizability semantics coincide for them. Axioms should be chosen such that they are true in an intended Tarskian model. Since Tarskian semantics admits classical logic, this means that a fair amount of classical logic is available through axioms. For example, for each closed nc-formula A(~x), stability, ∀~x (¬¬A(~x) → A(~x)) can be postulated as axiom. In addition, the rule (Conc-lem) is a variant of the classical law of excluded middle and (Rest-stab) permits stability for arbitrary right arguments of restriction.

In our examples and case studies we will use an instance of CFP with a sort for real numbers and some standard axiomatization of real closed fields formulated as a set of nc-formulas. In particular, we will freely use constants, operations and relations such as 0, 1, +, −, ∗, <, | · |, / and assume their expected properties as axioms (expressed as nc-formulas).

### 5 Program extraction

We define a realizability interpretation of CFP that will enable us to extract concurrent programs from proofs. Since the interpretation extends the one in IFP [12], it suffices to define realizability for the restriction and concurrency operators and prove that their proof rules are realizable (Sects. 5.2). All definitions and proofs of this section can be carried out in a formal system RCFP (realizability logic for CFP) which is CFP without | and but with classical logic and an extended first-order language that contains the earlier introduced programs and types as terms and the typing relation ':' as a predicate constant, and describes their semantics through suitable axioms. In particular, RCFP proves the correctness of extracted programs (Soundness Theorem 3). Since it only matters that RCFP is classically correct (since no realizability interpretation is applied to it), details of RCFP do not matter and are therefore omitted.

#### 5.1 Realizability

Realizability for CFP is formalized in RCFP and follows the pattern in [12]. For every non-Harrop CFP-formula A a type τ (A) and a RCFP-predicate R(A) are defined such that R(A) is a subset of τ (A) (more precisely, RCFP proves ∀a(R(A)(a) → a : τ (A)) hence the interpretation of R(A) is a subset of Dτ(A) ). We often write a r A for R(A)(a) ('a realizes A') and r A for ∃a R(A)(a) ('A is realizable').

Since Harrop formulas (see Sect. 4.1) have trivial computational content, it only matters whether they are realizable or not. Therefore, we define for a Harrop formula A, a RCFP-formula H(A) that represents the realizability interpretation of A, but with suppressed realizer. Formally, we define by simultaneous recursion, for every Harrop CFP-expression E an RCFP-expressions H(E), and for every non-Harrop CFP-expressions E an RCFP-expressions R(E). It is convenient to set, in addition, for Harrop formulas τ (A) Def = 1 and R(A) Def = λa (a = Nil ∧ H(A)), so that τ (A) and R(A) are defined for all CFP-formulas.

The complete definition, which is shown in Fig. 3, assumes that to each CFP predicate variable X there are assigned a fresh type variable α<sup>X</sup> and a fresh RCFP predicate variable X˜ with one extra argument for domain elements. Furthermore, to define realizability for the fixed points of a Harrop operator λX P, we use the notation

$$\mathbf{H}\_{X}(P) \stackrel{\text{Def}}{=} \mathbf{H}(P[\hat{X}/X])[X/\hat{X}]$$

where Xˆ is a fresh predicate constant assigned to the (non-Harrop) predicate variable X. This is motivated by the fact that λX P is Harrop iff P[X/X ˆ ] is. The idea is that HX(P) is the same as H(P) but considering X as a (Harrop) predicate constant.

To see that the definitions make sense, note that a formula P(~t) is Harrop iff P is, predicate variables and disjunctions are always non-Harrop, a conjunction is Harrop iff both conjuncts are, an implication A → B is Harrop iff B is, and For Harrop formulas A: τ (A) = 1 and R(A) = λa (a = Nil ∧ H(A)). τ (E) for non-Harrop expressions E: τ (P(~t)) = τ (P) τ (A ∨ B) = τ (A) + τ (B) τ (A ∧ B) = τ (A) × τ (B) (A, B non-Harrop) τ (A) (B Harrop) τ (B) (A Harrop) <sup>τ</sup> (<sup>A</sup> <sup>→</sup> <sup>B</sup>) = τ (A) ⇒ τ (B) (A non-Harrop) τ (B) (A Harrop) τ (B|A) = τ (B) τ ((B)) = A(τ (B)) τ (x A) = τ (A) ( ∈ {∀, ∃}) τ (X) = α<sup>X</sup> τ (P) = 1 (P a predicate constant) τ (λ~x A) = τ (A) τ (✷(λX P)) = fix α<sup>X</sup> . τ (P) (✷ ∈ {µ, ν}) R(E) for non-Harrop expressions E: R(P(~t)) = λa (R(P)(~t, a)) R(A ∨ B) = λc (∃a (c = Left(a) ∧ a r A) ∨ ∃b (c = Right(b) ∧ b r B)) R(A ∧ B) = λc (∃a, b (c = Pair(a, b) ∧ a r A ∧ b r B)) (A, B non-Harrop) λa (a r A ∧ H(B)) (B Harrop) λb (H(A) ∧ b r B) (A Harrop) <sup>R</sup>(<sup>A</sup> <sup>→</sup> <sup>B</sup>) = λc (c : τ (A) ⇒ τ (B) ∧ ∀a (a r A → (c a) r B)) (A non-Harrop) λb (b : τ (B) ∧ (H(A) → b r B)) (A Harrop) R(B|A) = λb (b : τ (B) ∧ (r A → b 6= ⊥) ∧ (b 6= ⊥ → b r B)) R((B)) = λc ∃a, b (c = Amb(a, b) ∧ a, b : τ (B) ∧ (a 6= ⊥ ∨ b 6= ⊥) ∧ (a 6= ⊥ → a r B) ∧ (b 6= ⊥ → b r B)) R(✸x A) = λa (✸x (a r A)) (✸ ∈ {∀, ∃}) R(X) = X˜ R(λ~x A) = λ(~x, a) (a r A) R(✷(λX P)) = ✷(λX˜ R(P)) (✷ ∈ {µ, ν}) H(E) for Harrop expressions E: H(P(~t)) = H(P)(~t) H(A ∧ B) = H(A) ∧ H(B) <sup>H</sup>(<sup>A</sup> <sup>→</sup> <sup>B</sup>) = r A → H(B) (A non-Harrop) H(A) → H(B) (A Harrop) H(✸x A) = ✸x H(A) (✸ ∈ {∀, ∃}) H(P) = P (P a predicate constant) H(λ~x A) = λ~x H(A) H(✷(λX P)) = ✷(λX HX(P)) (✷ ∈ {µ, ν})

Fig. 3. Realizability interpretation of CFP

∀x A, ∃x A, λ~x A are Harrop iff A is. The rationale and correctness of realizability for restriction and concurrency are discussed in Sect. 5.2.

If a formula A is nc, then it is Harrop (see Sect. 4.1 for definitions) but in addition A and H(A) are syntactically identical. In contrast, in general, a Harrop formula A neither implies nor is implied by H(A).

Lemma 3. For every CFP-formula A:


Proof. (1) and (2) are easily proved by structural induction on formulas. (3) follows from the fact that if A is of the form Amb(B), then B must be strict. (4) is proved by (3) and Corollary 1 (3).

Remarks and examples. The main difference of our interpretation to the usual realizability interpretation of intuitionistic number theory lies in the interpretation of quantifiers. While in number theory variables range over natural numbers, which have concrete computationally meaningful representations, we make no general assumption of this kind, since it is our goal to extract programs from proofs in abstract mathematics. This is the reason why we interpret quantifiers uniformly, that is, a realizer of a universal statement must be independent of the quantified variable and a realizer of an existential statement does not contain a witness. A similar uniform interpretation of quantifiers can be found in the Minlog system. The usual definition of realizability of quantifiers in intuitionistic number theory can be recovered by relativization to an inductively defined predicate N describing natural numbers in unary representation:

$$\mathbf{N}(x) \stackrel{\mu}{=} x = 0 \lor \mathbf{N}(x-1)$$

which is shorthand for N Def = µ(λX λx (x = 0 ∨ X(x − 1))). The type τ (N) assigned to N is the recursive type of unary natural numbers

$$\mathbf{nat} \stackrel{\text{Def}}{=} \mathbf{fix} \,\alpha \,. 1 + \alpha \,.$$

Realizability for N works out as

$$a \operatorname{rN}(x) \stackrel{\mu}{=} (a = \mathbf{Left} \land x = 0) \lor \exists b \left( a = \mathbf{Right} (b) \land b \operatorname{rN}(x - 1) \right) \dots$$

Thus, N(0), N(1), N(2) are realized by Left (i.e., Left(Nil)), Right(Left), Right(Right(Left)), and so on. Therefore, the (unique) realizer of N(n) is the unary representation of n. Other ways of defining natural numbers may induce different representations. An example of a formula with interesting realizers is the formula expressing that the sum of two natural number is a natural number,

$$\forall x, y \ (\mathbf{N}(x) \to \mathbf{N}(y) \to \mathbf{N}(x+y)).\tag{4}$$

It has type nat → nat → nat and is realized by a function f that, given realizers of N(x) and N(y), returns a realizer of N(x + y), hence f performs addition of unary numbers.

Example 2 (Non-terminating realizer). Let

$$\mathbf{D}(x) \stackrel{\text{Def}}{=} x \neq 0 \to \left(x \le 0 \lor x \ge 0\right).$$

Then τ (D) = 2 where 2 = 1 + 1, and a r D(x) unfolds to

$$a: \tau(\mathbf{2}) \land (x \neq 0 \to (a = \mathbf{Left} \land x \le 0) \lor (a = \mathbf{Right} \land x \ge 0)).$$

Therefore, D(x) is realized by Left if x < 0 and by Right if x > 0. If x = 0, any element of τ (2) realizes D(x), in particular ⊥. Hence, nonterminating programs, which, by Lemma 3 (4), denote ⊥, realize D(x). In contrast, strict formulas are never realized by a nonterminating program, as shown in Lemma 3 (2).

#### 5.2 Partial correctness and concurrency

We explain realizability for B|<sup>A</sup> and (B) and show that the associated proof rules are sound.

As we have seen in Example 2, a realizer of an implication A → B where A is a Harrop formula is realized by a 'conditionally correct' program M, that is, if H(A), then M realizes B, but otherwise no condition is imposed on M, in particular M may be non-terminating. However, M may terminate but fail to realize B. This means that termination of a realizer of A → B is not a sufficient condition for correctness (correctness meaning to realize B). But, as explained in the Introduction, this is what we need to concurrently realize a formula. The definition of realizability for the new logical operator | (shown in Fig. 3) achieves exactly this: A realizer of a restriction B|<sup>A</sup> is 'partially correct' in the sense that it is correct iff it terminates. By Lemma 3 (4), for a program M to realize B|<sup>A</sup> means that M has type τ (B), and if A is realizable then all the computations of M are productive, and conversely, if M has a productive computation then M always (that is, independently of the realizability of A) realizes B.

To highlight the difference between restriction and implication in a more concrete situation, consider (A ∨ B)|<sup>A</sup> vs. A → (A ∨ B) where A is Harrop. Clearly Left realizes A → (A ∨ B), but in general (A ∨ B)|<sup>A</sup> is not realizable. Note that Left even realizes A <sup>u</sup>→ (A ∨ B) where <sup>u</sup>→ is Schwichtenberg's uniform implication [39], hence restriction is also different from uniform implication.

The intuition of Amb(a, b) realizing (A) is that it is a pair of candidate realizers at least one of which is productive, and each productive one is a realizer.

Lemma 4. The rules for restriction and concurrency are realizable.

Proof. The table below shows the realizers of each rule for the (most interesting) case where the conclusion is non-Harrop, using the definitions

$$\begin{array}{l} \mathsf{Leftright} \stackrel{\mathsf{Def}}{=} \lambda b. \mathsf{case} \, b \,\mathsf{of} \, \{\mathsf{Left}(.) \to \mathsf{Left}; \mathsf{Right}(.) \to \mathsf{Right}\}, \\\mathsf{mappa} \stackrel{\mathsf{Def}}{=} \lambda f. \lambda c. \mathsf{case} \, c \,\mathsf{of} \, \{\mathsf{amb}(a, b) \to \mathsf{amb}(f \downarrow a, f \downarrow b)\}. \end{array}$$

Proofs of their correctness are in [11]. For (Rest-intro), (Rest-stab), and (Conclem), classical logic is needed. Here, we set a seq b Def = (λc. b)↓a.

b r (A → (B<sup>0</sup> ∨ B1)) H(¬A → B<sup>0</sup> ∧ B1) (leftright b) r (B<sup>0</sup> ∨ B1)|<sup>A</sup> Rest-intro (A, B0, B<sup>1</sup> Harrop) a r B|<sup>A</sup> f r (B → (B<sup>0</sup> |A)) (f↓a) r B<sup>0</sup> |A Rest-bind (B non-Harrop) ((a seq f) r B 0 |<sup>A</sup> (B Harrop)) a r B a r B|<sup>A</sup> Rest-return r (A<sup>0</sup> → A) a r B|<sup>A</sup> a r B|A<sup>0</sup> Rest-antimon b r B|<sup>A</sup> r A b r B Rest-mp ⊥ r B|False Rest-efq b r B|<sup>A</sup> b r B|¬¬<sup>A</sup> Rest-stab a r B|<sup>A</sup> b r B|<sup>¬</sup><sup>A</sup> Amb(a, b) r (B) Conc-lem <sup>a</sup> <sup>r</sup> <sup>A</sup> Amb(a, ⊥) r (A) Conc-return f r (A → B) c r (A) (mapamb f c) r (B) Conc-mp (A non-Harrop) (Amb(f, ⊥) r (B) (A Harrop))

Lemma 5. CFP derives the following rules. The rules are displayed together with their extracted realizers.

$$\begin{array}{ll}(1)&\frac{a\operatorname{\bf r}B\_{0}|\_{A\_{0}}}{\mathbf{Amb}(\mathtt{Left}\downarrow a,\mathtt{Right}\downarrow b)\operatorname{\bf r}\downarrow (B\_{0}\vee B\_{1})}\\(2)&\frac{a\operatorname{\bf r}(\mathtt{Left}\downarrow a,\mathtt{Right}\downarrow b)\operatorname{\bf r}\downarrow (B\_{0}\vee B\_{1})}{a\operatorname{\bf r}(B\vee C)|\_{D}}\end{array}$$

$$(2)\quad\frac{\operatorname{\bf }a\operatorname{\bf s}\neq\{\mathtt{Left}\downarrow = \bot;\mathtt{Right}\uparrow b\}\operatorname{\bf r}\operatorname{C}|\_{D\land\neg B}}{\mathtt{case}\,a\operatorname{\bf f}\left\{\mathtt{Left}\downarrow \to \bot;\mathtt{Right}\uparrow b\right\}\operatorname{\bf r}\operatorname{C}|\_{D\land\neg B}}\qquad(C\ \operatorname{\bf strict})!$$

Example 3. Continuing Example 2, we modify D(x) to

$$\mathbf{D}'(x) \stackrel{\text{Def}}{=} (x \le 0 \lor x \ge 0)|\_{x \ne 0} \cdot$$

A realizer of D<sup>0</sup> (x), which has type 2, may or may not terminate (non-termination occurs when x = 0). However, in case of termination, the result is guaranteed to realize x ≤ 0 ∨ x ≥ 0. Note that, a realizer of D(x) also has type 2 and may or may not terminate, but there is no guarantee that it realizes x ≤ 0∨x ≥ 0 when it does terminate. Nevertheless, D ⊆ D<sup>0</sup> follows from (Rest-intro) (since ¬x 6= 0 implies x ≤ 0 ∧ x ≥ 0) and is realized by leftright. D<sup>0</sup> ⊆ D holds trivially.

Example 4. This builds on the examples 2 and 3 and will be used in Sect. 6. Let t(x) = 1 − 2|x| and consider the predicates E(x) Def = D(x) ∧ D(t(x)) and

$$\mathbf{ConSD}(x) \stackrel{\text{Def}}{=} \downarrow ((x \le 0 \lor x \ge 0) \lor |x| \le 1/2).$$

We show E ⊆ ConSD: From E(x) and Example 3 we get D<sup>0</sup> (x) and D<sup>0</sup> (t(x)) which unfolds to (x ≤ 0 ∨ x ≥ 0)|x6=0 and (|x| ≥ 1/2 ∨ |x| ≤ 1/2)|<sup>|</sup>x|6=1/2. By Lemma 5 (2), (|x| ≤ 1/2)|<sup>|</sup>x|<1/2. Since ¬¬((x 6= 0) ∨ |x| < 1/2), we have ConSD(x) by Lemma 5 (1). Moreover, τ (E) = 2 × 2 and τ (ConSD) = A(3) where 3 Def = 2 + 1. The extracted realizer of E ⊆ ConSD is

$$\begin{aligned} \mathsf{conSD} \stackrel{\text{Def}}{=} \lambda c. \mathsf{case} \, c \,\mathsf{of} \, \{ \mathsf{Pair}(a, b) \to \mathsf{Amb}(\mathsf{Left} \downarrow (\mathsf{left} \, \mathsf{right} \, a), \\ \mathsf{Right} \downarrow (\mathsf{case} \, b \, \mathsf{of} \, \{ \mathsf{Left}(.) \to \bot; \mathsf{Right}(.) \to \mathsf{Ni} \})) \} \end{aligned}$$

of type τ (E ⊆ ConSD) = 2×2 → A(3). Explanation of this program: a is Left or Right depending on whether x ≤ 0 or x ≥ 0 but may also be ⊥ if x = 0. b is Left or Right depending on whether |x| ≤ 1/2 or |x| ≥ 1/2 but may also be ⊥ if |x| = 1/2. Since x = 0 and x = 1/2 do not happen simultaneously, by evaluating a and b concurrently, we obtain one of them from which we can determine one of the cases x ≤ 0, x ≥ 0, or |x| ≤ 1/2.

#### 5.3 Soundness and program extraction

As we did in the above example, one can extract from any CFP-proof of a formula a program that realizes it. This property is called the Soundness Theorem of realizability. Its proof is the same as for IFP [12] but extended by the rules for the new logical operators whose realizability we proved in Sects. 5.2.

Theorem 3 (Soundness Theorem I). From a CFP-proof of a formula A from a set of axioms one can extract a program M of type τ (A) (which is a regular type) such that RIFP proves M r A from the same axioms.

In CFP, we have a second Soundness Theorem which ensures the correctness of all results of fair computation paths of an extracted program M. More precisely, correctness of <sup>M</sup> means that all <sup>d</sup> <sup>∈</sup> data(JMK) realize the formula <sup>A</sup><sup>−</sup> obtained from A by deleting all concurrency operators . Since A<sup>−</sup> is an IFP formula, the Theorem relates the realizability interpretations of CFP and IFP.

However, such a correctness result only holds for formulas whose realizers do not contain Amb in the scope of a lambda-abstraction. This restriction is enforced by the following syntactic admissibility condition: An expression is called admissible if it contains neither free predicate variables nor restrictions (|), and all occurrences of concurrency () are strictly positive and at non-F-position. Here, the notion of a subexpression at F-position in a CFP expression is defined inductively by three rules: (i) A subexpression of the form A → B where A and B are both non-Harrop is at F-position. (ii) A subexpression ✷ λX Q (✷ ∈ {µ, ν}) is at F-position if Q has a free occurrence of X at F-position. (iii) A subexpression within a subexpression at F-Position is at F-position.

For example, (µ(λX λx (x = 0 ∨ ∀y (N(y) → X(f(x, y)))))) is admissible, whereas µ(λX λx (x = 0 ∨ ∀y (N(y) → X(f(x, y))))) is not. The predicate ConSD in Example 4 is admissible.

Theorem 4 (Faithfulness). If a ∈ D realizes an admissible formula A, then all d ∈ data(a) realize A<sup>−</sup>.

Theorems 3 and 4 imply:

Theorem 5 (Soundness Theorem II). From a CFP proof of an admissible formula A from a set of axioms one can extract a program M : τ (A) such that RCFP proves <sup>∀</sup><sup>d</sup> <sup>∈</sup> data(JMK) <sup>d</sup> <sup>r</sup> <sup>A</sup><sup>−</sup> from the same set of axioms.

Thms. 5 and 1, together with and classical soundness (see Sect. 4.3), yield:

Theorem 6 (Program Extraction). From a CFP proof of an admissible formula A from a set of axioms one can extract a program M : τ (A) such that for any computation M = M<sup>0</sup> <sup>p</sup> <sup>M</sup><sup>1</sup> <sup>p</sup> . . ., <sup>t</sup>i∈N(Mi)<sup>D</sup> realizes <sup>A</sup><sup>−</sup> in every model of the axioms.

### 6 Application

As our main case study, we extract a concurrent conversion program between two representations of real numbers in [-1, 1], the signed digit representation and infinite Gray code. In the following, we also write d : p for Pair(d, p).

The signed digit representation is an extension of the usual binary expansion that uses the set SD Def = {−1, 0, 1} of signed digits. The following predicate S(x) expresses coinductively that x has a signed digit representation.

$$\mathbf{S}(x) \stackrel{\nu}{=} |x| \le 1 \land \exists \, d \in \mathbf{SD} \, \mathbf{S}(2x - d)\,,$$

with SD(d) Def = (d = −1 ∨ d = 1) ∨ d = 0. The type of S is τ (S) = 3 <sup>ω</sup> where 3 Def = (1 + 1) + 1 and δ <sup>ω</sup> Def = fix α . δ × α, and its realizability interpretation is

$$p\mathbf{r} \, \mathbf{S}(x) \stackrel{\nu}{=} |x| \le 1 \land \exists \, d \in \mathbf{SD} \, \exists p' \; (p = d : p' \land p' \mathbf{r} \, \mathbf{S}(2x - d))$$

which expresses indeed that p is a signed digit representation of x, that is, p = d<sup>0</sup> : d<sup>1</sup> : . . . with d<sup>i</sup> ∈ SD and x = P i di2 <sup>−</sup>(i+1). Here, we identified the three digits d = −1, 1, 0 with their realizers Left(Left),Left(Right), Right.

Infinite Gray code ([18,42]) is an almost redundancy free representation of real numbers in [-1, 1] using the partial digits {−1, 1, ⊥}. A stream p = d<sup>0</sup> : d<sup>1</sup> : . . . of such digits is an infinite Gray code of x iff d<sup>i</sup> = sgb(t i (x)) where t is the tent function t(x) = 1 − |2x| and sgb is a multi-valued version of the sign function for which sgb(0) is any element of {−1, 1, ⊥} (see also Example 4). One easily sees that t i (x) = 0 for at most one i. Therefore, this coding has little redundancy in that the code is uniquely determined and total except for at most one digit which may be undefined. Hence, infinite Gray code is accessible through concurrent computation with two threads. The coinductive predicate

$$\mathbf{G}(x) \overset{\nu}{=} |x| \le 1 \land \mathbf{D}(x) \land \mathbf{G}(\mathbf{t}(x)) \,,$$

where D is the predicate D(x) Def = x 6= 0 → (x ≤ 0 ∨ x ≥ 0) from Example 2, expresses that x has an infinite Gray code (identifying −1, 1, ⊥ with Left, Right, ⊥). Indeed, τ (G) = 2 <sup>ω</sup> and

$$p\operatorname{r}\mathbf{G}(x) \overset{\nu}{=} |x| \le 1 \land \exists d, p'(p = d : p' \land (x \ne 0 \to d \operatorname{\mathbf{r}}\left(x \le 0 \lor x \ge 0\right)) \land p'\operatorname{\mathbf{r}}\left(\mathbf{G}(\mathbf{t}(x))\right) .$$

#### 104 U. Berger and H. Tsuiki

In [12], the inclusion S ⊆ G was proved in IFP and a sequential conversion function from signed digit representation to infinite Gray code extracted. On the other hand, a program producing a signed digit representation from an infinite Gray code cannot access its input sequentially from left to right since it will diverge when it accesses ⊥. Therefore, the program needs to evaluate two consecutive digits concurrently to obtain at least one of them. With this idea in mind, we define a concurrent version of S as

$$\mathbf{S}\_2(x) \overset{\nu}{=} |x| \le 1 \land \downarrow{\downarrow}(\exists \, d \in \mathbf{SD} \, \mathbf{S}\_2(2x - d))$$

with τ (S2) = fix α . A(3 × α) and prove G ⊆ S<sup>2</sup> in CFP (Thm. 7). Then we can extract from the proof a concurrent algorithm that converts infinite Gray code to signed digit representation. Note that, while the formula G ⊆ S<sup>2</sup> is not admissible (it contains at an F-position), the formula S2(x) is. Therefore, if for some real number x we can prove G(x), the proof of G ⊆ S<sup>2</sup> will give us a proof of S2(x) to which Theorem 6 applies. Since S2(x) <sup>−</sup> is S(x), this means that we have a nondeterministic program all whose fair computation paths will result in a (deterministic) signed digit representation of x.

Now we carry out the proof of G ⊆ S2. For simplicity, we use pattern matching on constructor expressions for defining functions. For example, we write f (a : t) Def = M for f Def = λx. case x of {Pair(a, t) → M}.

The crucial step in the proof is accomplished by Example 4, since it yields nondeterministic information about the first digit of the signed digit representation of x, as expressed by the predicate

$$\mathbf{ConSD}(x) \stackrel{\text{Def}}{=} \downarrow \vert ((x \le 0 \lor x \ge 0) \lor |x| \le 1/2).$$

#### Lemma 6. G ⊆ ConSD.

Proof. G(x) implies D(x) and D(t(x)), and hence ConSD, by Example 4.

The extracted program gscomp : 2 <sup>ω</sup> ⇒ A(3) uses the program conSD defined in Example 4:

> gscomp (a : b : p) Def = conSD (Pair(a, b)).

We also need the following closure properties of G:

Lemma 7. Assume G(x). Then:

$$\begin{array}{l} \text{(1) } \mathbf{G}(\mathbf{t}(x)), \; \mathbf{G}(|x|), \; and \; \mathbf{G}(-x);\\ \text{(2) } \; if \; x \ge 0, \; then \; \mathbf{G}(2x-1) \; and \; \mathbf{G}(1-x);\\ \text{(3) } \; if \; |x| \le 1/2, \; then \; \mathbf{G}(2x). \end{array}$$

Proof. This follows directly from the definition of G and elementary properties of the tent function t. The extracted programs consist of simple manipulations of the given digit stream realizing G(x), concerning only its tail and first two digits. No nondeterminism is involved. A detailed proof is in [11].

Theorem 7. G ⊆ S2.

Proof. By coinduction. Setting A(x) Def = ∃d ∈ SD G(2x − d), we have to show

$$\mathbf{G}(x) \to |x| \le 1 \land \downarrow{}(A(x))\,. \tag{5}$$

Assume G(x). Then ConSD(x), by Lemma 6. Therefore, it suffices to show

$$\mathbf{ConSD}(x) \to \sqcup(A(x))\tag{6}$$

which, with the help of the rule (Conc-mp), can be reduced to

$$(x \le 0 \lor x \ge 0 \lor |x| \le 1/2) \to A(x). \tag{7}$$

(7) can be easily shown using Lemma 7: If x ≤ 0, then t(x) = 2x + 1. Since G(t(x)), we have G(2x − d) for d = −1. If x ≥ 0, then G(2x − d) for d = 1 by (2). If |x| ≤ 1/2, then G(2x − d) for d = 0 by (3).

The program onedigit : 2 <sup>ω</sup> ⇒ 3 ⇒ 3 × 2 <sup>ω</sup> extracted from the proof of (7) from the assumption G(x) is

$$\begin{aligned} \mathtt{end}(\mathsf{end}(\mathsf{Big}(a:b:p)\ c^{\mathtt{Def}}\mathsf{case}\ c\mathsf{of}\ \{\mathsf{Left}(d)\rightarrow\mathsf{case}\ d\mathsf{of}\ \{\}\ \begin{aligned} \mathtt{left}(\mathsf{Left}(d)\rightarrow\mathsf{case}\ d\mathsf{of}\ \{\} \\ \mathtt{Left}(\cdot)\rightarrow\mathsf{Pair}(-1,b:p); \\ \mathtt{Right}(\cdot)\rightarrow\mathsf{Pair}(1,(\mathsf{not}\ b):p)\end{aligned} \end{aligned} \qquad\begin{aligned} \mathtt{left}(\cdot)\rightarrow\mathsf{Pair}(-1,b:p); \\ \mathtt{Right}(\cdot)\rightarrow\mathsf{Pair}(1,(\mathsf{not}\ b):p)\rangle; \\ \mathtt{right}(\cdot)\rightarrow\mathsf{Pair}(0,a:(\mathsf{not}\ p))\rangle \\ \mathtt{not}\ a^{\mathtt{Def}}\leq\mathsf{case}\ a\ \mathtt{of}\ \{\mathtt{Left}(\cdot)\rightarrow\mathsf{Right}; \\ \mathtt{Right}(\cdot)\rightarrow\mathsf{Left}\} \end{aligned}$$

This is lifted to a proof of (6) using mapamb (the realizer of (Conc-mp)). Hence the extracted realizer s : 2 <sup>ω</sup> ⇒ A(3 × 2 <sup>ω</sup>) of (5) is

$$\mathbf{s} \; p \stackrel{\text{Def}}{=} \mathsf{maspamb} \; (\mathsf{noedigt} \; p) \; (\mathsf{gsccomp} \; p)!$$

The main program extracted from the proof of Theorem 7 is obtained from the step function s by a special form of recursion, commonly known as coiteration. Formally, we use the realizer of the coinduction rule COIND(Φ<sup>S</sup><sup>2</sup> , G) where Φ<sup>S</sup><sup>2</sup> is the operator used to define G as largest fixed point, i.e.

$$\Phi\_{\mathbf{S}\_2} \stackrel{\text{Def}}{=} \lambda X \,\lambda x \, |x| \le 1 \land \downarrow (\exists d \in \mathbf{SD} \, X(2x - d)).$$

The realizer of coinduction (whose correctness is shown in [12]) also uses a program mon : (α<sup>X</sup> ⇒ α<sup>Y</sup> ) ⇒ A(3×αX) ⇒ A(3×α<sup>Y</sup> ) extracted from the canonical proof of the monotonicity of Φ<sup>S</sup><sup>2</sup> :

$$\begin{aligned} \mathsf{mon} \ f \ p &= \mathsf{mapamb} \ (\mathsf{mon}' \ f) \ p \\ \text{where} \ \mathsf{mon}' \ f \ (a:t) &= a:f \ t \end{aligned}$$

Putting everything together, we obtain the infinite Gray code to signed digit representation conversion program gtos : 2 <sup>ω</sup> ⇒ fix α . A(3 × α)

gtos rec = (mon gtos) ◦ s

Using the equational theory of RIFP, one can simplify gtos to the following program. The soundness of RIFP axioms with respect to the denotational semantics and the adequacy property of our language guarantees that these two programs are equivalent.

```
gtos (a : b : t) = Amb(
      (case a of {Left( ) → −1 : gtos (b : t);
                  Right( ) → 1 : gtos((not b) : t)}),
      (case b of {Right( ) → 0 : gtos(a : (nh t))})).
                  Left( ) → ⊥})).
```
In [43], a Gray-code to signed digit conversion program was written with the locally angelic Amb operator that evaluates the first two cells a and b in parallel and continues the computation based on the value obtained first. In that program, if the value of b is first obtained and it is Left, then it has to evaluate a again. With globally angelic choice, as the above program shows, one can simply neglect the value to use the value of the other thread. Globally angelic choice also has the possibility to speed up the computation if the two threads of Amb are computed in parallel and the whole computation based on the secondly-obtained value of Amb terminates first.

### 7 Implementation

Since our programming language can be viewed as a fragment of Haskell, we can execute the extracted program in Haskell by implementing the Amb operator with the Haskell concurrency module. We comment on the essential points of the implementation. The full code is available from [3].

First, we define the domain D as a Haskell data type:

```
data D = Nil | Le D | Ri D | Pair(D, D) | Fun(D -> D) | Amb(D, D)
```
The -reduction, which preserves the Phase I denotational semantics and reduces a program to a w.h.n.f. with the leftmost outermost reduction strategy, coincides with reduction in Haskell. Thus, we can identify extracted programs with programs of type D that compute that phase.

The <sup>c</sup> reduction that concurrently calculates the arguments of Amb can be implemented with the Haskell concurrency module. In [19], the (locally angelic) amb operator was implemented in Glasgow Distributed Haskell (GDH). Here, we implemented it with the Haskell libraries Control.Concurrent and Control.Exception as a simple function ambL :: [b] -> IO b that concurrently evaluates the elements of a list and writes the result first obtained in a mutable variable.

Finally, the function ed :: D -> IO D produces an element of data(a) from <sup>a</sup> <sup>∈</sup> <sup>D</sup> by activating ambL for the case of Amb(a, b). It corresponds to <sup>p</sup> reduction though it computes arguments of a pair sequentially. This function is nondeterministic since the result of executing ed (Amb a b) depends on which of the arguments a,b delivers a result first. The set of all possible results of ed a corresponds to the set data(a).

We executed the program extracted in Section 6 with ed. As we have noted, the number 0 has three Gray-codes (i.e., realizers of G(0)): a = ⊥ : 1 : (−1)ω, b = 1: 1: (−1)ω, and c = −1: 1: (−1)ω. On the other hand, the set of signed digit representations of 0 is A ∪ B ∪ C where A = {0 <sup>ω</sup>}, B = {0 k : 1: (−1)<sup>ω</sup> | k ≥ 0}, and C = {0 k : (−1) : 1<sup>ω</sup> | k ≥ 0}, i.e., A ∪ B ∪ C is the set of realizers of S(0). One can calculate

$$\mathbf{gts}(a) = \mathbf{Amb}(\bot, 0 \colon \mathbf{Amb}(\bot, 0 \colon \dots))$$

and data(gtos(a)) = A. Thus gtos(a) is reduced uniquely to 0 : 0 : . . . by the operational semantics. On the other hand, one can calculate data(gtos(b)) = A∪B and data(gtos(c)) = A∪ C. They are subsets of the set of realizers of S(0) as Theorem 5 says, and gtos(b) is reduced to an element of A ∪ B as Theorem 6 says.

We wrote a program that produces a {−1, 1, ⊥}-sequence with the speed of computation of each digit (−1 and 1) be controlled. Then, apply it to gtos and then to ed to obtain expected results.

### 8 Conclusion

We introduced the logical system CFP by extending IFP [12] with two propositional operators B|<sup>A</sup> and (A), and developed a method for extracting nondeterministic and concurrent programs that are provably total and satisfy their specifications.

While IFP already imports classical logic through nc-axioms that need only be true classically, in CFP the access to classical logic is considerably widened through the rule (Conc-lem) which, when interpreting B|<sup>A</sup> as A → B and identifying (A) with A, is constructively invalid but has nontrivial nondeterministic computational content.

We applied our system to extract a concurrent translation from infinite Gray code to the signed digit representation, thus demonstrating that this approach not only is about program extraction 'in principle' but can be used to solve nontrivial concurrent computation problems through program extraction.

After an overview of related work, we conclude with some ideas for follow-up research.

#### 8.1 Related work

The CSL 2016 paper [5] is an early attempt to capture concurrency via program extraction and can be seen as the starting point of our work. Our main

advances, compared to that paper, are that it is formalized as a logic for concurrent execution of partial programs by a globally angelic choice operator which is formalized by introducing a new connective B|A, and that we are able to express bounded nondeterminism with complete control of the number of threads while [5] modelled nondeterminism with countably infinite branching, which is unsuitable or an overkill for most applications. Furthermore, our approach has a typing discipline, a sound and complete small-step reduction, and has the ability to switch between global and local nondeterminism (see Sect. 8.2 below).

As for the study of angelic nondeterminism, it is not easy to develop a denotational semantics as we noted in Section 2, and it has been mainly studied from the operational point of view, e.g., notions of equivalence or refinement of processes and associated proof methods, which are all fundamental for correctness and termination [28,33,27,37,16,29]. Regarding imperative languages, Hoare logic and its extensions have been applied to nondeterminism and proving totality from the very beginning ([2] is a good survey on this subject). [31] studies angelic nondeterminism with an extension of Hoare Logic.

There are many logical approaches to concurrency. An example is an approach based on extensions of Reynolds' separation logic [36] to the concurrent and higher-order setting [34,13,25]. Logics for session types and process calculi [45,15,26] form another approach that is oriented more towards the formulae-astypes/proofs-as-programs [22,44] or rather proofs-as-processes paradigm [1]. All these approaches provide highly specialized logics and expression languages that are able to model and reason about concurrent programs with a fine control of memory and access management and complex communication patterns.

#### 8.2 Modelling locally angelic choice

We remarked earlier that our interpretation of Amb corresponds to globally angelic choice. Surprisingly, locally angelic choice can be modelled by a slight modification of the restriction and the total concurrency operators: We simply replace A by the logically equivalent formula A ∨ False, more precisely, we set B| 0 A Def = (B ∨ False)|<sup>A</sup> and 0 (A) Def = (A ∨ False). Then the proof rules in Sect. 4 with | and replaced by | <sup>0</sup> and 0 , respectively but without the strictness condition, are theorems of CFP. To see that the operator 0 indeed corresponds to locally angelic choice it is best to compare the realizers of the rule (Conc-mp) for and 0 . Assume A, B are non-Harrop and f is a realizer of A → B. Then, if Amb(a, b) realizes (A), then Amb(f↓a, f↓b) realizes (B). This means that to choose, say, the left argument of Amb as a result, a must terminate and so must the ambient (global) computation f↓a. On the other hand, the program extracted from the proof of (Conc-mp) for 0 takes a realizer Amb(a, b) of 0 (A) and returns Amb((up◦f ◦down)↓a,(up◦f ◦down)↓b) as realizer of 0 (B), where up and down are the realizers of B → (B ∨False) and (A∨False) → A, namely, up Def = λa.Left(a) and down Def = λc. case c of {Left(a) → a}. Now, to choose the left argument of Amb, it is enough for a to terminate since the non-strict operation up will immediately produce a w.h.n.f. without invoking the ambient computation. By redefining realizers of B|<sup>A</sup> and (A) as realizers of B| 0 <sup>A</sup> and 0 (A) and the realizers of the rules of CFP as those extracted from the proofs of the corresponding rules for | <sup>0</sup> and 0 , we have another realizability interpretation of CFP that models locally angelic choice.

#### 8.3 Markov's principle with restriction

So far, (Rest-intro) is the only rule that derives a restriction in a non-trivial way. However, there are other such rules, for example

$$\frac{\forall x \in \mathbf{N}(P(x) \lor \neg P(x))}{\exists x \in \mathbf{N}P(x)|\_{\exists x \in \mathbf{N}P(x)}} \text{ Rest-Markov}$$

If P(x) is Harrop, then (Rest-Markov) is realized by minimization. More precisely, if f realizes ∀x ∈ N(P(x) ∨ ¬P(x)), then min(f) realizes the formula ∃x ∈ N P(x)|<sup>∃</sup>x∈<sup>N</sup> <sup>P</sup> (x) , where min(f) computes the least k ∈ N such that f k = Left if such k exists, and does not terminate, otherwise. One might expect as conclusion of (Rest-Markov) the formula ∃x ∈ N P(x)|(¬¬∃x∈<sup>N</sup> <sup>P</sup> (x)). However, because of (Rest-stab) (which is realized by the identity), this wouldn't make a difference. The rule (Rest-Markov) can be used, for example, to prove that Harrop predicates that are recursively enumerable (re) and have re complements are decidable. From the proof one can extract a program that concurrently searches for evidence of membership in the predicate and its complement.

#### 8.4 Further directions for research

The undecidability of equality of real numbers, which is at the heart of our case study on infinite Gray code, is also a critical point in Gaussian elimination where one needs to find a non-zero entry in a non-singular matrix. As shown in [10], our approach makes it possible to search for such 'pivot elements' in a concurrent way. A further promising research direction is to extend the work on coinductive presentations of compact sets in [41] to the concurrent setting.

Acknowledgements This work was supported by IRSES Nr. 612638 CORCON and Nr. 294962 COMPUTAL of the European Commission, the JSPS Core-to-Core Program, A. Advanced research Networks and JSPS KAKENHI 15K00015 as well as the Marie Curie RISE project CID (H2020-MSCA-RISE-2016-731143).

### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### Why3-do: The Way of Harmonious Distributed System Proofs

Cl´audio Belo Louren¸co<sup>1</sup> and Jorge Sousa Pinto<sup>2</sup>

<sup>1</sup> Huawei Research Centre, United Kingdom, claudio.lourenco@huawei.com <sup>2</sup> HASLab/INESC TEC & Universidade do Minho, Portugal, jsp@di.uminho.pt

Abstract. We study principles and models for reasoning inductively about properties of distributed systems, based on programmed atomic handlers equipped with contracts. We present the Why3-do library, leveraging a state of the art software verifier for reasoning about distributed systems based on our models. A number of examples involving invariants containing existential and nested quantifiers (including Dijsktra's selfstabilizing systems) illustrate how the library promotes contract-based modular development, abstraction barriers, and automated proofs.

### 1 Introduction

The formal verification of properties of distributed algorithms and protocols is an important and notoriously difficult activity. The dominant approaches are: (i) Automatic exploration of the state space, known as model checking [10,4], a technique that can be used for both safety and liveness properties, expressed using variants of temporal logic. Its application to distributed systems is a consolidated area that has held many significant results. However, the state explosion phenomenon means that in practice only systems of modest size can be verified. (ii) Deductive reasoning based on the use of inductive invariants. A number of tools [26,18,13] now exist for the verification of single-threaded systems based on first-order logic (FOL), loop invariants, and contracts, with solid theoretical foundations [21,16]. Reasoning about distributed systems using inductive invariants was, until recently, mostly a pen-and-paper activity, but tools like Verdi [42], IronFleet [20], and Ivy [34] have made significant advances to this state of things (see Section 7 for details). Relying on external provers (and in the case of Iron-Fleet, on the Dafny verifier to check the sequential code), these tools support verification of asynchronous message-passing systems based on atomic handlers, reusable network/fault models, and different abstract specification mechanisms.

Based on the same principles, we propose in this paper a conceptual contractbased framework for reasoning about distributed systems, as well as the Why3-do library for the Why3 verifier [18]. Distinctive aspects of our approach include the following:

– It allows for reasoning about distributed systems using a standard program verification tool (rather than a dedicated tool or a proof assistant), and methods and techniques that are standard for sequential software.


Contributions of the Paper. We contribute to the state of the art of distributed system verification, and in general to software verification with Why3: (i) We introduce (Section 3) principles for modular verification of distributed systems based on clonable models, capturing in a uniform way different system semantics. Each model declares a set of handlers equipped with contracts.

(ii) We present (sections 4, 5, 6) a Why3 library with different system models and fault semantics. A concrete system is defined by cloning a model and defining its handlers and invariants. Handler implementations are required to respect the contracts declared in the model, which in particular ensures inductiveness of the invariants. Although Dafny contracts can also be used in IronFleet, the novelty in Why3-do is the presence of dedicated contracts in the library models, that are used to automatically generate verification conditions when cloning.

(iii) We introduce (Section 5) a model-independent specification mechanism based on system traces, to act as abstraction barrier between specification (observable properties) and implementation. Traces are a common specification mechanism; the novelty here is the support for modular development through the use of model-independent clonable specification modules; different implementations can be given for a specification, using different system models.

(iv) We present (Section 6) a locally-shared memory model illustrating how our approach is applied uniformly beyond message-passing models. As far as we are aware Verdi, IronFleet and Ivy work with message-passing systems only.

(v) We formalize and verify one of Dijsktra's self-stabilizing systems [15] and verify its closure (safety) and convergence (liveness) properties using Why3-do. This verification is of independent interest: our proof of convergence, using a measure function, takes advantage of SMT solvers and significantly improves on previous, much more laborious efforts using proof assistants (Section 6).

(vi) We propose two techniques for reasoning with inductive invariants containing existential and nested quantifiers: stepwise bounded validation (Section 6), and the use of dual definitions containing both code and logic (sections 4 and 6). Together with Why3's ability to interact with multiple solvers with different strengths, dual definitions allow for more robust and natural specifications, as well as for easier automated proofs, without the need for tricks like quantifier hiding [20]. Both techniques are explained by means of examples.

```
✞ ☎
module MapList
use int.Int, list.List, list.Mem, list.Length, list.NthNoOpt
val function f (x:int) : int requires {x >= 0} ensures {result >= 0}
predicate nonNeg (l:list int) = forall x :int. mem x l -> x >= 0
let rec map_list (l:list int) : list int
  requires { nonNeg l }
  ensures { nonNeg result /\ forall j. 0<=j<length l -> nth j result = f(nth j l) }
  variant { l }
= match l with
  | Nil -> Nil
  | Cons h t -> Cons (f h) (map_list t)
  end
end (* module MapList *)
module MapFib
use int.Int, list.List, list.Mem, list.Length, list.NthNoOpt, ref.Ref
inductive fibpred int int =
let function calcfib (m:int) : int
  requires { m >= 0 }
  ensures { result >= 0 /\ forall r. fibpred m r <-> r=result }
= let n = ref 0 in let x = ref 0 in let y = ref 1 in
  while !n < m do
    invariant { 0 <= !n <= m /\ !x >= 0 /\ !y >= 0 }
    invariant { forall r. (fibpred !n r <-> r = !x) /\ (fibpred (!n+1) r <-> r = !y) }
    variant { m - !n }
    let tmp = !x in x := !y; y := !y+tmp; n := !n+1;
  done;
  !x
clone MapList with val f = calcfib
lemma mapFib_lm: forall l:list int.nonNeg l-> let fibl = map_list l in
                 nonNeg fibl /\ forall j.0<=j<length l-> nth j fibl = calcfib (nth j l)
end (* module MapFib *)
✝ ✆
```
Listing 2.1. Why3 example

All the models and example modules mentioned in the paper are available for experimentation in the Why3-do artifact [28].

### 2 The Why3 Languages in a Nutshell

The example in Listing 2.1 illustrates the use of Why3's logic and programming languages, as well as the module cloning mechanism. The MapList module first imports a number of theories for mathematical integers and lists from the standard library. Why3 includes a wide range of theories, usable across provers. A program function f is then declared with the val keyword, including a simple contract: a precondition requiring its argument to be nonnegative, and a postcondition stating that the result is also nonnegative. In the rest of the module this contract will be assumed to hold for f. Next, a logic predicate nonNeg is defined. It uses a universal quantifier to state that every element of its argument list is nonnegative. Finally, the map\_list program function is defined. The definition includes both the function's recursive definition and a contract, in particular a postcondition that uses a universal quantifier to state the mapping property (result refers to the return value). From this module, Why3 will generate verification conditions (VCs) ensuring that the definition is consistent with its contract, assuming the definition of f keeps to its own contract. This interplay between contracts plays a fundamental role in deductive verification.

This little example allows us to elaborate on another aspect of Why3. nonNeg is also a function (returning a truth value), but it lives in a different namespace from map\_list, which is a WhyML program function. nonNeg belongs to Why3's logic language [17], and its definition contains a quantifier, which cannot be used in programs. However, pure program functions, which do not modify the global state, may also be used in the logic, if their declaration includes the function keyword. This is the case of f, used in both the code and the contract of map\_list. We will refer to program functions that can be used in the logic as "let functions". map\_list is also pure, but is not declared as a let function.

Why3 encodes both the code and contracts of let functions, so one may choose to write certain logic functions algorithmically or logically, or both. For instance nonNeg could be defined alternatively as follows (the postcondition is optional):

```
let rec predicate nonNeg (l:list int)
  ensures { result <-> forall x :int. mem x l -> x >= 0 }
= match l with
  | Nil -> true | Cons h t -> h>=0 && nonNeg t end
```
If the postcondition is present, the logic encoding of the predicate will contain redundancy (no inconsistency can be created since the definition must respect the contract). Writing such "dual definitions" of logic functions may be a good idea for a number of reasons, namely the possibility of including preconditions, and termination checks based on user-provided variants. Moreover, dual definitions increase the robustness of specifications and may facilitate automated proofs of results involving quantifiers. Not every logic function can be defined as a let function: since the latter must remain executable, they may not contain for instance occurrences of logic equality or quantifiers. In these cases let ghost functions can be used. These are pure logic definitions that are not meant to be executed, but are still written as programs.

A second module, MapFib, defines a program function calcfib that computes Fibonacci numbers using a loop. The recursive definition of the Fibonacci sequence (used in the function and loop invariant of calcfib) cannot be written as a logic function, since it is not total. It could be defined as a let function with a precondition restricting its domain, but we use instead an inductive predicate fibpred: the formula fibpred n f means that f is the nth. Fibonacci number. Inductive predicates, familiar to readers acquainted with proof assistants, are defined by means of a set of inference rules. They are used in our models to define non-deterministic transition relations on distributed system configurations.

Why3 will generate and successfully discharge VCs ensuring the correctness of calcfib with respect to its contract. Now, since calcfib is in accordance with the contract of f in MapList, this module can be cloned instantiating the latter function with the former. This imports into the current module a copy of every element of MapList, with calcfib substituted for f, and generates refinement VCs, to ensure that calcfib's contract is stronger than f's. Finally, the lemma mapFib\_lm states that indeed map\_list maps the function calcfib as expected.

### 3 Distributed Systems and Models

A distributed system consists of a set N of nodes, each of which can at any moment be in a state taken from a set Σ, together with additional elements, such as a communication network or a shared memory. We will call the global state of such a system a world and denote by W the set of all worlds. In general, worlds will include the local state of every node in the system, captured as a mapping lS : N → Σ. Different models will specialize this basic setting to define different notions of distributed system (and consequently also of world), including for instance different communication and fault models (we will always write N, Σ, or W in the context of a specific system model, left implicit).

Models are handler-based: systems are described by writing code executed by nodes in response to certain events, such as receiving a message from the network or an input from the local environment, or simply being enabled by a guard predicate that becomes true. Handlers are assumed to execute atomically. Each model defines a transition semantics describing how worlds evolve step by step, allowing for all possible schedules (both locally and globally). Each model contains a set of rules inferring judgments of the form w w 0 , meaning that the system's global state w evolves to w 0 . The general form of the rules states the following: if the world w 0 results from w when a handler is executed by one of the system's nodes, then w w 0 .

Let w<sup>0</sup> correspond to the initial state of the system, and <sup>∗</sup> denote the reflexive-transitive closure of . A world w is said to be reachable if w<sup>0</sup> <sup>∗</sup> w. Let Φ be some property of worlds; we will write w |= Φ to signify that Φ is satisfied by the world w . A system is said to be correct with respect to Φ if w |= Φ holds for every reachable world w. A typical correctness proof involves finding an inductive invariant: a property I such that (i) w<sup>0</sup> |= I, and (ii) for every pair w, w <sup>0</sup> of worlds, if w |= I and w w 0 , then w 0 |= I. If w |= I implies w |= Φ, this is sufficient to guarantee correctness.

Contract-based Models. We introduce the use of handler contracts for designing and verifying distributed systems. Let us consider a model with worlds of the form hlS, . . .i, with . . . standing for other components of worlds in addition to the state function. The signature and contract of a handling function will be of the following general form, where I is a candidate invariant predicate, and other arguments and return values (. . .) may be present:

```
handle(n : N, lS : N → Σ, . . .) : (σ : Σ, . . .)
requires IhlS, . . .i
ensures IhlS[n 7→ σ], . . .i
```
The function returns the new state σ of the node n that executes the handler in a world with state function lS. This general form will be adapted with modifications in different models. For instance, handling functions may have access only to the local state and not to the entire state function lS, or they may return, in addition to a new state, a list of messages to be sent by n. Transition rules have the following general form, updating the state of the node that executes the handler, and reflecting in the world other effects of the execution.

$$\frac{\mathsf{handle}(n, \mathsf{lS}, \dots) = (\sigma, \dots)}{\langle \mathsf{lS}, \dots \rangle \sim \langle \mathsf{lS}[n \mapsto \sigma], \dots \rangle}$$

The handler's contract, consisting of precondition IhlS, . . .i and postcondition IhlS[n 7→ σ], . . .i, ensures that if the handler is executed in a world satisfying the invariant I, then the world resulting from this transition still satisfies I.

It is common for handlers to have access only to the state σ of the node n where they are being executed. In this case it is not possible to include IhlS, . . .i as a precondition in the contract, since lS is not passed as a parameter. Preservation of the invariant can be written instead as a conditional postcondition, stating that for every world satisfying I in which σ is the state of node n and this node executes the handler, then the resulting world still satisfies I:

$$\begin{array}{l} \mathsf{handle}(n: \mathbf{N}, \sigma: \Sigma, \dots) : (\sigma': \Sigma, \dots) \\ \mathsf{consures} \,\forall \mathsf{s}. \mathbf{N} \to \Sigma \,\, \sigma = \mathsf{l}\mathsf{S} \, n \to I\langle \mathsf{lS}, \dots \rangle \to I\langle \mathsf{lS}[n \mapsto \sigma'], \dots \rangle \end{array}$$

The Why3-do Library. Listing 3.1 illustrates how contract-based models are written as Why3 modules. The World module declares basic types and functions, and defines the world structured type. The Steps module includes val declarations for (i) the initial world, (ii) an inductive invariant predicate, and (iii) a set of handling functions (illustrated here by handle\_1). Contracts enforce that the inductive invariant is satisfied by the initial world, and preserved by handlers. Each handler's contract makes use of a step\_1 auxiliary function, that is also used in the definition of the transition semantics through the step inductive predicate. The module ends with the definition of reachable world, and a lemma stating that the invariant holds in all reachable worlds (this is proved inductively for each model, using proof transformations and SMT solvers).

That is all that is required to define a system model, which may now be cloned to produce concrete distributed systems. Listing 3.2 illustrates how simple this is. We write a System module that defines, first of all, types for nodes, states, messages, and other relevant elements, and if appropriate, well-formedness predicates for different entities. The World module from the desired Why3-do library model can then be cloned, after which the following are defined: (i) the initial world, (ii) a candidate inductive invariant predicate, and (iii) handler functions specifying the behavior of the system's nodes/processes. The Steps module from the same model is now cloned, instantiating these elements. Why3 will produce a set of VCs, generated from the contracts contained in the cloned module, ensuring that the invariant is inductive. Properties of interest can at last be stated and proved (which may involve writing additional definitions and lemmas).

```
✞ ☎
 module World (* file model.mlw *)
 type node
 type state
 type world = (map node state, ...)
 function localState (w:world) : map node state = (* projection functions for worlds *)
  let (lS, ...) = w in lS
 end (* module World *)
 module Steps (* file model.mlw *)
 ...
 val function initState (node) : state (* init functions for world components *)
 constant initWorld : world = (initState, ...)
 val ghost predicate indpred (w:world)
  ensures { w=initWorld -> result } (* initial world must satisfy invariant *)
 (* specifying the new world that results from w when n executes a handler yielding results r *)
 function step_1 (w:world) (n:node) (r:(state, ...)) : world =
  let (st, ...) = r in
    let newLocalState = set (localState w) n st in
      (newLocalState, ...)
 (* handlers' arguments include a node h and its state; results include a new state for h *)
 val function handle_1 (h:node) (sig:state) ... : (state, ...)
  ensures { forall w :world. indpred w -> sig = localState w h -> ... ->
               indpred (step_1 w h result) }
 inductive step world world =
 | step_1 : forall w :world, n :node.
            step w (step_1 w n (handle_1 n (localState w n) ...))
 | ...
 inductive step_TR world world =
 | base : forall w :world. step_TR w w
 | step : forall w w' w'' :world. step_TR w w' -> step w' w'' -> step_TR w w''
 predicate reachable (w:world) = step_TR initWorld w
 (* inductive invariant holds in all reachable worlds *)
 lemma indpred_reachable : forall w :world. reachable w -> indpred w
 end (* module Steps *)
```
✝ ✆ Listing 3.1. Basic structure of a Why3-do model

### 4 The Basic Message-Passing Model

In this model nodes communicate by exchanging packets: triples of the form (d, s, m), carrying a message m ∈ Msg from node s ∈ N to node d ∈ N, with Msg a given set of messages. Worlds are pairs hlS , nti where lS : N → Σ is a function assigning a state to each node and nt : Msg<sup>∗</sup> is a network, abstracted as a list of packets. In a system based on this asynchronous model, nodes execute a message handler whenever they receive a message, and may in turn send messages to other nodes. The handleM function implements this local messagehandling behavior. Its parameters include the node h handling the message, the node that sent the message, the state of the handling node, and the message itself. It returns a new state for h and a list of packets to be sent to the network.

```
✞ ☎
module System (* file system.mlw *)
type node = int
type state = int
clone model.World with type node, type state
let function initState (n:node) : state = ...
let ghost predicate indpred (w:world) = ...
let function handle_1 (h:node) (lS:map node state) : state = ...
clone model.Steps with type node, type state, val initState, val indpred, val handle_1
goal systemProperty : forall w :world. reachable w -> ...
end (* module System *)
```
✝ ✆ Listing 3.2. Basic structure of a Why3-do system module

Its signature and contract are (with I a candidate invariant):

$$\begin{array}{l} \mathsf{handleM}(h:\mathsf{N},s:\mathsf{N},m:\mathsf{M}\mathsf{sg},\sigma:\Sigma):(\sigma':\Sigma,\mathsf{nt}':\mathsf{M}\mathsf{sg}^\*) \\ \mathsf{hensures} \forall\_{\mathsf{lS}:\mathsf{N}\to\Sigma,\mathsf{nt}:\mathsf{M}\mathsf{sg}^\*} \cdot \sigma = \mathsf{lS} \, h \to (h,s,m) \in \mathsf{nt} \\ \to \, I\langle\mathsf{lS},\mathsf{nt}\rangle \to I\langle\mathsf{lS}[h\mapsto\sigma'],\mathsf{nt}'+\mathsf{nt}-\{(h,s,m)\}\rangle\rangle \end{array}$$

The semantics of the model are given by the following transition rule:

$$\frac{\mathsf{handleM}(h,s,m,\mathsf{lS}(h)) = (\sigma,\mathsf{nt}') \qquad (h,s,m) \in \mathsf{nt}}{\langle\mathsf{lS},\mathsf{nt}\rangle \leadsto \langle\mathsf{lS}[h\mapsto\sigma],\ \mathsf{nt}'+\mathsf{nt}-\{(h,s,m)\}\rangle} \quad (message)$$

We use notation +, −, and ∈ for list concatenation, difference, and membership. Any packet that is in transit in the network may be selected by the rule to be delivered and handled by the receiving node. The rule removes the packet from the network, updates the state of the handling node, and sends new packets as prescribed by the handler. The semantics takes into account all possible orders of message delivery, since any message may be extracted from the packet pool. The semantics is otherwise idealized, but the library contains additional models in which messages may be dropped or duplicated by the network (an example verification of a system assuming message duplication is given in Section 5).

The contract of handleM ensures that executions of (message) preserve the invariant I. Let ok<sup>I</sup> (handleM) signify that the implementation of the handler adheres to its contract, instantiated with the candidate invariant I. If I holds in the initial world then it is indeed inductive and holds in all reachable worlds:

Lemma 1. Let w0, w ∈ W and I be a predicate such that ok<sup>I</sup> (handleM). If w<sup>0</sup> |= I and w<sup>0</sup> <sup>∗</sup> w then w |= I.

A simplified version of the corresponding Why3-do model is shown in Listing 4.1. The World module defines the tuple types packet and world and auxiliary functions. Steps declares the following elements to be instantiated when cloning: the ok\_Msg well-formedness predicate; initState and initMsgs,

```
✞ ☎
 module World
 type node type state type msg
 type packet = (node, node, msg)
 function dest (p:packet) : node = let (d,_,_)=p in d
 function src (p:packet) : node = let (_,s,_)=p in s
 function payload (p:packet) : msg = let (_,_,m)=p in m
 type world = (map node state, list packet)
 function localState (w:world) : map node state = let (lS,_)=w in lS
 function inFlightMsgs (w:world) : list packet = let (_,ifM)=w in ifM
 end (* module World *)
 module Steps
 ...
 predicate ok_Msg (node) (node) (msg)
 val function initState (node) : state
 val constant initMsgs : list packet
 constant initWorld : world = (initState, initMsgs)
 val ghost predicate indpred (w:world)
  ensures { w=initWorld -> result }
  ensures { result -> forall p: packet. mem p (inFlightMsgs w) ->
              ok_Msg (dest p) (src p) (payload p) }
 function step_message (w:world) (p:packet) (r:(state, list packet)) : world
 = let (st, ms) = r in let localState = set (localState w) (dest p) st in
  let inFlightMsgs = ms ++ (remove p (inFlightMsgs w)) in (localState, inFlightMsgs)
 val function handleMsg (h:node) (s:node) (m:msg) (sig:state) : (state, list packet)
  requires { ok_Msg h s m }
  ensures { forall w :world. indpred w -> mem (h, s, m) (inFlightMsgs w) ->
              sig = localState w h -> indpred (step_message w (h, s, m) result) }
 inductive step world world =
 | step_msg : forall w :world, p :packet. mem p (inFlightMsgs w) ->
  step w (step_message w p
    (handleMsg (dest p) (src p) (payload p) (localState w (dest p))))
 inductive step_TR world world = ...
 predicate reachable (w:world) = step_TR initWorld w
 lemma indpred_reachable : forall w :world. reachable w -> indpred w
 end (* module Steps *)
```
✝ ✆ Listing 4.1. Message-passing model: modelMP

used to construct initWorld; the inductive invariant indpred; and finally the handleMsg handler. The contract of indpred ensures that it is satisfied by the initial world, and that all messages in the network are well-formed. Wellformedness conditions are singled out from the invariant because the handler function may need to assume basic facts about messages. The module ends with lemma indpred\_reachable, corresponding to Lemma 1 (the ok<sup>I</sup> (handleM) and w<sup>0</sup> |= I premises are enforced by the contracts of indpred and handleMsg). It is proved using a Why3 transformation for predicate induction, and SMT solvers.

Example: Leader Election on a Ring. Leader Election is a coordination problem, where a set of processes or nodes collectively designate one of them to act as leader. One of the simplest solutions to this problem on a unidirectional ring network is the maximum-finding distributed algorithm devised by Chang and

Roberts [7]. Let each node have a distinct identifier of some type equipped with a total order relation. Informally the algorithm can be described as follows: (i) messages are node identifiers; each node starts by sending its id to the next node in the ring. (ii) Each node then enters a message-handling loop. If a received message has a higher value than the receiver's id, the message is forwarded to the next node. Otherwise, it is discarded. (iii) If a node receives back a message with its own id, it claims to be the leader. The fundamental property to be proved of this system is that at most one node claims to be leader. The system has been used as example in [34] and later in [29]. The Ivy description of the system is based on the decidable EPR fragment of FOL (See Section 7), whereas our formalization below uses unrestricted quantification.

The Why3-do encoding of this algorithm is given in Listing 4.2, based on the modelMP library model. The first step is to define types for nodes, identifiers, states, and messages. Identifiers are uniquely associated to nodes by means of the id function and the uniqueIds axiom. The constant n\_nodes is the number of nodes in the ring. A minimum of 3 nodes is assumed, with no upper bound. The constant maxId\_global corresponds to the (unique) node having the highestvalue id in the ring. Node states are records having a single field leader of Boolean type, which indicates when a node claims to be leader. The ok\_Msg predicate describes the notion of well-formed message in the ring topology.

The types for nodes and identifiers could be left undefined, with a set of axioms for the next function and the maxId\_global constant. But in our experience, using library types, as well as defined constants, predicates, and functions when adequate, is advantageous from the point of view of provability, and also reduces the danger of introducing inconsistencies. For instance the maxId\_global constant is defined algorithmically using a recursive let function maxId\_fn with a "dual definition" (it is equipped with a contract describing precisely what it does). We could instead simply write an axiom concerning maxId\_global, but using the dual definition let function, containing code, not only increases the degree of assurance in what is being specified, but also makes it easier to reason about, since Why3 will generate a more easily provable set of VCs.

Cloning the module modelMP.World introduces new composed types and auxiliary definitions. The system description then proceeds to give the initial conditions of the system, by means of a state function initState, and a constant initMsgs for the list of messages that are sent upon booting, also defined by means of a recursive let function. The handler definition then follows. The next element in the module is the invariant indpred, defined as a let predicate (since logic elements like quantifiers and equality are required, it is defined as a let ghost predicate using an auxiliary predicate inv, see Section 2). It states that every inflight message is well-formed; it contains the id of some node in the ring, with value not less than the sender's id, and it is not the id of any node i such that maxId\_global is located between i and the message's destination node (an auxiliary predicate between is used to express this). Moreover if the message contains its destination's id then that id is the highest in the network. Finally, any node that is claiming to be the leader has the highest id in the ring.

```
✞ ☎
 type node = int
 val constant n_nodes : int
 axiom n_nodes_ax : 3 <= n_nodes
 let function next (x:node) : node = mod (x+1) n_nodes
 type id = int
 val function id (node) : id
 axiom uniqueIds : forall i j :node. id i = id j <-> i=j
 let rec function maxId_fn (n:int) : node
  requires { 1 <= n <= n_nodes }
  ensures { 0 <= result < n}
  ensures { forall k :node. 0<=k<n -> k<>result -> id k < id result}
  variant { n }
 = if n=1 then 0
  else let m = maxId_fn (n-1) in if id (n-1) > id m then n-1 else m
 constant maxId_global : id = maxId_fn n_nodes
 type state = { leader : bool }
 type msg = id
 predicate ok_Msg (dest:node) (src:node) (m:msg) =
  0 <= dest < n_nodes /\ 0 <= src < n_nodes /\ dest = next src
 clone modelMP.World with type node = node, type state = state, type msg = msg
 let function initState (i:node) : state = { leader = false }
 let rec function initMsgs_fn (n:node) : list packet
  requires { 0<=n<=n_nodes }
  ensures { forall s d :node, m :msg. mem (d, s, m) result ->
               m = id s /\ d = next s /\ n<=s<n_nodes /\
              (forall i :node. between i maxId_global d -> m <> id i) /\
              (m = id d -> d = maxId_global) }
  variant { n_nodes-n }
 = if (0<=n<n_nodes) then Cons (next n, n, id n) (initMsgs_fn (n+1))
  else Nil
 let constant initMsgs : list packet = initMsgs_fn 0
 let function handleMsg (h:node) (src:node) (m:msg) (s:state) : (state, list packet)
 = if m = (id h) then ({ leader = true }, Nil)
  else if m > id h then (s, Cons (next h, h, m) Nil)
              else (s, Nil)
 predicate between (lo:node) (i:node) (hi:node) =
  (lo < i < hi) \/ (hi < lo < i) \/ (i < hi < lo)
 lemma btw_next_lm : forall i j k :node.
  0 <= i < n_nodes -> 0 <= j < n_nodes -> 0 <= k < n_nodes -> i <> k ->
    between (next i) j k -> between i j k
 predicate inv (lS:map node state) (iFM:list packet) =
  (forall s d :node, m :msg. mem (d, s, m) iFM ->
    (ok_Msg d s m /\ m >= id s /\
    (exists i :node. 0 <= i < n_nodes /\ m = id i) /\
    (forall i :node. between i maxId_global d -> m <> id i) /\
    (m = id d -> d = maxId_global) )) /\
  (forall i:node. 0<=i<n_nodes -> (lS i).leader = true -> i = maxId_global)
 let ghost predicate indpred (w:world) = inv (localState w) (inFlightMsgs w)
 clone modelMP.Steps with type node, type state, type msg, predicate ok_Msg,
  val initState, val initMsgs, val indpred, val handleMsg
 goal uniqueLeader :
  forall w :world, i j:node.
    reachable w -> 0<=i<n_nodes -> 0<=j<n_nodes ->
      (localState w i).leader = true -> (localState w j).leader = true -> i = j
✝ ✆
```
Listing 4.2. Leader election on a ring (Chang-Roberts)

The module then clones the Steps module from modelMP instantiating the necessary elements, and formulates the uniqueLeader proof goal. The verification results depend on the provers that are available. In our setup we were able to prove automatically all VCs using the Alt-Ergo [11], CVC4 [5], and Vampire [36] SMT solvers after (i) providing lemma btw\_next\_lm, proved automatically by Alt-Ergo; and (ii) including in the postcondition of function initMsgs\_fn the relevant facts relating in-transit messages and maxId\_global, as required by the invariant. Observe that this postcondition is proved automatically by the program verification engine following the recursive definition of the function.

### 5 Trace Specifications

In the previous section we have considered a specification property expressed at the implementation level, with access to internal node states. Other internal elements of worlds, including messages, could be mentioned in such implementationlevel properties. It is however very useful to introduce an abstraction barrier between specifications and implementation details. This can be achieved by logging certain observable events onto a trace of the system, and then writing specifications as properties of the trace. Models in our setting can be equipped with traces, allowing for protocols and systems to be specified in this way.

We will illustrate this by equipping the message-passing model of Section 4 with traces. Each system using this model defines an Out type of outputs, and the model defines external events as Evt = N × Out, outputs paired with the node that originated them (other models may use additional notions of external event, such as inputs received by nodes from their local environments). A trace is a sequence of external events; the function rec : N → Out<sup>∗</sup> → Evt<sup>∗</sup> produces a trace from a sequence of outputs, pairing them with the source node. Given a predicate ν on traces and τ ∈ Evt<sup>∗</sup> , we will write τ |= ν when τ satisfies ν.

A commit specification (µp, µ<sup>f</sup> ) consists of a predicate µp(Σ, Σ) and a function µ<sup>f</sup> (Σ, Σ) : Out<sup>∗</sup> , expressing respectively when outputs should be produced, and what those outputs should be. The signature of the message handler is similar to that in the model of Section 4, with a trace as additional output. Its contract states that it complies with a given commit specification.

$$\begin{array}{l} \mathsf{handle} \mathsf{M} (h:\mathsf{N}, s:\mathsf{N}, m:\mathsf{M} \mathsf{sg}, \sigma:\Sigma) : (\sigma':\Sigma, \mathsf{nt}':\mathsf{M} \mathsf{sg}^\*, l:\mathsf{Out}^\*)\\ \mathsf{ensures} \vee\_{\mathsf{lS}:\mathsf{N}\to\Sigma, \mathsf{nt}:\mathsf{M} \mathsf{sg}^\*} \cdot \sigma = \mathsf{lS} \, h \to (h, s, m) \in \mathsf{nt} \\\ I \langle \mathsf{S}, \mathsf{nt} \rangle \to I \langle \mathsf{lS} [h \mapsto \sigma'], \, \mathsf{nt}' + \mathsf{nt} - \{(h, s, m) \} \rangle \rangle \\\ \mathsf{ensures} \, (\mu\_p(\sigma, \sigma') \to l = \mu\_f(\sigma, \sigma')) \wedge (\neg \mu\_p(\sigma, \sigma') \to l = \varepsilon) \end{array}$$

We will write okI,µp,µ<sup>f</sup> (handleM) to signify that the implementation of handleM adheres to its contract, with invariant I and commit specification (µp, µ<sup>f</sup> ).

Worlds are tuples hlS, nt, τ i with lS : N → Σ, nt : Msg<sup>∗</sup> , and τ : Evt<sup>∗</sup> . The semantics will now be given by the relation ⊆ W × N × W, with w <sup>n</sup> w 0 meaning that world w transitions to w <sup>0</sup> with node n executing a handler. The following transition rule commits outputs to the trace:

$$\frac{\mathsf{handleM}(h,s,m,\mathsf{lS}(h)) = (\sigma,\mathsf{nt}',l) \qquad (h,s,m) \in \mathsf{nt}}{\langle\!\langle\mathsf{S},\ \mathsf{nt},\,\tau\rangle\!\rangle \circ \,\_{\mathsf{H}}\langle\!\langle\mathsf{S}\!\langle h\mapsto\sigma\rangle\!\rangle,\,\mathsf{nt}'+\mathsf{nt}-\{\langle h,s,m\rangle\}\,\rangle \,\,\_{\mathsf{rec}\_{h}}\langle\!\langle h\rangle+\tau\rangle} \;\,\_{\mathsf{H}}(message)$$

A specification is a triple (µp, µ<sup>f</sup> , ν) consisting of a commit specification and a predicate ν(Evt<sup>∗</sup> ) expressing some notion of trace consistency. Correctness implies that the commit specification is respected and traces are consistent.

Definition 1. A system with initial world w<sup>0</sup> ∈ W is said to be correct with respect to a specification (µp, µ<sup>f</sup> , ν) if


Lemma 2. Let (µp, µ<sup>f</sup> , ν) be a specification, and I a predicate such that okI,µp,µ<sup>f</sup> (handleM), w<sup>0</sup> |= I, and for every world w = hlS , nt, τ i, w |= I implies τ |= ν. Then the system is correct with respect to (µp, µ<sup>f</sup> , ν).

As usual the lemma is proved mechanically in the Why3-do module for this model. Every Why3-do model extended with traces contains a similar lemma.

A simplified version of the modelMPTrace model is shown in Listing 5.1 (... indicate elements that are preserved from the modelMP module). The world type extends the tuple of modelMP with a trace of type list externalEvent. The functions/predicates commitp, commitf, and consistent, corresponding respectively to µp, µ<sup>f</sup> , and ν, are to be instantiated when cloning the model. The indpred inductive predicate gains a new postcondition ensuring that it enforces consistency of the system's trace (following the conditions of Lemma 2). The step inductive predicate is modified to include as an additional parameter the node involved in each transition. The commit\_step and consistent\_reachable lemmas (mechanically proved, using the contracts of indpred and handleMsg) together correspond to Lemma 2 above.

Example: Distributed Lock. This example will show how Why3-do models can be extended in a flexible way. Its verification was first carried out in [20] and later also in [34] and [29]. We adapt it here to make use of trace specifications, which will allow us to demonstrate their effectiveness as an abstraction barrier. In addition to traces, the example also illustrates the use of guarded actions in models (through the use of enabling predicates), as well as the use of a nonidealized network model, in which in-transit messages can be duplicated. Two implementations will be given: one that is in accordance with the trace spec if the idealized model is used, and a second implementation that tolerates duplicating messages. The specification of the distributed lock system is the following:


```
✞ ☎
 module World ...
 type externalEvent ...
 type world = (map node state, list packet, list externalEvent) ...
 function trace (w:world) : list externalEvent = let (_,_,t)=w in t
 end (* module World *)
 module Steps ...
 type output
 type externalEvent
 val function record_outputs (n:node) (outs:list output) : list externalEvent
 predicate commitp (state) (state)
 function commitf (state) (state) : list output
 predicate consistent (t:list externalEvent)
 val ghost predicate indpred (w:world)
  ensures { ... /\ result -> consistent (trace w) }
 function step_message (w:world) (p:packet) (r:(state, list packet, list output)) : world =
  let (st, ms, outs) = r in let localState = set (localState w) (dest p) st in
      let inFlightMsgs = ms ++ (remove p (inFlightMsgs w)) in
        let trace = (record_outputs (dest p) outs) ++ (trace w) in
          (localState, inFlightMsgs, trace)
 val function handleMsg (h:node)(s:node)(m:msg)(sig:state) : (state, list packet, list output)
  requires { ... }
  ensures { ... /\ let (s',_,lo) = result in (commitp s s' ->
                       lo = commitf s s') /\ (not (commitp s s') -> lo = Nil) }
 inductive step world node world =
 | step_msg : forall w :world, p :packet.
    mem p (inFlightMsgs w) -> step w (dest p) (step_message w p
     (handleMsg (dest p) (src p) (payload p) (localState w (dest p))))
 ...
 lemma commit_step :
  forall w w' :world, n :node. reachable w -> step w n w' ->
     (commitp (localState w n) (localState w' n) ->
       trace w' = (record_outputs n (commitf (localState w n) (localState w' n))) ++ trace w)
  /\ (not (commitp (localState w n) (localState w' n)) -> trace w' = trace w)
 lemma consistent_reachable :
  forall w :world. reachable w -> consistent (trace w)
 end (* module Steps *)
```
✝ ✆ Listing 5.1. Message-passing model: modelMPTrace

3. in every reachable world an output n is stored in position n of the trace.

The system's trace stores the sequence of outputs sent by different nodes. Together, these requirements mean that a node acquiring the lock at epoch n writes to position n of the trace, which implies (since traces are only modified by appending at the head) that no two nodes acquire the lock in the same epoch.

Specifications are written as Why3-do modules defining the output and externalEvent types, together with projection and the record\_outputs functions. Most importantly, they define the commitp and consistent predicates, as well as the commitf function. However, the specification is abstract and does not impose the use of any specific system model. It requires the presence of certain types, but does not specify how the types are implemented. The requirement that states should contain specific information is included by declaring functions

```
✞ ☎
module Spec
  (* to be instantiated when cloning this module *)
  type node
  type state
  function getEpochS (s:state) : int
  predicate getHeldS (s:state)
  type output = | Locked int
  function getEpochO (o:output) : int =
    match o with | Locked e -> e end
  type externalEvent = (node, output)
  function node (e:externalEvent) : node = let (n,_) = e in n
  function outp (e:externalEvent) : output = let (_,o) = e in o
  let rec function record_outputs (n:node) (outs:list output) : list externalEvent
    ensures { forall i :int. 0<=i<length outs -> nth i result = (n, nth i outs) }
  = ...
  predicate commitp (s:state) (s':state) = not (getHeldS s) /\ getHeldS s'
  function commitf (_:state) (s':state) : list output = Cons (Locked (getEpochS s')) Nil
  predicate consistent (t:list externalEvent) =
    match t with
    | Nil -> true
    | Cons (_,o) tt -> getEpochO o = length t /\ consistent tt
    end
end (* module Spec *)
✝ ✆
```
Listing 5.2. Specification module for distributed lock

and/or predicates on states. Implementation modules will define these types and functions and clone the specification module, instantiating them.

This specification of the distributed lock is written as the Why3-do module of Listing 5.2. It assumes the use of a system model defining types node, state, output, and externalEvent. The above requirements are formalized as follows:


We will consider two message-passing implementations for this specification based on a ring topology, shown in listings 5.3 and 5.4. Node states are records with two fields: a Boolean held indicating whether the node holds the lock, and its current epoch. After the appropriate type definitions, both implementation modules clone the same Spec module, and then the World module from the appropriate model. The idealized model modelMPEnabledTrace is used in the implementation of Listing 5.3, whereas Listing 5.4 uses modelMPEnabledTraceDupl in which messages can be duplicated. Both are extensions of modelMPTrace (Listing 5.1) with an enabling predicate. Enabling predicates allow for nodes to execute guarded actions: when cloning the model, the enabled predicate (with a node and its state as parameters) and the handleEnbld function are instantiated; the semantics states that the handler may be executed whenever the predicate

```
✞ ☎
   type node = int
   val constant n_nodes : int
   axiom n_nodes_ax : 2 <= n_nodes
   let function next (x:node) : node = mod (x+1) n_nodes
   type state = { held : bool; epoch : int }
   function getEpochS (s:state) : int = epoch s
   predicate getHeldS (s:state) = held s
   type msg = int
   predicate ok_Msg (dest:node) (src:node) (_:msg) =
    0<=dest<n_nodes /\ 0<=src<n_nodes /\ dest = next src
   clone specLDT.Spec with type node, type state, function getEpochS, predicate getHeldS
   clone modelMPEnabledTrace.World with type node, type state,
    type msg, type output, type externalEvent
   let function initState (n:node) : state
   = let h = if n=0 then true else false in
      let e = if n=0 then 1 else 0 in
        { held = h; epoch = e }
   let constant initMsgs : list packet = Nil
   let constant initTrace : list externalEvent = Cons (0,Locked(1)) Nil
   let function handleMsg (_:node)(_:node) (m:msg) (s:state) :(state, list packet, list output)
   = if (not (held s) ) then ({ held = True; epoch = m }, Nil, Cons (Locked m) Nil)
    else (s, Nil, Nil)
   let ghost predicate enabled (s:state) (i:node)
   = 0<=i<n_nodes && held s
   let function handleEnbld (h:node) (s:state) : (state, list packet, list output)
   = let e = epoch s in ({ held = False; epoch = e }, Cons (next h, h, e+1) Nil, Nil)
   let rec ghost predicate zeroHeld (lS:map node state) (n:int) = ...
   let rec ghost predicate oneHeld (lS:map node state) (n:int) = ...
   let rec ghost predicate oneMsg (lp:list packet) = length lp = 1
   let rec ghost predicate noMsgs (lp:list packet) = length lp = 0
   let rec ghost predicate ok_trace (t:list externalEvent)
    ensures { result -> consistent t }
   = match t with
    | Nil -> true
    | Cons (_,o) Nil -> getEpochO o = 1
    | Cons (_,o1) os ->
      match os with
      | Nil -> true
      | Cons (_,o2) _ -> getEpochO o1=(getEpochO o2)+1 && ok_trace os
      end
    end
   predicate inv (lS:map node state) (iFM:list packet)
    (tr:list externalEvent)
   = (forall p: packet. mem p iFM -> ok_Msg(dest p)(src p)(payload p))
    /\ ((oneMsg iFM /\ zeroHeld lS n_nodes)
        \/ (noMsgs iFM /\ oneHeld lS n_nodes))
    /\ (forall n :node. 0<=n<n_nodes -> held (lS n) ->
         n = node (hd tr) /\ epoch (lS n) = getEpochO(outp (hd tr)))
    /\ (forall p: packet. mem p iFM ->
         src p = node (hd tr) /\ payload p=getEpochO(outp (hd tr))+1)
    /\ length tr > 0 /\ ok_trace tr
   let ghost predicate indpred (w:world)
   = inv (localState w) (inFlightMsgs w) (trace w)
   clone modelMPEnabledTrace.Steps with ...
```
✝ ✆ Listing 5.3. Distributed lock with idealized model

```
...
let function handleMsg (_:node) (_:node) (m:msg) (s:state)
  : (s':state, lp:list packet, lo:list output)
= let nop = (s, Nil, Nil) in
     if (held s) || m <= epoch s then nop
    else ({ held = True; epoch = m }, Nil, Cons (Locked m) Nil)
...
(* helper definitions for invariant predicate *)
let rec ghost predicate zeroHeld (lS:map node state)(n:int) ...
let rec ghost predicate atMostOneHeld (lS:map node state)(n:int)...
let rec ghost predicate isFresh (p: packet) (lS:map node state)...
let rec ghost predicate allStale (lS:) (lp:list packet)...
let rec ghost predicate atMostOneFresh (lS:...)(lp:...)...
let rec ghost predicate ok_trace (t:list externalEvent)...
predicate inv (lS:map node state) (iFM:list packet)
  (tr:list externalEvent)
= (forall p: packet. mem p iFM -> ok_Msg (dest p)(src p)(payload p))
  /\ atMostOneFresh lS iFM /\ atMostOneHeld lS n_nodes
  /\ (zeroHeld lS n_nodes \/ allStale lS iFM)
  /\ (forall n :node. 0<=n<n_nodes -> held (lS n) ->
        n = node (hd tr) /\ epoch (lS n) = getEpochO(outp (hd tr)))
  /\ (forall p: packet. mem p iFM -> isFresh p lS ->
    src p = node (hd tr) /\ payload p = getEpochO(outp (hd tr))+1)
  /\ length tr > 0 /\ ok_trace tr
...
```
✝ ✆ Listing 5.4. Distributed lock with duplicating messages model

✞ ☎

is true. In the present example, enabled is defined as true when a node holds a lock, in which case it is free to release it. The lock is released when handleEnbld executes, sending a message to the next node in the ring. The message includes the value of the sender's current epoch, incremented by one.

The system is initialized with node 0 holding the lock (and this fact is registered in the system trace). The handling functions then follow. The enabling predicate and the corresponding handler are the same in both implementations; it is in the message handlers that they differ. With the idealized model nodes can trust that messages are never stale, so they react by blindly acquiring the lock. With the duplicating model the receiving node first checks whether the epoch in the received message is higher than its present epoch (in which case it cannot be a stale copy of a previous message). The inductive invariants are also different for both implementations, but both include a property expressed with the ok\_trace predicate, stating that events in the trace contain incremental epochs, starting from 1. This implies consistency of the trace (as defined in the specification), and is easier to check for inductiveness.

Let us consider in detail the system of Listing 5.4. A message is fresh if the current epoch of its destination node is lower than the message. Transfer messages are always sent from the highest epoch node (holding the lock) and thus, at the time of sending, the destination has a lower epoch, which will be updated when the message is received and the lock acquired. Other copies of the message are stale because their destinations' epochs have since increased. The system's invariant is given as the conjunction of the following properties, using the zeroHeld, atMostOneHeld, allStale, and atMostOneFresh predicates: (i) in-transit messages are well-formed; (ii) there is at most one in-transit fresh message, and at most one node holding a lock; if a node holds a lock then all in-transit messages are stale; (iii) If node n holds the lock then the last Locked x was written in the trace by n, and x is the current epoch of n; (iv) if there exists a fresh in-transit message, then it was sent by the last node that output Locked x, and it carries the value x + 1; (v) the trace obeys the ok\_trace predicate.

The VCs generated for the modules of listings 5.3 and 5.4, proved automatically, establish the correctness of each system with respect to the specification of Listing 5.2: events are being logged in the specified way, and traces are consistent.

### 6 Locally Shared Memory Model

Dijkstra described certain distributed systems (including the self-stabilizing systems described below) using a guarded processes model, in which nodes/processes do not exchange messages, but instead have direct read access to each other's states. Although particular systems will only require read access to a limited set of states (typically its immediate neighbors'), our model allows read access universally. This is not a shared-memory model in all generality, but it may be implemented over shared memory, with a single-writer multiple-reader data structure for each node's state (and readers–writer locks for atomicity).

We formalize this in our setting as a model where worlds are simply of the form hlSi with lS : N → Σ a state-assigning function. A system based on this model is programmed by defining an enabling predicate on nodes and a handling function describing the behavior that can be executed whenever a node is enabled. Formally we will consider that the enabling predicate has signature ep(n : N, lS : N → Σ), taking as parameters a node and a global state assigning function, and the handling function has the following signature and contract:

$$\begin{array}{l} \mathsf{handle}(h:\mathsf{N},\mathsf{lS}:\mathsf{N}\to\Sigma):(\sigma:\Sigma) \\\mathsf{require}(h,\mathsf{lS})\land I\langle\mathsf{lS}\rangle \\\mathsf{sensures}\ I\langle\mathsf{lS}[h\mapsto\sigma]\rangle\rangle \end{array}$$

The enabling predicate and the handler code have read access to every node's state, but the handler may only modify the state of the node where it is running. This semantics is given by the following rule:

$$\frac{\mathsf{handleE}(h,\mathsf{lS}) = \sigma \qquad \mathsf{ep}(h,\mathsf{lS})}{\langle \mathsf{lS} \rangle \sim\_h \langle \mathsf{lS}[h \mapsto \sigma] \rangle} \; (enabled)$$

where <sup>h</sup> means that node h runs the handler. The contract of handleE ensures that executions of the (enabled) transition rule preserve the property I (the contract ensures this if the node is enabled, and the semantics only allow for transitions satisfying this requirement). We will write ok<sup>I</sup> (ep, handleE) when the implementation of the handling function handleE adheres to its contract, with invariant I and enabling predicate ep. Listing 6.1 shows a simplified version of the Why3-do modelReadallEnabled module, including the following Lemma, proved using an induction transformation and SMT solvers.

```
✞ ☎
 module World
  type node, type state, type world = map node state
 end
 module Steps
  val predicate validNd (n:node)
  val function initState (node) : state
  constant initWorld : world = initState
  val ghost predicate indpred (w:world)
    ensures { w=initWorld -> result }
  val ghost predicate enabled (map node state) (i:node)
    requires { validNd i }
  function step_enbld (w:world) (n:node) (st:state) : world = set w n st
  val function handleEnbld (h:node) (lS:map node state) : state
    requires { validNd h /\ enabled lS h /\ indpred lS }
    ensures { indpred (step_enbld lS h result) }
  inductive step world node world =
  | step_enbld : forall w :world, n :node. validNd n -> enabled w n ->
                   step w n (step_enbld w n (handleEnbld n w))
  lemma indpred_step :
    forall w w' :world, n :node. step w n w' -> indpred w -> indpred w'
  lemma step_preserves_states :
    forall w w' :world, n i :node. step w n w' -> i<>n -> w i = w' i
  (* keeps track of number of transition steps *)
  inductive step_TR world world int =
  | base : forall w :world. step_TR w w 0
  | step : forall w w' w'' :world, n :node, steps :int.
             step_TR w w' steps -> step w' n w'' -> step_TR w w'' (steps+1)
  lemma noNeg_step_TR : forall w w' :world, steps :int. step_TR w w' steps -> steps >= 0
  lemma indpred_manySteps :
    forall w w' :world, steps :int . step_TR w w' steps -> indpred w -> indpred w'
  predicate reachable (w:world) = exists steps :int. step_TR initWorld w steps
  lemma indpred_reachable : forall w :world. reachable w -> indpred w
 end
```
✝ ✆ Listing 6.1. Locally shared memory model: modelReadallEnabled

Lemma 3. Let w0, w ∈ W, with ep and I predicates such that ok<sup>I</sup> (ep, handleE), w<sup>0</sup> |= I, and w<sup>0</sup> <sup>∗</sup> w. Then w |= I.

Example: Stabilizing Mutual Exclusion. Self-stabilizing systems [15,38] are designed to tolerate failures resulting from "horrible errors" (such as data corruption), by including a recovery mechanism. Given some notion of legal configuration, a system is said to be self-stabilizing if (i) starting from an illegal configuration, all executions eventually converge to a legal configuration (a liveness property), and (ii) legal configurations are closed under normal execution steps, i.e. no illegal configuration is reachable if no corruption of data occurs (a safety property). One of Dijkstra's examples of such a system in his seminal paper [15] was a directed ring of processes sharing a resource, with mutual exclusion enforced by means of a circulating token. Legal configurations are those in

```
✞ ☎
 module SelfStab_Ring_Closure
  type node = int
  val constant n_nodes : int
  axiom n_nodes_bounds : 2 < n_nodes
  let predicate validNd (n:node) = 0 <= n < n_nodes
  type state = int
  val constant k_states : int axiom k_states_lower_bound : n_nodes < k_states
  let function incre (x:state) : state = mod (x+1) k_states
  clone modelReadallEnabled.World with type node, type state
  let function initState (n:node) : state = if n=n_nodes-1 then 1 else 0
  predicate has_token (lS:map node state) (i:node) =
    (i = 0 /\ lS i = lS (n_nodes-1)) \/ (i > 0 /\ i < n_nodes /\ lS i <> lS (i-1))
  let ghost predicate enabled (lS:map node state) (i:node) = has_token lS i
  let function handleEnbld (h:node) (lS:map node state) : state
  = if h = 0 then incre (lS (n_nodes-1)) else lS (h-1)
  let rec ghost predicate atLeastOneToken (lS:map node state) (n:int)
    requires { validNd n }
    ensures { result <-> exists k :int. 0<=k<n /\ has_token lS k }
    variant { n }
  = n > 0 && (has_token lS (n-1) || atLeastOneToken lS (n-1))
  predicate atMostOneToken (lS:map node state) (n:int) = validNd n ->
    forall i j :int. 0<=i<n -> 0<=j<n -> has_token lS i -> has_token lS j -> i=j
  lemma first_last : forall n: int, lS :map node state.
                       n >= 0 -> (forall j :int. 0<j<=n -> lS j = lS (j-1)) -> lS 0 = lS n
  lemma atLeastOneTokenLm : forall w :world. atLeastOneToken w n_nodes
  predicate inv (lS:map node state) =
    (forall n :int. validNd n -> 0 <= lS n < k_states) /\ atMostOneToken lS n_nodes
  let ghost predicate indpred (w:world) = inv w
  clone modelReadallEnabled.Steps with type node, type state,
    val validNd, val initState, val indpred, val enabled, val handleEnbld
  predicate oneToken (w:world) = atMostOneToken w n_nodes /\ atLeastOneToken w n_nodes
  goal oneToken : forall w :world. reachable w -> oneToken w
 end
```
✝ ✆ Listing 6.2. Self-stabilizing mutual exclusion on a ring – Closure

which exactly one process carries a token. In case of failure the system converges back into a single-token configuration. Dijkstra's proposal for self-stabilizing mutual exclusion was the following: processes have integer numbers in {0, . . . K −1} as states, with K greater than the size of the ring. Each process observes the state of its predecessor in the ring; the process with index 0 holds a token when its state is the same as that of its predecessor (the last process in the ring); other processes hold a token when their state is different from their predecessor's. When holding a token, each process may modify its state by copying its predecessor's state; node 0 additionally increments (modulo K) this state.

Listing 6.2 shows the Why3-do formalization of this system, based on the locally shared memory model. Nodes and states are both integers; n\_nodes and k\_states are the size of the ring and the number of different states. The enabling predicate is defined as true for a node exactly when it is carrying a token, as specified by the has\_token predicate. The handler defined by handleEnbld copies states as previously described. Mutual exclusion is expressed using predicates atLeastOneToken and atMostOneToken that apply to the first n nodes.

The module of Listing 6.2 verifies the closure property. The invariant expresses that node states are within bounds, and there is no more than one token in the ring. One possible (legal) initial configuration of the system is described by the initState let function. These definitions are instantiated when cloning modelReadallEnabled. The module ends with the oneToken goal, stating that there exists exactly one token in all reachable configurations.

Stepwise Bounded Validation. In the verification of closure we use the following technique: we introduce an axiom bounding the size of the system, passed to the solvers to make automated proofs easier (soundness of the verification may be compromised at this point). We then introduce parts of the invariant step by step, and check them in this bounded system in order to gain insight as to their validity. Once we feel confident about the elected invariant, we remove the bounding axiom to achieve soundness of the verification, possibly stating additional lemmas or strengthening the invariant. For the present system:

1. We started with the following invariant. Inductiveness is proved automatically, but the oneToken goal cannot be proved from it (as expected):

forall i :int. validNd i -> 0 <= lS i < k\_states.

2. Next, we included atMostOnetoken lS n\_nodes in the invariant; preservation was proved automatically, but oneToken could still not be proved. We then added a bounding axiom n\_nodes <= 10, which allowed the goal to be proved.

3. We strengthened the invariant with atLeastOnetoken lS n\_nodes and removed the bounding axiom. The oneToken goal was proved trivially; however, the VC pertaining to the preservation of the invariant could not be proved.

4. Preservation could be proved by reintroducing a bound on n\_nodes (with a bound of 1000, all VCs could be proved within 30 seconds in our setup).

These bounded proof results indicate that, in all likelihood, (i) the property atLeastOnetoken lS n\_nodes is preserved by system transitions, and thus inductive, but (ii) it is not necessary to include it in the inductive invariant to prove oneToken: in our development the oneToken goal could be proved for a number of processes up to 10 without including the former property in the invariant. The reason for this is that in fact the atLeastOnetoken lS n\_nodes property is satisfied by definition in all configurations: in order for a token to be present, either any two adjacent processes have different states, or the first and last processes have the same state. If all processes have the same state, then the second case holds. Including the property in the invariant still requires a bound (to prove preservation), but this can now have a much higher value (1000 rather than 10).

An unbounded proof is obtained by including in the module the first\_last lemma (proved by induction on n). This allows for the goal to be proved automatically without atLeastOnetoken lS n\_nodes in the invariant, and with no upper bound on n\_nodes. We remark that the dual definition (recursive +


Table 7.1. Comparison of DS deductive verification frameworks

MP: message-passing, LSM: locally shared memory, F: functional, I: imperative

contract) of the atLeastOneToken let function was crucial for proving the goal automatically (this was not possible with a logic definition).

The convergence property is more challenging; its Why3-do formalization can be found in the artifact [28]. We have also verified Dijkstra's version of this system with a bidirectional array topology. Bounded exploration again allowed us to validate parts of the invariant; attaining an unbounded verification required strengthening the invariant, rather than a lemma.

### 7 Related Work

Deductive verification methods are typically based on first-order logic reasoning and focus on safety properties, with correctness proofs requiring users to manually provide appropriate invariants and to discharge (either automatically or interactively) proof obligations generated in the process. Invariants may apply to loops, recursive functions, or non-deterministic transition relations, and allow for correctness proofs by induction on the length of executions. In the last few years a number of frameworks and tools have been proposed for reasoning about asynchronous message-passing systems using inductive invariants, based on atomic handler models and different specification mechanisms. We will now briefly survey these and compare them with Why3-do in terms of design choices.

Verdi [42] introduced the use of models based on worlds and atomic handlers, with models capturing different fault semantics. Why3-do's semantic framework is inspired by Verdi; we enrich handlers with interface specifications in the form of contracts, allowing for the use of methods that are standard in deductive verification of single-thread software. Verdi is a Coq development, and reasoning is carried out within the Coq proof assistant [22]. The implementation of our framework as a Why3 library allows for the use of automated tools (all the proofs in this paper use SMT solvers and a few Why3 transformations).

Whereas Verdi handlers are defined in a purely functional style, in Why3 do they are written in WhyML, combining functional and imperative features. Verdi supports system transformations that allow for verified systems to be obtained from systems verified with simpler models (additional mechanisms may be automatically introduced to compensate for the presence of faults). Transformations are verified once and for all, so the resulting systems do not need to be verified. An important difference is that Verdi targets exclusively messagepassing systems, whereas Why3-do covers different system models. Verdi supports traces, but specifications may not be written in a completely abstract, model-independent way. In Why3-do this is achieved through the use of clonable specification modules defining commit specifications and trace consistency.

The IronFleet [20] platform is built on top of a deductive verification tool, Dafny [26], which uses the Z3 [31] SMT solver for proofs. Like Verdi, it supports only message-passing systems. A major difference with respect to Why3-do and Verdi is that, instead of a specification mechanism based on traces, IronFleet separates development in a specification level (where worlds are viewed abstractly) and a concrete protocol level, both described in FOL as state machines. A refinement function [1] maps protocol worlds to the specification level, and a refinement proof shows that protocol steps are compatible with the abstract behavior (in Why3-do this is achieved by trace consistency proofs). There is a third, implementation level, where event handlers are programmed using mutable data structures and machine types, for performance and realism. IronFleet extends Dafny with a UDP specification to support networking, which allows non-atomic handlers to be developed assuming low-level interleaving. In order to establish refinement proofs between low-level implementations and protocols, reductionbased reasoning is supported. IronFleet also includes an embedding of TLA that makes possible reasoning about liveness properties. It is overall an ambitious tool that has been used by its authors to verify practical systems.

Up to a point Why3-do implementations cover both the protocol and implementation levels, since WhyML accommodates both functional programs and stateful code with mutable structures and machine types. Why3 supports code extraction from verified WhyML programs, and it should not be difficult to obtain a distributed implementation from a verified Why3-do system, using one of the available OCaml libraries. Our framework allows for diverse system models, with different implementation infrastructure requirements. In general each node must run a scheduler that will, for instance, receive incoming local inputs and messages from the network, check enabling predicates, and run the appropriate handlers, reflecting locally and globally the effects prescribed by the semantics.

The Ivy tool [34] differs from Why3-do and the previous frameworks in several important ways. It uses a dedicated modeling/programming language called RML, and a logic language restricted to the effectively propositional (EPR) class of formulas, whose satisfiability is decidable (Ivy also uses Z3). Specifications may refer to any part of the model (no specification/protocol distinct layers or observation traces are used). The use of EPR imposes severe restrictions: RML does not allow arithmetic operations, so for instance a ring topology cannot be modeled using integer modulo arithmetic. A verification methodology based on the use of EPR, and details on how it has been used to verify variants of the PAXOS protocol, are extensively described in [33] (the method proposed for reducing quantifier alternation is of general interest, even when unrestricted FOL is used). Leveraging the decidability of the logic, Ivy focuses on assisting the user in writing the protocol and its specification, and in discovering adequate inductive invariants. A few initial steps of execution are first considered, which may allow for bugs to be found in the protocol and/or target properties; Ivy then assists the user in finding an inductive invariant by performing interactive strengthening and generalization steps, and representing states visually.

A more general, comprehensive framework for reasoning about distributed systems has been constructed around the TLA+ specification language, based on the Temporal Logic of Actions [25]. TLA+ is without any doubt a widely successful toolset, and its adoption in practice is well documented [32]. The toolset comprises the specification language itself; the PlusCal algorithmic language; the TLC model checker [43]; the TLAPS proof system [8]; and a development environment. Correctness proofs are based on the notion of refinement mapping [1]. If one writes a TLA+ specification and a PlusCal implementation, and then translates the latter to TLA+, its correctness can be stated as a refinement problem, whose VC is itself written as a TLA+ formula. The TLAPS proof system is an ongoing effort but can already be used to prove many such refinements. TLAPS proofs [12] are constructed using both proof assistants and SMT solvers.

Table 7.1 summarizes the distinctive aspects of the discussed tools. Additionally, the I4 technique has been proposed [29] based on the automatic synthesis (by model checking) of inductive invariants for small instances of protocols, followed by their generalization. Invariants are checked with Ivy, and if necessary the process is repeated, considering a bigger instance or a pruned invariant. Kaizen [23] is a verified blockchain system that has been developed using an approach similar to IronFleet. Implementations of distributed systems that have been formally verified using different tools have been empirically scrutinized in [19].

Program logics for distributed systems have also been the subject of recent work, typically based on or inspired by concurrent separation logics [6], and mechanized in the Coq proof assistant. Notable examples include Disel [39], which focuses on modularity and compositionality, and Aneris [24], which includes support for node-level concurrency in addition to inter-node reasoning. ModP [14] is an actor-based compositional programming framework that offers assume-guarantee reasoning principles to support compositional system testing.

The self-stabilizing ring system has been verified interactively using the PVS [35] and Isabelle [30] proof assistants, and also by symbolic model checking [41,9]. A general framework for building certified proofs of self-stabilizing algorithms (using Coq) is described in [3].

### 8 Conclusion

In this paper we have proposed principles for contract-based verification of distributed systems, based on a library promoting modular development. The approach enables the use of state of the art sequential software verifiers for reasoning about distributed systems, supports model-independent trace specifications, and is uniform across system models, beyond the message-passing setting.

To implement these principles we have chosen the Why3 verification platform. We have shown how specific features of Why3, such as the ability to interface with different solvers and the use of dual definitions, contribute to successful automated proofs. For instance, we were able to prove the inductiveness of an invariant for the leader election protocol containing a quantifier 'alternation' (a sequence of the form ∀∃ [33], outside the decidable EPR logic). In particular, the Alt-Ergo and Vampire solvers were able to prove these VCs, whereas Z3 and CVC4 failed (with a generous timeout value). On the other hand, the dual definition of the atLeastOneToken predicate in the self-stabilization systems, when the invariant included this predicate containing an existential quantifier, allowed Z3 or CVC4 (not the other solvers) to prove inductiveness. In neither case was it necessary to employ invariant quantifier hiding, as in [20].

Unbounded domains (nodes, messages, etc.) are typical of distributed systems. Considering bounded systems, in combination with dual definitions, allowed us to explore the inductiveness of invariant properties before tackling the unbounded case (by strengthening invariants or writing lemmas). This should not be mistaken with the use of bounded verification in Ivy, which considers the first few system steps in order to debug models, or in I4, which produces finite quantifier-free instances of problems, amenable to model checking.

The limitations of the framework are that, in the spirit of verification of sequential programs with Why3, Why3-do targets the verification of distributed systems at the algorithmic level, and is not intended for reasoning about executable implementations (but see the discussion on implementation extraction in Section 7). Also, no support for reasoning with non-atomic handlers is included.

Why3 is a stable tool, actively developed by a solid team, with a growing user community and very low risk of obsolescence. It is being successfully used for formal verification in contexts as diverse as safety-critical programming [2], multicore schedulers [27], or blockchain smart contracts [37,40]. Why3-do brings Why3's strengths in terms of usability and proof engineering to the mechanical verification of distributed systems, making it available to a wider community.

Acknowledgments. The development of Why3-do was initiated during a visit of the second author to the Toccata team at Inria Saclay-ˆIle-de-France/LRI Univ Paris-Saclay/CNRS and greatly benefited from the team's hospitality and Why3 expertise. This work is financed by the ERDF – European Regional Development Fund through the North Portugal Regional Operational Programme - NORTE 2020 Programme and by National Funds through the Portuguese funding agency, FCT - Funda¸c˜ao para a Ciˆencia e a Tecnologia within project NORTE-01-0145- FEDER-028550 - PTDC/EEI-COM/28550/2017.

### References


Science, vol. 7898, pp. 1–20. Springer (2013). https://doi.org/10.1007/978-3-642- 38574-2 1


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### Relaxed virtual memory in Armv8-A

Ben Simner<sup>1</sup> Alasdair Armstrong<sup>1</sup> Jean Pichon-Pharabod<sup>2</sup> Christopher Pulte<sup>1</sup> Richard Grisenthwaite<sup>3</sup> Peter Sewell<sup>1</sup>

<sup>1</sup> University of Cambridge, UK first.last@cl.cam.ac.uk <sup>2</sup> Aarhus University, Denmark jean.pichon@cs.au.dk <sup>3</sup> Arm Ltd., UK first.last@arm.com

Abstract. Virtual memory is an essential mechanism for enforcing security boundaries, but its relaxed-memory concurrency semantics has not previously been investigated in detail. The concurrent systems code managing virtual memory has been left on an entirely informal basis, and OS and hypervisor verification has had to make major simplifying assumptions.

We explore the design space for relaxed virtual memory semantics in the Armv8-A architecture, to support future system-software verification. We identify many design questions, in discussion with Arm; develop a test suite, including use cases from the pKVM production hypervisor under development by Google; delimit the design space with axiomaticstyle concurrency models; prove that under simple stable configurations our architectural model collapses to previous "user" models; develop tooling to compute allowed behaviours in the model integrated with the full Armv8-A ISA semantics; and develop a hardware test harness.

This lays out some of the main issues in relaxed virtual memory bringing these security-critical systems phenomena into the domain of programming-language semantics and verification with foundational architecture semantics.

### 1 Introduction

Computing relies on virtual memory to enforce security boundaries: hypervisors and operating systems manage mappings from virtual to physical addresses to restrict access to physical memory and memory-mapped devices, and thereby to ensure that processes and virtual machines cannot interfere with each other, or with the parent OS or hypervisor. In a world with endemic use of memory-unsafe languages for critical infrastructure, and of hardware that does not enforce finegrained protection, virtual memory is one of the few mechanisms one has to enforce strong security guarantees. This has driven interest in hypervisors and virtual machines, and it provides a compelling motivation for verification of the OS-kernel and hypervisor code that manages virtual memory to provide security.

However, any such verification requires a semantics for the protection mechanisms provided by the underlying hardware architecture. There are two major challenges in establishing such a semantics. First, there is its sequential intricacy: virtual memory is one of the most complex aspects of a modern general-purpose architecture. For 64-bit Armv8-A (AArch64) it is described in a 166-page chapter of the prose reference manual [13, Ch.D5] and includes a host of features and options. Second, and more fundamentally, there is its relaxed memory behaviour. Hardware implementations of virtual memory use in-memory representations of the virtual-to-physical address mappings, represented as hierarchical page tables. For performance, there are dedicated cache structures for commonly used mapping data, in Translation Lookaside Buffers (TLBs). Translations are used often – a single load instruction might need 40 or more page-table entries to translate its fetch and access addresses – but they are changed only rarely, and by systems code not user code. Architectures therefore require manual management of TLB caching, e.g. with specific instructions to invalidate old TLB entries that should no longer be used, instead of providing the simpler coherent memory abstraction that they do for normal accesses. All this gives rise to new relaxed-memory effects, with subtle constraints determining when translations are required or forbidden to read from specific writes to the page tables, and systems code has to handle these appropriately to provide the desired virtual-memory abstraction and its security properties.

Previous work has developed hand-written sequential semantics for some aspects of address translation in Arm [57,59,58,60,44,38,41] and x86 [34,35,29,62], but these are at best lightly validated formalisations, and there is no wellvalidated relaxed-memory concurrency semantics of virtual memory. In the absence of that (and of proof techniques above it), previous OS and hypervisor verification work, e.g. on seL4, CertiKOS, KCore, Hyper-V, the PROSPER hypervisor, and SeKVM [25,40,37,44,11,38,43,61] has had to make major simplifying assumptions, either assuming correctness of TLB management and a singlethreaded setting (seL4), or assuming sequentially consistent concurrency with one of those hand-written sequential semantics, or assuming an extended notion of data-race-freedom (we return to the related work in §7).

We explore the design space for Armv8-A relaxed virtual memory semantics, to support future systems-software verification. We contribute:


using the authoritative Arm ASL definition of the intra-instruction semantics including pagetable walks (§6.1).

– We develop a test harness that lets us run virtual-memory litmus tests baremetal, albeit currently only for Stage 1 tests, and report results from running these on hardware (§6.2).

Mainstream industrial architecture specifications evolve over many years, balancing hardware-implementation and systems-software concerns. Experience with "user" relaxed-memory concurrency has shown that the process of developing rigorous semantics for arbitrary code provides a useful third input into this process, leading one to ask questions which help clarify the architectural intent. The architects, hardware designers, and system-software authors typically have a deep understanding of the area, but there is usually not, a priori, a well-understood informal specification that just needs to be formalised; instead that needs to be iteratively and collaboratively developed. Our §3 is based on detailed discussion with the Arm Chief Architect (a co-author of this paper); on the current Arm prose documentation [13]; on discussion with the pKVM development team; and on our experimental testing. To the best of our knowledge, our models provide a reasonable basis for software development and for verification, but this paper is surely not the last word on the subject, and it does not give an authoritative definition of the Armv8-A architecture. The history of relaxed-memory models shows that it typically takes multiple years, and gradual refinement of models, to converge on something reasonably stable for a production architecture or language, and even then they continue to change as new knowledge or features arise; with hindsight, few are definitive. Our goal here is rather to lay out some of the main issues, bringing this security-critical systems code into the domain of programming-language semantics and verification, above foundational architecture semantics.

We begin in §2 with an informal introduction to virtual memory in a simple sequential setting, to make this self-contained. This paper is necessarily condensed; an extended version, with our tests, models, proofs, and Isla tooling, is available at https://www.cl.cam.ac.uk/users/pes20/RelaxedVM-Arm/.

Scope and non-goals Our scope is Armv8-A virtual memory for the 64-bit (AArch64) architecture, aiming especially to support aspects relevant to hypervisors such as pKVM. Accordingly, we consider translation with multiple stages (for both hypervisor and OS), multiple levels, and the full Armv8-A intrainstruction semantics and translation walk behaviour (as defined by Arm in ASL and auto-translated to Sail [14]). Our models cover the Armv8-A ETS option as work in progress. We discuss some mixed-size aspects, but our models do not currently cover them. To keep things manageable, we do not consider hardware management of access flags or dirty bits, conflict aborts, feat\_bbm, feat\_cnp, feat\_xs, the interactions between virtual memory and instruction-fetch, or all the relaxed behaviour of exceptions, and we handle only some of the many varieties of the TLBI instruction. We focus on the specification of the architecturally allowed envelope of functional behaviour, not on side-channel phenomena. We

include some experimental testing, as a sanity check of our models, but our principal goal is to capture the architectural intent, and our principal validation is from discussion with Arm. Many of the issues should also be relevant to other architectures, but here we address only Armv8-A.

### 2 Background: A Crash Course on Virtual Memory

### 2.1 Virtualising addressing

In conventional computer systems, the underlying memory is indexed by physical addresses (PAs), as are memory-mapped devices. For a small microcontroller running trusted code, accessing resources directly via physical addresses may suffice. Larger systems rely heavily on virtual addressing: they interpose one or more layers of indirection between virtual addresses (VAs) used by instructions and the underlying physical addresses. This lets them:


A simple system might have many processes managed by an operating system, each of which (including the OS) has a partial function that gives the physical address and permissions for the virtual addresses it can use, roughly:

```
translate : VirtualAddress * PhysicalAddress × 2
                                                 {Read,Write,Execute}
```
Typically each process would have access to a subset of the physical addresses (the range of its translate function), disjoint from those of the other processes and from that of the OS, while the OS would have sole access to its own working memory and also access to that of the processes. This is implemented with a combination of hardware and system software. The hardware memory management unit (MMU) automatically translates virtual to physical addresses when doing an access needed to execute an instruction. If the function is undefined, the instruction traps with a page fault; if it is defined but does not have the appropriate accesses, it traps with a permission fault; and if it is defined with the right permissions, the hardware performs the required access using the resulting physical address. The OS has to set up the translate functions, ensure that the appropriate function is used when switching to a new process, and handle those faults. Translation functions are not necessarily injective, and the full translate function has permissions per exception-level, and includes not just access permissions but additional fields for cacheability, shareability, security, contiguity, and others which we elide for simplicity here.

#### 2.2 The translation-table walk

The current translate function for execution is determined by a system register, a translation table base register or TTBR, that contains the physical address of a lookup-tree data structure in memory. The details of this structure are (in Armv8-A) highly configurable, e.g. for different page sizes, controlled by various system registers. In a common configuration used by Linux, it maps 4096-byte pages and has a tree up to four levels (0–3) deep. Each non-leaf node of the tree has 512 64-bit entries, indexed by specific bit ranges of the virtual address. Each entry can be either invalid, meaning that the translate function is undefined for this part of the domain; a block (at levels 1 or 2) or page descriptor entry (at level 3), returning an output address and permissions; or a table (at levels 0, 1, or 2), with the physical (or intermediate physical) address of a next-level table with which to continue recursively.

This translation-table walk function is fully defined in the Arm ASL language.

#### 2.3 Multiple stages of translation

The above suffices for an operating system isolating multiple processes from each other, but one often wants to isolate multiple operating systems (or other guests), managed by a hypervisor. To support this, the architecture provides a second layer of indirection: instead of going straight from virtual to physical addresses, with a single stage of mapping controlled by the OS, one can have two stages, with the OS managing a Stage 1 table which maps virtual addresses to an intermediate physical addresses (IPAs), composed with a hypervisor-managed Stage 2 table, mapping IPAs to PAs. The full translation composes the two, intersecting their permissions.

translate\_stage1 : VirtualAddress \* IPA × 2 {Read,Write,Execute} translate\_stage2 : IPA \* PhysicalAddress × 2 {Read,Write,Execute}

Armv8-A has various exception levels (ELs), including EL0 (for user processes), EL1 (for OSs or other guests), and EL2 (for a hypervisor). These each have associated translation-table base registers:


Each hardware thread has its own base registers (and other system registers), and so different hardware threads can be using different address spaces (for example, for different processes) at the same time.

#### 2.4 Caching translations in TLBs

A naive hardware implementation of address translation would need many translation memory reads – with four levels, up to 24 with both stages enabled, for every instruction-fetch, read, or write. This would have unacceptable performance, so processors have specialised caches for translation-table walk reads called translation lookaside buffers (or TLBs). Under normal operation the TLBs are invisible to user code, but systems code has to manage them explicitly, to change which translation table is currently in use (e.g. when context switching), or to make changes to the tables for one process or guest. Without correct management a TLB could hold incorrect (stale) data, breaking the protection that the address translation is intended to provide.

The architecture supports explicit TLB maintenance with various flavours of the TLBI instruction (TLB invalidate), to invalidate old entries for specific ranges of virtual or intermediate physical addresses, or even whole ASIDs or VMIDs at once. The memory management unit (MMU) is responsible for performing these translations. It does this by looking at the TLB and, if the TLB does not contain an entry for the given address (called a miss), it performs the translation table walk function as described earlier and caches the result in the TLB (a fill).

TLB maintenance and TLB misses are expensive, and one would not want the cost of TLB invalidation on every context switch, so the architecture provides address space identifiers (ASIDs). The translation table base registers include an ASID in addition to the table base address, and when translation data is cached in a TLB it is tagged with the current ASID, giving the illusion of separate TLBs per ASID, and allowing switching from one to another without TLB maintenance. Eventually the system will need to reclaim and reuse a previously used ASID, and then TLB maintenance is required to clean that ASID's old entries. There are similar identifiers for Stage 2 intermediate physical memory, known as virtual-machine identifiers or VMIDs.

### 3 Concurrency Architecture Design Questions

Now we will introduce the main concurrency architecture design questions that arise for Armv8-A virtual memory, within the scope laid out in the introduction. As usual, the architecture has to define an envelope of behaviour that provides the guarantees needed by software, while admitting the relaxed behaviour of the microarchitectural techniques necessary for performance. That means we have to discuss both, including just enough microarchitecture to understand the possible programmer-visible behaviour, before we abstract it in the semantic models we give in §5. The discussion includes points of several kinds: some that are clear in the current Arm documentation, some where Arm have a change in flight, some that are not documented but where the semantics is (after discussion) obviously constrained by existing hardware or software practice, and some where there is a tentative Arm intent but it is not yet fixed upon; our modelling raised a number of questions of the latter two. To make this as coherent as possible, we discuss all these in a logical order, laying out the design principles. We have developed a suite comprised of 214 hand-written Isla-compatible virtual-memory litmus tests that illustrate the issues, but to keep this concise we just give the main ideas here. In the extended version, we link to tests for each issue. As a sample, we explain one pKVM test in detail in §4.

#### 3.1 Coherence with respect to physical or virtual addresses

For normal memory accesses, the most fundamental guarantee that architectures provide is coherence: in any execution, for each memory location, there is a total order of the accesses to that location, consistent with the program order of each thread, with reads reading from the most recent write in that order. Hardware implementations provide this, despite their elaborate cache hierarchies and out-of-order pipelines, by coherent cache protocols and pipeline hazard checking, identifying and restarting instructions when possible coherence violations are detected. Previous work on relaxed-memory semantics for architectures has taken virtual addresses as primitive, implicitly considering only execution with well-formed, constant, and injective address translation mappings.

Now, we have to consider whether coherence is with respect to virtual or physical addresses, for non-injective mappings. For Arm, coherence is w.r.t. physical addresses [13, D5.11.1 (p2812)]. This means that if two virtual addresses alias to the same physical address, then (still assuming well-formed and constant translation): a load from one virtual address cannot ignore a program-order (po) previous store to the other; and a load from one virtual address can have its value forwarded from a store to the other, and similarly on a speculative branch.

#### 3.2 Relaxed behaviour from TLB caching

There are two main aspects of the concurrency semantics of virtual memory: the relaxed behaviour arising directly from TLB caching, and the relaxed behaviour of the not-from-TLB (non-TLB) memory accesses for translation reads that read from memory or by forwarding from po-previous writes, and that might supply TLB cache fills. We discuss them in this and the following subsection respectively.

What can be cached: The MMU can cache information from successful translations, and also from translations that result in permission faults, but it is architecturally forbidden from caching information from attempted translations that result in translation faults. This ensures that the handlers of those faults do not need to do TLB maintenance to remove the faulting entry [13, D5.8.1 (p2780)], and makes the potential behaviour for page-table updates from invalid-to-valid and valid-to-any quite different, as we shall see.

TLB implementations might cache any combination of individual page-table entries and partial or complete translations, e.g. from the virtual address and context to the physical address of the last-level page. Conceptually, however, we can simply view a TLB as containing a set of cached page-table-entry writes (i.e., writes that have been read from for a translation), including at least:


That additional information allows the various TLBI instructions to target specific entries. A translation walk can arbitrarily use either a cached write (if one exists) or do a non-TLB read, either from memory or by forwarding from a po-previous write, for any stage or level.

Caching of multiple entries for the same virtual address and context: High-performance hardware implementations may have elaborate TLB structures, including multiple "micro TLBs" per thread. These can be seen as a conceptual single per-thread TLB that can hold zero, one, or more entries for each combination of input address and the other information above. If zero, a translation will necessarily read from memory (with ordering constrained as discussed below). If one or more, a translation may use any of those entries or read from memory (and the write read from might or might not be cached). However, in some cases multiple entries constitute a break-before-make failure, leading to relatively unconstrained behaviour; we return to this below.

When can page-table entries be cached: Any memory read by a translation can be cached. Any thread can spontaneously do a translation for any virtual address at any program point, with respect to its context at that point (though this interacts with the system-register write/read semantics). Spontaneous translations model hardware prefetching, speculative execution, and branch prediction. They mean that, in the absence of cache maintenance, translations may use TLB entries from arbitrarily old writes. Additionally, any thread may do a spontaneous translation at any point using the configuration from any exception level higher than the current one, but not for lower levels. Preventing spontaneous walks at lower EL is essential, as during an EL2 hypervisor switch between VMs, the EL1 control registers will be in an inconsistent state. Allowing spontaneous walks at higher EL models arbitrary interrupts to the higher level and then doing a spontaneous walk there.

Each virtual-memory access by a thread involves a non-spontaneous translation which is constrained by the normal inter-instruction constraints on out-oforder and speculative execution by the thread. These constraints are especially important in order to understand when a translation must fault: as invalid entries cannot be cached, a translation that gives rise to such a fault must be at least in part from a non-TLB read, subject to these ordering constraints.

Coherence of translations: Due to the TLB caching as described above, translations of the same virtual address by the same thread need not see a coherent view of page-table memory. This is in sharp contrast to normal accesses, but analogous to instruction-fetch reads [56] and reads from persistent memory [51].

Removing cached entries: TLBs may spontaneously forget any cached information at any point. To ensure that a cached entry is removed, software must ensure that it will not be spontaneously re-cached. It can do this with a write of an invalid entry and then a DSB instruction (data synchronization barrier) to ensure that it is visible across the system, followed by a TLBI.

Break-before-make failures: When changing an existing translation mapping, from one valid entry to another valid entry, Arm require in many cases the use of a break-before-make (BBM) sequence: breaking the old mapping with a write of an invalid entry; a DSB to ensure that is visible across the system; and a broadcast TLBI to invalidate any cached entries for all relevant threads; a DSB to wait for the TLBI to finish; then making the new mapping with a write of the new entry, and additional synchronisation to ensure that it is visible to translations. The current Arm text [13, D5.10.1 (p2795)] identifies six cases of pagetable updates that without such a sequence constitute BBM failures, and gives very severe architectural consequences thereof: failures of coherency, single-copy atomicity, ordering, or uniprocessor semantics. Note that these consequences are architecturally allowed if there could exist a break-before-make-failure change to the translation tables for some virtual address, irrespective of whether the program architecturally accesses it.

This severity is because, in some of the six cases, hardware implementations could give rather arbitrary behaviour, e.g. an amalgamation of old and new entries. From a software point of view, it seems that one must treat such cases more-or-less as fatal errors. This is analogous to the Data-race-free-or-catchfire semantics underlying the C/C++ relaxed memory model [4,33,22,20], in which any program with a consistent execution that includes a race between nonatomic accesses is deemed to have undefined behaviour, and the C/C++ standards do not constrain implementation behaviour for such programs in any way. This makes many potential litmus tests that change between valid entries uninteresting, as they simply exhibit BBM failures.

However, for a processor architecture that supports virtualisation, one cannot regard BBM failures as allowing completely arbitrary behaviour for the entire machine: if one guest virtual machine (at EL1) changes one of its own translation mappings without correctly following the BBM sequence, either mistakenly or maliciously, that should not impact security of the hypervisor (at EL2) or other guests. Instead, one has to bound the arbitrary behaviour to that virtual machine, allowing arbitrary memory and register accesses that are possible within its context. In our exhaustively executable semantics, to keep litmus-test executions finite, we currently simply detect BBM failures; we do not explicitly model that arbitrary behaviour.

In reality, these six BBM failure cases include some where hardware may give such weakly constrained behaviour and others where, because coherence is over physical addresses and the mapping may be temporarily indeterminate, software might see well-defined but nondeterministic or surprising results. These were architected as a guide for system software to produce predictable behaviour, and future versions of the architecture might refine this.

When a hypervisor installs a new guest, it has to be able to reset to a clean state. It can do so with a TLBI covering all the previous guest's processes address space. There seems to be no need or support for finer-grain cleanup.

### 3.3 Relaxed behaviour of translation-walk non-TLB reads

Now we turn to the semantics of translation-walk non-TLB reads, those that are satisfied from memory or by forwarding, not from a TLB. This matters especially when one knows that there are no relevant cached TLB entries, e.g. when an invalid entry has been written and a TLBI performed.

Ordering among the translation-walk reads of an access: Each translation-table walk for a virtual-memory access can involve many memory reads, one for each level of the table for each stage of translation.

The diagram on the right is an example walk, where

each Tn is read of level n of the Stage 1 table. Each of those Stage 1 reads must first be translated to get the PA (as the table contains IPAs) and so each Tnk is a read of level k of the Stage 2 table for the address of the Stage 1 table at level n. Once the full Stage 1 walk has been completed the final output IPA must be translated T11 T12 T21 T22 T31 T32 T41 T42 T\_1 T\_2 T13 T23 T33 T43 T\_3 T14 T24 T34 T44 T\_4 T1 T2 T3 T4 a:Rx=v

to the final PA, and those are the final 4 T\_n reads, of the Stage 2 table at level n. The reads are ordered one after another in the order they appear in the ASL walk function. This ordering must be respected by hardware as software relies on it when building the tables bottom-up.

Dependencies into translation-walk non-TLB reads: Address dependencies into a memory-access instruction in classic "user" models are now explainable as dataflow dependencies to the translation reads of those accesses, as the address has to be available before a walk can start. These are virtual-address dataflow dependencies (contrasting with physical-address coherence).

#### Translation-walk non-TLB reads from non-speculative same-thread writes:

PO-past A translation-walk non-TLB read might read from a po-previous pagetable-entry write, but it is only guaranteed to see such a write if there is enough intervening synchronisation. Arm have recently introduced Enhanced Translation Synchronization (ETS), optional in Armv8.0 and mandatory from Armv8.7. Armv8-A implementations without ETS require both a DSB, to make the write visible to translation-walk non-TLB reads, and an ISB, to ensure that any translations for later instructions that were done out-of-order, before the write, are restarted. With ETS, only the DSB is required for a translation-walk non-TLB read to definitely see the write, though one might still need an ISB if the new translation enables new instruction fetch. Because invalid entries cannot be cached, this means that if an entry is initially invalid, then after a write of a valid entry and a DSB;ISB/DSB, translations will use that valid entry. However, the DSB;ISB/DSB does not remove cached entries, so an initially valid entry might be cached by a spontaneous walk, so even after a write (of an invalid or non-BBM-failure valid entry) and a DSB;ISB/DSB, the old entry could still be used by translations. One would need a TLBI sequence to remove old cached entries, which we return to below.

PO-future The Armv8-A architecture allows load-store reordering, but it does not allow writes to become visible to other threads while they are still speculative. In the same vein, translation-walk non-TLB reads cannot read from po-later page-table-entry writes [13, D5.2.5 (p2683)]. Before the po-earlier translation is complete, one cannot know that it is not going to fault, so the later write has to be considered speculative. This prevents a thread-local self-satisfying translation cycle, analogous to the prevention of load-store cycles with dependencies.

PO-present On the margin, can a translation-walk non-TLB read for a write access see that write, or a distinct write from the same instruction? The second case could arise from a store-pair or misaligned store that does two writes, with one to a page-table-entry that could be used by the other, though real code would typically not do this intentionally. This is explicitly allowed by the current architecture text [13, D5.2.5 (p2683)]. However that text does not specify whether the translations for those two writes could both read from the other, a self-satisfying translation cycle where the writes write each others translations. In general such self-satisfying cycles give rise to thin air behaviours and the architectural intent is to forbid them.

Translation-walk non-TLB reads from speculative same-thread writes: Speculative execution requires translation walks, which might result in additional page-table entries being cached, but in most cases this is indistinguishable from the effects of a non-speculative spontaneous walk. However, one has to ask whether a translation-walk non-TLB read can see a po-previous write that is still speculative, e.g. while both instructions follow an as-yet-unresolved conditional branch. It is clear that the result of such a walk should not be persistently cached, or made visible to other threads (via a shared TLB), while it remains speculative. Moreover, such translations could lead to arbitrary reads of readsensitive device locations, which one normally relies on the MMU to prevent. The conclusion is therefore that this must be forbidden.

Translation-walk non-TLB reads from same-thread writes, forbidden past (same-thread TLBI completion): To remove an existing mapping on a single thread, one needs first to write an invalid entry, then a DSB to ensure that

has reached memory and thus is visible to translation-walk non-TLB reads (to prevent spontaneous re-caching), then a TLBI to invalidate any cached entries, then a DSB to wait for TLBI completion. Without ETS, one also needs an ISB to ensure that po-later translations that have been done early are restarted. With ETS, the ISB is not always necessary, though might still be needed for its instruction-cache effects if the change of mapping affects instruction fetch. After all that, an attempted access by that thread is guaranteed to fault.

Translation-walk non-TLB reads from other-thread writes, guaranteed past, initially invalid: Now consider when a translation-walk non-TLB read is guaranteed to see a write by another thread of a new entry, assuming that the entry was previously invalid and any cached entries for it invalidated. Consider a two-thread message-passing case, where a producer P0 writes a new

valid page table entry (pte\_valid), then has some ordering before a write of a flag, while a consumer P1 reads the flag, then has some ordering before an access Rx or Wx that


needs that entry for a translation Tx of virtual address x.

On some Armv8-A implementations that do not support ETS, some "obvious" combinations of ordering on P0 and P1 could lead to an abort of the translation of (d), which some OS software would find difficult to handle. This was the main motivation for ETS: implementations without it can have weak behaviour, requiring strong synchronisation to prevent the abort, while with ETS the architecture is stronger, requiring only weaker ordering to prevent the abort.

Without ETS, two combinations of ordering are architected as sufficient to ensure that the translation (d) sees the new valid entry:


In Case 1, the message-passing is enough to ensure the write (a) is in main memory, the P1 ISB ensures that any out-of-order translation of (d) is restarted, and the P1 DSB keeps the read (c) and that ISB in order. In Case 2, the first DSB ensures the write is visible to all threads, the TLBI (broadcast, for the virtual address x) invalidates any older cached entry on P1, and the second DSB waits for that TLBI to be complete, after which any new translation on P1 will have to see the new entry. However, it appears that the probability of an unhandleable abort in practice, where one usually does not have these operations immediately adjacent, and where in many cases the abort could be handled, has been judged low enough that OS code is not necessarily using either of these.

With ETS, the architecture says [13, D5.2.5,p2683] that "if a memory access RW1 is Ordered-before a second memory access RW2, then RW1 is also Orderedbefore any translation table walk generated by RW2 that generates a Translation fault, Address size fault, or Access flag fault." Microarchitecturally, the intuition here is that with ETS any translation done while speculative that leads to such a fault will have to be reconfirmed as faulting when execution is no longer speculative, so an early faulting translation of (d) would have to be restarted after the ordered-before edges have ensured that (a) is visible. However, in the case that the RW2 instruction faults, there is no read or write event, and if the fault is a translation fault, there is no physical address. One therefore has to ask what the meaning of ordered-before edges into RW2 is, especially for the parts of ordered-before dependent on physical addresses, such as coherence. The conclusion is that this should be only the non-physical-address parts of ordered-before into RW2, and in modelling one needs a "ghost" event to properly record what the dependencies would have been if it had succeeded. Note that this includes ordered-before to RW2 that ends with a data dependency into a write, even though that data would not normally be necessary for the translation.

Even with ETS, one might need an ISB on P1 if the new translation affects instruction fetch.

Translation-walk non-TLB reads from other-thread writes, guaranteed past, initially valid (other-thread TLBI completion): The following test has a read-only mapping for some physical address that is updated with a new

writeable mapping to the same physical address, followed by a message-pass to another thread that attempts to write. There is no requirement for break-beforemake here, as the output address has not changed, but TLB maintenance is required to ensure that the new writeable entry is guar-


anteed to be used by later translation reads.

Arm forbid the outcome where the STR faults due to a permission check. This is because the TLBI only completes once all instructions using any old translations which would be invalidated by the TLBI, on all other threads that the TLBI affects, have also completed, and the following DSB waits for that (the samethread case is different; see §3.3). In practice this means that once the TLBI completes, one of the following holds: either the final STR has not performed its translation of x yet and will be required to see the writeable mapping for its page table entry (pte); or the STR has translated using the new writeable mapping; or the STR has already translated using the old read-only mapping, in which case we know that the STR has finished and performed its write, since the TLBI could not complete while it was still in-progress. In that case if the STR has completed, then so must have the locally-ordered-before LDR, and that must have read 0. This explanation also covers the make-after-break case above, for non-ETS Case 2.

This is reflected in text to be included in future versions of the Arm ARM: A TLB maintenance operation [without nXS] generated by a TLB maintenance instruction is finished for a PE when:

1. all memory accesses generated by that PE using in-scope old translation information are complete.

	- (a) I1 uses the in-scope old translation information, and
	- (b) the use of the in-scope old translation information generates a synchronous data abort, and
	- (c) if I1 did not generate an abort from use of the in-scope old translation information, I1 would generate a memory access that RWx would be locally-ordered-before.

Translation-walk reads from same- and other-thread writes, forbidden past (break-before-make): Now we can finally return to the break-beforemake sequence. Normal reads cannot read from the coherence-predecessors of the most coherence-recent write that is visible to them, but translation reads can read old (non-invalid) values from a TLB. To prevent this, and to ensure that a translation read sees a new page-table entry, one has to both ensure that any old TLB entries are invalidated, with a suitable TLBI, and that the new entry is visible to translation-walk non-TLB reads.

Armv8-A says [13, D5.10.1 (p2795)] "A break-before-make sequence on changing from an old translation table entry to a new translation table entry requires the following steps: (1) Replace the old translation table entry with an invalid entry, and execute a DSB instruction. (2) Invalidate the translation table entry with a broadcast TLB invalidation instruction, and execute a DSB instruction to ensure the completion of that invalidation. (3) Write the new translation table entry, and execute a DSB instruction to ensure that the new entry is visible.".

Typically the write of an invalid entry and TLBI would be on the same thread, but more generally, any shape as below should be forbidden,

where Tx is a translation-walk read for an access of x and the trf relation shows the page-table write it reads from. In other words, the sequence ensures that the write of the invalid entry, and of any co-predecessor writes, are hidden behind the new page-table entry as far as new translations are concerned. Here the P0 DSB and P0-to-P1 ob ensure the P0 write has propagated to memory before the P1

TLBI starts; the P1 DSB waits for that TLBI to have finished on all threads; the P1-to-P2 ob ensures that has happened before the new page-table-entry write starts; and the DSB ensures the new write has reached memory and so is visible to translation before subsequent instructions. The P2 ISB is needed if on non-ETS hardware, to force restarts of any out-of-order translations for po-later instructions, or (on any hardware) if P2=P1, to ensure any later translations on the TLBI thread are restarted, or if the new mapping affects instruction fetch.

This generalisation seems necessary, as a TLBI might be performed by a virtual CPU at EL1 which is interrupted and rescheduled by an EL2 hypervisor. One should be able to rely on the hypervisor doing a DSB on the same hardware thread as part of the context switch, and that has to suffice. It is sound because the DSBs and TLBI are all broadcast, though note that the DSB waiting for TLBI completion has to be on the same hardware thread as it.

Translation-walk non-TLB reads from other-thread writes, forbidden future: Above we saw that translation-walk non-TLB reads should not read from po-later writes. How should that be generalised to multiple threads? For

the simplest example, consider the translation version of the LB test on the right, in which two threads translationread from each other's po-future (iio relates translation reads to their accesses).

Standard LB shapes for normal accesses without dependencies are allowed in Armv8-A, but this example should be forbidden: until each translation is done, one cannot know that the first instruction on each thread will not abort, so one could not make the po-later write visible to the other thread without inter-thread roll-back. In other words, the possibility of translation aborts creates ordering rather like a control dependency from translation reads to po-later writes.

Multicopy atomicity of translation-walk non-TLB reads: The ARMv7 and early Armv8-A architectures for normal accesses were non-multicopy-atomic: a write could become visible to some other threads before becoming visible to all threads, broadly similar in this respect to the IBM POWER architecture [1,53]. This is one of the most fundamental choices for a relaxed memory model. In 2017 Arm revised their Armv8-A architecture to be multicopy-atomic (other multicopy-atomic, or OMCA, in their terminology), a considerable simplification [49,12]. However, there was no consideration at the time of whether this should also apply to the visibility of writes by translation-walk non-TLB reads, or of the force of the ARM statement that a translation table walk is considered to be a separate observer [13, D5.10.2 (p2808)].

For example, consider the following translation-read analogue of the classic WRC+addrs test, which would be forbidden in OMCA Armv8-A for normal reads. Suppose one has ETS, the last-level page-table entries for x and y are P0 P1 P2 Wpte(x)=valid Tx trf Tx Fault trf Rx iio Wpte(y)=valid addr Ty Ry iio trf addr initially invalid and not cached in any TLB, P0 writes a valid entry for x, P1 does a translation that sees that entry and then (via an address depen-

dency) writes a valid entry for y, then P2 does a translation that sees that entry and then (via an address dependency) tries a translation for x, is that last guaranteed to see the valid entry instead of faulting? This might be exhibited by a microarchitecture with a shared TLB between P0 and P1 (e.g. if they are SMT threads on the same core, or have a shared TLB for a subcluster). The tentative Arm conclusion is that this should be forbidden, to avoid software issues with unexpected aborts similar to those motivating ETS. Now consider

the above translation version of LB, generalising from po-future writes to other ob-future writes. For transitive combinations of reads-from and dependencies, it should clearly still be forbidden, to avoid needing inter-thread roll-back, but for ob including coherence edges (coe) one can imagine that a translate read could see a write before the coherence relationships are established, analogous to the weakness of coherence in the Power non-MCA model.

Discussion of these and others with Arm led to the tentative conclusion for Armv8-A that translation-walk non-TLB reads (like normal reads) do not see any non-OMCA behaviour. In other words, there is no programmer-visible caching observable to some non-singleton subsets of threads' translations but not others.

#### 3.4 Further issues

Our discussions with Arm identified and clarified various other architectural choices, though for lack of space we cannot discuss them fully here, and our models do not cover them at present. To give a flavour: (1) Misaligned or load/storepair instructions give rise to multiple accesses, which might be to different pages. Each has their own translation; not ordered w.r.t. each other, and with no prioritisation of faults between them. As noted in §3.3, one might translate-read from the other, but not both simultaneously. (2) Normal registers act like a perthread sequential memory, with reads reading from the most recent po-previous write, but the system registers that control translations can have more relaxed behaviour, requiring ISBs to enforce sequential behaviour. (3) The architecture requires, and OSs rely on, the fact that turning on the MMU does not need TLB maintenance. However, in a two-stage world, if Stage 1 is off, one is still using the TLB for Stage 2, so entries do get added to the TLB. When one later turns on Stage 1, it is essential that the entries added from those earlier Stage 2 translations are not used, so one has to regard them as from a 257'th ASID.

### 4 Virtual memory in the pKVM production hypervisor

Protected KVM, or pKVM [30,27,2], is currently being developed by Google to provide a common hypervisor for Android, to provide improved compartmentalisation by a small trusted computing base (TCB) between the Linux kernel and other services. pKVM is built as a component of Linux. During boot, the Linux kernel hands over control of EL2 to the pKVM code, which constructs a memory map for itself and a Stage 2 memory map to encapsulate the Linux kernel. The Linux kernel thereafter runs only at EL1 (managing EL1&0 Stage 1 memory maps for itself and for user processes), as the principal guest, also known as the host (not to be confused with the host hardware). Other services can run as other guests, which are protected from the kernel and vice versa. The kernel remains responsible for scheduling, but context switching and inter-guest communication is done by hypervisor calls to the pKVM code at EL2. This gives us an ideal setting in which to examine the management of virtual memory by production code for Armv8-A relaxed-memory-concurrency, with both one and two stages of translation (for EL2 and EL1&0 respectively). The pKVM codebase is small, so it is feasible to examine all uses of TLB management, and we benefit from discussions with the pKVM development team. We have manually abstracted the main pKVM relaxed-virtual-memory scenarios into 14 tests. To give a flavour of these, we give one test in detail, which also illustrates the general form of virtual memory litmus tests; the others are described in the extended version.

In the simplest case where pKVM is just switching from one virtual CPU (vCPU) to another vCPU in a different VM, pKVM restores the per-CPU register state and sets the VTTBR with the new VMID. So long as the two vCPUs are using disjoint VMIDs there is no requirement for TLB maintenance.

This test, pKVM.vcpu\_run, is below, typeset (lightly hand-edited) from the


TOML input format of our Isla tool (§6.1). Here there is a single physical CPU, initially running a virtual machine VM1, with VMID 0x0001, at EL1. The section on the left defines the initial and all potential states of the page tables, and any other memory state. This test sets up separate translation tables for pKVM at EL2 (which has just a single stage) and for two VMs (each with two stages, Stage 2 controlled by pKVM and Stage 1 controlled by the VM). pKVM's own mapping hyp\_map maps its code. VM1's own Stage 1 mapping vm1\_stage1 maps virtual address x to ipa1, and the initial pKVM-managed Stage 2 mapping vm1\_stage2 maps that ipa1 to pa1, which implicitly initially holds 0. These page tables are described concisely by a small declarative language we developed, determining the page-table memory (here ∼30k) required for the Armv8-A page-table walks.

The top-right block gives the initial Thread 0 register values, including the various page-table base registers. The bottom-right blocks give the code of the test. This starts running at EL2, as one can see from the PSTATE.EL register value. The key assembly lines are annotated with the pKVM source line numbers they correspond to. To switch to run another virtual machine VM2, with VMID 0x0002, on this same physical CPU, pKVM changes VTTBR\_EL2 to the new vm2\_stage2 mapping and, as part of the context-switch register-file changes, restores TTBR0\_EL1 to the VM2's own Stage 1 mapping vm2\_stage1. The code then executes an ERET ("exception-return") instruction to return to EL1, and then tries to read x. The test includes a final assertion of the relaxed outcome that register x2=0, which could occur if the ldr translation used the old VM1 mapping instead of VM2's mapping. In this case that should not be allowed.

Other tests capture more elaborate scenarios. For example, currently the host kernel manages VMIDs and assigns each VM its own VMID. If the host runs out of VMIDs to allocate to new vCPUs, it currently revokes all previously allocated VMIDs and re-allocates from the beginning, during which pKVM has to ensure that any old vCPUs' translations using that VMID are expelled from any TLBs (pKVM.vcpu\_run.update\_vmid). If there is a concurrently executing vCPU using that VMID, that vCPU must be paused until after the new VMID generation (and hence any required TLB maintenance), before continuing with the freshly allocated VMID (pKVM.vcpu\_run.update\_vmid.concurrent).

For another example, for pKVM to maintain the illusion that each vCPU is on its own core, the per-core state must be cleaned between running different vCPUs, including ensuring that translations for one vCPU are not cached and visible to another, even if they happen to be in the same VM (and using the same VMID) (pKVM.vcpu\_run.same\_vm).

### 5 Model

We now define a semantic model for Armv8-A relaxed virtual memory that, to the best of our knowledge, captures the Arm architectural intent for the scope laid out in §1 and discussed in §3, including Stage 1 and Stage 2 translation-table walks and the required TLB maintenance. For some important questions, most notably for multi-copy atomicity, the Arm intent is currently tentative, so it is not possible to be more definitive. To capture just the synchronization required for "simple" software such as pKVM to work correctly we also give a weaker model: instead of trying to exactly capture the architecture or the behaviour of hardware, it has individual axioms for each behaviour that such software needs to rely on. This gives an over-approximation to the architecture, which we prove sound with respect to the model given in this section. The two models together delimit the design space.

In §3 and §4 we described the design issues in microarchitectural terms, discussing the behaviour of TLB caching and translation-walk non-TLB reads, along with the needs of system software. We now abstract from microarchitecture: instead of explicitly modelling TLBs, we simply include a translation-read event for each read performed by architected translation-table walks, and define which writes each such translation-read can read from. We give the model in an axiomatic Herd-like [9] style, as an extension to the base Armv8-A semantics [26,49,13]. In principle it would be desirable to also have equivalent abstract-microarchitectural operational models, as for base Armv8-A [49,48] but with explicit TLBs for each thread and events for reading from and into the TLB. However, address translation introduces many more events to litmus-test executions, which would make them harder to explore exhaustively, and a proof of equivalence would be a major undertaking, so we leave this to future work.

The base Armv8-A axiomatic model is defined as a predicate over candidate executions, each of which is a graph with various events (reads, writes, barriers) and relations over them, notably the per-thread program order po, the location coherence order co, the reads-from relation rf from writes to reads, the address, data, and control-dependency (addr, data, ctrl) subsets of po, and others. The base model is essentially the conjunction of an external (inter-thread) acyclicity property, effectively stating that the execution must respect some total order of events hitting the shared memory, constrained by the derived ordered-before (ob) relation; an internal acyclicity property, enforcing per-location coherence; and an atomic axiom for atomic and exclusive operations. As usual in Herd-style models, relations are suffixed e or i to restrict to their inter-thread or intra-thread parts. The Herd concrete syntax for relational algebra uses [X] for the identity on a set X, ; for composition, ~ for complement, | and & for union and intersection, and \* for product. We add translation data to events, including virtual, intermediate physical, and physical addresses (as determined by the translation regime). We add events for translation reads (**T**), TLB maintenance (**TLBI**), taking and returning from an exception (**TE** and **ERET**), and writing system registers (e.g. **MSR TTBR**). We modify the loc and co relations to relate events with the same physical address, and add a translation-reads-from trf to relate W to the T that read from it. To identify events with the same address we add same-va and same-ipa relations, relating events to the same virtual or intermediate physical address, and same-{va,ipa}-page for events in the same page. To identify events with the same address space or virtual machine ID, we use same-vmid and same-asid. The translate-read events within an instruction are related in the order they appear in the sequential ASL/Sail execution, both to each other and to any memory access or fault event, with the iio ("intra-instruction order") relation. We derive the addr relation from a new primitive tdata relation which relates read events to events that use that read value in the translation or computation of an address. For convenience we define new event sets: **C** for all cache-maintenance operations (DC, IC, and TLBI instructions); **T\_f** for all translation-read events which read a descriptor which causes a fault; **W\_inv** for all the write events which write an invalid descriptor; **Stage1** and **Stage2** for the **T** events which originate from the respective stage of translation; **ContextChange** for all context-changing events (such as writes to translation-controlling system registers); and **CSE** for all context-synchronizing events (taking and returning from exceptions and ISB).

The model is in Fig. 1, in full except for the tlb-affects relation. Its basic form is very similar to previous multicopy-atomic Armv8-A models. It still has external, internal, and atomic axioms, to which we add a translation-internal axiom for ensuring translations do not read from po-later writes.

```
let tlb-affects =
  (* see extended version *)
let TLB_barrier =
  ([TLBI] ; tlb-affects ; [T] ; tfr ; [W])^-1
  & wco
let maybe_TLB_cached =
  ([T] ; trf^-1 ; wco ; [TLBI-S1]) & tlb-
    affects^-1
let tcache1 = [T & Stage1] ; tfr ; TLB_barrier
let tcache2 = [T & Stage2] ; tfr ; TLB_barrier
let speculative =
    ctrl
  | addr; po
  | [T] ; instruction-order
(* translation-ordered-before *)
let tob =
    [T_f] ; tfre
  | ([T_f] ; tfri)
    & (po ; [DSB.SY] ; instruction-order)^-1
  | [T] ; iio ; [R|W] ; po ; [W]
  | speculative ; trfi
(* observed by *)
let obs = rfe | fr | wco
  | trfe
(* ordered-before TLBI and translate *)
let obtlbi_translate =
    tcache1
  | tcache2
    & (iio^-1 ; [T & Stage1] ; trf^-1 ; wco^-1)
  | (tcache2 ; wco? ; [TLBI-S1])
    & (iio^-1 ; [T & Stage1] ; maybe_TLB_cached
    )
(* ordered-before TLBI *)
let obtlbi =
    obtlbi_translate
  | [R|W|Fault] ; iio^-1 ; (obtlbi_translate &
    ext) ; [TLBI]
(* context-change ordered-before *)
let ctxob =
    speculative ; [MSR]
  | [CSE] ; instruction-order
  | [ContextChange] ; po ; [CSE]
  | speculative ; [CSE]
  | po ; [ERET] ; instruction-order ; [T]
                                                   (* ordered-before a translation fault *)
                                                   let obfault =
                                                       data ; [Fault & IsFromW]
                                                     | speculative ; [Fault & IsFromW]
                                                     | [dmbst] ; po ; [Fault & IsFromW]
                                                     | [dmbld] ; po ; [Fault & (IsFromW|IsFromR)]
                                                     | [A|Q] ; po ; [Fault & (IsFromW | IsFromR)]
                                                     | [R|W] ; po ; [Fault & IsFromW & IsReleaseW]
                                                   (* ETS-ordered-before *)
                                                   let obETS =
                                                       (obfault ; [Fault]) ; iio^-1 ; [T_f]
                                                     | ([TLBI] ; po ; [dsb] ; instruction-order ;
                                                        [T]) & tlb-affects
                                                   (* dependency-ordered-before *)
                                                   let dob =
                                                       addr | data
                                                     | speculative ; [W]
                                                     | addr; po; [W]
                                                     | (addr | data); rfi
                                                     | (addr | data); trfi
                                                   (* atomic-ordered-before *)
                                                   let aob = rmw
                                                     | [range(rmw)]; rfi; [A | Q]
                                                   (* barrier-ordered-before *)
                                                   let bob = [R] ; po ; [dmbld]
                                                     | [W] ; po ; [dmbst]
                                                     | [dmbst]; po; [W]
                                                     | [dmbld]; po; [R|W]
                                                     | [L]; po; [A]
                                                     | [A | Q]; po; [R | W]
                                                     | [R | W]; po; [L]
                                                     | [F | C]; po; [dsbsy]
                                                     | [dsb] ; po
                                                   (* Ordered-before *)
                                                   let ob = (obs | dob | aob | bob
                                                     | iio | tob | obtlbi | ctxob | obfault |
                                                        obETS)^+
                                                   (* Internal visibility requirement *)
                                                   acyclic po-loc | fr | co | rf as internal
                                                   (* External visibility requirement *)
                                                   irreflexive ob as external
                                                   (* Atomic requirement *)
                                                   empty rmw & (fre; coe) as atomic
                                                   (* Writes cannot forward to po-future
                                                        translates *)
                                                   acyclic (po-pa | trfi) as translation-internal
```
Fig. 1: Strong Model (with baseline Armv8-A model parts in gray)

Most of the changes to the model are in the external axiom, where we add several relations to ordered-before (ob): iio relates the intra-instruction events ordered by the ASL; tob ("translation ordered-before") ensures the order arising from the act of translation itself is respected; obtlbi orders translates and their explicit memory events with TLBIs which affect these translations; and ctxob ("context ordered-before") orders events which must come before some contextchanging operation or after some context-synchronizing operation. We also add a generalised coherence-order relation, wco, an existentially quantified total order expressing when TLBIs complete w.r.t. writes.

Coherence: By making loc (and therefore rf and co) relate events with the same physical addresses, we get coherence over physical addresses rather than virtual. Coherence of writes to translation tables is expressed in two places: including trfe in obs captures the fact that translation-table reads from memory microarchitecturally come from the 'flat' coherent storage subsystem, and so the writes that they read from must have been propagated before the translation happened; and the translation-internal axiom forbids forwarding against program-order.

TLB maintenance and break-before-make: The obtlbi relation ensures that instructions whose translations read from writes which are "hidden" by some TLBI instruction are ordered before the completion of that TLBI. This is achieved by the two clauses of obtlbi: the first clause ensures the translationbefore-TLBI ordering is preserved, and the second clause orders the explicit memory access of any such instruction with the same TLBI as the first clause. To do this, the model computes the set of writes which are in effect "barriered" by a given TLBI instruction. This is done with the tcache relations, which decides which TLBIs effect which translations by looking at the addresses each use and the wco ordering between the TLBIs and related writes.

To accurately match up each of the various TLBI instructions with the translations they may affect, we define a tlb-affects relation which relates TLBI events with the T events they are relevant to. We elide the full definition here, as it is simply the product of the enumeration of TLBI variants with the set of translations that match the exception level, stage, address, ASID or VMID given in the TLBI instruction. obtlbi\_translate then uses tlb-affects and wco to order any translations that read-from 'stale' writes from before the invalidation with the TLBI that invalidated those writes. One notable subtlety here is in Stage 2 translations: since the TLB could store whole VA to PA mappings we must check that the correct Stage 1 invalidations have been performed, in addition to the Stage 2 ones, to be able to order the Stage 2 translation with the TLBI.

Translation-table-walk reading from memory: As noted in §3.3, a translation which results in a translation fault must read from memory or be forwarded from program-order earlier instructions, and those memory reads behave multicopy atomically. In general the only time the model can guarantee that such a memory read happens is when the read results in a translation fault, since entries that result in a translation fault cannot be stored in the TLB (§3.2). The model captures this succinctly by including [T\_f];tfr in ob.

In general, a translation-read is ordered after the write which it reads from, as captured by the inclusion of the trfe edge in ob; this is strong enough to ensure that TLB fills and faulting memory walks pull values out of the memory system in a coherent way, but still weak enough to allow other -multi-copy-atomic behaviour such as forwarding.

As mentioned in §3.3, a DSB ensures that writes are propagated out to memory. For translations this amounts to ensuring that a faulting translation cannot read-from something older than a po-previous DSB-barriered write, as captured by the last edge in tob which says that a tfri edge from such a faulting translation must not have an interposing DSB.

Note that the absence of the full tfr relation in ob for non-faulting translations intentionally allows some incoherence, in essence allowing a translationread to "ignore" a newer write.

Context-changing operations: In general, the sequential semantics takes care of the context, such as current base register and system register state, for us. The ctxob relation simply ensures that such context-changing operations cannot be taken speculatively, and that context-synchronization ensures that all poprevious context-changing operations are ordered-before po-later translations.

Detecting BBM Violations: As discussed in §3.2, we do not model in detail the bounded-catch-fire semantics that currently architecturally results from a missing break-before-make sequence, as that would make it hard to enumerate possible litmus-test executions. Instead, because what one normally wants to know for litmus tests is that a test does not exhibit a BBM failure, we conservatively detect the existence of such violations and flag them for the user. This is achieved through a per-candidate-execute predicate, written in SMT, which looks for a situation which could be a break-before-make violation. It does this by asserting that there does not exist a pair of writes which conflict such that there is no interposing break-and-TLBI sequence. This approach is slightly overapproximate, as it might look for two writes that technically conflict even if they (for other reasons) are not used at the same time. This means that while we support programs that switch from one page table to another, we do not support programs that garbage collect page-table memory and then repurpose it.

ETS: We discussed the Armv8-A optional ETS feature, providing additional ordering strength for translations. The intuition is that the model would have ghost events in the event an instruction faults, to represent the explicit read or write which would have happened had the instruction not faulted. The model would then have to compute a special variant of ob including such dependencies, but without the physical-address-dependent relations such as loc, rf and co. Then any edge in the version of ob with the ghost events would become an edge in the real ob but attached to the faulting translation. To capture this, our model produces fault events which have the correct dependencies (and fault information) and the model orders the fault event with respect to program-order previous events which would have ordered and place those into ob. This involves manually adding [dmb] ; po ; [fault], addr ; po ; [fault & FromW], etc. to ob. The obETS relation then orders translations which result in a translation fault after anything the fault is ordered-after.

Metatheory: To establish that our models provide a simple and sound abstraction we prove three theorems: that for static injectively-mapped address spaces, any execution which is consistent in the model with translation, erasing translation events gives an execution that is consistent in the original Armv8-A model without translation; that for any consistent execution in the original Armv8-A model, there is a corresponding consistent execution in our extended model with translations; and that our weak model is a sound over-approximation of our full translation model, i.e., that for any consistent execution in our full translation model, that same execution is consistent in the weak translation model.

### 6 Tooling

#### 6.1 Isla-based model evaluation

Making relaxed-memory semantics exhaustively executable is essential for exploring their behaviour on examples [66,54,53,20,9,36,65,23,63,49,56]. Handling relaxed virtual memory brings several new challenges. First, even just the sequential definition of Armv8-A address translation, with the page-table walk and its options, is remarkably intricate, defined in thousands of lines of Arm's ASL instruction description language. Manually reimplementing a simplified version would be error-prone and incomplete, so we instead build on our Isla tool [15], which integrates the full 123,000 line Armv8-A ISA semantics (as defined by Arm in ASL and automatically translated into Sail [14]), with SMT-based tooling to evaluate tests w.r.t. axiomatic concurrency models. Previously Isla supported only "user" models, expressed in a language based on relational-algebra similar to the Cat language of Herd [9].

Previous litmus tests typically involved only a few abstract memory locations and events, but even simple virtual memory tests require 30kB of page tables, each "user" memory access might have 24 or more page-table accesses, and each 64-bit descriptor may be represented by a symbolic value representing all possible states that descriptor can be in. To avoid overwhelming the SMT solver during symbolic execution, the formula representing each symbolic descriptor is created dynamically when read. When encoding the final SMT problem that decides whether a candidate execution is allowed, we ensure that only the parts of the page tables actually used by that candidate execution are included. We also implemented a model-specific optimization that removes irrelevant translation events which cannot affect the result of the test, improving performance by a factor of 13 on average, and up to 90 times for some tests. Third, we had to provide a convenient way to express the page table configuration for each test, with the declarative language of which we saw a small part on the left-hand side of the §4 test.

msr elr\_el1, x13 po A good user interface is essential. Above, we show an Isla-generated execution for a WRC test like that of §3.3, showing how uninteresting translation events can be suppressed in the output to avoid overwhelming noise.

eret

The main result is that, in the strong model, all 214 litmus tests and 14 pKVM tests are allowed or forbidden as intended, based on our discussion with Arm of their architectural intent, except two pKVM tests which time out. Additionally, we tested that the weak model never forbids any test allowed by the strong model. The tool performance is eminently usable in practice: most tests take around 1 minute, and the full set of litmus tests can be run in less than 2 hours CPU time, on a 36-core Intel Xeon Gold 6240.

We also ran our model on an existing suite of "user" litmus tests, including 1927 additional generated tests, with a constant identity-mapped pagetable and checked the results match RMEM [31] and the official Armv8-A model [26,49,13].

#### 6.2 Experimental testing of hardware

Validation of the models through experimental testing has been a vital part of past relaxed memory semantics [24,54,3,8]. This is equally true here. However experimental testing of the concurrent aspects of virtual memory is a far harder problem: these tests need to be able to access privileged parts of the instruction set; they need to be able to setup and use their own exception handlers, preventing building these tools ontop of standard distributions like Linux; Stage 2 tests and bare-metal Stage 1 tests require direct access to hardware, preventing the use of hypervisors such as KVM around the harness. To achieve this we build a harness that can run bare-metal on Armv8 devices to run Stage 1 (but as yet, not Stage 2) concurrent virtual memory litmus tests, which can be found at https://github.com/rems-project/system-litmus-harness. At present this and Isla use different test formats, so we have some tests manually written in both.

We ran tests on three devices with standard Arm cores (A53, A72). The data we collected suggests that in practice, aside from known errata, these cores: respect coherence over physical locations; correctly implement TLB maintenance; are multi-copy atomic w.r.t translation-table walks; and generally do not disagree with our model, except in one instance where we observed an anomalous result which is under discussion with Arm.

Further testing on other platforms would be desirable, but our emphasis in this work is principally on exploring the design space and capturing the architectural intent, and the main validation is from discussion with the Arm Chief Architect, who ultimately is responsible for determining what the architecture is. In this context, experimental data serves mainly to provide reassurance that some envisaged architecture strength is not invalidated by extant hardware implementations.

### 7 Related work

There is extensive previous work on "user" relaxed-memory semantics of modern architectures, but very little extending this to cover systems aspects such as virtual memory. We build on the approaches established in "user" models for x86, IBM Power, Arm, and RISC-V, combining executableas-test-oracle models, discussion with architects, and experimental testing [54,5,7,47,55,53,21,52,46,9,36,31,32,49,64].

Arm publish a machine-readable version of their Armv8-A relaxed memory model [45], in the Cat language of the Herd7 tool [6], but that model does not currently cover the relaxed virtual-memory semantics. Independent work in progress by Alglave et al. is similarly aiming to characterise this, and to update Arm's published model in due course, but with complementary scope to the current paper: including hardware updates of access and dirty bits, but without integration with the full ASL/Sail instruction semantics and its multiple levels and stages of translation. Both have been informed by discussion with senior Arm staff, and one would hope to synthesise the understanding in future. Hossain et al. [39] develop an "estimated" model for virtual memory in x86 (which has a much less relaxed base semantics) in a broadly similar axiomatic style. Tao et al. [61] axiomatise six conditions for weak data-race-freedom that should be satisfied by Armv8-A kernel code that uses virtual memory in simple ways, and an extension of Promising-Arm [50] that effectively builds in these conditions; they extend the sequential verification of the SeKVM hypervisor by Li et al. [43] to show it satisfies these conditions. The paper does not attempt to characterise the exact guarantees provided by the Armv8-A architecture, or discuss the issues of our §3. A foundational model such as our §5 would let one ground such results on the actual architecture. Simner et al. [56] study relaxed instruction-fetch semantics.

Several works give non-relaxed-memory semantics for Arm or x86 address translation, more or less simplified and with or without TLBs: Bauereiss [14], Goel et al. [34,35], Syeda and Klein [57,59,58,60], Degenbaev [29] (used for verification of a hypervisor shadow pagetable implementation [42,28,11,10]), Barthe et al. [19,17,18,16], Tews et al. [62], Kolanski [41], and Guanciale et al. [38].

### 8 Acknowledgments

We thank Arm Ltd. for its support of Simner's PhD and the wider project of which this is part. We thank the Google pKVM development team, especially Will Deacon, Quentin Perret, Andrew Scull, Andrew Walbran, and Serban Constantinescu, for discussions on pKVM, and the Google Project Oak team, Ben Laurie, Hong-Seok Kim, and Sarah de Haas, for their support. We thank Luc Maranget for comments on a draft.

This work was partially funded by an Arm/EPSRC iCASE PhD studentship (Simner), Arm Limited, Google, ERC Advanced Grant (AdG) 789108 ELVER, and the UK Government Industrial Strategy Challenge Fund (ISCF) under the Digital Security by Design (DSbD) Programme, to deliver a DSbDtech enabled digital platform (grant 105694).

### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### Verified Security for the Morello Capability-enhanced Prototype Arm Architecture

Thomas Bauereiss<sup>1</sup> , Brian Campbell<sup>2</sup> , Thomas Sewell<sup>1</sup> , Alasdair Armstrong<sup>1</sup> , Lawrence Esswood<sup>1</sup> , Ian Stark<sup>2</sup> , Graeme Barnes<sup>3</sup> , Robert N. M. Watson<sup>1</sup> , and Peter Sewell<sup>1</sup>

> <sup>1</sup> University of Cambridge, Cambridge, UK first.last@cl.cam.ac.uk <sup>2</sup> University of Edinburgh, Edinburgh, UK first.last@ed.ac.uk <sup>3</sup> Arm Ltd., Cambridge, UK first.last@arm.com

Abstract. Memory safety bugs continue to be a major source of security vulnerabilities in our critical infrastructure. The CHERI project has proposed extending conventional architectures with hardware-supported capabilities to enable fine-grained memory protection and scalable compartmentalisation, allowing historically memory-unsafe C and C++ to be adapted to deterministically mitigate large classes of vulnerabilities, while requiring only minor changes to existing system software sources. Arm is currently designing and building Morello, a CHERI-enabled prototype architecture, processor, SoC, and board, extending the high-performance Neoverse N1, to enable industrial evaluation of CHERI and pave the way for potential mass-market adoption. However, for such a major new security-oriented architecture feature, it is important to establish high confidence that it does provide the intended protections, and that cannot be done with conventional engineering techniques.

In this paper we put the Morello architecture on a solid mathematical footing from the outset. We define the fundamental security property that Morello aims to provide, reachable capability monotonicity, and prove that the architecture definition satisfies it. This proof is mechanised in Isabelle/HOL, and applies to a translation of the official Arm specification of the Morello instruction-set architecture (ISA) into Isabelle. The main challenge is handling the complexity and scale of a production architecture: 62,000 lines of specification, translated to 210,000 lines of Isabelle. We do so by factoring the proof via a narrow abstraction capturing essential properties of arbitrary CHERI ISAs, expressed above a monadic intra-instruction semantics. We also develop a model-based test generator, which generates instruction-sequence tests that give good specification coverage, used in early testing of the Morello implementation and in Morello QEMU development, and we use Arm's internal test suite to validate our model.

This gives us machine-checked mathematical proofs of whole-ISA security properties of a full-scale industry architecture, at design-time. To the best of our knowledge, this is the first demonstration that that is feasible, and it significantly increases confidence in Morello.

### 1 Introduction

Memory safety bugs continue to be a major source of security vulnerabilities, responsible for around 70% of those addressed by Microsoft security updates, and around 70% of the high-severity bugs impacting Chromium [30,14]. Their root causes are well-known legacy design choices and limitations of normal practice: pervasive uses of systems programming languages that do not enforce memory protection; hardware that enforces only coarse-grain protection, using virtual memory; and test-and-debug development methods that cannot provide high assurance. These are baked in to the critical systems codebase across the industry, and the result, in today's adversarial environment, is that programming errors can often lead to exploitable vulnerabilities.

There are many possible approaches to improving this situation, including development of safer programming languages, techniques for full functionalcorrectness verification, and better bug-finding tools. Each is the subject of much research in programming languages and semantics, and all are worthwhile, but the legacy investment, the need for systems code to work close to the machine, and the inability of bug-finding to provide high assurance, have made it very hard to radically improve mass-market systems.

Another path, less well explored, is to change the architectural interface to provide hardware mechanisms that enable better enforcement of memory protection. Over the last twelve years, the CHERI project [1] has been extending conventional hardware Instruction-Set Architectures (ISAs) with new architectural features to enable fine-grained memory protection and highly scalable software compartmentalisation. The CHERI memory protection features allow historically memory-unsafe programming languages such as C and C++ to be adapted to have quite different semantics, replacing many unpredictable undefined behaviour (UB) cases with predictable fail-stop traps, to provide strong and efficient protection against many currently widely exploited vulnerabilities. Crucially, this requires only minor changes to the sources of existing systems software. The CHERI scalable compartmentalisation features enable the finegrained decomposition of operating-system (OS) and application code, to limit the effects of security vulnerabilities.

CHERI provides these via hardware support for unforgeable capabilities: in a CHERI ISA [54], instead of using simple 64-bit machine-word virtual-address pointer values to access memory, restricted only by the memory management unit (MMU), one can use 128+1-bit capabilities that encode a virtual address together with the base and bounds of the memory it can access. Encoding these within the capability enables a fast access-time check, faulting if there is a safety violation. A one-bit tag per capability-sized and aligned unit of memory, cleared in the hardware by any non-capability write and not directly addressable, ensures capability integrity by preventing forging, and the ISA design lets code shrink capabilities but never grow them. This architectural mechanism, along with additional sealed-capability features for secure encapsulation, can be used by programming language implementations and systems software in many ways.

Previous academic work on CHERI has developed CHERI-MIPS and CHERI-RISC-V architectures, FPGA processor implementations, and system software including adaptions of Clang/LLVM, linkers, debuggers, FreeRTOS, FreeBSD, and WebKit. The CHERI processor prototypes implement techniques such as compressed capability bounds [58], and a tag controller and cache [26] required to implement memory tagging on off-the-shelf DRAM. The software prototypes use CHERI's architectural features to implement memory-safe CHERI C/C++ programming languages [55], fine-grained spatial memory safety [15], heap temporal memory safety [15], and scalable software compartmentalisation [57]. An analysis of vulnerabilities reported to the Microsoft Security Response Center (MSRC) in 2019 suggested that CHERI memory safety would have deterministically mitigated 30%–70%, depending on the usage scenario [27], and porting the FreeBSD kernel and userspace to CHERI required changes only to 0.18% and 0.04% LoC respectively. Analysis of an open-source desktop stack [53] estimated a 73.8% vulnerability mitigation rate through a combination of memory protection and software compartmentalisation requiring a 0.026% LoC change.

Achieving widespread adoption of any substantial new architectural feature is also challenging, of course, but the issues differ from those for adoption of a new high-level programming language. It needs coordinated hardware and software change, which is hard to arrange, but on the plus side there are very few architecture vendors, so if a feature becomes (say) part of the mainline Arm architecture, and there is pull from major partners, then it will be implemented in all conforming Arm implementations and become ubiquitously available in devices. For CHERI, the academic results are encouraging, but achieving such adoption first needs an industry-scale evaluation of a high-performance silicon processor implementation and software stack above it, to demonstrate viability and enable that pull. This is beyond what can be done academically, but hard to justify as a purely commercial project. The 2019–24 UKRI Digital Security by Design (DSbD) challenge resolves this chicken-and-egg difficulty with a combined public-sector and industry (£70m+117m) programme to build and evaluate such demonstration platform, and support research and development above it [52].

Arm, supported in part by DSbD, is currently designing and building Morello, a CHERI-enabled prototype architecture, processor, system-on-chip (SoC), and development board, extending the Armv8.2-A architecture and the high-performance Neoverse N1 processor [6,8]. The Morello processor and SoC implement the CHERI ISAv8 protection model, and utilise CHERI's compressed capability bounds and tagged memory approaches. As of 2021-01, the architecture, emulators, initial development boards with Morello silicon, and initial software toolchains, have all been developed. This will allow evaluation of the CHERI mechanisms in a variety of configurations and use cases on a state-of-the-art hardware platform, and paves the way for the potential adoption of CHERI into future production architectures and devices.

In this paper, we describe work to put the Morello architecture and its security properties on a solid mathematical footing from the outset, and to use semantics to ease conventional engineering.

Fig. 1. From Morello ASL source (blue) to auto-generated artifacts (yellow) and verification outcomes (green)

For a new architecture that aims to provide security guarantees, it is especially important to provide high assurance that it actually does. Otherwise, any security flaw in the architecture will be present in any conforming hardware implementation, quite likely impossible to fix or work around after deployment, and the resulting loss of confidence might make further adoption impossible.

For Morello, this is challenging in two ways. First, CHERI needs to be deeply integrated into each base architecture it gets adapted to, most obviously by modifying all virtual-memory-accessing instructions to check bounds and permissions of capabilities, and by adding instructions to explicitly manipulate capabilities, but also in more subtle ways relating to exceptions, virtualisation, and so on. Second, the architecture specification is large and complex. The base Armv8-A architecture is defined in an 8200-page manual [7], to which the Morello architecture supplement adds 1200 more [8]. Fortunately, Arm have recently shifted to using an executable version of their ASL language for instruction-set architecture specification [40,41]. The sequential behaviour is all defined in ASL, and this is what appears in instruction descriptions and auxiliary functions (e.g. for capability compression and address translation) in the documentation. However, it remains very large, 62 000 non-whitespace lines of specification (LoS), and ASL does not itself have a mechanised semantics.

The main intended security property of the Morello architecture is reachable capability monotonicity, with the intuition that the available capabilities cannot be increased during normal execution (i.e., they are monotonically decreasing). This is a whole-system property about arbitrary machine execution, and conventional techniques cannot provide high assurance that the architecture satisfies it. Instead, it needs proof. We translate the Arm ASL definition via the Sail [9] language into Isabelle/HOL [39], extending previous work for Armv8-A, and give a mechanised statement and proof that the property holds of the architecture.

We deal with the challenge of scale by factoring the proof via a narrow abstraction: four relatively simple properties of arbitrary CHERI instruction execution that capture essential aspects of their behaviour. Our intra-instruction semantics focusses on the behaviour of instructions in isolation, interacting with registers and memory, rather than viewing each thread as a single state machine; this monadic interface lets us conveniently express these abstract-CHERI properties of instructions in terms of their register and memory effects. We prove capability monotonicity for arbitrary sequences of instructions above this abstraction, and we instantiate the abstraction for Morello and prove that its many instructions satisfy the required properties. Manual proof effort was required for a number of helper functions defined in the architecture for manipulating and using capabilities, but the bulk of the architecture is handled by automatic proof tools and tactics. Previous work by Nienhuis et al. [38] proved similar results for the much simpler and smaller (6k LoS) CHERI-MIPS architecture with a different approach, manually defining a larger set of abstract actions and proving that those do abstract the instruction semantics. That let one capture instruction intentions more explicitly, but needed more ad hoc machinery, while the new approach we follow here handles the 10x scale-up successfully.

Our proof was developed while the architecture and hardware design were still evolving, using weekly snapshots of Arm's ASL specification, with our automation letting us quickly adapt to changes. This let us identify a number of bugs that could be fixed before the architecture and hardware were finalised.

To validate the ASL-to-Sail translation of the Morello specification, we used the C emulator automatically generated from the Sail model to compare it against Arm's internal Architecture Compliance Kit (ACK) test suite.

Finally, we developed a test generator, using the Isla symbolic execution tooling for Sail [10], to automatically generate interesting instruction-sequence tests, aiming at good specification coverage. These complemented Arm's test suite and were used by Arm as part of their pre-tape-out validation, and were used as the main test suite for development of a Morello version of the QEMU emulator. This helped uncover some bugs in our own tooling as well as discrepancies between different Morello models and emulators. We also used Isla and an earlier Sail-to-SMT flow for quick checking of properties of capability compression.

To summarise, our contributions are:


This gives us machine-checked mathematical proofs of whole-ISA security properties of a full-scale industry architecture, at design-time. To the best of our knowledge, this is the first demonstration that that is feasible, and it significantly increases confidence in Morello.

The main proof took only around 24 person-months, by two people between 2020-03 and 2021-07, following around 23 person-months of preliminary work to get the model into usable Sail and Isabelle forms, to develop our CHERI abstraction in the context of earlier CHERI architectures, and on our Sail-toSMT flow. Test generation and ACK validation took an additional 17 personmonths, including Morello-specific work on Isla. This suggests that such proof could be not just technically but also economically viable for new architecture design, particularly as doing this routinely, as an established flow, would reduce the effort substantially.

As a side benefit, our well-validated Morello semantics is reusable for future software or hardware verification. The Armv8-A ISA is, along with x86, one of the two most important low-level programming languages, and if Morello is successful, then one would expect CHERI extensions to be similarly widely used.

Sail and Isabelle versions of the Morello specification, as well as our definitions and proofs, are available online [3].

Non-goals and limitations (1) Our results establish confidence that the Morello instruction set architecture design satisfies its fundamental intended security properties. We do not address correctness of the Morello hardware implementation of that architecture, which would be an extremely challenging hardware verification task, and we do not cover system components that are not specified by the ISA itself, e.g. the Generic Interrupt Controller (GIC). (2) The architecture, as usual, expresses only functional correctness properties, not timing or power properties, to allow hardware implementation freedom. Properties and proofs about the architecture therefore cannot address side channels, but see [56] for discussion of side-channels and CHERI. (3) We consider only the sequential architecture. Studying concurrency effects would require a more complex system model integrating the Morello sequential semantics with a whole-system concurrency memory model, which we leave to future work, but we expect the capability properties to be largely orthogonal to concurrency issues, as long as the write of a capability body and tag appear atomic. (4) We assume an arbitrary but fixed translation mapping. CHERI capabilities are in terms of virtual addresses, so system software that manages translations has to be trusted or verified. We also assume that the privileged capability creation instructions are disabled and no external debugger is active, because these features can in general be used to circumvent the capability protections, as discussed in §5.1. (5) Our capability monotonicity property is the most fundamental property one would expect to hold of a CHERI architecture, but it is by no means the only such property. However, stronger properties typically involve specific software idioms, e.g. calling conventions or exception handlers, and their proofs use techniques that have not yet been scaled up to full architectures. We return to this in §8. (6) We prove monotonicity of the Morello specification formally in Isabelle, however, our proof depends on an SMT solver as an oracle for one lemma, as discussed in §5. (7) Our conversion from ASL via Sail to Isabelle is not subject to verification, as neither ASL nor Sail have an independent formal semantics – their semantics is effectively defined by this translation. However, it is nontrivial, and there is the possibility of mismatches with the Sail-generated C emulator used for validation; we do not attempt to verify that correspondence. (8) The ASL specification is subject to the limitations documented by Arm in [7, Appendix K14], e.g. with respect to implementation-defined behaviour.

### 2 Overview of the Morello CHERI Architecture

CHERI is an architectural protection model that extends ISAs with a new data type, the architectural capability [54]. The Morello architecture adds CHERI capabilities to Armv8.2-A, the ISA implemented by the Neoverse N1 CPU on which the Morello hardware implementation is based [8].

### 2.1 CHERI Capabilities on Morello

CHERI capabilities are twice the natural address size of the architecture plus an out-of-band tag bit, which is not independently addressable; for Morello, capabilities are 128+1 bits. The lower 64 bits are the "value", which in most cases represents a virtual address. The upper 64 bits encode metadata, including bounds, permissions, and other mechanisms. The tag provides integrity protection: it is preserved only by legitimate operations on capabilities, and cleared by others. A capability can only be used as such, e.g. for a dereference, if its tag is set.

A sophisticated compression scheme allows a capability to include 64-bit lower and upper virtual-address bounds, encoded into 87 bits in total, with 56 of those shared with the value field (see [8, §2.5.1],[58] for details). Small regions can be described precisely, with an arbitrary size in bytes, while for larger regions, only certain bounds and sizes are expressible. The capability value must be either within the bounds or within a certain range above or below, allowing for common C idioms that transiently construct (but do not dereference) slightly out-ofbounds pointers; other combinations of value and bounds are not representable. This scheme trades off bounds precision for reduced capability size: supporting arbitrary bounds would require more than 128+1 bits per capability, which would have unacceptable performance costs.

Four of the 18 permission bits are reserved for software, while the others have architecturally defined meaning. The Load, Store, and Execute permissions control whether a capability can be used for loading or storing data or fetching instructions. Permission bits for loading and storing capabilities, as opposed to data, also exist. The System permission controls access to system registers and operations, in addition to the access control mechanisms of the base Arm architecture. Capabilities can also be sealed, making them immutable and unusable for anything but branching to them; this allows controlled transitions between different security domains. Sealing (or unsealing) a capability requires an authority capability with the Seal (or Unseal) permission; more on this below.

#### 2.2 Capabilities in Registers and Memory

Morello extends the Armv8-A general-purpose integer register file, as well as certain control and status registers, from 64 bits to 128+1 bits. Memory is extended with a tag bit for each 128-bit sized and aligned unit of DRAM.

The Program Counter (PC) is extended to become a Program-Counter Capability (PCC), constraining instruction fetch as well as PC-relative loads (e.g., of global variables). A new Default Data Capability (DDC) special register controls and transforms memory accesses relative to machine-word pointer values by legacy (non-capability) instructions, for legacy code using integer pointers.

#### 2.3 Capability-aware Instructions

Morello extends Armv8-A with new instructions and modifies existing instructions to use and respect capabilities. For example, a Load capability (literal) instruction LDR <Ct>,<label> calculates an address from the PCC value and an immediate offset, loads a capability from memory, and writes it to capability register Ct [8, §4.4.76]. If the PCC capability does not have the load permission, or the calculated address is outside its bounds, a capability fault exception is raised. The tag of the PCC capability is also checked (as part of instruction fetching). Most other instructions authorise loads and stores via a capability in an explicitly identified register, or use DDC, rather than implicitly use PCC.

Conventional execution flow is also controlled by capabilities, with branch instructions to capability destinations (or implicitly w.r.t. the PCC for legacy instructions). Here too the capability must have its tag set and the target virtual address must be within the bounds, and in this case it must authorise execution.

Then there are instructions to access and manipulate the fields of a capability, including arithmetic on its virtual-address value field (corresponding to conventional pointer arithmetic), comparisons, and other operations to extract and manipulate its permissions and other data.

#### 2.4 Domain Transition

CHERI distinguishes between sealed and unsealed capabilities. An unsealed capability can be used directly (e.g. to load and store), but a sealed capability can only be used to request actions be taken by other software. This feature can be used in the context of protection domains or software compartments, in which whole subsystems are given access to a limited subset of memory.

Domain X may have no direct authority to domain Y, but may call into domain Y by invoking one or more sealed capabilities originally sealed by (or for) Y. The invocation will install unsealed versions of the invoked capabilities in registers. This always includes replacing the current PCC, thus, this performs a jump to a specific code entry point provided by domain Y. These domain transitions are non-monotonic and must be treated specially in our proof.

Variations on this sealing and invocation mechanism enable slightly different calling styles. When sealing capabilities, they can be labelled with an object type, if the authorising capability has that object type in its bounds. The "branch to sealed capability pair" instruction invokes a given code capability and also an argument data capability, checking their object types match, providing object-style encapsulation. Three kinds of specialised sentry (sealed entry) capabilities may be used transparently by direct branch instructions, memory-indirect branch instructions, and memory-indirect branch-to-pair instructions, respectively.

#### 2.5 Exceptions and the Memory Management Unit

In addition to compiler-facing instructions, system functionality such as virtual memory, cache management, and exception handling is also extended, e.g. adding new exception cause codes, and page-table permission bits for loading or storing capabilities. Because exception handling is able to restore reserved registers during exception-level transitions, it is also a form of domain transition, as reserved registers may contain capabilities not available to the executing code.

#### 2.6 Using CHERI in Software

For context, we sketch how CHERI's capability mechanisms are used by software to control and constrain execution. The CHERI team has adapted a large open-source software stack to CHERI, including the LLVM compiler, linkers, debuggers, multiple OSs, and application suites. The verification in this paper is motivated by this software usage, but is itself purely about the architecture.

One of the main uses of capabilities is fine-grain memory protection. Spatial memory safety is achieved in CHERI C/C++ by implementing explicit pointers (those visible in the language, e.g. variables with pointer type) and implied pointers (used by the generated code and runtime, e.g. the stack pointer, PLT entries, and Global Offset Table pointers) with capabilities instead of conventional machine-word integers. These are protected (from corruption or reinjection) by the CHERI tag mechanism and monotonicity, and hence the memory contents they point to are protected, by the capability permissions and bounds checks, so long as no other capabilities give undesired access to them. This relies on compiler-generated code, the kernel, run-time linker, and C runtime (e.g., heap allocator) narrowing capability bounds and permissions during execution as appropriate. This protects against many cases in which a C/C++ coding error could lead to an exploitable vulnerability.

Temporal memory safety, additionally protecting against reuse-after-reallocation errors, is not directly supported by the architecture, but there are a variety of techniques to implement it, especially for heap memory, using CHERI's features [22]. Morello extends the page-table mechanism to allow capability flow to be tracked through memory, supporting revocation of old capabilities.

The other main use of CHERI is software compartmentalisation, splitting the address space into different compartments running separate software. The capability monotonicity property ensures these components are contained in their compartment boundaries. Domain transitions are possible via the sealed capability mechanism, which can be used to set up various inter-compartment interfaces. Often these transitions will all be to a privileged control component, but the architecture also supports direct transition between two mutually distrusting pieces of code. Various software models are supported, from implementing fast inter-process IPC to sandboxed libraries within processes.

```
1 function clause __DecodeA64 ((pc, ([bitone,bitzero,bitzero,bitzero,bitzero,bitzero,
2 bitone,bitzero,bitzero,bitzero,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_]
3 as __opcode)) if SEE < 99) = {
4 SEE = 99; let imm17 = Slice(__opcode, 5, 17); let Ct = Slice(__opcode, 0, 5);
5 decode_LDR_C_I_C(imm17, Ct) }
6
7 val decode_LDR_C_I_C : (bits(17), bits(5)) -> unit
8 function decode_LDR_C_I_C (imm17, Ct) = {
9 let 't = UInt(Ct);
10 let offset : bits(64) = SignExtend(imm17 @ 0b0000, 64);
11 execute_LDR_C_I_C(offset, t) }
12
13 val execute_LDR_C_I_C : forall ('t:Int),(0<='t & 't<=31). (bits(64),int('t)) -> unit
14 function execute_LDR_C_I_C (offset, t) = {
15 CheckCapabilitiesEnabled();
16 let base : VirtualAddress = VAFromCapability(PCC);
17 let address : bits(64) = Align(VAddress(base) + offset, CAPABILITY_DBYTES);
18 VACheckAddress(base, address, CAPABILITY_DBYTES, CAP_PERM_LOAD, AccType_NORMAL);
19 data : bits(129) = MemC_read(address, AccType_NORMAL);
20 let data : bits(129) = CapSquashPostLoadCap(data, base);
21 C_set(t) = data }
22
23 val VACheckAddress : forall ('size : Int).
24 (VirtualAddress, bits(64), int('size), bits(64), AccType) -> unit
25 function VACheckAddress (base, addr64, size, requested_perms, acctype) = {
26 c : bits(129) = undefined;
27 if VAIsBits64(base) then { c = DDC_read() }
28 else { c = VAToCapability(base) };
29 __ignore_15 = CheckCapability(c, addr64, size, requested_perms, acctype) }
30
31 val CheckCapability : forall ('size : Int).
32 (bits(129), bits(64), int('size), bits(64), AccType) -> bits(64)
33 function CheckCapability (c, address, size, requested_perms, acctype) = {
34 let el : bits(2) = AArch64_AccessUsesEL(acctype);
35 let 'msbit = AddrTop(address, el);
36 let s1_enabled : bool = AArch64_IsStageOneEnabled(acctype);
37 addressforbounds : bits(64) = address; [...7 lines setting addressforbounds...]
38 fault_type : Fault = Fault_None;
39 if CapIsTagClear(c) then { fault_type = Fault_CapTag }
40 else if CapIsSealed(c) then { fault_type = Fault_CapSeal }
41 else if not_bool(CapCheckPermissions(c, requested_perms))
42 then { fault_type = Fault_CapPerm }
43 else if (requested_perms & CAP_PERM_EXECUTE) != CAP_PERM_NONE
44 & not_bool(CapIsExecutePermitted(c)) then { fault_type = Fault_CapPerm }
45 else if not_bool(CapIsRangeInBounds(c, addressforbounds, size[64 .. 0]))
46 then { fault_type = Fault_CapBounds };
47 if fault_type != Fault_None then {
48 let is_store : bool = CapPermsInclude(requested_perms, CAP_PERM_STORE);
49 let fault : FaultRecord = CapabilityFault(fault_type, acctype, is_store);
50 AArch64_Abort(address, fault) };
51 return(address) }
```
Fig. 2. Sample Morello instruction semantics, in Sail, for parts of the LDR (literal) instruction [8, §4.4.76] for loading a capability from a PCC-relative address. Lines 1–5 are the relevant opcode pattern-match clause. That calls the decode function on Lines 7–11, which calls the execute function on Lines 13–21. That uses auxiliary function VACheckAddress (Lines 23–29) to check that the PCC capability (wrapped in a VirtualAddress structure) has the right bounds and permissions, raising an exception otherwise (Lines 47–50). MemC\_read (Line 19) performs the load, and CapSquashPostLoadCap (Line 20) performs additional checks, in particular clearing the tag of the loaded capability if the authorising capability does not have capability load permission.

### 3 Concrete Semantics of Morello

The basis for our verification and validation work for Morello is the ISA specification written by Arm in their ASL language. It includes sequential semantics of the capability mechanisms and instructions, along with all of the Armv8-A AArch64 base architecture and its extensions supported by Morello, e.g. floating point and vector instructions, system registers, exceptions, user mode, system mode, hypervisor mode, some debugging features, and virtual memory address translation. In total, the Morello ASL specification is around 62 000 nonwhitespace lines, covering 409 instructions, 1050 encodings, 600 automatically generated accessor functions for reading and writing system registers, and 1500 additional helper functions. Arm provided weekly snapshots of the ASL specification while it was being developed.

ASL is a first-order imperative language with exceptions. Originally a paper language only, it was made executable by Reid et al. [40,41]. It supports bitvectors of computed sizes, but bitvector indexing is not statically checked; it also supports mathematical integers and some limited structured types. The Arm documentation provides an informal description of the language [7, Appendix K14], but does not provide a formal semantics. We obtain a formal semantics of Morello by translating the ASL specification into Sail [9], a similar language but with a richer type system and open-source tooling, and thence into Isabelle/HOL, as 90 000 and 210 000 LoS respectively. Fig. 2 shows parts of the Sail semantics for the Morello LDR (literal) instruction for loading a capability from a PCC-relative address. This is just an iceberg-tip of the whole semantics, even just for this instruction: the MemC\_read involves all of address translation, and the call graph of the definitions shown amounts to 7 300 lines of Sail.

We reused the existing open-source Sail tooling and ASL-to-Sail translation [9,10] mostly as-is, with only minor improvements and some engineering work needed to handle Morello. In addition to the Isabelle definitions, we generate a C emulator for validation (§6) using the Sail tool, and we reuse the Isla symbolic execution engine for Sail [10] to generate tests (§7).

### 4 Abstract Formal Model of Capability Monotonicity

The main challenge in proving whole-ISA security properties of Morello is the scale and complexity of the model. Rather than a direct proof above the 210 000 line Isabelle specification, we factor the proof via an abstraction (instantiated for Morello in §5) that captures the essential properties of arbitrary instruction behaviour in any CHERI ISA. It has to spell out aspects of CHERI in some detail, e.g. the different kinds of non-monotonic domain transitions (cf. §2.4), but it abstracts away ISA details not directly relevant for capability monotonicity.

#### 4.1 ISA Abstraction

The abstraction is defined as properties of an arbitrary sequential ISA semantics, encoded in a monadic type with a trace semantics that exposes the individual register and memory effects of instructions. This interface was originally designed to connect Sail ISA semantics to relaxed memory models, but we found the factorisation via effects useful for reasoning even in a simple sequential setting.

The monad essentially corresponds to a free monad over an effect datatype. It is parameterised with a return type 'a, an exception type 'e, and a sum type of register value types 'regval (automatically generated by Sail for each ISA): **type** M 'regval 'a 'e =

```
...
```
Finished outcomes either indicate successful termination with a return value a (denoted as Done a), an exception (Exception e) which can be caught using a try\_catch combinator, or a failure (Fail msg), e.g. due to a failed assertion. Effect outcomes carry a continuation that expects a response and returns the next monadic outcome. Monadic return wraps a value in Done, while bind just nests the outcomes without interpreting the effects. We also define a corresponding type of events, e.g. E\_read\_reg (with only concrete values, not continuations), along with an effect trace semantics for monadic expression. We define our requirements on CHERI ISAs in terms of constraints on these traces in §4.4.

#### 4.2 CHERI ISA Parameters

In addition to the ISA semantics themselves, our properties are parameterised on aspects of the ISA relevant to CHERI. This includes names of special registers, in particular the program counter capability register PCC, the invoked data capability register IDC (capability register 29 on Morello, r31 on CHERI-RISC-V), registers holding capabilities to exception handlers (VBAR\_ELn on Morello), and privileged registers requiring system register access permission.

Moreover, we need to know which instructions may perform sealed capability invocations, as this potentially constitutes a non-monotonic security domain transition. We model this as functions taking an instruction identifier and an effect trace of a particular execution, and returning, respectively, the directly or indirectly invoked sealed capabilities in the trace. For example, the Morello BRS instruction invokes the sealed capabilities in its two input registers, and other branch instructions can also invoke sealed capabilities if they are sentries.

Finally, the mapping from virtual to physical memory addresses is captured by a pure partial function taking a virtual address and a (partial) instruction execution trace, from which it can extract the required information about the address mapping to determine the physical address, if any. This is needed because capabilities are in terms of virtual addresses, but the memory effects produced by the ISA semantics are in terms of physical addresses, so we need a way to translate between those when formulating requirements on memory accesses in the abstract model. We also assume another function as a parameter to distinguish memory operations that happen as part of an in-memory translation table walk, as the constraints on them differ from those on other memory operations.

### 4.3 Capability Abstraction

We capture capabilities in the abstract model via a typeclass that provides methods for accessing the various fields of capabilities, as well as sealing and unsealing operations. We also define a notion of derivability that serves as an upper bound on the capability manipulations that instructions are normally allowed to perform. Starting from a set of capabilities C, e.g. provided as inputs to an instruction, the set of capabilities derivable from C is defined inductively as the smallest set that contains C itself as well as capabilities obtained from other derivable ones via one of the following:


Of these operations, unsealing is the only one that may grant new privileges that are not already granted by the input capabilities. However, unsealing requires specific authority. An operating system, for example, can control what capabilities a user-space process can unseal by only handing out unsealing authority capabilities with a limited set of object types in their bounds.

### 4.4 CHERI ISA Intra-instruction Properties

Our abstraction is defined as the conjunction of four instruction-local properties. They are relatively straightforward to verify for a concrete ISA, and we will describe the proof for Morello in §5. At the same time, the properties imply the whole-ISA property of reachable capability monotonicity, as explained in §4.5. Hence, they serve as a useful intermediate abstraction layer for structuring the overall proof.

The central security guarantee that CHERI ISAs aim to provide is that software cannot forge capabilities and thereby escalate its privileges. Hence, we require that instructions only produce capabilities via the above derivation rules, except for the effects of well-defined transition mechanisms for switching control to another security domain.

Property 1 (Capability register writes). In any execution trace of a single instruction, for every write of a tagged capability to a register at a given point in the trace, one of the following holds:


The first case permits the normal operation of instructions, manipulating capabilities according to the above derivability rules. We allow instructions to use their available capabilities in these operations, which normally includes capabilities read from registers or loaded from memory up to the given point in the trace, with some exceptions: First, capabilities read from privileged registers are unavailable unless the system access permission is also available, i.e. if a tagged and unsealed capability with that permission has been read from PCC before. Second, we exclude capabilities loaded as part of translation table walks, as those loads are not subject to capability checks (although none of the existing CHERI ISAs attempt to load capabilities during translation table walks). Third, capabilities used in a domain transition, e.g. capabilities loaded from memory as part of an indirect sealed capability invocation, are unavailable for normal operations and handled separately by the other cases of Property 1 as follows.

The sealed capability invocation case applies when the capability being written is an invoked capability of the current instruction, as declared when instantiating the CHERI ISA abstraction (see §4.2). Such an invocation performs a branch to the unsealed code capability by writing it to the PCC register, and possibly writes an unsealed data capability to IDC. One of the following cases must hold, representing the different supported kinds of capability invocation:


The ISA exception case is signalled in the Morello model by the helper function AArch64.TakeException throwing a (Sail language) exception after setting up the branch to the exception handler. In this case, we allow a capability to the exception handler to be read from a privileged exception handler base register and written to PCC, even if system register access permission is not available. However, the definition of available capabilities together with our properties guarantee that this capability is not used for any other operations.

```
let store_cap_reg_axiom ISA has_ex invoked_caps invoked_indirect_caps t =
 let use_mem_caps = (invoked_indirect_caps = {}) in
(∀ i c r . (writes_to_reg_at_idx i t = Just r ∧ c ∈ (writes_reg_caps_at_idx ISA i t))
   −→
   (∗ Only store monotonically derivable capabilities to registers ∗)
   (cap_derivable (available_caps ISA use_mem_caps i t) c ∨
   (∗ ... or perform one of the following non − monotonic register writes : ∗)
   (∗ Exception ∗)
   (has_ex ∧ c ∈ exception_targets_at_idx ISA i t ∧ r ∈ ISA.PCC) ∨
   (∗ Capability pair invocation ∗)
   (∃ cc cd. ((c ≤ (unseal cc) ∧ r ∈ ISA.PCC) ∨ (c ≤ (unseal cd) ∧ r ∈ ISA.IDC)) ∧
     cap_derivable (available_caps ISA use_mem_caps i t) cc ∧
     cap_derivable (available_caps ISA use_mem_caps i t) cd ∧
     invokable cc cd ∧ c ∈ invoked_caps) ∨
   (∗ Direct sentry invocation ∗)
   (∃ cs. c ≤ (unseal cs) ∧ is_sentry cs ∧ is_sealed cs ∧ r ∈ ISA.PCC ∧
     cap_derivable (available_caps ISA use_mem_caps i t) cs ∧
     c ∈ invoked_caps) ∨
   (∗ Indirect sentry invocation (writing the unsealed sentry to IDC) ∗)
   (∃ cs. c ≤ (unseal cs) ∧ r ∈ ISA.IDC ∧ is_indirect_sentry cs ∧ is_sealed cs ∧
     cap_derivable (available_reg_caps ISA i t) cs ∧
     c ∈ invoked_indirect_caps) ∨
   (∗ Indirect capability (pair) invocation ∗)
   (∗ (writing the loaded capability/capabilities to PCC/IDC) ∗)
   (∃ c
       0
        . ((c ≤ (unseal c
                       0
                        ) ∧ is_sealed c
                                      0 ∧ is_sentry c
                                                    0 ∧ r ∈ ISA.PCC) ∨
            (c ≤ c
                   0 ∧ r ∈ (ISA.PCC ∪ ISA.IDC))) ∧
     cap_derivable (available_mem_caps ISA i t) c
                                                   0 ∧
     c ∈ invoked_caps ∧ invoked_indirect_caps 6= {})))
```
We formalise Property 1 as a predicate on traces, given in Fig. 3. It takes a number of arguments that we instantiate using the CHERI ISA parameters of §4.2, e.g. with *invoked*\_*caps* set to the capabilities that the given instruction invokes in the given trace. The predicate details the different cases (and invocation subcases) of Property 1 for all capabilities written to registers, using helper definitions such as available\_caps or invokable (checking permissions and object types of a pair of sealed capabilities).

The other three properties state that capabilities stored to memory must be derivable from available capabilities (here there are no non-monotonic exception cases), and that accesses to memory or privileged registers must be authorised by capabilities with sufficient permissions and bounds.

Property 2 (Capability stores). Every tagged capability stored to memory at a given point in an execution trace of a single instruction is derivable from the available capabilities at that point in the trace.

Property 3 (Privileged registers). Reads from or writes to privileged registers in an execution trace of a single instruction happen only after a tagged and unsealed capability with system register access permission has been read from PCC, unless an ISA exception is raised in the trace and the event is a read from an exception handler base register.

Property 4 (Memory accesses). For every load or store event at a given point in an execution trace of a single instruction, there is a tagged capability available at that point in the trace that authorises the memory operation (further explained below), unless the event is part of a translation table walk. The authorising capability must be unsealed, unless it is an indirect sentry capability being invoked in this trace and the event is a load. If the event is a load or a store of a tagged capability, then the address must be aligned to the capability size.

The authorising capability for memory accesses must be tagged and have the right bounds and permissions: the latter must include load/store permission, and there must be a virtual address range covered by the bounds of the capability that translates to the physical address range covered by the memory event. Loading/storing capabilities (and not just untagged data) requires additional permission bits. The authorising capability must also normally be unsealed; the only allowed case of using a sealed capability for a memory operation is the invocation of an indirect sentry capability. In that case, Property 1 allows the loaded capability (or pair of capabilities) to be written to PCC (or IDC). However, due to the definition of available capabilities, the loaded capabilities will in this case be unavailable for other purposes. Only capabilities loaded via unsealed authorising capabilities can be used for regular operations.

In addition to the instruction semantics, our ISA models also contain ASL/Sail code defining instruction fetch and decode behaviour. We use this for generating emulators, but also for stating the whole-ISA monotonicity theorem below with respect to multi-instruction traces produced by a fetch-decode-execute loop. For the fetch segments of these traces, we require the same properties to hold as for individual instruction execution traces, with the only difference being in the authorisation of memory loads: we assume that instruction fetching only loads instructions from memory, so we do not allow instruction fetching to perform capability memory loads, and we require that it checks for the execute rather than the load permission in the authorising capability.

#### 4.5 Capability Monotonicity Theorem

The above single-instruction properties are sufficient to prove a whole-ISA monotonicity theorem for reachable capabilities. This set of reachable capabilities for a given state of the system is defined inductively as the smallest set that includes:


This set is intended to provide an upper bound on the set of capabilities that software can construct (on its own) when starting execution in the given state, and the monotonicity theorem confirms that it is indeed an upper bound.

We assume a sequential setting and state the theorem with respect to executions of a sequential fetch-decode-execute loop; reasoning about concurrent behaviour is beyond the scope of this paper. Executing an effect trace t from a state s leading to a state s 0 , written s <sup>t</sup>−→ s 0 , is possible if the register and memory contents in read events along the trace t correspond to the last written values, if any, or the contents in the initial state s otherwise, and if s 0 results from s by updating register and memory contents with the values in t.

Proving the instruction-local properties of the last subsection for a concrete ISA might also require certain architecture-specific assumptions. We allow the specification of both a capability invariant that is preserved by capability derivation and assumed to hold initially, and a predicate on traces capturing further assumptions, e.g. about system registers. We say that an architecture is a CHERI ISA if all possible traces of instruction execution and fetching that satisfy the architecture-specific trace assumptions, and that read only capabilities satisfying the architecture-specific capability invariants, satisfy the properties of §4.4. Reachable capability monotonicity then holds for executions of arbitrary sequences of instructions, unless and until a transition to another security domain occurs via an ISA exception or sealed capability invocation.

Theorem 1 (Reachable Capability Monotonicity). Let t = *tf*<sup>1</sup> · *te*<sup>1</sup> · *tf*<sup>2</sup> · *te*<sup>2</sup> · . . . be a trace of the fetch-decode-execute loop of a CHERI ISA, alternating fetch/decode traces *tf*<sup>i</sup> and instruction execution traces *te*i, and let s be a state such that s <sup>t</sup>−→ s 0 . If all of the following hold:


the set of capabilities reachable in s 0 is a subset of the capabilities reachable in s.

This guarantees that software cannot escalate its privileges by forging capabilities that are not reachable from the starting state. Non-monotonic changes in the set of reachable capabilities are limited to the specific mechanisms defined above for transferring control to another security domain, i.e. ISA exceptions or sealed capability invocations, installing capabilities belonging to the new domain in the PCC (and possibly IDC) register. The monotonicity guarantee stops before such a domain transition happens. Sealed capability invocations within a security domain are monotonic, however; the theorem does cover capability invocation instructions, e.g. branch instructions taking sentry capabilities, if the unsealed invoked capability is reachable in the current security domain (condition 5 above). The translation invariance assumption (condition 4) rules out non-monotonicity due to the interpretation of capabilities changing when the memory mapping changes. It is assumed to hold for the duration of the given intra-domain trace, but after a domain transition and return, e.g. a system call, one could continue using this theorem with a modified translation mapping.

The proof of Theorem 1 starts with an induction on the number of instructions in the trace. For each individual subtrace t of an instruction fetch or execution with s <sup>t</sup>−→ s 0 , we show that the available capabilities at any point in t are reachable in s, as the definition of available capabilities excludes non-monotonic cases and only includes capabilities that are accessed with suitable permission due to the properties we require. Hence, state updates along t leading to s 0 (only writing available or invoked, but reachable capabilities due to the requirements and assumptions) are monotonic.

### 5 Proof of Capability Monotonicity in Morello

#### 5.1 Instantiation of the Abstract Model

In order to instantiate Theorem 1 for Morello, we instantiate the parameters of the abstract model, e.g. the set of privileged registers or the concrete capability representation. We do not currently instantiate the address translation mapping, effectively treating address translation as a black box and assuming an arbitrary but fixed partial mapping, together with a predicate on events to capture assumptions on register and memory contents, under which the mapping produced by the ASL address translation code is guaranteed to coincide with the given mapping. A candidate for instantiating this is the purely functional characterisation of address translation presented in [9, §8] and proved correct there for the base Armv8.3 architecture, under some assumptions about control registers. Using this would also allow (and require) us to substantiate the translation invariance assumption of Theorem 1. In particular, since the translation control registers are protected by the system register access permission, code running without that permission and without write access to the in-memory translation tables cannot modify the translation mapping.

For the monotonicity proof, the main architecture-specific assumption we make is that two privileged system features that could be used to violate monotonicity are inactive: external debuggers, and the experimental instructions SCTAG and STCT that allow setting tags of arbitrary capability bit patterns. Hence, we make assumptions on the contents of certain control registers to disable these (e.g. EDSCR.STATUS = 2 to model non-debug state); the tag setting instructions can also be disabled by removing the system access permission.

The capability invariant that we assume in the initial state is that bounds do not go beyond the 64-bit address space and that their length is non-negative, e.g. to rule out memory accesses that wrap around the edge of the address space. There exist capability encodings that violate this property, but the only way to generate them on Morello is via the tag setting instructions or an external debugger, which we assume to be disabled.

We also assume that the PCC capability is initially unsealed, if it is tagged, which the ASL code relies on in a few places. We proved this as an invariant after a bug we found in a branching helper function (see §5.4) was fixed.

Finally, we have to limit certain kinds of "constrained unpredictable" behaviour. For example, the LDP instruction loads a pair of words into two destination registers. However, if the same register index is used for both destination register arguments to the instruction, then it is left underspecified what value is written to the destination register, if any. One might expect this to be either the original register value or one of the loaded values, but Morello inherits from the base Armv8-A architecture the specification that the register value may be set to an architecturally UNKNOWN value in such cases. For capabilities, the Morello specification [8] further constrains this in rule TSNJF: "If an UNKNOWN value is written to a capability register or to capability-tagged memory, the write does not increase the Capability defined rights available to software." We formalise this by adding an assumption that, in traces for which we want to use the monotonicity theorem, all UNKNOWN capabilities used (appearing in traces in nondeterministic choice events) are reachable from the initial state of the trace.

#### 5.2 Manual Proofs about Capability Encoding Functions

We have to prove that the various functions that make changes to the concrete 129-bit capability representation (as used by the instruction semantics) do so in a monotonic way. The challenging aspect is the compressed capability bounds encoding introduced in [58] and used by Morello (as opposed to the version of CHERI-MIPS targeted by previous verification work [38], which used a simpler, uncompressed 256+1-bit encoding). The compression scheme allows the capability address value and both bounds, three 64-bit values, to be encoded in less than 128 bits. This exploits the fact that in well-behaved code the address should be within the bounds or nearby, so the bounds can be expressed as smaller offsets from it. They are encoded in a floating-point style, with an exponent and a floating "mantissa" window. Typical smaller capabilities have precise bounds, but large capabilities require aligned bounds, to save encoding space; the encoding uses various optimisations to maximise precision [58], [8, §2.5.1].

We initially SMT-checked the encoding functions using Sail's existing SMT backend. This provided early design feedback, including discovering an issue in the CapSetBounds function (see §5.4).

When moving from SMT checks to Isabelle proofs that can be integrated into the overall proof, one challenging function is CapIsRepresentableFast, which checks that an update to the capability value by an offset does not change the decoding of the bounds. It is important for performance that this check is done quickly. This fast version only considers the offset arithmetic within the mantissa window, making pessimistic assumptions about overflow/underflow in lower bits. We can prove that this check is sufficient, using algebraic methods in Isabelle/HOL without bit-blasting or SMT proofs.

The most challenging function for us to verify is called CapSetBounds, and is used to narrow capability bounds. The function checks that the requested new bounds fit monotonically in the existing bounds. It also picks an appropriate exponent, aligns to that exponent, and encodes an updated capability.

The main complication is that aligning the bounds to an exponent changes the length slightly, which may be an increase that requires a higher exponent.

The core argument for monotonicity here is non-trivial: the chosen alignment is the minimum one for which bounds can be encoded which enclose the requested bounds. Since the original capability also enclosed this range, its alignment cannot be less than this minimum, thus the bounds of the original capability are already aligned to the selected exponent. This finally implies that coercing the requested bounds to the selected exponent does not move them across the original bounds. A part of the proof of this lemma involved a brute-force split into cases for all possible selected exponents and reducing the cases to SMT bitvector lemmas which we pass to the CVC4 SMT solver [11]. This relies on the solver as an oracle, as replay of bitvector proofs in Isabelle is only experimental. Initial work on the CHERI compression scheme [58] included HOL4 proofs about these two functions, but this is the first time the crucial monotonicity proof has been done for the set-bounds function.

#### 5.3 Proof Engineering

With the model instantiation and lemmas about auxiliary functions in place, the remaining task is to prove that the rest of the ISA uses these functions correctly and satisfies the properties defined in §4.4. We tackle this using a combination of custom proof tactics within Isabelle and an external tool that automatically generates lemmas about the functions and instructions in the architecture. This simple approach worked sufficiently well that we were able to keep up with weekly snapshots of the ASL specification while it was being developed. Re-running the lemma generation tool mostly worked without affecting the existing manually written parts of the proof, with only few exceptions, e.g. when a refactoring of the (crucial) VACheckAddress function broke some lemmas about it.

The generated lemmas are stated in terms of predicates that reformulate the properties of §4.4 into properties of partial traces, taking an additional parameter that summarises the capabilities available at the start of this part of the trace. This allows us to split up an instruction proof into proofs that the auxiliary functions satisfy the properties and that they are used correctly, e.g. that a function performing a memory store is only called if a suitable authorising capability is available. Most of these proofs are automatically handled by straightforward proof tactics, but our tooling allows manually overriding specific parts of generated lemmas where necessary. We do this for about 100 of the ASL functions and instructions, generally taking the form of small patches, e.g. giving additional hints to the proof tactics, such as additional simplification rules or loop invariants, or adding side conditions to lemma statements, such as assumptions about capability checks for memory-accessing helper functions. The tool outputs the generated lemmas in theory files which are then checked by Isabelle; hence, the external tool does not need to be trusted. The proof consists of around 37 000 generated lines, 8 600 manually written lines, as well as 8 900 lines for the abstract model, monotonicity proof, and proof tools. The proof executes in 7hrs 20mins CPU time on an i7-10510U CPU at 1.80GHz, but only 3hrs 23mins real time thanks to parallel execution, with peak memory consumption of 18GB.

#### 5.4 Bugs and Issues Found

Our verification work uncovered several bugs and issues in the ASL specification.

During our initial SMT-checking of the capability manipulation helper functions, one issue we discovered that was not known previously was a bug in the top-byte normalisation logic of the CapSetBounds function, which could have led to some of the top bits of the lower or upper bound of a capability changing when modifying some of their lower bits, even if the requested bounds were within the original bounds of the input capability, thereby violating monotonicity.

Our Isabelle proof uncovered a bug in the BranchToCapability function where the branch target capability was modified without a check that it is unsealed. Hence, branch instructions could have modified sealed capabilities. The result would not have been directly available to the code that performed the branch, because the modified sealed capability would be installed into PCC, and the subsequent instruction fetch would fault with a sealed capability exception, but as part of exception handling the modified sealed capability would then have been written to the CELR register and become accessible to the exception handler.

Another issue we found was a case of missing capability checks in the implementation of the DC ZVA instruction. This would have allowed software to overwrite memory regions with zeros without capability authorisation.

We also found various issues that were already known to Arm, e.g. the STP instruction checking the tag of the wrong capability, as well as functional bugs not directly affecting our proof of security properties, e.g. a bug in the LDNP and STNP instructions where the wrong memory access type was used.

We reported all of our findings to Arm, and the issues have been fixed.

### 6 Validating the Concrete Semantics

Confidence in our results about Morello's security properties relies on our translation of the specification (from ASL into Sail and Isabelle) accurately reflecting the intended architecture. A key part of ensuring that hardware designs implement Arm architectures correctly is to test against Arm's internal Architectural Compliance Kit (ACK); to validate our translation we ran a large collection of tests from the Morello ACK against a Sail generated C emulator. This approach was also taken with an earlier AArch64 Sail model [9]. These tests are typically self-contained executables that can be run directly after processor reset without an operating system or peripherals, except for a simple serial device for reporting results and diagnostic information. Each test executes tens or even hundreds of thousands of instructions, so using our fast C emulator was essential.

The ACK covers Morello-specific functionality alongside the relevant parts of the base Arm-v8.2 architecture in more than 25000 tests. Its scope is wider than the ASL model, including features such as performance counters, debug, and tracing, where the ASL has only interfaces or partial information, leaving the detailed specification to prose descriptions. There are also tests for the generic interrupt controller (GIC), a distinct system-on-chip component with a separate specification which is not part of the ISA. Moreover, for the Morello-feature suites, the "implementation defined" behaviour expected by the tests is more constrained than normal to match the single Morello hardware design.

To manage this complexity we first obtained baseline results from a Morello Arm Fast Model simulator, without the additional support normally used in the ACK testing environment. This matches the contents of the ASL specification more closely. We then excluded tests which required features that are not fully modelled, and adjusted the "implementation defined" portions of the specification to approximate the hardware. By comparing the results from our Sail generated emulator against the baseline we could identify and repair faults in both the ASL specification and our translation. Repairing these issues was important both to ensure that our understanding of the problem was correct and to ensure that tests could run to completion to rule out further issues.

Specific issues that we encountered involved minutiae about how system register bits behave when features are not present (such as AArch32 instructions), a couple of missing cases in our built-in operations used by SIMD instructions, a variable shadowing issue in our translation tools, corner cases in the ASL specification handling of page table capability tracking, and a few exception handling problems. None of these issues affect capability monotonicity.

The resulting pass rate was 98.1% compared with the baseline. The discrepancies were mostly due to limitations of the ASL model, such as limited debugging support, corner cases in address space handling, and the lack of secure memory; a few details with some SIMD instructions and particular processor exceptions require further investigation, but again, they do not affect monotonicity.

### 7 Model-based Test Generation

In addition to the ACK, and before we had access to it, we generated a test suite from the model to check core instruction and capability functions against the implementations; and also to adapt QEMU to support most of Morello. We use symbolic execution, well-established as a way to generate high coverage test suites [12,43] and used previously for a much simpler CHERI architecture [13], both to perturb the initial state to explore different instruction behaviours and to control whether processor exceptions are taken. The latter is particularly useful for CHERI ISAs because most input values would trivially fault at one of the capability checks (e.g. see CheckCapability in Fig. 2). Instruction set specifications are good candidates for symbolic execution because the languages tend to be relatively simple and the number of paths for any given instruction is bounded. To build a test generator for Morello we were able to reuse the Isla symbolic execution tool, which was already being developed for work combining Sail ISAs with relaxed memory models [10].

The test generator operates on traces of instructions, partially or fully chosen at random from the encoding diagrams included in the original ASL. Isla's symbolic execution was extended with a simple sequential memory model using SMT arrays for the main memory and tags. In outline, the generator: 1. initialises the model by running the processor reset function in the symbolic executor (this is deterministic and does not involve any symbolic state); 2. alters the state so that the parts the test harness can change are symbolic, and fix other values as necessary (e.g., for memory translation); 3. symbolically executes each instruction in turn to find feasible behaviours and pick one; 4. passes the accumulated path conditions to the Z3 SMT solver [16] to find suitable concrete values for the initial and final states; and 5. constructs the final test with the instructions and the test harness which will set up the initial state and check the final state after execution. This harness is hand-written (although automatically producing it in the style of Martignoni et al. [29] would be interesting to explore), so to accelerate development we first restricted our attention to fault-free behaviours with memory management turned off, then gradually added support for exceptions, for a simple fixed memory mapping, and checks of more of the processor state after execution.

Our coverage goal for test generation was to ensure that all of the specification code for manipulating capabilities and for instructions that were added or modified for Morello would be executed in some test. This was complicated by non-determinism in parts of the specification. Some instructions have "constrained unpredictable" forms which can have one of several effects; e.g., a loadpair where both destination registers are the same might write UNKNOWN to them, do nothing, or take a fault. In principle allowing for all of these is possible, but the resulting disjunctions are likely to be much more difficult to solve, and the behaviours themselves are not very interesting, so we discarded these paths.

Another area of non-determinism in the specification is the load/store exclusive instructions that are used for synchronisation. Even during single-core execution these instructions have such behaviour due to the particular memory architecture choices, which are left as unimplemented primitive operations in the specification. To test these instructions we added a simple model of the guaranteed behaviour in Sail, which includes assertions to avoid uncertain cases.

While the number of paths to explore in any instruction is bounded, the number of paths found for some instructions remains impractically large. The main cause is the case splits in the capability compression scheme. We reduce these to a single path by pushing the decisions into the SMT solver using Isla's linearisation feature, extended to support more of the language, which transforms functions with no side effects into a single SMT expression. This was sufficient to perform large-scale test generation with the Morello model.

We checked our progress against our coverage goal using the Sail C backend's coverage measurement support, counting, for each expression in a Sail specification, the number of tests that exercise it. Once we had enough tests that the accumulated coverage began to level out, it was apparent that certain instructions and corner cases were not exercised enough. Overriding the random instruction choice filled in most of the gaps, and temporarily disabling the linearisation allowed exhaustive testing of a key capability function.

The tests found a few minor issues in our tooling and some more bugs in the original ASL specification: several undefined variants of instructions were included, a new load-pair that should have been marked "constrained unpredictable", a set-bounds operation could read the wrong register, and a translation fault could be missed in a load-tags instruction. Corrections were made to the specification for these issues; a couple also arose in one of the implementations of Morello, which were then fixed.

Comparing the coverage of these tests with the ACK is instructive. As we used the Sail coverage as a goal, we hit a few gaps in the ACK, such as the set-bounds issue, and a rare corner case in a core capability function. However, the ACK's coverage goals included semantic notions that we cannot capture easily. For example, if a conditional is supposed to be false because the first of three checks will fail, human-authored coverage includes the other checks passing, whereas our generator does not reason about the other checks because the symbolic execution does not reach them.

The generated test suite was also used as the basis for test-driven development of an extension of QEMU's Armv8-A support to Morello. After adding basics, such as tagged memory and the expanded register file, the tests guided which features to implement, easing development. Small errors were picked up automatically, such as confusing the stack pointer and zero registers (which share an encoding) and sign extension bugs, including one in the pre-existing QEMU code where a previous attempt to fix it had missed a subtle issue.

The adapted QEMU now boots CheriBSD, a version of FreeBSD with capability support, although this required some fixes for issues that were not found by the generated test suite. A few involved parts of the state that were not explicitly included in the self-test, particularly around exception handling, but most of them concerned out-of-scope system features.

### 8 Related Work

Nienhuis et al. [38] proved similar results for the CHERI-MIPS architecture, above the Isabelle generated from L3 [23]. CHERI-MIPS is much smaller than Morello (6k LoS), and much simpler, without page tables, virtualisation, vector instructions, etc. They identified 9 properties of the ISA semantics that sufficed to show reachable capability monotonicity and a secure encapsulation result. These captured the capability-relevant intentions of instructions explicitly, but were expressed in terms of a conventional whole-system semantics, instead of the intra-instruction semantics we use here, and that was key to scaling. Each instruction had to be annotated with its intention, extensive work was needed to prove commutativity results, and the properties were MIPS-specific.

The other most closely related work, proving properties of capability architectures, establishes stronger results but for highly idealised architecture definitions. While our monotonicity theorem is about arbitrary machine execution up to a domain crossing, Skorstengaard et al. and Georges et al. [46,47,49,48,24] establish logical-relation methods for reasoning about combinations of arbitrary and known code, the latter mechanised in Iris [28], but for idealised machines rather than full architectures. These add new features to help enforcing strong properties, but with unclear hardware implementation cost. Strydonck et al. [50] and El-Korashy et al. [19] study secure compilation in similarly idealised settings. Ultimately one would like to scale all these methods to production CHERI architectures. de Amorim et al. [5,4] verify information-flow properties of their SAFE architecture, also for a simplified model.

Capabilities have also been used in the interfaces of numerous operating systems. PSOS [37] uses a similar hardware tag bit to CHERI, but all capability operations are implemented in the OS rather than hardware. Various other operating system use standard hardware but have capabilities as part of their interfaces. These systems are very different to CHERI, but their security models have many similarities. Proofs that a (simplified) OS interface matches an abstract capability security model have been done for the EROS OS [45] and for the seL4 kernel [20]. A subsequent proof connects to the seL4 implementation [44]. Each of these abstract models somewhat resembles ours, e.g. with notions of reachable and derivable capabilities. Our observation that domain-crossing events create extra complications also seems to apply to seL4.

There is a great deal of work devoted to other approaches to improve memory safety which we cannot detail here, but see the review [51]. For just a sample, many projects have developed software-implemented variants of C or C++ that provide greater safety, but typically with rather different performance and code-porting costs to CHERI, and without considering whole-system aspects outside a single C/C++ program [25,36,34,35,17,42,21]. Then there are many hardware-accelerated approaches, e.g. MPX and WatchdogLite, Watchdog, and Hardbound [33,32,31,18]. A different line of work aims at bug-finding rather than deterministic mitigation, e.g. AddressSanitizer [2] and many others.

If widely adopted, Morello would radically change the landscape for such work, and for computer security more generally.

Acknowledgements We thank all the members of the wider CHERI and Morello teams, for their work to make Morello a reality. This work was supported by the UK Industrial Strategy Challenge Fund (ISCF) under the Digital Security by Design (DSbD) Programme, to deliver a DSbDtech enabled digital platform (grant 105694), EPSRC programme grant EP/K008528/1 REMS, ERC AdG 789108 ELVER, Arm iCASE awards, EPSRC IAA KTF funds, the Isaac Newton Trust, the UK Higher Education Innovation Fund (HEIF), Thales E-Security, Microsoft Research Cambridge, Arm, Google, Google DeepMind, HP Enterprise, and the Gates Cambridge Trust. Approved for public release; distribution is unlimited. This work was supported by the Defense Advanced Research Projects Agency (DARPA) and the Air Force Research Laboratory (AFRL), under contracts FA8750-10-C-0237 ("CTSRD"), FA8750-11-C-0249 ("MRC2"), HR0011-18-C-0016 ("ECATS"), and FA8650-18-C-7809 ("CIFV"), as part of the DARPA CRASH, MRC, and SSITH research programs. The views, opinions, and/or findings contained in this report are those of the authors and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.

### References


Verified Security for the Morello Capability-enhanced Prototype Arm Architecture 203


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### The Trusted Computing Base of the CompCert Verifed Compiler<sup>⋆</sup>

David Monniaux  and Sylvain Boulm´e 

Univ. Grenoble Alpes, CNRS, Grenoble INP, Verimag {David.Monniaux,Sylvain.Boulme}@univ-grenoble-alpes.fr

Abstract. CompCert is the frst realistic formally verifed compiler: it provides a machine-checked mathematical proof that the code it generates matches the source code. Yet, there could be loopholes in this approach. We comprehensively analyze aspects of CompCert where errors could lead to incorrect code being generated. Possible issues range from the modeling of the source and the target languages to some techniques used to call external algorithms from within the compiler.

Keywords: Formally Verifed Software · The Coq Proof Assistant

### 1 Introduction

CompCert [35,34,36] is a formally verifed compiler for a large subset of the C99 language (extended with some C11 features): there is a proof, checked by a proof assistant, that if the compiler succeeded in compiling a C program and that program executes with no undefned behavior, then the assembly code produced executes correctly with the same observable behavior. Yet, this impressive claim comes with some caveats; in fact, there have been bugs in CompCert, some of which could result in incorrect code being produced without warning [57]. How is this possible?

The question of the Trusted Computing Base (TCB) of CompCert has been alluded to in general overviews of CompCert [37,27], but there has been so far no detailed technical discussion of that topic. While our discussion will focus on CompCert and Coq, we expect that much of the general ideas and insights will apply to similar projects and other proof assistants: other verifed compilers, verifed static analysis tools, verifed solvers, etc.

We analyze the TCB of the ofcial releases of CompCert, <sup>1</sup> and two forks: CompCert-KVX, <sup>2</sup> adding various optimizations and a backend for the Kalray KVX VLIW (very large instruction word) core, and CompCert-SSA, <sup>3</sup> adding optimizations based on single static assignment (SSA) form [6,18]. Versions and changes

c The Author(s) 2022

<sup>⋆</sup> A software artefact is available from https://doi.org/10.5281/zenodo.5913981

<sup>1</sup> https://github.com/AbsInt/CompCert

<sup>2</sup> https://gricad-gitlab.univ-grenoble-alpes.fr/certicompil/compcert-kvx

<sup>3</sup> https://gitlab.inria.fr/compcertssa/compcertssa

I. Sergey (Ed.): ESOP 2022, LNCS 13240, pp. 204–233, 2022. https://doi.org/10.1007/978-3-030-99336-8\_8

to these software packages are referred to by git commit hashes. We discuss alternate solutions, some of which already implemented in other projects, their applicability to CompCert, as well as related work.

Sections 2 and 3 analyze the TCB part coming from Coq usage. Section 4 presents the TCB part connecting the Coq specifcation of CompCert's inputs (source code) to the user view of these inputs. Sections 5 and 6 analyze the TCB part connecting the Coq specifcation of CompCert's generated programs to the actual platform running these programs. The conclusion (7) summarizes which TCB parts of CompCert (and its forks) are the most error-prone, and discusses possible improvements.

### 2 The Coq Proof Assistant

CompCert is mostly implemented in Coq, <sup>4</sup> an interactive proof assistant [2]. Coq is based on a strict functional programming language, Gallina, based on the Calculus of Inductive Constructions, a higher-order λ-calculus. This language allows writing executable programs, theorem statements about these programs, and proofs of these theorems. CompCert is not directly executed within Coq. Instead, the Coq code is extracted to OCaml code, then linked with some manually written OCaml code. We now discuss how issues in the Coq implementation may impact the correctness of CompCert.

#### 2.1 Issues in Coq Proof Checking

Proofs written directly in Gallina would be extremely tedious and unmaintainable, so proofs are usually built using Coq tactics. While some other proof assistants trust tactics to apply only correct logical steps, this is not the case with Coq: what the tactics build is a λ-term, which could have been typed directly in Gallina if not for the tedium, and this λ-term is checked to be correctly typed by the Coq kernel. This allows tactics to be implemented in arbitrary ways, including calling external tools, without increasing the TCB.

A theorem statement is proved when a λ-term is shown to have the type of that statement (the Curry-Howard correspondence thus identifes statements and types, and proofs and λ-terms). Thus, all logical reasoning in Coq relies on the correctness of the Coq kernel, and some driver routines. In addition to the Coq compiler coqc and Coq toplevel coqtop, a proof checker coqchk provides some level of independent checking.

Coq is a mature development, however "on average, one critical bug has been found every year in Coq" [51]. Let us comment on the ofcial list of these bugs.<sup>5</sup> Interestingly, the list classifes their risk according to whether they can be exploited by accident. We can probably assume that the designers of CompCert would not deliberately write code meant to trigger a specifc bug in Coq and

<sup>4</sup> https://coq.inria.fr/

<sup>5</sup> https://github.com/coq/coq/blob/master/dev/doc/critical-bugs

prove false facts about compiled code: exploiting a Coq bug by mistake in a way sufciently innocuous to evade inspection of the source code, to accept an incorrect optimization that would be triggered only in very specifc cases (to evade being found through testing), seems highly unlikely.

Proofs are checked by Coq's kernel, which is essentially a type-checker for the λ-calculus implemented by Coq (the Calculus of Inductive Constructions with universes). There have been a number of critical bugs involving Coq's kernel, particularly the checking of the guard conditions (whether some inductively defned function truly performs structural induction) and of the universe conditions (Coq has a countable infnity of type universes, all syntactically called Type, distinguished by arithmetic constraints, which must then be checked for validity). These conditions prevent building some terms having paradoxical types. Furthermore, there are options (in the source code or the command-line) that disable checking guard, universe or positivity conditions. For instance, if one disables the guard condition to build a nonterminating function as though it were a terminating one, it is possible to prove "false":

```
Unset Guard Checking .
Fixpoint loop { A : Type } ( n : nat ) { struct n }: A := loop n .
Lemma false : False . Proof . apply loop . exact O . Qed.
```
coqchk -o lists which guard conditions have been disabled—none in CompCert.

The Coq kernel can evaluate terms (reduce them to a normal form), but is rather slow in doing so. For faster evaluation, it has been extended with a virtual machine (vm compute) [24] and a native evaluator (native compute) [10]. Both are complex machinery, and a number of critical bugs have been found in them.<sup>6</sup> In CompCert, there is a few direct calls to vm compute, none to native compute; but there may be indirect calls through tactics calling these evaluators.

#### 2.2 Issues in Coq Extraction

Coq's extractor, as used in CompCert, produces OCaml code from Coq code, which is then compiled and linked together with some other OCaml code. Extraction [39,38], roughly speaking, corresponds to removing non-computational (proof) content, compensating for some typing issues (see below), renaming some identifers (due to diferent reserved words), and of course printing out the result. Coq's extractor and OCaml are in the TCB of CompCert.

OCaml's type safety ensures that, barring the use of certain features that circumvent this type safety (unsafe array accesses, marshaling, calls to external C functions, the Obj module allowing unsafe low-level memory accesses. . . ), no type mismatch or memory corruption can happen at runtime within that OCaml code. None of these features are used within CompCert, except for calling C

<sup>6</sup> For instance, there used to be a bug with respect to types with more than 255 constructors that allowed proving "false" https://github.com/clarus/falso, so ludicrous that it made it into a satirical site https://inutile.club/estatis/falso/.

functions implementing the OCaml standard library, and some calls to Obj.magic, a universal unsafe cast operator, produced by Coq's extractor.

Calls to Obj.magic are used by the extractor to force OCaml to accept constructs (dependent types, arbitrary type polymorphism) that are correctly typed inside Coq but that, when mapped to OCaml types, result in ill-typed programs. The following program is correct in Coq (or in System F) but cannot be typed within OCaml's Hindley-Milner style of polymorphism, so uses Obj.magic: 7

Definition m ( g : ∀ { T } , list T → list T ) : Type := (( g ( false :: nil )) , ( g (0 :: nil ))). Extraction m .

The following program, which is similar to some code in the Builtins0.v CompCert module, uses dependent types

```
Inductive data := DNat : nat → data | DBool : bool → data .
Definition get_type ( d : data ) : Type :=
  match d with DNat _ ⇒ nat | DBool _ ⇒ bool end.
Definition extract ( d : data ) : get_type d :=
  match d with DNat n ⇒ n | DBool b ⇒ b end.
Require Extraction . Extraction extract .
```
Its extraction uses Obj.magic: 8

let extract = function DNat n -> Obj . magic n | DBool b -> Obj . magic b

Thus, incorrect behavior in the Coq extractor could, in theory at least, produce OCaml code that would not be type-safe, in addition to producing code not matching the Coq behavior. Is this serious cause for concern? On the one hand, the extraction process is quite syntactic and generic. It seems unlikely that it could produce valid OCaml code that would compile, pass tests, yet occasionally would have subtly incorrect behavior.<sup>9</sup> On the other hand, CompCert is perhaps the only major project using the extractor, which is thus not thoroughly tested. We do not know of any extractor bug that could result in CompCert miscompiling. Another related potential source of bugs comes from the link of OCaml code extracted from Coq and "external" OCaml code. This is discussed in Section 3.2.

Sozeau-et-al [51] study an approach to reduce the TCB of Coq by providing a formally verifed (in Coq) implementation of a signifcant subset of its kernel and paving the road for a formally verifed extraction. However, the target language of the extraction (OCaml ?) would still be in the TCB. An alternative solution would be direct generation of assembly code from Gallina, as done by Œuf [42]; however parts of CompCert are currently written in OCaml and would have to be rewritten into Gallina. Œuf extracts Gallina to Cminor, one of

<sup>7</sup> Some System F-like polymorphism was added to OCaml: structure types with polymorphic felds. This is not used by Coq's extractor as of Coq 8.13.2.

<sup>8</sup> Variants of this example correspond to general algebratic data types (GADTs), another recent addition to OCaml's type system not yet exploited by the extractor.

<sup>9</sup> Coq's bug tracker lists extractor bugs that, to the best of our knowledge, result in programs that are rejected by OCaml compilers.

the early intermediate languages of CompCert, then produces code using CompCert. <sup>10</sup> CertiCoq<sup>11</sup> [45,44] also extracts to Clight, which may be compiled with any C compiler.

### 3 Use of Axioms in Coq

Coq, as other proof assistants, checks that theorems are properly deduced from a (possibly empty) set of axioms. Axioms are also introduced as a mechanism to link Gallina programs to external OCaml code through extraction. Improper use of axioms may lead to two forms of inconsistency: logical inconsistency and inconsistency between the Coq proof and the OCaml external code.

#### 3.1 Logical Inconsistency

Coq is based on type theory, with logical statements seen through the Curry-Howard correspondence: a proof of a logical statement is the same thing as a program having a certain type. In other words, a theorem is proved if and only if there is a λ-term inhabiting the type corresponding to the statement of the theorem. An axiom is thus just the statement that a certain constant, given without defnition, inhabits a certain type.

The danger of using axioms is that they may introduce inconsistency, that is, being able to prove a contradiction; from which, through ex falso quodlibet, any arbitrary statement is provable. Furthermore, it is possible that several axioms are innocuous individually, but create inconsistency when added together.

There are several common use cases for axioms in Coq. One is being able to use modes of reasoning that are not supported by Coq's default logic: CompCert<sup>12</sup> adds the excluded-middle (∀P, P ∨ ¬P) for classical logic, functional extensionality (f = g if and only if ∀x, f(x) = g(x)), and proof irrelevance

<sup>10</sup> Other systems meant to generate code from defnitions in a proof assistant, generate code directly rather than reuse an existant compiler. This approach is promoted [31] with the argument that such a process is safer than textual extraction to, say, OCaml. This is not so clear to us. On the one hand, extracting (without proof of correctness) Gallina to a subset of OCaml, printing the result, then running the OCaml compiler, surely adds a lot to the TCB. On the other hand, it is typically difcult to get right in a compiler the modeling of the assembly instructions, the ABI, the foreign function interface, as discussed in Section 5. Bugs at that level are caught by extensive testing. Surely, the OCaml code generator, the many libraries using OCaml's foreign function interface, are more thoroughly tested by usage than a code generator used to extract a few specifc projects developed in a proof assistant.

<sup>11</sup> https://github.com/CertiCoq/certicoq

<sup>12</sup> CompCert module Axioms.v imports module FunctionalExtensionality from the Coq standard library, which both states functional extensionality and states proof irrelevance as axioms. Some CompCert modules import the standard Classical module, which states excluded-middle as an axiom. Since proof irrelevance is a consequence of excluded-middle, it should be possible to just import Classical in Axioms.v and deduce proof irrelevance from it.

(one assumes that the precise statement of a proof as a λ-term is irrelevant). Meta-theoretical arguments have shown that these three axioms do not introduce inconsistencies.<sup>13</sup>

Another use case for axioms is to introduce names for types, constants and functions defned in OCaml, with a relationship between these and those of the OCaml types and functions to be specifed for Coq's extraction facility. For instance, to call an OCaml function f: nat -> bool list one would use

Axiom f : nat → list bool . Extract Inlined Constant f⇒" f " .

This is used extensively in CompCert, to call algorithms implemented in OCaml for efciency, using machine integers and imperative data structures; see 3.3 Similarly, one can refer to an OCaml constant as follows<sup>14</sup>

Axiom size : nat . Extract Inlined Constant size ⇒ " size " .

Incorrect use of axioms to be realized through extraction can lead to logical inconsistency. Consider, for instance this variant, where the size external defnition is supposed to be a negative natural number (maybe because we mistakenly typed n < 0 instead of n < 10); one can easily derive False from it:

Axiom size : { n : nat | n < 0 }.

One approach for avoiding such logical inconsistencies is to avoid axioms that specify types carrying logical specifcations, that is, proofs (e.g., here n < 0); this is anyway a good idea, because such types may also result in mismatches (see 3.2). No OCaml function in CompCert accessed from Coq has Coq type carrying logical specifcation, with one exception, in CompCert-KVX:

```
Axiom profiling_id : Type .
Axiom profiling_id_eq : ∀ ( x y : profiling_id ) , { x = y } + {x < > y }.
```
These axioms state that there exists a type called profiling\_id ftted with a decidable equality, both of which are defned in OCaml. This decidable equality is a technical dependency of the decidable equality over instructions.

In order to avoid logical inconsistencies due to axioms referring to external defnitions, one can prove that the type in which the Axiom command states that

<sup>13</sup> There is a model of Coq's core calculus in Zermelo-Fraenkel set theory with the Axiom of Choice and inaccessible cardinals [32,53]. Such a model is compatible with these axioms. Previously, in times when Coq's Set sort was impredicative (it can still be selected to be so by a command-line option), it became apparent that this was incompatible with excluded-middle and forms of choice suitable for fnding representatives of quotient sets [15,16]. This should be a cause of caution, though we think it unlikely to exploit such paradoxes by accident.

<sup>14</sup> This may allow compiling a Coq development once (Coq compilation may be expensive, certain proofs take a lot of time) and then adjust some constants when compiling and linking the extracted OCaml code, maybe for diferent use cases. This is not used in CompCert, which, instead for fexibility, allows certain features to be selected at run-time through command-line options.

there exists a certain term is actually inhabited; this establishes that the axiom does not introduce inconsistency. For instance, one can specify an OCaml constant n < 10, to be resolved at compile-time, and exclude logical inconsistency by showing that such a constant actually exists:

```
Axiom size : { n : nat | n < 10 }.
Lemma size_can_exist : { n : nat | n < 10 }.
Proof . exists 0; lia . Qed.
```
This approach is occasionally used in Coq and CompCert for axiomatizing algebraic structures. For instance, Coq specifes constructive reals axiomatically, then provides an implementation that satisfes that specifcation; CompCert-KVX's impure monad (discussed in Section 3.3) is specifed axiomatically, but the authors provide several implementations satisfying that specifcation [11]. Similarly, the authors could have provided an implementation of profiling\_id (e.g., natural numbers) and profiling\_id\_eq to show that these two axioms did not introduce logical inconsistencies.

#### 3.2 Mismatches between Coq and OCaml

Though safe, the extractor can be used inappropriately. We have just seen that adding an axiom standing for an OCaml function can, if that axiom is not realizable in Coq, lead to logical inconsistency. Even if the axiom is logically consistent, extraction to arbitrary OCaml code can lead to undesirable runtime behavior.

An obvious case is when, in addition to an axiom specifying a constant referring, at extraction time, to an OCaml function, one adds an axiom specifying the behavior of that function, and that behavior does not match the specifcation. For instance, one can specify f to be a function returning a natural number greater than or equal to 3, then, through extraction, defne it to return 0:

```
Axiom f : nat → nat . Axiom f_ge_3 : ∀ x , ( f x ) ≥ 3.
Definition g x := Nat . leb 1 ( f x ).
Extract Constant f ⇒ " fun x → O " .
```
Unsurprisingly, it is possible to prove in Coq that g always returns true, and yet to run the OCaml code and see that it returns false. It is similarly possible to write Coq code with impossible cases that the extractor will extract to assert false, and the extracted code will actually reach this statement and die with an uncaught exception—an after all better outcome than producing output that contradicts theorems that have been proved. In the following code, False\_rec \_ \_ eliminates on False, which is obtained from contradiction with x ≥ 3, and is extracted to an always failing assertion.

```
Program Definition h x := match f x with
 | O ⇒ False_rec _ _ | S O ⇒ False_rec _ _
 | S ( S O ) ⇒ False_rec _ _ | S ( S ( S x )) ⇒ x
 end .
```
Axiomatizing the behavior of externally defned functions circumvents the idea of verifed software; nowhere in the CompCert source code is there such axiomatization. An equivalent but perhaps more discreet way of axiomatizing the behavior of OCaml function is through dependent types. Consider, again,


It is possible, through extraction mechanisms, to bind size to the OCaml constant 11; this is because the type of size is extracted to the same exact OCaml type as nat, the proof component is discarded. It is then possible to similarly lead the OCaml code extracted from Coq to cases that should be impossible.

The only case of such axiomatization, in CompCert-KVX, is the previously introduced profiling\_id\_eq axiom, which is bound to the Digest.equal function from OCaml's standard library, and defned to be string equality. We can surely assume that OCaml's string equality test to be correct, otherwise many things in Coq and other tools used to build CompCert are likely incorrect as well.

It is also possible to instruct the extractor to extract certain Coq types to specifc OCaml types, instead of emitting a normal declaration for them. The main use for this is to extract Coq types such as list or bool to the corresponding types in the OCaml standard library, as opposed to introducing a second list type, a second Boolean type; this is in fact so common that the standard Coq.extraction.ExtrOcamlBasic specifes a number of such specifc extractions, and so does CompCert. This is not controversial. The extractor also allows fully specifying how a Coq type maps to OCaml, including the constructor and "match" destructor; the only use of this feature in CompCert is in CompCert-KVX for implementing some forms of hash-consing (Sec. 3.4).

An in-depth discussion of further aspects of Coq/OCaml interfacing may be found in Boulm´e's habilitation thesis [11].

#### 3.3 Interfacing External Code as Pure Functions

Coq is based on a pure functional programming language; as in mathematics, if the same function gets called twice with the same arguments, it returns the same value. OCaml is an impure language, and the same function called with the same arguments may return diferent values over time, whether it depends on mutable state internal to the program or on external calls (user input, etc.). By binding Coq axioms to impure functions, we can, again, lead OCaml code extracted from Coq to places it should not go.

For instance, the z Boolean expression extracted from this Coq program is false though it is proved to be true: it calls the same function twice with the same argument and compares the result<sup>15</sup>; but since that function is impure and returns the value of a counter incremented at each call, two successive calls always return unequal values.

<sup>15</sup> This result is computed by the "Nat.eqb" Boolean equality over naturals (in contrast, the Coq propositional equality, written "=", is only logical).

```
Axiom f : unit → nat .
Extract Constant f ⇒
  " let count = ref O in fun () → count := S (! count ); ! count " .
Definition z : bool := Nat . eqb ( f tt ) ( f tt ).
Lemma ztrue : z = true .
  unfold z ; rewrite Nat . eqb_refl ; congruence .
Qed.
```
CompCert calls a number of OCaml auxiliary functions as pure functions, most notably the register allocator. These functions are "oracles", in the sense that they are not trusted to return correct results; their results are used to guide compilation choices, and may be submitted to checks. Both CompCert-SSA and CompCert-KVX add further oracles.

Could impure program constructs, in particular mutable state, in these oracles, lead to runtime inconsistencies? The code of some of these oracles is simple enough that it can be checked to behave overall functionally: mutable state, if any, is created locally within the function and does not persist across function calls. In the register allocator, there are a few global mutable variables (e.g., max\_age, max\_num\_eqs), and perhaps it is possible to obtain diferent register allocations for the same function by running the allocator several times. It seems unlikely that some CompCert code would intentionally call a (possibly computationally expensive) oracle twice with same inputs, then go to an incorrect answer if the two returned values difer. Yet, it is not obvious that this cannot happen.

To avoid such uncertainties, the CompCert-KVX authors encapsulated some of their oracles, in particular oracles used within simulation checkers by symbolic execution [48,47,49], inside the may-return monad of [11]. The monad models nondeterministic behavior: the same function may return diferent values when called with the same argument without leading into inconsistent cases. Beyond soundness, a major feature of this approach is to provide "theorems for free" about polymorphic higher-order foreign OCaml code. In other words, this approach ensures for free (i.e., by the OCaml typechecker) that some invariants proved on the Coq side are preserved by untrusted OCaml code [11]. While this technique has been intensively applied within the Verifed Polyhedron Library [12], it is only marginally used within the current CompCert-KVX, only for a linear-time inclusion test between lists.

This approach however has two drawbacks. Firstly, despite the introduction of tactics based on weakest liberal precondition calculus, the proof efort is heavier than for code written with pure functions without a monadic style. Secondly, all the code calling impure functions modeled within the may-return monad also becomes impure code modeled within that monad, meaning that a signifcant part of the rest of CompCert (at least the code calling the sequence of optimization phases and their proofs) would have to be rewritten using that monad.<sup>16</sup>

<sup>16</sup> Much of CompCert is already written in an error monad, with respect to which, the may-return monad is a straightforward generalization. It thus seems feasible to rewrite CompCert with the may-return monad instead of the existing error monad. In

CompCert's Coq code accesses mutable variables storing command-line options through helper functions. This supposes that these variables stay constant once the command line has been parsed, which is the case.

In Coq, all functions must be shown to be terminating (because nonterminating terms can be used to establish inconsistencies). Arguments for the termination of a function are sometimes more intricate and painful to write in Coq than those for its partial correctness, and termination is not really useful in practice: from the point of view of the end-user there is no diference between a terminating function that takes prohibitively long time to terminate, and a nonterminating function. For this reason, some procedures in CompCert and forks that search for a solution to a problem (e.g., a fxpoint of an operator) are defned by induction on a positive number, and return a default or error value if the base case of the induction is reached before the solution is found. Iteration.PrimIter, used for instance in the implementation of Kildall's fxpoint solving algorithm for datafow analysis, thus uses a large positive constant num\_iterations=1012. Such numbers are often informally known as fuel.

CompCert-SSA takes an even more radical view: a natural number fuel is left undefned, as an axiom, inside the Coq source code, and is extracted to OCaml code let rec fuel = S fuel, meaning that fuel is circularly defned as its own successor, and in practice acts as an infnite stream of successors. Why that choice? num\_iterations is a huge constant belonging to the positive type, which models positive integers in binary notation; there is a custom induction scheme for this type that implements the usual well-founded ordering on positive integers. In contrast, fuel is a natural number in unary notation, on which inductive functions may be defned by structural induction, which is a bit easier than with a custom induction scheme; but it is impossible to defne a huge constant in unary notation. The num\_iterations scheme is cleaner, but we have not identifed any actual problem with the fuel scheme. The OCaml code extracted from Coq has no way to distinguish fuel from a large constant.

The fuel trick however breaks if pointer equality is exposed on the natural number type [11]. The following program, defned using a "may return" monad, where phys\_eq\_nat is pointer equality on natural numbers, can be proved not to return true; yet, it does return true at runtime.

```
Definition fuel_eq_pred :=
  match fuel with
  | O ⇒ Impure . ret false
  | S x ⇒ phys_eq_nat fuel x
  end .
```
practice, this represents a lot of reengineering work. For example, currently, the mayreturn monad provides a tactic in backward reasoning, based a weakest-precondition calculus. In contrast, CompCert provides a tactic for forward reasoning on the error monad. Thus, defning a tactic on the may-return monad that behaves like the one of the error monad would help in reducing the amount of changes in CompCert proofs.

#### 3.4 Pointer Equality and Hash-Consing

The normal way in Coq to decide the equality of two tree-like data structures is to traverse them recursively. The worst-case of this approach is reached when the structures are equal, in which case they will be traversed completely. Unfortunately this case is frequent in many applications for verifed compilation, verifed static analysis, etc.: when the data structures represent abstract sets of states (in abstract interpretation), equality signals the equality of these abstract sets, which indicates that a fxed point is reached; equality between symbolic expressions is used for translation validation through symbolic execution [48]. Furthermore, there are many algorithms that traverse pairs of tree-like structures for which there are shortcuts if two substructures are equal: for instance, if this algorithm computes the union of two sets, then if these sets are equal, then the union is the same [41, §5]; being able to exploit such cases has long been known to be important for the speed of static analyzers [8, §6.1.2].

If we were programming in OCaml, we could simply use pointer equality (==) for a quick check that two objects are equal: if they are at the same memory location, then they are necessarily structurally equal (the converse is not true in general). In Coq, a naive formalization of this approach could be:

```
Parameter A : Type .
Axiom phys_eq : A → A → bool .
Axiom phys_eq_implies_eq : ∀ x y , phys_eq x y = true → x = y .
```
This approach is however unsound.<sup>17</sup> We prove that x\_eq\_x and x\_eq\_y are equal; yet in the extracted code, the former evaluates to true, the second to false.

```
Definition x :=S O . (* 1 *) Definition y :=S O . (* 1 *)
Definition x_eq_x:=phys_eq x x . Definition x_eq_y:=phys_eq x y .
Extract Inlined Constant phys_eq ⇒ " (==) " .
Recursive Extraction x_eq_x x_eq_y .
Lemma same : x_eq_x = x_eq_y . Proof . reflexivity . Qed.
```
To summarize, OCaml pointer equality can distinguish two structurally equal objects, whereas this is provably impossible for Coq functions: for Coq, x and y are the same, so they are interchangeable as arguments to phys\_eq. This is the functionality issue of Section 3.3 in another guise: the same OCaml function must be allowed to return diferent values when called with the same argument.

The solution used in CompCert-KVX for checking that symbolic values are equal was thus to model pointer equality as a nondeterministic function in a "may return" monad. In this model [11], pointer equality nondeterministically

<sup>17</sup> We saw in the preceding section another possible cause of unsoundness: if circular data structures are defned in OCaml inside inductive types, pointer equality can be used to establish that a term is equal to one of its strict subterms, which is normally impossible, thus leads to an absurd case at execution time. To avoid this, either completely disallow linking to circular terms constructed in OCaml, or restrict pointer equality test to types where such circular terms are not constructed.

discovers some structural equalities.<sup>18</sup> This solution has one drawback: the whole of the symbolic execution checker is defned within this monad, and the authors unsafely exit from that monad to avoid running much of CompCert through it. It is uncontroversial that pointer equality implies equality of the pointed objects. The only cause for unsoundness in such an approach could be the unsafe exit. Yet, again, why would CompCert-KVX call twice the symbolic execution engine with the same arguments to reach an absurd case for diferent outcomes?

Opportunistic detection of identical substructures through pointer equality was implemented for instance in Astr´ee [8]. This approach takes advantage of the fact that many algorithms operating on functional data structures simply copy pointers to parts of structures that are left intact: The opportunistic approach detects that some parts of structures have been left untouched, skipping costly traversals. It however does not work if a structure is reconstructed from scratch, for instance as the result of a symbolic execution algorithms: if two symbolic executions yield the same result, these results are defned by isomorphic data structures but the pointers are diferent. What is needed then is hash-consing: when constructing a new node, search a hash-table containing all currently existing nodes for an identical node and return it if it exists, otherwise create a new node and insert it into the table. Hash-consing is widely used in symbolic computation, SMT-solvers etc.; there exist libraries making it easy in OCaml [19], and the OCaml standard library contains a weak hash-table module, one of the main uses of which is being a basic block for hash-consing.

The difculty is that, though overall the construction of new objects behaves functionally (it returns objects that are structurally identical to what a direct application of a constructor would produce), it internally keeps a global state inside the hash-table. Several solutions have been proposed to that problem [14]; one is to keep that global state explicitly inside a state monad, which amounts to threading the current state of the hash table through all computations. In the original version from [14], this implied implementing the hash-table by emulating an array using functional data structures, which was very inefcient. Coq 8.13 introduced primitive 63-bit integers and arrays (with a functional interface), optimized for cases where the old version of an updated array is never used anymore [17, §2.3], which, through special extraction directives, may be extracted to OCaml native integers and arrays. That solution was not adopted for CompCert-KVX, only because Coq 8.13 had not yet been released when the project started. Instead, CompCert-KVX has experimented with two alternative approaches for hash-consing.

The frst approach used in CompCert-KVX introduces an untrusted OCaml function (modeled as a nondeterministic function within the may-return monad) that constructs terms through the hash-consing mechanism (searching in the hash-table etc.); these terms are then quickly checked for equivalence with the desired terms, using a provably correct checker. For instance, if a term c(a1, . . . , an) is to be constructed, and the function returns a term t, then the root constructor

<sup>18</sup> In this model, a given Coq term is not necessarily equal to "itself" for pointer equality, because, in a Coq proposition, "itself" implicitly means a structural copy of "itself".

of t is checked to be c, then the arguments to that constructor are checked to be equal to a1, . . . , a<sup>n</sup> by pointer equality.<sup>19</sup> This solution does not add anything to the trusted computing base, apart from pointer equality. A may-return monad is used because the OCaml code is untrusted, and in particular is not trusted to behave functionally. The drawback is that, though the OCaml code will always make sure that there are never two identical terms in memory at diferent pointer addresses, this is not refected from the point of view of proofs: in the Coq model (discussed above) of pointer equality within the may-return monad, pointer equality implies structural equality, but structural equality does not imply pointer equality. However, only the former is needed for a symbolic execution engine that checks that two executions are indeed equivalent by structural equality of terms, as in the scheduler in CompCert-KVX [48].

Having to thread a whole computation through a monad, further adding to proof complexity, for actions that are expected to behave functionally overall, is onerous. One solution is to add hash-consing natively inside the runtime system; for instance, the GimML language,<sup>20</sup> from the ML family [23,22,21], automatically performs hash-consing on datatypes on which it is safe to do so, which is for instance used to implement efcient fnite sets and maps. This can be emulated by a "smart constructor" approach [14], replacing, through the extraction mechanism, calls to the term constructor, term pattern matching, and term equality by calls to appropriate OCaml procedures: the constructor performs hash-consing, the pattern matcher performs pattern matching ignoring the internal-use "unique identifer" feld used for hash-consing, and term equality is defned to be pointer equality; appropriate OCaml encapsulation prevents manipulation of these terms except through these three functions, and in particular prevent them from being constructed by other methods than the smart constructor. Assuming that this OCaml code is correct, this is indeed sound, due to the global invariant that there never exist two distinct yet structurally identical terms of the hash-consed type currently reachable inside memory. Because terms can only be built using the smart constructor, and that hash-consing ensures that pointer equality is equivalent to structural equality, pointer equality can indeed be treated as a deterministic function, without need for a monad. This approach has the beneft of an easy-to-understand interface and simple proofs; this was the second approach experimented within CompCert-KVX and was used for the HashedSet module [41].

This second approach adds signifcantly more OCaml code to the trusted computing base than just assuming that pointer equality implies structural equality. Yet, this OCaml code is small, with few execution paths, and can be easily tested and audited. It assumes the correctness of OCaml's weak hash-tables; however, Coq's kernel includes a module (Hashset) that is also implemented using these weak hash-tables, so one already assumes that correctness when using Coq.

<sup>19</sup> A unique identifer is added as an extra feld to each object, for reasons including efcient hashing. Structural equality is thus modulo diferences in unique identifers. <sup>20</sup> https://projects.lsv.fr/agreg/?page id=258 Formerly HimML.

### 4 Front-end and semantic issues

CompCert parses C and assigns a formal semantics to it. As such, it depends on a formal model of the C syntax and a formal semantics for it, supposed to refect the English specifcation given in the international standard [4]. CompCert supports an extensive subset of C99 [3] (notable missing items are variable-length arrays and some forms of unstructured branching, `a la Duf's device) and some C11 features (note that in C11, support for variable-length arrays is optional).<sup>21</sup>

The formal semantics of C supported by CompCert is called "CompCert C". Converting the source program, given in a text fle, to the CompCert C AST (abstract syntax tree) on which the formal semantics is defned, relies on many nontrivial transformations: preprocessing, lexing (lexical analysis), parsing (AST building) and typechecking. Most of them are unverifed, but trusted. There are two important exceptions: signifcant parts of the parser and the typechecker of CompCert C are formally verifed. The formally verifed parser is implemented using the Menhir parser generator, and there is a formal verifcation of its correctness with respect to an attribute LR(1) grammar [25]. It relies on an unverifed "pre-parser" to distinguish identifer types introduced by typedef from other identifers (a well-known issue of context-free parsing of C programs). It produces an AST which is then simplifed and annotated with types, by another unverifed pass, called "elaboration". Finally, the resulting CompCert C program is typechecked, by the formally verifed typechecker. This is where the fully verifed frontend of CompCert really starts.

Obviously, a divergence between the semantics of C as understood by CompCert and that semantics as commonly understood by programmers to be compiled may lead to problems. Validating such semantics is an important issue [9]. The standard has evolved over time for taking into account common programming practices or for solving some contradictions.<sup>22</sup> CompCert semantics has also evolved to get closer to the standard, see [30]. In the last years, a few minor divergences have been spotted. For instance, there was a minor misimplementation of scoping rules (commit 99918e4) that led the following program to allocate s of size 3 (sizeof(t) being interpreted with t the global variable, whereas the standard mandates it should refer to the t variable declared before it on the same line) instead of 4:

```
char t []={1 ,2 ,3};
int main () { char t []={1 ,2 ,3 ,4} , s [ sizeof ( t )];
  return sizeof ( s ); }
```
Another example: CompCert and other compilers accepted some extension to the syntax of C99 (anonymous felds in structures and unions) but assigned slightly diferent meanings to it (diferent behavior during initialization, issue 411).

<sup>21</sup> The CH2O project (https://robbertkrebbers.nl/research/ch2o/) aims at formalizing the ISO C11 standard in Coq. This development is unrelated to the formalization inside CompCert.

<sup>22</sup> See an example on http://www.open-std.org/jtc1/sc22/wg14/www/docs/dr 260.htm.

The C standard leaves many behaviors undefned—anything can happen if the program exercises such a behavior (the compiler may refuse the program, the program may compile and run but halt abruptly when encountering the message, or may continue running with arbitrary behavior). Some undefned behaviors, such as array access out of bounds, are exploited in malicious attacks. The C standard also leaves many behaviors unspecifed, meaning the compiler may choose to implement them arbitrarily within a certain range of possibilities e.g., the order of evaluation of parts of certain expressions with respect to side efects.<sup>23</sup> Actually, distinguishing between unspecifed and undefned behavior in the evaluation order is rather complex: see [29] for a formal semantics. Furthermore, many compilers implement extensions to the standard. Some deviate from the standard's mandated behavior in some respects.<sup>24</sup>

Many programs, be them applications, libraries or system libraries, rely on the behavior of the default compiler on their platform (e.g., gcc on Linux, clang on MacOS, Microsoft Visual Studio for Windows).<sup>25</sup> If compilation just fails, then issues are relatively easy (though maintaining support for multiple compilers, often through conditional compilation and preprocessor defnitions, is errorprone); subtler problems may be encountered when software compiles but has diferent behavior with diferent compilers.<sup>26</sup> It may be difcult to narrow differences in outcomes to a bug (including reliance on undefned behavior) or to a diference in valid implementations of unspecifed behavior.

The only semantic issue that we know of regarding CompCert's forthcoming version 3.10 is with respect to bitfelds. A write to a bitfeld is implemented using bitshift and bitwise Boolean operations, and these operations produced the "undefned" value if one of their operands is "undefned". Writing to a bitfeld originally stored in an uninitialized machine word or long word, which is the case for local variables, thus results in an "undefned" value, whereas the bits written to are actually defned. Reading from that bitfeld will then produce the "undefned" value, as can be witnessed by running the program in CompCert's reference interpreter, which stops complaining of undefned behavior. Fixing this issue would entail using a bit-wise memory model (issue 418).<sup>27</sup> It may be pos-


<sup>23</sup> This should not be confused with syntactic associativity, which is fully defned by the standard.

sible to write and prove correct a phase that would replace this "undefned" value by an arbitrary value and thus result in miscompilation. We do not know, however, of any phase that would produce this in CompCert or variants.

CompCert-KVX's test suite includes calling compiler fuzzers CSmith<sup>28</sup> and YarpGen:<sup>29</sup> random programs are generated, compiled with gcc and CompCert-KVX and run on a simulated target—an error is fagged if fnal checksums diverge.

Due to possible semantic diferences for the subset of the C language between the tools that they use for their formal proofs and CompCert, Gernot Heiser, lead designer of the seL4 verifed kernel, argues that translation validation of the results of black-box compilation by gcc is a safer route:

[. . . ] using CompCert would not give us a complete proof chain. It uses a diferent logic to our Isabelle proofs, and we cannot be certain that its assumptions on C semantics are the same as of our Isabelle proofs.

Another option, for C code produced from a higher-level language by code generators, is to replace CompCert's frontend by a verifed a code generator for that language, directly targeting one of CompCert's intermediate representations (e.g., Clight) and semantics, as done for instance for Velus [13] for a subset of the Lustre synchronous programming language.

Some features of the C programming language are not supported by CompCert's formally verifed core, but can be supported through optional unverifed preprocessing, chosen by common line options: -fstruct-passing allows passing structures (and unions) as value as parameters to functions, as well as returning them from a function;<sup>30</sup> -fbitfields allows bit felds in structures.<sup>31</sup> Preprocessing implements these operations using lower-level constructs (memory copy builtin, bit shift operators), sometimes in ways incompatible with other compilers—CompCert's manual details such incompatibilities.

In addition, option -finline-asm allows inline assembly code with parameter passing, in a way compatible with gcc (implementing a subset of gcc's parameter specifcation). The semantics of inline assembly code is defned as clobbering registers and memory as specifed, and emitting an externally observable event. Option -fall activates structure passing, bitfelds, and inline assembly, for maximal compatibility with other compilers.

<sup>28</sup> https://github.com/csmith-project/csmith and [57]

<sup>29</sup> https://github.com/intel/yarpgen

<sup>30</sup> In C, passing pointers to structures that container parameters or are meant to container return values is a common idiom. The language however also allows passing or returning the structures themselves, and this is implement in various ways by compilers, including passing pointers to temporary structures or, for structures small enough to ft within a (long) machine word, directly as an integer register. How to do so on a given platform is specifed by the ABI.Parameter passing, with all particular cases, may be a quite delicate and convoluted part of the ABI.

<sup>31</sup> Recently, direct verifed handling of bitfelds was added to CompCert (commit d2595e3). This should be available in release 3.10.

Because inline assembly is difcult to use,<sup>32</sup> and because its semantics involves emitting an event, preventing many optimizations, CompCert also provides builtin functions that call specifc processor instructions. If a builtin has been given an arithmetic semantics, then it can be compiled into arithmetic operators suitable for optimization; this is the case, for instance, of the "fused multiply add" operator on the KVX.In contrast, instructions that change special processor registers are defned to emit observable events.

### 5 Assembly back-end issues

The verifed parts of CompCert do not output machine code, let alone textual assembly code. Instead, they construct a data structure describing a set of global defnitions: variables and functions; a function contains a sequence of instructions and labels. The instructions at that level may be actual processor instructions, or pseudo-instructions, which are expanded by unverifed OCaml into a sequence of actual processor instructions. The resulting program is printed to textual assembly code by the TargetPrinter module; most of it consists in printing the appropriate assembly mnemonic for each instruction, together with calling functions for printing addressing modes and register names correctly, but there is some arcane code dealing with proper loading of pointers to global symbols, printing of constant pools, etc. Some of this code depends on linking peculiarities and on the target operating system, not only on the target processor.

#### 5.1 Printing Issues

An obvious source of potential problems is the huge "match" statement with one case per instruction, each mapping to a "print" statement. If the "print" statement is incorrect, then the instruction printed will not correspond to the one in the data structure. Printing an ill-formed instruction is not a serious problem, as the assembler will refuse it and compilation will fail. There have however been recent cases where CompCert printed well-formed text assembly instructions that did not correspond to the instruction in the data structure. The reason why such bugs were not caught earlier is that these instructions are rarely used. Commit 2ce5e496 fxed a bug resulting in some fused multiply-add instructions being printed with arguments in the wrong order; these instructions are selected only if the source code contains an explicit fused multiply-add builtin call, which is rare. In CompCert-KVX, commit e2618b31 fxed a bug—"nand" instructions would be printed as "and"; "nand" is selected only for the rare ~(a & b) pattern. The bug was found by compiling randomly generated programs.

In some early versions of CompCert there used to be a code generation bug [57, §3.1] that resulted in an exceedingly large ofset being used in relative addressing on the PowerPC architecture; this ofset was rejected by the assembler. Similar

<sup>32</sup> Inline assembly is so error-prone that specialized tools have been designed to check that pieces of assembly code match their read/write/clobber specifcation [46].

issues surfaced later in CakeML on the MIPS-64 architecture [20] and in CompCert on AArch64 (commit c8ccecc). This is a sign that constraints on immediate operand sizes are easily forgotten or mishandled,<sup>33</sup> and a caution: incorrect value sizes could result in situations not resulting in assembler errors.

#### 5.2 Pseudo-Instructions

In addition to instructions corresponding to actual assembly instructions, the assembler abstract syntax in CompCert features pseudo-instructions, or macroinstructions, most notably: allocation and deallocation of a stack frame; copying a memory block of a statically known size; jumping through a table. The reasons why these are expanded in unverifed OCaml code are twofold. First, the correspondence between the semantics of such operations and their decomposition cannot be easily expressed within CompCert's framework for assembly-level small-step semantics, especially the memory model. CompCert models memory as a set of distinct blocks, and pointers as pairs (block identifer, ofset within the block); <sup>34</sup> stack allocation and deallocation create or remove memory blocks by moving the stack pointer, which is just a positive integer. Jump tables (used for compiling certain switch statements) are arrays of pointers to instructions within the current function, whereas CompCert only knows about function pointers. Second, their expansion may use special instructions (load/store of multiple registers, hardware loops. . . ) not normally selected, the behavior of which may be difcult to express in the semantics<sup>35</sup> or the memory model. This is typically the case for memory copy; see below.

Stack Frame (De)Allocation Stack (de)allocation pseudo-instructions address the gap between the abstract representation of the memory as a set of blocks completely separated from each other and the fat addressing space implemented by most processors, call frames laid out consecutively, allocation and deallocation amounting to subtracting or adding to the stack pointer. A refned view, with a correctness proof going to the fat addressing level, was proposed for the x86 target [55] but not merged into mainline CompCert.

<sup>33</sup> For instance, CompCert-KVX generates loads and stores of register pairs on AArch64, with special care: their ofset range is smaller than for ordinary loads and stores.

<sup>34</sup> This refects the C standard's view that variables and blocks live each in their own separate memory space. For instance, in C, comparisons between pointers to distinct variables have undefned behavior [4, §6.5.8]. Some CompCert versions in which pointers truly are considered to be integers have been proposed [7,43].

<sup>35</sup> Hardware loops, on processors such as the KVX, involve special registers. When the program counter equals the "loop exit" register, and there remain loop iterations to be done, control is transferred to the location specifed by the "loop start" register. In all existant CompCert assembly language semantics, non-branching instructions go to the next instruction. Modeling hardware loops would thus involve changing all instruction semantics to transfer control according to whether the loop exit is reached, proving invariants regarding the hardware loop registers, etc. This could be worth it if the hardware loops could be selected for regular code, not just builtins, but this itself would entail considerable changes in previous compiler phases.

Loading Constants Certain instructions may need some expansion and case analysis, and possibly auxiliary tables. For instance, on the ARM architecture, long constants must be loaded from constant pools addressed relatively to the program counter; thus emitting a constant load instruction entails emitting a load and populating the constant pool, which must be fushed regularly since the range of adressing ofsets is small. Getting the address of a global or local symbol (global or static) variable may also entail multiple instructions, and perhaps a case analysis depending on whether the code is to be position-independent, and, in CompCert-KVX, whether the symbol resides in a thread-local program section.<sup>36</sup> The low-level workings of the implementation of these pseudo-instructions rely on the linker performing relocations, on the application binary interface specifying that certain registers point to certain memory sections, etc.

Builtins CompCert allows the user to call special "builtins", dealing mainly with special machine registers and instructions (memory barriers, etc.). These builtins are expanded in Asmexpand or TargetPrinter into actual assembly instructions.

As an example, consider the memory copy builtin, which may both be used by the user (with \_builtin\_memcpy\_aligned()) to request copying a memory block of known size, and is also issued by the compiler for copying structures. Expanding that builtin may go through a case analysis on block size and alignment: smaller blocks will be copied by a sequence of loads and stores, larger blocks using a loop. The scratch registers may be diferent in each case, and this case analysis must be replicated in the specifcation; alternatively, the specifcation may contain a upper-bound on the set of clobbered registers, but in any case no clobbered register should be forgotten. There may also be a complicated distinction of cases regarding which source register is alias to which other source register, or which scratch one. A bug in that builtin, which did not check alignment and generated improper ofsets for load instructions, was found in CompCert on AArch64; the assembler would reject the generated code (commit c8ccecc). Another bug in the same builtin, on four architectures (ARM, AArch64, PowerPC, RISC-V), due to an incorrect test about register aliasing, resulted in successful compilation, assembly and linking with incorrect code being emitted (commit c2c871c).

One bug was found in the CompCert-KVX stack frame allocation code, which had no adverse consequence unless a very large stack frame or many parameters were used, which explains why it was not detected earlier (commit fccfa9).

Clobbered Registers Expansions of pseudo-instructions and builtins often use scratch registers. The registers that are clobbered by each pseudo-instruction and builtin are defned in the Coq fle (Asm.v) giving the semantics of the abstract assembly language. Thus, changes to expansions must afect coherently both the Asm.v specifcation and the AsmExpand and/or TargetPrinter OCaml module.

<sup>36</sup> In C11 [4], the \_Thread\_local storage class specifes that one separate copy of the variable exists for each thread. Typically, a processor register points to the threadlocal memory area and these variables are accessed by ofsets from that register. CompCert has no notion of concurrency, but on the KVX, some system variables are thread-local and must be accessed as such even from single-threaded programs.

In the last few years, several specifcation bugs about registers clobbered by pseudo-instructions and builtins were found in CompCert, on several architectures. Commit 0df99dc4 fxes several wrong specifcations of clobbered registers on AArch64; commit a4cfb9c2 on ARM;commit 39710f78 on RISC-V. It seems that none of these bugs could result in the generation of incorrect code, for the registers that were wrongly specifed not to be clobbered were not used by the CompCert code generator to store persistent data. The problem is that it was possible to modify the code generator with full correctness proof, and have CompCert generate incorrect code. For instance, some pseudo-instructions would use the return address register as a scratch register, not specifed as clobbered. Some compilers perform leaf function optimization: the prologues and epilogues of functions that never call other functions do not save and restore the return address. CompCert applies this optimization only on the PowerPC architecture, and even then only partially; if one had added this optimization to AArch64 or RISC-V, incorrect code would be generated in leaf functions using the wrongly specifed pseudo-instructions, though all proofs would go through.

Bugs in expansion of builtins due to incorrect specifcation of clobbered registers (or memory), and those related to outcome depending on compiler choices (e.g., register aliases), eerily resemble those due to improper use of inline assembly in C programs [46]. Perhaps similar methods of validation could be used.

As an alternative, we propose moving the parts that deal with case distinctions (register aliasing, sizes, alignments. . . ) out of the untrusted code base into the trusted code base, possibly one pseudo-assembly instruction for each case. For instance, there could be one "memory copy" pseudo-assembly instruction for each diferent code sequence to be generated, with fxed "clobbered" registers and explicit constraints on alignment, size etc. in the specifcation of the instruction. Verifed Coq code would select the proper pseudo-instruction to use. This would likely avoid bugs due to case distinctions in trusted code, alleviate difculties in properly specifying the pseudo-instructions and keeping this specifcation synchronized with their expansion, and make it easier to perform unit testing on the expansions.

#### 5.3 Microarchitectural Concerns

CompCert-KVX introduced instruction scheduling to CompCert. <sup>37</sup> Instruction scheduling reorders instructions while preserving semantics so as to minimize execution time. Current high-performance processors dynamically reorder instructions, but this is complex and consumes extra energy; in-order processors need the compiler to schedule instructions for good performance, taking into account latencies (the number of clock cycles between the operands of an instruction being read and the results being produced) and resource constraints (the number of instructions that can be simultaneously executed; e.g., a processor may be able to execute two instructions at a time, but only one of them may be a memory access, and only one of them may be foating-point).

<sup>37</sup> Tristan & Leroy [54] had developed scheduling for CompCert but their developments were not made publicly available, let alone integrated into CompCert releases.

Tables of resource uses and latencies are cumbersome to build, and often involve access to private documentation and/or reverse engineering; there are thus likely incorrect.<sup>38</sup> Fortunately, all targets of CompCert-KVX have interlocked pipelines, meaning that, if a value is read from a register that awaits a write, the instruction is stalled; thus sequential semantics are preserved: the worst that can happen if incorrect latencies are used is that the pipeline stalls for some cycles, which is a performance, not a correctness, issue. In contrast, on processors with non-interlocked pipelines the latencies belong to the semantic defnition of the assembly code: a read from a register that awaits a write yields the previous value held in that register. Regarding resource constraints, on a very large instruction word (VLIW) processor, bundles of instructions that exceed resource constraints will be refused by the assembler; on a conventional multipleissue processor, successive instructions that cannot be issued at the same cycle for lack of resources will be issued sequentially, which is equivalent since the processor preserves sequential semantics even when issuing several instructions. We conclude that pipeline modeling issues have no impact on the correctness of the generated code of CompCert-KVX, but solely on its performance.<sup>39</sup>

#### 5.4 Assembling and Linking

CompCert produces assembly code in textual form, which must then be assembled and linked using another toolchain, such as gcc (the GNU Compiler Collection) or clang (LLVM). This toolchain is thus within the TCB. Absint GmbH, which sells the commercial releases of CompCert, also sells for certain architectures the Valex tool which matches the CompCert code to the binary code [37,27]. An alternative is direct generation of machine code, as in CakeML [31]; CompCertELF extends CompCert with a verifed assembler for the x86 target [56].

Finally, CompCert's correctness proof was originally meant for a "closed world": a program wholly compiled with it as a single module. In reality, most large C projects are compiled from multiple fles which are then linked. The correctness proof was later extended, in version 2.7, to account for separate compilation and linking, following [26]. There have been proposals for more ambitious formalizations of the linking process [50], even implementing a verifed linker for a subset of ELF on the x86-32 architecture [56]; <sup>40</sup> Specifying and proving correct a general ELF linker is itself a fairly ambitious project [28].

### 6 Modeling and Application Binary Interface Issues

The semantics of assembly instructions is defned, for each architecture, in the ofcial manuals from the architecture designers. The application binary inter-

<sup>38</sup> The CompCert-KVX team had private documentation on the KVX; despite that, due to the tedium of building tables, they had a few bugs, as shown by commit logs. Their tables for AArch64 and RISC-V are based on the source code of other compilers.

<sup>39</sup> The situation would of course be very diferent in the case of a tool bounding worst case execution time through precise processor modeling.

<sup>40</sup> ELF is a standard fle format for object code.

face (ABI), specifc to each combination of architecture and operating system (or execution environment), defnes how parameters are to be passed (in which registers, etc.), what kind of diferent global symbols exist and how they are accessed, what registers are reserved for system use, how the execution stack is to be laid out, what values the high-order bits of long registers may contain if the register contains a shorter value, etc. In contrast, CompCert's vision of values is somewhat abstract, even at the assembly level, which may pose problems especially when interfacing to other parts of the runtime system.

#### 6.1 Modeling of Values

CompCert considers that a value, e.g., stored in a register, is either a 32-bit integer; a 64-bit integer; a 32-bit single precision foating-point number; a 64-bit double precision foating-point number; a pointer, consisting in a block identifer and an ofset; or "undefned", a value that can be refned into any other value, modeling undefned behavior that does not stop program execution (because not yet externally observed). This is, however, an abstraction of reality. Pointers, in reality, are not a pair (block, ofset) but a single 32-bit or 64-bit integer. How is a 32-bit value stored in a 64-bit register? Are the higher-order bits indiferent, supposed to be 0 (0-extension) or equal to the sign bit (sign-extension)?

These modeling issues have subtle consequences on the implementation of certain instructions. If the application binary interface specifes that 32-bit values stored in 64-bit processor registers are 0-extended, then the 0-extension operation as defned in CompCert (taking a 32-bit unsigned value and returning the same value as a 64-bit unsigned integer) can be implemented as a no-operation at assembly level (with the special annotation, for the register allocator, that the target register should be the same as the source register).<sup>41</sup> Similarly, if the application binary interface specifes that 32-bit values stored in 64-bit processor registers are sign-extended, then the sign-extension operation as defned in CompCert can be implemented as a no-operation at assembly level. Finally, the application binary interface may specify that the higher 32 bits of a 64-bit register containing a 32-bit value are arbitrary.

Since none of the CompCert semantics specifes register contents at the bit level, it is up to the backend designer to be consistent in what instructions assume and ensure, and this consistency is never formally verifed. Consistency must extend to the foreign function interface: for instance, if a CompCert function is called from a function compiled with another compiler that considers that the higher order 32 bits contain arbitrary values, but CompCert assumes that values are 0-extended, then incorrect behavior may ensue.

The modeling of certain instructions is delicate. The KVX processor supports, in addition to normal loads from memory, speculative loads, otherwise

<sup>41</sup> This also explains why on some platforms, the code produced by CompCert contains useless moves. If a 32-bit value needs to be extended to 64 bits in a way that both the 32-bit and 64-bit version are live after extension, then these two values, even if they are implemented by the same bit-string, will have to reside in two diferent registers, since CompCert value semantics distinguishes 32-bit from 64-bit values.

known as non-trapping or dismissible loads. A normal load from an incorrect memory address will trap; on the KVX, a speculative load from an incorrect address returns 0 instead of trapping. Here, "incorrect" is meant with respect to the page tables of the processor. In the intermediate representations of CompCert-KVX, speculative loads from incorrect memory locations return the special value "undefned", whereas a normal load would terminate execution. "Undefned" is a form of "poison value" propagating through operations, e.g., adding it to an integer yields "undefned". The assembly-level semantics, however, defned the value returned by a speculative load from an incorrect memory location as 0, as per the processor documentation. 0 is a valid refnement of "undefned", and the proofs go through. This is however incorrect modeling, because it confates two diferent notions: memory accesses invalid with respect to CompCert semantics, and memory accesses invalid with respect to the processor memory management unit:<sup>42</sup> the former are strictly included in the latter:43, a valid CompCert memory block may occupy a portion of a valid memory page, but the processor will allow accesses to the whole page. Using this incorrect semantics, one could perform a speculative load from a location known to be incorrect with respect to CompCert semantics (for instance, just past the end of a block allocated on the stack) and assume that this load would return 0, whereas this location, when read, would return another value. Commit 5798f56b replaced this default value by "undefned", which is correct: any value is a valid refnement of "undefned".

### 6.2 Foreign Function Interface

CompCert's application binary interface (ABI) is not specifed in a single point in CompCert: it comprises the calling convention, the value conventions implicit in the choice of instructions, etc. The correctness theorem of CompCert relates the execution of a C program, started from the main function, to the execution of the assembly program produced by its compilation, also started from the main function. It does not discuss functions compiled with other compilers calling a function compiled using CompCert. It also assumes that functions called from CompCert use the same calling convention. As explained in CompCert's manual

CompCert attempts to generate object code that respects the Application Binary Interface of the target platform and that can, therefore, be linked with object code and libraries compiled by other C compilers.

The manual then describes areas where CompCert's ABI difers from those of other compilers on the targets that it supports. Again, none of these other ABIs were formalized, so the statement of diferences in the manual is not based on formal analysis of compatibility, but rather on human analysis.

<sup>42</sup> Or, rather, the association of the processor memory management unit and the virtual memory subsystem of the operating system.

<sup>43</sup> In the case of memory over-commit by the OS, a valid memory access with respect to CompCert semantics may result in a segmentation violation. We do not consider this issue here, since it is a case of the OS promising resources to the program then reneging on its promises, and thus not supplying a stable execution environment.

#### 6.3 Runtime System

The runtime system for C is rather limited compared to other languages. It uses the C standard library supplied by the target platform. CompCert makes no assumption about it—calls to the standard library are just calls to external functions, and the sequence of these calls, as observable events, in the source semantics is refected in the assembly code—except for the heap memory allocation and deallocation functions malloc() and free(), which have special treatment and are given specifc semantics (creation and destruction of memory blocks in the CompCert memory model). CompCert assumes that this allocator is correct with respect to CompCert's infnite memory model. In particular, CompCert assumes that malloc always succeeds and never returns the null pointer, which seems unsound: in theory, some formally verifed optimizations may incorrectly remove defensive checks against heap overfow. In practice, we do not know of any optimization in CompCert exploiting this model of malloc. This assumption of infnite memory has been removed in CompCertS[7], at the price of a large extension of CompCert.

In CompCert, basic foating-point operations have a semantics defned according to IEEE-754 in round-to-nearest mode. This assumes no change to the rounding mode through a library call or direct access to special CPU registers.

Some processors do not support some expensive arithmetic operations (e.g. foating-point operations, division) in hardware. These are replaced by calls to functions in the runtime system, which are axiomatized to perform the required operation by a combination of elementary instructions. This creates a somewhat paradoxical situation where, for the same operation (say, 32-bit integer division): (i) if the operation is implemented in hardware, then it is trusted; (ii) if implemented in software through a call to the runtime system, then it is trusted; (iii) if implemented in software through expansion inside CompCert, then one has to provide a full proof that this expansion implements the operation: its execution coincides with that of the operator on argument values on which this operator has defned behavior. One argument is that the hardware is likely to have been designed from existant foating-point designs and thoroughly tested with many test vectors,<sup>44</sup> Software emulation is likely to be from a well-tested established library,<sup>45</sup> whereas expansion in CompCert probably has not been tested so well.

### 7 Insights and Conclusion

Some natural questions about "verifed" software is: how truly safe is it? What kind of constructs should we be considered as suspicious? As more designs come

<sup>44</sup> E.g. the Berkeley hard foat library (https://github.com/ucb-bar/berkeley-hardfoat) is used in certain RISC-V designs. Yet, they remind potential users that "These units are works in progress. They may not be yet completely free of bugs [...]".

<sup>45</sup> E.g. the Berkeley soft foat library (http://www.jhauser.us/arithmetic/SoftFloat.html); but, again "Releases 3 through 3c of Berkeley SoftFloat contain bugs in the square root functions that may be of concern for some uses. Those bugs are believed to be repaired in Release 3d and later."

with some formal proofs of correctness, even regulatory agencies have had to provide guidelines [1]. It is of course perilous to draw general conclusions from the analysis of one single project; here are some insights.

None of the problems found were in the verifed parts of CompCert: chances seem slim to stumble into a proof checker bug by accident, not notice something is amiss, and think to have proved a theorem that actually does not hold. This explains why the number of bugs found in CompCert releases is many orders of magnitude below usual compilers [52]. By construction, the bugs of CompCert are located in a limited subpart of the software, called its TCB, which may however not be as small as we may naively expect.

Two bugs were found in the front-end elaboration rules, "corner cases" that should be rarely found in real programs (thus their late discovery). A few subtle semantic bugs were also found in some back-ends. However, most bugs were found in the very last part of the back-end, which expands and prints assembly instructions. The causes of these bugs are: (i) the tedium of writing correct printers for each instruction with appropriate operand ordering, and the lack of systematic unit testing of the printers; (ii) the number of diferent cases, especially in the choice of register arguments, in the expansion of pseudo-instructions, and again the lack of systematic testing that all cases are correct; (iii) the difculties in keeping synchronized the specifcation of the pseudo-assembly instructions (in Coq) and the code performing their expansion, in two diferent fles. All these seem to be common software engineering issues, amenable to standard software engineering solutions such as systematic testing of all cases.

All these issues pertain to the specifcation and trusted (but unverifed) parts of the CompCert back-end, which echoes the results of early experiments that found bugs in these parts [57]. In contrast, no bugs due to the use of axioms for interfacing untrusted code, or the use of the extractor to OCaml, were found. In academic circles, however, much attention is often given to doing away with such axioms and the extractor; this may not refect the most pressing needs. There seems to be a chasm between, on the one hand, what feels relevant and interesting for experts in proof assistants or type theoreticians, on the other hand what would actually increase reliability in verifed compilers or similar tools.

In our opinion, the primary focus for increasing trust in CompCert (and removing possible further bugs) should be a validation mechanism of its assembly and ABI specifcation with respect to the actual execution platform. For example, SAIL provides a formal ISA semantics for ARMv8 that has been tested against the ARM Architecture Validation Suite [5]. However, CompCert cannot be directly plugged on SAIL, because of its more abstract view of the ISA. And this would not solve the issues related to the runtime environment and the ABI.

### Acknowledgements

We wish to thank A. Miquel for helpful references on the metatheory of Coq, as well as L. Gourdin, X. Leroy and C. Six for discussions about CompCert.

### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### View-Based Owicki–Gries Reasoning for Persistent x86-TSO<sup>⋆</sup>

Eleni Vafeiadi Bila<sup>1</sup> , Brijesh Dongol<sup>1</sup> () , Ori Lahav<sup>2</sup> , Azalea Raad<sup>3</sup> , and John Wickerson<sup>3</sup>

> <sup>1</sup> University of Surrey, Guildford, UK b.dongol@surrey.ac.uk <sup>2</sup> Tel Aviv University, Tel Aviv, Israel 3 Imperial College London, London, UK

Abstract. The rise of persistent memory is disrupting computing to its core. Our work aims to help programmers navigate this brave new world by providing a program logic for reasoning about x86 code that uses low-level operations such as memory accesses and fences, as well as persistency primitives such as fushes. Our logic, Pierogi, benefts from a simple underlying operational semantics based on views, is able to handle optimised fush operations, and is mechanised in the Isabelle/HOL proof assistant. We detail the proof rules of Pierogi and prove them sound. We also show how Pierogi can be used to reason about a range of challenging single- and multi-threaded persistent programs.

Keywords: Persistent memory, x86-TSO, Owicki-Gries, Isabelle/HOL, verifcation

### 1 Introduction

In our era of big data, the long-established boundary between 'memory' and 'storage' is increasingly blurred. Persistent memory is a technology that sits in both camps, promising both the durability of disks and data access times similar to those of DRAM. Embracing this technology requires rethinking our decadesold programming paradigms. As data held in memory is no longer wiped after a system restart, there is an opportunity to write persistent programs – programs that can recover their progress and continue computing even after a crash.

However, writing persistent programs is extremely challenging, as it requires the programmer to keep track of which memory writes have become persistent,

<sup>⋆</sup> Vafeiadi Bila is supported by VeTSS. Dongol is supported by EPSRC grants EP/V038915/1, EP/R032556/1, EP/R025134/2 and ARC Discovery Grant DP190102142. Lahav is supported by the Israel Science Foundation (grant 1566/18), by the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement no. 851811), and by the Alon Young Faculty Fellowship. Raad is supported by a UKRI Future Leaders Fellowship [grant number MR/V024299/1]. Wickerson is supported by an EPSRC Programme Grant (EP/R006865/1).

and which have not. This is further complicated in a multi-threaded setting by the intricate interplay between the rules of memory persistency (which determine the order in which writes become persistent) and those of memory consistency (which determine what data can be observed by which threads).

To address this difculty, we provide a foundation for persistent programming. We develop a program logic, Pierogi, for reasoning about x86 code that uses low-level operations such as memory accesses and fences, as well as persistency primitives such as fushes. We demonstrate the utility of Pierogi by using it to reason about a range of challenging single- and multi-threaded persistent programs, including some that demonstrate the subtle interplay between optimised fush (fushopt) and store fence (sfence) instructions. Using the Isabelle/HOL proof assistant, we have mechanised the Pierogi rules and proved them sound with respect to an operational semantics for x86 persistency [9]. One beneft of our Isabelle/HOL formalisation is that Pierogi is already partially automated: once the user has produced a proof outline (i.e. annotated each instruction with a postcondition), they can simply use Isabelle/HOL's sledgehammer, which automatically decides which axioms and rules of the proof system need invoking to verify the whole program. Our mechanisation, which includes all the example programs discussed in this paper, is available as auxiliary material [4,5]. State of the art To our knowledge, the only program logic for persistent programs is POG (Persistent Owicki–Gries) [31]. As with Pierogi, POG enables reasoning about persistent x86 programs and is based on the Owicki–Gries method [30]. However, unlike Pierogi, POG is not mechanised in a proof assistant, and does not support optimised fush (fushopt) instructions. Optimised fush instructions are an important persistency primitive as they are considerably faster than ordinary fush instructions. Indeed, Intel's experiments on their Skylake microarchitecture indicate that they can be nine times faster when applied to bufers that hold tens of kilobytes of data [19, p. 289], and hence programmers are impelled, "If fushopt is available, use fushopt over fush." However, fushopt is a tricky instruction for programmers and program logic designers alike: com-

Pierogi can reason efciently about x86 persistency (including fushopt instructions) thanks to two key recent advances: 1) Px86view [9], the view-based operational semantics of x86 persistency; and 2) the C11 Owicki-Gries logic [11–13] to reason about view-based operational semantics, which we adapt to Px86view.

pared to fush, fushopt can be reordered with more instructions under x86.

Our contributions 1) We present a program logic, called Pierogi, for reasoning about persistent x86 programs. 2) We mechanise (and partially automate) Pierogi in Isabelle/HOL, and prove it sound relative to an established operational semantics for x86 persistency. 3) We demonstrate the utility of Pierogi by using it to verify several idiomatic persistent x86 programs.

Outline We begin with an overview of memory consistency and persistency in x86 and provide an example-driven account of Pierogi reasoning (§2). We describe the assertion language and proof rules of Pierogi in §3, and verify a selection of programs using Pierogi in §4. We present the view-based operational semantics of x86 persistency and prove the soundness of Pierogi in §5.

Auxiliary material Additional examples as well as the proofs of theorems stated in the paper are given in the accompanying technical appendix [5]. Our Isabelle/HOL mechanisation is available as auxiliary material [4].

### 2 Overview and Motivation

Recent operational models for weak memory use views to capture relaxed behaviours of concurrent programs [9, 11, 21, 22], where the memory records the entire history of writes that have taken place thus far. This way, diferent threads can have diferent subsets of these writes (i.e. diferent views) visible to them. Below, we review Px86view, a view-based operational semantics for x86 persistency (§2.1); we then describe Pierogi (§2.2) using a series of running examples.

#### 2.1 Px86view at a Glance

In the literature of concurrency semantics, consistency models describe the permitted behaviours of programs by constraining the volatile memory order, i.e. the order in which memory writes are made visible to other threads, while persistency models describe the permitted behaviours of programs upon recovering from a crash (e.g. a power failure) by defning the persistent memory order, i.e. the order in which writes are committed to persistent memory. To distinguish between the two, memory stores are diferentiated from memory persists: the former denotes the process of making a write visible to other threads, whilst the latter denotes the process of committing writes to persistent memory (durably).

Px86view Consistency The consistency semantics of Px86view is that of the well-known TSO (total store ordering) [36] model, where later (in program order) reads can be reordered before earlier writes on diferent locations. This is illustrated in the store bufering (sb) example below (left):

$$\begin{array}{llll} \textbf{store} \ x \ 1; & \ \texttt{store} \ y \ 1; & \ \texttt{store} \ x \ 42; & \ \begin{array}{l} \texttt{store} \ x \ 42; \\ \texttt{b} := \texttt{load} \ x \end{array} & \begin{array}{l} \texttt{store} \ x \ 42; \\ \texttt{store} \ y \ 7 \\ a = 7 \land b = 0: \end{array} & \begin{array}{l} a := \texttt{load} \ y; \\ a = 7 \land b = 0: \end{array} \end{array}$$

Specifcally, assuming x=y= 0 initially, since a := load y (resp. b := load x) can be reordered before store x 1 (resp. store y 1), it is possible to observe the weak behaviour a= 0 ∧ b= 0. A well-known way of modelling such reorderings in TSO is through store bufers: when a thread τ executes a write store x v, its efects are not immediately made visible to other threads; rather they are delayed in a thread-local (store) bufer only visible to τ , and propagated to the memory at a later time, whereby they become visible to other threads. For instance, when store x 1 and store y 1 are delayed in the respective thread bufers (and thus not visible to one another), then a := load y and b := load x may both read 0.

Cho et al. [9] capture this by associating each thread τ with a coherence view (also called a thread-observable view), describing the writes observable by τ . Distinct threads may have diferent coherence views. For instance, after executing store x 1 and store y 1, the coherence view of the left thread may include store x 1 and not store y 1, while that of the right may include store y 1 and not store x 1. This way, a := load y (resp. b := load x) may read the initial value 0, as its coherence view does not include store y 1 (resp. store x 1).

After SC (sequential consistency) [27], TSO is one of the strongest consistency models and supports synchronisation patterns such as message passing, as shown in mp above, where a = 7 ∧ b = 0 cannot be observed. Specifcally, (assuming x=y= 0 initially) if the right thread reads 7 from y (written by the left thread), then the left thread passes a message to the right. Under TSO, message passing ensures that the instruction writing the message and all those ordered before it (e.g. store x 42; store y 7) are executed (ordered) before the instruction reading it (e.g. a := load y). As such, since b := load x is executed after a := load y, if a= 7 (i.e. store x 42 is executed before a := load y), then b= 42.

Px86view Persistency Cho et al. [9] recently developed the Px86view model, a view-based description of the Intel-x86 persistency semantics, which follows a bufered, relaxed persistency model. Under a bufered model, memory persists occur asynchronously [10]: they are bufered in a queue to be committed to persistent memory at a future time. This way, persists occur after their corresponding stores and as prescribed by the persistency semantics, while allowing the execution to proceed ahead of persists. As such, after recovering from a crash, only a prefx of the persistent memory order may have persisted. (The alternative is unbufered persistency in which stores and persists happen simultaneously.)

Under relaxed persistency, the volatile and persistent memory orders may disagree: the order in which the writes are made visible to other threads may difer from the order in which they are persisted. (The alternative is strict persistency in which the volatile and persistent memory orders coincide.)

The relaxed and bufered persistency of Px86view is shown in Fig. 1a. If a crash occurs during (or after) the execution of Fig. 1a, at crash time either write may have persisted and thus x, y∈ {0, 1} upon recovery. Note that the two writes cannot be reordered under Intel-x86 (TSO) consistency and thus at no point during the normal (non-crashing) execution of Fig. 1a is x=0, y=1 observable. Nevertheless, in case of a crash it is possible to observe x=0, y=1 after recovery. That is, due to the relaxed persistency of Px86view, the store order (x before y) is separate from the persist order (y before x). More concretely, under Px86view the writes may persist 1) in any order, when they are on distinct locations; or 2) in the volatile memory order, when they are on the same location.<sup>4</sup>

To aford more control over when pending writes are persisted, Intel-x86 provides explicit persist instructions such as fush x and fushopt x that can be used to persist the pending writes on x. <sup>5</sup> This is illustrated in Fig. 1b: executing fush x persists the earlier write on x (i.e. store x 1) to memory. As such, if

<sup>4</sup> Given a cache line (a set of locations), writes on distinct cache lines may persist in any order, while writes on the same cache line persist in the volatile memory order. For brevity, we assume that each cache line contains a single location, thus forgoing the need for cache lines. However, it is straightforward to lift this assumption.

<sup>5</sup> Executing fush x or fushopt x persists the pending writes on all locations in the cache line of x. However, as discussed, we assume cache lines contain single locations.


Fig. 1: Example Px86view programs and possible values after recovery from a crash ( ). In all examples <sup>x</sup>, <sup>y</sup>, <sup>z</sup> are distinct locations in persistent memory such that x=y=z=0 initially, and a is a (thread-local) register.

the execution of Fig. 1b crashes and upon recovery y=1, then x=1. That is, if store y 1 has executed and persisted before the crash, then so must the earlier store x 1; fush x. Note that y=1 ⇒ x=1 describes a crash invariant, in that it holds upon crash recovery regardless of when (i.e. at which program point) the crash may have occurred. Observe that this crash invariant is guaranteed thanks to the ordering constraints on fush instructions. Specifcally, fush instructions are ordered with respect to all writes; as such, fush x in Fig. 1b cannot be reordered with respect to either write, and thus upon recovery y=1 ⇒ x=1.

However, instruction reordering means that persist instructions may not execute at the intended program point and thus not guarantee the intended persist ordering. Specifcally, fushopt x is only ordered with respect to earlier writes on x, and may be reordered with respect to later writes, as well as earlier writes on diferent locations. This is illustrated in Fig. 1c: fushopt x is not ordered with respect to store y 1 and may be reordered after it. Therefore, if a crash occurs after store y 1 has executed and persisted but before fushopt x has executed, then it is possible to observe y=1, x=0 on recovery. That is, there is no guarantee that store x 1 persists before store y 1, despite the intervening fushopt x.

In order to prevent such reorderings and to strengthen the ordering constraints between fushopt and later instructions, one can use either fence instructions, namely sfence (store fence) and mfence (memory fence), or atomic readmodify-write (RMW) instructions such as compare-and-set (CAS) and fetchand-add (FAA). More concretely, sfence, mfence and RMW instructions are ordered with respect to all (both earlier and later) fushopt, fush and write instructions, and can be used to prevent reorderings such as that in Fig. 1c. This is illustrated in Fig. 1d. Unlike in Fig. 1c, the intervening sfence ensures that fushopt in Fig. 1d is ordered with respect to store y 1 and cannot be reordered after it, ensuring that store x 1 persists before store y 1 (i.e. y=1 ⇒ x=1 upon recovery), as in Fig. 1b. Note that replacing sfence in Fig. 1d with mfence or an RMW yields the same result. Alternatively, one can think of fushopt x executing asynchronously, in that its efect (persisting x) does not take place immediately upon execution, but rather at a later time. However, upon executing a barrier instruction (i.e. mfence, sfence or an RMW), execution is blocked until the efect of earlier fushopt instructions take place; that is, executing such barrier instructions ensures that earlier fushopt behave synchronously (like fush).

$$P\_1: \{7 \notin [y]\_2 \land a = 0\} \qquad \left\| \begin{array}{l} P: \{x \in \{1,2\} \cdot [x]\_\tau = [y]\_\tau = \{0\} \} \\ \hline \mathsf{stcore} \ x \ \mathsf{if} \ \mathsf{P}\_1: \{ [y]\_2 \subseteq \{0,7\} \land (7 \in [y]\_2 \Rightarrow \langle y,7 \rangle [x]\_2 = \{42\}) \} \\ \quad \mathsf{stcore} \ x \ \mathsf{if} \ \mathsf{P}\_1.\mathsf{Cons} \\ \quad P\_2: \{ [x]\_1 = \{42\} \land 7 \notin [y]\_2 \} \\ \quad \mathsf{stcore} \ y \ \mathsf{7}; // \mathsf{SP}\_1.\mathsf{Cons} \\ \quad \quad b: \mathsf{let} \ \mathsf{do} \ x, // \mathsf{LP}\_1.\mathsf{Cons} \\ \quad Q\_3: \{ a = 7 \Rightarrow b = 42 \} \\ \end{array} \right\}$$

Fig. 2: A Pierogi proof sketch of message passing (mp), where the // annotation at each step identifes the Pierogi proof rule (in §3.4) applied, and the highlighted assertions capture the efects of the preceding instruction.

The example in Fig. 1e illustrates how message passing can impose persist orderings on the writes of diferent threads. (Note that the program in the left thread of Fig. 1e is that of Fig. 1b.) As in mp, if a = 1, then store x 1; fush x is executed before a := load y (thanks to message passing). Consequently, since store z 1 is executed after a := load y when a = 1, we know store x 1; fush x is executed before store z 1. Therefore, if upon recovery z=1 (i.e. store z 1 has persisted before the crash), then x=1 (store x 1; fush x must have also persisted before the crash). As before, replacing fush x in Fig. 1e with fushopt x; C yields the same result upon recovery when C is an sfence/mfence or an RMW.

### 2.2 Pierogi: View-Based Owicki–Gries Reasoning for Px86view

Sequential Reasoning about Consistency using Views In Fig. 2 we present a Pierogi proof sketch of mp. Recall that in order to account for possible writeread reorderings on Intel-x86 architectures, Px86view associates each thread τ with a coherence view, describing the writes visible to τ . To reason about such thread-observable views, Pierogi supports assertions of the form [x]<sup>τ</sup> = S, stating that τ may read any value in the set S for location x. That is, the coherence view of τ for x consists of the writes whose values are those in S.

In the remainder of this article we enumerate the threads in our examples from left to right; e.g. the left and right threads in Fig. 2 are identifed as 1 and 2, respectively. Moreover, we assume the registers of distinct threads have distinct names. The precondition P in Fig. 2 thus states that both threads may initially only read 0 for both x and y: ∀τ ∈ {1, 2}. [x]<sup>τ</sup> =[y]<sup>τ</sup> ={0}.

In the case of thread 1, we can weaken P (using the standard rule of consequence of Hoare logic – see Cons in §3) to obtain P1. Upon executing store x 42 (1) we weaken the resulting assertion by dropping the a = 0 conjunct; and (2) we update the observable view of thread 1 on x to refect the new value of x: [x]<sup>1</sup> = {42}; that is, after executing store x 42, the only value observable by thread 1 for x is 42. Similarly, after executing store y 7, we could assert [y]<sup>1</sup> = {7}; however, this is not necessary for establishing the fnal postcondition Q, and we thus simply weaken the postcondition to true (P3).


Fig. 3: Proof sketches of Fig. 1b (left) and Fig. 1d (right)

Analogously, in the case of thread 2 we weaken P to obtain Q1: [y]<sup>2</sup> = {0} implies [y]<sup>2</sup> ⊆ {0, 7} and 7 ∈ [y]<sup>2</sup> ⇒ ⟨y, 7⟩[x]<sup>2</sup> = {42}. Note that 7 ∈ [y]<sup>2</sup> ⇒ ⟨y, 7⟩[x]<sup>2</sup> = {42} yields a vacuously true implication as [y]<sup>2</sup> = {0} and thus 7 ̸∈ [y]2. The ⟨y, 7⟩[x]<sup>2</sup> denotes a conditional view assertion [11] that describes how reading a value on one location (y) afects the thread-observable view on a diferent location (x). More concretely, ⟨y, 7⟩[x]<sup>2</sup> = {42} states that if thread 2 executes a load on y and reads value 7, it subsequently may only observe value 42 for x. This is indeed the essence of message passing in mp: once thread 2 reads 7 from y, it may only read 42 for x thereafter. As such, after executing the read instruction a := load y (1) we apply the LP<sup>1</sup> rule (in Fig. 7) which simply replaces [y]<sup>2</sup> with the local register a in which the value of y is read; and (2) we replace the conditional assertion ⟨y, 7⟩[x]<sup>2</sup> = {42} with the implication a = 7 ⇒ [x]<sup>2</sup> = {42}, stating that if the value read by thread 2 for y (in a) is 7, then its observable view for x is {42}. Similarly, upon executing b := load x we simply apply LP<sup>1</sup> to replace [x]<sup>2</sup> with the local register b in which the value of x is read. Lastly, the fnal postcondition Q is given by the conjunction of the thread-local postconditions (P<sup>3</sup> ∧ Q3).

Concurrent Reasoning and Stability In our description of the Pierogi proof sketch in Fig. 2 thus far we focused on sequential (per-thread) reasoning, ignoring how concurrent threads may afect the validity of assertions at each program point. Specifcally, as in existing concurrent logics [11, 26, 30, 31], we must ensure that the assertions at each program point are stable under concurrent operations. For instance, to ensure that P<sup>1</sup> remains stable under the concurrent operation a := load y, we require that executing a := load y on states satisfying the conjunction of P<sup>1</sup> and the precondition of a := load y (i.e. Q1) not invalidate P1, in that the resulting states continue to satisfy P1; that is, P<sup>1</sup> ∧ Q<sup>1</sup> a := load y P1 holds. Similarly, we must ensure that P<sup>1</sup> is stable under b := load x, i.e. P<sup>1</sup> ∧ Q<sup>2</sup> b := load x P1 holds. Analogously, we must establish the stability of P2, P3, Q1, Q<sup>2</sup> and Q<sup>3</sup> under concurrent operations. In §3 we present syntactic rules that simplify the task of checking stability obligations. It is then straightforward to show that the assertions in Fig. 2 are stable. Reasoning about fush Persistency To reason about the relaxed, bufered persistency of Px86view, Cho et al. [9] introduce persistency views, determining the possible persisted values for each location; i.e. the values of those writes that may have persisted to memory. Note that the persistency view determines the possible values observable upon recovery from a crash. By contrast, the (perthread) coherence views determine the observable values during normal (noncrashing) executions, and have no bearing on the post-crash values.

Analogously, we extend Pierogi with assertions of the form [x] <sup>P</sup> = S, stating that the persistent view for x includes writes whose values are given by S. To see this, consider the Pierogi proof sketch of Fig. 1b in Fig. 3 (left). Initially, y holds 0 in persistent memory: [y] <sup>P</sup> = {0}. (Note that the precondition could additionally include [x]<sup>1</sup> = [y]<sup>1</sup> = {0} ∧ [x] <sup>P</sup> = {0} to denote that initially the thread may only observe 0 for x and y and that x holds 0 in persistent memory; however, this is not needed for the proof and we thus forgo it.)

As before, after executing store x 1, the observable value for x is updated, as denoted by [x]<sup>1</sup> = {1}. Moreover, after executing fush x, the persisted value for x is 1, as denoted by [x] <sup>P</sup> = {1}, by committing (persisting) the observable value for x (i.e., [x]<sup>1</sup> = {1}) to memory (see FP<sup>1</sup> in Fig. 7). Finally, after executing store y 1, the observable value for y is updated, as denoted by [y]<sup>1</sup> = {1}.

Crash Invariants Recall that : <sup>y</sup>=1 <sup>⇒</sup> <sup>x</sup>=1 in Fig. 1b denotes a crash invariant in that it describes the persistent memory upon recover from a crash at any program point. This is because we have no control over when a crash may occur. To capture such invariants, in Pierogi we write quadruples of the form P C Q : <sup>I</sup> , where P C Q denotes a Hoare triple and I denotes the crash invariant. If C is a sequential program, I must follow from every assertion (including P and Q) in the proof. For instance, in the proof outline of Fig. 3 (left) all four assertions imply the invariant [y] <sup>P</sup> = {1} ⇒ [x] <sup>P</sup> = {1}. We discuss the meaning of crash invariants for concurrent programs below.

Reasoning about fushopt Persistency Recall that unlike fush, fushopt instructions (due to instruction reordering) may behave asynchronously and their efects may not take place immediately after execution. As such, unlike for fush x, after executing fushopt x we cannot simply copy the observable view on x to the persistent view on x.

To capture the asynchronous nature of fushopt, Cho et al. [9] introduce yet another set of views, namely the thread-local asynchronous view: the asynchronous view of thread τ on x describes the values (writes) that will be persisted at a later time (asynchronously) by τ upon executing a barrier instruction. That is, 1) when thread τ executes fushopt x, its asynchronous view of x is advanced to at least its observable view of x; and 2) when τ executes a barrier (sfence, mfence or RMW), then its persistent view for each location is advanced to at least its corresponding asynchronous view. We model this in Pierogi by 1) setting [x] A τ to be a subset of [x]<sup>τ</sup> when fushopt x is executed; and 2) setting [x] P to be a subset of [x] A τ (for each location x) when a barrier is executed.

This is illustrated in the proof sketch of Fig. 1d in Fig. 3 (right). In particular, unlike the proof sketch of Fig. 1b in Fig. 3 (left), after executing fushopt x we

P : a = 0 ∧ ∀o ∈ {x, y, z}, τ ∈ {1, 2}. [o]<sup>τ</sup> = [o] <sup>P</sup> = {0} P<sup>1</sup> : [y]<sup>2</sup> = {0} ∧ [z] <sup>P</sup> = {0} ∧ a = 0 store x 1; // SP<sup>1</sup> P<sup>2</sup> : [y]<sup>2</sup> = {0} ∧ [z] <sup>P</sup> = {0} ∧ a = 0 ∧ [x]<sup>1</sup> = {1} fush x; // FP1, Cons P<sup>3</sup> : [x] <sup>P</sup> = {1} store y 1; // SP1, Cons P<sup>4</sup> : [x] <sup>P</sup> = {1} true a := load y; true if (a = 1) a = 1 store z 1; true Q : [x] <sup>P</sup> = {1} I : : [z] <sup>P</sup> = {1} ⇒ [x] <sup>P</sup> = {1} 

Fig. 4: A Pierogi proof sketch of Fig. 1e

cannot simply copy the thread-observable view to the persistent view. Rather, we copy the thread-observable view [x]<sup>1</sup> to its asynchronous view and assert [x] A <sup>1</sup> = {1}; and upon executing the subsequent sfence, we copy the threadasynchronous view to the persistent view and assert [x] <sup>P</sup> = {1}.

Putting It All Together We next present a Pierogi proof sketch of Fig. 1e in Fig. 4. The proof of the left thread is analogous to that in Fig. 3 (left); the proof of the right thread is straightforward and applies standard reasoning principles. The fnal postcondition Q is obtained by weakening the conjunction of per-thread postconditions.

Note that the crash invariant I follows from the assertions at each program point of thread 1 (i.e. P<sup>1</sup> ∨ P<sup>2</sup> ∨ P<sup>3</sup> ∨ P<sup>4</sup> ⇒ I). That is, the crash invariant must follow from the assertions at all program points of some thread (e.g. thread 1 in Fig. 4). In the case of sequential programs (e.g. in Fig. 3), this amounts to all program points (of the only executing thread). Intuitively, we must ensure that the crash invariant holds at every program point regardless of how the underlying state changes. As the assertions are stable under concurrent operations, it is thus sufcient to ensure that there exists some thread whose assertions at each program point imply the crash invariant.

### 3 The Pierogi Proof rules and Reasoning Principles

We proceed with a description of our verifcation framework. As with prior work [11], the view-based semantics for persistent TSO [9] allows us to use the standard Owicki–Gries rules [2, 30] for compound statements. The main adjustment is the introduction of a new specialised assertion language capable of expressing properties about the diferent "views" described intuitively in §2. As such, since view updates are highly non-deterministic, the standard "assignment axiom" of Hoare Logic (and by extension Owicki–Gries) is no longer applicable. Moreover, unlike SC, reads in a weak memory setting have a side-efect: their interaction with the memory location being read causes the view of the executing

v, u∈Val ≜ N x, y, . . .∈Loc a, b, . . .∈Reg τ ∈Tid ≜ N i, j, k, . . .∈Lab a, ˆ ˆb, . . . ∈ AuxVar eˆ ∈ AuxExp ::= v | aˆ | eˆ+ˆe | · · · e ∈ Exp ::= v | a | e+e | · · · B ∈ BExp ::= true | B ∧ B | · · · α ∈ ASt ::= skip | a := e | a := load x | store x e | a := CAS x e e | sfence | mfence | fush x | fushopt x ls ∈ LSt ::= α goto j | if B goto j else to k | ⟨α goto j, aˆ := ˆe⟩ Π ∈ Prog ≜ Tid × Lab → LSt ⃗pc ∈ PC ≜ Tid → Lab

Fig. 5: The Pierogi domains and programming language

thread to advance. Therefore, we resort to a set of proof rules that describe how views are modifed and manipulated, as formalised by our view-based assertions.

### 3.1 The Pierogi Programming Language

We present the programming language in Fig. 5. Atomic statements (in ASt) comprise skip, assignment, memory reads and writes, barrier instructions and explicit persists. Specifcally, a := e evaluates expression e and returns it in (thread-local) register a; a := load x reads from memory location x and returns it in register a; and store x e writes the evaluated value of e into location x. The a := CAS x e<sup>1</sup> e<sup>2</sup> denotes 'compare-and-set' on location x, from the evaluated value of e<sup>1</sup> to the evaluated value of e2, and sets a to 1 if the CAS succeeds and to 0, otherwise. Finally, mfence denotes a memory fence, sfence denotes a store fence, and fush x and fushopt x denote explicit persist instructions (see §2).

Formally, we model a program Π as a function mapping each pair (τ, i) of thread identifer and label to the labelled statement (in LSt) to be executed. A labelled statement may be 1) a plain statement of the form α goto j, comprising an atomic statement α to be executed and the label j of the next statement; 2) a conditional statement of the form if B goto j else to k to accommodate branching, which proceeds to label j if B holds and to k, otherwise; or 3) a statement with an auxiliary update ⟨α goto j, aˆ := ˆe⟩, which behaves as α goto j, but in addition (in the same atomic step) updates the value of the auxiliary variable aˆ with the auxiliary expression eˆ. It is well known that Owicki-Gries proofs require auxiliary variables to record the history of executions to diferentiate states that would otherwise not be distinguishable [30]. We show how auxiliary variables are used in Pierogi in the fush bufering example (§4).

We track the control fow within each thread via the program counter function, ⃗pc, recording the program counter of each thread. We assume a designated label, ι ∈ Lab, representing the initial label; i.e. each thread begins execution with ⃗pc(τ ) = ι. Similarly, ζ ∈ Lab represents the fnal label. Moreover, if ⃗pc(τ ) = i at the current execution step, then: 1) when Π(τ, i) = α goto j or Π(τ, i) = ⟨α goto j, a := ˆe⟩, then ⃗pc(τ ) = j at the next step; 2) when Π(τ, i)=if B goto j else to k at the current step, then if B holds in the current state, then ⃗pc(τ )=j at the next step; otherwise ⃗pc(τ )=k at the next step.

Example 1. The program in Fig. 4, assuming that the left thread has id 1, is given as follows. The formalisation of the right thread is omitted, but is similar.

$$\begin{array}{l} \Pi \triangleq \left\{ (1, \iota) \mapsto \texttt{store } x \ 1 \ \texttt{got } 2, (1, 2) \mapsto \texttt{flush } x \ \texttt{got } 3, \ \right\} \\ (1, 3) \mapsto \texttt{store } y \ 1 \ \texttt{got } \zeta, \dots \end{array}$$

#### 3.2 View-Based Expressions

As with prior work on the RC11 model [21], we interpret Pierogi expressions directly over a view-based state. We use expressions tailored for the view-based Px86view model [9], which allow us to express relationships between diferent system components, including the persistent memory.

Our expressions fall into one of four categories: 1) current view expressions, which describe the current views of diferent system components (e.g. the persistent view); 2) conditional view expressions [11], which describe a view on a location after reading a particular value on a diferent location; 3) last view expressions, which hold if a component is viewing the last write to a location; and 4) write-count expressions, which describe the number of writes to a location.

Our current view expressions comprise [x]<sup>τ</sup> , [x] <sup>P</sup> and [x] A τ , as described below; as shown in §2, each of these expressions describes a set of possible values.


Conditional view expressions are of the form ⟨x, v⟩[y]<sup>τ</sup> , as described below. As discussed in §2, conditional expressions capture the crux of message passing.

⟨x, v⟩[y]<sup>τ</sup> returns a set of values that τ may read for y after it reads value v for x. In particular, if ⟨x, v⟩[y]<sup>τ</sup> = S holds for some set S and τ executes a := load x, then in the state immediately after the load, if a = v, then [y]<sup>τ</sup> ⊆ S (see LP<sup>2</sup> in Fig. 7).

Last-view expressions (cf. [16]) are boolean-valued and hold if a particular component is synchronised (i.e. observes the latest value) on the given location. Such expressions provide determinism guarantees on load and fush. For instance if the view of τ is the last write on x, then a read from x by τ will load this last value. Last-view expressions comprise <sup>V</sup>xW<sup>τ</sup> and <sup>V</sup>x<sup>W</sup> F τ :

<sup>V</sup>xW<sup>τ</sup> holds if <sup>τ</sup> is currently viewing the last write to <sup>x</sup>. Thus, for example, if <sup>V</sup>xW<sup>τ</sup> holds, then a load from <sup>x</sup> by <sup>τ</sup> reads the last write to <sup>x</sup>. Note that unlike architectural operational models [36], in the view model [9], writes are visible to all threads as soon as they occur.

VxW F <sup>τ</sup> holds if a fush of x by τ is guaranteed to fush the last write to x to persistent memory.

Lastly, write-count expressions are of the form |x, v|, as described below. Such assertions are useful for inferring view expressions from known facts about the number of writes in the system with a particular value (see Fig. 11).


#### 3.3 Owicki–Gries Reasoning

We present the Pierogi proof system, as an extension of Hoare Logic with Owicki–Gries reasoning to account for concurrency. The main diferences are that 1) our program annotations contain view-based assertions that allow reasoning about weak and persistent memory behaviours; and 2) we defne a crash invariant to describe the recoverable state of the program after a crash. We proceed by frst defning proof outlines, then providing syntactic rules for proving their validity. Our proof rules are syntactic, and thus can be understood and used without having to understand the details of the underlying Px86view model.

We let Assertionpv be the set of assertions (i.e. predicates over Px86view states) that use view-based expressions (§3.2). A crash invariant, I ∈ Inv ⊂ Assertionpv, is defned over persistent views only, i.e. it only comprises the persistent view expressions of the form [x] P . We model program annotations via an annotation function, ann ∈ Ann = Tid × Lab → Assertionpv, associating each program point (τ, i) with its associated assertion. A proof outline is a tuple (in, ann,I, fn), where in, fn ∈ Assertionpv are the initial and fnal assertions.

Example 2. The annotation of the proof in Fig. 4 is given by ann, with the mappings of thread 1 as shown below; the mappings of thread 2 are similar.

$$ann \triangleq \left\{ (1, \iota) \mapsto P\_1, (1, 2) \mapsto P\_2, (1, 3) \mapsto P\_3, (1, \zeta) \mapsto P\_4, \dots \right\}$$

Additionally, we have in ≜ a = 0 ∧ ∀o ∈ {x, y, z}, τ ∈ {1, 2}. [o]<sup>τ</sup> = [o] <sup>P</sup> = {0}, fn ≜ [x] <sup>P</sup> = {1} and I ≜ [z] <sup>P</sup> = {1} ⇒ [x] <sup>P</sup> = {1}.

Defnition 1 (Valid proof outline). A proof outline (in, ann,I, fn) is valid for a program Π if the following hold:

Initialisation. For all τ ∈ Tid, in ⇒ ann(τ, ι). Finalisation. ( V <sup>τ</sup>∈Tid ann(τ, ζ)) ⇒ fn. Local correctness. For all τ ∈ Tid and i ∈ Lab, either: – Π(τ, i) = α goto j and ann(τ, i) α ann(τ, j) ; or – Π(τ, i) = if B goto j else to k and both ann(τ, i)∧B ⇒ ann(τ, j) and ann(τ, i) ∧ ¬B ⇒ ann(τ, k) hold; or – Π(τ, i) = ⟨α goto j, aˆ := ˆe⟩ and ann(τ, i) α ann(τ, j)[ˆe/aˆ] . Stability. For all τ1, τ<sup>2</sup> ∈ Tid such that τ<sup>1</sup> ̸= τ<sup>2</sup> and i1, i<sup>2</sup> ∈ Lab: – if Π(τ1, i1) = α goto j, then ann(τ2, i2) ∧ ann(τ1, i1) α ann(τ2, i2) ;

– if Π(τ1, i1) = ⟨α goto j, aˆ := ˆe⟩, then ann(τ2, i2) ∧ ann(τ1, i1) α ann(τ2, i2)[ˆe/aˆ] . Persistence. There exists τ ∈ Tid such that for all i ∈ Lab, ann(τ, i) ⇒ I.

Intuitively, Initialisation (resp. Finalisation) ensures that the initial (resp. fnal) assertion of each thread holds at the beginning (resp. end); Local correctness establishes annotation validity for each thread; Stability ensures that each (local) thread annotation is interference-free under the execution of other threads [30]; and Persistence ensures that the crash invariant holds at every program point for some thread.

Example 3. Given the program in Example 1 and its annotation in Example 2, both Initialisation and Finalisation clearly hold. Moreover, Persistence holds for thread 1. For Local correctness of thread 1, we must prove (1)–(3) below; Local correctness of thread 2 is similar.

$$\begin{Bmatrix} P\_1 \end{Bmatrix} \text{ store } x \text{ 1 } \begin{Bmatrix} P\_2 \end{Bmatrix} \tag{1}$$

$$\begin{Bmatrix} P\_2 \end{Bmatrix} \text{ flush } x \quad \begin{Bmatrix} P\_3 \end{Bmatrix} \tag{2}$$

$$\begin{Bmatrix} P\_3 \end{Bmatrix} \text{ store } y \text{ 1 } \begin{Bmatrix} P\_4 \end{Bmatrix} \tag{3}$$

For Stability of P (the precondition of store x 1 in thread 1) against thread 2 we must prove:

$$\begin{Bmatrix} P\_1 \end{Bmatrix} \ a := \mathbf{load} \ y \begin{Bmatrix} P\_1 \end{Bmatrix} \tag{4}$$

$$\left\{P\_1 \land a = 1\right\} \text{ store } z \text{ 1 } \left\{P\_1\right\} \tag{5}$$

Stability of other assertions (i.e., P2–P4) is similar. We prove (1)–(5) in §3.4.

### 3.4 Pierogi Proof rules

One of the main benefts of Pierogi is the ability to perform proofs at a high level of abstraction. In this section, we provide the set of proof rules that we use. The annotation within a proof outline is, in essence, an invariant mapping each program location to an assertion that holds at the program location. Thus, we prove local correctness by checking that each atomic step of a thread establishes the assertions in that thread. Similarly, we check stability by checking each assertion in one thread against each atomic step of the other threads. To enable proof abstraction, we introduce a set of proof rules that describe the interaction between the assertions from §3.2 and the atomic program steps. We will use the standard decomposition rules from Hoare Logic to reduce proof outlines and enable our rules over atomic steps to be applied.

Standard Decomposition Rules The standard decomposition rules we use are given in Fig. 6, which allow one to weaken preconditions and strengthen postconditions, and decompose conjunctions and disjunctions.

Rules for Atomic Statements and View-Based Assertions Weak and persistent memory models (e.g. Px86) are inherently non-deterministic. Moreover in contrast to sequential consistent, in view-based operational semantics

$$\text{Cons}\begin{array}{c} \begin{array}{c} P' \Rightarrow P \\ \{P\} \amalg \end{array} \begin{array}{c} Q \Rightarrow Q'\\ \{Q\} \end{array} \end{array} \qquad \begin{array}{c} \begin{array}{c} \begin{Bmatrix} P\_{1} \\ \end{Bmatrix} \amalg \end{array} \begin{Bmatrix} Q\_{1} \\ \end{Bmatrix} \end{array} \qquad \begin{Bmatrix} P\_{1} \\ \begin{Bmatrix} Q\_{2} \\ \end{Bmatrix} \end{Bmatrix} \qquad \begin{Bmatrix} P\_{1} \\ \end{Bmatrix} \begin{Bmatrix} Q\_{1} \\ \end{Bmatrix} \\ \begin{Bmatrix} Q\_{2} \\ \end{Bmatrix} \end{Bmatrix} \qquad \begin{Bmatrix} P\_{2} \\ \end{Bmatrix} \begin{Bmatrix} Q\_{1} \\ \end{Bmatrix} \begin{Bmatrix} Q\_{2} \\ \end{Bmatrix} \end{Bmatrix}$$


Fig. 6: Standard decomposition rules of Pierogi

Fig. 7: Selected proof rules for atomic statements executed by thread τ

(such as Px86view) instructions such as a := load x have may a side-efect since they may update the view of the thread performing the load (cf. [11]). Therefore, unlike Hoare Logic, which contains a single rule for assignment, we have a set of rules for atomic statements, describing their interaction with view-based assertions. Each of the rules in this section has been proved sound with respect to the view-based semantics encoded in Isabelle/HOL.

A selection of these rules for the atomic statements is given in Fig. 7, where the statement is assumed to be executed by thread τ . The frst column contains the pre/post condition triple, the second any additional constraints and the third, labels that we use to refer to the rules in our descriptions below. Unless explicitly mentioned as a constraint, we do not assume that threads, locations and values are distinct; e.g. rule LP<sup>3</sup> (referring to τ and τ ′ ) holds regardless of whether τ = τ ′ or not.

The rules in Fig. 7 provide high-level insights into the low-level semantics of Px86view without having to understand the operational details. The LP<sup>i</sup> rules are for statement a := load x. Rule LP<sup>1</sup> states that if τ 's view of x is the set of values S, then in the post state a is an element of S and moreover τ 's view of x is a subset of S (since τ 's view may have shifted). By LP2, provided the conditional view of τ on y (with condition x = u) is S, if the load returns value u, then the view of τ is shifted so that [y]<sup>τ</sup> ⊆ S. We only have [y]<sup>τ</sup> ⊆ S in the postcondition because there may be multiple writes to x with value u; reading x read may shift the view to the latter write, thus reducing the set of values that τ can read for y. LP<sup>3</sup> describes conditions for a deterministic load by thread τ . The precondition assumes that there is only one write to x with value u, that some thread τ ′ sees the last write to x with value u. Then, if τ reads u, its view of x is also constrained to just the set containing u.

The store rules, SP<sup>i</sup> , refect that fact that a new write modifes the views of the other threads as well as the persistent memory and asynchronous views. The frst four rules describe the interaction of a store by thread τ with current view assertions. By SP1, the store ensures that the current view of τ is solely the value v written by τ . This is because in Px86view, new writes are introduced by the executing thread, τ , with a maximal timestamp (see store rule in Fig. 12), and τ 's view is updated to this new write. SP2, SP<sup>3</sup> and SP<sup>4</sup> are similar, and assuming that the view (of another thread, persistent memory and asynchronous view, respectively) in the pre-state is S, shows that the view in the post state is S ∪ {v}. Rule SP<sup>5</sup> allows one to introduce a conditional observation assertion ⟨x, v⟩[y]<sup>τ</sup> ′ where τ ′ ̸= τ . The pre-state of SP<sup>5</sup> assumes that τ 's view of y is the set S, and that τ ′ cannot view value v for y. Rule SP<sup>6</sup> introduces last-view assertions for τ after τ performs a write to x, and fnally SP<sup>7</sup> states that the number of writes to x with value v increases by 1 after executing store x v.

Rules FP<sup>i</sup> describe the efect of fush x on the state. FP<sup>1</sup> states that, provided that the current view of τ for x is the set of values S, after executing fush x, we are guaranteed that both the persistent view and asynchronous view of τ for x are subsets of S. We obtain a subset in the post state since the Px86view semantics potentially moves the persistent and asynchronous views forward. Similarly, by FP<sup>2</sup> if the current persistent view of x is S, then after executing fush x the persistent view will be a subset of S. Finally, FP<sup>3</sup> provides a mechanism for establishing a deterministic persistent view u for x. The precondition assumes that some thread's view of x is the last write with value u and that τ 's view is such that the fush is guaranted to fush to this last write to x.

Rule OP describes how the asynchronous view of τ in the postcondition of fushopt x is related to the current view of τ and the asynchronous view in the precondition. Finally, rule SFP describes the relationship between the persistent view in the postcondition and the asynchronous view and persistent view in the precondition for an sfence instruction.

Our Isabelle/HOL development contains further rules for the other instructions, including mfence and cas, which we omit here for space reasons. In addition, we prove the stability of several assertions (see Fig. 8 for a selection). An assertion P is stable over a statement α executed by τ if {P} α {P} holds.

Well-formedness The fnal major aspect of our framework is a well-formedness condition that describes the set of reachable states in the Px86view semantics. The condition is expressed as an invariant of the semantics: it holds initially, and is stable under every possible transition of Px86view. In fact, the rules in Figs. 7 and 8 are proved with respect to this well-formedness condition.

The majority of the well-formedness constraints are straightforward, e.g. describing the relationship between the views of diferent components. The most


Fig. 8: Selection of stable assertions for atomic statements executed by thread τ

important component of the well-formedness condition is a non-emptiness condition on views, which states that [x]<sup>τ</sup> ̸= ∅ ∧ [x] <sup>P</sup> ̸= ∅ ∧ [x] A <sup>τ</sup> ̸= ∅. For instance, a consequence of this condition is that, in combination with LP1, we have:

$$\left\{ [y]\_\tau = \{ v \} \right\} \; a := \text{load} \; x \; \left\{ [y]\_\tau = \{ v \} \right\} \tag{6}$$

Worked Example We now return to the proof obligations from Example 3 and demonstrate how they can be discharged using the proof rules described above. For Local correctness, condition (1) holds by Conj (from Fig. 6) together with stability rules WS1, WS<sup>2</sup> and WS<sup>4</sup> (from Fig. 8) which establish the frst three conjunctions in the postcondition, and SP<sup>1</sup> from Fig. 7, which establishes the fnal conjunction. Condition (2) holds by FP<sup>1</sup> in Fig. 7 together with Cons (from Fig. 6). Finally, condition (3) holds by WS<sup>2</sup> (from Fig. 8).

Both the Stability conditions (4) and (5) from Example 3 hold by the stability rules in Fig. 8 together with Cons and Conj (from Fig. 6). In particular, for (4), we use rules LS1, LS<sup>2</sup> and LS4, and for (5), we use WS1, WS<sup>2</sup> and WS4.

### 4 Examples

In this section we present a selection of programs that we have verifed in Isabelle/HOL. These examples highlight specifc aspects of Px86, in particular, the interaction between fushopt and sfence, as well as aspects of our view-based assertion language that simplifes verifcation.

Optimised Message Passing We start by considering a variant of Fig. 1e, which contains two optimisations. First, we notice that fushing of the write to x in thread 1 can be moved to thread 2 since the write to z is guarded by whether or not thread 2 reads the fag y. Second, it is possible to replace the fush by a more optimised fushopt followed by an sfence. We confrm correctness of these optimisations via the proof outline in Fig. 9. The optimised message passing in Fig. 9 ensures the same persistent invariant as Fig. 1e. However, the way in

$$\{\begin{subarray}{c}\{\boldsymbol{\}}\_{2}=\{\boldsymbol{0}\}\\\{\boldsymbol{\}}\_{2}=\{\boldsymbol{0}\}\\\{\boldsymbol{x}\}\_{2}=\{\boldsymbol{0}\}\\\{\boldsymbol{x}\}\_{1}=\{\boldsymbol{0}\}\\\{\boldsymbol{x}\}\_{1}=\{\boldsymbol{0}\}\\\{\boldsymbol{x}\}\_{1}=\{\boldsymbol{0}\}\\\{\boldsymbol{x}\}\_{1}=\{\boldsymbol{0}\}\\\{\boldsymbol{\}}\_{2}=\{\boldsymbol{0}\}\\\{\boldsymbol{\}}\_{1}=\{\boldsymbol{0}\}\\\{\boldsymbol{\}}\_{2}=\{\boldsymbol{0}\}\\\{\boldsymbol{\}}\_{1}=\{\boldsymbol{0}\}\\\{\boldsymbol{\}}\_{2}=\{\boldsymbol{0}\}\\\{\boldsymbol{\}}\_{1}=\{\boldsymbol{0}\}\\\{\boldsymbol{x}\}\_{2}=\{\boldsymbol{0}\}\\\{\boldsymbol{x}\}\_{1}=\{\boldsymbol{0}\}\\\{\boldsymbol{x}\}\_{2}=\{\boldsymbol{0}\}\times\{\boldsymbol{0}\}\\\{\boldsymbol{x}\}\_{1}=\{\boldsymbol{0}\}\times\{\boldsymbol{x}\}\\\{\boldsymbol{x}\}\_{2}=\{\boldsymbol{0}\}\times\{\boldsymbol{x}\}\\\{\boldsymbol{x}\}\_{2}=\{\boldsymbol{0}\}\times\{\boldsymbol{x}\}\\\{\boldsymbol{x}\}=\{\boldsymbol{0}\}\times\{\boldsymbol{x}\}\\\{\boldsymbol{x}\}=\{\boldsymbol{0}\}\times\{\boldsymbol{x}\}\\\{\boldsymbol{x}\}=\{\boldsymbol{0}\}\times\{\boldsymbol{x}\}\\\{\boldsymbol{x}\}=\{\boldsymbol{0}\}\times\{\boldsymbol{x}\}\\\{\boldsymbol{x}\}=\{\boldsymbol{0}\}\times\{\boldsymbol{x}\}\\\{\boldsymbol{x}\}=\{\boldsymbol{0}\}\times\{\boldsymbol{x}\}^{\mathcal{P}}=\{\boldsymbol{1}\}\\\{\boldsymbol{\}}\in\{\boldsymbol{x}\}^{\mathcal{P}}=\{\boldsymbol{1}\}\end{subarray}$$

Fig. 9: Proof outline for optimised message passing

which this is established difers. In particular, in Fig. 1e, the persistent invariant holds due to thread 1, whereas in Fig. 9 it holds due to thread 2.

With respect to the persistent invariant, the most important sequence of steps takes place in thread 2 if it reads 1 for y. Note that by the conditional view assertion in the precondition of a := load y, thread 2 is guaranteed to read 1 for x after reading 1 for y. Thus, if the test of if statement succeeds, then thread 2 must see 1 for x. This view is translated into an asynchronous view after the fushopt is executed, and then to the persistent view after executing sfence. Note that until this occurs, we can guarantee that [z] <sup>P</sup> = {0}, which trivially guarantees the persistent invariant.

Flush Bufering Our next example is a variation of store bufering (sb) and is used to highlight how writes by diferent threads on diferent locations interact with fushes. Here, thread 1 writes to x and fushes y, while thread 2 writes to y then fushes x. <sup>6</sup> The writes to w and z are used to witness whether the fushes in both threads have occurred. The persistent invariant states that, if both w and z hold 1 in persistent memory, then either x or y has the new value (i.e. 1) in persistent memory. If both threads perform their fush operations, then at least one must fush value 1 since a fush cannot be reordered with a store.

Although simple to state, the proof is non-trivial since it requires careful analysis of the order in which the stores to x and y occur. In the semantics of Cho et al. [9], the fush corresponding to the second store instruction executed synchronises with writes to all locations. Thus, for example, if thread 1's store to x is executed after thread 2's store to y, then the subsequent fush in thread 1 is guaranteed to fush the new write to y.

The above intuition requires reasoning about the order in which operations occur. To facilitate this, we use auxiliary variables aˆ and ˆb to record the order in which the writes to x and y occur; aˆ = 1 if the write to x occurs before the

<sup>6</sup> Note that the fush operations here are analogous to the load instructions in sb.

 ∀o ∈ {w, x, y, z}, τ ∈ {1, 2}. [o]<sup>τ</sup> = [o] <sup>P</sup> = {0} (ˆa, ˆb = 0, 0 ∧ [z] <sup>P</sup> = {0}) ∨ a, ˆ <sup>ˆ</sup><sup>b</sup> = 0, <sup>1</sup> <sup>∧</sup> <sup>V</sup>yW<sup>2</sup> <sup>∧</sup> [y]<sup>2</sup> = {1} ∧ [w] <sup>P</sup> = {0} ⟨store x 1, aˆ := ˆb + 1⟩; aˆ = 1 ∧ ˆb ∈ {0, 2}∧ ([z] <sup>P</sup> = {0} ∨ [x] <sup>P</sup> = {1} ∨ a, ˆ <sup>ˆ</sup><sup>b</sup> = 2, <sup>1</sup> <sup>∧</sup> <sup>V</sup>yW<sup>2</sup> <sup>∧</sup> [y]<sup>2</sup> <sup>=</sup> {1} ∧ <sup>V</sup>y<sup>W</sup> F <sup>1</sup> ∧[w] <sup>P</sup> = {0} fush y; aˆ = 1 ∧ ˆb ∈ {0, 2} ∧ ([z] <sup>P</sup> = {0} ∨ [x] <sup>P</sup> <sup>=</sup> {1})) ∨ (ˆa, ˆb = 2, 1 ∧ [y] <sup>P</sup> = {1}) store w 1; aˆ = 1 ∧ ˆb ∈ {0, 2} ∧ ([z] <sup>P</sup> = {0} ∨ [x] <sup>P</sup> = {1}) ∨ (ˆa, ˆb = 2, 1 ∧ [y] <sup>P</sup> = {1}) (ˆa, ˆb = 0, 0 ∧ [w] <sup>P</sup> = {0}) ∨ a, ˆ <sup>ˆ</sup><sup>b</sup> = 1, <sup>0</sup> <sup>∧</sup> <sup>V</sup>xW<sup>1</sup> <sup>∧</sup> [x]<sup>1</sup> = {1} ∧ [z] <sup>P</sup> = {0} ⟨store y 1, ˆb := ˆa + 1⟩; ˆb = 1 ∧ aˆ ∈ {0, 2} ∧ ([w] <sup>P</sup> = {0} ∨ [y] <sup>P</sup> = {1}) ∨ a, ˆ <sup>ˆ</sup><sup>b</sup> = 1, <sup>2</sup> <sup>∧</sup> <sup>V</sup>xW<sup>1</sup> <sup>∧</sup> [x]<sup>1</sup> <sup>=</sup> {1} ∧ <sup>V</sup>x<sup>W</sup> F <sup>2</sup> ∧[z] <sup>P</sup> = {0} fush x; ˆb = 1 ∧ aˆ ∈ {0, 2} ∧ ([w] <sup>P</sup> = {0} ∨ [y] <sup>P</sup> = {1}) ∨ (ˆa, ˆb = 1, 2 ∧ [x] <sup>P</sup> = {1}) store z 1; ˆb = 1 ∧ aˆ ∈ {0, 2} ∧ ([w] <sup>P</sup> = {0} ∨ [y] <sup>P</sup> = {1}) ∨ (ˆa, ˆb = 1, 2 ∧ [x] <sup>P</sup> = {1}) (ˆa, ˆb = 1, 2 ∧ [x] <sup>P</sup> = {1}) ∨ (ˆa, ˆb = 2, 1 ∧ [y] <sup>P</sup> = {1}) : [w] <sup>P</sup> = {1} ∧ [z] <sup>P</sup> = {1} ⇒ [x] <sup>P</sup> = {1} ∨ [y] <sup>P</sup> = {1} 

Fig. 10: Proof outline for fush bufering

write to y, and aˆ = 2 if the write to x occurs after the write to y. Let us now consider the precondition of fush y (the reasoning for fush x is symmetric). There are two disjuncts to consider.


Epoch Persistency In our next example, we demonstrate how writes of different threads on the same location interact with an optimised fush in the same location, as well as how the ordering of optimised fushes/loads alters the persistency behaviour. The crash invariant of Fig. 11 states that if z and y hold the value 1 in persistent memory then x has the value 2 in persistent memory.

In order for thread 2 to read value 2 for x, the store of 2 at x must be performed before the store of 1 and [x]<sup>2</sup> = {1, 2}. Establishing the persistent

 (∀τ ∈ {1, 2}, o ∈ {x, y, z}.[o]<sup>τ</sup> = [o] <sup>P</sup> = {0}) ∧ a = 0 |x, 2| = 0 ∧ ([x]<sup>2</sup> = 0 ∧ [x]<sup>1</sup> = 0)∨ ([x]<sup>2</sup> = 1 ∧ [x]<sup>1</sup> = {0, 1}) store <sup>x</sup> 2; | x, 2| = 1 ∧ <sup>V</sup>xW<sup>1</sup> <sup>∧</sup>[x]<sup>1</sup> <sup>=</sup> {2} ∧ [x]<sup>2</sup> ⊆ {1, 2} ∨ [x]<sup>2</sup> ⊆ {0, 1, 2} [y] <sup>P</sup> = {0} ∧ [z] <sup>P</sup> = {0} ∧ (|x, 2| ∈ {0, 1}) store <sup>x</sup> 1; [x]<sup>2</sup> = 1 ∨ [x]<sup>2</sup> = {1, 2} ∧ |x, 2| = 1 ∧ <sup>V</sup>xW<sup>1</sup> <sup>∧</sup>[x]<sup>1</sup> = 2 <sup>∧</sup> [y] <sup>P</sup> = {0} ∧ [z] <sup>P</sup> = {0} a := load x; (a = 2 ⇒ [x]<sup>2</sup> = {2}) ∧ [y] <sup>P</sup> = {0} ∧ [z] <sup>P</sup> = {0} fushopt x; (a = 2 ⇒ [x] A <sup>2</sup> = {2}) ∧ [y] <sup>P</sup> = {0} ∧ [z] <sup>P</sup> = {0} if (a = 2) [x] A <sup>2</sup> = {2} ∧ [y] <sup>P</sup> = {0} ∧ [z] <sup>P</sup> = {0} store y 1; ([x] A <sup>2</sup> = {2} ∨ [y] <sup>P</sup> = {0}) ∧ [z] <sup>P</sup> = {0} sfence; [x] <sup>P</sup> = {2} ∨ [y] <sup>P</sup> = {0} store z 1; [x] <sup>P</sup> = {2} ∨ [y] <sup>P</sup> = {0} ∨ [z] <sup>P</sup> = {0} [x] <sup>P</sup> = {2} ∨ [y] <sup>P</sup> = {0} ∨ [z] <sup>P</sup> = {0} : [y] <sup>P</sup> = {1} ∧ [z] <sup>P</sup> = {1} ⇒ [x] <sup>P</sup> = {2} 

Fig. 11: Proof outline for epoch persistency

invariant for thread 2 requires reasoning about the view of thread 2 for address x (i.e. [x]2) after the execution of the instruction a := load x. Notice here that a := load x is ordered with respect to the later fushopt x instruction. Consequently, any impact of the execution of the load on [x]2, will also afect [x] A 2 . Taking into account the ordering of the writes at the address x, we can conclude that if thread 2 reads the value 2, it reads the value of the last write at x. This is expressed with the assertion <sup>V</sup>xW<sup>1</sup> in the precondition of <sup>a</sup> := load <sup>x</sup>, which states that the threads 1's view of x is the last write to x. By rule LP3, if a thread τ 's view of an address x contains only the last write at this address, and the last value written at this address appears only once at the memory, then if a thread τ read this value at x, its view of x (i.e. [x]<sup>τ</sup> ) is guaranteed to contain only the last written value at x. Consequently, after reading value 2, thread 2's view of x contains only the value 2 (i.e. [x]<sup>2</sup> = {2}). Execution of fushopt x ensures [x] A 2 (by rule OP). As a result, in the case that the if statement succeeds, after the execution of the sfence it is guaranteed that the value 2 is persisted at x (i.e. [x] <sup>P</sup> = {2}). In the case that the if statement fails, [y] <sup>P</sup> = {0} must hold, thus the persistent invariant holds trivially.

### 5 Pierogi Soundness

In this section we present the Px86view model from [9] (§5.1), formally interpret our assertions as predicates on states of that model (§5.2), and establish the soundness of the proposed reasoning technique (§5.3).

(assign) α = a := e v = T.regs(e) T ′ = T[regs(a) 7→ v] ⟨T, M⟩ <sup>α</sup>−→ ⟨T ′ , M⟩ (store) α = store x e v = T.regs(e) M′ = M ++ [⟨x := v⟩] T ′ = T[coh(x) 7→ |M|] ⟨T, M⟩ <sup>α</sup>−→ ⟨T ′ , M′ ⟩ (load-internal) α = a := load x M[t] = ⟨x := v⟩ T.coh(x) = t T ′ = T[regs(a) 7→ v] ⟨T, M⟩ <sup>α</sup>−→ ⟨T ′ , M⟩ (load-external) α = a := load x M[t] = ⟨x := v⟩ T.coh(x) < t x ̸∈ M(t..T.vrNew] T ′ = T regs(a) 7→ v, coh(x) 7→ t, vrNew 7→<sup>⊔</sup> t, vpReady 7→<sup>⊔</sup> t ⟨T, M⟩ <sup>α</sup>−→ ⟨T ′ , M⟩ (sfence) α = sfence T ′ = T vpReady 7→<sup>⊔</sup> T.maxcoh, <sup>v</sup>pCommit 7→<sup>⊔</sup> T.vpAsync ⟨T, M⟩ <sup>α</sup>−→ ⟨T ′ , M⟩ (flush) α = fush x T ′ = T vpAsync(x) 7→<sup>⊔</sup> T.maxcoh, <sup>v</sup>pCommit(x) 7→<sup>⊔</sup> T.maxcoh ⟨T, M⟩ <sup>α</sup>−→ ⟨T ′ , M⟩ (flushopt) α = fushopt x T ′ = T[vpAsync(x) 7→<sup>⊔</sup> T.coh(x) ⊔ T.vpReady] ⟨T, M⟩ <sup>α</sup>−→ ⟨T ′ , M⟩ (program-normal) ⃗pc(τ ) = i Π(τ, i) = α goto j ⟨T⃗(τ ), M⟩ <sup>α</sup>−→ ⟨T ′ , M′ ⟩ ⃗pc′ = ⃗pc[τ 7→ j] T⃗′ = T⃗[τ 7→ T ′ ] ⟨ ⃗pc, T , M, G ⃗ ⟩ ⇒<sup>Π</sup> ⟨ ⃗pc′ , T⃗′ , M′ , G⟩ (program-if) ⃗pc(τ ) = i Π(τ, i) = if B goto j else to k ⃗pc′ <sup>=</sup> ⃗pc " τ 7→ ( j T⃗(τ ).regs(B) = true <sup>k</sup> T⃗(<sup>τ</sup> ).regs(B) = false# ⟨ ⃗pc, T , M, G ⃗ ⟩ ⇒<sup>Π</sup> ⟨ ⃗pc′ , T , M, G ⃗ ⟩ (program-ghost) ⃗pc(τ ) = i Π(τ, i) = ⟨α goto j, aˆ := ˆe⟩ ⟨T⃗(τ ), M⟩ <sup>α</sup>−→ ⟨T ′ , M′ ⟩ ⃗pc′ = ⃗pc[τ 7→ j] T⃗′ = T⃗[τ 7→ T ′ ] G ′ = G[ˆa 7→ G(ˆe)] ⟨ ⃗pc, T , M, G ⃗ ⟩ ⇒<sup>Π</sup> ⟨ ⃗pc′ , T⃗′ , M′ , G′ ⟩

Fig. 12: Transitions of Px86view for a program Π

#### 5.1 The Px86view Model

Like previous view-based models, Px86view employs a non-standard memory capturing all previously executed writes, alongside so-called "thread views" that track several position(s) of each thread in that history and enforce limitations on the ability of the thread to read from and write to the memory. In addition, the thread views contain the necessary information for determining the possible contents of the non-volatile memory upon a system crash. Formally, Px86view's memory and thread states are defned as follows.

Defnition 2 (Px86view's memory). A memory M ∈ Memory is a list of messages, where each message has the form ⟨x := v⟩ for some x ∈ Loc and v ∈ Val. We use w.loc and w.val to refer to the two components of a message w. We use standard list notations for memories (e.g. M<sup>1</sup> ++ M<sup>2</sup> for appending memories, [w] for a singleton memory, and |M| for the length of M). We refer to indices (starting from 0) in a memory M as timestamps, and denote the t'th element of M as M[t]. We use ⊔ for obtaining the maximum among timestamps (i.e. t<sup>1</sup> ⊔ t<sup>2</sup> = max(t1, t2)), and extend this notation pointwise to functions. We write x ̸∈ M(t2..t1] for the condition ∀t<sup>2</sup> < t ≤ t1. M[t].loc ̸= x.

Defnition 3 (Px86view's thread states). A thread state T ∈ Thread is a record consisting of the following felds: coh : Loc → N, vrNew : N, vpReady : N, vpAsync : Loc → N, and vpCommit : Loc → N. We use standard function/record update notation (e.g. T ′ = T[coh(x) 7→ t] denotes the thread state obtained from T be modifying the x entry in the coh component of T to t). In addition, 7→<sup>⊔</sup> is used to incorporate certain timestamps in felds (e.g. T[vrNew 7→<sup>⊔</sup> t] denotes the thread state obtained from T be modifying the vrNew component of T to T.vrNew ⊔ t). We denote by T.maxcoh the maximum among the coherence view timestamps (T.maxcoh = F x T.coh(x)).

The two components, together with program counters and the "ghost memory", are combined in Px86view's machine states as defned next.

Defnition 4 (Px86view's machine states). A machine state is a tuple σ = ⟨ ⃗pc, T , M, G ⃗ ⟩ where ⃗pc : Tid → Lab is a mapping assigning the next program label to be executed by each thread, T⃗ : Tid → Thread is a mapping assigning the current thread state to each thread, M ∈ Memory is the current memory, and G : AuxVar → Val is storing the current values of the auxiliary variables. Below we assume that G is extended to expressions eˆ ∈ AuxExp in a standard way. We denote the components of a machine state σ by σ. ⃗pc, σ.T⃗, σ.M, and σ.G. In addition, we denote by σ.maxpCommit(x) the maximum among the persistency view timestamps for location x (σ.maxpCommit = F τ σ.T⃗(τ ).vpCommit(x)).

The transitions of Px86view are presented in Fig. 12. These closely follow the model in [9] with minor presentational simplifcations. Note, however, that, for simplicity and following [23], we conservatively assume that writes persist atomically at the location granularity (representing, e.g. machine words) rather than at the granularity of the width of a cache line. We refer the interested reader to [9] for a detailed discussion of the transitions rules in Fig. 12.

The above operational defnitions naturally induce a notion of a execution (or a "run") of Px86view on a certain program Π starting from some initial state of the form ⟨λτ. ι, T , M, G ⃗ ⟩. A system crash might occur at any point during the execution. Again, following the model of [9], the non-volatile memory (NVM) is not modeled as a concrete part of the state. Instead, the possible contents of the NVM can be inferred from the machine state (specifcally from the memory and the vpCommit views of the diferent threads), as defned next. This defnition is presented as "crash transition" in [9].

Defnition 5. A non-volatile memory NVM : Loc → Val is possible in a state σ if for every x ∈ Loc, there exists some t such that σ.M[t] = ⟨x := NVM (x)⟩ and x ̸∈ σ.M(t..σ.maxpCommit(x)].

### 5.2 The Semantics of Pierogi Assertions

We present the formal defnitions of the expressions introduced in §3.2 in terms of Px86view's machine states.

Current and conditional views When formalising the current and conditional view expressions, we start with auxiliary functions that return the sets of observable timestamps visible to the components in question, then extract the values in memory corresponding these timestamps. To facilitate this, we defne

Vals(M, T S) ≜ {M[t].loc | t ∈ T S}

where M ∈ Memory and T S is a set of timestamps.

Thread view To defne the meaning of the thread view expression, [x]<sup>τ</sup> , we use:

$$\begin{aligned} \mathsf{T}\mathsf{S}^{\mathsf{OF}}\_{\tau}(\sigma, x, t) & \triangleq \{ t' \mid \sigma.M[t'].\mathsf{loc} = x \land \sigma.\vec{T}(\tau).\mathsf{coh}(x) \le t' \land x \notin \sigma.M(t'..t) \} \\ \mathsf{TS}\_{\tau}(\sigma, x) & \triangleq \mathsf{TS}^{\mathsf{OF}}\_{\tau}(\sigma, x, \sigma.\vec{T}(\tau).\mathsf{v}\_{\mathsf{r}\mathsf{New}}) \end{aligned}$$

TSOF τ (σ, x, t) returns the set of timestamps that are observable from timestamp t for thread τ to read for location x in state σ; and TS<sup>τ</sup> (σ, x) returns the set of timestamps that are observable for τ to read x in σ. Note that after instantiating t to σ.T⃗(τ ).vrNew in TSOF τ (σ, x, t), we obtain the premises of the load rules in Fig. 12. Then, [x]<sup>τ</sup> ≜ λσ. Vals(σ.M,TS<sup>τ</sup> (σ, x)), i.e. is the set of values in σ.M corresponding to the timestamps in TS<sup>τ</sup> (σ, x).

Persistent memory view For the persistent memory view expression, [x] P , we use:

$$\mathsf{TS}^p(\sigma, x) = \{ t \mid \sigma.M[t]. \mathsf{loc} = x \land x \notin \sigma.M(t.. \sigma.\mathsf{maxpCommit}(x)) \}$$

which returns the set of timestamps that are observable to the persistent memory for x in σ. Then, [x] <sup>P</sup> ≜ λσ. Vals(σ.M,TS<sup>P</sup> (σ, x)). Note that the second conjunct within the defnition of TS<sup>P</sup> (σ, x) is precisely the condition that links Px86view states to NVM states (Defnition 5). Given this defnition, we have:

Proposition 1. A non-volatile memory NVM : Loc → Val is possible in a state σ if NVM (x) ∈ [x] P (σ) for every x ∈ Loc.

Asynchronous memory view To defne the meaning of the asynchronous memory view, [x] A τ , we use:

$$\mathsf{TS}\_{\tau}^{\mathsf{A}}(\sigma, x) \triangleq \{ t \mid \sigma.M[t]. \mathsf{loc} = x \land x \notin \sigma.M(t. \dots \sigma.\vec{T}(\tau). \mathsf{v\_{p\mathsf{A}\mathsf{s}\mathsf{v}\mathsf{c}}(x)] \}$$

which returns the timestamps of the asynchronous view of thread τ in location x and state σ. Then, as before, [x] A <sup>τ</sup> ≜ λσ. Vals(σ.M,TS<sup>A</sup> τ (σ, x)).

Conditional view The functions used to defne conditional memory view, ⟨x, v⟩[y]<sup>τ</sup> , are slightly more sophisticated than those above. We defne:

$$\mathsf{TS}\_{\tau}^{\mathsf{OV}}(\sigma, x, v) \triangleq \left\{ \begin{array}{l} t' \\ \qquad t' = \mathsf{if} \, t = \sigma. \mathsf{T}(\sigma, x) \cdot \sigma. M[t] \,\mathsf{val} = v \wedge \\\qquad t' = \mathsf{if} \, t = \sigma. \vec{T}(\tau). \mathsf{coh}(x) \,\mathsf{then} \,\, \sigma. \vec{T}(\tau). \mathsf{v}\_{\mathsf{New}} \\\qquad \qquad \mathsf{else} \,\, t \sqcup \sigma. \vec{T}(\tau). \mathsf{v}\_{\mathsf{New}} \end{array} \right\}$$
 
$$\mathsf{TS}\_{\tau}^{\mathsf{CO}}(\sigma, x, v, y) \triangleq \bigcup \{ \mathsf{TS}\_{\tau}^{\mathsf{OF}}(\sigma, y, t) \mid t \in \mathsf{TS}\_{\tau}^{\mathsf{OV}}(\sigma, x, v) \}$$

where TSOV τ (σ, x, v) returns the set of timestamps that τ can observe for x with value v. Assuming t is a timestamp that τ can observe for x, and the value for x at t is v, the corresponding timestamp t ′ that TSOV τ (σ, x, v) returns is σ.T⃗(τ ).vrNew if τ 's coherence view for x is t, and the maximum of t and σ.T⃗(τ ).vrNew, otherwise. Given this, TSCO τ (σ, x, v, y) returns the timestamps that τ can observe for y, from any timestamp t ∈ TSOV τ (σ, x, v). Finally, the set of conditional values is defned by ⟨x, v⟩[y]<sup>τ</sup> ≜ λσ. Vals(σ.M,TSCO τ (σ, x, v, y)).

Last view assertions We use the following auxiliary defnition:

Last(M, x) ≜ F {t | M[t].loc = x}

which returns the timestamp of the last write to x in M. Then, the last view assertions are given by:


Value count Finally, the value count expression is defned as follows:

$$\left| \left| x, v \right| \triangleq \lambda \sigma. \left| \{ t \mid \sigma. M[t] = \langle x := v \rangle \} \right| \right| $$

### 5.3 Soundness of Pierogi

Given the above building blocks, the soundness of the proposed reasoning technique is stated as follows.

Theorem 1 (Soundness of Pierogi). Suppose that a program Π has a valid proof outline ⟨in, ann,I, fn⟩. Let σ be a state of Px86view that is reachable in an execution of Π from some state σinit of the form ⟨λτ. ι, T⃗ init, Minit, Ginit⟩ such that σinit ∈ in. Then, the following hold:


Finally, it is straightforward to show the soundness of a standard "auxiliary variable transformation" [30] which removes all auxiliary variables from a program Π (translating each command ⟨α goto j, aˆ := ˆe⟩ into α goto j) provided that the crash invariant and the fnal assertion do not contain occurrences of the auxiliary variables. Indeed, it is easy to see that the auxiliary memory G in the operational semantics in Fig. 12 serves only as an instrumentation, and does not restrict the possible runs. (Formally, if Π′ is obtained from Π by removing all auxiliary variables and ⟨ ⃗pc, T , M, G ⃗ ′ ⟩ is reachable in ⇒Π′ from some initial state, then ⟨ ⃗pc, T , M, G ⃗ ⟩ is reachable in ⇒<sup>Π</sup> from the same state for some G.)

### 6 Mechanisation

Perhaps the greatest strength of our development is an integrated Isabelle/HOL mechanisation providing a fully fedged semi-automated verifcation tool for Px86view programs. This mechanisation builds on the existing work on Owicki– Gries for RC11 by Dalvandi et al [11, 12] applying it to the Px86view semantics. We start by encoding the operational semantics of Cho et al. [9], followed by the view-based assertions described in §3.2. Then, we prove correctness of all of the proof rules for the atomic statements, including those described in §3.4. These rules can be challenging to prove since they require unfolding of the assertions and examination of the low-level operational semantics and their efect on the views of diferent system components.

Once proved, the rules provided are highly reusable, and are key to making verifcation feasible. Specifcally, when showing the validity of a proof outline (Defnition 1), Isabelle/HOL generates the necessary proof obligations (after minor interactions) and automatically fnds the set of high-level proof rules needed to discharge each proof obligation via the built-in sledgehammer tool [6]. This enables a high degree of experimentation and debugging of proof outlines, including the ability to reduce assertion complexity once a proof outline is validated.

The base development (semantics, view-based assertions, and soundness of proof rules) comprise ∼7000 lines of Isabelle/HOL code. With this base development in place, each example comprises 200–400 lines of code (including the encoding of the program, the annotations, and the proofs of validity). The entire development took approximately 3 months of full-time work.

### 7 Related Work

The soundness of Pierogi is proven relative to the Px86view of Cho et al. [9]; there are however other equivalent models in the literature [1,23,32,34], as well as other persistency models [33,35]. While the original persistent x86 semantics has asynchronous explicit persist instructions [34], the underlying model assumed here is due to Cho et al. [9] with synchronous persist instructions. Nevertheless, Khyzha and Lahav [23] formally proved that the two alternatives are equivalent when reasoning about states after crashes (e.g. using our "crash invariants").

As mentioned in §1, the only existing program logic for persistent programs is POG [31], which (as with Pierogi) is a descendent of Owicki–Gries [30]. Pierogi goes beyond POG by handling examples that involve fushopt instructions, which cannot be directly verifed using POG. Raad et al. [31] provide a transformation technique to replace certain patterns of fushopt and sfence with fush. Specifcally, given a program Π that includes fushopt instructions, provided that Π meets certain conditions, this transformation mechanism rewrites Π into an equivalent program Π′ that uses fush instructions instead, allowing one to use POG. However, there are three limitations to this strategy: 1) the rewriting is an external mechanism that requires stepping outside the POG logic; 2) the rewriting is potentially expensive and must be done for every program that includes fushopt; and 3) the transformation technique is incomplete in that not all programs meet the stipulated conditions (e.g. Epoch Persistency 2), and thus cannot be verifed using this technique. Pierogi has no such limitations, as we showed in the examples in Section 4. Moreover, POG has no corresponding mechanisation, and developing a mechanisation that also efciently handles the program transformation for fushopt instructions would be non-trivial.

The Owicki–Gries method was frst applied to non-SC memory consistency by Lahav et al. [26]. One way that their approach, which targets the release/acquire memory model, is diferent from ours is that they aim to use standard SC-like assertions; in order to retain soundness under a weak memory model, they had to strengthen the standard stability conditions on proof outlines. Dalvandi et al. [11, 13] took a diferent approach when designing their Owicki–Gries logic for the release/acquire fragment of C11: by employing a more expressive, viewbased assertion language, they were able to stick with the standard stability requirement. In our work, we follow Dalvandi et al.'s approach. However, our assertions are fne-tuned to cope with the other types of view present in Px86view, such as those corresponding to the persistent and the asynchronous views. It is interesting that some of the principles of view-based reasoning apply to diferent memory models, and future work could look at unifying reasoning across models.

Dalvandi et al. [13] have developed a deeper integration of their view-based logic using the Owicki–Gries encoding of Nipkow and Prensa Nieto [28] in Isabelle/HOL. Such an integration would be straightforward for Pierogi too, allowing verifcation to take place without translating programs into a transition system. This would be much more difcult for POG since Owicki–Gries rules themselves are diferent from the standard encoding in Isabelle/HOL, in addition to the transformation required for fushopt instructions discussed above.

The idea of extending Hoare triples with crash conditions frst appeared in the work of Chen et al. [8]. However, that work supports neither concurrency nor explicit fushing instructions. Related ideas are found in the works of Ntzik et al. [29] and Chajed et al. [7]. However, in contrast to Pierogi, both of these works 1) assume sequentially consistent memory, as opposed to a weak memory model such as TSO; 2) assume strict persistency (where store and persist orders coincide); and 3) assume there is a synchronous fush operation, which is easier to reason about than the asynchronous fushopt operation.

Besides program logics, there have been other recent eforts to help programmers reason about persistent programs. For instance, Abdulla et al. [1] have proven that state-reachability for persistent x86 is decidable, thus opening the door to automatic verifcation of persistent programs, and Gorjiara et al. [18] and Kokologiannakis et al. [25] have developed model checkers for fnding bugs in persistent programs. Recent works have considered durable atomic objects such as concurrent data structures [17] and transactional memory [3] and their verifcation [3, 14, 15], which have been designed to satisfy conditions such as durable linearizability [20,24] and durable opacity [3]. These proofs assume persistency under SC; our work provides foundations for extending these proofs to persistent x86-TSO.

### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### Abstraction for Crash-Resilient Objects<sup>⋆</sup>

Artem Khyzha⋆⋆ and Ori Lahav()

Tel Aviv University, Tel Aviv, Israel artkhyzha@mail.tau.ac.il and orilahav@tau.ac.il

Abstract. We study abstraction for crash-resilient concurrent objects using non-volatile memory (NVM). We develop a library-correctness criterion that is sound for ensuring contextual refnement in this setting, thus allowing clients to reason about library behaviors in terms of their abstract specifcations, and library developers to verify their implementations against the specifcations abstracting away from particular client programs. As a semantic foundation we employ a recent NVM model, called Persistent Sequential Consistency, and extend its language and operational semantics with useful specifcation constructs. The proposed correctness criterion accounts for NVM-related interactions between client and library code due to explicit persist instructions, and for calling policies enforced by libraries. We illustrate our approach on two implementations and specifcations of simple persistent objects with diferent prototypical durability guarantees. Our results provide the frst approach to formal compositional reasoning under NVM.

Keywords: Non-volatile memory · Linearizability · Library abstraction

### 1 Introduction

Non-volatile memory, or NVM for short, is an emerging technology that enables byte addressable and high performant storage alongside with data persistency across system crashes. This combination of features allows researchers and practitioners to develop a variety of efcient crash-resilient data structures (see, e.g., [14,32]). Recently, NVM has started to become available in commodity architectures of manufacturers such as Intel and ARM [4, 23], and formal (operational and declarative) models of these systems have been proposed [10, 25, 30].

Unfortunately, like other new technologies, NVM puts more burden on programmers. Indeed, to get close to the performance of DRAM, writes to the NVM are frst kept in volatile (i.e., losing contents upon crashes) caches, and only later persist (i.e., propagate to the NVM), possibly not in the order in which they were

<sup>⋆</sup> This research was supported by the Israel Science Foundation (grants 1566/18 and 2005/17) and by the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement no. 851811). Additionally, the frst author was supported by the Blavatnik Family Foundation, and the second by the Alon Young Faculty Fellowship.

<sup>⋆⋆</sup> Now at Arm Ltd, Cambridge, UK

issued. This results in counterintuitive behaviors even for sequential programs and requires careful management using barriers of diferent kinds, a.k.a. explicit persist instructions, for guaranteeing that the system recovers to a consistent state upon a failure. Combined with standard concurrency issues, programming on such machines is highly challenging.

To tackle the complexity and make NVM widely applicable, one would naturally want to draw on libraries encapsulating highly optimized concurrent crashresilient data structures (a.k.a. persistent objects). This approach goes both ways: programmers should be able to reason about their code using abstract library specifcations that hide the implementation details, and in turn, library developers should be able to verify "once and for all" their implementations against their specifcations abstracting away from a particular client program. From a formal standpoint, this indispensable modularity requires us to have a so-called (library) abstraction theorem: a correctness condition that guarantees the soundness of client reasoning that assumes the specifcation instead of the implementation. Put diferently, the abstraction theorem should allow one to establish contextual refnement, i.e., conclude that the specifcation reproduces the implementation's client-observable behaviors under any (valid) context. To the best of our knowledge, while several correctness criteria for persistent objects, akin to classical linearizability, have been proposed and have been established for multiple sophisticated implementations, none of them has been formally related to contextual refnement by an abstraction theorem of this kind.

In this paper we formulate and prove an abstraction theorem for concurrent programs utilizing non-volatile memory. We target the "Persistent Sequential Consistency" model of [25], or PSC, which enriches the standard sequentially consistent shared-memory with non-volatile storage using per-location FIFO bufers to account for delayed and out-of-order persistence of writes. PSC constitutes a relatively simple model that is very close to developer's informal understanding of NVM. While existing hardware does not implement PSC as is, [25] presented compiler mappings from PSC to x86 (based on its persistency model from [30]), which can be used to ensure PSC semantics on Intel machines. Directly supporting relaxed memory models is left for future work.

Auxiliary material. An extended version, including proofs of theorems stated in the paper, is available at https://arxiv.org/abs/2111.03881.

### 2 Key Challenges and Ideas

We outline the main challenges and the key ideas in our solutions. We keep the discussion informal, leaving the formal development to later sections.

#### 2.1 Library Specifcations

A choice of a formalism for specifying library behaviors is integral in stating a library abstraction theorem. For libraries of concurrent data structures (a.k.a. concurrent objects), a popular approach is to give specifcations in terms of sequential objects with the help of the classical notion of linearizability [21], which requires every sequence of method calls and returns that is possible to produce in a concurrent program to correspond to a sequence that can be generated by the sequential object. In this approach, a sequential object, represented by a set of sequences of pairs of method invocations and their associated responses, constitutes the library specifcation. Then, abstraction allows the client to reason about calls to a concurrent library as if they execute atomically on a single thread, or, equivalently, protected by a global lock [7, 13].

For libraries of crash-resilient objects, there is more than one natural way of interpreting sequential specifcations and adapting the linearizability defnition, and no single notion of correctness w.r.t. sequential specifcations captures all diferent options. A crash-resilient object may ensure that all methods completed by the moment of crash survive through it, or that some prefx of them does. It may also choose diferent possibilities for methods in progress at the moment of crash (whether they are allowed take their efect at some later point after the crash or not). Multiple adaptations of linearizability have been proposed, each relating crash-resilient objects to sequential specifcations in a diferent way. This includes: strict linearizability [3], persistent atomicity [19], and durable linearizability and its bufered variant [24]. Among them, bufered durable linearizability, which allows for efcient implementations, ended up not being compositional, which means that it may happen that two (non-interacting) libraries are both correct, but their combination is not. In fact, since each of the diferent notions is useful for particular objects, one may naturally want to mix diferent correctness notions in a single client program. This would force the client to reason with several alternatives for interpreting sequential specifcations, and to make sure that they compose well with one another.

To approach this variety, we believe it is necessary to follow a diferent approach, which is standard in concurrent program verifcation (see, e.g., [18, 20, 26]), and was applied before for deriving abstraction theorems in diferent contexts [8, 16, 17]. The idea is to take a library's specifcation to be just another library, where the latter is intended to have a simpler implementation. Then, we defne a library correctness condition stating what it means for one library L to refne another library L # (equivalently, for L # to abstract L), and prove an abstraction theorem that ensures that when the library correctness condition is met, the behaviors of any client using L are contained in the behaviors of the client using L # . Such a theorem is only useful if the correctness condition avoids quantifcation over all possible clients, which would make the theorem trivial.

Using code for specifying libraries has several advantages over correctness notions based on sequential specifcations. First, specifcations and implementations are expressed and reasoned about in a unifed framework, alleviating the need to interpret the use of sequential specifcations by concurrent programs with system failures. Instead, the client of the theorem replaces complex library code with simpler specifcation code, and thus works with the semantics of a single language. Second, it enables a layered verifcation technique for library developers, allowing them to prove library correctness by introducing one or more intermediate implementations between L and L # . Finally, this formulation of the abstraction theorem is compositional (a.k.a. local) by construction, meaning that objects can be specifed and verifed in isolation.

Now, "code as a specifcation" is only useful if the programming language is sufciently expressive for desirable specifcations. For concurrent objects, "atomic blocks", often included in theoretic programming languages, provide a handy specifcation construct. For NVM, one needs a way to govern the persistence similarly, ofering intuitive specifcations for libraries that simplify client reasoning. For that matter, viewing the out-of-order persistence of writes to diferent cache lines as the major source of counterintuitive behaviors in NVM, we propose a new specifcation construct, which we call persistence blocks. Roughly speaking, such blocks may only persist in their entirety, so that persistence blocks ensure an "all-or-nothing" persistency behaviors to the writes they protect.

For example, when recovering after a crash during a run of the tiny program x˙ := 1; ˙y := 1,<sup>1</sup> due to out-of-order persistence (writes to diferent cache lines are not guaranteed to persist in the order in which there were issued), we may reach any combination of values satisfying ˙x ∈ {0, 1} ∧ y˙ ∈ {0, 1}. In turn, if a persistence block is used, as in beginPB(˙x, y˙); ˙x := 1; ˙y := 1; endPB(˙x, y˙), then only ˙x = ˙y = 0 ∨ x˙ = ˙y = 1 are possible upon recovery.

Our blocks are closely related to persistent transactions of the PMDK library [22] (but we avoid the term transaction, since persistence blocks do not ensure isolation when executed concurrently). In our technical development, we extend the PSC model with instructions for persistence blocks, and carefully construct their semantics (see §4.2) to allow the abstraction result. We believe that persistence blocks are a useful specifcation construct for various data structures, where data consistency naturally involves multiple locations (often, pointers) being in-sync with one another.

#### 2.2 Client-Library Interaction Using Explicit Persist Instructions

The key to establishing a library abstraction theorem is in decomposing a program into two interacting sub-parts, a client and a library, and understanding the interactions between them. These interactions are usually defned in terms of histories, taken to be sequences of method invocations and responses, along with the values being passed. The library correctness condition (the premise of the abstraction theorem) requires that histories produced by using a library L are also produced by its specifcation L # when both libraries are used by a certain "most general client" (MGC, for short) that concurrently invokes arbitrary methods of L an arbitrary number of times with every possible argument. The abstraction theorem ensures that if the library correctness condition holds, then L refnes L # for any client.

Thus, for the abstraction theorem to hold, one has to make sure that the interactions between any client and the library are fully captured in the history produced by the library when used by the MGC. In crash-free sequentially

<sup>1</sup> We use "overdots" to denote non-volatile variables. We assume that all variables are initialized to 0 and that ˙x and ˙y lie on diferent cache lines.

consistent shared memory semantics, this is ensured by the standard assumption that the client and the library manipulate disjoint set of memory locations. Indeed, this restriction guarantees that clients can communicate with libraries only via values passed to and returned from method invocations.

However, we observe that under NVM, mutual interactions between the client and the library go beyond passed values, even when assuming disjointness of memory locations, which makes the standard notion of a library history insufcient. As a simple example, consider an interface with just one method f , specifed by L # = [f 7→ sfence; return]. The sfence instruction, called "store fence", is an explicit persist instruction meant to be used in conjunction with optimized barriers called "fush-optimal" (denoted by fo). Its role is to guarantee the persistence of previous write instructions that are guarded by fush-optimal instructions. Concretely, under PSC (following x86), after a thread executes ˙x := 1; fo(˙x); sfence, we know that the write of 1 to ˙x has persisted (i.e., been propagated to the NVM), while without the sfence, it may still sit in the volatile part of the memory system.

In turn, consider an implementation L, given by L = [f 7→ return], that implements f by doing nothing. Clearly, L does not implement L # correctly. Indeed, for the (sequential) client program ˙x := 1; fo(˙x); call(f ); ˙y := 1 that uses L # , we have ˙y = 1 =⇒ x˙ = 1 as a global invariant: if the system has crashed and we have ˙y = 1 in the NVM, then the sfence ensures that ˙x = 1 is in the NVM as well. Nevertheless, due to out-of-order persistence, if we use L in this program, we may get ˙y = 1 ∧ x˙ = 0 after a crash. Now, the client and the libraries above mention disjoint locations, and the histories that L may produce for the MGC are exactly the histories that L # produces (all well-formed sequences of "call" and "return"). Thus, when inspecting histories of L and of L # , we do not have sufcient information to observe the diference between them.

Generally speaking, the challenge stems from the fact that certain explicit persist instructions (sfence and other instructions whose implementation in the hardware contains an implicit store fence, such as RMWs in x86), which can be executed by the library, impose conditions on the persistence of writes performed by the client that ran earlier on the same processor.

We address this challenge in two ways. First, we can sidestep the problem by weakening the semantics of store fences, making them relative to a set of locations (those used by the library or those used by the client). To do so, we extend the programming language with a specifcation construct similar to a store fence, but only afecting a given set of locations, and we restrict its use by each component to mention only the locations it owns. The use of these localized instructions instead of store fences is sufcient to ensure that the interaction between client and library is fully captured in histories, and allows us to establish the expected abstraction theorem. Libraries that do not intend to provide a store fence functionality to their clients can readily replace store fences with their localized counterparts. Doing so gives more freedom to alternative implementations of the same specifcation, which may, e.g., use alternative persist instructions without the store fence functionality (such as CLFLUSH in [23]).

On the other hand, it is possible that in performance-critical systems, clients would like to rely on a store fence that is executed anyway by the library for the library's own needs. For that, the library developer needs to use a standard store fence in the library's specifcation rather than the localized counterpart, and the abstraction theorem has to handle store fences with their standard, non-localized semantics. To do so, we expose in histories not only method invocations and responses, but also store fences. Roughly speaking, it means that in addition to the standard requirement on values passed by method invocations and responses, for L to refne L # , we would also require that L performs a store fence whenever L # does (which does not hold for the example above). Our notion of history in §5 is set to allow store fences (alongside with their weaker localized versions), and the abstraction theorem in §6 shows that these extended histories are expressive enough for defning the library-correctness condition.

#### 2.3 Handling Calling Policies

The third challenge we address concerns abstraction for libraries that enforce certain calling policies on their clients.<sup>2</sup> For instance, a library implementing a lock may require that the calls of each thread for acquiring and releasing the lock perfectly interleave, and a library implementing a single-producer queue may require that only one thread is calling the enqueue method. In the context of NVM, libraries often demand that a distinguished recovery method is called after every crash before invoking any other method of the library. When the client uses the library in a way that violates the calling policy, the library developer ensures nothing, and the blame is assigned to the client.

In the presence of calling policies, the contextual refnement guaranteed by the library abstraction theorem, stating that all behaviors of a program Pr [L] that uses L are also behaviors of the program Pr [L # ] that uses L # , is only applicable for a program Pr that respects the calling policy. An interesting compositionality question arises: Are we allowed to assume the library's specifcation when checking whether a program adheres to the calling policy (that is, require that Pr [L # ] adheres to the policy), or should this obligation be satisfed for the library's implementation (that is, require that Pr [L] adheres to the policy)?

The latter option would limit the applicability of the abstraction theorem for client reasoning. Indeed, it may be the case that establishing that Pr [L] adheres to the policy depends on the implementation L, whereas the abstraction theorem should allow reasoning without knowing the implementation at all. On the other hand, the former option seems circular, as it uses contextual refnement to establish its own precondition.

In this paper we show that requiring that Pr [L # ] adheres to the policy is actually sufcient for ensuring contextual refnement. Roughly speaking, our proof avoids circular reasoning by inspecting a minimal contextual refnement violation, for which we are able to establish policy adherence when using L, given

<sup>2</sup> This challenge is not particular to NVM, but, interestingly, to the best of our knowledge, it has not been addressed in previous work establishing abstraction theorems.

policy adherence when using L # . To the best of our knowledge, this is a novel argument in the context of library abstraction. It is akin to DRF (data-race freedom) guarantees in weak memory concurrency, where often programs are guaranteed to have strong semantics (usually, sequential consistency) provided that certain race-freedom conditions hold in all runs under the strong semantics.

We note that many library's calling policies are "structural", namely they only enforce certain ordering constraints on the clients that do not depend on the values returned by the library (in particular, "execute recovery frst" is a structural policy). In these cases, policy adherence holds even for an overapproximation Lstub of L that returns arbitrary values. Certainly, however, this is not always the case. For example, a library L implementing standard list methods, cons and head, may require that head is only called on non-empty lists (like, e.g., pop front in C++ that triggers undefned behavior if applied to an empty list [1]). Then, invoking head with the value returned from cons does adhere to the calling policy, but this is not the case for the over-approximated library Lstub, which allows cons to return the empty list.

### 3 NVM Programs: Syntax and Semantics

In this section we begin to present the formal settings for our results. As standard in memory models, it is convenient to break the operational semantics into: a program semantics (a.k.a. thread subsystem) and a memory semantics. We represent both components as labeled transition systems whose transition labels correspond to the operations they perform. We then consider the synchronized runs of the program and the memory, where program actions that interact with the memory are matched by actions executed by the memory system (see §4.1).

Next, we focus on the program part of the semantics, presenting both syntax (§3.1) and semantics (§3.2). We use the following standard notations.

Notation for fnite sequences. For a fnite alphabet Σ, we denote by Σ<sup>∗</sup> (respectively, Σ+) the set of all (non-empty) sequences over Σ. We use ϵ to denote the empty sequence. The length of a sequence s is denoted by |s|. We often identify sequences with their underlying functions (whose domain is {1, ... ,|s|}), and write s(k) for the symbol at position 1 ≤ k ≤ |s| in s. We write σ ∈ s if σ appears in s, that is if s(k) = σ for some 1 ≤ k ≤ |s|. We use "·" for concatenating sequences, and identify symbols with sequences of length 1.

#### 3.1 Program Syntax

The domains and metavariables used to range over them are as follows:

values v, u ∈ Val = {0, 1, 2, ...} shared non-volatile variables x,˙ y˙ ∈ NVVar = {x˙, y˙, ...} shared volatile variables x, ˜ y˜ ∈ VVar = {x˜, y˜, ...} shared variables x, y ∈ Var = NVVar ∪ VVar register names r ∈ Reg = {a, b, ...} thread identifers τ, π ∈ Tid = {T1, T2, ... ,TN} method names f ∈ F main ̸∈ F

Thus, there are three kinds of variables: shared non-volatile, shared volatile, and thread-local ones (called registers), which are also volatile. A distinguished name main is reserved for the starting point of the program execution.

For concreteness, we present a simple programming-language syntax. Its expressions and instructions are given by the following grammar:<sup>3</sup>

$$\begin{array}{lclcl} e ::= & r \mid \mid v \mid \mid e + e \mid \mid e = e \mid \mid e \neq e \mid \mid \dots \\ \mathsf{int} ::= r \coloneqq e \mid \mid \mathsf{if} \, e \, \mathsf{got} \, n\_{1} \mid \ldots \mid n\_{m} \mid \mathsf{havor} \mid \mid x := e \mid \mid r := x \\ & \mid \, \mathsf{tf1}(\dot{x}) \mid \mathsf{f0}(\dot{x}) \mid \mathsf{sfence} \, \mid \mathsf{ca1}(f) \mid \mathsf{return} \\ & \mid \, \mathsf{1sfence}(\dot{X}) \mid \mathsf{beginPB}(\dot{X}) \mid \mathsf{endPB}(\dot{X}) \end{array}$$

Expressions are constructed with arithmetic and boolean operations over registers and values. Instructions consist of a local assignment r := e; a conditional if e goto n<sup>1</sup> p ... p n<sup>m</sup> for non-deterministically jumping to a program counter from {n1, ... ,nm} when e evaluates to non-zero or, otherwise, skipping (goto n<sup>1</sup> p ... p n<sup>m</sup> can be encoded as if 1 goto n<sup>1</sup> p ... p nm); havoc for arbitrarily modifying all registers; a write to memory x := e; and a read from memory r := x. There are also explicit persist instructions: a fush instruction fl( ˙x) and its optimized version fo( ˙x), called fush-optimal (referred to as CLFLUSH and CLFLUSHOPT in [23]), as well as the store fence instruction sfence (see §2.2).

This standard instruction set is extended to support calling and specifying library methods. There is a call instruction call(f ) and a return instruction return. The novel specifcation constructs include the local store fence instruction lsfence(X˙ ) that relaxes the semantics of sfence by only enforcing the persistence ordering for the given set X˙ of variables (thus, lsfence(NVVar) is equivalent to sfence); and instructions to begin and end a persistence block, beginPB(X˙ ) and endPB(X˙ ), respectively. The persistence block demarks the writes that need to persist simultaneously after the block ends, either nondeterministically or triggered by a fush on some variable in X˙ .

Next, we employ three syntactic categories:


<sup>3</sup> In the extended version of this paper, we also include read-modify-write instructions.

#### 3.2 Program Semantics

We give semantics to the syntactic objects using labeled transition systems.

Defnition 1. A labeled transition system (LTS) is a tuple A = ⟨Σ, Q, qInit, T⟩, where Σ is a set of transition labels, Q is a set of states, qInit ∈ Q is the initial state, and T ⊆ Q×Σ ×Q is a set of transitions. We often write q <sup>σ</sup>−→ q ′ to denote a transition ⟨q, σ, q′ ⟩. We denote by A.Σ, A.Q, A.qInit, and A.T the components of an LTS A. We write <sup>σ</sup>−→<sup>A</sup> for the relation {⟨q, q′ ⟩ | q <sup>σ</sup>−→ q ′ ∈ A.T} and −→<sup>A</sup> for S σ∈Σ <sup>σ</sup>−→<sup>A</sup> . For a sequence t ∈ A.Σ<sup>∗</sup> , we write <sup>t</sup>−→<sup>A</sup> for the composition t(1) −−→<sup>A</sup> ; ... ; t(|t|) −−−→<sup>A</sup> . A sequence t ∈ A.Σ<sup>∗</sup> such that A.qInit <sup>t</sup>−→<sup>A</sup> q for some q ∈ A.Q is called a trace of A. We denote by traces(A) the set of all traces of A. A state q ∈ A.Q is called reachable in A if A.qInit <sup>t</sup>−→<sup>A</sup> q for some t ∈ traces(A).

Next, we defne the LTSs induced by instruction sequences, sequential programs, and concurrent programs. We will often identify the syntactic objects with the LTS they induce (e.g., when writing expressions like S.Q for a sequential program S). The transition labels of these LTSs feature action labels.

Defnition 2. An action label takes one of the following forms: a read R(x, v), a write W(x, v), a fush FL( ˙x), a fush-opt FO( ˙x), an sfence SF, a local sfence LSF(X˙ ), a start beginPB(X˙ ) or an end endPB(X˙ ) of a persistence block, a call CALL(f , ϕ), or a return RET(f , ϕ), where x ∈ Var, v ∈ Val, ˙x ∈ NVVar, X˙ ⊆ NVVar, f ∈ F, and ϕ : Reg → Val. We denote by Lab the set of all action labels. The functions typ and var retrieve (when applicable) the type (R/W/ ...) and variable (x or ˙x) of an action label. We write varset(l) for the set of variables mentioned in l (e.g., varset(R(x, v)) = {x}, varset(LSF(X˙ )) = X˙ , and varset(SF) = ∅).

Action labels represent the interactions that a program has with the memory.

Defnition 3. The LTS induced by an instruction sequence I is given by:


I(pc) = r := e ϕ ′ = ϕ[r 7→ ϕ(e)] ⟨pc, ϕ⟩ <sup>ϵ</sup>−→<sup>I</sup> ⟨pc + 1, ϕ′ ⟩ I(pc) = if e goto n<sup>1</sup> p ... p n<sup>m</sup> ϕ(e) ̸= 0 =⇒ pc ′ ∈ {n1, ... ,nm} ϕ(e) = 0 =⇒ pc ′ = pc + 1 ⟨pc, ϕ⟩ <sup>ϵ</sup>−→<sup>I</sup> ⟨pc ′ , ϕ⟩ I(pc) = havoc ⟨pc, ϕ⟩ <sup>ϵ</sup>−→<sup>I</sup> ⟨pc + 1, ϕ′ ⟩ I(pc) = x := e l = W(x, ϕ(e)) ⟨pc, ϕ⟩ <sup>l</sup>−→<sup>I</sup> ⟨pc + 1, ϕ⟩ I(pc) = r := x l = R(x, v) ϕ ′ = ϕ[r 7→ v] ⟨pc, ϕ⟩ <sup>l</sup>−→<sup>I</sup> ⟨pc + 1, ϕ′ ⟩ I(pc) ∈ fl( ), fo( ), sfence, lsfence( ), beginPB( ), endPB( ) l = matching label(I(pc)) ⟨pc, ϕ⟩ <sup>l</sup>−→<sup>I</sup> ⟨pc + 1, ϕ⟩

Recall that program semantics is separate from memory semantics, which is why the transitions above completely ignore the restrictions arising from the memory system. In particular, the write to memory x := e only announces itself in the label. The read from memory r := x loads an arbitrary value v into the destination register r, announcing that value in the read label. Other instructions act as no-ops, and simply announce themselves in the transition label, using the function matching label that maps each instruction to its label (fl( ˙x) 7→ FL( ˙x), fo( ˙x) 7→ FO( ˙x), and so on).

Finally, call(f ) and return instructions are not handled in this level, but receive special semantics at the level of sequential programs, as defned next.

Defnition 4. The LTS induced by a sequential program S is given by:

	- ⟨pc, ϕ⟩ is a state of the instruction sequence (see Def. 3) storing the state of the sequence currently running.
	- pc<sup>s</sup> ∈ N ∪ {⊥}, called the stored program counter, is used to remember the program position to jump to when the current instruction sequence returns, whereas pc<sup>s</sup> = ⊥ means that the main method is currently running. (Recall that we assume that S(f ) is fat for every f ∈ F, so we do not need to record the call stack.)
	- f ∈ F∪{main}, called the active method, tracks the method that is currently running.
	- We denote by q.pc, q.ϕ, q.pc<sup>s</sup> , and q.f the components of a state q ∈ S.Q.

$$\begin{array}{llll} \text{NORMA} & &\\ & l\_{\epsilon} \in \texttt{Lab} \cup \{\epsilon\} & f \in \{\texttt{main}\} \cup \mathsf{F} & \text{CALL} \\ & \langle pc, \phi \rangle \xrightarrow{l\_{\epsilon}}\_{S\left(f\right)} \langle pc', \phi' \rangle &\\ & \langle pc, \phi, pc\_{\texttt{s}}, f\rangle \xrightarrow{l\_{\epsilon}}\_{S} \langle pc', \phi', pc\_{\texttt{s}}, f\rangle & & l = \texttt{CALL} \langle f, \phi \rangle \\ & \langle pc, \phi, pc\_{\texttt{s}}, f\rangle & & \langle pc, \phi, \bot, \texttt{main}\rangle \xrightarrow{l}\_{S} \langle 0, \phi, pc+1, f\rangle \\ & \langle \texttt{\texttt{S}\left(f\right)\left(pc\right)} = \texttt{\texttt{return}} & l = \texttt{\texttt{RET}\left(f, \phi\right)} & \frac{\texttt{\texttt{NON-DET-SFENCE}}}{\langle pc, \phi, pc\_{\texttt{s}}, f\rangle & & l=\texttt{\texttt{SF}}} \\ \end{array}$$

The normal transition lifts the instruction-sequence transition to the level of sequential programs. Note that the transition applies for any method (main or other). The call transition passes control from the main method to some other method, jumping the program counter to the frst instruction and storing the return point (pc+1). The return transition passes control back using the stored return point. For simplicity, we do not have any argument passing mechanism and use the full register store for that matter. (If needed, each component may store the values it needs in the memory, and reload them later on.)

Finally, non-det-sfence is a non-standard transition that we fnd technically convenient to have. It allows the program to non-deterministically execute an sfence at any point. Since, as will become apparent when presenting the memory system, sfences only restrict the possible behaviors, this transition is safe to include in the program semantics. It is particularly useful for simplifying the library correctness condition that only considers inclusion of sets of histories (see §5). For instance, switching the roles of L and L # from §2.2, the library implementing f using sfence should be considered a refnement of the one that simply returns. For that, we allow the no-op specifcation to perform non-deterministic sfences that match the ones executed by the concrete implementation.

Finally, the LTS induced by a concurrent program is defned as follows.

Defnition 5. The LTS induced by a (concurrent) program Pr is given by:


$$\text{I}\_{\text{NORM}} \xrightarrow{l\_{\epsilon} \in \mathsf{Lab} \cup \{\epsilon\}} \begin{array}{c} \overline{q}(\tau) \xrightarrow{l\_{\epsilon}}\_{Pr(\tau)} \overline{q}' \\ \overline{q}(\tau \mapsto q') \end{array} \qquad \text{CRASH} \xrightarrow[\overline{q} \xrightarrow{\overline{\xi}}\_{Pr} \overline{q}\_{\text{lbit}}]$$

### 4 The PSC Memory System

We present PSC ("Persistent Sequential Consistency"), the persistency model used as the memory system. We frst introduce the model as it is in [25] (extended with standard volatile memory alongside with the non-volatile one), following its operational presentation as an LTS with non-deterministic memory-internal transitions that fush stores from the volatile part to the non-volatile part. In §4.1, we defne the synchronization of programs with the PSC memory system. In §4.2, we present the extensions added in this paper that are useful for library abstraction. Finally, in §4.3, we establish certain separation properties of PSC that are essential in our proofs.

Roughly speaking, a state in PSC consists of a non-volatile memory (mapping from non-volatile variables to values) and a volatile memory (mapping from volatile variables to values). The volatile memory works just as a normal sequentially consistent memory, keeping track of the latest written value to every variable and returning that value for reads. Upon crash, the contents of the volatile memory is reset to its initial state. The non-volatile memory behaves observationally the same between crashes, but its contents survive crashes. To model delayed and out-of-order persistence of writes, write steps to non-volatile variables do not alter the non-volatile memory immediately when issued. Instead, writes frst go to volatile per-variable persistence FIFO bufers, which maintain the writes to each variable that are yet to persist. Then, PSC non-deterministically takes persist steps that apply the oldest update from a persistence bufer in the

non-volatile memory. Reads from non-volatile variables retrieve the latest value in the relevant bufer, or the value from the non-volatile memory if that bufer is empty, thus providing standard sequentially consistent semantics in the absence of system crashes. Upon crash the bufers are reset to their initial (empty) state, but the contents of the non-volatile memory remains intact.

Explicit persist instructions can be used to control the persistence of writes. A "fush" barrier for a certain variable blocks the execution until the relevant persistence bufer is empty, thus forcing all previous writes to that variable to persist. Alternatively, a (cheaper) "fush-optimal" barrier for a certain variable enqueues a special marker in the persistence bufer of this variable accompanied by the thread identifer of the thread that issued the barrier. The efect of fushoptimal is delayed until the same thread performs an sfence, which blocks the execution until all fush-optimal markers of that thread are dequeued from all bufers. The fact that the persistence bufers are FIFO ensures that an sfence by some thread forces the persistence of all writes executed before a fush-optimal issued by the same thread.

Defnition 6. PSC is the LTS defned as follows:

	- m˙ : NVVar → Val is called the non-volatile memory.
	- m˜ : VVar → Val is called the volatile memory.
	- P : NVVar → PLBuf is called the persistence bufer. Here, PLBuf denotes the set of all per-location persistence bufers, each of which is a fnite sequence p of entries of the form W(v) for v ∈ Val (writes), or FO(τ ) for τ ∈ Tid (fush optimal markers). The persistence bufer P assigns a per-location persistence bufer to every non-volatile variable.<sup>4</sup>

We denote by M.m˙, M.m˜, and M.P the components of a state M ∈ PSC.Q, and write M[X 7→ Y ] for the state obtained from M by setting M.X to Y .


The transitions follow the intuitive account above. Those corresponding to program transitions are labeled with pairs in Tid×Lab. For instance, a transition labeled with ⟨τ, R(x, vR)⟩ means that thread τ reads the value v<sup>R</sup> from (volatile or non-volatile) shared variable x.

<sup>4</sup> We conservatively assume that writes persist at the location granularity, rather than at the cache-line granularity as happens in real machines.

$$\begin{array}{llll} \text{V-WRTTE} & \text{NV-WRITE} & \text{NV-WRITE} \\ & l = \mathsf{W}(\bar{x}, v) & & l = \mathsf{W}(\bar{x}, v) & \text{READ} \\ \hline \dot{m}^{\prime} = M.\ddot{\mathsf{x}}[\bar{x} \mapsto v] & & p^{\prime} = M.\mathsf{P}(\dot{x}) \cdot \mathsf{W}(v) & P^{\prime} = M.\mathsf{P}[\dot{x} \mapsto p^{\prime}] & \text{V-R}(\bar{x}) \\ \hline \hline M \stackrel{\tau, l}{\longrightarrow} \mathsf{p} \mathsf{c} \mathsf{C} \ \mathsf{M}[\dot{\mathsf{m}} \mapsto \dot{m}^{\prime}] & & & M \stackrel{\tau, l}{\longrightarrow} \mathsf{P} \mathsf{c} \mathsf{L} \ \mathsf{P}[\dot{x} \mapsto p^{\prime}] & & & \text{H} \stackrel{\tau, l}{\longrightarrow} \mathsf{P} \mathsf{c} \mathsf{L} \ \mathsf{M} \\ \hline \hline \frac{\mathsf{F} \mathsf{L} \mathsf{U} \mathsf{S} \mathsf{H}}{M \stackrel{\tau, l}{\longrightarrow} \mathsf{P} \mathsf{c} \mathsf{L} \ \mathsf{M}} & \text{FL = \mathsf{F0}(\dot{x}) & & \text{F} \stackrel{\tau, l}{\longrightarrow} \mathsf{P} \mathsf{c} \mathsf{L} \ \mathsf{M}} \\ \hline \frac{M \mathsf{P}(\dot{x})}{\longrightarrow} & & & & \frac{p^{\prime} = M.\mathsf{P}(\dot{x}) \quad \mathsf{F0}(\dot{x}) \qquad \mathsf{F0}(\dot{x}) \stackrel{\tau}{\longrightarrow} \mathsf{P} \mathsf{L} \mathsf{P}[\dot{x} \mapsto p^{\prime}]}{M \stackrel{\tau, l}{\longrightarrow} \mathsf{p} \mathsf{c} \$$


Fig. 1. Transitions of PSC

#### 4.1 Linking Programs and Memories

To give semantics of programs running under PSC, the thread system is synchronized with the PSC memory system. Formally, the synchronization of a program Pr with PSC, is another LTS, denoted by Pr⋊⋉PSC, defned as follows:


$$\begin{array}{ccc} \text{\(\text{SYNCHONZED}\)} & \text{PROGRAM-INTERAL} & \text{MEMRY-INTERAL} \\ \alpha \in \{\text{\(\text{Td}} \times \text{Lab}\) \cup \{\text{\{\{\text{\}}\}}\} & \alpha \in \{\text{\(\text{ $\{\text{$ \{\text{ $}}$ } $}$ }\}\} & \alpha = \text{per} \\ \overline{\langle\text{\overline{q}}, M\rangle \stackrel{\alpha}{\longrightarrow}\_{\text{Prw}\text{\textbf{SC}}} \langle\text{\overline{q}', M'\rangle}} & \overline{\langle\text{\overline{q}}, M\rangle \stackrel{\alpha}{\longrightarrow}\_{\text{Prw}\text{\textbf{SC}}} \langle\text{\overline{q}', M}\rangle} & \overline{\langle\text{\overline{q}}, M\rangle \stackrel{\alpha}{\longrightarrow}\_{\text{Prw}\text{\textbf{SC}}} \langle\text{\overline{q}', M}\rangle} \end{array}$$

The above transitions are "synchronized transitions" of Pr and PSC, using the labels to decide what to synchronize on. Both the program and the memory take the same step for transition labels that are common to both LTSs, only the program steps for transition labels that are only program transitions, and only the memory steps for transition labels that are only memory transitions.

#### 4.2 Extending PSC for Library Abstraction

We present the modifcations of PSC for supporting the new specifcation constructs: localized sfences and persistence blocks. When referring to PSC in the sequel we mean the following revised version.

Local store fences. Localized sfences are straightforwardly supported by the following additional memory transition:

$$\text{LOCAL SFENCE} \stackrel{l=\text{LSF}(\dot{X})}{\\\longrightarrow} \frac{\forall \dot{x} \in \dot{X}.\text{FO}(\tau) \notin M.\text{P}(\dot{x})}{M \stackrel{\tau,l}{\longrightarrow} \text{PSC} \ M}$$

Here, instead of blocking until all FO(τ ) entries are removed from all bufers, we only require that such entries are not present in bufers associated with variables from a certain set (mentioned in the action label and corresponding to the argument of the lsfence(X˙ ) instruction).

Persistence blocks. We assume an infnite set BlockID of block identifers that are non-deterministically allocated when blocks are opened. The state of the memory system keeps track of a mapping assigning the current open block identifer to every thread and non-volatile variable, or ⊥ if the variable is not a part of an open block of the thread. When writing to non-volatile variables, the associated block identifers are attached to the write entry in the per-location persistence bufer. In turn, the propagation from the bufers to the NVM ensures that blocks are propagated only after they are not open and only in their entirety. To do so, we generalize the persist step of PSC to allow simultaneous propagation of multiple entries from the bufers. To respect the per-variable FIFO order, the propagated entries should form a prefx of each bufer.

Formally, this requires the following modifcations:

	- B : Tid → NVVar → (BlockID ∪ {⊥}) is called the active-block mapping. It assigns a block identifer (or ⊥ if there is no active block) to every thread identifer and non-volatile variable.
	- Bid ⊆ BlockID × P(NVVar) is called the block identifers set. It is used to store all persistence block identifers occurring so far, each accompanied by the set of non-volatile variables that it protects.

We denote by M.B and M.Bid the additional components of a state M. We impose the following well-formedness conditions:


$$\text{NN-WRITE} \xrightarrow{l = \mathsf{W}(\dot{x}, v)} \frac{p' = M.\mathsf{P}(\dot{x}) \cdot \underline{M.\mathsf{B}}(\tau)(\dot{x}); \mathsf{W}(v) \qquad P' = M.\mathsf{P}[\dot{x} \mapsto p']}{M \xrightarrow{\tau, l}\_{\mathsf{P} \cdot \mathsf{S} \gets \mathsf{P}} M[\mathsf{P} \mapsto P']}$$

5. The following two transitions for opening and closing blocks are added: beginPB endPB

$$\begin{array}{ccl} l & \texttt{\texttt{\tiny\\_end\\_\textsc{\\_}}} & \texttt{\tiny\\_\textsc{\\_}} & \texttt{\tiny\\_\textsc{\\_}} & \texttt{\\_\textsc{\\_}} \\ \forall \dot{x} \in \dot{X}. M. \mathsf{B}(\tau)(\dot{x}) = \bot & & & \\ B' = M. \mathsf{\underline{\underline{\beta}}} \left[ \tau \mapsto \lambda \dot{x}. \frac{\mbox{\\_\forall \ \dot{x} \in \dot{X} \ then \ \dot{j}}{\mbox{\\_else\\_}} M. \mathsf{\underline{\beta}}(\tau)(\dot{x}) \right] & & B' = M. \mathsf{\underline{\underline{\beta}}} \left[ \tau \mapsto \lambda \dot{x}. \frac{\mbox{\\_\forall \ \dot{x} \in \dot{X} \ then \ \bot}}{\mbox{\underline{\underline{\beta}}} M. \mathsf{\underline{\underline{\beta}}}{\mbox{\\_\forall}} \bot \mathrel{\underline{\underline{\beta}}} M. \mathsf{\underline{\beta}}(\tau)(\dot{x})} \right] \\ \hline & \frac{\{j\_{\neg}\} \notin M. \mathsf{\underline{\beta}} \mathsf{\underline{\underline{\beta}}} \quad M[\mathsf{\underline{\beta}} \mapsto B', \mathsf{\underline{\underline{\beta}}} \mathsf{\underline{\underline{\beta}}} \rightarrow B \mathsf{\underline{\underline{\beta}}} \mathsf{\underline{\underline{\beta}}}] & & & M \xrightarrow[]\$]\$]\$]\$]\$]\$]\$]\$]\$]\$]\$]\$]\$]\$]\$]\$]\$]\$]\$]\$]\$]\$]\$}$$

Thus, opening a block allocates a fresh identifer and sets the active-block mapping accordingly. In turn, closing a block resets the relevant variables in the active-block mapping.

6. The following transition is used instead of persist-write and persist-fo. It generalizes both persist-write and persist-fo by simultaneously persisting several entries together (each px˙ below stands for a sequence of entries).

$$\begin{array}{c} l = \mathtt{per} & \forall \dot{x}. M. \mathtt{P}(\dot{x}) = p\_{\dot{x}} \cdot P'(\dot{x})\\ \forall j. (\exists \dot{x}. j: \mathtt{W}(.) \in p\_{\dot{x}}) \implies \forall \dot{x}. (\forall \tau. M. \mathtt{B}(\tau)(\dot{x}) \neq j \land j: \mathtt{W}(.) \notin P'(\dot{x}))\\ \dot{m'} = \lambda \dot{x}. \left\{ \begin{matrix} v & \text{last write entry in } p\_{\dot{x}} \text{ has value } v\\ M. \dot{\mathtt{m}}(\dot{x}) & \text{there are no write entries in } p\_{\dot{x}} \end{matrix} \right.\\ \end{array}$$
  $\mathtt{PERSIST} \xleftarrow[1]{\mathtt{J} \begin{matrix} m \ \stackrel{\scriptstyle \longrightarrow}{\longmapsto} \mathtt{PS} \ \begin{matrix} m \ \dot{\mathtt{in}} \end{matrix} \equiv \dot{m'}, \mathtt{P} \mapsto P' \end{matrix}}$ 

This step imposes two restrictions. First, the persisted entries from each bufer (px˙) should form a prefx of that bufer, so that FIFO semantics is maintained. Second, to respect the persistence blocks, if some entry of a given block is persisted (∃x. j ˙ :W( ) ∈ px˙) then that block should not be currently active by any thread (∀x, τ. M. ˙ B(τ )( ˙x) ̸= j) and no entries of that block should remain in the volatile bufers (∀x. j ˙ :W( ) ̸∈ P ′ ( ˙x))).

We note that nested and interleaved blocks are allowed. The program on the right demonstrates such a case. Here, ˙x = 1 and ˙y = 1 must persist together; ˙z = 1 and ˙w = 1 must persist together; but these two pairs can persist independently of each other in any order. Thus, provided that the client and the library use blocks of their own locations, the block instructions by each component are invisible to the other. beginPB(˙x, y˙); x˙ := 1; beginPB(˙z, w˙); z˙ := 1; ˙w := 1; endPB(˙z, w˙); y˙ := 1; endPB(˙x, y˙);

#### 4.3 Separation Properties

To enable our library abstraction proof, the required key property of PSC, which we preserved in its extensions, is the ability to separate PSC states into disjoint parts (the library's part and the client's part) and capture each memory transition in terms of its efect on the two parts. Next, we formulate this property, which we will later use to prove library abstraction. In fact, our arguments for library abstraction rely only on the properties below, and never "unfold" the PSC-related defnitions. This allows one to refne and extend PSC, as long as the separation properties are preserved.

The separation of PSC states is stated in terms of the following restriction operator relative to a set of variables. For persistence blocks to behave correctly, we need an auxiliary condition on this set: we say that a set X˙ ⊆ NVVar separates a state M ∈ PSC.Q if for every ⟨j, Y˙ ⟩ ∈ M.Bid, we have Y˙ ⊆ X˙ or Y˙ ⊆ NVVar\X˙ .

Defnition 7. The restriction of M ∈ PSC.Q onto a set X ⊆ Var such that X ∩NVVar separates M, denoted by M|X, is the state M′ ∈ PSC.Q given by:


The next lemma states the separation property of PSC, providing a precise characterization of each PSC transition in terms of transitions on the restrictions M|<sup>X</sup> and M|Var\X. A special case is needed for store fence transitions, since taking these transitions enforces conditions on both restrictions.

Lemma 1. Let X ⊆ Var such that X ∩ NVVar separates a state M1.

1. For every τ ∈ Tid and l ∈ Lab \ {SF} with varset(l) ⊆ X, M<sup>1</sup> τ,l −→PSC M<sup>2</sup> ⇐⇒ (M1|<sup>X</sup> τ,l −→PSC M2|<sup>X</sup> ∧ M1|Var\<sup>X</sup> = M2|Var\X) 2. For every τ ∈ Tid, M<sup>1</sup> τ,SF −−→PSC M<sup>2</sup> ⇐⇒ (M1|<sup>X</sup> τ,SF −−→PSC M2|<sup>X</sup> ∧ M1|Var\<sup>X</sup> τ,SF −−→PSC M2|Var\X) 3. M<sup>1</sup> per −−→PSC M<sup>2</sup> ⇐⇒ (M1|<sup>X</sup> per −−→PSC M2|<sup>X</sup> ∧ M1|Var\<sup>X</sup> per −−→PSC M2|Var\X) 4. M<sup>1</sup> −→PSC M<sup>2</sup> ⇐⇒ (M1|<sup>X</sup> −→PSC M2|<sup>X</sup> ∧ M1|Var\<sup>X</sup> −→PSC M2|Var\X)

The proof of Lemma 1 proceeds by standard case analysis ranging over all possible transitions of PSC. Finally, the following operation is used below to compose a state from a client and a library components (see Lemma 2).

Defnition 8. Let M1, M<sup>2</sup> be states of PSC, and X1, X<sup>2</sup> ⊆ Var such that X<sup>1</sup> ∩ X<sup>2</sup> = ∅. The merge of M<sup>1</sup> and M<sup>2</sup> w.r.t. X<sup>1</sup> and X2, denoted by ⟨M1, X1⟩ ⊎ ⟨M2, X2⟩, is the state M ∈ PSC.Q defned by:

M.m˙( ˙x) = ( M1.m˙( ˙x) ˙x ∈ X<sup>1</sup> M2.m˙( ˙x) ˙x ∈ X<sup>2</sup> 0 otherwise similar defnitions for M.˜m, M.P, M.<sup>B</sup> M.Bid = {⟨j, Y˙ ⟩ ∈ M1.Bid | Y˙ ⊆ X1}∪ {⟨j, Y˙ ⟩ ∈ M2.Bid | Y˙ ⊆ X2}

### 5 Libraries and Their Clients

We present the notions of libraries and clients, as well as the necessary defnitions for stating the abstraction theorem: histories and most general clients.

Libraries. We take a library L to be a function assigning to method names in dom(L) ⊆ F fat instruction sequences representing the method bodies. In the context of some library L, we refer to the implementations of the methods in {main} ∪ F \ dom(L) in a program Pr as the client of L.

Client-library composition. We consider the common case where libraries and their clients never access the same shared variables. To formally defne this restriction, we use the following notations for sets of locations used by instruction sequences, libraries, and their clients:


Var(Pr \ F) def = S <sup>τ</sup>∈Tid Var(Pr (τ )(main)) ∪ S <sup>f</sup> <sup>∈</sup>F\<sup>F</sup> Var(Pr (f )). Then, client-library composition is defned as follows.

Defnition 9. A library L is safe for a program Pr if Var(L)∩Var(Pr\dom(L)) = ∅. When L is safe for Pr , we write Pr [L] for the program obtained from Pr by setting Pr (τ )(f ) = L(f ) for every τ ∈ Tid and f ∈ dom(L).

Note that we always have Var(Pr [L] \ dom(L)) = Var(Pr \ dom(L)).

Histories. Histories record the interactions between libraries and clients. Formally, a history h of a library L is a sequence of transition labels representing a crash, a call to a method of L, a return from a method of L, or an sfence, i.e., labels from the set HTLabdom(L) , which is defned as follows:

$$\begin{aligned} \mathsf{Lab}\_{F} & \stackrel{\text{def}}{=} \{ \mathsf{SF} \} \cup \{ \mathsf{CALL}(f, \phi), \mathsf{RET}(f, \phi) \mid f \in F, \phi : \mathsf{Reg} \to \mathsf{Val} \} \\ \mathsf{HT} \mathsf{Lab}\_{F} & \stackrel{\text{def}}{=} (\mathsf{Tid} \times \mathsf{Lab}\_{F}) \cup \{ \sharp \} \end{aligned}$$

Defnition 10. Let t be a trace of Pr⋊⋉PSC for some program Pr . The history induced by t w.r.t. a set F ⊆ F, denoted by H<sup>F</sup> (t), is the subsequence of t over HTLab<sup>F</sup> consisting of (in the same order they appear in t): call and return labels ⟨τ, CALL(f , ϕ)⟩ and ⟨τ, RET(f , ϕ)⟩ with f ∈ F; SF-labels ⟨τ, SF⟩; and crash labels. The notation H<sup>F</sup> (t) is extended to sets of traces in the obvious way. The set of histories w.r.t. F of Pr , denoted by H<sup>F</sup> (Pr ), is given by H<sup>F</sup> (traces(Pr⋊⋉PSC)). When F = F (i.e., the set of all method names), we simply write H(t) and H(Pr ).

Most general clients. We encompass library calling policies (see §2.3) using the notion of a "most general client"—a non-deterministic client that invokes the library methods in the most general way allowed by the policy. Formally, a most general client MGC is given as a (concurrent) program. Adherence to the calling policy is defned as follows.

Defnition 11. Let L be a library, and Pr and MGC be programs such that L is safe for both Pr and MGC . We say that Pr correctly calls L w.r.t. MGC if Hdom(L)(Pr [L]) ⊆ Hdom(L)(MGC [L]).

The policy of a library with no restrictions on its clients (beyond the separation of shared resources) is expressed by an MGC, called MGC free, that repeatedly invokes arbitrary library methods with arbitrary initial stores. Often persistent objects include a recovery method meant to be executed after a crash before any other method is invoked. We call such a policy MGC rec. Formally, MGC free (for dom(L) = {f1, ... ,fn}) and MGC rec (for dom(L) = {f1, ... ,fn} ⊎ {recover}) assign the following main method to each thread τ :

```
MGCfree(τ )(main) =
BEGIN : havoc;
goto f1 p ... p fn p END;
f1 : call(f1); goto BEGIN;
...
fn : call(fn); goto BEGIN;
END :
                                  MGCrec(τ )(main) =
                                  a := CAS(˜x, 0, 1); if a = 0 goto REC; goto WAIT;
                                  REC : call(recover); ˜y := 1; goto BEGIN;
                                  WAIT : a := ˜y; if a = 0 goto WAIT; goto BEGIN;
                                  BEGIN : ... rest of the code as in MGCfree ...
```
In MGC rec, using a compare-and-swap, one thread performs the recovery. All other threads wait until recovery ends to start their method invocations.

### 6 The Library Abstraction Theorem

In this section we state and prove the library abstraction theorem. The premise of this theorem, the library correctness condition, is formulated as follows.

Defnition 12. Let L and L # be libraries, both safe for a program MGC . We say that L refnes L # w.r.t. MGC , denoted by L ⊑MGC L # , if both libraries implement the same methods and H(MGC [L]) ⊆ H(MGC [L # ]).

Next, the abstraction theorem states that L ⊑MGC L # ensures that any client adhering to the library's calling policy may safely use the implementation L while reasoning about possible behaviors in terms of the specifcation L # . Our notion of "a behavior" includes the generated histories, as well as the reachable states, by the composition of the program and the memory system. Including reachable states is intended to assist safety verifcation. Clearly, we cannot require that the program states match for threads that are currently executing a method of L. In addition, since L and L # may update the memory diferently (e.g., use diferent variables), we should only consider the variables of the client when inspecting the memory states. This leads us to the following statement.

Theorem 1 (Abstraction). Suppose that L ⊑MGC L # . Let MGC and Pr be programs such that both L and L # are safe for MGC and Pr , and Pr correctly calls L # w.r.t. MGC . If ⟨qInit, MInit⟩ <sup>t</sup>−→Pr[L]⋊⋉PSC ⟨q, M⟩, then there exist t # and ⟨q # , M# ⟩ such that the following hold:


Note that L ⊑MGC L # is necessary for the conclusion to hold: otherwise, MGC itself is a client that can observe behaviors of L that are impossible for L # . Following §2.3, we also note that policy adherence is required w.r.t. to L # .

To prove the abstraction theorem, the following key lemma is used multiple times (with diferent arguments). It allows us to compose the client's part from one trace with the library's part from another into one combined trace.

Lemma 2 (Composition). Let L and L ′ be libraries implementing the same set F of methods such that both are safe for a program Pr , and L is also safe for a program Pr ′ . Suppose that ⟨qInit, MInit⟩ <sup>t</sup>cl −→Pr[L′ ]⋊⋉PSC ⟨qcl, Mcl⟩, ⟨qInit, MInit⟩ <sup>t</sup>lib −→Pr′ [L]⋊⋉PSC ⟨qlib, Mlib⟩, and H<sup>F</sup> (tcl) = H<sup>F</sup> (tlib). Then, there exists a trace t such that H(t) = H(tcl) and ⟨qInit, MInit⟩ <sup>t</sup>−→Pr[L]⋊⋉PSC ⟨q, M⟩, for:


The proof of Lemma 2 is based on the inherent disjointness in client-library composition provided by a library safe for its client program, which we leverage in the following two ways.

Firstly, we extract client-local and library-local transition properties from all transitions of Pr [L ′ ]⋊⋉PSC and Pr ′ [L]⋊⋉PSC. Thus, when we consider a transition by Pr [L ′ ]⋊⋉PSC corresponding to an instruction outside of a method of L ′ , we show that an analogous transition is possible with the same program state, but with memory state zeroing out locations used by the library L ′ . Similarly, when we consider a transition by Pr ′ [L]⋊⋉PSC corresponding to an instruction in a method of L, we show that an analogous transition is possible with almost the same program state, except we alter its stored program counter, and with memory state zeroing out locations used by the client Pr ′ . The justifcations for these steps follow by the (⇒) directions of Lemma 1.

Secondly, we compose the client-local transition properties Pr exhibits in tcl and the library-local transition properties L exhibits in tlib while constructing transitions of Pr [L]⋊⋉PSC for a trace t. Knowing that L is safe for Pr , we consider client-local transition properties from tcl corresponding to transitions we wish to recreate in t, and replace zeroed-out memory locations with locations of L. Dually, we consider library-local transition properties from tlib corresponding to transitions we wish to recreate in t, and replace zeroed-out memory locations with locations of Pr . The (⇐) directions of Lemma 1 justify such transformations. For instance, non-SF-transitions can be composed, provided that the client program preserves the library memory state, and vice versa; while crashes and SF-transitions record an interaction between a client program and a library and therefore need to be performed in synchrony.

We use these two ideas in proving Lemma 2 by induction on the sum of lengths of tcl and tlib, and use their local transition properties to justify composing them in synchrony. For the base case, we can simply take t = ϵ. For the induction step, we consider the last labels in tcl and tlib, as well as the cases when one of the traces is empty. When tcl = · αcl and tlib = · αlib, we use t ′ from the induction hypothesis for tcl and tlib with the last action removed from either or both of them, and let t = t ′ · αcl or t = t ′ · αlib.

Then, the abstraction theorem is proved as follows.

Proof outline for Thm. 1. It sufces to show H(Pr [L]) ⊆ H(Pr [L # ]); then the claim follows using Lemma 2 by letting L := L # , L ′ := L, Pr := Pr , and Pr ′ := Pr . Suppose otherwise, and let h be a shortest history in H(Pr [L]) \ H(Pr [L # ]). Let t be a shortest trace in traces(Pr [L]⋊⋉PSC) with H(t) = h. Consider the last transition label α in t. The minimality of h and t ensures that α must be a return transition label for some f ∈ dom(L). Indeed, otherwise, we can show that α is enabled in the end of a corresponding trace of Pr [L # ]⋊⋉PSC, which contradicts the fact that h ̸∈ H(Pr [L # ]). (The full argument here requires applying Lemma 2 with L := L # , L ′ := L, Pr := Pr , and Pr ′ := Pr .)

Now, using the fact that Pr correctly calls L # w.r.t. MGC , we again apply Lemma 2 with L := L, L ′ := L # , Pr := MGC , and Pr ′ := Pr , and derive that α is enabled in the end of a corresponding trace of MGC [L]⋊⋉PSC. Then, L ⊑MGC L # ensures that Hdom(L)(t) ∈ Hdom(L)(MGC [L # ]). Using Lemma 2 for the last time (applied with L := L # , L ′ := L, Pr := Pr , and Pr ′ := MGC ), we obtain that h = H(t) ∈ H(Pr [L # ]), which contradicts our assumption. ⊓⊔

The following corollary of Thm. 1 states that, like classical linearizability, our correctness condition is compositional (a.k.a. local), meaning that a library consisting of several (non-interacting) libraries can be abstracted by considering each sub-library separately. Formally, the composition of libraries L1, ... ,L<sup>n</sup> with pairwise disjoint sets of declared methods, denoted by L1⊎ ... ⊎Ln, is defned to be the library obtained by taking the union of L1, ... ,Ln. Compositionality is formulated as follows.

Corollary 1 (Compositionality). The following two conditions together imply that L1⊎ ... ⊎L<sup>n</sup> ⊑MGC L # <sup>1</sup>⊎ ... ⊎L # n :


To end this section, we provide a simple lemma that allows one to establish L ⊑MGC L # by applying standard simulation arguments for crashless traces (with observable transitions being those that induce history labels). For that matter, we require a simulation relation on non-volatile memories generated by MGC [L]⋊⋉PSC and MGC [L # ]⋊⋉PSC that holds for the very initial memory and preserved during crashless executions.

Lemma 3. A trace t is ˙m0-to-m˙ if ⟨qInit, MInit[˙m 7→ m˙ <sup>0</sup>]⟩ <sup>t</sup>−→Pr⋊⋉PSC ⟨q, M[˙m 7→ m˙ ]⟩ for some q and M. Suppose that some relation R on NVVar → Val satisfes:


Then, assuming dom(L) = dom(L # ), we have that L ⊑MGC L # .

Furthermore, if MGC [L # ] has no fo(·) and sfence instructions, then MGC [L # ] can take non-deterministic sfence steps (see §3) when MGC [L]⋊⋉PSC steps, so store fences can be ignored when checking H(t) = H(t # ). ⋊⋉PSC takes SF-

### 7 An Application: Persistent Pairs

We illustrate the use of the library abstraction theorem for a simple concurrent and persistent data structure—a pair of values that supports write and read operations. We present two specifcations and an implementation for each specifcation. Both specifcations ensure atomicity (i.e., linearizability if the system does not crash), and "data consistency" (reads return values written by a single write invocation), but they difer in their persistency guarantees. For the concurrency aspect, the implementations follow the sequence lock (seqlock, for short) mechanism, which uses a version counter along with the pair and allows readers to avoid blocking [6]. For durability, the implementations employ diferent techniques: one uses a "redo log" and the other is based on "checkpoints".

A durable pair. The frst specifcation, a library we denote by L # pair, consists of three methods: write for writing the two values of the pair, read for reading the pair, and recover for recovering from a crash. The specifcation is as follows:<sup>5</sup>


A volatile lock (l˜) is used to ensure atomicity. For durability, writes use persistence blocks, which ensure that the two parts of the pair persist simultaneously. After the block is ended, fl(˙x1) (equivalent here to fl(˙x2) due to the persistence block) ensures that the block persists. If the system crashes after a write completed, the written values are guaranteed to survive the crash. Thus, there is nothing to be done at recovery. Nevertheless, aiming to allow implementations, the library policy requires that recovery is executed after every crash before other methods are invoked (MGC rec in §5).

Next, we present an implementation of L # pair, which we denote by Lpair. We write x := y instead of a read of y (to some fresh register) followed by a write to x. We also omit some necessary register bookkeeping: since histories record the whole register store in call/return labels, strictly speaking, implementations must unroll changes to registers not used to pass return values.


Ignoring crashes, atomicity is guaranteed here using a seqlock. As for persistency, observe frst that writing directly to the NVM is wrong since we cannot control the non-deterministic propagation: if a crash occurs during the execution of write, it is possible that only one part of the pair has persisted, and the recovery method will not have sufcient information for reinitializing the pair correctly. Instead, write frst records its "job" in ⟨x˙ new 1 , x˙ new 2 ⟩. Then, if a crash happens and

<sup>5</sup> Our simplifed language has no mechanism for argument passing. We assume that write receives arguments (read returns results) via designated registers, a<sup>1</sup> and a2.

the write was in the middle of updating ⟨x˙ <sup>1</sup>, x˙ <sup>2</sup>⟩ (as identifed via observing an odd version number), the recovery will complete the job of the writer. We note that the (rather extensive) use of fushes (or fush-optimals followed by an sfence) is necessary here in order to restrict the out-of-order persistence. The fnal write to ˙s in write does not have to be explicitly persisted. Indeed, if a crash happens between this write and its persistence, recovery will redo the (idempotent) job.

#### Theorem 2. Lpair ⊑MGC rec L # pair.

Our proof sketch uses Lemma 3, letting ⟨m, ˙ m˙ # ⟩ ∈ R if the following hold:


Using the abstraction theorem, we obtain that for a program Pr that uses Lpair correctly (i.e., calls recovery frst after every crash), for every state ⟨q, M⟩ that is reachable in Pr [Lpair]⋊⋉PSC, there exists a state ⟨q # , M# ⟩ reachable in Pr [L # pair]⋊⋉PSC and indistinguishable from ⟨q, M⟩ from the client perspective.

A bufered durable pair. A second specifcation, denoted by L # bpair, allows for "bufered" behaviors, which enable faster implementations by weakening persistency guarantees [24]. Instead of requiring operations to persist before returning, it only requires that operations are "persistently ordered" before returning.


Compared to L # pair, the explicit fush instruction fl(˙x1) from the write method is omitted, which means that a crash after a completed write may take the pair back to its state before the write. Thus, the state after a crash need not necessarily be fully up-to-date. An additional method, called sync, can used to ensure that previous writes have persisted. Without sync, an implementation could simply ignore persistency and store the pair in the volatile memory, which corresponds to an execution of L # bpair in which the persistency bufers are never being fushed.

An implementation can be obtained as follows:


This implementation exploits the freedom allowed by the specifcation. Writes and reads again employ a seqlock, but this time they only use volatile variables. In turn, sync sets a "checkpoint", and recovery rolls the state back to the latest complete checkpoint. For that matter, a non-volatile fag f˙ is used to detect crashes during the setting the checkpoint ⟨x˙ next 1 , x˙ next 2 ⟩. Thus, before storing the checkpoint, the previous checkpoint is stored in the non-volatile variables ⟨x˙ prev 1 , x˙ prev 2 ⟩. Upon recovery, given the value of the fag, we know if we can restore the state from the current stored checkpoint, or, if a crash happened during the store of this checkpoint (which means that sync did not return), set the pair to the previous stored one.

Theorem 3. Lbpair ⊑MGC rec L # bpair.

Our proof sketch uses Lemma 3, letting ⟨m, ˙ m˙ # ⟩ ∈ R if the following hold:


### 8 Related and Future Work

Library abstraction theorems. Previous work has developed library abstraction theorems for crashless shared memory concurrency. First, [13] formalized the intuition that standard linearizability as defned in [21] corresponds to contextual refnement (and also proved a completeness result: the converse also holds provided that threads have other means of interaction besides the library). Later, [7] refned and formulated this result using history inclusion instead of linearizability, which is closer to our formalization. Other abstraction results account for liveness [16], resource-transferring programs [17], and x86-TSO [8]. Our composition lemma (Lemma 2) is inspired by [8], which addresses a challenge that is close to the challenge posed by store fence instructions in NVM, where actions of the client and the library afect each other even if they access to distinct locations. To do so, the notion of a history is extended to expose events that correspond to the fushing certain entries from the x86-TSO store bufers, which is close to what we do to handle store fences. Our alternative approach to this problem, i.e., introducing a relaxed version of the store fence, is novel.

While our framework is operational, library abstraction was also studied before for declarative shared memory concurrency semantics, particularly in the context of the C11 weak memory model [5, 28].

Linearizability notions for persistent objects. Diferent approaches for adapting the standard linearizability criterion that is based on crash-free sequential specifcations [21] were proposed before [3,19,24], but were not formally related to contextual refnement. Since methods like recover and sync (see §7) are meaningless in crash-free sequential specifcations, they require an ad-hoc external treatment in these linearizability adaptations. The variety of approaches to interpret crash-free sequential specifcations for crash-resilient concurrent objects

makes it hard, in particular, to combine libraries with diferent linearizability guarantees in a single program.

In turn, these existing notions are typically expressible in the refnement framework that we employ. For example, in the crashless setting, by wrapping each method of a sequential implementation S of some object inside a global lock, one obtains an abstract library L # S for that object that corresponds to the conditions imposed by standard linearizability [7] (a library L is linearizable w.r.t. S if every crashless history induced by a trace of MGC [L] is also induced by some trace of MGC [L # S ]). Now, when crashes are involved, by wrapping each method of S inside a global lock and a persistence block followed by an explicit fush instruction (like L # pair in §7), one obtains an abstract library L # S that corresponds to the conditions imposed by strict linearizability of [3] (L is strictly linearizable w.r.t. S if L ⊑MGC L # S ). Thus, our results can be used to derive contextual refnement (using L # S as a specifcation) from strictly linearizable objects. We note that while the original defnition of strict linearizability was for a model with per-processor failure, what we consider here is its application for full system crashes.

Durable linearizability [24] weakens strict linearizability by allowing methods that were active during a crash to take their efect at any later point in the execution (or never), instead of requiring that the efect of such methods is visible immediately after the crash (or never). This weakening aims to allow lazy recovery for large structures, where either the recovery procedure is executed in parallel to other methods after a crash, or the methods themselves participate in recovering the data structure when they are further executed. This notion can be also expressible as an abstract implementation in our language. For this matter, every update method in the specifcation would: frst record its task in a work-set; remove the task from the work-set; fush the updated work-set; and perform the task like in L # S described above. In turn, every query method may choose to complete any task it fnds in the work-set, since the method performing such a task has crashed during its invocation. For persistent pairs (see §7), this is illustrated by the specifcation below. The non-volatile variable w˙ is the multiset holding the work-set with atomic add and remove operations, and l˜rw is an abstract multiple-readers-single-writer lock used to resolve races on the work-set.


A "bufered" version of strict linearizability, which only requires the existence of a prefx of the completed invocations to be observed after a crash, is also naturally derived by considering L # <sup>S</sup> <sup>b</sup> which is obtained from a sequential implementation S by wrapping each method of S inside a global lock and a persistence block (without an explicit fush instruction) and ensuring that there is a single non-volatile variable that is written to by all library methods (introducing such a variable if needed).<sup>6</sup>

An alternative operational characterization of durable linearizability using Input/Output automata was developed in [12] and used to formally establish this property for the persistent queue of [14] by providing a full-blown simulation proof using the KIV proof assistant.<sup>7</sup> Nevertheless, this work does not relate the proved correctness criterion to contextual refnement.

Persistency models. The underlying model we assume is PSC by [25], a strengthening of Px86 [30] that formalizes the Intel-x86 persistency. The paper [25] provided compiler mappings that ensure PSC semantics on machines guaranteeing Px86 semantics. We extended the general semantic framework with libraries, and extended PSC with local store fences and persistence blocks.

Future work. Future work includes extending our proof method and results for weaker persistency models, such as persistent x86-TSO [30] and ARM [10]; handling random access shared memory with allocations and deallocations (instead of the simplifed shared variables model we employ); and lifting the strict condition that libraries and clients live in disjoint address spaces by allowing them to transfer ownership of certain locations (as was done in [17] for standard volatile memory).

In addition, extending and adapting methods for refnement verifcation under volatile memory is needed in order to provide library developers with means to validate our library-correctness conditions. Such methods may include automated checking by approximation [7], layered interactive verifcation in the style of [20,27], and formal logics as the one in [26]. Similarly, developing formal methods and tools that allow using library specifcations for client reasoning is left for future work, including decidable reachability analysis [2], program logics [29], and principled testing [15]. Finally, it is interesting to see how logical atomicity notions established by program logics, such as [11,31], which has been extended to cover crashes in disk-based storage systems [9], can be adapted for establishing our correctness condition and/or for client reasoning.

<sup>6</sup> Since the corresponding "bufered" correctness notion is not compositional, while the refnement-based notion is (see Corollary 1), one cannot expect to have a per-object translation of a sequential implementation S into a concurrent and persistent implementation L # <sup>S</sup> <sup>b</sup>. Indeed, the addition of a single non-volatile variable that is written to by all library methods is a not a per-object translation (i.e., for two sequential library implementations implementing disjoint sets of methods and operating on disjoint variables, S<sup>1</sup> and S2, we will not have L # <sup>S</sup>1∪S<sup>2</sup> <sup>b</sup> <sup>=</sup> <sup>L</sup> # <sup>S</sup><sup>1</sup> <sup>b</sup> <sup>∪</sup> <sup>L</sup> # <sup>S</sup><sup>2</sup> <sup>b</sup>).

<sup>7</sup> See https://kiv.isse.de/projects/Durable-Queue.html.

### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the

source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### Static Race Detection for Periodic Programs?

Varsha P Suresh<sup>1</sup> () , Rekha Pai<sup>2</sup> , Deepak D'Souza<sup>2</sup> () ,

Meenakshi D'Souza<sup>1</sup> () , and Sujit Kumar Chakrabarti<sup>1</sup>

1 International Institute of Information Technology Bangalore, Bengaluru, India 2 Indian Institute of Science, Bengaluru, India. {rekhapai,deepakd}@iisc.ac.in {varsha.suresh,meenakshi,sujitkc}@iiitb.ac.in

Abstract. We consider the problem of statically detecting data races in periodic real-time programs that use locks, and run on a single processor platform. We propose a technique based on a small set of rules that exploits the priority, periodicity, locking, and timing information of tasks in the program. One of the key requirements is a response time analysis for such programs, and we propose an algorithm to compute this for the case of non-nested locks. We have implemented our analysis for realtime programs written in C in a tool called PePRacer and evaluated its performance on a small set of benchmarks from the literature.

Keywords: Real-Time systems · periodic programs · static analysis · data races · WCRT Analysis

### 1 Introduction

Periodic real-time applications (or simply periodic programs) are a class of realtime systems that comprise a set of tasks, each of which comes with an associated priority and periodicity, and are executed according to a scheduling policy like priority-based preemptive scheduling, on a real-time operating system. Thus each task is made ready to run at the beginning of its period (though it may actually get to execute only later depending on its priority and how long it has been waiting in the ready queue), and may be preempted during its execution by higher priority tasks that have been made ready to run. Many of these systems are safety-critical in nature, being widely employed in avionics, robotics, and autonomous systems.

These systems are also essentially concurrent in nature (even if we consider single processor platforms), since a running task may be preempted by a higher priority task, causing them to interleave in time. With concurrency come the attendant problems of data-races: it is not difficult to imagine a scenario where a low priority task is updating a shared data-structure or even a multi-word variable like a long int, when it is preempted by a higher priority task that

<sup>?</sup> Supported by University Grants Commission (UGC), New Delhi, India and Royal Academy of Engineering, UK

goes on to access the potentially inconsistent shared data. Thus it is common for real-time application developers to use synchronization mechanisms like locks to protect accesses to shared data structures (like the ones used to control wheel movement in a robot) or resources (like an LCD display). Real-Time operating systems typically provide a variety of lock mechanisms from standard locks or semaphores to priority-inheritance based locks [18].

Our focus in this paper is on giving a way to statically (that is by analyzing the source code of the application, rather than running it) detect races in periodic programs that use standard locks. The emphasis in static analysis techniques is on soundness: we do not eliminate a pair of conflicting accesses unless we can prove that they do not race. The other side of the coin is precision: how close is the set of potential races reported to the actual set of races in the program. The basic technique used in the programming languages community to statically detect races is a lockset analysis, which computes the set of locks that are must-held at each statement in a task, and declares two statements to be nonracy if they hold a common lock. More recent techniques [17,20] exploit priority information to declare accesses to be non-racy: for instance a high-priority task does not need to protect its accesses from a lower priority task.

However, none of these techniques seek to exploit the inherent periodic nature or execution times of the tasks in these programs. For example, a simple observation is that if two tasks have the same period and don't take any locks, they can never overlap in time. Exploiting timing information is also key to improving the precision of a race analysis technique for these programs. The notion of worst-case response time (WCRT) of a task measures the maximum time an instance of the task may take to complete its execution starting from the beginning of its period. As an example of how we can use conservative WCRT estimates, if we can conclude from the WCRT information that a low-priority task always finishes execution before the next arrival of a high-priority task, we can declare them to be non-racy.

While computing the WCRT of tasks in periodic programs is well-studied in the real-time systems community, starting from [13,12] for periodic programs without locks, and for periodic programs with priority-inheritance-based locks [18], as far as we are aware there are no techniques available for periodic programs with standard locks. One of the contributions of this paper is to extend the classical technique of [12] to compute WCRT estimates for programs with nonnested locks, given worst-case execution time (WCET) estimates of tasks and lock-unlock blocks (or critical sections).

We then go on to give a set of six rules (in the spirit of the ideas described above) to soundly eliminate pairs of conflicting accesses, leading to a sound, efficient, and fairly precise race-detection technique for such programs.

We have implemented our analysis in a tool called PePRacer for detecting races in such programs written in C. One of the inputs to the tool is a WCET analysis for different blocks in the program tasks, which we obtain using the WCET analysis tool Heptane [11]. We have run our tool on several benchmarks, including robot controllers from the nxtOSEK project [2]. Our tool runs in a fraction of a second on these benchmarks, and on the average eliminates 97% of conflicting access pairs as non-racy.

An overview of our technique is presented in the next section on an example adapted from one of our benchmarks. Periodic programs and their execution semantics are introduced in Sec. 3. Sec. 4 formally defines the notions of conflicting accesses and data races. Algorithms for computing safe bounds on response times of periodic programs with locks are presented in Sec. 5.2. Sec. 6 gives the rules for disjointedness of tasks and the race detection algorithm for periodic programs. Our experiments on benchmark examples are detailed in Sec. 7, followed by a discussion on related work in Sec. 8.

### 2 Overview

We provide an overview of our technique with an illustrative example adapted from the "lego\_osek" robot controller, based on the OSEK operating system, from [2]. Fig. 1 shows some excerpts from this example. The controller's job is to control the motion of the two-wheeled robot to follow a line (that it detects using light sensors), it also detects obstacles along the way (using a sonar sensor) and avoids them by braking and moving to the left. The controller has two tasks TaskControl and TaskObstAvoid that do the line-following control and obstacle detection and avoidance respectively. TaskControl has high priority (higher value indicates higher priority) and runs every 10ms, while TaskObstAvoid has low priority and runs every 30ms. The two tasks access some shared locations, including structures for actuating the left and right wheel motors, an LCD display, and a boolean "obstacle-detected" flag. TaskControl reads two light sensor values, does some computation with them, and writes them to the LCD display. The access to the LCD display is protected by acquiring and releasing the lcd\_lock lock. Finally it computes the new speed and brake values that are then written to the wheel motor structures, after checking that the obstacle flag is not set. The TaskObstAvoid task reads the sonar and left light sensors, does some computation on them, sets the obstacle flag based on these values, and displays them on the LCD (making sure to take a lock on it first). If the obstacle flag was set, it goes on to write to the left wheel structure to brake and turn the robot to the left.

We note that there are several conflicting accesses to the shared variables, including lines 13 and 33 to lcd, lines 16 and 29 and 16 and 31 on obstacle, and lines 19–20 and 36–37 on left\_wheel. Apart from the accesses to lcd which are protected by a lock, the other accesses appear to be racy at first glance. For instance, while TaskObstAvoid is updating the left wheel structure, it could be preempted by the higher priority TaskControl which goes on to write to the same structure, potentially leading to a harmful race.

Our key idea is to exploit the priority, periodicity, and worst case response times of the tasks, to show that these accesses cannot race. Fig. 2 shows the periodic execution of the two tasks. Notice that if the low priority task is guaranteed to finish its execution before the next instance of the higher priority task

```
1. // Shared structures and variables 23. void TaskObstAvoid() {// Per 30, Prio 1 (low)
2. struct motor right_wheel; 24. int sonar_value, sensor_left;
3. struct motor left_wheel; 25. // Read and calibrate sensor values
4. struct display lcd; 26. sonar_value = get_sonar_sensor();
5. bool obstacle = 0; 27. sensor_left = get_light_sensor(left);
                                       28. if (...)
6. void TaskControl() {// Per 10, Prio 2 (high) 29. obstacle = 1;
7. int sensor_right, sensor_left; 30. else
8. // Read and calibrate sensor values 31. obstacle = 0;
9. sensor_right = get_light_sensor(right); 32. lock(lcd_lock);
10. sensor_left = get_light_sensor(left); 33. show_var(sonar_value, sensor_left);
11. lock(lcd_lock); 34. unlock(lcd_lock);
12. // display sensor values on LCD 35. if (obstacle) { // avoid by moving left
13. show_var(sensor_right, sensor_left); 36. left_wheel.speed = ...;
14. unlock(lcd_lock); 37. left_wheel.brake = 1;
15. // Motor control, uses sensor values 38. }
16. if (!obstacle) { 39. }
17. right_wheel.speed = ...;
18. right_wheel.brake = 0;
19. left_wheel.speed = ...;
20. left_wheel.brake = 0;
21. }
22. }
```
Fig. 1: An example periodic program adapted from Lego-OSEK

is scheduled, there can be no interleaving of the two tasks, and we can declare all the conflicting accesses as non-racy. However, concluding this in the presence of locks is not easy, and our first contribution is a way of computing an estimate of the worst case response times for tasks that take non-nested locks (like in the example program). Using raw WCET times of the tasks and its lock blocks (like lines 11–14) for the platform the robot controller is to be run on, we use Algo. 2 (described in Sec. 5) to compute an estimate of the response time of TaskObstAvoid. Rule 3 (described in Sec. 6) then allows us to eliminate all the pairs of conflicting accesses as non-racy.

We note that techniques such as [17,20] that consider task priorities and locks (but not periodicities and response times) would not be able to eliminate any of the conflicting access pairs, except the accesses to lcd which are protected by a lock.

Fig. 2: Task timelines for Lego-OSEK example

### 3 Periodic Programs

A periodic program is a collection of tasks. Each task has an associated function, period, and priority. There is a designated init task which is the only task that is ready to run initially. An execution of the program begins with running the function associated with the init task, which initializes shared variables. It then makes other tasks ready to run using the start command. The init task runs only once.

The execution of the tasks is orchestrated by a priority-based preemptive scheduler. It is important to point out here that we are assuming a single processor platform. The scheduler selects one of the enabled tasks for execution on a highest-priority-first basis. A task with period T is enabled every T time units. If there are more than one tasks of the highest priority ready to run, the longest waiting task is picked for execution. This is also known as First-Come-First-Served (FCFS) scheduling.

The task functions operate on a set of shared variables V using assignment statements and accesses to the shared variables can be synchronized using the lock-unlock commands. The set of commands (over a set of variables V ) Cmd<sup>V</sup> that can be used in a periodic program are shown in Table 1.


Table 1: Periodic Program Commands Cmd<sup>V</sup>

Formally, a periodic program P is a tuple (V, L, T ) where V is a finite set of shared variables, L is a finite set of locks, and T is a finite set of tasks, including a designated init task. A task τ ∈ T is a tuple (G<sup>τ</sup> , T<sup>τ</sup> , p<sup>τ</sup> ), where G<sup>τ</sup> is the task function, T<sup>τ</sup> is the period of the task, and p<sup>τ</sup> is its priority. The task function G<sup>τ</sup> is represented as a Control Flow Graph (CFG) G<sup>τ</sup> = (Loc<sup>τ</sup> ,I<sup>τ</sup> , ent <sup>τ</sup> , ext <sup>τ</sup> ), where Loc<sup>τ</sup> is the finite set of locations of τ , I<sup>τ</sup> ⊆ Loc<sup>τ</sup> × Cmd<sup>V</sup> × Loc<sup>τ</sup> is the set of instructions of τ , and ent <sup>τ</sup> , ext <sup>τ</sup> ∈ Loc<sup>τ</sup> are the entry and exit locations respectively of τ . We denote the set of locations and instructions in P by Loc<sup>P</sup> = S <sup>τ</sup>∈T Loc<sup>τ</sup> and I<sup>P</sup> = S <sup>τ</sup>∈T I<sup>τ</sup> respectively, assuming the set of locations to be disjoint across tasks. We will drop the subscripts whenever they are clear from the context.

An example periodic program and the CFG representation of one of its tasks ObsDect are shown in Fig. 3. The periodic program has two tasks that implements a simple robotic controller, apart from the default init task. The ObsDect task function detects an obstacle based on the sensor input in the sIn variable and makes a corrective action. The MoveForward task function directs the robot to move forward if there is no obstacle. The ObsDect task has high priority (value 2) and runs every 100 time units, while the MoveForward task has lower priority (value 1) and runs only every 200 time units. Both the tasks access the shared variables obstacle and forward.

Fig. 3: Example program and the CFG representation

We now define the semantics of a periodic program P = (V, L, T ) as a labeled transition system S<sup>P</sup> = hS, sin, ⇒i where S is the set of states, sin ∈ S is the initial state, and ⇒ is the transition relation, as defined below. In the following, Q<sup>T</sup> denotes the set of possible task priority queues and denotes an empty queue. We also assume that the tasks have distinct priorities in P = {1, . . . , k} with a higher value indicating higher priority. For an integer expression e, boolean expression b, and an environment φ for V , we denote by [[e]]<sup>φ</sup> the integer value that e evaluates to in φ, and [[b]]<sup>φ</sup> denotes the boolean value that b evaluates to in φ. For a function f : X → Y , and elements x ∈ X and y ∈ Y , we use the notation f[x 7→ y] to denote the function f 0 : X → Y given by f 0 (x) = y, and f 0 (z) = f(z) for all z different from x.

A state s ∈ S is a tuple (R, W, A, B, pc, φ, tick, r) where

– R is a priority queue of tasks that are ready to run,


The initial state sin is defined to be (, T − {init}, ∅, ∅, λτ.ent <sup>τ</sup> , λx.0, 0, init) denoting the fact that initially the init task is the running task while no other tasks are ready to run and instead are waiting to be scheduled, none of the tasks have acquired locks and hence they are not blocked, all the tasks are at their entry locations, all the variables are initialized to zero, and so is the tick counter.

We now define the transition relation ⇒ ⊆ S × I<sup>P</sup> × S as follows. For a state s = (R, W, A, B, pc, φ, tick, r), a task τ , and an instruction ι = (l, c, l<sup>0</sup> ) in G<sup>τ</sup> , we have s ⇒<sup>ι</sup> s 0 iff one of the rules in Fig. 4 is satisfied. If for a command c, the conditions on state s specified in the antecedent (the ones mentioned above the line) holds then s ⇒<sup>ι</sup> s 0 in the consequent (the one below the line) also holds.

In the Start rule, for the start command executed by the init task, all the tasks in W that are waiting to be scheduled onto the ready queue are enqueued onto R. We now pick the highest priority task, which is at the head of the updated ready queue, to be the next running task. Once the init task executes the start command, it plays no role in the rest of the execution.

The rule uses the ENQ(Q, S) function which when given a priority queue Q of tasks and a set S of tasks, enqueues each task in S onto the queue Q. The function enq(Q, s) is the standard enqueue function for a priority queue Q. The function deq(Q) returns the queue with the head element removed. The function head(Q) when given a priority queue Q of tasks returns the task with the highest priority, which is at the head of Q.

The End rule is defined for the end command to signal completion of the currently running task. Hence the task is inserted into the wait list W. Moreover, the highest priority task in the ready queue R, which is at its head, is removed from R and made the running task. The rule requires that the ready queue R be non-empty.

The Alock rule is defined for the lock(m) command. If the running task r requests for a lock m which is not acquired by any task (as given by A(m) = undef ) then the running task proceeds with acquiring the lock. The Block rule is defined for the lock(m) command when the running task cannot acquire the lock. If the running task r requests for a lock m which is acquired by a task τ 0 (as given by A(m) = τ 0 ) then the running task r is blocked by en-queuing it onto the blocked queue B(m). This calls for a re-schedule and hence the highest priority task from the non-empty ready queue R is made the running task.

The Unlock rule is defined for the unlock(m) command. If the running task r requests for the release of the lock m which it was holding or it was the

ASSUME BEGIN SKIP UNL−WK UNL−CS UNLOCK BLOCK ALOCK ASSIGN START TICK END c = assume(b) pc(τ ) = l τ = r [[b]]<sup>φ</sup> = true s ⇒<sup>ι</sup> (R, W, A, B, pc[τ 7→ l 0 ], φ, tick, r) s ⇒<sup>ι</sup> (R, W, A, B, pc[τ 7→ l 0 ], φ, tick, r) c = begin pc(τ ) = l τ = r s ⇒<sup>ι</sup> (R, W, A, B, pc[τ 7→ l 0 ], φ, tick, r) c = skip pc(τ ) = l τ = r c = unlock(m) pc(τ ) = l τ = r A(m) = r Q = B(m) 6= head(Q) = τ <sup>0</sup> pτ<sup>0</sup> ≤ p<sup>r</sup> s ⇒<sup>ι</sup> (enq(R, τ <sup>0</sup> ), W, A[m 7→ undef], B[m 7→ deq(Q)], pc[τ 7→ l 0 ], φ, tick, r) c = unlock(m) pc(τ ) = l τ = r A(m) = r Q = B(m) 6= head(Q) = τ <sup>0</sup> pτ<sup>0</sup> > p<sup>r</sup> s ⇒<sup>ι</sup> (enq(R, r), W, A[m 7→ undef], B[m 7→ deq(Q)], pc[τ 7→ l 0 ], φ, tick, τ <sup>0</sup> ) c = unlock(m) pc(τ ) = l τ = r (A(m) = r ∨ A(m) = undef) B(m) = s ⇒<sup>ι</sup> (R, W, A[m 7→ undef], B, pc[τ 7→ l 0 ], φ, tick, r) s ⇒<sup>ι</sup> (deq(R), W, A, B[m 7→ enq(B(m), r)], pc, φ, tick, head(R)) c = lock(m) pc(τ ) = l τ = r A(m) = τ <sup>0</sup> R 6= c = lock(m) pc(τ ) = l τ = r A(m) = undef s ⇒<sup>ι</sup> (R, W, A[m 7→ τ ], B, pc[τ 7→ l 0 ], φ, tick, r) s ⇒<sup>ι</sup> (R, W, A, B, pc[τ 7→ l 0 ], φ[x 7→ [[e]]φ], tick, r) c = x := e pc(τ ) = l τ = r c = start pc(τ ) = l τ = r = init s ⇒<sup>ι</sup> (deq(ENQ(R, W)), ∅, A, B, pc[τ 7→ l 0 ], φ, tick, head(ENQ(R, W))) s ⇒<sup>∗</sup> (deq(ENQ(R, S ∪ {r})), W \ S, A, B, pc, φ, v, head(ENQ(R, S ∪ {r}))) v = inc(tick) S = {τ <sup>0</sup> ∈ W | v is a multiple of Tτ<sup>0</sup>} s ⇒<sup>ι</sup> (deq(R), W ∪ {r}, A, B, pc[τ 7→ l 0 ], φ, tick, head(R)) c = end pc(τ ) = l τ = r R 6=

Fig. 4: Transition relation capturing the execution semantics of a periodic program

case that no task was holding the lock (as given by A(m) = r ∨ A(m) = undef) then the running task can proceed with releasing the lock. Further, if there are no tasks blocked on this lock m (as given by B(m) = ) then the current task continues to be the running task. The Unl-wk rule is defined for the unlock(m) command when a low priority task is blocked on the lock. If the running task requests for the release of the lock m which it was holding and a task τ 0 , at the head of the blocked priority queue B(m), is blocked on the lock, of priority lower than the running task, then τ 0 is unblocked by dequeing it from its blocked priority queue B(m) and enqueing it onto the ready queue R. Task r continues to be the running task. The Unl-cs rule is defined for the unlock(m) command when a high priority task is blocked on lock m. If the running task requests for the release of the lock m which it was holding and a high priority task τ 0 is blocked on the lock then τ 0 is unblocked by dequeing it from its blocked queue B(m). The task τ 0 , being of higher priority, is selected as the next running task while the current running task r is enqueued onto the ready queue R.

The Tick rule models the handling of a timer interrupt, signalling that a unit of time has elapsed. The tick counter is incremented by one, and the tasks in W whose periods divide the tick count, are moved to the ready queue R. The current running task r is also enqueued onto the ready queue. We now pick the highest priority task in the updated ready queue, which is at its head, as the next task to run.

The Skip, Begin, Assign, and Assume rules for the skip, begin, assignment, and assume commands, respectively, are standard.

An execution of a periodic program P is a finite sequence of transitions ρ = δ1, . . . , δ<sup>n</sup> (n ≥ 1), such that there exists a sequence of states s0, . . . , s<sup>n</sup> of S, with each δ<sup>i</sup> ∈ ⇒ of the form (si−1, ι<sup>i</sup> , si) for some ι<sup>i</sup> , and s<sup>0</sup> = sin.

The semantics we have defined so far abstracts away the "real-time" aspect of a periodic program. We can obtain the real-time semantics of a periodic program by considering a concrete execution environment which fixes the execution time of each instruction (say in a bounded interval of time), and restricting ourselves to executions where the tick interrupt is driven by a real-time clock and is consistent with the time taken to execute instructions between two ticks. Henceforth we fix such an environment and focus on the induced subset of executions of a periodic program.

### 4 Data Races

Let P = (V, L, T ) be a periodic program. In an execution of P, tasks are executed periodically and hence during the course of execution of P many instances of a task get executed. Consider two tasks τ<sup>1</sup> and τ<sup>2</sup> in T , and two non-empty paths π and π 0 in G<sup>τ</sup><sup>1</sup> and G<sup>τ</sup><sup>2</sup> , respectively. We say π and π <sup>0</sup> may happen in parallel in P if there is an execution ρ of P, and instances of τ<sup>1</sup> and τ<sup>2</sup> in ρ which execute along the paths π and π 0 respectively, in such a way that the paths π and π 0 interleave (that is, either π <sup>0</sup> begins after π has begun but not yet ended; or vice-versa).

We now define when two statements s<sup>1</sup> and s<sup>2</sup> (corresponding, to instructions ι<sup>1</sup> = (l1, c1, l<sup>0</sup> 1 ) and ι<sup>2</sup> = (l2, c2, l<sup>0</sup> 2 )) in tasks τ<sup>1</sup> and τ2, respectively, may happen in parallel. Consider the program P <sup>0</sup> obtained from P by enclosing the statements s<sup>1</sup> and s<sup>2</sup> in skip statements. Formally, we obtain P <sup>0</sup> by replacing the instruction ι<sup>1</sup> by the instructions (l1, skip, m1), (m1, c1, m<sup>0</sup> 1 ), and (m<sup>0</sup> 1 , skip, l<sup>0</sup> 1 ), where m<sup>1</sup> and m<sup>0</sup> <sup>1</sup> are new locations in Locτ<sup>1</sup> ; and similarly for ι2. Let π<sup>1</sup> be the path l1 skip <sup>→</sup> <sup>m</sup><sup>1</sup> <sup>→</sup> <sup>c</sup><sup>1</sup> m<sup>0</sup> 1 skip → l 0 1 in G<sup>τ</sup> 0 1 , and similarly π<sup>2</sup> in G<sup>τ</sup> 0 2 . We now say s<sup>1</sup> and s<sup>2</sup> may happen in parallel in P, if the paths π<sup>1</sup> and π<sup>2</sup> may happen in parallel in the program P 0 .

Two statements are called conflicting if they are read/write accesses to the same variable, and at least one of them is a write. We say two statements s<sup>1</sup> and s<sup>2</sup> in P are involved in a data race (or are simply racy) if they are conflicting accesses that may happen in parallel. As an example, in the example program of Fig. 3, the accesses to obstacle in lines 10 and 20 are conflicting. Without any assumptions on the execution time of these two tasks, these two statements are also racy, since there is an execution of the augmented program in which the skip-blocks around these two statements interleave.

Finally, we define what it means for a "block" of code to happen in parallel with another. A block of code in P is specified by a pair (l, X), where for some task τ in P, l is a location in Loc<sup>τ</sup> and X ⊆ Loc<sup>τ</sup> is a subset of locations reachable from l, in task τ . An initial path in a block B = (l, X) of a task τ in P, is a nonempty path in G<sup>τ</sup> that begins at l and stays within the set of locations X, except possibly for the last location in the path. We say a statement s = (m, c, m<sup>0</sup> ) in P belongs to block B = (l, X) if m belongs to the set X. We say two blocks B<sup>1</sup> and B<sup>2</sup> of P may happen in parallel if there are two initial paths π<sup>1</sup> in B<sup>1</sup> and π<sup>2</sup> in B2, which may happen in parallel with each other. Otherwise, B<sup>1</sup> and B<sup>2</sup> are disjoint.

### 5 Response Time and its Computation

Our aim in this section is to give a way of computing a safe bound on the response time of tasks in a periodic program with locks. We begin by recalling some of the basic notions.

Consider a sequential piece of compiled code B executing on a given hardware platform. Assume that the code does not have to compete for the processor time with other processes (in particular there is no preemption, and lock statements succeed without blocking). The execution time of B may still vary depending on reads of input and other shared locations, which are assumed to return nondeterministic values during the execution. If we consider the supremum of these execution times we obtain the worst-case execution time (WCET) of B on the given platform. There are many static analysis techniques and tools that help us obtain conservative estimates on the WCET of a program on a given platform. We refer the reader to [21] for a survey of these techniques and tools.

Let us now consider a periodic program P = (V, L, T ) which we want to execute in a given execution environment. Let τ be a task in T . Consider an execution ρ of P in this environment. There could be many instances of τ executing in ρ. Let us consider one such instance, where at time t, τ moves into the ready queue with the program counter pointing to its start location. Let t <sup>0</sup> be the time at which this instance completes (that is τ executes its end instruction). Then the response time of this instance of τ is t <sup>0</sup> − t. We are interested in the worst case response time (WCRT) of τ which is defined to be the supremum of the response time of instances of τ over all instances of τ and all executions of P in the given environment.

In a similar way we can define the WCRT of a block of code B in τ , where we take the initial time t to be time the instance of τ is in the ready queue with the program counter pointing to the beginning of B, and t 0 to be the time the last instruction of B completes.

We note that the response time of a task (or a block of code) may exceed its WCET, as the task may lose processor time due to preemption by higher priority tasks, or due to blocking lock attempts. To illustrate this, consider a periodic program with three tasks τ<sup>1</sup> (priority 1, period 20), τ<sup>2</sup> (priority 2, period 13), and τ<sup>3</sup> (priority 3, period 8). Suppose the tasks have a simple structure comprising straight-line code, and each of them takes and releases a common lock l. Let the WCET for each segment of the tasks be as shown in Fig. 5. Consider a portion of a possible execution of P shown in Fig. 6. We note that τ2, which has a WCET of 3, is ready to run at time 39 but completes execution only at time 44. Thus its response time in this instance is 5. This was due to the 2 units of processor time taken away by task τ<sup>3</sup> in its interruption during τ2's execution. Notice also that the top priority task τ<sup>3</sup> is delayed by 1 unit of time waiting for τ<sup>2</sup> to release the lock it had acquired before it was preempted.

Fig. 5: Block WCETs of tasks of example program

We say a periodic program P is schedulable if the WCRT of each task is less than or equal to its period. However, since it is difficult to know the exact WCRT, we will look for a conservative WCRT estimate which is less than or equal to the period of the task, to declare that a program is schedulable.

Fig. 6: Illustrating response time

#### 5.1 Computing Response time without Locks

In the classical setting of periodic programs without locks a conservative estimate of the WCRT for each task can be computed using Eq (1) below [12,13]. Let P = (V, L, T ) be a periodic program. We assume for convenience in the rest of this section that P has tasks τ1, . . . , τ<sup>n</sup> with distinct priorities (we ignore the init task). Without loss of generality we assume τ<sup>i</sup> has priority i. Further, each task τ<sup>i</sup> has a WCET estimate C<sup>i</sup> . Consider the equation below from [12] which in turn is based on the analysis in [13]. Here the R<sup>i</sup> 's are variables representing the WCRT of task τ<sup>i</sup> respectively.

$$R\_i = C\_i + \sum\_{j>i} (\lceil R\_i / T\_j \rceil \cdot C\_j). \tag{1}$$

Theorem 1 ([12,13]). The least solution to Eq 1, whenever it exists, is an upper bound on the WCRT of task τi.

Proof. Let L be any solution to Eq (1). We argue that L must upper bound the response time of any instance of task τ<sup>i</sup> . Consider an instance of task τ<sup>i</sup> that is enabled (enters the ready queue) at time t. Consider the time point t + L. If we ask ourselves how much processor time can be taken away in the interval [t, t + L] by a higher priority task τ<sup>j</sup> , it is clearly bounded by dL/T<sup>j</sup> e · C<sup>j</sup> . Thus, the total time that can be taken away by all higher priority tasks put together is bounded by P j>i(dL/T<sup>j</sup> e · C<sup>j</sup> ). This leaves at least C<sup>i</sup> time for task τ<sup>i</sup> to execute, and hence it must complete execution by t + L.

Algo. 1 below, which is similar to the recursive procedure proposed in [12], computes the least solutions to Eq (1) to compute conservative estimates of the WCRT of tasks, and thereby tells whether a periodic program is schedulable or not.

#### 5.2 Computing Response Time with Locks

Thm. 1 no longer holds (and Algo. 1 is no longer sound) when tasks are allowed to take locks. This can be seen from the example program and sample execution

#### Algorithm 1: Check Schedulability (No Locks)

Data: Periodic program P without locks, WCET estimates C<sup>i</sup> for τ<sup>i</sup> Result: P schedulable or not, and if so WCRT estimate for each task foreach task τ<sup>i</sup> do L prev i := 0; L<sup>i</sup> := Ci; while (L<sup>i</sup> is not a solution to Eq (1) and L<sup>i</sup> < Ti) do tmp := Li; L<sup>i</sup> := L<sup>i</sup> + P j>i((dLi/T<sup>j</sup> e − dL prev i /T<sup>j</sup> e) · C<sup>j</sup> ); L prev i := tmp; end if (L<sup>i</sup> does not satisfy Eq (1) or L<sup>i</sup> > Ti) then return "Unschedulable"; end end return "Schedulable", L1, . . . , Ln;

in Figs. 5 and 6, where for instance task τ<sup>3</sup> has a response time of 3, but the least solution to the corresponding Eq (1) is 2. However, as we show below, it is possible to extend the classical approach to handle non-nested locks.

Before we consider the general case, it will be instructive to first consider the example program of Fig. 5. Let C1, C2, C<sup>3</sup> stand for the WCET estimates for tasks τ1, τ2, τ<sup>3</sup> respectively, and C 1 l , C<sup>2</sup> l , C<sup>3</sup> l for the WCET estimates of the blocks B<sup>1</sup> , B<sup>2</sup> , B<sup>3</sup> respectively. Let us first begin by asking what is the response-time of the block B<sup>1</sup> . Recall that this is the portion of code between the lock(l) unlock(l) statements in τ1. Since B<sup>1</sup> does not contain any lock statements, the response time for this follows Eq (1), and we can write Eq (6) to capture its response time, U 1 l . In a similar way the response time, U 2 l , of the block B<sup>2</sup> is given by Eq. (5). It is easy to see that the response time, U 3 l , of the block B<sup>3</sup> in the highest priority task τ<sup>3</sup> is simply C 3 l .

Next, we consider the top priority task τ3. The only extra time it may spend is in waiting for its lock(l) instruction to succeed. This may happen because one of the lower priority tasks has acquired lock l and is yet to release it. Suppose this task is τ2. Then τ<sup>2</sup> must be somewhere in block B<sup>2</sup> . But how long can it be before τ<sup>2</sup> releases l? This is at most the response time for B<sup>2</sup> . In a similar way, if τ<sup>1</sup> has taken the lock, τ<sup>3</sup> may end up waiting for at most the response time of B<sup>1</sup> . Note also that τ<sup>3</sup> may have to wait for at most one of τ<sup>2</sup> or τ<sup>1</sup> to complete its lock block, never both. Thus, its response time is given by Eq (2).

Now let us consider task τ2. It may be delayed either (a) waiting for its lock(l) statement to succeed because τ<sup>1</sup> has taken the lock l; or (b) because τ<sup>3</sup> takes away some time by preempting it. The former is bounded by the responsetime of B<sup>1</sup> , while the latter is bounded by the number of times τ<sup>3</sup> can interrupt it times the WCET of τ3. Thus the response time of τ<sup>2</sup> is captured by Eq (3).

$$R\_3 = C\_3 + \max(U\_l^2, U\_l^1) \tag{2}$$

$$R\_2 = C\_2 + U\_l^1 + \left[R\_2/T\_3\right] \cdot C\_3 \tag{3}$$

$$R\_1 = C\_1 + \lceil R\_1/T\_3 \rceil \cdot C\_3 + \lceil R\_1/T\_2 \rceil \cdot C\_2 \tag{4}$$

$$U\_l^2 = C\_l^2 + \lceil U\_l^2 / T\_3 \rceil \cdot C\_3 \tag{5}$$

$$U\_l^1 = C\_l^1 + \lceil U\_l^1 / T\_3 \rceil \cdot C\_3 + \lceil U\_l^1 / T\_2 \rceil \cdot C\_2 \tag{6}$$

To find the least solution to Eqs (2–6), we can apply the analogue of Algo. 1 to first compute U 2 <sup>l</sup> = 3.5 and U 1 <sup>l</sup> = 6 using Eqs (5–6). We can now use these values to compute the values R<sup>1</sup> = 8, R<sup>2</sup> = 13, and R<sup>3</sup> = 8. Since these are within the respective time periods of the tasks, we declare that the program is schedulable.

We can now tackle the general case. Consider a periodic program P = (V, L, T ) satisfying the following assumptions (in addition to distinct priorities):


The equations below capture the WCRT of the tasks and lock blocks of P. The variables here are the R<sup>i</sup> 's representing the WCRT of task τ<sup>i</sup> , and the U k l,k's representing the WCRT of blocks B j l,k, respectively.

$$R\_i = C\_i + \sum\_{l \in L} (N\_l^i \cdot \max\_{j < i} U\_{l,k}^j) + \sum\_{j > i} (\lceil R\_i / T\_j \rceil \cdot C\_j) \tag{7}$$

$$U\_{l,k}^i = C\_{l,k}^i + \sum\_{j>i} (\lceil U\_{l,k}^i / T\_j \rceil \cdot C\_j) \tag{8}$$

Theorem 2. The least solution to the system of Eqs (7,8), whenever it exists, is an upper bound on the corresponding WCRT of tasks τ<sup>i</sup> and the blocks B<sup>i</sup> l,k.

Proof. Once again we show that any solution to the systems of equations (7) and (8) is an upper bound on the WCRT of the tasks and lock blocks of P respectively. Let L1, . . . , L<sup>n</sup> and L i l,k (for i ∈ {1, . . . , n}, l ∈ L, and k ∈ {1, . . . , nl,i}) be a solution to the equations above. We first argue that the WCRT of a block Bi l,k is bounded by L i l,k. Since the block is free of lock statements, this is like the classical case and a similar argument to Thm. 1 applies to conclude that L i l,k is an upper bound on the WCRT of B<sup>i</sup> l,k.

To argue that the WCRT of task τ<sup>i</sup> is bounded by L<sup>i</sup> , consider an execution of an instance of task τ<sup>i</sup> where it is made ready at time t. Consider the time interval t to t+L<sup>i</sup> . We claim that τ<sup>i</sup> must finish its execution before t+L<sup>i</sup> . Task τ<sup>i</sup> may lose time because of two reasons: (a) it is blocked on one of its lock(l) instructions because some other task τ has taken the lock l. Now it must be the case that τ is a lower priority task than τ<sup>i</sup> . Suppose τ had a higher priority than i. Then either it must have got blocked after acquiring l and before releasing it, or it was preempted by a still higher priority task τ 0 . The former case is ruled out since we don't allow nested locks. We can now apply similar reasoning to τ 0 , and so on; but the buck must stop at the highest priority task. Since it cannot be preempted, it must be blocked waiting to acquire another lock; this is a contradiction to our no nested lock assumption. Thus, the total time that can be taken away due to τ<sup>i</sup> waiting for a lock is bounded by P l∈L (N<sup>i</sup> l ·maxj<i L j l,k) (corresponding to the second term in Eq. (7)). The second reason τ<sup>i</sup> may lose time is (b) because of preemption by higher priority tasks. Like before, this is bounded by P j>i(dLi/T<sup>j</sup> e · C<sup>j</sup> ) (the third term in Eq. (7)). Thus, there must remain at least C<sup>i</sup> amount of time in the interval t to t + L<sup>i</sup> for τ<sup>i</sup> to execute, and hence it must complete execution before t + L<sup>i</sup> .

Algo. 2 is an algorithm to compute the least solution to the system of Eqs. (7,8), and check schedulability of a periodic program with non-nested locks.

### 6 Rules for Disjointness

In this section we describe a set of rules which tell us when two tasks of a periodic program are disjoint (that is, can never happen in parallel). We will then use these rules to propose a race-detection algorithm for periodic programs.

#### 6.1 Disjoint Block Rules

Let P = (V, L, T ) be a periodic program that (a) satisfies the no-nested-lock condition of Sec. 5.2, and (b) has WCRT estimates R<sup>τ</sup> for each task τ satisfying R<sup>τ</sup> ≤ T<sup>τ</sup> (that is, P is schedulable). The rules below tell us when two whole task bodies, or two blocks within them, are disjoint. Fig. 7 illustrates Rules 1–5.

	- τ and τ <sup>0</sup> have the same priority (i.e. p<sup>τ</sup> = p<sup>τ</sup> <sup>0</sup> ); and
	- Neither τ nor τ 0 shares a lock with a lower priority task.

Then τ and τ <sup>0</sup> are disjoint.

	- τ and τ <sup>0</sup> have the same period (i.e. T<sup>τ</sup> = T<sup>τ</sup> <sup>0</sup> ); and

• Neither τ nor τ 0 shares a lock with a lower priority task.

Then τ and τ <sup>0</sup> are disjoint.

– Rule 3 (Low-Multiple-of-High): Let τ<sup>l</sup> and τ<sup>h</sup> be two tasks in T such that: • τ<sup>l</sup> has a lower priority than τh; (i.e. p<sup>τ</sup><sup>l</sup> < p<sup>τ</sup><sup>h</sup> );

#### Algorithm 2: Check Schedulability With Locks

Data: Periodic program P with locks, WCET estimates C<sup>i</sup> for τ<sup>i</sup> and C i l,k for lock block B i l,k

Result: P schedulable or not; if schedulable, WCRT estimates for each task foreach block B i l,k do

L i,prev l,k := 0; Ll,k := C i l,k; while (L i l,k does not satisfy Eq (8) and L i l,k < Ti) do tmp := L i l,k; L i l,k := L i l,k + P j>i((dL i l,k/T<sup>j</sup> e − dL i,prev l,k /T<sup>j</sup> e) · C<sup>j</sup> ); L i,prev l,k := tmp; end if (L i l,k does not satisfy Eq (8) or L i l,k > Ti) then return "Unschedulable"; end

#### end

foreach task τ<sup>i</sup> do prev

L i := 0; L<sup>i</sup> := C<sup>i</sup> + P l∈L (N i l · maxj<i L j l,k) ; while (L<sup>i</sup> does not satisfy Eq (7) and L<sup>i</sup> < Ti) do tmp := Li; L<sup>i</sup> := L<sup>i</sup> + P j>i((dLi/T<sup>j</sup> e − dL prev i /T<sup>j</sup> e) · C<sup>j</sup> ); L prev i := tmp; end if (L<sup>i</sup> does not satisfy Eq (7) or L<sup>i</sup> > Ti) then return "Unschedulable"; end end return "Schedulable", L1, . . . , Ln;


– Rule 4 (High-Multiple-of-Low): Let τ<sup>l</sup> and τ<sup>h</sup> be two tasks in T such that:


Then τ<sup>l</sup> and τ<sup>h</sup> are disjoint.

– Rule 5 (Low-WCRT): Let τ<sup>l</sup> and τ<sup>h</sup> be two tasks in T such that:


{(k · Tτ<sup>h</sup> ) mod Tτ<sup>l</sup> | k ∈ N}

(note that such an m must exist by the second condition above). The WCRT estimate Rτ<sup>l</sup> of τ<sup>l</sup> is at most m (i.e. Rτ<sup>l</sup> ≤ m). Then τ<sup>l</sup> and τ<sup>h</sup> are disjoint.

– Rule 6 (Lock): Let B<sup>l</sup> and B<sup>0</sup> l be two lock(l)-unlock(l) blocks in distinct tasks τ and τ 0 respectively. Then B<sup>l</sup> and B<sup>0</sup> l are disjoint.

We now show that Rules 1–6 are sound.

Theorem 3. Consider a periodic program P, with no nested locks, and WCRT estimates which make it schedulable. Consider two blocks which satisfy the premise of one of the rules; then the identified blocks are indeed disjoint in P.

Proof. Let us fix a periodic program P without nested locks, and with WCRT estimates R<sup>τ</sup> for each task τ in P, which witness the schedulability of P. Now suppose τ and τ <sup>0</sup> are two tasks in P satisfying the premise of Rule 1, namely that they have the same priority and neither of them shares a lock with a lower priority task. Now if there were no higher priority tasks and τ and τ 0 took no locks at all, then clearly τ and τ 0 can never overlap in their execution instances, since neither can preempt the other. However, even if there was a higher priority task say τ <sup>00</sup>, note that by our scheduling semantics, if τ <sup>00</sup> were to interrupt τ during its execution, τ would resume execution ahead of any other tasks of the same priority that may be ready. So τ and τ 0 cannot interleave due to the preemption by a higher priority task. The other possible cause for interleaving could be because say τ gets blocked while trying to take a lock l that is already held by some other task of higher or lower priority. However, as argued earlier, a higher priority task holding l is ruled out. The case of a lower priority task holding l is ruled out by the premise of Rule 1. Thus it follows that τ and τ 0 cannot overlap in any execution. The soundness of Rule 2 follows a similar argument.

For Rule 3, suppose the period of τ<sup>l</sup> is a multiple of τh. Let us say τ<sup>l</sup> is made ready at some time t (which must be a multiple of its period T<sup>τ</sup><sup>l</sup> ). Now either t is also a multiple of T<sup>τ</sup><sup>h</sup> , in which case τ<sup>h</sup> will begin execution before τ<sup>l</sup> , or τ<sup>h</sup> is next scheduled at some time t <sup>0</sup> > t. In the former case, the only reason τ<sup>h</sup> may not complete before τ<sup>l</sup> gets to execute, is that τ<sup>h</sup> is blocked on acquiring a lock. As in earlier arguments, this lock can only have been acquired by a task of priority lower than τ<sup>l</sup> . But this is ruled out by the premise of the rule. In the latter case, by the premise of the rule, t + R<sup>τ</sup><sup>l</sup> ≤ t 0 . Hence τ<sup>l</sup> will complete its execution before τ<sup>h</sup> can preempt it at t 0 .

For Rule 4, suppose T<sup>τ</sup><sup>h</sup> is a multiple of T<sup>τ</sup><sup>l</sup> . Consider a time t when τ<sup>l</sup> is made ready. If τ<sup>h</sup> is not also enabled at t, then by schedulability, τ<sup>l</sup> must complete before t + T<sup>τ</sup><sup>l</sup> , which is before the time τ<sup>h</sup> is enabled next. Hence they cannot overlap in this case. If τ<sup>h</sup> is also enabled along with τ<sup>l</sup> at t, then it must

Fig. 7: Illustrating Rules 1–5

begin execution before τ<sup>l</sup> does. The only reason it may not complete before τ<sup>l</sup> is allowed to begin execution, is that it is blocked on a acquiring a lock l held by a task of lower priority than τ<sup>l</sup> . But this is ruled out by the premise of the rule.

For Rule 5, again consider τ<sup>l</sup> and τ<sup>h</sup> satisfying the premise of the rule. Let t be a time point where τ<sup>l</sup> is made ready. Either t is a multiple of Tτ<sup>h</sup> , in which case τ<sup>h</sup> is also made ready at the same time; or it is not, and arrives at some time t 0 later than t. The former case is similar to the situation considered in earlier cases, and the instances of τ<sup>l</sup> and τ<sup>h</sup> cannot overlap. In the latter case, by the premise of the rule, we must have t + Rτ<sup>l</sup> ≤ t + m ≤ t 0 , and hence τ<sup>l</sup> would finish its execution by t 0 , and the two tasks cannot overlap. The soundness of Rule 6 is standard.

#### 6.2 Computing the value m in Rule 5

Rule 5 requires us to compute the value m which is the smallest positive remainder that we can get by dividing an integral multiple of Tτ<sup>h</sup> by Tτ<sup>l</sup> . It is not difficult to see that all possible remainders must occur in the interval [0, T] where T is the LCM of Tτ<sup>l</sup> and Tτ<sup>h</sup> . Thus it is sufficient to look at the multiples of Tτ<sup>h</sup> upto T, and set m to be the minimum positive remainder we get by dividing these by Tτ<sup>l</sup> .

#### 6.3 Race Detection Algorithm

We now present the algorithm to detect races in periodic programs. Algo. 3 first identifies the set of shared variables accessed in the program and then lists all the conflicting access pairs, which are all assumed to be potentially racy initially. The algorithm, using the rules in Sect. 6 and the lockset analysis, described next, then prunes out the pairs of accesses found to be non-racy.

An iterative lockset analysis computes the set of locks held at each statement in a program P. At the program entry, it is assumed that no locks are held. For the lock(l) command, locks held are the set of locks held before this command along with the lock l. For the unlock(l) command, locks held are the set of locks held before this command with the lock l removed. For any other command, the lockset remains the same as held in the previous command. The join operation, in this analysis, is the intersection of locksets.

The algorithm uses the notion of covers which needs further explanation. Let τ<sup>1</sup> and τ<sup>2</sup> be two tasks in a periodic program P and s<sup>1</sup> and s<sup>2</sup> be two statements in P. We say the pair of tasks (τ1, τ2) covers the pair of statements (s1, s2) if either s<sup>1</sup> is a statement in G<sup>τ</sup><sup>1</sup> and s<sup>2</sup> is a statement in G<sup>τ</sup><sup>2</sup> or vice versa (i.e. s<sup>1</sup> in G<sup>τ</sup><sup>2</sup> and s<sup>2</sup> in G<sup>τ</sup><sup>1</sup> ).

### 7 Experimental Evaluation

In this section we first describe the implementation of Algo. 3 to detect races in periodic programs. We then explain the benchmarks used to evaluate the implementation followed by a discussion of the results.

#### Algorithm 3: Race Detection

Data: Periodic program P Result: List of potential races PR Identify the set of shared variables V ; Find the list CA of conflicting accesses on V ; PR := CA; Find list DT of disjoint tasks using rules in Sec. 6; foreach pair (s1, s2) of conflicting accesses in PR do if there is a pair (τ1, τ2) of tasks in DT, such that (τ1, τ2) covers (s1, s2) then // (s1, s2) are non-racy PR := PR − {(s1, s2)}; end end Perform lockset analysis on each task in P; foreach pair (s1, s2) of conflicting accesses in PR do let L<sup>1</sup> be the lockset at s<sup>1</sup> and L<sup>2</sup> be that at s2; if L<sup>1</sup> ∩ L<sup>2</sup> 6= ∅ then // (s1, s2) are non-racy PR := PR − {(s1, s2)}; end end return PR; // Set of potential races

#### 7.1 Implementation

We implemented Algo. 3 in the tool PePRacer [19] as shown in Fig. 8. The tool has a preprocessor, which inlines the functions in the input program, a time analyzer which computes WCET of tasks using Heptane [11], and then their WCRT using Algo. 2. The CA generator identifies the shared accesses, which are essentially accesses to global variables or shared locations through pointers, in the program, and then lists the conflicting access pairs. The Rules Checker identifies disjoint task pairs using the response times and eliminates conflicting accesses that are non-racy. The rules, described in Sec. 6, are applied on the conflicting accesses to eliminate non-racy pairs. The Lockset Analyzer computes the locks held at each statement in the program and further eliminates the remaining conflicting accesses that are non-racy. The tool finally displays the potentially racy pairs.

We implemented PePRacer in the OCaml based C Intermediate Language (CIL) static analysis framework [15]. The Inliner step in PePRacer uses the built-in inline pass in CIL while the lockset algorithm and Rules Checker are implemented as new passes in CIL. The implementation of the WCET Analyzer is explained next.

WCET Analysis WCET analysis was carried out on the benchmarks using the Heptane [11] tool. Heptane accepts inputs in the form of C programs. To prepare

Fig. 8: Schematic of PePRacer

the benchmark programs the following modifications were made to them: All non-C constructs in the benchmarks were translated to suitable C constructs, e.g. TASKs in OSEK programs were converted to correspondingly named functions. All code was merged into a single C file. Some benchmark programs did not have the source for some of their parts. Heptane needs the source code for the entire program being analysed. Hence, all code for which source code was not available was replaced with minimal stubs. Loop bounds were provided using ANNOT\_MAXITER as required by Heptane. These loop bounds were computed by manual inspection.

For each benchmark the WCET was separately computed for each of its task entry functions. Heptane supports WCET analysis for ARM and MIPS architectures. Where possible, WCET was run using default settings for both architectures. The difference between the WCET results for both architectures were found to average around 4%, never exceeding 20%. In our analysis we use the values for the ARM architecture.

Some aspects which may lead to our WCET estimates not being conservative are as follows:


For more accurate WCET analysis, data corresponding to the specific target architecture being considered should be used. Several WCET analysis tools are available [21] both in the commercial and academic domain. The choice of the analysis tool would influence the accuracy of the WCET analysis.

#### 7.2 Benchmarks

We tested the implementation on a few benchmark periodic programs shown in Table 2. Most of the real-world periodic programs are proprietary and difficult to gain access to. Hence we resorted to some programs from the nxtOSEK benchmark set, lego-osek-master project, ev3OSEK benchmark set, nxt-oseksumo-master project, AADLib benchmark set [1] and examples in [10] and [14] for evaluation of the tool. The programs in AADLib are configured to run on FreeRTOS while the others are designed to run on the OSEK real time operating system. The program fse\_obstacle.c implements a simplified version of a robotic controller which detects obstacles in its proximity while avionics.c specifies the general functions, data interactions, and timing constraints for a hypothetical avionics Mission Control Computer (MCC) system. Biped\_robot.c is a sample program for LATTEBOX NXTe/LSC based biped robot. Sumo.c implements a robot which attempts to push its opponent out of a circle. A Bluetooth based radio controlled car is implemented in nxtgt.c. In lego\_osek.c a robot detects obstacles and avoids collision by changing angle and speed. Objectfollower.c implements a follower. It goes forward as an object goes forward; when the object stops moving, it stops as well, and follower.c is similar. A two wheeled self-balancing radio controlled robot is implemented in nxtway\_gs.c. Ardupilot.c, taken from [1], is a simple version of the popular autopilot system supporting many vehicle types. sumoR.c and carR.c are racy versions of the programs sumo.c and car.c respectively.

We have annotated the programs with task attributes like periodicity, priority, and WCET time, along with details of locks held. The non-periodic tasks in some of the programs are taken to be tasks with high period. We have inlined the helper functions called in the tasks along with the calls to library functions. This will bring out the accesses to shared structures in the library. For example, the ecrobot library function ecrobot\_set\_motor\_speed, which is called in lego\_osek.c, accesses the shared NXT\_PORT\_A port. The GetResource and ReleaseResource functions used to take and release locks, respectively, are taken to be the lock and unlock command in our analysis. It is to be noted that in OSEK, resources are locked according to the Priority Ceiling Protocol (PCP). But for our evaluation, we assume these programs are using standard locks. We believe the placements of locks would not change even if the developer were using standard locks. FreeRTOS supports the use of standard locks.

#### 7.3 Results

We ran our tool on the benchmark programs on an Intel Quad Core i7-3770 3.40GHz machine running Ubuntu 18.04.4. Table 2 shows the results of running our tool. The "Tasks" column gives the number of tasks in the program, "Sched." gives whether the program is schedulable or not (by Algo. 2), the number of conflicting accesses in a program is listed under the "CA" column, and the count of potentially racy pairs are given under the "PR" column. The "%Elim." column gives the percentage of conflicting accesses that are found to be non-racy. The last column gives the time taken by the tool, which was calculated using the Linux time command.


Table 2: Results

Our tool detects the avionics.c program to be non-schedulable, which is also detected by [14]. Rules 3, 4, and 5 depend on the response times of the tasks and we bypassed the application of these rules for avionics.c. The "PR" column in the table for avionics.c gives the count of potentially racy pairs detected after the application of other rules. The last two rows of the table shows the data for some of the benchmarks which have been modified to make them racy by changing the periods, execution times, etc. Our tool is able to filter out a large part of the conflicting access (CA) pairs as non-racy (on an average 97% of CA pairs are eliminated).

Table 3 gives the coverage of the rules (Rules 1–6). Here each rule is independently applied on the conflicting accesses to demonstrate the value of each rule separately. Column "R1" gives the count of CA pairs flagged as non-racy due to Rule 1 only. The case is similar with other columns. Recall that the non-trivial rules like Rules 3–5 use periodicity and/or response time to declare CA pairs as non-racy. A careful analysis of the count for these in Table 3 reveals their usefulness in flagging non-racy pairs. Some pairs are detected by these rules while not covered by the other simpler rules. It is even worthwhile observing that the CA pairs detected as non-racy by Rule 6 (the one based on locks) are covered by other rules. The developers can use this information to decide on whether to use expensive constructs like lock-unlock to ensure mutual exclusion when the task periodicity and response time can themselves ensure it.


Table 3: Rule Coverage

### 8 Related Work

We begin with work related to computing response times and schedulability analysis. Apart from the work of [13,12] already mentioned, feasibility analysis for real-time periodic tasks without locks have been studied by Baruah et al [4] and Pellizzoni and Lipari [16]. Baruah [3] studies schedulability under Earliest Deadline First and Stack Resource Policy (EDF+SRP) and gives an efficient algorithm for checking schedulability. Bertogna et al [5] study resource holding times (how long a task may hold on to a lock/resource) and give algorithms for computing and minimizing these times.

In closely-related classical work on real-time systems that use locks, Sha et al [18] consider a very general setting of priority-based preemptive scheduling, with FCFS among waiting tasks of the same priority (similar to our setting), with arbitrarily nested locks, and give sufficient conditions for schedulability of programs under these conditions. However the locks they consider are priority inheritance based locks which elevate the priority of a task if it is in a critical section to a level based on the priorities of the tasks waiting for (or that might acquire) this resource. Programs with such locks have the useful property that the blocking time of a task is bound by the longest WCET of a lock block (critical section) of a lower priority task. This facilitates their analysis and bounds on response time. In our setting of standard locks (though restricted to be nonnested) it is not clear if such properties can be exploited.

Related work on verification of periodic programs can be broadly classified into two categories: Verification of periodic programs using techniques like model checking, symbolic execution etc., and detecting data races in programs for embedded applications similar to periodic programs, using static analysis techniques.

Periodic programs with tasks prioritized in a rate monotonic fashion and communicating using shared variables, have been verified against safety properties using bounded model checking with different kinds of locks in [7], [6] and [8]. In their first paper of the series [7], the authors provide a time-bounded verification of safety properties where the sequentializations of programs are considered with respect to number of jobs of each task within the time bound. Priority and preemption locks are considered in [7] and the work is extended to include Priority Inheritance Protocol (PIP) locks in [8]. [6] proposes a new sequential composition mechanism to reduce the number of sequentializations and make the bounded verification scalable. However, the verification is bounded to a certain depth, and in general cannot be used to soundly detect all data races.

PLC programs are very similar to our periodic programs and are widely used in embedded safety critical software. Symbolic execution of PLC programs is developed in [10] where the authors convert PLC programs into C programs and use their rate-monotonic, priority-based, preemptive scheduling semantics to reduce the number of inter-leavings considered. The only way to use their symbolic execution to detect data races would be for the developer to introduce a counter for each shared variable and increment and decrement this counter, and then check for violations of assertion that encode a racy accesses to these variables. This technique is unlikely to be scalable.

Static analysis based techniques for detecting data races embedded software kernels and applications have been of recent research interest [17], [9], [20]. Schwarz et al [17] provide an algorithm to detect data races in multi-task programs with priority ceiling locks. Additional synchronization mechanisms including dynamic threads, suspend-resume of scheduler and tasks etc. are considered in [20]. Both these works exploit priorities and locks, but do not consider periodicity and WCRT information like we do, and would lead to less precise results on the class of periodic programs considered in this paper.

### 9 Conclusion

In this work we have proposed a technique for statically detecting data races in periodic real-time programs with locks. Our contribution includes a response time analysis for such programs when the locks are used in a non-nested manner. Going forward, some interesting directions include using the insights in this paper to perform precise and efficient data-flow analysis for such programs; improving the tightness of the response time analysis; and extending the technique for detecting high-level races for the class of such programs and for periodic programs with other locking mechanisms like priority-inheritance based locks, and other scheduling policies.

### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### Probabilistic Total Store Ordering

Parosh Aziz Abdulla<sup>1</sup> , Mohamed Faouzi Atig<sup>1</sup> , Raj Aryan Agarwal<sup>2</sup> , Adwait Godbole<sup>3</sup> (), and Krishna S.<sup>2</sup>

<sup>1</sup> Uppsala University, Sweden 2 IIT Bombay, India <sup>3</sup> University of California Berkeley, USA adwait@berkeley.edu

Abstract. We present Probabilistic Total Store Ordering (PTSO) – a probabilistic extension of the classical TSO semantics. For a given (finitestate) program, the operational semantics of PTSO induces an infinitestate Markov chain. We resolve the inherent non-determinism due to process schedulings and memory updates according to given probability distributions. We provide a comprehensive set of results showing the decidability of several properties for PTSO, namely (i) Almost-Sure (Repeated) Reachability: whether a run, starting from a given initial configuration, almost surely visits (resp. almost surely repeatedly visits) a given set of target configurations. (ii) Almost-Never (Repeated) Reachability: whether a run from the initial configuration, almost never visits (resp. almost never repeatedly visits) the target. (iii) Approximate Quantitative (Repeated) Reachability: to approximate, up to an arbitrary degree of precision, the measure of runs that start from the initial configuration and (repeatedly) visit the target. (iv) Expected Average Cost: to approximate, up to an arbitrary degree of precision, the expected average cost of a run from the initial configuration to the target. We derive our results through a nontrivial combination of results from the classical theory of (infinite-state) Markov chains, the theories of decisive and eager Markov chains, specific techniques from combinatorics, as well as, decidability and complexity results for the classical (non-probabilistic) TSO semantics. As far as we know, this is the first work that considers probabilistic verification of programs running on weak memory models.

### 1 Introduction

The classical Sequential Consistency (SC) semantics [1] has been a fundamental assumption in concurrent programming. SC guarantees that process operations are atomic. A write operation, performed by a given process, is immediately visible to all the other processes. However, designers of modern computer systems, in their quest of increased system efficiency, often sacrifice the SC guarantee. Instead, the processes communicate asynchronously, allowing a delay in the propagation of write operations. Due to the propagation delay, written values can become available to processes at different time points, and in an order that may be different from the order in which they are generated. This asynchronous behavior gives rise to new semantics, collectively referred to as weak memory models [2]. In the presence of weak memory models, programs exhibit new, and often unexpected, behaviors, bringing about complex challenges in the design and analysis of concurrent systems. Even text-book programs may behave erroneously. The classical Dekker mutual exclusion protocol is a case in point. The ubiquity of weak memory models has led to an extensive research effort for the testing and verification of concurrent programs running under such semantics.

Existing works on the verification of programs running on weak memory models, consider safety properties such as state reachability, assertion violation, and robustness. While safety properties are fundamental, we need also to prove liveness properties, i.e., to show that the program indeed makes progress. This is, of course, true already in the case of SC. A program, such as a mutual exclusion protocol, needs to guarantee that each process will eventually reach its critical section. The satisfiability of liveness properties is often dependent on the type of fairness conditions on process executions that are provided by the underlying platform [3,4]. The reason is the presence of concurrency nondeterminism, i.e., the inherent non-determinism in program behavior due to the different possible ways in which the scheduler can interleave the processes. The scheduler may always neglect a given process, which means that the process will never make progress (e.g., never reaches its critical section). Therefore, we need the scheduler to follow a fair selection policy that allows each process to advance in its execution. The situation is even more complicated in the case of weak memory models, since we also need to deal with a second source of nondeterminism, besides concurrency non-determinism, namely (data) propagation non-determinism. Since write operations are propagated asynchronously, there is in general no way to predict if, when, and in which order, write operations become visible to the processes.

In this paper we present a framework for the verification of liveness properties for concurrent programs running under the classical Total Store Ordering (TSO) semantics [5]. The TSO model puts an unbounded store (write) buffer between each process and the main memory. The buffer carries pending write operations that have been performed by the process. These operations are propagated from the buffer to the shared memory in a FIFO manner. When a process performs a write operation, it appends the operation as a message to its buffer. When a process reads a variable, it searches its buffer for a pending write operation on that variable. If such operations exist then it reads from the most recent one. If no such operation exists, it fetches the value of the variable from the main memory. The TSO propagation mechanism is a typical example of how propagation non-determinism arises: the write operations are propagated to the shared memory non-deterministically, and a process sees the other processes' write operations only when the latter are available in the memory. Therefore, having a scheduler that fairly selects the processes is not sufficient. We also need to ensure that the write operations propagate to the processes sufficiently often.

Traditional fairness conditions such as strong or weak fairness [3,4,6] cannot capture propagation policies adequately since they irrationally allow slow propagation, i.e., they allow write operations to propagate at a lower rate than the rate by which they are issued. For instance, strong fairness guarantees that messages are transferred infinitely often from the buffers to the memory. Still, it does not constrain the relative frequency of write and update operations, and hence it does not prevent the buffer contents from growing unboundedly. In such a scenario, more and more un-propagated messages may be clustered inside the buffers, and a given process may, from some point on, be confined only to read its own writes, since it will not see the memory updates by the other processes. Accordingly, verifying liveness properties subject to strong fairness may wrongly deem the system to be incorrect: even if a process is selected infinitely often by the scheduler and write operations are propagated infinitely often to the memory, a given process may incorrectly be judged not to make progress due to slow propagation.

While slow propagation can arise theoretically under the above mentioned fairness conditions, it is almost never observed in practice. Existing platforms implement different policies, such as invalidation or write-back policies, to flush the buffers at regular intervals [7,8]. This prevents the buffer sizes from growing beyond certain sizes, and implicitly ensure propagation fairness. In fact, this is true to the degree that non-SC behaviors are (relatively) rarely observed on TSO platforms [9,10].

In this paper, we perform verification of liveness properties for concurrent programs under TSO using probabilistic fairness [11]. As far as we know, this is the first work that considers probabilistic verification of programs running on weak memory models. In our model, both process scheduling and message propagation are carried out according to given probability distributions. We assign a weight (a natural number) to each process. We resolve concurrency nondeterminism probabilistically by letting the scheduler select the next process to execute with a probability that reflects the weight of the process compared to the weights of the other processes that are enabled in the same configuration. After each process step, we allow an update step, in which the buffers transfer parts of their contents to the memory. We make the probability distribution equal among all possible update operations in the given configuration<sup>4</sup> . As we will see later in the paper, defining the model in this way implies that we assign low probabilities to program runs that unboundedly increase the number of messages inside the buffers. Accordingly, our model is more faithful to real program behavior compared to models induced by non-probabilistic fairness conditions.

We perform a comprehensive analysis of the decidability of verifying liveness properties for concurrent programs running under the TSO semantics, subject to probabilistic fairness. In fact, verifying programs running on the TSO memory model, even with respect to safety properties, poses a difficult challenge. The unboundedness of the buffers implies that the state space of the system is infinite, even in the case where the input program is finite-state [12,13]. Similarly, the operational semantics of our model gives rise to Markov chains with infinite state spaces. Furthermore, in general, liveness properties give rise to more difficult problems than safety properties, since the former are interpreted over

<sup>4</sup> Our framework allows several other types of probability distributions (see Sec. 9.)

infinite program executions while the latter are interpreted over finite executions. Our results rely on nontrivial combinations of results from the classical theory of (infinite-state) Markov chains [14,15], the theories of decisive and eager Markov chains [16,17], specific techniques from combinatorics [18], as well as, decidability and complexity results for the classical (non-probabilistic) TSO semantics [19,13]. Concretely, we show the decidability of the following problems, each of which is defined by giving an initial configuration γinit and a set Target of process target states.

Qualitative Analysis (Sec. 6). In qualitative reasoning, we are interested in knowing whether the given property is satisfied with probability 1 (almost surely satisfied), or with probability 0 (almost never satisfied). We show that the satisfiability of these properties can be reduced to similar problems on the underlying (non-probabilistic) transition systems for classical TSO. The actual probabilities appearing in the induced Markov chains then are inconsequential and only their non-zeroness matters. This is useful whenever the probabilities have not been measured exactly, or the portion of the system giving rise to probabilistic behavior has not been designed yet. We consider the following different flavors of qualitative analysis: Almost-Sure (Repeated) Reachability<sup>5</sup> : whether a run of the system from γinit will almost surely visit (resp. repeatedly visit) Target; Almost-Never (Repeated) Reachability: whether a run of the system from γinit will almost never visit (resp. repeatedly visit) Target. Furthermore, we show that all these problems have non-primitive-recursive complexities.

Quantitative Analysis (Sec. 7). The task is to estimate to an arbitrary degree of precision the probability by which a run from γinit (repeatedly) visits Target, rather than only checking whether the probability is equal to one or zero.

Expected Average Cost (Sec. 8). We study the expected cost for runs that start from γinit until they reach Target. To that end, we extend our model by providing a cost function that assigns a fixed cost to each instruction in the language. Calculating expected costs of runs has many potential applications. For instance, one might be interested in the mean-time of reaching a target, i.e., the average number of steps before reaching the target [20]. In the context of weak memory models, in general, and TSO in particular, one can perform a more refined analysis by also taking into account the fact that specific instructions, e.g., memory fences, have higher costs [21]. Incorporating instruction costs in the model makes average cost analysis reflect more faithfully the efficiency of the program compared to an instruction count based metric. There have been several approaches towards optimizing fence implementations in hardware [22,23,24] which exploit the fact that non-SC behaviours are rare even in unfenced code. A quantitative analysis of the prevalence of behaviours and cost of executing instructions can help determine the efficacy of such implementations.

<sup>5</sup> While repeated reachability is a liveness property, plain reachability in the nonprobabilistic case is a safety property. However, in the presence of probabilities, plain reachability measures the probability of convergence towards a target state, and hence it can be considered a form of liveness property. In any case, this is a matter of definition and has no bearing on the rest of the paper.

The supplementary material [25] contains detailed proofs of all the lemmas and theorems.

Related Work Only recently there has been an increased interest in the formulation and verification of liveness properties for weak memory models. In [26], they factor the system into a process and memory subsystems and define notions of fairness for either. This is reminiscent of our approach, where we consider probabilistic policies for process scheduling and memory update. Their model on the other hand is non-probabilistic and they have weaker fairness guarantees, which we describe in more detail in Sec. 5.1. The liveness verification problem for TSO has been considered in [27], where they show undecidability for various liveness properties. However, once again work with non-probabilistic notions of fairness. We show in this paper, that with stronger (probabilistic) fairness, reachability and repeated reachability problems become decidable.

In [12], they show the undecidability of the repeated reachability problem, without fairness conditions, for finite-state programs running under the TSO semantics. In contrast, we show that checking repeated reachability qualitatively is decidable (Sec. 6.2), and that we can even compute the measure of runs satisfying the property with arbitrary precision (Sec. 7.2).

There has been a huge amount of work on the verification of finite-state Markov chains (see, e.g., [20,28]). Since the buffers in TSO are unbounded, we however, get an infinite-state Markov chain. There is also a substantial literature on the verification of infinite-state Markov chains, where specialized techniques are developed for particular classes of systems. Several works have considered probabilistic push-down automata and probabilistic recursive machines [29,30,31]. However, these techniques don't apply in our case since push-down automata cannot encode the FIFO store-buffer data-structure.

Works such as [32,16,33,34] develop algorithmic and complexity results for checking termination and reachability for systems such as probabilistic VASS, probabilistic Petri nets, probabilistic multi-counter systems. Again, these models are different from ours and cannot encode FIFO queues.

The works closest to ours are those on probabilistic lossy channel systems [16,17]. These works also rely on the frameworks of decisive and eager Markov chains. However, lossy channel systems and TSO are fundamentally different, and the manner in which we instantiate the frameworks of decisive/eager Markov chains differs. The decidability of verification for probabilistic extensions of lossy channels is sensitive to the definition of the message losses. In the case of lossy channel systems, if messages are only allowed to be lost at one end of the channel (a model that is close to our notion of message updates), then all non-trivial verification problems become undecidable for probabilistic lossy channel systems [35]. Therefore, although there is a reduction from TSO to lossy channel systems in the case of non-probabilistic models [12], we know of no such reduction between the corresponding probabilistic models.

Finally, the concept of decisiveness has been extended to more general models such as generalized semi-Markov processes, stochastic timed automata [36], and lossy channel-based stochastic games [37].

### 2 Preliminaries

In this section, we introduce notation, recall basics of transition systems, Temporal logic and Markov chains.

Basic Notation The size of a set A is denoted by |A|. We use A<sup>∗</sup> and A<sup>ω</sup> to denote the set of finite resp. infinite words over (a possibly infinite set) A, and let be the empty word. For w ∈ A<sup>∗</sup> , |w| denotes the length of w (|w| = ∞ if w is infinite). For i : 1 ≤ i ≤ |w|, we use w[i] to denote the i th element of w. We define head (w) := w[1] and tail (w) := w[2] · · · w[|w|]. We use a ∈ w to denote that w[i] = a for some i : 1 ≤ i ≤ |w|. For words w<sup>1</sup> ∈ A<sup>∗</sup> and w<sup>2</sup> ∈ (A<sup>∗</sup> ∪ Aω), we use w<sup>1</sup> · w<sup>2</sup> to denote their concatenation. For k ∈ N, we define A<sup>k</sup> := {w ∈ A<sup>∗</sup> | |w| = k}, i.e., it is the set of words over A of length k.

Transition Systems A transition system is a pair hΓ, →i where Γ is a (potentially) infinite set of configurations, and →⊆ Γ × Γ is the transition relation. We write γ → γ 0 to denote that hγ, γ<sup>0</sup> i ∈→, and use <sup>∗</sup>→ to be the reflexive transitive closure of → . For k ∈ N, we write γ k → γ 0 to denote that there is a sequence γ<sup>0</sup> → γ<sup>1</sup> → · · · → γ<sup>k</sup> where γ<sup>0</sup> = γ and γ<sup>k</sup> = γ 0 , i.e., there is a sequence of k transition steps leading from γ to γ 0 . For ∼∈ {<, ≤, =}, we write γ ∼k → γ 0 to denote that γ <sup>m</sup>→ γ 0 for some m : 0 ≤ m ∼ k.

Temporal Logic A run ρ of transition system T = hΓ, →i is an infinite word γ0γ<sup>1</sup> . . . of configurations such that γ<sup>i</sup> → γi+1 for i ≥ 0. We use ρ[i] to denote γi . We say that ρ is a γ-run if ρ[0] = γ. We use Runs (γ) to denote the set of γ-runs. A path π is a finite prefix of a run, and a γ-path is a finite prefix of a γ-run. We use the standard notation γ |=<sup>T</sup> φ to represent that γ satisfies the CTL<sup>∗</sup> state formula φ and ρ |=<sup>T</sup> φ to mean that ρ satisfies the path<sup>6</sup> formula φ. We refer the reader to [38] for details of CTL.

For γ ∈ Γ and G ⊆ Γ, we say that G is reachable from γ, denoted γ |=<sup>T</sup> ∃♦G, if there is a γ-run ρ such that ρ[i] ∈ G for some i. For k ∈ N, γ ∈ Γ, and G ⊆ Γ, ρ |=<sup>T</sup> ♦ <sup>k</sup>G says that ρ reaches G first at the k th step. For ∼ ∈ {<, ≤, =, ≥, >}, ρ |=<sup>T</sup> ♦ <sup>∼</sup><sup>k</sup>G says that ρ |=<sup>T</sup> ♦ <sup>m</sup>G holds for some m : 0 ≤ m ∼ k. The statement ρ |=<sup>T</sup> <sup>k</sup>G says that ρ visits G at the k th step (but possibly earlier).

Markov Chains A Markov chain C is a pair hΓ, Mi where Γ is a (potentially infinite) set of configurations, and M : Γ × Γ → [0, 1] is a transition probability matrix over Γ, called the probability matrix of C, i.e. M satisfies: ∀a ∈ A.P <sup>b</sup>∈<sup>A</sup> M (a, b) = 1. A Markov chain C = hΓ, Mi induces an underlying transition system, denoted C ↓ . We define C ↓ := hΓ, →i, where →:= {hγ, γ<sup>0</sup> i | M (γ, γ<sup>0</sup> ) > 0}. The underlying transition system has the same configuration set, with transitions between configurations that have non-zero transition probability under C. This allows us to lift the temporal logic concepts defined above to Markov chains.

<sup>6</sup> We term infinite sequences as runs and finite sequences as paths. However, traditionally, CTL<sup>∗</sup> refers to properties of infinite-sequences (our runs) as path-formulae.

Probability Measures Consider a Markov chain C = hΓ, Mi. The probability of taking path π is the product of single step probabilities along π:

$$\operatorname{Prob}\_{\mathcal{C}}\left(\pi\right) := \prod\_{i=0,\ldots,\lfloor\pi\rfloor-1} \mathbb{1}\left(\pi[i], \pi[i+1]\right)$$

For a configuration γ, we adopt the usual probability space on γ-runs with the σ-algebra over cylindrical sets starting from γ (see [39,20] for details). For path formula φ, we define Prob<sup>C</sup> (γ |= φ) = Prob<sup>C</sup> ({ρ ∈ Runs (γ) | ρ |=<sup>C</sup> φ}) (which is measurable by [40]), e.g. given a set F ⊆ G, Prob<sup>C</sup> (γ |= ♦F) is the measure of γ-runs which reach F. If Prob<sup>C</sup> (γ |= φ) = 1 the we say that almost all γ-runs of C satisfy φ. Following the literature, we say that γ |=<sup>C</sup> φ holds almost surely (almost certainly), or that φ holds almost surely from γ.

### 3 Concurrent Programs

A (concurrent) program consists of a set of processes that run in parallel and communicate through a set of shared variables. The operation of the program is controlled by a central scheduler that selects the processes to execute one after the other. We assume a finite set Procs of processes that share a set X of variables. Fig. 1 gives the grammar for a small but general assemblylike language that we use for defining the syntax of concurrent programs. A program instance, P is described by a set of shared variables, var<sup>∗</sup> , followed by the codes of the processes, (proc reg<sup>∗</sup> instr<sup>∗</sup> ) ∗ . Each process p ∈ Procs has a finite set Regs<sup>p</sup> of (local) registers. We assume that the sets of registers of the different processes are disjoint, and define Regs<sup>P</sup> := ∪p∈ProcsRegs<sup>p</sup> .

Each process declares its set of registers, reg<sup>∗</sup> , followed by a sequence of instructions. We assume that the data domain of X and Regs<sup>P</sup> is a finite set V, with a special element 0 ∈ V.

Instructions An instruction i is

of the form l : s where l is a

```
prog ::= var∗
               ( proc reg∗
                           instr∗
                                 )
                                  ∗
instr ::= lbl : stmt
stmt ::= | var:=reg
           | reg:=var
           | reg:=expr
           | reg:=CAS(var ,reg ,reg)
           | if reg then lbl
           | term
```
Fig. 1. A simple programming language.

unique (across processes) label and s is a statement. Labels represent program counters of processes and indicate the instruction that the process executes the next time it is scheduled. A read/write statement either writes the value of a register to a shared variable, reads the value of a shared variable into a register, or updates the value of a register by evaluating an expression. We assume a set expr of expressions over constants and registers, but not referring to the shared variables. The CAS statement is the standard compare-and-swap operation, and if-statements have their usual interpretations. Iterative constructs such as while and for, as well as goto-statements, can be encoded with branching if-statements as usual.

The fence statement, that flushes the contents of the buffer of the process, can be simulated using the CAS statement. The statement term will cause the process to terminate its execution. Sometimes, we will refer to an instruction by its statement, e.g. the instruction r:=x, (where r is a register and x is a shared variable) a read instruction, similarly for a write instruction, etc. Semantics of these instructions are explained through a set of inference rules in Sec. 4.

Labels We define Lbl<sup>p</sup> to be the set of labels that occur in the code of the process p, and define Lbl<sup>P</sup> := ∪p∈ProcsLblp. We assume that term has the label l term p . We define Instr<sup>p</sup> to be the set of instructions occurring in p, and define Instr<sup>P</sup> := ∪p∈ProcsInstrp. For instruction i of the form l : s we define λ (i) := l and stmt (i) := s. Abusing notation, we also define stmt (l) := s. For a process p ∈ Procs instruction i ∈ Instrp, with stmt (i) 6= term, we define next (i) to be the (unique) instruction next to i in the code of p. For an instruction l<sup>1</sup> : (if a then l2), we assume, without loss of generality<sup>7</sup> , that l<sup>1</sup> 6= l2.

Scheduler The scheduler selects the process from Procs to run next. The operational model for classical TSO [41] uses a non-deterministic scheduler. We adopt a scheduler that selects the next process probabilistically. The scheduler policy is defined by a function Sched: Sched (p) ∈ N denotes the scheduling weight assigned to to the process p. If p is enabled (i.e. the process can execute the next instruction, formally defined in Sec. 4) then p is scheduled at the next step with a probability that is proportional to Sched (p).

### 4 Operational Semantics

The operational model for classical TSO [41] describes the semantics as a transition system. We also take an operational approach. However, we differ in a fundamental aspect: classical TSO models choice between transitions as nondeterministic choice. We on the other hand, model this as probabilistic choice, to get a system called as Probabilistic TSO (PTSO for short). Adding probabilities induces a Markov chain, which governs the behaviours of PTSO.

A program is described by a pair: the set of processes, Procs and the scheduler policy Sched. In this section, we fix such a program P = hProcs, Schedi. We develop the operational semantics of P under PTSO as an infinite-state Markov chain JPK MC := hΓ<sup>P</sup> , M<sup>P</sup> i. We begin by defining the set of configurations Γ<sup>P</sup> (Sec. 4.1). Then we describe the behavior of P under classical TSO using a transition system JPK TS (Sec. 4.2); Finally, we extend the transition system to a Markov chain JPK MC by giving probability distributions that define govern process scheduling, and memory updates.

<sup>7</sup> We make the restriction for technical convenience. The case where l<sup>1</sup> = l<sup>2</sup> do not introduce conceptual difficulties. However, it simplifies the presentation by eliminating some corner cases when we define probability measures (Sec. 5) and when we introduce our cost model (Sec. 8).

#### 4.1 Configurations

The central feature of TSO is the store buffer : a FIFO buffer in which pending write operations are queued as messages. The semantics equips each process p ∈ Procs with an unbounded buffer, here called the p-buffer, that carries pending write operations issued by p, but that have yet not reached the shared memory.

A configuration, hλ, R, B,Mi, describes four attributes: a labeling state (λ), a register state (R), a buffer state (B), and a memory state (M). We use Γ<sup>P</sup> to denote the set of configurations of P.

A labeling state is a function λ: Procs → Lbl<sup>P</sup> that defines, for p ∈ Procs, the label λ (p) ∈ Lbl<sup>p</sup> of the next instruction to be executed by p.

A register state is a function R : Regs<sup>P</sup> → V that maps each register a ∈ Regs<sup>P</sup> , to its current value R(a) ∈ V. For an expression e, we use R(e) to denote the evaluation of e against the register state R.

A single-buffer state w is a word in (X × V) ∗ , describing the content of the p-buffer for some process p ∈ Procs. The buffer contains a sequence of pending write messages, i.e. pairs of form hx, *v*i representing a write to x, with value *v*. A buffer state is a function B : Procs → (X × V) ∗ that defines, for each process p ∈ Procs, a single-buffer state describing the content of the p-buffer.

A memory state is a function M: X → V that assigns to each variable x ∈ X its current value M(x) ∈ V in the shared memory.

Fig. 2. The classical TSO semantics: process transitions (green), update transitions (orange) and overall transition (Full-TSO)

. Consider a configuration γ = hλ, R, B,Mi. We say that γ is plain if B (p) = for all p ∈ Procs, i.e., all the buffers in γ are empty. We use Γ plain P to denote the set of plain configurations of P. Notice that Γ plain <sup>P</sup> ⊆ Γ<sup>P</sup> and that Γ plain P is finite. For a label l ∈ Lbl<sup>P</sup> , we write l ∈ γ if λ (p) = l for some p ∈ Procs. We define Γ l <sup>P</sup> := {γ ∈ Γ<sup>P</sup> | l ∈ γ}, i.e., configurations in which l occurs.

P For a configuration γ = hλ, R, B,Mi we define the size of γ by |γ| := <sup>p</sup>∈Procs |B (p)|, i.e., it is the total number of messages in the buffers in γ. For ∼∈ {<, ≤, =, ≥, >}, we define Γ ∼` <sup>P</sup> := {γ ∈ Γ<sup>P</sup> | |γ| ∼ `}. , i.e. configurations where the total number of messages, m, relates to ` by m ∼ `.

#### 4.2 The Classical TSO Semantics

We recall the classical semantics of TSO, using a transition system JPK TS = hΓ<sup>P</sup> , −→<sup>P</sup> i. We define the transition relation −→<sup>P</sup> through the set of inference rules in Fig. 2. The relation −→<sup>P</sup> is the composition of two relations: the relation −→proc describes the processes' execution steps, and the relation −→update describes memory updates, where pending writes are propagated to the memory.

Process Transitions We define the process transition relation −→proc:= ∪p∈Procs p −→proc as a union of relations each corresponding to one process (the rule proc). The inference rules defining <sup>p</sup> −→proc, for a process p ∈ Procs are depicted in Fig.2. Each rule corresponds to one step performed by p. After executing an instruction, p will move on to the next instruction in its code. It executes the latter instruction when again selected by the scheduler.

A write instruction (x := a) assigns the value of the local register a to the shared variable x. The process appends a write message consisting of x together with the value R(a) of a, to the head of the p-buffer. A read instruction, (a := x), assigns the value of the shared variable x to the local register a. The value of x is either fetched from the p-buffer (read-own-write), or from the shared memory (read-from-memory). We capture both cases in one inference rule, using the function FetchVal defined as follows. Let w be the contents of the p-buffer. We write x ∈ w if hx, *v*i ∈ w for some *v* ∈ V, and write x 6∈ w otherwise. We define (i) FetchVal (x) (w) (M) := *v* if x ∈ w and w = w<sup>1</sup> · hx, *v*i · w<sup>2</sup> with x 6∈ w1; and (ii) define FetchVal (x) (w) (M) := M(x) if x 6∈ w. In case (i), the value of x is taken from the latest x-message from the p-buffer. In case (ii), no x-messages exist in the p-buffer, and the value is read from the shared memory.

The instruction b := CAS (x, a1, a2) checks whether the p-buffer is empty and the value of the shared variable x is equal to the value of the register a1. If yes, we assign atomically the value of the register a<sup>2</sup> to x, and assign the value true to b (the rule CAS-true). If the value of x is different from the value of a<sup>1</sup> then we do not change the value of x, but assign the value false to b (the rule CAS-false). If the p-buffer is not empty then p is disabled in the current configuration. We define the set of disabled processes at configuration γ:

$$\mathsf{display}\left(\gamma\right) := \left\{ p \: \mid \ (\mathsf{stmt}\left(p\right) = \mathsf{t}\bullet\mathsf{rn}) \lor \left( \left(\mathsf{stmt}\left(p\right) = \left(b := \mathsf{CtS}\left(\mathsf{x}, \mathsf{a}\_1, \mathsf{a}\_2\right)\right) \right) \land \left(\mathcal{B}\left(p\right) \neq \epsilon\right) \right\} \right\}$$

In other words, it is the set of processes that are disabled in γ either because they have terminated or because they are about to perform a CAS operation and their buffers are not empty. We say that p is disabled in γ if p ∈ disab (γ), and that γ is disabled if all the processes are disabled in γ. If a process (resp. configuration) is not disabled then it is enabled. If γ is disabled, we make a dummy transition that does not change γ (the rule disabled) 8 . Notice that if γ p −→proc γ 0 then there is unique process p ∈ Procs such that γ p −→proc γ 0 .

Update Transitions Between two process transitions, the system may perform a (possibly empty) sequence of update steps. The rule empty-update describes an empty update step. Each single-update step pops one write message at the end of the p-buffer for some process p and uses it to update the memory. The update rule captures the effect of a sequence of such single-update steps. We define the update transition relation −→update:= ∪α∈Procs<sup>∗</sup> <sup>α</sup>−→update as a union of relations each corresponding to a given sequence of update steps. The word α gives the sequence of processes that perform the updates. The net effect is that the system (i) pops a sequence of ( possibly empty) suffixes from the buffer of each process, (ii) shuffles these into one sequence, and (iii) uses the resulting sequence to update the memory. Notice that each selection of possible suffixes in step (i) may result in several different sequences due to multiple interleavings in step (ii). Observe that −→<sup>P</sup> is deadlock-free, i.e., for each configuration γ ∈ Γ, there is at least one configuration γ <sup>0</sup> ∈ Γ such that γ −→<sup>P</sup> γ 0 .

#### 4.3 Adding Probabilities: PTSO

We define the Markov Chain JPK MC = hΓ<sup>P</sup> , M<sup>P</sup> i. The set Γ<sup>P</sup> of configurations is defined as above. The probability matrix M<sup>P</sup> is defined as the composition of two probability distributions: (i) the process probability distribution Mproc (ii) the update probability distribution Mupdate which add probabilities to the process transition relation −→proc, and the update transition relation −→update respectively.

The Process Probability Distribution: the Scheduler At each program step (−→<sup>P</sup> ), a process is selected for execution according to a probability given by the scheduler. In a configuration γ, the scheduler selects an enabled process p ∈ enab (γ) with a probability that reflects the relative weight of p compared to those of the other enabled processes, Rweight (γ) (p):

$$\text{Rweight}\left(\gamma\right)\left(p\right) = \begin{cases} 0 & \text{if } p \in \text{disab}\left(\gamma\right) \\ \frac{\text{Schad}\left(p\right)}{\sum\_{p' \in \text{sch}\left(\gamma\right)} \text{Schad}\left(p'\right)} & \text{if } p \in \text{enab}\left(\gamma\right) \end{cases} \tag{1}$$

This gives the probability that p to execute in the next step from γ. For configurations γ and γ 0 , with γ p −→proc γ 0 , we define Mproc (γ, γ<sup>0</sup> ) := Rweight (γ) (p). In other words, we move from γ to γ <sup>0</sup> with a probability that is given by the relative weight of p in γ. We define M<sup>P</sup> (γ, γ<sup>0</sup> ) := 0 if γ 6 −→procγ 0 . To account for

<sup>8</sup> The latter transition is not strictly needed, but it is included for technical convenience.

the case where all the processes are disabled in γ, we define Mproc (γ, γ) := 1 if γ is disabled.

Faithfulness Our model uses a scheduling policy that assigns a fixed scheduling weight, Sched (p), to each process p in the system. This is a case of memoryless scheduling, i.e., the probability distribution over processes does not depend on the execution history. However, we can relax this constraint to allow for any scheduling policy that satisfies the faithfulness condition:

$$\forall p \in \mathbf{Procs} \; \mathsf{Rweight} \left( \gamma \right) \left( p \right) = 0 \iff p \in \mathsf{disab} \left( \gamma \right)$$

In words, at each step, each enabled process should be scheduled with non-zero probability. A scheduler that assigns scheduling weights such that the above condition holds is said to be a faithful scheduler.

Schedulers with memory The above criterion allows for schedulers that are more refined as compared to the memoryless scheduler. As an example, on implementations of TSO, processes are often scheduled for multiple consecutive steps since unnecessary context switching wastes processor resources. To reflect this detail, we can consider a scheduler that assigns a higher probability to the previously scheduled process, pprv. For some choice of constant weights, Sched, we can define a new choice of weights Sched<sup>0</sup> where λ > 1 is some parameter.

$$\mathsf{Sched}'(p) = \mathsf{Sched}\left(p\right) \quad \text{if} \quad p \neq p\_{prv} \quad \text{and} \quad \lambda \cdot \mathsf{Sched}\left(p\right) \quad \text{otherwise}$$

In this case, pprv is re-scheduled with a weight which is larger by a factor of λ. A larger λ implies a stronger tendency to re-schedule a process. This scheduling policy still satisfies faithfulness. One can extend this by formulating more intricate policies, e.g. ones that account for k previous steps.

To better illustrate the concerns and challenges of verification, we continue to adopt the simple (memoryless) scheduler proposed earlier. However, we emphasize that our results extend to faithful schedulers.

The Update Probability Distribution: the Memory update policy Between the process steps, pending messages from the store buffers are propagated to the shared memory (the update transition). The details of this write propagation are implementation-specific, with policies tuned towards system performance. Classical TSO models this update propagation non-deterministically. We, on the other hand, consider a probabilistic update policy. In a similar manner to the scheduling probabilities, the update probability distribution defines the probability by which a configuration γ reaches another configuration γ 0 through an update step (−→update). Recall that an update step consists of a sequence of (single) update operations. The number of possible update sequences from γ is finite since the sizes of each buffer is finite. In our model, we assume that the update distribution is the uniform distrbution over all possible update sequences. We note that starting from γ, different update sequences can lead to the same configuration γ 0 . The reason is that different shufflings of the selected suffixes (see Sec. 4.2) may lead to the same memory state. To reflect this, for configurations γ and γ 0 , we define Mupdate (γ, γ<sup>0</sup> ) := n α | γ <sup>α</sup>−→update<sup>γ</sup> 0 o n α | ∃γ00. γ <sup>α</sup>−→updateγ<sup>00</sup><sup>o</sup> , i.e. the fraction of update sequences that lead to the configuration γ 0 .

Left-Biasedness Though we adopt a specific update distribution, we provide a generic condition on that update policy that is sufficient for our results to hold. We call this the left-biasedness property. Here we provide an intuitive description of left-biasedness and defer the formal definition to Sec. 8.

Intuitively, left-biasedness requires that for sufficiently large configurations, the probability that the configuration size reduces in a single −→<sup>P</sup> step is strictly greater than p for some p > <sup>1</sup> 2 . Left-biasedness allows a wide class of more refined scheduling policies, e.g., where no message propagation is performed when the number of messages is smaller than a certain value, or where only the messages inside the buffers of some (probabilistically selected) processes are propagated.

Though our results apply more generally to models characterized by faithfulness (scheduler policy), and left-biasedness (update policy), we continue to adopt the fixed-weight (memoryless) scheduler and uniform update policy for reasons described above.

The Full Probability Distribution. We combine the process and update probability distributions, to derive the probability matrix M<sup>P</sup> , and thus obtain the Markov chain JPK MC. Consider configurations γ and γ <sup>0</sup> where γ −→<sup>P</sup> γ 0 . Let γ <sup>00</sup> be the unique configuration such that γ −→proc γ <sup>00</sup> −→update γ 0 . Then, we define M<sup>P</sup> (γ, γ<sup>0</sup> ) := Mproc (γ, γ<sup>00</sup>) · Mupdate (γ <sup>00</sup>, γ<sup>0</sup> ).

Lemma 1 M<sup>P</sup> is a prob. distribution on Γ<sup>P</sup> ; hence, JPK MC is a Markov chain.

### 5 PTSO: Concepts and Properties

Now, we intuit some concepts underlying Probabilistic TSO and its properties. PTSO Refines Classical TSO. After introducing JPK TS and JPK MC in Sec. 4, we s.t. they are closely related; JPK TS is the underlying transition system of JPK MC .

Lemma 2 (JPK MC) <sup>↓</sup> = JPK TS for any program P.

In particular, this means that the PTSO system JPK MC is a refinement of JPK TS: a behaviour is observed in JPK TS iff it is seen in JPK MC with non-zero probability. Whenever the context is clear, we write P instead of JPK TS , JPK MC .

Label Reachability. We formulate our verification problems in terms of reachability to instruction labels. To simplify the notation, we identify a label l ∈ Lbl<sup>P</sup> with the set Γ l <sup>P</sup> of configurations in which l occurs. We say that "l is reachable" rather than "Γ l <sup>P</sup> is reachable", and write ♦l instead of ♦ {γ ∈ Γ<sup>P</sup> | l ∈ γ}. In [13,12] the authors show that label reachability from a plain configuration is decidable. The following lemma, generalizes this to the case where the source

configuration need not be plain and destination can be a particular plain configuration.

Lemma 3 For a program P, a configuration γ ∈ Γ<sup>P</sup> , and a plain configuration γ <sup>0</sup> ∈ Γ plain P , it is decidable whether γ <sup>∗</sup>−→<sup>P</sup> γ 0 .

Extending this, we have Lemma 4: we can query whether γ <sup>∗</sup>−→<sup>P</sup> γ 0 for each γ <sup>0</sup> ∈ Γ l <sup>P</sup> ∪ Γ plain P . Decidability of Lemma 4 follows since Γ plain P is finite and the subroutine is decidable by Lemma 3.

Lemma 4 For a program P, a configuration γ ∈ Γ<sup>P</sup> , and a label l ∈ Lbl<sup>P</sup> , it is decidable whether γ <sup>∗</sup>−→<sup>P</sup> l.

#### 5.1 Left-Orientedness and Attractors

We show that the set of plain configurations Γ plain P set has an attractor property in the sense of [16]. In our setting, this means that any run of JPK MC almost surely visits Γ plain P infinitely often.

Small and large configurations To arrive at this result, we consider a generalization of plain configurations, called small configurations, denoted Γ small <sup>P</sup> . Γ small P consists of configurations with a small number of messages inside their buffers. Concretely, a configuration γ is small if |γ| ≤ 4, i.e., the total number of messages inside the buffers does not exceed 4. <sup>9</sup> We define the set of large configurations by Γ large P := Γ<sup>P</sup> − Γ small <sup>P</sup> = Γ ≥5 P . We show that the Markov chain JPK MC is leftoriented in the sense of [42]. That is, for any large configuration γ ∈ Γ large P , the expected change in configuration size for a single −→<sup>P</sup> step is negative.

0: x = 1 1: goto 0 2: x = 2 3: a = x 4: goto 2 An illustrative example We explain the update probability distribution through the code snippet on the right. To begin with let us only consider the process on the left (procL). It executes an infinite loop, writing 1 to variable x. Let us consider the evolution of the buffer-sizes of procL, i.e. the number of (x,1) messages in the procL-buffer. Assume that on reaching label 0, procL has 6 messages in its buffer. The −→<sup>P</sup> step consists of a process transition, −→proc followed by an update transition, −→update. In the −→proc step, the write increases the size of the buffer by one, thus obtaining a buffer of size 7. Following this the −→update step may push any number of messages to the memory. Since the update policy chooses uniformly amongst possible update sequences, the resulting configuration has one amongst {0, . . . , 7} messages in the procL-buffer, each occurring with an equal probability of 1/8. The next −→proc step (a goto), does not change the buffer size, but the −→update step can still propagate messages. The reasoning for the next steps follows similarly.

<sup>9</sup> This value is an artifact of the probabilistic policies we have adopted in Sec. 4

Comparison with other notions of fairness At each −→proc step atmost one message is added to the process buffers (when the process performs a write), however in the following −→update can still remove large number of messages. Hence, from sufficient large configuration sizes, the system has a tendency to move towards configurations with smaller buffer sizes. Formally, we prove the following lemma, using the left-orientedness property mentioned earlier.

$$\mathbf{\color{red}{Lemma 5}}\text{ }Prob\_{\mathcal{P}}\left(\gamma \vdash \Box \Diamond \Gamma\_{\mathcal{P}}^{\mathsf{plainin}}\right) = 1 \text{ for all configurations } \gamma \in \Gamma\_{\mathcal{P}}.$$

For the above example, PTSO guarantees that the process on the right (procR) eventually reads value 1 into register a. This follows since in a plain configuration, the buffer of procR is empty and hence it can read the value from the memory - this happens almost surely. We highlight that other notions of fairness such as strong fairness in process scheduling (discussed in [27]) as well memory fairness [26], cannot provide this guarantee. In particular, memory fairness from [26], would consider the execution which exactly alternates writes of both processes but procR reads before its own write is pushed memory to be fair and hence permissible.

$$\mathbf{x} = \mathbf{1} \qquad \mathbf{x} = \mathbf{2} \qquad \mathbf{a} = \mathbf{x} \quad \text{// } \mathbf{2} \qquad \mathbf{x} = \mathbf{1} \qquad \mathbf{x} = \mathbf{2} \qquad \mathbf{a} = \mathbf{x} \quad \text{// } \mathbf{2} \qquad \mathbf{x} = \mathbf{1} \qquad \mathbf{\cdots} \qquad \mathbf{x}$$

B-Plain Configurations We can refine our analysis of the attraction property enjoyed by the set Γ plain P of plain configurations. We consider a subset of Γ plain P which we call the set of bottom plain configurations, (or B-plain configurations, for short), denoted Γ Bplain P . Intuitively, a B-plain configuration is a member of a bottom strongly connected component in the graph of plain configurations. Formally, a configuration γ ∈ Γ<sup>P</sup> is said to be B-plain if (i) γ ∈ Γ plain P , and (ii) for any γ <sup>0</sup> ∈ Γ plain P , if γ <sup>∗</sup>−→<sup>P</sup> γ 0 then γ <sup>0</sup> <sup>∗</sup>−→<sup>P</sup> γ. Since any run of the system almost surely visits the set of Γ plain P infinitely often, it will also almost surely visit a B-plain configuration infinitely often.

Lemma 6 Prob<sup>P</sup> γ |= ♦Γ Bplain P = 1 for all configurations γ ∈ Γ<sup>P</sup> .

### 6 Qualitative (Repeated) Reachability

Given: a program P, a configuration γinit ∈ Γ<sup>P</sup> , a label l ∈ Lbl<sup>P</sup> Qual Reach: Determine whether Prob<sup>P</sup> (γinit |= ♦l) = 1 Qual Rep Reach: Determine whether Prob<sup>P</sup> (γinit |= ♦l) = 1

In this section, we perform qualitative reachability analysis for PTSO. Given a program P, configuration γinit, and label l, we check whether a γinit-run almost surely reaches l. We also consider qualitative repeated reachability, where, we ask whether a γinit-run repeatedly visits l (visits l infinitely often) w.p. 1. We also consider almost-never variants of the problems, where we check whether the probabilities are 0 rather than 1. We prove that these problems are decidable, and have non-primitive-recursive complexities.

### 6.1 Almost-Sure Reachability

The qualitative reachability problem, Qual Reach, is defined above. The algorithm in Figure 3 solves Qual Reach by analyzing the transition system JPK TS, the underlying transition system of PTSO. If l occurs in γinit then the property trivially holds, and hence we answer positively. Otherwise, the algorithm considers a new program P <sup>0</sup> obtained by replacing the statement labeled l, by a new statement that makes P 0 terminate immediately if l is reached. Let p ∈ Procs be the unique process such that l ∈ Lblp. We define P l := hProcs − {p} ∪ {p <sup>0</sup>} , Schedi where p 0 is a fresh process derived from p by replacing stmt (l) by goto l term new for a fresh label goto l term new 6∈ Lbl<sup>P</sup> and adding a term at label l term new . The remaining instructions of p <sup>0</sup> are identical to p.

The loop on line 3 cycles through the (finite) set of plain configurations. For each plain configuration γ from the original program P, we check: (i) Whether γ is reachable from the initial configuration γinit in P 0 . By the construction of P 0 , this is equivalent to checking whether γ is reachable from γinit in P without observing label l. (ii) Whether it can reach the label l. If the


Fig. 3. Almost-sure reachability algorithm. answer to (i) is yes, and the answer to (ii) is no, then we have found a finite path π in P that starting from γinit, without visiting l, reaches configuration γ from which l is not reachable. This implies that Prob<sup>P</sup> (γinit |= ♦l) < 1. If none of the plain configurations satisfy the condition, then each plain configuration γ reachable from γinit has a path to l. Now by the attractor lemma, any run will almost surely visit Γ plain P infinitely often and by the fairness property of Markov chains, it almost surely visits l.

#### 6.2 Almost-Sure Repeated Reachability

For almost-sure repeated reachability we are interested in determining whether the γinit-runs visit l infinitely often with probability 1. The algorithm for this is similar to the case for almost-sure reachability: we check whether ∃ a plain configuration γ that satisfies γinit <sup>∗</sup>−→<sup>P</sup> <sup>γ</sup> ∧ γ <sup>∗</sup>−→<sup>P</sup> l , in which case we return false. The difference is that we do not need to transform the program as in the case of almost-sure reachability. Details are in the supplementary material.

#### 6.3 Almost-Never (Repeated) Reachability

The almost-never variants of the (repeated) reachability problems, Never Qual Reach resp. Never Qual Rep Reach, ask whether the probabilities equal to 0 rather than 1. The solution to Never Qual Reach is straightforward, since Prob<sup>P</sup> (γinit |= ♦l) = 0 iff ¬(γinit <sup>∗</sup>−→<sup>P</sup> l). On the other Given: a program P, a configuration γinit ∈ Γ<sup>P</sup> , a label l ∈ Lbl<sup>P</sup> Never Qual Reach: Determine whether Prob<sup>P</sup> (γinit |= ♦l) = 0 Never Qual Rep Reach: Determine whether Prob<sup>P</sup> (γinit |= ♦l) = 0

hand, the Never Qual Rep Reach problem requires a search over B-plain configurations γ satisfying γinit <sup>∗</sup>−→<sup>P</sup> γ <sup>∗</sup>−→<sup>P</sup> l. Due to space constraints, we defer the algorithm and proofs to the appendix.

#### 6.4 Decidability and Complexity

The algorithms can be effectively implemented since (i) Γ plain P is finite; and (ii) the conditions of the for-loops and if-statements can be checked effectively, as implied by Lemma 4. This gives Theorem 1. Theorem 2 is proved through reductions from the reachability problem under the classical (non-probabilistic) TSO semantics [19]. The non-primitive-recursive lower bounds follow from the corresponding result for reachability of classical TSO.

Theorem 1. Qual Reach, Qual Rep Reach, Never Qual Reach, Never Qual Rep Reach are all decidable.

Theorem 2. Qual Reach, Qual Rep Reach, Never Qual Reach, Never Qual Rep Reach all have non-primitive-recursive complexities.

### 7 Quantitative (Repeated) Reachability

In this section we discuss quantitative reachability problems for PTSO. In contrast to qualitative analysis from Sec. 6, the task here is to compute the actual probability. We are not able to compute the probabilities exactly, but we can approximate the probability with an arbitrary degree of precision.

#### 7.1 Approximate Quantitative Reachability

Given: program P, configuration γinit ∈ Γ<sup>P</sup> , label l ∈ Lbl<sup>P</sup> , precision value ∈ R + Quant Reach: Determine θ s.t. Prob<sup>P</sup> (γinit |= ♦l) ∈ [θ, θ + ε] Quant Rep Reach: Determine θ s.t. Prob<sup>P</sup> (γinit |= ♦l) ∈ [θ, θ + ε]

In the approximate quantitative reachability problem, Quant Reach, given a precision parameter ε, we are interested in determining an approximation θ satisfying θ ≤ Prob<sup>P</sup> (γinit |= ♦l) ≤ θ + ε.

The algorithm in Fig. 4 solves the problem by successively improving the approximation at each iteration until we are within ε-precision of the exact value. The algorithm maintains two variables: PosApprx (positive approximation) is an under-approximation of the probability with which l is reachable from γinit, and NegApprx (negative approximation) is an under-approximation of the probability with which l is not reachable from γinit. PosApprx serves as a lower bound on θ, while, 1 − NegApprx serves as an upper bound: PosApprx ≤ θ ≤ 1 − NegApprx.

```
Algorithm: Quant Reach
   Input: P: program; γinit ∈ ΓP : configuration; l ∈ LblP : label; ε ∈ R
                                                                  >0
                                                                    :
          precision.
 1 Var
 2 PosApprx, NegApprx ∈ R: approximations, waiting ∈ (ΓP × R)
                                                                ∗
                                                                 : queue
 3 PosApprx := 0; NegApprx := 0; waiting := hγinit, 1i
 4 while PosApprx + NegApprx < 1 − ε do
 5 hγ, φi := head (waiting); waiting := tail (waiting)
 6 if l ∈ γ then PosApprx := PosApprx + φ
 7 else if ¬(γ
                 ∗→P l) then NegApprx := NegApprx + φ
 8 else
 9 for each γ
                    0 with γ →P γ
                                 0 do waiting := waiting · hγ
                                                             0
                                                             , φ · MP (γ, γ0
                                                                         )i
10 return PosApprx
```
#### Fig. 4. The quantitative reachability algorithm.

The algorithm iteratively improves these approximations until we reach a point where their sum is within ε from 1 (line 4). In such a case, the desired value of θ = PosApprx is an ε-precise approximation.

To calculate the approximations, the algorithm performs forward reachability analysis starting from the initial configuration γinit. It generates the set of γinit-paths in a breadth-first manner, using the waiting FIFO queue. For each generated path π it also calculates the probability of π. Instead of the whole path π, waiting only stores the last configuration, γ, of π and the probability of π, φ, as a pair hγ, φi.

The approximation variables are initialized (line 3) to zero, and waiting queue is initialized to contain a single pair, hγinit, 1i, representing the initial configuration γinit (which occurs with probability one). The while-loop executes until we achieve the desired precision. At each iteration, we check whether we already have reached the desired precision. If not, the algorithm pops the pair hγ, φi from the waiting-queue. There are three possibilities depending on γ:


To show correctness of the algorithm, let PosApprx(i) and NegApprx(i) represent the value of PosApprx and NegApprx prior to performing the i th iteration. We show that in the limit as i → ∞, the value of PosApprx(i)+NegApprx(i) tends to 1. Technically this follows by Lemma 5. By this lemma, any γinit-run almost surely either (i) reaches a plain configuration from which l is not reachable, or (ii) repeatedly reaches a plain configuration from which l is reachable. In case (ii) it will almost surely reach l. This implies that Prob<sup>P</sup> (γinit |= (♦(l ∨ ¬∃♦l))) = 1, i.e., an γinit-run will almost surely either reach l or reach a configuration from which l is not reachable, implying that PosApprx(i) + NegApprx(i) tends to 1. Finally, by Lemma 4 we can effectively check the condition of the if-statement, and hence the algorithm terminates.

The correctness of the approximation on termination follows by the property that PosApprx(i) and NegApprx(i) are under-approximations of the reach and non-reach probabilities. This follows from the following invariants:

$$\begin{aligned} \mathsf{PosApprx}^{(i)} &\leq \mathit{Prob}\_{\mathcal{P}}\left(\gamma\_{init} \left| = \lozenge\right|\right) &\mathsf{NegApprx}^{(i)} \leq \mathit{Prob}\_{\mathcal{P}}\left(\gamma\_{init} \left| = \lozenge\right|\square \neg \text{l}\right) \\ \mathit{Prob}\_{\mathcal{P}}\left(\gamma\_{init} \left| = \lozenge\right|\right) &\leq 1 - \mathit{Prob}\_{\mathcal{P}}\left(\gamma\_{init} \left| = \lozenge\right|\square \neg \text{l}\right) \\ \mathsf{PosApprx}^{(i)} &+ \mathsf{NegApprx}^{(i)} > 1 - \varepsilon \text{ holds on termination} \end{aligned}$$

These imply that, on termination, PosApprx is within ε-precision of θ. Theorem 3. Quant Reach is solvable.

#### 7.2 Approximate Quantitative Repeated Reachability

In the case of the approximate quantitative repeated reachability problem, we are interested in approximating the probability of visiting a given label l infinitely often. We develop an algorithm that uses an iterative approximation scheme similar to the reachability case. We defer full details of this algorithm to the supplementary material and instead give an intuitive explanation on how it differs from Sec.7.1.

This algorithm too maintains approximations PosApprx and NegApprx and iteratively narrows the error margin until it is smaller than ε. The main difference is in the condition at line 6 of Figure 4. In the case of reachability the lower estimate PosApprx, is increased when l ∈ γ. In the repeated reachability case, this is not sufficient; we need to ensure that there is no state γ 0 that is reachable from the current state γ and such that l is not reachable from γ 0 . The existence of such a γ 0 implies existence of a non-zero measure continuation of the current run in which l is not reached infinitely often. Hence, the conditional of the ifstatement is modified to: ∀γ <sup>0</sup> ∈ BPlain. (γ <sup>∗</sup>→<sup>P</sup> γ 0 ) ⇒ (γ <sup>0</sup> <sup>∗</sup>→<sup>P</sup> l).

We note that naively we would have to check the above condition for all configurations γ <sup>0</sup> ∈ Γ<sup>P</sup> , which is infeasible since Γ<sup>P</sup> is an infinite set. We address this by using Lem. 6, which shows that runs from all configurations eventually reach a B-plain configuration. Hence it is sufficent to only check the condition for the (finitely many) B-plain configurations, which are precomputed in BPlain.

Theorem 4. Quant Rep Reach is solvable.

### 8 Expected Average Costs

In this section, we develop a cost model for concurrent programs where we assign a cost to the execution of each instruction, the goal begin to approximate the expected cost of runs that reach a given label.

#### 8.1 Computing costs over runs

A cost function Cost : Lbl<sup>P</sup> → N >0 for program P defines for each label l ∈ Lbl<sup>P</sup> the cost of executing the instruction at l. A particular way to define the function is to assign a cost to each instruction in the programming language, so that Cost (l) depends only on stmt (l) and not on l itself. But we consider the general case. We extend Cost to runs as follows. Consider configurations γ = hλ, R, B,Mi and γ 0 such that γ −→<sup>P</sup> γ 0 . If γ p −→<sup>P</sup> γ 0 , for process p, then we define Cost (γ, γ<sup>0</sup> ) := Cost (λ (p)). In other words, it is the cost of the instruction executed by p. Recall from Sec. 4 that p is unique and therefore the function is well-defined. If disab (γ) or if ¬(γ −→<sup>P</sup> γ 0 ) then we define Cost (γ, γ<sup>0</sup> ) := 0. Consider a run ρ ∈ {Runs (γ) | ρ |=<sup>P</sup> ♦ =i l}, i.e. a γ-run that reaches l for the first time at step i. We define Cost (ρ) (l) = P 1≤j≤|i|−1 Cost (ρ[j], ρ[j + 1]), i.e, the sum of costs of all executed instructions along ρ up to the first visit to l.

For a configuration γ, a label l, and a cost function Cost, we define a random variable Xγ,l,Cost : Ω → R over support Ω = γ · Γ ω C as follows:

$$X\_{\gamma, \mathsf{l}, \mathsf{Cost}}\left(\rho\right) = \begin{cases} 0 & \rho \notin \{\mathsf{Runs}\left(\gamma\right) \mid \rho \mid = \rho \nmid \overline{\gamma}^i\}, \\ X\_{\gamma, \mathsf{l}, \mathsf{Cost}}\left(\rho\right) = \mathsf{Cost}\left(\rho\right)\left(\mathsf{l}\right) & \text{otherwise} \end{cases}$$

Given: program P, configuration γinit ∈ Γ plain <sup>P</sup> , cost function Cost: Lbl<sup>P</sup> → N >0 , label l ∈ Lbl<sup>P</sup> s.t. γinit |= ♦l, precision value ∈ R + Exp Ave Cost: Determine θ s.t. E (X<sup>γ</sup>init,l,Cost | γinit |= ∃♦l) ∈ [θ, θ + ε]

The expected average cost problem E (Xγ,l,Cost) is defined as the expected cost of reaching l from γ and E (Xγ,l,Cost | γ |= ∃♦l) as the conditional expectation over runs that reach l. If ¬(γ |=<sup>P</sup> ∃♦l) then the expected cost is not defined. If however γ |=<sup>P</sup> ∃♦l then E (Xγ,l,Cost | γ |= ∃♦l) = E (Xγ,l,Cost)/Prob<sup>P</sup> (γ |=<sup>P</sup> ♦l), which follows since for the non-reaching runs, the cost is zero. We present the expected average cost problem, in the figure above, where we want to approximate E (Xγ,l,Cost | γ |= ∃♦l) to ε-precision.

#### 8.2 Eagerness

Our solution to Exp Ave Cost relies on the fact that JPK MC satisfies an eagerness property in the sense of [17]. In our setting, eagerness means that the probability of avoiding the target label l decreases exponentially with the number of steps. Concretely, we show that there are two constants: the eagerness degree E<sup>P</sup> ∈ R >0 , and the eagerness threshold η<sup>P</sup> ∈ R >0 satisfying the following:

$$\forall \gamma \in \Gamma\_{\mathcal{P}}^{\mathtt{un}\mathtt{all}} \; \forall \mathfrak{l} \in \mathtt{L} \mathsf{bl}\_{\mathcal{P}} \; \forall n \geq \eta\_{\mathcal{P}} \quad \gamma \vdash\_{\mathcal{P}} \exists \Diamond \mathbb{l} \Rightarrow \mathit{Prob} \; \mathsf{p} \; \left( \gamma \vdash\_{\mathcal{P}} \lozenge^{\mathtt{n}} \mathsf{l} \right) \leq \left( \mathcal{E}\_{\mathcal{P}} \right)^{n} \; \mathsf{l}$$

i.e. for n ≥ η<sup>P</sup> , the probability of avoiding l during the first n steps decreases exponentially with n. The following lemma forms the crux of this section.

Lemma 7 (Eagerness Lemma) E<sup>P</sup> and η<sup>P</sup> exist and are computable.

We devote this sub-section to give an overview of the the proof of Lemma 7 (the formal proof is provided in the supplementary material). We consider the behavior of runs with respect to the small and large configurations, exploiting the fact that the runs of the system tend to gravitate towards the small configurations. However here we use a property, called left-biasedness (defined in Sec. 8.2), that is stronger than the left-orientedness property of Sec. 5.1.

To prove Lemma 7, we show that, for a small configuration γ ∈ Γ small <sup>P</sup> , the runs from γ satisfy the following three properties with a high probability: (i) they make their first return to Γ small <sup>P</sup> within a small number of steps, (ii) they return to Γ small <sup>P</sup> multiple times, within a small number of steps, and (iii) if they eventually reach l then they will do that within a few steps. We collect these results to obtain the proof of Lemma 7.

Gravity: First Return We recall that buffer sizes can increase by at most one during process transitions, and that any number of messages can be flushed to the memory during an update transition (Sec. 4 and Sec. 5.1). Based on this, we show left-biasedness, defined as follows:

Left-biasedness ∀γ ∈ Γ large P the probability of moving from γ to a smaller configuration is bounded below by 2/3 and that of moving to a larger configuration is bounded above by 1/3, regardless of P.

Using left-biasedness, we show that the set Γ small <sup>P</sup> has a gravity property, namely, a run starting from a small configuration will, with a high probability, return to the set Γ small <sup>P</sup> (for the first time) within a few number of steps. Formally, we define the gravity parameter G<sup>P</sup> as follows: ˆq := 2/3, ˆp := 1/3, and G<sup>P</sup> := 2 √ qˆ · pˆ = 2· √ 2 3 . We prove the following lemma.

Lemma 8 (Gravity Lemma) Prob<sup>P</sup> γ |=<sup>P</sup> ♦ <sup>≥</sup>nΓ small P ≤ (G<sup>P</sup> ) n , for all γ ∈ Γ small <sup>P</sup> and all n ∈ N .

The lemma states that, starting from a small configuration, the probability that a run avoids Γ small <sup>P</sup> in the next n steps decreases exponentially with n.

Multiple Revisits Notice that the gravity lemma is concerned with the first return to the set of small configurations. We will now apply this argument repeatedly to conclude that, with high probability, multiple re-visits to small configurations take place "quickly". That is, the set of runs starting from Γ small <sup>P</sup> and frequently re-visiting Γ small <sup>P</sup> has a high measure. To formalize these arguments, we make the following definition. For m, n : 1 ≤ m ≤ n, we define Visit<sup>P</sup> (n, m) to be the set of runs that visit the set Γ small <sup>P</sup> exactly m times in their first n − 1 steps<sup>10</sup>. We use the Visit predicate to partition the set of γ-runs, depending on how often they return to Γ small <sup>P</sup> during their first n steps. We distinguish these as

<sup>10</sup> For technical convenience, we use n − 1 instead of n in the definition of Visit. This allows us to avoid some corner cases in the proofs.

Sporadic-Runs (S-Runs): runs that visit the Γ small <sup>P</sup> sporadically during their first n steps, and Frequent-Runs (F-Runs): runs that visit Γ small <sup>P</sup> frequently during their first n steps. We will derive a constant ν ∈ N (see below) that delineates the border between these sets. We formally define:

$$\begin{aligned} \mathsf{SRuns}\left(\gamma\right)\left(n\right) &:= \cup\_{1 \le m \le \left\lfloor \frac{n}{\nu} \right\rfloor} \left\{ \rho \in \mathsf{Runs}\left(\gamma\right) \mid \rho \mid \rho \mid = \mathsf{Vissit}\_{\mathcal{P}}\left(n,m\right) \right\} \\ \mathsf{FRuns}\left(\gamma\right)\left(n\right) &:= \cup\_{\left\lfloor \frac{n}{\nu} \right\rfloor + 1 \le m \le n} \left\{ \rho \in \mathsf{Runs}\left(\gamma\right) \mid \rho \mid = \mathsf{Visit}\_{\mathcal{P}}\left(n,m\right) \right\} \end{aligned}$$

Fig. 5. Figure depicting configuration sequences of S, F and D runs. Green dots represent small configurations, blue dots represent large configurations. All runs start in a small (plain) configuration. Within the first n configurations: the S-run visits Γ small P at most b n ν c times, the F, D runs visit Γ small <sup>P</sup> at least b n ν c+1 times. A D-run is a special case of an F-run which does not visit label l (red dot) in the first n steps.

The value of n/ν distinguishes the S-Runs from the F-Runs. Our goal is to give an upper bound on the measure of the S-Runs. For a prefix path π of length n, there are ( n−1 <sup>m</sup>−1) ways to choose the m − 1 indices along π at which Γ small <sup>P</sup> is reached (since the run starts from Γ small <sup>P</sup> ). Each of the m − 1 path fragments between these indices represents one consecutive revisit of Γ small <sup>P</sup> . By Lemma 8, the measure of the set of such runs is bounded by (G<sup>P</sup> ) <sup>n</sup>−<sup>m</sup> = 2 √ 2 3 n−<sup>m</sup> , giving

$$Prob\_{\mathcal{P}}\left(\mathsf{SRuns}\left(\gamma\right)\left(n\right)\right) \leq \sum\_{m=1}^{\lfloor \frac{n}{\nu} \rfloor} \binom{n-1}{m-1} \cdot \mathcal{G}\_{\mathcal{P}}^{n-m} \leq \left(\sqrt{\frac{8}{3}} \cdot \left(\frac{\nu}{\nu-1}\right) \cdot \left(2 + \sqrt{3} \cdot \nu\right)^{\lfloor \frac{1}{\nu} \rfloor}\right)^{n}$$

under the condition that 4 ≤ 2 · ν ≤ n. The second inequality is obtained through algebraic manipulations using G<sup>P</sup> = 2· √ 2 3 . Define <sup>f</sup>(x) := <sup>q</sup> 8 3 · x x−1 ·

 2 + <sup>√</sup> 3 · x b 1 x c . We have f(150) = 0.986 < 1. Hence, for parameter ν := 150, defining E S <sup>P</sup> := f(ν), we have the following lemma, where the bound decays exponentially with n since E S <sup>P</sup> < 1.

Lemma 9 (S-Run Bound) Prob<sup>P</sup> (γ |=<sup>P</sup> SRuns (γ) (n)) ≤ (E S P ) n , for all γ ∈ Γ small <sup>P</sup> and all n such that 300 = 2 · ν ≤ n.

Reaching the label l We now turn our attention to the set of F-Runs. Our goal is to show that if an F-Run reaches l then, with a high probability, it will reach l "quickly". To that end, we consider the opposite scenario and introduce a subset of the F-Runs which we call Delayed Runs (D-Runs):

$$\mathsf{DRuns}\left(\gamma\right)\left(\mathbb{I}\right)\left(n\right) := \cup\_{m=\left\lfloor\frac{n}{\nu}\right\rfloor+1}^{n}\left\{\rho \in \mathsf{Runs}\left(\gamma\right) \: \mid \: \rho \vdash\_{\mathcal{P}} \left\langle \widecheck{\neg}^{n}\!\rangle \land \mathsf{Vissit}\_{\mathcal{P}}\left(n,m\right)\right\}$$

A D-Run is an F-Run that delays its first visit to the label l until the n th step for some n. We show that the measure of D-Runs decreases n increases. Note that l is reachable from all configurations from a path that ends at l. Therefore, we consider the set A := {γ ∈ Γ small <sup>P</sup> | γ |=<sup>P</sup> ∃♦l}, of small configurations from which l is reachable. We analyze how often a run starting from a small configuration, visits A before finally visiting the label l. For sets of configurations G1, G<sup>2</sup> ⊆ Γ<sup>P</sup> , a run ρ, and m ∈ N, we write ρ |= G<sup>1</sup> Before<sup>m</sup> G<sup>2</sup> to denote that ρ visits the set G<sup>1</sup> at least m times before visiting G<sup>2</sup> for the first time. Notice

$$\mathsf{DRuns}\left(\gamma\right)\left(\mathbb{I}\right)\left(n\right) \subseteq \bigcup\_{m=\left\lfloor\frac{n}{\nu}\right\rfloor+1}^{n} \left\{\rho \in \mathsf{Runs}\left(\gamma\right) \mid \rho \mid \!= \mathcal{P}\left.\mathcal{A}\mathsf{Beforre}^{m}\mathbb{I}\right\}\right\}\tag{2}$$

To upper bound the measure of D-Runs, we start by upper bounding the measure of the set {ρ ∈ Runs (γ) | ρ |=<sup>P</sup> A Before<sup>m</sup> l}, i.e. γ-runs making m visits to A before visiting l. We consider the probability that a run from a small configuration γ does visit l before returning to γ. We can compute a µ such that

$$0 < \mu \le \min\_{\gamma \in \mathcal{A}} prob \cdot \left( \gamma \Vdash \bigcirc \left( \mathbb{I} \, \texttt{Before}^1 \, \gamma \right) \right) \tag{3}$$

Hence µ is a lower bound on the measure of runs that start from some configuration in γ ∈ A and visit l before returning to γ. To obtain an upper bound on the measure of D-Runs, we show the following inequality:

$$\operatorname{Prob}\_{\mathcal{P}}\left(\mathsf{DRuns}\left(\gamma\right)\left(\mathbb{I}\right)\left(n\right)\right) \leq \sum\_{m=\lfloor\frac{n}{\nu}\rfloor+1}^{n} \sum\_{\gamma' \in \mathcal{A}} \left(1-\mu\right)^{\lceil\frac{m}{\lceil\frac{n}{\nu}\rceil}\rceil-1} \leq \frac{|\mathcal{A}|}{(1-\mu)\cdot\left(1-(1-\mu)^{\frac{n}{\nu-\lceil\frac{n}{\nu}\rceil}}\right)} \cdot \left((1-\mu)^{\frac{n}{\nu-\lceil\frac{n}{\nu}\rceil}}\right)$$

The first inequality follows from formulas 2 and 3, while the second is obtained through algebraic techniques. Define E D <sup>P</sup> such that (1−µ) 1 <sup>ν</sup>·|A| < E D <sup>P</sup> < 1. Such an E D <sup>P</sup> is computable since ν, A, µ are computable. Since (1−µ) 1 <sup>ν</sup>·|A| < E D <sup>P</sup> it follows that there is a natural number, denoted by η D <sup>P</sup> , such that |A| (1−µ)·(1−(1−µ) 1 |A| ) · 1 <sup>ν</sup>·|A| <sup>n</sup>

(1 − µ) ≤ (E D P ) <sup>n</sup> for all n ≥ η D <sup>P</sup> . This gives the following lemma.

Lemma 10 (D-Run Bound) Prob<sup>P</sup> (DRuns (γ) (l) (n)) ≤ (E D P ) n , for all γ ∈ Γ small <sup>P</sup> and all n ≥ η D P .

Proof of Lemma 7 We now give a sketch of the proof of the eagerness property.

Choose a value E SD <sup>P</sup> such that, max(E S <sup>P</sup> , E D <sup>P</sup> ) < E SD <sup>P</sup> < 1. From Lemma 9 and Lemma 10 it follows that for some constant η SD <sup>P</sup> > max(η D <sup>P</sup> , 300), Prob<sup>P</sup> (γ |=<sup>P</sup> ♦ <sup>=</sup><sup>n</sup>l) ≤ (E SD <sup>P</sup> ) <sup>n</sup>, for all n > ηSD <sup>P</sup> (sufficiently large). The final step is to extend the argument to the set of γ-runs that reach l in n or more steps (as required by Lemma 7).

$$Prob\_{\mathcal{P}}\left(\gamma \mid \vdash \mathcal{P} \not\supseteq^{n}\right) = \sum\_{k=n}^{\infty}Prob\_{\mathcal{P}}\left(\gamma \mid \vdash\_{\mathcal{P}} \not\supseteq^{n}\right) \leq \sum\_{k=n}^{\infty} \left(\mathcal{E}\_{\mathcal{P}}^{\mathsf{SD}}\right)^{k} = \frac{\left(\mathcal{E}\_{\mathcal{P}}^{\mathsf{an}}\right)^{n}}{1 - \mathcal{E}\_{\mathcal{P}}^{\mathsf{an}}}$$

Choose E<sup>P</sup> , (exists since E SD <sup>P</sup> < 1) such that E SD <sup>P</sup> < E<sup>P</sup> < 1. There exists an η<sup>P</sup> such that (<sup>E</sup> SD <sup>P</sup> ) n 1−ESD P ≤ (E<sup>P</sup> ) n for all n ≥ η<sup>P</sup> , and hence Prob<sup>P</sup> γ |=<sup>P</sup> ♦ <sup>≥</sup><sup>n</sup>l ≥ (E<sup>P</sup> ) n for all n ≥ η<sup>P</sup> (sufficiently large). This gives us the result.

#### 8.3 The Algorithm

Now we proceed to describe the algorithm. The goal is to approximate E (Xγinit,l,Cost | γinit |= ∃♦l). The scheme followed by the algorithm is similar to the quantitative section: it iteratively improves an approximations until it is ε-precise. However, the implementation is much more challenging since we need to maintain error margins on both the cost and the probabilities. It performs forward reachability analysis, starting from γinit, and generating, successively longer γinit-paths, in a breadth-first manner.

The variable waiting contains triples of form hγ, ψ, φi corresponding to γinitpaths waiting to be analysed. For such a path π, γ is the last configuration of π, ψ is the cost of π, and φ is the probability of taking π. We initialize waiting to contain a triple corresponding to the empty path from γinit: hγinit, 0, 1i. Prior to the i th iteration loop (line 10), waiting contains triples corresponding to paths of length i. At each loop iteration the triples in waiting are analysed and the triples for paths one step deeper are generated for the next iteration.

```
Algorithm: Solving Exp Ave Cost
   Input: P: program; γinit ∈ ΓP : configuration l ∈ LblP : label with γinit |= ∃♦l; ;
   Cost: InstrP → R: cost function; ε ∈ R
                                      >0
                                         : precision;
 1 Var
 2 waiting, waiting0 ∈ (ΓP × R × R)
                                      ∗
                                       : queues;
 3 CostApprx ∈ R: approximation of E (Xγ,l,Cost);
 4 ProbApprx ∈ R: under-approximation of ProbJPK
                                                   MC (γ |=P ♦l);
 5 CostError ∈ R, ProbError ∈ R: over-approximations of errors;
 6 k, n ∈ N;
 7 k := MaxCost (Cost); n := 0;
 8 CostApprx := 0; ProbApprx := 0; waiting := hγinit, 0, 1i;
 9 CostError := k
                (1−EP )2 ; ProbError := 1
                                      1−EP
                                           ;
10 repeat
11 n := n + 1; waiting0
                          := ∅;
12 for i = 1 to |waiting| do
13 hγ, ψ, φi := waiting[i];
14 if l ∈ γ then
15 CostApprx := CostApprx + ψ · φ; ProbApprx := ProbApprx + φ;
16 else
17 for all γ
                      0
                       : γ −→P γ
                               0 do
18 waiting0
                          := waiting0
                                     · hγ
                                        0
                                         , ψ + Cost (γ, γ0
                                                        ), φ · MP (γ, γ0
                                                                    )i;
19 CostError := CostError · EP ; ProbError := ProbError · EP ;
20 waiting := waiting0
                         ;
21 until 
          CostApprx+CostError
              ProbApprx −
                               CostApprx
                           ProbApprx+ProbError < ε
                                                ∧ (ProbError > 0) ∧ (n ≥ ηP );
22 return CostApprx
           ProbApprx+ProbError
```
Fig. 6. The expected average cost algorithm.

The iterations calculate increasingly precise approximations of E (X<sup>γ</sup>init,l,Cost), and of Prob<sup>P</sup> (γinit |=<sup>P</sup> ♦l), maintained in variables CostApprx and ProbApprx, respectively. We maintain two additional variables (CostError and ProbError) that help us to provide an upper bound on the estimation errors. Defining MaxCost (Cost) := max {Cost (l) | l ∈ Lbl<sup>P</sup> }, we explain the correctness of the algorithm with a number of invariants.

Lemma 11 The algorithm maintains the following invariants where invariants (1,2,5,6) hold for all i > 0 and invariants (3,4) hold for all i ≥ η<sup>P</sup> .


Invariants 5 and 6 imply that as i → ∞ CostError(i) and ProbError(i) tend to 0. Hence, limi→∞ CostApprx(i)+CostError(i) ProbApprx(i) − CostApprx(i)−CostError(i) ProbApprx(i)+ProbError(i) = 0 implying termination. Since n ≥ η<sup>P</sup> when the algorithm terminates, by invariants 3 and 4 it follows that CostApprx(n) ≤ E (Xγ,l,Cost) ≤ CostApprx(n) + CostError(n) and ProbApprx(n) ≤ Prob<sup>P</sup> (γ |=<sup>P</sup> ♦l) ≤ ProbApprx(n) + ProbError(n) . Combining these two inequalities and the termination condition of the algorithm, we get the following:

$$\frac{\mathsf{CostAPP}^{(n)}}{\mathsf{ProbMPI}\mathsf{prx}^{(n)} + \mathsf{ProbError}^{(n)}} \le \frac{E(X\_{\gamma, \mathsf{l}, \mathsf{Caut}})}{\mathsf{Prob}\,\mathsf{p}\,(\gamma \mid = \mathsf{p} \,\mathsf{\mathbb{Q}l})} < \frac{\mathsf{CostAPP}^{(n)}}{\mathsf{ProbMap}\mathsf{prx}^{(n)} + \mathsf{ProbError}^{(n)}} + \varepsilon$$

Hence on termination, θ := CostApprx(n) ProbApprx(n)+ProbError(n) is within ε-precision of the true value, implying correctness of the algorithm. We get the following theorem.

Theorem 5. The above algorithm solves Exp Ave Cost.

### 9 Conclusions, Discussions, and Perspectives

We presented PTSO, a probabilistic extension of the classical TSO semantics. We have shown decidability/computability results for a wide a range of properties such as quantitative and qualitative reachability/repeated reachability and expected average costs. As far as we know, this is the first study of probabilistic verification for weak memory models, and opens many avenues for future work.

Refined Probability Distributions. For ease of presentation, we developed our results in the context of specific scheduling and update policies. However, we emphasize that our results carry-over to policies satisfying faithfulness and leftorientedness, which are fairly weak conditions. Hence we believe that developing more refined models that better capture behaviours of TSO implementations, using techniques such as parameter estimation, is interesting future work.

General Cost Models Similar can be said for cost models: our algorithm works for all cost functions such that the cost of a path is exponentially bounded by its length. In particular, developing cost models that closely mimic usage of processor resources, e.g. cost based on read from local store-buffer vs. read from memory, can be useful to gain a better understanding of the implementation. Other Memory Models Finally, we are interested in extending our approach to other weak memory models such as RA/SRA, POWER, ARM.

### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### Linearity and Uniqueness: An Entente Cordiale

Daniel Marshall<sup>1</sup> () , Michael Vollmer<sup>1</sup> , and Dominic Orchard1,<sup>2</sup>

> <sup>1</sup> University of Kent, Canterbury, UK {dm635,m.vollmer,d.a.orchard}@kent.ac.uk <sup>2</sup> University of Cambridge, UK

Abstract. Substructural type systems are growing in popularity because they allow for a resourceful interpretation of data which can be used to rule out various software bugs. Indeed, substructurality is finally taking hold in modern programming; Haskell now has linear types roughly based on Girard's linear logic but integrated via graded function arrows, Clean has uniqueness types designed to ensure that values have at most a single reference to them, and Rust has an intricate ownership system for guaranteeing memory safety. But despite this broad range of resourceful type systems, there is comparatively little understanding of their relative strengths and weaknesses or whether their underlying frameworks can be unified. There is often confusion about whether linearity and uniqueness are essentially the same, or are instead 'dual' to one another, or somewhere in between. This paper formalises the relationship between these two well-studied but rarely contrasted ideas, building on two distinct bodies of literature, showing that it is possible and advantageous to have both linear and unique types in the same type system. We study the guarantees of the resulting system and provide a practical implementation in the graded modal setting of the Granule language, adding a third kind of modality alongside coeffect and effect modalities. We then demonstrate via a benchmark that our implementation benefits from expected efficiency gains enabled by adding uniqueness to a language that already has a linear basis.

Keywords: linear types · uniqueness types · substructural logic

### 1 Introduction

Linear types [15, 57] and uniqueness types [5, 47] are two influential and longstanding flavours of substructural type system. As these approaches have developed, it has become clear in the community (both in folklore and the literature) that these are closely related ideas. For example, the chapter on substructurality in Advanced Topics in Types and Programming Languages [62] describes uniqueness types as "a variant of linear types". This framing is supported by various works which, for example, make reference to "a form of linearity (called uniqueness)" [33] or other such statements of equality or similarity [38].

But reading a different set of papers gives a contrasting impression that linearity and uniqueness are not the same but in some sense dual to one another, and with different behaviour for at least some applications. Recent work on linear types for Haskell [7] describes the two concepts as being "at their core, dual" and later having a "weak duality". The impression that these two approaches behave differently is backed up by much of the theoretical work on uniqueness types, with one paper stating that "although both linear logic and uniqueness typing are substructural logics, there are important differences" [56], closely followed by a tantalising mention of the fact that "some systems based on linear logic are much closer to uniqueness typing than to linear logic".

It is clear, at least, that both linear types and uniqueness types are substructural type systems: they both restrict structural rules (in particular, contraction and weakening) of type systems that are the Curry-Howard counterparts to regular intuitionistic logic. This captures the well-known maxim that "not all things in life are free" [61]; many kinds of data behave resourcefully, and are subject to constraints on their usage. Sensitive data should not be infinitely duplicated and passed around freely, file handles should not be arbitrarily discarded without being properly closed, and communication channels should not be used without adherence to a fixed protocol, to name a few!

Thanks to these clear benefits, notions of substructurality are slowly but surely making their way into the programming ecosystem, with languages such as Haskell [7], Idris [9], Clean [47], Rust [24], ATS [65], and Granule [36] all having type systems that behave substructurally in some way. What is not clear, however, is what exactly the relationship is between these varying systems; for instance, it is not obvious how to relate linearity and uniqueness. Linear types, though they themselves come in various forms, are most often based on the linear logic of Girard [15], and in the strictest sense they treat values as resources which must be used exactly once and never again. On the other hand, uniqueness types are named as such because they aim to ensure that values are guaranteed to have at most one reference to them [40,45,47,48,55,56], with a view towards allowing them to be safely updated in-place. Do these two requirements always coincide, or are there cases where they diverge?

In this paper, we resolve this long-standing confusion, building on two distinct bodies of literature to develop an accurate understanding of the contexts in which linear and uniqueness types behave the same or behave differently, and their relative strengths and weaknesses. Our primary contributions are as follows:


monadic (effectful) modalities already present in the language. The implementation enables the classic primary use of uniqueness: access to safe inplace update in a functional language without working inside a monad.

– In Section 4.2, we confirm the performance benefits of uniqueness types by benchmarking the performance of arrays which allow for in-place update. We generate impure Haskell code from our Granule implementation in order to demonstrate that further efficiency can be gained via adding uniqueness types even when your language is already linear at its core.

Section 5 and Section 6 provide related work and discussion, including relation to ideas in Rust. Various additional details are collected in the appendix [28], including proofs and collected reduction rules for the operational semantics. We also provide an artifact [29], so that the interested reader can experiment with code examples in Granule and reproduce our benchmarks for themselves.

### 2 Key Ideas

It is clear that linear and uniqueness types both involve restricting the substructural rules of intuitionistic logic, but what remains unclear is the exact relationship between the two concepts. This section discusses two widespread understandings of their relationship, both of which are accurate in some respects but fail to capture some key similarities and differences. We then combine aspects of both viewpoints to systematically relate linearity and uniqueness.

### 2.1 Are linearity and uniqueness (essentially) the same?

Perhaps the most well-known substructural types are linear types, which have been studied for decades in the literature [57, 62] as the Curry-Howard counterpart of linear logic [15]. Several languages have implemented linear type systems over the years, including ATS [65], Alms [52] and Quill [32], and they are steadily making their way into the mainstream via extensions to languages like Haskell [7]. Examples of linearity in this paper will focus on the functional language Granule [36] (whose syntax resembles Haskell), since values in Granule are linear by default making the examples less complex, and also because Granule will later be the foundation upon which we build our unified calculus.

Strictly, linear types treat values as resources which must be used once and then never again. For instance, we can type the identity function, since it binds a single variable and then uses it, but the K combinator (which discards one of its arguments) is not linearly typed. Thus linearity is a claim about the consumption of a resource: a linear type is a contract, which says that we must consume a value that we are given exactly once. Consider the following classic example (rendered in Granule) of a function which cannot be represented using linear types, assuming an interface where eat : Cake → Happy and have : Cake → Cake:

```
1 impossible : Cake → (Happy, Cake)
```
<sup>2</sup> impossible cake = (eat cake, have cake)

Ill-typed Granule

Note that Granule's function type → is the type of linear functions, more traditionally written (. The above function is ill-typed and the Granule compiler will brand it with a linearity error; this is because the value of type Cake passed into the function is a linear resource, and the body of the function requires us to duplicate it (via contraction), which is forbidden. Thus, linear types remind us of the familiar aphorism: you can't have your cake and eat it too.

Uniqueness types, on the other hand, are primarily aimed at ensuring that values have only a single reference to them, which is a useful property for ensuring the safety of updating data in-place. But is this uniqueness restriction really so different from the constraints of linearity?

One of the most familiar languages featuring uniqueness types is Clean [47], which uses uniqueness for mutable state and input/output, in contrast to languages such as Haskell which use monads for similar purposes. We shall use Clean for our uniqueness examples for the moment, before we introduce our own implementation of uniqueness in Section 4. Consider the following in Clean:

```
1 impossible :: *Coffee -> (*Awake, *Coffee)
2 impossible coffee = (drink coffee, keep coffee)
```
Ill-typed Clean

We use coffee instead of cake to distinguish unique values from the linear values of the Granule example, but notice this function has exactly the same structure as the previous example. The operator ∗ denotes a unique type, since unrestricted values are the default in Clean. Similarly to Granule, when presented with this function Clean gives a uniqueness error; the argument of type \*Coffee is duplicated, and so we can no longer guarantee there is only one reference to it upon exiting the function. Think of a \*Coffee as having been freshly poured; we cannot continue acting as though it is fresh once some of it has been drunk!

So far, it seems that the concepts of linearity and uniqueness are very similar after all, as is often claimed. However, neither of these examples uses unrestricted values; we only see values that are linearly typed or uniquely typed. In fact, in a setting where all values must be linear, we can also guarantee that every value is unique, and vice versa! Intuitively, if it is never possible to duplicate a value, then it will never be possible for said value to have multiple references. It is when we also have the ability for unrestricted use (non-linear/non-unique) that differences between linearity and uniqueness begin to arise, as we will soon see.

Much of the classic literature on linear types makes mention of the idea that linearity can be used for tracking whether a value has only one reference, though we know by now that this more accurately describes uniqueness; indeed, one of the oldest such papers by Wadler, which has been (rightly) hugely influential, states that "values of linear type have exactly one reference to them, and so require no garbage collection" [57, p.2]. However, systems akin to the one discussed by Wadler [4,18,46] crucially separate values into two completely distinct and (mostly) disconnected linear and non-linear worlds. In this context, a linear value can never have been duplicated previously and thus must also obey the conditions required for uniqueness. Therefore, it is correct to say that a value of linear type has exactly one reference in such a system.

This issue is further discussed in a later article by Wadler [58], though uniqueness types had yet to be invented and so the concept is never referred to by this name. Linear types based on linear logic are defined in Section 3 of said article, for which linearity behaves as we understand it: a value having linear type guarantees that it will not be duplicated or discarded, but the notion of dereliction allows a non-linear variable to be used linearly going forwards [15]. As Wadler states, "dereliction means we cannot guarantee a priori that a variable of linear type has exactly one pointer to it" [58, p.7], and so we cannot guarantee uniqueness of reference in a system based upon linear logic. In Section 7, Wadler goes on to define steadfast types, where dereliction and promotion are again restricted to recover the uniqueness guarantee in addition to the linearity restriction <sup>3</sup> .

However, never being able to duplicate or discard any value is an overly restrictive view of data, preventing many valid uses of various kinds of information, and so modern languages with linear types therefore generally do provide a mechanism for non-linearity rather than working in the 'steadfast' style. Linear logic provides the ! modality (also called the exponential modality), which allows the representation of non-linear (unrestricted) values. In Granule, we can use this modality to rewrite the previous example into one that is now well-typed:

```
Granule
1 possible : !Cake → (Happy, Cake)
2 possible lots = let !cake = lots in (eat cake, have cake)
```
We can think of !Cake values as representing an infinite amount of cake, which is made available once we eliminate the modality (via the let) to get an unrestricted (non-linear) variable cake. The functions eat and have are linear functions, so each application in isolation views cake linearly, by an implicit use of dereliction in the type system. Crucially, from an unrestricted value we can produce a linear value, so we can impose the restriction of linearity whenever we like. However it is not possible to produce an unrestricted value from a linear one. This restriction means that linear types are useful for representing resources such as file handles, as in the following example:

```
Granule
1 twoChars : (Char, Char) <IO>
2 twoChars = let -- do-notation like syntax
3 h ← openHandle ReadMode "someFile";
4 (h, c1) ← readChar h;
5 (h, c2) ← readChar h;
6 () ← closeHandle h
7 in pure (c1, c2)
```
<sup>3</sup> The concept of steadfastness coincides with the notion of "necessarily unique" used in languages such as Clean, where a necessarily unique value is one that is unique and also can never be made non-unique [47].

Here, we open a file handle, read two characters from it, and then close it. The linearity of the handle ensures that once we have created it, we must close it properly, and also that we cannot duplicate it along the way. But linearity is less useful in other circumstances. As an example, consider the case of mutable arrays. Discarding an array will not cause any problems,<sup>4</sup> so a linear array would be too restrictive and not allow for some valid use cases; affine types allow discarding behaviour by adding back in weakening [52]. But in order to be able to mutate an array we need to be able to guarantee that no other references to it exist, and in this sense linearity is not strong enough; any linear value could have previously been a non-linear one that was duplicated any number of times before being specialised (via dereliction) to a linear type. For representing mutable arrays, we are better served by considering uniqueness types.

Uniqueness behaves differently to linearity in the context of a system with the ability to describe unrestricted values. If we have an unrestricted value, we certainly cannot produce a unique one from it which would violate the guarantee of uniqueness; we cannot claim that a value has only one reference to it when it could have been duplicated and manipulated elsewhere. But conversely, if we have a unique value, there is no harm in dropping this guarantee and producing an unrestricted value; a non-unique value does not need to make any promises about how many references may exist. Thus, in Clean we can write:

```
1 possible :: *Coffee -> (Awake, Coffee)
2 possible coffee = (drink coffee, keep coffee)
```

```
Clean
```
Here, we require that the input is unique (it has type \*Coffee), so for the function to be well-typed we can no longer claim this value is unique once it reaches the output, as it has been duplicated along the way (and it now must have type Coffee). The information here is flowing in the opposite direction than for linearity; the possible function in Clean would be ill-typed if we replaced unique values with linear ones, and vice versa for the earlier Granule example. This directionality allows us to represent mutable arrays much more easily with uniqueness. For example, the following destructively fills a real-valued array:

```
Clean
1 fill :: *{Real} Int -> *{Real}
2 fill a1 0 = a1
3 fill a1 i
4 # f = toReal i
5 # a2 = {a1 & [i - 1] = f} // write f to index i-1 in a1
6 = fill a2 (i - 1) // recurse with unique array a2
```
<sup>4</sup> One may worry that discarding an array could cause space leaks, but this can be tempered via garbage collection. If a non-linear value will no longer be used we know statically that it can be garbage collected, and thus it is harmless to reuse the space occupied by this value going forwards. This will allow us to update unique objects such as arrays destructively without being concerned about referential transparency.

Here, we take in a unique array of floating point numbers and some unrestricted integer value, and fill the first cells of the array with the numbers up to that value. Here we know that it is safe to write to the array because it is unique, so no other references to it can exist elsewhere; once we are finished with the array later on, however, it is fine to discard it, as with an array in most other functional programming languages, which would not be possible if our array was linearly typed. This however does mean that uniqueness types are not appropriate for the earlier example of file handles—we cannot ensure that a unique file handle is closed, as it can be discarded at any time.

In summary, linearity and uniqueness provide the same guarantees up until a system also has a notion of unrestricted value (non-linear or non-unique). The complementary but distinct use cases shown above make it clear that it would be valuable to have both linear and unique values together in a single language, but this has previously not been possible. Our main contribution is a core calculus that allows linearity and uniqueness to coexist and interact, demonstrated also via an implementation in the Granule language. Next we consider the question of duality, and how to formally describe how linearity and uniqueness differ.

#### 2.2 Are linearity and uniqueness dual?

It is common in folklore and in the literature to describe linearity and uniqueness as somehow dual to one another (see e.g., [32, 52]) but rigorous versions of this statement are more rarely found. The earliest formalisation of uniqueness is from Harrington's 'uniqueness logic' [20], which we use as a foundation for much of the following. Harrington constructs a logic which is on the surface much like linear logic, but instead of the ! modality for non-linearity it includes a ◦ modality for non-uniqueness which differs from non-linearity in its introduction rule.

In linear logic, the ! modality acts as a comonad, such that the introduction of ! on the right of a sequent means that all formulae on the left of a sequent must also have ! applied, whilst introduction of ! on the left is unrestricted:

$$\frac{!\varGamma \vdash P}{!\varGamma \vdash !P}!\_{\mathrm{R}} \qquad \frac{\varGamma, P \vdash Q}{\varGamma, !P \vdash Q}!\_{\mathrm{L}}!\_{\mathrm{L}}$$

(also known as storage and dereliction respectively [16]). In contrast, the nonuniqueness modality ◦ of Harrington acts as a monad, meaning that introduction of ◦ on the right is unrestricted but introduction of ◦ on the left of a sequent means that all formulae on the right of the sequent must also have ◦ applied:

$$\frac{\varGamma \vdash P}{\varGamma \vdash P^{\diamond}} \Diamond\_{\mathbf{R}} \qquad \frac{\varGamma, P \vdash Q^{\diamond}}{\varGamma, P^{\diamond} \vdash Q^{\diamond}} \Diamond\_{\mathbf{L}}$$

The non-uniqueness modality ◦ then has the following structural rules for contraction and weakening, which are conspicuously identical to those for the ! modality representing non-linearity:

$$\frac{\Gamma, P^{\diamond}, P^{\diamond} \vdash R}{\Gamma, P^{\diamond} \vdash R} \circ\_{\mathcal{C}} \quad \frac{\Gamma \vdash R}{\Gamma, P^{\diamond} \vdash R} \circ\_{\mathcal{W}} \quad \qquad \frac{\Gamma, !P, !P \vdash R}{\Gamma, !P \vdash R} !\_{\mathcal{C}} \quad \frac{\Gamma \vdash R}{\Gamma, !P \vdash R} !\_{\mathcal{W}} \vdash \!\_{\mathtt{W}} \mathcal{C}$$

One might be tempted to think that because the introduction rules for ◦ behave dually to those of !, the modalities are simply dual to one another, and thus non-uniqueness is equivalent to linear logic's ?. But since the contraction and weakening rules for ◦ are the same as those for !, this is not quite the case; ◦ behaves dually to ! in some ways but not in others. Formally, ◦ is a monad while ! is a comonad, but both are comonoidal, whereas ? is monoidal.

Linear logic allows us to derive !P ` P (from dereliction), which agrees with our notion of linearity where non-linear values can be restricted to behave linearly going forwards but if we have a linear value it must remain linear; uniqueness logic conversely allows us to derive P ` P ◦ , formalising our concept of uniqueness where we can forget the uniqueness guarantee and turn a unique value into a non-unique one, but if we have a non-unique value we cannot go back.

We can now make more precise the intuitive notion we have developed, which suggests that linear types provide a restriction on what can be done with a value 'in the future' whilst uniqueness types provide a guarantee about what has been done with a value 'in the past'. The distinction becomes clearer when we consider substitutions, which are generated by β-reductions.

Substitutions are the same whether we are working with linear logic or uniqueness logic, as the rules for functions are identical, but the difference arises when thinking about what it is possible to know about a value in one logic compared to the other. Given a linear value, we know that substituting this value into an expression will preserve linearity, as there is no way to transform a linear value into a non-linear one. Conversely, given a unique expression then we know that any values substituted in will not affect the uniqueness guarantee, as there is no way to transform a non-unique value into a unique one. Thus 'future' refers to outgoing substitutions, while 'past' refers to incoming substitutions.

So if linearity and uniqueness do in fact behave the same in some ways but not all, and they do in fact behave dually in some ways but not all, then what is the overall takeaway? What statement can we make about the relationship between their behaviour that reconciles these two viewpoints?

#### Takeaway. Linearity and uniqueness behave dually with respect to composition, but identically with respect to structural rules, i.e., their internal plumbing.

In other words, internally the non-linear and non-unique modalities are both comonoidal, so they allow for the same behaviour of contraction and weakening for values that are wrapped inside them.

But the duality arises upon considering how we can map into and out of these modalities; we can map out of the non-linear modality and retrieve a linear value, but we can never map into it, giving the modality its familiar comonadic structure. Conversely, we can map a unique value into the non-unique modality to allow for contraction and weakening, but we can never map back out of it, which explains the dual monadic behaviour of this modality.

It is this understanding of the similarities and differences between linearity and uniqueness that will allow us to unify them, and have values of both flavours present in a single type system, which will be our goal for the next section.

### 3 The Linear-Cartesian-Unique Calculus

We now consider how to represent both linearity and uniqueness in the same system. The first choice to make is whether our base values will be linear or unique<sup>5</sup> , as this will influence the directionality of the modalities we need to include in the calculus. Here we present a system where linearity is the base and uniqueness is a modality, as opposed to one where uniqueness is the base and linearity is a modality, for two reasons.


Given a linear basis, we formalise the idea that we can map from unique to non-unique and from non-linear to linear. The key insight is that we treat nonlinearity and non-uniqueness as the same state as both these states are unrestricted; we can do anything we like with, and have no guarantees for, an unrestricted value. We write ∗P for a P with a uniqueness guarantee, similar to the syntax of Clean and to avoid confusion with Harrington's ◦ modality for nonuniqueness. The resulting calculus, which we call the Linear-Cartesian-Unique calculus (or LCU for short), builds on (intuitionistic multiplicative exponential) linear logic with additional rules for uniqueness.

<sup>5</sup> We choose a substructural basis over an unrestricted one since this more closely maps to both linear and uniqueness logic, where values have substructural behaviour by default unless they are wrapped in a modality.

<sup>6</sup> A similar problem arises from the application of unique functions, and this has been a thorn in the side of developers of uniqueness type systems for some time. The solution applied in Clean is that any function with unique elements in its closure is "necessarily unique", meaning it cannot be subtyped into a non-unique function and applied multiple times. Handily, this coincides with the notion of a linear function, which is why our calculus having a linear base also avoids this problem.

Syntax LCU's syntax is that of the linear λ-calculus with multiplicative products and unit (first line of syntax below) with terms for introducing and eliminating the ! modality and working with the uniqueness modality (second line):

$$\begin{aligned} t & \coloneqq x \mid \lambda x. t \mid t\_1 \, t\_2 \mid (t\_1, t\_2) \mid \mathsf{let} \,(x, y) = t\_1 \,\mathsf{in}\, t\_2 \mid \mathsf{unit} \mid \mathsf{let}\,\mathsf{unit} \,= t\_1 \,\mathsf{in}\, t\_2\\ & \mid !t \mid \mathsf{let}\,\mathsf{l}x = t\_1 \,\mathsf{in}\, t\_2 \mid \,\& t \mid \mathsf{copy}\,\, t\_1 \,\mathsf{as}\, x \,\mathsf{in}\, t\_2 \mid \,\,\mathsf{st} \end{aligned} \tag{\text{terms}}$$

The meaning is explained in the next section with reference to typing.

#### 3.1 Typing

Typing judgments are of the form Γ ` t : A, with types A defined:

$$A, B ::= A \multimap B \mid A \otimes B \mid 1 \mid !A \mid \*A \tag{\text{type}}$$

Thus our type syntax comprises linear function types A ( B, linear multiplicative products A ⊗ B, a linear multiplicative unit 1, the non-linearity modality !A and the uniqueness modality ∗A.

Typing contexts are defined as follows:

$$\Gamma ::= \emptyset \mid \Gamma, x:A \mid \Gamma, x:[A] \tag{contents}$$

which are either empty, or contexts extended with a linear assignment x : A or contexts extended with a non-linear assignment denoted x : [A]. This marking of assumptions in a context as linear or non-linear (see Terui [50]) is one way to guarantee substitution is admissible (avoiding, for example, issues pointed out by Wadler where substitution is not well-typed if care is not taken [59, 60], an issue noted also by Prawitz in 1965 in the context of S4 modal logic [42]).

Throughout, the comma operator , concatenates disjoint contexts.

We introduce the key typing rules inline. Figure 1 collects the full set of rules. The linear λ-calculus core is typed by the following three rules:

$$\frac{\Gamma, x:A \vdash x:A}{[\Gamma], x:A \vdash x:A} \text{VAR} \frac{\Gamma, x:A \vdash t:B}{\Gamma \vdash \lambda x.t:A \multimap B} \text{ABS} \frac{\Gamma\_1 \vdash t\_1:A \multimap B \quad \Gamma\_2 \vdash t\_2:A}{\Gamma\_1 + \Gamma\_2 \vdash t\_1t\_2:B} \text{APP}$$

In the case of var, a linear variable is used but the rest of the context must be marked as non-linear, denoted by [Γ] which marks all assumptions as non-linear.

Definition 1 (All non-linear assumptions). A context Γ is denoted as containing only non-linear assumptions by writing [Γ] in the typing rules, where [∅] and [Γ] =⇒ [Γ], x : [A].

In the case of app, the two subterms are typed in different contexts which are then combined via context addition.

Definition 2 (Context addition). The partial operation + on contexts is the union of two contexts as long as they are disjoint in their linear assumptions and any variables occurring in both contexts are both non-linear assumptions, i.e.

$$\begin{aligned} \varGamma\_1 + \varGamma\_2 = \varGamma\_1 \cup \varGamma\_2 & \quad \amalg \quad \forall x \in \mathsf{dom}(\varGamma\_1) \cap \mathsf{dom}(\varGamma\_2) \implies \exists A. \varGamma\_1(x) = \varGamma\_2(x) = [A] \end{aligned}$$

The non-linear modality ! has the following introduction and elimination rules and related dereliction rule:

$$\frac{[\;]\vdash t:A}{[\;]\vdash !t:1A} \; !\_{I} \quad \frac{\;}{\;} \; \_{\;1} \vdash \; t\_{1}: !A \quad \Gamma\_{2}, x: [A] \vdash t\_{2}:B \quad \; !\_{E} \quad \frac{\Gamma, x: A \vdash t:B}{\; \vdash \; t\_{1} \; \land \; t\_{2}: B} \; \; \text{BER}$$

The left-most rule captures the idea that a computation t of value A can be used non-linearly, by 'promoting' it to !A as long as all its inputs are also non-linear, denoted by [Γ] in the context. The middle rule eliminates a non-linear modality (a capability to use an A value non-linearly) by composing it with a variable x which is non-linear in t2. These rules are accompanied by the 'dereliction' rule that says non-linear variables can be treated as linear variables.

So far everything is standard from other linear type systems. We now move to our uniqueness modality which has two syntactic constructs: borrow and copy:

$$\frac{\Gamma \vdash t : \ast A}{\Gamma \vdash \& t : !A} \text{ вовRooW} \quad \frac{\Gamma\_1 \vdash t\_1 : !A \quad \Gamma\_2, x : \ast A \vdash t\_2 : !B}{\Gamma\_1 + \Gamma\_2 \vdash \textbf{copy} \; t\_1 \; \textbf{as} \; x \; \text{in}\; t\_2 : !B} \quad \textbf{copyY}$$

The borrow rule maps a unique value to a non-linear value, allowing a uniqueness guarantee to be forgotten. In terms of the operational semantics (see Section 3.4), this causes evaluation of t before the borrow. Next, the copy rule says that a non-linear value of type A can be copied to produce a unique A which is used by t2; the input is required to be non-linear so that we cannot circumvent a linearity restriction by copying a linear value, and the output is required to be non-unique so that we cannot leverage the copy to smuggle out a value which pretends to be truly unique. These rules in turn are accompanied by the 'necessitation' rule that says values can be assumed unique as long as they have no dependencies:

$$\frac{\emptyset \vdash t : A}{[I] \vdash \ast t : \ast A} \quad \text{NEC}$$

The borrow and copy rules in this logic suggest a monad-like relationship between the ! and ∗ modalities, with the borrow rule representing the 'return' of the monad and the copy rule likewise acting as the 'bind'. The ∗ modality is not in itself a monad (or indeed, a comonad like !); rather, it acts as a functor over which the ! modality becomes a relative monad [3]. A relative monad comprises a functor J and an object mapping T, along with an operation η : JX → T X and a mapping from JX → T Y arrows to T X → T Y with axioms analogous to the monad axioms. Thus, here J is the uniqueness modality ∗ and T the nonlinearity modality !. If one imagines the dual version of this logic where the basis is unique, the hypothetical linearity modality would act as a functor making the non-uniqueness modality into a relative comonad [1, 37] in much the same way.

#### 3.2 Equational theory

One way of understanding the meaning of the LCU calculus is to see its equational theory (which we later prove sound against its operational model). The

[Γ], x : A ` x : A var Γ, x : A ` t : B Γ ` λx .t : A ( B abs Γ<sup>1</sup> ` t<sup>1</sup> : A ( B Γ<sup>2</sup> ` t<sup>2</sup> : A Γ<sup>1</sup> + Γ<sup>2</sup> ` t<sup>1</sup> t<sup>2</sup> : B app Γ<sup>1</sup> ` t<sup>1</sup> : A Γ<sup>2</sup> ` t<sup>2</sup> : B Γ<sup>1</sup> + Γ<sup>2</sup> ` (t1, t2) : A ⊗ B ⊗<sup>I</sup> Γ<sup>1</sup> ` t<sup>1</sup> : A ⊗ B Γ2, x : A, y : B ` t<sup>2</sup> : C Γ<sup>1</sup> + Γ<sup>2</sup> ` let(x, y) = t<sup>1</sup> in t<sup>2</sup> : C ⊗<sup>E</sup> [Γ] ` unit : 1 1I Γ<sup>1</sup> ` t<sup>1</sup> : 1 Γ<sup>2</sup> ` t<sup>2</sup> : B Γ<sup>1</sup> + Γ<sup>2</sup> ` let unit = t<sup>1</sup> in t<sup>2</sup> : B 1<sup>E</sup> Γ, x : A ` t : B Γ, x : [A] ` t : B der [Γ] ` t : A [Γ] ` !t : !A !I Γ<sup>1</sup> ` t<sup>1</sup> : !A Γ2, x : [A] ` t<sup>2</sup> : B Γ<sup>1</sup> + Γ<sup>2</sup> ` let!x = t<sup>1</sup> in t<sup>2</sup> : B !E Γ ` t : ∗A Γ ` &t : !A borrow Γ<sup>1</sup> ` t<sup>1</sup> : !A Γ2, x : ∗A ` t<sup>2</sup> : !B Γ<sup>1</sup> + Γ<sup>2</sup> ` copy t<sup>1</sup> as x in t<sup>2</sup> : !B copy ∅ ` t : A [Γ] ` ∗t : ∗A nec

Fig. 1: Collected typing rules for LCU calculus

calculus has the standard βη-equalities for the multiplicative linear λ-calculus fragment, which includes the following βη rules for !:

$$\text{let} \, !x = !t \, \text{in} \, t' \equiv [t/x]t' \tag{\beta!}$$

$$\text{let}!x = t \operatorname{in}!x \equiv t \tag{\eta!}$$

along with the following equalities on the uniqueness fragment:

$$\mathbf{copy} \; t \; \mathbf{as} \; x \; \| \; \& x \equiv t \; \tag{uuitR}$$

$$\mathbf{copy} \; \& \; v \; \mathbf{as} \; x \; \text{in} \; t' \equiv [v/x]t' \tag{unitL}$$

$$\mathsf{copy}\ y\_1 \mathsf{as}\ x \mathsf{in}\ (\mathsf{copy}\ t\_2 \ \mathsf{as}\ y \mathsf{in}\ t\_3) \equiv \mathsf{copy}\ (\mathsf{copy}\ t\_1 \ \mathsf{as}\ x \mathsf{in}\ t\_2) \; \mathsf{as}\ y \mathsf{in}\ t\_3 \qquad \text{(associativity)}$$

The first axiom states that copying a non-linear t into a unique value x and immediately borrowing it to be non-linear is equivalent to just t. The second axiom states that borrowing a unique value v and copying it to a unique x in the scope of t 0 is the same as just substituting in that v for x. The last axiom gives associativity of copying under the side condition that x is free in t3. These equations are exactly the relative monad axioms [3], though we specialise (unitL) slightly by restricting to values to account for the reduction semantics.

The typability of these axioms relies on the admissibility of linear and nonlinear substitution shown in Section 3.5 on the metatheory of the calculus.

#### 3.3 Exploiting uniqueness for mutation

A key use for ensuring uniqueness of a reference is that this allows mutation to be used safely—the original pun behind Wadler's "Linear Types can Change the World" [57]. To illustrate this idea, and consider its soundness in the next section, we extend the LCU calculus with a primitive type of arrays:

$$A ::= \dots \mid \mathsf{Arrəy } A \mid \mathbb{N} \mid \mathbb{F}$$

where N are natural numbers used for sizes and indices and F floating-point values. The calculus is also extended with operations for floating-point arrays, typed by axiomatic rules (with built-in weakening):


These operations provide the interface for exploiting unique array references, where writeArray performs mutation as the type system guarantees that uniquely typed values have not been duplicated in the past (Section 3.5). We ignore outof-bounds exceptions as this is an orthogonal issue, which could be solved using indexed types. We elide rules for typing numerical terms here.

Our implementation in Section 4 replays these ideas in a practical setting. The next section gives the operational heap model for the calculus, where the semantics of mutation is made concrete.

### 3.4 Operational heap model

We define an operational model for the LCU calculus to make the meaning of uniqueness and linearity more concrete, and to prove that our type system enforces the desired properties. The semantics is call-by-name and resembles a small-step operational semantics but instead uses a notion of heaps, both to capture the idea of a memory reference to arrays as well as to give a way to track resource usage on program variables. We adapt the model of Choudhury et al. [11], which was used to track resource usage in a pure language with graded types. Our model applies this idea to a non-graded setting, extended to include reference counting for uniqueness. To prove that linearity and uniqueness are respected (soundness), the heap semantics incorporates some typing information in order to ease the theorem statements and proofs as shown in Section 3.5.

Single-step reductions in the operational model are of the form:

$$H \vdash t \leadsto H' \vdash t' \mid \Gamma \mid \Delta \tag{single-step judgeent form} \}$$

where H is the incoming heap which provides bindings to variables that appear in t and array allocations. The result of the reduction is a new term t <sup>0</sup> with an updated heap H<sup>0</sup> , as well as two additional pieces of information: Γ gives us a 'binding context' recording the typing of any binders that were encountered (or 'opened') during reduction, and ∆ gives us a 'usage context' containing an account of how variables were used. Usage contexts are defined as:

$$
\Delta ::= \emptyset \mid \Delta, x:r \quad \text{(usage contects)} \qquad r ::= 1 \mid \omega \quad \text{(usage/reference counter)} \
$$

where r is a usage marker that says a variable was used either once (denoted 1) or used more than once (denoted ω). Usage has a preorder ≤ where 1 ≤ ω.

We extend the syntax of terms with a value form a representing runtime array references to the heap. In order to account for their type, the syntax of

contexts is extended to include assumptions a : Array A which are treated as a different syntactic category of variables. Additional runtime typing rules for array reference terms a are provided akin to a variable rule (see appendix [28]).

Heaps are defined as follows akin to a context but containing two kinds of 'allocations' for variables x and for array references a:

$$H ::= \emptyset \mid H, x \mapsto\_r (\varGamma \vdash t : A) \mid H, a \mapsto\_r \mathbf{arrr} \tag{\text{heaps}}$$

In the case of extending the heap with a variable allocation for x , the heap records that x can be used according to r and that it maps to a term t, along with its typing which is only present to aid the metatheory. For brevity, we sometimes write x 7→rt instead of x 7→r(Γ ` t : A) when the typing is not important. In the case of an array reference a, the heap records the number of references currently held to it, where r is again used (representing either one reference 1 or many ω), and describes the heap-only array representation term arr pointed to by that reference (whose syntax we introduce later along with the relevant rules).

Multiple reductions are composed from zero or more single-step reductions, with judgments of the form H ` t ⇒ H<sup>0</sup> ` t 0 | Γ | ∆ given by two rules capturing empty reduction sequences and extending a sequence at its head:

<sup>H</sup> ` <sup>t</sup> <sup>⇒</sup> <sup>H</sup> ` <sup>t</sup> | ∅ | ∅ refl H ` t<sup>1</sup> H<sup>0</sup> ` t<sup>2</sup> | Γ<sup>1</sup> | ∆<sup>1</sup> H<sup>0</sup> ` t<sup>2</sup> ⇒ H<sup>00</sup> ` t<sup>3</sup> | Γ<sup>2</sup> | ∆<sup>2</sup> H ` t<sup>1</sup> ⇒ H<sup>00</sup> ` t<sup>3</sup> | Γ1, Γ<sup>2</sup> | ∆<sup>1</sup> + ∆<sup>2</sup> ext

In the case of ext the binding contexts are disjoint (since we treat binders as unique in a standard way) but the usage contexts are added as follows:

$$\begin{aligned} \Delta\_1 + \emptyset = \emptyset \qquad \Delta\_1 + (\Delta\_2, x:r) = \begin{cases} (\Delta\_1 + \Delta\_2), x:r & x \notin \mathsf{dom}(\Delta\_1) \\ (\Delta\_1' + \Delta\_2), x:\omega & \Delta\_1 = \Delta\_1', x:r' \end{cases} \end{aligned}$$

i.e., if a variable x appears in both usage contexts then in the resulting context x : ω since for the purposes of our counting we are interested in counting 0 uses (via absence in ∆) or 1 use or many uses (ω).

Heap model The reduction rules are collected in the appendix [28], but we explain the core rules for the single-step reduction relation here. Unlike a normal small-step semantics, variables have a reduction, with two possibilities:

$$\overbrace{H, x \mapsto\_1 t \vdash x \leadsto H \vdash t \left| \emptyset \right| x : 1}^{\sim\_{\text{VAR}1}} \overbrace{H, x \mapsto\_{\omega} t \vdash x \leadsto H, x \mapsto\_{\omega} t \vdash t \left| \emptyset \right| x : 1}^{\sim\_{\text{VAR}2}} \leadsto\_{\text{VAR}3}$$

Both reduce a variable x to the term t which is assigned to x in the heap. In the left rule, we started out with a heap capability of 1 (linear) so after the reduction we remove x from the heap. In the right rule, we have a heap capability of ω (non-linear) so we preserve the assignment to x in the outgoing heap.

β-reduction is then given as follows:

$$\frac{\Gamma \vdash t' : A}{H \vdash (\lambda x.t) \; t' \leadsto H, x \mapsto\_1 (\Gamma \vdash t' : A) \vdash t \mid x : A \mid \emptyset} \quad \leadsto\_\beta$$

Rather than using a substitution, the body term is the result under a heap extended with x assigned to the (typed) argument term t 0 . This heap binding is given a resource capability of 1 since functions are linear. In the output, we remember that a linear binding has been opened up in the scope of the term. An inductive rule allows an application to reduce on the left:

$$\frac{H \vdash t\_1 \leadsto H' \vdash t\_1' \mid \varGamma \mid \Delta}{H \vdash t\_1 \, t\_2 \leadsto H' \vdash t\_1' \, t\_2 \mid \varGamma \mid \Delta} \leadsto\_{\text{APP}}$$

We elide the rules for products and unit which follow much the same scheme; one congruence to evaluate the reduction of an elimination form and one to enact a β reduction. For the ! modality, this scheme gives us the !β rule which creates a non-linear binding of x to the term t1:

$$\frac{[\varGamma] \vdash t\_1 : A}{H \vdash \mathsf{let} \, !x = !t\_1 \,\,\text{in}\,\,t\_2 \leadsto \,\, H, x \mapsto \omega([\varGamma] \vdash t\_1 : A) \vdash t\_2 \mid x : [A] \mid \emptyset} \quad \leadsto\_{!\beta}$$

The more interesting rules are for the uniqueness aspects of the language. Borrowing & (which maps a unique value type ∗A to a non-linear value !A) has a congruence rule and a reduction to enact a borrow:

$$\frac{H \vdash t \leadsto H' \vdash t' \mid \Gamma \mid \Delta}{H \vdash \& t \leadsto H' \vdash \& t' \mid \Gamma \mid \Delta} \leadsto\_{\&} \quad \frac{\mathsf{dom}(H) \equiv \mathsf{arr} \mathsf{Refs}(v)}{H, H' \vdash \& (\*v) \leadsto ([H]\_{\omega}), H' \vdash !v \mid \emptyset \mid \emptyset} \leadsto\_{\&} \beta$$

The action is in the right-hand rule here, where the incoming heap is split into two parts, where H is such that it provides the allocations for all array references in v (enforced by the premise here). The unique value ∗v is wrapped to be nonlinear in the result !v and thus all of its array references are now marked as 'many' via [H]<sup>ω</sup> which replaces all reference counts with ω, e.g.:

$$H', a \mapsto\_1 \mathbf{arr} \vdash \& (\*a) \leadsto H', a \mapsto\_{\omega} \mathbf{arr} \vdash !a \mid \emptyset \mid \emptyset$$

Thus, borrowing enacts the idea that a reference is no longer unique and may be used many times (and hence now is a non-linear value). Copying then has three reductions; a congruence (elided), a reduction which forces evaluation under the non-linear modality, and a β-reduction to enact copying to a unique value:

$$H \vdash t \leadsto H' \vdash t' \mid \Gamma \mid \Delta$$

$$\begin{array}{c} H \vdash \mathsf{copy} \; ! \, t \, \mathsf{as} \, x \, \mathsf{in} \, t\_2 \leftrightarrow H' \vdash \mathsf{copy} \; ! \, t' \, \mathsf{as} \, x \, \mathsf{in} \, t\_2 \mid \Gamma \mid \Delta \\\\ \hline H, H' \vdash \mathsf{copy} \; ! v \, \mathsf{as} \, x \, \mathsf{in} \, t \leadsto H, H', H'', \, x \rightarrow \mathsf{1}(\Gamma \vdash \mathsf{\*} \theta(v) : \mathsf{\*} \, A) \vdash t \mid x : \ast A \mid \emptyset \end{array} \leadsto\_{\mathsf{copy}\beta}$$

The copy! rule evaluates under ! so that the first term can be reduced to a value v to be copied in the next rule. The copy<sup>β</sup> rule enacts copying where dom(H<sup>0</sup> ) ≡ arrRefs(v) marks the part of the heap with array references coming from v. Then copy(H<sup>0</sup> ) copies the arrays in this part of the heap, creating a heap fragment H<sup>00</sup> and a renaming θ which maps from the old array references to the references of the new copies. This renaming is applied to v in the newly bound unique variable x. Thus the value θ(v) refers to any freshly copied arrays.

Lastly, the semantics of the array primitives uses an array representation on the heap, where arr is some array object and arr[i] = v indicates that the i th element is bound to the value v, and we write a#H for an array reference a which is fresh for heap H:

$$\begin{array}{c} \begin{array}{c} a \# H\\ H \vdash \texttt{newArray} \; n \leadsto \; H, a \rightarrow\_{1} \mathsf{arr} \vdash \mathsf{\*} a \; \mid \; \varnothing \mid \; \varnothing \end{array} \\\\ \begin{array}{c} \begin{array}{c} H, a \rightarrow\_{r} (\mathsf{arr}[i] = v) \vdash \mathsf{readArray} \; (\ast a) \; i \leadsto \; H, a \rightarrow\_{r} (\mathsf{arr}[i] = v) \vdash (v, \ast a) \mid \; \varnothing \mid \; \varnothing \end{array} \\\\ \begin{array}{c} H, a \rightarrow\_{r} \mathsf{arr} \vdash \mathsf{writeArray} \; (\ast a) \; i \curlyeq \; v \leadsto \; H, a \rightarrow\_{r} (\mathsf{arr}[i] = v) \vdash \ast a \mid \; \varnothing \mid \; \varnothing \end{array} \end{array} \\\\ \begin{array}{c} \begin{array}{c} H, a \rightarrow\_{r} \mathsf{arr} \vdash \mathsf{deleteArray} \; (\ast a) \; \sim \; H \vdash \mathsf{unit} \; \mid \; \varnothing \mid \; \varnothing \end{array} \end{array} \end{array}$$

Thus newArray creates a fresh array reference a and allocates a new array on the heap with a single reference count. The readArray and writeArray primitives work as expected to read and destructively update the array referenced by a, whose reference count is arbitrary but unchanged by the reduction. Lastly deleteArray deallocates the array. Noticeably, the rules do not enforce uniqueness; but as we see in the next section, well-typed programs preserve uniqueness of references.

#### 3.5 Metatheory

Proofs of all the statements that follow are provided in the appendix [28]. We first establish some key results showing the admissibility of substitution and weakening, which are leveraged in later proofs:

Lemma 1 (Linear substitution). If Γ <sup>0</sup> ` t 0 : A and Γ, x : A ` t : B then Γ <sup>0</sup> + Γ ` [t <sup>0</sup>/x ]t : B .

Lemma 2 (Non-linear substitution). If [Γ 0 ] ` t 0 : A and Γ, x : [A] ` t : B then [Γ 0 ] + Γ ` [t <sup>0</sup>/x ]t : B .

Lemma 3 (Weakening is admissible). If Γ ` t : A then Γ, [Γ 0 ] ` t : A.

Next, the heap model allows us to establish the key properties of well-typed programs respecting linearity and uniqueness restrictions. We first define when a heap is compatible with a typing context:

Definition 3 (Heap-context compatibility). A heap H is compatible with a typing context Γ if H contains assignments for every variable in the context and the typing contexts of the terms in the heap are also compatible with the heap. The relation is defined inductively as:

$$\begin{array}{c} H \lnot \lnot F \\ \hline H, a \mapsto\_{r} \mathsf{arr} \lnot F, a:\mathsf{Array}\ A \end{array} \begin{array}{c} \begin{array}{c} H \lnot \lnot (\varGamma\_{1} + \varGamma\_{2}) \quad \varGamma\_{2} \vdash t: A \quad x \notin \mathsf{dom}(H) \\ \hline (H, x \mapsto\_{r} (\varGamma\_{2} \vdash t: A)) \lnot \lnot (\varGamma\_{1}, x: A) \end{array} \text{LIN} \\\\ \begin{array}{c} H \lnot \lnot (\varGamma\_{1} + \varGamma\_{2}) \quad \varGamma\_{2} \vdash t: A \quad x \notin \mathsf{dom}(H) \\ \hline (H, x \mapsto\_{\omega} (\varGamma\_{2} \vdash t: A)) \lnot \lnot (\varGamma\_{1}, x: [A]) \end{array} \end{array}$$

Thus, a heap compatible with Γ1, x : A contains an assignment for x marked with a usage annotation r which can be either 1 for linear or ω for non-linear use. Note that non-linear values can be used linearly, as captured by dereliction (the der typing rule). However, a non-linear assumption must have a heap assignment marked with ω (rule ω), where the dependencies of the assigned term t must all be non-linear in the remaining compatibility judgment on the rest of the heap.

From a heap (and likewise from a typing context) we can also extract usage information. This is useful for focusing on resource usage as follows:

Definition 4 (Usage context extraction). For a context Γ or heap H we can extract usage information denoted Γ or H defined as:

$$\begin{array}{llll}\overline{\emptyset} = \emptyset & \overline{(\varGamma, x: [A])} = \overline{\varGamma}, x: \omega & \overline{(\varGamma, a: A)} = \overline{\varGamma} & \overline{(\varGamma, x: A)} = \overline{\varGamma}, x: 1 \\\overline{\emptyset} = \emptyset & \overline{(H, x \mapsto\_r (\varGamma \vdash t: A))} = \overline{H}, x: r & \overline{(H, a \mapsto\_r t)} = \overline{H} \end{array}$$

We now give the two main theorems about our calculus which give us the properties that linearity is respected (called conservation, Theorem 4) and that uniqueness is respected (Theorem 5).

Theorem 4 (Conservation). For a well-typed term Γ ` t : A and all Γ<sup>0</sup> and H such that H ./ (Γ<sup>0</sup> + Γ) and a reduction H ` t H<sup>0</sup> ` t 0 | Γ<sup>1</sup> | ∆ we have:

$$\exists \varGamma'. \ \varGamma' \vdash t' : A \ \wedge \ \ H' \bowtie (\varGamma\_0 + \varGamma') \ \wedge \ \ (\overline{H'} + \Delta) \subseteq (\overline{H}, \overline{\Gamma\_1})$$

The first conjunct is regular type preservation, linked with heap compatibility in the second conjunct. The last conjunct expresses the core of conservation: that resource usage accrued in this reduction, given by ∆, plus remaining resources given in the heap H<sup>0</sup> are approximated (via v, the pointwise lifting of ≤) by the original resources given in the heap H plus the specification of the resources from any variable bindings Γ<sup>1</sup> encountered along the way. The context Γ<sup>0</sup> accounts for bindings not described by Γ, and is key to the inductive proof of this result.

We then establish that all heap references have only one reference to them at the end of execution.

Theorem 5 (Uniqueness). For a well-typed term Γ ` t : ∗A and all Γ<sup>0</sup> and H such that H ./ (Γ<sup>0</sup> + Γ) and given a multi-reduction to a value H ` t ⇒ H<sup>0</sup> ` ∗v | Γ 0 | ∆, for all a ∈ arrRefs(v) (array references in v) we have:

$$a \mapsto\_1 t' \in H \implies \exists t''. a \mapsto\_1 t'' \in H' \quad \wedge \quad a \notin \mathsf{dom}(H) \implies \exists t''. a \mapsto\_1 t'' \in H'$$

i.e., any array references contributing to the final term that are unique in the incoming heap stay unique in the resulting term, and any new array references contributing to the final term are also unique.

Notice that there is a certain duality between the conservation theorem and the uniqueness theorem which mirrors the weak duality between linearity and uniqueness. The statement of conservation is a generalised way to say that if a variable is linear then it will always be used in a linear way, or in other words that linearity restrictions will always be upheld; conversely, the uniqueness theorem tells us that if a variable is unique then it must always have been used in a unique way, or in other words that it does not have multiple references.

One important point to notice is that the additional rules (borrow and copy) that we include for unique types are in fact trivial cases when it comes to the uniqueness theorem since they can never output a value with a unique type. This makes sense as the idea behind these additional rules is to mediate the interaction between uniqueness and non-uniqueness, and this interaction can only ever go in the direction of producing values that are non-unique.

A sub-result of conservation is type preservation which is complemented by a separate progress result in Theorem 6 to give syntactic type safety:

Theorem 6 (Progress). Values of the heap model v are given by:

v ::= (t1, t2) | unit | ∗t | !t | λx .t | i | a | p (value terms sub-grammar)

where p are partially-applied primitives, e.g., newArray, readArray, readArray (∗a). Given Γ ` t : A, then t is either a value, or if H ./ Γ0+Γ there exists a heap H<sup>0</sup> , term t<sup>0</sup> , usage context ∆, and context Γ 0 such that H ` t H<sup>0</sup> ` t 0 | Γ 0 | ∆.

Finally, we see that the operational semantics, extended to full β-reduction (i.e., all congruences), supports the equational theory:

Theorem 7 (Soundness with respect to the equational theory). For all t1, t<sup>2</sup> such that Γ ` t<sup>1</sup> : A and Γ ` t<sup>2</sup> : A and t<sup>1</sup> ≡ t<sup>2</sup> and given H such that H ./ Γ, there exists a value (irreducible term) v and Γ1, Γ2, ∆1, ∆<sup>2</sup> such that there are full β-reductions to the same value

H ` t<sup>1</sup> ⇒<sup>β</sup> H<sup>0</sup> ` v | Γ<sup>1</sup> | ∆<sup>1</sup> ∧ H ` t<sup>2</sup> ⇒<sup>β</sup> H<sup>0</sup> ` v | Γ<sup>2</sup> | ∆<sup>2</sup>

### 4 Implementation

#### 4.1 Frontend

The implementation of uniqueness types in Granule follows much the same pattern as the logic defined earlier. Granule already possesses a semiring graded necessity modality, where for a pre-ordered semiring (R, ∗, 1, +, 0, v), there is a family of types {Ar}r∈R. We represent the ! from linear logic (and our calculus) via the pre-ordered semiring {0, 1, ω} (none-one-tons [30]) with !A = Aω. 7

The semiring is defined with r+s = r if s = 0, r+s = s if r = 0 and otherwise ω, and r ∗ 0 = 0 ∗ r = 0, r ∗ ω = ω ∗ r = ω (for r 6= 0), and r ∗ 1 = 1 ∗ r = r with ordering 0 v ω and 1 v ω. This semiring allows us to represent both linear and non-linear use: variables graded with 1 must be used linearly, with 0 must be discarded, and a grade of ω permits unconstrained use `a la linear logic's !.

<sup>7</sup> It may not seem obvious that such a graded modality does exactly represent the behaviour of linear logic's !, and in fact capturing the precise behaviour of ! does require some additional semiring structure which is present in Granule [22].

Fig. 2: Relationship between various flavours of substructural types demonstrating how they can all be represented using Granule's expressive modalities.

(In Granule, A<sup>ω</sup> can be written as the type <sup>A</sup> [Many], but we syntactically alias this to !A for simplicity and ease of understanding.)

As in LCU, uniqueness is represented by a new modality, which we call ∗ to match the calculus (and so that the syntax of programs involving uniqueness will be familiar to Clean users). The uniqueness modality wraps a value that behaves 'linearly' (and so cannot be duplicated or discarded), with the key difference being that we provide primitive functions which allow ! to act as a relative monad over unique values. The primitives have the following type signatures:

```
Granule
1 uniqueReturn : ∀ {a : Type} . *a → !a -- borrow
2 uniqueBind : ∀ {a b : Type} . (*a → !b) → !a → !b -- copy
```
The uniqueReturn function here implements the borrow rule from the calculus (acting as the 'return' of the relative monad), and similarly the uniqueBind function implements the copy rule (acting as the 'bind').

We provide syntactic sugar for both of these primitives for convenience, with syntax designed to evoke the rules from the LCU calculus; &x is equivalent to writing uniqueReturn x, while clone t1 as x in t2 is equivalent to writing uniqueBind (λx → t2) t1. <sup>8</sup> A simple example of uniqueness types in action is given below, to demonstrate the idea.

```
Granule
1 sip : *Coffee → (Coffee, Awake)
2 sip fresh = let !coffee = &fresh in (keep coffee, drink coffee)
```
Here, borrowing (&) converts the unique Coffee value into an unrestricted one, so that it can be duplicated and used twice for the two separate functions. Note however that the uniqueness guarantee is lost in the process, so both of the output values are non-unique (linear, in this case).

Figure 2 illustrates the relationship between uniqueness, linearity and other common forms of substructural typing in the resulting system.

<sup>8</sup> In the implementation we use 'clone' rather than 'copy', as the name 'copy' is often used elsewhere in Granule, e.g. for the non-linear function which duplicates its input.

We implemented a built-in library for primitive floating point arrays in Granule, matching the interface for arrays of floats that was introduced as an extension to the LCU calculus in Section 3.3, with operations typed as follows:

```
Granule
1 newFloatArray : Int → *FloatArray
2 readFloatArray : *FloatArray → Int → (Float, *FloatArray)
3 writeFloatArray : *FloatArray → Int → Float → *FloatArray
4 deleteFloatArray : *FloatArray → ()
```
The writeFloatArray primitive updates an array destructively in place since we have a guarantee that no other references exist to the array which has been passed in. In the next section, we use this set of primitives to evaluate the performance of our implementation, by measuring the performance gains from allowing for in-place updates in this fashion. We have another set of immutable primitives akin to the above (but written with a suffix I) which work with non-unique arrays, e.g. readFloatArrayI : FloatArray → Int → (Float, FloatArray), and thus do not perform mutation.

The following shows an example of clone, where a new array is borrowed and a copy of this borrowed FloatArray on line 3 is deleted, leaving the original (now immutable) instance of the array unaffected on line 4:

```
1 let x = newFloatArray 10 in
2 let [y] = &x in
3 let [()] = clone [y] as y' in (let () = deleteFloatArray y' in [()])
4 in readFloatArrayI y 10
```
Granule

#### 4.2 Compilation and Evaluation

As part of our implementation of uniqueness types in Granule, as described in Section 4.1, we also implemented a simple compiler that translates programs into Haskell. This compiler preserves the value types, but erases all of Granule's substructural types (linear, unique, graded, etc.). As a result, we can take advantage of both Granule's flexible type system and Haskell's libraries and optimizing compiler. For this paper, all performance results were measured by compiling Granule programs to Haskell, and compiling the resulting Haskell with GHC 9.0.1. The measurements were collected on an ordinary MacBook with a 2 GHz quad-core Intel i5 processor and 16 GB of RAM.

As mentioned in Section 1, one motivation for using uniqueness types is to do the kind of in-place mutation necessary for efficient programming with arrays. To check that our implementation is reasonable, we carried out an evaluation using an array processing benchmark. The benchmark recursively allocates and sums up lists of arrays of various sizes, with the goal of demonstrating the benefits of uniqueness types for arrays in functional programming. Each iteration of the benchmark allocates a list of a thousand arrays, populates the arrays with values, then traverses the list to sum them up. We prepared two versions of this

Fig. 3: Performance of mutable vs. immutable arrays in Granule. Lower is better.

benchmark: one with functional in-place updates and manual (safe) deletion of unique arrays, and one with non-unique, immutable, garbage collected arrays and updates via copying. The overall performance of these two benchmarks is shown in Figure 3, with lower bars/numbers representing better performance. The results, while not surprising, do confirm that array-handling is generally more efficient when in-place mutation is allowed. Additionally, in Figure 3, we compared the time spent in garbage collection between the two versions of the benchmark. Because our implementation allocates unique data outside of GHC's heap, and uniqueness types allow programmers to directly de-allocate objects in memory, the unique version of the benchmark spends significantly less time in garbage collection. For this benchmark, the unique arrays are outside of the garbage collected heap and directly de-allocated, while other incidental objects (closures, lists, and so on) are still handled by the garbage collector.

Of course, this is a somewhat contrived benchmark. Real-world Haskell libraries, for example, typically provide functional high-level interfaces for array manipulation while using unsafe code to mutate arrays internally. The popular vector library<sup>9</sup> is one example, and repa [25] is another. Additionally, there is significant prior work on improving the efficiency of functional programs operating on arrays (for example, using combinators like map and fold along with aggressive fusion [13,25,27]), which we will not dwell on. The main point is that, at some stage, arrays must be mutated. Rather than having this happen through unsafe code, or via external C or Fortran, uniqueness types give us a way to do that mutation directly in our functional language, efficiently and safely.

Crucially, in these comparisons, all versions of the programs are implemented in the same language: Granule. With our extensions, the language is expressive enough to encompass a variety of programming approaches. Functional programmers may freely mix and match from a variety of options for data management and manipulation. Object lifetimes may be either manually or automatically managed, and object contents may allow in-place mutation or be immutable.

<sup>9</sup> https://hackage.haskell.org/package/vector

### 5 Related Work

Uniqueness types are most well known for their appearance in the Clean language [40, 47], where they are used in lieu of monadic computation and for the efficiency gains offered by in-place update. In Clean, computation is based on graph rewriting and reduction; constants such as numbers are graphs, and functions are graph rewriting formulas. This gives the type system a rather different feel to those offered by more recent functional programming languages, and makes it more difficult to capture the benefits of Clean-style uniqueness in a modern setting, hence the value in our pursuit of this goal.

Some theoretical groundwork for Clean's uniqueness types has certainly been developed over the years, particularly in works by de Vries among others [55,56]; these papers aim to clarify the distinction between Clean's type system and systems based on the λ-calculus. Further work makes headway on the problem of distinguishing uniqueness from other substructural systems [53,54]. This follows a similar theoretical approach to the one demonstrated in our paper; such ideas for limited settings inspired the groundwork for our system, which is more general and has a practical implementation.

Other languages (old and new) featuring uniqueness types include Single-Assignment C [45], Mercury [48] and Cogent [35].

Ownership was first developed as a framework for understanding aliasing in object-oriented languages [34], and is intended to give a high-level structural view of objects and references in much the same way that powerful type systems give a high-level structural view of data. Ownership is now most familiar due to being pervasive in the Rust programming language, for which multiple formalisations have been attempted; RustBelt [24] gives a lower-level encoding of Rust intended for formal verification while Oxide [64] is a higher-level encoding designed for more theoretical work, among others [39]. Extending these ideas to other languages is an active area of research; RefinedC [44] is one example.

Regions have been used over the years in the context of effect systems [23, 26]. One of the primary motivations of research into region types was their application in region-based memory management [51], which aimed to bring some of the benefits of traditional stack-based memory management to higher-order functional languages. Regions divide values based on their lifetimes, so a system with region types can safely allocate and de-allocate memory for values based on region type information, eliminating the need for garbage collection.

Early on, regions were restricted to have LIFO (last-in, first-out) lifetimes which followed the block structure of a language, but later work relaxed this constraint using uniqueness (see: static capabilities [63] and Cyclone [21]); a unique reference to a region ensures there are no aliases to the region, and that it can therefore be promptly de-allocated. Additionally, regions themselves act as a way to control aliasing, and can be thought of as equivalence classes for a "may alias" relation—in other words, values which do not share a region may not alias with one another, and so if a value does not share a region with anything else then it may be safely mutated in place.

Work on Cyclone [14] demonstrated the relationship between regions and unique pointers, observing that "unique pointers are essentially lightweight, dynamic regions that hold exactly one object." Beyond that, Rust's lifetimes are heavily based on regions, and there exists an extension of ML called Affe [43] which aims to support both linearity and borrowing using regions.

Capabilities are tokens that a function must possess in order to be able to access a particular location in memory. Capabilities are linear, and cannot be duplicated or discarded, in order to prevent them from being forged [17]. Implementations exist for various object-oriented languages such as Java [2] and Scala [19]; more functional languages taking inspiration from the idea of capabilities also exist [33, 41]. Recent work on linear constraints for Haskell [49], which hopes to allow for something similar to borrowing within the framework of linear Haskell, also descends from work on capabilities. Ambient capabilities can also be internalised as a comonad to capture purity within an impure language [12].

### 6 Future Work

Ownership via fractional permissions Though Granule can now represent values with both linear and unique types, the language allows for much more fine-grained analysis of resourceful data via grading. For instance, we can replay our earlier non-linearity example but with some extra information in the types:

```
Granule
1 accurate : Cake [2] → (Happy, Cake)
2 accurate [cake] = let extra = have cake in (eat cake, extra)
```
Instead of an infinite amount of cake we specify that we have exactly two cakes; the cake on the right-hand side must be linear as we only have one usage remaining. If we used the input three times we would receive a type error.

Given that we can move beyond the simple binary view of linear and nonlinear, one might suspect that we could track the quantity of existing references to a value more accurately than just unique or non-unique. We propose taking inspiration here from Boyland's notion of fractional permissions [8].

The purpose of fractional permissions is to allow multiple readers to access the same resource without losing the ability to later gain unique write access. A "permission" can be split up, allowing read-only access to multiple consumers, and then later recombined (while ensuring no other permissions still exist).

To relate these with our calculus, let us hypothesise that ∗<sup>1</sup> P is a 'complete' unique value that we can read from or write to, and that we can split up arbitrarily into 'fractionally' unique values ∗<sup>n</sup> P where 0 < n ≤ 1, as follows:

$$\ast\_n P \otimes \ast\_m P \iff \ast\_{n+m} P$$

As with fractional permissions, fractional values must only be used for behaviour that does not involve mutation, because whilst a value is only fractionally unique we cannot guarantee that other references do not exist. We should only regain this ability if we recombine the guarantees into a complete ∗<sup>1</sup> P.

This model closely resembles ownership as in Rust [24] – we can think of a value of type ∗<sup>n</sup> P for n < 1 as being equivalent to a Rust-style &P which is a borrowed value that we cannot mutate.<sup>10</sup> When a value has been borrowed the original value cannot be written to until we are finished with the borrows, much like we would need to collect all the fractionally unique values back together to get back to our original unique ∗<sup>1</sup> P. Being able to more closely model Rust's powerful ownership system would make this a fruitful avenue for future research.

Linear Haskell Granule's linear basis and assortment of modalities allows for a particularly natural embedding of the LCU calculus, but this does not preclude the theory of this paper from being applied in other contexts. One particularly valuable setting to consider would be Haskell, which as of GHC 9 already has linear types based on an underlying graded system called λ q →.

Haskell's graded representation of linearity involves function types (a %r -> b) which have a multiplicity annotation r; at present, this can be either 'One (linear) or 'Many (unrestricted). But λ q <sup>→</sup> is designed to be extensible, and the possibility of introducing additional multiplicities is welcomed [7, 49].

The original paper on linear Haskell [7] mentions that "linear types are conceptually simpler than uniqueness type systems, giving a clearer path to implementation in GHC", and also that "functional languages have more use for fusion than in-place update". Our clarification of the relationship between linearity and uniqueness demonstrates that not only are uniqueness types no more complex conceptually than linear ones, they can comfortably sit alongside one another in a single calculus; our evaluation demonstrates that while linearity is certainly useful, there are still further practical benefits to be gained from introducing uniqueness into a language with linear types. Perhaps these contributions will begin to forge a path towards a future for Haskell where linear types and uniqueness types can both be leveraged for their respective strengths.

Adjoint models Benton's linear/non-linear (LNL) logic [6] consists of two fragments: intuitionistic (non-linear) logic Φ `<sup>I</sup> X and a mixed fragment of intuitionistic linear logic with non-linear hypotheses Φ, Γ `<sup>L</sup> A. These two fragments are connected by a pair of modalities Lin(X) and Mny(A), which form an adjunction; the ! modality can then be recovered by !A = Lin(Mny(A)).

Breaking the ! modality into two and allowing linear logic to be mixed with non-linear logic has been a valuable endeavour, and so a natural question is whether it is possible to build an LNL-style adjoint model for our unified LCU calculus. It seems plausible that building an adjoint model for just uniqueness logic would not be too difficult; this would be very similar to the LNL model but with the adjunction moving in the opposite direction, and the monadic modality ◦ from uniqueness logic could be represented in much the same way that the comonadic ! can be recovered in LNL.

<sup>10</sup> Rust also includes mutable borrows, which allow the borrower to both read from and write to their borrowed reference. These are a much closer analogue to our current non-fractional calculus, since mutable borrows must be unique.

An adjoint model for the full LCU calculus would be more interesting. This would most likely involve three fragments, two of which would be symmetric monoidal categories (for unique and linear values) and one of which would be a Cartesian closed category (for unrestricted values), with two adjunctions allowing values to flow from unique to unrestricted to linear as we might hope.

Ordered and dependent types As expressive as Granule's type system may be, there are opportunities for enforcing stronger properties on programs elsewhere in the landscape of type theories. One possibility is that in addition to restricting contraction and weakening, it is also possible to restrict exchange, giving ordered type theories which correspond to noncommutative logic.

Such systems can be used to model stack-based memory allocation (as opposed to heap-based), since without exchange an object may only be used when it is at the top of the stack [10,62]. But much like linearity, these systems restrict the use of exchange in the future; is there an equivalent of uniqueness for ordered types which guarantees that exchange has never been applied in the past, and could this be useful for tracking references on the stack?

Another possibility is to bring uniqueness into the realm of dependent types. Recent work on graded modal dependent type theory (GrTT) [31] allows for capturing requirements on variable usage at both the type and computation levels; grades come in pairs, where the first component is the computation-level grading and the second component is the type-level grading. Strictly linear usage in types is rare – is there value in being able to represent uniqueness here?

### 7 Conclusion

Linearity and uniqueness are both well-studied concepts with similar substructural foundations, but differing benefits; linearity enables the careful management of resourceful data, while uniqueness offers the possibility of safe in-place updates. By formalising the relationship between these two ideas, building on two distinct bodies of literature, we have shown that there is value in having both linear and unique types in the same type system. This could be a first step on the road towards properly understanding the relationships between more advanced substructural type systems, such as the fine-grained resource tracking of Granule and Idris and the complex memory management provided by Rust.

Moreover, we implemented this system in the graded modal setting of the Granule language and provided benchmarks to demonstrate the efficiency gains that can be accessed via adding uniqueness to a language that already has a linear basis. The opportunities to incorporate uniqueness types into languages outside of Granule are apparent, and this paper offers both a theoretical underpinning for uniqueness as it relates to linearity as well as clear validation of the performance benefits that a system which unifies linearity and uniqueness can offer.

Acknowledgments This work was supported by an EPSRC Doctoral Training Award (Marshall) and EPSRC grant EP/T013516/1 (Verifying Resource-like Data Use in Programs via Types).

### References


'13, Association for Computing Machinery, New York, NY, USA (2013). https://doi.org/10.1145/2500365.2500601


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### A Framework for Substructural Type Systems <sup>⋆</sup>

James Wood<sup>1</sup> () and Robert Atkey<sup>1</sup> ()

University of Strathclyde, Glasgow, UK {james.wood.100,robert.atkey}@strath.ac.uk

Abstract. Mechanisation of programming language research is of growing interest, and the act of mechanising type systems and their metatheory is generally becoming easier as new techniques are invented. However, state-of-the-art techniques mostly rely on structurality of the type system — that weakening, contraction, and exchange are admissible and variables can be used unrestrictedly once assumed. Linear logic, and many related subsequent systems, provide motivations for breaking some of these assumptions.

We present a framework for mechanising the metatheory of certain substructural type systems, in a style resembling mechanised metatheory of structural type systems. The framework covers a wide range of simply typed syntaxes with semiring usage annotations, via a metasyntax of typing rules. The metasyntax for the premises of a typing rule is related to bunched logic, featuring both sharing and separating conjunction, roughly corresponding to the additive and multiplicative features of linear logic. We use the uniformity of syntaxes to derive type system-generic renaming, substitution, and a form of linearity checking.

Keywords: Formalised syntax · substructural types · mechanised metatheory · quantitative typing

### 1 Introduction

In this paper, we treat the metatheory of a class of substructural type systems related to linear logic [11]. This class is variously known as coefectful [17, 18], quantitative [4, 7], or resource-aware [10], or is given no particular name [1, 19], and generalises bounded linear logic to track variable usage with semiring annotations. In all of these systems, we have some ambient semiring R, and in the judgements of the type system, variables are annotated by elements of R describing how that variable can be used. The additive structure of R gives the ability to count, or otherwise accumulate, usages of variables in multiple subterms. The multiplicative structure gives rise to a form of modality, for example allowing multiple or unlimited reuse, or movement between security levels, in the type system.

c The Author(s) 2022

<sup>⋆</sup> James Wood is supported by an EPSRC Studentship. Robert Atkey is supported by EPSRC grant EP/T026960/1.

I. Sergey (Ed.): ESOP 2022, LNCS 13240, pp. 376–402, 2022. https://doi.org/10.1007/978-3-030-99336-8\_14

The aspect of such systems we tackle here is their basic metatheory and mechanisation thereof.

We build upon both the general structural framework of Allais et al. [3] and the substructural techniques of Wood and Atkey [21]. The way Allais et al. consolidate and codify mechanisation techniques for propositional natural deduction systems based on intrinsically typed syntax and de Bruijn indices, we aim to replicate for linear-like systems based on semiring usage annotations. By picking a trivial semiring, our work can subsume that of Allais et al., except for the many pieces of machinery we have not yet ported to this new framework.

Our work complements that of Orchard et al. [17] on the Granule programming language. Where Granule focuses on writing programs in the language and running them, we focus on metatheoretic reasoning about type systems.

Our work is similar in scope to that of Licata et al. [13], though we work in a natural deduction style rather than a sequent calculus style. Where Licata et al. are much more agnostic in terms of substructurality — allowing for noncommutative and bunched logics — we are much more agnostic in terms of syntax. The system of Licata et al. is essentially a single calculus, supporting "product" (F) types and "function" (U) types, parametrised on a mode theory describing its structural rules. For this system, they derive the strong result of cut elimination. Meanwhile, we leave syntax design to the user, and consequently can only guarantee substitution (which we can only get because of our commitment to natural deduction).

This paper proceeds as follows. In section 2, we review and fx conventions pertaining to partially ordered semirings and vectors over them. In section 3, we introduce an informal meta-syntax allowing us to state substructural typing rules succinctly and without explicit reference to contexts. In section 4, we mechanise that meta-syntax, giving a type of descriptions of type systems, and interpreting those descriptions as types of intrinsically typed terms. In section 5, we discuss usage-aware environments: a generalisation of the structures used in simultaneous renaming and substitution proofs. We use environments in section 6 to state an alternative elimination principle for terms, and give examples of such eliminations in section 7. The examples are syntax-generic renaming and substitution, a specifc denotational semantics, and a syntax-generic usage elaborator. Finally, we conclude and discuss future work in section 8

The work presented in this paper has been mechanised in Agda, with the code available for building upon [22].

### 2 Vectors over semirings

The basic algebraic structure we deal with is partially ordered semirings, or posemirings for short. A posemiring is a (not necessarily commutative) semiring on a partially ordered set, where both operations are monotonic. As in many similar formalisms, posemiring elements represent usage restrictions, with addition collecting restrictions from multiple uses, multiplication handling usage under a modality, and the order giving subsumption of restrictions, comparable to subtyping.

Defnition 1. A posemiring is a tuple (R, ≤, 0, +, 1, ∗) such that (R, ≤) is a partially ordered set, (R, 0, +) is a commutative monoid, (R, 1, ∗) is a monoid, + and ∗ are monotonic, and ∗ distributes over 0 and + on both sides.

Example 1 (Zero-one-many). The poset {0 > ω < 1} forms a posemiring under normal numeric addition (with 1 + 1 = 1 + ω = ω + ω = ω) and multiplication (with ω ∗ ω = ω). This gives us a way to mark whether variables are unused (0), used linearly (1), or used unrestrictedly (ω) in the current (sub)term. The ordering says that unrestricted-use variables can also remain unused or be used linearly.

Example 2 (Variance). The set {∼∼, ↑↑, ↓↓, ??}, with ∼∼ at the bottom and ?? at the top of the order, forms a posemiring with addition being meet, 0 being top (??), 1 being ↑↑, and multiplication being commutative and determined by ↓↓ ∗ ↓↓ = ↑↑ and ∼∼ ∗ ↓↓ = ∼∼ ∗ ∼∼ = ∼∼. This gives us a way to track the variance with which variables are used, in the aim of all terms being monotonic in their free variables. ↑↑ stands for covariance, ↓↓ for contravariance, ∼∼ for invariance, and ?? for a variable with no guarantees, in which we must be constant.

An element of a chosen posemiring R describes the usage restrictions on a variable. Therefore, a vector of elements from R describes the usage restrictions of a whole context's worth of variables. From the posemiring operations of R, we derive the standard vector operations of zero, addition, and multiplication by a scalar. We can also form the standard basis vectors at any given dimension. From the order on R, we get a pointwise order on vectors.

Vectors of a given length form a module over the posemiring R, analogously to how vectors over a feld form a vector space. The partial order on such vectors is pointwise.

Defnition 2. A (left) module over a posemiring, given a posemiring R, is a partially ordered commutative monoid (M, 0M, +M) with, for each r ∈ R, a pomonoid morphism r · (−) : M → M, such that the collection of these respects the posemiring structure on r. Specifcally, for all instantiations of the variables:

– If r ≤ r ′ and u ≤ u ′ , then r · u ≤ r ′ · u ′ . – r · 0<sup>M</sup> = 0<sup>M</sup> and r · (u +<sup>M</sup> v) = r · u +<sup>M</sup> r · v. – 0 · u = 0<sup>M</sup> and (r + s) · u = r · u +<sup>M</sup> s · u. – 1 · u = u and (r ∗ s) · u = r · (s · u).

We care to defne modules so as to defne module morphisms, also known as linear maps, which we use extensively when relating two contexts (as we do, for example, in simultaneous substitution). For the sake of mechanisation, we choose to defne module morphisms relationally rather than functionally, giving a somewhat unfamiliar-looking defnition that is equivalent to the usual

)

functional defnition. The main advantage of this relational approach is that proofs of relatedness for typical linear maps compose and decompose via data constructors and pattern matching.

Defnition 3. A (relational) linear map Ψ between modules M and N over a posemiring R is a relation ∼ on the underlying sets of M and N satisfying the following laws (with → standing for implication and quantifers binding most loosely).

– ∀u, u′ , v, v′ . u ≤ u ′ → v ′ ≤ v → u ∼ v → u ′ ∼ v ′ – ∀v. (∃u. u ≤ 0 ∧ u ∼ v) → v ≤ 0 – ∀u0, u1, v. (∃u. u ≤ u<sup>0</sup> + u<sup>1</sup> ∧ u ∼ v) → (∃v0, v1. u<sup>0</sup> ∼ v<sup>0</sup> ∧ u<sup>1</sup> ∼ v<sup>1</sup> ∧ v ≤ v<sup>0</sup> + v1) – ∀r, u′ , v. (∃u. u ≤ ru′ ∧ u ∼ v) → (∃v ′ . u′ ∼ v ′ ∧ v ≤ rv′ – ∀u. ∃v. u ∼ v ∧ ∀v ′ . u ∼ v ′ → v ′ ≤ v

Intuitively, Q ∼ P, where P and Q are row vectors, is equivalent to P ≤ QΨ, where Ψ is the matrix representing the linear map and on the right is a vectormatrix multiplication. It is important that we think of row vectors and rightmultiplication by a matrix because, without commutativity of the underlying posemiring, we can only expect (rQ)Ψ = r(QΨ) and not Ψ(rQ) = r(ΨQ). In section 5, we use the matrix notation for convenience, while in the Agda code we see Ψ .rel P Q.

### 3 Bunched Typing Rules

We now let R be an arbitrary posemiring. Our framework represents well typed and R-usaged terms intrinsically. Intrinsic typing means that we represent well typed and R-usaged terms (and only those) as inhabitants of an inductive family Rγ ⊢ A indexed by usage context R, type context γ, and type A. We represent the shape of a context as a nullary-binary tree, with typing and usage contexts being functions that assign types and elements of R, respectively, to leaves of the tree. Using trees instead of lists for typing contexts has the advantage that extension of a context by multiple variables does not lead to complex counting arguments to access the pre-existing variables, because context extension is (judgementally) injective. However, these precise details will eventually become irrelevant, as we will be able to use simultaneous renaming to smooth over any structural diferences between contexts.

Figure 1 presents a prototypical example of a system that our framework can represent, which is a subsystem of the λR system of Wood and Atkey [21]. Each rule is given as a constructor: the premises are named p, s, t, etc., and the conclusion is a constructor applied to those metalanguage variables. Object language variables are represented intrinsically as members of the type Rγ ⊐− A, which is a proof that the type A appears in the typing context, i : γ ∋ A, together with a proof that R ≤ ⟨i|. Expanding the vector notation, the latter condition says that the selected variable i must have a usage annotation ≤ 1 in R, while

$$\begin{array}{c} \begin{array}{c} x : \mathcal{R}\gamma \rightrightarrows A \\ \hline \hline \textbf{var}\,x : \mathcal{R}\gamma \vdash A \end{array} \\\\ \begin{array}{c} t : \mathcal{R}\gamma, 1A \vdash B \\ \hline \hline \end{array} \begin{array}{c} s : \mathcal{R}\gamma \vdash A \multimap B \\ t : \mathcal{R}\gamma \vdash A \end{array} \\\\ \begin{array}{c} t : \mathcal{R}\gamma \vdash A\_{i} \\ \hline \end{array} \begin{array}{c} p : \mathcal{R}\gamma \vdash \mathcal{R}\gamma \vdash A \multimap B \\ \hline \end{array} \begin{array}{c} s : \mathcal{R}\gamma \vdash A \multimap B \\ t : \mathcal{Q}\gamma \vdash A \end{array} \\\\ \begin{array}{c} p : \mathcal{R}\gamma \vdash A\_{i} \\ s : \mathcal{R}\gamma \vdash A \oplus B \quad \quad u : \mathcal{Q}\gamma, 1A \vdash C \\ \hline \end{array} \begin{array}{c} p : \mathcal{R}\gamma \vdash A \otimes B \\ t : \mathcal{Q}\gamma, 1B \vdash C \\ \hline \end{array} \\\\ \begin{array}{c} p : \mathcal{R}\gamma \vdash \mathcal{R}\gamma \vdash A \otimes B \\ t : \mathcal{R}\gamma \vdash \mathcal{R}\gamma \vdash A \otimes C \\ \hline \end{array} \begin{array}{c} s : \mathcal{R}\gamma \vdash \mathcal{R}\gamma \end{array} \\\\ \begin{array}{c} p : \mathcal{R}\gamma \vdash \mathcal{R}\gamma \vdash A \otimes B \\ t : \mathcal{Q}\gamma \vdash A \otimes C \\ t : \mathcal{Q}\gamma \vdash A \otimes C \end{array} \end{array}$$

Fig. 1. A prototypical posemiring-usaged system

all other variables must have a usage annotation ≤ 0. We use the constructors ↙ and ↘ to describe a path down the nullary-binary tree, terminated by the word here. The var rule imports variables into terms.

The remaining rules are the introduction and elimination rules for three type constructors: ⊸I and ⊸E for function types A ⊸ B where the bound variable is annotated with 1 for "use once"; ⊕I and ⊕E for sum types A ⊕ B; and !I and !E for a R-annotated exponential modality !rA.

There are two key observations to make about this system, which will guide the way we build our generic framework for R-annotated substructural systems:


These observations indicate a way to regularise and streamline the presentation of this system. Instead of treating each premise and the conclusion as having potentially unrelated typing and usage constraints, we make use of combinators for combining premises that will relate their usage and typing contexts to the conclusion by construction. This idea comes from the work of Rouvoet et al. [20], including the ˙→ and −∗ connectives we use later. To handle binders, which introduce variables, we make use of a combinator that adds a variable with a given R-annotation to an ambient context, without having to explicitly mention

<sup>1</sup> There is an unfortunate clash of terminology here: multiplicative rules add their usage contexts, while additive rules share their usage contexts.

the parts of the context that have not changed. This technique is already present in some paper presentations of type systems, and is formalised by Allais et al. [3]. To manage how usage annotations are distributed between premises, we use the separating (∗) and sharing (×˙ ) conjunction connectives from Bunched Implications [16]. To handle the !I rule, we will need a scaling modality, r · −. The semantics of the bunched connectives we will use in this paper are:

$$\begin{aligned} \dot{1}\,\mathcal{R} &:= 1\\ (T \dot{\times} U)\,\mathcal{R} &:= T\mathcal{R} \times U\,\mathcal{R} \\ (T \dot{\rightarrow} U)\,\mathcal{R} &:= T\mathcal{R} \to U\,\mathcal{R} \\ I^\*\,\mathcal{R} &:= \mathcal{R} \le 0 \\ (T \ast U)\,\mathcal{R} &:= \Sigma\mathcal{P}, \mathcal{Q}.\,(\mathcal{R} \le \mathcal{P} + \mathcal{Q}) \times T\mathcal{P} \times U\,\mathcal{Q} \\ (T \twoheadrightarrow U)\,\mathcal{P} &:= H\mathcal{Q}, \mathcal{R}.\,(\mathcal{R} \le \mathcal{P} + \mathcal{Q}) \to T\mathcal{Q} \to U\,\mathcal{R} \\ (r \cdot T)\,\mathcal{R} &:= \Sigma\mathcal{P}.\,(\mathcal{R} \le r\mathcal{P}) \times T\mathcal{P}.\end{aligned}$$

The function connectives ˙→ and −∗ are not used in typing rules, but are used in the rest of the framework (though one can interpret the horizontal line in a typing rule as ˙→ plus universal quantifcation). An important point to note is that bunched combinators induce linear combinations of substructures, in the sense of the linear algebra of posemirings described in the previous section.

$$\begin{array}{ccc} \frac{x: \exists \ A}{\mathsf{var}\,x: \vdash A} & \frac{t: 1A \vdash B}{\multicolumn{2}{l}{}} & \frac{t: 1A \vdash B}{\multicolumn{2}{l}{}} & \frac{(t: \vdash A \multicolumn{2}{l}{\vdash} A \multicolumn{2}{l}{\vdash} A)}{\multicolumn{2}{l}}\\ t: \vdash A\_{i} & & \frac{(s: \vdash A \oplus B)}{\oplus \kernticolumn{2}{l}{}\_{i}t: \vdash A\_{0} \oplus A\_{1}} & \frac{(s: \vdash A \oplus B)}{\oplus \kernticolumn{2}{l}{}\_{i}t: \vdash A \oplus B} & \frac{(t: \vdash A \multicolumn{2}{l}{\vdash} A \multicolumn{2}{l}{\vdash} B)}{\multicolumn{2}{l}}\\ \cline{2-4} & t: \vdash A\_{0} \oplus A\_{1} & & \frac{(t: \vdash A \oplus B)}{\oplus \kernticolumn{2}{l}{}\_{i}t: \kernticolumn{2}{l}{}\_{i}A \multicolumn{2}{l}{}} & \frac{(t: \vdash \vdash\_{\mathit{F}} A \oplus\_{\mathit{F}} B)}{\oplus \kernticolumn{2}{l}{}\_{i}t: \kernticolumn{2}{l}{}\_{i}A & & & \frac{(s: \vdash \vdash\_{\mathit{F}} A)}{\oplus \kernticolumn{2}{l}{}\_{i}t: \kernticolumn{2}{l}{}\_{i}A} \end{array}$$

Fig. 2. The prototypical system of fgure 1 restated in terms of bunched combinators.

Figure 2 shows our prototypical system restated with implicit contexts and the bunched combinators. The inductive family is now denoted ⊢ A, only mentioning context extensions, as we do in the rules ⊸I, ⊕E and !E. Thus, in the var rule, the context is completely suppressed. The ⊸I rule just has to state that a new variable with usage annotation 1 and type A is added to the context. The ⊸E rule uses the separating conjunction (∗) to combine the premises, indicating that the usages of the two premises are added together for the conclusion. The ⊕E rule demonstrates the sharing conjunction ×˙ : the scrutinee term s and the clause terms t, u are combined by separating conjunction, because their usages must be combined, but the clause terms are combined by the sharing conjunction, because they have the same usage context.

Bunched combinators, along with suppression of unchanged typing contexts, leads to a more streamlined presentation of the system without the clutter of explicit usage context inequalities. However, the larger advantage for us is that systems are constructed using these combinators automatically admit renaming, substitution, and other scope-, type-, and usage-safe traversals. If we were to allow arbitrary modifcation of the context in premises, these results would not be possible, since there would be no guarantee that a substitution (for instance) could be "pushed" up from a conclusion to the premises. As we will see in section 5, our generic notion of environment (e.g., a simultaneous substitution) is based around linear transformations, and so automatically commutes with the linear combinations of premises induced by the bunched connectives. This is the key to our generic results for all of the systems describable in our framework.

### 4 Generic syntax

We take the insights of the previous section and use them to build a generic framework for posemiring-annotated substructural systems in Agda. We will frst show descriptions of systems, which are comprised of rules that have premises combined using the bunched combinators. We then show how to construct the Agda data type of intrinsically well scoped, typed, and resourced terms for any given system of our framework. We use the prototypical system from fgure 2 as a running example. Section 4.3 presents further examples that our framework can express.

We now start to use Agda notation for record and data type declarations, to emphasise that our framework has been implemented.

#### 4.1 Descriptions of Systems

A type System is made up of multiple Rules. Each Rule comprises a Premises and a conclusion type. We assume that there is a Ty : Set of types for the system in scope.

The Premise data type describes premises of rules, using the bunched combinators from section 3. A single premise is introduced by the ⟨ '⊢ ⟩ constructor. This allows binding of additional variables ∆ (with specifed types and usage annotations) and the specifcation of a conclusion type A for this premise. The remaining constructors are descriptions for the bunched connectives.

```
data Premises : Set where
  ⟨ '⊢ ⟩ : (∆ : Ctx) (A : Ty) → Premises
  '1˙ : Premises; '×˙ : (p q : Premises) → Premises
  'I
    ∗
     : Premises; '∗ : (p q : Premises) → Premises
  '· : (r : Ann) (p : Premises) → Premises
```
A Rule is a pair of some Premises and a conclusion. We use an infx arrow as a suggestive notation for rules.

```
record Rule : Set where
  constructor =⇒
  feld premises : Premises; conclusion : Ty
```
Finally, a System consists of a set of rule labels (i.e., constructor names), and for each label a description of the corresponding rule. We use ▷ as infx notation for systems to associate the label set with the rules.

```
record System : Set1 where
  constructor ▷
  feld Label : Set; rules : (l : Label) → Rule
```
As an example, we transcribe the system defned in fgure 2 into a description. We give the set of types of this system as a data type Ty (together with a base type ι). We assume that there is a posemiring Ann in scope for the annotations.There is one label for each instantiation of a logical rule, but the labels contain no further information about subterms or restrictions on the context. This will be provided when we associate labels with Rules in a System.


data Side : Set where ll rr : Side

To build a system, we associate with each label a rule:

```
λR : System
λR = 'λR ▷ λ where
 ('⊸I A B) → ⟨ [ 1# · A ]
                        c
                         '⊢ B ⟩ =⇒ (A ⊸ B)
 ('⊸E A B) → (⟨ []c
                   '⊢ A ⊸ B ⟩ '∗ ⟨ []c
                                   '⊢ A ⟩) =⇒ B
 ('!I r A) → (r '· ⟨ []c
                      '⊢ A ⟩) =⇒ (! r A)
 ('!E r A C) → (⟨ []c
                   '⊢ ! r A ⟩ '∗ ⟨ [ r · A ]
                                     c
                                      '⊢ C ⟩) =⇒ C
 ('⊕I ll A B) → ⟨ []c
                  '⊢ A ⟩ =⇒ (A ⊕ B)
 ('⊕I rr A B) → ⟨ []c
                  '⊢ B ⟩ =⇒ (A ⊕ B)
 ('⊕E A B C) →
   ⟨ []c
       '⊢ A ⊕ B ⟩ '∗ (⟨ [ 1# · A ]
                            c
                              '⊢ C ⟩ '×˙ ⟨ [ 1# · B ]
                                               c
                                                '⊢ C ⟩) =⇒ C
```
Compared to fgure 2, modulo the Agda notation, we can see that the fundamental structure has been preserved: the rules match one-to-one, and the bunched premises are the same. A major diference is that we do not include a counterpart to the var rule in a System. Variables are common to all the systems representable in our framework.

#### 4.2 Terms of a System

The next thing we want to do is to build terms in the described type system. The following defnitions are useful for talking about types indexed over contexts, judgement forms, and judgement forms admitting newly bound variables, respectively.

```
OpenType : ∀ ℓ → Set (suc ℓ)
OpenType ℓ = Ctx → Set ℓ
OpenFam : ∀ ℓ → Set (suc ℓ)
OpenFam ℓ = Ctx → Ty → Set ℓ
ExtOpenFam : ∀ ℓ → Set (suc ℓ)
ExtOpenFam ℓ = Ctx → OpenFam ℓ
```
To specify the meaning of descriptions, we assume some X : ExtOpenFam, over which we form one layer of syntax, using the function <sup>J</sup> <sup>K</sup><sup>p</sup> that interprets Premises defned below. The frst argument to X is the new variables bound by this layer of syntax, as exemplifed in the frst clause of <sup>J</sup> <sup>K</sup>p. The second argument is the context containing the variables being carried over from the previous layer. Notice that this is not, in general, the same as the context from the previous layer, because the usage annotations may have been changed by connectives like '∗ and '· . The third argument is the type of subterm required.

The remainder of the clauses of <sup>J</sup> <sup>K</sup><sup>p</sup> are the interpretation into bunched combinators. The superscript <sup>c</sup> on the bunched connectives denotes that they have been lifted from predicates on usage vectors to predicates on contexts, with the type component of the context shared throughout. Additive connectives 1˙ and ×˙ are already polymorphic (not relying on anything specifc about usage vectors), so do not need a <sup>c</sup> variant.

<sup>J</sup> <sup>K</sup><sup>p</sup> : Premises <sup>→</sup> ExtOpenFam <sup>ℓ</sup> <sup>→</sup> OpenType <sup>ℓ</sup> <sup>J</sup> ⟨ <sup>∆</sup> '<sup>⊢</sup> <sup>A</sup> ⟩ <sup>K</sup><sup>p</sup> <sup>X</sup> <sup>Γ</sup> <sup>=</sup> <sup>X</sup> <sup>∆</sup> <sup>Γ</sup> <sup>A</sup> <sup>J</sup> '1˙ <sup>K</sup><sup>p</sup> <sup>X</sup> <sup>=</sup> <sup>1</sup>˙; <sup>J</sup> <sup>p</sup> '×˙ <sup>q</sup> <sup>K</sup><sup>p</sup> <sup>X</sup> <sup>=</sup> <sup>J</sup> <sup>p</sup> <sup>K</sup><sup>p</sup> <sup>X</sup> <sup>×</sup>˙ <sup>J</sup> <sup>q</sup> <sup>K</sup><sup>p</sup> <sup>X</sup> J 'I ∗ <sup>K</sup><sup>p</sup> <sup>X</sup> <sup>=</sup> <sup>I</sup> ∗c ; <sup>J</sup> <sup>p</sup> '<sup>∗</sup> <sup>q</sup> <sup>K</sup><sup>p</sup> <sup>X</sup> <sup>=</sup> <sup>J</sup> <sup>p</sup> <sup>K</sup><sup>p</sup> <sup>X</sup> <sup>∗</sup> c <sup>J</sup> <sup>q</sup> <sup>K</sup><sup>p</sup> <sup>X</sup> <sup>J</sup> <sup>r</sup> '· <sup>p</sup> <sup>K</sup><sup>p</sup> <sup>X</sup> <sup>=</sup> <sup>r</sup> · c <sup>J</sup> <sup>p</sup> <sup>K</sup><sup>p</sup> <sup>X</sup>

The interpretation of a Rule checks that the rule targets the desired type and then interprets the rule's premises ps. Notice that the interpretation of the premises is independent of the conclusion of the rule, which accounts for the use of OpenType in <sup>J</sup> <sup>K</sup><sup>p</sup> versus OpenFam in <sup>J</sup> <sup>K</sup>r.

<sup>J</sup> <sup>K</sup><sup>r</sup> : Rule <sup>→</sup> ExtOpenFam <sup>ℓ</sup> <sup>→</sup> OpenFam <sup>ℓ</sup> <sup>J</sup> ps <sup>=</sup><sup>⇒</sup> <sup>A</sup>′ <sup>K</sup><sup>r</sup> <sup>X</sup> <sup>Γ</sup> <sup>A</sup> <sup>=</sup> <sup>A</sup>′ <sup>≡</sup> <sup>A</sup> <sup>×</sup> <sup>J</sup> ps <sup>K</sup><sup>p</sup> <sup>X</sup> <sup>Γ</sup>

The interpretation of a System is to choose a rule label l from L and interpret the corresponding rule rs l in the same context and for the same conclusion.

<sup>J</sup> <sup>K</sup><sup>s</sup> : System <sup>→</sup> ExtOpenFam <sup>ℓ</sup> <sup>→</sup> OpenFam <sup>ℓ</sup> <sup>J</sup> <sup>L</sup> <sup>▷</sup> rs <sup>K</sup><sup>s</sup> <sup>X</sup> <sup>Γ</sup> <sup>A</sup> <sup>=</sup> <sup>Σ</sup>[ <sup>l</sup> <sup>∈</sup> <sup>L</sup> ] <sup>J</sup> rs <sup>l</sup> <sup>K</sup><sup>r</sup> <sup>X</sup> <sup>Γ</sup> <sup>A</sup>

The most obvious way to make such an X is to use some existing OpenFam on an extended context. We defned Scope to do this: take the new variables ∆, concatenate them onto the existing context Γ, and pass the extended context onto the judgement T.

Scope : ∀ {ℓ} → OpenFam ℓ → ExtOpenFam ℓ Scope T ∆ Γ A = T (Γ ++<sup>c</sup> ∆) A

We use Scope to deal with new variables in syntax. Terms resemble the free monad over a layer-of-syntax functor, though that picture is complicated by variable binding. A term is either a variable or a use of a logical rule together with terms for each of the required subterms. The Size argument is a use of Agda's sized types to record that subterms are smaller than the surrounding term for the termination checker.

```
data [ , ] ⊢ (d : System) : Size → OpenFam 0ℓ where
 'var : ∀[ ⊐− →˙ [ d , ↑ sz ] ⊢ ]
 'con : ∀[ J d Ks (Scope [ d , sz ] ⊢ ) →˙ [ d , ↑ sz ] ⊢ ]
```
This defnition uses →˙ , which, analogously to ×˙ , is an index-preserving version of the function space. We take →˙ to handle n many indices — in this case two (the context and the type). The notation ∀[ T ] stands for ∀ {x<sup>1</sup> . . . xn} → T x<sup>1</sup> . . . xn, where T is a type family with n many indices.

Terms in this data type are difcult to write by hand, due to the need for proofs that the usage contexts are handled correctly. For example, the following term is needed to show that, in the {0, 1, ω} (linearity) posemiring of example 1, !ω forms a comonad. Pattern synonyms ⊸I, !E′ , and !I′ stand for applications of 'con, with the latter two taking explicit usage contexts and proofs. On concrete posemirings (as in this example), unifcation is particularly poor at inferring the usage contexts from the proofs because addition and multiplication are no longer (judgementally) injective. The function var# is a way of turning a statically known de Bruijn level and a usage proof into an application of 'var.

$$\begin{array}{lcll} \text{cojoin} \cdot \text{l}\omega : \forall \ A \to \ [\ \lambda \mathbb{R} \ \bot \ ] \left[ \begin{array}{l} \vee \neq \ \emptyset \ \omega \notin A \ \omega \neq \ \emptyset \ \omega \neq \ \emptyset \ \omega \neq \ \emptyset \ \end{array} \right. \\ \text{cojoin} \cdot \text{l}\omega \ A = & \\ \quad \multimap \left\{ \left[ \begin{array}{l} \vee \neq \ \{ \ \# \} \ \{ \} \right] \left[ \begin{array}{l} \vee \neq \ \{ \# \} \ \end{array} \right] \left( \left[ \begin{array}{l} \vee \neq \ \# \end \ \right) \left[ \begin{array}{l} \left( \begin{array}{l} \vee \neq \ \# \end \ \right) \right] \right) \right. \\ \left\{ \left( \left[ \begin{array}{l} \vee \neq \ +\_{n} \ \left[ \leq \text{-refl} \ \right] \end{array} \right) \right. \\ \left\{ \left( \left[ \begin{array}{l} \vee \neq \ \{ \# \} \ \end{array} \right) \right) \right. \\ \left\{ \left( \left[ \begin{array}{l} \vee \neq \ \text{+}\_{n} \ \left[ \leq \text{-refl} \ \right] \end{array} \right) \right. \\ \left\{ \left( \left[ \begin{array}{l} \vee \neq \ \left[ \begin{array}{l} \vee \neq \ \text{+}\_{n} \ \end{array} \right] \right) \right. \\ \left\{ \left( \left[ \begin{array}{l} \vee \neq \ \left[ \begin{array}{l} \vee \neq \ \text{+}\_{n} \ \end{array} \right] \right) \right. \\ \left\{ \left($$

Writing terms like this is clearly unsustainable. We will see a way of automating the necessary proofs via a System-generic elaborator in section 7.2.

#### 4.3 Other syntaxes and syntactic forms

The system µµ˜. We can encode a usage-annotated version of System L/the µµ˜ calculus [8] — a syntax for classical logic — in such a way that contexts capture the undistinguished parts of the sequent. As such, the generic substitution lemma we get in section 7.1 is the form of substitution required in standard µµ˜-calculus metatheory. Though the µµ˜-calculus is originally described as a sequent calculus [8], we use the techniques of Herbelin [12, p. 12] and Lovas and Crary [14] to present it as a natural deduction system, thus giving a notion of variable to the system.

Unlike the single judgement form of λR and standard simply typed λ-calculi, the µµ˜-calculus has three judgement forms: terms, coterms, and commands. Read logically, terms and coterms are seen to, respectively, prove and refute propositions (types), while commands exhibit contradictions. This means that the abstract Ty in the generic framework is instantiated to Conc (for conclusion) as below, with Ty not being exposed directly to the generic framework. For now, we just consider multiplicative disjunction ` (par ) and negation/duality, beside an uninterpreted base type. These are enough to exhibit classical behaviour.

```
data Ty : Set where
  base : Ty
  ` : (rA sB : Ann × Ty) → Ty
  ˆ⊥ : (A : Ty) → Ty
                                        data Conc : Set where
                                          com : Conc
                                          trm cot : (A : Ty) → Conc
```
With Ty instantiated as Conc, all terms are assigned Conc type, as are all the variables. No variables are given com type, similar to how in the bidirectional typing syntax of Allais et al. [3, p. 25], no variables are given Check type. How to observe this invariant is covered in the latter paper, so we will not repeat it here (having not yet seen how to write traversals on terms).

The syntax comprises a cut between a term and a coterm of the same type, the eponymous µ and ˜µ constructs for proof by contradiction, and then term and coterm (introduction and elimination) forms for negation and par.

```
data 'MMT : Set where
 'cut 'µ 'µ∼ : (A : Ty) → 'MMT
 'λ 'λ∼ : (A : Ty) → 'MMT
 '⟨-,-⟩ 'µ⟨-,-⟩ : (rA sB : Ann × Ty) → 'MMT
MMT : System
MMT = 'MMT ▷ λ where
 ('cut A) → ⟨ []c
                '⊢ trm A ⟩ '∗ ⟨ []c
                                 '⊢ cot A ⟩ =⇒ com
 ('µ A) → ⟨ [ 1# , cot A ]
                          c
                            '⊢ com ⟩ =⇒ trm A
 ('µ∼ A) → ⟨ [ 1# , trm A ]
                          c
                            '⊢ com ⟩ =⇒ cot A
 ('λ A) → ⟨ []c
                '⊢ cot A ⟩ =⇒ trm (A ˆ⊥)
 ('λ∼ A) → ⟨ []c
                '⊢ trm A ⟩ =⇒ cot (A ˆ⊥)
 ('⟨-,-⟩ rA@(r , A) sB@(s , B)) →
   r '· ⟨ []c
           '⊢ cot A ⟩ '∗ s '· ⟨ []c
                               '⊢ cot B ⟩ =⇒ cot (rA ` sB)
 ('µ⟨-,-⟩ rA@(r , A) sB@(s , B)) →
   ⟨ [ r , cot A ]
               c ++c
                      [ s , cot B ]
                                c
                                 '⊢ com ⟩ =⇒ trm (rA ` sB)
```
Duplicability There is one more bunched combinator we have experimented with adding to the framework:

$$(\square T)\mathcal{R} := \Sigma \mathcal{R}'. (\mathcal{R}' \le \mathcal{R}) \times (\mathcal{R}' \le 0) \times (\mathcal{R}' \le \mathcal{R}' + \mathcal{R}') \times T\mathcal{R}'$$

The idea of (□T) R is to assert that R, or some refnement of it, can be both discarded and duplicated indefnitely, and in the refnement we have a T. We use this combinator to introduce subterms that are used an unknown number of times, for example the continuations of the eliminator of an inductive type, or other fxed points. We can also use it in linear/non-linear style systems [6] to make sure linear variables are not available in the intuitionistic fragment.

Adding the □ combinator is the only thing we have found that requires our linear maps be functional rather than merely relational.

### 5 Environments

We have now seen how to build data types of intrinsically well typed and well usaged terms for a given System. In the next section, we will defne a generic traversal function that assigns a "semantics" to each term. Traversals operate on open terms, so they need a way to assign semantics to variables in a typed and usage respecting manner. This is the function fulflled by environments.

Given a semantic notion of variable V : OpenFam, we use the notation Γ V A meaning V Γ A for the type of inhabitants of V in the context Γ at type A. In the non-substructural systems of Allais et al. [3], a V-environment Γ <sup>V</sup>=⇒ ∆ is nothing more than a function ∀A → ∆ ⊐− A → Γ V A, mapping variables to V-things. In our usage annotated setting though, we must correctly distribute resources tracked by the annotations; making sure that we have enough resources in Γ to cover all the demands in ∆. Following our previous work [21], this accounting is expressed via the presence of a linear transformation:

Defnition 4 (Environment). A V-environment between annotated contexts Γ and ∆ (decomposed as Pγ and Qδ, respectively, when convenient) is a linear map Ψ : R<sup>|</sup>∆<sup>|</sup> → R<sup>|</sup>Γ<sup>|</sup> (written postfx) such that P ≤ QΨ and for each A, P ′ , and Q′ such that P ′ ≤ Q′Ψ, a "lookup" function from Q′ δ ⊐− A to P ′γ V A.

In Agda code, we use [ V ] Γ ⊨ A instead of Γ V A and [ V ] Γ ⇒<sup>e</sup> ∆ instead of Γ <sup>V</sup>=⇒ ∆.

The specifcation of the lookup function has some redundancy. Notice that, for Q′∆ ⊐− A to hold, we must have Q′ ≤ ⟨i| for some i. Instead of P ′ ≤ Q′Ψ, asking for P ′ ≤ ⟨i|Ψ would be just as general. Additionally, all of the Vs we consider satisfy the subusaging property (that P ′ ≤ P yields a coercion PΓ V A → P ′Γ V A), in which case we could just ask for an inhabitant of (⟨i|Ψ)γ V A. However, we fnd the stated defnition technically expedient because, by this point, basis vectors and raw indices (instead of usage-checked variables) are below our level of abstraction. We prefer to work with linear relatedness and ⊐−-variables.

By instantiating V in defnition 4, we obtain resource-correct versions of familiar notions: letting V be ⊐− yields resource-correct renamings; and letting V be ⊢ (i.e., terms) yields resource-correct substitutions.

We may informally assign variable names to the entries in the domain context.

Example 3. Assume R is the natural numbers with ordering given by = and the usual addition and multiplication. There is a ⊐−-environment (a renaming)

$$(6a:A,0b:B,1c:C,0d:D) \stackrel{\Xi}{\Longrightarrow} (1C,2A,4A).$$

The mapping of variables to variables and matrix giving the linear map Ψ are:


Note that (6 0 1 0) = (1 2 4)Ψ. The frst column of Ψ, corresponding to variable 6a : A, contains two 1s because it has been duplicated (via contraction). The second and fourth columns are all 0 because variables b and d have been discarded (via weakening). The third column contains one 1 because c is used once. This 1 appears above the 1s to its left because c has been permuted (via exchange) past a. Each of the rows in the matrix is a basis vector because variables can only be formed in contexts with basis-compatible annotations.

Relocation An environment ρ : Pγ <sup>V</sup>=⇒ Qδ does not determine P and Q, we can replace them with any P ′ and Q′ that are related by the linear map ρ.Ψ (that is, the linear map of environment ρ):

Lemma 1 (relocate). Given an environment ρ : Pγ <sup>V</sup>=⇒ Qδ and a P ′ and a Q′ such that P ′ ≤ Q′ (ρ.Ψ), there is also an environment of type P ′γ <sup>V</sup>=⇒ Q′ δ with the same linear map and action on variables.

Relocation will be used when pushing environments into subterms in section 6.3.

Inductive Construction When V supports subusaging, we can construct a Venvironment by cases on the shape of the target context by the following rules, which use the bunched connectives from section 3:

$$\begin{array}{cccc} \frac{I^\*}{\langle\rangle : \stackrel{\mathcal{V}}{\Longrightarrow} \cdot} & \frac{\rho : \stackrel{\mathcal{V}}{\Longrightarrow} \Delta\_l \; : \; \sigma : \stackrel{\mathcal{V}}{\Longrightarrow} \Delta\_r}{\langle\rho, \sigma\rangle : \stackrel{\mathcal{V}}{\Longrightarrow} \Delta\_l, \Delta\_r} & \frac{r \cdot \left(M : \stackrel{\mathcal{V}}{\Longrightarrow} A\right)}{\langle M\rangle : \stackrel{\mathcal{V}}{\Longrightarrow} rA} \end{array}$$

Left to right, we can create an environment into the empty context when all usage annotations on the source context are 0; we can create an environment into a concatenated context when we can additively split up the annotations of

the source context and produce environments into both halves from the split sources; and we can create an environment into a singleton context rA when we can divide the source context by r and produce a V-value in the divided context of the appropriate type.

Example 4. Assume R is the natural numbers with ordering given by = and the usual addition and multiplication, and ⊢ is the type of terms for a System with function application. There is an environment (substitution)

$$\langle \langle z \rangle, \langle y \ z \rangle \rangle : (0x : A, 2y : B \to C, 3z : B) \stackrel{\dashrightarrow}{\Longrightarrow} (1B, 2C).$$

We rely on the observations that 0 2 3 = 0 0 1 + 0 2 2 and, on the right, that 0 2 2 = 2 0 1 1 . Then, we have 0x : A, 0y : B ⊸ C, 1z : B ⊢ z : B and 0x : A, 1y : B ⊸ C, 1z : B ⊢ y z : C.

We could have used these rules to inductively defne what environments are. However, we found that this was difcult to work with. It is often easier to do linear algebraic proofs separately from the rest of an environment. For example, for identity and composition of environments (below), defnition 4 is easier to use because we can rely on the identity and composition of linear maps. Concretely, an inductive proof of identity would, for example, involve constructing an environment of type Pγ, Qδ <sup>V</sup>=⇒ Pγ, Qδ by constructing environments of types Pγ, 0δ <sup>V</sup>=⇒ Pγ and 0γ, Qδ <sup>V</sup>=⇒ Qδ. These are not identity environments, so we would have to strengthen the induction hypothesis.

Renameability Renamings, i.e. ⊐−-environments, are a particularly important case of environments. Renamings form a category, with identity and composition following from the identity and composition of linear maps. As in the work of Fiore et al. [9], presheaves over renamings are an important class of open families. In Agda code, we abbreviate <sup>⊐</sup><sup>−</sup>=<sup>⇒</sup> (which would usually be [ <sup>⊐</sup><sup>−</sup> ] <sup>⇒</sup><sup>e</sup> ) as ⇒<sup>r</sup> .

In a setting where new variables can be bound in the middle of a derivation, it is important that the values we carry around while traversing a term can handle the existence of variables that appear but they do not use. We call any such notion of value renameable. The cofree renameable open type on an open type T is □<sup>r</sup> T (unrelated to the □ combinator mentioned at the end of section 4.3), with T then being renameable if it forms a □<sup>r</sup> -coalgebra.

Defnition 5. For T an open type, (□<sup>r</sup> T) Γ := ∀ h (−) <sup>⊐</sup><sup>−</sup>=<sup>⇒</sup> <sup>Γ</sup> →˙ T i . That is, □<sup>r</sup> T holds at Γ when T holds not only at Γ, but also at any other Γ <sup>+</sup> which renames to Γ.

Defnition 6. We say that T is renameable whenever there is a function ren<sup>T</sup> : ∀[ T →˙ □<sup>r</sup> T ]. That is, whenever T holds at Γ, it also holds at any Γ <sup>+</sup> which renames to Γ.

A renameable notion of value gives rise to a renameable notion of environment, essentially by renaming each contained value in an appropriate way. On the other side, environments admit renamings of their codomains in the opposite direction to that given by renameability.

Lemma 2 (renˆEnv). If (−) V A is renameable for all A, then so is (−) <sup>V</sup>=⇒ ∆ for all ∆.

Lemma 3. From Γ <sup>V</sup>=⇒ ∆ and ∆ <sup>⊐</sup>−=<sup>⇒</sup> <sup>Θ</sup>, we get <sup>Γ</sup> <sup>V</sup>=⇒ Θ.

Proof sketch. Notice that the lookup component of an environment maps variables in the codomain to values in the domain. We can apply the renaming to these variables.

### 6 Semantics

Given a V-environment Γ ⇒ ∆, the function semantics we defne in this section assigns a C-value in context Γ to every term in context ∆, where C is an OpenFam being the carrier of the semantic interpretation of terms (V being the semantic interpretation of variables). Before we can defne semantics, we need to treat recursion through rules' premises (section 6.1) and extension of environments when going under variable binders (section 6.2).

#### 6.1 A layer of syntax is functorial

A basic property of the universe of syntaxes we described in section 4 is that every syntax supports a functorial action on subterms, realised by the function map-s. Its type says that to map a function f over a layer of syntax, there must be a linear map F relating the domain and codomain usage contexts, and f should be usable wherever the domain and codomain usage contexts are similarly related by F.

```
map-s : (s : System) →
  (∀ {Θ P
          ′ Q′} → F .rel P
                          ′ Q′ → ∀[ X Θ (ctx P
                                                ′ γ) →˙ Y Θ (ctx Q′
                                                                    δ) ]) →
  (∀ {P Q} → F .rel P Q → ∀[ J s Ks X (ctx P γ) →˙ J s Ks Y (ctx Q δ) ])
```
This generality is needed because usage contexts change between a term and its immediate subterms—they are decomposed according to the bunched connectives used in the rules. X and Y are ExtOpenFams, with Θ being the context extension for a subterm (i.e., the variables newly bound in that subterm). Unlike usage annotations, types in the contexts γ and δ, and the conclusion types implicit here, are preserved throughout. This is the essence of the usage annotation based approach—we use traditional techniques for variable binding, with the usage annotations layered on top.

The heart of map-s is map-p, which recursively works through the structure ps of premises of the rule applied, acting on each subterm it fnds. Here, particularly

in the clauses for '∗ and '·, we see why it is not enough for the function on subterms to apply at usage contexts P and Q — rather, it also needs to apply at any similarly related P ′ and Q′ . In the case of '∗, we have that P ≤ P<sup>M</sup> + P<sup>N</sup> , with M and N being collections of subterms in usage contexts P<sup>M</sup> and P<sup>N</sup> , respectively. Linearity of F yields Q<sup>M</sup> and Q<sup>N</sup> such that Q ≤ Q<sup>M</sup> + Q<sup>N</sup> and we use map-p recursively at (PM, QM) and (P<sup>N</sup> , Q<sup>N</sup> ) on M and N. The cases for '· and 'I <sup>∗</sup> are similar, each using a diferent aspect of linearity. In contrast, the cases for '1˙ and '×˙ , which are the only constructors used in fully structural systems, do not involve any changes in the usage contexts.

map-p : (ps : Premises) → (∀ {Θ P ′ Q′} → F .rel P ′ Q′ → ∀[ X Θ (ctx P ′ γ) →˙ Y Θ (ctx Q′ δ) ]) → (<sup>∀</sup> {<sup>P</sup> <sup>Q</sup>} <sup>→</sup> <sup>F</sup> .rel <sup>P</sup> <sup>Q</sup> <sup>→</sup> <sup>J</sup> ps <sup>K</sup><sup>p</sup> <sup>X</sup> (ctx <sup>P</sup> <sup>γ</sup>) <sup>→</sup> <sup>J</sup> ps <sup>K</sup><sup>p</sup> <sup>Y</sup> (ctx <sup>Q</sup> <sup>δ</sup>)) map-p ⟨ Γ '⊢ A ⟩ f r M = f r M map-p '1˙ f r = map-p (ps '×˙ qs) f r (M , N) = map-p ps f r M , map-p qs f r N map-p 'I ∗ f r I ∗ ⟨ sp0 ⟩ = I ∗ ⟨ F .rel-0<sup>m</sup> (sp0 , r) ⟩ map-p (ps '∗ qs) f r (M ∗⟨ sp+ ⟩ N) = let rM ↘, sp+′ ,↙ rN = F .rel-+<sup>m</sup> (sp+ , r) in map-p ps f rM M ∗⟨ sp+′ ⟩ map-p qs f rN N map-p (p '· ps) f r (⟨ sp\* ⟩· M) = let r ′ , sp\*′ = F .rel-\*<sup>m</sup> (sp\* , r) in ⟨ sp\*′ ⟩· map-p ps f r ′ M

#### 6.2 The Kripke function space

At this point we introduce a minor generalisation to OpenFam and ExtOpenFam: I —OpenFam and I —ExtOpenFam. We obtain the defnition of I —OpenFam by replacing the textual occurrence of Ty by the parameter I.

The defnition Kripke V C ∆ is a kind of function space that describes a C value parametrised by ∆-many additional Vs (all correctly typed and usage annotated). It is used to describe how to go under binders in a Higher-Order Abstract Syntax style—to go under a binder we must provide semantic interpretations for all the additional variables:

```
Kripke : (V : OpenFam v) (C : I —OpenFam c) → I —ExtOpenFam
Kripke = Wrap λ V C ∆ Γ A → □r
                                 ([ V ] ⇒e ∆ −∗c
                                                 [ C ] ⊨ A) Γ
```
Wrap is a device that turns any type family into an equivalent type family that is judgementally injective in its indices, which helps with Agda's type inference. It turns the type family into a parametrised record with a single feld get whose type is the type in the body of the λ-abstraction. For understanding the meaning of Kripke, Wrap can be ignored.

If ∆ is of the form s1B1, . . . , snBn, then Kripke V C ∆ Γ A is equivalent to □<sup>r</sup> (s<sup>1</sup> · c [ V ] ⊨ B<sup>1</sup> −∗<sup>c</sup> · · · −∗<sup>c</sup> s<sup>n</sup> · c [ V ] ⊨ B<sup>n</sup> −∗<sup>c</sup> [ C ] ⊨ A) Γ by Currying. That is to say, the Kripke function is expecting a value for each newly bound variable, at the multiplicity of its annotation, together with the resources supporting each of those values. We use the "magic wand" function space here to enforce the invariant that the freshly bound variables have usage annotations that are added to the existing variables, not shared with them. The use of the □<sup>r</sup> modality ensures that we can still use it in the presence of additional variables introduced by weakening.

Kripke is functorial in the C argument, as witnessed by the mapKC function, which is essentially post-composition:

```
mapKC : ∀ {A B} → ∀[ [ C ] ⊨ A →˙ [ C
                                      ′
                                        ] ⊨ B ] →
         ∀ {∆ Γ} → Kripke V C ∆ Γ A → Kripke V C
                                                     ′ ∆ Γ B
mapKC f b .get ren .app∗ sp ρ = f (b .get ren .app∗ sp ρ)
```
### 6.3 Semantic traversal

We can now state the data required to implement a traversal assigning semantics to terms. For open families V and C, interpreting variables and terms respectively, we assume that V is renameable, that V is embeddable in C, and that we have an algebra for a layer of syntax, where bound variables are handled using the Kripke function space:

```
record Semantics (d : System) (V : OpenFam v) (C : OpenFam c)
      : Set (suc 0ℓ ⊔ v ⊔ c) where
 feld
   renˆV : ∀ {A} → Renameable ([ V ] ⊨ A)
   JvarK : ∀[ V →˙ C ]
   JconK : ∀[ J d Ks (Kripke V C) →˙ C ]
```
We mutually defne the action semantics and its lemma body. The purpose of semantics is to turn a term into a C-value using a V-environment and the felds of Semantics. Meanwhile, body does a similar job, but also deals with newly bound variables. In particular, body takes a term in a context extended by Θ, and produces a Kripke function from V-values for Θ to C-values.

```
semantics : ∀ {Γ ∆} → [ V ] Γ ⇒e ∆ → ∀ {sz} →
 ∀[ [ d , sz ] ∆ ⊢ →˙ [ C ] Γ ⊨ ]
body : ∀ {Γ ∆} → [ V ] Γ ⇒e ∆ → ∀ {sz Θ} →
 ∀[ Scope [ d , sz ] ⊢ Θ ∆ →˙ Kripke V C Θ Γ ]
```
To implement the new recursor semantics, we use the standard recursor, which in one case gives us a variable v, and in the other gives us a structure of subterms M, each of which is in an extended context. To deal with a variable v, we look it up in the environment <sup>ρ</sup>, then use the <sup>J</sup>var<sup>K</sup> feld to map the resulting <sup>V</sup>-value to a C-value. To deal with a structure of subterms M, we use the functoriality of the syntactic structure to consider each subterm separately. On a subterm, we apply body, which amounts to a recursive call to semantics with an extended environment. Recall that relocate (lemma 1) adjusts the environment ρ to work in the usage contexts of the subterms.

```
semantics ρ ('var v) = JvarK $ ρ .lookup (ρ .ft-here) v
semantics ρ ('con M) = JconK $
  map-s (ρ .Ψ) d (λ r → body (relocate ρ r)) (ρ .ft-here) M
```
For body, we are given a subterm M, to which we want to apply semantics. To do so, we need an extended version of the initial environment ρ. We express this as the generation of a Kripke function that produces the extended environment given interpretations of the fresh variables. We take ρ, which is an environment covering ∆, and σ, which is an environment covering Θ, and glue them together using the inductive rules for generating environments, after having renamed ρ via lemma 2 to make it ft the new context Γ <sup>+</sup> (intended to be Γ ++<sup>c</sup> Θ):

extend : ∀ {Γ ∆ Θ} → [ V ] Γ ⇒<sup>e</sup> ∆ → Kripke V ([ V ] ⇒<sup>e</sup> ) Θ Γ (∆ ++<sup>c</sup> Θ) extend ρ .get ren .app∗ sp σ = ++<sup>e</sup> (renˆEnv renˆV ρ ren ∗⟨ sp ⟩ σ)

To defne body, we use mapKC to post-compose the environment extension by the λ-function taking an extended environment and acting with it on M.

body ρ M = mapKC (λ σ → semantics σ M) (extend ρ)

### 7 Example traversals

In this section, we provide three example uses of semantic traversals: generic renaming and substitution, a usage elaborator, and a denotational semantics. The reader is also encouraged to see the far greater range of examples in the work of Allais et al. [3], which should adapt to our usage-annotated setting. Renaming and substitution are essential results, while the latter two examples focus on usage annotations.

A result we will use throughout this section is reifcation. When we have an index-preserving mapping from usage-checked variables to V-environments, we can construct environments of the form Γ <sup>V</sup>=⇒ Γ (identity environments) for all Γ. This lets us write the reify function, which simplifes our obligations in giving a Semantics by coercing Kripke functions into just C-values in an extended context.

Lemma 4 (reify). If V is an open family such that there is a function v : ∀[ ⊐− → V ˙ ], we get a function of type ∀[ Kripke V C →˙ Scope C ] for any C.

Proof. Let b : Kripke V C ∆ Γ A. That is, b is a Kripke function yielding Ccomputations We want to apply b so as to get a C (Γ, ∆) A. Let Pγ = Γ and Qδ = ∆. The □<sup>r</sup> in the type of b allows us to reverse-rename Γ to Γ, 0δ. Then we give the −∗-function an argument in context 0γ, ∆, noting that (Γ, 0δ)+(0γ, ∆) = (Γ, ∆), as we wanted from the result. The argument needs type 0γ, ∆ <sup>V</sup>=⇒ ∆. We produce this via lemma 3 from an environment ρ : 0γ, ∆ <sup>V</sup>=⇒ 0γ, ∆ created using v and a renaming which is the complement to that used on □<sup>r</sup> .

All of the Vs used in examples in this paper support identity environments. However, Allais et al. [3, p. 27] give some important examples that do not support identity environments, and thus cannot use reify (lemma 4). The feature that causes the lack of support for identity environments is that a semantics can make use of the fact that only variables of particular kinds are bound by the syntax. In the examples of Allais et al., a bidirectionally typed language only binds variables that synthesise their type, as opposed to those whose type is checked. The semantics of type-checking and elaboration rely on variables synthesising their type, so V is chosen to cover only those variables. Instead of using reify, we must observe that each syntactic form only binds such synthesising variables. Similar phenomena would appear in, say, a call-by-value language where variables are values (not computations), or a polarised language where variables always have a polarity matching their type.

### 7.1 Renaming and substitution

In an unpublished note, McBride [15] gives a parametrised traversal yielding homomorphisms of syntax, the canonical examples of which are simultaneous renaming and simultaneous substitution. The parameters are collected in the record Kit. We make a minor change to the original presentation, where instead of our renˆV feld, McBride has the feld wk allowing only context extensions. As for the other two felds, vr allows us to map variables to V-values, so as to put newly bound variables in environments; and tm allows us to extract terms from V-values, as required when we use the environment to handle a free variable.

```
record Kit (d : System) (V : OpenFam v) : Set (suc 0ℓ ⊔ v) where
 feld
   renˆV : ∀ {A} → Renameable ([ V ] ⊨ A)
   vr : ∀[ ⊐− →˙ V ]
   tm : ∀[ V →˙ [ d , ∞ ] ⊢ ]
```
Where McBride gave the traversal explicitly, we go via our generic semantic traversal. The frst two felds of Semantics derive directly from felds of Kit. Meanwhile, to handle term constructors, we frst reify to get a collection of traversed subterms, and then use 'con to assemble these subterms into a similarly shaped syntactic form as we started with. The vr feld is used implicitly in reify, as it is used to show that V-identity environments exist.

```
kit→sem : Kit d V → Semantics d V [ d , ∞ ] ⊢
kit→sem K .renˆV = K .renˆV
kit→sem K .JvarK = K .tm
kit→sem {d = d} K .JconK = 'con ◦ map-s′ d reify
 where open Kit K using (identityEnv)
```
The action of a syntactic traversal on logical rules is basically fxed: we preserve the logical rule and extend the environment with any newly bound variables according to vr. Meanwhile, the action on variables is relatively unconstrained: we look up the variable in the environment to get a V-value, then transform that V-value into a term using tm.

The idea of simultaneous renaming is that variables replace variables, whereas with simultaneous substitution, terms replace variables. This translates to environments for renaming containing ⊐−-values (variables), and environments for substitution containing ⊢-values (terms).

Ren-Kit : Kit d ⊐− Ren-Kit = record { renˆV = renˆ⊐− ; vr = id ; tm = 'var }

Notice that renˆ⊢, witnessing the fact that terms are renameable, is a corollary of Ren-Kit.

Sub-Kit : Kit d [ d , ∞ ] ⊢ Sub-Kit = record { renˆV = renˆ⊢ ; vr = 'var ; tm = id }

#### 7.2 A usage elaborator

Using the constructs we have seen so far, producing example terms soon becomes extremely tedious. We can achieve some abbreviation by using pattern synonyms to wrap around 'con expressions, but we still have to produce essentially bespoke proofs whenever we use a usage-sensitive part of the syntax. The size of each of these proofs is roughly proportional to the number of free variables, so the amount of proof we have to write grows roughly quadratically with the size of terms. An additional factor, which we can't see on paper, is that type checking time for these proofs soon becomes prohibitive to interactive development.

Our aim in this subsection is to automate usage constraint proofs, making terms both easier to write and more performant to check. We invoke the automation by writing terms in a syntax where usage constraints have been trivialised, and then use a semantic traversal over the simplifed syntax to try to produce a fully elaborated term in the original syntax. We write the automation in a way that is generic in the syntax description, thus avoiding repetition and facilitating the prototyping of new type systems.

The type of syntax descriptions depends on the type of usage annotations because of variable binding. For example, in the !r-E rule of fgure 2, the right premise binds a new variable with annotation r, where r is drawn from the ambient posemiring. The scaling combinator also makes direct reference to the posemiring. To produce a simplifed syntax description, where usage constraints are trivialised, we set the ambient posemiring to the 1-element 0 posemiring. In contrast to syntax descriptions, even though types can contain usage annotations, the type of types does not depend on the type of usage annotations. This means that, in our simplifed syntax, terms have types from the original system even though variables have trivial usage annotations. We defne the 0 posemiring as follows, being careful to use the 0-feld record type ⊤ so that everything algebraic gets solved by Agda's η-laws. Indeed, in this very defnition, all of the semiring operations and laws are canonically inferred.

```
0-poSemiring : PoSemiring 0ℓ 0ℓ 0ℓ
0-poSemiring = record
 { Carrier = ⊤; ≈ = λ → ⊤; ≤ = λ → ⊤ }
```
The elaboration process is monadic. In particular, we use the List/nondeterminism monad to give all of the possible annotation choices on the free variables of a term. We believe the commitment to multiple solutions is inherent when the syntax contains '1˙ . For example, in the intermediate stages of elaborating (⊢ λx. (∗, ∗)) : A ⊸ ⊤ ⊗ ⊤ with a usage counting posemiring (assuming reasonable rules for ⊤ and ⊗), it is unclear whether to use the variable x in the left ∗ or the right ∗. This uncertainty should be refected in the fnal result.

The non-deterministic choices we make during elaboration are enumerated by the felds of NonDetInverses. These choices are driven by the typing rules and a candidate usage vector for the conclusion. For example, +<sup>−</sup><sup>1</sup> r is needed when we encounter a '∗ in the syntax and the candidate usage annotation we are considering is r. Then, +<sup>−</sup><sup>1</sup> r is a list of pairs of annotations p and q that r can split into, together with a proof of the splitting. For 0#<sup>−</sup><sup>1</sup> and 1#<sup>−</sup><sup>1</sup> , inverses to constants, we are given the candidate r and typically return an empty list if the constraint cannot be satisfed, or a singleton list containing a proof. \* −1 is used when we encounter scaling, in which case we know both the scaling factor r (from the syntax description) and the candidate q. These inverse operations combine monadically (in fact, applicatively) to give inverses to the vector operations of zero, addition, scaling, and basis.

record NonDetInverses : Set where feld 0#<sup>−</sup>¹ : (r : Ann) → List (r ≤ 0#) +<sup>−</sup>¹ : (r : Ann) → List (∃ \ ((p , q) : × ) → r ≤ p + q) 1#<sup>−</sup>¹ : (r : Ann) → List (r ≤ 1#) \* <sup>−</sup>¹ : (r q : Ann) → List (∃ \ p → q ≤ r \* p)

We choose the V of our semantics to be (unannotated) variables. For the C, we consider functions from candidate usage vectors R to the list of elaborated derivations with usage annotations given by R. The protocol this encodes is that the user will provide an unannotated term together with a candidate usage context R, and usage elaboration returns a list of possible ways the term could be annotated such that the conclusion has usage context R. The module name U refers to the fact that we are taking the ambient posemiring to be 0 in OpenFam. The efect on OpenFam is that the usage annotations of any contexts we consider are uninformative (hence the on the left).

C : System → U.OpenFam C sys (U.ctx γ) A = ∀ R → List ([ sys , ∞ ] ctx R γ ⊢ A)

To traverse the unannotated terms, we produce a Semantics over the unannotated system uSystem sys. To write it, we make use of idiom brackets <sup>L</sup> . . . <sup>M</sup>, which have the efect of replacing top-level spines of applications by (List-)applicative applications. Field by feld, we already know that variables are renameable. To interpret a variable, we consider all the possible proofs that such a variable could be well annotated, and package them up as a variable term via the applicative machinery. Finally, for compound terms, we frst reify the unannotated subterms, and then combine the subterms via a lemma.

```
elab-sem : ∀ sys → U.Semantics (uSystem sys) U. ⊐− (C sys)
elab-sem sys .renˆV = U.renˆ⊐−
elab-sem sys .JvarK (U.lvar i q ) R =
  L 'var L (lvar i q) (⟨ i |
                        −¹ R) M M
elab-sem sys .JconK b R =
  let rb = U.map-s′
                    (uSystem sys) U.reify b in
  L 'con (lemma sys rb) M
```
The lemma essentially goes through the shape of the premises, combining the collections of subterms in the natural way. For example, at each ×˙ , we take the Cartesian product of the possibilities of each half, and at each ∗ , we non-deterministically split the usage annotations coming in, and then take the Cartesian product. When it comes to newly bound variables, the syntax description tells us their annotations, so there is no further non-determinism introduced here.

```
lemma : ∀ (sys : System) {A Γ} →
  U.J uSystem sys Ks (U.Scope (C sys)) (uCtx Γ) A →
  List (J sys Ks (Scope [ sys , ∞ ] ⊢ ) Γ A)
```
To actually use elab-sem on terms, we take the associated semantics and pass it the identity environment (an identity renaming in this case, because V is a family of variables). We use elab-unique, which further checks statically that exactly one derivation is returned. The candidate usage vector R will be [] for closed terms, and otherwise we have to supply the intended usage annotations.

We can now use the elaborator to automatically infer the usage annotations for the example at the end of section 4.2. This allows us to write:

```
cojoin-!ω : ∀ {A} → [ λR , ∞ ] []c ⊢ (! ω# A ⊸ ! ω# (! ω# A))
cojoin-!ω = elab-unique (⊸I (!E (var# 0) (!I (!I (var# 1))))) []
```
We have instantiated the usage elaborator so that: 0#<sup>−</sup><sup>1</sup> is a singleton on 0 and ω, and empty on 1; 1#<sup>−</sup><sup>1</sup> is a singleton on 1 and ω, and empty on 0; +<sup>−</sup><sup>1</sup> gives 0 7→ [(0, 0)], 1 7→ [(0, 1),(1, 0)], and ω 7→ [(ω, ω)]; and \* <sup>−</sup><sup>1</sup> gives (ω, 0) 7→ [0], (ω, 1) 7→ [], and (ω, ω) 7→ [ω] (omitting (0, ) and (1, ) cases for brevity). Note that we do not consider splitting ω up as, say, 1+ω, because this splitting would introduce more non-determinism but not allow any more terms to be typed. As such, the only non-determinism comes when we have variables annotated 1 and need to do an additive split, like when we apply the !E rule below. At this point, the variable can become either 0-annotated in the left subterm and 1-annotated on the right, or vice-versa. We will fnd that, because the left subterm wants to use that variable, the former choice will be rejected. The function var# is a convenience for converting statically known natural numbers, representing de Bruijn levels, into variable terms.

#### 7.3 A denotational semantics

To justify the name semantics, we give an example traversal that is a denotational semantics in the usual sense. The semantics we take is a refnement of that of Abel and Bernardy [2], which gives a way to extract parametricity theorems from substructurally typed programs. Example theorems are that all linear terms act as permutations on some fxed set of resources, and all monotonically typed terms are really monotonic in the way the typing suggests they are.

To abbreviate this section, we use a simplifed syntax compared to λR. We allow for an arbitrary family of base types BaseTy, and a single type former (r , A) ⊸ B, equivalent to (! r A) ⊸ B from the earlier system.

data Ty : Set where base : BaseTy → Ty ⊸ : (rA : Ann × Ty) (B : Ty) → Ty

In the term syntax, λ-abstraction now binds a variable with annotation r, while application needs to scale its argument by r (both in accordance with the function type they are acting on).

```
data 'AnnArr : Set where
  'lam 'app : (rA : Ann × Ty) (B : Ty) → 'AnnArr
AnnArr : System
AnnArr = 'AnnArr ▷ λ where
  ('lam rA B) → ⟨ [ rA ]
                        c
                          '⊢ B ⟩ =⇒ rA ⊸ B
  ('app rA@(r , A) B) → ⟨ []c
                              '⊢ rA ⊸ B ⟩ '∗ r '· ⟨ []c
                                                      '⊢ A ⟩ =⇒ B
```
As a running example, we take the usage annotations to be the 4-element variance posemiring (example 2). We establish the property that all terms are monotonic in their free variables. This monotonicity can be covariant or contravariant (or neither or both) depending on the annotation of each free variable. This provides an additional example to those of Abel and Bernardy.

We will take semantics of this system into world-indexed relations [2, 5]. A world-indexed relation (WRel) over a poset of worlds W is a set over which we have a W-indexed binary relation satisfying a presheaf-like property with respect to the order on W. The Agda code for world-indexed relations and constructions on them can be found in Wood and Atkey [22].

Example 5. When W is the 1-element set, a world-indexed relation is just a set equipped with a binary relation.

Morphisms (WRelMor) between world-indexed relations R and S consist of a mapping between the underlying sets such that, at each fxed world w, the mapping preserves relatedness from R to S.

When the poset of worlds forms a (relational) commutative monoid, such world-indexed relations support a symmetric monoidal closed structure, with objects denoted I <sup>R</sup>, ⊗<sup>R</sup> , and ⊸<sup>R</sup> ,. These reuse the bunched connectives I ∗ , ∗, and −∗, now over worlds rather than contexts.

The fnal piece of semantics we need is a bang operator. We allow the semantic bang to be an arbitrary annotation-indexed functor on world-indexed relations. This functor must respect all of the structure on the indices, making it a graded comonad over multiplication, as well as being lax monoidal at any particular index r. These laws are listed in the Generic.Linear.Example.WRel module in [22].

Example 6. With W being the 1-element set and annotations coming from the variance semiring, we can defne the following bang. It is always the identity on the set component, while the relation component consists of fipping the relation for contravariance and taking conjunctions to achieve both covariance and contravariance. When we want neither covariance nor contravariance, we use the always true predicate on worlds 1˙ .

! <sup>R</sup> : WayUp → WRel ≤<sup>w</sup> → WRel ≤<sup>w</sup> ! <sup>R</sup> a R .set = R .set ! <sup>R</sup> ↑↑ R .rel = R .rel ! <sup>R</sup> ↓↓ R .rel x y = R .rel y x ! <sup>R</sup> ?? R .rel x y = 1˙ ! <sup>R</sup> ˜˜ R .rel x y = R .rel x y ×˙ R .rel y x ! <sup>R</sup> a R .subres = id

The semantics of a type is given by <sup>J</sup> <sup>K</sup>, which maps into world-indexed relations. The function type is interpreted using ⊸<sup>R</sup> and ! <sup>R</sup>. Contexts are interpreted by <sup>J</sup> <sup>K</sup> c , using ⊗<sup>R</sup> and I <sup>R</sup>. Terms are interpreted as morphisms by the open family <sup>J</sup> <sup>⊢</sup> <sup>K</sup>. Variables are interpreted by lookup<sup>R</sup> (defnition omitted).

lookup<sup>R</sup> : <sup>∀</sup> {<sup>Γ</sup> <sup>A</sup>} <sup>→</sup> <sup>Γ</sup> <sup>⊐</sup><sup>−</sup> <sup>A</sup> <sup>→</sup> <sup>J</sup> <sup>Γ</sup> <sup>⊢</sup> <sup>A</sup> <sup>K</sup>

Now we give a Semantics. The choice of V as ⊐− is somewhat arbitrary, given that a standard denotational semantics would not use intermediate environments in the same sense as renaming and substitution, but it allows us to reuse the standard facts that variables support renaming and identity environments. With this choice of V and C, we interpret environment entries by lookup<sup>R</sup>. Meanwhile, for the logical rules, we ignore environments by using reify to just deal with morphisms in an extended context. As such, λ-abstractions are easy to interpret, while applications require some massaging to remove the extension by an empty context, followed by some plumbing to split the interpretation of the context according to the usage constraints and feed the interpretation of the argument n ′ into the interpretation of the function m′ .

$$\begin{array}{l} \text{Wel} : \text{Semantics } \text{Ann} \text{Arr } \Box\_{\Box} \text{I} \vdash\_{\Box} \text{I} \vdash\_{\Box} \text{Wel } \text{r.r} \Gamma \vdash \Box \text{Wel } \text{r.r} \Gamma \vdash \text{I} \text{ö} \text{m} \Gamma \vdash \text{I} \text{ö} \text{m} \text{p} \Gamma \\\\ \text{Wel } \{\mathsf{run}\} \{\ \langle \mathsf{nn}\ \{r, A\} \ B \,, \equiv \text{r.r} \mathsf{fl} \text{ }, m \rangle = \mathsf{curr} \mathsf{p}^{R} \ \{\mathsf{reify} \, m\} \\\\ \text{Wel } \{\mathsf{run}\} \{\ \langle \mathsf{car}\ R \rangle \} \end{array} \\ \begin{array}{l} \text{Wel } \{\mathsf{run}\} \{\ \langle r, A \rangle \} \ \mathsf{if} \ \langle \mathsf{x} \rangle \ \mathsf{if} \ \langle \mathsf{x} \rangle \ \langle \mathsf{y} \rangle \ \mathsf{if} \ \langle m \rangle \ \mathsf{if} \ \langle \cdot \rangle \ \langle \cdot \rangle \ \langle \cdot \rangle \ \langle \cdot \rangle \ \langle \cdot \rangle \ \langle p^{\*} \cdot n \rangle \ \mathsf{if} \ \langle m \rangle \\\\ \qquad n' = \mathsf{reify} \ n \ \mathsf{o}^{R} \ \langle \mathsf{car}^{\*} \mathsf{o}^{R} \rangle \ \mathsf{un} \, n' \rangle \end{array} \\ \begin{array}{l} m' = \mathsf{un} \mathsf{run} \, p^{R} \{ \langle \mathsf{reify} \, m \rangle \ \mathsf{o}^{R} \ \langle \mathsf{car}^{\*} \rangle \ \langle \mathsf{car} \rangle \ \langle \mathsf{car}^{\*} \rangle \ \mathsf{o}^{R}$$

Then, the semantics of terms is given by the function semantics Wrel 1 r , where 1 r is the identity renaming.

Example 7. We can make a subtraction function from primitive addition and negation on integers. Subtraction is covariant in its frst argument and contravariant in its second argument. We give the defnition in pseudocode, though it is also amenable to the usage elaborator of section 7.2, suitably instantiated.

$$\begin{aligned} &\omega \sim p: \uparrow\uparrow\mathbb{Z} \multimap \lnot\uparrow\mathbb{Z} \multimap \lnot\mathbb{Z}, \sim\lnot\mathbb{Z}, \sim\lnot\mathbb{Z} \vdash \text{minus}: \uparrow\uparrow\mathbb{Z} \multimap \lnot\mathbb{Z} \multimap \lnot\mathbb{Z} \\ &\text{minus}:= \lambda x. \ \lambda y. \ p \, x \, (n\, y) \end{aligned}$$

After feeding in Agda's addition and negation functions as the interpretations of the free variables (noting that they are both monotonic in the required way), we get the following free theorem.

thm : x Z.≤ x ′ → y ′ Z.≤ y → x Z.+ (Z.- y) Z.≤ x ′ Z.+ (Z.- y ′ )

### 8 Conclusions

We have presented a framework for doing metatheory for a class of substructural type systems in Agda. The framework gives us renaming, substitution, and a usage elaborator for new syntaxes for free, which we hope can facilitate prototyping and the mechanisation of more interesting semantic results. Beside the mechanised framework itself, we believe its methodology — the use of bunched premise combinators — can guide and simplify the development of (potentially unmechanised) substructural type systems.

Our account of substructurality is based on the linear algebraic principles described by Wood and Atkey [21]. However, these details only really afect the defnition of environment, in which the use of linear maps is motivated by them being the standard notion of morphism between vectors. We could imagine that a similar notion of morphism is found for the kind of annotations found in Licata et al. [13], allowing a framework to consider fner substructural systems.

### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### A Dependent Dependency Calculus

Pritam Choudhury1, Harley Eades III<sup>2</sup> , and Stephanie Weirich<sup>3</sup>

<sup>1</sup> University of Pennsylvania, Philadelphia, PA, USA, pritam@seas.upenn.edu <sup>2</sup> Augusta University, Augusta, GA, USA <sup>3</sup> University of Pennsylvania, Philadelphia, PA, USA

Abstract. Over twenty years ago, Abadi et al. established the Dependency Core Calculus (DCC) as a general purpose framework for analyzing dependency in typed programming languages. Since then, dependency analysis has shown many practical benefits to language design: its results can help users and compilers enforce security constraints, eliminate dead code, among other applications. In this work, we present a Dependent Dependency Calculus (DDC), which extends this general idea to the setting of a dependently-typed language. We use this calculus to track both run-time and compile-time irrelevance, enabling faster typechecking and program execution.

Keywords: Dependent Types · Information Flow · Irrelevance

### 1 Dependency Analysis

Consider this judgment from a type system that has been augmented with dependency analysis.

```
x :
  L
    Int, y :
           H Bool, z :
                       M Bool ` if z then x else 3 :M Int
```
In this judgment, L, M and H stand for low, medium and high security levels respectively. The computed value of the expression is meant to be a mediumsecurity result. The inputs, x , y and z have been marked with their respective security levels. This expression type-checks because it is permissible for mediumsecurity results to depend on both low and medium-security inputs. Note that the high-security boolean variable y is not used in the expression. However, if we replace z with y in the conditional, then the type checker would reject that expression. Even though the high-security input would not be returned directly, the medium-security result would still depend on it.

Dependency analysis, as we see above, is an expressive addition to programming languages. Such analyses allow languages to protect sensitive information [30,16], support run-time code generation [33], slice programs while preserving behavior [34], etc. Several existing dependency analyses were unified by Abadi et al. [1] in their Dependency Core Calculus (DCC). This calculus has served as a foundation for static analysis of dependencies in programming languages.

What makes DCC powerful is the parameterization of the type system by a generic lattice of dependency levels. Dependency analysis, in essence, is about ensuring secure information flow—that information never flows from more secure to less secure levels. Denning [13] showed that a lattice model, where increasing order corresponds to higher security, can be used to enforce secure flow of information. DCC integrates this lattice model with the computational λ-calculus [22] by grading the monad operator of the latter with elements of the former. This integration enables DCC to analyze dependencies in its type system.

However, even though many typed languages have included dependency analysis in some form, this feature has seen relatively little attention in the context of dependently-typed languages. This is unfortunate because, as we show in this paper, dependency analysis can provide an elegant foundation for compile-time and run-time irrelevance, two important concerns in the design of dependentlytyped languages. Compile-time irrelevance identifies sub-expressions that are not needed for type checking while run-time irrelevance identifies sub-expressions that do not affect the result of evaluation. By ignoring or erasing such subexpressions, compilers for dependently-typed languages increase the expressiveness of the type system, improve on compilation time and produce more efficient executables.

Therefore, in this work, we augment a dependently-typed language with a primitive notion of dependency analysis and use it to track compile-time and run-time irrelevance. We call this language DDC, for Dependent Dependency Calculus, in homage to DCC. Although our dependency analyses are structured differently, we show that DDC can faithfully embed the terminating fragment of DCC and support its many well-known applications, in addition to our novel application of tracking compile-time and run-time irrelevance.

More specifically, our work makes the following contributions:


<sup>4</sup> https://github.com/sweirich/graded-haskell

ness theorems for DDC. This mechanization is available online<sup>4</sup> and as a self-contained artifact [11].

### 2 Irrelevance and Dependent Types

Run-time irrelevance (sometimes called erasure) and compile-time irrelevance are two forms of dependency analyses that arise in dependent type theories. Tracking these dependencies helps compilers produce faster executables and makes type checking more flexible [27,19,6,20,3,18,4,24,32,23].

#### 2.1 Run-time irrelevance

Parts of a program that are not required during run time are said to be runtime irrelevant. Our goal is to identify such parts. Let's consider some examples. We shall mark variables and arguments with > if they can be erased prior to execution and leave them unmarked if they should be preserved.

For example, the polymorphic identity function can be marked as:

id : Π x:<sup>&</sup>gt;Type. x -> x id = λ <sup>&</sup>gt;x. λy. y

The first parameter, x , of the identity function is only needed during type checking; it can be erased before execution. The second parameter, y, though, is required during runtime. When we apply this function to arguments, as in (id Bool<sup>&</sup>gt; True), we can erase the first argument, Bool, but the second one, True, must be retained.

Indexed data structures provide another example of run-time irrelevance.

Consider the Vec datatype for length-indexed vectors, as it might look in a core language inspired by GHC [31,35]. The Vec datatype has two parameters, n and a, that also appear in the types of the data constructors Nil and Cons. These parameters are relevant to Vec, but irrelevant to the data constructors. (In the types of the constructors, the equality constraints (n ∼ Zero) and (n ∼ Succ m) force n to be equal to the length of the vector.)

```
Vec : Nat -> Type -> Type
Nil : Π n:>Nat. Π a:>Type. (n ∼ Zero) => Vec n a
Cons : Π n:>Nat. Π a:>Type. Π m:>Nat. (n ∼ Succ m) => a -> Vec m a
    -> Vec n a
```
Now consider a function vmap that maps a given function over a given vector. The length of the vector and the type arguments are not necessary for running vmap; they are all erasable. So we assign them >.

```
vmap : Π n:>Nat.Π a b:>Type. (a -> b) -> Vec n a -> Vec n b
vmap = λ
        > n a b. λ f xs.
          case xs of
            Nil -> Nil
            Cons m> x xs -> Cons m> (f x) (vmap m> a
                                                     > b
                                                        > f xs)
```
Note that the >-marked variables m, a and b appear in the definition of vmap, but only in > contexts. By requiring that 'unmarked' terms don't depend on terms marked with >, we can track run-time irrelevance and guarantee safe erasure. Observe that even though these arguments are marked with > to describe their use in the definition of vmap, this marking does not reflect their usage in the type of vmap. In particular, we are free to use these variables with Vec in a relevant manner.

#### 2.2 Compile-time Irrelevance

Some type constructors may have arguments which can be ignored during type checking. Such arguments are said to be compile-time irrelevant. For example, suppose we have a constant function that ignores its argument and returns a type.

phantom : Nat<sup>&</sup>gt; -> Type phantom = λ <sup>&</sup>gt; x. Bool

To type check idp below, we must show that phantom 0 equals phantom 1. Without compile-time irrelevance, we need to β-reduce both sides to show that the input and output types are equal.

idp : phantom 0<sup>&</sup>gt; -> phantom 1<sup>&</sup>gt; idp = λ x. x

However, in the presence of compile-time irrelevance, we can use the dependency information contained in the type of a function to reason about it abstractly. Because the function f below ignores its argument, it is sound to equate the input and output types.

```
ida : Π f :
           >(Nat> -> Type). f 0> -> f 1>
ida = λ
       > f. λ x. x
```
In the absence of compile-time irrelevance, we cannot type-check ida. So compile-time irrelevance makes type checking more flexible.

Compile-time irrelevance can also make type checking faster when the types contain expensive computation that can be safely ignored. For example, consider the following program that type checks without compile-time irrelevance. However, in that case, the type checker must show that fib 28 reduces to 317811, where fib represents the Fibonacci function.

```
idn : Π f :
           >(Nat> -> Type). f (fib 28)> -> f 317811>
idn = λ
       > f. λ x. x
```
So far, we have used two annotations on variables and terms: > for irrelevant ones and 'unmarked' for relevant ones. We used > to mark both arguments that can be erased at runtime and arguments that can be safely ignored by the type checker. However, sometimes we need a finer distinction.

#### 2.3 Strong Irrelevant Σ-types

Consider the type Σm:<sup>&</sup>gt;Nat. Vec m a, which contains pairs whose first component is marked as irrelevant. This type might be useful, say, for the output of a filter function for vectors, where the length of the output vector cannot be calculated statically. If we never need to use this length at runtime, then it would be good to mark it with > so that it need not be stored.<sup>5</sup>

However, marking m with > means that the first component of the pair of this type must also be compile-time irrelevant. This results in a significant limitation for strong Σ types: we cannot project the second component from the pair. Say we have ys:Σm:<sup>&</sup>gt;Nat. Vec m a. The type of (π<sup>1</sup> ys) is a Nat that can only be used in irrelevant positions. However, note that the argument n in Vec n a must be compile-time relevant; otherwise the type checker would equate Vec 0 a with Vec 1 a, making the length index meaningless. The type of (π<sup>2</sup> ys) would then be Vec (π<sup>1</sup> ys) a, which is ill-formed because an irrelevant term (π<sup>1</sup> ys) appears in a relevant position.

Therefore, we don't want to mark the first component of the output of filter with >. However, if we leave it unmarked, we cannot erase it at runtime, something that we might want to. A way out of this quandry comes by considering terms that are run-time irrelevant but not compile-time irrelevant. Such terms exist between completely irrelevant and completely relevant terms. They should not depend upon irrelevant terms and relevant terms should not depend upon them. We mark such terms with a new annotation, C, with the constraints that 'unmarked' terms do not depend on C and C terms do not depend on > terms. The three annotations, then, correspond to the three levels of a lattice modelling secure information flow, with ⊥ < C < >, using ⊥ in lieu of 'unmarked'. We call the lattice L<sup>I</sup> , for irrelevance lattice. Using this lattice, we can type check the following filter function.

```
filter : Πn:>Nat.Πa:>Type.(a -> Bool) -> Vec n a -> Σm:C Nat. Vec m a
filter = λ
          > n a. λ f vec.
          case vec of
            Nil -> (ZeroC , Nil)
            Cons n1> x xs
              | f x -> ((Succ (π1 ys))C , Cons (π1 ys)> x (π2 ys))
                            where
                              ys = filter n1> a
                                               > f xs
              | _ -> filter n1> a
                                    > f xs
```
Eisenberg et al. [14] observe that, in Haskell, it is important to use projection functions to access the components of the pair that results from the recursive call (as in π<sup>1</sup> ys and π<sup>2</sup> ys) to ensure that filter is not excessively strict. If filter instead used pattern matching to eliminate the pair returned by the recursive

<sup>5</sup> It is, however, safe for m to be used in a relevant position in the body of the Σ-type even when it is marked with >. This marking indicates how the first component of a pair having this type is used, not how the bound variable m is used in the body of the type.

call, it would have needed to filter the entire vector before returning the first successful value. This filter function demonstrates the practical utility of strong irrelevant Σ-types because it supports the same run-time behavior of the usual list filter function but with a more richly-typed data structure.

### 3 A Simple Dependency Analyzing Calculus

Our ultimate goal is a dependent dependency calculus. However, we first start with a simply-typed version so that we can explain our approach to dependency analysis and non-interference in a simplified setting.

We call the calculus of this section SDC, for Simple Dependency Calculus. This calculus is parameterized by a lattice of labels or grades, which can also be thought of as security levels. <sup>6</sup> An excerpt of this calculus appears in Figure 1; it is an extension of the simply-typed λ-calculus with a grade-indexed modal type T ` A. The modal type T ` A can be thought of as putting a security barrier of grade ` around the values of A. The calculus itself is also graded, which means that in a typing judgment, the derived term and every variable in the context carries a label or grade. (The specification of the full system, which includes unit, products and sums, appears in the extended version of this paper [12].)

#### 3.1 Type System

The typing judgment has the form Ω ` a : ` A which means that "` is allowed to observe a" or that "a is visible at `". Selected typing rules for SDC appear in Figure 1. Most rules are straightforward and propagate the level of the sub-terms to the expression.

The rule SDC-Var requires that the grade of the variable in the context must be less than or equal to the grade of the observer. In other words, an observer at level ` is allowed to use a variable from level k if and only if k ≤ `. If the variable's level is too high, then this rule does not apply, ensuring that information can always flow to more secure levels but never to less secure ones. Abstraction rule SDC-Abs uses the current level of the expression for the newly introduced variable in the context. This makes sense because the argument to the function is checked at the same level in rule SDC-App.

The modal type, introduced and eliminated with rule SDC-Return and rule SDC-Bind respectively, manipulates the levels. The former says that, if a term is (` ∨ `0)-secure, then we can put it in an `0-secure box and release it at level `. An `0-secure boxed term can be unboxed only by someone who has security clearance for `0, as we see in the latter rule. The join operation in rule SDC-Bind ensures that b can depend on a only if b itself is `0-secure or `<sup>0</sup> ≤ `.

<sup>6</sup> We use the terms label, level and grade interchangeably.

$$\begin{array}{c|c|c} \text{labels} & \ell, k := \bot \top \mid k \nmid \ell \nmid k \nmid \ell \dots \atop t \text{yes} & A, B := \textbf{Unit} \mid A \rightarrow B \mid \tau^{\ell} A \\ \text{terms} & a, b := x \mid \lambda x : A. a \mid a \mid b \quad \text{variables and functions} \\ & & & \mid \eta^{\ell} \mid a \mid \textbf{bnd}^{\ell} x = a \textbf{in} \mid b \text{ graded mod} \mid \ell \mid \phi \\ & & & \mid \eta^{\ell} \mid a \mid \textbf{bnd}^{\ell} x = a \textbf{in} \mid b \text{ graded mod} \mid \ell \mid \phi \\ \hline & & & \mid \Omega \vdash \cdot \mid A \end{array} \tag{Typical duality} \tag{Typical duality} \tag{7.15}$$
 
$$\begin{array}{c|c|c|c} \text{SDC-VaR} & \text{SDC-Aps} & \text{SDC-Aps} \\ \hline \ell \mid \theta \mid x : \, ^{\ell} A \in \Omega & \begin{array}{c} \text{SDC-Aps} \\ \Omega \vdash \lambda x : A \, b \; \cdot \; ^{\ell} A \to B \\ \Omega \vdash \lambda x : A \, b \; \cdot \; ^{\ell} A \to B \end{array} \qquad \begin{array}{c} \begin{array}{c} \text{SDC-Bns} \\ \Omega \vdash a \; ^{\ell} A \to B \\ \Omega \vdash \cdot \; a \; ^{\ell} A \end{array} \\ \hline \end{array} \\ \end{array} \qquad \begin{array}{c} \begin{array}{c} \text{SDC-Bns} \\ \Omega \vdash \cdot \; a \; \cdot \; \cdot \; B \end{array} \quad \begin{array}{c} \Omega \vdash a \; \cdot \; A \; \cdot \; B \\ \Omega \vdash$$

#### 3.2 Meta-theoretic Properties

This type system satisfies the following properties related to levels.

First, we can always weaken our assumptions about the variables in the context. If a term is derivable with an assumption held at some grade, then that term is also derivable with that assumption held at any lower grade. Below, for any two contexts Ω1, Ω2, we say that Ω<sup>1</sup> ≤ Ω<sup>2</sup> iff they are the same modulo the grades and further, for any x , if x : `<sup>1</sup> A ∈ Ω<sup>1</sup> and x : `<sup>2</sup> A ∈ Ω2, then `<sup>1</sup> ≤ `2.

Lemma 1 (Narrowing). If Ω<sup>0</sup> ` a : ` A and Ω ≤ Ω<sup>0</sup> , then Ω ` a : ` A.

Narrowing says that we can always downgrade any variable in the context. Conversely, we cannot upgrade context variables in general, but we can upgrade them to the level of the judgment.

Lemma 2 (Restricted Upgrading). If Ω1, x : `<sup>0</sup> A, Ω<sup>2</sup> ` b : ` B and `<sup>1</sup> ≤ `, then Ω1, x : `0∨`<sup>1</sup> A, Ω<sup>2</sup> ` b : ` B .

The restricted upgrading lemma is needed to show subsumption. Subsumption states that, if a term is visible at some grade, then it is also visible at all higher grades.

Lemma 3 (Subsumption). If Ω ` a : ` A and ` ≤ k, then Ω ` a : <sup>k</sup> A.

Subsumption is necessary (along with a standard weakening lemma) to show that substitution holds for this language. For substitution, we need to ensure that the level of the variable matches up with that of the substituted expression.

Lemma 4 (Substitution). If Ω1, x : `<sup>0</sup> A, Ω<sup>2</sup> ` b : ` B and Ω<sup>1</sup> ` a : `<sup>0</sup> A, then Ω1, Ω<sup>2</sup> ` b{a/x} : ` B .

SDC terms are reduced using a call-by-name strategy. An excerpt of the small-step semantics appears in Figure 1. Note how the labels on the introduction form and the corresponding elimination form match up in rules SDCStep-Beta and SDCStep-BindBeta. Further, note that we could have also used a call-byvalue strategy to reduce SDC terms; we chose a call-by-name strategy because our development is motivated by potential applications in Haskell.

For a call-by-name operational semantics, the above lemmas allow us to prove, a standard progress and preservation based type soundness result, which we omit here.

Next, we show that our type system is secure by proving non-interference.

#### 3.3 A Syntactic Proof of Non-interference

When users with low-security clearance are oblivious to high-security data, we say that the system enjoys non-interference. Non-interference results from levelspecific views of the world. The values η <sup>H</sup> True and η <sup>H</sup> False appear the same to an L-user while an H-user can differentiate between them. To capture this notion of a level-specific view, we design an indexed equivalence relation on open terms, ∼`, called indexed indistinguishability, and shown in Figure 2. To define this relation, we need the labels of the variables in the context but not their types. So, we use grade-only contexts Φ, defined as Φ ::= ∅ | Φ, x : `. These contexts are like graded contexts Ω without the type information on variables, also denoted by |Ω|.

Informally, Φ ` a ∼` b means that a and b appear the same to an `-user. For example, η <sup>H</sup> True ∼<sup>L</sup> η <sup>H</sup> False but ¬(η <sup>H</sup> True ∼<sup>H</sup> η <sup>H</sup> False). We define this relation ∼` by structural induction on terms. We think of terms as ASTs annotated at various nodes with labels, say `0, that determine whether an observer ` is allowed to look at the corresponding sub-tree. If `<sup>0</sup> ≤ `, then observer ` can start exploring the sub-tree; otherwise the entire sub-tree appears as a blob. So we can also read Φ ` a ∼` b as: "a is syntactically equal to b at all parts of the terms marked with any label `0, where `<sup>0</sup> ≤ `, but may be arbitrarily different elsewhere."

Note the rule SGEq-Return in Figure 2. It uses an auxiliary relation, Φ ` `0 ` a<sup>1</sup> ∼ a2. This auxiliary extended equivalence relation Φ ` `0 ` a<sup>1</sup> ∼ a<sup>2</sup> formalizes

<sup>7</sup> Because this relation is untyped, its analogue for DDC is similar. For each lemma below, we include a reference to the location in the Coq development where it may be found for the dependent system.

Φ ` a ∼` b (Indexed Indistinguishability)

Fig. 2. Indexed indistiguishability for SDC (Excerpt)

the idea discussed above: if `<sup>0</sup> ≤ `, then a<sup>1</sup> and a<sup>2</sup> must be indistinguishable at `; otherwise, they may be arbitrary terms.

Now, we explore some properties of the indistinguishability relation.<sup>7</sup> If we remove the second component from an indistinguishability relation, Φ ` a ∼` b, we get a new judgment, Φ ` a : `, called grading judgment. Now, corresponding to every indistinguishability rule, we define a grading rule where the indistinguishability judgments have been replaced with their grading counterparts. Terms derived using these grading rules are called well-graded. We can show that well-typed terms are well-graded.

Lemma 5 (Typing implies grading). If Ω ` a : ` A then |Ω| ` a : `.

Lemma 6 (Equivalence). Indexed indistinguishability at ` is an equivalence relation on well-graded terms at `.

The above lemma shows that indistinguishability is an equivalence relation. Observe that at the highest element of the lattice, >, this equivalence degenerates to the identity relation.

Indistinguishability is closed under extended equivalence. The following is like a substitution lemma for the relation.

Lemma 7 (Indistinguishability under substitution). If Φ, x : ` ` b<sup>1</sup> ∼<sup>k</sup> b<sup>2</sup> and Φ ` ` k a<sup>1</sup> ∼ a<sup>2</sup> then Φ ` b1{a1/x} ∼<sup>k</sup> b2{a2/x}.

With regard to the above lemma, consider the situation when ¬(` ≤ k), for example, when ` = H and k = L. In such a situation, for any two terms a<sup>1</sup>

and a2, if Φ, x : ` ` b<sup>1</sup> ∼<sup>k</sup> b2, then Φ ` b1{a1/x} ∼<sup>k</sup> b2{a2/x}. Let us work out a concrete example. For a typing derivation x : <sup>H</sup> A ` b : <sup>L</sup> Bool, we have, by lemmas 5 and 6, x : H ` b ∼<sup>L</sup> b. Then, ∅ ` b{a1/x} ∼<sup>L</sup> b{a2/x}. This is almost non-interference in action. What's left to show is that the indistinguishability relation respects the small step semantics, written <sup>a</sup><sup>1</sup> ; <sup>a</sup>2. The small-step relation is standard call-by-name reduction.

Theorem 1 (Non-interference). If Φ ` a<sup>1</sup> ∼<sup>k</sup> a 0 <sup>1</sup> and a<sup>1</sup> ; <sup>a</sup><sup>2</sup> then there exists some a<sup>0</sup> 2 such that a<sup>0</sup> <sup>1</sup> ; <sup>a</sup> 0 <sup>2</sup> and Φ ` a<sup>2</sup> ∼<sup>k</sup> a 0 2 .

Since the step relation is deterministic, in the above lemma, there is exactly one such a 0 2 that a 0 1 steps to. Now, going back to our last example, we see that b{a1/x} and b{a2/x} take steps in tandem and they are L-indistinguishable after each and every step. Since the language itself is terminating, both the terms reduce to boolean values, values that are themselves L-indistinguishable as well. But the indistinguishability for boolean values is just the identity relation. This means that b{a1/x} and b{a2/x} reduce to the same value.

The indistinguishability relation gives us a syntactic method of proving noninterference for programs derived in SDC. Essentially, we show that a user with low-security clearance cannot distinguish between high security values just by observing program behavior.

Next, we show that SDC is no less expressive than the terminating fragment of DCC.

#### 3.4 Relation with Sealing Calculus and Dependency Core Calculus

SDC is extremely similar to the sealing calculus λ [] of Shikuma and Igarashi [29]. Like SDC, λ [] has a label on the typing judgment.<sup>8</sup> But unlike SDC, λ [] uses standard ungraded typing contexts Γ. Both the calculi have the same types. As far as terms are concerned, there is only one difference. The sealing calculus has an unseal term whereas SDC uses bind. We present the rules for sealing and unsealing terms in λ [] below.<sup>9</sup>

$$\begin{array}{c c c} & \text{SEALING-UNSEAL} \\ \hline I \vdash a: ^{\ell \lor \ell\_0} A & \begin{array}{c} \Gamma \vdash a: ^{\ell} T^{\ell\_0} \ A \\ \hline \end{array} \\ \hline \Gamma \vdash \eta^{\ell\_0} \ a: ^{\ell} T^{\ell\_0} \ A \\ \hline \end{array} & \begin{array}{c} \begin{array}{c} \text{SEALING-UNSEAL} \\ \hline \Gamma \vdash a: ^{\ell} T^{\ell\_0} \ A \\ \hline \end{array} \\ \hline \end{array}$$

Shikuma and Igarashi [29] have shown that λ [] is equivalent to DCCpc, an extension of the terminating fragment of DCC. Therefore, we compare SDC to DCC by simulating λ [] in SDC. For this, we define a translation ¯·, from λ [] to SDC. Most of the cases are handled inductively in a straightforward manner. For unseal, we have, unseal` a := bind` x = a in x .

<sup>8</sup> Note that our labels correspond to observer levels of [29], which can be viewed as a lattice.

<sup>9</sup> We take the liberty of making small cosmetic changes in the presentation.

With this translation, we can give a forward and a backward simulation connecting the two languages. The reduction relation ; below is full reduction for both the languages, the reduction strategy used by Shikuma and Igarashi [29] for λ []. Full reduction is a non-deterministic reduction strategy whereby a β-redex in any sub-term may be reduced.

Theorem 2 (Forward Simulation). If a ; <sup>a</sup> 0 in λ [], then <sup>a</sup> ; <sup>a</sup> 0 in SDC.

Theorem 3 (Backward Simulation). For any term a in λ [], if <sup>a</sup> ; b in SDC, then there exists a<sup>0</sup> in λ [] such that b = a <sup>0</sup> and a ; <sup>a</sup> 0 .

The translation also preserves typing. In fact, a source term and its target have the same type. Below, for an ordinary context Γ, the graded context Γ ` denotes Γ with the labels for all the variables set to `.

Theorem 4 (Translation Preserves Typing). If Γ ` a : ` A, then Γ ` ` a : ` A.

The above translation shows that the terminating fragment of DCC can be embedded into SDC. Therefore SDC is at least as expressive as the terminating fragment of DCC. Further, SDC lends itself nicely to syntactic proof techniques for non-interference. This approach generalizes to more expressive systems, as we shall see in the next section, where we extend SDC to a general dependent dependency calculus.

### 4 A Dependent Dependency Analyzing Calculus


Fig. 3. Dependent Dependency Calculus Grammar (Types and Terms)

Here and in the next section, we present dependently-typed languages, with dependency analysis in the style of SDC. The first extension, called DDC<sup>&</sup>gt; is a straightforward integration of labels and dependent types. This system subsumes SDC, and so can be used for the same purposes. Here, we show how it can be used to analyze run-time irrelevance. Then, in Section 5, we generalize this system to DDC, which allows definitional equality to ignore unnecessary sub-terms, thus also enabling compile-time irrelevance. We present the system in this way both to simplify the presentation and to show that DDC<sup>&</sup>gt; is an intermediate point in the design space.

Both DDC<sup>&</sup>gt; and DDC are pure type systems [5]. They share the same syntax, shown in Figure 3, combining terms and types into the same grammar. They are parameterized by a set of sorts s, a set of axioms A(s1, s2) which is a binary relation on sorts, and a set of rules R(s1, s2, s3) which is a ternary relation on sorts. For simplicity, we assume, without loss of generality, that for every sort s1, there is some sort s2, such that A(s1, s2).<sup>10</sup>

We annotate several syntactic forms with grades for dependency analysis. The dependent function type, written Πx : ` A.B, includes the grade of the argument to a function having this type. Similarly, the dependent pair type, written Σx: ` A.B, includes the grade of the first component of a pair having this type. <sup>11</sup> We can interpret these types as a fusion of the usual, ungraded dependent types and the graded modality T ` A we saw earlier. In other words, Πx : ` A.B acts like the type Πy : (T ` A).bind x = y in B and Σx : ` A.B acts like the type Σy : (T ` A).bind x = y in B. Because of this fusion, we do not need to add the graded modality type as a separate form—we can define T ` A as Σx : `A.Unit. Using Πx : ` A.B instead of Πy : (T ` A).bind x = y in B has an advantage: the former allows x to be held at differing grades while type checking B and the body of a function having this Π-type while the latter requires x to be held at the same grade in both the cases. We utilize this flexibility in Section 5.

### 4.1 DDC<sup>&</sup>gt; : Π-types

The core typing rules for DDC<sup>&</sup>gt; appear in Figure 4. As in the simple type system, the variables in the context are labelled and the judgement itself includes a label `. Rule DCT-Var is similar to its counterpart in the simply-typed language: the variable being observed must be graded less than or equal to the level of the observer. Rule DCT-Pi propagates the level of the expression to the subterms of the Π-type. Note that this type is annotated with an arbitrary label `0: the purpose of this label `<sup>0</sup> is to denote the level at which the argument to a function having this type may be used.

In rule DCT-Abs, the parameter of the function is introduced into the context at level `<sup>0</sup> ∨ ` (akin to rule SDC-Bind). In rule DCT-App, the argument to the function is checked at level `<sup>0</sup> ∨` (akin to rule SDC-Return). Note that the Π-type is checked at > in rule DCT-Abs. In DDC<sup>&</sup>gt;, level > corresponds to 'compile time' observers and motivates the superscript > in the language name.

Rule DCT-Conv converts the type of an expression to an equivalent type. The judgment |Ω| ` A ≡<sup>&</sup>gt; B is a label-indexed definitional equality relation

<sup>10</sup> This assumption does not lead to any loss in generality because given a pure type system (S 0 , A 0 , R 0 ) that does not meet the above condition, we can provide another pure type system (S 00 , A 00 , R <sup>00</sup>), where S <sup>00</sup> = S <sup>0</sup> ∪ {D} (given D ∈/ S 0 ) and A <sup>00</sup> = A <sup>0</sup> ∪ {(s, D)|s ∈ S <sup>00</sup>} and R <sup>00</sup> = R 0 , such that there exists a straightforward bisimulation between the two systems.

<sup>11</sup> We use standard abbreviations when x is not free in B: we write `A → B for Πx : ` A.B


Fig. 4. DDC<sup>&</sup>gt; type system (core rules)

instantiated to >. This relation is the closure of the indexed indistinguishability relation (Section 3.3) under small-step call-by-name evaluation. When instantiated to >, the relation degenerates to β-equivalence. So the rule DCT-Conv is essentially casting a term to a β-equivalent type; however, in the next section, we utilize the flexibility of label-indexing to cast a term to a type that may not be β-equivalent. Also, note that the equality relation itself is untyped. As such, we need the third premise to guarantee that the new type is well-formed.

### 4.2 DDC<sup>&</sup>gt; : Σ-types

The language DDC<sup>&</sup>gt; includes Σ types, as specified by the rules below.

$$\begin{array}{c} \text{DCT-WSia} \\ \Omega \vdash A \mathrel{\mathop{:}^{\ell}} s\_{1} \\ \hline \Omega \vdash \Sigma x \mathrel{\mathop{:}^{\ell}} A \mathrel{\mathop{:}^{\ell}} \mathcal{A} \mathrel{\mathop{:}^{\ell}} s\_{2} \\ \hline \end{array} \begin{array}{c} \text{DCT-WPia} \\ \Omega \vdash a \mathrel{\mathop{:}^{\ell\_{0}}} A \\ \hline \end{array} \begin{array}{c} \Omega \vdash a \mathrel{\mathop{:}^{\ell\_{0}}} \mathcal{A} \\ \hline \end{array}$$

Like Π-types, Σ-types include a grade that is not related to how the bound variable is used in the body of the type. The grade indicates the level at which the first component of a pair having the Σ-type may be used. In rule DCT-WPair, we check the first component a of the pair at a level raised by `0, the level annotating the type, akin to rule SDC-Return. The second component b is checked at the current level.

and `A × B for Σx: `A.B.

DCT-LetPair

Ω ` a : ` Σx: `0A.B Ω, x : `0∨` A, y : ` B ` c : ` C {(x `0 , y)/z} Ω, z : <sup>&</sup>gt; (Σx: `0A.B) ` C : <sup>&</sup>gt; s Ω ` let (x `0 , y) = a in c : ` C {a/z}

The rule DCT-LetPair eliminates pairs using dependently-typed pattern matching. The pattern variables x and y are introduced into the context while checking the body c. Akin to rule SDC-Bind, the level of the first pattern variable, x , is raised by `0. The result type C is refined by the pattern match, informing the type system that the pattern (x `0 , y) is equal to the scrutinee a.

Because of this refinement in the result type, we can define the projection operations through pattern matching. In particular, the first projection, π `0 1 a := let (x `0 , y) = a in x while the second projection, π `0 2 a := let (x `0 , y) = a in y. These projections can be type checked according to the following derived rules:

$$\frac{\begin{array}{c} \text{DCT-Proj1} \\ \Omega \vdash a \mathrel{\cdot \cdot \ell} \; \Sigma \mathrel{x} \colon \, ^{\ell\_{0}}A.B \\ \Omega \vdash \pi\_{1}^{\ell\_{0}} \; a \mathrel{\cdot \cdot \ell} \; A \end{array} \qquad \begin{array}{c} \text{DCT-Proj2} \\ \Omega \vdash a \mathrel{\cdot \cdot \ell} \; \Sigma \mathrel{\cdot x} \colon ^{\ell\_{0}}A.B \\ \Omega \vdash \pi\_{2}^{\ell\_{0}} \; a \mathrel{\cdot \cdot \ell} \; B \; \{\pi\_{1}^{\ell\_{0}} \; a/x\} \end{array}}$$

Note that the derived rule DCT-Proj1 limits access to the first component through the premise `<sup>0</sup> ≤ `, akin to rule Sealing-Unseal. This condition makes sense because it aligns the observability of the first component of the pair with the label on the Σ-type.

### 4.3 Embedding SDC into DDC<sup>&</sup>gt;

Here, we show how to embed SDC into DDC<sup>&</sup>gt;.

We define a translation function, ·, that takes the types and terms in SDC to terms in DDC<sup>&</sup>gt;. For types, the translation is defined as: A → B := <sup>⊥</sup>A → B, A × B := <sup>⊥</sup>A×B and T` A := Σx: `A.Unit. For terms, the translation is straightforward except for the following cases: η ` a := (a ` , unit) and bind` x = a in b := let (x ` , y) = a in b, where y is a fresh variable. By lifting the translation to contexts, we show that translation preserves typing.

Theorem 5 (Trans. Preserves Typing). If Ω ` a : ` A, then Ω ` a : ` A.

Next, assuming a standard call-by-name small-step semantics for both the languages, we can provide a bisimulation.

Theorem 6 (Forward Simulation). If a ; <sup>a</sup> 0 in SDC, then <sup>a</sup> ; <sup>a</sup> 0 in DDC<sup>&</sup>gt;.

Theorem 7 (Backward Simulation). For any term <sup>a</sup> in SDC, if <sup>a</sup> ; b in DDC<sup>&</sup>gt;, then there exists a<sup>0</sup> in SDC such that b = a <sup>0</sup> and a ; <sup>a</sup> 0 .

Hence, SDC can be embedded into DDC<sup>&</sup>gt;, preserving meaning. As such, DDC<sup>&</sup>gt; can analyze dependencies in general.

#### 4.4 Run-time Irrelevance

Next, we show how to track run-time irrelevance using DDC>. We use the two element lattice {⊥, >} with ⊥ < > such that ⊥ and > correspond to run-time relevant and run-time irrelevant terms respectively. So, we need to erase terms marked with >. However, we first define a general indexed erasure function, b·c`, on DDC<sup>&</sup>gt; terms, that erases everything an `-user should not be able to see. The function is defined by straightforward recursion in most cases. For example, bx c` := x and bΠx : `<sup>0</sup> A.Bc` := Πx : `<sup>0</sup> bAc`.bBc` and bλ `<sup>0</sup> x .bc` := λ `<sup>0</sup> x .bbc`. The interesting cases are: bb a`<sup>0</sup> c` := (bbc` bac `0 ` ) if `<sup>0</sup> ≤ ` and (bbc` unit`<sup>0</sup> ) otherwise, b(a `0 , b)c` := (bac `0 ` , bbc`) if `<sup>0</sup> ≤ ` and (unit`<sup>0</sup> , bbc`) otherwise. They are so defined because if ¬(`<sup>0</sup> ≤ `), an `-user should not be able to see a,

so we replace it with unit.

This erasure function is closely related to the indistinguishability relation, we saw in Section 3.3, extended to a dependent setting. (This definition appears in the extended version of this paper [12].) The erasure function maps the equivalence classes formed by the indistinguishability relation to their respective canonical elements. We have verified the following lemmas using the Coq proof assistant. Footnotes mark the file and lemma name of the corresponding mechanized results.

Lemma 8 (Canonical Element12). If Φ ` a<sup>1</sup> ∼` a2, then ba1c` = ba2c`.

Further, a well-graded term and its erasure are indistinguishable.

Lemma 9 (Erasure Indistinguishability13). If Φ ` a : `, then Φ ` a ∼` bac`.

Next, we can show that erased terms simulate the reduction behavior of their unerased counterparts.

Lemma 10 (Erasure Simulation<sup>14</sup>). If <sup>Φ</sup> ` <sup>a</sup> : ` and a ; b, then <sup>b</sup>ac` ; bbc`. Otherwise, if a is a value, then so is bac`.

This lemma follows from Lemma 9 and the non-interference theorem (Theorem 1). Therefore, it is safe to erase, before run time, all sub-terms marked with >.

This shows that we can correctly analyze run-time irrelevance using DDC<sup>&</sup>gt;. However, supporting compile-time irrelevance requires some changes to the system. We take them up in the next section.

<sup>12</sup> erasure.v:Canonical element. <sup>13</sup> erasure.v:Erasure Indistinguishability

<sup>14</sup> erasure.v:Step erasure,Value erasure

### 5 DDC: Run-time and Compile-time Irrelevance

#### 5.1 Towards Compile-time Irrelevance

Recall that terms which may be safely ignored while checking for type equality are said to be compile-time irrelevant. In DDC>, the conversion rule DCT-Conv checks for type equality at >.

$$\frac{\begin{array}{c} \text{DCT-Conv} \\ \Omega \vdash a : \stackrel{\ell}{\*} A \end{array}}{\begin{array}{c} \Omega \vdash a : \stackrel{\ell}{\*} B \end{array}} \quad |\Omega| \vdash A \equiv\_{\top} B \qquad \begin{array}{c} \Omega \vdash B : \stackrel{\top}{\*} s \end{array}$$

The equality judgment used in this rule Φ ` a ≡<sup>&</sup>gt; b is an instantiation of the general judgment Φ ` a ≡` b, which is the closure of the indistinguishability relation at ` under β-equivalence. When ` is >, indistinguishability is just identity. As such, the equality relation at > degenerates to standard β-equivalence. So, rule DCT-Conv does not ignore any part of the terms when checking for type equality.

To support compile-time irrelevance then, we need the conversion rule to use equality at some grade strictly less than > so that >-marked terms may be ignored. For the irrelevance lattice L<sup>I</sup> , the level C can be used for this purpose. For any other lattice L, we can add two new elements, C and >, above every other existing element, such that L < C < >, and thereafter use level C for this purpose. So, for any lattice, we can support compile-time irrelevance by equating types at C.

Referring back to the examples in Section 2.2, note that for phantom : Nat<sup>&</sup>gt; -> Type, we have phantom 0<sup>&</sup>gt;≡<sup>C</sup> phantom 1<sup>&</sup>gt;. With this equality, we can type-check idp : phantom 0<sup>&</sup>gt; -> phantom 1<sup>&</sup>gt; = λ x. x, even without knowing the definition of phantom.

Now, observe that in rule DCT-Conv, the new type B is also checked at >. If we want to check for type equality at C, we need to make sure that the types themselves are checked at C. However, checking types at C would rule out variables marked at > from appearing in them. This would restrict us from expressing many examples, including the polymorphic identity function.

To move out of this impasse, we take inspiration from EPTS [20,21]. The key idea, adapted from [20], is to use a judgment of the form C ∧ Ω ` a : <sup>C</sup> A instead of a judgment of the form Ω ` a : <sup>&</sup>gt; A. The operation C ∧ Ω takes the point-wise meet of the labels in the context Ω with C, essentially reducing any label marked as > to C, making it available for use in a C-expression. This operation, called truncation, makes > marked variables available at C. Other systems also use similar mechanisms for tracking irrelevance — for example, we can see a relation between this idea and analogous ones in [27] and [3]. In these systems, "context resurrection" operation makes proof variables and irrelevant variables in the context available for use, similar to how C ∧ Ω makes >-marked variables in the context available for use.

#### 5.2 DDC: Basics

Next, we design a general dependency analyzing calculus, DDC, that takes advantage of compile-time irrelevance in its type system. DDC is a generalization of DDC<sup>&</sup>gt; and EPTS• [20]. When C equals >, DDC degenerates to DDC>, that does not use compile-time irrelevance. When C equals ⊥, DDC degenerates to EPTS• , that identifies compile-time and run-time irrelevance. A crucial distinction between EPTS• and DDC is that while the former is tied to a two element lattice, the latter can use any lattice. Thus, not only can DDC distinguish between run-time and compile-time irrelevance, but also it can simultaneously track other dependencies.


Fig. 5. Dependent type system with compile-time irrelevance (core rules)

Ω a :

` A

Ω a :

` A

The core typing rules of DDC appear in Figure 5. Compared to DDC<sup>&</sup>gt;, this type system maintains the invariant that for any Ω ` a : ` A, we have ` ≤ C. To ensure that this is the case, rule T-Type and rule T-Var include this precondition. This restriction means that we cannot really derive any term at > in DDC. We can get around this restriction by deriving C ∧ Ω ` a : <sup>C</sup> A in place of Ω ` a : <sup>&</sup>gt; A.

Wherever DDC<sup>&</sup>gt; uses > as the observer level on a typing judgment, DDC uses truncation and level C instead. If DDC<sup>&</sup>gt; uses some grade other than > as the observer level, DDC leaves the derivation as such. So a DDC<sup>&</sup>gt; judgment Ω ` a : ` A is replaced with a truncated-at-top judgment, Ω a : ` A which can be read as: if ` = >, use the truncated version C ∧ Ω ` a : <sup>C</sup> A; otherwise use the normal version Ω ` a : ` A, as we see in Figure 5. In the typing rules, uses of this new judgment have been highlighted in gray to emphasize the modification with respect to DDC>.

#### 5.3 Π-types

Rule T-Pi is unchanged. The lambda rule T-AbsC now checks the type at C after truncating the variables in the context to C. The application rule T-AppC checks the argument using the truncated-at-top judgment. Note that if `<sup>0</sup> = >, the term a can depend upon any variable in Ω. Such a dependence is allowed since information can always flow from relevant to irrelevant contexts.

To see how irrelevance works in this system, let's consider the definition and use of the polymorphic identity function.

$$\begin{array}{rcl} \mathtt{id} & \colon \boldsymbol{H} \ \mathtt{x} \colon \prescript{\mathsf{T}}{}{\mathtt{Type}} \ . & \mathsf{x} \ \multimap \mathtt{x} \\ \mathtt{id} = \ \boldsymbol{\lambda}^{\top} \mathtt{x} \ . & \boldsymbol{\lambda} \ \mathtt{y} \ . & \mathtt{y} \end{array}$$

In DDC<sup>&</sup>gt;, the type Π x:<sup>&</sup>gt;Type. x -> x is checked at >. However, here it must be checked at level C, which requires the premise x:<sup>C</sup> Type ` x -> x :<sup>C</sup> Type. Note that if we used the same grade for the bound variable x in rule T-Pi and rule T-AbsC, we would have been in trouble because variable x is compiletime relevant while we check the type, even though it is irrelevant in the term.<sup>15</sup>

Finally, observe that rule T-ConvC uses the definitional equality at C instead of > and that the new type is checked after truncation.

#### 5.4 Σ-types


We also need to modify the typing rules for Σ types accordingly. In particular, when we create a pair, we check the first component using the truncated-at-top judgment. This is akin to how we check the argument in rule T-AppC. Note that if `<sup>0</sup> = >, the first component a is compile-time irrelevant. In such a situation, we cannot type-check the second projection since it requires the first projection, as we see in the derived<sup>16</sup> projection rules below. So pairs having type Σx: <sup>&</sup>gt;A.B

<sup>15</sup> This is why we fuse the graded modality with the dependent types. If they were separated, and we had to bind here, it would be a problem since a dependent function and its type have different restrictions vis-`a-vis the bound variable.

<sup>16</sup> strong exists.v:T wproj1,T wproj2

can only be eliminated via pattern matching if B mentions x. However, pairs having type Σx: <sup>C</sup>A.B can be eliminated via projections.

For example, for an output of the filter function, ys :Σm:<sup>C</sup> Nat. Vec m Bool, we have π<sup>1</sup> ys : <sup>C</sup> Nat and π<sup>2</sup> ys : Vec (π<sup>1</sup> ys) Bool. Note that (π<sup>1</sup> ys) is visible at C and is used in the type of (π<sup>2</sup> ys). We can substitute (π<sup>1</sup> ys) for m in (Vec m Bool) because m :<sup>C</sup> Nat ` Vec m Bool : <sup>C</sup> Type. However, (π<sup>1</sup> ys) cannot be used at ⊥, so it will be erasable then.

$$\begin{array}{lll} \text{T-Proof1C} & & \begin{array}{c} \text{T-Proj1C} \\ \Omega \vdash a: ^{\ell}\_{1}\Sigma x: ^{\ell\_{0}}A.B \\ \Omega \vdash \pi\_{1}^{\ell\_{0}} \; a: ^{\ell}A \end{array} & \begin{array}{c} \text{T-Proj2C} \\ \Omega \vdash a: ^{\ell}\_{1}\Sigma x: ^{\ell\_{0}}A.B \\ \Omega \vdash \pi\_{2}^{\ell\_{0}}a: ^{\ell}B \{\pi\_{1}^{\ell\_{0}} \; a/x\} \end{array} \end{array}$$

#### 5.5 Non-interference

DDC satisfies an analogous noninterference theorem to the one presented for SDC, using suitable definitions for the grading relation, written Φ ` a : `, and indexed indistiguishability, written Φ ` b<sup>1</sup> ∼` b2. The complete definition of these judgements appears in the extended version of this paper [12].

Lemma 11 (Typing implies grading17). If Ω ` a : ` A then |Ω| ` a : `.

Lemma 12 (Equivalence18). Indexed indistinguishability at ` is an equivalence relation on well-graded terms at `.

Lemma 13 (Indistinguishability under substitution19). If Φ, x : ` ` b<sup>1</sup> ∼<sup>k</sup> b<sup>2</sup> and Φ ` ` k a<sup>1</sup> ∼ a<sup>2</sup> then Φ ` b1{a1/x} ∼<sup>k</sup> b2{a2/x}.

Theorem 8 (Non-interference for DDC20). If Φ ` a<sup>1</sup> ∼<sup>k</sup> a 0 <sup>1</sup> and a<sup>1</sup> ; <sup>a</sup><sup>2</sup> then there exists some a<sup>0</sup> 2 such that a<sup>0</sup> <sup>1</sup> ; <sup>a</sup> 0 <sup>2</sup> and Φ ` a<sup>2</sup> ∼<sup>k</sup> a 0 2 .

#### 5.6 Consistency of Equality

The equality relation of DDC incorporates compile-time irrelevance. To show that the type system is sound, we need to show that the equality relation is consistent. Consistency of definitional equality means that there is no derivation that equates two types having different head forms. For example, it should not equate Nat with Unit.

Note that if > inputs can interfere with C outputs, the equality relation cannot be consistent. To see why, let x : <sup>&</sup>gt; A ` b : <sup>C</sup> Bool and for a1, a<sup>2</sup> : A, let the terms b{a1/x} and b{a2/x} reduce to True and False respectively. Now, (λ <sup>&</sup>gt;x .if b then Nat else Unit) a<sup>1</sup> <sup>&</sup>gt; ≡<sup>C</sup> (λ <sup>&</sup>gt;x .if b then Nat else Unit) a<sup>2</sup> <sup>&</sup>gt;. But then, by β-equivalence Nat ≡<sup>C</sup> Unit.

To prove consistency, we construct a standard parallel reduction relation and show that this relation is confluent. Thereafter, we prove that if two terms

<sup>17</sup> typing.v:Typing Grade <sup>18</sup> geq.v:GEq refl,GEq symmetry,GEq trans

<sup>19</sup> subst.v:CEq GEq equality substitution <sup>20</sup> geq.v:CEq GEq respects Step

are definitionally equal at `, then they are joinable at `, meaning they reduce, through parallel reduction, to two terms that are indistinguishable at `. Next, we show that joinability at ` implies consistency. Therefore, we conclude that for any `, the equality relation at ` is consistent. This implies that the equality relation at C, that ignores sub-terms marked with >, is sound. Hence, DDC tracks compile-time irrelevance correctly. Note that DDC can track run-time irrelevance the same way as DDC>.

We formally state consistency in terms of head forms, i.e. syntactic forms that correspond to types such as sorts s, Unit, Πx : ` A.B, etc.

Theorem 9 (Consistency21). If Φ ` a ≡` b, and a and b both are head forms, then they have the same head form.

#### 5.7 Soundness theorem

DDC is type sound and we have checked this and other results using the Coq proof assistant. Below, we give an overview of the important lemmas in this development.

The properties below are stated for DDC, but they also apply to DDC<sup>&</sup>gt; since DDC degenerates to DDC<sup>&</sup>gt; whenever C = >. First, we list the properties related to grading that hold for all judgments: indexed indistinguishability, definitional equality, and typing. (We only state the lemmas for typing, their counterparts are analogous.) These lemmas are similar to their simply-typed counterparts in Section 3.2.

Lemma 14 (Narrowing22). If Ω ` a : ` A and Ω<sup>0</sup> ≤ Ω, then Ω<sup>0</sup> ` a : ` A

Lemma 15 (Weakening23). If Ω1, Ω<sup>2</sup> ` a : ` A then Ω1, Ω, Ω<sup>2</sup> ` a : ` A.

Lemma 16 (Restricted upgrading24). If Ω1, x : `<sup>0</sup> A, Ω<sup>2</sup> ` b : ` B and `<sup>1</sup> ≤ ` then Ω1, x : `0∨`<sup>1</sup> A, Ω<sup>2</sup> ` b : ` B .

Next, we list some properties that are specific to the typing judgment. For any typing judgment in DDC, the observer grade ` is at most C. Further, the observer grade of any judgment can be raised up to C.

Lemma 17 (Bounded by C <sup>25</sup>). If Ω ` a : ` A then ` ≤ C.

Lemma 18 (Subsumption<sup>26</sup>). If Ω ` a : ` A and ` ≤ k and k ≤ C then Ω ` a : <sup>k</sup> A

Note that we don't require contexts to be well-formed in the typing judgment; we add context well-formedness constraints, as required, to our lemmas. The following lemmas are true for well-formed contexts. A context Ω is well-formed, expressed as ` Ω, iff for any assumption x : ` A in Ω, we have Ω<sup>0</sup> A : <sup>&</sup>gt; s, where Ω0 is the prefix of Ω that appears before the assumption.

<sup>21</sup> consist.v:DefEq Consistent <sup>22</sup> narrowing.v:Typing narrowing

<sup>23</sup> weakening.v:Typing weakening <sup>24</sup> pumping.v:Typing pumping

<sup>25</sup> pumping.v:Typing leq C <sup>26</sup> typing.v:Typing subsumption

Lemma 19 (Substitution27). If Ω1, x : `<sup>0</sup> A, Ω<sup>2</sup> ` b : ` B and ` Ω<sup>1</sup> and Ω<sup>1</sup> a : `<sup>0</sup> A then Ω1, Ω2{a/x} ` b{a/x} : ` B{a/x}

Next, if a term is well-typed in our system, the type itself is also well-typed.

Lemma 20 (Regularity28). If Ω ` a : ` A and ` Ω then Ω A : <sup>&</sup>gt; s .

Finally, we have the two main lemmas proving type soundness.

Lemma 21 (Preservation29). If Ω ` a : ` A and ` <sup>Ω</sup> and a ; <sup>a</sup> 0 , then Ω ` a 0 : ` A.

Lemma 22 (Progress30). If ∅ ` a : ` A then either a is a value or there exists some a<sup>0</sup> such that a ; <sup>a</sup> 0 .

Hence, DDC is type sound. We have seen earlier that it tracks run-time and compile-time irrelevance correctly.

DDC is parameterized by a generic pure type system and a generic lattice. When the parameterizing pure type system is strongly normalizing, such as the Calculus of Constructions, type-checking is decidable. In the next section, we provide a demonstration.

### 6 Type Checking

As a pure type system, not all instances of DDC admit decidable type checking. For example, in the presence of the type:type axiom, the system includes nonterminating computations via Girard's paradox. As as a result, we cannot decide equality in that system, so type checking will be undecidable. However, if the sorts, axioms and rules are chosen such that the language is strongly normalizing, then we can define a decidable type checking algorithm. This algorithm is standard, but relies on a decision procedure for the equality judgement.

Our consistency proof, described in Section 5.6, gives us a start. This proof uses an auxiliary binary relation called joinability, which holds when two terms can use multiple steps of parallel reduction to reach two simpler terms that differ only in their unobservable components. Joinability and definitional equality induce the same relation on DDC terms. We can show that two DDC terms are definitionally equal if and only if they are joinable<sup>31</sup>, which means that a decision procedure based on joinability will be sound and complete for DDC's labeled definition of equivalence.

Therefore, the decidability of type checking reduces to showing strong normalization. If we select the sorts, axioms and rules of DDC to match those of the Calculus of Constructions [5], we believe that this result holds, but leave a direct proof for future work. However, by translating this instance of DDC to ICC<sup>∗</sup> , we can show that a sublanguage of this instance is strongly normalizing.

<sup>27</sup> typing.v:Typing substitution CTyping <sup>28</sup> typing.v:Typing regularity

<sup>29</sup> typing.v:Typing preservation <sup>30</sup> progress.v:Typing progress

<sup>31</sup> consist.v:DefEq Joins,Joins DefEq

ICC<sup>∗</sup> [6], is a version of the Implicit Calculus of Constructions with annotations that support decidable type checking, but because it includes only (relevant and irrelevant) Π-types, so we must restrict our attention to the corresponding fragment of DDC.

We define the following translation, written <sup>e</sup>·, that converts DDC terms to ICC<sup>∗</sup> terms. The key parts of this translation map arguments labeled C and below to relevant arguments, and those labeled greater than C, such as >, to irrelevant arguments.<sup>32</sup>

$$
\widetilde{x} = x \qquad \widetilde{s} = s \qquad \qquad \widehat{Hx:^\ell A.B} = \begin{cases}
\Pi(x; \widetilde{A}).\overline{B} & \text{if } \ell \le C \\
\Pi[x; \widetilde{A}].\overline{B} & \text{otherwise} \end{cases} \\
$$

$$
\widetilde{\lambda x:^\ell A.b} = \begin{cases}
\lambda(x; \widetilde{A}).\overline{b} & \text{if } \ell \le C \\
\lambda[x; \widetilde{A}].\overline{b} & \text{otherwise}
\end{cases} \qquad \widetilde{b^\top a^\ell} = \begin{cases}
\widetilde{b^\prime}(\widetilde{a}) & \text{if } \ell \le C \\
\widetilde{b^\prime}[\widetilde{a}] & \text{otherwise}
\end{cases}
$$

Note that ICC<sup>∗</sup> compares terms for equality after an erasure operation, written · ∗ , that removes all irrelevant arguments. Now, we can show that the above translation preserves definitional equality and typing. Here, <sup>Ω</sup><sup>e</sup> denotes <sup>Ω</sup> with the labels at the variable bindings omitted.

Lemma 23 (Translation preservation). If <sup>Φ</sup> ` <sup>A</sup> <sup>≡</sup><sup>C</sup> B , then <sup>A</sup>e<sup>∗</sup> <sup>∼</sup>=βη <sup>B</sup><sup>e</sup> <sup>∗</sup> . If Ω ` a : ` A, then <sup>Ω</sup><sup>e</sup> ` <sup>a</sup><sup>e</sup> : A. <sup>e</sup>

Next, observe that because β-reductions are preserved by the translation, any parallel reduction in DDC between terms a and b at level C, where a 6= b, would correspond to a sequence of reduction steps <sup>a</sup><sup>e</sup> <sup>→</sup><sup>+</sup> βie <sup>e</sup><sup>b</sup> in ICC<sup>∗</sup> . That means that an infinite sequence of parallel reductions a0, a1, . . . , where each term differs from the previous, corresponds to an infinite sequence of reductions <sup>a</sup>e0, <sup>a</sup>e<sup>1</sup> . . . in ICC<sup>∗</sup> . Therefore, as all well-typed ICC<sup>∗</sup> terms are strongly normalizing, we can conclude that this is so for this instance of DDC.

Non-terminating instances of DDC. For pure type systems that are not strongly normalizing, such as the type:type language, there is an alternative approach to developing a calculus with decidable type checking, following Weirich et al. [35]. The key idea is to develop an annotated version of DDC that book-keeps additional information from typing and equality derivations. In such an annotated version, the conversion rule would include an explicit coercion annotation that witnesses the equality between the concerned types, thus avoiding the need for normalization.

<sup>32</sup> The syntax of ICC<sup>∗</sup> uses parentheses to indicate usual (relevant) arguments and square brackets to indicate arguments that are irrelevant at both run time and compile time.

### 7 Discussions and Related Work

#### 7.1 Irrelevance in Dependent Type Theories

Overall, compile-time and run-time irrelevance is a well-studied topic in the design of dependent type systems. In some systems, the focus is only on support for run-time irrelevance: see [18,4,8,19,20,32]. In other systems, the focus is on compile-time irrelevance: see [27,3]. Some systems support both, but require them to overlap, such as [6,21,35,24]. The system of Mooon et al. [23] does not require them to overlap but their type system does not make use of compile-time irrelevance in the conversion rule.

To compare, system DDC<sup>&</sup>gt;, presented here, can support run-time irrelevance only and is similar to the core language of Tejiˇsˇc´ak [32]. However, note that DDC<sup>&</sup>gt; can track dependencies in general while the system in [32] tracks runtime irrelevance alone. DDC, on the other hand, is the only system that we are aware of that tracks run-time and compile-time irrelevance separately and makes use of the latter in the conversion rule. Further, DDC tracks these irrelevances in the presence of strong Σ-types with erasable first components, something which, to the best of our knowledge, no prior work has been able to.

Prior work has identified the difficulty in handling strong Σ-types with erasable first components in a setting that tracks compile-time irrelevance. Abel and Scherer [3] point out that strong irrelevant Σ-types make their theory inconsistent. Similarly, EPTS• [21] cannot define the projections for pairs having such Σ-types. The reason behind this is that EPTS• is hard-wired to work with a two-element lattice which identifies compile-time and run-time irrelevance. As such, projections from such pairs lead to type unsoundness. For example, considering the first components to be run-time irrelevant, the pairs (Int, unit) and (Bool, unit) are run-time equivalent. Since EPTS• identifies run-time and compile-time irrelevance, these pairs are also compile-time equivalent. Then, taking the first projections of these pairs, one ends up with Int and Bool being compile-time equivalent. We resolve this problem by distinguishing between runtime and compile-time irrelevance, thus requiring a lattice with three elements.

Next, we compare our work with existing literature with respect to the equality relation. We analyze compile-time irrelevance to enable the equality relation to ignore unnecessary sub-terms. However, since our equality relation is untyped, we cannot include type-dependent rules in our system, such as η-equivalence for the Unit type. Several prior works on irrelevance [19,6,21,32] use an untyped equality relation. However, some prior work, such as [27,3], do consider compiletime irrelevance in the context of typed-directed equality. But such systems require irrelevant arguments to functions appear only irrelevantly in the codomain type of the function, thus ruling out several examples including the polymorphic identity function.

#### 7.2 Quantitative Type Systems

Our work is closely related to quantitative type systems [26,15,9,18,4,25,2,10,23]. Such systems provide a fine-grained accounting of coeffects, viewed as resources, for example, variable usage, linearity, liveness, etc. A typical judgment from a quantitative type system [10] may look like:

#### x : <sup>1</sup> Bool, y : 1 Int, z : <sup>0</sup> Bool ` if x then y + 1 else y − 1 :<sup>1</sup> Int

The variable x is used once in the condition, the variable y is used once in each of the branches while the variable z is not used at all. As such, they are marked with these quantities in the context.

This form of judgment is very similar to our typing judgments with quantities appearing in place of levels. However, there is a crucial difference: to the right of the turnstile, while any level may appear in our judgments, only the quantity 1 can appear in typing judgments of quantitative systems. A quantitative system that allows an arbitrary quantity to the right of the turnstile is not closed under substitution [18,4]. As such, quantitative systems are tied to a fixed reference while our systems can view programs from different reference levels. This difference in form results from the difference in the purposes the two kinds of systems serve: quantitative systems count while our systems compare. Counting requires a fixed standard or reference whereas comparison does not. Applications that require counting, like linearity tracking, are handled well by quantitative systems while applications that require comparison, like ensuring secure information flow, are handled well by systems of our kind.

From a type-theoretic standpoint, in general, quantitative systems cannot eliminate pairs through projections. This is so because there is no general way to split the resources of the context that type-checks a pair. Eliminating pairs through projections is straightforward in our systems because the grade on the typing judgment can control where the projections are visible.

#### 7.3 Dependency Analysis and Dependent Type Theory

Dependency analysis and dependent type theories have come together in some existing work.

Like our system, Prost [28] extends the λ-cube so that it may track dependencies. However, unlike our system, this work uses sorts to track dependencies. It is inspired by the distinction between sorts in the Calculus of Constructions where computationally relevant and irrelevant terms live in sorts Set and Prop respectively. As Mishra-Linger [21] points out, such an approach ties up two distinct language features, sorts and dependency analysis, which can be treated in a more orthogonal manner.

Bernardy and Guilhem's type-theory in color [7] is very related to our work. This type-theory uses colors to erase terms while we use grades. Colors and grades both form a lattice structure and their usage in the respective type systems are quite similar. However, in type-theory in color, internalized parametricity is used to reason about erasure; so it is important that the type-theory be logically consistent. Our work does not rely on the normalizing nature of the theory; we take a direct route to analyzing erasure.

Like our work, Louren¸co and Caires [17] track information flow in a dependent type system. But Louren¸co and Caires [17] focus on more imperative features, like modeling of state while we focus on irrelevance. A distinguishing feature of their system is that they allow security labels to depend upon terms, something that we don't attempt here.

### 8 Conclusion

We started with the aim of designing a dependent calculus that can analyze dependencies in general, and run-time and compile-time irrelevance in particular. Towards this end, we designed a simple dependency calculus, SDC, and then extended it to two dependent calculi, DDC<sup>&</sup>gt; and DDC. DDC<sup>&</sup>gt; can track run-time irrelevance while DDC can track both run-time and compile-time irrelevance along with other dependencies.

In future, we would like to explore how irrelevance interacts with other dependencies. We also want to explore whether our systems can be integrated with existing graded type systems, especially quantitative type systems. Yet another interesting direction for research is that how they compare with graded effect systems.

Our work lies in the intersection of dependency analysis and irrelevance tracking in dependent type systems. Both these areas have rich literature of their own. We hope that the connections established in this paper will be mutually beneficial and help in the future exploration of dependencies and irrelevance in dependent type systems.

### 9 Acknowledgments

The first two authors were supported by the National Science Foundation under Grant Nos. 1703835 and 1521539. The second author was supported by the National Science Foundation under Grant No. 2104535.

### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/ 4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### Polarized Subtyping

Zeeshan Lakhani<sup>1</sup> (B) , Ankush Das<sup>3</sup> , Henry DeYoung<sup>1</sup> , Andreia Mordido<sup>2</sup> , and Frank Pfenning<sup>1</sup>

> <sup>1</sup> Carnegie Mellon University, Pittsburgh, PA, USA {zlakhani,hdeyoung,fp}@cs.cmu.edu

<sup>2</sup> LASIGE, Faculdade de Ciˆencias, Universidade de Lisboa, Lisbon, Portugal

afmordido@fc.ul.pt

<sup>3</sup> Amazon, Cupertino, CA, USA?? daankus@amazon.com

Abstract. Polarization of types in call-by-push-value naturally leads to the separation of inductively defined observable values (classified by positive types), and coinductively defined computations (classified by negative types), with adjoint modalities mediating between them. Taking this separation as a starting point, we develop a semantic characterization of typing with step indexing to capture observation depth of recursive computations. This semantics justifies a rich set of subtyping rules for an equirecursive variant of call-by-push-value, including variant and lazy records. We further present a bidirectional syntactic typing system for both values and computations that elegantly and pragmatically circumvents difficulties of type inference in the presence of width and depth subtyping for variant and lazy records. We demonstrate the flexibility of our system by systematically deriving related systems of subtyping for (a) isorecursive types, (b) call-by-name, and (c) call-by-value, all using a structural rather than a nominal interpretation of types.

Keywords: Call-by-push-value · Semantic Typing · Subtyping

### 1 Introduction

Subtyping is an important concept in programming languages because it simultaneously allows more programs to be typed and more precise properties of programs to be expressed as types. The interaction of subtyping with parametric polymorphism and recursive types is complex and despite a lot of progress and research, not yet fully understood.

In this paper we study the interaction of subtyping with equirecursive types in call-by-push-value [53, 54], which separates the language of types into positive and negative layers. This polarization elegantly captures that positive types classifying observable values are inductive, while negative types classifying (possibly recursive) computations are coinductive. It lends itself to a particularly simple semantic definition of typing using a mixed induction/coinduction [9, 13, 22]. From this definition, we can immediately derive a form of semantic subtyping [15, 35, 36].

<sup>??</sup> work performed prior to joining Amazon

Concretely, we realize the mixed induction/coinduction via step-indexing and carry out our metatheory in Brotherston and Simpson's system CLKID<sup>ω</sup> of circular proofs [14]. This includes a novel proof that syntactic versions of typing and subtyping are sound with respect to our semantic definitions. While we also conjecture that subtyping is precise (in the sense of [55]), we postpone this more syntactic property to future work.

Because our foundation is call-by-push-value, a paradigm that synthesizes callby-name and call-by-value based on the logical principle of polarization, we obtain several additional results in relatively straightforward ways. For example, both width and depth subtyping for variant and lazy records are naturally included. Furthermore, following Levy's interpretation of call-by-value and call-by-name functional languages into call-by-push-value, we extract subtyping relations and algorithms for these languages and prove them sound and complete. We also note that we can directly interpret the isorecursive types in Levy's original formulation of call-by-push-value [53].

We further provide a systematic notion of bidirectional typing that avoids some complexities that arise in a structural type system with variant and lazy records. The resulting decision procedure for typing is quite precise and suggests clear locations for noting failure of typechecking. The combination of equirecursive callby-push-value with bidirectional typing achieves some of the goals of refinement types [24, 34], which fit a structural system inside a generative type language. Here we have considerably more freedom and less redundancy. However, we do not yet treat intersection types or polymorphism.

We summarize our main contributions:


These are followed by a discussion of related work and a conclusion. Additional material and proofs are provided in an appendix of the extended paper version [49].

### 2 Equirecursive Call-by-Push-Value

Call-by-push-value [53, 54] is characterized by a separation of types in positive τ + and negative σ <sup>−</sup> layers, with shift modalities going back and forth between them. The intuition is that positive types classify observable values v while negative types classify computations e.

$$\begin{aligned} \tau^+, \sigma^+ &::= \tau\_1^+ \otimes \tau\_2^+ \mid \mathbf{1} \mid \oplus \{\ell \colon \tau\_\ell^+\}\_{\ell \in L} \mid \downarrow \sigma^- \mid t^+ \\\ \sigma^-, \tau^- &::= \tau^+ \to \sigma^- \mid \otimes \{\ell \colon \sigma\_\ell^-\}\_{\ell \in L} \mid \uparrow \tau^+ \mid s^- \end{aligned}$$

The usual binary product τ × σ splits into two: τ + σ <sup>+</sup> for eager, observable products inhabited by pairs of values, and <sup>N</sup>{` : <sup>σ</sup> − ` }`∈<sup>L</sup> for lazy, unobservable records with a finite set L of fields we can project out. Binary sums are also generalized to variant record types {` : <sup>τ</sup> + ` }`∈L. <sup>4</sup> These are not just a programming convenience but allow for richer subtyping: lazy and variant record types support both width and depth subtyping, whereas the usual binary products and sums support only the latter. For example, width subtyping means that {false: <sup>1</sup>} is a subtype of bool<sup>+</sup> <sup>=</sup> {false: <sup>1</sup>, true: <sup>1</sup>}, while <sup>1</sup> would not be a subtype of the usual binary 1 + 1. Neither is 1 a subtype of bool+, demonstrating the utility of variant record types with one label, such as {false: <sup>1</sup>}. Similar examples exist for lazy record types. This way, we recover some of the benefits of refinement types without the syntactic burden of a distinct refinement layer.

The shift ↓σ <sup>−</sup> is inhabited by an unevaluated computation of type σ <sup>−</sup> (a "thunk"). Conversely, the shift ↑τ <sup>+</sup> includes a value as a trivial computation (a "return"). Levy [53] writes U B instead of ↓σ <sup>−</sup> and F A instead of ↑τ +.

Finally, we model recursive types not by explicit constructors µα+. τ <sup>+</sup> and να<sup>−</sup>. σ<sup>−</sup> but by type names t <sup>+</sup> and s <sup>−</sup> which are defined in a global signature Σ. They may mutually refer to each other. We treat these as equirecursive (see Section 3) and we require them to be contractive, which means the right-hand side of a type definition cannot itself be a type name. Since we would like to directly observe the values of positive types, the definitions of type names t <sup>+</sup> = τ <sup>+</sup> are inductive. This allows inductive reasoning about values returned by computations. On the other hand, negative type definitions s <sup>−</sup> = σ <sup>−</sup> are recursive rather than coinductive in the usual sense, which would require, for example, stream computations to be productive. Because we do not wish to restrict recursive computations to those that are productive in this sense, they are "productive" only in the sense that they satisfy a standard progress theorem.

Next, we come to the syntax for values v of a positive type and computations e of a negative type. Variables x always stand for values and therefore have a positive type. We use j to stand for labels, naming fields of variant records or lazy records, where j · v injects value v into a sum with alternative labeled j and e.j projects field e out of a lazy record. When we quantify over a (always finite) set of labels we usually write ` as a metavariable for the labels.

v ::= x | hv1, v2i | hi | j · v | thunk e

e ::= λx. e | e v | {` = e`}`∈<sup>L</sup> | e.j | return v | let return x = e<sup>1</sup> in e<sup>2</sup> | f | match v (hx, yi ⇒ e) | match v (hi ⇒ e) | match v (` · x` ⇒ e`)`∈<sup>L</sup> | force v Σ ::= · | Σ, t<sup>+</sup> = τ <sup>+</sup> | Σ, s<sup>−</sup> = σ <sup>−</sup> | Σ, f : σ <sup>−</sup> = e

<sup>4</sup> We borrow the notation from linear logic even though no linearity is implied.

In order to represent recursion, we use equations f = e in the signature where f is a defined expression name,which we distinguish from variables, and all equations can mutually reference each other. An alternative would have been explicit fixed point expressions fix f. e, but this mildly complicates both typing and mutual recursion. Also, it seems more elegant to represent all forms of recursion at the level of types and expressions in the same manner. We also choose to fix a type for each expression name in a signature. Otherwise, each occurrence of f in an expression could potentially be assigned a different type, which strays into the domain of parametric polymorphism and intersection types.

Following Levy, we do not allow names for values because this would add an undesirable notion of computation to values, and, furthermore, circular values would violate the inductive interpretation of positive types. As discussed in [53, Chapter 4], they could be added back conservatively under some conditions.

#### 2.1 Dynamics

For the operational semantics, we use a judgment e 7→ e <sup>0</sup> defined inductively by the following rules which may reference a global signature Σ to look up the definitions of expression names f. In contrast, values do not reduce. The dynamics of call-by-push-value are defined as follows:

$$\begin{array}{llll}\hline\\\hline(\lambda x.e)\ v\mapsto[v/x]e & e\,\stackrel{e\,\mapsto}{\,e\,\,v\,\,e\,\,e'\,} & \mathtt{let\,\,return\,\,x\,=} & \mathtt{return\,\,v\,\,in\,\,e\,\,=} & \mathtt{\$\,[v/x]e\,\,e\,\,} \\\\ \hline\\\begin{array}{llllll} e\_{1}\mapsto e\_{1}^{\prime} & (j\in L) & (j\in L) \\\\\hline\mathtt{let\,\,r\,\,v\,\,in\,\,e\,\,1\,\,e\,\,=} & \mathtt{let\,\,r\,\,t\,\,r\,\,e\,\,=} & \mathtt{\$\,{\,{}}\,e\,\,=} & \mathtt{\$\,{}}\,e\,\,\overline{\,{}}\,e\,\, \mathtt{in\,\,e\,\,}x\,\,= \\\\ \hline\mathtt{match\,\,\langle v\_{1},v\_{2}\rangle\,\,(\langle x,y\rangle\Rightarrow e)\mapsto [v\_{1}/x][v\_{2}/y\,] & \overline{\mathtt{match\,\,{}}\,\langle\rangle\,\langle\rangle\,\Rightarrow e\rangle\,\,e} & \mathtt{match\,\,\langle\rangle\,\langle\rangle\,\Box\,e\,\,e\,\,=} \\\\ \hline\end{array}$$

Note that some computations, specifically λx. e, {` = e`}`∈L, and return v, do not reduce and may be considered values in other formulations. Here, we call them terminal computations and use the judgment e terminal to identify them.

λx. e terminal {` = e`}`∈<sup>L</sup> terminal return v terminal

We will silently use simple properties of computations in the remainder of the paper which follow by straightforward induction.

#### Lemma 1 (Computation).


### 2.2 Some Sample Programs<sup>5</sup>

Example 1 (Computing with Binary Numbers). We show some example programs for binary numbers in "little endian" representation (least significant bit first) and in standard form, that is, without leading zeros.

bin<sup>+</sup> <sup>=</sup> {<sup>e</sup> : <sup>1</sup>, b0 : bin, b1 : bin} std<sup>+</sup> <sup>=</sup> {<sup>e</sup> : <sup>1</sup>, b0 : pos, b1 : std} pos<sup>+</sup> <sup>=</sup> { b0 : pos, b1 : std}

We expect the subtyping relationships pos ≤ std ≤ bin to hold, because every positive standard number is a standard number, and every standard number is a binary number. According to our definition and rules in Sections 3 and 5 these will hold semantically as well as syntactically.

We now show some simple definitions f : σ <sup>−</sup> = e.

```
six : ↑pos = return b0 · b1 · b1 · e · hi
```
The increment function on binary numbers implements the carry with a recursive call, which has to be wrapped in a let/return.

$$\begin{array}{l} \mathsf{inc} \ : \mathsf{std} \rightarrow \mathsf{\uparrowpos} \\ = \lambda x. \mathsf{match} \ x \ (\mathsf{e} \cdot u \Rightarrow \mathsf{return} \ \mathsf{b} \mathbf{1} \cdot \mathbf{e} \cdot u \\ \mid \ \mathsf{b} \mathbf{0} \cdot x' \Rightarrow \mathsf{return} \ \mathsf{b} \mathbf{1} \cdot x' \\ \mid \ \mathbf{b} \mathbf{1} \cdot x' \Rightarrow \mathsf{let} \ \mathsf{return} \ y' = \mathsf{inc} \ x' \ \text{in } \mathsf{return} \ \mathsf{b} \mathbf{0} \cdot y') \end{array}$$

By subtyping, we also have inc : std → ↑std, for example, but not inc : bin → ↑bin since bin 6≤ std. However, the definition could be separately checked against this type, which points towards an eventual need for intersection types.

The following incorrect version of the decrement function does not have the indicated desired type!

$$\begin{array}{ll} \mathsf{dec}\_{0} & \mathsf{pos} \to \uparrow \mathsf{std} & \mathsf{\% \mathsf{incorrect}!} \\ & = \lambda x. \mathsf{match} \, x \, (\mathsf{b} \mathbf{0} \cdot x' \Rightarrow \mathsf{let} \, \mathsf{return} \, y' = \mathsf{dec}\_{0} \, x' \, \mathsf{in} \, \mathsf{return} \, \mathsf{b} \mathbf{1} \cdot y' \\ & & \qquad \mid \; \mathbf{b} \mathbf{1} \cdot x' \Rightarrow \mathsf{return} \, \mathsf{b} \mathbf{0} \cdot x' \, \end{array}$$

The error here is quite precisely located by the bidirectional type checker (see Section 6): When we inject b0 · x 0 in the second branch it is not the case that x 0 : pos as required for standard numbers! And, indeed, dec<sup>0</sup> b1 · e · hi 7→<sup>∗</sup> return b0 · e · hi which is not in standard form. On the other hand, the fact that a branch for e · u is missing is correct because the type pos does not have an alternative for this label.

We can fix this problem by discriminating one more level of the input (which could be made slightly more appealing by a compound syntax for nested pattern matching).

dec : pos → ↑std = λx. match x ( b0 · x <sup>0</sup> ⇒ let return y <sup>0</sup> = dec x 0 in return b1 · y 0 | b1 · x <sup>0</sup> ⇒ match x 0 ( e · u ⇒ return e · u | b0 · x <sup>00</sup> ⇒ return b0 · b0 · x 00 | b1 · x <sup>00</sup> ⇒ return b0 · b1 · x <sup>00</sup> ) )

<sup>5</sup> These examples and more are captured in our open access implementation artifact [50].

Example 2 (Computing with Streams). We present an example of a type with mixed polarities: a stream of standard numbers with a finite amount of padding between consecutive numbers. Programmer's intent is for the stream to be lazy and infinite, i.e., no end-of-stream is provided. But because we do not restrict recursion even a well-typed implementation may diverge and fail to produce another number. On the other hand, the padding must always be finite because the meaning of positive types is inductive. We present padded streams as two mutually dependent type definitions, one positive and one negative. Because our type definitions are equirecursive this isn't strictly necessary, and we could just substitute out the definition of pstream<sup>−</sup>.

For our example, we also define a subtype with zero padding, as forcing a single padding label none between any two elements could also be expressed.

pstream<sup>−</sup> <sup>=</sup> <sup>↑</sup>(std padding) padding<sup>+</sup> <sup>=</sup> {none : padding, some : <sup>↓</sup>pstream} zstream<sup>−</sup> <sup>=</sup> <sup>↑</sup>(std {some : <sup>↓</sup>zstream})

In zstream, we see the significance of variant record types with just one label: some. We exploit this in Section 7 to interpret isorecursive types into equirecursive ones. We have that zstream ≤ pstream, which means we can pass a stream with zero padding into any function expecting one with arbitrary padding.

We now program two mutually recursive functions to create a stream with zero padding from a stream with arbitrary (but finite!) padding.

```
compress : (↓pstream) → zstream
omit : padding → zstream
compress = λs. let return np = force s in
                match np (hn, pi ⇒ return hn, some · thunk (omit p)i)
omit = λp. match p ( none · p
                              0 ⇒ omit p
                                        0
                    | some · s ⇒ compress s )
```
Example 3 (Omega). As a final example in this section we consider the embedding of the untyped λ-calculus. The untyped term under consideration is (λx. x x) (λx. x x). The first thing to notice is that this term is not even syntactically well-formed because x stands for a value, but in x x the function parts needs to be an expression. Closely related is that the "usual" definition for the embedding of the untyped λ-calculus (see, for example, [42]) U = U → U isn't properly polarized. So, we define it as U <sup>−</sup> = (↓U) → U instead:

$$\begin{array}{ll} \omega : (\downarrow \mathsf{U}) \to \mathsf{U} & \qquad \Omega : \mathsf{U} \\ \omega = \lambda x. (\mathsf{force} \, x) \, x & \qquad \Omega = \omega \, (\mathsf{thunk} \, \omega) \end{array}$$

Because our type definitions are equirecursive, both of these definitions are welltyped. Moreoever, we also have ω : U and in fact the embedding of every untyped λ-term will have type U. We also observe that ω (thunk ω) 7→<sup>3</sup> ω (thunk ω) and therefore represents a well-typed diverging term. Of course, f : U = f is also well-typed and reduces to itself in one step.

Remarkably, with our notion of semantic typing we will see that Ω will have every type σ <sup>−</sup> and not just U [49, Appendix B, Example 9]!

### 3 Semantic Typing

Our aim is to justify both typing and subtyping by semantic means. We therefore start with semantic typing of closed values and computations, written v ∈ τ <sup>+</sup> and e ∈ σ <sup>−</sup>. From this we can, for example, define semantic subtyping for positive types τ <sup>+</sup> ⊆ σ <sup>+</sup> as ∀v. v ∈ τ <sup>+</sup> ⊃ v ∈ σ +.

Conceptually, semantic typing is a mixed inductive/coinductive definition. Values are typed inductively, which yields the correct interpretation of purely positive types such as natural numbers, lists, or trees, describing finite data structures. Computations are typed coinductively because they include the possibility of infinite computation by unbounded recursion. While we assume we can observe the structure of values, computations e cannot be observed directly. Different notions of observation for computation would yield different definitions of semantic typing. For our purposes, since we want to allow unfettered recursion, we posit we can (a) observe the fact that a computation steps according to our dynamics, even if we cannot examine the computation itself, and (b) when a computation is terminal we can observe its behavior by applying elimination forms (for types τ <sup>+</sup> → σ <sup>−</sup> and <sup>N</sup>{` : <sup>σ</sup> − ` }`∈L) or by observing its returned value (for the type ↑τ <sup>+</sup>).

Besides capturing a certain notion of observability, our semantics incorporates the usual concept of type soundness which is important both for implementations and for interpreting the results of computations. These are:

Semantic Preservation (Theorem 1) If e ∈ σ <sup>−</sup> and e 7→ e 0 then e <sup>0</sup> ∈ σ −.

Semantic Progress (Theorem 2) If e ∈ σ <sup>−</sup> then either e 7→ e 0 for some e <sup>0</sup> or e is terminal (but not both). This captures the usual slogan that "well-typed programs do not go wrong" [57]. An implementation will not accidentally treat a pair as a function or try to decompose a function as if it were a pair.

Semantic Observation If v ∈ τ <sup>+</sup> then the structure of the value v is determined (inductively) by the type τ <sup>+</sup>. Similarly, a terminal computation e ∈ ↑τ + must have the form e = return v with v ∈ τ +.

These combine to the following: if we start a computation for e ∈ ↑τ <sup>+</sup> then either e 7→<sup>∗</sup> return v for an observable value v ∈ τ <sup>+</sup> after a finite number of steps, or e does not terminate.

These are close to their usual syntactic analogues, but the fact that we do not rely on any form of syntactic typing is methodologically significant. For example, if we have a program that does not obey a syntactic typing discipline but behaves correctly according to our semantic typing, our results will apply and this program, in combination with others that are well typed, will both be safe (semantic progress) and return meaningfully observable results (semantic preservation and observation). This point has been made passionately by Dreyer et al. [28] and applied, for example, to trusted libraries in Rust [47]. Another example can be found in gradual typing [38, 60]. As long as we can prove by any means that the "dynamically typed" portion of the program is semantically well-typed (even if not syntactically so), the combination is sound and can be executed without worry, returning a correctly observable result. A third example

is provided by session types for message-passing concurrency [44]. While it is important to have a syntactic type discipline, processes in a distributed system may be programmed in a variety of languages some of which will have much weaker guarantees. Being able to prove their semantic soundness then guarantees the behavioral soundness of the composed system.

Semantic typing in the context of call-by-push-value is well-suited for encoding computational effects, such as input/output, memory mutation, nontermination, etc. Call-by-push-value was designed as a study for the λ-calculus with effects [53, Sec. 2.4], stratifying terms into values (which have no side-effects) and computations (which might). Through the lens of semantic typing, we can ensure behavioral soundness in the presence of effects.

#### 3.1 Semantic Typing with Observation Depth

Despite the extensive work on mixed inductive and coinductive definitions [3, 11, 20, 21, 22, 43, 48, 51, 59, 61, 69], there is no widely accepted style in presenting such definitions and reasoning with them concisely in an mathematical language of discourse. With some regret, we therefore present our semantic definition by turning the coinductive part into an inductive one, following the basic idea underlying step indexing [7, 8, 10, 27]. Since the coinduction has priority over the induction, arguments proceed by nested induction, first over the step index and second over the structure of the inductive definition. This representation of mixed definitions implies that reasoning over step indices has lexicographic priority over values.

An alternative point of view is provided by sized types [5, 6]. Both sized types and step indexing employ the same concept of observation depth; however, for sized types, we would observe data constructors, whereas for step indexing we observe computation steps. General recursion is supported in our system because "productivity" in the negative layer means that computations can step rather than produce a data constructor. The step index is actually the (universally quantified) observation depth for a coinductively defined predicate. We do not index the (existentially quantified) size of the inductive predicate but use its structure directly since values are finite and become smaller. All step indices k, i and occasionally j range over natural numbers. We use three judgments,

1. e ∈<sup>k</sup> σ <sup>−</sup> (e has semantic type σ <sup>−</sup> at index k) 2. e ∈ˆk+1 σ <sup>−</sup> (terminal e has semantic type σ <sup>−</sup> at index k + 1) 3. v ∈<sup>k</sup> τ <sup>+</sup> (v has semantic type τ <sup>+</sup> at index k)

They should be defined by nested induction, first on k and second on the structure of v/e, where part 2 can rely on part 1 for a computation that is not terminal. We write v < v<sup>0</sup> when v is a strict subexpression of v 0 . The clauses of the definition can be found in Figure 1.

A few notes on these definitions. When expanding type definitions t = τ <sup>+</sup> and s = σ <sup>−</sup> we rely on the assumption that type definitions are contractive, so one of the immediately following cases will apply next. This means that unlike many

v ∈<sup>k</sup> t , v ∈<sup>k</sup> τ <sup>+</sup> for t = τ <sup>+</sup> ∈ Σ v ∈<sup>k</sup> τ + <sup>1</sup> <sup>τ</sup> + <sup>2</sup> , v = hv1, v2i, v<sup>1</sup> ∈<sup>k</sup> τ + 1 , and v<sup>2</sup> ∈<sup>k</sup> τ + 2 for some v1, v<sup>2</sup> v ∈<sup>k</sup> 1 , v = hi <sup>v</sup> <sup>∈</sup><sup>k</sup> {` : <sup>τ</sup> + ` }`∈<sup>L</sup> , v = j · v<sup>j</sup> and v<sup>j</sup> ∈<sup>k</sup> τ + j for some j ∈ L v ∈<sup>k</sup> ↓σ <sup>−</sup> , v = thunk e and e ∈<sup>k</sup> σ <sup>−</sup> for some e e ∈<sup>0</sup> σ <sup>−</sup> always e ∈k+1 σ <sup>−</sup> , (e 7→ e 0 and e <sup>0</sup> ∈<sup>k</sup> σ <sup>−</sup>) or (e terminal and e ∈ˆk+1 σ −) e ∈ˆk+1 s , e ∈ˆk+1 σ <sup>−</sup> for s = σ <sup>−</sup> ∈ Σ e ∈ˆk+1 τ <sup>+</sup> → σ <sup>−</sup> , e v ∈k+1 σ <sup>−</sup> for all i ≤ k and v with v ∈<sup>i</sup> τ + <sup>e</sup> <sup>∈</sup>ˆk+1 <sup>N</sup>{` : <sup>σ</sup> − ` }`∈<sup>L</sup> , e.j ∈k+1 σ − j for all j ∈ L e ∈ˆk+1 ↑τ <sup>+</sup> , e = return v for some v ∈<sup>k</sup> τ + v ∈ τ <sup>+</sup> , v ∈<sup>k</sup> τ <sup>+</sup> for all k e ∈ σ <sup>−</sup> , e ∈<sup>k</sup> σ <sup>−</sup> for all k

Fig. 1. Definition of Semantic Typing

definitions in this style the types do not necessarily get smaller. For the inductive part (typing of values), the values do get smaller and for the coinductive part (typing of computations) the step index will get smaller because in the case of functions and records the constructed expression is not terminal.

A number of variations on this definition are possible. A particularly interesting one avoids decreasing the step index unless recursion is unrolled [8, 27, 60] so sources of nontermination can be characterized more precisely. It may also be possible to keep the step index constant when analyzing a terminal computation of type ↑τ <sup>+</sup>. Stripping the return constructor constitutes a form of observation and therefore decreasing the index seems both appropriate and simplest.

The quantification over i ≤ k in the case of terminal computations of function type seems necessary because we need the relation to be downward closed so that it defines a deflationary fixed point [4, 41]. Values and computations are then semantically well-typed if they are well-typed for all step indices.

#### Lemma 2 (Downward Closure).

1. e ∈<sup>k</sup> σ <sup>−</sup> implies e ∈<sup>i</sup> σ <sup>−</sup> for all i ≤ k 2. e ∈ˆk+1 σ <sup>−</sup> implies e ∈ˆi+1 σ <sup>−</sup> for all i ≤ k 3. v ∈<sup>k</sup> τ <sup>+</sup> implies v ∈<sup>i</sup> τ <sup>+</sup> for all i ≤ k

Proof. By a routine nested induction on k and the structure of v/e where part 2 can appeal to part 1 when e is not terminal.

Here are some semantic types that can easily be verified (see [49, Appendix B]).

Example 4 (Semantic Typing).


#### 3.2 Properties of Semantic Typing

The properties of semantic preservation and progress follow immediately just by applying the definitions and Lemma 1, so we elide their proofs.

Theorem 1 (Semantic Preservation). If e ∈ σ <sup>−</sup> and e 7→ e 0 then e <sup>0</sup> ∈ σ −.

Theorem 2 (Semantic Progress). If e ∈ σ <sup>−</sup> then either e 7→ e <sup>0</sup> or e is terminal, but not both.

### 4 Subtyping

The semantics of subtyping is quite easy to express using semantic typing.

Definition 1 (Semantic Subtyping).

1. τ <sup>+</sup> ⊆ σ <sup>+</sup> iff v ∈ τ <sup>+</sup> implies v ∈ σ <sup>+</sup> for all v. 2. τ <sup>−</sup> ⊆ σ <sup>−</sup> iff e ∈ τ <sup>−</sup> implies e ∈ σ <sup>−</sup> for all e.

We would now like to give a syntactic definition of subtyping that expresses an algorithm and show it both sound and complete with respect to the given semantic definition. The intuitive rules for subtyping shouldn't be surprising, although to our knowledge our formulation is original.

#### 4.1 Empty and Full Types

A first observation is that τ <sup>+</sup> ⊆ σ <sup>+</sup> whenever τ <sup>+</sup> is an empty type, regardless of σ <sup>+</sup>, because the necessary implication holds vacuously. So we need an algorithm to determine emptiness of a positive type. For the most streamlined presentation (which is also most suitable for an implementation) we first put the signature into a normal form that alternates between structural types and type names.

$$\begin{aligned} \tau^+ &:= t\_1 \otimes t\_2 \mid \mathbf{1} \mid \oplus \{\ell \colon t\_\ell\}\_{\ell \in L} \mid \downarrow s \\ \sigma^- &:= t \to s \mid \otimes \{\ell \colon s\_\ell\}\_{\ell \in L} \mid \uparrow t \\ \Sigma &:= \cdot \mid \Sigma, t = \tau^+ \mid \Sigma, s = \sigma^- \mid \Sigma, f : \sigma^- = e \end{aligned}$$

A usual presentation of emptiness maintains a collection of recursive types in a context in order to do a kind of loop detection. For example, the type <sup>t</sup> <sup>=</sup> <sup>1</sup> <sup>t</sup> is empty because we may assume that <sup>t</sup> is empty while testing <sup>1</sup> <sup>t</sup>. Instead, we express this and similar kinds of arguments using valid circular reasoning. If one were to formalize it, it would be in CLKID<sup>ω</sup> [14], although the succedent of any sequent is either empty or a singleton (as in CLJID<sup>ω</sup> [12]).

We construct circular derivations for t empty where t is a positive type name. Note that negative types are never empty. We can form a valid cycle when we encounter a goal t empty as a proper subgoal of t empty. Since we fix a signature Σ once and for all before defining each judgment such as emptiness or subtyping, we omit the index Σ since it never changes. The rules can be found in Figure 2.

<sup>t</sup> <sup>=</sup> {` : <sup>t</sup>`}`∈<sup>L</sup> <sup>∈</sup> Σ t<sup>j</sup> empty (∀<sup>j</sup> <sup>∈</sup> <sup>L</sup>) t empty emp (no rules for t = 1 or t = ↓s) <sup>t</sup> <sup>=</sup> <sup>t</sup><sup>1</sup> <sup>t</sup><sup>2</sup> <sup>∈</sup> Σ t<sup>1</sup> empty t empty emp<sup>1</sup> <sup>t</sup> <sup>=</sup> <sup>t</sup><sup>1</sup> <sup>t</sup><sup>2</sup> <sup>∈</sup> Σ t<sup>2</sup> empty t empty emp<sup>2</sup>

Fig. 2. Circular Derivation Rules for Emptiness

Example 5. We continue Example 4, part (4), building a formal circular derivation. We first bring the signature into normal form, <sup>Σ</sup> <sup>=</sup> {u<sup>0</sup> <sup>=</sup> <sup>1</sup>, t<sup>0</sup> <sup>=</sup> <sup>u</sup><sup>0</sup> <sup>t</sup>0}, and then construct

$$t\_0 = u\_0 \otimes t\_0 \quad t\_0 \text{ empty} \quad \begin{array}{c} \text{CYCLE()} \\ t\_0 \text{ empty} \\ \hline t\_0 \text{ empty} \end{array} \otimes \text{EMP}\_2$$

Theorem 3 (Emptiness). If t empty then for all k and v, v 6∈<sup>k</sup> t.

Proof. We interpret the judgment t empty semantically as v ∈<sup>k</sup> t ` · (which expresses v 6∈<sup>k</sup> t in a sequent), where t is given and k and v are parameters and therefore implicitly universally quantified. The proof of this judgment is carried out in a circular metalogic. We translate each inference rule for t empty into a derivation for v ∈<sup>k</sup> t ` ·, where each unproven subgoal corresponds to a premise of the rule. When the derivation of t empty is closed by a cycle, the corresponding derivation of v ∈<sup>k</sup> t ` · is closed by a corresponding cycle in the metalogic. The cases can be found in [49, Appendix D].

Next we symmetrically define what it means for a computation type σ <sup>−</sup> to be full, namely that it is inhabited by every (semantically well-typed) computation. A simple example is the type <sup>N</sup>{ }, that is, the lazy record without any fields. It contains every well-typed expression because all projections (of which there are none) are well-typed. It turns out the fullness is directly defined from emptiness.

We may construct a derivation using the following rules. It could be circular, since the judgment t empty allows circular derivations.

$$\frac{s = t\_1 \to s\_2 \in \Sigma \quad t\_1 \text{ empty}}{s \text{ full}} \to \text{FUL} \qquad \frac{s = \&\{\} \in \Sigma}{s \text{ full}} \quad \& \text{FULL}$$
 
$$\text{(no rule for } s = \uparrow t\text{)}$$

We interpret s full as the entailment e ∈<sup>k</sup> r ` e ∈<sup>k</sup> s. In other words, we are assuming that e is semantically well-typed at some r and use that to show that it then will also be well-typed at the unrelated s.

Theorem 4 (Fullness). If s full then e ∈<sup>k</sup> r implies e ∈<sup>k</sup> s for all k, e, and r.

Proof. (see [49, Appendix E])

Note that there is no rule that would allow us to conclude that s = t<sup>1</sup> → s<sup>2</sup> is full if <sup>s</sup><sup>2</sup> is full. Such a rule would be unsound: consider { } ∈ <sup>N</sup>{ }. It is not the case that { } ∈ <sup>1</sup> <sup>→</sup> <sup>N</sup>{ }, so <sup>1</sup> <sup>→</sup> <sup>N</sup>{ } is not full, even though <sup>N</sup>{ } is. Similarly, λx. { } ∈ <sup>1</sup> <sup>→</sup> <sup>N</sup>{ } but λx. { } 6∈ <sup>N</sup>{<sup>l</sup> : <sup>N</sup>{ }}, so <sup>N</sup>{<sup>l</sup> : <sup>N</sup>{ }} is not full.

### 4.2 Syntactic Subtyping

The rules for syntactic subtyping build a circular derivation of t <sup>+</sup> ≤ u <sup>+</sup> and s <sup>−</sup> ≤ r <sup>−</sup>. A circularity arises when a goal t ≤ u or s ≤ r arises as a subgoal strictly above a goal that is of one of these two forms. In general, we use t and u to stand for positive type names and s and r for negative type names without annotating those names. The polarity will also be clear from the context. Moreover, in the interest of saving space, we write t = τ <sup>+</sup> and s = σ <sup>−</sup> when these definitions are in the fixed global signature Σ. The rules can be found in Figure 3. In particular, we would like to highlight the ⊥sub+, ⊥sub<sup>−</sup>, and >sub rules, which incorporate emptiness and fullness into syntactic subtyping. For example, among other subtypings, the ⊥sub<sup>+</sup> rule establishes t ≤ u whenever <sup>t</sup> <sup>=</sup> <sup>t</sup><sup>1</sup> <sup>t</sup><sup>2</sup> and either <sup>t</sup><sup>1</sup> empty or <sup>t</sup><sup>2</sup> empty.

Example 6. We revisit Example 1 to show that pos ≤ std. We have annotated each subgoal from the sub rule with the corresponding label; we have elided the reference to the sub rule in the derivation for lack of space. Again, we normalize the signature before running the algorithm.

$$\begin{array}{l} \mathsf{u}^{+} = \mathsf{1} \\ \mathsf{std}^{+} = \oplus \{ \mathsf{e} : \mathsf{u}, \mathsf{b} \mathsf{0} : \mathsf{pos}, \mathsf{b} \mathsf{1} : \mathsf{std} \} \\ \mathsf{pos}^{+} = \oplus \{ \qquad \qquad \qquad \qquad \mathsf{b} \mathsf{0} : \mathsf{pos}, \mathsf{b} \mathsf{1} : \mathsf{std} \} \\ \begin{array}{l} \mathsf{CYCLE}(\*) \\ \hline \mathsf{[\mathsf{e}] \mathsf{0} \mathsf{pos} \leq \mathsf{pos} \end{array} \quad \begin{array}{l} \mathsf{CYCLE}(\*) \\ \hline \mathsf{[\mathsf{b}\mathsf{0} \} \mathsf{pos} \leq \mathsf{pos} \quad \mathsf{[\mathsf{b}\mathsf{1} \} \mathsf{std} \leq \mathsf{std} \end{array} \\ \hline \begin{array}{l} \mathsf{[\mathsf{b}\mathsf{0} \} \mathsf{pos} \leq \mathsf{pos} \; (\*) \end{array} \quad \begin{array}{l} \mathsf{[\mathsf{b}\mathsf{1} \} \mathsf{std} \leq \mathsf{std} \end{array} \\ \end{array} \\ \begin{array}{l} \mathsf{CYCLE}(\*) \\ \hline \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad$$

pos ≤ std

$$\begin{array}{llll} t & t = t\_1 \otimes t\_2 & u = u\_1 \otimes u\_2 & t\_1 \le u\_1 & t\_2 \le u\_2 \\ & t \le u & \\ & t \le u & \\ & & \text{else} & \\ & & & t \le u \end{array} \quad \text{ @SUB} \qquad \begin{array}{llll} t & t = \mathbf{1} & u = \mathbf{1} & u = \mathbf{1} \\ & t \le u & \\ & & t \le u & \\ \hline \end{array} \quad \begin{array}{llll} \text{Ust} & \mathbf{1} & \mathbf{1} & \mathbf{1} \\ & t \le u & \\ & & t \le u & \\ \hline \end{array} \quad \begin{array}{llll} \text{Ust} & \mathbf{1} & \mathbf{1} & \mathbf{1} & \mathbf{1} & \mathbf{1} \\ & & & \text{else} & \\ \hline \end{array} \quad \begin{array}{llll} \text{Ust} & \mathbf{1} & \mathbf{1} & \mathbf{1} & \mathbf{1} \\ & & & \text{else} & \\ \hline \end{array} \quad \begin{array}{llll} \text{Ust} & \mathbf{1} & \mathbf{1} & \mathbf{1} & \mathbf{1} \\ & & & \text{else} & \\ \hline \end{array} \quad \begin{array}{llll} \text{Ust} & \mathbf{1} & \mathbf{1} & \mathbf{1} & \mathbf{1} \\ & & & \text{else} & \\ \hline \end{array} \quad \begin{array}{llll} \text{Ust} & \mathbf{1} & \mathbf{1} & \mathbf{1} \\ & & & \text{else} & \\ \hline \end{array}$$

Fig. 3. Circular Derivation Rules for Subtyping

From a circular derivation we now construct a valid circular proof in an intuitionistic metalogic [12]. For example, t ≤ u is interpreted as t ⊆ u, that is, every value in t is also a value in u. We actually prove a slightly stronger theorem, namely that for the step index on both sides can remain the same.

#### Theorem 5 (Soundness of Subtyping).

1. If t ≤ u then v ∈<sup>k</sup> t ` v ∈<sup>k</sup> u for all k and v (and so, t ⊆ u). 2. If s ≤ r then e ∈<sup>k</sup> s ` e ∈<sup>k</sup> r for all k and e (and so, s ⊆ r).

Proof. We proceed by a compositional translation of the circular derivation of subtyping into a circular derivation in the metalogic. For each rule we construct a derived rule on the semantic side with corresponding premises and conclusion.

When the subtyping proof is closed due to a cycle, we close the proof in the metalogic with a corresponding cycle. In order for this cycle to be valid, it is critical that the judgments in the premises of the derived rule are strictly smaller than the judgments in the conclusion. Since our mixed logical relation is defined by nested induction, first on the step index k and second on the structure of the value v or expression e, the lexicographic measure (k, v/e) should strictly decrease. Some sample cases can be found in [49, Appendix F].

Besides soundness, reflexivity and transitivity of syntactic subtyping are two other properties that we prove for assurance that the syntactic subtyping rules are sensible and have no obvious gaps. These proofs can be found in [49, Appendix G]. Ligatti et al. [55] also consider a notion of preciseness as a syntactic means for judging the correctness of their syntactic subtyping rules. As they mention in [55,

Sec. 6.2], this property is highly language-sensitive, depending on the choice of evaluation strategy (strict vs. nonstrict), where nonstrict subtyping relies on "which primitives are present in the language, sometimes in nonorthogonal ways." Moreover, preciseness requires syntactically well-typed counterexamples, whereas we also consider ill-typed terms. We can straightforwardly prove that syntactic subtyping for purely positive types (in relation to strict evaluation) is complete with respect to semantic subtyping. We leave the preciseness of syntactic subtyping of negative types for future consideration.

### 5 Syntactic Typing and Soundness

We now introduce a syntactic typing judgment, at the moment without regard to decidability. Such a judgment is often called declarative typing in contrast with what is algorithmic typing in Section 6 (Figure 4). We prove that all syntactically well-typed terms are also semantically well-typed. Conceptually, a declarative system is unnecessary because the bidirectional system is very closely related, and there are no problems in justifying the soundness of the the bidirectional system directly with respect to our semantics. Besides the fact that there is a small amount of additional bureaucracy (the rules are divided between four judgments instead of two, and there are two additional rules), it is also the case that the standard versions of call-by-name and call-by-value use a similar form of declarative typing and are therefore easier to relate to our system in Section 8.

Because all declarations in a signature can be mutually recursive, each declaration f : σ <sup>−</sup> = e is checked assuming all other declarations are valid. The soundness proof below justifies this. The complete set of judgments and rules with their corresponding presuppositions can be found in [49, Appendix H, Figs. 7 and 8]. For these rules, we need contexts Γ, defined as usual with the presupposition that all variables declared in a context are distinct.

$$
\Gamma ::= \cdot \mid \Gamma, x \colon \tau^+
$$

The rules for key judgments Γ ` v : τ <sup>+</sup> and Γ ` e : σ <sup>−</sup> can be obtained from the bidirectional rules in Section 6 by replacing both v ⇐ τ <sup>+</sup> and v ⇒ τ <sup>+</sup> with v : τ <sup>+</sup> and, similarly, e ⇐ σ <sup>−</sup> and e ⇒ σ <sup>−</sup> with e : σ <sup>−</sup>. Moreover, one should drop the two annotation rules anno<sup>+</sup> and anno<sup>−</sup> because these are not in the source language for declarative typing.

We would like to show that the syntactic typing rules are sound with respect to their semantic interpretation. For that, we first define simultaneous substitutions θ of closed values for variables and θ ∈<sup>k</sup> Γ for the semantic interpretation of contexts as sets of substitutions at step index k.

$$\theta ::= \cdot \mid \theta, v/x$$

$$(\cdot) \in\_k (\cdot) \quad \text{always}$$

$$(\theta, v/x) \in\_k (\Gamma, x: \tau^+) \triangleq \theta \in\_k \Gamma \text{ and } v \in\_k \tau^+$$

On the semantic side, we define

1. Γ |= v ∈<sup>k</sup> τ <sup>+</sup> iff for all θ ∈<sup>k</sup> Γ we have v[θ] ∈<sup>k</sup> τ + 2. Γ |= e ∈<sup>k</sup> σ <sup>−</sup> iff for all θ ∈<sup>k</sup> Γ we have e[θ] ∈<sup>k</sup> σ −

We now can prove a number of lemmas, one for each syntactic typing rule. A representative selection of the lemmas, each written as an admissible rule for semantic typing, can be given by:


The proofs are somewhat interesting: some require induction on k, others follow more directly by definition. Due to a lack of space, the proofs can be found in [49, Appendix I], each admissible rule formulated as a separate lemma.

### Theorem 6 (Soundness of Syntactic Typing). Assume θ ∈<sup>k</sup> Γ.

1. If Γ ` v : τ <sup>+</sup> then v[θ] ∈<sup>k</sup> τ + 2. If Γ ` e : σ <sup>−</sup> then e[θ] ∈<sup>k</sup> σ −

Proof. We construct a circular proof based on the typing derivation, and the typing derivations for all definitions f : σ <sup>−</sup> = e ∈ Σ. There are three kinds of cases (see [49, Appendix I] for samples of each):


Because soundness is stated for all θ, Γ, and k, we can immediately obtain corollaries such as that · ` v : τ <sup>+</sup> implies that v ∈ τ <sup>+</sup>, and that · ` e : σ <sup>−</sup> implies that e ∈ σ −.

### 6 Bidirectional Typing

We now shift from our declarative typing system into an algorithmic one that describes a practical decision procedure. We choose to express it as a bidirectional typechecking algorithm, particularly to avoid inference issues regarding subsumption [45] and our extensive use of type names and variant records, as

Γ ` v<sup>1</sup> ⇐ τ + <sup>1</sup> Γ ` v<sup>2</sup> ⇐ τ + 2 Γ ` hv1, v2i ⇐ τ + <sup>1</sup> <sup>τ</sup> + 2 i Γ ` v ⇒ τ + <sup>1</sup> <sup>τ</sup> + <sup>2</sup> Γ, x:τ + 1 , y:τ + <sup>2</sup> ` e ⇐ σ − Γ ` match v (hx, yi ⇒ e) ⇐ σ − e x : τ <sup>+</sup> ∈ Γ Γ ` x ⇒ τ + var Γ ` hi ⇐ 1 1i Γ ` v ⇒ 1 Γ ` e ⇐ σ − Γ ` match v (hi ⇒ e) ⇐ σ − 1e Γ ` e ⇐ σ − Γ ` thunk e ⇐ ↓σ − ↓i Γ ` v ⇒ ↓σ − Γ ` force v ⇒ σ − ↓e (j ∈ L) Γ ` v ⇐ τ + j <sup>Γ</sup> ` <sup>j</sup> · <sup>v</sup> ⇐ {` : <sup>τ</sup> + ` }`∈<sup>L</sup> i <sup>Γ</sup> ` <sup>v</sup> <sup>⇒</sup> {` : <sup>τ</sup> + ` }`∈<sup>L</sup> ∀(` ∈ L): Γ, x`:τ + ` ` e` ⇐ σ − Γ ` match v (` · x` ⇒ e`)`∈<sup>L</sup> ⇐ σ − e Γ, x:τ <sup>+</sup> ` e ⇐ σ − Γ ` λx. e ⇐ τ <sup>+</sup> → σ − →i Γ ` e ⇒ τ <sup>+</sup> → σ <sup>−</sup> Γ ` v ⇐ τ + Γ ` e v ⇒ σ − →e ∀(` ∈ L): Γ ` e` ⇐ σ − ` <sup>Γ</sup> ` {` <sup>=</sup> <sup>e</sup>`}`∈<sup>L</sup> ⇐ <sup>N</sup>{` : <sup>σ</sup> − ` }`∈<sup>L</sup> Ni <sup>Γ</sup> ` <sup>e</sup> <sup>⇒</sup> <sup>N</sup>{` : <sup>σ</sup> − ` }`∈<sup>L</sup> (j ∈ L) Γ ` e.j ⇒ σ − j Nek f : σ <sup>−</sup> = e ∈ Σ Γ ` f ⇒ σ − name Γ ` v ⇐ τ + Γ ` return v ⇐ ↑τ + ↑i Γ ` e<sup>1</sup> ⇒ ↑τ <sup>+</sup> Γ, x:τ <sup>+</sup> ` e<sup>2</sup> ⇐ σ − Γ ` let return x = e<sup>1</sup> in e<sup>2</sup> ⇐ σ − ↑e Γ ` v ⇒ τ <sup>+</sup> τ <sup>+</sup> ≤ σ + Γ ` v ⇐ σ + sub<sup>+</sup> Γ ` e ⇒ τ <sup>−</sup> τ <sup>−</sup> ≤ σ − Γ ` e ⇐ σ − sub<sup>−</sup> Γ ` v ⇐ τ + Γ ` (v : τ <sup>+</sup>) ⇒ τ + anno<sup>+</sup> Γ ` e ⇐ σ − Γ ` (e : σ <sup>−</sup>) ⇒ σ − anno<sup>−</sup>

Fig. 4. Bidirectional Typing

well as the approach's deep integration with polarized logics [29, Section 8.3]. Moreover, bidirectional typing is quite robust with respect to language extensions where various inference procedures are not.

Bidirectional typechecking [68] has been a popular choice for presenting algorithmic typing, especially when concerned with subtyping [30], and is decidable for a wide range of rich type systems. This approach splits each of the typing judgments, Γ ` v : τ <sup>+</sup> and Γ ` e : σ <sup>−</sup>, into checking (⇐) and synthesis (⇒) judgments for values and expressions, respectively: Γ ` v ⇐ τ <sup>+</sup>, Γ ` v ⇒ τ + and Γ ` e ⇐ σ <sup>−</sup>, Γ ` e ⇒ σ −.

We follow the recipe laid out by [25, 32]: introduction rules check and elimination rules synthesize. More precisely, the principal judgment, premise or conclusion, has the connective being introduced by checking or eliminated by synthesis.

We introduce two new forms of syntactic values (v : τ <sup>+</sup>) and computations (e : σ <sup>−</sup>) which exist purely for typechecking purposes and are erased before evaluation. This is not actually used on any of our examples because definitions in the signature already require annotations.

Applying the recipe, we can easily convert our declarative rules into bidirectional ones, as laid out in Section 5. The only rules we add to the system are anno<sup>+</sup> and anno<sup>−</sup>, which allow us to prove completeness. All the examples in Section 2.2 check with these rules and only require type annotations at the top level of the declarations in the signature.

Due to our use of equirecursive types, the implementation of this system can closely follow the structure of the rules in Figures 2, 3, and 4. First, as mentioned in Section 4.1, we convert the signature into a normal form that alternates structural types and type names. Then, we determine all the empty type names using a memoization table for t <sup>+</sup> empty to easily construct circular derivations of emptiness (bottom-up) using the rules in Figure 2. If constructing such a derivation fails then t <sup>+</sup> is nonempty. Fullness is derived from emptiness non-recursively. From there, we build a memoization table for t <sup>+</sup> ≤ u <sup>+</sup> and s <sup>−</sup> ≤ r <sup>−</sup>, for positive and negative type names, so we can construct circular derivations of subtyping between names (also bottom-up). This happens lazily, only computing t <sup>+</sup> ≤ u <sup>+</sup> or s <sup>−</sup> ≤ r <sup>−</sup> if typechecking requires this information.

Bidirectional typing, given subtyping, follows the rules in Figure 4, including the rules for positive and negative subsumption, but it requires that the types in annotations are also translated to normal form, possibly introducing new (user-invisible) definitions in the signature.

The theorems (with straightforward proofs) for soundness and completeness of bidirectional typechecking can be found in [49, Appendix J, Thms. 12 and 13].

### 7 Interpretation of Isorecursive Types

Our system uses equirecursive types, which allow many subtyping relations since there are no term constructors for folding recursive types. Moreover, equirecursive types support the normal form where constructors are always applied to type names (see Section 4.1), simplifying our algorithms, their description and implementations. Most importantly, perhaps, equirecursive types are more general because we can directly interpret isorecursive types, which are embodied by fold and unfold operators, into our equirecursive setting and apply our results.

We give a short sketch here; details can be found in [49, Appendix K]. For every recursive type µα<sup>+</sup>. τ <sup>+</sup> we introduce a definition t <sup>+</sup> <sup>=</sup> {fold<sup>µ</sup> : [t/α]τ}. Similarly, for every corecursive type να<sup>−</sup>. σ<sup>−</sup> we introduce a definition s <sup>−</sup> = <sup>N</sup>{fold<sup>ν</sup> : [s/α]σ}. Now, the labels fold<sup>µ</sup> and fold<sup>ν</sup> tagging the sole choice of a unary variant or lazy record, respectively, play exactly the role that the fold constructor plays for recursive types. This entirely straightforward translation is enabled by our generalization of the binary sum and lazy pairs to variant and lazy records, respectively, so we can use them in their unary form.

### 8 Call-by-Name and Call-by-Value

More familiar than call-by-push-value (CBPV) are the lazy, call-by-name (CBN) and eager, call-by-value (CBV) operational semantics that underlie the Haskell

and ML families of functional programming languages. Levy [54] has shown that both CBN and CBV exist as fragments of CBPV, exhibiting translations from CBN and CBV types and terms into the CBPV language. In this section, we derive systems of subtyping for CBN and CBV from these translations into ours and prove them sound and complete. We discover that they are minor variants of existing systems for CBN [39] and CBV [55] subtyping.

Because polarized subtyping is able to connect Levy's translations with existing systems for CBN and CBV subtyping, it serves as further evidence that those prior translations and our subtyping rules are, in some sense, canonical. Moreover, it is yet one more piece of evidence that CBPV is an effective synthesis of evaluation orders in which to study the theory of functional programming.

#### 8.1 Call-by-name

Consider a CBN language with the following types. The language of terms and the standard statics and dynamics can be found in [49, Appendix L].

$$\tau, \sigma ::= \tau \to \sigma \mid \tau\_1 \otimes \tau\_2 \mid \mathbf{1} \mid \oplus \{ \ell \colon \tau\_\ell \}\_{\ell \in L} \mid \otimes \{ \ell \colon \tau\_\ell \}\_{\ell \in L}$$

In this section, we will focus on function types τ → σ and variant record types {` : <sup>τ</sup>`}`∈<sup>L</sup> and their corresponding terms.

Levy [54] presents translations, (−) , from CBN types and terms to CBPV negative types and expressions, respectively. An auxiliary translation, ↓(−) , on contexts is also used. Here, we elide the translation of terms other than variables and the terms for function and variant record types; the full translation on terms can be found in [54].

$$\begin{aligned} &Tays\\ &(\boldsymbol{\tau}\rightarrow\boldsymbol{\sigma})^{\boxplus} = \downarrow\boldsymbol{\tau}^{\boxplus}\rightarrow\boldsymbol{\sigma}^{\boxplus} \qquad &(\boldsymbol{x})^{\boxplus} = \texttt{return}\,\boldsymbol{x}\\ &\left(\tau\_{1}\otimes\tau\_{2}\right)^{\boxplus} = \uparrow\left(\downarrow\boldsymbol{\tau}\_{1}^{\boxplus}\otimes\downarrow\boldsymbol{\tau}\_{2}^{\boxplus}\right) \qquad &(\lambda\boldsymbol{x}.e)^{\boxplus} = \lambda\boldsymbol{x}.e^{\boxplus}\\ &(\mathbf{1})^{\boxplus} = \uparrow\mathbf{1} \qquad &(e\_{1}\,e\_{2})^{\boxplus} = e\_{1}^{\boxplus}\,\{\mathtt{thunk}\,\,e\_{2}^{\boxplus}\}\\ &\left(\oplus\{\ell:\,\boldsymbol{\tau}\_{\ell}\}\_{\ell\in L}\right)^{\boxplus} = \uparrow\oplus\{\ell:\,\boldsymbol{\downarrow}\boldsymbol{\tau}\_{\ell}^{\boxplus}\}\_{\ell\in L}\\ &\left(\&\{\ell:\,\boldsymbol{\sigma}\_{\ell}\}\_{\ell\in L}\right)^{\boxplus} = \otimes\{\ell:\,\boldsymbol{\sigma}\_{\ell}^{\boxplus}\}\_{\ell\in L}\end{aligned}$$

We also translate type names t to fresh type names t , translating the body of t's definition and inserting additional type names as required for the normal form that alternates between structural types and type names. Levy [54] proves that well-typed terms are well-typed after the translation to CBPV is applied. Our syntactic typing rules are the same, so the theorem carries over to our setting.

We adapt the subtyping system of Gay and Hole [39] to a λ-calculus from the π-calculus, which reverses the direction of subtyping from their classical system and adds empty records, obtaining the CBN syntactic subtyping rules shown in Figure 5.

These rules introduce a CBN syntactic subtyping judgment t ≤ u. To distinguish it from CBPV syntactic subtyping, we will take care in this section to

always include superscript pluses and minuses for CBPV type names, with CBN type names being unmarked. As for CBPV syntactic subtyping, the rules for CBN subtyping shown in Figure 5 build a circular derivation. Just as before, a circularity arises when a goal t ≤ u arises as a proper subgoal of itself.

$$\begin{array}{ccccc} t = t\_1 \to t\_2 & u = u\_1 \to u\_2 & u\_1 \le t\_1 & t\_2 \le u\_2 & \rightarrow \text{SUB}\_{\mathbb{N}}\\ \hline \\ t \le u & & & & \\ \hline t = t\_1 \otimes t\_2 & u = u\_1 \otimes u\_2 & t\_1 \le u\_1 & t\_2 \le u\_2 & \underset{\mathbb{S}\to\text{SUB}\_{\mathbb{N}}}{\text{ }} & \underbrace{t = 1 & u = 1} & \mathbf{1}\text{sub}\_{\mathbb{N}}\\ \hline \\ t = \oplus \{\ell \colon t\_\ell\}\_{\ell \in L} & u = \oplus \{j \colon u\_j\}\_{j \in J} & (L \subseteq J) & \forall (\ell \in L) : t\_\ell \le u & \underset{\mathbb{S}\to\text{SUB}\_{\mathbb{N}}}{\text{ }}\\ \cline{2-4} t = \underbrace{\begin{array}{c} \mathcal{L} \in \{\ell \colon t\_\ell\}\_{\ell \in L} \quad u = \mathcal{S}\{j \colon u\_j\}\_{j \in J} & (L \supseteq J) & \forall (j \in J) : t\_j \le u\_j\\ \hline t \le u & & & & \\ \hline t \le u & & & & \\ \hline t \le u & & & & \end{array}}\_{\text{\mathbb{S}} \text{SUB}\_{\mathbb{N}}} \underbrace{\begin{array}{c} t = \tau & u \text{ full} \\ \hline t = \mathcal{S} \{ } \end{array}}\_{\text{\mathbb{S}} \text{FUL}\_{\mathbb{N}}} \underbrace{\begin{array}{c} \\ \hline t = \mathcal{S} \{ } \end{array}}\_{\text{\mathbb{S}} \text{FUL}\_{\mathbb{N}}} \end{array}$$

Fig. 5. Circular Derivation Rules for Call-by-Name Subtyping

These rules are exact analogues of those of Gay and Hole [39], with one exception. The three rules involving empty variants and records, namely ⊥subn, <sup>&</sup>gt;subn, and <sup>N</sup>fulln, have no analogues in [39] only because their language did not include the corresponding empty internal and external choice types.

As we will prove below, the CBN subtyping rules in Figure 5 are exactly those for which t ≤ u in the CBN language if and only if t  ≤ u  in the CBPV metalanguage. We thereby show that our polarized subtyping on the image of Levy's CBN translation is sound and complete with respect to Gay and Hole's CBN subtyping.

Before proceeding to those proofs, it is worth pointing out that many of these CBN subtyping rules exactly follow CBPV, with a few notable differences. First, the sub<sup>n</sup> rule does not permit empty branches that do not occur in the supertype. This is because the <sup>↓</sup> shifts that appear in ({` : <sup>τ</sup>`}`∈L)  prevent each branch from being empty—there is no emptiness rule for ↓ shifts in the CBPV subtyping. Second, for this CBN language, only types <sup>t</sup> <sup>=</sup> <sup>N</sup>{ } are full. In particular, a CBN function type t = t<sup>1</sup> → t<sup>2</sup> is never full, even though a CBPV function type s <sup>−</sup> = t + <sup>1</sup> → s − 2 is full if the argument type t + 1 is empty. This stems from the ↓ shift that appears in the argument type in (τ → σ)  = ↓τ  → σ . Third, the reader may be surprised by the omission of an emptiness judgment for CBN types. The <sup>⊥</sup>sub<sup>n</sup> rule mentions the CBN type <sup>t</sup> <sup>=</sup> { }, which looks like it ought to be an empty type—the CBPV type t + <sup>0</sup> <sup>=</sup> { } is empty, after all. Yes, but the CBN translation of <sup>t</sup> <sup>=</sup> { } is in fact the negative type <sup>t</sup> <sup>=</sup> <sup>↑</sup>{ }, and negative types are never empty. Nevertheless, t <sup>=</sup> <sup>↑</sup>{ } ≤ <sup>u</sup>  in this case.

Now we prove that polarized subtyping on the image of Levy's CBN embedding, (−) , is sound and complete with respect to the CBN subtyping rules of Figure 5. The proofs can be found in [49, Appendix L].

#### Theorem 7 (Soundness of Polarized Subtyping, Call-by-Name).


#### Theorem 8 (Completeness of Polarized Subtyping, Call-by-Name).

1. If t full, then t  full. 2. If t ≤ u, then t  ≤ u .

#### 8.2 Call-by-Value

We can play through a similar procedure for Levy's CBV translation. Consider a CBV language with the following types. The language of terms, typing rules, and standard dynamics can be found in [49, Appendix M].

$$
\tau, \sigma ::= \tau \to \sigma \mid \tau\_1 \otimes \tau\_2 \mid \mathbf{1} \mid \oplus \{\ell \colon \tau\_\ell\}\_{\ell \in L} \mid \otimes \{\ell \colon \sigma\_\ell\}\_{\ell \in L}.
$$

The translations that Levy [54] presents from CBV types and terms to CBPV positive types and expressions are as follows. We only present the translation of variables, function abstractions, and function applications; the full translation on terms can be found in [54].

$$\begin{aligned} &\text{Types} & \text{Terms} \\ & (\boldsymbol{\tau} \rightarrow \boldsymbol{\sigma})^{\boxplus} = \downarrow (\boldsymbol{\tau}^{\boxplus} \rightarrow \uparrow \boldsymbol{\sigma}^{\boxplus}) & (\boldsymbol{x})^{\boxplus} = \text{return } \boldsymbol{x} \\ & (\tau\_{1} \otimes \tau\_{2})^{\boxplus} = \tau\_{1}^{\boxplus} \otimes \tau\_{2}^{\boxplus} & (\boldsymbol{f})^{\boxplus} = \text{force } \boldsymbol{f} \text{ for } \boldsymbol{f} : \boldsymbol{\tau} = \boldsymbol{e} \in \boldsymbol{\Sigma} \\ & (\mathbf{1})^{\boxplus} = \mathbf{1} & (\boldsymbol{\lambda} \boldsymbol{x}. \boldsymbol{e})^{\boxplus} = \text{return } (\text{thm} \& (\boldsymbol{\lambda} \boldsymbol{x}. \boldsymbol{e}^{\boxplus})) \\ & (\oplus \{\ell \colon \boldsymbol{\tau}\_{\ell}\}\_{\ell \in L})^{\boxplus} = \oplus \{\ell \colon \boldsymbol{\tau}\_{\ell}^{\boxplus}\}\_{\ell \in L} & (e\_{1} \, e\_{2})^{\boxplus} = \text{let } \text{return } \boldsymbol{x} = e\_{2}^{\boxplus} \text{ in } \\ & (\& \{\ell \colon \boldsymbol{\sigma}\_{\ell}\}\_{\ell \in L})^{\boxplus} = \downarrow \& \{\ell \colon \boldsymbol{\uparrow} \boldsymbol{\sigma}\_{\ell}^{\boxplus}\}\_{\ell \in L} & \quad \text{let } \text{return } f = e\_{1}^{\boxplus} \text{ in } \\ \end{aligned}$$

We also translate type names t to fresh type names t , translating the body of t's definition and inserting additional type names as required for the normal form that alternates between structural types and type names.

Levy proves that well-typed terms translate to well-typed expressions. Because our syntactic typing rules are the same as his, his theorem carries over.

We adapt the CBV subtyping system of Ligatti et al. [55] to our setting, which means that we include variants and lazy records with width and depth subtyping and replace isorecursive with equirecursive types. We obtain the syntactic subtyping rules shown in Figure 6. Once again, we will take care to

t = t<sup>1</sup> → t<sup>2</sup> u = u<sup>1</sup> → u<sup>2</sup> u<sup>1</sup> ≤ t<sup>1</sup> t<sup>2</sup> ≤ u<sup>2</sup> t ≤ u →sub<sup>v</sup> <sup>t</sup> <sup>=</sup> <sup>t</sup><sup>1</sup> <sup>t</sup><sup>2</sup> <sup>u</sup> <sup>=</sup> <sup>u</sup><sup>1</sup> <sup>u</sup><sup>2</sup> <sup>t</sup><sup>1</sup> <sup>≤</sup> <sup>u</sup><sup>1</sup> <sup>t</sup><sup>2</sup> <sup>≤</sup> <sup>u</sup><sup>2</sup> t ≤ u sub<sup>v</sup> t = 1 u = 1 t ≤ u 1sub<sup>v</sup> <sup>t</sup> <sup>=</sup> {` : <sup>t</sup>`}`∈<sup>L</sup> <sup>u</sup> <sup>=</sup> {<sup>j</sup> : <sup>u</sup>j}j∈<sup>J</sup> <sup>∀</sup>(` <sup>∈</sup> <sup>L</sup> \ <sup>J</sup>): <sup>t</sup>` empty <sup>∀</sup>(` <sup>∈</sup> <sup>L</sup> <sup>∩</sup> <sup>J</sup>): <sup>t</sup>` <sup>≤</sup> <sup>u</sup>` t ≤ u sub<sup>v</sup> <sup>t</sup> <sup>=</sup> <sup>N</sup>{` : <sup>t</sup>`}`∈<sup>L</sup> <sup>u</sup> <sup>=</sup> <sup>N</sup>{<sup>j</sup> : <sup>u</sup>j}j∈<sup>J</sup> (<sup>L</sup> <sup>⊇</sup> <sup>J</sup>) <sup>∀</sup>(<sup>j</sup> <sup>∈</sup> <sup>J</sup>): <sup>t</sup><sup>j</sup> <sup>≤</sup> <sup>u</sup><sup>j</sup> t ≤ u <sup>N</sup>sub<sup>v</sup> t empty u = σ t ≤ u ⊥sub<sup>v</sup> t = t<sup>1</sup> → t<sup>2</sup> u = u<sup>1</sup> → u<sup>2</sup> u<sup>1</sup> empty t ≤ u >sub→→<sup>v</sup> <sup>t</sup> <sup>=</sup> <sup>N</sup>{` : <sup>t</sup>`}`∈<sup>L</sup> <sup>u</sup> <sup>=</sup> <sup>u</sup><sup>1</sup> <sup>→</sup> <sup>u</sup><sup>2</sup> <sup>u</sup><sup>1</sup> empty t ≤ u >subN→<sup>v</sup> <sup>t</sup> <sup>=</sup> <sup>t</sup><sup>1</sup> <sup>→</sup> <sup>t</sup><sup>2</sup> <sup>u</sup> <sup>=</sup> <sup>N</sup>{ } t ≤ u >sub<sup>→</sup><sup>N</sup> v <sup>t</sup> <sup>=</sup> <sup>t</sup><sup>1</sup> <sup>t</sup><sup>2</sup> <sup>t</sup><sup>i</sup> empty t empty emp<sup>v</sup><sup>i</sup> <sup>t</sup> <sup>=</sup> {` : <sup>t</sup>`}`∈<sup>L</sup> <sup>∀</sup>(` <sup>∈</sup> <sup>L</sup>): <sup>t</sup>` empty t empty emp<sup>v</sup>

(no emptiness rules for <sup>1</sup>, <sup>→</sup>, and <sup>N</sup>)

Fig. 6. Circular Derivation Rules for Call-by-Value Subtyping

distinguish the CBV syntactic subtyping judgment, t ≤ u, from CBPV syntactic subtyping by marking CBPV type names with pluses and minuses. The rules shown in Figure 6 build circular derivations.

These rules match those of Ligatti et al., with one minor exception that we will detail below. As we will prove, these rules are exactly those for which t ≤ u in the CBV language if and only if t ≤ u in the CBPV metalanguage.

Before proceeding to the proofs, a few remarks about these rules. First, unlike the CBN sub<sup>n</sup> rule, the sub<sup>v</sup> rule here includes the possibility that some components of a variant record type may be empty. More generally, the differences between CBN and CBV subtyping arise from the differences in emptiness and fullness between the two calculi. Emptiness and fullness are quite sensitive to the eager/lazy distinction between the two evaluation strategies. Because this distinction manifests in almost every layer of a complex type, the two subtyping systems diverge more than one might expect.

Second, besides the adaptions mentioned above, the rules of Figure 6 diverge from those of Ligatti et al. in only one way. Ligatti et al. [55] have the rule "t ≤ u if u = u1→u<sup>2</sup> and u<sup>1</sup> empty" that generalizes the >sub→→ v , >subN<sup>→</sup> v , and >sub<sup>→</sup><sup>N</sup> v rules of Figure 6 (assuming that Ligatti et al. would also have "t ≤ u if <sup>u</sup> <sup>=</sup> <sup>N</sup>{ }" if they had included lazy records in their language).

Somewhat unexpectedly, polarized subtyping on the image of Levy's CBV translation would be incomplete with respect to this more general rule. This is because the ↓ shift inserted by Levy's translation acts as a barrier to fullness: "t ≤ u if u = ↓r and r full" would be unsound in polarized subtyping. For example, Ligatti et al. have 1 ≤ 0 → 1 for an empty type 0, but we do not have 1 = 1 ≤ ↓(0 → ↑1) = (0 → 1) because the unit value hi does not have type ↓(0 → ↑1). This phenomenon is primarily of theoretical interest since it is confined to functions that can never be applied to any arguments and empty records (and only when they are compared against CBV types <sup>t</sup><sup>1</sup> <sup>t</sup>2, <sup>1</sup>, and {` : <sup>t</sup>`}`∈L). Nevertheless, we conjecture a more differentiated translation of types and terms could restore completeness.

These observations notwithstanding, we can prove that the CBV subtyping rules of Figure 6 are sound and complete with respect to the subtyping rules for CBPV under Levy's translation. The proofs can be found in [49, Appendix M].

#### Theorem 9 (Soundness of Polarized Subtyping, Call-by-Value).


#### Theorem 10 (Completeness of Polarized Subtyping, Call-by-Value).

1. If t empty, then t empty. 2. If t ≤ u, then t ≤ u .

### 9 Related Work and Discussion

We now dive deeper into research related to our underlying theme on how polarization affects the interaction and definition of subtyping with recursive types across varying interpretations.

Subtyping Recursive Types. The groundwork for coinductive interpretations of subtyping equirecursive types has been laid by Amadio and Cardelli [9], subsequently refined by others [13, 37]. Danielsson and Altenkirch [22] also provided significant inspiration since they formally clarify that subtyping recursive types relies on a mixed induction/coinduction. In using an equirecursive presentation within different calculi, our work has been influenced by its predominant use in session types [19, 23, 40] and, in particular, Gay and Hole's coinductive subtyping algorithm [39], which we take as a template for call-by-name typing.

Another important influence has been the work on refinement types [24, 34] which are also recursive but exist within predefined universes of generative types. As such, subtyping relations are simpler in their interactions, but face many of the same issues such as emptiness checking. One can see this paper as an attempt to free refinement types from some of its restrictions while retaining some of its good properties. The key ingredients are (1) explicitly separating values from computations via polarization, (2) the introduction of variant and lazy records

and their width and depth subtyping rules (owing much to [70]), and (3) simple bidirectional typechecking. What is still missing is the use of intersections and unions that allow subtyping to propagate more richly to higher-order types [31].

Our treatment of empty—value-uninhabited—and full types in Section 4.1, as well as our call-by-value interpretation in Section 8.2 builds on Ligatti et al.'s work [55] on precise subtyping with isorecursive types.

Our direct interpretation of isorecursive types and translation into an equirecursive setting furthers numerous works either comparing or relating both formulations [67, 73, 74]. In particular, Abadi and Fiore [1] and more recently Patrigniani et al. [63] prove that terms in one equirecursive setting can be typed in the other (and vice versa) with varying approaches. The former treats type equality inductively and is focused on syntactic considerations. The latter treats type equality coinductively and analyzes types semantically. Neither of these handle subtyping or mixed coinductive/inductive types like in our study.

Finally, Zhou et al. [76] serves as a helpful overview paper on subtyping recursive types at large and discusses how Ligatti et al.'s complete set of rules requires very specific environments for subtyping, as well as non-standard subtyping rules. This observation demonstrates why our semantic typing/subtyping approach can offer a more flexible abstraction for reasoning about expressive type systems while maintaining type safety.

Semantic Typing and Subtyping. Semantic typing goes back to Milner's semantic soundness theorem [57], which defined a well-typed program being semantically free of a type violation. Whereas syntactic typing specifies a fixed set of syntactic rules that safe terms can be constructed from, semantic typing here combines two requirements: positive types circumscribe observable values, exposing their structure, and computations of negative types are only required to behave in a safe way. As we demonstrate throughout section 5, we can prove our semantic definitions compatible with our syntactic type rules, leaving syntactic type soundness to fall out easily (Theorem 6).

Milner's initial model didn't scale well to richer types, like recursive types. With a lens toward more expressive systems, step indexing has become a prominent approach [7, 8, 10, 27], which we use to observe that a computation in our model steps according to our dynamics.

As with syntactic/semantic typing, syntactic subtyping is the more typical approach in modeling subtyping relations over its semantic counterpart. Nonetheless, in what's operated almost parallel to the research on semantic types, research on semantic subtyping has also made strides [35, 15, 66]. Mainly, these exploit semantic subtyping for developing type systems based on set-theoretic subtyping relations and properties, particularly in the context of handling richer types, including polymorphic functions [17, 16, 65] and variants [18], recursive types (interpreted coinductively), and union, intersection, and negation connectives [36]. A major theme in this line of work is excising "circularity" [15, 36] by means of an involved bootstraping technique, as issues arise when the denotation of a type is defined simply as the set of values having that type.

We depart from this line of research in the treatment of functions (defined computationally rather than set-theoretically), recursive types (equirecursive setting; inductive for the positive layer and coinductive for the negative layer), both variant and lazy record types, and the commitment to explicit polarization (including our incorporation of emptiness/fullness). The latter of which eliminates circularity and ties together multiple threads defined in this study.

With this combination of semantic typing and subtyping, our work provides a metatheory for a more interesting set of typed expressions while also providing a stronger and more flexible basis for type soundness [28], as semantic typing can reason about syntactically ill-typed expressions as long as those expressions are semantically well-typed. This combination scales well to our polarized, mixed setting and focus on subtyping in the presence of recursive types.

Polarized Type Theory and Call-by-Push-Value. At the core of this work has been the call-by-push-value [53, 54] (CBPV) calculus with its notions of values, computations, and the shifts between them. Beyond Levy's work, this subsuming paradigm has formed the foundation of much recent research, ranging from probabilistic domains [33] to those reasoning about effects [56] and dependent types [64]. New et al.'s [60] gradual typing extension to the calculus shares similarities with our use of step indexing, but its relations (binary rather than unary), dynamics, and step-counting are treated differently, and its goals are very different as well, including no coverage on subtyping.

To our knowledge, there are no direct treatments of subtyping recursive types in a CBPV system or applying a full semantic typing approach in this context with subtyping. It is, as we've shown, a fruitful setting for our investigation since the explicit polarization of the language mirrors the mixed reasoning required to analyze the subtyping.

Though CBPV and polarized type theory typically go hand-in-hand, there are investigations that look at polarization (focusing) and algebraic typing and subtyping from alternate perspectives. Steffen [72] predates Levy's research and presents polarity as a kinding system for exploiting monotone and antimonotone operators in subtyping function application. Abel [2] built upon this and extended it with sized types. The inherent connection between types and evaluation strategy has also been studied in the setting of program synthesis [71] and proof theory [58], but these do not share our specific semantic concerns.

Polarization as an organizing principle for subtyping is present in Zeilberger's thesis [75], but addresses a problem that is fundamentally different in multiple ways, e.g. using "classical" types and continuations, and no width and depth subtyping. The biggest difference, however, is that its setting considers refinement types, while we do not have a refinement relation and show that some of the advantages of refinement types can be achieved without the additional layer.

Two studies on a global approach to algebraic subtyping [26, 62] define subtyping relationships with generative datatype constructors while discussing polarity (here with a different meaning) and discarding semantic interpretations. However, the generative nature of datatype constructors in this work makes its quite different from ours.

Mixed Coinductive/Inductive Reasoning for Recursive Types. The natural separation of positive and negative layers in CBPV led us through the literature on mixed coinductive/inductive definitions for recursive types. Related to our work in this paper, Danielsson and Altenkirch [22] and Jones and Pearce [46] provide definitions for equirecursive subtyping relations in a mixed setting while using a suspension monad for non-terminating computations, which shares an affinity with force/return CBPV computations. Danielsson and Altenkirch, however, do not try to justify the structural typing rules themselves via semantic typing of values or expressions—only the subtyping rules. Jones and Pearce are closer to our approach since they also use a semantic interpretation of types for expressions. While not polarized, they do consider inductive/coinductive types separately, but do not lift them to cover function types, instead studying other constructs such as unions.

Komendantsky [48] manages infinitary subtyping (for only function and recursive types) via a semantic encoding by folding an inductive relation into a coinductive one. We work in the opposite direction, turning the coinductive portion into an inductive one by step indexing. Lepigre and Raffali [52] mix induction and coinduction in a syntax-directed framework, focusing on circular proof derivations and sized types [6]; also managing inductive types coinductively. Cohen and Rowe [21] provide a proposal for circular reasoning in a mixed setting, but the focus is on a transitive closure logic built around least and greatest fixed point operators. It seems quite plausible that we could use such systems to formalize our investigation, although we found some merit in using step-indexing and Brotherston and Simpson's circular proof system for induction [14].

### 10 Conclusion

We introduced a rich system of subtyping for an equirecusive variant of call-bypush-value and proved its soundness via semantic means. We also provided a bidirectional type checking algorithm and illustrated its expressiveness through several different kinds of examples. We showed the fundamental nature of the results by deriving systems of subtyping for isorecursive types and languages with call-by-name and call-by-value dynamics. The limitations of the present systems lie primarily in the lack of intersection and union types and parametric polymorphism which are the subject of ongoing work.

Acknowledgements. We wish to express our gratitude to the anonymous reviewers of this paper for their comments. Support for this research was provided by the NSF under Grant No. 1718276 and by FCT through the CMU Portugal Program, the LASIGE Research Unit (UIDB/00408/2020 and UIDP/00408/2020), and the project SafeSessions (PTDC/CCI-COM/6453/2020).

### References


LICS 2009, 11-14 August 2009, Los Angeles, CA, USA. pp. 71–80. IEEE Computer Society (2009), https://doi.org/10.1109/LICS.2009.34


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### Structured Handling of Scoped Effects

Zhixuan Yang<sup>1</sup>  , Marco Paviotti<sup>1</sup> , Nicolas Wu<sup>1</sup> , Birthe van den Berg<sup>2</sup> , and Tom Schrijvers<sup>2</sup>

> 1 Imperial College London, London, United Kingdom {s.yang20,m.paviotti,n.wu}@imperial.ac.uk <sup>2</sup> KU Leuven, Leuven, Belgium {birthe.vandenberg,tom.schrijvers}@kuleuven.be

Abstract. Algebraic effects offer a versatile framework that covers a wide variety of effects. However, the family of operations that delimit scopes are not algebraic and are usually modelled as handlers, thus preventing them from being used freely in conjunction with algebraic operations. Although proposals for scoped operations exist, they are either ad-hoc and unprincipled, or too inconvenient for practical programming. This paper provides the best of both worlds: a theoretically-founded model of scoped effects that is convenient for implementation and reasoning. Our new model is based on an adjunction between a locally finitely presentable category and a category of functorial algebras. Using comparison functors between adjunctions, we show that our new model, an existing indexed model, and a third approach that simulates scoped operations in terms of algebraic ones have equal expressivity for handling scoped operations. We consider our new model to be the sweet spot between ease of implementation and structuredness. Additionally, our approach automatically induces fusion laws of handlers of scoped effects, which are useful for reasoning and optimisation.

Keywords: Computational effects · Category theory · Haskell · Algebraic theories · Scoped effects · Handlers · Abstract syntax

### 1 Introduction

For a long time, monads [45, 60, 68] have been the go-to approach for purely functional modelling of and programming with side effects. However, in recent years an alternative approach, algebraic effects [48], is gaining more traction. A big breakthrough has been the introduction of handlers [52], which has made algebraic effects suitable for programming and has led to numerous dedicated languages and libraries implementing algebraic effects and handlers. In comparison to monads, algebraic effects provide a more modular approach to computations with effects, in which the syntax and semantics of effects are separated computations invoking algebraic operations can be defined syntactically, and the semantics of operations are given by handlers separately in possibly many ways.

A disadvantage of algebraic effects is that they are less expressive than monads; not all effects can be easily expressed or composed within their confines. For instance, operations like catch for exception handling, spawn for parallel composition of processes, or once for restricting nondeterminism are not conventional algebraic operations; instead they delimit a computation within their scope. Such operations are usually modelled as handlers, but the problem is that they cannot be freely used amongst other algebraic operations: when a handler implementing a scoped operation is applied to a computation, the computation is transformed from a syntactic tree of algebraic operations into some semantic model implementing the scoped operation. Consequently, all subsequent operations on the computation can only be given in the particular semantic model rather than as mere syntactic operations, thus nullifying the crucial advantage of modularity when separating syntax and semantics of effects.

To remedy the situation, Wu et al. [70] proposed a practical, but ad-hoc, generalisation of algebraic effects in Haskell that encompasses scoped effects, that has been adopted by several algebraic effects libraries [32, 42, 56]. More recently, Pir´og et al. [46] sought to put this ad-hoc approach for scoped effects on the same formal footing as algebraic effects. Their solution resulted in a construction based on a level-indexed category, called indexed algebras, as the way to give semantics to scoped effects. However, this formalisation introduces a disparity between syntax and semantics that makes indexed algebras not as structured as the programs they interpret, where they use an ad-hoc hybrid fold that requires indexing for the handlers, but not for the program syntax. Moreover, indexed algebras are not ideal for widespread implementation as they require dependent typing, in at least a limited form like gadts [25].

This paper presents a more structured way of handling scoped effects, which we call functorial algebras. They are principled and formally grounded on category theory, and at the same time more structured than the indexed algebras of Pir´og et al. [46], in the sense that the structure of functorial algebras directly follows the abstract syntax of programs with scoped effects. Functorial algebras enjoy the following advantages over indexed algebras:


The structure and contributions of this paper are as follows:

– We highlight the loss of modularity when modelling scoped operations as handlers and sketch how the problem is solved using functorial algebras in Haskell, along with a number of programming examples (Section 2).


Finally, we discuss related work (Section 6) and conclude (Section 7). An extended version of this paper [72] contains appendices and proofs for this paper, and our implementations can also be found online [71].

### 2 Scoped Effects for the Working Programmer

We start with a recap of handlers of algebraic effects (Section 2.1), and then we highlight the loss of modularity when modelling non-algebraic effectful operations as handlers (Section 2.2). We then show how the problem is solved by modelling them as scoped operations and handling them with functorial algebras in Haskell (Section 2.3), whose categorical foundation will be developed later.

### 2.1 Handlers of Algebraic Effects

For the purpose of demonstration, in this section we base our discussion on a simplistic implementation of effect handlers in Haskell using free monads, although the problem with effect handlers highlighted in this section applies to other more practical implementations of effect handlers, either as libraries (e.g. [27, 33]) or standalone languages (e.g. [7, 36, 40]).

Following Plotkin and Pretnar [52], computational effects, such as exceptions, mutable state, and nondeterminism, are described by signatures of primitive effectful operations. Signatures can be abstractly represented by Haskell functors:

$$\mathtt{class\\_Func} \, f \, \mathtt{where\,\,fmap} :: (a \to b) \to f \, \, a \to f \, \, b$$

The following functor ES (with the evident Functor instance) is the signature of three operations: throwing an exception, writing and reading an Int-state:

$$\mathbf{data} \ ES \ x = \text{Through} \mid \text{Put } \text{Int } x \mid \text{Get } (\text{Int} \to x) \tag{1}$$

Typically, a constructor of a signature functor Σ has a type isomorphic to P → (R → x ) → Σ x for some types P and R. As in (1), the types of the three constructors are isomorphic to Throw :: () → (Void → x ) → ES x , Put :: Int → (() → x ) → ES x and Get :: () → (Int → x ) → ES x respectively where Void is the empty type. Each constructor of a signature functor Σ is thought of as an operation that takes a parameter of type P and produces a result of type R, or equivalently, has R-many possible ways to continue the computation after the operation. Given any (signature) functor Σ, computations invoking operations from Σ are modelled by the following datatype, called the free monad of Σ,

$$\textbf{data}\ \textit{Free}\ \Sigma\ \textit{a} = \textit{Return}\ \textit{a} \ \mid \textit{Call}\ \left(\Sigma\ \left(\textit{Free}\ \Sigma\ \textit{a}\right)\right)$$

whose first case represents a computation that just returns a value, and the second case represents a computation calling an operation from Σ with more Free Σ a subterms as arguments, which are understood as the continuation of the computation after this call, depending on the outcome of this operation.

The inductive datatype Free Σ a comes with a recursion principle:

$$handle :: (\Sigma \ b \to b) \to (a \to b) \to Free \Sigma \ a \to b \\\land \text{handle alg } g \text{ (Return } x) = g \text{ } x\\\land \text{handle alg } g \text{ (\$Call } op\$) \text{ = alg (\$fmap (\$land\$ le } alg \text{ } g\$) op\$)}$$

which folds a tree of operations Free Σ a into a type b, providing a way Σ b → b, usually called a Σ-algebra, to perform operations from Σ on b and a way a → b to transform the returned type a of computations to b. The function handle can be used to give Free Σ a monad instance:

$$\begin{array}{ll} \mathit{return} :: a \to Free \Sigma \ a & (\gg =) :: Free \Sigma \ a \to (a \to Free \Sigma \ b) \to Free \Sigma \ b \\\mathit{return} = \mathit{Return} & \mathit{\$ m \gg =} k = handle \mathit{Call} \ k \ m \end{array}$$

The monadic instance allows the programmer to build effectful computations using the do-notation in a clean way. For example, the following program updates the state s to n / s for some n :: Int, and throws an exception when s is 0:

safeDiv :: Int → Free ES Int safeDiv n = do s ← get; if s ≡ 0 then Call Throw else do {put (n / s); return (n / s)}

where the auxiliary wrapper functions (the so-called smart constructors in the Haskell community) that invoke Call appropriately are

$$get = Call \text{ (}Get \text{ }Return \text{)} \qquad put \text{ } n = Call \text{ (}Put \text{ } n \text{ (}Return \text{ ()))}$$

The free monad merely models effectful computations syntactically without specifying how these operations are actually implemented. Indeed, the program safeDiv above is defined without saying how mutable state and exceptions are implemented at all. To actually give useful semantics to programs built with free monads, the programmer uses the handle function above to interpret programs with Σ-algebras, which are called handlers in this context.

For example, given a program r ::Free ES a for some a, a handler catchHdl r :: ES (Free ES) → Free ES that gives the usual semantics to throw is

$$\begin{array}{l} \text{cath} \text{Hdll} :: \text{Free } ES \ a \to ES \ (\text{Free } ES \ a) \to \text{Free } ES \ a \\ \text{cath} \text{Hdll } r \text{ } Throw = r; \qquad \text{cath} \text{Hdll } r \text{ } op = \text{Call } op \end{array} \tag{2}$$

which evaluates r for recovery in case of throwing an exception, and leaves other operations untouched in the free monad. An important advantage of the approach of effect handlers is that different semantics of a computational effect can be given by different handlers. For example, suppose that in some scenario one would like to interpret exceptions as unrecoverable errors and stop the execution of the program when an exception is raised. Then the following handler can be defined for this behaviour:

catchHdl<sup>0</sup> :: Free ES a → ES (Free ES (Maybe a)) → Free ES (Maybe a) catchHdl<sup>0</sup> r Throw = return Nothing; catchHdl<sup>0</sup> r op <sup>=</sup> Call op (3)

As expected, applying these two handlers to the program safeDiv 5 produces different results (of types Free ES Int and Free ES (Maybe Int) respectively):

handle (catchHdl (return 42)) return (safeDiv 5) = do s ← get; if s ≡ 0 then return 42 else do {put (n / s); return (n / s)} handle (catchHdl<sup>0</sup> (return 42)) (return · Just) (safeDiv 5) = do s ← get; if s ≡ 0 then return Nothing else do {put (n / s); return (Just (n / s))}

Note that exception throwing and catching are modelled differently in the approach of algebraic effects and handlers, one as an operation in the signature ES and one as a handler, although it is natural to expect both of them to be operations of the effect of exceptions. This asymmetry results from the fact that exception catching is not algebraic: if catch was modelled as a binary operation in the signature, then the monadic bind >>= of the free monad earlier, which intuitively means sequential composition of programs, would imply that (catch r p) >>= k = catch (r >>= k) (p >>= k), which is semantically undesirable. Thus the perspective of Plotkin and Pretnar [52] is that non-algebraic operations like catch should be deemed different from algebraic operations, and they can be modelled as handlers (of algebraic operations).

### 2.2 Scoped Operations as Handlers Are Not Modular

However, this treatment of non-algebraic operations leads to a somewhat subtle complication: as observed by Wu et al. [70], when non-algebraic operations (such as catch) are modelled with handlers, these handlers play a dual role of (i) modelling the syntax of the operation (the scope for which exceptions are caught by catch) and (ii) giving semantics to it (when an exception is caught, run the recovery program). To see the problem more concretely, ideally one would like to have a syntactic operation catch of the following type that acts on computations without giving specific semantics a priori,

$$cathch :: Free\ ES\ a \to Free\ ES\ a \to Free\ ES\ a$$

allowing to write programs like

$$prog = \mathbf{do} \cdot \{ x \leftarrow catch \mid safeDiv \, 5 \} \, (return \, 42); put \, (x+1) \} \tag{4}$$

and the semantics of (both algebraic and non-algebraic) operations in prog can be given separately by handlers. Unfortunately, when catch is modelled as handlers catchHdl or catchHdl<sup>0</sup> as in the last subsection, the program prog must be written differently depending on which handler is used:

```
do x ← handle (catchHdl (return 42)) return (safeDiv 5); put (x + 1)
vs. do xMb ← handle (catchHdl0
                                   (return 42)) (return · Just) (safeDiv 5)
         case xMb of {Nothing → return Nothing
                       (Just x ) → do r ← put (x + 1); return (Just r )}
```
The issue is that these handlers interpret the operation catch in different semantic models, Free ES a and Free ES (Maybe a), and this affects both the value x that is returned, and the way the subsequent put is expressed. Therefore, nonalgebraic operation catch modelled as handlers is not as modular as algebraic operations, weakening the advantage of programming with algebraic effects.

#### 2.3 Scoped Effects and Functorial Algebras

Now we present an overview of a solution to the problem highlighted above by modelling exception catching as scoped effects [46] and handle them using functorial algebras, which will be more formally developed in later sections.

Syntax of Scoped Operations To achieve modularity for (non-algebraic) operations delimiting scopes, such as catch, which are called scoped operations, Pir´og et al. [46] generalise the free monad Free Σ to a monad Prog Σ Γ accommodating both algebraic and scoped operations. The monad is parameterised by two functors Σ and Γ, called the algebraic signature and the scoped signature respectively. The intention is that a constructor Op :: (R → x ) → Σ x of the algebraic signature represents an algebraic operation Op producing an R-value as usual, whereas a constructor Sc :: (N → x ) → Γ x of the scoped signature represents a scoped operation Sc creating N -many scopes enclosing programs.

Example 1. As in the previous subsection, the effect of exceptions has an algebraic operation for throwing exceptions, which produces no values, and a scoped operation for catching exceptions, which creates two scopes, one enclosing the program for which exceptions are caught, and the other enclosing the recovery computation. Thus the algebraic and scoped signatures are respectively

$$\mathbf{data}\ \mathbf{Thow}\ \ x = \mathbf{Thow} \qquad\qquad\qquad\mathbf{data}\ \ Catch\ \ x = Catch\ \ x\ \ x \tag{5}$$

Example 2. An effect of explicit nondeterminism has two algebraic operations for nondeterministic choice and a scoped operation Once:

$$\mathbf{\dot{\bf data}}\ Chice\ x = Fa\mathbf{i}\ |\ Or\ x\ \mathbf{\dot{x}}\ \qquad\qquad\qquad\qquad\mathbf{\dot{\bf data}}\ \mathrm{Once\ \boldsymbol{x}=Once\ }\mathbf{x}\tag{6}$$

The intention is that this effect implements logic programming [20]—solutions to a problem are exhaustively searched: operation Or p q splits a search branch into two; Fail marks a failed branch; and the scoped operation Once p keeps only the first solution found by p, making it semi-deterministic, which is useful for speeding up the search with heuristics from the programmer.

Similar to the free monad, the Prog monad models the syntax of computations invoking operations from Σ and Γ:

$$\begin{array}{ccl}\textbf{data}\,\,Prog\,\,\Sigma\,\,\Gamma\,\,a = Return\,\,a\,\,\,Call\,\,(\Sigma\,\,(Prog\,\,\Sigma\,\,\Gamma\,\,a))\\\quad\,\,\,Enter\,\,\left(\Gamma\,\left(Prog\,\,\Sigma\,\,\Gamma\,\left(Prog\,\,\Sigma\,\,\Gamma\,\,a\right)\right)\right)\end{array}\tag{7}$$

Thus an element of Prog Σ Γ a can either (i) return an a-value without causing effects, or (ii) call an algebraic operation in Σ with more subterms of Prog Σ Γ a as the continuation after the operation, or (iii) enter the scope of a scoped operation. The third case deserves more explanation: the first Prog in (Γ (Prog Σ Γ (Prog Σ Γ a))) represents the programs enclosed by the scoped operation, and the second Prog represents the continuation of the program after the scoped operation, and thus the boundary between programs inside and outside the scope is kept in the syntax tree, which is necessary because collapsing the boundary might change the meaning of a program. The distinction between algebraic and scoped operations can be seen more clearly from the monadic bind of Prog (the monadic return of Prog is just Return):

$$\begin{array}{l} (\gg=) :: \operatorname{Prog} \Sigma \varGamma \ a \to (a \to \operatorname{Prog} \Sigma \varGamma \ b) \to \operatorname{Prog} \Sigma \varGamma \ b \\\ (Return \ a) \gg= k = k \ a \\\ (Call \ op) \quad \gg= k = \operatorname{Call} \ (fmap \ (\gg= k) \ op) \\\ (Enter \ sc) \gg= k = \operatorname{Enter} \ (fmap \ (fmap \ (\gg= k)) \ sc) \end{array}$$

For algebraic operations, extending the continuation (>>=k) directly acts on the argument to the algebraic operation, whereas for scoped operation, (>>=k) acts on the second layer of Prog. Thus for an algebraic operation o, (o p) >>= k and o (p >>= k) have the same representation, whereas for a scoped operation s, (s p) >>= k and s (p >>= k) have different representations, which is precisely the distinction between algebraic and scoped operations.

The constructors Call and Enter are clumsy to work with, and for writing programs more naturally, we define smart constructors for operations. Generally, for algebraic operations Op ::F x → Σ x and scoped operations Sc ::G x → Γ x , the smart constructors are

$$\begin{array}{ccc} \text{op} :: F \left( \text{Prog } \Sigma \ \Gamma \ a \right) \to \text{Prog } \Sigma \ \Gamma \ a & \text{sc} :: G \left( \text{Prog } \Sigma \ \Gamma \ a \right) \to \text{Prog } \Sigma \ \Gamma \ a \\\ \text{op} = \text{Call} \cdot \text{Op} & \text{sc} = \text{Enter} \cdot \text{fmap (fmap return)} \cdot \text{Sc} \end{array}$$

For example, the smart constructor for Catch (Example 1) is

$$cathch :: Program \to Cognition \to Program \to Program \to Cited \new cathch \; h \; r = Entter \; (Catch \; (fmap \; return \; h) \; (fmap \; return \; r))$$

With all machinery in place, now we can define the program (4) using Prog that we could not write with Free:

$$prog = \mathbf{do} \cdot \{ x \leftarrow cath \mid safeDiv \, 5 \} \, (return \, 42); \, put \, (x+1) \}$$

Handlers of Scoped Operations Similar to Free, the Prog monad merely models the syntax of effectful computations, and more useful semantics need to be given

data EndoAlg Σ Γ f = EndoAlg { returnE :: ∀x . x → f x , callE :: ∀x . Σ (f x ) → f x , enterE :: ∀x . Γ (f (f x )) → f x } data BaseAlg Σ Γ f a = BaseAlg {callB :: Σ a → a , enterB :: Γ (f a) → a }

hcata :: (Functor Σ, Functor Γ) ⇒ (EndoAlg Σ Γ f ) → Prog Σ Γ a → f a hcata alg (Return x ) = returnE alg x hcata alg (Call op) = (callE alg · fmap (hcata alg)) op hcata alg (Enter scope) = (enterE alg · fmap (hcata alg · fmap (hcata alg))) scope handle :: (Functor Σ, Functor Γ) ⇒ (EndoAlg Σ Γ x ) → (BaseAlg Σ Γ x b) → (a → b) → Prog Σ Γ a → b handle ealg balg gen (Return x ) = gen x handle ealg balg gen (Call op) = (callB balg · fmap (handle ealg balg gen)) op handle ealg balg gen (Enter sc) = (enterB balg · fmap (hcata ealg · fmap (handle ealg balg gen))) sc

Fig. 1: A Haskell implementation of handling with functorial algebras

by handlers. Although Pir´og et al. [46] developed a notion of indexed algebras for this purpose, indexed algebras turn out to be more complicated than necessary (we will discuss them in Section 4), and the contribution of this paper is a simpler kind of handlers for scoped operations, which we call functorial algebras.

Given signatures Σ and Γ, a functorial algebra for them is a quadruple hf , b, ealg, balgi for some functor f called the endofunctor carrier, type b called the base carrier. The other two components ealg :: EndoAlg Σ Γ f and balg :: BaseAlg Σ Γ f b are called the endofunctor algebra and the base algebra. Their types are fully shown in Figure 1. The intuition is that functor f and ealg interpret the part of a program enclosed by scoped operations, and the type b and balg interpret the part of a program not enclosed by any scopes.

Example 3. The standard semantics of exception catching (cf. handler (2)) can be implemented by a functorial algebra with the conventional Maybe functor as the endofunctor carrier with the following EndoAlg:


For the base carrier that interprets operations not enclosed by any catch, a straightforward choice is just taking Maybe a as the base carrier for a type a, and setting callB = callE and enterB = enterE, which means that operations inside and outside scopes are interpreted in the same way.

In general, we can define a specialised version of handle (Figure 1) that only takes an endofunctor algebra as input for interpreting operations inside and outside scopes in the same way:

handleE :: (EndoAlg Σ Γ f ) → Prog Σ Γ a → f a handleE ealg@(EndoAlg {. .}) = handle ealg (BaseAlg callE enterE) returnE

Applying handleE excE to the following program produces Just 43 as expected.

$$\mathbf{do} \left\{ x \leftarrow \text{cath} \; \middle| \; \text{th row} \; \left( \text{return } 42 \right); \text{return } \left( x+1 \right) \right\} \tag{8}$$

For the non-standard semantics (cf. (3)) that disables exception recovery, one can define another endofunctor algebra excE<sup>0</sup> by replacing enterE in excE with

enterE<sup>0</sup> :: Catch (Maybe (Maybe a)) → Maybe a enterE<sup>0</sup> (Catch Nothing ) = Nothing; enterE<sup>0</sup> (Catch (Just k) ) = k

With excE<sup>0</sup> , handling the program in (8) produces Nothing as expected.

Now we provide some intuition for how functorial algebras work. First note that the three fields of EndoAlg in Figure 1 precisely correspond to the three cases of Prog (7). Thus by replacing the constructors of Prog with the corresponding fields of EndoAlg, we have a polymorphic function hcata ealg :: ∀x . Prog Σ Γ x → f x (Figure 1) turning a program into a value in f .

The function handle (Figure 1) takes a functorial algebra, a function gen :: a → b and a program p as arguments, and it handles all the effectful operations in p by using hcata ealg for interpreting the part of p inside scoped operations and balg for interpreting the outermost layer of p outside any scoped operations. The function gen corresponds to the 'value case' of handlers of algebraic effects, which transforms the a-value returned by a program into the type b for interpretation.

We close this section with some more examples of handling scoped effects with functorial algebras. The supplementary material of this paper also contains an OCaml implementation of functorial algebras and the following examples.

Example 4. The standard way to handle explicit nondeterminism with the semideterministic operator once (Example 2) is using a functorial algebra with the list functor as the endofunctor carrier together with the following algebra:


Then applying handleE ndetE to the following program produces [1, 2] as expected. In comparison, if once were algebraic, the result would be [1].

do {n ← once (or (return 1) (return 3)); or (return n) (return (n + 1))}

Example 5. In the last example we used the list functor to interpret explicit nondeterminism, resulting in the depth-first search (DFS) strategy for searching. Noted by Spivey [59], other search strategies can be implemented by other choices

of functors. For example, depth-bounded search (DBS) can be implemented with the functor Int → [a ], and breadth-first search (BFS) can be implemented with the functor [[a ]] (or Kidney and Wu [31]'s more efficient LevelT functor).

A powerful application of scoped effects is modelling search strategies:

$$\textbf{\texttt{data}}\text{ }Strateg\text{ }x = DFS\text{ }x\text{ }|\,BFS\text{ }x\mid\text{DBS}\text{ }Int\text{ }x\text{ }$$

so that the programmer can freely specify the search strategy of nondeterministic choices in a scope. The algebraic signature Choice and scoped signature Strategy can be handled by a functorial algebra carried by the endofunctor ([a ], [[a ]],Int → [a ]) and a base type [a ] (assuming that depth-first search is the default strategy). The complete code is in the supplementary material.

Example 6. A scoped operation for the effect of mutable state is the operation local s p that executes the program p with a state s and restores to the original state after p finishes. Thus (local s p >>=k) is different from local s (p >>=k), and local should be modelled as a scoped operations of signature data Local s a = Local s a. Together with the usual algebraic operations get and put of state, Local can be interpreted with a functorial algebra carried by the state monad type State s a = s → (s, a). The essential part of the functorial algebra is the following enterE for Local (complete code in the supplementary material):

$$enter E :: Local \ (State \ s \ (State \ s \ a)) \to State \ s \ a \ 'center E \ (Local \ s' \ f) \ s = \mathbf{let} \ (\ \omega, k) = f \ s \ \mathbf{in} \ k \ s$$

Example 7. Parallel composition of processes is not an operation in the usual algebraic presentations of process calculi [61, 62] precisely because it not algebraic: (p | q) >>= k 6= (p >>= k) | (q >>= k). Again, we can model it as a scoped operation, and different scheduling behaviours of processes can be given as different functorial algebras. The supplementary material contains complete code of handling parallel composition using the so-called resumption monad [11, 47].

### 3 Categorical Foundations for Scoped Operations

We now move on to a categorical foundation for scoped effects and functorial algebras. First, we recall some standard category theory underlying algebraic effects and handlers (Section 3.1) and also Pir´og et al. [46]'s monad P that models the syntax of scoped operations, which is exactly the Prog monad in the Haskell implementation (Section 3.2). Then, we define functorial algebras formally (Section 3.3) and show that there is an adjunction between the category of functorial algebras and the base category (Section 3.4) inducing the monad P, which provides a means to interpret the syntax of scoped operations.

The rest of this paper assumes familiarity with basic category theory, such as adjunctions, monads, and initial algebras, which are covered by standard texts [6, 41, 55]. The mathematical notation in this paper is summarised in the appendices, which may be consulted if the meaning of some symbols are unclear.

### 3.1 Syntax and Semantics of Algebraic Operations

The relationships between equational theories, Lawvere theories, monads, and computational effects are well-studied for decades from many perspectives [23, 30, 45, 48, 54, 57]. Here we recap a simplified version of equational theories by Kelly and Power [30] that we follow to model algebraic and scoped effects on locally finitely presentable (lfp) categories [1].

Locally Finitely Presentable Categories The use of lfp categories in this paper is limited to some standard results about the existence of many initial algebras in lfp categories, and thus a reader not familiar with lfp categories may follow this paper with some simple intuition: a category C is lfp if it has all (small) colimits and a set of finitely presentable objects such that every object in C can be obtained by 'glueing' (formally, as filtered colimits of) some finitely presentable objects. For example, Set is lfp with finite sets as its finitely presentable objects, and indeed every set can be obtained by glueing, here meaning taking the union of, all its finite subsets: X = S { N ⊆ X | N finite }. Other examples of lfp categories include the category of partially ordered sets, the category of graphs, the category of small categories, and presheaf categories (we refer the reader to the excellent exposition [57] for concrete examples), thus lfp categories are widespread to cover many semantic settings of programming languages.

Moreover, an endofunctor F : C → C is said to be finitary if it preserves 'glueing' (filtered colimits), which implies that its values F X are determined by its values at finitely presentable objects: F X ∼= F(colimiNi) ∼= colimiF N<sup>i</sup> where N<sup>i</sup> are the finitely presentable objects that generate X when glued together. For example, polynomial functors ` <sup>n</sup>∈<sup>N</sup> P n × (−) <sup>n</sup> on Set are finitary where P n is a set for every n.

Algebraic Operations on LFP Categories Fixing an lfp category C, we take finitary endofunctors Σ : C → C as signatures of operations on C. Like in Section 2.1, the intuition is that every natural transformation ` <sup>C</sup>(R,−) P → Σ− for some object P : C and a finitely presentable object R : C stands for an operation taking a parameter of type P and R-many arguments. The category Σ-Alg of Σ-algebras is defined as usual: it has pairs hX : C, α : ΣX → Xi as objects and morphisms h : X → X<sup>0</sup> such that h · α = α 0 · Σh as morphisms hX, αi → hX<sup>0</sup> , α 0 i. The following classical results (see e.g. [2, 5]) give sufficient conditions for constructing initial and free Σ-algebras:

Lemma 1. If category C has finite coproducts and colimits of all ω-chains and functor Σ : C → C preserves them, then the forgetful functor U<sup>Σ</sup> : Σ-Alg → C forgetting the structure maps has a left adjoint Free<sup>Σ</sup> : C → Σ-Alg mapping every X : C to a Σ-algebra hΣ<sup>∗</sup>X, opXi where Σ<sup>∗</sup>X denotes the initial algebra µY . X + ΣY and op<sup>X</sup> : ΣΣ<sup>∗</sup>X → Σ<sup>∗</sup>X.

Lemma 1 is applicable to our setting since C being lfp directly implies that it has all colimits, and finitary functors Σ preserve colimits of ω-chains because colimits of ω-chains are filtered. Hence we have an adjunction: Free<sup>Σ</sup> a U<sup>Σ</sup> : Σ-Alg → C. We denote the monad from the adjunction by Σ<sup>∗</sup> = UΣFree<sup>Σ</sup> (which is implemented as the Free Σ monad in Section 2.1). The idea is still that syntactic terms built from operations in Σ are modelled by the monad Σ<sup>∗</sup> , and semantics of operations are given by Σ-algebras. Given any Σ-algebra hX, α : ΣX → Xi and morphism g : A → X in C, they induce an interpretation morphism handlehX,αig : Σ∗A → X s.t.

$$handle\_{\langle X,\alpha\rangle}g = \mathcal{U}\_{\Sigma}(\epsilon\_{\langle X,\alpha\rangle} \cdot \mathsf{Free}\_{\Sigma}g) : \Sigma^\*A = \mathcal{U}\_{\Sigma}\mathsf{Free}\_{\Sigma}A \to X \tag{9}$$

where <sup>h</sup>X,α<sup>i</sup> : FreeΣUΣhX, αi → hX, αi is the counit of Free<sup>Σ</sup> a UΣ.

Algebraic Effects and Handlers The perspective of Plotkin and Pretnar [52] is that computational effects are characterised by signatures Σ of primitive effectful operations, and they determine monads Σ<sup>∗</sup> that model programs syntactically. Then Σ-algebras are handlers [52] of operations that can be applied to programs using (9) to give specific semantics to operations.

The approach of algebraic effects has led to a significant body of research on programming with effects and handlers, but it imposes an assumption on the operations to be modelled: the construction of Σ<sup>∗</sup> in Lemma 1 [2, 5] implies that the multiplication µ of the monad Σ<sup>∗</sup> satisfies the algebraicity property: op ·(Σ ◦ µ) = µ ·(op ◦ Σ<sup>∗</sup> ) : ΣΣ<sup>∗</sup>Σ<sup>∗</sup> → Σ<sup>∗</sup> where op : Σ(Σ<sup>∗</sup> ) → Σ<sup>∗</sup> . This intuitively means that every operation in Σ must be commutative with sequential composition of computations. Many, but not all, effectful operations satisfy this property, and they are called algebraic operations.

Adjoint Approach to Effects The crux of algebraic effects and handlers is the adjunction Free<sup>Σ</sup> a UΣ. However, we have not relied on the adjunction being the free/forgetful one at all: given any monad P : C → C that models the syntax of effectful Programs, if L a R : D → C is an adjunction such that RL ∼= P as monads, then objects D in D provide a means to interpret programs P A—for any g : A → RD in C, we have the following interpretation morphism

$$handle\_D g = R(\epsilon\_D \cdot Lg) : PA \cong R(LA) \to RD \tag{10}$$

The intuition for g is that it transforms the returned value A of a computation into the carrier RD, so it corresponds to the 'value case' of effect handlers [8]. Pir´og et al. [46] call this approach the adjoint-theoretic approach to syntax and semantics of effects, and they construct an adjunction between indexed algebras and the base category for modelling scoped operations. Earlier, Levy [37] and Kammar and Plotkin [28] also adopt a similar adjunction-based viewpoint in the treatment of call-by-push-value calculi: value types are interpreted in the base category C, and computation types are interpreted in the algebra category D.

Remark 1. A notable missing part of our treatment is the equations that specify operations in a signature. Following Kelly and Power [30], an equation for a signature Σ : C → C can be formulated as a pair of monad morphisms σ, τ : Γ <sup>∗</sup> → Σ<sup>∗</sup> for some finitary functor Γ, and taking their coequaliser Γ <sup>∗</sup> Σ<sup>∗</sup> M τ σ in the category of finitary monads constructs a monad M that represents terms modulo the equation l = r. Although it seems straightforward to extend this formulation of equational theories work with scoped effects, we do not consider equations in this paper for the sake of simplicity.

Remark 2. Working with lfp categories precludes operations with infinite arguments, such as the get operation (1) of mutable state when the state has infinite possible values, but this limitation is not inherent and can be handled by moving to locally κ-presentable categories [1] for some larger cardinal κ.

### 3.2 Syntax of Scoped Operations

Not all operations in programming languages can be adequately modelled as algebraic operations on Set, for example, λ-abstraction [16], memory cell generation [38, 48], more generally, effects with dynamically generated instances [62], explicit substitution [18], channel restriction in π-calculus [61], and their syntax are usually modelled in some functor categories. More recently, Pir´og et al. [46] extend Ghani and Uustalu [18]'s work to model a family of non-algebraic operations, which they call scoped operations. In this subsection, we review their development in the setting of lfp categories. Throughout the rest of the paper, we fix an lfp category C, and refer to it as the base category, and it is intended to be the category in which types of a programming language are interpreted. Furthermore, we fix two finitary endofunctors Σ, Γ : C → C and call them the algebraic signature and scoped signature respectively.

Syntax Endofunctor P Now our goal is to construct a monad P : C → C that models the syntax of programs with algebraic operations in Σ and non-algebraic scoped operations in Γ. First we construct its underlying endofunctor. When C is Set, the intuition for programs P A is that they are terms inductively built from the following inference rules:

$$\frac{a \in A}{var(a) \in PA} \qquad \frac{o \in \Sigma n \quad k: n \to PA}{o(k) \in PA} \qquad \frac{s \in \Gamma n \quad p: n \to PA \quad k: X \to PA}{\{s(p); k\} \in PA}$$

where n ranges over finite sets and o ∈ Σn represents an algebraic operation of |n| arguments, and similarly s ∈ Γ n is a scoped operation that creates |n| scopes. The difference between algebraic and scoped operations is manifested by an additional explicit continuation k in the third rule, as it is not the case that sequentially composing s(p) with k equals s(p; k) like for algebraic operations, so the continuation for scoped operations must be explicitly kept in the syntax. When C is any lfp category, these rules translate to the following recursive equation for the functor P : C → C:

$$PA \cong A + \Sigma(PA) + \int^{X:\mathbb{C}} \coprod\_{\mathbb{C}(X,PA)} \Gamma(PX) \tag{11}$$

where the existentially quantified X in the third rule is translated to a coend R <sup>X</sup>:<sup>C</sup> in C [41]. Moreover, the coend in (11) is isomorphic to Γ(P(P A)) because by the coend formula of Kan extension, it exactly computes Lan<sup>I</sup> (Γ P)(P A), i.e. the left Kan-extension of Γ P along the identity functor I : C → C, and by definition Lan<sup>I</sup> (Γ P) = Γ P. Thus (11) is equivalent to

$$PA \cong A + \Sigma(PA) + \Gamma(P(PA))\tag{12}$$

which is exactly the Prog Σ Γ datatype that we saw in the Haskell implementation (7). To obtain a solution to (12), we construct a (higher-order) endofunctor G : Endo<sup>f</sup> (C) → Endo<sup>f</sup> (C) to represent the Grammar where Endo<sup>f</sup> (C) is the category of finitary endofunctors on C:

$$G = \mathcal{I}\mathfrak{d} + \Sigma \circ - + \varGamma \circ - \circ- \tag{13}$$

where Id : C → C is the identity functor. Then Lemma 1 is applicable because Endo<sup>f</sup> (C) has all small colimits since colimits in functor categories can be computed pointwise and C has all small colimits. Furthermore, G preserves all filtered colimits, in particular colimits of ω-chains, because − ◦ = : Endo<sup>f</sup> (C) × Endo<sup>f</sup> (C) → Endo<sup>f</sup> (C) is finitary following from direct verification. Since initial algebras are precisely free algebras generated by the initial object, by Lemma 1, there is an initial G-algebra hP : Endo<sup>f</sup> (C), in : GP → Pi and in is an isomorphism. Thus P obtained in this way is indeed a solution to (12)—the endofunctor modelling the syntax of programs with algebraic and scoped operations.

Monadic Structure of P Next we equip the endofunctor P with a monad structure. This can be done in several ways, either by the general result about Σ-monoids [14, 16] in Endo<sup>f</sup> (C), or by [43, Theorem 4.3], or by the following relatively straightforward argument in [46]: by the 'diagonal rule' of computing initial algebras by Backhouse et al. [4], P = µG (13) is isomorphic to P <sup>0</sup> = µX. Id + Σ ◦ X + Γ ◦ P ◦ X. Note that P 0 is exactly (Σ + Γ ◦ P) <sup>∗</sup> as endofunctors by Lemma 1, thus

$$P \cong (\Sigma + \varGamma \circ P)^\* : \mathsf{Endo}\_f(\mathbb{C}) \tag{14}$$

Then we equip P with the same monad structure as the ordinary free monad (Σ + Γ ◦ P) ∗ . The implementation in (7) is exactly this monad structure.

#### 3.3 Functorial Algebras of Scoped Operations

To interpret the monad P (12) modelling the syntax of scoped operations, it is natural to expect that semantics is given by G-algebras on Endo<sup>f</sup> (C) so that interpretation is then the catamorphisms from µG to G-algebras. And following the adjoint-theoretic approach (10), we would like to have an adjunction G-Alg C a such that the induced monad is isomorphic to P. However, there seems no natural way to construct such an adjunction unless we replace Galgebras with a slight extension of it, which we referred to as functorial algebras, as the notion for giving semantics to scoped operations. In the following, we first define functorial algebras formally (Definition 1) and then show the adjunction between the category of functorial algebras and the base category (Theorem 1), which allows us to interpret P with functorial algebras.

A functorial algebra is carried by an endofunctor H : C → C with additionally an object X in C. The endofunctor H also comes with a morphism α <sup>G</sup> : GH → H in Endo<sup>f</sup> (C), and the object X is equipped with a morphism α I : ΣX +Γ HX → X in C. The intuition is that given a program of type P X ∼= X + Σ(P X) + Γ(P(P X)), the middle P in Γ P P corresponds to the part of a program enclosed by some scoped operations (i.e. the p in {s(p)>>=k}), and this part of the program is interpreted by H with α <sup>G</sup>. After the enclosed part is interpreted, α I interprets the outermost layer of the program by X with α I in the same way as interpreting free monads of algebraic operations. More precisely, let I : Endo<sup>f</sup> (C) × C → C be a bi-functor such that <sup>3</sup>

$$I\_H X = \Sigma X + \Gamma(HX) \qquad \qquad I\_\sigma f = \Sigma f + \Gamma(\sigma \cdot Hf) \tag{15}$$

for all H : Endo<sup>f</sup> (C) and X : C and all morphisms σ : H → H<sup>0</sup> and f : X → X<sup>0</sup> . Then we define an endofunctor Fn : Endo<sup>f</sup> (C) × C → Endo<sup>f</sup> (C) × C such that

$$\mathsf{Fn}\langle H, X \rangle = \langle GH, I\_H X \rangle \tag{16}$$

Definition 1. A functorial algebra is an object hH, Xi in Endo<sup>f</sup> (C) × C paired with a structure map FnhH, Xi → hH, Xi, or equivalently it is a quadruple

$$\left\langle H:\mathsf{Endo}\_{f}(\mathbb{C}), \quad X:\mathbb{C}, \quad \alpha^{G}:GH \to H, \quad \alpha^{I}:\Sigma X + \Gamma(HX) \to X \right\rangle$$

where GH = Id+Σ ◦ H +Γ ◦ H ◦ H. Morphisms between two functorial algebras hH1, X1, α G 1 , α I 1 i and hH2, X2, α G 2 , α I 2 i are pairs hσ : H<sup>1</sup> → H2, f : X<sup>1</sup> → X2i making the following diagrams commute:

$$\begin{array}{ccc} \begin{array}{c} GH\_1 \ \stackrel{\alpha\_1^G}{\longrightarrow} H\_1 \end{array} & \begin{array}{c} \Sigma X\_1 + \Gamma(H\_1 X\_1) \ \stackrel{\alpha\_1^I}{\longrightarrow} X\_1 \\ \downarrow \sigma & \Sigma f + \Gamma(\sigma \circ f) \downarrow \\ H\_2 \end{array} & \begin{array}{c} \Sigma X\_2 + \Gamma(H\_2 X\_2) \ \stackrel{\alpha\_2^I}{\longrightarrow} X\_2 \end{array} \end{array}$$

Functorial algebras and their morphisms form a category Fn-Alg.

Example 8. We reformulate our programming example of nondeterministic choice with once shown Example 4 in the formal definition. Let C = Set in this example and 1 = {?} be some singleton set. We define signature endofunctors

$$
\Sigma X = 1 + X \times X \qquad \qquad \qquad \Gamma X = X
$$

so that Σ represents nullary algebraic operation fail and binary algebraic operation or , and Γ represents the unary scoped operation once that creates one scope. Let List : Set → Set be the endofunctor mapping a set X to the set of finite lists

<sup>3</sup> The first argument H to I is written as subscript so that we have a more compact notation I ∗ <sup>H</sup> when taking the free monad of I<sup>H</sup> : C <sup>C</sup> with the first argument fixed.

with elements from X. We define natural transformations α <sup>Σ</sup> : Σ ◦ List → List and α <sup>Γ</sup> : Γ ◦ List ◦ List → List by

$$\alpha\_X^{\Sigma}(\iota\_1 \star) = nil, \quad \alpha\_X^{\Sigma}(\iota\_2 \left< x, y \right>) = x ++ y, \quad \alpha\_X^{\Gamma}(nil) = nil, \quad \alpha\_X^{\Gamma}(cons \ x \ xs) = x + y$$

where nil is the empty list; ++ is list concatenation; and cons x xs is the list with an element x in front of xs. Then for any set X, hList, List X i carries a functorial algebra with structure maps

$$\alpha^G = [\eta^{List}, \alpha^{\Sigma}, \alpha^{\Gamma}] : GList \to List \qquad \alpha^I = [\alpha\_X^{\Sigma}, \alpha\_X^{\Gamma}] : I\_{List}X \to X \tag{17}$$

where η List : Id → List wraps any element into a singleton list.

The last example exhibits that one can define a functorial algebra carried by hH, HXi from a G-algebra on H : Endo<sup>f</sup> (C) by simply choosing the object component to be HX for an arbitrary X : C. In other words, there is a faithful functor G-Alg → Fn-Alg, which results in functorial algebras that interpret the outermost layer of a program—the part not enclosed by any scoped operation in the same way as the inner layers. But in general, the object component of functorial algebras offers the flexibility that the outermost layer can be interpreted differently from the inner layers, as in the following example.

Example 9. Continuing Example 8, if one is only interested in the final number of possible outcomes, then one can define a functorial algebra hList, N, α <sup>G</sup>, α I i where α <sup>G</sup> is (17) and α I (ι<sup>1</sup> (ι1?)) = 0,

α I (ι<sup>1</sup> (ι2hx, yi)) = x + y, α I (ι<sup>2</sup> nil) = 0, α I (ι<sup>2</sup> (cons n ns)) = n

#### 3.4 Interpreting with Functorial Algebras

In the rest of this section we show how functorial algebras can be used to interpret programs P A (12) with scoped operations. We first construct a simple adjunction ↑ a ↓ between the base category C and Endo<sup>f</sup> (C) × C, which is then composed with the free/forgetful adjunction FreeFn a UFn between Endo<sup>f</sup> (C)×C and Fn-Alg for the functor Fn (16). The resulting adjunction (18) is proven to induce a monad T isomorphic to P (Theorem 1), and by the adjoint-theoretic approach to syntax and semantics (10), this adjunction provides a means to interpret scoped operations modelled with the monad P (Theorem 2).

First we define functor ↑ : C → Endo<sup>f</sup> (C) × C such that ↑ X = h0, Xi where 0 : Endo<sup>f</sup> (C) is the initial endofunctor—the constant functor sending everything to the initial object in C. The functor ↑ is left adjoint to the projection functor ↓ : Endo<sup>f</sup> (C) × C → C of the second component.

Then we would like to compose ↑ a ↓ with the free-forgetful adjunction FreeFn a UFn for the endofunctor Fn (16) on Endo<sup>f</sup> (C) × C, and the latter adjunction indeed exists.

Lemma 2. The endofunctor Fn (16) on Endo<sup>f</sup> (C) × C has free algebras, i.e. there is a functor FreeFn : Endo<sup>f</sup> (C) × C → Fn-Alg left adjoint to the forgetful functor UFn : Fn-Alg → Endo<sup>f</sup> (C) × C.

These two adjunctions are depicted in the following diagram:

$$\xleftarrow{Fn-Alg} \xleftarrow{\text{Free}\_{\mathsf{Fra}\_{\mathsf{Fra}}}} \operatorname{Endo}\_{f}(\mathbb{C}) \times \mathbb{C} \xleftarrow{\uparrow} \xleftarrow{\uparrow} \xleftarrow{\downarrow} \mathsf{C} \xleftarrow{\star} \mathsf{C}} \mathsf{C} \xleftarrow{\star} \mathsf{C} \end{aligned} \tag{18}$$

and we compose them to obtain an adjunction FreeFn ↑ a ↓ UFn between Fn-Alg and C, giving rise to a monad T = ↓ UFnFreeFn ↑. In the rest of this section, we prove that T is isomorphic to P (11) in the category of monads, which is crucial in this paper, since it allows us to interpret scoped operations modelled by the monad P with functorial algebras Fn-Alg.

We first establish a technical lemma characterising the free Fn-algebra on the product category Endo<sup>f</sup> (C)×C in terms of the free algebras in C and Endo<sup>f</sup> (C).

Lemma 3. There is a natural isomorphism between FreeFn and the following

$$\widehat{\operatorname{Free}}\_{\mathsf{Fn}}\langle H, X \rangle = \left\langle G^\*H : \mathsf{Endo}\_f(\mathbb{C}), \quad (I\_{G^\*H})^\*X : \mathbb{C}, \quad op\_H^{G^\*}, \quad op\_X^{(I\_{G^\*H})^\*} \right\rangle$$

where opG<sup>∗</sup> <sup>H</sup> : <sup>G</sup>(G<sup>∗</sup>H) <sup>→</sup> <sup>G</sup><sup>∗</sup><sup>H</sup> and op(IG∗H) ∗ <sup>X</sup> : IG∗H((IG∗H) <sup>∗</sup>X) → (IG∗H) <sup>∗</sup>X are the structure maps of the free G-algebra and IG∗H-algebra respectively.

Theorem 1. Monads P (12) and T (18) are isomorphic as monads.

Remark 3. In general, the right adjoint ↓ UFn is not monadic since it does not reflect isomorphisms, which is a necessary condition for it to be monadic by Beck's monadicity theorem [41]. This entails that the category Fn-Alg of functorial algebras is not equivalent to the category of Eilenberg-Moore algebras. Nonetheless, as we will see later in Section 4, functorial algebras and Eilenberg-Moore algebras have the same expressive power for interpreting scoped operations.

The isomorphism established Theorem 1 enables us to interpret programs modelled by the monad P using functorial algebras following (10): for any functorial algebra hH, X, α <sup>G</sup>, α I i (Definition 1), and any morphism g : A → X in the base category C, there is a morphism

$$\mathit{Random}\_{\langle H, X, \alpha^G, \alpha^I \rangle} g = \downarrow \mathcal{U}\_{\mathsf{Fn}}(\epsilon\_{\langle H, X, \alpha^G, \alpha^I \rangle} \cdot \mathit{Free}\_{\mathsf{Fn}} \uparrow g) : TA \cong PA \to X \tag{19}$$

which interprets programs P A with the functorial algebra hH, X, α <sup>G</sup>, α I i. Furthermore, we can derive the following recursive formula (20) for this interpretation morphism, which is exactly the Haskell implementation in Figure 1.

Theorem 2 (Interpreting with Functorial Algebras). For any functorial algebra α = hH, X, α <sup>G</sup>, α I i as in Definition 1, and any morphism g : A → X for some <sup>A</sup> in the base category <sup>C</sup>, let <sup>h</sup> <sup>=</sup> <sup>L</sup><sup>α</sup> G<sup>M</sup> : <sup>P</sup> <sup>→</sup> <sup>H</sup> be the catamorphism from the initial G-algebra P to the G-algebra α <sup>G</sup> : GH → H. The interpretation of P A with this algebra α and g satisfies

$$
\delta \hbar \hbar dle\_{\alpha} g = [g, \quad \alpha\_{\Sigma}^{I} \cdot \Sigma(handle\_{\alpha} g), \quad \alpha\_{\Gamma}^{I} \cdot \Gamma h\_{X} \cdot \Gamma P(handle\_{\alpha} g)] \cdot i n\_{A}^{\diamond} \tag{20}
$$

where in◦ : P → Id + Σ ◦ P + Γ ◦ P ◦ P is the isomorphism between P and GP; morphisms α I <sup>Σ</sup> = α I ·ι<sup>1</sup> : ΣX → X and α I <sup>Γ</sup> = α I ·ι<sup>2</sup> : Γ HX → X are the two components of α I : ΣX + Γ HX → X.

To summarise, we have defined a notion of functorial algebras that we use to handle scoped operations. The heart of the development is the adjunction (18) that induces a monad isomorphic to the monad P (12) that models the syntax of programs with scoped operations, following which we derive a recursive formula (20) that interprets programs with functor algebras. The formula is exactly the implementation in Figure 1: the datatype EndoAlg represents the α <sup>G</sup> in (20); datatype BaseAlg corresponds to α I ; function hcata implements <sup>L</sup><sup>α</sup> GM.

### 4 Comparing the Models of Scoped Operations

Functorial algebras are not the only option for interpreting scoped operations. In this section we compare functorial algebras with two other approaches, one being Pir´og et al. [46]'s indexed algebras and the other one being Eilenberg-Moore (EM) algebras of the monad P (12), which simulate scoped operations with algebraic operations. After a brief description of these two kinds of algebras, we compare them and show that their expressive power is in fact equivalent.

#### 4.1 Interpreting Scoped Operations with Eilenberg-Moore Algebras

In standard algebraic effects, handlers are just Σ-algebras for some signature functor Σ : C → C, and it is well known that the category Σ-Alg of Σ-algebras is equivalent to the category C Σ<sup>∗</sup> of EM algebras of the monad Σ<sup>∗</sup> . Thus handlers of algebraic operations are exactly EM algebras of the monad Σ<sup>∗</sup> modelling the syntax of algebraic operations. This observation suggests that we may also use EM algebras of the monad P (12) as the notion of handlers for scoped operations.

Lemma 4. EM algebras of P are equivalent to (Σ + Γ ◦ P)-algebras. In other words, an EM algebra of P is equivalently a tuple

$$\langle X:\mathbb{C},\ \alpha\_{\Sigma}:\Sigma X \to X,\ \alpha\_{\Gamma}:\Gamma(PX)\to X\rangle\tag{21}$$

Thus we obtain a way of interpreting scoped operations based on the adjunction FreeΣ+<sup>Γ</sup> ◦ <sup>P</sup> a UΣ+<sup>Γ</sup> ◦ <sup>P</sup> : given an EM algebra α = hX, αΣ, α<sup>Γ</sup> i of P as in (21), then for any A : C and morphism g : A → X, the interpretation of P A by g and this EM algebra is

$$\mathit{Random}\_{\alpha} \, g = \mathcal{U}\_{\Sigma + \Gamma \circ P} (\epsilon\_{\alpha} \cdot \mathit{Free}\_{\Sigma + \Gamma \circ P} \, g) : PA \cong (\Sigma + \Gamma \circ P)^{\*} A \to X \tag{22}$$

The formula (22) can also be turned into a recursive form:

$$
\delta \hbar \hbar dle\_{\alpha} g = [g, \quad \alpha\_{\Sigma} \cdot \Sigma(handle\_{\alpha} g), \quad \alpha\_{\Gamma} \cdot \Gamma P(handle\_{\alpha} g)] \cdot i n\_{A}^{\diamond} \tag{23}
$$

that suits implementation (see the appendices for more details).

Interpreting scoped operation with EM algebras can be understood as simulating scoped operations with algebraic operations and general recursion: a signature (Σ, Γ) of algebraic-and-scoped operations is simulated by a signature (Σ+Γ ◦P) of algebraic operations where P is recursively given by (Σ+Γ ◦P) ∗ . In this way, one can simulate scoped operation in languages implementing algebraic effects that allow signatures of operation to be recursive, such as [7, 19, 36], but not the original design by Plotkin and Pretnar [52], which requires signatures of operations to mention only some base types.

The downside of this simulating approach is that the denotational semantics of the language becomes more complex and usually involves solving some domain-theoretic recursive equations, like in [7]. Moreover, this approach typically requires handlers to be defined with general recursion, which obscures the inherent structure of scoped operations, making reasoning about handlers of scoped operations more difficult.

#### 4.2 Indexed Algebras of Scoped Effects

Indexed algebras of scoped operations by Pir´og et al. [46] are yet another way of interpreting scoped operations. They are based on the following adjunction:

$$\underbrace{\text{Lx-Alg}}\_{\text{Lx-Alg}} \xleftarrow{\text{Free}\_{\text{Lx-}}} \underbrace{\text{C}^{|\text{N}|}}\_{\text{C}} \xleftarrow{\text{I}} \xrightarrow{\text{I}} \underbrace{\text{C}}\_{\text{C}} \tag{24}$$

where C |N| is the functor category from the discrete category |N| of natural numbers to the base category C. That is to say, an object in C |N| is a family of objects A<sup>i</sup> in C indexed by natural numbers i ∈ |N|, and a morphism τ : A → B in C |N| is a family of morphisms τ<sup>i</sup> : A<sup>i</sup> → B<sup>i</sup> in C (with no coherence conditions). An endofunctor Ix : C <sup>|</sup>N<sup>|</sup> → C |N| is defined to characterise indexed algebras:

$$\mathsf{Ix}A = \Sigma \circ A + \varGamma \circ (\lhd A) + (\rhd A),$$

where / and . are functors C <sup>|</sup>N<sup>|</sup> → C |N| shifting indices such that (/A)<sup>i</sup> = Ai+1 and (.A)<sup>0</sup> = 0 and (.A)i+1 = A<sup>i</sup> . Then objects in Ix-Alg are called indexed algebras. Furthermore, since a morphism (.A) → A is in bijection with A → (/A), an indexed algebra can be given by the following tuple:

$$\langle A:\mathbb{C}^{|\mathbb{N}|}, \quad a:\Sigma \diamond A \to A, \quad d:\Gamma(\lhd A) \to A, \quad p:A \to \lhd A \rangle \tag{25}$$

The operational intuition for it is that the carrier A<sup>i</sup> at level i interprets the part of syntax enclosed by i layers of scopes, and when interpreting a scoped operation Γ(P(P X)) at layer i, the part of syntax outside the scope is first interpreted, resulting in Γ(P Ai), and then the indexed algebra provides a way p to promote the carrier to the next level, resulting in Γ(P Ai+1). After the inner layer is also interpreted as Γ Ai+1, the indexed algebra provides a way d to demote the carrier, producing A<sup>i</sup> again. Additionally the morphism a interprets ordinary algebraic operations.

Example 10. Example 8 for nondeterministic choice with once can be expressed with an indexed algebra as follows. For any set X, we define an indexed object A : C <sup>|</sup>N<sup>|</sup> by A<sup>0</sup> = List X and Ai+1 = List A<sup>i</sup> . The object A carries an indexed algebra with the following structure maps: for all i ∈ N, ai(ι<sup>1</sup> ?) = nil and

$$a\_i(\iota\_2 \mid x, y) = x ++ y, \quad d\_i(nil) = nil, \quad d\_i(cons\ x \ xs) = x, \quad p\_i(x) = cons\ x \ nil$$

The adjunction FreeIx a UIx in (24) is the free-forgetful adjunction for Ix on C |N| . The other adjunction a is given by A = A0, ( X)<sup>0</sup> = X, and ( X)i+1 = 0 for all i ∈ N. Importantly, Pir´og et al. [46] show that the monad induced by the adjunction (24) is isomorphic to monad P (12), thus indexed algebras can also be used to interpret scoped operations

$$\mathit{handle}\_{\langle A,a,d,p\rangle} g = \downarrow \mathcal{U}\_{\mathsf{Im}}(\epsilon\_{\langle A,a,d,p\rangle} \cdot \mathsf{Free}\_{\mathsf{Im}} \upharpoonright g) \tag{26}$$

in the same way as what we do for functorial algebras in Section 3.4. Interpreting with indexed algebras can also be implemented in Haskell with GHC's DataKinds extension for type-level natural numbers (which can be found in the appendices).

#### 4.3 Comparison of Resolutions

Now we come back to the real subject of this section—comparing the expressivity of the three ways for interpreting scoped operations. Specifically, we construct comparison functors between the respective categories of the three kinds of algebras, which translate one kind of algebras to another in a way preserving the induced interpretation in the base category. Categorically, the three kinds of algebras correspond to three resolutions of the monad P, which form a category of resolutions (Definition 2) with comparison functors as morphisms. In this category, the Eilenberg-Moore resolution is the terminal object, and thus it automatically gives us comparison functors translating other kinds of algebras to EM algebras. To complete the circle of translations, we then construct comparison functors KEM Fn : C <sup>P</sup> → Fn-Alg translating EM algebras to functorial ones (Section 4.4) and KFn Ix : Fn-Alg → Ix-Alg translating functorial algebras to indexed ones (Section 4.5).

Definition 2 (Resolutions [35]). Given a monad M on C, the category Res(M) of resolutions of M has as objects adjunctions hD,L a R : D → C, η, i whose induced monad RL is M. A morphism from a resolution hD,L a R, η, i to hD 0 ,L <sup>0</sup> a R<sup>0</sup> , η 0 , 0 i is a functor K : D → D 0 , called a comparison functor, such that it commutes with the left and right adjoints, i.e. KL = L <sup>0</sup> and R<sup>0</sup>K = R.

We have seen adjunctions for indexed algebras, EM algebras and functorial algebras respectively, each inducing the monad P up to isomorphism, so each of them can be identified with an object in the category Res(E). For each resolution hD,L, R, η, i, we have been using the objects D in D to interpret scoped operations modelled by P: for any morphism g : A → RD in C, the interpretation of P A by D and g is handle<sup>D</sup> g = R(<sup>D</sup> ·Lg) : P A = RLA → RD. Crucially, we show that interpretations are preserved by comparison functors.

Lemma 5 (Preservation of Interpretation). Let K : D → D 0 be any comparison functor between resolutions hD,L, R, η, i and hD 0 ,L 0 , R<sup>0</sup> , η 0 , 0 i of some monad M : C → C. For any object D in D and any g : A → RD in C,

$$handle\_D \ g = handle\_{KD} \ g: MA \to RD (= R'KD) \tag{27}$$

where each side interprets MA using L a R and L <sup>0</sup> a R<sup>0</sup> respectively.

This lemma implies that if there is a comparison functor K from some resolution L a R : D → C to L <sup>0</sup> a R<sup>0</sup> : D <sup>0</sup> → C of the monad P, then K can translate a D object to a D <sup>0</sup> object that preserves the induced interpretation. Thus the expressive power of D for interpreting P is not greater than D 0 , in the sense that every handle<sup>D</sup> g that one can obtain from D in D can also be obtained by an algebra KD in D 0 . Thus the three kinds of algebras for interpreting scoped operations have the same expressivity if we can construct a circle of comparison functors between their categories, which is what we do in the following.

Translating to EM Algebras As shown in [41], an important property of the Eilenberg-Moore adjunction is that it is the terminal object in the category Res(M) for any monad M, which means that there uniquely exists a comparison functor from every resolution to the Eilenberg-Moore resolution. Specifically, given a resolution hD,L, R, η, i of a monad M, the unique comparison functor K from D to the category C<sup>M</sup> of the Eilenberg-Moore algebras is

$$KD = \left(M(RD) = RLRD \xrightarrow{Re\_D} RD\right) \qquad \text{and} \qquad K(D \xrightarrow{f} D') = Rf'$$

Lemma 6. There uniquely exist comparison functors KIx EM : Ix-Alg → C <sup>P</sup> and KFn EM : Fn-Alg → C <sup>P</sup> from the resolutions of indexed algebras and functorial algebras to the resolution of EM algebras.

#### 4.4 Translating EM Algebras to Functorial Algebras

Now we construct a comparison functor KEM Fn : C <sup>P</sup> → Fn-Alg translating EM algebras to functorial ones. The idea is straightforward: given an EM algebra X, we map it to the functorial algebra with X for interpreting the outermost layer and the functor P for interpreting the inner layers, which essentially leaves the inner layers uninterpreted before they get to the outermost layer.

Since C <sup>P</sup> is isomorphic to (Σ+Γ ◦ P)-Alg, we can define KEM Fn on (Σ+Γ ◦ P) algebras instead. Given any hX : C, α : (Σ + Γ ◦ P)X → Xi, it is mapped by KEM Fn to the functorial algebra

$$\langle P, \ X, \ in : GP \to P, \ \alpha: (\Sigma + \Gamma \circ P)X \to X \rangle$$

and for any morphism f in (Σ + Γ ◦ P)-Alg, it is mapped to hid <sup>P</sup> , fi. To show KEM Fn is a comparison functor, we only need to show that it commutes with the left and right adjoints of both resolutions. Details can be found in the appendices.

Lemma 7. Functor KEM Fn is a comparison functor from the Eilenberg-Moore resolution of P to the resolution FreeFn ↑ a ↓ UFn of functorial algebras.

#### 4.5 Translating Functorial Algebras to Indexed Algebras

At this point we have comparison functors Ix-Alg KIx EM −−→ C <sup>P</sup> <sup>K</sup>EM Fn −−→ Fn-Alg. To complete the circle of translations, we construct a comparison functor KFn Ix : Fn-Alg → Ix-Alg in this subsection. The idea of this translation is that given a functorial algebra carried by endofunctor H : C <sup>C</sup> and object X : C, we map it to an indexed algebra by iterating the endofunctor H on X. More precisely, KFn Ix : Fn-Alg → Ix-Alg maps a functorial algebra

$$\langle H: \mathbb{C}^{\mathbb{C}}, \, X: \mathbb{C}, \, \alpha^{G}: \operatorname{Id} + \Sigma \circ H + \Gamma \circ H \circ H \to H, \, \alpha^{I}: \Sigma X + \Gamma H X \to X \rangle$$

to an indexed algebra carried by A : C |N| such that A<sup>i</sup> = HiX, i.e. iterating H i-times on X. The structure maps of this indexed algebra ha : ΣA → A, d : Γ(/A) → A, p : A → (/A)i are given by

$$\begin{aligned} a\_0 &= (\alpha^I \cdot \iota\_1) : \Sigma X \to X \\ d\_0 &= (\alpha^I \cdot \iota\_2) : \Gamma H X \to X \end{aligned} \qquad \begin{aligned} a\_{i+1} &= (\alpha^G\_{H^i X} \cdot \iota\_2) : \Sigma H H^i X \to H^{i+1} X \\ d\_{i+1} &= (\alpha^G\_{H^i X} \cdot \iota\_3) : \Gamma H H H^i X \to H^{i+1} X \end{aligned}$$

and p<sup>i</sup> = α G HiX ·ι<sup>1</sup> : HiX → HHiX. On morphisms, KFn Ix maps a morphism hτ : H → H<sup>0</sup> , f : X → X<sup>0</sup> i in Fn-Alg to σ : HiX → H<sup>0</sup>iX<sup>0</sup> in Ix-Alg such that σ<sup>0</sup> = f and σi+1 = τ ◦ σ<sup>i</sup> where ◦ is horizontal composition.

Lemma 8. KFn Ix is a comparison functor from the resolution FreeFn ↑ a ↓ UFn of functorial algebras to the resolution FreeIx a UIx of indexed algebras.

Since comparison functors preserve interpretation (Lemma 5), the lemma above implies that the expressivity of functorial algebras is not greater than indexed ones. Together with the comparison functors defined earlier, we conclude that the three kinds of algebras—indexed, functorial and Eilenberg-Moore algebras—have the same expressivity for interpreting scoped operations.

Remark 4. Although the three kinds of algebras have the same expressivity in theory, they structure the interpretation of scoped operations in different ways: EM algebras impose no constraint on how the part of syntax enclosed by scopes is handled; indexed algebras demand them to be handled layer by layer but impose no coherent conditions between the layers; functorial algebras additionally force all inner layers must be handled in a uniform way by an endofunctor.

On the whole, it is a trade-off simplicity and structuredness: EM algebras are the simplest for implementation, whereas the structuredness of functorial algebras make them easier to reason about. This is another instance of the preference for structured programming over unstructured language features, in the same way as structured loops being favoured over goto, although they have the same expressivity in theory [13].

### 5 Fusion Laws of Interpretation

An advantage of the adjoint-theoretic approach to syntax and semantics is that the naturality of an adjunction directly offers fusion laws of interpretation that fuse a morphism after an interpretation into a single interpretation, which have proven to be a powerful tool for reasoning about and optimising programs manipulating abstract syntax [12, 21, 65, 66] and in particular handlers of algebraic effects [69, 73]. In this section, we present the fusion law for functorial algebras.

#### 5.1 Fusion Laws of Interpretation

Recall that given any resolution L a R with counit of some monad M : C → C where L : C → D, for any g : A → RD, we have an interpretation morphism

$$handle\_D \ g = R(\epsilon\_D \cdot Lg) : MA \to RD$$

Then whenever we have a morphism in the form of (f · handle<sup>D</sup> g)—an interpretation followed by some morphism—the following fusion law allows one to fuse it into a single interpretation morphism.

Lemma 9 (Interpretation Fusion). Assume L a R is a resolution of monad M : C → C where L : C → D. For every D : D, g : A → RD and f : RD → X, if there is some D<sup>0</sup> and h : D → D<sup>0</sup> in D such that RD<sup>0</sup> = X and Rh = f, then

$$f \cdot handle\_D \, g = handle\_{D'} \, (f \cdot g) \tag{28}$$

Applying the lemma to the three resolutions of P gives us three fusion laws: for any D : D where D ∈ {Ix-Alg, Fn-Alg, C <sup>P</sup> }, one can fuse f · handle<sup>D</sup> g into a single interpretation if one can make f a D-homomorphism. Particularly, the following is the fusion law for functorial algebras.

Corollary (Fusion Law for Functorial Algebras). Let αˆ<sup>1</sup> = hH, X1, α G 1 , α I 2 i be a functorial algebra (Definition 1) and g : A → X1, f : X<sup>1</sup> → X<sup>2</sup> be any morphisms in C. If there is a functorial algebra αˆ<sup>2</sup> = hH2, X2, α G 2 , α I 2 i and a functorial algebra morphism hσ : H<sup>1</sup> → H2, h : X<sup>1</sup> → X2i, then

$$f \cdot handle\_{\hat{\alpha\_1}} g = handle\_{\hat{\alpha\_2}}(f \cdot g)$$

Example 11. Let ˆα be the functorial algebra of nondeterminism with once in Example 8 and len : List A → N be the function mapping a list to its length. Then using the fusion law, len · handleα<sup>ˆ</sup> g = handleβ<sup>ˆ</sup> (len · g) if we can find a suitable functorial algebra βˆ : Fn-Alg and h : ˆα → βˆ s.t. ↓ UFnh = len. In fact, a suitable βˆ is just the functorial algebra in Example 9 and h = hid, leni.

Example 12. Although Pir´og et al. [46] propose the adjunction (24) to interpret scoped operations with indexed algebras, their Haskell implementation is not a faithful implementation of the interpretation morphism (26), but rather a more efficient one skipping the step of transforming P to the isomorphic free indexed algebra ( UIxFreeIx ). However, it is previously unclear whether this implementation indeed coincides with the interpretation morphism (26) due to the discrepancy between the syntax monad P and indexed algebras.

This issue is in fact one of the original motivations for us to develop functorial algebras—a way to interpret P that directly follows the syntactic structure. Using the comparison functors to transform between indexed and functorial algebras, we can reason about Pir´og et al. [46]'s implementation with functorial algebras, and its correctness can be established using fusion laws. This extended case study is in the appendices.

### 6 Related Work

The most closely related work is that of Pir´og et al. [46] on categorical models of scoped effects. That work in turn builds on Wu et al. [70] who introduced the notion of scoped effects after identifying modularity problems with using algebraic effect handlers for catching exceptions [52]. Scoped effects have found their way into several Haskell implementations of algebraic effects and handlers [32,42,56].

Effect Handlers and Modularity Spivey [60], Moggi [44] and Wadler [67] initiated monads for modeling and programming with computational effects. Soon after, the desire arose to define complex monads by combining modular definitions of individual effects [26, 63], and monad transformers were developed to meet this need [39]. Yet, several years later, algebraic effects were proposed as an alternative more structured approach for defining and combining computational effects [22, 48, 49]. The addition of handlers [52] has made them practical for implementation and many languages and libraries have been developed since. Schrijvers et al. [58] have characterised modular handlers by means of modular carriers, and shown that they correspond to a subclass of monad transformers.

Scoped operations are generally not algebraic operations in the original design of algebraic effects [48], but as we have seen in Section 4.1, an alternative view on Eilenberg-Moore algebras of scoped operations is regarding them as handlers of algebraic operations of signature Σ + Γ P. However, the functor Σ + Γ P involves the type P modelling computations, and thus it is not a valid signature of algebraic effects in the original design of effect handlers [51, 52], in which the signature of algebraic effects can only be built from some base types to avoid the interdependence of the denotations of signature functors and computations. In spite of that, many later implementations of effect handlers such as Eff [7], Koka [36] and Frank [40] do not impose this restriction on signature functors (at the cost that the denotational semantics involves solving recursive domaintheoretic equations), and thus scoped operations can be implemented in these languages with EM algebras as handlers.

Other variations of scoped effects have been suggested. Recently, Poulsen et al. [53] and van den Berg et al. [9] have proposed a notion of staged or latent effect, which is a variant of scoped effects, for modelling the deferred execution of computations inside lambda abstractions and similar constructs. Ahman and Pretnar [3] investigate asynchronous effects, and they note that interrupt handlers are in fact scoped operations. We have not yet investigated this in our framework, but it will be an interesting use case.

Abstract Syntax This work focusses on the problem of abstract syntax and semantics of programs. The practical benefit of abstract syntax is that it allows for generic programming in languages like Haskell that have support for, e.g. type classes, gadts [25] and so on. As an example, Swierstra [64] showed that it is possible to modularly create compilers by formalising syntax in Haskell.

Fiore et al. [16,17] first formalise abstract syntax categorically for operations with variable binding. Subsequently, Ghani and Uustalu [18] model the abstract syntax of explicit substitutions as an initial algebra in the endofunctor category and show that it is a monad. Pir´og et al. [46] and this paper use a monad P, which is a slight generalisation of the monad of explicit substitutions, to model the syntax of scoped operations. The datatype underlying P is an instance of nested datatypes studied by Bird and Paterson [10] and Johann and Ghani [24].

In this paper we have not treated equations on effectful operations, which are both theoretically and practically important. Plotkin and Power [48] show that theories of various effects with suitable equations determine their corresponding monads, and later Hyland et al. [22] show that certain combinations of effect theories are equivalent to monad transformers. Equations are also used for reasoning about programs with algebraic effects and handlers [34, 50, 73]. Possible ways to extend scoped effects with equations include the approach in [29] (Remark 1), the categorical framework of equational systems [14], second order Lawvere theories [15], and syntactic frameworks like [62].

### 7 Conclusion

The motivation of this work is to develop a structured approach to the syntax and semantics of scoped operations. We believe our proposal, functorial algebras, is at a sweet spot in the trade-off between structuredness and simplicity, allowing practical examples of scoped operations to be programmed and reasoned about naturally, and implementable in modern functional languages such as Haskell and OCaml. We put our model and two other models for interpreting scoped effects in the same categorical framework, and we showed that they have equivalent expressivity for interpreting scoped effects, although they form non-equivalent categories. The uniform theoretical framework also induces fusion laws of interpretation in a straightforward way.

There are two strains of work that should be pursued from here. The first one would be investigating ways to compose algebras of scoped operations. The second one would be the design of a language supporting handlers of scoped operations natively and its type system and operational semantics.

### Acknowledgements

This work is supported by EPSRC grant number EP/S028129/1 on 'Scoped Contextual Operations and Effects', by FWO project G095917N, and KU Leuven project C14/20/079. The authors would like to thank the anonymous reviewers for their constructive feedback.

### References

1. Ad´amek, J., Rosicky, J.: Locally Presentable and Accessible Categories. London Mathematical Society Lecture Note Series, Cambridge University Press (1994). https://doi.org/10.1017/CBO9780511600579


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/ 4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## Region-based Resource Management and Lexical Exception Handlers in Continuation-Passing Style

Philipp Schuster (), Jonathan Immanuel Brachthäuser, and Klaus Ostermann

University of Tübingen, Germany philipp.schuster@uni-tuebingen.de

Abstract. Regions are a useful tool for the safe and automatic management of resources. Due to their scarcity, resources are often limited in their lifetime which is associated with a certain scope. When control flow leaves the scope, the resources are released. Exceptions can non-locally exit such scopes and it is important that resources are also released in this case.

Continuation-passing style is a useful compiler intermediate language that makes control flow explicit. All calls are tail calls and the runtime stack is not used. It can also serve as an implementation technique for control effects like exceptions. In this case throwing an exception means jumping to a continuation which is not the current one.

How is it possible to offer region-based resource management and exceptions in the same language and translate both to continuation-passing style? In this paper, we answer this question. We present a typed language with resources and exceptions, and its translation to continuationpassing style. The translation can be defined modularly for resources and exceptions – the correct interaction between the two automatically arises from simple composition. We prove that the translation preserves welltypedness and semantics.

### 1 Introduction

Regions were originally introduced for the safe and automatic management of memory [33]. Since then, much research extended their usefulness for memory management in different scenarios [9, 12–14]. Regions are also a useful tool for controlling the allocation, release, and use of any kind of scarce resource even when considering memory to be plentiful [19]. Resources are organized into a stack of regions which corresponds to nested scopes in the program. Resources in a region are automatically released when control flow leaves the corresponding scope. A type-and-region system guarantees resource safety, i.e., that there is no access to a resource outside of its corresponding scope.

Exceptions allow for non-local exits from scopes. It is important that resources are released not only upon normal return, but also when an exception is thrown. A type-and-effect system statically ensures that certain error conditions do not occur when running a program. In the case of exceptions, for example, we want to guarantee exception safety, i.e., every exception is eventually caught. Some work on regions explicitly caters to exceptions [14, 18, 19, 32]. Still, the interaction between regions, exceptions, and first-class functions is non-trivial. To the best of our knowledge region safety for a language with this combination of features has not yet been formally established.

Continuation-passing style (CPS) is an attractive [2, 8, 17] intermediate representation for programs. Control flow is explicit and many program optimizations amount to simple inlining and beta reduction. CPS can also be an implementation technique for control effects like exceptions [16, 17, 26]. Optimization of programs using these features still amounts to inlining and reduction. In CPS all calls are tail calls. Importantly, there is no runtime stack that a thrown exception unwinds. Instead, throwing an exception means jumping to a continuation other than the current one.

A CPS translation (from a source to a target language in CPS) must of course be correct, i.e. preserve the semantics of the source language. Ideally, the target language is also typed, and the translation takes well-typed terms to well-typed terms. Moreover, when we translate a source program with exceptions to CPS, well-typedness of the target term should also entail exception safety. However, there is not yet a single CPS translation for both exceptions and resource management in the same language. Moreover, since in CPS there is no stack, it is not possible to run cleanup actions during unwinding. Therefore it is not clear how such a combination in CPS could guarantee proper release of resources when an exception is thrown.

We present an intermediate language Λ<sup>ρ</sup> with resources and exceptions. It has a type-and-effect system keeping track of regions to model both: the lifetime of resources as well as the scope of exception handlers. We define its operational semantics as an instrumented [23] abstract machine, which manipulates a runtime stack. We prove progress (Theorem 1) and preservation (Theorem 2) for this semantics in the proof assistant Coq. Resource safety (Corollary 1) and exception safety (Corollary 2) follow as corollaries. To our knowledge, this is the first proof of safety for a language with region-based resource management, exceptions, and first-class functions.

We define a CPS translation from Λ<sup>ρ</sup> to System F with base types and primitive operations. The translation takes well-typed terms to well-typed terms (Theorem 3). We implemented the translation as a shallow embedding into the dependently typed language Idris 2. It does not use any special runtime constructs, neither for regions nor for exceptions. The translation is correct: translated terms simulate the abstract machine semantics step-wise (Theorem 4). This entails resource safety and exception safety for CPS translated terms.

Our key technical idea is to understand regions as describing the runtime stack. In the operational semantics, language constructs for resources and exceptions push freshly generated markers onto the runtime stack. At runtime, a region stands for the concrete list of markers on the stack and subregioning evidence stands for the concrete difference between two such lists. In CPS there is no stack. Under our CPS translation, regions are answer types [30], and subregioning evidence terms are answer-type coercing functions. They move from one region to another one. This allows us to define the CPS translations of resource management and exceptions separately while having them interact correctly.

The rest of the paper is organized as follows. In Section 2, we introduce the main ideas behind our language Λρ. In Section 3, we formally present Λρ. We start with a base language with type-level region tracking and term-level subregioning evidence. We gradually extend this base language with region-based resource management and exceptions. In Section 4, we define the CPS translation for Λ<sup>ρ</sup> to System F. We do so gradually, first for the base language, then for resources, then for exceptions. In Section 5 we compare to related work and in Section 6 we summarize the key ideas and outline future work.

### 2 Overview

Here, we provide an informal overview of our main ideas and the language Λρ. We start by re-iterating how regions are used for resource management. We then introduce exceptions and show how we translate them to CPS. Finally, we combine resources and exceptions and demonstrate how our translation reveals information about the use of resources in the presence of non-local exits.

### 2.1 Regions for Resources

As a first example, let us see how regions can be used to manage file handles in Λρ. Our type system follows Fluet and Morrisett [12] and Kiselyov and Shan [19] with some minor differences.

Example 1. Consider the following simple example, which copies the first line of a file "input" into a file "output" and additionally inserts a line at the beginning and a line at the end of the output file. Both files are automatically closed and any attempt, accidental or not, to use them after they are closed will fail.

```
pool { [r1](p1 : Pool r1, l1 : r1 v Top) ⇒
  val out: File r1 = open(p1, "output", 0);
  writeln(out, "start", 0);
  pool { [r2](p2 : Pool r2, l2 : r2 v r1) ⇒
    val in: File r2 = open(p2, "input", 0);
    val firstLine = readln(in, 0);
    writeln(out, firstLine, l2)
  };
  writeln(out, "end", 0);
  return ()
}
```
We use a pool { ... } statement to create a fresh resource pool. A pool is a reference to a list of open files. All files in this list are automatically closed when control flow leaves the enclosed block. The pool statement introduces a region variable r1, a pool variable p1 and subregioning evidence l1. We then open the file "output" in pool p1. In our type system, every statement is checked in a region. The overall statement is checked in the top-level region Top. The enclosed block is checked in region r1. When we open a file, we have to explicitly pass evidence that the current region is a subregion of the pool's region. In this example, we pass the reflexivity evidence 0 : r1 v r1. We create a second pool p2 in a second region r2, which is clearly inside of r1. This fact is witnessed by the evidence variable l2. When we write to the output file, we have to provide evidence that the current region r2 is inside of the file's region r1. We provide l2 : r2 v r1.

For this simple example, after applying our CPS translation and some beta reduction we get the following straight-line code.

```
λk.
 let p1 = createPool ();
 let out = openFile p1 "output";
 writeLine out "start";
 let p2 = createPool ();
 letin = openFile p2 "input";
 let firstLine = readLine in;
 writeLine out firstLine;
 releasePool p2;
 writeLine out "end";
 releasePool p1;
 k ()
```
The original progam did not contain any interesting control flow and our CPS translation results in a sequence of primitive operations. There is no overhead for protecting resources when no exception is thrown. Later we will see how we clean up resources when there are exceptions. But first, let us look at our CPS translation of exceptions.

#### 2.2 Regions for Exception Handlers

Exceptions abort the current computation to an exception handler. An exception that is thrown while the corresponding handler is not on the stack results in an error condition that we statically prevent from happening. In Λρ, we use the same mechanism for resources and exceptions and enforce exception safety in terms of regions: in order to throw to an exception handler, we require evidence that the corresponding handler is still on the stack.

Exceptions in Λ<sup>ρ</sup> are lexically scoped: the connection between a thrown exception and its handler is established by a variable that stands for this very handler [5, 6, 35, 36]. This style of exceptions is in contrast to traditional exceptions, which are caught by the dynamically closest handler. Lexical exception handlers have advantages when reasoning about higher-order functions. Operationally, each try statement generates a fresh marker at runtime and pushes a catch frame with this marker onto the stack. We explicitly pass these markers as values of type Catch r. For example, consider the following program.

Example 2. The function safeDiv divides two numbers, but throws an exception when the second number is zero.

```
def safeDiv[r](x : Int, y : Int, e : Catch r) at r {
  if (y == 0) { throw(e, 0) }
  else { return (x / y) }
}
```
In addition to the two parameters x and y, the function safeDiv receives a catch marker e. When y is zero we throw to e. For this to be safe we need to guarantee that we only throw to e in the dynamic extent of the corresponding exception handler. But this is the very same problem we had with pools. So we use the very same solution: When we throw to a catch frame of type Catch r we have to provide evidence that the current region is a subregion of the catch's region, in this example 0 : r v r.

The function safeDiv is region polymorphic. It abstracts over a region variable r. It is also annotated to run in the region r. To handle the exception we use our safeDiv function as follows.

```
try { [r1](e1 : Catch r1, l1 : r1 v Top) ⇒
  safeDiv[r1](5, 0, e1)
} catch { return 0 }
```
Very much like the pool statement, the exception handler introduces a region variable r1, a handler e1, and subregioning evidence l1. In the call to safeDiv, we instantiate the region variable r to r1 and pass the exception handler e1. The example illustrates that we can guarantee exception safety by the very same mechanism we use for region safety.

When we translate this program to CPS, inline the function safeDiv, and after applying beta reduction and commuting conversions we get the following:

λk2. if (0 ≡ 0)then k<sup>2</sup> 0 else k<sup>2</sup> (5 / 0)

When we translate programs to CPS, control flow becomes explicit. This is also true in the presence of control effects like exceptions. Because of this, optimizing programs in CPS amounts to beta reduction. How then can we achieve the same in the presence of resources and exceptions?

#### 2.3 Combining Resources and Exceptions

Consider the following simple program that mixes pools and exceptions.

Example 3. We install an exception handler and create two resource pools. We open a file in the inner pool, open a file in the outer pool, and then throw an exception.

```
try { [r1](e1 : Catch r1, l1 : r1 v Top) ⇒
  pool { [r2](p2 : Pool r2, l2 : r2 v r1) ⇒
    pool { [r3](p3 : Pool r3, l3 : r3 v r2) ⇒
      open(p3, "input", 0);
      open(p2, "output", l3);
      throw(e1, l3 ⊕ l2)
    }
  }
} catch { return 1 }
```
To open files into pools, we have to provide evidence, as before. To throw an exception to the outer handler e1, we provide evidence that region r3 is inside of r1. We compose evidence variables l3 ⊕ l2, to get evidence of type r3 v r1.

This program, after CPS translation, reduces to the following program. The exception handler is known and will be eliminated. Again, simplifying control flow amounts to beta reduction as usual in CPS.

```
λk.
 let p2 = createPool ();
 let p3 = createPool ();
 openFile p3 "input";
 openFile p2 "output";
 releasePool p3;
 releasePool p2;
 k 1
```
In our framework, these simplifications of control flow also correctly account for proper creation and release of resources. We can blindly reduce the translated program without any extra considerations.

#### 2.4 First-Class Functions

The language Λ<sup>ρ</sup> supports first-class functions. For example, consider the following program which factors out a common pattern as a higher-order function.

```
def withFile[r0](path: String, f: [r](File r, r v r0) →r Unit) at r0 {
  pool { [r1](p1: Pool r1, l1: r1 v r0) ⇒
    val file = open(p1, path, 0);
    f[r1](file, l1)
  }
}
```
The function withFile is region polymorphic. It abstracts over the region r0 it can be used in. The function f must be region polymorphic too, because we use it under a new region r1. We instantiate its region parameter with r1 and pass evidence l1. It would be possible to write withFile with the following signature:

```
withFile : [r0](path: String, f: [r](File r) →r Unit) →r0 Unit
```
Here, the function parameter f would not receive any evidence. This variant of withFile would be less useful, as f could not access any resources from outside of the call-site of withFile.

### 3 A Language with Regions, Resources, and Exceptions

In this section, we formally present Λ<sup>ρ</sup> and its operational semantics. We will introduce Λ<sup>ρ</sup> step-by-step starting with a base language with support for typelevel region tracking but no interesting term-level features that make use of them. We then add resource pools, exceptions, and finally consider the combination of the two. The operational semantics is given in terms of an abstract machine that manipulates a runtime stack. In Section 4, we present a CPS translation of Λρ, following the same incremental development.

The paper is accompanied by a mechanized formalization of Λ<sup>ρ</sup> and its operational semantics in the Coq theorem prover [3], including the usual theorems of Progress (Theorem 1) and Preservation (Theorem 2). Resource- and exception safety follow as corollaries: whenever we use a resource (like a file) it is live (Corollary 1), and whenever we throw an exception the corresponding handler is on the stack (Corollary 2).

Our operational semantics will push freshly generated markers onto the runtime stack. A region is the list of concrete markers on the stack and evidence is the list of markers that is the difference between two such lists. Although they do not play any role computationally, for our proofs we will substitute these lists for region variables and evidence variables at runtime. Our typing rule for runtime evidence then makes proving region safety and exception safety possible.

#### 3.1 Syntax

Figure 1 defines the syntax of the core of Λρ. We use fine-grain call-by-value [22] and syntactically distinguish between statements, which can have effects, and pure expressions.

Function values (i.e., { [r ](x : τ ) at ρ⇒s}) abstract over a list of type-level region parameters (i.e., r ), and a list of term-level parameters (i.e., x : τ ). Each function is defined to run exactly in a region ρ, but otherwise functions are unsurprising. Since our focus is on the interaction between regions and control effects, we omit type abstraction from this presentation. Our mechanization includes type polymorphism, which is orthogonal to the rest of the calculus. We define the following short-hand notation for named function definitions:

$$\mathsf{eddef}\,f[\overline{r}](\overline{x\mathrel{\vbox{\hbox{ $::$ }}}})\,\mathsf{at}\,\rho\{\left\{\begin{smallmatrix} s\_{0} \end{smallmatrix}\right\};s\end{smallmatrix}=\mathsf{s}\qquad\mathsf{val}\,f=\mathsf{return}\,\mathsf{l}\,\{[\overline{r}](\overline{x\mathrel{\hbox{ $::$ }}})\,\mathsf{at}\,\rho\Rightarrow s\_{0}\};s$$

The list of region parameters scopes over the parameter types, the return type, the annotated region ρ, and the body s of the function. We apply functions to a list of regions ρ and a list of arguments e.

We introduce two additional concepts: type-level regions and term-level evidence. Type-level regions ρ are region variables r or the top-level region >.


#### Names:

x , y, l ::= x | y | l ... value variables r ::= r | s | ... region variables

Fig. 1. Syntax of the core of Λρ.

Intuitively, the top-level region denotes the bottom part of the runtime stack. Term-level evidence expressions are either the empty evidence 0 witnessing reflexivity of subregioning, or the composition of evidence e ⊕ e witnessing transitivity of subregioning. By convention, we use the meta-variables f and l to stand for variables of function type and evidence type respectively, and we use the meta-variable i to stand for expressions of evidence type.

#### 3.2 Typing

Figure 2 defines the typing of core Λρ. We type statements and expressions with different judgement forms. While both are typed in an environment Γ containing value and region bindings, only statements are typed in a given region ρ. Statements may perform effectful (that is, serious in the terminology of Reynolds [24]) computation, which is only safe in specific regions. In contrast, expressions are pure (that is, trivial) and can be evaluated independent of any region.

Typing of Statements Rule Val types sequencing of statements. We type the two statements s<sup>0</sup> and s in the same region ρ of the compound statement. Returning a result of a computation (rule Ret) can be typed in any region. In rule App, we apply a function e<sup>0</sup> to a list of regions ρ and to a list of arguments e. The type of e<sup>0</sup> is a function type in a region ρ0. The overall statement is typed in a region ρ. The premise ρ = ρ0[r 7→ρ] requires that, after substituting regions ρ for the region variables r both have to syntactically be the same. Note that we do not have any implicit or explicit subtyping of function types here or elsewhere. All region subtyping exclusively occurs through the passing of subregioning evidence.

#### Statement Typing: Γ

$$\begin{array}{c} F \upharpoonright \rho \vdash \ s \; : \; \tau \\ \vdash \; \; \vdash \; \; \; \vdash \; \; \; \; \vdash \; \; \; \; \end{array}$$

` e

: τ

$$\begin{array}{ccl}\Gamma \vdash e\_0 : \forall [\overline{\tau}](\overline{\tau}) \to^{\rho\_0} \tau\_0 & \Gamma \vdash e : \tau[\overline{\tau \mapsto \rho}] & \rho = \rho\_0[\overline{\tau \mapsto \rho}] \\ \hline \hline \Gamma \mathbin{\mathbin{\vdash}\rho \vdash} e\_0[\overline{\rho}](\overline{\epsilon}) : \tau\_0[\overline{\tau \mapsto \rho}] & & \\ \hline \hline \Gamma \mathbin{\mathbin{\vdash}\rho \vdash} s\_0 : \tau\_0 & \Gamma, \mathbin{x\_0} : \tau\_0 \sqcup \rho \vdash s : \tau & & \\ \hline \Gamma \mathbin{\mathbin{\vdash}\rho \vdash} \mathsf{val}\, x\_0 = s\_0; \; s : \tau & & \varPi \vdash e : \tau & & \\ \hline \end{array} \begin{array}{c} \Gamma \vdash e : \tau & & \\ \hline \Gamma \vdash e : \tau & & \\ \hline \Gamma \mathbin{\mathbin{\vdash}\rho \vdash} \mathsf{return } e : \tau & & \\ \hline \end{array}$$

Expression Typing: Γ

$$\frac{\Gamma(x) = \tau}{\Gamma \vdash x : \tau} \text{ [VaR] } \begin{array}{l} \text{[VaR]} \ \overline{I \vdash n : \mathsf{Int}} \end{array} \begin{array}{l} \text{[Lr]} \end{array} \frac{\Gamma, \overline{\tau}, \overline{\ \overline{x : \tau}} \sqcup \rho \vdash s\_0 \ : \tau\_0}{\Gamma \vdash \{\} \overline{[\tau](\overline{x : \tau}) \text{at} \: \rho \Rightarrow s\_0\} \ : \; \forall [\overline{\tau}](\overline{\tau}) \to^{\rho} \tau\_0} \text{ [FUN]} \ .$$

Γ ` 0 : ρ v ρ [Reflexive] Γ ` e : ρ v ρ <sup>0</sup> Γ ` e 0 : ρ <sup>0</sup> v ρ 00 Γ ` e ⊕ e 0 : ρ v ρ <sup>00</sup> [Transitive]

Fig. 2. Type system of the core of Λρ.

Typing of Expressions The typing rules for variables Var and primitives Lit are standard. Rule Fun types functions. We type the body of the function s<sup>0</sup> in an environment extended with the region parameters r and value parameter types x : τ . Every function is annotated with a region ρ that specifies exactly the region it will have to be called in. This region ρ is also the region we type the body s<sup>0</sup> in. The region parameters r may appear in the parameter types, the return type, the function's region ρ, and body s0. This allows us to write regionpolymorphic functions that can run in any region. Value parameters of evidence type allow us to write region-polymorphic functions that are constrained to run in a subregion that meets these constraints.

Reflexivity evidence 0 witnesses that every region is nested within itself, and evidence e ⊕ e <sup>0</sup> witnesses the transitivity of nesting, which is reflected in their typing rules. We require the composition of evidence to be associative.

#### 3.3 Operational Semantics

Figure 3 presents the operational semantics of core Λρ. A machine state hs kKi consists of the statement s under evaluation and the runtime stack K. For the core of Λρ, the stack K is a list of frames of the form val x = ; s. The reduction rules are mostly standard. The first rule (return) returns to the next frame on the stack. The second rule (push) focuses on s<sup>0</sup> and pushes a frame on the stack. Finally, rule (call) performs reduction by simultaneously substituting region arguments ρ for region variables r and trivial expressions e for term parameters x . Region parameters, the annotated region ρ, and evidence terms

#### Syntax of the Abstract Machine:


#### Machine Steps:


#### Extended Syntax:

#### Runtime Regions and Evidence:



Fig. 3. Abstract machine semantics of core Λρ.

are operationally irrelevant. As already mentioned, we need them to maintain invariants in our proofs.

The core of Λρ, as presented, does not yet contain features with interesting operational behavior. While we can abstract over regions, eventually all region variables will be instantiated with the top-level region and evidence will always be the trivial evidence.

Figure 3 also defines runtime regions and evidence values in core Λρ. We extend the syntax of values with evidence values w, and the syntax of regions with runtime regions u. Both are empty lists • for now. In the next two sections, we will extend their syntax to be lists for markers h. The toplevel region > is the empty list runtime region •.

To connect type-level regions ρ with the concrete runtime stack K, we define a semantic function <sup>R</sup><sup>J</sup> · <sup>K</sup>, which computes the runtime region of the current stack. In core Λρ, the only possible runtime region is the empty list. To give meaning to evidence expressions, we define a semantic function <sup>V</sup><sup>J</sup> · <sup>K</sup>. Currently the only possible evidence value is the empty list.

#### 3.4 Resource Pools

In this subsection, we add statements for region-based resource management to Λρ. As in the introduction, we use files as an example for resources. Figure 4

#### Syntax:

Statements s ::= ... | pool { [r ](x , l) ⇒ s } new resource pool | open(e, e0, i) open file | readln(e, i) read contents Types τ ::= ... | Pool ρ | File ρ

#### Typing Rules:

$$\begin{array}{c} \begin{array}{c} \Gamma, \, r, x:\mathsf{Pool}\, r, \, l:\, r \sqsubseteq \rho\,\,\mathsf{r}\vdash s:\,\tau\\ \hline \Gamma\u{\;}\rho\vdash \mathsf{pool}\,\{\,[r](x,\,l)\Rightarrow s\}:\,\tau \end{array} \begin{array}{c} \begin{array}{c} \Gamma\vdash e\;:\,\mathsf{File}\,\rho'\\\Gamma\vdash i:\,\rho\sqsubseteq\,\rho'\\\Gamma\u{\;}\rho\vdash \mathsf{read}\,\mathsf{ln}(e,\,i):\,\mathsf{String} \end{array} \begin{array}{c} \begin{array}{c} \Gamma\mathsf{\&}\rho'\\\Gamma\vdash\mathsf{send}\,\mathsf{n}\,\mathsf{[}\,\rho'\end{array} \begin{array}{c} \mathsf{\&}\mathsf{\&}\mathsf{\&}\mathsf{\&}\mathsf{\&}\mathsf{\&}\mathsf{\&}\,\mathsf{n}\,\mathsf{} \end{array} \end{array} \end{array} \end{array}$$

Γ ` e : Pool ρ <sup>0</sup> Γ ` e<sup>0</sup> : String Γ ` i : ρ v ρ 0 Γ ρ ` open(e, e0, i) : File ρ 0 [Open]

Fig. 4. Syntax and typing rules of resource pools.

#### Syntax of Frames:

F ::= ... | #pool<sup>h</sup> { } resource pool frame

#### Machine Steps:

(release) hreturn e k #pool<sup>h</sup> { } :: Ki → hreturn e k Ki do releasePool(h) (pool) hpool { [r ](x , l) ⇒ s<sup>0</sup> } k Ki → hs0[r 7→ u][x 7→ h][l 7→ w] k #pool<sup>h</sup> { } :: Ki do <sup>h</sup> <sup>=</sup> createPool() where <sup>u</sup> <sup>=</sup> po <sup>h</sup> :: <sup>R</sup><sup>J</sup> <sup>K</sup> <sup>K</sup>, and <sup>w</sup> <sup>=</sup> po <sup>h</sup> :: • (open) <sup>h</sup>open(h, <sup>e</sup>, <sup>i</sup>) <sup>k</sup> <sup>K</sup><sup>i</sup> <sup>→</sup> <sup>h</sup>return <sup>x</sup> <sup>k</sup> <sup>K</sup><sup>i</sup> when po h in <sup>R</sup><sup>J</sup> <sup>K</sup> <sup>K</sup> do x = openFile(h, e) (read) <sup>h</sup>readln(p, <sup>i</sup>) <sup>k</sup> <sup>K</sup><sup>i</sup> <sup>→</sup> <sup>h</sup>return <sup>x</sup> <sup>k</sup> <sup>K</sup><sup>i</sup> when po h in <sup>R</sup><sup>J</sup> <sup>K</sup> <sup>K</sup> do x = readLine(p) where h = p.getPool

Runtime Regions and Evidence: h ::= @a5f | @4b2 | ... markers w ::= ... | po h :: w evidence value u ::= ... | po h :: u runtime region Runtime Region of Stack: <sup>R</sup><sup>J</sup> #pool<sup>h</sup> { } :: <sup>K</sup> <sup>K</sup> <sup>=</sup> po <sup>h</sup> :: <sup>R</sup><sup>J</sup> <sup>K</sup> <sup>K</sup>

Fig. 5. Abstract machine semantics of pool-based resource management.

#### Syntax:


#### Typing Rules:

Γ , r , x : Catch r , l : r v ρ r ` s<sup>0</sup> : τ Γ ρ ` s : τ Γ ρ ` try { [r ](x , l) ⇒ s<sup>0</sup> } catch { s } : τ [Try] Γ ` e : Catch ρ 0 Γ ` i : ρ v ρ 0 Γ ρ ` throw(e, i) : τ [Throw]

Fig. 6. Syntax and typing rules of exceptions.

introduces three additional statement forms, which introduce and eliminate nontrivial evidence to assert that all files are correctly closed. The pool statement delimits a new region in which we run the enclosed statement s. It introduces three variables, a fresh region variable r , a variable x : Pool r , and evidence l : r v ρ, witnessing that the fresh region r is a subregion of the outer region ρ. The open statement receives a pool argument e, a filename e0, and an evidence argument i : ρ v ρ 0 that witnesses that the current region ρ is nested within the pool's region ρ 0 . Rule Read for readln statements is similar.

Figure 5 extends the operational semantics. Frames can now be pool frames which contain a marker h. In rule (pool), we allocate a fresh marker h and push a pool frame onto the stack. In rule (release), we pop the pool frame and release the pool h, closing all associated resources. Our goal is to ensure that all access to marker h happens between these two steps.

To this end, rules (open) and (read) dynamically assert that the marker h is on the current stack K. Accessing a pool that fails this test would result in a stuck term. As it turns out, the mere existence of evidence i suffices to show that the assertion always succeeds (Corollary 1).

For our proof of this fact, Figure 5 extends the syntax of runtime regions and evidence. Runtime regions now include lists of pool markers and so do evidence values. The runtime region of a stack K is the list of markers that have been pushed onto it. We extend the function <sup>R</sup><sup>J</sup> · <sup>K</sup> to extract this list. During execution, region variables r stand for runtime regions u. In rule (pool) we substitute the runtime region po <sup>h</sup> :: <sup>R</sup>JK<sup>K</sup> for the region variable <sup>r</sup> and the singleton list po h :: • for the evidence variable l. Later we will see how the typing rule for evidence values connects type-level runtime regions with the concrete runtime region of the current stack K.

#### 3.5 Exceptions

Figure 6 extends Λ<sup>ρ</sup> with two new statement forms. The try ... catch ... statement delimits a new region in which we run the enclosed statement s0. It introduces

#### Syntax of Frames:

F ::= ... | #catch<sup>h</sup> { } { s } catch frame Machine Steps: (popcatch) hreturn e k #catch<sup>h</sup> { } { s } :: Ki → hreturn e k Ki (try) htry { [r ](x , l) ⇒ s<sup>0</sup> } catch { s } k Ki → hs0[r 7→ u][x 7→ h][l 7→ w] k #catch<sup>h</sup> { } { s } :: Ki do <sup>h</sup> <sup>=</sup> generateFresh() where <sup>u</sup> <sup>=</sup> ca <sup>h</sup> :: <sup>R</sup><sup>J</sup> <sup>K</sup> <sup>K</sup> and <sup>w</sup> <sup>=</sup> ca <sup>h</sup> :: • (throw) <sup>h</sup>throw(h, <sup>i</sup>) <sup>k</sup> <sup>K</sup><sup>i</sup> <sup>→</sup> <sup>h</sup>throw(h, <sup>V</sup>JiK) <sup>k</sup> <sup>K</sup><sup>i</sup> (unwind) hthrow(h, w) k val x = ; s :: Ki → hthrow(h, w) k Ki (forward) hthrow(h, ca h 0 :: w) k #catchh<sup>0</sup> { } { s } :: Ki→ hthrow(h, w) k Ki where h 6= h 0 (catch) hthrow(h, • ) k #catch<sup>h</sup> { } { s } :: Ki → hs k Ki Runtime Regions and Evidence: Runtime Region of Stack:


Fig. 7. Abstract machine semantics of exceptions.

three variables, a fresh region variable r , a variable x : Catch r , and an evidence variable l : r v ρ, witnessing that the fresh region r is a subregion of the outer region ρ. The throw statement receives a handler e to throw to, and evidence i that the handler's region ρ 0 is nested in the current region ρ.

Figure 7 extends the operational semantics. Frames can now be catch frames with a marker h and a catch statement s. In rule (try) we generate a fresh marker h and push a catch frame with this marker and the catch statement onto the stack. The handler x is this marker h. In rule (popcatch) we pop this catch frame upon normal return. In rule (throw) we transition from normal execution to unwinding. <sup>h</sup> is a catch marker, and <sup>V</sup><sup>J</sup> <sup>i</sup> <sup>K</sup> evaluates the evidence expression <sup>i</sup> to a list of catch markers. In rules (unwind) and (forward) we unwind the stack frame-by-frame until we find the matching catch frame (catch). Because each try

#### Extended Machine Steps:

(free) hthrow(h, po h 0 :: w) k #poolh<sup>0</sup> { } :: Ki → hthrow(h, w) k Ki do releasePool(h 0 )

Fig. 8. Abstract machine semantics of combining resources and exceptions.

statement generates a fresh marker at runtime, and we search for this marker during unwinding, exceptions have generative semantics [5, 6, 35, 36].

Figure 7 extends the syntax of runtime regions and evidence. They now include lists of catch markers. Again, evidence guarantees that unwinding never fails, i.e. the corresponding marker is always somewhere on the stack. Remarkably, we pop elements off the evidence value w in lock-step with popping catch frames off the stack and never get stuck in doing so. We always find the matching catch frame exactly when the evidence value is the empty list. The evidence value precisely reflects the list of markers between the region of the throw statement and the region of the catch statement. Importantly, this also holds for the combined language Λ<sup>ρ</sup> (Corollary 4).

#### 3.6 Combining Resource Pools and Exceptions

When we extend the core language with both pools and exceptions, we notice that the machine gets stuck when we would have to unwind through a pool frame. Figure 8 extends the reduction relation with this missing case. When we unwind through a #poolh<sup>0</sup> frame, we release the pool h 0 . In full Λ<sup>ρ</sup> regions are lists where the elements are either a pool marker or an exception marker. Evidence is, again, the same. Having to add the rule in Figure 8 shows that under our operational semantics, the two extensions are not orthogonal. We have to explicitly consider their interaction. In Section 4, we define a CPS translation for Λρ. Remarkably, both extensions can be defined separately and the correct interaction automatically arises from their composition. Perhaps more importantly, the resulting terms in CPS can be reduced freely without having to consider the interaction between pools and exceptions.

#### 3.7 Metatheory of Λ<sup>ρ</sup>

We started out with core Λ<sup>ρ</sup> only supporting regions and subregioning evidence. We then added two extensions, pools and exceptions, first individually, then together to arrive at the full language. Although we use resource pools for files as an example, our approach generalizes to region-based management of any resource. Indeed, in our mechanization, we do not model files and the pool statement only pushes and pops the fresh marker. Instead of open and readln we have a statement check with the following typing rule:

Γ ` e : Pool ρ <sup>0</sup> Γ ` i : ρ v ρ <sup>0</sup> Γ ρ ` s : τ Γ ρ ` check(e, i); s : τ [Check]

### Stack Typing: ` K

$$
\begin{array}{c}
\vdash \mathsf{K} : \tau \\
\vdash \mathsf{I} \end{array}
$$

` • : τ [Exit] <sup>x</sup> : <sup>τ</sup> <sup>R</sup><sup>J</sup> <sup>K</sup> <sup>K</sup> ` <sup>s</sup> : <sup>τ</sup> <sup>1</sup> ` <sup>K</sup> : <sup>τ</sup> <sup>1</sup> ` val x = ; s :: K : τ [Frame] ` K : τ ` #pool<sup>h</sup> { } :: K : τ [#Pool] <sup>∅</sup> <sup>R</sup><sup>J</sup> <sup>K</sup> <sup>K</sup> ` <sup>s</sup> : <sup>τ</sup> ` <sup>K</sup> : <sup>τ</sup> ` #catch<sup>h</sup> { } { s } :: K : τ [#Catch]

Abstract Machine Typing: <sup>∅</sup> <sup>R</sup><sup>J</sup> <sup>K</sup> <sup>K</sup> ` <sup>s</sup> : <sup>τ</sup> ` <sup>K</sup> : <sup>τ</sup> ` hs k Ki ok [Machine] Evidence Value Typing: u<sup>0</sup> = w ++ u<sup>1</sup> ∅ ` w : u<sup>0</sup> v u<sup>1</sup> [Evidence]

Fig. 9. Abstract machine typing of Λ<sup>ρ</sup>

It asserts that the given pool is on the current runtime stack, i.e. live, and crashes the program if it is not. Otherwise it continues to execute statement s. We can safely access resources by first performing a runtime check and then using unsafe primitive operations. For example we would define

open(e, e0, i) := check(e, i); openFile(e, e0)

As we will see shortly, this check never fails.

Soundness We mechanized the formalization of Λ<sup>ρ</sup> in the Coq theorem prover and showed the usual theorems of progress and preservation of the stepping relation on machine states M.

#### Theorem 1 (Progress).

If ` M ok, then either M →M 0 or M is of the form hreturn e k • i for some expression e.

### Theorem 2 (Preservation).

If ` M ok and M →M 0 then ` M 0 ok.

Figure 9 presents the typing rules for the abstract machine. An abstract machine state is well-typed when the statement s is well-typed in the concrete runtime region of the stack K. The typing judgement ` K : τ types stacks K that expect a value of type τ . An evidence value is well-typed when it is the difference between the two runtime regions u<sup>0</sup> and u1.

Properties The following properties follow directly from progress and preservation. Firstly, whenever we use a pool, it is live. The operational semantics inspects the runtime stack. But since the check always succeeds we do not have to actually perform it.

#### Corollary 1 (Resource Safety).

If <sup>h</sup>open(h, <sup>e</sup>0, <sup>i</sup>) <sup>k</sup>K<sup>i</sup> ok, then po <sup>h</sup> is in <sup>R</sup>JKK.

Secondly, whenever we throw an exception, the corresponding handler is on the stack. Moreover, as we have seen from the operational semantics, during the search for the correct handler, we encounter precisely the markers that are in the evidence value.

#### Corollary 2 (Effect Safety).

# If <sup>h</sup>throw(h, <sup>i</sup>) <sup>k</sup>K<sup>i</sup> ok, then ca <sup>h</sup> is in <sup>R</sup>JKK.

Thirdly, every function runs in exactly the runtime region its type requires. In other words, the type-level region ρ will at runtime stand for the concrete runtime region of the stack this function is called in.

#### Corollary 3 (Region Correspondence).

If h{ [<sup>r</sup> ](<sup>x</sup> : <sup>τ</sup> ) at <sup>ρ</sup>⇒s<sup>0</sup> }[u](e) <sup>k</sup>K<sup>i</sup> ok, then <sup>ρ</sup>[<sup>r</sup> 7→u] = <sup>R</sup>JKK.

Finally, evidence values are exactly the difference between the two regions. This corollary is inspired by the similarly named theorem of Xie et al. [34].

#### Corollary 4 (Evidence Correspondence).

If an evidence value w has type ρ<sup>0</sup> v ρ1, then ρ<sup>0</sup> and ρ<sup>1</sup> are runtime regions u<sup>0</sup> and u<sup>1</sup> and u<sup>0</sup> = w ++ u1.

Together, these corollaries make runtime evidence on the one hand and marker frames on the stack on the other hand redundant. The unwinding can either use evidence terms, or markers on the stack, since the two agree. The operational semantics uses both to establish this fact. The liveness check for pools is redundant since it always succeeds. It only exists to establish this fact.

We could erase evidence terms and only rely on marker frames on the stack. In the next section, we are going to CPS where there is no stack. Therefore we will do the opposite: Erase marker frames and purely rely on evidence terms to have the correct content at runtime. This is possible because of the correspondence between evidence and runtime regions. Ultimately, this allows us to prove that CPS translated terms behave exactly as the operational semantics (Theorem 4).

### 4 Translation of Regions, Pools, and Exceptions to CPS

We now present the translation of Λ<sup>ρ</sup> into System F (with file primitives) in CPS. As a result of the translation, the stack K becomes an evaluation context [10], regions become answer types, and evidence terms become answer-type coercions. As before, we will define the translations of core Λ<sup>ρ</sup> and the two extensions with file pools and exceptions step-by-step. Our translation can serve as a compilation technique for languages with control effects and resources into any language Translation of Types: <sup>T</sup> <sup>J</sup> Int <sup>K</sup> <sup>=</sup> Int <sup>T</sup> <sup>J</sup> <sup>r</sup> <sup>K</sup> <sup>=</sup> <sup>r</sup> <sup>T</sup> <sup>J</sup> <sup>&</sup>gt;<sup>K</sup> <sup>=</sup> Void <sup>T</sup> <sup>J</sup> <sup>∀</sup>[<sup>r</sup> ](<sup>τ</sup> ) <sup>→</sup><sup>ρ</sup> <sup>τ</sup> <sup>0</sup> <sup>K</sup> <sup>=</sup> <sup>∀</sup><sup>r</sup> . <sup>T</sup> <sup>J</sup><sup>τ</sup> <sup>K</sup> <sup>→</sup> Cps <sup>T</sup> <sup>J</sup> <sup>ρ</sup> <sup>K</sup> <sup>T</sup> <sup>J</sup> <sup>τ</sup> <sup>0</sup> <sup>K</sup> <sup>T</sup> <sup>J</sup> <sup>ρ</sup> <sup>v</sup> <sup>ρ</sup> 0 <sup>K</sup> <sup>=</sup> <sup>∀</sup>a. Cps <sup>T</sup> <sup>J</sup> <sup>ρ</sup> 0 <sup>K</sup> <sup>a</sup> <sup>→</sup> Cps <sup>T</sup> <sup>J</sup> <sup>ρ</sup> <sup>K</sup> <sup>a</sup> Translation of Expressions: <sup>E</sup><sup>J</sup> <sup>x</sup> <sup>K</sup> <sup>=</sup> <sup>x</sup> <sup>E</sup><sup>J</sup> { [<sup>r</sup> ](<sup>x</sup> : <sup>τ</sup> ) at <sup>ρ</sup> <sup>⇒</sup> <sup>s</sup>} <sup>K</sup> <sup>=</sup> <sup>Λ</sup><sup>r</sup> . <sup>λ</sup><sup>x</sup> . <sup>S</sup><sup>J</sup> <sup>s</sup> <sup>K</sup><sup>ρ</sup> <sup>E</sup><sup>J</sup> <sup>0</sup> <sup>K</sup> <sup>=</sup> <sup>Λ</sup>a. λm. <sup>m</sup> <sup>E</sup><sup>J</sup> <sup>e</sup><sup>1</sup> <sup>⊕</sup> <sup>e</sup><sup>2</sup> <sup>K</sup> <sup>=</sup> <sup>Λ</sup>a. λm. <sup>E</sup><sup>J</sup> <sup>e</sup><sup>1</sup> <sup>K</sup> <sup>a</sup> (E<sup>J</sup> <sup>e</sup><sup>2</sup> <sup>K</sup> a m)

Translation of Statements:

Auxiliary Definitions:

<sup>S</sup><sup>J</sup> val <sup>x</sup> <sup>=</sup> <sup>s</sup>0; <sup>s</sup><sup>1</sup> <sup>K</sup><sup>ρ</sup> <sup>=</sup> <sup>λ</sup>k. <sup>S</sup><sup>J</sup> <sup>s</sup><sup>0</sup> <sup>K</sup><sup>ρ</sup> (λ<sup>x</sup> . <sup>S</sup><sup>J</sup> <sup>s</sup><sup>1</sup> <sup>K</sup><sup>ρ</sup> k) <sup>S</sup><sup>J</sup> return <sup>e</sup> <sup>K</sup><sup>ρ</sup> <sup>=</sup> <sup>λ</sup>k. <sup>k</sup> (EJeK) <sup>S</sup><sup>J</sup> <sup>e</sup>0[ρ](e) <sup>K</sup><sup>ρ</sup> <sup>=</sup> <sup>E</sup>Je<sup>0</sup><sup>K</sup> <sup>T</sup> <sup>J</sup> <sup>ρ</sup> <sup>K</sup> <sup>E</sup>Je<sup>K</sup>

Cps R A = (A → R) → R

Fig. 10. Translation from core Λ<sup>ρ</sup> to System F.

that supports first-class functions, making it widely applicable. Moreover, as demonstrated by Schuster et al. [26], modeling control effects with CPS can enable compile-time optimizations for significant performance improvements. We implemented the CPS translation of Λ<sup>ρ</sup> as a shallow embedding in Idris 2 [7].

### 4.1 Translation of Core Λ<sup>ρ</sup>

Figure 10 defines the translation of core Λ<sup>ρ</sup> to System F. Our translation targets one particular variant of CPS, called iterated CPS [11, 25]. Every stack segment, delimited by a marker, is represented by its own continuation argument. That is, in iterated CPS, functions do not receive one but potentially multiple continuations. This will only become relevant in the presence of exceptions (Section 4.3).

Translation of Types Base types, such as Int are left unchanged by the translation. We translate region variables to type variables in System F and the toplevel region to the empty type Void. The translation on types shows that the iterated CPS translation is (so far) very similar to the traditional CPS translation. In particular, the auxiliary meta-definition Cps R A is defined as the familiar type (A →R) →R of computations in CPS with return type A and answer type R. Evidence terms are functions between effectful computations, as can be seen from the translation of evidence types.

Translation of Terms As usual in CPS, we translate sequencing of statements to push a frame onto the current continuation k, that is, the continuation first runs s<sup>1</sup> and then continues with k. Return statements are translated to tail

#### Extended Translation Rules:


#### Auxiliary Definitions:


Fig. 11. Translation of Λ<sup>ρ</sup> with resource pools.

calls of the current continuation. Again, viewing continuations as stacks, this is in accordance with the operational semantics given in Section 3.3. In general, statements with return type τ that have to be run in a region ρ are translated to terms of type Cps <sup>T</sup> <sup>J</sup>ρ<sup>K</sup> <sup>T</sup> <sup>J</sup><sup>τ</sup> <sup>K</sup>. This can for instance be seen in the translation of function types. We translate regions to answer types. Region abstractions are translated to type abstractions and region-polymorphic functions have a polymorphic answer type [30]. We translate evidence expressions to functions that lift a computation to run in a different region. The reflexivity evidence is translated to the polymorphic identity function, and transitivity of evidence amounts to function composition.

In the remainder of this section, we present the rest of the translation of our language with pools and exceptions Λρ. Later, we show that the translated code in CPS simulates the operational semantics given in Section 3.

#### 4.2 Resource Pools

In Figure 4, we have seen the definition of Λ<sup>ρ</sup> with resource pools. Figure 11 defines the translation to CPS. As we have seen in Section 3.7, we do not need any runtime checks to prevent markers and files from being used outside of their region. Indeed, in CPS there is no stack, which we could check for markers.

The pool statement creates a fresh resource pool. The translation instantiates <sup>r</sup> with the outer answer type <sup>T</sup> <sup>J</sup> <sup>ρ</sup> <sup>K</sup>. When control leaves the enclosed block, the pool is released. In its translation we use the auxiliary function RunPool. It binds the current continuation k and creates a fresh pool h. We run the given computation m with h and a continuation where we push a frame that releases the pool onto the current continuation k. This ensures that we releases the pool when we return normally from the enclosed block.

#### Extended Translation Rules:

<sup>T</sup> <sup>J</sup> Catch <sup>ρ</sup> <sup>K</sup> <sup>=</sup> Cps <sup>T</sup> <sup>J</sup> <sup>ρ</sup> <sup>K</sup>Void <sup>S</sup><sup>J</sup> try { [<sup>r</sup> ](<sup>x</sup> , <sup>l</sup>) <sup>⇒</sup> <sup>s</sup><sup>0</sup> } catch { <sup>s</sup> } <sup>K</sup><sup>ρ</sup> <sup>=</sup> RunCps ((Λ<sup>r</sup> . λ<sup>x</sup> . λl. <sup>S</sup><sup>J</sup> <sup>s</sup><sup>0</sup> <sup>K</sup><sup>r</sup> ) (Cps <sup>T</sup> <sup>J</sup> <sup>ρ</sup> <sup>K</sup> <sup>T</sup> <sup>J</sup> <sup>τ</sup> <sup>K</sup>) (λk. <sup>S</sup><sup>J</sup> <sup>s</sup> <sup>K</sup><sup>ρ</sup> ) (LiftCps)) <sup>S</sup><sup>J</sup> throw(e, <sup>i</sup>) <sup>K</sup><sup>ρ</sup> <sup>=</sup> <sup>E</sup>JiKVoid <sup>E</sup>Je<sup>K</sup>

#### Auxiliary Definitions:


Fig. 12. Translation of Λ<sup>ρ</sup> with exceptions.

Evidence terms are functions LiftPool h that release the pool h. Our types make sure that we evaluate the evidence if-and-only-if we non-locally leave the body of the pool. In Section 3.4, evidence was a list of pools. Here, evidence still contains a list of pools, but this list is hidden in the closure environment of the evidence. Evidence composition conceptually concatenates these lists.

The open statement opens a file and registers it in the pool. The readln statement uses a primitive to read from a file. We require evidence that the pool is live, i.e. on the runtime stack, but do not have to actually use it. As we have seen in Section 3.7 its existence is enough to assert that accessing the file is safe.

Example 4. Let us consider a simplified version of the motivating example (Section 2.1). The example on the left translates to the term in System F on the right. It has type Cps Void Int.

```
pool {
  [r1](p1: Pool r1, l1: r1 v T) ⇒
    val f = open(p1, "input", 0);
    return 0
}
                                           λk.
                                             let h = createPool ();
                                             (Λr1. λp1. λl1. λk1.
                                              letf = openFile p1 "input";
                                              k1 0) Void h
                                                (Λa. λm. λk. releasePool h; m k)
                                                (λx . releasePool h; k x )
```
This term can be normalized to the following:

λk. let h = createPool (); letf = openFile h "input"; releasePool h; k 0

#### 4.3 Exceptions

In this subsection, we present the translation of exceptions. Whereas in the operational semantics (Section 3.5) we have divided the stack into regions with markers, we now have multiple stacks, i.e. continuations. We have seen that evidence terms contained exactly the list of markers we have to unwind when we throw to a handler. Now we take advantage of this fact and let the evidence be the unwinding action itself. Figure 12 presents the translation of exceptions. It is different from the translation to double-barrelled CPS [17, 29], where functions only ever get exactly two continuations. Under our translation to iterated CPS functions can receive any number of continuations.

To support aborting the computation, we instantiate the answer type r of the translated body <sup>s</sup><sup>0</sup> to be the type Cps <sup>T</sup> <sup>J</sup>ρ<sup>K</sup> <sup>T</sup> <sup>J</sup><sup>τ</sup> <sup>K</sup>. This adds another layer of CPS and one additional (curried) continuation argument. In the translation of try ... catch ... statements, we use RunCps. It runs the given computation m with an additional continuation which is initially empty. The evidence l lifts the given computation from the inner region to the outer region. It will be bound to LiftCps which pushes the current continuation onto the next one.

A Catch ρ is a CPS expression that aborts the computation. That is, the handler (λk. <sup>S</sup><sup>J</sup> <sup>s</sup> <sup>K</sup><sup>ρ</sup> ) discards the current continuation k. In the translation of statement throw(e, i), we call the provided evidence i and then the handler e. Running the evidence lifts the handler into the correct region, making it compatible with the current answer type. It is safe for the handler to discard the continuation k, since all cleanup actions contained in k are run by the evidence.

Example 5. Let us consider the example from Section 2.2. The example on the left translates to the resulting term of type Cps Void Int on the right.

```
try { [r1](e1 : Catch r1, l1 : r1 v T) ⇒
  safeDiv[r1](5, 0, e1)
} catch {
  return 0
}
                                                    (Λr1. λe1. λl1.
                                                     safeDiv r1 5 0 e1
                                                    ) (Cps Void Int)
                                                     (λk1. λk2. k2 0)
                                                     (Λa. λm. λk. λj. m (λx . k x j))
                                                     (λx . λk. k x )
```
The resulting System F term can be beta reduced and eta expanded to:

λk2. safeDiv (Cps Void Int) 5 0 (λk1. λk2. k<sup>2</sup> 0) (λx . λk. k x ) k<sup>2</sup>

We instantiate the answer type r of safeDiv with r1, which itself is instantiated with Cps Void Int. The return type is Cps (Cps Void Int)Int and our program now receives two continuations. To abort, the exception handler discards the first one (i.e., k1) and returns 0 to the second one (i.e., k2).

#### 4.4 Combining Resource Pools and Exceptions

Well-typed programs in Λ<sup>ρ</sup> translate to well-typed programs in System F.

Theorem 3 (Well-typedness of Translated Terms).

If <sup>Γ</sup> <sup>ρ</sup> ` <sup>s</sup> : <sup>τ</sup> , then <sup>T</sup> <sup>J</sup> <sup>Γ</sup> <sup>K</sup> ` SJsK<sup>ρ</sup> : (<sup>T</sup> <sup>J</sup> <sup>τ</sup> <sup>K</sup> → T <sup>J</sup> <sup>ρ</sup> <sup>K</sup>) → T <sup>J</sup> <sup>ρ</sup> <sup>K</sup>

Proof (Proof ).

Straightforward induction over the typing derivation.

The translation of exception handlers in Section 4.3 automatically interacts correctly with the evidence terms we have defined for resource pools in Section 4.2: We clear a pool exactly when an exception is thrown across it. This is because we have chosen the translation of evidence to be a concrete computation that moves from one region to another one.

Example 6. The following is an extended example where we combine resource pools and exceptions in a more complicated way. The program splits a large input file into smaller files of 100 lines each.

```
try { [r1](stop : Catch r1, l1 : r1 v Top) ⇒
  withFile[r1]("input", { [r2](in: File r2, l2 : r2 v r1) ⇒
    def copyFile(target : String) at r2 {
      withFile[r2](target, { [r3](out: File r3, l3 : r3 v r2) ⇒
        def copyLine() at r3 {
          if (isEOF(in, l3)) { throw(stop, l3 ⊕ l2) }
          else { writeln(out, readln(in, l3), 0) }
        };
        def innerLoop(toCopy : Int) at r3 {
          if (toCopy > 0) { copyLine(); innerLoop(toCopy - 1) }
        };
        innerLoop(100)
      })
    };
    def loop(n : Int) at r2 { copyFile("output" ++ n); loop(n + 1) };
    loop(0)
  })
} catch { return () }
```
When we encounter the end of the input file, we simply throw an exception to terminate the program. We can be confident that all resources will be properly cleaned up and so fearlessly use exceptions to structure control flow. The outer loop, for example never returns. It is terminated by throwing an exception. This program, after CPS translation, manually applying contification [17], and beta reduction, reduces to the code in Figure 13.

Our CPS translation of both regions and control enables aggressive optimization. For example, at the end of the input file, we immediately release both pools and return. Since we only apply well-known optimizations on functional programs, we can be certain of their correctness without having to reason explicitly about resources nor control effects nor their combination. The overall correctness of the optimized result rests on the correctness of our CPS translation.

#### 4.5 Simulation of the Machine Semantics by the CPS translation

In Section 3, we defined an operational semantics for Λρ. In this section we defined a CPS translation for Λρ. We now show that the two behave the same. This entails that the operational properties from Section 3 carry over to the

```
λk.
 let p2 = createPool ();
 letin = openFile p2 "input";
 let rec loop n = (λk1.
   let p3 = createPool ();
   let out = openFile p3 ("output" ++ n);
   let rec innerLoop toCopy = (λk2.
    if (toCopy > 0)
    then if isEOF(in)
      then releasePool p3; releasePool p2; (λk4. k4 0)
      else letline = readLine in; writeLine out line; innerLoop (toCopy − 1)
    else releasePool p3; loop (n + 1)
   );
   innerLoop 100
 );
 loop 0 k
```
Fig. 13. Result of translating Example 6 to CPS.

CPS translation and that optimization via beta reduction is sound. To show preservation of semantics, we extend our translation to machine states [4, 15]. We translate statements to terms and stacks to evaluation contexts in System F. We define the translation <sup>M</sup><sup>J</sup> · <sup>K</sup> of machine states as the plugging of the translation of the statement into the translation of the stack. The full translation is available in a separate technical report [27].

We show that for each step the machine takes, there is a corresponding (possibly empty) sequence of steps between the translated terms.

#### Theorem 4 (Simulation).

If M →M 0 , then <sup>M</sup><sup>J</sup> <sup>M</sup> <sup>K</sup> <sup>→</sup><sup>∗</sup> <sup>M</sup><sup>J</sup> <sup>M</sup> 0 K.

Proof (Proof ).

By considering each case of the stepping relation. The (throw) step needs its own lemma, which we show by induction on possible evidence expressions.

Since for simulation we are only interested in operational behavior, we target the untyped lambda calculus (with primitives for file management) instead of System <sup>F</sup>. The translation of statements is the same as <sup>S</sup><sup>J</sup> <sup>s</sup> <sup>K</sup><sup>ρ</sup> in Figures 10, 11, and 12, but we erase all type annotations, type abstractions, and type applications. There is no harm in doing so, since our target is in CPS where the evaluation order is explicit.

While the operational semantics given in Section 3 discards frames during unwinding, for our proof of simulation we have to retain them. We do so in a third component of the machine state hthrow(h, w) kKk Hi: the stack trace H. This is necessary because the CPS translation discards the whole continuation in one step, while the operational semantics unwinds the stack frame-by-frame.

We translate the empty stack to a special primitive function done, which will return the overall result of the program. It is called exactly once, when the machine is in its final state and we return to the empty stack.

Example 7. Pools are created and released exactly when they would be in the operational semantics. As an illustration, consider the following sequence of machine steps where we unwind a pool frame:

```
h throw(h1, (po h2 :: • )) k #poolh2
                                     {  } :: #catchh1 {  } { return 1 } :: • i →
h throw(h1, (po h2 :: • )) k #poolh2
                                     {  } :: #catchh1 {  } { return 1 } :: • k • i →
h throw(h1, • ) k #catchh1 {  } { return 1 } :: • k #poolh2
                                                           {  } :: • i →
h return 1 k • i
```
The first step (throw) goes from normal execution to the unwinding state which accumulates frames in its third component. The next two steps are (free) and (catch). In CPS, we can observe the same program trace:

((LiftPool h2) (λk1. λk2. k<sup>2</sup> 1)) (λx. releasePool h2; (λx. λk. k x) x) done → (λk. releasePool h2; (λk1. λk2. k<sup>2</sup> 1) k) (λx. releasePool h2; (λx. λk. k x) x) done → (λk1. λk2. k<sup>2</sup> 1) (λx. releasePool h2; (λx. λk. k x) x) done → (λk2. k<sup>2</sup> 1) done

Example 8. Although we do not have any markers generated at runtime, the CPS translation exactly mimics the behavior of the operational semantics, which does have them. Consider another example, where we throw an exception to an outer handler. The steps are (throw), (forward), and (catch).

```
h throw(h1, (h2 :: • )) k #catchh2 {  } { return 2 } :: #catchh1 {  } { return 1 } :: • i →
h throw(h1, (h2 :: • )) k #catchh2 {  } { return 2 } :: #catchh1 {  } { return 1 } :: • k • i →
h throw(h1, • ) k #catchh1 {  } { return 1 } :: • k #catchh2 {  } { return 2 } :: • i →
h return 1 k • i
```
In CPS, we start out with three continuations, then we push the first one onto the second one, then the exception handler discards both in one step:

(LiftCps (λk1. λk2. k<sup>2</sup> 1)) (λx. λk. k x) (λx. λk. k x) done → (λk. λj. (λk1. λk2. k<sup>2</sup> 1) (λy. k y j )) (λx. λk. k x) (λx. λk. k x) done → (λk1. λk2. k<sup>2</sup> 1) (λy. (λx. λk. k x) y (λx. λk. k x)) done → (λk2. k<sup>2</sup> 1) done

The CPS translation exhibits the same behavior as the operational semantics. It simulates the generative semantics of exceptions. Remarkably, it does not need any runtime support for markers on the stack to do so. Indeed, in CPS there is no stack!

### 5 Related Work

Out of the large body of work on regions, the one most closely related, and indeed which has been the basis of our work, is the one by Kiselyov and Shan [19], which in turn is based on work of Fluet and Morrisett [12]. Kiselyov and Shan provide a library for region-based resource management in Haskell. They demonstrate how types, regions, and subregioning evidence are inferred, which we do not discuss. They deal with builtin Haskell exceptions, but leave a formal proof to future work. We go further, and add exceptions as a language feature, and prove region- and exception safety. Moreover, we present a CPS translation of these features.

Crary et al. [9] present a language with dynamic regions, where regions do not have to be nested, resource access is safe, but resource cleanup is not automatic but explicit. Their language is presented in CPS. Indeed, to quote Fluet et al. [13]: "Dynamic regions are not restricted to LIFO lifetimes and can be treated as first-class objects. They are particularly well suited for iterative computations, CPS-based computations, and event-based servers where lexical regions do not suffice." We present a CPS translation of lexical regions where resources are automatically released, even when an exception is thrown.

Clearly also related is the line of work on monadic encapsulation of state [20, 23, 28]. The most recent work in this line [31] presents a mechanized proof of a number of equivalences in the presence of encapsulated mutable state. We merely prove that references are not used outside of their region, but do so in the presence of exceptions.

Kiselyov and Ishii [18] present a Haskell library for effect handlers based on a variant of the free monad in Haskell. Their library supports user-defined effects and handlers and they provide a range of pre-defined effects like exceptions, nondeterminism, and state. They also discuss a region effect for safe and automatic allocation and release of resources, which correctly works in the presence of the exception effect. Other effects, like non-determinism, are explicitly ruled out by the type system when they would be used across a resource delimiter. They reify the structure of the program as a free monad and then write interpreters over this structure, whereas we translate programs to CPS. Moreover we provide a proof of region- and exception safety, which is out of scope of their work.

Leijen [21] reports on an extension of the programming language Koka with support for resources and finalization. They support general effect handlers, while we merely discuss the special case of exceptions. Their approach requires sophisticated modification of the language runtime, whereas our approach can be explained as a translation to pure System F. They allow for more complex finalization patterns, where users explicitly run the finalizers of a resumption. This is to avoid running finalizers on linearly used resumptions, a problem that we completely side-step by only discussing exceptions.

Ahman and Bauer [1] present an approach to resource management: Runners. They guarantee that cleanup actions are run exactly once. We offer the same guarantee. We present an operational semantics that relates resource management to the stack and a translation of programs to CPS. Their denotational semantics translates programs to essentially a free monad.

Thielecke [29] compares different control constructs by their translation to double-barrelled CPS where functions receive exactly two continuations. In contrast, under our iterated CPS translation functions can receive any number of continuations. Moreover, even in the case where we pass two continuations, there is a difference. Whereas in double-barrelled CPS translated terms have type:

$$(\left[ \begin{array}{c} A \end{array} \right] \to Ans) \to (\left[ \begin{array}{c} B \end{array} \right] \to Ans) \to Ans.$$

Under our iterated CPS translation such terms would have type:

$$(\left[\begin{]} A \; \right] \to (\left[\begin{]} B \; \right] \to Ans \end{gathered} ) \to (\left[\begin{]} B \; \right] \to Ans) \to (\left[\begin{]} B \; \right] \to Ans) \to Ans$$

Their work is neither concerned with resources nor multiple different exception handlers.

Thielecke [30] studies the connection between control effects and continuation passing. His work introduces some of the ideas presented in this paper: regions are answer types, region polymorpism is answer-type polymorphism, and effect masking introduces a fresh region to delimit the extent of control effects. We expand upon his work in several ways: Instead of a single control operator call/cc, we consider a language with multiple layers of exceptions and resources. Therefore, on the type level, we have subregioning evidence between nested regions, and on the term level, we translate to iterated CPS.

Our iterated CPS translation of exceptions is closely related to the one presented by Schuster et al. [26]. However, they do not support effect-polymorphic functions. Our translation to System F is similar to the one for effect handlers sketched in Appendix B of Hillerström et al. [15].

### 6 Conclusion

We presented Λρ, a language with first-class functions, regions, resources, and exceptions. Its type system guarantees safe access to resources and safe use of exceptions. We then presented a CPS translation that preserves these guarantees.

We view regions as describing runtime stacks. This view is very much in line with recent work on effect handlers. One does wonder if our approach scales to more general control effects, which do not discard the current continuation, and perhaps even use it multiple times. This is the subject of ongoing investigation.

### Acknowledgments

The work on this project was supported by the Deutsche Forschungsgemeinschaft (DFG – German Research Foundation) – project number DFG-448316946.

### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## A Predicate Transformer for Choreographies Computing Preconditions in Choreographic Programming

Sung-Shik Jongmans1,<sup>2</sup> () and Petra van den Bos<sup>3</sup>

<sup>1</sup> Department of Computer Science, Open University, Heerlen, the Netherlands <sup>2</sup> CWI, Amsterdam, the Netherlands

<sup>3</sup> Formal Methods and Tools Group, University of Twente, Enschede, the Netherlands

Abstract. Construction and analysis of distributed systems is difficult; choreographic programming is a deadlock-freedom-by-construction approach to simplify it. In this paper, we present a new theory of choreographic programming. It supports for the first time: construction of distributed systems that require decentralised decision making (i.e., if/ while-statements with multiparty conditions); analysis of distributed systems to provide not only deadlock freedom but also functional correctness (i.e., pre/postcondition reasoning). Both contributions are enabled by a single new technique, namely a predicate transformer for choreographies.

### 1 Introduction

Construction and analysis of distributed systems that consist of message passing processes is hard. Typical challenges include providing deadlock freedom (i.e., the processes never get stuck) and functional correctness (i.e., the processes compute the intended outcome). Choreographic programming [8,9,10] is a deadlockfreedom-by-construction approach to make implementation and verification of distributed systems easier. In this paper, to address two limitations of existing theories, we present a new theory of choreographic programming. It supports for the first time: construction of distributed systems that require decentralised decision making; analysis of distributed systems to provide not only deadlock freedom but also functional correctness.

### 1.1 Background: Choreographic Programming by Example

To explain choreographic programming, consider a distributed system in which two processes enact roles Client and Server. First, a username and password are communicated from Client to Server. Next, Server checks Client's credentials and informs Client about the outcome: if authentication succeeded, the execution continues; if it failed, it ends. We construct and analyse this system as follows:

1. Initially, we write a global program G ("the choreography"); it prescribes the behaviour of all roles, collectively, from their shared perspective.

<sup>C</sup>."foo"\_S.<sup>x</sup> ; <sup>C</sup>.123\_S.<sup>y</sup> ; if <sup>S</sup>.auth(x,y) (S.SUCC\_<sup>C</sup> ; <sup>G</sup><sup>0</sup> ) (S.FAIL\_C)

Fig. 1: Workflow of choreographic programming

In this notation, p.e\_q.y prescribes a value communication to share data from role p to role q: expression e is evaluated at p, sent at p, received at q, and stored in variable <sup>y</sup> at <sup>q</sup>. Similarly, p.`\_<sup>q</sup> prescribes a label communication to share decisions: label ` is actively selected at p ("internal choice"), sent at p, received at q, and passively branched on at q ("external choice"). Furthermore, G<sup>1</sup> ; G<sup>2</sup> and if r.e G<sup>1</sup> G<sup>2</sup> prescribe a sequence and a conditional choice (i.e., if e is evaluated to true at r, then G<sup>1</sup> is executed, or else G2). Now, informally, the first theorem of choreographic programming is this:

#### Theorem 1 (Deadlock Freedom). Every global program is deadlock-free.

2. Subsequently, we decompose global program G into local programs L<sup>C</sup> and L<sup>S</sup> ("the processes"), using a projection function; every local program prescribes the behaviour of one role, individually, from its own perspective.

Client: CS!"foo" ; CS!123 ; SC?{SUCC : L 0 C , FAIL : skip} Server: CS?x ; CS?y ; if S.auth(x,y) (SC!SUCC ; L 0 S ) (SC!FAIL)

In this notation, pq !e and pq?y prescribe a send and a receive of a value from p to q. Similarly, pq !` and pq?{`<sup>i</sup> : Li}i∈<sup>I</sup> prescribe a send and a receive of a label (i.e., if `<sup>j</sup> is received for some j ∈ I, then L<sup>j</sup> is executed).

Now, informally, the second theorem of choreographic programming is this:

Theorem 2 (Operational Equivalence). Every well-formed global program is operationally equivalent to the parallel composition of its projections.

"Well-formedness" is a syntactic condition on global programs; we discuss it in more detail later. Here, we just claim that G above is indeed well-formed.

3. Finally, we compose local programs L<sup>C</sup> and L<sup>S</sup> in parallel ("the distributed system"), by deploying them concurrently, and by executing them at their own pace; as they run, L<sup>C</sup> and L<sup>S</sup> send and receive messages as prescribed. Now, Thm. 1 and Thm. 2 together entail that L<sup>C</sup> and L<sup>S</sup> are deadlock-free, by construction, without extra analysis. Figure 1 summarises the workflow.

#### 1.2 Related Work: State of the Art & Open Problems

Early work on choreographic programming was presented by Carbone et al. [8,9] (using binary session types [34]) and by Carbone and Montesi [10] (using multiparty session types [35]); substantial progress has been made since. For instance, Montesi and Yoshida developed a theory of compositional choreographic programming that supports open distributed systems [42]; Carbone et al. studied connections between choreographic programming and linear logic [11,12,7]; Dalla Preda et al. combined choreographic programming with dynamic adaptation [48,46,47]; Cruz-Filipe and Montesi developed a minimal Turing-complete language of global programs [16,19]; Cruz-Filipe et al. presented a technique to extract global programs from families of local programs ("choreography extraction") [14]; and recently, Giallorenzo et al. studied a correspondence between choreographic programming and multitier languages [29]. Other work on choreographic programming includes results on case studies [15], procedural abstractions [18], asynchronous communication [17], polyadic communication [20,31], implementability [28], and formalisation/mechanisation in Coq [21,22]. Furthermore, theoretical developments are supported in practice by several tools, including Chor [10], AIOCJ [48,47], and Choral [29].

However, all publications cited above have two limitations:

1. Regarding the construction of distributed systems, existing work on choreographic programming supports only centralised decision making: every if/ while-statement in a global program has a one-party condition, evaluated at a single role. For instance, in the example above, the decision to continue or end the execution is made by Server alone; Client is duly informed afterwards—with a label communication—as it needs to know how to proceed, but the decision is really Server's.

However, in many distributed systems, it is impractical (i.e., unnecessary or unnatural), or even impossible, for a single role to make decisions.

For instance, consider a distributed system in which two processes enact roles Player1 and Player2 to simulate a game of chess. The idea is that, at the end of every turn, a move is communicated from "active" Playeri to "passive" Playerj, after which a decision must be made: should Playerj take a next turn, or is the game over? The key point here is that every role has enough knowledge to check if the latest move is, in fact, the final one. So after every turn, every role can privately—without a label communication—decide to continue or end the execution; moreover, unanimity is guaranteed. It is, thus, unnecessary to additionally use a label communication to have one role explicitly inform the other one about how to proceed. Yet, all publications cited above force the usage of a label communication in this situation anyway.

2. Regarding the analysis of distributed systems, existing work on choreographic programming focusses on providing deadlock freedom. In contrast, providing functional correctness has not received due attention. This is surprising: given the sequential programming style in which global programs are expressed, it seems worthwhile to study how classical verification techniques for sequential code can be adapted to choreographic programming.

Beyond choreographic programming, all other choreography-based approaches that we know of are limited to centralised decision making, including conversation protocols (e.g., [3,27]), multiparty session types (MPST) (e.g., [35,13,23,24]),


Table 1: State of the art (e.g., [9,10,12,19,29,42,47]) vs. this paper

and MPST extensions to support value-based reasoning using assertions [5], dependent types [51,25], and refinement types [52]. Furthermore, we note that (elements of) deductive verification and session types were combined in Actris [32] and ParTypes [41]. Actris supports reasoning about functional correctness (using separation logic [44,36]), but only for binary sessions. In contrast, ParTypes supports multiparty sessions, but it does not consider functional correctness.

#### 1.3 Contributions of This Paper

In this paper, we address the two limitations described in Sect. 1.2.


Table 1 summarises our contributions relative to the state of the art; it also shows a minimal example to illustrate the essential difference between centralised decision making and decentralised. With centralised decision making (left global program), first, only Bob shares x2 with Alice; next, only Alice compares it with x1 and shares the outcome with Bob. In contrast, with decentralised decision making (right global program), first, both Alice and Bob share their values; next, both Alice and Bob compare them, but they do not need to share the outcomes, as their unanimity is guaranteed.

#### 1.4 Key Challenge: How to Check If Unanimity Is Guaranteed?

So far, we have seen two examples of decentralised decision making (i.e., Player1 and Player2 in Sect. 1.2; Alice and Bob in Sect. 1.3). In both examples, we noted that "unanimity is guaranteed"; this is crucially important to provide deadlock freedom. As a counterexample of what can go wrong in the absence of unanimity, suppose that Bob's condition in Tab. 1 were x2==true (i.e., he ignores Alice's value). In that case, unanimity is not guaranteed, so Alice and Bob can diverge: Alice privately decides to enter one branch, while Bob privately decides to enter the other branch. A deadlock subsequently ensues if, for instance, Alice needs to await a message from Bob in her branch, while Bob needs to await a message from Alice in his branch.

Thus, the key challenge to support decentralised decision making in choreographic programming is this: "How to check if unanimity is guaranteed?" The pivotal insight is that this question can be reduced to a seemingly unrelated one: "Given a global program and a postcondition, how to compute a precondition?" It was first answered for sequential code by Dijkstra in the 1970s [26], in terms of a predicate transformer to compute weakest preconditions. A crucial technical contribution of this paper is a non-trivial adaptation of Dijkstra's seminal work, tailored for choreographic programming, to provide not only functional correctness (i.e., ensure the truth of the postcondition) but also deadlock freedom in the presence of decentralised decision making (i.e., ensure unanimity).

#### 1.5 Organisation of This Paper

In Sect. 2, to further motivate this paper's new theory, we present more examples of real(istic) distributed systems that require decentralised decision making.

The new theory is presented in Sects. 3–7: in Sect. 3, we present some preliminaries; in Sect. 4, we present a base calculus of global programs, without if/ while-statements, but with a main theorem that covers both deadlock freedom and functional correctness; in Sect. 5 and Sect. 6, to support decentralised decision making, we extend the base calculus with if/while-statements; in Sect. 7, we present a calculus of local programs and projection. Thus, Sect. 4–6 cover the upper half of Fig. 1, while only Sect. 7 covers the bottom half.

Appendices appear in the full version of this paper [39]. Detailed definitions, auxiliary lemmas, main theorems, and proofs appear in a technical report [40].

### 2 Motivating Examples

To further motivate the usefulness and necessity of this paper's new theory, in this section, we present examples of real(istic) distributed systems that require decentralised decision making; see Appx. A [39] for additional examples. Throughout the section, we adopt a programmer's perspective and present only global programs (i.e., all construction and analysis activities that a programmer carries out manually in the workflow, happen in the upper half of Fig. 1).

Regarding the usefulness of the new theory, the following example shows that centralised decision making can be impractical (i.e., unnatural or unnecessary).

Example 1 (Chess simulation). From Sect. 1.2, recall the distributed system in which two processes enact roles Player1 and Player2 to simulate a game of chess.

```
1. P1.b:=board() ; P2.b:=board() ;
2. while P1.!done(b)
3. (P1.CONTINUE_P2 ; G12 ;
4. if P2.!done(b)
5. (P2.CONTINUE_P1 ; G21)
6. (P2.END_P1 ; skip)) ;
7. P1.END_P2
            (a) Centralised
                                      1. P1.b:=board() ; P2.b:=board() ;
                                      2. while P1.!done(b) ∧ P2.!done(b)
                                      3. (G12 ;
                                      4. if P1.!done(b) ∧ P2.!done(b)
                                      5. G21
                                      6. skip)
                                                 (b) Decentralised
```
Fig. 2: Global programs for chess simulation (Exmp. 1)

Figure 2 shows two global programs: one that uses centralised decision making (at Player1 and Player2, in alternating order), and one that uses the new theory's decentralised decision making; both have auxiliary global programs G<sup>12</sup> (Player1 is active, Player2 is passive; details omitted) and G<sup>21</sup> (vice versa).

In Sect. 1.2, we argued for the usefulness of decentralised decision making in this example: the label communications in Fig. 2a are actually unnecessary.

Regarding the necessity of the new theory, the following example shows that centralised decision making can be impossible. In the example, notation G<sup>1</sup> k G<sup>2</sup> prescribes an interleaving; it is used to express that the order in which G<sup>1</sup> and G<sup>2</sup> are executed does not matter (i.e., it is not intended to be multi-threading; there is no interaction between G<sup>1</sup> and G2). By convention, sequencing binds stronger than interleaving. For instance, G<sup>1</sup> ; G<sup>2</sup> k G<sup>3</sup> should be read as (G<sup>1</sup> ; G2) k G3.

Example 2 (Probabilistic leader election in anonymous clique networks). Consider a distributed system in which k anonymous processes (i.e., they have no predefined identifiers) need to elect a leader among them. For clique networks (i.e., each process has a channel to each other process), a probabilistic version of Peleg's algorithm [45] can be used in the style of Itai and Rodeh [37,38]. The algorithm proceeds in rounds. In every round, every process picks a random identifier and sends it to every other process. If there is a unique maximal identifier, then the process that picked it becomes the leader. If not, another round follows.

Figure 3 shows a global program for k=3; it crucially relies on the new theory's decentralised decision making. We write r.[x1, . . . , xn]:=[e1, . . . , en] to abbreviate r.x<sup>1</sup> :=e<sup>1</sup> ; · · · ; r.x<sup>n</sup> :=en, while we write p.e\_[q1.x1, . . . , qn.xn] to abbreviate p.e\_q1.x<sup>1</sup> ; · · · ; p.e\_qn.xn. First, the processes initialise five variables (lines 1–3): seed is used to pick random identifiers; id1, id2, and id3 are used to store and compare identifiers; leader indicates whether or not the process was elected. Next, the processes enter the loop (lines 4–7), each of whose iterations represents one round: in every iteration, every process increments its seed, picks a random identifier, and shares it. When the maximal identifier is unique, the processes exit the loop. One process marks itself as leader (lines 8–10).

The point of this example is that the probabilistic version of Peleg's algorithm for cliques—actually, any leader election algorithm—cannot faithfully be implemented using centralised decision making. The reason is that centralised decision

```
1. (P1.[seed, id1, id2, id3, leader]:=[-1, -1, -1, -1, false] k
2. P2.[seed, id1, id2, id3, leader]:=[-1, -1, -1, -1, false] k
3. P3.[seed, id1, id2, id3, leader]:=[-1, -1, -1, -1, false]) ;
4. while V
            {r.!maxIsUnique(id1,id2,id3)}r∈{P1,P2,P3}
5. (P1.seed:=seed+1 ; P1.id1:=random1(seed) ; P1.id1_[P3.id1, P2.id1] k
6. P2.seed:=seed+1 ; P2.id2:=random2(seed) ; P2.id2_[P1.id2, P3.id2] k
7. P3.seed:=seed+1 ; P3.id3:=random3(seed) ; P3.id3_[P2.id3, P1.id3]) ;
8. if V
        {r.id1 == max(id1,id2,id3)}r∈{P1,P2,P3} (P1.leader:=true) (skip) ;
9. if V
        {r.id2 == max(id1,id2,id3)}r∈{P1,P2,P3} (P2.leader:=true) (skip) ;
10. if V
        {r.id3 == max(id1,id2,id3)}r∈{P1,P2,P3} (P3.leader:=true) (skip)
```
Fig. 3: Global program for probabilistic leader election in anonymous clique networks (k=3), using decentralised decision making

making inherently requires the presence of a distinguished process (to evaluate a one-party condition and share the outcome). However, the motivation to run a leader election algorithm in the first place is that such a distinguished process is not yet agreed upon. That is, centralised decision making requires asymmetry of processes, whereas leader election algorithms require symmetry.

### 3 Setting the Stage: Data and Conditions

The topic of interest in this paper is "processes that communicate", rather than "data that are communicated". For this reason, we assume that there exists some underlying calculus of data (Sect. 3.1), but we omit most of its details; they are orthogonal to this paper's contributions. On top of it, we adopt a logic to write preconditions, postconditions, and conditions in if/while-statements (Sect. 3.2).

#### 3.1 Data

Let R = {A, B, C, . . .} denote a universe of roles, ranged over by p, q, r. Let X = {x, y, z, . . .} denote a universe of variables, ranged over by x, y, z. Let V = {error, true, false, 0, 1, 2, . . .} denote a universe of values, ranged over by v (i.e., V contains at least a distinguished value error, booleans, and numbers, but we also use other data types in examples, including functions). Let E denote a universe of expressions, ranged over by e; it is induced by the following grammar:

e ::= r.x |{z} role-qualified variable  v <sup>e</sup>1==e<sup>2</sup> <sup>e</sup>1<e<sup>2</sup> <sup>e</sup>1&&e<sup>2</sup>  !<sup>e</sup> <sup>e</sup>1+e<sup>2</sup> · · · | {z } compound expressions

Let S = R \* (X \* V) denote a universe of states (i.e., partial functions from roles to partial functions from variables to values), ranged over by S; the idea is that every state has a separate section for every role of interest, to model disjoint memory spaces. Let eval : S × E → V denote a total evaluation function. For instance, eval{A7→{x7→5,y7→6}}(A.x+A.y) = 11. We assume that bogus expressions are evaluated to error. For instance, eval∅(1+true) = error.

Regarding terminology, we say that every role-qualified variable r.x is "local to r". If every role-qualified variable that occurs in e is local to r, then e is "local to r". Regarding notation, if e is local to r, then we often move all "r."-qualifiers that occur in e to the front. For instance, we write A.x+y instead of A.x+A.y.

#### 3.2 Conditions

We adopt the following basic logic over expressions in E. Let Ψ denote a universe of formulas, ranged over by φ, χ, ψ; it is induced by the following grammar:

$$\left|\phi,\chi,\psi\right\rangle ::= e \mid \neg \psi \mid \psi\_1 \land \psi\_2 \mid \psi \psi$$

Informally, given state S, formulas have the following meaning relative to S:


Formally, an interpretation function maps formulas to the sets of states in which they are true, denoted by <sup>J</sup>-K; it is induced by the following equations:

$$\begin{array}{rcl} \left[e\right] = & \left[\begin{array}{c} \left[\neg\psi\right] = \mathbb{S} \\ \left[\uppsi\_{1}\land\psi\_{2}\right] = \left[\uppsi\_{1}\right] \cap \left[\uppsi\_{2}\right] \end{array} \right] & \left[\uppsi\_{1}\right] \cap \left[\uppsi\_{2}\right] \end{array} \quad \left[\left[\uppsi\right]\right] = \left\{\begin{array}{c} \mathbb{S} & \text{if: } \left[\psi\right] = \mathbb{S} \\ \emptyset & \text{otherwise} \end{array} \right. \\ \left. \end{array}$$

Regarding terminology, if every expression that occurs in ψ is local to r, then ψ is "local to r"; if so, the truth of ψ can be checked at r. Regarding notation, we often write V {ψr}r∈{r1,...,rn} instead of ψr<sup>1</sup> ∧· · ·∧ψr<sup>n</sup> if ψ<sup>r</sup> is local to r for every r ∈ {r1, . . . , rn}. Furthermore, we write ψ<sup>1</sup> ∨ ψ<sup>2</sup> and ψ<sup>1</sup> → ψ<sup>2</sup> for disjunction and implication. Finally, we write <sup>ψ</sup><sup>1</sup> <sup>≡</sup> <sup>ψ</sup><sup>2</sup> instead of <sup>J</sup>ψ<sup>1</sup><sup>K</sup> <sup>=</sup> <sup>J</sup>ψ<sup>2</sup>K.

### 4 Global Programs: Base Calculus

To gently introduce the main components of the new theory, in this section, we present a base calculus of global programs, without if/while statements, but with a main theorem that covers both deadlock freedom and functional correctness.

Initially, we present the syntax and semantics (Sect. 4.1); subsequently, we present a predicate transformer (Sect. 4.2); finally, we present the main theorem, which relies on the predicate transformer (Sect. 4.3). In the next sections, we extend the base calculus to support decentralised decision making.

#### 4.1 Syntax and Semantics

Let Γ and G denote universes of global actions and global programs, ranged over by γ and G; they are induced by the following grammar:

γ ::= q.y :=e  p.e\_q.y G ::= skip  γ <sup>G</sup><sup>1</sup> ; <sup>G</sup><sup>2</sup> <sup>G</sup><sup>1</sup> <sup>k</sup> <sup>G</sup><sup>2</sup>

Informally, these grammar elements have the following meaning:


For instance, in A.x:=5 ; B.y:=6, the assignment at Bob is independent of the assignment at Alice, so they may be executed out-of-order. In contrast, in <sup>A</sup>.x:=<sup>5</sup> ; <sup>A</sup>.x+1\_B.y, the communication from Alice to Bob depends on the assignment at Alice, so they must be executed in-order. In general, when two global actions have disjoint subjects (i.e., participating roles), they are considered independent and may be executed out-of-order.

Out-of-order execution of global actions with disjoint subjects is common in choreographic programming: it was first introduced by Carbone and Montesi to deal with latent concurrency among roles in global action sequences [10].

– Global program G<sup>1</sup> k G<sup>2</sup> prescribes an interleaving of G<sup>1</sup> and G2.

Formally, we define the operational semantics of global programs at two "layers".

(1) The "top layer" consists of an abstract termination relation, denoted by ↓, and an abstract labelled reduction relation, denoted by → in the style of process algebra (e.g., [2]). More precisely, G ↓ means that G can terminate, while G ψ,γ −−→ G<sup>0</sup> means that G can reduce to G<sup>0</sup> when ψ is true (i.e., conditionally) by executing γ. For instance, the following abstract execution is possible:

$$\mathsf{A.x:=5}; \mathsf{A.x+1} \multimap \mathsf{B.y} \xrightarrow{\mathsf{true}, \mathsf{A.x:=5}} \mathsf{skip}; \mathsf{A.x+1} \multimap \mathsf{B.y} \xrightarrow{\mathsf{true}, \mathsf{A.x+1} \multimap \mathsf{B.y}} \mathsf{skip}; \mathsf{skip} \downarrow$$

First, the global program reduces by executing an assignment; next, it reduces by executing a communication; next, it terminates. For simplicity, skips are not automatically cleaned up by the reduction rules (but they could be).

Relations ↓ and → are induced by the rules in Fig. 4a. Most rules are standard [2]. Notably, in this section, every reduction is unconditional (i.e., labelled with true) due to rule [→-Act]. The only special rule is rule [→-Seq2]: it states that if G<sup>2</sup> can reduce to G<sup>0</sup> <sup>2</sup> by executing γ (right premise), and if γ is independent of G<sup>1</sup> (left premise), then G<sup>1</sup> ; G<sup>2</sup> can reduce accordingly (conclusion). We note that independence is defined in terms of disjointness of subjects, as explained above. For instance, the following abstract out-of-order execution is possible:

$$\begin{array}{c} \begin{array}{c} \begin{array}{c} \begin{array}{c} \text{\$ \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \scriptstyle \end{array} \end{array}$$

(a) Base calculus

$$\begin{array}{c} \frac{\psi = \bigwedge\{e\_r\}\_{r \in R} \quad \gamma = 1^R}{\text{if } \bigwedge\{e\_r\}\_{r \in R} \, G\_1 \, G\_2 \xrightarrow{\psi, \gamma} \, G\_1} & \xrightarrow{\begin{subarray}{c} \gamma = \bigwedge\{\neg e\_r\}\_{r \in R} \quad \gamma = 2^R \\ \text{if } \bigwedge\{e\_r\}\_{r \in R} \, G\_1 \, G\_2 \xrightarrow{\psi, \gamma} \, G\_2 \end{subarray}} & \xrightarrow{\psi = \bigwedge\{\neg e\_r\}\_{r \in R} \quad \gamma = 2^R} & \\ \hline \\ \psi = \bigwedge\{e\_r\}\_{r \in R} \, \begin{subarray}{c} \gamma = \bigwedge\{e\_r\}\_{r \in R} \, \ \gamma = 1^R \\ \text{while } \bigwedge\{e\_r\}\_{r \in R} \, G \text{ ; while } \bigwedge\{e\_r\}\_{r \in R} \, \{\psi\_{\text{inv}}\} \text{ } G \end{subarray} \Big| \rightharpoonup{\text{WHERE1}} \\ \hline \\ \psi = \bigwedge\{\neg e\_r\}\_{r \in R} \quad \gamma = 2^R \\ \hline \text{while } \bigwedge\{e\_r\}\_{r \in R} \, G \xrightarrow{\psi, \gamma} \, \mathbf{skip} \end{array}$$

(b) Extension with if/while-statements – explained in Sect. 5

R = R<sup>1</sup> ∪ R<sup>2</sup> R<sup>1</sup> 6= ∅ implies G<sup>1</sup> ↓ R<sup>2</sup> 6= ∅ implies G<sup>2</sup> ↓ if V {er}r∈<sup>R</sup> G1|<sup>R</sup><sup>1</sup> G2|<sup>R</sup><sup>2</sup> ↓ [↓-NIf] r ∈ R \ (R<sup>1</sup> ∪ R2) ψ = e<sup>r</sup> γ = 1{r} if V {er}r∈<sup>R</sup> G1|<sup>R</sup><sup>1</sup> G2|<sup>R</sup><sup>2</sup> ψ,γ −−→ if V {er}r∈<sup>R</sup> G1|<sup>R</sup>1∪{r} G2|<sup>R</sup><sup>2</sup> [→-NIf1] r ∈ R \ (R<sup>1</sup> ∪ R2) ψ = ¬e<sup>r</sup> γ = 2{r} if V {er}r∈<sup>R</sup> G1|<sup>R</sup><sup>1</sup> G2|<sup>R</sup><sup>2</sup> ψ,γ −−→ if V {er}r∈<sup>R</sup> G1|<sup>R</sup><sup>1</sup> G2|<sup>R</sup>2∪{r} [→-NIf2] G<sup>1</sup> ψ,γ −−→ G 0 <sup>1</sup> subj(γ) ⊆ R<sup>1</sup> \ R<sup>2</sup> if V {er}r∈<sup>R</sup> G1|<sup>R</sup><sup>1</sup> G2|<sup>R</sup><sup>2</sup> ψ,γ −−→ if V {er}r∈<sup>R</sup> G 0 <sup>1</sup>|<sup>R</sup><sup>1</sup> G2|<sup>R</sup><sup>2</sup> [→-NIf3] G<sup>2</sup> ψ,γ −−→ G 0 <sup>2</sup> subj(γ) ⊆ R<sup>2</sup> \ R<sup>1</sup> if V {er}r∈<sup>R</sup> G1|<sup>R</sup><sup>1</sup> G2|<sup>R</sup><sup>2</sup> ψ,γ −−→ if V {er}r∈<sup>R</sup> G1|<sup>R</sup><sup>1</sup> G 0 <sup>2</sup>|<sup>R</sup><sup>2</sup> [→-NIf4] if V {er}r∈<sup>R</sup> (G ; while V {er}r∈<sup>R</sup> {ψinv} G|∅)|<sup>∅</sup> skip|<sup>∅</sup> ψ,γ −−→ G 0 while V {er}r∈<sup>R</sup> {ψinv} G|<sup>∅</sup> ψ,γ −−→ G 0 [→-NWhile]

(c) Extension with non-blocking if/while-statements – explained in Sect. 6 Fig. 4: Abstract operational semantics of global programs ("top layer")

$$\begin{array}{c} \begin{array}{c} G \downarrow\\ \hline (G,\mathcal{S}) \downarrow \end{array} \begin{array}{c} \begin{array}{c} G \stackrel{\psi,\gamma}{\longrightarrow} G' \quad \mathcal{S} \in \left[\psi\right] \end{array} \begin{array}{c} \gamma^{c} = \mathsf{eval}\_{\mathcal{S}}(\gamma)\\ \hline (G,\mathcal{S}) \stackrel{\gamma^{c}}{\longrightarrow} (G',\mathsf{effect}(\gamma^{c},\mathcal{S})) \end{array} \begin{array}{c} \mathsf{effect}(q.y := v,\mathcal{S})\\ \hline \textbf{effect}(p.v \multimap q.y,\mathcal{S}) = \mathcal{S}[v/q.y] \end{array} \end{array} \end{array} \begin{array}{c} \begin{array}{c} \mathsf{effect}(q.y := v,\mathcal{S})\\ \hline \textbf{effect}(p.v \multimap q.y,\mathcal{S}) = \mathcal{S}[v/q.y] \end{array} \end{array}$$

$$\begin{array}{c} \mathcal{S}[v/q.y] = \{r \mapsto \mathcal{S}(r) \mid q \neq r\} \cup \{q \mapsto \{x \mapsto \mathcal{S}(q)(x) \mid x \neq y\} \cup \{y \mapsto v\} \} \end{array}$$

Fig. 5: Concrete operational semantics of global programs ("bottom layer")

$$\mathsf{A.x:=5}:\mathsf{B.y:=6} \xrightarrow{\mathsf{true},\mathsf{B.y:=6}} \mathsf{A.x:=5}:\mathsf{skip} \xrightarrow{\mathsf{true},\mathsf{A.x:=5}} \mathsf{skip};\mathsf{skip} \downarrow$$

(2) The "bottom layer" consists of a concrete termination predicate, denoted by ↓ (same symbol as before), and a concrete labelled reduction relation, denoted by → (ditto). The idea is that the bottom layer enriches the top layer by taking into account states, in terms of configurations of the form (G, S). More precisely, (G, S) ↓ means that G can terminate in S, while (G, S) γ c −→ (G<sup>0</sup> , S 0 ) means that G can reduce to G<sup>0</sup> by executing γ c in S to obtain S 0 . We write γ <sup>c</sup>—with a superscript "c"—to indicate that it is a "concrete" global action in which every expression has been evaluated to a value (using S). For instance, the following concrete execution is possible:

$$\begin{array}{c} \mathsf{(A.x:=5 \; ; \mathsf{A.x+1} \to \mathsf{B.y}, \{\mathsf{A} \mapsto \{\mathsf{x} \mapsto \mathsf{0}\}, \mathsf{B} \mapsto \{\mathsf{y} \mapsto \mathsf{0}\}\})\\ \xrightarrow{\mathsf{A.x:=5}} & (\mathsf{skip}: \mathsf{A.x+1} \to \mathsf{B.y}, \{\mathsf{A} \mapsto \{\mathsf{x} \mapsto \mathsf{5}\}, \mathsf{B} \mapsto \{\mathsf{y} \mapsto \mathsf{0}\}\})\\ \xrightarrow{\mathsf{A.x.o-\mathsf{B.y}}} & (\mathsf{skip}: \mathsf{skip}, \{\mathsf{A} \mapsto \{\mathsf{x} \mapsto \mathsf{5}\}, \mathsf{B} \mapsto \{\mathsf{y} \mapsto \mathsf{6}\}\}) \downarrow \end{array}$$

Relations ↓ and → are induced by the rules in Fig. 5. Rule [↓] states that if G can terminate, then so can (G, S), regardless of S. More interestingly, rule [→] states that if G can reduce to G<sup>0</sup> when ψ is true by executing γ (left premise), and if ψ is indeed true in S (middle premise), and if γ c is the "concretisation" of γ such that every expression is first evaluated using S (right premise), then (G, S) can reduce accordingly, and the effect of γ c is applied to S (conclusion); the latter means that a variable is bound to a new value in S, formalised using "substitution notation". For instance (cf. second reduction in the concrete execution above), if <sup>S</sup> <sup>=</sup> {<sup>A</sup> 7→ {<sup>x</sup> 7→ <sup>5</sup>}, <sup>B</sup> 7→ {<sup>y</sup> 7→ <sup>0</sup>}}, then effect(eval<sup>S</sup> (A.x+1\_B.y), <sup>S</sup>) = effect(A.6\_B.y, <sup>S</sup>) = {<sup>A</sup> 7→ {<sup>x</sup> 7→ <sup>5</sup>}, <sup>B</sup> 7→ {<sup>y</sup> 7→ <sup>6</sup>}}.

Our formalisation of the operational semantics has two novelties:




$$\begin{array}{|c|c|c|}\hline \sqrt{\varkappa\_{R}(G\_{1})} & \varkappa\_{R}(G\_{2}) & R\_{1}, R\_{2} \subseteq R\\ & R\_{1} \neq \emptyset \text{ implies } R\_{2} = \emptyset\\ & R\_{2} \neq \emptyset \text{ implies } R\_{1} = \emptyset\\ \hline \sqrt{\varkappa\_{R}(\text{if }\bigwedge\{e\_{r}\}\_{r \in R} \, G\_{1} \,|\_{R\_{1}} \, G\_{2} \,|\_{R\_{2}})} & \sqrt{\text{\reflectbox{\$\reflectbox{\$\reflectbox{\$\reflectbox{\$\reflectbox{}}\$}\$}\$}} & \sqrt{\text{\reflectbox{\$\reflectbox{\$\reflectbox{}}\$}\$}} & \sqrt{\text{\reflectbox{\$\reflectbox{\$\reflectbox{}}\$}\$}} \\ \hline \end{array}$$

(c) Extension with non-blocking if/while-statements – explained in Sect. 6


We end this subsection with a well-formedness relation, denoted by X, to check a few basic syntactic properties of global programs; it is induced by the rules in Fig. 6a. For now, there are two aims (to be extended in subsequent sections for if/while-statements):


#### 4.2 Predicate Transformer

In the next subsection, the main theorem for global programs will be as follows (informally): if the global program is well-formed, and if the precondition is true in the initial state, then deadlock freedom and functional correctness are provided. In this subsection, we present a technique to automatically compute preconditions such that the main theorem can indeed be formulated and proved.


(a) Base calculus

$$\begin{array}{l} \Phi(\text{if } \bigwedge \{e\_{r}\}\_{r \in R} \ G\_{1} \ G\_{2}, \chi) = \\ (\bigwedge \{ \begin{array}{l} e\_{r} \}\_{r \in R} \rightarrow \Phi(G\_{1}, \chi) \end{array}) \wedge \\ (\bigwedge \{ \neg e\_{r} \}\_{r \in R} \rightarrow \Phi(G\_{2}, \chi) ) \wedge \\ (\bigwedge \{ e\_{r\_{1}} \rightarrow e\_{r\_{2}} \}\_{r\_{1}, r\_{2} \in R}) \\ (\bigwedge \{ e\_{r\_{1}} \rightarrow e\_{r\_{2}} \}\_{r\_{1}, r\_{2} \in R}) \\ \Phi(\text{while } \bigwedge \{ e\_{r} \}\_{r \in R} \, \{ \psi\_{\text{inv}} \} \, G, \chi) = \\ \psi\_{\text{inv}} \wedge \forall (\psi\_{\text{inv}} \rightarrow \left( \begin{array}{l} \\ \end{array} \right) \\ (\bigwedge \{ \neg e\_{r} \}\_{r \in R} \rightarrow \Phi(G, \psi\_{\text{inv}})) \wedge \\ (\bigwedge \{ \neg e\_{r} \}\_{r \in R} \rightarrow \chi) \wedge \\ (\bigwedge \{ e\_{r\_{1}} \rightarrow e\_{r\_{2}} \}\_{r\_{1}, r\_{2} \in R})) \end{array}$$

(b) Extension with if/while-statements – explained in Sect. 5

$$\begin{aligned} \Phi(\text{if } \bigwedge \{e\_r\}\_{r \in R} \, G\_1 |\_{R\_1} \, G\_2 |\_{R\_2}, \chi) &= \begin{cases} \Phi(\text{if } \bigwedge \{e\_r\}\_{r \in R} \, G\_1 \, G\_2, \chi) & \text{if } \, R\_1 = \emptyset = R\_2 \\ \Phi(G\_2, \chi) \wedge \bigwedge \{\neg e\_r\}\_{r \in R \backslash R\_2} & \text{if } \, R\_1 = \emptyset \neq R\_2 \\ \Phi(G\_1, \chi) \wedge \bigwedge \{\~\ e\_r\}\_{r \in R \backslash R\_1} & \text{if } \, R\_1 \neq \emptyset = R\_2 \\ \text{false} & \text{if } \, R\_1 \neq \emptyset \neq R\_2 \end{cases} \\ \Phi(\text{while } \bigwedge \{e\_r\}\_{r \in R} \, \{\psi\_{\text{inv}}\} \, G |\_{\emptyset, \chi}) &= \Phi(\text{while } \bigwedge \{e\_r\}\_{r \in R} \, \{\psi\_{\text{inv}}\} \, G, \chi) \end{aligned}$$

(c) Extension with non-blocking if/while-statements – explained in Sect. 6

Fig. 7: Predicate transformer to compute preconditions

Let φ denote a predicate transformer function; it is defined by the equations in Fig. 7a, where χ[e/q.y] denotes substitution of e for q.y in χ. In words, φ consumes a global program G and a postcondition χ as input, and it produces a precondition φ(G, χ) as output. The idea is that φ is sound: if φ(G, χ) is true in the initial state, then after executing G, χ is true in the final state. Essentially, Fig. 7a is an adaptation of Dijkstra's predicate transformer to compute weakest preconditions for sequential code [26], denoted by wp. More precisely:


<sup>φ</sup>(A.x+1\_B.y, <sup>A</sup>.x+B.y==11) = A.x+A.x+1==11 ≡ A.x+A.x==10 ≡ A.x==5 (a) Communication <sup>φ</sup>(<sup>γ</sup> ; <sup>A</sup>.x+1\_B.y, <sup>A</sup>.x+B.y==11) <sup>=</sup> <sup>φ</sup>(γ, <sup>φ</sup>(A.x+1\_B.y, <sup>A</sup>.x+B.y==11)) = φ(γ, A.x+A.x+1==11) = 5+5+1==11 ≡ true (b) Sequence φ(γ ; if (A.x==5 ∧ B.y==6) B.y:=7 skip, χ) = φ(γ, φ(if (A.x==5 ∧ B.y==6) B.y:=7 skip, χ)) = φ(γ,(A.x==5 ∧ B.y==6 → φ1) ∧ (¬A.x==5 ∧ ¬B.y==6 → φ2) ∧ (A.x==5 ↔ B.y==6)) = (5==5 ∧ B.y==6 → φ1[5/A.x]) ∧ (¬5==5 ∧ ¬B.y==6 → φ2[5/A.x]) ∧ (5==5 ↔ B.y==6) ≡ (B.y==6 → φ1[5/A.x]) ∧ (false → φ2[5/A.x]) ∧ B.y==6 ≡ φ1[5/A.x] ∧ B.y==6

$$\text{(c) Condition} \text{ choice } - \text{ explained in } \text{Sect. 5. Let } \phi\_1 = \phi(\mathbf{B}. \mathbf{y} := \mathbf{7}, \chi), \phi\_2 = \phi(\mathbf{skip}, \chi).$$

Fig. 8: Examples of φ. Let γ = A.x:=5.

G2. Figure 8b shows an example: if true is true (i.e., unconditionally), after executing the global program, the sum of A.x and B.y is 11.

However, unlike Dijkstra's setting (i.e., strong sequencing), there is a caveat in our setting (i.e., weak): G<sup>1</sup> and G<sup>2</sup> may be executed out-of-order. This makes proving the soundness of φ more challenging than in Dijkstra's work (notably: establishing the correspondence between backwards computation of a precondition and forwards execution of the sequence).

– For G<sup>1</sup> k G<sup>2</sup> (absent in Dijkstra's work), the definition of φ is inspired by the notion of disjoint parallelism in Hoare logic [33,1]. There are two cases. If G<sup>1</sup> and G<sup>2</sup> are non-interfering, which means that the variables that occur in G<sup>1</sup> and G<sup>2</sup> are disjoint, denoted as G<sup>1</sup> # G2, then the order in which G<sup>1</sup> and G<sup>2</sup> are executed does not affect the truth/falsehood of the postcondition; in that case, a precondition is computed by assuming, arbitrarily, in-order execution of G<sup>1</sup> and G<sup>2</sup> (but any other interleaving would work as well). If G<sup>1</sup> and G<sup>2</sup> are interfering, then φ yields false, so no state satisfies the precondition. This is sound but not complete (i.e., there exist deadlock-free and functionally-correct global programs for which the computed precondition is nevertheless false). For our present purposes, however, φ is "complete enough" (e.g., all examples in Sect. 2 and Appx. A [39] are supported).<sup>4</sup>

The following proposition follows almost directly from the definitions. It states that if φ(γ, χ) is true in S, then χ is true in S 0 , after executing γ.

Proposition 1. If S ∈ <sup>J</sup>φ(γ, χ)<sup>K</sup> and <sup>S</sup> <sup>0</sup> = effect(eval<sup>S</sup> (γ), S), then S <sup>0</sup> <sup>∈</sup> <sup>J</sup>χK.

<sup>4</sup> Even though φ requires non-interference, interleaving (k) offers additional expressive power beyond weak sequencing (;). This is because non-interference (for k) is defined in terms of disjointness of variables, whereas independence (for ;) is defined in terms of disjointness of roles. For instance, A.x:=5 and A.y:=6 are non-interfering, but not independent. Consequently, A.x:=5 k A.y:=6 allows the assignments to happen in any order, whereas A.x:=5 ; A.y:=6 requires them to happen from left to right.

#### 4.3 Deadlock Freedom and Functional Correctness

The aim of this subsection is to formulate and prove the main theorem for global programs, which covers both deadlock freedom and functional correctness.

To give a uniform presentation across Sects. 4–6, we formulate the lemmas and theorem for the base calculus in this section in a way that they are reusable verbatim—for the extensions in the next sections. As a result, some formulations are more restrictive than necessary for the base calculus, but this is fine.

The first two lemmas pertain to φ's soundness. The first lemma states that if G is well-formed and can terminate, then the truth of φ(G, χ) implies the truth of χ (i.e., the postcondition has been brought about). The second lemma states that if G is well-formed and can reduce to G<sup>0</sup> when ψ is true by executing γ, then the truth of φ(G, χ) ∧ ψ implies the truth of χ, after executing γ ; G<sup>0</sup> (i.e., the postcondition is being brought about by executing γ).

Lemma 1. If <sup>X</sup>R(G) and <sup>G</sup> <sup>↓</sup>, then <sup>J</sup>φ(G, χ)<sup>K</sup> <sup>⊆</sup> <sup>J</sup>χK.

Proof. By induction on the derivation of G ↓.

Lemma 2. If XR(G) and G ψ,γ −−→ G<sup>0</sup> , then <sup>J</sup>φ(G, χ) <sup>∧</sup> <sup>ψ</sup><sup>K</sup> <sup>⊆</sup> <sup>J</sup>φ(<sup>γ</sup> ; <sup>G</sup><sup>0</sup> , χ)K.

Proof. By induction on the derivation of G ψ,γ −−→ G<sup>0</sup> . The interesting case is rule [→-Seq2], with G = G<sup>1</sup> ; G2. We need to prove the following inclusions:

$$\left[\left[\Phi(G\_1, \phi(G\_2, \chi)) \land \psi\right] \subseteq \left[\Phi(G\_1 : \gamma : G\_2', \chi)\right] \subseteq \left[\Phi(\gamma : G\_1 : G\_2', \chi)\right]\right]$$

The first inclusion can be proved using the induction hypothesis and G<sup>2</sup> ψ,γ −−→ G<sup>0</sup> 2 (right premise of rule [→-Seq2]). The second inclusion can be proved using subj(G1) ∩ subj(γ) = ∅ (left premise) and XR(G), to establish that the variables that occur in G<sup>1</sup> and γ are disjoint as well (i.e., G<sup>1</sup> and γ are non-interfering).

The next lemma states that well-formedness is preserved by reduction.

Lemma 3. If <sup>X</sup>R(G) and <sup>J</sup>φ(G, χ)<sup>K</sup> <sup>6</sup><sup>=</sup> <sup>∅</sup> and <sup>G</sup> ψ,γ −−→ G<sup>0</sup> , then XR(G<sup>0</sup> ).

Proof. By induction on the derivation of G ψ,γ −−→ G<sup>0</sup> .

The next lemma states that if G is well-formed, and if φ(G, χ) is true in S, then either G can terminate, or G can reduce to G<sup>0</sup> (i.e., G is not stuck).

Lemma 4. If <sup>X</sup>R(G) and S ∈ <sup>J</sup>φ(G, χ)K, then either <sup>G</sup> <sup>↓</sup>, or there exist ψ, γ, G<sup>0</sup> such that G ψ,γ −−→ <sup>G</sup><sup>0</sup> and S ∈ <sup>J</sup>ψK.

Proof. By induction on the derivation of XR(G).

Now, our main theorem for global programs states that if G is well-formed, and if φ(G, χ) is true in S, and if (G, S) has a sequence of reductions to (G† , S † ), then either (G† , S † ) can terminate and χ is true in S † , or (G† , S † ) can reduce. Thus, an execution of (G, S) consists of either finitely many reductions, followed by termination, or infinitely many (i.e., deadlock freedom); in the former case, upon termination, the postcondition is true (i.e., functional correctness).

Theorem 3. If <sup>X</sup>R(G) and S ∈ <sup>J</sup>φ(G, χ)<sup>K</sup> and (G, <sup>S</sup>) γ c <sup>1</sup> −→ · · · γ c <sup>n</sup> −→ (G† , S † ), then: 1. Either (G† , S † ) ↓, or there exist γ c , G‡ , S ‡ such that (G† , S † ) γ c −→ (G‡ , S ‡ ). 2. If (G† , S † ) ↓, then S † <sup>∈</sup> <sup>J</sup>χK.

Proof. First, we inductively apply Prop. 1 and Lems. 2–3, along the reduction sequence to prove XR(G† ) and S † <sup>∈</sup> <sup>J</sup>φ(G† , χ)K. Next, we apply Lem. <sup>4</sup> to prove deadlock freedom and Lem. 1 to prove functional correctness, using Fig. 5.

### 5 Global Programs: If/While-Statements

In the previous section, to gently introduce the main components of our theory, we presented a base calculus of global programs. In this section, we extend it with if/while-statement to support decentralised decision making.

#### 5.1 Syntax and Semantics

Recall that Γ and G denote universes of global actions and global programs, ranged over by γ and G; they are induced by the following extended grammar:

γ ::= · · · (page 8)  i R G ::= · · · (page 8) if V {er}r∈<sup>R</sup> G<sup>1</sup> G<sup>2</sup>  while <sup>V</sup> {er}r∈<sup>R</sup> {ψinv} G

Informally, the new grammar elements have the following meaning:

	- Case A: If e<sup>r</sup> is true for every r ∈ R, then everyone enters G1.
	- Case B: If e<sup>r</sup> is false for every r ∈ R, then everyone enters G2.
	- Case C: If e<sup>r</sup><sup>1</sup> is true, but e<sup>r</sup><sup>2</sup> is false, for some r1, r<sup>2</sup> ∈ R, then someone enters G1, but someone else enters G2.

Cases A and B are the "good" situations in which the roles are unanimous. In contrast, case C is the "bad" situation that leads to deadlock.

For simplicity, in this section, we assume that roles make private decisions together (i.e., at the same time), using two synchronisation barriers. Intuitively, in operational terms, this means that for every role r: first, it privately evaluates its own conjunct er; next, it reaches one of two barriers, depending on the truth/falsehood of er; next, it waits until every other role has privately evaluated a conjunct and reached a barrier as well. In cases A and B, all roles eventually reach the same barrier, so it breaks, and all roles enter one branch together; in case C, the roles never reach the same barrier—they are divided—so neither one of them breaks, and the roles get stuck.

(We note that barriers are often undesirable in distributed systems. In the next section, therefore, we also extend the base calculus with barrier-free if/while-statements. However, as the technical details of the barrier-free versions are considerably more complicated than the barrier-based versions, but partly rely on similar principles, we present the barrier-based ones first.) An if-statement cannot terminate: a decision must be made.

– Global program while V {er}r∈<sup>R</sup> {ψinv} G prescribes a conditional loop of G. The idea is similar to if V {er}r∈<sup>R</sup> G<sup>1</sup> G2, including non-termination. Condition ψinv is the loop invariant; it does not affect the operational semantics of while-statements, but it is used to compute preconditions.

Formally, for if/while-statements, → is induced by the rules in Fig. 4b (page 10). The presence of rules [→-If1] and [→-If2] corresponds to cases A and B, whereas the absence of other rules corresponds to case C (i.e., there are no reductions when roles are not unanimous). For instance, when G = A.x:=5 ; if (A.x==5 ∧ B.y==6) B.y:=7 skip, the following two abstract executions are possible:

$$G \xrightarrow{\mathsf{trua},} \begin{array}{c} \mathsf{trua}, \quad \mathsf{A.x = \mathsf{E} \land \mathsf{B}, \mathsf{y} = \mathsf{6},} \quad \mathsf{trua}, \\\quad \xrightarrow{\mathsf{I}^{\mathsf{A}} \mathsf{A}, \mathsf{B}} \xrightarrow{\mathsf{I}^{\mathsf{A}} \mathsf{A}, \mathsf{B}} \end{array} \xrightarrow{\mathsf{trua},} \begin{array}{c} \mathsf{trua}, \quad \neg \mathsf{A.x = \mathsf{E} \land \neg \mathsf{B}, \mathsf{y} = \mathsf{6}, \\\quad G \xrightarrow{\mathsf{A.x = \mathsf{E}}} \mathsf{o} \xrightarrow{\begin{array}{c} \mathsf{T}^{\mathsf{A}, \mathsf{B}} \ \mathsf{J} \end{array}} \xrightarrow{\mathsf{\bullet}} \end{array} \xrightarrow{\mathsf{\bullet}} \begin{array}{c} \mathsf{\bullet}. \end{array}$$

First, G reduces by executing an assignment at Alice (both executions); next, it reduces by executing private decisions at Alice and Bob together to enter the then-branch (left execution) or else-branch (right); next, in the former case, it reduces by executing an assignment at Bob and terminates, whereas in the latter case, it terminates. Regarding concrete executions, two situations are possible:

– If B.y is initially 6, then the left abstract execution can induce a deadlockfree concrete one: after the first concrete reduction, A.x is 5, and B.y is still 6, so A.x==5 ∧ B.y==6 is true (i.e., case A, unanimity), enabling the sequel. – If B.y is initially not 6, then both abstract executions cannot induce a deadlock-free concrete one: after the first concrete reduction, A.x is 5, but B.y is still not 6, so both A.x==5 ∧ B.y==6 and ¬A.x==5 ∧ ¬B.y==6 are false

This example shows that we need a technique to infer that B.y must initially be 6 to ensure unanimity for deadlock freedom; we present it in the next subsection.

(i.e., case C, non-unanimity), disabling the sequel and causing a deadlock.

We end this subsection with an extension of X for if/while-statements; it is induced by the rules in Fig. 6b (page 12). There is a third aim now (cf. page 12):

3. Rules [X-If] and [X-While] ensure that every role (in R) has its own conjunct in every multiparty condition. The idea is that every role always needs to know which branch to enter, so it must participate in every decision.<sup>5</sup>,<sup>6</sup>

<sup>5</sup> Well-formedness (every role has its own conjunct) and the grammar of if/while-statements (every conjunct is local to a role) are jointly similar to the variable-knowledgecondition of Neykova et al. [43]; they ensure that formulas are projectable (Fig. 10b).

<sup>6</sup> It is possible to encode choices in which only a few—not all—roles participate using extra variables; the idea is outlined at the end of Appx. A [39]. However, this encoding

#### 5.2 Predicate Transformer

We proceed with an extension of φ for if/while-statements; it is defined by the equations in Fig. 7b (page 13). As before, the definition of φ for if/whilestatements is an adaptation of the definition of wp (i.e., Dijkstra's original predicate transformer [26]), but it differs on crucial points as well. More precisely:

– For if V {er}r∈<sup>R</sup> G<sup>1</sup> G2, the definition of φ has three conjuncts. The first (resp. second) conjunct states that if every e<sup>r</sup> is true (resp. false), then the precondition of the then-branch (resp. else-branch) is true. This is similar to the definition of wp, and it includes case A (resp. B) on page 16.

The third conjunct states that every er<sup>1</sup> must imply every er<sup>2</sup> (i.e., they are either all true or all false); this is new relative to the definition of wp, and it excludes case C on page 16. (i.e., if the precondition computed by φ is true, then case C will never arise). The following proposition makes this precise.

#### Proposition 2. <sup>J</sup> V {er<sup>1</sup> → er<sup>2</sup> }r1,r2∈R<sup>K</sup> <sup>⊆</sup> <sup>J</sup> V {er}r∈<sup>R</sup> ∨ V {¬er}r∈RK.

Thus, φ accumulates logical requirements not only to ensure the truth of the postcondition for functional correctness (i.e., the first and second conjunct), but also to ensure unanimity for deadlock freedom (i.e., the third conjunct). Figure 8c (page 14) shows an example, featuring the same global program as G on page 17: if φ1[5/A.x] is true (to ensure the truth of χ) and B.y is 6 (to ensure unanimity), then after executing the global program, χ is true. Thus, φ mechanises our reasoning about G on page 17.

– For while V {er}r∈<sup>R</sup> {ψinv} G, the definition of φ has an "outer conjunction" and an "inner conjunction". The inner conjunction is similar to φ for if-statements: either every e<sup>r</sup> and the precondition of the body are true, to (re-)enter the loop, or every ¬e<sup>r</sup> and the postcondition are true, to exit it.

The second outer conjunct states that always (i.e., in every state, i.e., before and after executing the body), if the invariant is true, then the inner conjunction is true; the first outer conjunct states that the invariant is indeed true (i.e., before executing the body). This is similar to the definition of wp.

#### 5.3 Deadlock Freedom and Functional Correctness

To extend the main theorem for global programs (Thm. 3, page 16) to cover if/while-statements, we need to extend the auxiliary lemmas (Lem. 1–4, page 15 onwards); the proof of the theorem relies on the lemmas and is the same.

Lemma 5. Lemmas 1–4 hold for this section's extension.

Proof. For Lem. 1 there are no new cases (i.e., no new termination rules in Fig. 4b). For Lems. 2–3, the new cases (i.e., new reduction rules in Fig. 4b) can be proved directly. For Lem. 4, the new cases (i.e., new well-formedness rules in Fig. 6b) can be proved using Prop. 2, to establish that rule [→-If1] or rule [→-If2] is applicable in such a way that S ∈ <sup>J</sup>ψ<sup>K</sup> holds as well.

is not always practical/user-friendly. We therefore aim to offer "native" support for such choices too, using a form of merging [8,9,10]; see also Appx. D [39].

Theorem 4. Theorem 3 holds for this section's extension.

Proof. The same as the proof of Thm. 3, using Lem. 5 instead of Lems. 1–4.

### 6 Global Programs: Non-Blocking If/While-Statements

In the previous section, we extended the base calculus of global programs with blocking if/while-statements; they require roles to make private decisions together (i.e., at the same time), using barriers. In this section, we extend the base calculus also with non-blocking if/while-statements; they allow roles to make private decisions alone (i.e., at their own pace). This is often preferable.

#### 6.1 Syntax and Semantics

Recall that G denotes a universe of global programs, ranged over by G; it is induced by the following extended grammar:

G ::= · · · (page 16) if V {er}r∈<sup>R</sup> G1|R<sup>1</sup> G2|R<sup>2</sup>  while <sup>V</sup> {er}r∈<sup>R</sup> {ψinv} G|<sup>∅</sup>

Informally, the new grammar elements have the following meaning:<sup>7</sup>

– Global program if V {er}r∈<sup>R</sup> G1|R<sup>1</sup> G2|R<sup>2</sup> prescribes a non-blocking conditional choice of G<sup>1</sup> and G2. It relies on similar principles as the blocking version; notably, the same cases A, B, C on page 16 are applicable.

The key difference with the blocking version is that roles make private decisions alone (i.e., at their own pace), without using synchronisation barriers. Intuitively, in operational terms, this means that for every role r: first, it privately evaluates its own conjunct er; next, it immediately enters a branch. To accommodate this, extra syntactic bookkeeping—in the form of the "|R<sup>1</sup> " and "|R<sup>2</sup> " notation—is needed to keep track of roles' decisions.

More precisely, at any time, R<sup>i</sup> contains every role that has already made a private decision to enter G<sup>i</sup> . Initially, both R<sup>1</sup> and R<sup>2</sup> are empty. In case A (resp. B), R<sup>1</sup> (resp. R2) eventually becomes "full" and contains all roles, while R<sup>2</sup> (resp. R1) always remains empty. In case C, both R<sup>1</sup> and R<sup>2</sup> eventually become non-empty, but they always remain "non-full" as well.

A non-blocking if-statement can terminate when all roles have made a private decision and every entered branch can terminate.

– Global program while V {er}r∈<sup>R</sup> {ψinv} G|<sup>∅</sup> prescribes a non-blocking conditional loop of G. The idea is similar to if V {er}r∈<sup>R</sup> G1|<sup>R</sup><sup>1</sup> G2|<sup>R</sup><sup>2</sup> , except that no extra bookkeeping is needed (i.e., a fixed ∅ in "|∅"): nonblocking while-statements will be unfolded into non-blocking if-statements. (The reason for the seemingly redundant "|∅" notation is to give non-blocking while-statements a different grammatical form than blocking ones.)

<sup>7</sup> Blocking and non-blocking if/while-statements have different syntax. This makes it possible to mix the blocking and non-blocking versions in the same global program (we have not encountered a compelling use case for this yet, though).

Formally, for non-blocking if/while-statements, ↓ and → are induced by the rules in Fig. 4c (page 10). Rule [↓-NIf] states that if every role has made a private decision (left premise), and if G<sup>1</sup> and G<sup>2</sup> can terminate when at least one role has entered it (middle and right premise), then the non-blocking if-statement can terminate. The effect of the "R<sup>i</sup> 6= ∅" conditions is that a non-entered branch does not need to be able to terminate for the whole if-statement to be able to terminate. We note that rule [↓-NIf] also covers the case in which both R<sup>1</sup> and R<sup>2</sup> are non-empty, which should never have happened in the first place; shortly, we will rule it out using well-formedness and the predicate transformer.

Rules [→-NIf1] and [→-NIf2] state that if r has not made a private decision yet (left premise), then the non-blocking if-statement can reduce by executing one. For instance, when G = A.x:=5 ; if (A.x==5 ∧ B.y==6) B.y:=7|<sup>∅</sup> skip|<sup>∅</sup> and ψ = A.x==5 ∧ B.y==6, the following two abstract executions are possible:

$$\begin{array}{l} G \xrightarrow{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\mathbf{\overline{\mathbf{\boldsymbol{\beta}}}}}}}}}}}}}}}}}}}}}}}}}}}}}} \}} \texttt \texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\mathbf{\overline{\mathbf{\boldsymbol{\beta}}}}}}}}}}}}}}}}}}}}}}}}}}}}} $$

First, G reduces twice by executing an assignment and a private decision at Alice alone to enter the then-branch (both executions); next, it reduces by executing a private decision at Bob alone to enter the then-branch (top execution) or else-branch (bottom); next, in the latter case, it is stuck. Regarding concrete executions, if B.y is initially not 6, then a deadlock-free one does not exist: the top abstract execution cannot be enriched (i.e., after the second reduction, the sequel is disabled); the bottom abstract execution can be enriched but gets stuck. We note that unlike rules [→-If1] and [→-If2], there is no direct correspondence between rules [→-NIf1] and [→-NIf2] and cases A, B, C on page 16.

Rules [→-NIf3] and [→-NIf4] state that if G<sup>1</sup> or G<sup>2</sup> can reduce by executing γ (left premise), and if the subjects of γ have previously entered G<sup>1</sup> or G<sup>2</sup> (right premise), then the non-blocking if-statement can reduce accordingly. This means that global actions in the branches can be executed already before all private decisions have been made, out-of-order. We note that the set differences in the premises of these rules are needed, because in general (but undesirably), R<sup>1</sup> and R<sup>2</sup> may overlap; shortly, we will rule out this possibility using well-formedness and the predicate transformer. For instance, with the same G as above, also the following abstract execution is possible (due to rule [→-Seq2] as well):

$$G \xrightarrow{\mathsf{tru}\mathfrak{a},} \bullet \xrightarrow{\mathsf{B}\_{\mathsf{V}}\mathfrak{a}\mathfrak{r}} \bullet \xrightarrow{\mathsf{B}\_{\mathsf{V}}\mathfrak{r}\mathfrak{b}} \bullet \xrightarrow{\mathsf{ttrue}} \bullet \xrightarrow{\mathsf{A}\_{\mathsf{x}}\mathfrak{r}\mathfrak{r}\mathfrak{s}} \text{skip}; \text{ if } \psi \,\,\mathsf{skip}|\_{\{\mathsf{A},\mathsf{B}\}} \,\,\mathsf{skip}|\_{\emptyset} \downarrow$$

Rule [→-NWhile] unfolds the non-blocking while-statement.

We end this subsection with an extension of X for non-blocking if/whilestatements; it is induced by the rules in Fig. 6c (page 12). There is a fourth aim now (cf. page 12 and page 17):

4. Rule [X-NIf] ensures that case A or B on page 16 applies, but not case C: when roles make private decisions alone, they must still be unanimous.

### 6.2 Predicate Transformer

For non-blocking if/while-statements, φ is defined by the equations in Fig. 7c (page 13). It is based on the extension for the blocking variants in Fig. 7b:

– For if V {er}r∈<sup>R</sup> G1|<sup>R</sup><sup>1</sup> G2|<sup>R</sup><sup>2</sup> , the definition of φ has four cases.

If R<sup>1</sup> and R<sup>2</sup> are both empty, then no role has made a private decision to enter a branch yet, so the precondition is the same as for blocking ifstatements (i.e., either choice is still possible). This shows that blocking and non-blocking if-statements are functionally equivalent in the following sense: to ensure that the same postcondition is true in the end, the same precondition must be true in the beginning.

If R<sup>i</sup> and R<sup>j</sup> are empty and non-empty, then the roles in R<sup>j</sup> have privately decided to enter G<sup>j</sup> . Thus, the precondition of G<sup>j</sup> must be true. Moreover, to ensure that the remaining roles in R \ R<sup>j</sup> will privately make the same decision to enter G<sup>j</sup> , their conjuncts must be all true (if j=1) or all false (if j=2) as well. In this way, cases A and B on page 16 are included.

If R<sup>1</sup> and R<sup>2</sup> are both non-empty, then roles have privately decided to enter both G<sup>1</sup> and G2, which should never have happened. Thus, the precondition is false. In this way, case C on page 16 is excluded.

– For while V {er}r∈<sup>R</sup> {ψinv} G|∅, no role has made a private decision to (re)enter the loop or exit it yet, so the precondition is the same as for blocking while-statements. When the first role privately decides, the non-blocking while-statement is unfolded into a non-blocking if-statement.

### 6.3 Main Theorem: Deadlock Freedom and Functional Correctness

To extend the main theorem for global programs (Thm. 3, page 16) to cover nonblocking if/while-statements, we need to extend the auxiliary lemmas (Lem. 1–4, page 15 onwards); the proof of the theorem relies on the lemmas and is the same.

Lemma 6. Lemmas 1–4 hold for this section's extension.

Proof. For Lem. 1, the new case (i.e., rule [↓-NIf] in Fig. 4c) can be proved using XR(G), to rule out the degenerate case that a non-blocking if-statement with the "empty" multiparty condition V {er}r∈∅ can terminate. For Lem. 2, the new cases (i.e., new reduction rules in Fig. 4c) can be proved directly. For Lem. 3, the new cases (i.e., new reduction rules in Fig. 4c) can be proved using XR(G) and <sup>J</sup>φ(G, χ)<sup>K</sup> <sup>6</sup><sup>=</sup> <sup>∅</sup> (first and second premise of Lem. 3), to establish that <sup>R</sup><sup>1</sup> or R<sup>2</sup> is empty before the reduction, and remains empty after it (i.e., case C on page 16 never arises). For Lem. 4, the new cases (i.e., new well-formedness rules in Fig. 6b) can be proved using Prop. 2, to establish that rule [→-NIf1] or rule [→-NIf2] is applicable in such a way that S ∈ <sup>J</sup>ψ<sup>K</sup> holds as well.

Theorem 5. Theorem 3 holds for this section's extension.

Proof. The same as the proof of Thm. 3, using Lem. 6 instead of Lems. 1–4.

### 7 Local Programs and Projection

In the previous sections, to cover the upper half of Fig. 1, we incrementally presented a calculus of global programs with blocking and non-blocking if/whilestatements. In this section, to cover the bottom half, we present a complementary calculus of local programs and a projection function.

### 7.1 Syntax and Semantics

Let Λ and L denote universes of local actions and local programs, ranged over by λ and G; they are induced by the following grammar:

$$\begin{array}{lclcl}\lambda & ::= & q.y := e & \mid pq \upharpoonright e & \mid pq?e \mid \ i\_r^R \mid \ \mathsf{r} \\ L & ::= & \mathsf{skip} \mid \ \lambda & \mid L\_1 \; \mid L\_2 \; \mid \ L\_1 \; \mid \ L\_2 \; \mid \\ R.\textbf{if} \ e \, L\_1 \, L\_2 & \mid R.\textbf{while} \; e \, L & \mid \ \mathsf{if} \; e \vert\_n \, L\_1 \vert\_{R\_1} \, L\_2 \vert\_{R\_2} \; \mid \ \textbf{while} \; e \vert\_n \, L \vert\_{\emptyset} \end{array}$$

Informally, these grammar elements have the following meaning:


Formally, the abstract termination and reduction relations for local programs are induced by the same rules as in Fig. 4 (page 10), mutatis mutandis, except:


Let R \* L denote a universe of families of local programs (i.e., partial functions roles to local programs), ranged over by L. Informally, L prescribes a parallel composition of the k local programs in its image L(r1), . . . ,L(rk). Formally, the abstract termination and reduction relations are induced by the rules in Fig. 9. They state that an assignment and a delay are executed alone, while a send–receive pair and a collection of private decisions are executed together. We note that for n=1, the bottom-left rule to execute i {r1,...,rn} covers the case of non-blocking if/while-statements. Furthermore, the mechanisms by


Fig. 9: Abstract operational semantics of families of local programs. L[r 7→ L 0 r ] denotes the update of the image of r in L to L 0 r .


(b) Global programs. Let ◦ ∈ {;, k} and r ∈ R.

Fig. 10: Decomposition of global actions/programs into local actions/programs

which "togetherness" arises (i.e., channels and barriers) are left implicit; they are implementation details. The concrete termination and reduction relations are induced by the same rules as in Fig. 5 (page 11), mutatis mutandis.

To decompose global actions and programs into local ones, let denote a projection function; it is induced by the equations in Fig. 10. In words, consumes a global program G and a role r as input, and it produces a local program G r as output. The idea is that is sound and complete: roughly, G can terminate or reduce by executing γ if, and only if, G r can similarly terminate or reduce by executing γ r. The interesting cases of Fig. 10 are as follows:


First, suppose that G reduces by executing a global action γ in which r does participate. To ensure that G r can similarly reduce by executing γ r, it will be sufficient to register in G r whether or not r has already entered a branch in G (and which one). This is achieved by "|<sup>R</sup>1∩{r}" and "|<sup>R</sup>2∩{r}". Second, suppose that G reduces by executing a global action i {r0} in which r does not participate, using rule [→-NIf1] or rule [→-NIf2], so another role r<sup>0</sup> enters G<sup>1</sup> or G2. To ensure that G r can similarly reduce by executing τ, it will be sufficient to register in G r the number of roles that have not yet entered a branch in G, excluding r. This is achieved by "|<sup>|</sup>R\(R1∪R2∪{r})<sup>|</sup> ". Third, suppose that G reduces by executing a global action γ in which r does not participate using rule [→-NIf3] or rule [→-NIf4]. To ensure that G r can similarly reduce, no additional information needs to be registered.

#### 7.2 Operational Equivalence

Informally, our main theorem for local programs and projection is as follows: if the global program is well-formed, and if the computed precondition is true in the initial state, then operational equivalence is provided. In the rest of this section, we first present auxiliary lemmas; next, we present the main theorem.

The first lemma pertains to soundness of . It states that if G is well-formed and can terminate or reduce, then G r can similarly terminate or reduce.

#### Lemma 7.

1. If XR(G) and r ∈ R and G ↓, then (G r) ↓. 2. If XR(G) and r ∈ R and G ψ,γ −−→ G<sup>0</sup> , then (G r) ψr,γr −−−−−→ (G<sup>0</sup> r).

Proof. By induction on the derivation of G ↓ (item 1) and G ψ,γ −−→ G<sup>0</sup> (item 2). The interesting cases are rules [→-If1], [→-If2], [→-While1], and [→-While2]: in those cases, we use premises XR(G) and r ∈ R to establish that r must have its own conjunct in the multiparty condition, so it must contribute to γ.

The second lemma pertains to completeness of . It states that if G is wellformed, and if G r can terminate, then G can similarly terminate. Furthermore, it states that if G is well-formed, and if every G r can reduce by executing γ r, for every subject r of γ, then G can similarly reduce.

#### Lemma 8.

1. If XR(G) and (G r) ↓, then G ↓. 2. If XR(G) and (G r) ψr,γr −−−−→ L 0 r , for every r ∈ subj(γ), then G ψ,γ −−→ G<sup>0</sup> and ψ<sup>r</sup> = ψ r and L 0 <sup>r</sup> = G<sup>0</sup> r, for every r, for some ψ, G<sup>0</sup> .

Proof. By induction on the derivation of (G r) ↓ (item 1) and the derivations of (G r) ψr,γr −−−−→ L 0 r , for every r ∈ subj(γ) (item 2). The interesting cases are [→-Par1] and [→-Par2]: we use premise XR(G) to establish that either the LHS is reduced in every G r, or the RHS (otherwise, there is no unique G<sup>0</sup> ).

Thus, the previous lemmas show that a global program and its family of projections can simulate each other's behaviour, at the abstract "top layer" of the operational semantics. The following theorem shows that this result can be extended to the concrete "bottom layer": it states that if G is well-formed, and if φ(G, χ) is true in S, then (G, S) and ({Gr}r∈R, S) are weakly bisimilar (e.g., [30]), denoted with ≈. This means that (G, S) and ({G r}r∈R, S) can coinductively simulate each other's behaviour, modulo delays (i.e., operational equivalence).

Theorem 6. If <sup>X</sup>R(G) and S ∈ <sup>J</sup>φ(G, χ)K, then (G, <sup>S</sup>) <sup>≈</sup> ({<sup>G</sup> <sup>r</sup>}r∈R, <sup>S</sup>).

Proof. We prove the theorem using Lems. 7–8 and Fig. 5. See Appx. C [39] for a more detailed overview of the steps, including a weak bisimulation relation.

### 8 Conclusion

We presented a new theory of choreographic programming. It supports for the first time: construction of distributed systems that require decentralised decision making; analysis of distributed systems to provide not only deadlock freedom but also functional correctness. Both contributions are enabled by a single new technique, namely a predicate transformer for choreographies.

The following corollary summarises our main theorems (Thms. 3–6):

Corollary 1. If global program G (with multiparty conditions in if/while-statements) is well-formed, and if precondition φ(G, χ) is true in initial state S, then the family of projections ({G r}r∈R, S) is deadlock-free and functionally-correct.

For instance, in Sect. 2, we presented a deadlock-free global program for leader election; in Appx. E [39], we demonstrate how to prove its functional correctness; by Cor. 1, these properties are preserved by projection.

We implemented the new theory on top of the existing VerCors tool for deductive verification [4]; we present this implementation elsewhere.

In future work, we aim to extend the new theory with: (1) asynchronous communication; (2) a new version of merging [8,9,10] for decentralised decision making (see also footnote 6); (3) more flexible interleaving by relaxing the disjointness requirement for interleaving to support shared variables (e.g., using concurrent separation logic [6,44]).

Acknowledgments Funded by the Netherlands Organisation of Scientific Research (NWO): 016.Veni.192.103.

### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### Comparing the expressiveness of the π-calculus and CCS

Rob van Glabbeek1,<sup>2</sup> 

<sup>1</sup> Data61, CSIRO, Sydney, Australia <sup>2</sup> School of Comp. Sc. and Engineering, Univ. of New South Wales, Sydney, Australia rvg@cs.stanford.edu

Abstract. This paper shows that the π-calculus with implicit matching is no more expressive than CCSγ, a variant of CCS in which the result of a synchronisation of two actions is itself an action subject to relabelling or restriction, rather than the silent action τ . This is done by exhibiting a compositional translation from the π-calculus with implicit matching to CCS<sup>γ</sup> that is valid up to strong barbed bisimilarity.

The full π-calculus can be similarly expressed in CCS<sup>γ</sup> enriched with the triggering operation of Meije.

I also show that these results cannot be recreated with CCS in the rˆole of CCSγ, not even up to reduction equivalence, and not even for the asynchronous π-calculus without restriction or replication.

Finally I observe that CCS cannot be encoded in the π-calculus.

### 1 Introduction

The π-calculus [23,24,22,33] has been advertised as an "extension to the process algebra CCS" [23] adding mobility. It is widely believed that the π-calculus has features that cannot be expressed in CCS, or other immobile process calculi—so called in [27]—such as ACP and CSP.

"the π-calculus has a much greater expressiveness than CCS"

[Sangiorgi [32]]

"Mobility – of whatever kind – is important in modern computing. It was not present in CCS or CSP, [...] but [...] the π-calculus [...] takes mobility of linkage as a primitive notion." [Milner [22]]

The present paper investigates this belief by formally comparing the expressive power of the π-calculus and immobile process calculi.

Following [10,11] I define one process calculus to be at least as expressive as another up to a semantic equivalence ∼ iff there exists a so-called valid translation up to ∼ from the other to the one. Validity entails compositionality, and requires that each translated expression is ∼-equivalent to its original. This concept is parametrised by the choice of a semantic equivalence that is meaningful for both the source and the target language. Any language is as expressive as any other up to the universal relation, whereas almost no two languages are equally expressive up to the identity relation. The equivalence ∼ up to which a translation is valid is a measure for the quality of the translation, and thereby for the degree in which the source language can be expressed in the target.

Robert de Simone [34] showed that a wide class of process calculi, including CCS [20], CSP [6], ACP [4] and SCCS [18], are expressible up to strong bisimilarity in Meije [1]. In [8] I sharpened this result by eliminating the crucial rˆole played by unguarded recursion in De Simone's translation, now taking aprACP<sup>R</sup> as the target language. Here aprACP<sup>R</sup> is a fragment of the language ACP of [4], enriched with relational relabelling, and using action prefixing instead of general sequential composition. It differs from CCS only in its more versatile communication format, allowing multiway synchronisation instead of merely handshaking, in the absence of a special action τ , and in the relational nature of the relabelling operator. The class of languages that can be translated to Meije and aprACP<sup>R</sup> are the ones whose structural operational semantics fits a format due to [34], now known as the De Simone format. They can be considered the "immobile process calculi" alluded to above. The π-calculus does not fit into this class—its operational semantics is not in De Simone format.

To compare the expressiveness of mobile and immobile process calculi I first of all need to select a suitable semantic equivalence that is meaningful for both kinds of languages. A canonical choice is strong barbed bisimilarity [26,33]. Strong barbed bisimilarity is not a congruence for either CCS or the π-calculus, but it is used as a semantic basis for defining suitable congruences on languages [26,33]. For CCS, the familiar notion of strong bisimilarity [19] arises as the congruence closure of strong barbed bisimilarity. For the π-calculus, the congruence closure of strong barbed bisimilarity yields the notion of strong early congruence, called strong full bisimilarity in [33]. In general, whatever its characterisation in a particular calculus, strong barbed congruence is the name of the congruence closure of strong barbed bisimilarity, and a default choice for a semantic equivalence [33].

My first research goal was to find out if there exists a translation from the π-calculus to CCS that is valid up to strong barbed bisimilarity. The answer is negative. In fact, no compositional translation of the π-calculus to CCS is possible, even when weakening the equivalence up to which it should be valid from strong barbed bisimilarity to strong reduction equivalence, and even when restricting the source language to the asynchronous π-calculus [5] without restriction and replication. This disproves a result of [3].

My next research goal was to find out if there is a translation from the πcalculus to any other immobile process calculus, and if yes, to keep the target language as close as possible to CCS. Here the answer turned out to be positive. How close the target language can be kept to CCS depends on which version of the π-calculus I take as source language. My first choice was the original πcalculus, as presented in [23,24], as it is at least as expressive as its competitors. It turns out, however, that the matching operator [x=y]P of [23,24] is the source of a complication. The book [33] merely allows matching to occur as part of action prefixing, as in [x=y]u(z).P or [x=y]¯uv.P. I call this implicit matching. Matching was introduced in [23,24] to facilitate complete equational axiomatisations of the π-calculus, and [33] shows that for that purpose implicit matching is sufficient.

To obtain a valid translation from the π-calculus with implicit matching (henceforth called πIM) to an upgraded variant of CCS, the only upgrade needed is to turn the result of a synchronisation of two actions into a visible action, subject to relabelling or restriction, rather than the silent action τ . I call this variant CCSγ, where γ is a commutative partial binary communication function, just like in ACP [4]. CCS<sup>γ</sup> is a fragment of aprACPR, which also carries a parameter γ. If γ(a, b) = c, this means that an a-action of one component in a parallel composition may synchronise with a b-action of another component, into a c-action; if γ(a, b) is undefined, the actions a and b do not synchronise. CCS can be seen as the instance of CCS<sup>γ</sup> with γ(¯a, a) = τ , and γ undefined for other pairs of actions. But as target language for my translation I will need another choice of the parameter γ.

An important feature of ACP, which greatly contributes to its expressiveness, is multiway synchronisation. This is achieved by allowing an action γ(a, b) to synchronise with an action c into γ(γ(a, b), c). This feature is not needed for the target language of my translations. So I require that γ(γ(a, b), c) is always undefined.

To obtain a valid translation from the full π-calculus, with an explicit matching operator, I need to further upgrade CCS<sup>γ</sup> with the triggering operator of Meije, which allows a relabelling of the first action of its argument only.

By a general result of [11], the validity up to strong barbed bisimilarity of my translation from πIM to CCS<sup>γ</sup> (and from π to CCStrig γ ) implies that it is even valid up to an equivalence on their disjoint union that on π coincides with strong barbed congruence, or strong early congruence, and on CCStrig γ is the congruence closure of strong barbed bisimilarity under translated contexts. The latter is strictly coarser than strong bisimilarity, which is the congruence closure of strong barbed bisimilarity under all CCStrig γ contexts.

Having established that πIM can be expressed in CCSγ, the possibility remains that the two languages are equally expressive. This, however, is not the case. There does not exists a valid translation (up to any reasonable equivalence) from CCS—thus neither from CCSγ—to the π-calculus, even when disallowing the infinite sum of CCS, as well as unguarded recursion. This is a trivial consequence of the power of the CCS renaming operator, which cannot be mimicked in the π-calculus. Using a simple renaming operator that is as finite as the successor function on the natural numbers, CCS, even without infinite sum and unguarded recursion, allows the specification of a process with infinitely many weak barbs, whereas this is fundamentally impossible in the π-calculus.

### 2 CCS

CCS [19] is parametrised with a sets K of agent identifiers and A of visible actions. The set A of co-actions is A := {a¯ | a ∈ A }, and L := A ∪ A is the set of labels. The function ¯· is extended to L by declaring a¯¯ = a. Finally, Act := L ] {τ} is the set of actions. Below, a, b, c, . . . range over L and α, β over Act. A relabelling is a function f : L →L satisfying f(¯a) = f(a); it extends


Table 1. Structural operational semantics of CCS

to Act by f(τ ) := τ . The class TCCS of CCS terms, expressions, processes or agents is the smallest class<sup>1</sup> including:


One writes P<sup>1</sup> + P<sup>2</sup> for P <sup>i</sup>∈<sup>I</sup> P<sup>i</sup> when I ={1, 2}, and 0 when I = ∅. Each agent identifier A ∈ K comes with a unique defining equation of the form A def = P, with P ∈ TCCS. The semantics of CCS is given by the labelled transition relation → ⊆ TCCS ×Act×TCCS. The transitions P <sup>α</sup>−→ Q with P, Q∈TCCS and α∈Act are derived from the rules of Table 1.

Arguably, the most authentic version of CCS [20] features a recursion construct instead of agent identifiers. Since there exists a straightforward valid transition from the version of CCS presented here to the one from [20], the latter is at least as expressive. Therefore, when showing that a variant of CCS is at least as expressive as the π-calculus, I obtain a stronger result by using agent identifiers.

### 3 CCS<sup>γ</sup>

CCS<sup>γ</sup> has four parameters: the same set K of agent identifiers as for CCS, an alphabet A of visible actions, with a subset S ⊆ A of synchronisations<sup>2</sup> , and a

<sup>1</sup> CCS [19,20] allows arbitrary index sets I in summations P <sup>i</sup>∈IPi. As a consequence, TCCS is a proper class rather than a set. Although this is unproblematic, many computer scientists prefer the class of terms to be a set. This can be achieved by choosing a cardinal κ and requiring the index sets I to satisfy |I| < κ. To enable my translation from the π-calculus to CCStrig <sup>γ</sup> , κ should exceed the size of the set of names used in the π-calculus.

<sup>2</sup> These have been added solely to prevent multiway synchronisation.

partial communication function γ : (A \S ) <sup>2</sup> \* S ∪ {τ}, which is commutative, i.e. γ(a, b) = γ(b, a) and each side of this equation is defined just when the other side is. Compared to CCS there are no co-actions, so Act := A ] {τ}.

The syntax of CCS<sup>γ</sup> is the same as that of CCS, except that parallel composition is denoted k rather than |, following ACP [4,2]. This indicates a semantic difference: the rule for communication in the middle of Table 1 is for CCS<sup>γ</sup> replaced by <sup>P</sup>

$$\frac{P \xrightarrow{a} P', \ Q \xrightarrow{b} Q'}{P \| Q \xrightarrow{c} P' \| Q'} \left( \gamma(a, b) = c \right).$$

Moreover, relabelling operators f : A → Act are allowed to rename visible actions into τ , but not vice versa.<sup>3</sup> They are required to satisfy c ∈ S ⇒ f(c) ∈ S ∪ {τ}. These are the only differences between CCS and CCSγ.

### 4 Strong barbed bisimilarity

The semantics of the π-calculus and CCS can be expressed by associating a labelled or a barbed transition system with these languages, with processes as states. Semantic equivalences are defined on the states of labelled or barbed transition systems, and thereby on π- and CCS processes.

Definition 1. A labelled transition system (LTS) is pair (S, →) with S a class (of states) and → ⊆ S × A × S a transition relation, for some suitable set of actions A.

I write P <sup>α</sup>−→ Q for (P, α, Q) ∈ →, P <sup>α</sup>−→ for ∃Q. P <sup>α</sup>−→ Q, and P 6 <sup>α</sup>−→ for its negation. The structural operational semantics of CCS presented before creates an LTS with as states all CCS processes and the transition relation derived from the operational rules, with A := Act.

Definition 2. A strong bisimulation is a symmetric relation R on the states of an LTS such that

– if P R Q and P <sup>α</sup>−→ P 0 then ∃Q<sup>0</sup> . Q <sup>α</sup>−→ Q<sup>0</sup> ∧ P <sup>0</sup> R Q<sup>0</sup> .

Processes P and Q are strongly bisimilar—notation P ↔ Q—if P R Q for some strong bisimulation R.

As is well-known, ↔ is an equivalence relation, and a strong bisimulation itself. Through the operational semantics of CCSγ, strong bisimilarity is defined on CCS<sup>γ</sup> processes.

Definition 3. A barbed transition system (BTS) is a triple (S, 7→, ↓) with S a class (of states), 7→ ⊆ S ×S a reduction relation, and ↓ ⊆ S ×B an observability predicate for some suitable set of barbs B.

<sup>3</sup> Renaming into τ could already be done in CCS by means of parallel composition. Hence this feature in itself does not add extra expressiveness.


Table 2. The actions

One writes P↓<sup>b</sup> for P ∈ S and b ∈ B when (P, b) ∈ ↓. A BTS can be extracted from an LTS with τ ∈ A, by means of a partial observation function O: A \* B. The states remain the same, the reductions are taken to be the transitions labelled τ (dropping the label in the BTS), and P↓<sup>b</sup> holds exactly when there is a transition P <sup>α</sup>−→ Q with O(α) = b.

In this paper I consider labelled transition systems whose actions α∈A are of the forms presented in Table 2. Here x and y are names, drawn from the disjoint union of two sets Z and R of public and private names, and M is a (possibly empty) matching sequence, a sequence of matches [x=y] with x, y ∈ Z ] R and x 6= y. The set of names occurring in M is denoted n(M). In Table 2, also the free names fn(α) and bound names bn(α) of an action α are defined. The set of names of α is n(α) := fn(α) ∪ bn(α). Consequently, also the actions Act of my instantiation of CCS<sup>γ</sup> need to have the forms of Table 2. For the translation into barbed transition systems I take B := Z ∪ Z, where Z := {a¯ | a ∈ Z}, and O(α) as indicated in Table 2, provided M = ε and O(α) ∈ B.

Definition 4. A strong barbed bisimulation is a symmetric relation R on the states of a BTS such that

– if P R Q and P 7−→ P 0 then ∃Q<sup>0</sup> . Q 7−→ Q<sup>0</sup> ∧ P <sup>0</sup> R Q<sup>0</sup>

– and if P R Q and P↓<sup>b</sup> then also Q↓b.

Processes P and Q are strongly barbed bisimilar—notation P •∼ Q—if P R Q for some strong barbed bisimulation R.

Again, •∼ is an equivalence relation, and a strong barbed bisimulation itself. Through the above definition, strong barbed bisimilarity is defined on all LTSs occurring in this paper, as well as on my instantiation of CCSγ. It can also be used to compare processes from different LTSs, namely by taking their disjoint union.

### 5 The π-calculus

The π-calculus [23,24] is parametrised with an infinite set N of names and, for each n ∈ IN, a set of K<sup>n</sup> of agent identifiers of arity n. The set T<sup>π</sup> of π-calculus terms, expressions, processes or agents is the smallest set including:


The order of precedence among the operators is the order of the listing above. A process α.0 with α = τ or ¯xy or x(y) is often written α.

n(P) denotes the set of all names occurring in a process P. An occurrence of a name y in a term is bound if it occurs in a subterm of the form x(y).P or (νy)P; otherwise it is free. The set of names occurring free (resp. bound) in a process P is denoted fn(P) (resp. bn(P)).

Each agent identifier A ∈ K<sup>n</sup> is assumed to come with a unique defining equation of the form

$$A(x\_1, \ldots, x\_n) \stackrel{\text{def}}{=} P$$

where the names x<sup>i</sup> are all distinct and fn(P) ⊆ {x1, . . . , xn}.

The π-calculus with implicit matching (πIM) drops the matching operator, instead allowing prefixes of the form Mxy.P ¯ , Mx(y).P and Mτ.P, with M a matching sequence.

A substitution is a partial function σ:N\*N such that N \(dom(σ)∪range(σ)) is infinite. For ~x = (x1, . . . , xn), ~y = (y1, . . . , yn) ∈ N <sup>n</sup>, {~y/~x} denotes the substitution given by σ(xi) = y<sup>i</sup> for 1 ≤ i ≤ n. One writes {y/x} when n=1.

For x ∈ N , x[σ] denotes σ(x) if x ∈ dom(σ) and x otherwise; M[σ] is the result of changing each occurrence of a name x in M into x[σ], while dropping resulting matches [y=y].

For a substitution σ, the process P σ is obtained from P ∈T<sup>π</sup> by simultaneous substitution, for all x ∈ dom(σ), of x[σ] for all free occurrences of x in P, with change of bound names to avoid name capture. A formal inductive definition is:

$$\begin{array}{rcl} \mathbf{0}\sigma &=& \mathbf{0} \\ (M\tau.P)\sigma &=& M[\sigma]\tau.(P\sigma) \\ (M\bar{x}y.P)\sigma &=& M[\sigma]\overline{x}[\sigma]y[\sigma].(P\sigma) \\ (Mx(y).P)\sigma &=& M[\sigma]x[\sigma](z).(P\{z|y\}\sigma) \\ ((\nu y)P)\sigma &=& (\nu z)(P\{z|y\}\sigma) \\ ([x=y]P)\sigma &=& [x|\sigma] = y[\sigma](P\sigma) \\ (P|Q)\sigma &=& (P\sigma)|(Q\sigma) \\ (P+Q)\sigma &=& (P\sigma) + (Q\sigma) \\ & A(\vec{y})\sigma &=& A(\vec{y}|\sigma) \end{array}$$

where z is chosen outside fn((νy)P) ∪ dom(σ) ∪ range(σ); in case y /∈ dom(σ) ∪ range(σ) one always picks z := y.

A congruence is an equivalence relation ∼ on T<sup>π</sup> such that P ∼ Q implies τ.P ∼ τ.Q, ¯xy.P ∼ xy.Q ¯ , x(y).P ∼ x(y).Q, (νy)P ∼ (νy)Q, [x=y]P ∼ [x=y]Q, P|U ∼ Q|U, U|P ∼ U|Q, P + U ∼ Q + U and U + P ∼ U + Q. Let ≡ be the smallest congruence on T<sup>π</sup> allowing renaming of bound names, i.e., that satisfies x(y).P ≡ x(z).(P{z/y}) and (νy)P ≡ (νz)(P{z/y}) for any z /∈ fn((νy)P). If P ≡ Q, then Q is obtained from P by means of α-conversion. Due to the choice of z above, substitution is precisely defined only up to α-conversion.

Note that P ≡ Q implies that fn(P) = fn(Q), and also that P σ ≡ Qσ for any substitution σ.

### 6 The semantics of the π-calculus

Fig. 1. Semantics of the π-calculus

Whereas CCS has only one operational semantics, the π-calculus is equipped with at least five, as indicated in Figure 1. The late operational semantics stems from [24], the origin of the π-calculus. It is given by the action rules of Table 3. These rules generate a labelled transition system in which the states are the π-calculus processes and the transitions are labelled with the actions τ , ¯xy, ¯x(y) and x(y) of Table 2 (always with M the empty string). Here I take Z := N and R := ∅. For πIM, rule match is omitted. A process [x=y]α.P has no outgoing transitions, similar to 0.

In [24] the late and early bisimulation semantics of the π-calculus were proposed.

Definition 5. A late bisimulation is a symmetric relation R on π-processes such that, whenever P R Q, α is either τ or ¯xy and z 6∈ n(P) ∪ n(Q),


Processes P and Q are late bisimilar—notation P .∼<sup>L</sup> Q—if P R Q for some late bisimulation R. They are late congruent—notation P ∼LQ—if P{~y/~x} .∼LQ{~y/~x} for any substitution {~y/~x}.


#### Table 3. Late structural operational semantics of the π-calculus

Early bisimilarity ( .∼E) and congruence (∼E) are defined likewise, but with ∀y∃Q<sup>0</sup> instead of ∃Q<sup>0</sup>∀y. In [24,33] it is shown that .∼<sup>L</sup> and .∼<sup>E</sup> are congruences for all operators of the π-calculus, except for the input prefix. ∼<sup>E</sup> and ∼<sup>L</sup> are congruence relations for the entire language; in fact they are the congruence closures of .∼<sup>L</sup> and .∼E, respectively. By definition, .∼<sup>L</sup> ⊆ .∼E, and thus ∼<sup>L</sup> ⊆ ∼E.

Lemma 1 ([24]). Let P ≡ Q and bn(α) ∩ n(Q) = ∅. If P <sup>α</sup>−→ P 0 then Q <sup>α</sup>−→ Q<sup>0</sup> for some Q<sup>0</sup> with P <sup>0</sup> ≡ Q<sup>0</sup> .

This implies that ≡ is a late bisimulation, so that ≡ ⊂ ∼L.

In [25] the early operational semantics of the π-calculus is proposed, presented in Table 4; it uses free input actions xy instead of bound inputs x(y). This is also the semantics of [33]. The semantics in [25,33] requires us to identify processes modulo α-conversion before applying the operational rules. This is equivalent to adding rule alpha of Table 4.

A variant of the late operational semantics incorporating rule alpha is also possible. In this setting rule alpha-open can be simplified to open, and likewise input to x(y).P <sup>x</sup>(y) −−→ P. By Lemma 1, the late operational semantics with alpha gives rise to the same notions of early and late bisimilarity as the late operational semantics without alpha; the addition of this rule is entirely optional. Interestingly, the rule alpha is not optional in the early operational semantics, not even when reinstating alpha-open.

Example 1. Let P := ¯xy|(νy)(x(z)). One has (νy)(x(z)) <sup>x</sup>(z) −−→<sup>L</sup> (νy)0 and thus P <sup>τ</sup> −→<sup>L</sup> 0|(νy)0 by com. However, (νy)(x(z)) xy −→<sup>E</sup> (νy)0 is forbidden by the side condition of res, so in the early semantics without alpha P cannot make a τ -step. Rule alpha comes to the rescue here, as it allows P ≡xy¯ |(νw)(x(z)) <sup>τ</sup> −→<sup>E</sup> 0|(νw)0.


Table 4. Early structural operational semantics of the π-calculus

By the following lemma, the early transition relation −→<sup>E</sup> is completely determined by the late transition relation −→αL with alpha:

Lemma 2 ([25]). Let P ∈ T<sup>π</sup> and β be τ , ¯xy or ¯x(y). – P <sup>β</sup>−→<sup>E</sup> Q iff P <sup>β</sup>−→αL Q. – P xy −→<sup>E</sup> Q iff P <sup>x</sup>(z) −−→αL R for some R, z with Q ≡ R{y/z}.

The early transition relations allow a more concise definition of early bisimilarity:

Proposition 1 ([25]). An early bisimulation is a symmetric relation R on T<sup>π</sup> such that, whenever P R Q and α is an action with bn(α) ∩ (n(P) ∪ n(Q)) = ∅,

$$- \text{ if } P \xrightarrow{\alpha}\_{E} P' \text{ then } \exists Q' \text{ with } Q \xrightarrow{\alpha}\_{E} Q' \text{ and } P' \text{ } \mathcal{R} \ Q'.$$

Processes P and Q are early bisimilar iff P R Q for some early bisimulation R.

Through the general method of Section 4, taking Z := N and R := ∅, a barbed transition system can be extracted from the late or early labelled transition system of the π-calculus; by Lemmas 1 and 2 the same BTS is obtained either way. This defines strong barbed bisimilarity •∼ on Tπ. The congruence closure of •∼ is early congruence [33]. In [21] a reduction semantics of the πcalculus is given, that yields a BTS right away. Up to strong barbed bisimilarity, this BTS is the same as the one extracted from the late or early LTS.

In [32] yet another operational semantics of the π-calculus was introduced, in a style called symbolic by Hennessy & Lin [16], who had proposed it for a version of value-passing CCS. It is presented in Table 5. The transitions are labelled with actions α of the form Mβ, where M is a matching sequence and β an action as in the late operational semantics. When x6=y the matching sequence M prepended with [x=y] is denoted [x=y]M; however, [x=x]M simply denotes M.

In the operational semantics of CCS, τ -actions can be thought of as reactions that actually take place, whereas a transition labelled a merely represents the


Table 5. Late symbolic structural operational semantics of the π-calculus

For the π-calculus, the blue Ms are omitted; for πIM the purple rules.

potential of a reaction with the environment, one that can take place only if the environment offers a complementary transition ¯a. In case the environment never does an ¯a, this potential will not be realised. A reduction semantics (as in [22]) yields a BTS that only represents directly the realised actions—the τ transitions or reductions—and reasons about the potential reactions by defining the semantics of a system in terms of reductions that can happen when placing the system in various contexts. An LTS, on the other hand, directly represents transitions that could happen under some conditions only, annotated with the conditions that enable them. For CCS, this annotation is the label a, saying that the transition is conditional on an ¯a-signal from the environment. As a result of this, semantic equivalences defined on labelled transitions systems tend to be congruences for most operators right away, and do not need much closure under contexts.

Seen from this perspective, the operational semantics of the π-calculus of Table 3 or 4 is a compromise between a pure reduction semantics and a pure labelled transition system semantics. Input and output actions are explicitly included to signal potential reactions that are realised in the presence of a suitable communication partner, but actions whose occurrence is conditional on two different names x and y denoting the same channel are entirely omitted, even though any π-process can be placed in a context in which x and y will be identified. As a consequence of this, the early and late bisimilarities need to be closed under all possible substitutions or identifications of names before they turn into early and late congruences. The operational semantics of Table 5 adds the conditional transitions that where missing in Table 3, and hence can be seen as a true labelled transition system semantics.

In this paper I need the early symbolic operational semantics of the πcalculus, presented in Table 6. Although new, it is the logical combination of the early and the (late) symbolic semantics. Its transitions that are labelled


Table 6. Early symbolic structural operational semantics of the π-calculus

with actions having an empty matching sequence are exactly the transitions of the early semantics, so the BTS extracted from this semantics is the same.

For πIM, rule symb-match is omitted, but tau, output and input carry the matching sequence M (indicated in blue).

### 7 Valid translations

A signature Σ is a set of operator symbols g, each of which is equipped with an arity n ∈ IN. The set T<sup>Σ</sup> of closed terms over Σ is the smallest set such that, for all g ∈ Σ,

P1, . . . , P<sup>n</sup> ∈ T<sup>Σ</sup> ⇒ g(P1, . . . , Pn) ∈ T<sup>Σ</sup> .

Call a language simple if its expressions are the closed terms T<sup>Σ</sup> over some signature Σ. The π-calculus is simple in this sense; its signature consists of the binary operators + and |, the unary operators τ , ¯xy., x(y)., (νy) and [x=y] for x, y ∈ N , and the nullary operators (or constants) 0 and A(y1, . . . , yn) for A ∈ K<sup>n</sup> and y<sup>i</sup> ∈ N . CCS is not quite simple, since it features the infinite choice operator.

Let L be a language. An n-ary L-context C is an L-expression that may contain special variables X1, ..., Xn—its holes. For C an n-ary context, C[P1, . . . , Pn] is the result of substituting P<sup>i</sup> for X<sup>i</sup> , for each i = 1, . . . , n.

Definition 6. Let L <sup>0</sup> and L languages, generating sets of closed terms TL<sup>0</sup> and TL. Let L <sup>0</sup> be simple, with signature Σ. A translation from L 0 to L (or an encoding from L 0 into L) is a function T : TL<sup>0</sup> → TL. It is compositional if for each n-ary operator g ∈ Σ there exists an n-ary L-context C<sup>g</sup> such that T (g(P1, . . . , Pn)) = Cg[T (P1), . . . , T (Pn)].

Let ∼ be an equivalence relation on TL<sup>0</sup> ∪ TL. A translation T from L 0 to L is valid up to ∼ if it is compositional and T (P) ∼ P for each P ∈ TL<sup>0</sup> .

The above definition stems in essence from [10,11], but could be simplified here since [10,11] also covered the case that L 0 is not simple. Moreover, here I restrict attention to what are called closed term languages in [11].

### 8 The unencodability of π into CCS

In this section I show that there exists no translation of the π-calculus to CCS that is valid up to •∼. I even show this for the fragment π ¶ <sup>A</sup> of the (asynchronous) π-calculus without choice, recursion, matching and restriction (thus only featuring inaction, action prefixing and parallel composition).

Definition 7. Strong reduction bisimilarity, ↔<sup>r</sup> , is defined just as strong barbed equivalence in Definition 4, but without the requirement on barbs.

I show that there is no translation of π ¶ <sup>A</sup> to CCS that is valid up to ↔<sup>r</sup> . As ↔<sup>r</sup> is coarser than •∼, this implies my claim above. It may be useful to read this section in parallel with the first half of Section 14.

Definition 8. P Let be the smallest preorder on CCS contexts such that <sup>i</sup>∈<sup>I</sup> <sup>E</sup><sup>i</sup> <sup>E</sup><sup>j</sup> for all <sup>j</sup> <sup>∈</sup> <sup>I</sup>, <sup>E</sup>|<sup>F</sup> <sup>E</sup>, <sup>E</sup>|<sup>F</sup> <sup>F</sup>, <sup>E</sup>\<sup>L</sup> <sup>E</sup>, <sup>E</sup>[f] <sup>E</sup> and A P for all A ∈ K with A def = P. A variable X occurs unguarded in a context E if E X.

If the hole X<sup>1</sup> occurs unguarded in the unary context E[ ] and U <sup>τ</sup> −→ (resp. U <sup>τ</sup> −→ <sup>τ</sup> −→) then E[U] <sup>τ</sup> −→ (resp. E[U] <sup>τ</sup> −→ <sup>τ</sup> −→).

Lemma 3. Let E[ ] be a unary and C[ , ] a binary CCS context, and P, Q, P 0 , Q<sup>0</sup> , U ∈ TCCS. If E[C[P, Q]] <sup>τ</sup> −→ and U <sup>τ</sup> −→ but neither E[C[P 0 , Q]] <sup>τ</sup> −→ nor E[C[P, Q<sup>0</sup> ]] <sup>τ</sup> −→ nor E[U] <sup>τ</sup> −→ <sup>τ</sup> −→, then C[P, Q] <sup>τ</sup> −→.

Proof. Since the only rule in the operational semantics of CCS with multiple premises has a conclusion labelled τ , it can occur at most once in the derivation of a CCS transition. Thus, such a derivation is a tree with at most two branches, as illustrated at the right. Now consider the derivation of E[C[P, Q]] <sup>τ</sup> −→. If none of its branches prods into the sub-

process P, the transition would be independent on what is substituted here, thus yielding E[C[P 0 , Q]] <sup>τ</sup> −→. Thus, by symmetry, both P and Q are visited by branches of this proof. It suffices to show that these branches come together within the context C, as this implies C[P, Q] <sup>τ</sup> −→. So suppose, towards a contradiction, that the two branches come together in E. Then E must have the form E1[E2[ ]|E3[ ]], where the hole X<sup>1</sup> occurs unguarded in E2, E<sup>3</sup> as well as E1. But in that case E[U] <sup>τ</sup> −→ <sup>τ</sup> −→, contradicting the assumptions. ut

Lemma 4. If D[ , , ] is a ternary CCS context, P1, P2, P<sup>3</sup> ∈ TCCS, and D[P1, P2, P3] <sup>τ</sup> −→, then there exists an i ∈ {1, 2, 3} and a CCS context E[ ] such that D<sup>0</sup> [P] <sup>τ</sup> −→ E[P] for any P ∈ TCCS. Here D<sup>0</sup> is the unary context obtained from D[ , , ] by substituting P<sup>j</sup> for the hole X<sup>j</sup> , for all j ∈ {1, 2, 3}, j 6= i.

Proof. Since the derivation of D[P1, P2, P3] <sup>τ</sup> −→ has at most two branches, one of the P<sup>i</sup> is not involved in this proof at all. Thus, the derivation remains valid if any other process P is substituted in the place of that P<sup>i</sup> ; the target of the transition remains the same, except for P taking the place of P<sup>i</sup> in it. ut

Theorem 1. There is no translation from π ¶ <sup>A</sup> to CCS that is valid up to ↔<sup>r</sup> .

Proof. Suppose, towards a contradiction, that T is a translation from π ¶ <sup>A</sup> to CCS that is valid up to ↔<sup>r</sup> . By definition, this means that T is compositional and that T (P) ↔<sup>r</sup> P for any π ¶ <sup>A</sup>-process P.

As T is compositional, there exists a ternary CCS context D[ , , ] such that, for any π ¶ <sup>A</sup>-processes R, S, T,

$$\mathcal{P}\left(\bar{x}v \mid x(y).(R|S|T)\right) = D[\mathcal{P}(R), \mathcal{P}(S), \mathcal{P}(T)].$$

Since ¯xv <sup>x</sup>(y).(0|0|0) <sup>τ</sup> −→ as well as T xv¯ <sup>x</sup>(y).(0|0|0) ↔<sup>r</sup> xv¯ <sup>x</sup>(y).(0|0|0), it follows that T xv¯ <sup>x</sup>(y).(0|0|0) <sup>τ</sup> −→, i.e., D[T (0), T (0), T (0)] <sup>τ</sup> −→. Hence Lemma 4 can be applied. For simplicity I assume that i = 1; the other two cases proceed in the same way. So there is a CCS context E[ ] such that D[P, T (0), T (0)] <sup>τ</sup> −→ E[P] for all CCS terms P. In particular, for all π ¶ <sup>A</sup>-terms R, T (¯xv <sup>x</sup>(y).(R|0|0) = D[T(R), T(0), T(0)] <sup>τ</sup> −→ E[T(R)]. (1)

I examine the translations of the π-calculus expressions ¯xv <sup>x</sup>(y).(R|0|0), for R ∈ {yz¯ |v(w), 0|v(w), yz¯ |0, τ}.

Since ¯xv <sup>x</sup>(y).(yz¯ <sup>|</sup>v(w)|0|0) <sup>τ</sup> −→ <sup>τ</sup> −→ and <sup>T</sup> respects <sup>↔</sup><sup>r</sup> ,

T xv¯ <sup>x</sup>(y).(yz¯ <sup>|</sup>v(w)|0|0) <sup>τ</sup> −→ <sup>τ</sup> −→ .

In the same way, neither T xv¯ <sup>x</sup>(y).(0|v(w)|0|0) <sup>τ</sup> −→ <sup>τ</sup> −→ nor T xv¯ <sup>x</sup>(y).(yz¯ <sup>|</sup>0|0|0) <sup>τ</sup> −→ <sup>τ</sup> −→. (2)

Furthermore, since T respects ↔<sup>r</sup> and there is no S ∈ T<sup>π</sup> such that

xv¯ |x(y).(yz¯ |v(w)|0|0) <sup>τ</sup> −→ S 6 <sup>τ</sup>−→,

there is no S ∈ TCCS with T xv¯ |x(y).(yz¯ |v(w)|0|0) <sup>τ</sup> −→ S 6 <sup>τ</sup>−→. (3)

By (1) and (3), E[T (yz¯ |v(w))] <sup>τ</sup> −→.

By (1) and (2), E[T (0|v(w))] 6 <sup>τ</sup>−→ and E[T (yz¯ |0)] 6 <sup>τ</sup>−→.

Since T is compositional, there is a binary CCS context C<sup>|</sup> [ , ] such that T (P|Q) = C<sup>|</sup> [T (P), T (Q)] for any P, Q ∈ Tπ. It follows that

$$\begin{array}{c} E[C\_] [\mathcal{F}(\bar{y}z), \mathcal{F}(v(w))] \xrightarrow{\tau} \\ E[C\_] [\mathcal{F}(\mathbf{0}), \mathcal{F}(v(w))] \xrightarrow{\tau} \\ E[C\_] [\mathcal{F}(\bar{y}z), \mathcal{F}(\mathbf{0})] \xrightarrow{\tau} \end{array}$$

Moreover since τ <sup>τ</sup> −→, also U := T (τ ) <sup>τ</sup> −→, but, since it is not the case that xv¯ <sup>x</sup>(y).(<sup>τ</sup> <sup>|</sup>0|0) <sup>τ</sup> −→ <sup>τ</sup> −→ <sup>τ</sup> −→, neither holds T xv¯ <sup>x</sup>(y).(<sup>τ</sup> <sup>|</sup>0|0) <sup>τ</sup> −→ <sup>τ</sup> −→ <sup>τ</sup> −→, and neither E[U] <sup>τ</sup> −→ <sup>τ</sup> −→. So by Lemma 3, T (¯yz|v(w)) = C<sup>|</sup> [T (¯yz), T (v(w))] <sup>τ</sup> −→, yet ¯yz|v(w) 6 <sup>τ</sup>−→. This contradicts the validity of <sup>T</sup> up to <sup>↔</sup><sup>r</sup> . ut

### 9 A valid translation of πIM into CCS<sup>γ</sup>

Given a set N of names, I now define the parameters K, A and γ of the language CCS<sup>γ</sup> that will be the target of my encoding. First of all, K will be the disjoint union of all the sets K<sup>n</sup> for n ∈ IN, of n-ary agent identifiers from the chosen instance of the π-calculus.

Take p /∈ N . Let R<sup>0</sup> := { <sup>ς</sup>p | ς ∈ {e, `, r} <sup>∗</sup>}. The set R of private names is {u υ | u ∈ R<sup>0</sup> ∧ υ ∈ {<sup>0</sup>} <sup>∗</sup>}. Let S = {s1, s2, . . .} be an infinite set of spare names, disjoint from N and R. Let Z := N ] S and H := Z ] R. 4

I take Act to be the set of all expressions α from Table 2, as defined in Section 4 (in terms of Z and R), so A :=Act\{τ}. The communication function γ is given by γ(Mxy, Nvy ¯ ) = [x=v]MNτ , just as for rule e-s-com in Table 6.

For ~x = (x1, . . . , xn) ∈ N <sup>n</sup> and ~y = (y1, . . . , yn) ∈ Hn, with the x<sup>i</sup> distinct, let {~y/~x} <sup>S</sup> : S ∪ {x1, . . . , xn} \* H be the substitution σ with σ(xi) = y<sup>i</sup> and σ(si) = x<sup>i</sup> for i = 1, ..., n, and σ(si) = si−<sup>n</sup> for i > n. These functions extend homomorphically to A and thereby constitute CCS<sup>γ</sup> relabellings. Abbreviate [{~y/~x} <sup>S</sup> ] by [~y/~x] and [{z/y} <sup>S</sup> ] by [z/y].

For η ∈ {`, r, e} and y ∈ Z, let the surjective substitutions η : R \* R and p<sup>y</sup> :{y} ∪ R → {y} ∪ R be given by:

$$\begin{array}{llll} & & p\_y(y) & := p & \\ \eta(^{\varsigma}p) & & & p\_y(p') & := y \\ \eta(^{\varsigma}p^{\upsilon\prime}) := ^{\varsigma}p^{\upsilon} & \text{if } \varsigma \neq \eta \zeta & & p\_y(u) & := e(u) \quad \text{if } u \neq y, p'. \end{array}$$

These σ : H \* H are injective, i.e., x[σ] 6=y[σ] when x6=y. Also they yield CCS<sup>γ</sup> relabellings. The following compositional encoding, which will be illustrated with examples in Section 12, defines my translation from πIM to CCSγ.

$$\begin{array}{lcl} \mathcal{F}(\mathbf{0}) &:= \mathbf{0} \\ \mathcal{F}(M\tau.P) &:= M\tau.\mathcal{F}(P) \\ \mathcal{F}(M\bar{x}y.P) &:= M\bar{x}y.\mathcal{F}(P) \\ \mathcal{F}(Mx(y).P) &:= \sum\_{z \in \mathcal{H}} Mxz.\Big(\mathcal{F}(P)[z|y]\Big) \\ \mathcal{F}((\nu y)P) &:= \mathcal{F}(P)[p\_y] \\ \mathcal{F}(P \mid Q) &:= \mathcal{F}(P)[\ell] \parallel \mathcal{F}(Q)[r] \\ \mathcal{F}(P+Q) &:= \mathcal{F}(P) + \mathcal{F}(Q) \\ \mathcal{F}(A(\vec{y})) &:= A[\vec{y}\vec{x}] &\text{when } A(\vec{x}) \stackrel{\text{def}}{=} P \end{array}$$

where the CCS<sup>γ</sup> agent identifier A has the defining equation A = T (P) when A(~x) def = P was the defining equation of the agent identifier A from the π-calculus.

To explain what this encoding does, inaction, silent prefix, output prefix and choice are translated homomorphically. The input prefix is translated into an infinite sum over all possible input values z that could be received, of the received message Mxz followed by the continuation process T (P)[z/y]. Here [z/y] is a CCS relabelling operator that simulates substitution of z for y in T (P). This

<sup>4</sup> The names in S and in R\R<sup>0</sup> exist solely to make the substitutions {~y/~x} S , η and p<sup>y</sup> surjective. Here σ is surjective iff dom(σ) ⊆ range(σ).

implements the rule early-input from Table 6. Agent identifiers are also translated homomorphically, except that their arguments ~y are replaced by relabelling operators.

Restriction is translated by simply dropping the restriction operator, but renaming the restricted name y into a private name p that generates no barbs. The operator [py] injectively renames all private names <sup>ς</sup>p that occur in the scope of (νy) by tagging all of them with a tag e. This ensures that the new private name p is fresh, so that no name clashes can occur that in πIM would have been prevented by the restriction operator.

Parallel composition is almost translated homomorphically. However, each private name on the right is tagged with an r, and on the left with an `. This guarantees that private names introduced at different sides of a parallel composition cannot interact. Interaction is only possible when the name is passed on in the appropriate way.

The main result of this paper states the validity of the above translation, and thus that CCS<sup>γ</sup> is at least as expressive as πIM:

Theorem 2. For P ∈ T<sup>π</sup> one has T (P) •∼ P.

See http://theory.stanford.edu/~rvg/abstracts.html#153 for a proof.

Theorem 2 says that each π-calculus process is strongly barbed bisimilar to its translation as a CCS<sup>γ</sup> process. The labelled transition systems of the π-calculus and CCS<sup>γ</sup> are both of the type presented in Section 4, i.e. with transition labels taken from Table 2. There also the associated barbs are defined. By Theorem 2 each π transition P <sup>τ</sup> −→ P 0 can be matched by a CCS<sup>γ</sup> transition T (P) <sup>τ</sup> −→ Q with T (P 0 ) •∼ Q. Likewise, each CCS<sup>γ</sup> transition T (P) <sup>τ</sup> −→ Q can be matched by a π transition P <sup>τ</sup> −→ P <sup>0</sup> with T (P 0 ) •∼ Q. Moreover, if P has a barb x (or x¯) then so does T (P), and vice versa. Here a π or CCS<sup>γ</sup> process P has a barb a ∈ Z ∪ Z iff P ay −→ P <sup>0</sup> or P <sup>a</sup>(y) −−→ P 0 for some name y ∈ H and process P 0 . Transitions P Mxy¯ −−−→ P 0 , P Mx¯(y) −−−−→ P 0 , P Mxy −−−→ P <sup>0</sup> or P Mx(y) −−−−→ P <sup>0</sup> with M 6= ε or x ∈ R generate no barbs.

### 10 The ideas behind this encoding

The above encoding combines seven ideas, each of which appears to be necessary to achieve the desired result. Accordingly, the translation could be described as the composition of seven encodings, leading from πIM to CCS<sup>γ</sup> via six intermediate languages. Here a language comprises syntax as well as semantics. Each of the intermediate languages has a labelled transition system semantics where the labels are as described in Section 4. Accordingly, at each step it is well-defined whether strong barbed bisimilarity is preserved, and one can show it is. These proofs go by induction on the derivation of transitions, where the transitions with visible labels are necessary steps even when one would only be interested in the transitions with τ -labels. There are various orders in which the seven steps can be taken. The seven steps are:

Fig. 2. Translation from the π-calculus with implicit matching to CCS<sup>γ</sup> Definitions of the intermediate languages πIM(Z,R) and π † IM(Z,R) are not provided here.


As indicated in Figure 2, my translation maps the π-calculus with implicit matching to a subset of CCSγ. On that subset, π-calculus behaviour can be replayed faithfully, at least up to strong early congruence, the congruence closure of strong barbed bisimilarity (cf. [11]). However, the interaction between a translated π-calculus process and a CCS<sup>γ</sup> process outside the image of the translation may be disturbing, and devoid of good properties. Also, in case intermediate languages are encountered on the way from πIM to CCSγ, which is just one of the ways to prove my result, no guarantees are given on the sanity of those languages outside the image of the source language, i.e. on their behaviour outside the realm of clash-free processes after Step 3 has been made.

### 11 Triggering

To include the general matching operator in the source language I need to extend

the target language with the triggering operator s⇒P of Meije [1,34]:

$$\frac{P \xrightarrow{\alpha} P'}{s \Rightarrow P \xrightarrow{s\alpha} P'}$$

Meije features signals and actions; each signal s can be "applied" to an action α, and doing so yields an action sα. In this paper the actions are as in Table 2, and a signal is an expression [x=y] with x, y ∈ N ; application of a signal to an action was defined in Section 6.

Triggering cannot be expressed in CCSγ, as rooted weak bisimilarity [2], the weak congruence of [19,20], is a congruence for CCS<sup>γ</sup> but not for triggering. However, rooted branching bisimilarity [12] is a congruence for triggering [9].

My translation from πIM to CCS<sup>γ</sup> can be extended into one from the full π-calculus to CCStrig <sup>γ</sup> by adding the clause

$$\mathcal{P}([x=y]P) \; := \; [x=y] \Rightarrow \mathcal{P}(P).$$

Theorem 2 applies to this extended translation as well.

### 12 Examples

Example 2. The outgoing transitions of x(y).yw¯ are

$$x(y).\bar{y}w \xleftarrow{xz\_1\ldots\bar{z}\_1w} \bar{z}\_2w \xrightarrow{\bar{z}\_2w} \mathbf{0}$$

$$x z\_n \star\_{\bar{z}\_n w} \dot{\bar{z}}\_n w \xleftarrow{\bar{z}\_nw} \mathbf{0}$$

The same applies to its translation P <sup>z</sup>∈H xz. (¯yw.0)[z/y] .

$$\begin{array}{c} \begin{array}{c} \begin{array}{c} xz\_{1} \\ \end{array} \begin{array}{c} \begin{array}{c} (\bar{y}w.\mathbf{0})[z\_{1}/y] \end{array} \xrightarrow{\bar{z}\_{1}w} \begin{array}{c} \mathbf{0}[z\_{1}/y] \\ \end{array} \\ \begin{array}{c} xz\_{n} \end{array} \end{array} \end{array} \end{array} \begin{array}{c} \begin{array}{c} (\bar{y}w.\mathbf{0})[z\_{1}/y] \xrightarrow{\bar{z}\_{1}w} \mathbf{0}[z\_{1}/y] \\ \left(\bar{y}w.\mathbf{0}\right)[z\_{2}/y] \xrightarrow{\bar{z}\_{2}w} \mathbf{0}[z\_{2}/y] \\ \vdots \\ \mathbf{0}[z\_{n}/y] \xrightarrow{\bar{z}\_{n}w} \mathbf{0}[z\_{n}/y] \end{array} \end{array} \end{array} \begin{array}{c} \begin{array}{c} \mathbf{z}\_{1}w.\mathbf{0}[z\_{1}/y] \\ \mathbf{0}[z\_{1}/y] \end{array} \xrightarrow{\bar{z}\_{2}w} \mathbf{0}[z\_{2}/y] \\ \mathbf{0}[z\_{2}/y] \xrightarrow{\bar{z}\_{2}w} \mathbf{0}[z\_{2}/y] \end{array} \end{array}$$

Here the z<sup>i</sup> range over all names in N . Below I flatten such a picture by drawing the arrows only for one name z, which however still ranges over N .

Example 3. The transitions of P = x(y).yw¯ | xu.u ¯ (v) are

$$\begin{array}{c|c} (x(y).\bar{y}w)|\bar{x}u.u(v)\xleftarrow{xz} & \bar{z}w|\bar{x}u.u(v)\xleftarrow{\bar{z}w} \mathbf{0}|\bar{x}u.u(v)|\\ \Big\mid \bar{x}u & \bar{x}u & \bar{x}u\\ (x(y).\bar{y}w)|u(v)\xleftarrow{xz} & \bar{z}w|u(v)\xleftarrow{\bar{z}w} \mathbf{0}|u(v)\\ \Big\mid \underline{u}q & \bar{u}w|u(v)\xleftarrow{\bar{z}w} \mathbf{0} & \frac{\tau}{\tau} \end{array}$$

Here uw¯ |u(v) is the special case of ¯zw|u(v) obtained by taking z := u. It thus also has outgoing transitions labelled ¯uw and uq, for q ∈ N .

Up to strong bisimilarity, the same transition system is obtained by the translation T (P) of P in CCSγ.

$$\mathcal{F}(P) = \left(\sum\_{z \in \mathcal{H}} xz.((\bar{y}w.\mathbf{0})[z|y])\right)[\ell] \left\| \left(\bar{x}u.\sum\_{z \in \mathcal{H}} uz(\mathbf{0}[z|v])\right)[r] \right\|$$

Since there are no restriction operators in this example, the relabelling operators [`] and [r] are of no consequence. Here

$$\left| \mathcal{F}(P) \xrightarrow{\tau} (\bar{y}w.\mathbf{0})[\![u]\!] \!] \right| \Big| \sum\_{z \in \mathcal{H}} uz(\mathbf{0}[\![z \![v] \!] \!])[r] \xrightarrow{\tau} \mathbf{0}[\![u \![y] \!] \!] \Vert \mathbf{0}[\![w \![w \!] \!] \!] \mathbf{I}].$$

Example 4. Let Q = (νx) x(y).yw¯ | (νu) xu.u ¯ (v) . It has no other transitions than

$$Q \xrightarrow{\tau} (\nu x)(\nu u)(\bar{u}w|u(v)) \xrightarrow{\tau} (\nu x)(\nu u)(\mathbf{0}|\mathbf{0}).$$

Its translation T (Q) into CCS<sup>γ</sup> is

$$\left( \left( \sum\_{z \in \mathcal{H}} xz.((\bar{y}w.\mathbf{0})[z\%])\right)[\ell] \right) [\ell] \left| \left( \bar{x}u.\sum\_{z \in \mathcal{H}} uz(\mathbf{0}[z\%])\right)[p\_u][r] \right| [p\_x][r] \right| $$

Up to strong bisimilarity, its transition system is the same as that of P or T (P) from Example 3, except that in transition labels the name u is renamed into the private name erp, and x is renamed into the private name p. One has T (Q) •∼ Q, since private names generate no barbs.

Example 5. The process (νx)(x(y)) | (νx)(¯xu) has no outgoing transitions. Accordingly, its translation

$$\left(\sum\_{z \in \mathcal{H}} xz.(\mathbf{0}\{z\langle y\rangle\})\right)[p\_x][\ell] \quad \Big|\ (\bar{x}u)[p\_x][r]$$

only has outgoing transitions labelled `pz for z∈H and <sup>r</sup>pu. Since the names `p and <sup>r</sup>p are private, these transitions generate no barbs. In this example, the relabelling operators [`] and [r] are essential. Without them, the mentioned transitions would have complementary names, and communicate into a τ -transition.

Example 6. Let P = (νy) xy. ¯ yw¯ | x(u).u(v). Then

$$P \xrightarrow{\tau} (\nu y) \big(\bar{y}w \mid y(v)\big) \xrightarrow{\tau} (\nu y)(\mathbf{0}|\mathbf{0})\dots$$

Now T (νy) xy. ¯ yw¯ = (¯xy.yw. ¯ 0)[py] and

$$\mathcal{F}(x(u).u(v)) = \sum\_{z \in \mathcal{H}} xz.\left(\left(\sum\_{z \in \mathcal{H}} uz.(\mathbf{0}[z \%])\right)[z \!/ u]\right).$$

Hence T (νy) xy. ¯ yw¯ [`] x¯ `<sup>p</sup> −−→ (¯yw.0)[py][`]. Since the substitution <sup>r</sup> used in the relabelling operator [r] is surjective, there is a name s that is mapped to `p, namely `p 0 . Considering that T (x(u).u(v)) xs −→ T (u(v))[s/u],

$$\mathcal{F}(P) \stackrel{\tau}{\longrightarrow} (\bar{y}w.\mathbf{0})[p\_y][\ell] \left| \left( \sum\_{z \in \mathcal{H}} (uz.\mathbf{0})[\mathbf{\bar{z}}/v] \right)[^s \boldsymbol{\uprho}][r] \right|.$$

These parallel components can perform actions `pw and `pw, synchronising into a τ -transition, and thereby mimicking the behaviour of P.

Example 7. Let P = (νy) xy. ¯ (νy)(¯yw) | x(u).u(v). Then P <sup>τ</sup> −→ (νy) (νy)(¯yw) | y(v) 6 <sup>τ</sup>−→. One obtains

$$\mathcal{F}(P) \stackrel{\tau}{\longrightarrow} (\bar{y}w.\mathbf{0})[p\_y][p\_y][\ell] \left| \left(\sum\_{z \in \mathcal{H}} uz.(\mathbf{0}[z|v])\right)[^{s/u}[r]]\right|$$

for a name s that under [r] maps to `p. Now the left component can do an action `epw, whereas the left component can merely match with `pw. No synchronisation is possible. This shows why it is necessary that the relabelling [py] not only renames y into p, but also p into <sup>e</sup>p.

Example 8. Let P = x(y).x(w).wu¯ . Then

$$P|\bar{x}v.\bar{x}y.y(v) \xrightarrow{\tau} x(w).\bar{w}u|\bar{x}y.y(v) \xrightarrow{\tau} \bar{y}u|y(v) \xrightarrow{\tau} \mathbf{0}|\mathbf{0}.$$

Therefore, T (P|xv. ¯ x¯y.y(v)) must also be able to start with three consecutive τ -transitions. Note that

$$\mathcal{F}(P|\bar{x}v.\bar{x}y.y(v)) = \mathcal{F}(P)[\ell] \left| \left( \bar{x}v.\bar{x}y.\sum\_{z \in \mathcal{H}} yz(\mathbf{0}[^{z}|y|) \right)[r] \right| $$

with

$$\mathcal{F}(P) = \sum\_{z \in \mathcal{H}} xz.\left(\left(\sum\_{z \in \mathcal{H}} xz.\left((\bar{w}u.\mathbf{0})[z|w]\right)\right)[z|y]\right)\dots$$

The only way to obtain T (P|xv. ¯ x¯y.y(v)) <sup>τ</sup> −→ <sup>τ</sup> −→ <sup>τ</sup> −→ is when T (P) xv −→ Q xy −→ <sup>y</sup>¯<sup>u</sup> −→. The CCS<sup>γ</sup> process Q must be

$$\left(\sum\_{z \in \mathcal{H}} xz.((\bar{w}u.\mathbf{0})[^z\_\cdot w])\right)[^v \langle y|\,.]$$

Given the semantics of CCS relabelling, one must have X z∈H xz.(( ¯wu.0)[z/w]) <sup>α</sup>−→, such that applying the relabelling [v/y] to α yields xy. When simply taking [{v/y}]

for [v/y], that is, the relabelling that changes all occurrences of the name y in a

transition label into v, this is not possible. This shows that a simplification of my translation without use of the spare names S would not be valid.

Crucial for this example is that I only use surjective substitutions. [v/y] is an abbreviation of [{v/y} <sup>S</sup> ]. Here {v/y} <sup>S</sup> is a surjective substitution that not only renames y into v, but also sends a spare name s to y. This allows me to take α := xs. Consequently, in deriving the transition P <sup>z</sup>∈H xz.(( ¯wu.0)[z/w]) <sup>α</sup>−→, I choose z to be s, so that

$$\sum\_{z \in \mathcal{H}} xz.((\bar{w}u.\mathbf{0})[^z\_/w]) \xrightarrow{xs} (\bar{w}u.\mathbf{0})[^s\_/w] \xrightarrow{\bar{s}u} \mathbf{0}[^s\_/w].$$

Putting this in the scope of the relabelling [v/y] yields

$$Q \xrightarrow{xy} (\bar{w}u.\mathbf{0})[s\!/\!/w][\!/\!/\!/y] \xrightarrow{\bar{y}u} \mathbf{0}[s\!/\!/w][\!/\!/y\!\!/y][$$

as desired, and the example works out.<sup>5</sup>

This example shows that spare names play a crucial role in intermediate states of CCSγ-translations. In general this leads to stacked relabellings from true names into spare ones and back. Making sure that in the end one always ends up with the right names calls for particularly careful proofs that do not cut corners in the bookkeeping of names.

A last example showing a crucial feature of my translation is discussed in Section 14.

### 13 The unencodability of CCS into π

Let f : A → A be a CCS relabelling function satisfying f(xiy) = xi+1y. Here (xi)<sup>∞</sup> <sup>i</sup>=0 is an infinite sequence of names, and A is as in Section 4. The CCS process A defined by

$$A := x\_0 y. \mathbf{0} + \tau. (A[f])$$

satisfies ∃P. A <sup>τ</sup> −→ ∗ P ∧ P↓<sup>x</sup><sup>i</sup> for all i ≥ 0, i.e., it has infinitely many weak barbs. It is easy to check that all weak barbs of a π-calculus process Q must be free names of Q, of which there are only finitely many. Consequently, there is no π-calculus process Q with A •∼ Q, and hence no translation of CCS in the π-calculus that is valid up to •∼. 6

### 14 Related work

My translation from πIM to CCS<sup>γ</sup> is inspired by an earlier translation E from a version of the π-calculus to CCS, proposed by Banach & van Breugel [3]. The

<sup>5</sup> This use of spare names solves the problem raised in [3, Footnote 5].

<sup>6</sup> In [28] it was already mentioned, by reference to Pugliese [personal communication, 1997] that CCS relabelling operators cannot be encoded in the π-calculus.

paper [3] takes A := {hx, yi | x, y ∈ N } for the visible CCS actions; action hx, yi corresponds with my xy, and its complement hx, yi with my ¯xy. On the fragment of π featuring inaction, prefixing, choice and parallel composition, the encoding of [3] is given by

$$\begin{array}{lcl} \mathcal{E}(\mathbf{0}) & := & \mathbf{0} \\ \mathcal{E}(\tau.P) & := & \tau.\mathcal{E}(P) \\ \mathcal{E}(\bar{x}y.P) & := & \overline{\langle x,y \rangle} \mathcal{E}(P) \\ \mathcal{E}(x(y).P) & := & \sum\_{z \in \mathcal{N}} \langle x,z \rangle . (\mathcal{E}(P)[z | y]) \\ \mathcal{E}(P \mid Q) & := & \mathcal{E}(P) \mid \mathcal{E}(Q) \\ \mathcal{E}(P + Q) & := & \mathcal{E}(P) + \mathcal{E}(Q) . \end{array}$$

The main result of [3] (Theorem 5.3), stating the correctness of this encoding, says that P ↔<sup>r</sup> Q iff E(P) ↔<sup>r</sup> E(Q), for all π-processes P and Q. Here ↔<sup>r</sup> is strong reduction bisimilarity—see Definition 7. In fact, replacing the call to Lemma 3.5 in the proof of this theorem by a call to Lemma 3.4, they could equally well have claimed the stronger result that P ↔<sup>r</sup> E(P) for all π-processes P, i.e., that E is valid up to ↔<sup>r</sup> .

This result contradicts my Theorem 1 and thus must be flawed. Where it fails can be detected by pushing the counterexample process P := ¯xv | x(y).R with R := ¯yu|v(w), used in the proof of Theorem 1, through the encoding of [3]. I claim that while P <sup>τ</sup> −→ vu¯ |v(w) <sup>τ</sup> −→, its translation E(P) cannot do two τ -steps. Hence P ↔6 <sup>r</sup> E(P). Using a trivial process Q such that P ↔<sup>r</sup> Q ↔<sup>r</sup> E(Q), this also constitutes a counterexample to [3, Theorem 5.3].

Note that E(R) = hy, ui.0 | P <sup>z</sup>∈<sup>N</sup> <sup>h</sup>v, zi.(0[z/w]). This process can perform the actions hy, ui as well as hv, ui, but no action τ , since y 6= v. Now

$$\mathcal{E}(P) = \overline{\langle x, v \rangle}. \mathbf{0} \mid \sum\_{z \in N} \langle x, z \rangle . (\mathcal{E}(R)[z \langle y \rangle]).$$

Its only τ -transition goes to 0 | E(R)[v/y]. This process can perform the actions hv, ui as well as hv, ui, but still no action τ , since [v/y] is a CCS relabelling operator rather than a substitution, and it is applied only after any synchronisations between hy, ui.0 and P <sup>z</sup>∈<sup>N</sup> <sup>h</sup>v, zi.(0[z/w]) are derived.

My own encoding T translates the processes P and R essentially in the same way, but now there is a transition T (R) [y=v]<sup>τ</sup> −−−−→ (0k0[u/w]). The renaming [v/y] turns this synchronisation into a τ :

$$\mathcal{P}(P) \xrightarrow{\tau} \mathcal{P}(R)[\prime/y] \xrightarrow{\tau} (\mathbf{0} \| \mathbf{0}[\prime \langle w \rangle])[\prime/y].$$

The crucial innovation of my approach over [3] in this regard is the switch from the early to the early symbolic semantics of the π-calculus, combined with a switch from CCS as target language to CCSγ.

In [31], Roscoe argues that CSP is at least as expressive as the π-calculus. As evidence he present a translation from the latter to the former. Roscoe does not provide a criterion for the validity of such a translation, nor a result implying that a suitable criterion has been met. The following observations show that his transition is not compositional, and that it is debatable whether it preserves a reasonable semantic equivalence.


As pointed out in [14,29], even the most bizarre translations can be found valid if one only imposes requirements based on semantic equivalence, and not compositionality. Roscoe's translation is actually rather elegant. However, we do not have a decent criterion to say to what extent it is a valid translation. The expressiveness community strongly values compositionality as a criterion, and this attribute is the novelty brought in by my translation.

### 15 Conclusion

This paper exhibited a compositional translation from the π-calculus to CCS<sup>γ</sup> extended with triggering that is valid up to strong barbed bisimilarity, thereby showing that the latter language is at least as expressive as the former. Triggering is not needed when restricting to the π-calculus with implicit matching (as used for instance in [33]). Conversely, I observed that CCS (and thus certainly CCSγ) cannot be encoded in the π-calculus. I also showed that the upgrade of CCS to CCS<sup>γ</sup> is necessary to capture the expressiveness of the π-calculus.

A consequence of this work is that any system specification or verification that is carried out in the setting of the π-calculus can be replayed in CCSγ. The main idea here is to replace the names that are kept private in the π-calculus by means of the restriction operator, by names that are kept private by means of a careful bookkeeping ensuring that the same private name is never used twice. Of course this in no way suggests that it would be preferable to replay π-calculus specifications or verifications in CCSγ.

My translation encodes the restriction operator (νy) from the π-calculus by renaming y into a "private name". Crucial for this approach is that private names generate no barbs, in contrast with standard approaches where all names generate barbs. This use of private names is part of the definition of strong barbed bisimilarity •∼ on my chosen instance of CCSγ, and justified since that definition is custom made in the present paper. The use of private names can be avoided by placing an outermost CCS restriction operator around any translated π-process. This, however, would violate the compositionality of my translation.

The use of infinite summation in my encoding might be considered a serious drawback. However, when sticking to a countable set of π-calculus names, only countable summation is needed, which, as shown in [8], can be eliminated in favour of unguarded recursion with infinitely many recursion equations. As the original presentation of the π-calculus already allows unguarded recursion with infinitely many recursion equations [24] the latter can not reasonably be forbidden in the target language of the translation. Still, it is an interesting question whether infinite sums or infinite sets of recursion equations can be avoided in the target language if we rule them out in the source language. My conjecture is that this is possible, but at the expense of further upgrading CCSγ, say to aprACP<sup>τ</sup> R. This would however require work that goes well beyond what is presented here.

An alternative approach is to use a version of CCS featuring a choice quantifier [17] instead of infinitary summation, a construct that looks remarkably like an infinite sum, but is as finite as any quantifier from predicate logic. A choice quantifier binds a data variable z (here ranging over names) to a single process expression featuring z. The present application would need a function from names to CCS relabelling operators. When using this approach, the size of translated expressions becomes linear in the size of the originals.

It could be argued that choice quantification is a step towards mobility. On the other hand, if mobility is associated more with scope extrusion than with name binding itself, one could classify CCS<sup>γ</sup> with choice quantification as an immobile process algebra. A form of choice quantification is standard in mCRL2 [15], which is often regarded "immobile".

My translation from π to CCS<sup>γ</sup> has a lot in common with the attempted translation of π to CCS in [3]. That one is based on the early operational semantics of CCS, rather than the early symbolic one used here. As a consequence, substitutions there cannot be eliminated in favour of relabelling operators.

A crucial step in my translation yields an intermediate language with an operational semantics in De Simone format. In [7] another representation of the π-calculus is given through an operational semantics in the De Simone format. It uses a different way of dealing with substitutions. This type of semantics could be an alternative stepping stone in an encoding from the π-calculus into CCSγ.

In [28] Palamidessi showed that there exists no uniform encoding of the πcalculus into a variant of CCS. Here uniform means that T (P|Q)=T (P)|T (Q). This does not contradict my result in any way, as my encoding is not uniform. Palamidessi [28] finds uniformity a reasonable criterion for encodings, because it guarantees that the translation maintains the degree of distribution of the system. In [30], however, it is argued that it is possible to maintain the degree of distribution of a system upon translation without requiring uniformity. In fact, the translation offered here is a good example of one that is not uniform, yet maintains the degree of distribution.

Gorla [13] proposes five criteria for valid encodings, and shows that there exists no valid encoding of the π-calculus (even its asynchronous fragment) into CCS. Gorla's proof heavily relies on the criterion of name invariance imposed on valid encodings. It requires for P ∈ T<sup>π</sup> and an injective substitution σ that T (P σ) = T (P)σ 0 for some substitution σ 0 that is obtained from σ through a renaming policy. Furthermore, the renaming policy is such that if dom(σ) is finite, then also dom(σ 0 ) is finite. This latter requirement is not met by the encoding presented here, for a single name x ∈ N corresponds with an infinite set of actions xy, the "names" of CCS, and a substitution that merely renames x into z must rename each action xy into zy at the CCS end, thus violating the finiteness of dom(σ 0 ).

My encoding also violates Gorla's compositionality requirement, on grounds that T (P) appears multiple times (actually, infinitely many) in the translation of Mx(y).P. It is however compositional by the definition in [10] and elsewhere. My encoding satisfies all other criteria of [13] (operational correspondence, divergence reflection and success sensitiveness).

### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### Concurrent NetKAT

### Modeling and analyzing stateful, concurrent networks

Jana Wagemaker<sup>1</sup> , Nate Foster<sup>2</sup> , Tobias Kapp´e<sup>3</sup> , Dexter Kozen<sup>2</sup> , Jurriaan Rot<sup>1</sup> , and Alexandra Silva<sup>2</sup>

> <sup>1</sup> Radboud University, Nijmegen, The Netherlands Jana.Wagemaker@ru.nl

<sup>2</sup> Cornell University, Ithaca, New York, USA 3 ILLC, University of Amsterdam, The Netherlands

Abstract. We introduce Concurrent NetKAT (CNetKAT), an extension of NetKAT with operators for specifying and reasoning about concurrency in scenarios where multiple packets interact through state. We provide a model of the language based on partially-ordered multisets (pomsets), which are a well-established mathematical structure for defining the denotational semantics of concurrent languages. We provide a sound and complete axiomatization of this model, and we illustrate the use of CNetKAT through examples. More generally, CNetKAT can be understood as an algebraic framework for reasoning about programs with both local state (in packets) and global state (in a global store).

Keywords: Concurrent Kleene algebra, NetKAT, completeness, concurrency

### 1 Introduction

Kleene algebra (KA) is a well-studied formalism [20,23,34,8] for analyzing and verifying imperative programs. Over the past few decades, various extensions of KA have been proposed for modeling increasingly sophisticated scenarios. For example, Kleene algebra with tests (KAT) [21] models conditional control flow while NetKAT [3,10] models behaviors in packet-switched networks.

A key limitation of NetKAT, however, is that the language is stateless and sequential. It cannot model programs composed in parallel, and it offers no way to reason algebraically about the effects induced by multiple concurrent packets. Meanwhile, the software-defined networking (SDN) paradigm has evolved to include richer functionality based on stateful processing including data aggregation and dynamic routing. In languages like P4 [4], issues of concurrency arise because the semantics depends on the order that packets are processed.

Given this context, it is natural to wonder we can add concurrency to NetKAT while retaining the elegance of the underlying framework. In this paper, we answer this question in the affirmative, by developing CNetKAT. However, to do this, we must overcome several challenges. A first hurdle is that networks exhibit many different forms of concurrent behavior. The most obvious source of concurrency arises when multiple packets are processed by different devices. In these situations, certain packets may cause changes in forwarding behavior by modifying global state variables on switches. However, there is also concurrency within individual devices: a high-speed switching chip often has multiple pipelines, each with multiple stages of match-action tables and stateful registers. The tables can be programmed to act concurrently on (parts of) a single packet, and the pipelines also act concurrently on multiple packets.

Another hurdle is that it is not entirely clear how to simultaneously extend KA with networking features and concurrency. Orthogonal to the development of NetKAT, the issue of adding concurrency to KA has been researched extensively, starting with concurrent Kleene algebra (CKA) [13,25,26,17]. However, the combination of concurrency from CKA and tests from KAT is not straightforward see, e.g. [14,15,16]—which motivated the development of partially-observable concurrent Kleene algebra (POCKA) [37]. In POCKA, a single thread only has partial view of the state. Hence, when evaluating control guards, a thread makes observations about the machine state, rather than definitive tests. This allows for fine-grained reasoning about concurrent programs with variables, conditionals, loops, and imperative statements that manipulate a shared global memory.

In this work, we use POCKA as a basis for designing a language with state and concurrent threads, which we combine with a multi-packet extension of NetKAT. The resulting language, Concurrent NetKAT (CNetKAT), models the behavior of packets in a network that communicate through a shared global state, and addresses the fundamental and non-trivial question of how to combine concurrency and the interaction between local and global state within KA.

Overall, the contributions of the paper are as follows:


The next section contains an overview of the challenges in the design of extending NetKAT with multiple packets, global state, and concurrency, as well as a glimpse of how to use the language in a practical example.

### 2 Overview

CNetKAT models the behavior of two basic entities: the packets being routed through the network, and a global store, which may be accessed by the network as it processes the packets. These elements give rise to two kinds of basic programs. On the one hand, basic packet programs—imported from NetKAT [3]—include tests (fi=n) and modifications (fi←n) of packet fields f1, . . . , f<sup>N</sup> . Examples of fields are sw, denoting the switch of the packet in the network, and tag, denoting

Fig. 1: Running example

the type of a packet. In general, we expect packets to have fields for a collection of standard attributes; unused fields may be populated with a dummy value.

On the other hand, basic state programs include observations<sup>4</sup> (vi=n), modifications (vi←n) and a copy operation (vi←v<sup>j</sup> ) on state variables v1, . . . , vM. It will always be clear from context whether an action concerns a state or field variable. CNetKAT also includes a primitive program a for any set of packets a, which is useful for specifying the set of packets currently being processed.

Remark 1. We could augment the set of primitives with features such as general expressions in assignments. However, to keep things simple, we will only consider these primitives, which are already rich enough to describe non-trivial behaviors.

CNetKAT programs are composed using sequential composition (';'), iteration ('∗'), and non-deterministic choice ('+'), similar to NetKAT. In addition, CNetKAT programs may use the parallel composition operator ('k').

The full syntax of CNetKAT is given in Figure 2. Before giving a precise account of the semantics, we will go over some simple example programs.

Example 1 (Packet forwarding). Consider the network depicted on the left in Figure 1. Similar to NetKAT, we assume packet movement and variable assignments are instantaneous. Suppose there are two packet types: ♠ and ♥. We want to write a program that transfers packets from node 1 to node 4 by sending ♠ via node 2, and ♥ via node 3. The program running in switch 1 could be

p<sup>1</sup> := sw=1 ; ((tag=♠ ; sw←2) k (tag=♥ ; sw←3))

This program first filters out the packets at switch 1. Next, it launches two parallel threads, both of which receive a copy of the incoming packets. The first thread filters out packets of type ♠ and forwards them to switch 2, while the second thread filters out packets of type ♥, forwarding them to switch 3.

We can write programs p2, p<sup>3</sup> and p<sup>4</sup> for the other switches as well, and then compose all of those in parallel to obtain a program for the entire network.

Remark 2. Instant packet movement is not baked into CNetKAT, but rather a consequence of modeling packet location using the field sw. A more advanced

<sup>4</sup> Intuitively, these are tests on the state that can be understood as observing the part of the global state containing the variable, hence the terminology.

model could use an additional field to mark a packet as being "in-flight" until it reaches the next hop. Here, we opt for the simpler model.

Example 2 (Global behavior). CNetKAT programs can read and write to a global store, letting earlier actions on packets affect later decisions. For instance, suppose we need ♠ packets to be forwarded only if a ♥ packet already visited switch 3. We can use a global variable v to implement this stateful behavior, writing:

$$\{\mathsf{sw} = 1 \; ; \; ((v = 1 \; ; \, \mathsf{tag} = \spadesuit \; ; \, \mathsf{sw} \gets 2) \; \parallel \; (\mathsf{tag} = \heartsuit \; ; \, \mathsf{sw} \gets 3 \; ; \, v \gets 1) \}$$

We can program the other switches with p<sup>i</sup> , as shown in Figure 1.

Remark 3 (Concurrency and state). Actions involving global variables are more subtle than those that concern packet fields, due to concurrent threads accessing the global store. For instance, we can write the program v←1 ; v=2, which first sets v to 1 and then asserts that v should have value 2. This may seem inconsistent; however, there may be valid ways of executing this program if there are other threads that change the value of v from 1 to 2 between the assignment v←1 and the assertion v=2. This possibility makes defining a compositional semantics somewhat tricky, as we will discuss below.

Semantics of CNetKAT programs. A packet π is a record of fields f1, . . . , f<sup>N</sup> . We write π(sw) for the value of sw in π and π[1/sw] for the packet obtained after updating the value of sw to 1. We denote the set of packets by Pk.

The semantics of a CNetKAT program is represented as a function that takes a set of packets, potentially located in different nodes in the network, and returns a set of possible behaviors that those input packets might produce. More precisely, the semantics function has type <sup>J</sup>−<sup>K</sup> : 2Pk <sup>→</sup> <sup>2</sup> Pom ·2 Pk . Here, Pom is the set of pomsets [12,11], which can be thought of as structures that record the causal order between concurrent events (details appear in Section 3.1). An element <sup>u</sup>· <sup>b</sup> <sup>∈</sup> <sup>J</sup>pK(a) means "there is an execution of <sup>p</sup> that changes the global variables according to u, and the set of output packets produced is b".<sup>5</sup>

The semantics is defined in Figure 3. For instance, a packet filter (f=n) takes a set of packets a and returns {1 · a(f=n)}, where a(f=n) contains all packets in a where f has value n and 1 is the pomset representing that the global state did not change. A modification (f←n) takes a set of input packets a and returns {1 · a(f ← n)}, where a(f ← n) = {π[n/f] : π ∈ a}. These two basic packet actions manipulate the local state of the program.

On the global state we have observations of the form (v=n) and modifications (v←n), (v←v ′ ). Each gives rise to a pair in the semantics—{v = n·a}, {(v←n)· a}, {(v←v ′ ) · a}—in which the input set of packets a is returned as output and the assertion or modification is recorded in the pomset.

Lastly, the primitive a ∈ 2 Pk is useful for writing specifications. This program copies the set of packets a into the global pomset. We will see that this is useful for checking inclusion of certain behaviors in a program's semantics, and in the

<sup>5</sup> We use the notation · to denote pairs: u · b denotes the pair (u, b).


Fig. 2: CNetKAT syntax. We highlight constructs not in NetKAT.

proof of completeness. Formally, the behavior of a on any input set b is {a · b}, where a is the global state pomset with one node labeled by a.

To construct more complicated programs, we can combine the basic elements above using operators from Kleene algebra. For instance, p+q is a program that represents a non-deterministic choice between p and q. Its semantics is obtained by taking the union of sets produced by both p and q on the input packets. We can also compose programs sequentially using p ; q, where we first apply p to the input packets and then q to all sets of packets produced by p, and we compose the corresponding global pomsets sequentially. We can iterate a program finitely many times using p ∗ . Lastly, we can combine programs with a parallel operator, p k q, which denotes a program that, on input a, executes both p and q on a, and then combines the results: the pomsets denoting the global components are composed in parallel, and the corresponding sets of output packets joined.

Remark 4 (Concurrency and state, continued). Note that statements observing or modifying global variables are stored in the pomsets but not executed, that is, we do not actually check immediately whether v is indeed 1 but rather simply record it. This may seem like an odd choice at first: why does the semantics not also keep a record of the global store? The reason is related to Remark 3.

Consider the program q = (v=0) ; (v=1), which asserts that v has value 0, and then that it has value 1. In isolation, q does not have any valid behavior, as it sequentially executes two tests that cannot be valid without intermediate intervention. However, the program q k (v←1) does have valid behavior on some

#### Syntax

$$\left| \left[ p \right] \colon 2^{\mathbb{P}\_{\mathbb{R}}} \to 2^{\mathcal{R}\_{\mathbb{S}^m}(\mathbb{S}t \cup \mathbb{A} \text{ct} \cup 2^{\mathbb{P}\_{\mathbb{R}}}) \cdot 2^{\mathbb{P}\_{\mathbb{R}}}} \right| $$

$$\begin{array}{c} \begin{array}{l} [p][\mathcal{Q}] \triangleq \{1 \cdot \mathcal{Q}\} \\ [\mathtt{abort}](a) \triangleq \mathcal{Q} \\ [\mathtt{skip}](a) \triangleq \{1 \cdot a\} \\ [\mathtt{t}](a) \triangleq \{1 \cdot [\mathtt{t}]\}\_{\mathcal{B}}(a) \end{array} & \begin{array}{l} [\mathtt{p}](a) \triangleq \mathtt{St} \\ [\mathtt{e}](a) \triangleq \mathtt{St} \\ [\mathtt{t}](a) \triangleq \{1 \cdot [\mathtt{t}]\}\_{\mathcal{B}}(a) \end{array} \\ \begin{array}{l} [\mathtt{f} \gets n](a) \triangleq \{1 \cdot a\}(a) \triangleq \{1 \cdot a\}(\mathcal{I} \gets n) \{ \\ [\mathtt{b}](a) \triangleq \{\mathtt{f} \cdot a\} \\ [\mathtt{p}](a) \triangleq \{\mathtt{b} \cdot a\} \\ [\mathtt{p}](a) \triangleq \{a \cdot a\} \\ [\mathtt{p} + \mathtt{q}](a) \triangleq [\mathtt{p}](a) \cup [\mathtt{q}](a) \end{array} \end{array} \end{array}$$

#### Predicates

Semantics

$$\begin{array}{c} \left[\boldsymbol{\varrho}\right](a) \triangleq \mathsf{St}^{\star} \odot \left[\boldsymbol{o}\right]\_{\mathcal{O}} \odot \mathsf{St}^{\star} \times \left\{a\right\} \\ \left[\boldsymbol{e}\right](a) \triangleq \mathsf{St}^{\star} \odot \left\{\boldsymbol{e}\right\} \odot \mathsf{St}^{\star} \times \left\{a\right\} \\ \left[\boldsymbol{p} \mathrel{\mathop{\scalebox{10}{\boldsymbol{a}}}{\boldsymbol{\varrho}}}\right](a) \triangleq \left\{ (\mathsf{u} \cdot \mathsf{v}) \cdot \boldsymbol{b} \; \middle| \; \begin{array}{c} \mathbf{u} \cdot \boldsymbol{a}' \in [\![\![p]\!](\!a\text{)}), \\ \mathbf{v} \cdot \boldsymbol{b} \in [\![q]\!](\!a'\text{)} \\ \mathbf{v} \cdot \boldsymbol{b} \in [\![q]\!](\!a'\text{)} \end{array} \right. \\ \left[\boldsymbol{p} \mathrel{\mathop{\scriptstyle}{\left[\boldsymbol{p}\right]}}\{\boldsymbol{a}\}\right](a) \triangleq \left\{ (\mathsf{u} \parallel \mathsf{v}) \cdot (\mathsf{b} \cup \mathsf{c}) \; \middle| \; \begin{array}{c} \mathbf{u} \cdot \boldsymbol{b} \in [\![p]\!](\!a), \\ \mathbf{v} \cdot \boldsymbol{c} \in [\![q]\!](\!a) \end{array} \right\} \end{array}$$

Observations


#### Filtering, updates and downwards closure a ∈ 2

Pk, Z ⊆ St

$$\begin{array}{lll} a(f = n) = \{\pi \in a \mid \pi(f) = n\} & a(f \leftarrow n) = \{\pi[n/f] \mid \pi \in a\} \\ \alpha \leq \beta \iff \mathsf{domain}(\beta) \subseteq \mathsf{domain}(\alpha) \land \quad \forall x \in \mathsf{domain}(\beta). \; \alpha(x) = \beta(x) \\ Z\_{\leq} = \{\alpha \mid \exists \beta \in Z \text{ s.t.} \; \alpha \leq \beta\} & \mathcal{P}\_{\leq}(\mathsf{St}) = \{Z \mid Z \subseteq \mathsf{St} \land Z = Z\_{\leq}\} \end{array}$$

Fig. 3: CNetKAT semantics. Pairs <sup>u</sup>·<sup>b</sup> in <sup>J</sup>pK(a) indicate that the program <sup>p</sup> takes input a and the global state change induced by p is encoded in u and constrains the final packet set b. We overload · for sequential composition of pomsets and pairs, while ⊙ is the usual lifting from pomsets to languages.

interleavings—namely the ones where the assignment v←1 is scheduled between the two tests. It stands to reason that a compositional semantics of such programs should include traces with such local inconsistencies, as they may be explained by actions taken by other programs running in parallel [37]. For CNetKAT, this is accomplished by placing the observations and modifications in the pomset.

This leaves us with the question of how to obtain the semantics of a program in isolation. We take a page from POCKA [37], which uses the set of guarded pomsets to filter out the pomsets sensible in isolation; details appear in §5.

One final modification is needed to obtain the CNetKAT semantics from <sup>J</sup>−K. The idea is to allow interleaving between parallel threads [13]. This is accomplished by adding to the semantics all pomsets in which events are "more ordered" than the ones already present in <sup>J</sup>−K. We denote this closed semantics by <sup>J</sup>−<sup>K</sup> <sup>y</sup>; a precise definition is given in §3.

Recording local behavior To apply CNetKAT to various verification tasks, we sometimes need to take snapshots of the local state at different points. For example, if we want to argue that ♥ packets arrived at switch 3 before ♠ packets arrived at switch 2, we need more than the information about inputs and outputs that have occurred so far. We therefore have to extend the language with an operator comparable to dup in NetKAT. On input a, the semantics of the dup operator is the set {a · a}, where the first component is a single node pomset labeled with set of packets a. <sup>6</sup> By recording packets inside the pomset, information about changes to packets also contains their relation to changes to global variables during the execution. Hence, using dup, we can infer causality relations between local and global state changes.

The programs p1, p2, p<sup>3</sup> and p<sup>4</sup> used in our running example (see Figure 1) can be instrumented with a dup on every entry to and exit from a switch. This encodes extra information in the semantics that can be used for reasoning about packet-forwarding paths as well as global state changes.

$$\begin{array}{l} p\_1 \triangleq \mathsf{sw} = 1 \; ; \mathsf{dup} \; ; ((v = 1 \; ; \mathsf{tag} = \spadesuit \; ; \mathsf{dup} \; ; \mathsf{sw} \leftarrow 2 \; ; \mathsf{dup}) \\ \qquad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad$$

The overall program of the running example then becomes

$$p \triangleq v \gets 0 ; (p\_1 \parallel p\_2 \parallel p\_3 \parallel p\_4)^\*$$

where the global variable v is initialized to 0, and the programs p1, p2, p3, p<sup>4</sup> are executed in parallel, performing the actions of each individual switch. The Kleene star ensures that the packets may take multiple hops through the network, eventually reaching their final destination (switch 4).

Remark 5. If a dup occurs in parallel to other threads, then these other parallel threads can only change the exact place of the dup-recording in the pomset via possible interleavings, but not influence its content.

Remark 6. We model the collection of in-flight packets as a set, as opposed to e.g. a partially ordered set encoding their order of arrival. This is an abstraction of our framework. Not putting an order on packets simplifies the algebraic presentation and has the advantage that it enables modeling of switches that reorder packets without an additional primitive. If the order of packets is important, information about this order can be extracted from the semantics. In particular, when packets were forwarded can be deduced by inspecting the sets of packets recorded in the pomset component using dup.

Two differences between CNetKAT and NetKAT Readers familiar with NetKAT might wonder why Example 1 uses k instead of + to compose the

<sup>6</sup> We overload 'a' as a set of packets, a programming primitive and a label used in pomsets, but it always denotes a set of packets in the latter two uses as well.

branches of p1. The reason is that in CNetKAT, k is interpreted as multicast and + is interpreted as non-deterministic composition. In NetKAT, programs act on a single input packet, so these coincide. But in CNetKAT, programs act on multiple packets concurrently, so they must be distinguished.

To illustrate the difference, consider wanting to filter the input packets so that only those where field f has value n or field g has value m remain. In NetKAT, we can use the program f=n + g=m, which can be understood in two different ways. First, we can think of it as using (angelic) non-determinism to select a test, yielding {π} if at least one test passes and ∅ if both tests fail. Alternatively, we can think of it as using multicast to copy the input to both f=n and g=m, then using the tests to perform the required filtering, and finally taking the union of the resulting sets. In NetKAT, the net effect of both interpretations is identical, so multicast and non-determinism can be identified semantically.

However, when we generalize to sets of packets, it is natural to expect that processing a set a with f=n followed by g=m would yield the subset of a where each packet satisfies at least one of the tests. Operationally, processing a using these programs could be realized by making two copies of a, then using the tests to perform the required filtering, and taking the union of the resulting sets. This is reflected in the semantics: <sup>J</sup>f=<sup>m</sup> <sup>k</sup> <sup>g</sup>=nK(a) = {<sup>1</sup> · (a(<sup>f</sup> <sup>=</sup> <sup>m</sup>) <sup>∪</sup> <sup>a</sup>(<sup>g</sup> <sup>=</sup> <sup>n</sup>))}, where we get a single pair in the output. If instead we non-deterministically choose between the tests, the result would be the subset where f = n or the subset where <sup>g</sup> <sup>=</sup> <sup>m</sup>. Indeed, we have that <sup>J</sup>f=<sup>m</sup> <sup>+</sup> <sup>g</sup>=nK(a) = {<sup>1</sup> · <sup>a</sup>(<sup>f</sup> <sup>=</sup> <sup>m</sup>), <sup>1</sup> · a(g = n)}. Hence, multicast and non-determinism can no longer be identified in the context of multiple packets. For readers familiar with NetKAT, this means that the Boolean disjunction ∨ is now identified with k rather than +.

Lastly, we highlight that CNetKAT's dup is fundamentally different from NetKAT's dup, which just records versions of the packet during execution. In CNetKAT, dup does two things: it implements the same functionality as in NetKAT, but also structures the recording of packets inside the pomset.

Proving properties with CNetKAT In §5, we analyze the behavior of the running example in detail and show how to filter out the behaviors of p that can be obtained when it is run in isolation. In this overview, we establish a simpler property: namely, that p exhibits executions where the packets were at switch 3 before they were at switch 2. We first argue this using the denotational semantics and then illustrate how we can establish the same fact with axiomatic reasoning.

Recall a pomset accounts for events and the ordering between them. In the following examples, we will depict pomsets as a graph with nodes labeled by state actions, observations and sets of packets, and the ordering indicated by arrows. For instance, a → b means that a happened before b.

We evaluate p on input {♥, ♠}, where both packets start at switch 1. In the closed semantics <sup>J</sup>p<sup>K</sup> <sup>y</sup> ({♥, ♠}) we find the following pomset (the · · · indicate that the pomset continues on the next line, not that nodes are omitted), in the first projection, with β a partial function from Var to Val s.t. β(v) = 1:

$$(v \leftarrow 0) \rightarrow \{\heartsuit, \spadesuit\} \rightarrow \{\heartsuit\} \rightarrow \{\heartsuit[3/\text{sw}]\} \rightarrow (v \leftarrow 1) \rightarrow \beta \rightarrow \dotsb$$

$$\cdots \to \{\spadesuit\} \to \{\spadesuit\} \to \{\spadesuit^{\lfloor 2/\text{sw} \rfloor} \} \to \{\spadesuit^{\lfloor 4/\text{sw} \rfloor} \} \to \sideset{}{\rightsquigarrow} \to \{\spadesuit^{\lfloor 4/\text{sw} \rfloor} \} \to \sideset{}{\rightsquigarrow} \{\spadesuit^{\lfloor 4/\text{sw} \rfloor} \} \to \sideset{}{\rightsquigarrow} \{\spadesuit^{\lfloor 4/\text{sw} \rfloor} \} \to \sideset{}{\rightsquigarrow} \{\spadesuit^{\lfloor 4/\text{sw} \rfloor} \} \to \s}$$

Every node labeled with a set of packets can be understood intuitively as "at this point in the execution these packets were a subset of the total packets present in the network." We can observe in the pomset that the ♥ packet was at switch 3, before the ♠ packet reached switch 2. We also see that v←1, happens between v←0 and β. In the end, both packets are observed at switch 4.

The second projection in the semantics corresponding to this pomset is the set of output packets {♠[4/sw], ♥[4/sw]}.

In the full version of this article [38, Appendix E], we show something stronger: in all behaviors that can happen in isolation, the packet ♥[3/sw] is recorded into the global pomset before the assignment v←1, which precedes the observation that v equals 1 and the generation of the packet ♠[2/sw].

We can write an axiomatic statement that captures that the above behavior is in the closed semantics of p on input {♥, ♠}. To do this, we first need to capture the pictured global state pomset with corresponding set of output packets syntactically, for which we use an abbreviation. Namely, we can write a program that outputs, on any input, a specific packet: for a packet π, we write this program simply as <sup>π</sup>. The output of <sup>J</sup>π<sup>K</sup> on any input is {<sup>1</sup> · {π}}. This extends to sets of packets: ♥ k ♠ denotes a program whose semantics is {1 · {♥ k ♠}} on any input. This notation pairs well with the use of the letters a ∈ 2 Pk as programming syntax: if we know which set of packets we (want to) record into the global state pomset with dup, we can also directly write this set of packets in the program as a syntactic letter. For instance, the program (♥ k ♠) ; dup, has the same behaviors as (♥ k ♠) ; {♥, ♠}: the moment we execute the dup, we know the current set of packets is {♥, ♠}, and thus writing this set of packets as a letter and recording that letter into the global state pomset will have the same result. Using these two pieces of information, we can write the program

$$q \triangleq \left( (v \leftarrow 0); \{ \heartsuit, \spadesuit \}; \{ \heartsuit \}; \{ \heartsuit [3/\text{sw}] \}; (v \leftarrow 1); \{ v = 1 \}; \{ \spadesuit \}; \ldots \text{ } \text{ } \{ 1 \} \right)$$

$$\ldots; \{ \spadesuit [2/\text{sw}] \}; \left( (\{ \spadesuit [2/\text{sw}] \} ; \{ \spadesuit [4/\text{sw}] \} ) \parallel (\{ \heartsuit [3/\text{sw}] \} ; \{ \heartsuit [4/\text{sw}] \} ) \right); \ldots$$

$$\ldots; \{ \spadesuit [4/\text{sw}], \heartsuit [4/\text{sw}] \} \rangle ; (\spadesuit [4/\text{sw}] \parallel \heartsuit [4/\text{sw}])$$

The first chunk of this program is the syntactic encoding of the desired global state pomset, where the ♥ packet arrives at switch 3 before the ♠ packet arrives at switch 2, and the final parallel of packets represents the set of output packets. We can prove using the axioms of CNetKAT that

$$(\heartsuit \parallel \spadesuit) ; q \le (\heartsuit \parallel \spadesuit) ; p \tag{2}$$

(2) states that the behavior of q on input {♥, ♠}, is included in the behavior of p on the same input. In the behavior of q, it is clear that the ♥ packets are observed at switch 3 before the ♠ packets appear at switch 2.

Remark 7 (Generalized alphabet). Here we see the use of sets of packets as letters in the program syntax. Program q is much closer to the behavior we try to capture, and therefore easier to analyze, than a program containing dup.

To check the validity of equivalences such as (2), we axiomitize CNetKAT and prove it sound and complete. The axioms include the axioms of KA, extended with additional axioms for operations that manipulate packets and the global state. The full axiomatization appears in Section 3.4. For instance, drop;q ≡ drop states that no outputs are produced in the absence of inputs. The program drop drops the set of inputs and returns {1 · ∅}. Any program q after drop outputs {1 · ∅}, because q is not executed when the input is empty. In contrast, q ; drop ≡ drop does not hold since q might have changed the global state.

In addition to drop, CNetKAT has a program abort, which acts as a unit for non-deterministic choice (+). To illustrate the difference between abort and drop consider (f=n) ; (f=m) and (v=n) ∧ (v=m), where m 6= n. The first program filters using f = n and and then filters using f = m where m 6= n. This yields {1 · ∅}, since a packet cannot have different values for f. Hence, we can derive (f=n) ; (f=m) ≡ drop. The second program asserts the global state variable v has value n and m, which is inconsistent; we require variable v to have two different values at the same time. Hence, from the axioms we can derive that (v=n) ∧ (v=m) ≡ ⊥ ≡ abort.

We prove in §4 that the axiomatization presented in Section 3.4 is not only sound but also complete—i.e., all programs with the same semantics can be proved equivalent using the axioms. The rest of the paper is devoted to presenting the CNetKAT syntax and semantics formally (§3), and establishing conservativity results over NetKAT and POCKA. Lastly we present a case study (§5).

### 3 Concurrent NetKAT

This section defines the syntax and semantics of CNetKAT formally.

### 3.1 Pomsets and pomset languages

For a poset (X, ≤) and a set S ⊆ X, define the downwards-closure of S by S<sup>≤</sup> ::= {x | ∃y ∈ S s.t x ≤ y} and P≤(X) ::= {Y ⊆ X | Y = Y≤}. It is well-known that P≤(X) carries the structure of a bounded distributive lattice, with intersection as meet, union as join, X as top and ∅ as bottom. Further, if (X, ≤) is finite, the lattice is itself finite and thus carries a (necessarily unique) pseudocomplement defined by Y ::= S {Z ∈ P≤(X) | Y ∩ Z = ∅}. We provide a concrete lattice with a pseudocomplement below.

Pomsets are used to capture the different evolutions of the state as it is accessed concurrently by different threads. Pomsets are labeled posets (up to isomorphism), used as a generalization of words [11,12]. A labeled poset over a finite alphabet Σ is a triple u = hSu, ≤u, λui, where (Su, ≤u) is a partially ordered set and λ<sup>u</sup> : S → Σ is the labeling function. For u, v labeled posets, we say u is isomorphic to v, u ∼= v, if there exists a bijection h: S<sup>u</sup> → S<sup>v</sup> that preserves labels — λv ◦ h = λu— and preserves and reflects ordering— s ≤u s ′ if and only if h(s) ≤v h(s ′ ). A pomset over Σ is an isomorphism class of labeled posets over Σ, i.e., the class [v] = {u | u ∼= v} for some labeled poset v. Because pomsets are label-preserving isomorphism classes, the nature of the carrier is not relevant, only its cardinality and order. The triple u = hSu, ≤u, λui is a representation of the pomset. However, often we abuse terminology and call u the pomset.

We write Pom(Σ) for the set of pomsets over Σ, and 1 for the empty pomset. When a ∈ Σ, we write a for the pomset represented by the labeled poset with a single node labeled by a. Pomsets can be composed sequentially and in parallel.

The parallel composition of two pomsets is obtained by taking the disjoint union of the carriers, while keeping the ordering relations within each component. Formally, u k v = S<sup>u</sup>k<sup>v</sup>, ≤<sup>u</sup>k<sup>v</sup>, λ<sup>u</sup>k<sup>v</sup> , with S<sup>u</sup>k<sup>v</sup> = S<sup>u</sup> + Sv, ≤<sup>u</sup>k<sup>v</sup> = ≤<sup>u</sup> ∪ ≤<sup>v</sup> and λ<sup>u</sup>k<sup>v</sup>(x) = λu(x), for x ∈ Su, and λ<sup>u</sup>k<sup>v</sup>(x) = λv(x), for x ∈ Sv. Two pomsets are composed sequentially by taking the disjoint union of the carriers and ordering all elements of the first before all elements of the second, keeping the ordering relations within each component. Formally, u · v = hS<sup>u</sup>·<sup>v</sup>, ≤<sup>u</sup>·<sup>v</sup>, λ<sup>u</sup>·<sup>v</sup>i, with S<sup>u</sup>·<sup>v</sup> = S<sup>u</sup> + Sv, ≤<sup>u</sup>·<sup>v</sup> = ≤<sup>u</sup> ∪ ≤<sup>v</sup> ∪ (S<sup>u</sup> × Sv) and λ<sup>u</sup>·<sup>v</sup> = λ<sup>u</sup>k<sup>v</sup>.

Gischer introduced a notion of ordering on pomsets [11]: u ⊑ v means that u, v have the same events and labels, but u is "more sequential" than v in the sense that more events are ordered. Formally, u ⊑ v if there exists a label- and order-preserving bijection h: S<sup>v</sup> → Su.

Pomset languages are simply sets of pomsets. The operations on pomsets lift pointwise to pomset languages, see Figure 3. The semantics of concurrent threads requires ensuring a closure property. In particular, we will close pomset languages under the subsumption order of Gischer. Additionally, for pomsets that contain nodes labeled by observations, we make use of a contraction order: u v, capturing that u results from v by eliminating consecutive observations that can be collapsed into one. As an example, consider

a α b c a α b α c

Denote these pomset with u and v respectively, and let α ∈ St. Then u v. A formal definition can be found in the full version of this article [38, Appendix A].

Definition 1 (Closure). Let L be a pomset language.

$$L\downarrow^{\text{exch}} = \{ \mathbf{u} \mid \exists \mathbf{v} \in L \text{ } s.t. \; \mathbf{u} \sqsubseteq \mathbf{v} \} \qquad \qquad L\downarrow^{\text{cortr}} = \{ \mathbf{u} \mid \exists \mathbf{v} \in L \text{ } s.t. \; \mathbf{u} \preceq \mathbf{v} \}$$

We define L↓ contr∪exch as the smallest language containing L and satisfying that if v ∈ L↓ contr∪exch and u v or u ⊑ v, then u ∈ L↓ contr∪exch .

Closure under ⊑ is called exch because it ensures soundness of the exchange law, an axiom introduced in [13] to capture the possibility of interleaving. Closure under contraction is motivated algebraically; it ensures soundness of one of the axioms necessary when adding a test algebra (a PCDL or a BA) to a KA [16].

### 3.2 CNetKAT: syntax and semantics

CNetKAT expressions denote (possibly concurrent) packet processing programs that have access to a global state. Syntactically, CNetKAT is a language built from alphabets of tests and actions, each of which is divided in two categories. For packet tests, we firstly inherit NetKAT's packet predicates, which are elements of a Boolean algebra generated by an alphabet of basic tests on packet fields. Packet predicates t, u include constants drop and pass, denoting false and true, basic tests f=n, negation ¬t, disjunction t∨Bu and conjunction t∧Bu operations.

Additionally, we have state observations, which do not have the structure of a Boolean algebra but instead form a pseudocomplemented distributive lattice. Intuitively, the functions denoting the state are partial. State observations o, o′ include constants ⊥ and ⊤, basic tests v=n, pseudocomplement o, intersection o∧o ′ and union o∨o ′ . The other constructs were introduced in §2 (see Figure 2).

The semantics of a program is a function <sup>J</sup>·<sup>K</sup> : 2Pk <sup>→</sup> <sup>2</sup> Pom(St∪Act∪2 Pk)·2 Pk that takes a set of packets a and produces a (possibly empty) set of pairs u·b consisting of a pomset u, recording the global state behavior and the storage of local packets whenever dup is used, and a set of packets b. On an empty input set, every program produces {1 · ∅}, modeling that nothing can happen without packets. Producing the empty set when the input is non-empty models a program that has aborted, whereas producing a set {1·∅} models dropping all the packets without any change to the state. Most of the semantics was already explained in §2; in the following we elaborate on some behaviors and illustrate subtleties concerning the units. See Figure 3 for an overview of the full denotational semantics of CNetKAT.

On a non-empty input a, a packet filter t removes packets in a that do not satisfy predicate t and does not touch the state — this is captured by the set {<sup>1</sup> · <sup>J</sup>tK<sup>B</sup> (a)}, where <sup>J</sup>tK<sup>B</sup> (a) is interpreted as an element of the Boolean algebra (2<sup>a</sup> , ∪, ∩, ∅, a, \) defined by the poset (2<sup>a</sup> , <sup>⊆</sup>), and <sup>J</sup>tK<sup>B</sup> (a) is defined as the homomorphic extension of <sup>J</sup>f=nK<sup>B</sup> (a) = {π ∈ a | π(f) = n}.

A state observation denotes a function that returns a set with elements u · a when applied to a set a. In case the original input set a is empty, nothing happens and the output of <sup>J</sup>oK(a) is simply {1·∅}. When <sup>a</sup> is not empty, the semantics of <sup>o</sup> makes use of an observation algebra developed in [14,37]. More formally, we take the pseudocomplemented bounded distributive lattice (P≤(St), ∪, ∩, St, ∅, · ,) generated by the poset (St, ≤) with α ≤ β if and only if domain(β) ⊆ domain(α) and ∀x ∈ domain(β).α(x) = β(x). Then, a state observation is interpreted as St<sup>∗</sup> · <sup>J</sup>oK<sup>O</sup> · St<sup>∗</sup> × {a}, where <sup>J</sup>oK<sup>O</sup> is an element of <sup>P</sup>≤(St) and defined as the homomorphic extension of the assignment <sup>J</sup>v=nK<sup>O</sup> <sup>=</sup> {<sup>α</sup> <sup>∈</sup> St <sup>|</sup> <sup>α</sup>(v) = <sup>n</sup>}. Intuitively, in <sup>J</sup>oKO, we find all the partial functions (elements of St) that agree with <sup>o</sup>. For instance, <sup>J</sup>v=nK<sup>O</sup> contains all partial functions that assign <sup>n</sup> to <sup>v</sup>. This also illustrates the need for a pseudocomplement rather than a complement: if threads have only partial information about the state, an observation should be satisfied only if there is positive evidence for it. Hence, e.g. v=n should be satisfied only if v has a value and it is not n, which is not captured by the complement from a Boolean algebra — the complement would also include partial functions that do not assign a value to v in the behavior of v=n. This is incorrect, because if v has no value in a partial observation, we might learn later that the actual value of v was in fact n, and it was therefore incorrect to assert v=n.

State modifications are interpreted as a set of elements u·a when applied to a set a. The pomsets u record the state modification surrounded by arbitrary state observations; in the first projection of the semantics of the assignment v←n we get a set of possible pomsets: St<sup>∗</sup> ⊙ {v ← n} ⊙ St<sup>∗</sup> .

Remark 8. We surround state changes and observations with arbitrary sequences of states to include global pomsets that have alternating modifications and states in the semantics. Reasoning about behavior of programs is more practical using such alternating pomsets, because the states allow one to take stock of the configuration of the machine in between modifications. The semantics contains also non-alternating pomsets to ensure compositionality w.r.t the parallel.

CNetKAT has six different syntactical units, some of which coincide semantically. There are two units for packets: drop, which drops all the packets ({1·∅}), and pass, which passes the current packets without changing the state ({1·a} on input a). Similarly, we have two units for state observations: ⊥ and ⊤. The first one indicates an inconsistent state, and therefore the whole program exhibits no behavior; its behavior is ∅. The second one indicates any state observation is acceptable, and its behavior on input a is {s · a | s ∈ St}. Lastly there are two units for programs in general: abort, the program without behavior, and skip, the program where nothing happens (on input a its semantics is {1 · a}). Hence, abort is equivalent to ⊥ and skip equivalent to pass. All units behave as {1 · ∅} when the input set is ∅, because nothing happens when there are no packets.

The CNetKAT semantics consists of pairs of global state pomsets and sets of output packets. It might be possible to encode the information of the output packets as a final node in the pomset, but keeping the set of output packets separated allows us to easily track the input-output behavior of a program in terms of packets. This brings CNetKAT closer to NetKAT and its packet processing behavior. In particular, the NetKAT packet processing axioms, can only be used because we track the input-output behavior of the program separately.

To obtain the full semantics, and ensure we capture correctly the intended behavior, we need to perform a closure on the state component.

Definition 2 (Closed Semantics). Given a CNetKAT policy p, we define the semantics of p when applied to input a ∈ 2 Pk as

$$\left( [p] \bigcup \left( a \right) = \left\{ \mathbf{u} \cdot b \mid \mathbf{v} \cdot b \in [p](a), \mathbf{u} \in \{\mathbf{v}\} \right\}^{\mathsf{contr} \cup \mathsf{lexch}} \right\}$$

Closure under exch and contr formalizes important intuitions about the semantics of concurrent threads. The closure under exch ensures all traces resulting from interleaving threads are included, and the closure under contr specifies that if two observations hold simultaneously, then it is possible to observe them in sequence. Note that the converse should not hold as some action could happen in between the two observations in a parallel thread.

We distinguish state, packet and deterministic packet programs as follows.

Definition 3 (State and deterministic packet programs). Let Tpacket denote packet programs, which are programs generated by the following grammar:

$$p, q ::= t \in \mathcal{B} \cup \{ f \gets n \mid f \in \mathsf{Fld}, n \in \mathsf{Val} \} \mid \mid p + q \mid \mid p ; q \mid \mid p \mid \mid q^\* $$

Let Tstate(Σ) denote state programs over alphabet Σ:

$$s, v ::= \mathsf{abort} \mid \mathsf{skip} \mid u \in \Sigma \mid s + v \mid s : v \mid s \mid v \mid s^\*$$

Let Tdet−pack denote deterministic packet programs:<sup>7</sup> :

$$x, y ::= t \in \mathcal{B} \cup \{ f \gets n \mid f \in \mathsf{Fld}, n \in \mathsf{Val} \} \quad \mid x ; y \mid \mid x \parallel y$$

In this paper we mostly use state programs over alphabet O∪Act∪2 Pk∪{dup}. Whenever we intend to use this alphabet, we simply write Tstate.

We prove the following lemmas regarding the CNetKAT semantics.

Lemma 1 (State and packet program semantics). Let p ∈ Tpacket, s ∈ Tstate and a ∈ 2 Pk. For all <sup>w</sup> <sup>∈</sup> <sup>J</sup>pK(a), <sup>w</sup> is of the form <sup>1</sup> · <sup>b</sup> for <sup>b</sup> <sup>∈</sup> <sup>2</sup> Pk. For all <sup>w</sup> <sup>∈</sup> <sup>J</sup>sK(a), <sup>w</sup> is of the form <sup>v</sup> · <sup>a</sup> for <sup>v</sup> a pomset over St <sup>∪</sup> Act <sup>∪</sup> <sup>2</sup> Pk .

For non-empty sets of packets a and a ′ , the global behavior of a state program without dup is identical on both inputs. Let 2Pk ne denote 2Pk \ {∅}.

Lemma 2. Let s ∈ Tstate(O ∪ Act ∪ 2 Pk). For all a, a′ ∈ 2 Pk ne we have <sup>u</sup> <sup>|</sup> <sup>u</sup> · <sup>b</sup> <sup>∈</sup> <sup>J</sup>sK(a) = <sup>u</sup> <sup>|</sup> <sup>u</sup> · <sup>b</sup> <sup>∈</sup> <sup>J</sup>sK(<sup>a</sup> ′ ) .

We characterize <sup>J</sup>−K<sup>B</sup> in terms of its behavior on subsets of the input set.

Lemma 3. Let <sup>t</sup> ∈ B and a, b <sup>⊆</sup> Pk. Then <sup>J</sup>tK<sup>B</sup> (<sup>a</sup> <sup>∪</sup> <sup>b</sup>) = <sup>J</sup>tK<sup>B</sup> (a) <sup>∪</sup> <sup>J</sup>tK<sup>B</sup> (b).

Lastly, we have a lemma characterising the semantics of a deterministic packet program in terms of its behavior on subsets of the input.

Lemma 4. Let x ∈ Tdet−pack and a, b ⊆ Pk. Then <sup>J</sup>xK(<sup>a</sup> <sup>∪</sup> <sup>b</sup>) = <sup>1</sup> · (<sup>c</sup> <sup>∪</sup> <sup>d</sup>) <sup>|</sup> <sup>J</sup>xK(a) = {<sup>1</sup> · <sup>c</sup>}, <sup>J</sup>xK(b) = {<sup>1</sup> · <sup>d</sup>} .

#### 3.3 Is CNetKAT conservative over NetKAT and POCKA?

CNetKAT combines NetKAT and POCKA, so it is natural to ask whether it is a conservative extension of either language. It turns out that the answer is positive for POCKA, and for a fragment of NetKAT. We start by recalling the semantics of NetKAT [3]. Note that NetKAT expressions are packet programs without k.

<sup>7</sup> Equivalently, we can define Tpacket by adding a predicate H to the signature of our algebra that counts the number of ∗'s and +'s a term contains, and a packet program p is an element of Tdet−pack if and only if p ∈ Tpacket and H(p) = 0.

.

Definition 4 (NetKAT semantics). Let π ∈ Pk, t ∈ B and p, q NetKAT terms.

$$\begin{array}{ll} \left[t\right]\_{\mathsf{NK}}(\pi) = \left[t\right]\_{\mathcal{B}}(\{\pi\}) & \quad \left[\mathsf{pass}\right]\_{\mathsf{NK}}(\pi) = \{\pi\} & \quad \left[\mathsf{d}\mathsf{Top}\right]\_{\mathsf{NK}}(\pi) = \{\} \\\\ \left[f \leftarrow n\right]\_{\mathsf{NK}}(\pi) = \{\pi[n/f]\} & \quad \left[p \mathrel{\,}q\right]\_{\mathsf{NK}}(\pi) = \bigcup\_{\pi' \in \left[p\right](\pi)} \left[q\right]\_{\mathsf{NK}}(\pi') \\\\ \left[\left[p^{\*}\right]\right]\_{\mathsf{NK}}(\pi) = \bigcup\_{n} \left[p^{n}\right]\_{\mathsf{NK}}(\pi) & \quad \left[p + q\right]\_{\mathsf{NK}}(\pi) = \left[p\right]\_{\mathsf{NK}}(\pi) \cup \left[q\right]\_{\mathsf{NK}}(\pi) \end{array}$$

Theorem 1. Take <sup>π</sup> <sup>∈</sup> Pk and NetKAT term <sup>p</sup>. <sup>J</sup>pKNK(π) = <sup>S</sup> <sup>1</sup>·a′∈JpK({π}) a ′ .

We can derive a further relation between the semantics if we assume there is no use of + and ∗ (the proof uses Lemma 3).

Lemma 5. Let p be built out of packet predicates and modifications (f←n), and their sequential composition. Then <sup>J</sup>pK(a) = 1 · S x∈a <sup>J</sup>pKNK(x) .

It is worth remarking that the equational theories of NetKAT and CNetKAT are not equivalent: there are equivalent programs in NetKAT, that cannot be proved equivalent with the CNetKAT axioms, as the following example illustrates. Consider the program p + drop for p a packet program without parallel. In NetKAT, because the + is interpreted as multicast, this program is provably equivalent to p: executing p on your input packet while at the same time also dropping a copy of the input, has the same outcome as just executing p. In CNetKAT, however, this is not the case. Instead, the +-operator is interpreted as non-deterministic choice and in the semantics of p + drop we get the trace 1 · ∅, representing the choice of dropping all the packets, which is not present in the semantics of p. Hence, this axiom is unsound (p + drop 6≡ p), and instead the alternative axiom p k drop = p holds, reflecting the fact that k is multicast.

We now show CNetKAT semantics is equivalent to the POCKA semantics on state programs. In [37], POCKA terms are what we defined as state programs over the alphabet O∪Act, and they are interpreted in terms of pomset languages over assignments and states, encoded as partial functions, similarly to separation logic [33]. The POCKA semantics are defined in two steps: the first step results in a set containing all pomsets that can be derived directly from the terms, and in a second step this set is closed under two laws—exch and contr—that account for all traces that can be built in parallel threads (including simple interleaving).

Definition 5 (POCKA semantics). Let o ∈ O, e ∈ Act, p, q ∈ Tstate(O ∪ Act).


The semantics of a POCKA expression <sup>p</sup> is <sup>J</sup>pKPOCKA <sup>=</sup> <sup>L</sup>p<sup>M</sup> y exch∪contr

Theorem 2. CNetKAT is a conservative extension of POCKA: if p is a POCKA term (<sup>p</sup> ∈ Tstate(O ∪ Act)) then for <sup>a</sup> <sup>6</sup><sup>=</sup> <sup>∅</sup>, <sup>J</sup>p<sup>K</sup> <sup>y</sup> (a) = {<sup>u</sup> · <sup>a</sup> <sup>|</sup> <sup>u</sup> <sup>∈</sup> <sup>J</sup>pKPOCKA}.

#### 3.4 Axiomatization

We introduce notation to describe packets and sets of packets axiomatically. Let f1, . . . , f<sup>k</sup> be a list of all fields of a packet in some fixed order. Then for each tuple n = n1, . . . , n<sup>k</sup> we obtain expressions f<sup>1</sup> = n<sup>1</sup> · · · f<sup>k</sup> = n<sup>k</sup> and f<sup>1</sup> ← n<sup>1</sup> · · · f<sup>k</sup> ← nk, which, similar to NetKAT, we call complete tests and complete assignments. Complete tests are also referred to as atoms, because they are the atoms of the Boolean algebra generated by the tests. We denote the set of atoms by At, complete tests with α and complete assignments with π. There is a one-to-one correspondence between complete tests and assignments according to the values of n. For α ∈ At we denote the corresponding complete assignment by πα, and if π is a complete assignment we denote the corresponding atom by απ.

There is also a link between sets of packets and terms of the form k<sup>i</sup>∈<sup>I</sup> πi . For each set of packets a, we take the set {π<sup>i</sup> | i ∈ I} of complete assignments such that each π<sup>i</sup> corresponds to a packet of a, and combine them in parallel. Formally, for a set of packets a there exists an expression k<sup>i</sup>∈<sup>I</sup> πi , that we denote with Πa, such that on any input <sup>b</sup> <sup>6</sup><sup>=</sup> <sup>∅</sup>, <sup>J</sup>Π<sup>a</sup>K(b) = {<sup>1</sup> · <sup>a</sup>}. Similarly, the semantics of an expression of the form k<sup>i</sup>∈<sup>I</sup> π<sup>i</sup> on any input is always {1 · a} for some a ∈ 2 Pk . We use the notation Π<sup>a</sup> as a syntactic representation of set of packets a.

CNetKAT has the structure of a Kleene algebra on state programs, enriched with additional axioms. Tests form a Boolean algebra and state observations a pseudocomplemented distributive lattice (PCDL). The test and observation structures are subject to interaction constraints. The packet processing behavior is captured by the packet axioms, which contain axioms for individual packets and sets of packets. The axioms governing the parallel operator are partially familiar from earlier work on BKA [13,25]. There is also the exchange law familiar from CKA. Lastly, we have axioms for the interactions between state programs and packet programs. The full set of axioms is described in Figure 4. We write ≡ for the smallest congruence on Prg generated by the axioms in Figure 4.

Remark 9 (When is Π<sup>a</sup> equal to drop?). Π<sup>a</sup> ≡ drop if and only if a is empty. Π<sup>∅</sup> = k<sup>i</sup>∈<sup>∅</sup>π<sup>i</sup> ≡ k∅ ≡ W ∅ ≡ drop. For all other a, we have Π<sup>a</sup> 6≡ drop.

There are a few subtleties to notice in Figure 4. First, we point out the interaction between drop and abort. When no packets are present, not even abort can be executed. Hence, if we drop all packets and then abort, the abort does not happen: drop ; abort ≡ drop. On the other hand, if we first abort and then drop all the packets, the behavior is equal to just aborting: abort ; drop ≡ abort.

In the axioms of the parallel operator, the axiom s k skip ≡ skip from BKA is missing; it only holds when s is a state program, and can be found in the local state vs global state axioms. In addition to the familiar BKA axioms, there is the axiom drop k p ≡ p, in contrast with abort k p ≡ abort.

The local state vs global state axioms capture the interactions between the global pomset and the output packets. The first one, Π<sup>a</sup> ; dup ≡ Π<sup>a</sup> ; a, captures the intuition that if we know the input is a (due to Πa, which, as a parallel of complete assignments, essentially overwrites any non-empty input set to a), then we know the dup is recording an "a". The second axiom, Π<sup>a</sup> ; w ≡ w ; Π<sup>a</sup> states


Fig. 4: Axioms of CNetKAT. The left column contains the KA axioms, the packet axioms, the axioms for the interaction between the local and global state, and an extensionality axiom. The right column axiomatizes the k, the algebra of packet tests (which is a Boolean algebra), and the algebra of partial state observations (which is a PCDL). The interface axioms connect both the lattice operators to the Kleene algebra ones. We write e ≦ f as a shorthand for e + f ≡ f.

that for dup-free state program w, we can flip the order between changing the set of output packets or performing the state changes in w, as long as Π<sup>a</sup> is not the parallel representing the empty set. This latter condition is crucial: if a = ∅, then Π<sup>a</sup> ≡ drop, and drop ; w ≡ drop (the global state changes in w do not get executed if we have no packets).

The axiom drop ; p ≡ drop for any program p captures the intuition that if there are no packets, nothing happens anymore. The other way around, y;drop ≡ drop is only true for y a packet program; if it was a state program, the global state changes get executed if we start with a non-empty set of input packets, making the behavior of y ; drop not equivalent to drop.

Lastly, extensionality says that if two programs are equivalent on all inputs (i.e., a ∈ 2 Pk), then the programs are equivalent. It is not clear whether this axiom is derivable from the others; we hope to settle this question in the future.

### 4 Soundness and Completeness

In this section we prove soundness and completeness of the CNetKAT semantics w.r.t. the axiomatization from Figure 4. For soundness, we prove that if programs p and q are provably equivalent using the axioms, they have the same semantics:

Theorem 3 (Soundness). For all p, q <sup>∈</sup> Prg, if <sup>p</sup> <sup>≡</sup> <sup>q</sup>, then <sup>J</sup>p<sup>K</sup> <sup>y</sup> <sup>=</sup> <sup>J</sup>q<sup>K</sup> y.

Conversely, we will prove that if p and q have the same semantics on all inputs a, then p ≡ q. We structure the completeness proof in four parts:


Step 1: Normal form We prove that for every a ∈ 2 Pk, we can write any program p as Π<sup>a</sup> followed by a sum of state programs followed by a parallel of complete assignments. This is the most difficult step in the completeness proof.

We derive a few equivalences from Figure 4 regarding complete tests and assignments that make the proof of the normal form easier. We refer to these axioms as the reduced axioms. For α and β complete tests such that α 6= β, π and π ′ complete assignments, and a ∈ 2 Pk ne , b ∈ 2 Pk, we can derive:

$$
\pi \equiv \pi ; \alpha\_{\pi} \qquad \alpha \equiv \alpha ; \pi\_{\alpha} \qquad \pi ; \pi' \equiv \pi' \qquad \alpha ; \beta \equiv \mathsf{drop} \qquad \Pi\_{a} ; \Pi\_{b} \equiv \Pi\_{b}
$$

All of these equivalences are easy consequences of the packet axioms, the packet predicate axioms, the axiom t∧<sup>B</sup> t ′ ≡ t;t ′ and the fact that for all packet programs p we have p ; drop ≡ drop ≡ drop ; p [3]. The last reduced axiom is derived in the full version of this article [38, Lemma 14].

Theorem 4 (Normal form). Let p ∈ Prg and a ∈ 2 Pk. There exists a finite set J, and elements u<sup>j</sup> ∈ Tstate(O ∪ Act ∪ 2 Pk) and b<sup>j</sup> ∈ 2 Pk for each j ∈ J s.t.

$$\varPi\_a \; ; \, p \equiv \varPi\_a \; ; \sum\_{j \in J} \left( u\_j \; ; \, \varPi\_{b\_j} \right).$$

Sketch. The proof proceeds by induction on the structure of p. For instance, for an assignment f←n, where we take Π<sup>a</sup> = kk∈Kπ<sup>k</sup> for some non-empty finite index set K and complete assignments πk, we derive

Π<sup>a</sup> ; f←n ≡ Π<sup>a</sup> ; Π<sup>a</sup> ; f←n (Π<sup>a</sup> ; Π<sup>b</sup> ≡ Πb) = Π<sup>a</sup> ; ( k k∈K πk) ; f←n ≡ Π<sup>a</sup> ; k k∈K (π<sup>k</sup> ; f←n) ((p k q) ; x ≡ (p ; x) k (q ; x)) ≡ Π<sup>a</sup> ; skip ; k k∈K π ′ k (p ; skip ≡ p)

where π ′ k is π<sup>k</sup> with the assignment for f replaced by f←n. If K = ∅ then Π<sup>a</sup> ≡ drop and the equivalence above follows immediately. The most difficult case is the star; we use an argument that relies on the fact that matrices over a Kleene algebra form a Kleene algebra [20]. A proof can be found in the full version of this article [38, Appendix D]

Step 2: Completeness for Πa-shaped programs As mentioned, Πashaped programs are syntactic representations of packet sets. We prove that if two such programs result in the same set of packets on any non-empty input, they are provably equivalent, using that Π<sup>a</sup> describes a unique set of packets.

Lemma 6. Let a ∈ 2 Pk ne , and b, c ∈ 2 Pk. If <sup>J</sup>Π<sup>b</sup><sup>K</sup> <sup>y</sup> (a) = <sup>J</sup>Π<sup>c</sup><sup>K</sup> <sup>y</sup> (a) then <sup>Π</sup><sup>b</sup> <sup>≡</sup> <sup>Π</sup>c.

Step 3: Completeness of sums in the normal form We first prove completeness for state programs, where we use completeness of POCKA. To do so, some caution is needed; POCKA terms are state terms over the alphabet O ∪ Act. However, the state terms relevant here also include elements a ∈ 2 Pk .

Lemma 7. Let s, v ∈ Tstate(O ∪ Act ∪ 2 Pk) and a ∈ 2 Pk ne . If <sup>J</sup>s<sup>K</sup> <sup>y</sup> (a) = <sup>J</sup>v<sup>K</sup> <sup>y</sup> (a), then s ≡ v.

Next we prove completeness for expressions of the form s ; Πa, and then extend this to arbitrary finite sums of such programs:

Lemma 8. Let b, c ∈ 2 Pk , u, v state programs, and a ∈ 2 Pk ne . Then we have: <sup>J</sup><sup>u</sup> ; <sup>Π</sup><sup>b</sup><sup>K</sup> <sup>y</sup> (a) = <sup>J</sup><sup>v</sup> ; <sup>Π</sup><sup>c</sup><sup>K</sup> <sup>y</sup> (a) <sup>⇒</sup> <sup>u</sup> ; <sup>Π</sup><sup>b</sup> <sup>≡</sup> <sup>v</sup> ; <sup>Π</sup>c.

Lemma 9. If <sup>r</sup><sup>P</sup> j∈J (u<sup>j</sup> ; Π<sup>b</sup><sup>j</sup> ) z <sup>y</sup> (a) = <sup>q</sup><sup>P</sup> <sup>k</sup>∈<sup>K</sup>(v<sup>k</sup> ; Π<sup>c</sup><sup>k</sup> ) y <sup>y</sup> (a) for some <sup>a</sup> <sup>∈</sup> 2 Pk ne , then P j∈J (u<sup>j</sup> ; Π<sup>b</sup><sup>j</sup> ) ≡ P <sup>k</sup>∈<sup>K</sup>(v<sup>k</sup> ; Π<sup>c</sup><sup>k</sup> ), where J, K are finite; u<sup>j</sup> , v<sup>k</sup> are state programs and b<sup>j</sup> , c<sup>k</sup> ∈ 2 Pk for each j, k.

Step 4: Completeness The last lemma before proving completeness relates the semantics of p on input a to the semantics of Π<sup>a</sup> ; p on any non-empty input.

#### Lemma 10. Let b ∈ 2 Pk ne , a ∈ 2 Pk. For all <sup>p</sup> <sup>∈</sup> Prg, <sup>J</sup>Π<sup>a</sup> ; <sup>p</sup><sup>K</sup> <sup>y</sup> (b) = <sup>J</sup>p<sup>K</sup> <sup>y</sup> (a).

Theorem 5 (Completeness). Let p, q ∈ Prg. For all a ∈ 2 Pk we have that if JpK <sup>y</sup> (a) = <sup>J</sup>q<sup>K</sup> <sup>y</sup> (a), then <sup>p</sup> <sup>≡</sup> <sup>q</sup>.

Proof. We first show that Π<sup>a</sup> ; p ≡ Π<sup>a</sup> ; q for all a ∈ 2 Pk. In case a = ∅, Π<sup>a</sup> must be the empty parallel. Hence, Π<sup>a</sup> ; p ≡ drop ≡ Π<sup>a</sup> ; q. In the rest of the proof we assume <sup>a</sup> <sup>6</sup><sup>=</sup> <sup>∅</sup>. Via Lemma 10, we obtain that <sup>J</sup>p<sup>K</sup> <sup>y</sup> (a) = <sup>J</sup>Π<sup>a</sup> ; <sup>p</sup><sup>K</sup> <sup>y</sup> (a) = <sup>J</sup>Π<sup>a</sup> ; <sup>q</sup><sup>K</sup> <sup>y</sup> (a) = <sup>J</sup>q<sup>K</sup> P <sup>y</sup> (a). We obtain a normal form such that <sup>Π</sup><sup>a</sup> ; <sup>p</sup> <sup>≡</sup> <sup>Π</sup><sup>a</sup> ; j∈J (u<sup>j</sup> ;Π<sup>b</sup><sup>j</sup> ) (Theorem 4). Similarly, Π<sup>a</sup> ; q ≡ Π<sup>a</sup> ; P <sup>k</sup>∈<sup>K</sup>(v<sup>k</sup> ;Π<sup>c</sup><sup>k</sup> ). Via soundness we derive <sup>r</sup> Π<sup>a</sup> ; P j∈J (u<sup>j</sup> ; Π<sup>b</sup><sup>j</sup> ) z <sup>y</sup> (a) = <sup>q</sup> Π<sup>a</sup> ; P <sup>k</sup>∈<sup>K</sup>(v<sup>k</sup> ; Π<sup>c</sup><sup>k</sup> ) y <sup>y</sup> (a), and via Lemma 10 that <sup>r</sup><sup>P</sup> j∈J (u<sup>j</sup> ; Π<sup>b</sup><sup>j</sup> ) z <sup>y</sup> (a) = <sup>q</sup><sup>P</sup> <sup>k</sup>∈<sup>K</sup>(v<sup>k</sup> ; Π<sup>c</sup><sup>k</sup> ) y <sup>y</sup> (a). With the partial completeness result from Lemma 9, we obtain that P j∈J u<sup>j</sup> ; Π<sup>b</sup><sup>j</sup> <sup>P</sup> <sup>≡</sup> <sup>k</sup>∈<sup>K</sup> (v<sup>k</sup> ; Π<sup>c</sup><sup>k</sup> ). This leads to

$$\Pi\_a \colon p \equiv \Pi\_a \colon \sum\_{j \in J} \left( u\_j \; ; \Pi\_{b\_j} \right) \equiv \Pi\_a \colon \sum\_{k \in K} \left( v\_k \; ; \Pi\_{c\_k} \right) \equiv \Pi\_a \; ; \; q \mid$$

Hence, we have derived that Π<sup>a</sup> ; p ≡ Π<sup>a</sup> ; q for all a ∈ 2 Pk. With the extensionality axiom we can conclude that p ≡ q.

### 5 Examples

This section shows how we can use CNetKAT to model and analyze several concurrent programs. We start by analyzing the running example from §2, and then proceed to a more involved example that combines the behavior of a stateful firewall, a load balancer, and an in-network cache.

#### 5.1 Running Example

Consider again the running example from §2. Because we are ultimately interested in the behavior of the program when the packets have reached their final destination, switch 4, we will add a test sw=4 at the end of the program:

$$p \triangleq (v \gets 0) ; (p\_1 \parallel p\_2 \parallel p\_3 \parallel p\_4)^\* ; (\mathsf{sw} = 4)^\*$$

Recall that the CNetKAT semantics of a program contains traces that are only required to model executions where the program is composed in parallel with another program, to ensure a compositional semantics for the language. However, to analyze the behavior of a program in isolation, we want to eliminate these extra traces. To do this, we follow the same strategy used in [37], where so-called guarded pomsets were proposed. Guarded pomsets are a subclass of pomsets that captures the characteristics of behaviors of (concurrent) programs running in isolation. For example, in a guarded pomset, if one assertion, say v=0, occurs before another assertion, say v=1, there must be an assignment v←1 between the two asserts to account for the change. That is, in an isolated execution every change to variables must be explained by an action in the program.

To illustrate the difference between pomsets and guarded pomsets, consider our example. We unfold the Kleene star twice and evaluate the resulting program; we obtain a pair with output {♠[4/sw], ♥[4/sw]} and corresponding pomset,

(v←0) {♥, ♠} β {♥} {♠} {♥[3/sw]} ♠[2/sw] (v←1) {♠[2/sw]} {♥[3/sw]} {♠[4/sw]} {♥[4/sw]}

where β(v) = 1. This pomset is unguarded: β(v) = 1 occurs without a cause.

$$\text{The semantics also contains a pair with } \{\spadesuit[4/\text{sw}], \heartsuit[4/\text{sw}]\} \text{ and pomset},$$

$$\begin{array}{c} \vee \rightarrow (v \gets 0) \rightarrow \alpha \rightarrow \{\heartsuit, \spadesuit\} \\ \vee \begin{array}{c} \vee \rightarrow \{\heartsuit, \spadesuit\} \end{array} \xrightarrow{\beta \rightarrow \{\spadesuit\} \longrightarrow \clubsuit[2/\text{sw}] \rightarrow \{\clubsuit[2/\text{sw}]\} \rightarrow \{\clubsuit[4/\text{sw}]\}} \\ \vee \begin{array}{c} \vee \{\heartsuit[3/\text{sw}]\} \rightarrow \{v \gets 1\} \rightarrow \{\heartsuit[3/\text{sw}]\} \rightarrow \{\heartsuit[4/\text{sw}]\} \end{array} \end{array}$$

with α(v) = 0, β(v) = 1, and γ unrestricted. This pomset is guarded because it contains an arrow from v←1 to β, justifying the change in valuation from α to β. As we show in the full version of this article [38, Appendix E], all guarded pomsets in the semantics will have this arrow, and satisfy the desired property: ♥ packets are observed at switch 3 before ♠ packets are observed at switch 2.

Now consider the axiomatic claim we made in §2 (i.e., (2)), (♥ k ♠);q ≦ (♥ k ♠);p where q is the program from (1). We can easily see that the following holds: JqK <sup>y</sup> {♥, ♠} ⊆ <sup>J</sup>p<sup>K</sup> <sup>y</sup> {♥, ♠}. Hence, we can use Lemma 10 and the completeness result for CNetKAT (Theorem 5) to obtain (2).

#### 5.2 Stateful Load Balancer, Cache, and Firewall

For a more complex example, consider the network in Figure 5, which is adapted from an example from [2]. The overall goal is to (i) prevent packets from a highpriority server S<sup>h</sup> going to low priority hosts l1, . . . , l<sup>k</sup> and (ii) load balance requests to the servers in a round robin fashion. We provide naive specifications for the cache, firewall and load balancer programs in Figure 5. For simplicity, we assume that there is exactly one low-priority host, and exactly one high-priority host, i.e., n = k = 1, and we leave the specification of the topology implicit.

Remark 10. In contrast with the previous example, the program in Figure 5 includes reads and writes of a global variable that occur on different physical devices. In principle, synchronizing variables like r would give rise to additional packets that update local copies of variables—a process that could itself be modelled in CNetKAT. We leave the implementation of a translation pass that achieves the synchronization of global variables across switches to future work.

Fig. 5: Stateful firewall between high/low priority hosts and servers.

In [2], the authors point out a problem with the example that arises because the cache has no means to enforce the security policy. One strategy for resolving this problem is to swap the placement of the firewall and the cache. Another is to distribute access control rules onto the cache as well as the firewall. However, there is also a second, more subtle issue: the load balancer uses the global variable r to decide to which server to forward requests. In the presence of multiple packets, another packet may arrive before the change to the global variable occurs allowing two (or more!) packets to be sent to the same server.

The issue with the load balancer can be observed in the following example. Take as input packets ♠ and ♥ with ♠(src) = ♥(src) = l1. After being processed at the cache, both packets arrive at the firewall. One of the pairs in the semantics of the firewall F is the following, with α unrestricted and β(r) = 0: (α → (r←0) → β) · {♥[loadb/dst], ♠[loadb/dst]}. After processing by the load balancer, both packets are sent to s<sup>l</sup> simultaneously. To illustrate this event, we claim that there is a guarded pomset in the semantics of the load balancer. Observe that in the semantics of L we find the following pomset, with α and β from before (the second β is the result of the r=0 in L): α → (r←0) → β → β → {♥[sl/dst], ♠[sl/dst]}. Using closure under contraction, we obtain a guarded pomset (the two β-nodes are merged into one) where both packets appear at s<sup>l</sup> at the same time.

A final issue stems from the fact that the firewall implementation is flawed as written. Specifically, it uses a global variable to determine whether a packet should be forwarded on to a high priority host. Of course, if another packet arrives before the current one has been forwarded, the value of this variable might change, resulting in both packets being forwarded to a low priority host.

The issue with the firewall can be observed as follows. Take as input two packets ♠ and ♥ with ♠(src) = s<sup>h</sup> and ♥(src) = s<sup>l</sup> . After processing by the load balancer, both packets end up in the firewall. One of the pairs in the semantics of the firewall is the following, with α(v) = 1 and β unrestricted: (α · v←0 k β · v←1)· {♥[cache/dst], ♠[cache/dst]}. After processing by the cache, both packets


Fig. 6: Leader logic from [9] and CNetKAT term, with k acceptors.

are sent to h<sup>1</sup> or l1. To illustrate how the packets travel to e.g. l1, we find the following pomset in the semantics of C, with α, β from before and γ(v) = 0:

$$\begin{aligned} \alpha &\rightarrow (v \leftarrow 0) \\ \updownarrow &\rightarrow (v \leftarrow 1) \end{aligned} \gg \gamma \rightarrow \{\heartsuit[l\_1/\mathsf{dst}], \spadesuit[l\_1/\mathsf{dst}]\}$$

This pomset subsumes a guarded pomset. Hence, by exchange closure, we find guarded pomsets in the behavior of C where the packets both end up at l1.

Overall, these examples show that CNetKAT can model subtle interactions between packets that arise in the presence of concurrency and state. Moreover, the axiomatic semantics can be used to prove (in)equivalences between programs.

### 6 Related Work

The core of CNetKAT is two extensions of Kleene Algebra: NetKAT [3,10], a networking extension of Kleene algebra with tests, and POCKA [37], a concurrent extension of KA. NetKAT describes how single packets move through a network, whereas CNetKAT can handle multiple packets. POCKA was introduced to describe concurrent interactions of global variables, whereas CNetKAT makes use of this algebra to enable intra-packet communication. CNetKAT captures local and global state interactions which was not in any of the previous work.

In the family of KA extensions, POCKA is closest to Concurrent Kleene algebra with Observations (CKAO) [15,16], which was proposed to integrate concurrency with conditionals such as if-statements and while-loops. Contrary to CKAO, which uses a Boolean algebra to axiomatize conditionals, POCKA uses a pseudocomplemented distributive lattice (PCDL) as the algebra for tests, which are referred to as observations to mark the difference. The idea to use a PCDL as the algebra for observations was first proposed in [14].

Our work fits within the CKA tradition, which gives a true concurrency semantics and is thereby distinct from bisimulation semantics typically considered in process algebras, such as CSP and CCS. Another distinction is that CNetKAT uses global state rather than message passing.

Some recently published work has also extended NetKAT with constructs for modeling multi-packet behavior [7]. Here the goal is to model interactions between the control- and date-plane in dynamic updates. Parallel composition is axiomatized with a left-merge operator and a communication-merge operator, and semantics is in terms of bisimilarity instead of traces. The examples largely focus on the table updates, not on the flow of packets through the network.

The current paper deviates from earlier concurrent variations on NetKAT, such as Concurrent NetCore [35] and a stateful variant of NetKAT introduced in [31]. Both have a different algebraic structure than NetKAT. Concurrent Net-Core does not have Kleene star, and does not provide a denotational semantics, or axiomatization. Moreover, it does not handle multiple packets, the use of + in the language is multicast rather than non-determinism, and k is concurrent processing of disjoint fields of the same packet. Because of these restrictions, concurrent NetCore is less suitable to specify inter-packet concurrency.

The approach in [31] models interactions among multiple packets, but is accompanied by semantic correctness guarantees, rather than algebraic formalizations as in CNetKAT. A recent PhD thesis [29] contains another version of stateful NetKAT, which assumes packet processing can always be serialized into a deterministic, global order. This assumption enables a simpler semantics and a decision procedure, though completeness is left as an open problem. Flow control in [29] is handled in the style of Guarded Kleene Algebra with Tests [22,36], which means that programs and specifications must be deterministic.

More broadly, there is a growing community doing research on network verification tools. Early work such as HSA [18], Anteater [30], Veriflow [19], Atomic Predicates [39], etc. focused on stateless SDN data planes, while more recent work such as p4v [27] and VMN [32] supports richer models such as P4 and stateful middleboxes. These tools typically use analyses based on symbolic simulation or they encode verification tasks into first-order formulas that can be checked using SMT solvers. To the best of our knowledge, CNetKAT is the first algebraic framework to model network-wide, multi-packet interaction with mutable state.

### 7 Discussion

We proposed CNetKAT, an algebraic framework to reason about programs with both local and global state, in the presence of parallel threads and control-flow statements. We provided a denotational semantics and a complete axiomatization. We also provided examples of how the language can be used to reason about stateful network programs and different sources of concurrency in a network.

As a result of the algebraic approach, the semantics of a program arises from the semantics of its parts. This clashes with the idea of observational equivalence when concurrency comes into play: some behaviors of a program can only be observed when executed concurrently with another program, and not in isolation. Hence it becomes necessary to include some elements in the semantics that do not immediately correspond to observable behavior. This implies that observational equivalence is not the right notion for axiomatising the semantics. However, using the greatest congruence contained in a notion of observational equivalence is interesting; this guided us in the development of our axiomatisation but it remains to be shown that our axiomatisation is indeed the greatest congruence.

CNetKAT relies on a classic approach to proving program correctness: develop a framework can model both specifications and implementations, and show that equivalence is decidable. Past experience with NetKAT suggests that this approach is usable, although CNetKAT lacks a procedure to check semantic equivalence, or at least membership of a given pomset. Devising an efficient procedure for this task is our immediate priority. The procedure will most likely rely on automata models such as fork automata [28] or Petri automata [6,5].

Ultimately, we would like to use CNetKAT to reason about stateful and distributed P4 programs. A target case study is provided in [9], which implemented Lamport's Paxos algorithm in the forwarding plane. To show correctness, the authors used a translation to Promela, a model checking language, and specify check that learners never decide on separate values for a single instance of consensus. This property is closely related to guarded pomsets. We would like to use CNetKAT to show correctness of the P4 implementation of the protocol directly (translation from the P4 code is almost direct, see Figure 6 for an example).

The reader will notice that the CNetKAT expression in Figure 6 uses an action of the form f←v, where f is a field (inst) and v a global variable (instance). Adding actions of the converse form v←f is trivial since the packet logic specifies that f always has exactly one value. However, actions f←v require more care: the value of global variables can only be determined at the end since parallel threads might change it while it is being copied. To accommodate this in the semantics, we will have to allow partially defined packet fields and determine the missing field values at the end (when we check for guarded traces).

Another exciting direction for future work is the development of a library of litmus tests for networking in the spirit of [1]. Litmus tests are carefully crafted concurrent programs operating on shared memory locations that expose subtle bugs in memory models of hardware. One could imagine using the guarded pomsets semantics to discover minimal witnesses of undesired concurrent behavior.

We would also like to investigate the memory model of CNetKAT; this would give insight into the rules followed by operations on the global state. For a partial answer, we can look at POCKA. The guarded fragment of the POCKA semantics was shown to be sequentially consistent (concurrent memory accesses behave as if they are executed sequentially [24]), as it passed the store buffering litmus test [1]. The guarded fragment of the pomsets recording global variable changes is expected to pass this litmus test as well. It is worth investigating whether CNetKAT also supports other weak memory models, such as linearizability.

Acknowledgements N. Foster and T. Kapp´e were partially supported by DARPA grant HR001120C0107 (Pronto). T. Kapp´e also received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sk lodowska-Curie grant agreement No. 101027412 (VERLAN). D. Kozen was supported by NSF grant CCF-20008083. A. Silva was partially funded by ERC grant AutoProbe (101002697), EPSRC project CleVer (EP/S028641/1), and a Royal Society fellowship.

### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Author Index**

Abdulla, Parosh Aziz 317 Agarwal, Raj Aryan 317 Armstrong, Alasdair 143, 174 Atig, Mohamed Faouzi 317 Atkey, Robert 376

Barnes, Graeme 174 Batz, Kevin 57 Bauereiss, Thomas 174 Berger, Ulrich 85 Bila, Eleni Vafeiadi 234 Boulmé, Sylvain 204 Brachthäuser, Jonathan Immanuel 492 Broman, David 29

Campbell, Brian 174 Chakrabarti, Sujit Kumar 290 Choudhury, Pritam 403 Cruttwell, Geoffrey S. H. 1

D'Souza, Deepak 290 D'Souza, Meenakshi 290 Das, Ankush 431 DeYoung, Henry 431 Dongol, Brijesh 234

Eades III, Harley 403 Esswood, Lawrence 174

Fesefeldt, Ira 57 Foster, Nate 575

Gavranovi´c, Bruno 1 Ghani, Neil 1 Godbole, Adwait 317 Grisenthwaite, Richard 143

Jansen, Marvin 57 Jongmans, Sung-Shik 520

Kappé, Tobias 575 Katoen, Joost-Pieter 57 Keßler, Florian 57 Khyzha, Artem 262 Kozen, Dexter 575 Kudlicka, Jan 29

Lahav, Ori 234, 262 Lakhani, Zeeshan 431 Lourenço, Cláudio Belo 114 Lundén, Daniel 29

Marshall, Daniel 346 Matheja, Christoph 57 Monniaux, David 204 Mordido, Andreia 431

Noll, Thomas 57

Öhman, Joey 29 Orchard, Dominic 346 Ostermann, Klaus 492

Pai, Rekha 290 Paviotti, Marco 462 Pfenning, Frank 431 Pichon-Pharabod, Jean 143 Pinto, Jorge Sousa 114 Pulte, Christopher 143

Raad, Azalea 234 Ronquist, Fredrik 29 Rot, Jurriaan 575

S., Krishna 317 Schrijvers, Tom 462 Schuster, Philipp 492 Senderov, Viktor 29 Sewell, Peter 143, 174 Sewell, Thomas 174 Silva, Alexandra 575 Simner, Ben 143 Stark, Ian 174 Suresh, Varsha P 290

Tsuiki, Hideki 85

van den Berg, Birthe 462 van den Bos, Petra 520

van Glabbeek, Rob 548 Vollmer, Michael 346

Wagemaker, Jana 575 Watson, Robert N. M. 174 Weirich, Stephanie 403 Wickerson, John 234

Wilson, Paul 1 Wood, James 376 Wu, Nicolas 462 Yang, Zhixuan 462 Zanasi, Fabio 1