**Christoph Benzmüller Marijn J. H. Heule Renate A. Schmidt (Eds.)**

# **Automated Reasoning**

**12th International Joint Conference, IJCAR 2024 Nancy, France, July 3–6, 2024 Proceedings, Part I**

# Lecture Notes in Computer Science

# **Lecture Notes in Artificial Intelligence 14739**

Founding Editor Jörg Siekmann

Series Editors

Randy Goebel, *University of Alberta, Edmonton, Canada* Wolfgang Wahlster, *DFKI, Berlin, Germany* Zhi-Hua Zhou, *Nanjing University, Nanjing, China*

The series Lecture Notes in Artificial Intelligence (LNAI) was established in 1988 as a topical subseries of LNCS devoted to artificial intelligence.

The series publishes state-of-the-art research results at a high level. As with the LNCS mother series, the mission of the series is to serve the international R & D community by providing an invaluable service, mainly focused on the publication of conference and workshop proceedings and postproceedings.

Christoph Benzmüller · Marijn J. H. Heule · Renate A. Schmidt Editors

# Automated Reasoning

12th International Joint Conference, IJCAR 2024 Nancy, France, July 3–6, 2024 Proceedings, Part I

*Editors* Christoph Benzmüller Otto-Friedrich-Universität Bamberg Bamberg, Germany

Renate A. Schmidt University of Manchester Manchester, UK

Marijn J. H. Heule Carnegie Mellon University Pittsburgh, PA, USA

ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Artificial Intelligence ISBN 978-3-031-63497-0 ISBN 978-3-031-63498-7 (eBook) https://doi.org/10.1007/978-3-031-63498-7

LNCS Sublibrary: SL7 – Artificial Intelligence

© The Editor(s) (if applicable) and The Author(s) 2024. This book is an open access publication.

**Open Access** This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

If disposing of this product, please recycle the paper.

# **Preface**

This volume contains the papers of the 12th International Joint Conference on Automated Reasoning (IJCAR) held in Nancy, France, during July 3–6, 2024. IJCAR is the premier international joint conference on all aspects of automated reasoning, including foundations, implementations, and applications, comprising several leading conferences and workshops. IJCAR 2024 brought together the Conference on Automated Deduction (CADE), the International Symposium on Frontiers of Combining Systems (FroCoS), and the International Conference on Automated Reasoning with Analytic Tableaux and Related Methods (TABLEAUX).

Previous IJCAR conferences were held in Siena, Italy (2001), Cork, Ireland (2004), Seattle, USA (2006), Sydney, Australia (2008), Edinburgh, UK (2010), Manchester, UK (2012), Vienna, Austria (2014), Coimbra, Portugal (2016), Oxford, UK (2018), Paris, France (2020, virtual), and Haifa, Israel (2022).

IJCAR 2024 received 115 submissions (130 abstracts) out of which 45 papers were accepted (with an overall acceptance rate of 39%): 39 regular papers (out of 96 regular papers submitted, resulting in a regular paper acceptance rate of 41%) and 6 short papers (out of 19 short papers submitted, resulting in a short paper acceptance rate of 31%). Each submission was assigned to at least three Program Committee members and was reviewed in single-blind mode. All submissions were evaluated according to the following criteria: relevance, originality, significance, correctness, and readability. The review process included a feedback/rebuttal period, during which authors had the option to respond to reviewer comments.

In addition to the accepted papers, the IJCAR 2024 program included three invited talks:


This year marks the 30th anniversary of the CADE ATP System Competition (CASC), which was conceived in 1994 after CADE-12 in Nancy, France, when Christian Suttner and Geoff Sutcliffe were sitting on a bench under a tree in Parc de la Pépinière. In the 28 competitions since then, CASC has been a catalyst for research and development, providing an inspiring environment for personal interaction between ATP researchers and users. A special event took place to celebrate this anniversary.

In addition to the main programme, IJCAR 2024 hosted ten workshops, which took place on July 1–2, and two systems competitions (CASC and Termination). The SAT/SMT/AR 2024 Summer School was held in Nancy the week prior to IJCAR 2024.

The Best Paper Award of IJCAR 2024 went to Hugo Férée, Iris van der Giessen, Sam van Gool, and Ian Shillito for the paper "Mechanised Uniform Interpolation for Modal Logics K, GL, and iSL". The Best Student Paper Award went to Johannes Niederhauser (with Chad E. Brown and Cezary Kaliszyk) for the paper entitled "Tableaux for Automated Reasoning in Dependently-Typed Higher-Order Logic".

Another highlight of the conference was the presentation of the 2024 Herbrand Award for Distinguished Contributions to Automated Reasoning to Armin Biere (Albert-Ludwigs-University Freiburg, Germany) in recognition of "his outstanding contributions to satisfiability solving, including innovative applications, methods for formula pre- and in-processing and proof generation, and a series of award-winning solvers, with deep impact on model checking and verification."

The 2024 Bill McCune PhD Award was given to Katherine Kosaian for the PhD thesis "Formally Verifying Algorithms for Real Quantifier Elimination", completed at Carnegie Mellon University, USA, in 2023.

The main institutions supporting IJCAR 2024 were the University of Lorraine and the Inria research center at the University of Lorraine. We also thank as sponsors: the research laboratory for computer science in Nancy (LORIA), a joint research unit of the University of Lorraine, CNRS, and Inria, its Formal Methods Department, and Métropole du Grand Nancy. For hosting the conference, we thank IDMC Nancy.

We would also like to acknowledge the generous sponsorship of Springer and Imandra Inc., and the support by EasyChair. Finally, we are indebted to the entire IJCAR 2024 Organizing Team for their assistance with the local organization and general management of the conference, especially Didier Galmiche, Stephan Merz, Christophe Ringeissen (Conference Co-Chairs), Sophie Tourret (Workshop, Tutorial and Competition Chair), Peter Lammich (Publicity Chair) and Anne-Lise Charbonnier and Sabrina Lemaire (main administrative support).

May 2024 Christoph Benzmüller Marijn J. H. Heule Renate A. Schmidt

# **Organization**

# **Conference Chairs**


# **Program Committee Chairs**


# **Workshop, Tutorial and Competition Chair**


# **Publicity Chair**


# **Local Arrangements**


# **Steering Committee**


### **Program Committee**

Franz Baader TU Dresden, Germany Nikolaj Bjørner Microsoft, USA Agata Ciabattoni TU Wien, Austria Daniela Kaufmann TU Wien, Austria Xavier Parent TU Wien, Austria

Lawrence Paulson University of Cambridge, UK Elaine Pimentel University College London, UK Christophe Ringeissen Inria, University of Lorraine, France Renate A. Schmidt University of Manchester, UK

Haniel Barbosa Universidade Federal de Minas Gerais, Brazil Christoph Benzmüller Otto-Friedrich-Universität Bamberg and FU Berlin, Germany Armin Biere University of Freiburg, Germany Jasmin Blanchette Ludwig-Maximilians-Universität München, Germany Maria Paola Bonacina Università degli Studi di Verona, Italy Florent Capelli Université d'Artois, France Clare Dixon University of Manchester, UK Pascal Fontaine Université de Liège, Belgium Carsten Fuhs Birkbeck, University of London, UK Didier Galmiche University of Lorraine, France Silvio Ghilardi Università degli Studi di Milano, Italy Jürgen Giesl RWTH Aachen University, Germany Arie Gurfinkel University of Waterloo, Canada Marijn J. H. Heule Carnegie Mellon University, USA Andrzej Indrzejczak University of Lodz, Poland Moa Johansson Chalmers University of Technology, Sweden Patrick Koopmann Vrije Universiteit Amsterdam, The Netherlands Konstantin Korovin University of Manchester, UK Peter Lammich University of Twente, The Netherlands Martin Lange University of Kassel, Germany Tim Lyon Technische Universität Dresden, Germany Kuldeep S. Meel University of Toronto, Canada Stephan Merz Inria, University of Lorraine, France Cláudia Nalon University of Brasília, Brazil Aina Niemetz Stanford University, USA Albert Oliveras Universitat Politècnica de Catalunya, Spain Nicolas Peltier CNRS, Laboratory of Informatics of Grenoble, France

Andrei Popescu University of Sheffield, UK Andrew Reynolds University of Iowa, USA Claudia Schon Hochschule Trier, Germany Stephan Schulz DHBW Stuttgart, Germany Roberto Sebastiani University of Trento, Italy

# Rafael Peñaloza University of Milano-Bicocca, Italy Elaine Pimentel University College London, UK André Platzer Karlsruhe Institute of Technology, Germany Florian Rabe FAU Erlangen-Nürnberg, Germany Giles Reger Amazon Web Services, USA and University of Manchester, UK Giselle Reis Carnegie Mellon University, Qatar Christophe Ringeissen Inria, University of Lorraine, France Philipp Rümmer University of Regensburg, Germany Uli Sattler University of Manchester, UK Tanja Schindler University of Basel, Switzerland Renate A. Schmidt University of Manchester. UK Martina Seidl Johannes Kepler University Linz, Austria Viorica Sofronie-Stokkermans University of Koblenz, Germany Alexander Steen University of Greifswald, Germany Martin Suda Czech Technical University in Prague, Czech Republic Yong Kiam Tan Institute for Infocomm Research, A\*STAR, Singapore Sophie Tourret Inria, France and Max Planck Institute for Informatics, Germany Josef Urban Czech Technical University in Prague, Czech Republic Uwe Waldmann Max Planck Institute for Informatics, Germany Christoph Weidenbach Max Planck Institute for Informatics, Germany

Sarah Winkler Free University of Bozen-Bolzano, Italy

Yoni Zohar Bar-Ilan University, Israel

# **Additional Reviewers**

Noah Abou El Wafa Takahito Aoto Martin Avanzini Philippe Balbiani Lasse Blaauwbroek Frédéric Blanqui Thierry Boy de La Tour Marvin Brieger Martin Bromberger James Brotherston Chad E. Brown Florian Bruse Filip Bártek Julie Cailler

Cameron Calk Christophe Chareton Jiaoyan Chen Karel Chvalovský Tiziano Dalmonte Anupam Das Martin Desharnais Paulius Dilkas Marie Duflot Yotam Dvir Chelsea Edmonds Sólrún Halla Einarsdóttir Clemens Eisenhofer Zafer Esen Camillo Fiorentini Mathias Fleury Stef Frijters Florian Frohn Nikolaos Galatos Alessandro Gianola Matt Griffin Alberto Griggio Liye Guo Raúl Gutiérrez Xavier Généreux Hans-Dieter Hiep Jochen Hoenicke Jonathan Huerta y Munive Ullrich Hustadt Cezary Kaliszyk Jan-Christoph Kassing Michael Kinyon Lydia Kondylidou Boris Konev George Kourtis Francesco Kriegel Falko Kötter Timo Lang Jonathan Laurent Daniel Le Berre Jannis Limperg Xinghan Liu

Anela Lolic Etienne Lozes Salvador Lucas Andreas Lööw Kenji Maillard Sérgio Marcelino Andrew M. Marshall Gabriele Masina Marcel Moosbrugger Barbara Morawska Johannes Oetsch Eugenio Orlandelli Jens Otten Adam Pease Bartosz Piotrowski Enguerrand Prebet Siddharth Priya Long Qian Jakob Rath Colin Rothgang Reuben Rowe Jan Frederik Schaefer Johannes Schoisswohl Marcel Schütz Florian Sextl Ian Shillito Nicholas Smallbone Giuseppe Spallitta Sergei Stepanenko Georg Struth Matteo Tesi Guilherme Toledo Patrick Trentin Hari Govind Vediramana Krishnan Laurent Vigneron Renaud Vilmart Dominik Wehr Tobias Winkler Frank Wolter Akihisa Yamada Michal Zawidzki

# **Contents – Part I**

### **Invited Contributions**





# **Contents – Part II**

### **Intuitionistic Logics and Modal Logics**



# **Invited Contributions**

# **Automated Reasoning for Mathematics**

Jeremy Avigad(B)

Carnegie Mellon University, Pittsburgh, PA 15213, USA avigad@cmu.edu

**Abstract.** Throughout the history of automated reasoning, mathematics has been viewed as a prototypical domain of application. It is therefore surprising that the technology has had almost no impact on mathematics to date and plays almost no role in the subject today. This article presents an optimistic view that the situation is about to change. It describes some recent developments in the Lean programming language and proof assistant that support this optimism, and it reflects on the role that automated reasoning can and should play in mathematics in the years to come.

# **1 The Origins and Foundations of Automated Reasoning**

The fact that IJCAR is celebrating the 30th anniversary of the CADE ATP System Competition (CASC) is a reminder that, at least by the standards of computer science, automated reasoning has a long and venerable history. Some date the field to 1956, when Allen Newell, Herbert Simon, and Cliff Shaw introduced the *Logic Theorist*, a program that used heuristic methods to prove theorems in propositional logic. Two years earlier, however, Martin Davis had implemented Presburger's decision procedure for integer arithmetic on a computer at the Institute for Advanced Study. Davis admitted that the program did not perform well but he reported that it succeeded in proving that the sum of two even numbers is even. In 1960, Henry Gelernter, J. R. Hansen, and Donald Loveland published an article on the *Geometry Machine*, a program that could prove nontrivial theorems in elementary Euclidean plane geometry. The resolution rule for propositional logic was introduced by Davis and Hilary Putnam in 1960, and John Alan Robinson's introduction of a unification algorithm in 1965 established resolution theorem proving as a powerful method for first-order logic.<sup>1</sup>

The theoretical foundations of automated reasoning predate even the introduction of the first electronic computers. In contemporary terms, Kurt Gödel's first incompleteness theorem says that there is no complete, consistent, computably axiomatized theory that contains (or interprets) a modicum of arithmetic. In his 1931 paper, Gödel explained that the theorem, as he stated it,

<sup>1</sup> All of the articles mentioned in this paragraph are found in a collection edited by Siegmann and Wrightson [50]. See also the survey by Mackenzie [36] and the references there.

c The Author(s) 2024

C. Benzmüller et al. (Eds.): IJCAR 2024, LNAI 14739, pp. 3–20, 2024. https://doi.org/10.1007/978-3-031-63498-7\_1

is not in any way due to the special nature of the systems that have been set up, but holds for a wide class of formal systems; among these, in particular, are all systems that result from the two just mentioned through the addition of a finite number of axioms . . . [21,22]

It is interesting to see Gödel struggling to say "computably axiomatized theory" for the simple reason that, at the time, there was no mathematical concept of computability available. He then presented a tentative definition of computability in lectures that he gave at the Institute for Advanced Study in Princeton in 1932 precisely so that he could state the incompleteness theorems in their proper generality. Alan Turing gave his own celebrated definition of computability a few years later and titled the paper "On computable numbers, with an application to the Entscheidungsproblem," providing a negative answer to Hilbert's question as to whether there is a decision procedure for first-order logic. Gödel had expressed uncertainty as to whether his definition exhausts the general notion of computability, but he took Turing's analysis to settle the matter definitively:

In consequence of later advances, in particular of the fact that, due to A. M. Turing's work, a precise and unquestionably adequate definition of the general concept of a formal system can now be given, the existence of undecidable arithmetical propositions and the non-demonstrability of the consistency of a system in the same system can now be proved rigorously for *every* consistent formal system containing a certain amount of finitary number theory. (quoted in [17])

The emphasis is by Gödel, who took the phrase "formal system" to mean a system with computably checkable axioms and rules.

Turing pointed out in his paper that the first incompleteness theorem is a consequence of the undecidability of theories of arithmetic, because if there were a complete, computably axiomatized theory of arithmetic one could decide the provability of a formula by simultaneously searching for a proof of the formula and its negation. Alonzo Church gave an independent proof of the undecidability of arithmetic in 1936, and Stephen Kleene, another key player in developing the modern theory of computability, was also keenly interested in applications to logic and the foundations of mathematics. So logicians were thinking about computable proof systems and proof search even before the arrival of the first digital electronic computers in the 1940 s.

Fundamental decision procedures for logic and arithmetic also predate the electronic computer. As early as 1915, Leopold Löwenheim gave a decision procedure for monadic first-order logic. (The introduction to Börger et al. [11] provides an excellent overview of early work on the decidability of fragments of first-order reasoning.) Mojżesz Presburger's paper on a decision procedure for arithmetic, based on Thoralf Skolem's method of eliminating quantifiers, was published in 1929. (Presburger never earned a doctorate for that work; Andrzej Mostowski reported [15] that Alfred Tarski thought the result was too simple, a straightforward application of Thoralf Skolem's method of elimination of quantifiers.) In 1930, Tarski obtained a decision procedure for real-closed fields, that is, the first-order theory of the real numbers as an ordered field, though the result was not published until 1948. So logicians were also interested in decision procedures for aspects of mathematical reasoning even before there were computers to implement them.

# **2 Taking Stock**

We have seen that the early history of automated reasoning was rooted in the foundations of mathematics and that many of its early pioneers were mathematicians. Even those who were not mathematicians took mathematical reasoning to be a primary target of the technology. Now, almost a century after Presburger's discovery of a decision procedure for arithmetic and almost three quarters of a century after the implementation of the Logic Theorist, it seems reasonable to ask where we stand. What has automated reasoning done for mathematics, and how is it used in mathematics today?

The answer is disappointingly negative. Automated reasoning has had almost no impact at all on mathematics and plays almost no role in the subject today. Few working mathematicians have ever touched an automated reasoning tool, let alone use automated reasoning in their daily work. The technology has contributed to very few mathematical discoveries, even minor ones.

This is surprising. One would think, as the pioneers of the subject clearly did, that mathematical reasoning is ideally suited to automation. To be sure, mathematics requires creativity, intuition, experience, and insight, but it also requires long chains of precise and sometimes tedious reasoning, and it's hard to get the details right. Computers can carry out small inferential steps much more quickly and accurately than we can, so we would expect them to be helpful for exploring and verifying mathematical results. Numeric and symbolic computation have revolutionized the sciences, even though science involves a lot more than computation. Why hasn't automated reasoning had a similar impact on mathematics?

This question is not meant as a criticism. Automated reasoning has made remarkable progress over the past 70 years, and the tools are now quite sophisticated. They have had a significant impact in several important areas, such as hardware and software verification, AI, planning, databases, knowledge representation, program synthesis, and natural language processing. Automated reasoning does not have to look to mathematics for justification. Moreover, the fact that there have been few applications to mathematics doesn't mean that there haven't been any, and the successes are worth celebrating. Finally, the fact that it has been difficult to automate mathematical reasoning is largely a reflection of the fact that mathematics isn't easy to mechanize, and those of us who love the subject wouldn't want it any other way. So my goal here is rather to review some of the successes of automated reasoning for mathematics, understand the challenges, and reflect on the role that automated reasoning can and should play in mathematics in the years to come.

In 2019, I gave a joint talk at FroCoS and TABLEAUX titled "Automated Reasoning for the Working Mathematician." In that talk, I surveyed the use of automated reasoning with interactive proof assistants in the hopes of extracting lessons that I could convey to the automated reasoning community. This article draws on that talk as well as unpublished notes, data, and experiments that I prepared at the time.<sup>2</sup> See also my article, "The Mechanization of Mathematics" [1], and an article by Michael Beeson with the same title [6], for additional examples of applications of automated reasoning to mathematics.

# **3 A Personal History**

As a mathematician who has been using automated reasoning tools for more than two decades, my interest in the subject is personal. I first experimented with Isabelle and Coq in the late 1990 s, and when I started using Isabelle in earnest in 2002, the automation was surprisingly mature. There were a conditional term rewriter, simp [44], variations on a general tableau prover (auto, force, and clarify [45]), and a decision procedure for linear arithmetic (arith). Working with students at Carnegie Mellon, I completed a proof of the Hadamard–de la Vallée-Poussin prime number theorem in September of 2004 [2]. A couple of months later, Georges Gonthier announced the verification of the four-color theorem in Coq [24], and soon after that Thomas Hales announced the verification of the Jordan curve theorem in HOL Light [26]. These were early landmarks, providing evidence that substantial mathematical theorems could be formalized.

Many of the challenges in formalizing the prime number theorem stemmed from the fact that Isabelle's libraries were young and incomplete. The automation, however, was remarkably helpful. For example, in the 4,000 lines contained in the last file in the proof, there are 390 invocations of simp, 397 invocations of auto and friends, and 246 invocations of arith. Even now, twenty years later, I have yet to have a better experience with automation.

I spent a sabbatical year in France with Gonthier and his team in 2009– 2010, working on the formalization of the Feit–Thompson theorem, using the SSReflect proof language and Coq [25]. In designing and managing the project, Georges made the conscious decision to avoid automation entirely, other than the built-in foundational reduction of Coq expressions, which is fundamental to the methodology of SSReflect. He was skeptical that black-box automation would scale and had more faith in the power of good language design to make formalization manageable.

When I returned from France, I was ready to leave interactive theorem proving behind and turn to automated reasoning. But a talented undergraduate at Carnegie Mellon, Luke Serafin, managed to convince me to work on a project to verify the central limit theorem in Isabelle [3], and I was seduced by the excitement around homotopy type theory at the time to work on another verification project in Coq [4].

What really pulled me back to the world of proof assistants, however, was Leonardo de Moura's decision, in 2013, to launch the Lean project. Leo convinced

<sup>2</sup> https://github.com/avigad/arwm.

me that even if one is primarily interested in automation for mathematics, one should build it on top of a secure, expressive foundation, not just to ensure that the automation is reliable, but to have a meaningful specification of what the results mean. For several years, Lean's web pages described the aim of the project as follows:

to bridge the gap between interactive and automated theorem proving, by situating automated tools and methods in a framework that supports user interaction and the construction of fully specified axiomatic proofs.

I don't think Leo anticipated the amount of work he would have to put into implementing an elaborator for dependent type theory and supporting all the features that are needed to make that foundation usable. Work also went into the implementation of a tactic framework, in Lean 2, and the implementation of a metaprogramming language, in Lean 3, that users could use to write their own tactics [18,39]. Lean 4 is a complete rewrite of the system, most of which is now implemented in Lean 4 itself [38]. Leo and Sebastian Ullrich have put a lot of effort into making Lean 4 an efficient programming (and metaprogramming) language, and treating syntax as first-class objects, making the syntactic framework as powerful, flexible, and extensible as the tactic framework.

We are now beginning to see automation for Lean 4 that is written in Lean 4, as well as Lean-based experiments on applications of machine learning to mathematical reasoning. Thus, a decade into the Lean project, we are now in an especially good position to realize the initial vision of making it a powerful means of combining automation with user interaction. In the sections that follow, I will discuss prospects for automated reasoning for mathematics in general, but I will also focus specifically on opportunities based on recent developments in Lean.

### **4 Domain-General Reasoning for Verification**

To prepare for the talk at FroCoS and TABLEAUX, I sent a questionnaire to colleagues who had worked on formalization of mathematics to learn about their experiences with automation. One of the interesting findings was that most of the people who I considered to be the best at formalization—people who had formalized vast amounts of interesting mathematics efficiently and with very high quality—used very little automation at all. (Larry Paulson was a notable exception; he has spoken eloquently of the power of automation in enabling him to port large amounts of measure theory and analysis from HOL Light to Isabelle.) The best explanation I could come up with is that even if automation were much better than it is now, serious users would still have to fill in some inferences by hand, which would inevitably require them to learn the library inside out and become skillful at writing explicit proofs. So even when automation is available, power users generally come to know the library and proof language well enough that they don't need to use automation to do what they want to do. If that analysis is correct, it highlights the challenge of scaling the use of formal methods to a broader mathematical audience. Even now, I get frustrated when I have to struggle with an unfamiliar part of the library to carry out an inference that seems painfully obvious. The lack of automation limits the utility of formalization to all but the most determined and dedicated practitioners.

Let me clarify that when I talk about domain-general reasoning, I am setting aside equational rewriting and simplification. It's not clear how to classify such methods. On the one hand, there is nothing more general than the equality relation: wherever there are expressions that denote objects, there are equations that govern them. On the other hand, equational rewriters handle only a small fragment of logical reasoning and the task they perform is focused and specific. If we take domain-general reasoning to encompass problems that, in full generality, are equivalent to the halting problem, it makes sense to exclude equational rewriters, which are designed to reduce expressions to canonical forms in a finite number of steps. In any case, they are incredibly useful. Tools like Isabelle's simp can simplify a formula to True and hence prove more than just equations. They can also carry out conditional rewriting and use backchaining to dispel side conditions. Users can add facts to the rewrite database as they develop new theories. As far as I know, any proof assistant that takes automation seriously has some sort of rewrite engine. Lean's version of simp was one of the first tactics that was implemented in that system.

To prepare for the talk at FroCoS and TABLEAUX, I also carried out some experiments with Isabelle's *Sledgehammer* [46]. This is a tool that, given a proof goal, uses a relevance filter to select a couple of hundred potentially useful theorems from the library, sends the problem to external provers, and then tries to use the information they return to reconstruct a proof internally. I set myself the task of determining the extent to which I could formalize theorems by writing a proof sketch, calling only Sledgehammer and auto, and then refining the sketch as needed. I formalized three theorems in this way—the mutilated chessboard problem, the intermediate value theorem, and the existence of infinitely many primes congruent to three modulo four—and I took detailed notes as I went. The data, which is still available in the GitHub repository, is not very rigorous, but it helped me understand better some of the places where the automation fell short. In particular, two of the theorems required mild forms of second-order reasoning, such as identities governing the summation of functions over finite sets or reasoning about membership in sets defined by explicit predicates. Now, provers like Vampire, Zipperposition, and E can handle higher-order reasoning natively [7,8,53], which is likely to help.

One method of proof reconstruction involves harvesting nothing more than the list of theorems the external prover needed to establish the goal and calling internal, proof-producing automation to redo the search. Isabelle uses a tool called *Metis* for that [31]. Joshua Clune, Yicheng Qian, and Alexander Bentkamp have written a proof-producing superposition theorem prover for Lean called *Duper* to serve a similar purpose, as well as to serve as generic internal automation. They invested considerable effort in adapting conventional resolution methods to dependent type theory. For example, in dependent type theory, a data type might not have any elements, and Skolemization and other components of the search calculus have to be adapted to avoid introducing unsoundness. Duper can instantiate generic type variables on the fly but that introduces additional technical complications, as does adapting unification procedures to the depedently typed setting. Because it is modeled on Zipperposition, Duper can also handle higher-order inferences, and because it operates on Lean expressions directly (rather than via translation), it is possible to handle other rules tailored specifically to dependent type theory. Testing on standard benchmarks shows that Duper's performance is roughly comparable to Metis', offering hope that we will have a Sledgehammer for Lean before long. Lean is starting to catch up with Isabelle in other respects as well, with an automated reasoner called Aesop [33], inspired by Isabelle's auto, as well as tools like Coq's eauto and PVS's grind.

When we consider the way that mathematicians use proof assistants, it becomes clear that support for reasoning about algebraic structures is essential. I have sometimes heard computer scientists say that there is no need to use dependent type theory because anything that can be done there can be done just as well in set theory or simple type theory. In principle, any reasonable foundation can interpret any other, possibly modulo a few axiomatic extensions, but generally speaking, the remark fails to appreciate the extent to which algebraic language and thought pervade contemporary mathematics. Any undergraduate student of mathematics can talk about the ring of *n*×*n* matrices of polynomials over Z*/p*Z for a prime number *p*. Moreover, that student knows that multiplication is commutative on the polynomials and their coefficients but not on the matrices, and that nonzero elements of Z*/p*Z have multiplicative inverses while the polynomials and matrices generally don't. In other words, mathematicians easily name complex structures and use generic notation, and they know what properties elements of those structures have. The structures themselves are mathematical objects on par with numbers, functions, and circles, yet at the same time they serve as data types, constraining what can meaningfully be said about their elements. I find it remarkable that Lean and Mathlib are so good at making a vast network of structures available to users without collapsing under the technical requirements. Doing so requires a carefully designed system of type class inference and efficient means of elaborating and unifying the complex expressions that describe the structures and their elements. Dependent type theory may not be the only possible solution, but it is the only one implemented so far that can do anything close to what mathematicians need.

Reasoning about a rich algebraic hierarchy is a challenge for automated reasoning for at least two reasons: first, because instantiating generic theorems requires determining whether the types in question are instances of the relevant algebraic structures, and, second, because carrying out unification with expressions that have algebraic components requires determining the identity of objects that have been described in different ways. Both of these tasks are too expensive to be carried out in the midst of an automated search, but, fortunately, we generally don't expect them to be: mathematicians are usually careful to establish the relevant algebraic context explicitly. Duper manages algebraic reasoning using a remarkable preprocessing tool by Qian called *Lean-Auto*, which heuristically instantiates generic theorems, infers the relevant algebraic structures, and chooses canonical representations of expressions, all before the search begins.

Sledgehammers generally work by translating the source language to a simpler one and using tools optimized for equational reasoning, propositional reasoning, and quantifier instantiation. Another approach to automating dependent type theory is to search systematically in dependent type theory itself. There are tools for Coq [16] and Agda [34,51] that take this approach, without making any attempt at completeness. At Carnegie Mellon, Chase Norman has implemented a procedure for a minimal but fully expressive dependent type theory that provides a complete solution to both the unification and type inhabitation problems, generalizing Huet's unification procedure for higher-order logic [30]. The framework is flexible enough to instantiate various heuristics to carry out the search, and the implementation performs well on examples. It will be interesting to see whether such an approach will provide automation that complements the strengths of a sledgehammer.

# **5 Domain-Specific Reasoning for Verification**

Talking about domain-general automation reminds me of a quip that I once heard attributed to Sidney Morgenbesser that philosophers are people who know something about everything but nothing about anything. In an ornery mood, I might complain that first-order theorem provers are good at reasoning about everything but not so good at reasoning about anything in particular. At the opposite end of the spectrum are domain-specific automated reasoning tools that carry out more deterministic and focused tasks, such as reasoning about arithmetic, establishing algebraic identities, reasoning about linear and nonlinear inequalities, and so on.

Tools like these are extremely useful. Lean has benefited from the availability of a metaprogramming language, introduced in Lean 3 [18] and made vastly more powerful in Lean 4 [38], that allows users to write tactics within Lean. The ability to attract mathematical users from 2017 on was bolstered by the fact that users like Mario Carneiro were able to quickly provide them with the tactics they needed. As mathematicians gained expertise with the system, they could design tactics that would help them in their daily work. For example, Heather Macbeth, with the help of Carneiro, wrote tactics gcongr and positivity to help with common calculations, and then could easily shorten a proof like this:

```
calc -
      wp - wq-
              * -
                  wp - wq-

  _=2*(-
            a-
              * -
                  a-
                    + -
                        b-
                          * -
                              b-
                                )-4* -
                                          u - half · (wq + wp)-
                                                               *
          -
           u - half · (wq + wp)-
                                 := by rw [← this]; simp
  _ ≤ 2*(-
            a-
              * -
                  a-
                     + -
                        b-
                           * -
                               b-
                                )-4* δ * δ :=
        (sub_le_sub_left eq1 _)
  _ ≤ 2 * ((δ + div) * (δ + div) + (δ + div) * (δ + div)) -
```

```
4 * δ * δ :=
        (sub_le_sub_right (mul_le_mul_of_nonneg_left
          (add_le_add eq1 eq2 ') (by norm_num)) _)
  _=8* δ * div + 4 * div * div := by ring
exact
  add_nonneg (mul_nonneg (mul_nonneg (by norm_num) zero_le_δ)
      (le_of_lt Nat .one_div_pos_of_nat ))
    (mul_nonneg (mul_nonneg (by norm_num)
Nat.one_div_pos_of_nat.le) Nat.one_div_pos_of_nat.le)
to this:
calc -
      wp - wq-
              * -
                  wp - wq-

  _=2*(-
            a-
              * -
                  a-
                    + -
                        b-
                          * -
                              b-
                                )-4* -
                                         u - half · (wq + wp)-
                                                               *
          -
           u - half · (wq + wp)-
                                := by simp [← this]
  _ ≤ 2*(-
            a-
              * -
                  a-
                    + -
                        b-
                          * -
                              b-
                                )-4* δ * δ := by gcongr
  _ ≤ 2 * ((δ + div) * (δ + div) + (δ + div) * (δ + div)) -
        4 * δ * δ := by gcongr
  _=8* δ * div + 4 * div * div := by ring
positivity
```
Tomáš Skřivan recently contributed a tactic, fun\_prop, that effectively establishes properties like continuity, differentiability, and measurability of functions.

I am grateful to Adam Topaz for writing a small Lean metaprogram to extract tactic usage statistics from a recent version of Mathlib. The data is messy because tactic variants are listed separately when they are called under separate wrappers, including variants that were written to support the port of the library from Lean 3 but are otherwise superseded by newer versions. It is also misleading in that some tactics have been around much longer than others, so the numbers do not reflect the current utility. Nor does the data say anything about the role of domain-specific automation outside of Mathlib. Finally, some tactics are used transiently and are then eliminated from the final proof document, such as those that help find theorems, display information, or write proofs. These do not appear in the list.

Nonetheless, the data is informative. Tactics used to apply theorems are most common (apply, exact, refine, etc., with about 60K instances in all). After that, the vast majority of tactic calls, by far, are used for equational reasoning, with more than 52K invocations of Lean's rw tactic, which does manual term rewriting, and more than 60K invocations of Lean's simplifier (simp, simpa, dsimp, and simp-rw). About 25K invocations are used to decompose data and existential assertions (obtain, rintro, rcases, cases, etc.), and there are about 5K calls to tactics that carry out proof by induction.

More specialized automation still manages to carry its weight. The linarith tactic is called more than 1,100 times; split\_ifs, which splits a goal to simplify conditional expressions, and ring, which carries out ring calculations, are each called more than 1,000 times; filter\_upwards, a simple tactic that helps reason about filters in measure theory and analysis is called almost 900 times; norm\_num, which does numeric calculations, is called more than 800 times; a specialization of Aesop for use with category theory, aesop\_cat, is called almost 800 times; positivity, a relative newcomer, is called more than 600 times; and norm\_cast, a tactic to help mediate casts between numeric domains, and gcongr are each called more than 500 times.

Lean's metaprogramming language also provides flexible ways to support better interactions with automation, both domain-general and domain-specific. Lean's *Widgets* framework [42] allows users to install custom Javascript-driven displays of objects and information in Lean's "infoview" window in VS Code, and allows user interactions with these graphical displays to communicate information and actions back to Lean and the editor. When the user puts the cursor at the beginning of a tactic invocation in a proof, the infoview window highlights the part of the state that is about to change, and when the cursor is on or at the end of the invocation, it highlights what has just changed. Users can hover over constants in the infoview window to see documentation, they can click on expressions to see their types, and they can control-click on identifiers to jump to their definitions. Users can trace class inference to diagnose failures, and expand or collapse nodes of the search tree in the infoview window. In Lean, automation can return structured error messages that can be explored. Ideally, whenever automation fails, users should have the means to diagnose the problem and come up with suitable interventions to fix it. Developers of automation should keep user interfaces in mind both as targets for automated reasoning and as means for using automation more effectively.

Finally, it is worth mentioning that tools that help users find theorems and explore the library are also essential. Lean provides internal tactics like apply?, exact?, and rw? that suggest atomic proof steps. There is also a good symbolic search engine, *Loogle*, and there is another search tool, *Moogle*, that uses a large language model to answer natural-language queries.

# **6 Automation for the Discovery of New Theorems**

To this point, we have focused on the use of automated reasoning to verify mathematical results that were discovered by conventional means. Mathematicians, however, tend to be much more excited about methods that help with the discovery of new mathematics. The automated reasoning community is justifiably proud of William McCune's use of his theorem prover, EQP, to settle to Robbins conjecture in 1996 [37]. The result, which shows that a certain set of equations can be taken as an alternative axiomatization of Boolean algebras, made the pages of the *New York Times*. Since then, there has also been a small industry in using automated theorem provers to prove theorems about other algebraic structures, like loops and quasigroups. One can think of a loop as like a group except without the associativity axiom, and one can think of a quasigroup as a loop without an identity. First-order theorem provers have been used to establish consequences of these and related axioms, an industry nicely surveyed by Phillips and Stanovský [47].

I have heard mathematicians express annoyance, however, at the suggestion that these results have anything to do with real mathematics. Few mathematicians have heard of the Robbins conjecture or have any interest in quasigroups. More to the point, mathematicians chafe at the implication that modern algebra is about deriving first-order consequences of axioms. Algebraists are interested in classification theorems, which characterize structures in terms of key invariants, structure theorems, which provide means of understanding structures in terms of subobjects and morphisms to other structures, and representation theorems. They are interested in introducing new structures and new spaces of structures, with applications that explain and simplify past results and provide powerful tools for future research. All these involve reasoning about structures within the context of a rich mathematical theory, rather than reasoning deductively from the axioms. As a result, to most mathematicians, the applications of automated reasoning to algebra so far are little more than recreational curiosities.

Applications of SAT solvers to mathematics fare better. For example, in 1912, Issai Schur proved that given any finite coloring of the positive integers, there is a monochromatic solution to the equation *x*+*y* = *z*. Today, this is recognized as a seminal result in both Ramsey theory and additive number theory. The theorem raises the question as to whether one can compute the largest initial segment of the positive integers {1*,* 2*,...,S*(*k*)} such that there is a *k*-coloring with no such monochromatic solution. It's not hard to establish *S*(1) = 1, *S*(2) = 4, and *S*(3) = 13. In 1965, Golomb and Baumert computed *S*(4) = 44 in a paper that contains other interesting examples of backtracking search [23]. The value *S*(5) = 160 was computed by Heule in 2017 using a SAT solver [28], a result which has drawn praise from the mathematical community. The value of *S*(6) is still unknown.

Most mathematicians aren't interested in calculating Schur numbers, but the problem is considered at least interesting by association, given that they recognize Schur's theorem as an important theorem. The case is similar with respect to a theorem that Paul Erdős dubbed the "happy ending problem" because it led to the marriage of George Szekeres and Esther Klein. The general version of the theorem says that for every positive integer *n*, any sufficiently large finite set of points in general position in the plane contains a convex *n*-gon. Let *f*(*n*) denote the smallest number of points in general position that provides such a guarantee; the value of *f*(*n*) is known only for *n* ≤ 6. A related problem asks for the smallest number of points in general position that guarantees the existence of an *empty* convex *n*-gon, that is, one without any points in its interior. There are infinite sets of points without a convex 7-gon, but Nicolás [43] and Gerken [20] proved independently that every sufficiently large set of points in general position contains an empty convex hexagon. Using a SAT solver, Heule and Scheucher [29] recently showed that 30 is the smallest number of points that provides that guarantee.

When SAT solvers are used to solve mathematical problems, it is important to have guarantees that the results are correct. Students at Carnegie Mellon are working on a SAT library for Lean that addresses that concern. Joshua Clune has written an LRAT checker that is currently in use at Amazon Web Services; Wojciech Nawrocki has verified a checker for *knowledge compilation*, a technology that, in particular, can provide precise counts of the number of satisfying assignments to a propositional formula [12]; and Cayden Codel has written a verified checker for SR, a strong proof format for SAT solvers that incorporates symmetries in "without loss of generality" reductions [13]. Perhaps the more serious concern is to verify that a problem that is sent to a SAT solver is a correct representation of the intended problem. This is pressing because the reduction of an ordinary mathematical statement to a SAT problem often relies on complex encodings as well as symmetry breaking and other reductions, and the generation of the propositional formula is further subject to subtle programming errors. Codel, Nawrocki, and James Gallicchio have been working on aspects of the library that address that problem as well [14]. Bernardo Subercaseaux, Nawrocki, Gallicchio, Codel, Carneiro, and Heule have verified the correctness of the encoding used to compute the empty hexagon number [52].

Mathematicians have yet to explore the use of automated reasoning tools to *find* objects of mathematical interests. SAT solvers, SMT solvers, constraint solvers, and model finders are all designed to find objects and structures satisfying given constraints. A popular Isabelle tool called *Nitpick* [10] uses a model finder to look for counterexamples to purported theorems, in order to prevent them from investing time and energy in trying to prove a statement that is false.

Bespoke decision procedures can also aid the process of discovery. An automated reasoning tool called *Walnut* [40,49] implements a decision procedure for an extension of Presburger arithmetic that can express properties of *automatic sequences*, which are, roughly, sequences generated by finite state automata. Consider the question as to whether there is an infinite binary sequence with no subsequence of the form *xxx<sup>R</sup>*, where *x* is a finite sequence and *x<sup>R</sup>* is its reversal. It is not hard to see that the sequence 01010101 *...* has that property, but is it possible to find one that is aperiodic? A paper by Mousavi, Schaeffer, and Shallit [41] explains how Walnut helped them construct such a sequence.

I believe that mathematicians' general habit of dismissing "finite problems" as not properly mathematical will change over time. The entire edifice of infinitary mathematics bears on our everyday experience only through measurement and observation, and discrete problems from computer science have already begun to influence mathematical research. It also seems likely that mathematicians will find creative ways to solve infinitary problems by devising representations and reductions that are amenable to automation. The main challenge is that automated reasoning is unfamiliar to them. The history of Lean suggests that mathematicians will go to extraordinary lengths to learn a new technology once they decide that it is interesting and aligns with their goals. To facilitate adoption, it helps to have documentation and expository materials that are written with them in mind. The incentive structures in mathematics and computer science are not good at encouraging that kind of cross-disciplinary outreach, but once the door is open, it is only a matter of time before a new technology becomes part of the mathematical mainstream.

# **7 Machine Learning and Symbolic AI**

It's impossible to write about the prospects of automated reasoning for mathematics today without saying something about machine learning. Machine learning and symbolic AI have complementary strengths: ML is good at synthesizing vast amounts of data but isn't good at getting details right, whereas symbolic methods are good at getting the details right but are overwhelmed by combinatorial explosion in a search space. A central challenge for AI is to design systems that get the best of both worlds by combining the two approaches, and mathematics, where the problems are especially clear and well-defined, is an ideal place to make progress.

It is therefore not surprising that there is growing interest in applications of machine learning to mathematics. Researchers have long explored the use of machine learning techniques to guide symbolic search and to select premises for a sledgehammer, and with the advent of deep learning, there has been a surge of interest in using neural networks to prove theorems in conjunction with an interactive proof assistant. A recent "brief survey" of machine learning in automated reasoning by Blaauwbroek et al. [9] has 168 references, and surveys of deep learning for mathematical reasoning by Lu et al. [35] and by Li et al. [32] have well over 200 references each.

Lean is becoming recognized as an ideal platform for such work. There have been at least two projects on using machine learning for premise selection for Lean [19,48], there are tactics and code pilots based on Lean [27,54,55], and there are tools that support data extraction and interaction with Lean for machine learning experiments, including one developed by Kaiyu Yang and coauthors [55] and two developed by Kim Morrison.<sup>3</sup> The MiniF2F problem benchmark [56] includes versions for Lean 3, and the ProofNet benchmarks [5] have recently been ported to Lean 4 by Abhijit Chakrborty and Rahul Vishwakarma.<sup>4</sup> The features that make Lean a good platform for automated reasoning also make it a good platform for machine learning and support a synthesis of the two. The size and sophistication of Lean's mathematical library, Mathlib, and the involvement of the mathematical community provide powerful opportunities for progress.

### **8 Conclusions**

Although I began with a disappointing assessment of the current state of automated reasoning for mathematics, I hope I have conveyed good reasons for optimism. Mathematicians are beginning to warm to the use of formal methods, opening up new avenues for progress. As the technology improves, the number of mathematicians making use of the automated reasoning tools will increase, providing greater incentives for computer scientists to focus on mathematical applications. This, in turn, will increase the number of mathematicians who use

<sup>3</sup> https://github.com/semorrison/lean-training-data

https://github.com/leanprover-community/repl.

<sup>4</sup> https://github.com/rahul3613/ProofNet-lean4.

the technology and can therefore provide feedback and even contribute to its development. In short, I expect that we are on the cusp of a virtuous cycle whereby technological improvements lead to more users, which, in turn, lead to further improvements.

In this article, I have tried to identify some of the key challenges to developing better automation for mathematics, and I have suggested specific approaches that I find promising. The biggest challenges, however, may be sociological rather than technical. Making automation useful for mathematics will require mathematicians and computer scientists working together, and neither discipline will get far on its own. Mathematicians and computer scientists have very different attitudes and outlooks. Computer scientists focus on disseminating information quickly in conference publications, and their success is measured by the number of citations they receive. With the weight of centuries behind them, mathematicians can't be rushed, are mistrustful of academic fads, and tend to look to the leading experts in their fields to determine what is important. Whereas computer scientists value measurable impact in the short term, mathematicians answer to less clear-cut assessments of the quality and depth of their work. It's not that one discipline's standards are better than the other; each has its advantages and problems. It's just that the disparity of outlooks often makes communication difficult.

I'd like to suggest to computer scientists reading this article that it might be nice to adopt a mathematical attitude every once in a while. Imagine thinking about something because you find it interesting, without caring what others think. Imagine exploring ideas to see where they take you, without worrying about whether that will result in a conference publication by the next round of deadlines. Imagine working on a problem just for the joy of taking up the challenge. If all that sounds good to you, you'll find mathematicians to be excellent companions. I am not suggesting that you should turn your back on computer science; of course, you will still have bills to pay. But I am confident that one day, when you look back over your career, any contributions you have made to mathematics will be among the things that you are most proud of, and among those that are closest to your heart. So I invite you to come to the Lean Zulip channel and start talking to mathematicians about the things that automation can do for them. I promise, you won't regret it.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Induction in Saturation**

Laura Kov´acs1(B) , Petra Hozzov´a<sup>1</sup> , M´arton Hajdu<sup>1</sup> , and Andrei Voronkov2,3

> TU Wien, Vienna, Austria laura.kovacs@tuwien.ac.at University of Manchester, Manchester, UK EasyChair, Manchester, UK

**Abstract.** Proof by induction is commonplace in modern mathematics and computational logic. This paper overviews and discusses our recent results in turning saturation-based first-order theorem proving into a powerful framework for automating inductive reasoning. We formalize applications of induction as new inference rules of the saturation process, add instances of appropriate induction schemata to the search space, and use these rules and instances immediately upon their addition for the purpose of guiding induction. Our results show, for example, that many problems from formal verification and mathematical theories can now be solved completely automatically using a first-order theorem prover.

# **1 Introduction**

Proof by induction is commonplace in modern mathematics and computational logic. Many number-theoretic arguments rely upon mathematical induction over the natural numbers, while showing correctness of software systems typically requires structural induction over inductively-defined data types, to only name two examples. The wider automation of mathematics, logic, verification and other efforts therefore demands automating induction.

Induction can be automated by reducing goals to subgoals [1,11], so that proving a goal ∀x.F(x) can be proved by induction on x. However, splitting goals into subgoals and organizing proof search accordingly requires expert guidance. As an alternative, inductive reasoning has recently appeared in SMT solvers [13] and first-order theorem provers [3,12,14], complementing strong support for reasoning with theories and quantifiers. These approaches do not reduce goals to subgoals but instead implement tailored instantiations of induction schemas [3,12,14], adjust the underlying calculus with inductive generalizations [4] and function rewriting [6], extend theory reasoning for proving inductive formulas [8], and integrate induction with rewriting for generating auxiliary inductive properties during proof search [5].

This paper describes our recent efforts in these directions, entering new grounds in the automation of inductive reasoning. The distinctive feature of our work comes with mechanizing mathematical induction in saturation-based first-order theorem proving, turning thus saturation-based proof search into a powerful framework to reason about software technologies, in particular about inductive properties of functional and imperative programs.

# **2 Induction in Saturation - In a Nutshell**

Our work combines very efficient superposition-based equational reasoning with inductive reasoning, by extending superposition with new inference rules capturing inductive steps within saturation. We refer to these inference rules as *induction rules* and consider them in addition to superposition inferences during proof-search. Following the approach of [12], we capture the application of induction via the following general induction rule:

$$\frac{\overline{L}[t] \lor C}{F \to \forall x. L[x]} \text{ (Ind)},$$

where L[t] is a ground literal, C is a clause, and F → ∀x.L[x] is a valid induction schema. Further, L[t] denotes the negation of L[t].

In our work, we consider extensions and variants of the induction rule Ind, in order to add instances of appropriate induction schemata over inductive formulas to be proved. We call these instances *induction axioms* and automate induction in saturation via the following two, inter-connected steps:


For step (i), we pick up a formula G in the search space and use induction rules to add new induction axioms Ax to the search space, aiming at proving ¬G, or sometimes a formula more general than ¬G. While our inference rules implement inductive reasoning upon G using Ax, adding only these inference rules to superposition-based proof search would be insufficient for efficient theorem proving. Modern saturation-based theorem provers are very powerful not just because of the logical calculi they are based on, such as superposition. What makes them powerful and efficient are redundancy criteria and pruning of the search space; strategies for directing proof search, mainly by clause and inference selection; and theory-specific reasoning, for built-in support for data types [8]. Therefore, in addition to devising new induction rules in (i), in (ii) we bring redundancy elimination, proof search options and theory axioms/rules to saturation with induction.

As a result of the combined efforts of (i)–(ii), induction in saturation maintains efficiency of standard saturation and is not limited to induction over specific (well-founded) theories. Such a genericity is particularly important for applying our results in the formal analysis of system requirements. For example, proving that every element in the computer memory is initialized or that no execution of a user request interferes with another user request, typically requires inductive reasoning with integers and arrays.

In the rest of the paper, we illustrate the automation of induction in saturation within the following three use-cases:


# **3 Induction and Arithmetic**

We first discuss our work in proving inductive properties over (sums of) integers. While integers with the standard <-ordering are not well-founded, we show that we can apply, and automate, induction over any integer interval with a finite bound [7].

In the sequel, we assume a distinguished *integer sort*, denoted by Z. When we use standard integer predicates <, ≤, >, ≥, functions +, −,... and constants 0, 1, 2,... , we assume that they denote the corresponding interpreted integer predicates and functions with their standard interpretations. All other symbols are uninterpreted. We will write quantifiers like <sup>∀</sup><sup>x</sup> <sup>∈</sup> <sup>Z</sup> to denote that <sup>x</sup> has the integer sort.

**Example of Induction over Integers.** Consider the recursive function sum of Fig. 1(a), computing the sum of the integers from the integer interval [0, n]. We aim to prove the assertion of Fig. 1(a), denoted via **assert** and stating that the value computed by sum is the closed-form expression describing the sum of the first n positive integers.

In order to prove the assertion of Fig. 1(a) within saturation-based proof search, we proceed as follows. We convert the function definition of sum into first-order axioms and negate the assertion of Fig. 1(a), skolemizing n as σ. We obtain the following unit clauses, with each clause being implicitly universally quantified:

$$\mathbf{s}\mathbf{u}\mathbf{m}(0) = 0\tag{1}$$

$$n = 0 \vee \mathfrak{sum}(n) = n + \mathfrak{sum}(n-1) \tag{2}$$

$$
\sigma \ge 0 \tag{3}
$$

$$2 \cdot \mathsf{sum}(\sigma) \neq \sigma \cdot (\sigma + 1) \tag{4}$$

Clauses (1)–(2) result from the functional definition of sum, whereas clauses (3)– (4) yield the clausified negation of the assertion of Fig. 1(a). We then continue by applying inference rules on these clauses with the goal of refuting the negated assertion by deriving the empty clause, corresponding to a contradiction.

**Induction Rule over Integers.** When considering integers, we adjust the general induction rule Ind by considering induction over well-founded (integer) intervals. In particular, for proving property (4) of Fig. 1(a), we use the following extension of the Ind rule, where b is a ground term of integer sort:

$$\frac{\overline{L}[t] \lor C \quad t \ge b}{L[b] \land \forall y. (y \ge b \land L[y] \to L[y+1]) \to \forall x. (x \ge b \to L[x])} \quad (\mathsf{Intlnd}\_{\ge})$$

To refute the negated assertion (4), we instantiate the IntInd<sup>≥</sup> rule with <sup>L</sup>[σ] being 2 · sum(σ) = σ ·(σ + 1) and b set to 0, deriving thus the following induction axiom as an instance of the induction schema of IntInd≥:

$$\begin{aligned} \left(2 \cdot \mathfrak{sum}(0) = 0 \cdot (0 + 1) \\ \quad \wedge \quad \forall y \in \mathbb{N}. (y \ge 0 \land 2 \cdot \mathfrak{sum}(y) = y \cdot (y + 1) \implies 2 \cdot \mathfrak{sum}(y + 1) = (y + 1) \cdot ((y + 1) + 1)) \right) \\ \implies \forall x \in \mathbb{N}. (x \ge 0 \to 2 \cdot \mathfrak{sum}(x) = x \cdot (x + 1)) \end{aligned} \tag{5}$$

Recall that saturation-based provers work with clauses, rather than with arbitrary formulas. Therefore, the induction axiom (5) is clausified and its clausal normal form (CNF) given below is added to the search space, where y is skolemized as σ :

$$2 \cdot \mathsf{sum}(0) \neq 0 \cdot (0 + 1) \lor 2 \cdot \mathsf{sum}(\sigma') = \sigma' \cdot (\sigma' + 1) \lor \neg(x \ge 0) \lor 2 \cdot \mathsf{sum}(x) = x \cdot (x + 1) \quad \text{(6)}$$

$$\begin{aligned} 2 \cdot \mathsf{sum}(0) \neq 0 \cdot (0+1) \lor 2 \cdot \mathsf{sum}(\sigma'+1) \neq (\sigma'+1) \cdot ((\sigma'+1)+1) \lor \neg(x \ge 0) \lor \\ 2 \cdot \mathsf{sum}(x) = x \cdot (x+1) \end{aligned} \tag{7}$$

**Optimizing Induction in Saturation.** Simply instantiating IntInd<sup>≥</sup> and adding the corresponding induction axiom for any clause L[t] ∨ C in the search space would however be inefficient: considering L[t]∨C just like any other clause in saturation may trigger the application of too many inferences. Therefore, we treat premises L[t] ∨ C of induction rules differently in order to *guide the saturation algorithm* in two ways.

First, we ensure that an application of Ind or IntInd<sup>≥</sup> is followed by a binary resolution step in which the conclusion of an induction rule is resolved with (inductive) premise(s). For example, to derive a refutation from (6) and (7), we apply binary resolution on (6) and (7) with (3) and (4), resolving away the last two literals of (6) and (7). Refutation of (4) is then easily derived, by using the axioms (1)– (2) defining sum together with arithmetic reasoning over integers. For example, our theorem prover Vampire [9] finds a refutation of (4) in almost no time<sup>1</sup>.

Second, induction can be very explosive – i.e., it may generate many consequences of which few lead to refutation. Therefore, in practice, we implement additional requirements on the premises of Ind and IntInd≥, with these requirements to be used during saturation. Among others, we use heuristics on whether

<sup>1</sup> Empirical data reported in this paper have been obtained on computers with AMD Epyc 7502 2.5 GHz processors and 1 TB RAM.

**Fig. 2.** Inductive reasoning with arrays, with valA(j) denoting A[j].

the term t must contain a symbol from the conjecture we are trying to prove; whether we apply induction on non-unit clauses; or whether (in the case of integer induction) we allow L[t] to be a comparison or equality literal, and if yes, how many times and on which positions it can contain the term t.

**Induction and Theories.** We note that the sum function and the corresponding assertion of Fig. 1(a) can also be encoded using natural numbers as inductively defined data types. While the resulting encoding of Fig. 1(a) holds over naturals, proving Fig. 1(a) over naturals becomes very complex in practice, as natural numbers do not have built-in arithmetic axioms but rely on term algebra axioms [8]. As a result, when proving Fig. 1(a) over naturals, we are faced with the challenge of proving addition and multiplication properties of naturals, which require induction as well, making efficient proof search challenging. Our work therefore advocates the combination of inductive reasoning with theoryspecific inference rules, in the case of Fig. 1(a) this being the application of induction over integers.

Application of induction over integers becomes especially beneficial when proving complex, non-linear arithmetic conjectures. Figure 1(b) shows such a use-case of a function sum evsq that recursively computes the sum of squares of the first n positive even integers. The assertion of Fig. 1(b) is the well-known closed form formula for the sum computed by sum evsq. Proving this assertion, we follow a similar recipe as for Fig. 1(a): instantiate induction inferences over integers with the equality from the assertion, resolve the conclusion of the induction axiom with the literals of the assertion, and then prove the base case and the step case using arithmetic reasoning combined with the definition of sum evsq. Thanks to theory-specific reasoning together with induction, Vampire proves Fig. 1(b) in no time (in less than 1 s).

# **4 Induction over Arrays**

We next describe applications of induction in saturation while proving the functional correctness of array-manipulating programs.

**Example of Induction over Arrays.** Consider the imperative program of Fig. 2, annotated with pre-condition (**assume**), post-condition (**assert**) and loop invariant (**invariant**).

Given the pre-condition and the invariant, we aim to prove that, upon loop termination, each A[j] will hold the sum of the first j positive integers. Note that the assumed termination of the loop implies the negation of the loop condition: ¬(i < A.size). With this additional formula, the assertion of Fig. 2 clearly holds. Yet, proving it automatically, inductive reasoning is needed.

**Induction Rule over Arrays of Integers.** We consider the following variant of the IntInd<sup>≥</sup> rule, using induction over a finite integer interval:

$$\frac{\mathbb{Z}[t] \lor C \qquad t \ge b\_1 \qquad t \le b\_2}{L[b\_1] \land \forall x. (b\_1 \le x < b\_2 \land L[x] \to L[x+1]) \to \forall y. (b\_1 \le y \le b\_2 \to L[y])} \quad (\mathsf{Intlnd}\_{[\ge 1]}),$$

where <sup>L</sup>[t] is a ground literal, and <sup>b</sup>1, b<sup>2</sup> are ground terms. We instantiate IntInd[≥] based on the negated, skolemized and clausified assertion of Fig. 2; for doing so, we set L[σ] to be 2 · valA(σ) = σ · (σ + 1) and consider b<sup>1</sup> to be 0 and b<sup>2</sup> to be A.size − 1. We further clausify the resulting induction axiom; resolve the clausified axiom against the premises of IntInd[≥]; and finally refute the rest of the literals using the invariant, pre-condition and negated loop condition of Fig. 2 within integer arithmetic.

Note that, unlike in the examples of Fig. 1, one of the bounds of the interval upon which we are applying induction is *symbolic* – an uninterpreted constant. This is a powerful generalization which allows us to reason with arrays regardless of their specific length. In practice, induction over arrays of integers in Vampire proves the assertion of Fig. 2 (using around 1 s of time).

# **5 Induction over Lists**

We finally present our efforts towards proving inductive properties of functional programs, using combination of inductively defined data types. We use two datatypes, natural numbers and lists over natural numbers, denoted respectively by N and L. We assume that these datatypes are axiomatised by the *distinctiveness*, *exhaustiveness* and *injectivity* axioms of term algebras [8].

**Example of Induction over Lists.** Consider the functional program of Fig. 3. We aim to prove the assertion (**assert**) expressing that reversing a list an even number of times results in the same list; doing so, we use an assumption (**assume**) corresponding to an inductive lemma. For proving the assertion of Fig. 3, we translate the function definitions and assumption of Fig. 3 into firstorder axioms, negate the assertion of Fig. 3, and clausify the resulting formulas. As a result, the following two clauses are obtained from the negated assertion, respectively introducing Skolem constants σ<sup>1</sup> and σ<sup>2</sup> for n and xs:

$$\mathsf{even}(\sigma\_1) \tag{8}$$

$$\mathsf{revN}(\sigma\_2, \sigma\_1) \neq \sigma\_2 \tag{9}$$

**Fig. 3.** Inductive reasoning with natural numbers and list datatypes.

**Induction Rule over Lists.** A suitable induction formula that refutes clause (9) is generated in two steps. A formula generated solely from clause (9) may be too strong. Hence, we generate an induction formula that takes clause (8) into account as well. Doing so, we use a generalization of the Ind rule that works on an arbitrary number of premises. Namely, we use the following induction rule with two premises:

$$\frac{\overline{L}[t] \lor C \quad \overline{L'}[t] \lor C'}{F \to \forall x.(L[x] \lor L'[x])} \ (\mathsf{Ind}'),$$

where L[t] and L [t] are ground literals, C and C are clauses, and F → ∀x.(L[x]∨ L [x]) is a valid induction schema.

Second, to generate a suitable antecedent for the induction schema (i.e. F), we notice that the recursion used in the definition of even suggests an induction principle different from standard structural induction over natural numbers. These insights lead us to generate following induction axiom:


After clausifying this axiom and resolving the conclusion literals with the premises (8) and (9), a first-order refutation using the term algebra axioms and the clausified function definitions and assumption of Fig. 3 is straightforward; Vampire finds a refutation almost immediately.

**Optimizing Induction in Saturation.** Note that Fig. 3 uses an auxiliary inductive lemma (**assume**), in order to prove the assertion of Fig. 3. An additional challenge in automating the proof of the assertion of Fig. 3 comes therefore with the task of generating and proving auxiliary inductive lemmas during saturation.

Proving the lemma of Fig. 3 needs further induction steps; however, the generation of a suitable induction formula is only triggered by *an instance* of the respective lemma. Since the superposition calculus is optimized to avoid generating clauses unnecessary for first-order reasoning, either (i) we tweak the parameters of superposition such that the generation of an instance of the lemma *is necessary* for first-order reasoning, or (ii) we perform additional sound inferences (on top of superposition and induction inferences) to derive these instances.

Addressing these challenges, we develop different term ordering families (e.g. KBO or LPO), parameterized by various symbol precedences or weight functions; and devise literal selection functions to vary the inferred consequences of a subgoal [5]. As a result, we select different inductive lemmas during saturation. Further, we use function definitions not as axioms but as rewrite rules, in order to ensure that recursively defined functions are expanded/rewritten into their (likely much larger) definitions [6]. With such optimizations at hand, Vampire proves the assertion of Fig. 3 without using the asserted inductive lemma (**assume**), but by generating the respective inductive lemma of Fig. 3 completely automatically.

# **6 Conclusions and Outlook**

Automated reasoning about system requirements is one of the most active areas of formal methods [2,10]. Our work addresses recent reasoning demands in the presence of induction, needed for example in proving safety and security requirements over software systems or establishing mathematical conjectures. In particular, we turn saturation-based first-order theorem proving into a powerful workhorse for automating induction. When we integrate induction in saturation, the choice of possibilities to exploit is very large. As such, should one approach fail to bring considerable improvements, one may quickly study and investigate other approaches, allowing thus for further improvements and advancements in mechanizing induction. As saturation-based first-order theorem proving is not yet fully integrated in the tech-chain of ensuring software reliability, we believe automating induction in saturation will bring significant further advances in the theory and practice of both automated reasoning and formal verification.

**Acknowledgements.** We thank the entire developer team of the Vampire theorem prover, and in particular Daneshvar Amrollahi, Michael Rawson, Giles Reger, and Martin Suda, for valuable discussions. We acknowledge generous funding from the ERC Consolidator Grant ARTIST 101002685, the TU Wien Doctoral College SecInt, the FWF SFB project SpyCoDe F8504, the WWTF ICT22-007 grant ForSmart, and the Amazon Research Award 2023 QuAT.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Stepping Stones in the TPTP World

Geoff Sutcliffe(B)

University of Miami, Miami, USA geoff@cs.miami.edu

Abstract. The TPTP World is a well established infrastructure that supports research, development, and deployment of Automated Theorem Proving (ATP) systems. There are key components that help make the TPTP World a success: the TPTP problem library was first released in 1993, the CADE ATP System Competition (CASC) was conceived after CADE-12 in 1994, problem difficulty ratings were added in 1997, the current TPTP language was adopted in 2003, the SZS ontologies were specified in 2004, the TSTP solution library was built starting around 2005, the Specialist Problem Classes (SPCs) have been used to classify problems since 2010, the SystemOnTPTP service has been offered from 2011, the StarExec service was started in 2013, and a world of TPTP users have helped all along. This paper reviews these stepping stones in the development of the TPTP World.

# 1 Introduction

The TPTP World [38,41] (once gently referred to as the "TPTP Jungle" [3]) is a well established infrastructure that supports research, development, and deployment of Automated Theorem Proving (ATP) systems. Salient components of the TPTP World are the TPTP languages [47], the TPTP problem library [37], the TSTP solution library [38], the SZS ontologies [36], the Specialist Problem Classes (SPCs) and problem difficulty ratings [51], SystemOnTPTP [33] and StarExec [32], and the CADE ATP System Competition (CASC) [40]. There are dependencies between these parts of the TPTP World, as shown in Fig. 1, forming a series of "stepping stones" from TPTP standards to happy users, described in the following sections of this paper . . .


There is a cycle of dependencies from the TPTP problem library, to the TSTP solution library, to the problem difficulty ratings, and back to the TPTP problem library. This cycle means that building these components, particularly building releases of the TPTP problem library, requires iteration until stability is reached.

Fig. 1. Dependencies between the Stepping Stones

Various parts of the TPTP World have been deployed in a range of applications, in both academia and industry. Since the first release of the TPTP problem library in 1993, many researchers have used the TPTP World as an appropriate and convenient basis for ATP system research and development. The web page www.tptp.org provides access to all components.

### 2 The TPTP Language

The TPTP language [42] is one of the keys to the success of the TPTP World. The TPTP language is used for writing both problems and solutions, which enables convenient communication between ATP systems and tools. Originally the TPTP World supported only first-order clause normal form (CNF) [50]. Over the years full first-order form (FOF) [37], typed first-order form (TFF) [2,46], typed extended first-order form (TXF) [44], typed higher-order form (THF) [12, 43], and non-classical forms (NTF) [30] have been added. The standardisation received a significant technical boost in 2006 when the BNF definition was revised so that all the language forms are quite precisely specified.<sup>1</sup> As a result a parser can be generated using lex/yacc [54]. A general principle of the TPTP language is: "we provide the syntax, you provide the semantics". As such, there is no a priori commitment to any semantics for each of the language forms, although in almost all cases the intended logic and semantics are well known.

Problems and solutions are built from *annotated formulae* of the form . . . *language*(*name*, *role*, *formula*, *source*, *useful\_info*)

The *language*s supported are cnf (clause normal form), fof (first-order form), tff (typed first-order form), and thf (typed higher-order form). The *role*, e.g., axiom, lemma, conjecture, defines the use of the formula. In a *formula*, terms and atoms follow Prolog conventions – functions and predicates start with a lowercase letter or are 'single quoted', and variables start with an uppercase letter. The language also supports interpreted symbols that either start with a \$, e.g., the truth constants \$true and \$false, or are composed of non-alphabetic characters, e.g., integer/rational/real numbers such as 27, 43/92, -99.66. The logical connectives in the TPTP language are !, ?, -, |, &, =>, <=, <=>, and <->, for the mathematical connectives ∀, ∃, ¬, ∨, ∧, ⇒, ⇐, ⇔, and ⊕ respectively. Equality and inequality are expressed as the infix operators = and !=. The *source* and *useful\_info* are optional. Figure 2 shows an example with typed higher-order annotated formulae.

# 3 The TPTP Problem Library

The first inklings of the TPTP World emerged at CADE-10 in Kaiserslautern, Germany, in 1990, as a collaboration between Geoff Sutcliffe from the University of Western Australia (James Cook University from 1993), and Christian Suttner from the Technische Universität München. At that time a large number of test problems had accumulated in the ATP community, in both hardcopy [4,14,17,22,25,56] and electronic form [13,29].<sup>2</sup> We observed that ATP system developers were testing their systems, and publishing results, based on small numbers of selected test problems. At the same time some good ideas were seen to be abandoned because they had been tested on inappropriate selections of test problems. These observations motivated us to start collecting ATP test problems into what became the TPTP problem library. A comprehensive library of test problems is necessary for meaningful system evaluations, meaningful system comparisons, repeatability of testing, and the production of statistically significant results. The goal was to support testing and evaluation of ATP systems, to help ensure that performance results accurately reflect the capabilities of the ATP systems being considered.

<sup>1</sup> But note that the BNF "syntax" is not completely strict – the BNF uses an extended BNF form that relegates details to "strict" rules that do not have to be checked by a parser.

<sup>2</sup> To my knowledge, the first circulation of test problems was by Larry Wos in the late sixties.

Fig. 2. THF annotated formulae

Releases of the TPTP problem library are identified by numbers in the form v*V ersion*.*Edition*.*P atch*. The *V ersion* enumerates major new releases of the TPTP in which important new features have been added. The *Edition* is incremented each time new problems are added to the current version. The *P atch* level is incremented each time errors found in the current edition are corrected. The first release of the TPTP problem library, v1.0.0 was made on Friday 12th November 1993.

The problems in the library are classified into *domains* that reflect a natural hierarchy, based mainly on the Dewey Decimal Classification and the Mathematics Subject Classification. Seven main fields are defined: logic, mathematics, computer science, science & engineering, social sciences, arts & humanities, and other. Each field is subdivided into domains, each identified by a three-letter mnemonic, e.g., the social science field has three domains: Social Choice Theory (SCT), Management (MGT), and Geography (GEG).

Table 1 lists the versions of the TPTP up to v9.0.0, with the new feature added, the number of problem domains, and the number of problems.<sup>3</sup> The number of domains indicates the diversity of the problems, while the number of problems indicates the size of the library. The attentive reader might note that many releases have been made in July-September. This is because the CADE ATP System Competition (CASC - see Sect. 9), has an influence on the release cycle of the TPTP.

<sup>3</sup> The data for v9.0.0 is an estimate, because this paper was written before the release was finalised.


Table 1. Overview of TPTP releases

Each TPTP problem file has three parts: a header, optional includes, and annotated formulae. The header section contains information for users, formatted as comments in four parts, as shown in Fig. 3. The first part identifies and describes the problem, the second part provides information about occurrences of the problem in the literature and elsewhere, the third part provides semantic and syntactic characteristics of the problem, and the last part contains comments and bugfix information. The header fields are self-explanatory, but of particular interest are the Status field that is explained in Sect. 5, the Rating field that is explained in Sect. 7, and the SPC field that is explained in Sect. 6. The include section is optional, and if used contains include directives for axiom files, which in turn have the same three-part format as problem files; see Fig. 3 for an example. Inclusion avoids the need for duplication of the formulae in commonly used axiomatizations. The annotated formulae are described in Sect. 2, and Fig. 2 provides an example.

# 4 The TSTP Solution Library

The complement to the TPTP problem library is the TSTP solution library [35, 38]. The TSTP is built by running all the ATP systems that are available in the TPTP World on all the problems in the TPTP problem library. The TSTP started being built around 2005, using solutions provided by individual system developers. From 2010 to 2013 the TSTP was generated on a small NSF funded<sup>4</sup> cluster at the University of Miami. Since 2014 the ATP systems have been run on StarExec (see Sect. 8), initially on the StarExec Iowa cluster, and since 2018 on the StarExec Miami cluster. StarExec has provided stable platforms that produce reliably consistent and comparable data in the TSTP. At the time of writing this paper the TSTP contained the results of running 87 ATP systems and system variants on all the problems that each system could attempt (therefore, e.g.,

<sup>4</sup> NSF Award 0957438.

Fig. 3. Header of problem DAT016\_1.

systems that do model finding for FOF are not run on THF problems). This produced 1091026 runs, of which 432718 (39.6%) solved the problem. One use of the TSTP is for computing the TPTP problem difficulty ratings (see Sect. 7).

TSTP solution files have a header section and annotated formulae. The header section has four parts, as shown in Fig. 4: the first part identifies the ATP system, the problem, and the system's runtime parameters; the second part provides information about the hardware, operating system, and resource limits; the third part provides the SZS result and output values (see Sect. 5), and syntactic characteristics of the solution; the last part contains comments. The solution follows in annotated formulae.

For derivations, where formulae are derived from parent formulae, e.g., in proofs, refutations, etc., the *source* fields of the annotated formulae are used to capture parent-derived formulae relationships in the derivation DAG. This includes the source of the formula – either the problem file or an inference. Inference data includes the name of the inference rule used, the semantic relationship between the parents and the derived formula as an SZS ontology value (see Sect. 5), and a list of the parent annotated formulae names. Figure 5 shows an example refutation [slightly modified, type declarations omitted] from the E system [27] for the problem formulae in Fig. 2, and Fig. 4 shows the corresponding header fields.

Fig. 4. Example derivation header

For interpretations (typically models) the annotated formulae are used to describe the domains and symbol mappings of Tarskian interpretations, or the formulae that induce Herbrand models. A TPTP format for interpretations with finite domains was previously defined [47], and has served the ATP community adequately for almost 20 years. Recently the need to represent interpretations with infinite domains, and Kripke interpretations, has led to the development of a new TPTP format that supports these interpretations [49]. Figure 6 shows the problem formulae and a model that uses integers as the domain. Please read [49] for lots more details!

*Resource Limits:* A common question, often based on mistaken beliefs, is whether the resource limits used should be increased to find more solutions. Analysis shows that increasing resource limits does not significantly affect which problems are solved by an ATP system. Figure 7 illustrates this point; it plots the CPU times taken by several contemporary ATP systems to solve the TPTP problems for the FOF\_THM\_RFO\_\* SPCs, in increasing order of time taken. The data was taken from the TSTP, i.e., using the StarExec Miami computers. The relevant feature of these plots is that each system has a point at which the time taken to find solutions starts to increase dramatically. This point is called the system's Peter Principle [23] Point (PPP) – it is the point at which the system has reached its level of incompetence. Evidently a linear increase in the computational resources beyond the PPP would not lead to the solution of significantly

Fig. 5. Example derivation formulae

more problems. The PPP thus defines a realistic computational resource limit for the system, and if enough CPU time is allowed for an ATP system to reach its PPP, a usefully accurate measure of what problems it can solve is achieved. The performance data in the TSTP is produced with adequate resource limits.

# 5 The SZS Ontologies

The SZS ontologies [36] (named "SZS" after the authors of the first presentation of the ontologies [52]) provide values to specify the logical status of problems and solutions, and to describe logical data. The Success ontology provides values for the logical status of a conjecture with respect to a set of axioms, e.g., a TPTP problem whose conjecture is a logical consequence of the axioms is tagged as a Theorem (as in Fig. 3), and a model finder that establishes that

Fig. 7. CPU times for FOF\_THM\_RFO\_\*

a set of axioms (with no conjecture) is consistent should report Satisfiable. The Success ontology is also used to specify the semantic relationship between the parent "axioms" and inferred "conjecture" of an inference, as done in the TPTP format for derivations (see Sect. 4). The NoSuccess ontology provides reasons why an ATP system/tool has failed, e.g., an ATP system might report Timeout. The Dataform ontology provides values for describing logical data, as might be output from an ATP system/tool, e.g., a model finder might output a FiniteModel. Figure 8 shows some of the salient nodes of the ontologies. Their expanded names and their (abbreviated) definitions are listed in Fig. 9.

Fig. 8. Extract of the SZS ontologies

The SZS standard also specifies the precise way in which the ontology values should be presented in ATP system output, in order to facilitate easy processing. For example, if an ATP system has established that a conjecture is not a theorem of the axioms, by finding a finite countermodel of the axioms and negated conjecture, the SZS format output would be (see Sect. 4 for the format of the annotated formulae):

```
% SZS status CounterSatisfiable
% SZS output start FiniteModel
... annotated formulae for the finite model
% SZS output end FiniteModel
```
# 6 Specialist Problem Classes

The TPTP problem library is divided into Specialist Problem Classes (SPCs) – classes of problems with the same recognizable logical, language, and syntactic characteristics that make the problems in each SPC homogeneous wrt ATP systems. Evaluation of ATP systems within SPCs makes it possible to say which systems work well for what types of problems. The appropriate level of subdivision for SPCs is that at which less subdivision would merge SPCs for which ATP systems have distinguishable behaviour, and at which further subdivision would unnecessarily split SPCs for which ATP systems have reasonably homogeneous behaviour. Empirically, homogeneity is ensured by examining the patterns of system performance across the problems in each SPC. The SPCs for essentially propositional problems were motivated by the observation that SPASS [55] performed differently on the ALC problems in the SYN domain of the TPTP. A data-driven test for homogeneity is also possible [6].


Fig. 9. SZS ontology values

The characteristics currently used to define the SPCs in the TPTP are shown in Fig. 10. Using these characteristics 223 SPCs are defined in TPTP v8.2.0. For example, the SPC TF0\_THM\_NEQ\_ARI contains typed monomorphic first-order theorems that have no equality but include arithmetic. The header section of each problem in the TPTP problem library (see Sect. 3) includes its SPC. The SPCs are used when computing the TPTP problems difficulty ratings (see Sect. 7).

# 7 Problem Difficulty Ratings

Each TPTP problem has a difficulty rating that provides a well-defined measure of how difficult the problem is for current ATP systems [51]. The ratings are based on performance data in the TSTP solution library (see Sect. 4), and are updated in each TPTP edition.

The TPTP tags problems that are designed specifically to be suited or illsuited to some ATP system, calculus, or control strategy as *biased* (this includes the SYN000 problems, which are designed for testing ATP systems' parsers). The biased problems are excluded from the calculations. Rating is then done separately for each SPC (see Sect. 6).


Fig. 10. SPC characteristics

First, a partial order between systems is determined according to whether or not a system solves a strict superset of the problems solved by another system. If a strict superset is solved, the first system is said to *subsume* the second. The union of the problems solved by the non-subsumed systems defines the state-ofthe-art – all the problems that are solved by any system. The fraction of nonsubsumed systems that fail on a problem is the difficulty rating for the problem: problems that are solved by all of the non-subsumed systems get a rating of 0.00 (easy); problems that are solved by some of the non-subsumed systems get a rating between 0.00 and 1.00 (difficult); problems that are solved by none of the non-subsumed systems get a rating of 1.00 (unsolved). As additional output, the systems are given a rating for each SPC – the fraction of difficult problems they can solve.

The TPTP difficulty ratings provide a way to assess progress in the field – as problems that are unchanged are not actually getting easier, decreases in their difficulty ratings are evidence of progress in ATP systems. Figures 11 and 12 plot the average difficulty ratings overall and for each of the four language forms in the TPTP World (after some sensible data cleaning). Figure 11 is taken from [41], published in 2017. It plots the average ratings for the 14527 problems that had been unchanged and whose ratings had not been stuck at 0.00 or 1.00, from TPTP v5.0.0 that was released in 2010 to v6.4.0 that was released in 2016. Figure 12 is taken from [45], published in 2024. It plots the average ratings for the 16236 problems that had been unchanged and whose ratings had not been stuck at 0.00 or 1.00, from TPTP v6.3.0 that was released in 2016 to v8.2.0 that was released in 2023. The two figures' plots dovetail quite well, which gives confidence that they really are comparable (there are some minor differences caused by slightly different data cleaning and by recent refinements to the difficulty rating calculations). The older plots show a quite clear downward trend both overall and for the four language forms, while the new plots do not. Possible reasons are discussed in [45].

Fig. 11. Ratings from v5.0.0 to v6.4.0 Fig. 12. Ratings from v6.3.0 to v8.2.0

# 8 SystemOnTPTP and StarExec

Some of the early users of the TPTP problem library (see Sect. 3) were working in disciplines other than computer science, e.g. (with a few exemplar references), mathematics [16,26], logic [8,11], management [20,21], planning [28]. These users often selected the Otter system [15] for their experiments, as it was readily available and easy enough to install. As the TPTP World evolved it was clear that more powerful ATP systems were available, especially evident in the CADE ATP System Competition (see Sect. 9). However, these more powerful systems were often not as easy to obtain and install. In order to make the use of ATP easier for non-geek users, an NSF grant was obtained<sup>5</sup> to build the SystemOnTPTP online service, which provides hardware and a web interface for users to submit their problems to most recent versions of a wide range of ATP systems. SystemOnTPTP can provide recommendations for ATP systems that might be most likely to solve a problem, based on the SPC of the user's problem and the system ratings for the SPC (see Sects. 6 and 7). SystemOnTPTP can also run ATP systems (e.g., the recommended systems) in competition parallel [48]. The core SystemOnTPTP service is supported by (i) the SystemB4TPTP service that provides tools to prepare problems for submission to ATP systems, e.g.,

<sup>5</sup> NSF Award 1405674.

axiom selection [9], type checking [12], a scripting language [39], and (ii) the SystemOnTSTP service that provides tools for analysing solutions output from SystemOnTPTP, e.g., interactive viewing of derivations [53], proof checking [34], identification of interesting lemmas in a proof [24]. While many users enjoy interactive use of the services in a web browser, it is also easy to use the services programmatically – in fact this is where most of the use comes from. In 2023 there were 887287 requests serviced (an average of one every 36 s), from 286 countries. One heavy programmatic user is the Sledgehammer component of the interactive theorem prover Isabelle [19].

The TPTP problem library was motivated (see Sect. 3) by the need to provide support for meaningful ATP system evaluation. This need was also (or became) evident in other logic solver communities, e.g., SAT [10] and SMT [5]. For many years testing of logic solvers was done on individual developer's computers. In 2010 a proposal for centralised hardware and software support was developed, and in 2011 a \$2.11 million NSF grant<sup>6</sup> was obtained. This grant led to the development and availability of StarExec Iowa [32] in 2012, and a subsequent \$1.00 million grant<sup>7</sup> in 2017 expanded StarExec to Miami. StarExec has been central to much progress in logic solvers over the last 10 years, supporting 16 logic solver communities, used for running many annual competitions [1], and supporting many many users.

It was recently announced that StarExec Iowa will be decommissioned. The maintainer of StarExec Iowa explained that "the plan is to operate StarExec as usual for competitions Summer 2024 and Summer 2025, and then put the system into a read-only mode for one year (Summer 2025 to Summer 2026)". While StarExec Miami will continue to operate while funding is available, it will not be able to support the large number of logic solver communities that use the larger StarExec Iowa cluster. In the long run it will be necessary for StarExec users to transition to new environments, and several plans are (at the time of writing) being discussed. One effort, funded by an Amazon Research Award<sup>8</sup>, is to containerise StarExec and ATP systems so they will run in a Kubernetes framework on Amazon Web Services [7]. This will allow communities and individual users to run their own StarExec.

# 9 The CADE ATP System Competition

The CADE ATP System Competition (CASC) [40] is the annual evaluation of fully automatic, classical logic, ATP systems - the world championship for such systems. CASC is held at each CADE (the International Conference on Automated Deduction) and IJCAR (the International Joint Conference on Automated Reasoning) conference – the major forums for the presentation of new

<sup>6</sup> NSF Awards 1058748 and 1058925, led by Aaron Stump and Cesare Tinelli at the University of Iowa.

<sup>7</sup> NSF Award 1730419.

<sup>8</sup> Amazon Research Award "Automated Theorem Proving Community Infrastructure in the AWS Cloud", www.amazon.science/research-awards/recipients/geoffreysutcliffe.

research in all aspects of automated deduction. One purpose of CASC is to provide a public evaluation of the relative capabilities of ATP systems. Additionally, CASC aims to stimulate ATP research, motivate development and implementation of robust ATP systems that can be easily and usefully deployed in applications, provide an inspiring environment for personal interaction between ATP researchers, and expose ATP systems within and beyond the ATP community. Over the years CASC has been a catalyst for impressive improvements in ATP, stimulating both theoretical and implementation advances [18]. It has provided a forum at which empirically successful implementation efforts are acknowledged and applauded, and at the same time provides a focused meeting at which novice and experienced developers exchange ideas and techniques. The first CASC was held at CADE-13 in 1996, at DIMACS in Rutgers University, USA. The CASC web site provides access to all the details: www.tptp.org/CASC. Of particular interest for this IJCAR is that CASC was conceived 30 years ago in 1994 after CADE-12 in Nancy, when Christian Suttner and I were sitting on a bench under a tree in Parc de la Pépinière, burning time before our train departures.

CASC is run in divisions according to problem and system characteristics, in a coarse version of the SPCs (see Sect. 6). Problems for CASC are taken from the TPTP problem library (see Sect. 3), and some other sources. The TPTP problem ratings (see Sect. 7) are used to select appropriately difficult problems from the TPTP problem library. The systems are ranked according to the number of problems solved with an acceptable proof/model output. Ties are broken according to the average time over problems solved.

The design and organization of CASC has evolved over the years to a sophisticated state. In the years from CASC-26 in 2017 to CASC-29 in 2023 the competition stayed quite stable, but each year the various divisions, evaluations, etc., were optimized (as was also the case in prior years when there were also larger changes in the competition design). The changes in the divisions reflect the changing demands and interest in different types of ATP problems, and decisions made for CASC (in the context of the TPTP World) have had an influence on the directions of development in ATP. Over the years 11 divisions have been run . . .


In the 20 CASCs so far 111 distinct ATP systems have been entered. For each CASC the division winners of the previous CASC are automatically entered to provide benchmarks against which progress can be judged. Some systems have emerged as dominant in some of the divisions, with Vampire being a wellknown example. The strengths of these systems stem from four main areas: solid theoretical foundations, significant implementation efforts (in terms of coding and data structures), extensive testing and tuning, and an understanding of how to optimize for CASC.

# 10 TPTP World Users

Over the years the TPTP World has provided a platform upon which ATP users have presented their needs to ATP system developers, who have then adapted their ATP systems to the users' needs. The interplay between the TPTP problem library (see Sect. 3) and the CADE ATP System Competition (see Sect. 9) has been particularly effective as a conduit for ATP users to provide samples of their problems to ATP system developers. Users' problems that are contributed to the TPTP are eligible for use in CASC. The problems are then exposed to ATP system developers, who improve their systems' performances on the problems, in order to perform well in CASC. This completes a cycle that provides the users with more effective tools for solving their problems.

Many people have contributed to the TPTP World, with problems, software, advice, expertise, and enthusiasm. I am grateful to them all<sup>9</sup> and here are just a few who have made salient contributions (ordered roughly by date of the contributions mentioned):


### Thank you!

<sup>9</sup> See www.tptp.org/TPTP/TR/TPTPTR.shtml#Conclusion

# 11 Conclusion

This paper has described key components of the TPTP World that help make it a success, linking them together as "stepping stones" that lead from one component to another. The large number of citations to work of others, and explicitly Sect. 10, illustrate how the TPTP World has benefited, and benefited from, users in the ATP community. I am also grateful to the many people who have donated hard cash to the project, helping to keep it alive!

This paper has naturally focused on the successful parts of the TPTP World. There have also been some failed developments and suboptimal (in retrospect) decisions -. For example, in 2015 there was an attempt to develop a description logic form for the TPTP language. While some initial progress was made, it ground to a halt without support from the description logic community. A suboptimal design decision, rooted in the early days of the TPTP, is the naming scheme used for problem files. The naming scheme uses three digits to number the problems in each domain, thus setting a limit of 1000 problems, which failed to anticipate the numbers of problems that would be contributed to some of the problem domains. This has been overcome by creating multiple domain directories where necessary, but if it were to be done again, six or eight digit problem numbers shared across all domains would be an improvement.

The maintenance and development of the TPTP World is ongoing work. The most recent development is the languages and support for non-classical logics, initially modal logic [30,31]. The new format for representing interpretations (see Sect. 4) will be promulgated in the near future. As always, the ongoing success and utility of the TPTP problem library depends on ongoing contributions of problems – the automated reasoning community is encouraged to continue making contributions of all types of problems.

The TPTP World would not exist without the early strategic insights of Christian Suttner, with his willingness to let me do the organization without interference. Maybe his most wonderful contribution (which took him over two hours to produce when he was visiting me at James Cook University – I think he took a nap ) is his wonderfully simple plain-language definition of automated theorem proving: "the derivation of conclusions that follow inevitably from known facts".

# References


Logic. Technical report. CCSOM Preprint 92-74, Department of Statistics and Methodology, University of Amsterdam (1992)


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Theorem Proving and Tools**

# An Empirical Assessment of Progress in Automated Theorem Proving

Geoff Sutcliffe1(B) , Christian Suttner<sup>2</sup>, Lars Kotthoff<sup>3</sup> , C. Raymond Perrault<sup>4</sup> , and Zain Khalid<sup>1</sup>

> <sup>1</sup> University of Miami, Miami, USA geoff@cs.miami.edu , zsk17@miami.edu <sup>2</sup> Miami, USA <sup>3</sup> University of Wyoming, Laramie, USA larsko@uwyo.edu <sup>4</sup> SRI International, Menlo Park, USA ray.perrault@sri.com

Abstract. The TPTP World is a well established infrastructure that supports research, development, and deployment of Automated Theorem Proving (ATP) systems. This work uses data in the TPTP World to assess progress in ATP from 2015 to 2023.

Keywords: Automated Theorem Proving · Empirical Evaluation · Progress

# 1 Introduction

The TPTP World [69] (www.tptp.org) is a well established infrastructure that supports research, development, and deployment of Automated Theorem Proving (ATP) systems. The TPTP World includes the TPTP problem library, the TSTP solution library, standards for writing ATP problems and reporting ATP solutions, tools and services for processing ATP problems and solutions, and it supports the CADE ATP System Competition (CASC). This work uses data in the TPTP World to assess progress in ATP from 2015 to 2023.

Any meaningful assessment of progress in ATP must refer to the ability of ATP systems to solve problems. As the systems improve over time, the problems that they must solve also change to meet the demands of applications (with a fixed set of problems the systems can simply be finely tuned to the set, with inevitable asymptotic progress towards solving all the problems [52]). The TPTP problem library provides an evolving set of ATP problems that reflect the needs of ATP users, and is an appropriate basis for assessing the changing ability of ATP systems (the library is almost monotonically growing, but occasionally problems are removed – see Sect. 4.1). Alongside the TPTP problem library, the TSTP solution library provides data about ATP systems' abilities to solve the problems in the TPTP problem library. This paper examines progress in ATP,

C. Suttner—Deceased.

c The Author(s) 2024

C. Benzmüller et al. (Eds.): IJCAR 2024, LNAI 14739, pp. 53–74, 2024. https://doi.org/10.1007/978-3-031-63498-7\_4

based on the data from TPTP v6.3.0 released on 28th November 2015 to TPTP v8.2.0 released on 13th June 2023. It is important to differentiate between evaluations at instances in time, such as provided by competitions, and evaluations over time. At instances of time the test problems used for evaluation, the systems being evaluated, and the hardware/software platform used, are static, e.g., as done in [20]. That provides a clean basis for a detailed comparison between systems. In contrast, evaluation over time is complicated by changing test problems, changing systems, and changing hardware/software. This dynamic evaluation environment requires additional control to provide meaningful results. The analyses done in this work do not explicitly factor in the resources needed to find solutions, e.g., hardware, time limits, etc.; Sect. 3.1 explains why this makes sense in the ATP context.

*Related Work:* The use of system performance data to evaluate a field of endeavour is common. In the realm of logic-based systems, examples include the various competitions [6] for logic-based systems (e.g., CASC [68], the SAT Competition [36], SMT-COMP [5], the ASP Competition [18]), longitudinal surveys of competitions [20,75], the Technical Performance chapter of Stanford University's AI Index Annual Report [45], the use of Shapley values to evaluate algorithmic improvements in SAT solving [25,41], comparison of algorithmic and hardware advances (in SAT solving) [24], and other more specialized benchmarking, e.g., [89]. A general examination of the requirements for such benchmarking is provided by [9]. An ontology of artificial intelligence benchmarks is described in [14]. [52] provides an insightful analysis of the global dynamics of using benchmark sets in computer vision and natural language processing, and the takeaway messages are broadly applicable, including to benchmark sets for logic-based systems. In all cases the common measures for evaluation are (i) the ability of systems to solve problems, and (ii) the resources required by the systems to solve the problems. In order for results to be relevant, test problems must be representative of the problems the systems will face in applications, and the resource measurements must be appropriate for the availability and demands of the applications.

*Summary of Findings:* There has been progress in the last eight years, with stronger progress from v6.3.0 (2015) up to v7.1.0 (2018), but then a period of quiet until some more signs of progress in v8.2.0 (2023). There have been some first solutions of problems that are of direct interest to humans, a quite large number of first ATP solutions of problems from the TPTP, and some noteworthy improvements in individual ATP systems. There has been an apparent slowing of progress compared to the five years prior to 2015.

*Paper Structure:* Sections 2 and 3 provide a brief background to the TPTP problem library and TSTP solution library, highlighting features relevant to this work. Section 4 describes how the TPTP and TSTP data was prepared for analysis, and describes the measures used. Section 5 is the core of the paper, giving the results with commentary. Section 6 concludes.


Table 1. Overview of TPTP releases

# 2 The TPTP Problem Library

The core of the TPTP World is the TPTP problem library [66]; it is the de facto standard set of test problems for classical logic ATP systems. The problems can be browsed online<sup>1</sup> and documentation is available<sup>2</sup> Each release of the problem library is identified by a number in the form *version*.*edition*.*patch*. The current release at the time of writing was v8.2.0. Section 3 explains why the analyses of progress presented in this paper start at v6.3.0. Table 1 gives some summary data about the editions from v6.3.0 to v8.2.0. The Size column gives the number of problems in the release at the time of the release, while the Analysed column gives the number of problems left for analysis after the data cleaning described in Sect. 4.1. The acronyms for problem types are given in Sect. 2.1.

Each TPTP problem file has a header section (as comments) that contains information for users in four parts: the first part identifies and describes the problem; the second part provides information about occurrences of the problem in the literature and elsewhere; the third part provides semantic and syntactic characteristics of the problem; the last part contains comments and bugfix information. The third part is most relevant to this work. It contains the problem's SZS status [77] that provides the semantic status of the problem, e.g., if it is a Theorem, a Satisfiable set of formulae, a problem whose status is Unknown, etc. It also includes statistics about the problem's syntax, e.g., the number of formulae, the numbers of symbols, the use of equality and arithmetic, etc. The SZS status and the syntactic characteristics are used to form the Specialist Problem Class of the problem, as explained in Sect. 2.1.

<sup>1</sup> www.tptp.org/cgi-bin/SeeTPTP?Category=Problems.

<sup>2</sup> www.tptp.org/cgi-bin/SeeTPTP?Category=Documents.

# 2.1 Specialist Problem Classes

The problems in the TPTP library are divided into Specialist Problem Classes (SPCs) – classes of problems that are homogeneous wrt recognizable logical, language, and syntactic characteristics. Evaluation of ATP systems within SPCs makes it possible to say which systems work well for what types of problems. Empirically, homogeneity is ensured by examining the patterns of system performance across the problems in each SPC. For example, the separation of "essentially propositional" problems was motivated by observing that SPASS [85] performed differently on the ALC problems in the SYN domain of the TPTP. A data-driven test of homogeneity is also possible [26].

The characteristics used to define the SPCs in TPTP v8.2.0 are . . .


Using these characteristics 223 SPCs are defined in TPTP v8.2.0. For example, the SPC TF0\_THM\_NEQ\_ARI contains typed monomorphic first-order theorems that have no equality but include arithmetic. Combinations of SPCs are written using UNIX globbing, e.g., TF0\_THM\_\*\_NAR is the combination of TF0\_THM\_EQU\_NAR and TF0\_THM\_NEQ\_NAR – typed monomorphic higher-order theorems problems, either with or without equality, but no arithmetic.

The SPCs are used when computing the TPTP problems difficulty ratings, as explained in Sect. 2.2.

### 2.2 TPTP Problem Ratings

Each TPTP problem has a difficulty rating that provides a well-defined measure of how difficult the problem is for current ATP systems [76]. The ratings are based on performance data in the TSTP (see Sect. 3), and are updated in each TPTP edition. Rating is done separately for each SPC. First, a partial order between systems is determined according to whether or not a system solves a strict superset of the problems solved by another system. If a strict superset is solved, the first system is said to *subsume* the second. Then the fraction of nonsubsumed systems that fail on a problem is the difficulty rating for the problem. Problems that are solved by all of the non-subsumed systems get a rating of 0.00 ("easy"); problems that are solved by some of the non-subsumed systems get a rating between 0.00 and 1.00 ("difficult"); problems that are solved by none of the non-subsumed systems get a rating of 1.00 ("unsolved").

# 3 The TSTP Solution Library

The complement of the problem library is the TSTP solution library [65,67]. The TSTP is built by running all the ATP systems that are available in the TPTP World on all the problems in the TPTP problem library. At the time of writing this paper, the TSTP contained the results of running 87 ATP systems and system variants on all the problems in the TPTP that they could attempt. This produced 1091026 runs, of which 432718 (39.6%) solved the problem. One use of the TSTP is for ATP system developers to examine solutions to problems and thus understand how they can be solved, leading to improvements to their own systems. The use considered here is for TPTP problem ratings.

Prior to 2010 the data in the TSTP came from results submitted by ATP system developers, who performed testing on their own hardware. From 2010 to 2013 the data was generated on the TPTP World servers at the University of Miami. Since 2014 the ATP systems have been run on StarExec [63], initially on the StarExec Iowa cluster, and since 2018 on the StarExec Miami cluster. StarExec has provided stable platforms that produce reliably consistent and comparable data in the TSTP. The analyses presented in Sect. 4 start at TPTP v6.3.0, which was released in November 2015. By that time the problem ratings were based on data produced on the StarExec computers.

The StarExec Iowa computers have a quad-core Intel Xeon CPU E5-2609 CPU running at 2.40 GHz, 128 GiB memory, and the CentOS Linux release 7.9.2009 operating system. The StarExec Miami computers have an octa-core Intel Xeon E5-2667 v4 CPU running at 2.10 GHz, 128 GiB memory, and the CentOS Linux release 7.4.1708 operating system. One ATP system is run on one CPU at a time, with a 300 s CPU time limit and a 128GiB memory limit (see Sect. 3.1). The minor differences between the Iowa and Miami configurations can be ignored for the task of "solving problems", as is explained in Sect. 3.1.

### 3.1 Resource Limits

Analysis shows that increasing resource limits does not significantly affect which problems are solved by an ATP system. Fig. 1 illustrates this point; it plots the CPU times taken by several contemporary ATP systems to solve the TPTP problems for the FOF\_THM\_RFO\_\* SPCs, in increasing order of time taken. The relevant feature of these plots is that each system has a point at which the time taken to find solutions starts to increase dramatically. This point is called the system's Peter Principle [55] Point (PPP), as it is the point at which the system has reached its level of incompetence. Evidently a linear increase in the computational resources beyond the PPP would not lead to the solution of significantly more problems. The PPP thus defines a realistic computational resource limit for the system. Therefore, provided that enough CPU time and memory are allowed for an ATP system to reach its PPP, a usefully accurate measure of what problems it can solve is achieved. The performance data in the TSTP is produced with adequate resource limits.

Fig. 1. CPU times for FOF\_THM\_RFO\_\*

### 4 Analysis Processes

#### 4.1 Analysis Data

The analyses performed in this assessment use the TPTP problem ratings, and historical data about which ATP systems solved which problems in each TPTP release. The data was extracted from the ProblemAndSolutionStatistics file that accompanies each TPTP release, which summarizes information from the header fields of the TPTP problem files and corresponding TSTP solution files. As explained in Sect. 3, TSTP data starting from TPTP v6.3.0 in November 2015 has been used, taking snapshots at each TPTP edition up to v8.2.0.

Before analysis the rating data was cleaned as follows:

*Cleaning for Bias:* The TPTP tags problems that are designed specifically to be suited or ill-suited to some ATP system, calculus, or control strategy as *biased*. The biased problems were excluded from the analyses.

*Cleaning for Bugfixes:* Over time some problems have had to be removed from the TPTP because they are renamed, duplicates, wrongly formulated, etc. Such problems in a TPTP release are thus not in subsequent releases. The removed problems were excluded from the analyses.

*Cleaning for the Past:* Problems are added to the TPTP in each release, and corresponding TSTP data is generated using the available ATP systems. As it is not possible to run all previously available ATP systems on new problems when they are added to the TPTP, it has been (quite reasonably) assumed that if a problem was unsolved by the current ATP systems when it was added to the TPTP (initial rating 1.00), then it would have been unsolved by previously available ATP systems. The rating data was thus augmented for problems that were added after v6.3.0 and had an initial rating of 1.00, by setting the problems' ratings in the prior TPTP releases to 1.00. There were 1854 such problems. This, however, can lead to an unfairly optimistic view of progress, because those retrospectively added 1.00 ratings increase the average problem rating in the past. For problems that were solved when they were added to the TPTP (initial rating less than 1.00), it is unknown if the previously available ATP systems would have been able to solve them. Augmenting the rating data by setting the problems' ratings in prior TPTP releases to their initial rating of less than 1.00 could lead to an optimistic or pessimistic view of progress, depending if the rating was greater or less than the average in the past releases. In this work the rating data was augmented for problems that were added after v6.3.0 and had an initial rating less than 1.00, by setting the problems' ratings in the prior TPTP releases to their initial rating. There were 2632 such problems. The optimistic/pessimistic effect gets stronger when rating data is augmented for problems that were added in more recent TPTP releases. A total of 1854+ 2632 = 4486 problems had their initial ratings propagated backwards, starting from the various releases over the eight years of analysis. Overall this could have had a slightly optimistic impact in the analyses.

*Cleaning for Change:* A counterintuitive feature of an individual problem's difficulty ratings is that they sometimes increase with time. It is counterintuitive because the problem has not changed. (This was also noted in a prior analysis [69].) Increases are caused by new ATP systems or system versions becoming available. If a new system is not subsumed then its TSTP data is used in the rating process: the ratings of problems that it solves decrease, but at the same time the ratings of problems that it does not solve increase – you have to "pay the piper".<sup>3</sup> A common instance of this phenomenon is a new system that can solve some previously unsolved (rating 1.00) problems, but that cannot solve a substantial number of problems that are solved by other systems (rating less than 1.00). In this work the anomaly is resolved by additionally looking at *monotonic ratings*: if a problem's rating in a TPTP release is greater than its previous rating, the monotonic rating is set to the previous lower rating. Monotonized ratings make clear sense in the case of problems that were unsolved (rating 1.00) and were later solved by a new system (the rating drops to less than 1.00) – if a problem is solved, it cannot become unsolved – the solving system still exists in principle. In cases where the rating is less than 1.00 monotonized ratings might be considered to be optimistic because ratings do have to "pay the piper".

# 4.2 Coherent SPC Sets

Five of the analyses performed (see Sect. 4.3) require data from sets of problems with similar characteristics, so that the analysis results are wrt that type of problem. The basis for such sets is the SPCs (see Sect. 2.1), which provide a fine-grained partitioning of the TPTP problems so that each SPC is coherent. Some SPCs that capture compatible problem characteristics can be merged to form a *coherent SPC set*.

The coherent SPC sets used for the analyses are listed in Table 2. The SPC set column lists the SPCs that are in the set, using the abbreviations given in Sect. 2.1. Some noteworthy exclusions are: typed extended first-order problems, because they were added to the TPTP only in v8.0.0; typed polymorphic first-order and higher-order problems, because too few systems are capable of attempting the problems and generating the necessary TSTP data; some SPCs that have too few problems, e.g., TF0\_CSA\_\*\_NAR and TF0\_SAT\_\*\_NAR, which combined have only 154 problems.

# 4.3 Six Analyses

The cleaned TPTP problems ratings and historical TSTP data has been used for six analyses of progress in ATP. Individual problem ratings are used for the first analysis. The other five analyses are wrt the coherent SPC sets described in Sect. 4.2.

*First Solutions:* Arguably the most successful use of ATP comes from the "hammers" [15] associated with Interactive Theorem Proving (ITP) systems, where the individual problems being solved are typically not of direct interest to the human users who are focussed on the larger task being addressed in the ITP system. In contrast, the use of ATP by practitioners to solve individual problems

<sup>3</sup> Conversely, if a system that was not subsumed becomes unavailable, it no longer contributes TSTP data for new problems. This phenomenon is rare (e.g., Isabelle ran fine on StarExec Iowa but did not port to StarExec Miami in 2018) and has not materially impacted the analyses of progress.


Table 2. Coherent SPC sets

that have resisted manual approaches is less common and possibly less successful, but the sparsity makes successes particularly noteworthy. First solutions of problems that are of direct interest to humans are indications of progress. Such problems are identifiable by (i) the rating decreasing from 1.00, and (ii) evidence that the problem is of direct interest to some humans.

*Average Difficulty Ratings:* This is the average problem difficulty rating, and the average monotonized difficulty rating. (This approach was used in [73].) As the problems are unchanged (they are not actually getting easier), decreases are evidence of progress in ATP systems.

*Never-Solved:* This is the fraction of problems that were unsolved (rating 1.00) in all TPTP releases up to each TPTP release, relative to the number in v6.3.0. (The converse of this is plotted in [78].) Decreases are evidence of progress.

*Solved:* For the given system and a given TPTP release, this is . . .

*P roblemSolvedInRelease* <sup>−</sup> *LeastSolvedAcrossAllReleases MostSolvedAcrossAllReleases* <sup>−</sup> *LeastSolvedAcrossAllReleases*

The releases with a 1.00 value are those in which the most problems were solved, and those with 0.00 had the least number solved. Increases are evidence of progress.

*Always-Easy:* This is the converse of Never-solved – the fraction of problems that were easy (rating 0.00) in all TPTP releases back to each TPTP release, relative to the number in v8.2.0. Increases are evidence of progress.

*Shapley Value:* A State-of-the-Art (SotA) ATP system for a TPTP release is defined as one that solves the union of the problems solved by the individual ATP systems, e.g., by using competition parallelism [79]. The Shapley value [87] is the average of the marginal contributions (how much the SotA system improves when adding each given system) over all systems added to all possible subsets of other systems. First, temporal Shapley analysis [41] is used to measure the SotA systems' contributions to progress, normalized by the number of previously unsolved problems so that 0.0 means no previously unsolved problems were solved and 1.0 means all previously unsolved problems were solved. Peaks indicate stronger progress. Next, (non-temporal) Shapley analysis [25] is used to measure the contributions of the individual systems in each release. Finally, temporal Shapley analysis for all systems in all releases is used to measure the contributions of the individual system versions when they were introduced. The latter two analyses were used to provide insights for the commentary about the systems' performances (they are not plotted in Sect. 5).

# 5 Evidence of Progress

### 5.1 First Solutions

There are some nice examples of ATP systems finding first solutions to problems that are of direct interest to humans . . .


<sup>4</sup> The Robbins problem was posed in personal communications between Edward Huntington, Herbert Robbins, and Alfred Tarski. The background is given in en.wikipedia.org/wiki/Robbins\_algebra.


# 5.2 Solutions and Ratings

A total of 25325 problems were analysed over the coherent SPCs, of which 19762 (78%) were solved in TPTP v6.3.0, increasing to 20227 (80%) in v8.2.0. Of the 25325 problems, 5563 (22%) were unsolved when they were added to the TPTP, of which 1009 (4%) were solved in some release by v8.2.0. Conversely, there were 8984 problems (35%) that had a rating of 0.00 in v8.2.0, of which 2965 (12%) had a higher rating in some preceding release. These overall figures provide evidence of overall progress, but the contributions vary across the coherent SPC sets. Figures 2, 3, 4, 5, 6, 7, 8, 9 and 10 plot the values for each coherent SPC set for the latter five analyses described in Sect. 4.3. <sup>5</sup> The captions provide the numbers of 'P'roblems in TPTP v8.2.0, the number left for analysis after the data cleaning, and the numbers of 'N'ever-solved, 'S'olved, and 'A'lways-easy problems in releases v6.3.0-v8.2.0.

Figures 2, 3 4, 5, 6 and 7 plot the values for the CNF- and FOF-based coherent SPC sets. CNF is now the "assembly language" of most ATP systems, which typically translate more expressive logics down to CNF. As such, progress in CNF typically contributes to progress in other SPCs.

CNF\_UNS\_RFO\_PEQ\_UEQ showed progress in v6.4.0 due to the strong performance of Twee 2.0 [62], which made a lot more problems always-easy by v7.0.0. Also in v6.4.0, Waldmeister 710 [44] solved five problems that had never been solved before. In v7.4.0 E 2.5 made a strong contribution, then in v8.1.0 Twee 2.4 made another strong contribution, alongside CSE\_E 1.3 [88]. Waldmeister 710 had the highest Shapley value across all the releases, but in v8.1.0 both Twee and CSE\_E solved more problems than Waldmeister. The lowest number of problems solved was in v7.5.0 and v8.0.0, when 23 fewer problems were solved than in v7.4.0 – not many in the context of the 1034 solved in v7.4.0. The only discernible common feature of those 23 problems is that they had ratings over 0.90 in v7.4.0. Apparently some changes in the ATP system versions from v7.4.0 to v7.5.0 made the problems unsolvable in v7.5.0, and further changes reversed the situation for v8.1.0 when 1043 problems were solved.

<sup>5</sup> Data: github.com/GeoffsPapers/ATPProgress2024/raw/master/DataForAnalysis.

CNF\_UNS\_RFO\_\*\_NUE had a small but quite consistent decline in the problem ratings, indicating some progress. The big advances were in v7.0.0 when Vampire 4.2 performed well, including solving 33 problems that had never been solved before. In v8.2.0 SnakeForV 1.0 solved 26 problems that had never been solved before. The biggest drop in problems solved was between v7.2.0 and v7.3.0, when 66 fewer problems were solved. The largest increase in problems solved was between v8.1.0 and v8.2.0, when 50 more problems were solved. SnakeForV was again the big contributor to the increase. SnakeForV is interesting, as it is a variant of Vampire with an independent reimplementation of Spider-style [82] strategy discovery and schedule construction that factors in prover randomization [64].

CNF\_SAT\_RFO\_\* had only one high point, in v6.4.0 when Vampire 4.0.5 made a strong contribution, including solving four problems that had never been solved before. The sudden drop in problems solved in v7.0.0 was due to Prover9 1105 [46] data not being available; the reason is lost in the mists of time, but it is interesting to note that the older system was able to solve some problems that other systems could not. By v7.1.0 new systems had taken up the slack. The plots are all quite stable from v7.1.0 onwards.

{FOF,CNF}\_\*\_EPR\_\* had two points of progress, the first in v7.0.0 and the second in v7.3.0. In v7.0.0 the progress came from iProver 2.6 that had integrated an abstraction-refinement framework [30], and Vampire 4.2 that had some changes in its model building. Between them they solved five problems that had never been solved before. In v7.3.0 iProver 3.0 integrated superposition [23]. The number of problems solved increased continuously until v8.2.0. The drop in v8.2.0 was due to poorer performances by the new iProver 3.7, SnakeForV 1.0, and Vampire 4.7. These systems share the same FOF to CNF translator, which might have been the source of the common change.

FOF\_THM\_RFO\_\* is the best known of the FOF-based SPCs, with the most ATP systems able to attempt the problems, and is the target of most new systems. The problem difficulty ratings are quite flat, but the number of problems solved increased quite regularly, from 6086 in v6.3.0 to 6235 in v8.2.0. The largest step of progress came in v7.0.0 when Vampire 4.2 solved 72 problems that had never been solved before, thanks to improvements in preprocessing. ET 0.2 [37] also contributed to the progress in v7.0.0. In v7.4.0 Enigma 0.4 [32,33] was a new system that made a strong contribution to progress. Vampire 4.5 also contributed to progress in v7.4.0, with a new layered clause selection approach [27] and a new subsumption demodulation rule [28].

FOF\_{CSA,SAT}\_RFO\_\* is also well known, and along with its typed first-order counterpart (not analysed due to insufficient data) is important for applications, e.g., [22]. The largest sign of progress was in v6.4.0. The main contributors were Vampire 4.0.5 with improvements to its satisfiability checking, and iProver 2.5 with restructured core data structures and improved preprocessing including predicated elimination. Vampire 4.0.5 solved 10 problems that had never been solved before. There is a drop of 10 problems solved from v8.0.0 to v8.2.0. As in CNF\_UNS\_RFO\_PEQ\_UEQ, there is no discernible common feature of those 10 problems, and their ratings were at most 0.75. This again shows that the set of problems solved by evolving versions of systems does not grow monotonically.

Figures 8, 9 and 10 plot the values for the TFF- and THF-based coherent SPC sets. TF0\_THM\_\*\_NAR uses the simplest of the typed TPTP languages. In v7.0.0 there was progress thanks to Vampire 4.2 and CVC4 1.5.2 [4]. In v8.2.0 there was progress thanks to SnakeForV 1.0. In between those points of progress there was a drop in the number of problems solved, from 282 in v7.4.0 down to 260 in v7.5.0, apparently due to poorer performance of CVC4 1.9 in v7.5.0 compared to that of CVC4 1.7 in v7.4.0.

TF0\_THM\_\*\_ARI is important because it uses the simplest TPTP language that includes arithmetic, which occurs naturally in application areas [16,39,53]. There was clearly some significant progress in v6.4.0 as many problems were solved for the first time by Vampire 4.0.5, which had integrated Z3 [50] since Vampire 4.0. This contributed to the increase in the number of problems solved, from 915 in v6.3.0 to 1009 in v6.4.0. CVC4 1.5 [4] and Princess 150706 [57] also performed well.

TH0\_THM\_\*\_NAR uses typed higher-order logic, and despite using a more expressive language than the TF0\_\* SPCs, has been the focus of ATP system development longer [70,74]. The problem ratings declined moderately, and there were bursts of progress in v7.0.0 and v7.5.0. The progress in v7.0.0 was largely thanks to Satallax 3.2 [17], which included a SInE-like [31] procedure for premise selection that enabled it to solve some large problems that were previously out of reach. That progress increased the number of always-easy problems by v7.1.0. In v7.5.0 Zipperposition 2.0 [8] improved over the previous version, and solved 18 problems that had never been solved before.

Fig. 2. CNF\_UNS\_RFO\_PEQ\_UEQ P:1140- 1140 N:120-86 S:1020-1049 A:38-233

Fig. 3. CNF\_UNS\_RFO\_\*\_NUE P:4445-4441 N:569-391 S:3873-3966 A:1004-1780

Figure 11 was presented (verbatim) in a prior analysis done at TPTP release v6.4.0 [69]. The figure plotted the average ratings for the 14527 problems that were unchanged in the TPTP since v5.0.0, and whose ratings had not been stuck at 0.00 or 1.00 since v5.0.0. It was noted in [69]: "The ratings generally show a downward trend - there has been progress!". Figure 12 shows the same done at TPTP release v8.2.0, for the 16236 problems that were unchanged in the TPTP since v6.3.0, and whose ratings have not been stuck at 0.00 or 1.00 since v6.3.0.

Fig. 4. CNF\_SAT\_RFO\_\* P:1044-1042 N: 155-147 S:887-889 A:476-598

Fig. 6. FOF\_THM\_RFO\_\* P:7204-7202 N: 1116-818 S:6086-6235 A:696-971

Fig. 5. {FOF,CNF}\_\*\_EPR\_\* P:1457- 1425 N:78-43 S:1347-1360 A:1027-1311

Fig. 7. FOF\_{CSA,SAT}\_RFO\_\* P:1329- 1028 N:282-256 S:746-753 A:481-709

The two figures' plots dovetail quite well, which gives confidence that they really are comparable (there are some minor differences caused by the data cleaning done for this work, and recent refinements to the rating calculations [71,72]). The older plots show a quite clear downward trend both overall and for the four types of problems, while the new plots do not. Possible reasons are discussed in the conclusion (Sect. 6).

Fig. 8. TF0\_THM\_\*\_NAR P:400-397 N:120- 103 S:277-268 A:117-123

Fig. 9. TF0\_THM\_\*\_ARI P:1176-1087 N: 172-58 S:915-1022 A:763-785

Fig. 10. TH0\_THM\_\*\_NAR PA:3189-3183 N:461-305 S:2722-2814 A:617-1244

Fig. 11. Ratings from v5.0.0 to v6.4.0 Fig. 12. Ratings from v6.3.0 to v8.2.0

# 6 Conclusion

This paper has presented an empirical assessment of progress in ATP, using data from the TPTP World in TPTP v6.3.0 in 2015 to v8.2.0 in 2023. The assessment has been in terms of six measures, divided into nine coherent SPC sets of problems that are reasonably homogeneous for ATP systems. The assessment shows that there has been progress in the last eight years, with stronger progress from v6.3.0 (2015) up to v7.1.0 (2018), but then a period of quiet until some more signs of progress in v8.2.0 (2023). There have been some first solutions of problems that are of direct interest to humans, and a quite large number of first ATP solutions of problems from the TPTP. The coherent SPCs with the strongest signs of progress were CNF\_UNS\_RFO\_PEQ\_UEQ and TH0\_THM\_\*\_NAR.

In addition to overall trends, it is worth noting some of the salient improvements in individual ATP systems, extracted from Sect. 5 ...


In terms of problem difficulty ratings, the monotonized ratings necessarily went down but the trend was not dramatic, and the raw ratings were generally stable. This is in contrast to the clearly decreasing ratings from 2011 to 2016. The reasons for that apparent slowing of progress are not definitely known, but we have thought of the following possible reasons:


<sup>6</sup> www.dagstuhl.de/23471.

during those years several ATP systems were optimized for EPR problems, most notably iProver. Putting a division on hiatus leads to less development in that aspect of ATP.

– In [72] it was noted that CASC might be causing incremental development of ATP systems. This concern has been expressed as far back as CASC-JC in 2001 [54]. In response to this concern CASC-J12 will have a new ICU (I Challenge yoU) division that focusses on solving hard problems rather than solving more problems, hoping to stimulate new developments and progress.

This assessment of progress is based on ATP systems' abilities to solve problems. Evaluation of other performance measures would be interesting, e.g., stability of proof search modulo perturbations of the input, and some have been done in other evaluations of logic-based systems. These include measures such as resource usage and verifiability of proofs/models. Evaluation of non-performance measures is often ignored, but for users might be just as necessary. These include measures such as the range of logics covered, ease of building and deploying, portability to different hardware and operating system environments, availability of source code, quality of source code and its documentation, licensing that permits a required level of use or modification, availability of user documentation, and (maybe most importantly!) developer support. These are topics for future assessments.

# References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# A Higher-Order Vampire (Short Paper)

Ahmed Bhayat<sup>1</sup> and Martin Suda2(B)

<sup>1</sup> Leicester, UK ahmed\_bhayat@hotmail.com <sup>2</sup> Czech Technical University in Prague, Prague, Czech Republic martin.suda@cvut.cz

Abstract. The support for higher-order reasoning in the Vampire theorem prover has recently been completely reworked. This rework consists of new theoretical ideas, a new implementation, and a dedicated strategy schedule. The theoretical ideas are still under development, so we discuss them at a high level in this paper. We also describe the implementation of the calculus in the Vampire theorem prover, the strategy schedule construction and several empirical performance statistics.

Keywords: Vampire · Higher-Order · Strategy Scheduling

# 1 Introduction

The Vampire prover [15] has supported higher-order reasoning since 2019 [6]. Until recently, this support was via a translation from higher-order logic (HOL) to polymorphic first-order logic using combinators. The approach had positives, specifically it avoided the need for higher-order unification. However, our experience suggested that for problems requiring complex unifiers, the approach was not competitive with calculi that do rely on higher-order unification. This intuition was supported by results at the CASC system competition [25].

Due to this, we recently devised an entirely new higher-order superposition calculus. This time we based our calculus on a standard presentation of HOL. The key idea behind our calculus is that rather than using full higher-order unification, we use a depth-bounded version. That is, when searching for higher-order unifiers, when some predefined number of projection and imitation steps have taken place, the search is backtracked. The crucial difference in our approach to similar approaches is that rather than failing on reaching the depth limit, we turn the set of remaining unification pairs into negative constraint literals which are returned along with the substitution formed until that point. This is similar to recent developments in the field of theory reasoning [5].

The new calculus has now been implemented in Vampire along with a dedicated strategy schedule. Together these developments propelled Vampire to first

A. Bhayat—Independent Scholar.

c The Author(s) 2024

C. Benzmüller et al. (Eds.): IJCAR 2024, LNAI 14739, pp. 75–85, 2024. https://doi.org/10.1007/978-3-031-63498-7\_5

place in the THF division of the 2023 edition of the CASC competition.<sup>1</sup> As the completeness of the calculus is an open question which we are working on, we have to date not published a description of the calculus.

In this paper, we describe the calculus, discuss its implementation in Vampire and also provide some details of the strategy schedule and its formation.

# 2 Preliminaries

We assume familiarity with higher-order logic and higher-order unification. Detailed presentations of these can be found in recent literature [2,4,29]

We work with a rank-1 polymorphic, clausal, higher-order logic. For the syntax of the logic we follow a more-or-less standard presentation such as that of Bentkamp et al. [2]. Higher-order applications such as f ac contain subterms with no first-order equivalents such as f and f a. We refer to these as *prefix* subterms. We represent term variables with x, y, z, function symbols with f, g, h, and terms with s and t. To keep the presentation simple, we omit typing information from our terms.

A substitution is a mapping of variables to terms. Unification is the process of finding a substitution σ for terms t<sup>1</sup> and t<sup>2</sup> such that t1σ ≈ t2σ for some definition of equality (≈) of interest. It is well known that first-order syntactic unification is decidable and unique most general unifiers exists. For the higherorder case, unification is not decidable, and the set of incomparable unifiers is potentially infinite. A commonly used higher-order unification procedure for enumerating unifiers is Huet's preunification routine [13]. Unlike full higher-order unification, preunification does not attempt to unify terms if both have variable head symbols. Thus, preunification does not require infinitely branching rules unlike full higher-order unification [29].

The two main rules that extend first-order unification in Huet's procedure are *projection* and *imitation*. We provide a flavour of these via an example. Consider unifying terms s = x a and s = a. In searching for a suitable instantiation of the variable x, we can either attempt to copy the head symbol of s leading to the substitution x → λy. a, or we can bring one of x's arguments to the head position leading to the substitution x → λy. y. The first is known as imitation and the second as projection.

We use the concept of a *depth*<sup>n</sup> *unifier*. We do not define the term formally, but provide an intuitive understanding. Consider a higher-order preunification algorithm. Any substitution formed by following a path of the unification tree, starting from the root, that contains exactly n imitation and projection steps, or reaches a leaf using fewer than n such steps, is a *depth*<sup>n</sup> *unifier*. For terms s and t, let Un(s, t) be the set of all depth<sup>n</sup> unifiers of s and t. Note that this set is finite as we are assuming preunification and hence the tree is finitely branching.

For terms s and t, for each depth<sup>n</sup> unifier σ ∈ Un(s, t), we associate a set of negative equality literals C<sup>σ</sup> formed by turning the unification pairs that remain

<sup>1</sup> https://tptp.org/CASC/29/WWWFiles/DivisionSummary1.html.

when the depth limit is reached into negative equalities. In the case σ is an *actual unifier* of s and t, C<sup>σ</sup> is of course the empty set.

To make this clearer, consider the unification tree presented in Fig. 1. There are two depth<sup>2</sup> unifiers labelled σ<sup>1</sup> and σ<sup>2</sup> in the figure. Related to these, we have Cσ<sup>1</sup> = Cσ<sup>2</sup> = {x<sup>2</sup> a b ≈ b}. There are four depth<sup>3</sup> unifiers (not shown in the figure) and zero depth<sup>n</sup> unifiers for for n > 3.

Fig. 1. Unification tree for terms xab and fba

### 3 Calculus

Our calculus is parameterised by a selection function and an ordering . Together these give rise to the concept of literals being (strictly) -eligible with respect to a substitution σ [2]. When discussing eligibility we drop and σ and rely on the context to make these clear. We call a literal s ≈ t, where both s and t have variable heads, a *flex-flex* literal. Such a literal is never selected in the calculus. We present the primary inference rule, Sup, below.

$$\frac{D' \lor t \approx t' \quad C' \lor s \langle u \rangle \dot{\approx} s'}{(C' \lor D' \lor s \langle t' \rangle \dot{\approx} s' \lor C\_\sigma)\sigma} \text{ Sup}.$$

In the rule above, we use ≈˙ to denote either a positive or negative equality. We use <sup>s</sup> <sup>u</sup> to denote that <sup>u</sup> is a *first-order* subterm of <sup>s</sup>. That is, a non-prefix subterm that is not below a lambda. The side conditions of the inference are σ ∈ Un(t, u), u is not a variable, t ≈ t is strictly eligible in the left premise, s u ≈˙ s is eligible in the right premise, and the other standard ordering conditions. The remaining core inference rules are EqRes and EqFact.

$$\frac{C' \lor t \approx t' \lor s \approx s'}{(C' \lor t' \not\approx s' \lor s \approx s' \lor C\_\sigma)\sigma} \text{ EqFACT \qquad } \qquad \frac{C' \lor s \not\approx t'}{(C' \lor C\_\sigma)\sigma} \text{ EqRes } \dots$$

For both rules, <sup>σ</sup> <sup>∈</sup> <sup>U</sup>n(t, s). For EqFact, <sup>s</sup> <sup>≈</sup> <sup>s</sup> is eligible in the premise and for EqRes <sup>s</sup> ≈ <sup>s</sup> is eligble. We also include inferences ArgCong (see [3]), and FlexFlexSimp which derives the empty clause, <sup>⊥</sup>, from a clause containing only flex-flex literals.

$$\frac{C' \lor s \approx s'}{C' \sigma \lor (s \sigma) \, x \approx (s' \sigma) \, x} \text{ ARG} \\ \text{ONG} \qquad \frac{x\_1 \overline{s}\_n \not\approx x\_2 \, \overline{r}\_m \lor \cdots}{\bot} \text{ F} \\ \text{LexF} \\ \text{EXexp} \\ \text{SIMP}$$

For ArgCong, <sup>s</sup> <sup>≈</sup> <sup>s</sup> is eligible in the premise, <sup>σ</sup> is the type unifier of <sup>s</sup> and <sup>s</sup> and x is a fresh variable. In our implementation, the depth parameter n is set via a user option. In the case it is set to 0, the following pair of inferences are added to the calculus.

$$\frac{C' \lor x \,\overline{s}\_n \not\approx f \,\overline{t}\_m}{(C' \lor x \,\overline{s}\_n \not\approx f \,\overline{t}\_m) \{x \to \lambda \overline{y}\_n. f \,\overline{(z\_j \,\overline{y}\_n)}\_m\}} \text{IMIATE}$$

$$\frac{C' \lor x \,\overline{s}\_n \not\approx f \,\overline{t}\_m}{(C' \lor x \,\overline{s}\_n \not\approx f \,\overline{t}\_m) \{x \to \lambda \overline{y}\_n. y\_i \,\overline{(z\_j \,\overline{y}\_n)}\_p\}} \text{PROJECT}$$

Where j ranges from 1 to m in Imitate and 1 to p in Project, and each z<sup>j</sup> is a fresh variable. The literals x s<sup>n</sup> ≈ f t<sup>m</sup> are eligible in the premises and p is the arity of yi, the projected variable. The idea behind introducing these rules is to facilitate the instantiation of head variables with suitable lambda terms when this is not being done as part of unification. Our intuition is that by intertwining the unification and calculus rules in the spirit of the EP calculus [21], the need for explosive rules (such as FluidSup [2]) that simulate superposition underneath variables is removed. The examples we present below support this intuition. Besides the core inference rules, the calculus has a set of rules to handle reasoning about Boolean terms. These are similar to rules discussed in the literature [20,30]. Extensionality is supported either via an axiom or by using unification with abstraction as described by Bhayat [4]. Similarly, Hilbert choice can be supported via a lightweight inference in the manner of Leo-III [20] or via the addition of the Skolemized choice axiom. The calculus also contains various well-known simplification rules such as Demodulation and Subsumption.

Soundness and Completeness. The soundness of the calculus described above is relatively straightforward to show. On the other hand, the completeness of the calculus with respect to Henkin semantics is an open question. We hypothesise that given the right ordering, and with tuning of inference side conditions, the depth<sup>0</sup> variant of the calculus (with the Imitate and Project rules) is refutationally complete. A proof is unlikely to be straightforward due to the fact that we do not select flex-flex literals.

*Example 1.* Consider the following unsatisfiable clause set. Assume a depth of 1. Selected literals are underlined.

$$C = \underline{x \, a \, b \, \not\cong f \, b \, a} \lor x \, c \, d \, \not\cong f \, b \, a$$

An EqRes binds x to λy, z.f(x<sup>1</sup> a b)(x<sup>2</sup> a b) and results in C<sup>1</sup> = f (x<sup>1</sup> a b)(x<sup>2</sup> a b) ≈ f ba∨<sup>f</sup> (x<sup>1</sup> c d)(x<sup>2</sup> c d) ≈ f ba. An EqRes on <sup>C</sup><sup>1</sup> binds <sup>x</sup><sup>1</sup> to λy, z.b and results in <sup>C</sup><sup>2</sup> <sup>=</sup> <sup>x</sup><sup>2</sup> a b ≈ <sup>a</sup> <sup>∨</sup> f b (x<sup>2</sup> c d) ≈ f ba. A final EqRes on <sup>C</sup><sup>2</sup> binds <sup>x</sup><sup>2</sup> to λy, z.a and results in f ba ≈ f ba from which it is trivial to obtain the empty clause ⊥.

*Example 2 (Example* 1 *of Bentkamp et al.* [3]*).* Consider the following unsatisfiable clause set. Assume the depth<sup>0</sup> version of the calculus.

$$C\_1 = f\ a \approx c \qquad C\_2 = h\left(y\,b\right)\left(y\,a\right) \not\approx h\left(g\left(f\,b\right)\right)\left(g\,c\right)$$

An EqRes inference on <sup>C</sup><sup>2</sup> results in <sup>C</sup><sup>3</sup> <sup>=</sup> y b ≈ <sup>g</sup> (f b) <sup>∨</sup> y a ≈ g c. An Imitate inference on the first literal of C<sup>3</sup> followed by the application of the substitution and some β-reduction results in C<sup>4</sup> = g (z b) ≈ g (f b) ∨ g (z a) ≈ g c. A further double application of EqRes gives us <sup>C</sup><sup>5</sup> <sup>=</sup> z b ≈ f b <sup>∨</sup> z a ≈ <sup>c</sup>. We again carry out Imitate on the first literal followed by an EqRes to leave us with <sup>C</sup><sup>6</sup> <sup>=</sup> x b ≈ <sup>b</sup> <sup>∨</sup> <sup>f</sup> (x a) ≈ <sup>c</sup>. We can now carry out a Sup inference between <sup>C</sup><sup>1</sup> and C<sup>6</sup> resulting in C<sup>7</sup> = x b ≈ b ∨ c ≈ c ∨ x a ≈ a from which it is simple to derive <sup>⊥</sup> via an application of Imitate on either the first or the third literal. Note, that the empty clause was derived without the need for an inference that simulates superposition underneath variables, unlike in [3].

# 4 Implementation

The calculus described above, along with a dedicated strategy schedule, has been implemented in the Vampire theorem prover.<sup>2</sup> Vampire natively supports rank-1 polymorphic first-order logic. Therefore, we translate higher-order terms into polymorphic first-order terms using the well known applicative encoding. Note, that we use the symbol →, in a first-order type, to separate the argument types from the return type. It should not be confused with the binary, higherorder function type constructor → that we assume to be in the type signature. Application is represented by a polymorphic symbol app : Πα1, α2.(α<sup>1</sup> → α<sup>2</sup> × α1) → α2. Lambda terms are stored internally using De Bruijn indices. A lambda is represented by a polymorphic symbol lam : Πα1, α2. α<sup>2</sup> → (α<sup>1</sup> → α2). De Bruijn indices are represented by a family of polymorphic symbols d<sup>i</sup> : Πα. α for <sup>i</sup> <sup>∈</sup> <sup>N</sup>. Thus, the term λx : τ.x is represented internally as lam(τ,τ, d0(<sup>τ</sup> )). The term λx. f(λz.x) is represented internally (now ignoring type arguments) as lam(app(f, lam(d1))).

Some of the most important options available are: hol\_unif\_depth to control the depth unification proceeds to, funx\_ext to control how function extensionality is handled, cnf\_on\_the\_fly to control how eager or lazy the clausification algorithm is, and applicative\_unif which replaces higher-order unification with (applicative) first-order unification. This is surprisingly helpful in some cases. Besides for the options listed above, there are many other higher-order specific options as well as options that impact both higher-order and first-order reasoning. These options can be viewed by building Vampire and running with –help.

<sup>2</sup> See http://bit.ly/3vBQLi4 for the release, https://bit.ly/3Hl3lES for the code.

# 5 Strategies and the Schedule

We generally followed the Spider [27] methodology for strategy discovery and schedule creation. This starts with randomly sampling strategies to solve as-ofyet unsolved problems (or improve the best known time for problems already known to be solvable). Each newly discovered strategy is optimized with local search to work even better on the single problem which it just solved. This is done by trying out alternative values for each option, possibly in several rounds. A variant of the strategy that improves the solution time or at least uses a default value of an option is preferred. The final strategy is then evaluated on the totality of all considered problems and the process repeats.

In our case, we sought strategies to cover the 3914 TH0 problems of the TPTP library [24] version 8.1.2. The strategy space consisted of 87 options inherited from first-order Vampire and 26 dedicated higher-order options. To sample a random strategy, we considered each option separately and picked its value based on a guess of how useful each is. (E.g., for applicative\_unif we used the relative frequencies of on: 3, off: 10.) During the strategy discovery process we adapted the maximum running time per problem, both for the random probes several times and for the final strategy evaluation: from the order of 1 s up to 100 s. In total, we collected 1158 strategies over the course of approximately two weeks of continuous 60 core CPU computation. The strategies cover 2804 unsatisfiable problems, including 50 problems of TPTP rating 1.0 (which means these problems were not officially solved by an ATP before).

Once a sufficiently rich set of strategies gets discovered and evaluated, schedule building can be posed as a constraint programming task in which one seeks to allot time slices to individual strategies to cover as many problems as possible while not exceeding a given overall time bound T [12,19]. We had a good experience with a weighted set cover formulation and applying a greedy algorithm [9]: starting from an empty schedule, at any point we decide to extend it by scheduling a strategy S for additional t units of time if this step is currently the best among all possible strategy extensions in terms of "the number of problems that will additionally get covered *divided by* t". This greedy approach does not guarantee an optimal result, but runs in polynomial time and gives a meaningful answer uniformly for any overall time bound T (See [8] for more details).

Our final higher-order schedule tries to cover, in this greedy sense, as many problems as possible at several increasing time bounds: starting from 1 s, 5 s, and 10 s bounds relevant for the impatient users, all the way up to the CASC limit of 16 min (2 min on 8 cores) and further beyond. In the end, it makes use of 278 out of the 1158 available strategies and manages to cover all the known-tobe-solvable problems in a bit less than 1 h of single core computation. We stress that our final schedule is a single monolithic sequence and does not branch based on any problems' characteristics or features.<sup>3</sup>

<sup>3</sup> One additional interesting aspect of our schedule building approach (see Appendix A of our preprint [7] for more details) is that we employ input shuffling and prover randomization [23] and thus treat our strategies as Las Vegas algorithms, whose running time or even success/failure may depend on chance.


Table 1. The most important options in terms of contribution to problem coverage

*Most Important Options:* In Table 1, we list the first five options sorted in descending order of "how many problems we would not be able to cover if the given option could not be varied in strategies." (In other words, as if the listed default value was "wired-in" to the prover code.)

Based on existing research [28], it is unsurprising to see that varying clausification has a large impact. Likewise, for varying the unification depth. What is perhaps more surprising is that replacing higher-order unification with applicative first-order unification can be beneficial. equality\_to\_equiv turns equality between Boolean terms into equivalence before the original clausification pass is carried out. The effectiveness of this option is also somewhat surprising.

Table 2. Number of problems solved by a single good higher-order strategy and our schedule at various time limit cutoffs. Run on the 3914 TH0 TPTP problems


*Performance Statistics:* It is long known [26,31] that a strategy schedule can improve over the performance of a single good strategy by large margin. Table 2 confirms this phenomenon for our case. For this comparison we selected one of the best performing (at the 60 s time limit mark) single strategies that we had previously evaluated. From the higher-order perspective, the strategy is interesting for setting hol\_unif\_depth to 4 and supporting choice reasoning via an inference rule (choice\_reasoning on).<sup>4</sup>

Although our schedule has been developed on (and for) the TH0 TPTP problems, it helps the new higher-order Vampire solve more problems of other origin too. Of the Sledgehammer problems exported by Desharnais et al. in their

<sup>4</sup> Otherwise, it uses Vampire's default setting, except for relying on an incomplete literal selection function [11] and using a relative high naming threshold [17], i.e., being reluctant to introduce new names for subformulas during clausification.

last prover comparison [10], namely the 5000 problems denoted in their work TH0−, Vampire can now solve 2425 compared to 2179 obtained by Desharnais et al. with the previous Vampire version (both under 30 s per problem).<sup>5</sup>

We remark that we also developed a different schedule specifically adapted to Sledgehammer problems (in various TPTP dialects, i.e., not just TH0), which is now available to the Isabelle [16] users since the September 2023 release.

# 6 Related Work

The idea to intertwine superposition and unification appears in earlier work, particularly in the EP calculus implemented in Leo-III [21]. The main differences between our calculus and EP are:


We also incorporate more recent work on higher-order superposition, mainly from the Matryoshka project [2,28]. Of course, the use of constraints in automated reasoning extends far beyond the realm of higher-order logic. They have been researched in the context of theory reasoning [14,18] and basic superposition [1].

# 7 Conclusion

In this paper, we have presented a new higher-order superposition calculus and discussed its implementation in Vampire. We have also described the new higherorder schedule created. The combination of calculus, implementation and schedule have already proven effective. However, we believe that there is great room for further exploration and improvement. On the theoretical side, we wish to prove refutational completeness of the calculus (or a variant thereof). On the practical side, we wish to refine the implementation, most notably by adding additional simplification rules.

<sup>5</sup> Our experiments were run on Intel-<sup>R</sup> Xeon-<sup>R</sup> Gold 6140 CPU @ 2.3 GHz, Desharnais et al. [10] used StarExec [22] with Intel-<sup>R</sup> Xeon-<sup>R</sup> CPU E5-2609 @ 2.4 GHz nodes.

<sup>6</sup> Our understanding is that the implementation of EP in Leo-III does make use of orderings as well as eager unification. However, eager unification does not return unification literals, instead failing once the depth bound is reached. See [20] for details.

Acknowledgments. The second author was supported by project CORESENSE no. 101070254 under the Horizon Europe programme and project RICAIP no. 857306 under the EU-H2020 programme.

# References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Tableaux for Automated Reasoning in Dependently-Typed Higher-Order Logic

Johannes Niederhauser1(B) , Chad E. Brown<sup>2</sup>, and Cezary Kaliszyk1,3

	- <sup>2</sup> Czech Institute of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Prague, Czech Republic

<sup>3</sup> School of Computing and Information Systems, University of Melbourne, Melbourne, Australia

Abstract. Dependent type theory gives an expressive type system facilitating succinct formalizations of mathematical concepts. In practice, it is mainly used for interactive theorem proving with intensional type theories, with PVS being a notable exception. In this paper, we present native rules for automated reasoning in a dependently-typed version (DHOL) of classical higher-order logic (HOL). DHOL has an extensional type theory with an undecidable type checking problem which contains theorem proving. We implemented the inference rules as well as an automatic type checking mode in Lash, a fork of Satallax, the leading tableauxbased prover for HOL. Our method is sound and complete with respect to provability in DHOL. Completeness is guaranteed by the incorporation of a sound and complete translation from DHOL to HOL recently proposed by Rothgang et al. While this translation can already be used as a preprocessing step to any HOL prover, to achieve better performance, our system directly works in DHOL. Moreover, experimental results show that the DHOL version of Lash can outperform all major HOL provers executed on the translation.

Keywords: Tableaux · Dependent Types · Higher-Order Logic

# 1 Introduction

Dependent types introduce the powerful concept of types depending on terms. Lists of fixed length are an easy but interesting example. Instead of having a simple type lst we may have a type Πn: nat. lst n which takes a natural number as argument and returns the type of a list with length n. More generally, lambda terms λx.s now have a dependent type Πx: A.B which makes the type of (λx.s)t dependent on t. With that, it is possible for example to specify an unfailing version of the tail function by declaring its type to be <sup>Π</sup>n: nat.lst(sn) <sup>→</sup> lstn. Many interactive theorem provers for dependent type theory are available [3,10,14,16], most of them implement intensional type theories, i.e., they distinguish between a decidable judgmental equality (given by conversions) and provable equality (inhabiting an identity type). Notable exceptions are PVS [19] and F-[21] which implement an extensional type theory. In the context of this paper, we say a type theory is *extensional* if judgmental equality and provable equality coincide, as in [12]. The typing judgment in such type theories is usually undecidable, as shown in [8].

The broader topic of this paper is automated reasoning support for extensional type theories with dependent types. Not much has been done to this end, but last year Rothgang et al. [17] introduced an extension of HOL to dependent types which they dub DHOL. In contrast to dependent type theory, automated theorem proving in HOL has a long history and led to the development of sophisticated provers [2,4,20]. Rothgang et al. defined a natural extension of HOL and equipped it with automation support by providing a sound and complete translation from DHOL into HOL. Their translation has been implemented and can be used as a preprocessing step to any HOL prover in order to obtain an automated theorem prover for DHOL. Hence, by committing to DHOL, automated reasoning support for extensional dependent type theories does not have to be invented from scratch but can benefit from the achievements of the automated theorem proving community for HOL.

In this paper, we build on top of the translation from Rothgang et al. to develop a tableau calculus which is sound and complete for DHOL. In addition, dedicated inference rules for DHOL are defined and their soundness is proved. The tableau calculus is implemented as an extension of Lash [6]. The remainder of this paper is structured as follows: Sect. 2 sets the stage by defining DHOL and the erasure from DHOL to HOL due to Rothgang et al. before Sect. 3 defines the tableau calculus and provides soundness and completeness proofs. The implementation is described in Sect. 4. Finally, we report on experimental results in Sect. 5.

# 2 Preliminaries

### 2.1 HOL

We start by giving the syntax of higher-order logic (HOL) which goes back to Church [9]. In order to allow for a graceful extension to DHOL, we define it with a grammar based on [17].

$$T ::= \circ \mid T, a \colon \mathtt{tp} \mid T, x \colon A \mid T, s \tag{\text{theories}}$$

$$\Gamma ::= \cdot \mid \Gamma, x \colon A \mid \Gamma, s \tag{\text{contents}}$$

$$A, B ::= a \mid A \to B \mid o \tag{\text{ityes}}$$

$$s, t, u, v ::= x \mid \lambda x \colon A.s \mid s \: t \mid \perp \mid \neg s \mid s \Rightarrow t \mid s =\_A t \mid \forall x \colon A.s \; \text{ (terms)}$$

A theory consists of base type declarations a: tp, typed variable or constant declarations x: A and axioms. Contexts are like theories but without base type declarations. In the following, we will often write s ∈ T,Γ to denote that s occurs in the combination of T and Γ. Furthermore, note that ◦ and · denote the empty theory and context, respectively. Types are declared base types a, the base type of booleans o or function types A → B. As usual, the binary type constructor → is right-associative. Terms are simply-typed lambda-terms (modulo α-conversion) enriched by the connectives ⊥, ¬, ⇒, =<sup>A</sup> as well as the typed binding operator for ∀. All connectives yield terms of type o (formulas). By convention, application associates to the left, so stu means (s t) u with the exception that ¬ s t always means ¬(s t). Moreover, we abbreviate ¬(s =<sup>A</sup> t) by s =<sup>A</sup> t and sometimes omit the type subscript of =<sup>A</sup> when it is either clear from the context or irrelevant. We write s[x1/t1,...,xn/tn] to denote the simultaneous capture-avoiding substitution of the xi's by the ti's. The set of free variables of a term s is denoted by Vs.

A theory T is well-formed if all types are well-formed and axioms have type o with respect to its base type declarations. In that case, we write <sup>s</sup> <sup>T</sup> Thy where the superscript s indicates that we are in the realm of simple types. Given a well-formed theory T, the well-formedness of a context Γ is defined in the same way and denoted by <sup>s</sup> <sup>T</sup> <sup>Γ</sup> Ctx. Given a theory <sup>T</sup> and a context <sup>Γ</sup>, we write <sup>Γ</sup> <sup>s</sup> <sup>T</sup> <sup>A</sup> tp to state that <sup>A</sup> is a well-formed type and <sup>Γ</sup> <sup>s</sup> <sup>T</sup> s: A to say that <sup>s</sup> has type <sup>A</sup>. Furthermore, <sup>Γ</sup> <sup>s</sup> <sup>T</sup> s denotes that s has type o and is provable from <sup>Γ</sup> and <sup>T</sup> in HOL. Finally, we use <sup>Γ</sup> <sup>s</sup> <sup>T</sup> A ≡ B to state that A and B are equivalent well-formed types. For HOL this is trivial as it corresponds to syntactic equivalence, but this will change drastically in DHOL.

### 2.2 DHOL

The extension from HOL to DHOL consists of two crucial ingredients:


Thus, the grammar defining the syntax of DHOL is given as follows:


If a base type a has arity 0, it is called a *simple base type*. Note that HOL is the fragment of DHOL where all base types have arity 0. Allowing base types to have term arguments makes type equality a highly non-trivial problem in DHOL. For example, if <sup>Γ</sup> <sup>d</sup> <sup>T</sup> <sup>s</sup>: Πx: A.B (the <sup>d</sup> in <sup>d</sup> indicates that we are speaking about DHOL) and <sup>Γ</sup> <sup>d</sup> <sup>T</sup> <sup>t</sup>: <sup>A</sup> we still want <sup>Γ</sup> <sup>d</sup> <sup>T</sup> (s t): B[x/t] to hold if <sup>Γ</sup> <sup>d</sup> <sup>T</sup> A ≡ A , so checking whether two types are equal is a problem which occurs frequently in DHOL. Intuitively, we have <sup>Γ</sup> <sup>d</sup> <sup>T</sup> A ≡ A if and only if their simply-typed skeleton consisting of arrows and base types without their arguments is equal and given a base type <sup>a</sup>: Πx<sup>1</sup> : <sup>A</sup>1. ··· <sup>Π</sup>x<sup>n</sup> : <sup>A</sup>n. tp, an occurrence a t<sup>1</sup> ... t<sup>n</sup> in A and its corresponding occurrence a t <sup>1</sup> ... t <sup>n</sup> in A , we have <sup>Γ</sup> <sup>d</sup> <sup>T</sup> t<sup>i</sup> =<sup>A</sup>*i*[x1/t1,...,x*i*−1/t*i*−1] t <sup>i</sup> for all 1 i n. This makes DHOL an extensional type theory where already type checking is undecidable as it requires theorem proving. Another difference from HOL is the importance of the chosen representation of contexts and theories: Since the well-typedness of a term may depend on other assumptions, the order of the type declarations and formulas in a context Γ or theory T is relevant. A formal definition of the judgments <sup>Γ</sup> <sup>d</sup> <sup>T</sup> <sup>A</sup> tp, <sup>Γ</sup> <sup>d</sup> <sup>T</sup> <sup>s</sup>: <sup>A</sup>, <sup>Γ</sup> <sup>d</sup> <sup>T</sup> <sup>s</sup> and <sup>Γ</sup> <sup>d</sup> <sup>T</sup> A ≡ B via an inference system is given in [17]. Since we use more primitive connectives, a minor variant is presented in Fig. 1.

*Example 1.* Consider the simple base types nat: tp and elem: tp as well as the dependent base type lst: Πx: nat. tp. The constants and functions

```
0: nat s: nat → nat
nil: lst 0 cons: Πn: nat. elem → lst n → lst (s n)
```
provide means to represent their inhabitants. Additionally, we define functions plus: nat <sup>→</sup> nat <sup>→</sup> nat

<sup>∀</sup>n: nat. plus 0 <sup>n</sup> <sup>=</sup>nat <sup>n</sup> <sup>∀</sup>n, m: nat. plus (<sup>s</sup> <sup>n</sup>) <sup>m</sup> <sup>=</sup>nat <sup>s</sup> (plus n m)

and app: Πn: nat. <sup>Π</sup>m: nat. lst <sup>n</sup> <sup>→</sup> lst <sup>m</sup> <sup>→</sup> lst (plus n m):

<sup>∀</sup>n: nat, x: lst n. app 0 <sup>n</sup> nil <sup>x</sup> <sup>=</sup>lst <sup>n</sup> <sup>x</sup> <sup>∀</sup>n, m: nat, z : elem, x: lst n, y : lst m. app (s n) m (cons nzx) y =lst (<sup>s</sup> (plus n m)) cons (plus n m) z (app nmxy)

In the defining equations of app, we annotated the equality sign with the dependent type of the term on the right-hand side. In all cases, the simply-typed skeleton is just lst but for a type check we need to prove the two equalities

<sup>∀</sup>n: nat. plus 0 <sup>n</sup> <sup>=</sup>nat <sup>n</sup> <sup>∀</sup>n, m: nat. plus (<sup>s</sup> <sup>n</sup>) <sup>m</sup> <sup>=</sup>nat <sup>s</sup> (plus n m)

which are exactly the corresponding axioms for plus. Type checking the conjecture

<sup>∀</sup>n: nat, x: lst n. app <sup>n</sup> <sup>0</sup> <sup>x</sup> nil <sup>=</sup>lst <sup>n</sup> <sup>x</sup>

would require proving <sup>∀</sup>n: nat. plus <sup>n</sup> <sup>0</sup> <sup>=</sup>nat <sup>n</sup> which can be achieved by induction on natural numbers if we include the Peano axioms.

### 2.3 Erasure

The following definition presents the translation from DHOL to HOL due to Rothgang et al. [17]. Intuitively, the translation erases dependent types to their simply typed skeletons by ignoring arguments of base types. The thereby lost

$$\begin{array}{|c|c|c|c|}\hline\hline\cr \omega\_{\mathsf{T}} \mathsf{T} \mathsf{b} \mathsf{b} \mathsf{b} \mathsf{b} \mathsf{b} \mathsf{b} \mathsf{b} \mathsf{b} \mathsf{b} \mathsf{b} \\ \hline\cr \frac{\omega\_{\mathsf{T}} \mathsf{T} \mathsf{b} \mathsf{b} \mathsf{b} \mathsf{b} \mathsf{b} \mathsf{c} \mathsf{b} \\ \hline\cr \frac{\omega\_{\mathsf{T}} \mathsf{T} \mathsf{b} \mathsf{b} \mathsf{b} \mathsf{c} \mathsf{b} \mathsf{c} \mathsf{b} \mathsf{c} \mathsf{b} \mathsf{c} \mathsf{b} \mathsf{d} \mathsf{b} \mathsf{c} \mathsf{b} \mathsf{d} \mathsf{b} \mathsf{c} \mathsf{b} \mathsf{d} \mathsf{b} \mathsf{c} \\ \hline\cr \frac{\omega\_{\mathsf{T}} \mathsf{T} \mathsf{b} \mathsf{b} \mathsf{c} \mathsf{b} \mathsf{c} \mathsf{b} \mathsf{c} \mathsf{d} \mathsf{b} \mathsf{c} \mathsf{b} \mathsf{d} \mathsf{b} \mathsf{c} \mathsf{b} \mathsf{d} \mathsf{b} \mathsf{c} \mathsf{b} \mathsf{d} \mathsf{b} \mathsf{c} \\ \hline\cr \frac{\omega\_{\mathsf{T}} \mathsf{T} \mathsf{b} \mathsf{d} \mathsf{b} \mathsf{c} \mathsf{b} \mathsf{c} \mathsf{b} \mathsf{c} \mathsf{d} \mathsf{b} \mathsf{c} \mathsf{d} \mathsf{b} \mathsf{c} \mathsf{d} \mathsf{b} \mathsf{c} \mathsf{d} \mathsf{b} \mathsf{c} \mathsf{d} \mathsf{b} \mathsf{c} \\ \hline\cr \frac{\omega\_{\mathsf{T}} \mathsf{T} \mathsf{b} \mathsf{d} \mathsf{b} \mathsf{c} \mathsf{d} \mathsf{$$

Fig. 1. Natural Deduction Calculus for DHOL

information on concrete base type arguments is restored with the help of a partial equivalence relation (PER) A<sup>∗</sup> for each type A. A PER is a symmetric, transitive relation. The elements on which it is also reflexive are intended to be the members of the original dependent type, i.e., <sup>Γ</sup> <sup>d</sup> <sup>T</sup> s: A if and only if <sup>Γ</sup> <sup>s</sup> <sup>T</sup> A<sup>∗</sup> s s.

Definition 1. *The translation from DHOL to HOL is given by the erasure function* s *as well as* A<sup>∗</sup> *which computes the formula representing the corresponding PER of a type* A*. The functions are mutually defined by recursion on the grammar of DHOL. The erasure of a theory (context) is defined as the theory (context) which consists of its erased components.*


$$\overline{a \colon \Pi x\_1 \colon A\_1 \dots \dots \Pi x\_n \colon A\_n . \overline{\bf t p}} = a \colon \mathtt{tp}, a^\* \colon \overline{A\_1} \to \dots \to \overline{A\_n} \to a \to a \to o, a\_{\mathtt{per}}$$

$$\begin{aligned} \label{eq:7} \begin{aligned} (a^\* \ s \ t &= s =\_o t \\ (a \ t\_1 \ \dots \ t\_n)^\* \ s \ t &=\_n a^\* \overline{t\_1} \dots \overline{t\_n} \ s \ t \end{aligned} \\ (\Pi x \colon A.B)^\* \ s \ t &=&\forall x, y \colon \overline{A}. A^\* \ x \ y \Rightarrow B^\* \ (s \ x) \ (t \ y) \end{aligned}$$

*Here,* aper *is defined as follows:*

$$a\_{\mathbf{per}} = \forall x\_1 \colon \overline{A\_1} \dots \forall x\_n \colon \overline{A\_n} \,\,\forall u, v \colon a \,\, a^\* \,\, x\_1 \,\, \dots \,\, x\_n \,\, u \,\, v \Rightarrow u =\_a v$$

Theorem 1 (Completeness [17]).

*– if* <sup>Γ</sup> <sup>d</sup> <sup>T</sup> <sup>A</sup>: tp *then* <sup>Γ</sup> <sup>s</sup> <sup>T</sup> <sup>A</sup>: tp *and* <sup>A</sup><sup>∗</sup> *is a PER over* <sup>A</sup> *– if* <sup>Γ</sup> <sup>d</sup> <sup>T</sup> <sup>A</sup> <sup>≡</sup> <sup>B</sup> *then* <sup>Γ</sup> <sup>s</sup> <sup>T</sup> ∀x, y : A. A<sup>∗</sup> x y =<sup>o</sup> B<sup>∗</sup> x y *– if* <sup>Γ</sup> <sup>d</sup> <sup>T</sup> <sup>s</sup>: <sup>A</sup> *then* <sup>Γ</sup> <sup>s</sup> <sup>T</sup> <sup>s</sup>: <sup>A</sup> *and* <sup>Γ</sup> <sup>s</sup> <sup>T</sup> A<sup>∗</sup> s s *– if* <sup>Γ</sup> <sup>d</sup> <sup>T</sup> <sup>s</sup> *then* <sup>Γ</sup> <sup>s</sup> <sup>T</sup> s

### Theorem 2 (Soundness [17]).

$$\begin{array}{l} \mathsf{I} \vdash \mathsf{if} \; \Gamma \vdash\_{T}^{\mathsf{d}} \; s \mathrel{\mathop{\mathsf{T}}} \; s \mathrel{\mathop{\mathsf{T}}} \; \mathsf{then} \; \Gamma \vdash\_{T}^{\mathsf{d}} \; \mathsf{then} \; \Gamma \vdash\_{T}^{\mathsf{d}} \; s \\\ \mathsf{I} \quad \mathsf{if} \; \Gamma \vdash\_{T}^{\mathsf{d}} \; s \mathrel{\mathop{\mathsf{T}}} \; s \mathrel{\mathop{\mathsf{T}}} \; \mathsf{a} \; \; \Gamma \vdash\_{T}^{\mathsf{d}} t \mathrel{\mathop{\mathsf{T}}} \; \mathsf{a} \; \; \mathsf{and} \; \overline{\Gamma} \vdash\_{T}^{\mathsf{s}} A^{\*} \; \overline{s} \mathrel{\mathop{\mathsf{T}}} \; \mathsf{then} \; \Gamma \vdash\_{T}^{\mathsf{d}} s \mathrel{\mathop{\mathsf{T}}} \; s \mathrel{\mathop{\mathsf{T}}} \; s \mathrel{\mathop{\mathsf{T}}} \; \end{array}$$

Note that the erasure treats simple types and dependent types in the same way. In the following, we define a post-processing function Φ on top of the original erasure [17] which allows us to erase to simpler but equivalent formulas. The goal of Φ is to replace A<sup>∗</sup> s t where A is a simple type by s =<sup>A</sup> t. As a consequence, the guard A<sup>∗</sup> x x in ∀x: A.s for simple types A can be removed. The following definition gives a presentation of Φ as a pattern rewrite system [13].

Definition 2. *Given a HOL term* s*, we define* Φ(s) *to be the HOL term which results from applying the following pattern rewrite rules exhaustively to all subterms in a bottom-up fashion:*

$$\begin{aligned} a^\* \ F \ G \ \ &\rightarrow \ \ F =\_a \ G\\ \forall x, y \colon A. \ (x =\_A y) \Rightarrow (\ F \ x =\_B G \ y) \ &\rightarrow \ \ F =\_{A \to B} G\\ \forall x \colon A. \ (x =\_A x) \Rightarrow \ F \ x \ &\rightarrow \ \forall x \colon A. \ F \ x \end{aligned}$$

*Here,* F, G *are free variables for terms,* a<sup>∗</sup> *denotes the constant for the PER of a simple base type* a *and* A, B *are placeholders for simple types. Given a HOL theory* T*, there are finitely many instances for* a<sup>∗</sup> *but infinite choices for* A *and* B*, so the pattern rewrite system is infinite.*

Lemma 1. *Assume* <sup>Γ</sup> <sup>d</sup> <sup>T</sup> <sup>s</sup>: <sup>o</sup>*.* <sup>Γ</sup> <sup>d</sup> <sup>T</sup> <sup>s</sup> *if and only if* <sup>Γ</sup> <sup>s</sup> <sup>T</sup> Φ(s)*.*

*Proof.* Since the erasure is sound and complete (Theorem 2 and Theorem 1), it suffices to show that <sup>Γ</sup> <sup>s</sup> <sup>T</sup> <sup>Φ</sup>(s) if and only if <sup>Γ</sup> <sup>s</sup> <sup>T</sup> s. Consider the rules from Definition 2. Φ(s) is well-defined: Clearly, the rules terminate and confluence follows from the lack of critical pairs [13]. Hence, it is sufficient to prove <sup>Γ</sup> <sup>s</sup> T l if and only if <sup>Γ</sup> <sup>s</sup> <sup>T</sup> r for every rule in Definition 2. For the first rule, assume <sup>Γ</sup> <sup>s</sup> <sup>T</sup> <sup>a</sup><sup>∗</sup> F G. Since <sup>a</sup>per <sup>∈</sup> <sup>T</sup>, we have <sup>Γ</sup> <sup>s</sup> <sup>T</sup> <sup>F</sup> <sup>=</sup><sup>a</sup> <sup>G</sup>. Now assume <sup>Γ</sup> <sup>s</sup> <sup>T</sup> F =<sup>a</sup> G. Since <sup>F</sup> has type <sup>a</sup>, we obtain <sup>Γ</sup> <sup>d</sup> <sup>T</sup> F =<sup>a</sup> F. Completeness of the erasure yields <sup>Γ</sup> <sup>s</sup> <sup>T</sup> a<sup>∗</sup> F F. Now, the assumption allows us to replace equals by equals, so we conclude <sup>Γ</sup> <sup>s</sup> <sup>T</sup> a<sup>∗</sup> F G. The desired result for the second rule follows from extensionality. Finally, the third rule is an easy logical simplification. 

Given a theory T (context Γ) we write Φ(T) (Φ(Γ)) to denote its erased version where formulas have been simplified with Φ.

Corollary 1. *Assume* <sup>Γ</sup> <sup>d</sup> <sup>T</sup> <sup>s</sup>: <sup>o</sup>*.* <sup>Γ</sup> <sup>d</sup> <sup>T</sup> <sup>s</sup> *if and only if* <sup>Φ</sup>(Γ) <sup>s</sup> <sup>Φ</sup>(T) Φ(s)*.*

*Example 2.* Consider again the axiom recursively defining app from Example 1

```
∀n, m: nat, z : elem, x: lst n, y : lst m.
    app (s n) m (cons nzx) y =lst (s (plus n m)) cons (plus n m) z (app nmxy)
```
which we refer to as sapp. Its post-processed erasure Φ(sapp) is given by the following formula which is simpler than sapp:

<sup>∀</sup>n, m: nat, z : elem, x: lst. lst<sup>∗</sup> nxx ⇒ ∀<sup>y</sup> : lst. lst<sup>∗</sup> myy <sup>⇒</sup> lst<sup>∗</sup> - s (plus n m) - app (s n) m (cons nzx) y cons (plus n m) z (app nmxy) 

# 3 Tableau Calculus for DHOL

### 3.1 Rules

The tableau calculus from [1,7] is the basis of Satallax [4] and its fork Lash [6]. We present an extension of this calculus from HOL to DHOL by extending the rules to DHOL as well as providing tableau rules for the translation from DHOL to HOL. A *branch* is a 3-tuple (T, Γ, Γ ) which is *well-formed* if <sup>d</sup> <sup>T</sup> Thy, d <sup>T</sup> <sup>Γ</sup> Ctx and <sup>s</sup> <sup>Φ</sup>(T) <sup>Γ</sup> Ctx. Intuitively, the theory contains the original problem and remains untouched while the contexts grow by the application of rules. Furthermore, DHOL and HOL are represented separately: For DHOL, the theory T and context Γ are used while HOL has a separate context Γ with respect to the underlying theory Φ(T). In particular, each rule in Fig. 2 really stands for two rules: one that operates in DHOL and the original version that operates in HOL. Except for the erasure rules TER<sup>1</sup> and TER<sup>2</sup> which add formulas to the HOL context based on information from the DHOL theory and context, the rules always stay in DHOL or HOL, respectively. More formally, a *step* is an n + 1 tuple (T, Γ, Γ ),(T,Γ1, Γ 1),...,(T,Γn, Γ <sup>n</sup>) of branches where ⊥ ∈ T, Γ, Γ and either Γ ⊂ Γ<sup>i</sup> and Γ = Γ <sup>i</sup> for all 1 i n or Γ = Γ<sup>i</sup> and Γ ⊂ Γ <sup>i</sup> for all 1 i n. Given a step A, A1,...,An, the branch A is called its *head* and each A<sup>i</sup> is an *alternative*.

A *rule* is a set of steps defined by a schema. For example, the rule T<sup>⇒</sup> from Fig. 2 indicates the set of steps (T, Γ, Γ ),(T,Γ1, Γ 1),(T,Γ2, Γ <sup>2</sup>) where ⊥ ∈ T, Γ, Γ and either s ⇒ t ∈ T,Γ or s ⇒ t ∈ Φ(T), Γ . In the former case, we have Γ<sup>1</sup> = Γ,¬s and Γ<sup>2</sup> = Γ, t as well as Γ = Γ <sup>1</sup> = Γ <sup>2</sup>. The latter case is the same but with the primed and unprimed variants swapped.

In the original tableau calculus [1,7], normalization is defined with respect to an axiomatized generic operator [·]. As one would expect, one of these axioms states that the operator does not change the semantics of a given term. Since there is no formal definition of DHOL semantics yet, we simply use [s] to denote the βη-normal form of s which is in accordance with our implementation.

A rule *applies* to a branch A if some step in the rule has A as its head. A tableau calculus is a set of steps. Let T be the tableau calculus defined by the rules in Fig. 2. The side condition of freshness in T¬∀ means that for a given step with head (T, Γ, Γ ) there is no type A such that y : A ∈ T,Γ or y : A ∈ Φ(T), Γ and we additionally require that there is no name x such that ¬[s x] ∈ T,Γ or ¬[s x] ∈ Φ(T), Γ . In practice, this means that to every formula, T¬∀ can be applied at most once. Furthermore, the side condition t: A in the rule T<sup>∀</sup> means that either <sup>Γ</sup> <sup>d</sup> <sup>T</sup> <sup>t</sup>: <sup>A</sup> or <sup>Γ</sup> <sup>s</sup> <sup>Φ</sup>(T) t: A depending on whether the premise is in T,Γ or Φ(T), Γ . The side condition s: o in the rule TER<sup>1</sup> means that <sup>Γ</sup> <sup>s</sup> <sup>Φ</sup>(T) s: o. This is to prevent application of TER<sup>1</sup> before the necessary type information is obtained by applying TER<sup>2</sup> .

The set of T *-refutable* branches is defined inductively: If ⊥ ∈ T, Γ, Γ , then (T, Γ, Γ ) is refutable. If A, A1,...,An is a step in T and every alternative A<sup>i</sup> is refutable, then A is refutable.

The rules in Fig. 2 strongly resemble the tableau calculus from [1]. In order to support DHOL, we replaced simple types by their dependent counterparts. To that end, we tried to remain as simple as possible by only allowing syntactically equivalent types in T<sup>∀</sup> and TCON: Adding a statement like A ≡ A as a premise would change the tableau calculus as well as the automated proof search signif-

Fig. 2. Tableau rules for DHOL

icantly, so these situations are handled by the erasure for which the additional rules TER<sup>1</sup> , TER<sup>2</sup> are responsible.

It is known that the restriction of T to HOL (without TER<sup>1</sup> and TER<sup>2</sup> ) is sound and complete with respect to Henkin semantics [1,7]. Furthermore, due to Corollary 1, the rules TER<sup>1</sup> and TER<sup>2</sup> define a sound and complete translation from DHOL to HOL with respect to Rothgang et al.'s definition of provability in DHOL [17].

### 3.2 Soundness and Completeness

In general, a soundness result based on the refutability of a branch (T, Γ, Γ ) is desirable. If there were a definition of semantics for DHOL which is a conservative extension of Henkin semantics, the proof could just refer to satisfiability of T, Γ, Γ . Unfortunately, this is not the case. Note that an appropriate definition of semantics is out of the scope of this paper: In addition to its conception, we would have to prove soundness and completeness of <sup>d</sup> on top of the corresponding proofs for our novel tableau calculus. Therefore, soundness and completeness of the tableau calculus will be established with respect to provability in DHOL or HOL. Unfortunately, this requirement complicates the proof tremendously as a refutation can contain a mixture of DHOL, erasure and HOL rules. Therefore, we have to consider both HOL and DHOL and need to establish a correspondence between Γ and Γ which is difficult to put succinctly and seems to be impossible without further restricting the notion of a well-formed branch. Therefore, we prove soundness and completeness with respect to a notion of refutability which has three stages: At the beginning, only DHOL rules are applied, the second stage is solely for the erasure and in the last phase, only HOL rules are applied. Note that this notion of refutability includes the sound but incomplete strategy of only using native DHOL rules as well as the sound and complete strategy of exclusively working with the erasure.

Definition 3. *A branch* (T, Γ, Γ ) *is* s-refutable *if it is refutable with respect to the HOL rules.*

#### Lemma 2. *A well-formed branch* (T, Γ, Γ ) *is s-refutable* ⇐⇒ <sup>Γ</sup> <sup>s</sup> <sup>Φ</sup>(T) ⊥*.*

*Proof.* Immediate from soundness and completeness of the original HOL calculus as well as soundness and completeness of <sup>s</sup> . 

Definition 4. *The set of* e-refutable *branches is inductively defined as follows: If* (T, Γ, Γ ) *is s-refutable and* Γ ⊆ Φ(Γ)*, then it is e-refutable. If* A, A1 ∈ TER<sup>1</sup> ∪ TER<sup>2</sup> *and* A<sup>1</sup> *is e-refutable, then* A *is e-refutable.*

Lemma 3. *If* (T, Γ, Γ ) *is well-formed and e-refutable then* <sup>Φ</sup>(Γ) <sup>s</sup> <sup>Φ</sup>(T) ⊥*.*

*Proof.* Let (T, Γ, Γ ) be well-formed and e-refutable. We proceed by induction on the definition of e-refutability. If (T, Γ, Γ ) is s-refutable then <sup>Γ</sup> <sup>s</sup> <sup>Φ</sup>(T) ⊥ by Lemma 2. Since <sup>Γ</sup> <sup>⊆</sup> <sup>Φ</sup>(Γ) we also have <sup>Φ</sup>(Γ) <sup>s</sup> <sup>Φ</sup>(T) ⊥. For the induction step, let (T, Γ, Γ ),(T, Γ, Γ <sup>1</sup>) be a step with either TER<sup>1</sup> or TER<sup>2</sup> and assume that the branch (T, Γ, Γ <sup>1</sup>) is e-refutable. Since well-formedness of (T, Γ, Γ <sup>1</sup>) follows from the well-formedness of (T, Γ, Γ ), the induction hypothesis yields <sup>Φ</sup>(Γ) <sup>s</sup> <sup>Φ</sup>(T) ⊥ as desired. 

Definition 5. *The set of* d-refutable *branches is inductively defined as follows: If* (T, Γ, ·) *is e-refutable or* ⊥ ∈ T,Γ*, then it is d-refutable. If* A, A1,...,An ∈ T \ (TER<sup>1</sup> ∪ TER<sup>2</sup> ) *and every alternative* A<sup>i</sup> *is d-refutable, then* A *is d-refutable.*

Next, we have to prove soundness of every DHOL rule. For most of the rules, this is rather straightforward. We show soundness of TFE, TFQ and TDEC as representative cases and start with an auxiliary lemma.

Lemma 4. *Assume* <sup>Γ</sup> <sup>d</sup> <sup>T</sup> <sup>s</sup>: <sup>o</sup>*. We have* <sup>Γ</sup> <sup>d</sup> <sup>T</sup> <sup>s</sup> *if and only if* <sup>Γ</sup> <sup>d</sup> <sup>T</sup> [s]*.*

*Proof.* By the beta and eta rules, we have <sup>Γ</sup> <sup>d</sup> <sup>T</sup> s =<sup>o</sup> [s]. Using cong we obtain the desired result in both directions. 

Lemma 5 (TFE). *Let* (T, Γ, Γ ) *be a well-formed branch. Choose* x *such that* <sup>x</sup> ∈ V<sup>s</sup> ∪ V<sup>t</sup> *and assume* <sup>s</sup> =Πx: A.B <sup>t</sup> <sup>∈</sup> T,Γ*. If* Γ,¬[∀x: A.sx <sup>=</sup> tx] <sup>d</sup> <sup>T</sup> ⊥ *then* <sup>Γ</sup> <sup>d</sup> <sup>T</sup> ⊥*.*

*Proof.* From the assumptions and Lemma 4, we obtain <sup>Γ</sup> <sup>d</sup> <sup>T</sup> s =Πx: A.B t and <sup>Γ</sup> <sup>d</sup> <sup>T</sup> <sup>∀</sup>x: A.sx <sup>=</sup><sup>B</sup> tx. Furthermore, an application of <sup>∀</sup>e yields Γ, x: <sup>A</sup> <sup>d</sup> T sx <sup>=</sup><sup>B</sup> tx. Using congλ, we get <sup>Γ</sup> <sup>d</sup> <sup>T</sup> (λx: A.sx) =Πx: A.B (λx: A.tx). Hence, we can apply eta (x ∈ Vs∪Vt), sym and the admissible rule trans [18] which says that equality is transitive to get <sup>Γ</sup> <sup>d</sup> <sup>T</sup> <sup>s</sup> <sup>=</sup>Πx: A.B <sup>t</sup> and therefore <sup>Γ</sup> <sup>d</sup> <sup>T</sup> ⊥. 

Lemma 6 (TFQ). *Let* (T, Γ, Γ ) *be a well-formed branch. Assume* s =Πx: A.B <sup>t</sup> <sup>∈</sup> T,Γ *and* <sup>x</sup> ∈ V<sup>s</sup> ∪ Vt*. If* Γ, [∀x: A.sx <sup>=</sup> tx] <sup>d</sup> <sup>T</sup> <sup>⊥</sup> *then* <sup>Γ</sup> <sup>d</sup> <sup>T</sup> ⊥*.*

*Proof.* From the assumptions, <sup>Γ</sup> <sup>d</sup> <sup>T</sup> ¬[s] =<sup>o</sup> [¬s], cong and Lemma 4, we obtain <sup>Γ</sup> <sup>d</sup> <sup>T</sup> <sup>s</sup> <sup>=</sup>Πx: A.B <sup>t</sup> and <sup>Γ</sup> <sup>d</sup> <sup>T</sup> ¬∀x: A.sx =<sup>B</sup> tx. Furthermore, we have Γ, x: <sup>A</sup> <sup>d</sup> <sup>T</sup> sx <sup>=</sup><sup>B</sup> tx by refl and congAppl. Hence, <sup>∀</sup>i yields <sup>Γ</sup> <sup>d</sup> <sup>T</sup> ∀x: A.sx =<sup>B</sup> tx and we conclude by an application of ¬e. 

Lemma 7 (TDEC). *Let* (T, Γ, Γ ) *be a well-formed branch. Assume*

$$x \; s\_1 \; \dots \; s\_n \neq\_{au\_1 \dots u\_m} x \; t\_1 \; \dots \; t\_n \in T, \Gamma$$

*and* <sup>Γ</sup> <sup>d</sup> <sup>T</sup> x: Πy<sup>1</sup> : A<sup>1</sup> ··· Πy<sup>n</sup> : An.a u <sup>1</sup> ... u <sup>m</sup> *where* u<sup>i</sup> = u <sup>i</sup>[y1/s<sup>1</sup> ...yn/sn] *for* 1 i <sup>m</sup>*. If* Γ, s<sup>i</sup> =<sup>A</sup>*i*[x1/s1,...,x*i*−1/s*i*−1] <sup>t</sup><sup>i</sup> <sup>d</sup> <sup>T</sup> ⊥ *for all* 1 i <sup>n</sup> *then* <sup>Γ</sup> <sup>d</sup> <sup>T</sup> ⊥*.*

*Proof.* From the assumptions, we obtain <sup>Γ</sup> <sup>d</sup> <sup>T</sup> s<sup>i</sup> =<sup>A</sup>*i*[x1/s1,...,x*i*−1/s*i*−1] t<sup>i</sup> for all 1 i <sup>n</sup> and <sup>Γ</sup> <sup>d</sup> <sup>T</sup> x =Πy<sup>1</sup> : <sup>A</sup>1···Πy*<sup>n</sup>* : <sup>A</sup>*n*.au- 1...u- *<sup>m</sup>* x. Hence, n applications of the congruence rule for application yield <sup>Γ</sup> <sup>d</sup> <sup>T</sup> x s<sup>1</sup> ... s<sup>n</sup> =au1...u*<sup>m</sup>* x t<sup>1</sup> ... tn. Since we also have <sup>Γ</sup> <sup>d</sup> <sup>T</sup> x s<sup>1</sup> ... s<sup>n</sup> =au1...u*<sup>m</sup>* x t<sup>1</sup> ... tn, we obtain <sup>Γ</sup> <sup>d</sup> <sup>T</sup> ⊥. 

Now we are ready to prove the soundness result for T .

#### Theorem 3. *If* (T, Γ, ·) *is well-formed and d-refutable then* <sup>Γ</sup> <sup>d</sup> <sup>T</sup> ⊥*.*

*Proof.* Let (T, Γ, ·) be well-formed and d-refutable. We proceed by induction on the definition of d-refutability. If (T, Γ, ·) is e-refutable, the result follows from Lemma <sup>3</sup> together with Corollary 1. If ⊥ ∈ T,Γ then clearly <sup>Γ</sup> <sup>d</sup> <sup>T</sup> ⊥. For the inductive case, consider a step (T, Γ, ·),(T,Γ1, ·),...,(T,Γn, ·) with some DHOL rule. Since (T, Γ, ·) is d-refutable, all alternatives must be d-refutable. If we manage to show well-formedness of every alternative, we can apply the induction hypothesis to obtain <sup>Γ</sup><sup>i</sup> <sup>d</sup> <sup>T</sup> ⊥ for all 1 i n. Then, we can conclude <sup>Γ</sup> <sup>d</sup> <sup>T</sup> ⊥ by soundness of the DHOL rules. Hence, it remains to prove wellformedness of the alternatives. In most cases, this is straightforward. We only show one interesting case, namely TDEC.

Instead of proving Γ, s<sup>i</sup> =<sup>A</sup>*i*[x1/s1,...,x*i*−1/s*i*−1] <sup>t</sup><sup>i</sup> <sup>d</sup> <sup>T</sup> ⊥ for all 1 i n we show that <sup>Γ</sup> <sup>d</sup> <sup>T</sup> s<sup>i</sup> =<sup>A</sup>*i*[x1/s1,...,x*i*−1/s*i*−1] t<sup>i</sup> for all 1 i n. Since (T, Γ, ·) is a well-formed branch, both s<sup>1</sup> and t<sup>1</sup> have type A1. Hence, (T,(Γ, s<sup>1</sup> =<sup>A</sup><sup>1</sup> t1), ·) is well-formed and our original induction hypothesis yields Γ, s<sup>1</sup> =<sup>A</sup><sup>1</sup> <sup>t</sup><sup>1</sup> <sup>d</sup> <sup>T</sup> ⊥ from which we obtain <sup>Γ</sup> <sup>d</sup> <sup>T</sup> s<sup>1</sup> =<sup>A</sup><sup>1</sup> t1. Now let i n and assume we have <sup>Γ</sup> <sup>d</sup> <sup>T</sup> s<sup>j</sup> =<sup>A</sup>*<sup>j</sup>* [x1/s1,...,x*j*−1/s*j*−1] t<sup>j</sup> for all j<i (∗). This is only possible if <sup>Γ</sup> <sup>d</sup> <sup>T</sup> t<sup>j</sup> : A<sup>j</sup> [x1/s1,...,x<sup>j</sup>−<sup>1</sup>/s<sup>j</sup>−<sup>1</sup>] for all j<i. Since (T, Γ, ·) is a well-formed branch, it is clear that <sup>Γ</sup> <sup>d</sup> <sup>T</sup> s<sup>i</sup> : Ai[x1/s1,...,x<sup>i</sup>−<sup>1</sup>/s<sup>i</sup>−<sup>1</sup>] and <sup>Γ</sup> <sup>d</sup> <sup>T</sup> t<sup>i</sup> : Ai[x1/t1,...,x<sup>i</sup>−<sup>1</sup>/t<sup>i</sup>−<sup>1</sup>]. From (∗), we obtain

$$F \vdash\_{T}^{\mathsf{d}} t\_i \colon A\_i[x\_1/s\_1, \dots, x\_{i-1}/s\_{i-1}],$$

so (T,(Γ, s<sup>i</sup> =<sup>A</sup>*i*[x1/s1,...,x*i*−1/s*i*−1] ti), ·) is well-formed. Hence, the original induction hypothesis yields <sup>Γ</sup> <sup>d</sup> <sup>T</sup> s<sup>i</sup> =<sup>A</sup>*i*[x1/s1,...,x*i*/s*i*−1] t<sup>i</sup> as desired. 

In the previous proof, we can see that for TDEC, well-formedness of an alternative depends on refutability of all branches to the left. Note that the same holds for TMAT and T⇒. This is a distinguishing feature of DHOL as in tableaux, branches are usually considered to be independent.

Finally, completeness is immediate from the completeness of the HOL tableau calculus and the erasure:

Theorem 4. *If* <sup>Γ</sup> <sup>d</sup> <sup>T</sup> ⊥ *then* (T, Γ, ·) *is d-refutable.*

*Proof.* Let <sup>Γ</sup> <sup>d</sup> <sup>T</sup> ⊥. Using Corollary 1 and Lemma 2 we conclude s-refutability of (T, Γ, Φ(Γ)). By definition, (T, Γ, Φ(Γ)) is also e-refutable. Furthermore, by inspecting TER<sup>1</sup> and TER<sup>2</sup> we conclude that (T, Γ, ·) is also e-refutable and therefore d-refutable. 

# 4 Implementation

We implemented the tableau calculus for DHOL as an extension of Lash [6] which is a fork of Satallax, a successful automated theorem prover for HOL [4]. By providing an efficient C implementation of terms with perfect sharing as well as other important data structures and operations, Lash outperforms Satallax when it comes to the basic ground tableau calculus which both of them implement. However, Lash removes a lot of the additional features beyond the basic calculus that was implemented in Satallax. Nevertheless, this was actually beneficial for our purpose as we could concentrate on adapting the core part. Note that Lash and Satallax do not just implement the underlying ground tableau calculus but make heavy use of SAT-solving and a highly customizable priority queue to guide the proof search [4,5].

For the extension of Lash to DHOL, the data structure for terms had to be changed to support dependent function types as well as quantifiers and lambda abstractions with dependent types. Of course, it would be possible to represent everything in the language of DHOL but the formulation of DHOL suggests that the prover should do as much as possible in the HOL fragment and only use "proper" DHOL when it is really necessary. With this in mind, the parser first always tries to produce simply-typed terms and only resorts to dependent types when it is unavoidable. Therefore, the input problem often looks like a mixture of HOL and DHOL even though everything is included in DHOL. A nice side effect of this design decision is that our extension of Lash works exactly like the original version on the HOL fragment except for the fact that it is expected to be slower due to the numerous case distinctions between simple types and dependent types which are needed in this setting.

Although DHOL is not officially part of TPTP THF, it can be expressed due to the existence of the !>-symbol which is used for polymorphism. Hence, a type Πx: A.B is represented as !>[X:A]:B. For simplicity and efficiency reasons, we did not implement dependent types by distinguishing base types from their term arguments but represent the whole dependent type as a term. When parsing a base type a, Lash automatically creates an eponymous constant of type tp to be used in dependent types as well as a simple base type a<sup>0</sup> for the erasure and a constant a<sup>∗</sup> for its PER. The flags DHOL\_RULES\_ONLY and DHOL\_ERASURE\_ONLY control the availability of the erasure as well as the native DHOL rules, respectively. Note that the implementation is not restricted to drefutability but allows for arbitrary refutations. In the standard flag setting, however, only the native DHOL rules are used. Clearly, this constitutes a sound strategy. It is incomplete since the confrontation rule only considers equations with syntactically equivalent types. We have more to say about this in Sect. 4.2.

# 4.1 Type Checking

By default, problems are only type-checked with respect to their simply-typed skeleton. If the option exactdholtypecheck is set, type constraints stemming from the term arguments of dependent base types are generated and added to the conjecture. The option typecheckonly discards the original conjecture, so Lash just tries to prove the type constraints. Since performing the type check involves proper theorem proving, we added the new SZS ontology statuses TypeCheck and InexactTypecheck to the standardized output of Lash. Here, the former one means that a problem type checks while the latter one just states that it type checks with respect to the simply-typed skeleton.

For the generation of type constraints, each formula of the problem is traversed like in normal type checking. In addition, every time a type condition a t<sup>1</sup> ... t<sup>n</sup> ≡ a s<sup>1</sup> ... s<sup>n</sup> comes up and there is some i such that s<sup>i</sup> and t<sup>i</sup> are not syntactically equivalent, a constraint stating that s<sup>i</sup> = t<sup>i</sup> is provable is added to the set of type constraints. Note that it does not always suffice to just add s<sup>i</sup> = t<sup>i</sup> as this equation may contain bound variables or only hold in the context in which the constraint appears. To that end, we keep relevant information about the term context when generating these constraints. Whenever a forall quantifier or lambda abstraction comes up, it is translated to a corresponding forall quantifier in the context since we want the constraint to hold in any case. While details like applications can be ignored, it is important to keep left-hand sides of implications in the context as it may be crucial for the constraint to be met. In general, any axiom may contribute to the typechecking proof.

*Example 3.* The conjecture

$$\forall n \colon \mathsf{nat}, x \colon \mathsf{lst} \, n \colon n =\_{\mathsf{nat}} 0 \Rightarrow \mathsf{app} \, n \, n \, x \, x = x$$

is well-typed if the type constraint

$$\forall n \colon \mathsf{nat}, x \colon \mathsf{lst} \, n \colon n =\_{\mathsf{nat}} 0 \Rightarrow \mathsf{plus} \, n \, n =\_{\mathsf{nat}} n$$

is provable. Lash can generate this constraint and finds a proof quickly using the axiom <sup>∀</sup>n: nat. plus 0 <sup>n</sup> <sup>=</sup>nat <sup>n</sup>.

Since conjunctions and disjunctions are internally translated to implications, it is important to note that we process formulas from left to right, i.e. for x: lstn and <sup>y</sup> : lst <sup>m</sup>, the proposition <sup>m</sup> <sup>=</sup> <sup>n</sup> <sup>∨</sup> <sup>x</sup> <sup>=</sup> <sup>y</sup> type checks because we can assume m = n to process x = y. Consequently, x = y ∨ m = n does not type check. As formulas are usually read from left to right, this is a natural adaption of short-circuit evaluation in programming languages. Furthermore, it is in accordance with the presentation of Rothgang et al. [17] as well as the corresponding implementation in PVS [19]. As a matter of fact, PVS handles its undecidable type checking problem in essentially the same way as our new version of Lash by generating so called *type correctness conditions* (TCCs).

### 4.2 Implementation of the Rules

Given the appropriate infrastructure for dependent types, the implementation of most rules in Fig. 2 is a straightforward extension of the original HOL implementation. For <sup>T</sup>∀, the side condition <sup>Γ</sup> <sup>d</sup> <sup>T</sup> t: A is undecidable in general. It has been chosen to provide a simple characterization of the tableau calculus. Furthermore, it emphasizes that we do not instantiate with terms whose type does not literally match with the type of the quantified variable. In the implementation, we keep a pool of possible instantiations for types A which occur in the problem. The pool gets populated by terms of which we know that they have a given type because this information was available during parsing or proof search. Hence, we only instantiate with terms t for which we already know that <sup>Γ</sup> <sup>d</sup> <sup>T</sup> t: A holds.

Given an equation s =<sup>A</sup> t, there are many candidate representations of A modulo type equality. When we build an equation in the implementation, we usually use the type of the left-hand side. Since all native DHOL rules of the tableau calculus enforce syntactically equivalent types, the ambiguity with respect to the type of an equation leads to problems. For example, consider a situation where <sup>Γ</sup> <sup>d</sup> <sup>T</sup> <sup>s</sup>: <sup>A</sup>, <sup>Γ</sup> <sup>d</sup> <sup>T</sup> <sup>t</sup>: <sup>B</sup> and <sup>Γ</sup> <sup>d</sup> <sup>T</sup> <sup>s</sup> <sup>=</sup><sup>A</sup> <sup>t</sup> which implies <sup>Γ</sup> <sup>d</sup> <sup>T</sup> A ≡ B. During proof search, it could be that <sup>Γ</sup> <sup>d</sup> <sup>T</sup> t = s is established. Clearly, this is a contradiction which leads to a refutation, but usually the inequality annotated with the type B which makes the refutation inaccessible for our native DHOL rules. Therefore, we implemented rules along the lines of

$$\mathcal{T}\_{\text{SYMCAST}\_1} \xrightarrow[t=B]{s=A} t \colon B \qquad \mathcal{T}\_{\text{SYMCAST}\_2} \xrightarrow[t \neq B]{s \neq A} \begin{subarray}{c} t \colon B \ \vert \end{subarray}$$

which do not only apply symmetry but also change the type of the equality in a sound way. Like in <sup>T</sup>∀, the side condition should be read as <sup>Γ</sup> <sup>d</sup> <sup>T</sup> t: B which makes it undecidable. However, in practice, we can compute a representative of the type of t given the available type information. While experimenting with the new DHOL version of Lash, the implementation of these rules proved to be very beneficial for refutations which only work with the DHOL rules. For the future, it is important to note that TSYMCAST<sup>1</sup> and TSYMCAST<sup>2</sup> are not sound for the extension of DHOL to predicate subtypes as <sup>Γ</sup> <sup>d</sup> <sup>T</sup> <sup>s</sup> <sup>=</sup><sup>A</sup> <sup>t</sup> and <sup>Γ</sup> <sup>d</sup> <sup>T</sup> t: B do not imply <sup>Γ</sup> <sup>d</sup> <sup>T</sup> A ≡ B anymore.

#### 4.3 Generating Instantiations

Since Lash implements a ground tableau calculus, it does not support higherorder unification. Therefore, the generation of suitable instantiations is a major issue. In the case of DHOL, it is actually beneficial that Lash already implements other means of generating instantiations since the availability of unification for DHOL is questionable: There exist unification procedures for dependent type theories (see for example [11]) but for DHOL such a procedure would also have to address the undecidable type equality problem.

For simple base types, it suffices to consider so-called *discriminating* terms to remain complete [1]. A term s of simple base type a is discriminating in a branch A if s =<sup>a</sup> t ∈ A or t =<sup>a</sup> s ∈ A for some term t. For function terms, completeness is guaranteed by enumerating all possible terms of a given type. Of course, this is highly impractical, and there is the important flag INITIAL\_SUBTERMS\_AS\_INSTANTIATIONS which adds all subterms of the initial problem as instantiations. This heuristic works very well in many cases.

For dependent types, we do not check for type equality when instantiating quantifiers but only use instantiations with the exact same type (c.f. T<sup>∀</sup> in Fig. 2) and let the erasure handle the remaining cases.

An interesting feature of this new version of Lash is the possibility to automatically generate instantiations for induction axioms. Given the constraints of the original implementation, the easiest way to sneak a term into the pool of instantiations is to include it into an easily provable lemma and then use the flag INITIAL\_SUBTERMS\_AS\_INSTANTIATIONS. However, this adds unnecessary proof obligations, so we modified the implementation such that initial subterms as instantiations also include lambda-abstractions corresponding to forall quantifiers.

*Example 4.* Consider the induction axiom for lists:

$$\begin{aligned} &\forall p \colon (\mathsf{II}n \colon \mathsf{nat}.\mathsf{lst}\ n \to o).\,p \,\mathsf{0}\ \mathsf{nil} \\ &\Rightarrow (\forall n \colon \mathsf{nat}, x \colon \mathsf{elem}, y \colon \mathsf{lst}\ n \, y \Rightarrow p\,\mathsf{(s}\ n)\,\left(\mathsf{cons}\ n\,x\ y\right)) \\ &\Rightarrow (\forall n \colon \mathsf{nat}, x \colon \mathsf{lst}\ n \, p\,\,x\ ) \end{aligned}$$

Even though it works for arbitrary predicates p, it is very hard for an ATP system to guess the correct instance for a given problem without unification in general. However, given the conjecture <sup>∀</sup>n: nat, x: lst n. app <sup>n</sup> <sup>0</sup> <sup>x</sup> nil <sup>=</sup>lst <sup>n</sup> <sup>x</sup> we can easily read off the correct instantiation for p where ∀ is replaced by λ.

# 5 Case Study: List Reversal Is an Involution

Consider the following equational definition of the list reversal function rev:

$$\begin{aligned} \mathtt{rev\\_O\\_nil} &= \mathtt{last\\_o\\_nil} \\ \mathtt{\forall n:\ \mathtt{nat},x:\ \mathtt{elem},y:\ \mathtt{lst}\ n. \\ \mathtt{rev\\_(s\ n)} \left(\mathtt{cons}\ n\ x\ y\right) &= \mathtt{last\\_s\ n} \left(\mathtt{s\ 0}\right)\left(\mathtt{rev\\_n}\ y\right)\left(\mathtt{cons}\ \mathtt{0}\ x\ \mathtt{nil}\right) \end{aligned}$$

The conjecture

$$(\forall n \colon \mathsf{nxt}, x \colon \mathsf{lst} \, n \colon \mathsf{rev} \, n \left( \mathsf{rev} \, n \, x \right) =\_{\mathsf{lst} \, n} x \qquad\qquad\qquad (\mathsf{rev} \, \mathsf{-invol} \,)^2$$


Table 1. Amount of problem files per (intermediate) goal

is very easy to state, but turns out to be hard to prove automatically. The proof is based on the equational definitions of plus and app given in Example 1 as well as several induction proofs on lists using the axiom from Example 4. In particular, some intermediate goals are needed to succeed:

$$\forall n \colon \texttt{nat}, x \colon \texttt{lst} \, n \,. \, \texttt{app} \, n \, \texttt{0} \, x \, \texttt{nil} =\_{\texttt{last} \, n} x \tag{\texttt{app-ni1}}$$

$$\forall n\_1 \colon \mathsf{nat}, x\_1 \colon \mathsf{lst} \, n\_1, n\_2 \colon \mathsf{nat}, x\_2 \colon \mathsf{lst} \, n\_2, n\_3 \colon \mathsf{nat}, x\_3 \colon \mathsf{lst} \, n\_3. \tag{\mathtt{app-assoc}})$$

$$\mathsf{app} \, n\_1 \, \mathsf{(plus} \, n\_2 \, n\_3 \,) \, x\_1 \, \mathsf{(aps } n\_2 \, n\_3 \, x\_2 \, x\_3 \text{)}$$

$$= \mathsf{app} \, \left( \mathsf{plus} \, n\_1 \, n\_2 \right) \, n\_3 \left( \mathsf{asp} \, n\_1 \, n\_2 \, x\_1 \, x\_2 \right) \, x\_3$$

$$\begin{aligned} \left( \forall n . \mathsf{nat}, x . \mathsf{lst} \, n, y . \mathsf{elem}, m . \mathsf{nat}, z . \mathsf{lst} \, m. \right. \\ \left( \begin{array}{c} \mathsf{app} \left( \mathsf{plus} \, n \left( \mathsf{s} \, \mathsf{0} \right) \right) \, m \left( \mathsf{app} \, n \left( \mathsf{s} \, \mathsf{0} \right) x \left( \mathsf{cons} \, \mathsf{0} \, y \, \mathsf{nil} \right) \right) \, z = \mathsf{app} \, n \left( \mathsf{s} \, m \right) x \left( \mathsf{cons} \, m \, y \, z \right) \end{aligned} \right. \end{aligned}$$

<sup>∀</sup>n: nat, x: lst n, m: nat, y : lst m. (rev-invol-lem) rev (plus n m) (app n m (rev n x) y) = app m n (rev m y) x

Note that for polymorphic lists, this is a standard example of an induction proof with lemmas (see e.g. [15, Section 2.2]). In the dependently-typed case, however, many intermediate equations would be ill-typed in interactive theorem provers like Coq or Lean. In order to succeed in automatically proving these problems, we had to break them down into separate problems for the instantiation of the induction axiom, the base case and the step case of the induction proofs. Often, we further needed to organize these subproblems in manageable steps. Overall, we created 34 TPTP problem files which are distributed over the intermediate goals as shown in Table 1. Note that already type checking these intermediate problems is not trivial: All type constraints are arithmetic equations, and given the Peano axioms, many of them need to be proven by induction themselves. Since we are mainly interested in the dependently-typed part, we added the needed arithmetical facts as axioms. Overall, the problem files have up to 18 axioms including the Peano axioms, selected arithmetical results, the defining equations of plus, app and rev as well as the list induction axiom. We left out unnecessary axioms in many problem files to make the proof search feasible.

With our new modes for DHOL which solely work with the native DHOL rules, Lash can type check and prove all problems easily. If we turn off the native DHOL rules and only work with the erasure using the otherwise same modes with a 60 s timeout, Lash can still typecheck all problems but it only manages to prove 7 out of 34 problems. In order to further evaluate the effectiveness of our new implementation, we translated all problems from DHOL to HOL using the *Logic Embedding Tool* <sup>1</sup>, which performs the erasure from [17]. We then tested 16 other HOL provers available on *SystemOnTPTP*<sup>2</sup> on the translated problems with a 60 s timeout (without type checking). We found that 5 of the 34 problems could only be solved by the DHOL version of Lash, including one problem where it only needs 5 inference steps. Detailed results as well as means to reproduce them are available on Lash's website<sup>3</sup> together with its source code.

# 6 Conclusion

Starting from the erasure from DHOL to HOL by Rothgang et al. [17], we developed a sound and complete tableau calculus for DHOL which we implemented in Lash. To the best of our knowledge, this makes it the first standalone automated theorem prover for DHOL. According to the experimental results, configurations where the erasure is performed as a preprocessing step for a HOL theorem prover can be outperformed by our new prover by solely using the native DHOL rules. We hope that this development will raise further interest in DHOL. Possible further work includes theoretical investigations such as the incorporation of choice operators into the erasure as well as a definition of the semantics of DHOL. Furthermore, it is desirable to officially define the TPTP syntax for DHOL which then opens the possibility of establishing a problem data set on which current and future tools can be compared. Finally, we would like to extend Lash to support predicate subtypes. Rothgang et al. already incorporated this into the erasure but there is no corresponding syntactic support in TPTP yet. In particular, this would get us much closer to powerful automation support for systems like PVS.

Acknowledgments. The results were supported by the Ministry of Education, Youth and Sports within the dedicated program ERC CZ under the project POSTMAN no. LL1902. This work has also received funding from the European Union's Horizon Europe research and innovation programme under grant agreement no. 101070254 CORESENSE as well as the ERC PoC grant no. 101156734 *FormalWeb3*. Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or the Horizon Europe programme. Neither the European Union nor the granting authority can be held responsible for them.

Disclosure of Interests. The authors have no competing interests to declare that are relevant to the content of this article.

<sup>1</sup> https://github.com/leoprover/logic-embedding.

<sup>2</sup> https://tptp.org/cgi-bin/SystemOnTPTP.

<sup>3</sup> http://cl-informatik.uibk.ac.at/software/lash-dhol/.

# References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **The Naproche-ZF Theorem Prover (Short Paper)**

Adrian De Lon(B)

University of Bonn, Bonn, Germany adelon@uni-bonn.de

**Abstract.** Naproche-ZF is a new experimental open-source *natural* theorem prover based on set theory; formalizations in Naproche-ZF are written in a controlled natural language embedded into LATEX and proof gaps are filled in with automated theorem provers. Naproche-ZF aims to scale natural theorem proving beyond chapter-sized formalizations. In contrast to the Naproche system, the new system uses an extensible grammarbased approach, has more efficient proof automation, and enables larger interconnected formalizations based on a standard library.

# **1 Introduction**

Despite significant progress in theorem provers and successful formalizations of research-level mathematics [17], theorem provers have not yet enjoyed broad adoption among mathematicians [12,29,40]. A common criticism levelled against theorem provers by mathematicians is that formalizations are hard to write *and* read: they are written in unfamiliar languages, contain clutter that is irrelevant to the core mathematical ideas, may require knowledge of various specialized proof tactics, and are simply longer overall (see *de Bruijn factor* [39]).

*Natural theorem provers* are a direct answer to this critique; they aim to check texts *as written* by mathematicians: in natural language and with many proof gaps. Similar ambitions can already be found in the work of pioneers in theorem proving, such as P. Abrahams's 1960s Proofchecker [1], which was intended to check the reasoning of textbooks as-is. Some theorem provers are *partly natural*, such as the influential Mizar system [16] which uses a quasi-natural input language and allows *obvious inferences* [34] as proof gaps. There are significant challenges to the natural approach, both in the processing of natural language and in the high degree of proof automation required to fill proof gaps. However, advances in automated theorem proving and computer hardware have made this approach more feasible.

Naproche [9,26] is a natural theorem prover based on A. Paskevich's implementation of SAD [27,38], extending it with, e.g., set-theoretic primitives, more efficient checking, and an integrated development environment. Students have completed formalizations in Naproche in various areas of mathematics, such as analysis, axiomatic set theory, representation theory, and combinatorics. However, typical formalizations in Naproche use ad hoc axiomatic preliminaries and struggle to scale beyond chapter-length. Medium-sized formalizations of around 3000 lines can take half an hour to check and proving new theorems becomes increasingly difficult.

Naproche-ZF<sup>1</sup> is a reimplementation of key ideas in Naproche with larger formalizations in mind. We shall compare the two systems throughout this paper. Naproche-ZF is developed in tandem with a growing modular standard library (see Footnote 1) containing formalizations of foundational material on sets, relations, functions, orders, ordinals, algebraic structures, topological spaces, and more.

# **2 Controlled Natural Language**

The input language of Naproche-ZF is a new controlled natural language (CNL): it is a carefully chosen and formally specified subset of mathematical English that is embedded into LATEX for mathematical notation and document structuring. Most mathematicians are familiar with LATEX, which makes the language easier to learn. Ideally formalizing in a CNL should feel like writing with a strict style guide. The following example shows a theorem formalized in Naproche-ZF, first the LATEX source and then the rendering after typesetting.

```
\begin{theorem}[Burali-Forti antimony]\label{burali_forti}
  There exists no set $\Omega$ such that for all $\alpha$
  we have $\alpha\in\Omega$ iff $\alpha$ is an ordinal.
\end{theorem}
\begin{proof}
  Suppose not. Consider $\Omega$ such that for all $\alpha$ we have
    $\alpha\in\Omega$ iff $\alpha$ is an ordinal.
  For all $x, y$ such that $x\in y\in\Omega$ we have $x\in\Omega$.
  So $\Omega$ is \in-transitive. Thus $\Omega$ is an ordinal.
  Hence $\Omega\in\Omega$. Contradiction.
\end{proof}
```
*Theorem* (Burali-Forti antimony). There exists no set Ω such that for all α we have α ∈ Ω iff α is an ordinal.

*Proof.* Suppose not. Consider Ω such that for all α we have α ∈ Ω iff α is an ordinal. For all x, y such that x ∈ y ∈ Ω we have x ∈ Ω. So Ω is ∈-transitive. Thus Ω is an ordinal. Hence Ω ∈ Ω. Contradiction. -

Naproche-ZF treats everything outside of fixed *formal environments* such as definition, theorem, and proof, as comments. This facilitates writing *literate formalizations* that mix informal commentary and formal mathematics in the same document (e.g. [8], cf. literate programming [21]). Other theorem provers also support literate formalization or advanced typesetting; examples include Literate Agda, Isabelle's documents, and a Mizar-to-LATEX translator [2]. An advantage of CNLs in literate formalization is that it is rarely necessary to restate theorems and proofs steps, since the formal statement is already readable.

<sup>1</sup> Available at https://adelon.net/naproche-zf under an open-source license together with its standard library.

**Parsing.** Natural mathematical language is *dynamic*: definitions introduce new lexical items, which can be symbols, words, or phrases. Dynamism complicates the processing of mathematical language. Naproche and Naproche-ZF take different approaches to parsing their input languages and modelling dynamism.

Naproche's parser is defined with monadic parser combinators [18]. It statefully modifies itself as it encounters definitions and translates to an internal formula representation on the fly. Such tight coupling makes it harder to extend its CNL. There are also cases where parsing takes exponential time.

Naproche-ZF splits this process into phases. First it finds all lexical items using a scanner written with applicative regex combinators [7]. For nouns and verbs it then guesses the plural forms with basic *smart paradigms* [10] in the sense of GF [31]. The resulting lexicon becomes a parameter for the grammar of the CNL. Using the Earley [13] Haskell library, the CNL is specified as a context-free grammar in an embedded domain-specific language and the derived Earley parser [11] parses in cubic time in the worst-case or in quadratic time if the grammar is unambiguous. This grammar-oriented approach makes the initial design and future extensions of a CNL easier compared to parser combinators: new rules can be stated declaratively and there is no need to worry about exponential parsing times or eliminating left-recursion in the grammar.

**Accuracy.** <sup>N</sup>aproche supports a plain text dialect [28] and a LATEX dialect. Naproche-ZF drops support for the plain text format and uses LATEX markup in its CNL to avoid ambiguities. For instance, "a" can be ambiguous in Naproche (variable vs. determiner), but is clarified by distinguishing "*a*" ("\$a\$") from "a" ("a"). Such distinctions and avoidance of the backtracking behaviour of combinator parsers significantly improve error specificity and locality. For instance, Naproche often mistakes an unexpected word as a new variable name and will usually offer nonspecific and mislocated error messages along the lines of an "unexpected "."", whereas Naproche-ZF can reliably offer a list of valid tokens at the location of the error. Naproche-ZF uses a stateful tokenizer to handle nested math and text modes, to support, e.g., "\text{...} " within set comprehensions.

Naproche accommodates grammatical number via a synonym instruction. For example, one uses "[synonym number/-s]" to identify "natural number" and "natural numbers". Thus Naproche accepts ungrammatical sentences such as "x,y is natural numbers". Naproche-ZF guesses plural forms via smart paradigms and requires number agreement. Overall, the grammar of Naproche-ZF is stricter with the aim of avoiding ambiguity, ungrammaticality, as well as pitfalls observed in formalizations written by students, where statements had unintuitive meanings in Naproche.

Naproche-ZF supports various idioms. For example, binary relations can be chained and multiple terms can be related to each other ("*a, b < c < d*"), they can appear in bounded quantifiers ("For all *<sup>x</sup>* <sup>∈</sup> *<sup>X</sup>* ..."), and sets can be used as binary relations ("*xRy*"). Naproche-ZF also models some *pragmatic phenomena* [32,35]: for example, an existential claim in a proof implicitly introduces a local constant, the same way that the "Consider ..." step does.

# **3 Semantics and Proof Checking**

**Translation.** Naproche-ZF translates from its CNL to a set theory in higherorder logic (HOL) with Henkin semantics [5], using generalized de Bruijn indices [19,20] to handle quantifiers and other binders. However, the system emphasizes reasoning within the first-order fragment where possible to use the strength of mature first-order automated theorem provers (ATPs).

Adjectives, verbs, and nouns are translated as predicates. Bounding phrases in quantifications are translated to type guards. For instance, "Every natural number is an ordinal" is translated to "∀*n.* natural number(*n*) <sup>→</sup> ordinal(*n*)".

Proof automation is currently first-order only, using strong first-order ATPs such as E [36] and Vampire [22,33]. Every nontrivial proof step or intermediate claim leads to a proof task that is exported to an ATP. By default an ATP is given 10 s per tasks, but most tasks can be solved within fractions of a second. Naproche-ZF will also integrate with ATPs supporting HOL via TPTP THF0 [4]. For an impression of how ATPs are used, consider the step "Then *<sup>B</sup>* <sup>∈</sup> <sup>2</sup>*<sup>A</sup>*" in the following formalization of Cantor's theorem.

*Theorem* (Cantor). There exists no surjection from A to 2A. *Proof.* Suppose not. Take a surjection <sup>f</sup> from <sup>A</sup> to 2A. Let <sup>B</sup> <sup>=</sup> {<sup>a</sup> <sup>∈</sup> <sup>A</sup> <sup>|</sup> a /<sup>∈</sup> <sup>f</sup>(a)}. Then <sup>B</sup> <sup>∈</sup> <sup>2</sup>A. There exists <sup>a</sup>- ∈ A such that f(a- ) = B by the definition of surjectivity. Now a- ∈ B iff a- ∈/ f(a- ) = B. Contradiction.

This step leads to a proof task in which all preceding first-order definitions and theorems, as well as all local assumptions, definitions, and claims can be used as premises. Here the exported TPTP [37] problem contains a few hundred premises, shown below with the conjecture and recent theorems at the top, along with local premises at the bottom. Note that Naproche-ZF transformed the local definition of *B* via set comprehension into the first-order premise cantor1 by an automatic application of the axiom of separation.

```
fof(cantor,conjecture,in(fB,pow(fA))).
[...]
fof(powerset_intro,axiom,![XA,XB]:(subseteq(XA,XB)=>in(XA,pow(XB)))).
[...]
fof(subseteq,axiom,![XA,XB]:(subseteq(XA,XB)<=>![Xa]:(in(Xa,XA)=>in(Xa,XB)))).
[...]
fof(cantor1,axiom,![Xa]:(in(Xa,fB)<=>(in(Xa,fA)&~in(Xa,apply(ff,Xa))))).
fof(cantor2,axiom,in(ff,surj(fA,pow(fA)))).
fof(cantor3,axiom,~~?[X5]:in(X5,surj(fA,pow(fA)))).
```
**Sets.** Naproche-ZF's built-in constructs are geared towards higher-order set theories [5,15] extending Zermelo–Fraenkel (ZF) set theory. ZF with the axiom of choice (ZFC) is the de facto foundation of informal mathematics, but additional axioms such as the universe axiom of Tarski–Grothendieck set theory (TG) can be convenient for, e.g., category theory. Variants of TG are used by Mizar, Egal, and Megalodon [6]. Second-order axioms of ZF have corresponding built-in syntax or proof steps in Naproche-ZF that make it possible to use them with first-order proof automation. As we have seen in Cantor's theorem above, set comprehensions are automatically eliminated in some situations using the axioms of separation or *n*-ary replacement. Naproche-ZF has a built-in proof method for ∈-induction and will also feature a mechanism for defining one's own induction principles (proved as higher-order theorems). There is potential for interoperability or integration with systems based on set theory, such as Mizar, Isabelle/ZF, and Megalodon. Naproche-ZF has an experimental export feature that translates theorem statements to Megalodon.

**Structures.** <sup>N</sup>aproche has no dedicated features for mathematical structures, which means that users have to set up structures themselves. Dealing with notation becomes cumbersome, as you have to explicitly annotate which structure an operation belongs to. In Naproche-ZF one can define structures directly. They are encoded as record datatypes in set theory, with structure operations acting as field projections. The noun phrase of a structure is translated as a predicate and structure axioms are translated to first-order introduction and elimination rules. This first-order encoding enables structures subtyping and multiple inheritance. In the example below, topological spaces inherit from the built-in *onesorted structure*, which has only a projection "|−|" to the carrier set as an operation and has no axioms.

*Definition.* A topological space X is a onesorted structure equipped with O<sup>X</sup> s.t.


Structure operations are typeset using LATEX macros that take the structure as an optional argument. In theorems and proofs the structure argument may be omitted when a suitable structure was *instantiated* beforehand. For example, if a theorem statement has a premise of the form "Let *X* be a topological space", one can subsequently write O (\$\opens\$) for O*<sup>X</sup>* (\$\opens[X]\$). When multiple structures with the same operation are instantiated, the last instantiation shadows the previous ones.

**Imports.** The import mechanism of <sup>N</sup>aproche works similar to an include directive, which led to redundant checking of shared imports. Naproche-ZF tracks imports in a graph to avoid this. The import mechanism in Naproche-ZF uses the command "\import{<file>} " which may be hidden in the rendered document.

**Proofs.** The simplest way of proving a theorem in Naproche-ZF is to leave it entirely to the ATP by not writing an explicit proof. One can also provide a proof of the form "Follows by *justification*", where

*justification* = "set extensionality" <sup>|</sup> "assumption" <sup>|</sup> "definition" | *ref.*

The justification "by set extensionality" splits an atomic equation into two goals expressing mutual set inclusion, which is convenient when the ATP is reluctant about using extensionality. Next, "by assumption" and "by definition" each restrict the available premises for the ATP to just the local assumptions or previous definition, respectively. Finally, a *ref* is an explicit reference to previous theorems, reusing commonly used LATEX citation commands, such as \cref from the Cleveref package. Only these explicitly cited theorems are then used as premises for the ATP, together with the local assumptions and relevant definitions. Most other proof steps can also be justified in this manner, but we will disregard justifications below for the sake of brevity.

Next, one can state intermediate claims using one of many equivalent phrases such as "We have *Φ*" or "Thus *Φ*". This creates an ATP task for the claim and then adds the claim as an additional assumption for the remainder of the proof. Claims may be justified by a subproof.

One can also perform goal-directed proof steps, similar to many other formalization languages. There are straightforward proof steps like "Assume *Φ*", "Suppose not", and case distinctions. One can obtain witness with "Consider *x, y, z* <sup>∼</sup> *<sup>X</sup>* such that *<sup>Φ</sup>*" or simplify universal goals with "Fix *x, y, z* <sup>∼</sup> *<sup>X</sup>* such that *Φ*", where both the bound by an arbitrary relation symbol ∼ and the suchthat refinement are optional.

A proof step of the form "It suffices to show *Φ*" creates a proof task of showing that *Φ* implies the current goal, and then sets *Φ* as the new goal.

Naproche-ZF supports *calculational reasoning* in the align\* environment: each equation may be followed by an \explanation with a citation. Currently calculational reasoning works with equations and biconditionals. Other systems such as Lean and Isabelle have similar features that also support inequalities [3]. We will extend calculational reasoning as needed alongside future formalizations.

**Premise Selection.** Irrelevant premises make it harder for ATPs to find proofs. *Premise selection* [24] is a process that attempts to identify the relevant premises in a problem. Naproche lacks premise selection, which is a major barrier to scaling beyond chapter-sized formalizations. Work is in progress to add premise selection to Naproche-ZF similar to the premise selection of Sledgehammer [23,30]. The checker already includes a basic MePo-like [25] filter. There is also experimental premise selection using graph neural networks (GNNs), thanks to the help of Mirek Olˇs´ak and Josef Urban. The first-order problems exported by Naproche-ZF are structurally similar to the Mizar corpus on which GNN-based filters have performed well [14]. Training data for the GNN can be extracted from explicitly justified proof steps (those that use "by *ref*"). We also expect that premise selection trained on the Mizar corpus would perform well for Naproche-ZF. GNN-based premise selection is experimental and not yet integrated into the checker. We plan to scale up premise selection as we slowly grow the standard library.

**Performance.** An apples-to-apples performance comparison of <sup>N</sup>aproche and Naproche-ZF is difficult since there are no exactly parallel formalizations in the two systems. Naproche-ZF processes texts faster overall, in part due to having a parser with better asymptotic behaviour. The total checking time however is dominated by proof searches in external ATPs. ATP tasks are single-threaded in Naproche, whereas Naproche-ZF uses a thread pool to make use of modern multi-core CPUs. Moreover, when an ATP task fails in Naproche, it retries the task after unfolding definitions. When developing large formalizations, this behaviour can lead to sudden explosions in checking time, as proofs that used to be fast suddenly become slow because they are retried multiple times. This behaviour dates back to SAD, where it was useful in the context of smaller formalizations and weaker ATPs. Naproche-ZF calls the ATP once per problem (but does use portfolio modes of ATPs). The standard library of Naproche-ZF is currently at a modest 4600 lines (excl. comments and blank lines). Using Vampire as the ATP, it takes less than 10 s to check on an Intel i7-13700K and less than 22 s on an Apple M1. In comparison, Naproche can take over 10 times as long when checking formalizations of similar length.

Naproche-ZF optionally caches the initial segment of successful ATP proofs between runs, resuming checking at the first failed proof, which saves time during proof writing. The cache of a proof is invalidated if the premises differ to avoid reproducibility problems: we do not use monotonicity of entailment since adding irrelevant premises can make ATP proofs fail.

# **4 Conclusion and Future Work**

Even in its early state, Naproche-ZF is a new theorem prover that shows that natural theorem provers can scale beyond chapter-sized formalizations. It features an extensible grammar-based approach to natural language, familiar settheoretical foundation in higher-order logic, and proof automation powered by strong first-order ATPs. Use of concurrency, more control over the proof search process, premise selection, faster parsing, a module system, and other refinements result in a performance improvement by an order of magnitude compared to its predecessor Naproche.

Naproche-ZF is still experimental research-quality software and requires more features, grammar refinement, bug fixes, user testing, and documentation to become user friendly software. Naproche-ZF's checker is a command line tool and lacks an integrated development environment (IDE), which would make the system more user friendly. Currently, user interaction with the ATP within Naproche-ZF is limited: failed or slow proofs sometimes require digging through large logs and experimenting with the ATP on the command line. An IDE for Naproche-ZF should also facilitate better interaction with the ATP, e.g. by giving Sledgehammer-like suggestions after finding proofs.

Naproche-ZF would also benefit from improvements to general purpose proof automation (e.g. better premise selection) and from including special-purpose proof automation, e.g. for arithmetic.

The included standard library is still fairly small and it would be nice to update student formalizations completed in older versions of Naproche to also work in Naproche-ZF.

Currently the LATEX files are typeset as-is. It would be worthwhile to generate richer HTML (with MathML) or PDF documents, by, e.g., linking lexical items to their definition or enabling progressive disclosure of more complicated proofs.

# **References**


LNCS, vol. 1125, pp. 191–201. Springer, Heidelberg (1996). https://doi.org/10. 1007/BFb0105405


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Reducibility Constraints in Superposition**

M´arton Hajdu1(B) , Laura Kov´acs<sup>1</sup> , Michael Rawson<sup>1</sup> , and Andrei Voronkov2,3

> TU Wien, Vienna, Austria marton.hajdu@tuwien.ac.at University of Manchester, Manchester, UK EasyChair, Manchester, UK

**Abstract.** Modern superposition inference systems aim at reducing the search space by introducing redundancy criteria on clauses and inferences. This paper focuses on reducing the number of superposition inferences with a single clause by blocking inferences into some terms, provided there were previously made inferences of a certain form performed with predecessors of this clause. Other calculi based on blocking inferences, for example basic superposition, rely on variable abstraction or equality constraints to express irreducibility of terms, resulting however in blocking inferences with all subterms of the respective terms. Here we introduce reducibility constraints in superposition to enable a more expressive blocking mechanism for inferences. We show that our calculus remains (refutationally) complete and present redundancy notions. Our implementation in the theorem prover Vampire demonstrates a considerable reduction in the size of the search space when using our new calculus.

**Keywords:** Saturation · Superposition · Redundancy · Reducibility constraints

# **1 Introduction**

Automated reasoners in first-order logic with equality commonly rely on the *superposition calculus* [5,25]. This calculus has been extended with various improvements in order to reduce the search space. For instance, avoiding superposition into variables and ordering literals and clauses are common practices in modern theorem provers [21,29,34].

To reduce generation of redundant clauses in equational reasoning, the "basicness" restriction [16] was introduced at the term level. This idea aided, for example, in finding the proof of the Robbins problem [24]. This restriction blocks superposition (rewriting) inferences into terms resulting from (quantifier) instantiations, considering such terms irreducible in further proof steps. This approach was further generalised to block superposition into terms above variable positions in basic superposition/paramodulation [7,26], while preserving refutational completeness. However, blocking and applying different rewrite steps among equal

**Fig. 1.** Possible superposition sequences into 4 .

terms impacts proof search. In this paper, we propose a number of different ways to block inferences, so that the resulting calculus remains complete. The effect of these restrictions resembles some strategies from term rewriting, such as innermost and outermost strategies.

**Motivating Example.** Consider the following satisfiable set C of clauses:

$$\mathcal{C} = \left\{ \begin{array}{l} \bigotimes g(x,b) \simeq a, \\ \bigotimes g(a,x) \simeq x, \end{array} \begin{array}{l} \bigotimes f(x,b) \simeq x, \\ \bigotimes P(g(x,y),f(g(x,b),z)) \end{array} \right\}$$

where x, y are variables, a, b constants, f,g function symbols, and P is a predicate symbol. In this paper denotes equality. Figure 1 shows some derivations of <sup>P</sup>(a, a) by consecutively superposing into <sup>4</sup> with <sup>1</sup> and <sup>2</sup> . It also shows a derivation of P(a, b) by superposing into <sup>4</sup> with <sup>1</sup> , then with <sup>3</sup> and <sup>2</sup> . Note that Fig. 1 contains many redundant clauses. For example, 4 is redundant w.r.t. 6 and 1 , as it is a logical consequence of (smaller) 6 and 1 . Similarly, 7 is redundant w.r.t. 11 and 1 .

Many derivations of Fig. 1 could however be avoided by using a rewrite order between the inferences. For example, a *leftmost-innermost rewrite order* on inferences derives 13 along the path 4 – 5 – 9 – 13 . Whenever we would deviate from the leftmost-innermost rewrite order when rewriting a term t, we enforce the order by requiring that any term t that is to the left of or inside t is irreducible in further derivations. In other words, we block further inferences with t . With such a restriction, we cannot rewrite g(x, y) in clause <sup>6</sup> , as <sup>g</sup>(x, y) was to the left of the previously rewritten term f(g(x, b), z). Hence, when using a leftmostinnermost rewrite upon in Fig. 1, 9 is only generated by the derivation path 4 – 5 – 9 . Similarly, 11 cannot be derived from 7 but can be derived from 6 .

**Our Contributions.** We introduce a new superposition calculus that enables various ways to block (rewrite) inferences during proof search. Key to our calculus are *reducibility constraints* to restrict the order of superposition inferences (Sect. 3). Our approach supports and generalizes, among others, the leftmostinnermost rewrite orders mentioned in the motivating example by means of *irreducibility constraints*, allowing us to reduce the number of generated clauses. Furthermore, in our motivating example the derivation 5 – 8 – 12 of Fig. 1 is not needed for the following reason. By superposing into <sup>2</sup> with <sup>3</sup> , we derive <sup>a</sup> b, which makes one of 12 and 13 redundant w.r.t. the other. As 1 was used to rewrite g(x, b) in Fig. <sup>1</sup> and derive <sup>5</sup> , we block superposition into <sup>g</sup>(x, b) with all clauses except 1 in further derivations. We express this requirement via a *one-step reducibility constraint* (Definition 1), resulting in the BLINC – BLocked INference Calculus. As such, BLINC is parameterized by a rewrite ordering and (ir)reducibility constraints.

We prove<sup>1</sup> that our BLINC calculus is refutationally complete, for which we use a model construction technique (Sect. 4) with new features introduced to take care of constraints. We extend our calculus with redundancy elimination (Sect. 5). When evaluating the BLINC calculus implemented in the Vampire theorem prover, our experiments show that reducibility constraints significantly reduce the search space (Sect. 6).

# **2 Preliminaries**

We work in standard *first-order logic with equality*, where equality is denoted by -. We use variables x, y, z, v, w and terms s, t, u, l, r, all possibly with indices. A term is *ground* if it contains no variables. A literal is an unordered pair of terms with polarity, i.e. an equality s t or a disequality s t. We write s t for either an equality or a disequality. A *clause* is a multiset of literals. We denote clauses by B, C, D and denote by the *empty clause* that is logically equivalent to ⊥.

An *expression* E is a term, literal or clause. We will also consider as expressions constraints and constrained clauses introduced later. An expression is called *ground* if it contains no variables. We write E[s] to state that the expression E contains a distinguished occurrence of the term s at some position. Further, E[s → t] denotes that this occurrence of s is replaced with t; when s is clear from the context, we simply write E[t]. We say that t is a *subterm* of s[t], denoted by t s[t]; and a *strict subterm* if additionally <sup>t</sup> <sup>=</sup> <sup>s</sup>[t], denoted by ts[t]. A *substitution* σ is a mapping from variables to terms, such that the set of variables {x <sup>|</sup> σ(x) <sup>=</sup> x} is finite. We denote substitutions by θ, σ, ρ, μ, η. The application of a substitution θ on an expression E is denoted Eθ. A substitution θ is called *grounding for an expression* E if Eθ is ground. We denote the set of grounding substitutions for an expression E by GSubs, that is GSubs(E) = {θ <sup>|</sup> Eθ is ground}. We denote the empty substitution by ε.

A substitution θ is more general than a substitution σ if θη <sup>=</sup> σ for some substitution η. A substitution θ is a *unifier* of two terms s and t if sθ <sup>=</sup> tθ, and is a *most general unifier*, denoted mgu(s, t), if for every unifier η of s and <sup>t</sup>, there exists a substitution μ s.t. η <sup>=</sup> θμ. We assume that the most-general unifiers of terms are idempotent [2].

<sup>1</sup> detailed proofs are in the full version of this paper [15].

A binary relation <sup>→</sup> over the set of terms is a *rewrite relation* if (i) l <sup>→</sup> r <sup>⇒</sup> lθ <sup>→</sup> rθ and (ii) l <sup>→</sup> r <sup>⇒</sup> s[l] <sup>→</sup> s[r] for any term l, r, s and substitution θ. The *reflexive and transitive closure* of a relations → is denoted by →∗. We write ← to denote the inverse of <sup>→</sup>. Two terms are *joinable*, denoted by s <sup>↓</sup> t, if there exists a term u s.t. s <sup>→</sup><sup>∗</sup> <sup>u</sup> <sup>←</sup><sup>∗</sup> <sup>t</sup>. A *rewrite system* <sup>R</sup> is a set of rewrite rules. A term l is *irreducible* in R if there is no r s.t. l <sup>→</sup> r <sup>∈</sup> R. Joinability w.r.t. R will be denoted by <sup>s</sup> <sup>↓</sup><sup>R</sup> <sup>t</sup>. A *rewrite ordering* is a strict rewrite relation. A *reduction ordering* is a well-founded rewrite ordering. In this paper we consider reduction orderings which are total on ground terms, that is they satisfy st <sup>⇒</sup> s t; such orderings are also called *simplification orderings*.

We use the standard definition of a *bag extension* of an ordering [12]. An ordering on terms induces an ordering on literals, by identifying s t with the multiset {s, t} and s t with the multiset {s, s, t, t}, and using the bag extension of . We denote this induced ordering on literals also with . Likewise, the ordering on literals induces the ordering on clauses by using the bag extension of . Again, we denote this induced ordering on clauses also with . The induced relations on literals and clauses are well-founded (resp. total) if so is the original relation on terms. In examples used in this paper, we assume a KBO simplification ordering with constant weight [19].

Many first-order theorem provers work with clauses [21,29,34]. Let S be a set of clauses. Often, systems *saturate* S by computing all logical consequences of S with respect to a sound inference system <sup>I</sup>. The process of saturating S is called *saturation*. An inference system I is a set of inference rules of the form

$$\begin{array}{c c c} C\_1 & \dots & C\_n \\ \hline \hline & C \\ \hline \end{array},$$

C where <sup>C</sup>1,...,C<sup>n</sup> are the *premises* and <sup>C</sup> is the *conclusion* of the inference. The inference rule is *sound* if its conclusion is the logical consequence of its premises, that is <sup>C</sup><sup>1</sup>,...,C<sup>n</sup> <sup>|</sup><sup>=</sup> <sup>C</sup>. The inference is *reductive* w.r.t. an ordering if C C<sup>i</sup>, for some <sup>i</sup> = 1,...,n. An inference system <sup>I</sup> is *sound* if all its inferences are sound. An inference system I is *refutationally complete* if for every unsatisfiable clause set S, there is a derivation of the empty clause in <sup>I</sup>. An interpretation I is a model of an expression E if E is true in I. A clause C that is false in an interpretation I is a *counterexample* for I. If a clause set contains a counterexample, then it also contain a minimal counterexample w.r.t. a simplification ordering [6].

# **3 Reducibility Constraints**

This section presents a new blocking calculus, called BLINC (BLocked INference Calculus). This calculus uses specific constraints to block inferences.

**Definition 1 (Constraints).** Let l be a non-variable term and r a term. We call the expression l r <sup>a</sup> *one-step reducibility constraint*, and the expression <sup>↓</sup>l an *irreducibility constraint*. A *constraint* is one of the two. 

### **Fig. 2.** The BLINC calculus

Now let us define the semantics of these constraints.

**Definition 2 (Satisfied Constraints, Violated Constraints).** Let R be a rewrite system. We say that R *satisfies* l r if l <sup>→</sup> r <sup>∈</sup> R and *satisfies* <sup>↓</sup>l if l is irreducible in R. We say that R *violates* a constraint if it does not satisfy it. 

An expression C <sup>|</sup> Γ is a *constrained clause*, where C is a clause and Γ a finite set of constraints. A constrained clause C <sup>|</sup> Γ is true iff C is true. We denote constrained clauses C, D, possibly with indices.

**Definition 3 (Blocked Constrained Clause, Blocked Inference).** Let C = C <sup>|</sup> Γ be a constrained clause. We call the constraint l r <sup>∈</sup> Γ *active in* <sup>C</sup> if s l for some term s in C. Likewise, we call <sup>↓</sup>l <sup>∈</sup> Γ *active in* <sup>C</sup> if s l for some term s in C. We call <sup>C</sup> *blocked* if either it contains two active constraints <sup>l</sup> <sup>r</sup> and l r such that r and r are not unifiable, or it contains two active constraints l r and <sup>↓</sup>l. An inference is *blocked* if its conclusion is blocked. 

Our superposition calculus BLINC uses constrained clauses and bans inferences with blocked conclusions. For that, we attach constraints to clauses, as follows.

**Definition 4 (S-ordering).** An *S-ordering* is a partial strict well-order on terms that is stable under substitutions. We use the function B defined below to attach constraints to clauses.

$$\mathcal{B}\_{\exists}(s,l) := \{ \downarrow u \mid u \Subleftarrow 'l, u \text{ is non-variable and } u \not\subseteq s \}$$

 

BLINC is shown in Fig. 2. We assume a literal selection function satisfying the standard condition on and underline selected literals. The next example illustrates blocked BLINC inferences.

*Example 1.* We use the order on terms as the S-ordering. A Sup inference of BLINC into <sup>4</sup> with <sup>2</sup> from our motivating example from page 2 results in

**Fig. 3.** Inferences from Fig. 1 with blocked inferences in BLINC removed. Figure 3 illustrates the effectiveness of reducibility constraints when compared to Fig. 1: we removed arcs corresponding to inferences blocked when the order is used as the S-ordering. Of the 14 original inferences as in Fig. 1, only 7 are not blocked in Fig. 3.

$$\frac{f(x,b)\simeq x \quad P(g(x,y),f(g(x,b),z))}{P(g(x,y),g(x,b)) \mid \{\downarrow b,\downarrow g(x,y),\downarrow g(x,b),f(g(x,b),b)\leadsto g(x,b)\}}\cdot$$

Note that the conclusion is a constrained variant of clause 7 of Fig. 1. Now, the superposition of <sup>1</sup> into <sup>g</sup>(x, y), and hence the following variant of clause <sup>10</sup> of Fig. 1, is blocked:

$$\begin{array}{llll} g(x,b) \simeq a & P(g(x,y),g(x,b)) \mid \{\downarrow g(x,y),\downarrow g(x,b),f(g(x,b),b) \leadsto g(x,b)\} \\ \hline P(a,g(x,b)) \mid \{\downarrow b,\downarrow g(x,b),f(g(x,b),b) \leadsto g(x,b),g(x,b) \leadsto a\} \\ \cline{3-4} \ldots \ldots \ldots \ldots \ldots \ldots \ldots \ldots \ldots \ldots \ldots \ldots \ldots \ldots \ldots \ldots \end{array}$$

Note that the conclusion is blocked by the active constraints g(x, b) a and <sup>↓</sup>g(x, b). Figure <sup>3</sup> shows the modified version of Fig. 1, when using the inference rules of BLINC to generate fewer clauses than in Fig. 1. 

*Example 2.* Consider now a sequence of superposition inferences into 4 by 1 and then by 3 . That is, we consider the derivation 4 – 5 – 8 from Fig. 1 as:

$$\frac{g(a,x)\simeq x}{P(a,f(b,z)\mid A)}\frac{g(x,y),f(g(x,b),z)}{P(a,f(g(x,b),z))\mid\{\downarrow b,g(x,b)\leadsto a\}}$$

The resulting conclusion is constrained and blocked, as we have two active constraints g(a, b) a and g(a, b) b. As such and as shown in Fig. 3, clause <sup>8</sup> (and also clause <sup>12</sup> ) will not be generated by BLINC, in contrast to Fig. 1. 

# **4 Model Construction in BLINC**

This section shows completeness of BLINC, with a proof which resembles that of Duarte and Korovin [13]. We start by adjusting terminology to our setting and discussing key differences compared with standard completeness proofs.

**Definition 5 (Closure).** Let <sup>C</sup> <sup>=</sup> C <sup>|</sup> Γ be a constrained clause and θ a substitution. The pair C · θ is called a *closure* and is logically equivalent to Cθ. A closure (C <sup>|</sup> Γ) · θ is *ground* if Cθ <sup>|</sup> Γ θ is ground, in which case we say that θ is *grounding* for C <sup>|</sup> Γ and call (C <sup>|</sup> Γ) · θ <sup>a</sup> *ground instance of* C <sup>|</sup> Γ.

The set of all ground instances of C is denoted C∗. We will denote ground closures by <sup>C</sup>, <sup>D</sup>, maybe with indexes. If N is a set of constrained clauses, then N<sup>∗</sup> is defined as C∈<sup>N</sup> <sup>C</sup>∗. If <sup>C</sup> <sup>D</sup>, we write <sup>C</sup> <sup>|</sup> <sup>Γ</sup> <sup>D</sup> <sup>|</sup> <sup>Δ</sup>. Similarly, if Cθ <sup>|</sup> Γ θ Dσ <sup>|</sup> Δσ, then we write (C <sup>|</sup> Γ) · θ (D <sup>|</sup> Δ) · σ. 

A crucial part in the completeness proof of BLINC is reducing minimal counterexamples to smaller ones. However, due to blocked inference conditions (5) in Sup-, (2) in EqRes-, and (3) in EqFac-, such a counterexample reduction may not be possible. We solve this problem in three steps:


**Definition 6 (Partial/Total Models, Blocked/Productive Closures).** Let <sup>N</sup> be a set of constrained clauses. We will define, for every closure <sup>C</sup> <sup>∈</sup> <sup>N</sup><sup>∗</sup>, the rewrite system <sup>R</sup><sup>C</sup> and refer to all such rewrite systems as *partial models*. The definition will be made by induction on the relation on ground closures. In parallel to defining R<sup>C</sup>, we also define the rewrite system

$$R\_{\preccurlyeq \mathbb{C}} = \bigcup\_{\mathbb{D} \preccurlyeq C} R\_{\mathbb{D}}.$$

The partial model <sup>R</sup><sup>C</sup> will either be the same as <sup>R</sup><sup>≺</sup><sup>C</sup>, or obtained from <sup>R</sup><sup>≺</sup><sup>C</sup> by adding a single rewrite rule. In the latter case will call the closure C *productive*.

The *reduced closure* of a ground closure C·θ is defined as the closure C·θ such that for each variable x occurring in <sup>C</sup>, we have that θ (x) is the normal form of θ(x) in R≺C·<sup>θ</sup>. We call a ground closure *reduced* if its reduced form coincides with this closure. Let C · θ be a ground closure and C · θ be its reduced form. We say that C · θ is *blocked w.r.t.* N if R≺C·θ violates some constraint in Γ θ that is active in <sup>C</sup>θ . A closure that is not blocked w.r.t. N is called *unblocked w.r.t.* N. Let <sup>C</sup> = (l r <sup>∨</sup> C ) <sup>|</sup> Γ. The closure C · θ is called *productive* if


Now we define

$$\begin{array}{ll} R\boldsymbol{c} \cdot \boldsymbol{\theta} = \begin{cases} R\_{\prec \mathcal{C} \cdot \boldsymbol{\theta}} \cup \{ l\boldsymbol{\theta} \to \boldsymbol{r} \boldsymbol{\theta} \}, & \text{if } \mathcal{C} \cdot \boldsymbol{\theta} \text{ is productive;} \\ R\_{\prec \mathcal{C} \cdot \boldsymbol{\theta}}, & \text{otherwise.} \\ \bigcup\_{\mathcal{C} \in N^{\*}} R\_{\mathcal{C}} & \text{if } \mathcal{C} \cdot \quad \text{and} \\ \dots \dots \dots \dots \dots \dots \dots \dots & \dots \end{array} \end{array}$$

Finally, we call <sup>R</sup><sup>∞</sup> *the total model* and define <sup>U</sup>(N) as the set of all closures in N<sup>∗</sup> unblocked w.r.t. N. 

This construction has two standard properties that we will use in our proofs:


The crucial difference between our model construction and the standard model construction is the condition on productive closures to be unblocked w.r.t. N. Let us now define our redundancy notions based on <sup>U</sup>(N) as follows.

**Definition 7 (Redundant Clause/Inference).** A constrained clause C is *redundant w.r.t.* N if every ground instance of <sup>C</sup> is either blocked w.r.t. N, or follows from smaller ground closures in <sup>U</sup>(N). An inference <sup>C</sup><sup>1</sup>, ..., <sup>C</sup><sup>n</sup> D is *redundant w.r.t.* <sup>N</sup> if for each <sup>θ</sup> grounding for <sup>C</sup><sup>1</sup>,..., <sup>C</sup><sup>n</sup> and <sup>D</sup> either


**Definition 8 (Saturation up to Redundancy).** A set of constrained clauses N is *saturated up to redundancy* if, given non-redundant constrained clauses <sup>C</sup><sup>1</sup>, ..., <sup>C</sup><sup>n</sup> <sup>∈</sup> <sup>N</sup>, any BLINC inference <sup>C</sup><sup>1</sup>, ..., <sup>C</sup><sup>n</sup> D is redundant w.r.t. <sup>N</sup>. 

From now on, let N be an arbitrary but fixed set of constrained clauses. We will formulate a sequence of lemmas used in the completeness proof, whose proofs are included in the the full version of the paper [15]. The following lemma is used to show that unary inferences with an unblocked premise result in an unblocked conclusion.

**Lemma 1. (Unblocking Inferences)** *Suppose* <sup>C</sup>, D ∈ N *and* θ *and* σ *are substitutions irreducible in* <sup>R</sup>≺C·<sup>θ</sup> *and in* <sup>R</sup>≺D·<sup>σ</sup>*, respectively. If* C · <sup>θ</sup> D· <sup>σ</sup>*,* Γ θ <sup>⊇</sup> Δσ *and* C · θ *is unblocked w.r.t.* N*, then* D · σ *is unblocked w.r.t.* N*.*

We next prove that the conclusion of a blocked inference is redundant, that is, the conditions that block inferences in BLINC are correct.

**Lemma 2. (Redundancy with Blocked Clauses)** *Let* C *be a constrained clause. If* <sup>C</sup> *is blocked, then all ground instances of* <sup>C</sup> *are blocked w.r.t.* N*.*

The next lemma resembles the standard lemma on counterexample reduction.

**Lemma 3 (Unblocked** Sup-**).** *Suppose that (a)* <sup>D</sup> <sup>=</sup> s t <sup>∨</sup> D <sup>|</sup> Γ *is a constrained clause in* N*, (b)* D · θ *a ground closure unblocked w.r.t.* N*, (c)* θ *is irreducible in* R≺D·<sup>θ</sup>*, (d)* sθ tθ*, (e)* sθ *is reducible in* <sup>R</sup>≺D·<sup>θ</sup>*.*

*Then there exist a constrained clause* (l r <sup>∨</sup> <sup>C</sup> <sup>|</sup> <sup>Π</sup>) <sup>∈</sup> <sup>N</sup>*, a* Sup-*-inference*

l r <sup>∨</sup> C <sup>|</sup> Π s[u] t <sup>∨</sup> D <sup>|</sup> Γ (Sup-) (s[r] t <sup>∨</sup> C <sup>∨</sup> D)σ <sup>|</sup> Δ

*and a substitution* τ *such that (i)* <sup>D</sup>στ <sup>=</sup> <sup>D</sup>θ*, (ii)* l <sup>r</sup>∨C <sup>|</sup> Π ·στ *is productive, and* (s[r] t <sup>∨</sup> C <sup>∨</sup> D)σ <sup>|</sup> Δ · στ *is unblocked w.r.t.* N*.*

We are now ready to show completeness of BLINC, starting with the following.

**Theorem 1 (Model of** <sup>U</sup>(N)**).** *Let* N *be saturated up to redundancy and* -<sup>∈</sup>/ <sup>N</sup>*. Then for each* <sup>C</sup> ∈ U(N) *we have* <sup>R</sup><sup>C</sup> <sup>|</sup><sup>=</sup> <sup>C</sup>*.*

When <sup>R</sup><sup>C</sup> <sup>|</sup><sup>=</sup> <sup>C</sup>, we will simply say that <sup>C</sup> is true. Note that this implies that <sup>R</sup><sup>D</sup> <sup>|</sup><sup>=</sup> <sup>C</sup> for all <sup>D</sup> C, and also <sup>R</sup><sup>∞</sup> <sup>|</sup><sup>=</sup> <sup>C</sup>. We say that <sup>C</sup> is false if it not true.

Here, we only prove a few representative cases and refer to [15] for complete argumentation. Assume, by contradiction, that <sup>U</sup>(N) contains a ground closure <sup>C</sup> such that <sup>R</sup><sup>C</sup> |<sup>=</sup> <sup>C</sup>. Since is well-founded, then <sup>N</sup><sup>∗</sup> contains a minimal unblocked closure C · <sup>θ</sup> such that <sup>R</sup>C·<sup>θ</sup> |<sup>=</sup> C · <sup>θ</sup>.

**Case 1.** <sup>C</sup> *is redundant w.r.t.* N*.*

*Proof.* The closure C·<sup>θ</sup> is unblocked, so it follows from smaller closures <sup>C</sup><sup>1</sup>,..., <sup>C</sup><sup>n</sup> in <sup>U</sup>(N). Then there is some <sup>C</sup><sup>i</sup> which is false too, and we are done. 

**Case 2.** <sup>C</sup> *contains a variable* <sup>x</sup> *such that* xθ *is reducible in* <sup>R</sup>≺C·<sup>θ</sup>*.*

*Proof.* The reduced closure C · <sup>θ</sup> of C · θ is unblocked w.r.t. N, so C · θ ∈ U(N). Since xθ xθ and for all variables <sup>y</sup> different from <sup>x</sup> we have yθ yθ , we have C · θ C· θ , then C · <sup>θ</sup> is true. Since yθ <sup>=</sup> yθ is true in <sup>R</sup><sup>∞</sup> for all variables <sup>y</sup>, we also have that C · θ is equivalent to C · <sup>θ</sup> in <sup>R</sup><sup>∞</sup>, hence C · <sup>θ</sup> is true and we obtain a contradiction. 

**Case 3.** *There is a reductive inference* <sup>C</sup><sup>1</sup>,..., <sup>C</sup><sup>n</sup> D *with* <sup>C</sup><sup>1</sup>,..., <sup>C</sup><sup>n</sup> <sup>∈</sup> <sup>N</sup> *which is redundant w.r.t.* <sup>N</sup> *such that (a)* {C<sup>1</sup> · θ,..., <sup>C</sup><sup>n</sup> · <sup>θ</sup>}⊆U(N)*, (b)* D · <sup>θ</sup> *is unblocked w.r.t.* <sup>N</sup>*, (c)* C · <sup>θ</sup> = max{C<sup>1</sup> · θ,..., <sup>C</sup><sup>n</sup> · <sup>θ</sup>}*, and (d)* D · <sup>θ</sup> <sup>|</sup><sup>=</sup> C · <sup>θ</sup>*.*

*Proof.* D · θ is implied by ground closures in <sup>U</sup>(N) smaller than C · θ. These ground closures are then true in RC·<sup>θ</sup>, so D · <sup>θ</sup> is true, and hence C · <sup>θ</sup> is also true in <sup>R</sup>C·<sup>θ</sup>, contradiction. 

**Case 4.** *None of the previous cases apply, and a negative literal* s t *is selected in* <sup>C</sup>*, i.e.* <sup>C</sup> <sup>=</sup> s t <sup>∨</sup> <sup>C</sup> <sup>|</sup> <sup>Γ</sup>*.*

*Proof.* C · <sup>θ</sup> is false in <sup>R</sup>C·<sup>θ</sup>, so sθ <sup>↓</sup><sup>R</sup>C·<sup>θ</sup> tθ. W.l.o.g., assume sθ tθ.

**Subcase 4.1.** sθ <sup>=</sup> tθ*.*

*Proof.* Then s and t are unifiable. Consider the EqResinference

$$\frac{s \not\simeq t \lor C \mid F}{C \sigma \mid F \sigma}$$

where σ <sup>=</sup> mgu(s, t). Take any ground instance D · ρ = (Cσ <sup>|</sup> Γ σ) · ρ such that σρ <sup>=</sup> θ; by the idempotence of σ, we have D · ρ <sup>=</sup> D · θ. Clearly, C · θ D· θ and D · θ implies C · θ. As C · θ D· θ and Γσρ <sup>=</sup> Γσθ <sup>=</sup> Γ θ, Lemma <sup>1</sup> implies that D · θ is unblocked w.r.t. N. By Case 1, <sup>D</sup> is not redundant, hence D ∈ N. But then D · θ is a false closure in <sup>U</sup>(N), which is strictly smaller than C · θ, so we obtain a contradiction. 

# **Subcase 4.2.** sθ tθ.

*Proof.* By conditions on the literal selection, we assume that sθ tθ is maximal in <sup>C</sup>. By Lemma 3, there is a Sup inference into sθ with a ground closure such that the result <sup>C</sup> · <sup>θ</sup> is unblocked w.r.t. <sup>N</sup>. This closure is of the form D · θ = (l r <sup>∨</sup> D <sup>|</sup> Π) · θ and we have the following Supinference

$$\frac{\underline{l}\cong\underline{r}\vee D\mid H\quad \underline{s[l']\not\ni t}\vee C\mid\Gamma}{(\underline{s[r]\not\ni t}\vee C\lor D)\sigma\mid\Delta}$$

where σ <sup>=</sup> mgu(l,l ). Note that <sup>C</sup> <sup>=</sup> s[r] t <sup>∨</sup> C <sup>∨</sup> D and <sup>C</sup> · ρ <sup>=</sup> <sup>C</sup> · θ. Then, C ·θ <sup>C</sup> · <sup>θ</sup> and D · <sup>θ</sup> and <sup>C</sup> · <sup>θ</sup> imply C ·θ. Since <sup>C</sup> · <sup>θ</sup> is unblocked w.r.t. <sup>N</sup>, using Lemma 2, we get that <sup>C</sup> is not blocked w.r.t. N, and condition (5) of Sup is satisfied. Similarly to Case 4.1, we have that the conclusion is a smaller false unblocked closure, so we obtain a contradiction. 

Next we show that for a saturated set of clauses <sup>N</sup>, if <sup>R</sup><sup>∞</sup> is a model for <sup>U</sup>(N), then it is also a model of <sup>N</sup><sup>∗</sup>, that is, <sup>R</sup><sup>∞</sup> satisfies also all blocked closures in N<sup>∗</sup>. This follows from the next theorem.

**Theorem 2 (Model of** <sup>N</sup><sup>∗</sup>**).** *Let* N *be a saturated set of clauses. Every blocked closure* C · θ <sup>∈</sup> N<sup>∗</sup> *follows from* <sup>U</sup>(N)*.*

Using Theorems 1–2, we obtain completeness of BLINC.

**Corollary 1 (Completeness of BLINC).** *Let* N *be saturated up to redundancy. If* N *does not contain* -*, then* N *is satisfiable.*

We conclude with a remark on *constraint inheritance* in BLINC. Note that in the Sup inference rule of Fig. 2, constraints are inherited only from the right premise. It is possible to block more inferences without losing refutational completeness of BLINC, by allowing constraint inheritance from the left premise in the Sup rule as well. However, we cannot propagate constraints that are nonactive in the left premise, as they may become active in the conclusion, making the inference blocked. This effect is illustrated in the following example.

*Example 3.* Consider a superposition into 1 with 3

$$\frac{g(x,b)\simeq a \quad g(a,x)\simeq x}{a\simeq b \mid \{\downarrow a,\downarrow b,g(a,b)\sim a\}}$$

If b a, then <sup>↓</sup>a is the only active constraint in the conclusion. Consider a superposition with 4 where constraints are inherited from both premises:

$$\frac{a \simeq b \mid \{\downarrow a, \downarrow b, g(a, b) \leadsto a\} \quad P(g(x, y), f(g(x, b), z))}{P(g(x, y), f(g(x, a), z)) \mid \{\downarrow a, \downarrow b, g(a, b) \cong a, b \leadsto a\}}$$

In the conclusion, <sup>↓</sup>b and b a are both active, which blocks the inference. 

# **5 Redundancy Detection in BLINC**

In this section we discuss redundancy detection in BLINC. We give sufficient conditions for a clause to be redundant when inferences of a specific form are applied. As usual, we call a *simplifying inference*, or *simplification*, any inference such that one of the premises becomes redundant after the conclusion is added to the current set of clauses. Inference rules whose instances are simplifications are called *simplification rules*. When we display a simplification rule, we will denote clauses that become redundant by drawing a line through them.

Definition 7 gives rise to two kinds of simplification criteria: (i) based on blocking, and (ii) when one of the premises C · θ follows from smaller constrained clauses. The following definition captures the first redundancy criterion.

### **Definition 9 (Closure/Clause Blocked Relative to Closure/Clause).**

A ground closure C is *blocked relative to* a ground closure D if for every set of constrained clauses N, if <sup>D</sup> is blocked w.r.t. N<sup>∗</sup>, then <sup>C</sup> is blocked w.r.t. <sup>N</sup><sup>∗</sup> too. A constrained clause C is *blocked relative to* a constrained clause D, if every ground instance of C is blocked relative to some ground instance of D. 

This notion will be used for defining simplification rules. We will next present sufficient conditions for checking that a constrained clause is blocked relative to another constrained clause. For example, each ground closure of a clause C | ∅ is unblocked w.r.t. any set N, hence everything is blocked relative to that ground closure. Further, each ground closure with a reducible substitution is blocked relative to its reduced closure.

**Definition 10 (Well-Behaved Constrained Clause).** Let <sup>C</sup> <sup>=</sup> C <sup>|</sup> Γ be a constrained clause. We say that <sup>C</sup> is *well-behaved* if (i) all constraints in Γ are active in C, and for each γ <sup>∈</sup> Γ, (ii) if γ <sup>=</sup> <sup>↓</sup>l, then <sup>↓</sup>u <sup>∈</sup> Γ for all ul, and (iii) if γ <sup>=</sup> l r, then <sup>↓</sup>u <sup>∈</sup> Γ for all ul and l contains all variables of r. 

*Example 4.* The clause P(a, f(b, z)) | {↓a, g(a, b) a} is not well-behaved but P(a, f(b, z)) | {↓a, <sup>↓</sup>b, g(a, b) <sup>a</sup>} is. The clause <sup>a</sup> b | {↓a, <sup>↓</sup>b, g(a, b) a} is not well-behaved since it contains constraints not active in the clause. 

**Lemma 4. (Relatively Blocked Well-Behavedness)** *Let* <sup>C</sup> <sup>=</sup> C <sup>|</sup> Γ *and* <sup>D</sup> <sup>=</sup> D <sup>|</sup> Δ *be well-behaved constrained clauses, and* σ *a substitution. Then* <sup>C</sup> *is blocked relative to* <sup>D</sup> *if* CDσ *and* Γ <sup>⊇</sup> Δσ*.*

In the sequel, we assume that each constrained clause is well-behaved. We next adjust two standard simplifications within superposition [14], namely demodulation in Theorem 3 and subsumption in Theorem 4. Our analogue of *demodulation* is the following special case of Supin BLINC:

$$(\mathsf{Dem}\_{\exists})\begin{array}{c} l \simeq r \mid \Delta \quad \underline{C}[l\sigma] \vdash \mathsf{T}' \\ C[r\sigma] \mid \Gamma \end{array} \quad \text{where} \quad \begin{array}{c} (1) \ l\sigma \vdash r\sigma, \\ (2) \ C[l\sigma] \vdash (l \simeq r)\sigma, \\ (3) \ \Delta\sigma \subseteq \Gamma. \end{array}$$

**Theorem 3. (BLINC Demodulation)** Dem *is a simplification rule. That is,* C[lσ] <sup>|</sup> Γ *is redundant w.r.t. any constrained clause set that contains* l r <sup>|</sup> Δ *and* C[rσ] <sup>|</sup> Γ*.*

In addition to simplification rules, we will also consider *deletion rules*. These rules delete a (redundant) constrained clause from N provided that N contains another constrained clause or set of constrained clauses. The below deletion rule is our analogue of *subsumption*:

$$(\mathsf{Subs}\_{\mathbb{B}}) \xrightarrow{D \mid \Delta \quad C \xleftarrow{\sim} T'} \text{ where } \begin{array}{c} \text{(1) } D\sigma \subsetneq C, \\ \text{(2) } \Delta \sigma \subseteq I, \end{array} \text{ for some substitution } \sigma.$$

**Theorem 4. (BLINC Subsumption)** Subs *is a deletion rule. That is,* <sup>C</sup> <sup>|</sup> <sup>Γ</sup> *is redundant w.r.t. any constrained clause set that contains* D <sup>|</sup> Δ*.*

We also introduce two deletion rules based on properties of the constraints of a clause. Namely, in Theorem 5 we introduce a deletion rule resembling "basic blocking" [25], whereas Theorem 6 exploits deletion based on rewrite orders. Consider therefore the following rule:

$$(\mathsf{Block}\_{\mathbb{B}}) \xrightarrow{l \simeq r \mid \Delta \quad C \vdash \mathcal{T}} \begin{array}{c} (1) \ C \vdash (l \simeq r) \sigma \text{ and } l\sigma \succ r\sigma, \\ (2) \ \Delta \sigma \subseteq I, \\ (3) \ \text{either (i)} \downarrow l\sigma \in I \\ \text{or (ii)} \ l\sigma \leadsto r' \in I \text{ and } r' \succ r\sigma. \end{array}$$

**Theorem 5. (BLINC Blocking)** Block *is a deletion rule. That is,* <sup>C</sup> <sup>|</sup> <sup>Γ</sup> *is redundant w.r.t. any constrained clause that contains* l r <sup>|</sup> Δ*.*

Our last deletion inference relies on the fact that all rewrite rules in any partial model have to be oriented left-to-right according to . That is,

$$(\mathbf{Orient}\_{\ni}) \xleftarrow{C \downarrow D \downarrow} \overline{\{t \curvearrowright \{t \curvearrowright r\}} \quad \text{where} \quad \begin{array}{l} (1) \ r \succ l, \\ (2) \ C \succ (l \simeq r). \end{array}$$

**Theorem 6. (BLINC Orientation)** Orient *is a deletion rule. That is,* <sup>C</sup> <sup>|</sup> Γ ∪ {l r} *is redundant w.r.t. any constrained clause set.*

We illustrate the above simplification and deletion rules with the following example.

*Example 5.* Consider the following well-behaved constrained clauses:

$$\begin{array}{ll}(1)\ P(g(a,x),b) \mid \{\downarrow b, f(x,b) \leadsto b\}, & \quad (2)\ P(g(y,z),w) \mid \{f(z,w) \leadsto b\},\\(3)\ g(a,z) \cong b \mid \{\downarrow b\}, & \quad (4)\ f(x,y) \cong a \mid \emptyset \end{array}$$

By Theorem 4, clause (2) subsumes clause (1). By Theorem 3, clause (1) can be simplified with clause (3) into P(b, b) | {↓b, f(x, b) a}. Finally, assuming b a, clauses (1) and (2) are redundant w.r.t. clause (4) by Theorem 5.  *Remark 1.* **(Simplification Heuristics via Unblocking)** We note that further simplifications (and heuristics) can be implemented by removing constraints from constrained clauses. This process of removing constraints is captured via the following rule:

$$(\mathsf{Unblock}) \begin{array}{c} \underline{C} \vdash \mathsf{T} \\ \underline{C} \mid \Delta \\ \ldots \end{array} \text{ where } \Delta \subset \Gamma.$$

Clearly, Unblock is a simplification rule, as removing constraints from a constrained clause preserves completeness in BLINC. 

We note that using the general notion of well-behaved clauses and Lemma 4, any further redundancy elimination technique can be adapted to BLINC. We conclude this section by showing that Theorems 3–6 can be adjusted and combined using the ground redundancy of Definition 7. This results in stronger redundancy detection, as the following example illustrates.

*Example 6.* Consider the following Supinference:

$$\frac{g(f(v,w),a) \simeq g(w,a) \mid \emptyset \quad f(g(f(x,y),z),f(y,x)) \simeq z \mid \emptyset \quad \sigma = \begin{cases} v \mapsto x, \\ w \mapsto y, \\ z \mapsto a \end{cases},$$
 
$$\text{are } \Lambda - I \mid f(x,y) \mid f(u,x) \mid g \cdot g(f(x,y),a) \text{ and } g(u,a) \nmid \text{ Note that the conclusion}$$

where Δ <sup>=</sup> {↓f(x, y), <sup>↓</sup>f(y, x), <sup>↓</sup>a, g(f(x, y), a) g(y, a)}. Note that the conclusion is a well-behaved constrained clause. The conclusion cannot be simplified by clauses

> (1) f(x, y) f(y, x) and (2) f(x, x) x,

using any of Theorems 3–6. However, using similar conditions as in the Block- deletion rule, we can do the following. Let θ be a substitution that makes the conclusion ground. By a comparative case distinction on xθ and yθ,


we conclude that the ground closure (f(g(y, a), f(y, x)) a <sup>|</sup> Δ)· θ is redundant in all cases, hence the conclusion is redundant w.r.t. clauses (1) and (2). 


**Fig. 4.** Experimental comparison using variants BLINC in Vampire, using 1455 UEQ problems and 2422 PEQ problems.

# **6 Evaluation**

We implemented<sup>2</sup> BLINC in Vampire [21], together with the simplification rules of Sect. 5. We have also implemented a redundancy check called *orderedness* that eagerly checks if the result of a superposition can be deleted. We experimented with several variants of BLINC with redundancy elimination (all techniques discussed in Sect. 5), using different heuristics for removing constraints from clauses via Unblock: (i) blinc1 does not use Unblock; (ii) blinc2 uses Unblock to remove constraints inherited from premises, hence only conclusions of Sup will contain constraints; (iii) blinc3 uses Unblock occasionally on the clause that would simplify the most clauses in the search space when unconstrained; (iv) blinc4 uses Unblock on all clauses at activation. We compare these to standard superposition (baseline).

Solving unit equality (UEQ) problems is still very hard for superpositionbased theorem provers, a claim substantiated by results in the CADE ATP System Competition (CASC) [30]. For this reason, our evaluation focused on the UEQ domain of the TPTP benchmark suite, version 8.1.2 [31]. Since our work does not consider (variants of) resolution, but proper superposition, we also restricted further evaluation to the pure equality (PEQ) benchmarks of TPTP. As a result, our experiments use all benchmarks of the unit equality (UEQ) and pure equality (PEQ) divisions from TPTP version 8.1.2 [31].

All our experiments are based on a Discount saturation loop [11] and a Knuth-Bendix ordering, with a timeout of 100 s and without AVATAR [32]. Our results are summarized in Fig. 4. The results show that blinc1 performs poorly compared to baseline, blinc3 and blinc4, and that blinc2 performs only slightly better than blinc1. The variant blinc3 performs much better than blinc1 and blinc2 but it is still does not solve any new problems. The variant blinc4 performs comparably to the state-of-the-art baseline but solves different problems, 28 uniquely. Our preliminary results are therefore encouraging for complementing state-of-the-art superposition proving with BLINC reasoning, possibly in a portfolio solver.

We also analysed the impact of BLINC variants on skipping superposition inferences during proof search. Figure 5 shows the distribution of benchmarks by percentage of skipped superposition inferences among all superposition inferences during our runs for blinc variants. blinc1 skips more than half of superposition inferences in a significant number of benchmarks, while the least restrictive blinc4 still reduces the number of superposition inferences by a significant amount in most benchmarks.

<sup>2</sup> https://github.com/vprover/vampire/commit/9c42b448996947e8.

**Fig. 5.** Distribution of UEQ (top) and PEQ (bottom) benchmarks by ratio of skipped superpositions to all superpositions, showing also average (avg) and median (mdn). For example, using blinc1, on average 30.2%, resp. 26.0% of superpositions can be skipped in UEQ, resp. PEQ benchmarks.

# **7 Related Work**

The basicness restriction [16,27] was extended to first-order logic, for example, in *basic superposition* [26] and *basic paramodulation* [7]. The former uses ground unification, the latter closures and variable abstraction to capture irreducibility constraints. In basic paramodulation, *redex orderings* are used similarly to S-orderings in our framework. BLINC expresses more fine-grained blocking, for example, distinguishing between different superpositions on the same term. Related notions in basic superposition have also been formalized [33].

Several critical pair criteria in completion-based theorem proving use irreducibility notions. *Blocking* [4] is similar to basicness, while *compositeness* [4,17] forbids any superpositions into terms with reducible subterms. *General superposition* [35,36] avoids superpositions when more general ones or ones symmetric in variables have been performed. Our BLINC framework handles all such restrictions. These criteria are instances of the *connectedness* criterion [3], which has been also explored in *ground joinability* [1], *ground reducibility* [22] and *ground connectedness* [13].

More general irreducibility constraints were considered in completion [23] and in superposition [18], the latter using a semantic tree method for completeness. Ordering constraints [9,10,20] and unification constraints [8,28] have also been considered, usually moving them to the calculus level. Extending and generalizing our BLINC framework with such constraints is a future challenge.

# **8 Conclusions**

We introduce reducibility constraints to block inferences during superposition reasoning. Our resulting BLINC calculus is refutationally complete and is extended with redundancy elimination, allowing us to maintain efficient reasoning when compared to state-of-the-art superposition proving. Integrating our approach with further inference-blocking constraints, such as blocking more general or outermost superpositions, is an interesting line for future work. Adapting our framework to domain-specific inference rules, e.g. in linear arithmetic or higher-order superposition, is another line for future work.

Other interesting directions are (i) the use of a stronger semantics of constraints, as in Definition 10, and (ii) a "hybrid calculus", improving on blinc3, where we still use constraints for blocking generating inferences but relax them whenever they prevent us from applying a simplification or a deletion rule.

**Acknowledgements.** We thank Konstantin Korovin for fruitful discussions. We acknowledge funding from the ERC Consolidator Grant ARTIST 101002685, the TU Wien SecInt Doctoral College, the FWF SFB project SpyCoDe F8504, the WWTF ICT22-007 grant ForSmart, and the Amazon Research Award 2023 QuAT.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **First-Order Automatic Literal Model Generation**

Martin Bromberger<sup>1</sup>, Florent Krasnopol1,2, Sibylle M¨ohle<sup>1</sup>, and Christoph Weidenbach1(B)

<sup>1</sup> Max Planck Institute for Informatics, Saarbrucken, Germany *{*mbromber,fkrasnopol,smoehle,weidenbach*}*@mpi-inf.mpg.de <sup>2</sup> Ecole Normale Sup´ ´ erieure, Paris-Saclay, France

**Abstract.** Given a finite consistent set of ground literals, we present an algorithm that generates a complete first-order logic interpretation, i.e., an interpretation for all ground literals over the signature and not just those in the input set, that is also a model for the input set. The interpretation is represented by first-order linear literals. It can be effectively used to evaluate clauses. A particular application are SCL stuck states. The SCL (Simple Clause Learning) calculus always computes with respect to a finite number of ground literals. It then finds either a contradiction or a stuck state being a model with respect to the considered ground literals. Our algorithm builds a complete literal interpretation out of such a stuck state model that can then be used to evaluate the clause set. If all clauses are satisfied an overall model has been found. If it does not satisfy some clause, this information can be effectively explored to extend the scope of ground literals considered by SCL.

# **1 Introduction**

Explicit and effective model representations as well as model building out of a set of first-order clauses have a long tradition [3,10,12,16,20,24,32,41,48,50,51]. In addition, they naturally arise out of decision procedures for decidable firstorder clause set fragments [1–3,11,14,15,18,19,22,25–31,33,36,38,39,43,44,46, 52–55]. The problem we are studying here is to the best of our knowledge new: given a finite set of consistent ground literals, find a finite representation of an overall, typically infinite Herbrand style interpretation, satisfying those ground literals. Of course, there are trivial solutions to this problem, e.g., by assigning any missing ground literal to true or false. Projecting the results of [23] to firstorder logic results in such a trivial solution. However such a solution will not fit our motivating application: the family of SCL calculi [6,7,9,21,37] where we here concentrate on the case of first-order logic without equality. Similar to CDCL [40], SCL computes resolution inferences with respect to a partial ground model, i.e., a consistent sequence of first-order ground literals. The number of ground literals considered by SCL is finite at any point in time, thanks to an upper bound ground literal β with respect to a well-founded (quasi)-ordering. For the purpose of this paper we simply consider the number of symbols in a

c The Author(s) 2024 literal with respect to ≤. In this context SCL either produces the empty clause with respect to β or a partial model satisfying all first-order clause instances smaller than β. In case of such a partial model we want to extend it to an overall interpretation for the clause set and then check whether this interpretation is a model for the first-order clause set considered, or, if not, find a suitable extension to β that then covers false clauses with respect to the generated interpretation. So all considered ground literals are instances of existing literals from some clause set. Therefore, we look for a solution that respects the term structure of the ground literals. Our approach starts with a universal relation and then refines it according to the term structure of the considered ground literals until it fits all ground literals,

For illustration, consider the following very simple example. For the three first-order clauses

$$\begin{array}{cc} \neg P(x) \lor P(g(g(x)) & \qquad & P(a) \\ \neg P(g(g(g(g(a))))) \end{array}$$

an SCL run with β = P(g(g(g(a)))), i.e., exclusively atoms smaller or equal P(g(g(g(a)))) are dealt with, where for the ordering we simply count symbols, a partial model generated by SCL could be

$$[P(a), P(g(g(a))), P(g(a))^1, P(g(g(g(a))))]$$

where the third literal is decided and the others are propagated from the above clauses. It is a model for all ground instances with literals smaller or equal P(g(g(g(a)))), hence, excluding ¬P(g(g(g(g(a))))). Our model building calculus would start with state

$$(\{P(a), P(g(g(a))), P(g(a)), P(g(g(g(a))))\}; \emptyset; \{P(x)\}; \emptyset)$$

meaning that the initial model assumption for P is the universal relation, i.e., P holds on all ground terms. Processed ground literals are moved from the first to the second component of the state and final literal interpretation literals from the third to the fourth component by the algorithm. Thus finally, all processed ground literals are moved to the second component and the final literal model is contained in the fourth component while the other two are empty. One application of rule Solve, see page 7, immediately establishes the model P(x), because it satisfies all ground literals. Of course, this interpretation does not satisfy the three clauses without the restriction to instances bounded by β. Still, we can use our interpretation to find the smallest clause instance falsified by it, in our example ¬P(g(g(g(g(a))))), and use the maximal literal in that clause as our new β = P(g(g(g(g(a))))). Running SCL with the new β will immediately yield the contradiction. Now consider a small modification of the three clauses where we replace the final unit clause by a disjunction.

$$\begin{array}{cc} \neg P(x) \lor P(g(g(x)) & P(a) \\ \neg P(g(g(g(g(a))))) \lor \neg P(g(g(g(a)))) \end{array}$$

Running SCL on this clause set with β = P(g(g(g(a)))) may yield the same partial model as before and hence the same overall interpretation P(x). Again the final clause is falsified by the interpretation yielding a new minimal β = P(g(g(g(g(a))))). Running SCL again with this β yields a final partial model

$$\vdash [P(a), P(g(g(a))), P(g(g(g(g(a))))), \neg P(g(g(g(a)))), \neg P(g(a))^1]$$

and now starting with this ground model

$$(\{P(a), P(g(g(a))), P(g(g(g(g(a))))), \neg P(g(g(g(a)))), \neg P(g(a))\}; \emptyset; \{P(x)\}; \emptyset)$$

the initial candidate interpretation P(x) needs to be refined, because it has positive and negative instances among the set of ground literals. Refining means, we exhaustively instantiate P(x) until no model candidate atom has both positive and negative instances by rule Refine, see page 6. This eventually yields

$$\{\emptyset; \{P(a), P(g(g(a))), P(g(g(g(g(a))))), \neg P(g(g(g(a)))), \neg P(g(a))\}; \emptyset; \{P(a), \neg P(g(a)), P(g(g(g(a)))), \neg P(g(g(g(g(a)))))\}\}$$

which in fact covers all ground literals and constitutes a model for the three clauses.

The paper is now organized as follows: after fixing some notions and notation in Sect. 2, and a short introduction to SCL, Sect. 3, our contributions are contained in Sect. 4. Important results are: (i) out of a set of ground literals we can generate in polynomial time an overall interpretation, Lemma 4, Lemma 8 and Lemma 5; (ii) our literal model representation satisfies the well-known requirements for explicit model representations [10], in particular supports effective clause evaluation, see page 13; (iii) the literal model representation can effectively be used to find a new minimal β in case it does not satisfy the clause set, Lemma 13. The paper ends with a discussion of the obtained results and further research directions, Sect. 5.

### **2 Preliminaries**

We assume a first-order language without equality over a signature Σ = (Ω,Π) of operator symbols and predicates, respectively. All signature symbols come with a fixed arity. Terms, atoms, literals, clauses and clause sets are defined as usual, where in particular clauses are identified both as disjunctions and multisets of literals. Then N denotes a clause set; C, D denote clauses; L, K, H denote literals; A, B denote atoms; P, Q, R denote predicates; t, s terms; f, g, h function symbols; a, b, c constants; and x, y, z variables. We write f /1 or R/2 for a function symbol of arity 1 or predicate symbol of arity 2, respectively. The complement of a literal is denoted by the function comp. The atom of a literal by the function atom, i.e., atom(¬A) = A and atom(A) = A. Semantic entailment |= is defined as usual where variables in clauses are assumed to be universally quantified. Substitutions σ, τ are total mappings from variables to terms, where dom(σ) := {x | xσ = x} is finite and codom(σ) := {t | xσ = t, x ∈ dom(σ)}. Their application is extended to literals, clauses, and sets of such objects in the usual way. A term, atom, clause, or a set of these objects is *ground* if it does not contain any variable. A substitution σ is *ground* if codom(σ) is ground. A substitution σ is *grounding* for a term t, literal L, clause C if tσ, Lσ, Cσ is ground, respectively. A literal L is an *atom instance* of a literal K if atom(K)σ = atom(L) for some σ. A term, literal is called *linear* if any variable occurs at most once in the term, literal. The function mgu denotes the *most general unifier* of two terms, atoms, literals. We assume that any mgu of two terms or literals does not introduce any fresh variables and is idempotent.

A *position* is a word over the naturals with empty word . The set of positions of a term, atom is inductively defined by: pos(x) = {} if x is a variable, pos(f(t1,...,tn)) = {} ∪ n <sup>i</sup>=1{ip | p ∈ pos(ti)} for terms, and pos(P(t1,...,tn)) = {} ∪ n <sup>i</sup>=1{ip | p ∈ pos(ti)} for atoms. For a position p ∈ pos(t) we define t|<sup>p</sup> = t if p = and f(t1,...,tn)|<sup>p</sup> = ti|<sup>p</sup> if p = ip . Moreover, we define by t[s]<sup>p</sup> the term, atom one receives by replacing the subterm t|<sup>p</sup> at position p of t with the term s. The size of a term t, atom A is defined by size(t) = | pos(t)| or size(A) = | pos(A)|. The size of a substitution σ is defined by size(σ) = <sup>x</sup>∈dom(σ) size(xσ). The size of a set of terms, atoms, substitution is the sum of the size of its members. A position p ∈ pos(t) is *maximal* in t if for any other position q ∈ pos(t) we have |q|≤|p|. The *depth* of a position p is 0 if p = and |p| otherwise. The *depth* of a term t, atom A is the maximal depth of any position in t, A, i.e., depth(t) = max{|p| | p ∈ pos(t)} and depth(A) = max{|p| | p ∈ pos(A)}, respectively. The depth of a term s in a term t is the depth of a maximal position p such that t|<sup>p</sup> = s.

Two literals are *inconsistent* if they have different sign and their atoms are unifiable. A set of literals is *consistent* if it does not contain a pair of inconsistent literals. A *literal interpretation* M is a finite set of consistent literals. A literal interpretation I is *complete* with respect to a signature Σ if for any Σ ground atom A there is a literal K ∈ I such that atom(K)σ = A for some σ. A literal interpretation I satisfies a ground literal K, I |= K if there is an L ∈ I such that Lσ = K for some σ. It satisfies a non-ground literal K if it satisfies all groundings of K.

We overload notation for sets where "," is overloaded for disjoint union, and disjoint addition, e.g., "Γ1, Γ2" stands for Γ<sup>1</sup> ∪ Γ<sup>2</sup> and Γ1, L stands for the set Γ<sup>1</sup> ∪ {L}.

# **3 SCL: Clause Learning from Simple Models**

The family of SCL calculi (short "Simple Clause Learning") [6,9,21,37] lifts CDCL (Conflict-Driven Clause Learning) from propositional logic [34,42,49] to variants of first-order logic. The idea is to have superposition-style resolutions on non-ground, first-order clauses but instead of the usual static order that guides them, SCL uses as its guide ground partial models Γ, i.e., sequences of ground literals, also called *trails*. A trail for a clause set N is constructed/extended by guessing literals via so called Decisions and by propagating literals based on the current trail and the current clauses in N [9]. This construction continues until we determine that Γ falsifies a ground instance Cσ of a clause C ∈ N. The conflict between Γ and C is then resolved by applying Resolution to C and the clauses used for propagation during the construction of Γ. At the end of these resolutions, SCL learns a new clause D and a prefix Γ of Γ from which D can be propagated to start the construction of the next trail, which is guaranteed to never encounter the same conflict due to D. Furthermore D is not redundant, in particular, not subsumed by any clause.

The maximal length of the trail is always finitely bounded by all literals being smaller than a fixed ground literal β. In case all ground literals have been explored and not clause is falsified this constitutes a so called *stuck state*. In a stuck state the trail is model for all ground clause instantiations smaller β, but not in general.

In its first, original version [21], the focus of the SCL calculus is on deciding the Bernays-Schoenfinkel class without equality. Moreover, the original version is already a sound and refutationally complete semi-decision procedure for general first-order logic without equality that guarantees non-redundant clause learning. Subsequently, SCL has been extended to handle theories [6] and first-order logic with equality [37].

In the meantime, there exists a refined version [9] unifying and extending the previous versions [6,37] for first-order logic called SCL(FOL). In particular, this version introduces a refined Backtrack rule and a refined reasonable strategy criterion. In parallel we proved correctness and soundness of SCL(FOL) in Isabelle [5]. The Isabelle SCL(FOL) version relaxes some of the original requirements. SCL computations are performed with respect to a quasi-ordering on ground atoms where the strict part is well-founded. We adopt this setting also in this paper by instantiating with symbol counting and ≤. SCL(FOL) is only allowed to add literals L to the trail Γ with atom(L) β for some atom β. Note that the bounding atom β may grow, but only if we reach a stuck state, where <sup>Γ</sup> <sup>|</sup>= gnd<sup>β</sup>(N) and where the function gnd<sup>β</sup> computes the set of all ground instances of a clause set where the grounding is restricted to produce literals L with atom(L) β. This guarantees that SCL(FOL) (with a reasonable strategy) will always find a refutation if the input clause set is unsatisfiable. Moreover, for a fixed β, SCL(FOL) turns into a decision procedure for gnd<sup>β</sup>(N). And even if we allow β to grow, SCL(FOL) regularly visits partial models Γ that at least satisfy gnd<sup>β</sup>(N), that may even be extendable to full models for N, or at least guide the selection for our next bounding literal β [8].

# **4 Generating Models**

A motivation for our model generating algorithm is the extension of SCL ground trail Γ out of a stuck state to a complete literal interpretations. Such an interpretation either satisfies the considered clause set, or it falsifies some clause. The latter information can then be used to extend the SCL search for a model or a contradiction. Our extension from Γ is not trivial, e.g., by assigning all atoms beyond Γ to true. Instead, it respects the literal structure in Γ and naturally extends it to a complete literal interpretation.

The starting point is simply a set of ground literals and the finite signature used to build the set. The algorithm is presented by three abstract rewrite rules operating on a state in a non-deterministic way. The state is a tuple (Γ; Δ; I; M) where Γ, Δ are consistent sets of ground literals, M is a set of linear literals that defines a partial interpretation such that M |= L for each L ∈ Δ and M does not have any conflict with Γ; I is a set of linear atoms such that I ∪ M represents a complete literal interpretation; initially M is empty and I the set {P(x1,...,xn)} for some predicate P and linear atom P(x1,...,xn), denoting the universal relation for P. Processed literals/atoms are moved by the rewrite rules from Γ to Δ and I to M, respectively. The rewrite calculus then builds an overall interpretation of P according to Γ where we assume that Γ only contains P literals. So given a set of ground literals for each occurring predicate a separate run starting with the respective literals is needed.

The start state is (Γ; ∅; {P(x1,...,xn)}; ∅) for a finite consistent set of ground literals Γ over P and linear atom P(x1,...,xn) and a final state is (∅; Δ; ∅; M) where we will show M |= Γ. We assume a finite signature Σ.

The first rule Refine covers the situation where some atom A in I has both positive and negative instances in Γ. Since Γ is consistent, the atom A can be split into instances A<sup>i</sup> of itself and each of the resulting instances is guaranteed to eventually have only positive or negative instances in Γ. Note that this may require repeated applications of the rule Refine.

**Refine** (Γ; Δ; I, P(t1,...,tn); M)

⇒mod (Γ; Δ; I,∪<sup>f</sup>*i*/k∈<sup>Ω</sup>{P(t1,...,tn){x → fi(y<sup>i</sup><sup>1</sup> ,...,y<sup>k</sup>*<sup>i</sup>* )}}; M)

provided P(t1,...,tn)|<sup>p</sup> = x, f<sup>i</sup> are all function symbols (including constants) from Ω and k<sup>i</sup> denotes their respective arity, all variables in fi(y<sup>i</sup><sup>1</sup> ,...,y<sup>k</sup>*<sup>i</sup>* ) are fresh, different variables, and p is a minimal variable position in P(t1,...,tn) for which there exist literals P(l1,...,ln) and ¬P(r1,...,rn) in Γ such that both are atom instances of P(t1,...,tn), and the atoms P(l1,...,ln)|<sup>p</sup> and P(r1,...,rn)|<sup>p</sup> are not unifiable

Due to refinement and the construction of the final complete literal interpretation it may happen that certain atoms in I do not have any instances in Γ. They are then moved to the final representation of the interpretation by rule Clean.

**Clean** (Γ; Δ; I, P(t1,...,tn); M) ⇒mod (Γ; Δ; I; M,P(t1,...,tn)) provided P(t1,...,tn) has no atom instance in Γ

Actually, the atom P(t1,...,tn) could be added positively or negatively to the final literal interpretation. In favor of Theorem 12 we stick here to adding all literals without instances positively. If all instances of some atom in I from Γ have an identical sign, they are solved and both the atom and the instances can be removed from I and Γ by rule Solve, respectively.

$$\begin{array}{l} \mathsf{Solve} \quad (\varGamma; \varGamma'; \varDelta; \varGamma, P(t\_1, \dots, t\_n); M) \\ \Rightarrow\_{\mathsf{mod}} \, (\varGamma; \varDelta, \varGamma'; \varGamma; M, \#P(t\_1, \dots, t\_n)) \end{array}$$

provided Γ , Γ = ∅, consists only of positive literal instances of P(t1,...,tn) or only of negative literal instances; Γ contains no literal instances of P(t1,...,tn); # = ¬ if the literals in Γ are negative and # is empty otherwise

*Example 1.* Let Σ = ({a/0,f/1, g/1}, {P/3}) be a signature. Now consider Γ = {K, L} where K = P(a, f(a), a), L = ¬P(a, g(a), a) over the signature Σ. An execution trace of ⇒mod is as follows:

$$\begin{array}{l} 1: & (\{K, L\}; \emptyset; \{P(x\_1, x\_2, x\_3)\}; \emptyset) \\ 2: \Rightarrow\_{\mathsf{good}}^{\mathrm{Refine}} \{ \{K, L\}; \emptyset; \{P(x\_1, a, x\_3), P(x\_1, f(y\_1), x\_3), P(x\_1, g(y\_2), x\_3)\}; \emptyset \} \\ 3: \Rightarrow\_{\mathsf{mod}}^{\mathrm{Chean}} \{ \{K, L\}; \emptyset; \{P(x\_1, f(y\_1), x\_3), P(x\_1, g(y\_2), x\_3)\}; \{P(x\_1, a, x\_3)\} \} \\ 4: \Rightarrow\_{\mathsf{mod}}^{\mathrm{Solve}} \{ \{L\}; \{K\}; \{P(x\_1, g(y\_2), x\_3)\}; \{P(x\_1, a, x\_3), P(x\_1, f(y\_1), x\_3)\} \} \\ 5: \Rightarrow\_{\mathsf{mod}}^{\mathrm{Solve}} \{ \emptyset; \{K, L\}; \emptyset; \{P(x\_1, a, x\_3), P(x\_1, f(y\_1), x\_3), \neg P(x\_1, g(y\_2), x\_3)\} \} \end{array}$$

Step 1: The initial state (Γ; Δ; I; M) consists of the set Γ = {K, L}, the empty set Δ, the set I containing only P(x1, x2, x3) which generalizes all literals over Σ with predicate P, and the empty set of literals M.

Step 2: Both K and L are atom instances of P(x1, x2, x3), but with opposite signs. Moreover, the terms K|<sup>2</sup> = f(a) and L|<sup>2</sup> = g(a) are not unifiable, and the position p = 2 is the minimal variable position in P(x1, x2, x3) for which this is the case, and the preconditions of rule Refine are met. A refinement of P(x1, x2, x3) in position 2 takes place and P(x1, x2, x3) is replaced by literals differing from it only in position 2 by replacing x<sup>2</sup> by every constant and function symbol occurring in Σ. The resulting atoms are P(x1, a, x3), P(x1, f(y1), x3), and P(x1, g(y2), x3) and they cover again all P ground instances.

Step 3: The literal P(x1, a, x3) ∈ I has no atom instance on Γ and is moved to M by means of rule Clean.

Step 4: The positive literal K is the only instance of P(x1, f(y1), x3) on Γ, and the preconditions of rule Solve are met. The literal K is moved from Γ to Δ, and P(x1, f(y1), x3) is moved from I to M.

Step 5: The negative literal L is the only atom instance of P(x1, g(y2), x3) with negative sign, and the preconditions of rule Solve are met. The literal L is moved from Γ to Δ, whereas P(x1, g(y2), x3) is removed from I and added to M with negative sign. Now both Γ and I are empty, and the execution stops with the linear complete literal interpretation M = {P(x1, a, x3), P(x1, f(y1), x3), ¬P(x1, g(y2), x3)}.

Next we prove that ⇒mod always computes an overall interpretation and model of the initial Γ. The basis for these results is the notion of a sound state below. We then show by induction on the length of a ⇒mod derivation that the initial state is sound and any follower state is sound assuming its start state is sound.

**Definition 2 (Sound State).** *A state* (Γ; Δ; I; M) *is* sound*, if the following invariants hold:*


**Lemma 3 (Soundness of Initial State).** *The initial state* (Γ; ∅; {P(x1,..., xn)}; ∅) *is sound.*

*Proof.* Invariants 1 to 6 given in Definition 2 hold in the initial state:


3. The atom P(x1,...,xn) ∈I∪ M generalizes all ground atoms over Σ with predicate P.

4. Holds trivially, because Δ = ∅.

5. The atom B of P(x1,...,xn) ∈ I generalizes all ground atoms over Σ with predicate P, and any atom(L) ∈ Γ is one of those.

6. Holds trivially, because the maximal depth of P(x1,...,xn) ∈ I is equal to one. 

**Lemma 4. (Soundness of** ⇒**mod Rules).** *The rules of* ⇒*mod preserve state soundness.*

*Proof.* The proof is carried out by induction over the number of rule applications. By the induction hypothesis, we assume Invariants 1 to 6 given in Definition 2 hold in a state (Γ ; Δ ; I ; M ) and show that after the application of any rule they are still met in the resulting state (Γ; Δ; I; M).

Rule Refine. The literal L = P(t1,...,tn) ∈ I is replaced by literals L<sup>i</sup> differing from L solely in the position p, which now contains either a constant symbol or a function symbol whose arguments are fresh different variables.

1. The literal L is linear by the induction hypothesis. The variables introduced in the literals L<sup>i</sup> are fresh and different, hence all literals L<sup>i</sup> are linear, too. The literals in M are linear by the induction hypothesis and M remains unaffected. Therefore, all literals in I ∪ M are linear as well.

2. By the induction hypothesis, no two atoms in I∪M are unifiable. This holds in particular for the atom of L and any other atom in I ∪M . The terms Li|<sup>p</sup> are not unifiable by the definition of rule Refine, hence their atoms are not unifiable. Since the literals L<sup>i</sup> are instances of L, their atoms are not unifiable with any atom in (I ∪ M ) \ {L} and thus of I ∪ M. Moreover, the atoms I ∪ M \ {L} are by induction hypothesis not unifiable with each other.

3. Let A be any ground atom. By the induction hypothesis, there exists a literal K in I ∪M such that A is an instance of atom(K), i.e., there exists a substitution σ such that A = atom(K)σ. If K is not the literal L that is refined, then K is still in I ∪ M and so the property holds for atom A. If K is the literal L that is refined on position p with x = L|p, then we know that atom A has a term A|<sup>p</sup> = fi(s1,...,sk*<sup>i</sup>* ) at position p. This means A is also an instance of the literal L<sup>i</sup> = P(t1,...,tn){x → fi(yi<sup>1</sup> ,...,yk*<sup>i</sup>* )} that was newly added to I such that A = atom(Li)σi, where σ<sup>i</sup> = σ ∪ {yi<sup>1</sup> → s1,...,yk*<sup>i</sup>* → sk*<sup>i</sup>* }. Hence the property holds for all ground atoms.

4. Both Δ and M remain unaffected, and Invariant 4 still holds.

5. The set Γ remains unaffected by rule Refine, and its atoms are ground atoms over P. By the induction hypothesis, for any atom A in Γ there exist a literal K in I with atom(K) = B and a σ such that B σ = A. If K is not the literal L that is refined, then K is still in I and so the property holds for atom A. If K is the literal L that is refined at position p with x = L|p, then we know that atom A has a term A|<sup>p</sup> = fi(s1,...,s<sup>k</sup>*<sup>i</sup>* ) at position p. This means A is also an instance of the literal L<sup>i</sup> = P(t1,...,tn){x → fi(y<sup>i</sup><sup>1</sup> ,...,y<sup>k</sup>*<sup>i</sup>* )} that was newly added to I such that A = atom(Li)σ<sup>i</sup> where σ<sup>i</sup> = σ ∪ {y<sup>i</sup><sup>1</sup> → s1,...,y<sup>k</sup>*<sup>i</sup>* → s<sup>k</sup>*<sup>i</sup>* }. Hence the property holds for all atoms A in Γ.

6. By the definition of rule Refine, the depth of any literal L<sup>i</sup> added to I can be at most the depth of P(t1,...,tn) plus one, due to the introduction of a function symbol at position p. Therefore, the maximal depth of the literals in I may increase at most by one. However, the depth of any atom in Γ which is an instance of P(t1,...,tn) is at least equal to the one of P(t1,...,tn). Furthermore, Γ , Δ , and M remain unaltered, and therefore Invariant 6 still holds after the application of rule Refine.

### Rule Clean.

1–3,5,6. The sets Γ and Δ remain unaltered. The literal P(t1,...,tn) in I is moved from I to I. It remains unaffected, just as any other literal in I ∪ M , and by the induction hypothesis, Invariants 1–3, 5, and 6 hold.

4. The ground set Δ remains unchanged, and no literal is removed from M , therefore Invariant 4 still holds after executing rule Clean.

#### Rule Solve.

1–3,6. The literal P(t1,...,tn) is moved from I to M, either with positive or negative sign. Its atom remains unaffected, just as any other literal in I ∪ M , and by the induction hypothesis, Invariants 1–3, and 6 hold.

4. The literals added to Δ are ground instances of #P(t1,...,tn) added to M, and Invariant 4 is met after the application of rule Solve.

5. The literals removed from Γ are ground instances of #P(t1,...,tn) and P(t1,...,tn) is removed from I . Therefore, Invariant 5 still holds after applying rule Solve.

6. The removal of P(t1,...,tn) from I and the addition of #P(t1,...,tn) to M do not affect the maximal depth of the atoms in I ∪ M , and Γ ∪ Δ remains unaffected. Therefore, Invariant 6 still holds after applying rule Solve. 

Next we show termination and that ⇒mod does not get stuck and always ends in a final state. From now on we only consider sound states.

**Lemma 5 (Termination and Runtime).** ⇒*mod terminates in polynomial time* O(size(Γ)<sup>2</sup>) *with respect to the size of* Γ*.*

*Proof.* For a state ({L1,...,Ln}; Δ; I; M) let I be a multiset of literals {K1, ..., Kn} out of literals from I such that atom(Li) = atom(Ki)σ<sup>i</sup> for some σi. Note that for a given {L1,...,Ln} and I the multiset I is unique. This is a result of a sound state and Invariant 2.2. Let

$$\delta(\{L\_1, \ldots, L\_n\}, \{K\_1, \ldots, K\_n\}) = \sum\_{1 \le i \le n} \text{size}(\sigma\_i).$$

Now the measure (δ({L1,...,Ln}, {K1,...,Kn}), |I|) with >lex strictly decreases with each rule application: The rules Clean and Solve strictly decrease the number of L<sup>i</sup> and/or |I|. For the rule Refine the atom P(t1,...,tn) has at least two different instances among the L<sup>i</sup> and after application of the rule the respective σ<sup>i</sup> for all those instances decrease in size by one.

There are at most size(Γ) many applications of Refine possible and for each of these applications at most size(Γ) many applications of Clean or Solve are possible, resulting in the above upper bound. Please recall that the number of symbols in Ω is also bound by size(Γ). 

**Lemma 6 (No Stuck States).** *If for a state* (Γ; Δ; I; M) *we have* Γ = ∅ *or* I = ∅ *then at least one* ⇒*mod rule is applicable.*

*Proof.* Suppose Γ = ∅ and L ∈ Γ. By soundness, Definition 2.5, there exists a literal K ∈ I such that atom(L) is an instance of K. If in addition I contains a literal H of sign opposite to the one of L where atom(H) is an instance of K and a minimal variable position p in K such that atom(L)|<sup>p</sup> and atom(H)|<sup>p</sup> are not unifiable, the preconditions of rule Refine are met. If instead all literals H ∈ Γ, whose atoms are an instance of K, have the same sign as L, rule Solve can be applied. By Definition 2.5, it can not happen that Γ = ∅ and I = ∅. Now assume Γ = ∅ and I = ∅ and let L be a literal in I. No atom instance of L is contained in Γ, and the preconditions of rule Clean are met. 

A consequence of Lemma 6 is that ⇒mod always makes progress, i.e., in any non-terminal state, a rule is applicable. Finally, we prove that ⇒mod in fact produces an overall interpretation satisfying the literals from the initial state.

**Lemma 7 (All Literals are Considered).** *Let* (Γ0; ∅; {P(x1,...,xn)}; ∅) *be an initial state. Then for any (possibly non-final) state* (Γ; Δ; I; M) *obtained during the execution of* ⇒*mod on the initial state, it holds that* Γ ∪ Δ = Γ0*.*

*Proof.* In the initial state (Γ0; Δ0; {P(x1,...,xn)}; ∅), this is obviously the case since Δ<sup>0</sup> = ∅. For proving that this property holds throughout the execution of ⇒mod, we assume that it holds in a state (Γ ; Δ ; I ; M ) and show that after applying one rule, it is still met in the resulting state (Γ; Δ; I; M).

Refine, Clean. Both Γ and Δ remain unaltered, hence Γ ∪ Δ = Γ0. Solve. Literals are moved from Γ<sup>0</sup> to Δ, hence Γ ∪ Δ = Γ<sup>0</sup> ∪ Δ<sup>0</sup> = Γ0. 

**Lemma 8 (Complete Linear Literal Model).** *Let* (Γ0; ∅; {P(x1,..., xn)}; ∅) *be an initial state and* (∅; Δ; ∅; M) *a final state generated by executing* ⇒*mod on it. Then* M *is a complete linear literal model of* Γ0*.*

*Proof.* M is a complete linear literal interpretation by Definition 2.1-3, Lemma 4. By Lemma 7, we have Δ = Γ0. By Definition 2.4, the literals in M generalize all literals in Δ and hence in Γ0. This proves that M is a model of Γ0. 

Our rules are not deterministic, and several factors affect the model obtained by running ⇒mod with the same initial state (Γ0; ∅; P(x1,...,xn); ∅). If the preconditions of multiple rules are met in a non-final state (Γ; Δ; I; M), we are free to choose the order in which we execute them. If there are literals L, K ∈ I meeting the preconditions of Refine with respect to the same minimal variable position p, either may be chosen. Thus applying ⇒mod to the same trail twice might give us two literal interpretations of different size as shown by an example.

*Example 9 (Model Size).* Consider the signature Σ = ({a/0, g/1}, {R/2}) and Γ<sup>0</sup> = {L1, L2, L3, L4, L5, L6} where L<sup>1</sup> = ¬R(a, a), L<sup>2</sup> = R(a, g(a)), L<sup>3</sup> = R(g(a), g(g(a))), L<sup>4</sup> = R(a, g(g(a))), L<sup>5</sup> = ¬R(g(a), g(a)), and L<sup>6</sup> = ¬R(g(a), a)}. A possible run is shown below. The variables or literals we refine in the next step or apply Solve or Clean to, respectively, are underlined.

$$\begin{array}{l} 0: & \left(\varGamma\_{0}; \emptyset; \{R(\underline{x}, y)\}; \emptyset\right) \\ 1: \Rightarrow\_{\text{mod}}^{\text{Refine}}\left(\varGamma\_{0}; \emptyset; \{R(a, \underline{y}), R(g(z), y)\}; \emptyset\right) \\ 2: \Rightarrow\_{\text{mod}}^{\text{Refine}}\left(\varGamma\_{0}; \emptyset; \{R(a, a), \underline{R(a, g(u))}, R(g(z), y)\}; \emptyset\right) \\ 3: \Rightarrow\_{\text{mod}}^{\text{Solve}^{\*}}\left(\varGamma\_{1}; \Delta\_{1}; \{R(g(z), \underline{y})\}; M\_{1}\right) \\ 4: \Rightarrow\_{\text{mod}}^{\text{Refine}}\left(\varGamma\_{1}; \Delta\_{1}; \{R(g(z), a), R(g(z), g(v))\}; M\_{1}\right) \\ 5: \Rightarrow\_{\text{mod}}^{\text{Solve}}\left(\varGamma\_{2}; \Delta\_{2}; \{\overline{R(g(z), g(\underline{v}))}\}; M\_{2}\right) \\ 6: \Rightarrow\_{\text{mod}}^{\text{Refine}}\left(\varGamma\_{2}; \Delta\_{2}; \{R(g(z), g(a)), R(g(z), g(w))\}\right); M\_{2}\right) \\ 7: \Rightarrow\_{\text{mod}}^{\text{Solve}^{\*}}\left(\varGamma\_{3}; \Delta\_{3}; \emptyset; M\_{3}\right) \end{array}$$

The initial state is given by (Γ0; ∅; {R(x, y)}; ∅) (step 0). We choose to refine x at position p = 1 in R(x, y), since L<sup>1</sup> and L<sup>3</sup> are instances of its atom with differing signs and L1|<sup>1</sup> and L3|<sup>1</sup> are not unifiable. Rule Refine replaces R(x, y) by R(a, y) and R(g(z), y) (step 1). Similarly, we refine the variable y at position p = 2 in R(a, y), since L<sup>1</sup> and L<sup>2</sup> are instances of it having different sign and L1|<sup>2</sup> and L2|<sup>2</sup> are not unifiable (step 2). Then rule Solve can be applied twice, namely to R(a, a) and its negative instance L<sup>1</sup> ∈ Γ0, and to R(a, g(u)) and its positive instances L2, L<sup>4</sup> ∈ Γ0. We obtain Γ<sup>1</sup> = Γ<sup>0</sup> \ {L1, L2, L4}, Δ<sup>1</sup> = {L1, L2, L4}, and M<sup>1</sup> = {¬R(a, a), R(a, g(u))} (step 3). Next, the variable y at position p = 2 in R(g(z), y) is refined, since L<sup>3</sup> and L<sup>5</sup> are instances of it having opposite sign and their subterms at position 2 are not unifiable. The literal R(g(z), y) is replaced by R(g(z), a) and R(g(z), g(v)) (step 4). Since Γ<sup>1</sup> contains only a positive instance of R(g(z), a), namely L6, rule Solve is applied resulting in Γ<sup>2</sup> = Γ<sup>1</sup> \{L6}, Δ<sup>2</sup> = Δ1∪{L6}, and M<sup>2</sup> = M1∪{¬R(g(z), a)} (step 5). The trail Γ<sup>2</sup> contains instances L<sup>3</sup> and L<sup>5</sup> of R(g(z), g(v)) with different sign. Variable v at position 2 1 is chosen for refinement, since L3|2 1 and L5|2 1 are not unifiable, and R(g(z), g(v)) is replaced by R(g(z), g(a)) and R(g(z), g(g(w))) (step 6). Now Γ<sup>2</sup> contains only a positive instance of R(g(z), g(a)) and a negative one of R(g(z), g(g(w))), and rule Solve is applicable. This gives us Γ<sup>3</sup> = Γ<sup>2</sup> \ {L3, L5} = ∅, Δ<sup>3</sup> = Δ<sup>2</sup> ∪ {L3, L5} = Γ0, and M<sup>3</sup> = M<sup>2</sup> ∪ {¬R(g(z), g(a)), R(g(z), g(g(w)))} with 5 literals (step 7).

The choice of the variable to be refined in step 1 is not deterministic, and the following steps might lead to a different model. A different run for Γ <sup>0</sup> = Γ<sup>0</sup> could be as follows:

$$\begin{array}{ll} 0: & (I'\_0; \emptyset; \{R(x, \underline{y})\}; \emptyset) \\ 1: \Rightarrow\_{\text{mod}}^{\text{Refine}} (I'\_0; \emptyset; \{R(x, a), R(x, g(z))\}; \emptyset) \\ 2: \Rightarrow\_{\text{mod}}^{\text{Solve}} (I'\_1; \Delta'\_1; \{\overline{R(\underline{x}, g(z))}\}; M'\_1) \\ 3: \Rightarrow\_{\text{mod}}^{\text{Refine}} (I'\_1; \Delta'\_1; \{R(a, g(z)), R(g(u), g(z))\}; M'\_1) \\ 4: \Rightarrow\_{\text{mod}}^{\text{Solve}} (I'\_2; \Delta'\_2; \{\overline{R(g(u), g(\underline{z}))}\}; M'\_2) \\ 5: \Rightarrow\_{\text{mod}}^{\text{Refine}} (I'\_2; \Delta'\_2; \{R(g(u), g(a)), R(g(u), g(g(v)))\}; M'\_2) \\ 6: \Rightarrow\_{\text{mod}}^{\text{Solve}} (I'\_3; \Delta'\_3; \emptyset; \overline{M'\_3}) \end{array}$$

The first refinement step involves p = 2, y = R(x, y)|<sup>2</sup> motivated by L<sup>1</sup> and L<sup>2</sup> (step 1). Now we can execute Solve on R(x, y) since it has only negative instances on Γ <sup>0</sup>, which are L<sup>1</sup> and L<sup>6</sup> obtaining Γ <sup>1</sup> = Γ<sup>0</sup> \ {L1, L6}, Δ <sup>1</sup> = Δ<sup>0</sup> ∪ {L1, L6}, and M <sup>1</sup> = {¬R(x, a)} (step 2). The variable x at position 1 of R(x, g(z)) is refined, since L<sup>2</sup> and L<sup>5</sup> are a positive and negative instance, respectively, of R(g(u), g(z)) and their subterms at position 1 are not unifiable, by replacing R(x, g(z)) by R(a, g(z)) and R(g(u), g(z)) (step 3). Now rule Solve can be executed on R(a, g(z)), which generalizes L<sup>2</sup> and L4, which are both positive. This results in Γ <sup>2</sup> = Γ <sup>1</sup> \ {L2, L4}, Δ <sup>2</sup> = Δ <sup>1</sup> ∪ {L2, L4}, and M <sup>2</sup> = M <sup>1</sup>∪{R(a, g(z))} (step 4). Next, a Refine step on z at position 2 1 in R(g(u), g(z)) is executed due to L<sup>3</sup> and L5, which are instances of R(g(u), g(z)) of opposite sign and whose subterms at position 2 1 are not unifiable. The literal R(g(u), g(z)) is replaced by R(g(u), g(a)) and R(g(u), g(g(v))) (step 5), which have one instance each in Γ <sup>2</sup>. Rule Solve is applied twice, resulting in Γ <sup>3</sup> = Γ <sup>2</sup> \ {L3, L5} = ∅, Δ <sup>3</sup> = Δ <sup>2</sup> ∪ {L3, L5} = Γ <sup>0</sup>, and M <sup>3</sup> = M <sup>2</sup> ∪ {¬R(g(u), g(a)), R(g(u), g(g(v)))} with 4 literals.

So M<sup>3</sup> and M <sup>3</sup> not only differ syntactically, but also contain a different number of literals. Refining x before y led to ¬R(a, a),¬R(g(z), a) ∈ M3, whereas refining y before x resulted in ¬R(x, a) ∈ M <sup>3</sup>, which generalizes both ¬R(a, a) and ¬R(g(z), a).

In summary, ⇒mod computes an overall interpretation out of the initial finite set of consistent ground literals in polynomial time. We shortly compare our model representation formalism with the long standing literature, in particular [10,17]. They suggested four postulates which should ideally be met by any model representation formalism:

– **Uniqueness.** Each model representation M specifies a single interpretation over Σ.


The model M obtained by ⇒mod is a complete linear literal interpretation. Our representation formalism is therefore a special case of an atomic representation (ARM) [10] if we leave out negative literals which are implicit for ARMs. The validity of the four model building postulates has been shown for ARMs [10]. So the models computed by ⇒mod satisfy the four model building postulates. Clause evaluation for our linear literal models M is straightforward: a clause C is valid iff there is no substitution σ such that for each L ∈ C there is a literal K ∈ M such that Lσ and Kσ are complementary. Recall that this is a consequence of the fact that our literal interpretations are explicit and complete: for any ground atom A over Σ there is a literal K in M such that A is a literal instance of K. The respective procedure for ARMs is more involved [45], whereas in our case established techniques for hyper-resolution apply [35,47,56].

Finally, we show consequences out of our model building procedure for nonground literals and the SCL calculus: if the computed interpretation does not satisfy all clauses, then it can be used to effectively compute a minimal extension to the ground literal restriction of the SCL calculus.

**Theorem 10 (Non-ground Guarantees).** *Let* Γ *be a set of consistent ground literals. Let* M *be a model generated by* ⇒*mod from* Γ*. Let* L *be a (potentially non-ground,) linear literal with* depth(atom(L)) = d*. Let* = 1 *if* L *has a position* p *of depth* d *(i.e.,* |p| = d*) such that* L|<sup>p</sup> *is a constant. Otherwise,* = 0*. Let* Γ *contain all ground instances* Lσ *of* L *(i.e.,* Lσ ∈ Γ*) with* depth(atom(Lσ)) ≤ d + *. Let* Γ *contain no ground instance of* comp(L)*, i.e., for all* K ∈ Γ *it holds that* K *is not unifiable with* comp(L)*. Then* M |= L*.*

*Proof.* Proof by contradiction. We assume that all our assumptions hold, but that M |= L. By Definition 2.3 and Lemmas 3 and 4, M |= L if there exists a K ∈ M that is unifiable with comp(L). Moreover, Γ contains no ground instances of comp(L) by assumption. We will now prove by induction that we can only reach states (Γ ; Δ; I; M) where A ∈ I is unifiable with atom(L) if A has depth ≤ d+ and Γ contains all ground instances Lσ of L such that atom(Lσ) is also a ground instance of A and depth(atom(Lσ)) ≤ d + . (Note that there always exists at least one such ground instance because A has depth ≤ d + .) This property guarantees that Clean can never be applied to an atom A ∈ I that is unifiable with atom(L) and that Solve is only applicable to an atom A ∈ I that is unifiable with atom(L) if there is also an instance atom(Lσ) of A in Γ that ensures that we assign A with the correct polarity. The induction base holds trivially because in the state (Γ; ∅; {P(x1,...,xn)}; ∅) the only atom in I is P(x1,...,xn) and it has the minimal depth 1 and Γ contains by assumption all ground instances of L with depth ≤ d + . For the induction step, we assume that (Γ ; Δ; I; M) is a sound state that satisfies our property and prove that any direct successor state (Γ; Δ ; I ; M ) must again satisfy our property. We prove this by case distinction:

1) Clean and Solve only remove elements A from I and all positive and negative instances of A from Γ. This together with Definition 2.2 guarantees that the literals removed from Γ do not match with any of the remaining elements of I . Therefore, the property still holds.

2) Refine on A, but A is not unifiable with atom(L). Trivial, because any of the new elements in I will also not be unifiable with L.

3) Refine on A, A is unifiable with atom(L), and depth(A) ≤ d + , and the position p of the refined variable has depth |p| < d + . This means by induction that Γ = Γ contains all ground instances Lσ of L such that atom(Lσ) is also a ground instance of A and depth(atom(Lσ)) ≤ d + . Moreover, any new atom A ∈ I \ I has at most depth |p| + 1 so still ≤ d + . And lastly, since A is an instance of A, Γ contains all ground instances Lσ of L such that atom(Lσ) is also a ground instance of A and depth(atom(Lσ)) ≤ d + .

4) Refine on A, A is unifiable with atom(L), depth(A) = d + , and the position p of the refined variable has depth |p| = d+. Let A<sup>∗</sup> = mgu(A, atom(L)) as well as L<sup>∗</sup> = A<sup>∗</sup> and L = A if L is positive or else L<sup>∗</sup> = ¬A<sup>∗</sup> and L = ¬A. This means by induction that Γ = Γ contains all ground instances L∗σ with depth(atom(L∗σ)) ≤ d + and that they all have the same polarity as L. Moreover, any variable in A has either a position q with depth |q| = d + or there exist no (Aτ ),(¬Aτ ) ∈ Γ such that (Aτ )|<sup>q</sup> and (Aτ )|<sup>q</sup> are not unifiable. However, this means we also know that any ground instance (L∗[xq]q)σ of (L∗[xq]q) must be in Γ = Γ if q is the variable position of x<sup>q</sup> in A with |q| < d+. Note that due to linearity of L and A (and assuming disjoint variables) A<sup>∗</sup>|<sup>q</sup> = A|<sup>q</sup> if and only if there exist q , q such that q = q q, A|<sup>q</sup> is the position of a variable x<sup>q</sup>- , |q | < d+, L|<sup>q</sup> is defined and not a variable, and A<sup>∗</sup>|<sup>q</sup>- = L|<sup>q</sup>- . This means that we get L /A if we replace all positions q in L∗/A<sup>∗</sup> with A|<sup>q</sup> if A|<sup>q</sup> = x<sup>q</sup> and |q| < d + . If we use this together with the previous fact for all variable positions |q| < d + , then we get that any ground instance L σ of L must be in Γ = Γ and therefore Refine is not applicable. A contradiction.

5) Refine on A, A is unifiable with atom(L), and depth(A) > d+. This case is impossible by induction hypothesis! 

The preconditions of Theorem 10 may look unrelated to its conlusion at first sight. The first example shows why Γ needs to contain all ground instances Lσ of L of depth depth(L). The reason is that Refine may lead to an atom K in I that is unifiable with comp(L) but Γ contains no ground instances of comp(K). In our example this is P(x, f(y)). The second example shows why Γ needs to contain all ground instances Lσ of L of depth depth(L) + 1 if L has a constant at a position p with depth d = depth(L). The reason is that an application of Refine may lead to an atom K in I that is unifiable with comp(L) but has no ground instances of comp(K). In our example this is P(f(x), y).

*Example 11.* (1) Let Γ = {¬P(a, a),¬P(f(a), a), P(a, f(a))} with signature Σ = ({a/0,f/1}, {P}). Then for the input state (Γ; ∅; P(x, y); ∅) the calculus returns the model M = {¬P(x, a), P(x, f(y))} because we first need to apply Refine to position 2. Although for ¬P(f(x), y) there is no inconsistent atom in Γ, M |= ¬P(f(x), y).

(2) Let Γ = {¬P(a, a),¬P(a, b),¬P(b, a), P(b, b)} with signature Σ = ({a/0, b/0, f /1}, {P}). Then for the input state (Γ; ∅; P(x, y); ∅) the calculus can return the model M = {¬P(a, y), P(f(x), y),¬P(b, a), P(b, b), P(b, f(y))} if we first apply Refine to position 1. And although for ¬P(x, a) there is no inconsistent atom in Γ, M |= ¬P(x, a).

**Theorem 12 (Non-ground Guarantees by Clean).** *Let* Γ *be a consistent set of ground literals. Let* M *be a model generated by* ⇒*mod from* Γ*. Let* d *be the maximal depth of any negative literal in* Γ*. Let* A *be a linear atom with* depth(A) ≤ d*. Let* Γ *contain all ground instances of* Aσ *with* depth(Aσ) ≤ d*. Then* M |= A*.*

*Proof.* Note that the most general unifier of any two linear literals K, K has depth depth(K mgu(K, K )) = max(depth(K), depth(K )). Firstly, we show that rule Solve can never add a negative literal ¬B to the model that is unifiable with ¬A. In this case, Solve is only applicable if Γ contains a ground instance ¬Bσ and no ground instance Bσ. The first condition implies depth(¬B) ≤ d and if B and A are unifiable A mgu(A, B) has depth ≤ d. However, since Γ contains all ground instances of A with depth ≤ d this also means Γ contains all ground instances of A mgu(A, B) with depth ≤ d. This means that our assumptions guarantee that the second condition for Solve is not satisfied if ¬B is unifiable with ¬A. So Solve will not add a literal ¬B that is unifiable with ¬A. Secondly, in addition to Solve, only Clean adds literals to M. All literals added by Clean are atoms, so they cannot unify with ¬A. Hence, M |= L. 

**Lemma 13 (Lower Bound for SCL Refutations).** *Let* Γ *be the ground partial model of an SCL stuck state for the input clause set* N *and bounded by and* β*. This means in particular that (i) every literal* L ∈ Γ *is ground and bounded by and* β *(i.e.,* L β*), (ii) every ground atom* A β *is defined by* Γ *(i.e.,* A ∈ Γ *or* ¬A ∈ Γ*), and (iii) for every clause* C ∈ N *and every grounding* σ *of* C *either* Γ |= Cσ *or there exists a literal* L ∈ Cσ *such that* L β*. Let* M *be a complete interpretation (i.e., for every ground atom* A*,* M |= A *or* M |= ¬A*) that models* Γ *(i.e.,* M |= Γ*) but not the clause set* N *(i.e.,* M |= N*). Let* β *be a smallest ground literal according to such that there exists a clause* C ∈ N*, a grounding* τ *, where* L β *holds for any literal* L ∈ Cτ *, and* M |= Cτ *. Then there exists no* β<sup>∗</sup> ≺ β *such that an SCL run on* N *and bounded by* ≺ *and* β<sup>∗</sup> *finds a refutation.*

*Proof.* The assumptions for β and the completeness of M imply that M |= Cσ for all clauses C ∈ N and all groundings σ, where L β<sup>∗</sup> for any literal L ∈ Cσ. This means one valid SCL run for N, , and β<sup>∗</sup> can simply decide all ground atoms A β<sup>∗</sup> according to M, i.e., according to whether M |= A or M |= ¬A, without encountering any conflicts and ending in a stuck state with a set Γ such that <sup>Γ</sup> <sup>|</sup>= gndβ<sup>∗</sup> (N), where the function gndβ<sup>∗</sup> computes the set of all ground instances of a clause set where the grounding is restricted to produce literals <sup>L</sup> with <sup>L</sup> <sup>β</sup>∗. The existence of this stuck state proves that gndβ<sup>∗</sup> N is satisfiable and that there exists no refutation for it. Hence, no SCL run for N, , and β<sup>∗</sup> can find a refutation. 

# **5 Conclusion and Future Work**

Explicit model building is always a compromise between the expressivity of the used language, its computational properties and the effort to actually compute the model. Satisfiability of first-order logic clause sets is not even semi-decidable, so there cannot be a general solution. In the context of SCL, efficient model building and efficient clause evaluation are important aspects and our quite simple model building language, namely complete linear literal interpretations, nicely serves these two purposes. Still there may be room for improvement. For example, the three clauses

$$\begin{array}{cc} \neg R(x,x) & R(x,g(x)) \\ \neg R(x,y) \lor \neg R(y,z) \lor R(x,z) \end{array}$$

do not have a finite model. Linear literal interpretations have the finite model property so there cannot be a finite representation of a model within this language. It needs a more expressive language. For example, assuming an additional constant a and a bound β = R(g(a), g(g(a))) a partial model computed by SCL would be

$$\{ \neg R(a, a), R(a, g(a)), R(g(a), g(g(a))), R(a, g(g(a))), \neg R(g(a), g(a)), \neg R(g(a), a)^1 \}$$

The respective overall model could be represented by the linear Horn clause set

$$\begin{array}{cc} \neg R(a,a) & R(a,g(a)) \\ \neg R(x,y) \to \neg R(g(x),y) & R(x,y) \to R(x,g(y)) \\ \neg R(x,y) \to \neg R(g(x),g(y)) & R(x,y) \to R(g(x),g(y)) \end{array}$$

or by terms with exponents and constraints [4,13]

$$\exists \ i \ge j \parallel \neg R(g^i(a), g^j(a)) \qquad i < j \parallel R(g^i(a), g^j(a)) \dots$$

However, it is an open question how such representations can be actually computed out of a set of ground literals and how they can be used to efficiently test validity of clauses.

The rule Clean may actually add the respective literal wither positively or negatively to M. In practice, such literals could be marked in M. Then in case of starting from an SCL stuck trail where M is not a model for the clause set, a small but useful extension is to check whether flipping the sign of some of these literals turn M into a model.

In summary, we have presented an algorithm that computes in polynomial time out of a finite consistent set of ground literals Γ a complete linear literal interpretation M such that M |= Γ. Furthermore, M can be effectively used to evaluate clauses and to determine a minimal extension to the ground literal restriction β out of an SCL stuck state.

**Acknowledgements.** We thank our anonymous reviewers for their constructive comments.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Synthesis of Recursive Programs in Saturation**

Petra Hozzov´a1(B) , Daneshvar Amrollahi<sup>2</sup> , M´arton Hajdu<sup>1</sup> , Laura Kov´acs<sup>1</sup> , Andrei Voronkov1,3,4, and Eva Maria Wagner<sup>1</sup>

> TU Wien, Vienna, Austria petra.hozzova@tuwien.ac.at Stanford University, Stanford, USA University of Manchester, Manchester, UK EasyChair, Manchester, UK

**Abstract.** We turn saturation-based theorem proving into an automated framework for recursive program synthesis. We introduce magic axioms as valid induction axioms and use them together with answer literals in saturation. We introduce new inference rules for induction in saturation and use answer literals to synthesize recursive functions from these proof steps. Our proof-of-concept implementation in the Vampire theorem prover constructs recursive functions over algebraic data types, while proving inductive properties over these types.

**Keywords:** Program Synthesis · Saturation · Superposition · Induction · Recursion · Theorem Proving

# **1 Introduction**

Program synthesis is the task of constructing a program P satisfying a given specification F, ensuring that P is correct by design [20]. In this paper we work with a functional specification F of the input-output relation of a program P, *where* F *is given as a* ∀∃ *formula in first-order logic* [1,20]. Validity of a specification formula F ensures that for every input value there exists an output value satisfying F, and therefore there is a function which for every input value gives such an output value. Our goal is to *automatically find a (possibly recursive) program that* P computes the output, while preserving F.

As a complementary approach to formal verification, synthesis is inherently more complex [28]. The complexity is further compounded when we consider reasoning about – and synthesizing – programs using recursion. As a remedy, in this paper we advocate for using automated first-order theorem proving as the reasoning back-end to (recursive) program synthesis.

The work [8] extended the saturation-based first-order theorem proving framework to *saturation-based synthesis framework*. The approach (i) uses saturationbased reasoning to prove that a specification F is valid; (ii) tracks the constructive parts of the proof of F; (iii) and uses them to synthesize a program P satisfying F. In this paper we complement [8] with support for *recursive program* synthesis. We use recent developments on automating induction in saturation [5,7,9,25] and construct recursive programs based on applications of induction.


**Fig. 1.** Axioms of half and the ∀∃-specification for the function computing double.

**Illustrative Example.** Consider the specification (SD) of Fig. 1, which describes the inverse of the half function over natural numbers. Given the axiomatization of half in Fig. 1, our approach synthesizes the recursive function double as a solution of (SD), defined as:

$$\mathsf{double}(\mathsf{0}) \simeq \mathsf{0}$$

$$\forall x. \; \mathsf{double}(\mathsf{s}(x)) \simeq \mathsf{s}(\mathsf{s}(\mathsf{double}(x))) \tag{1}$$

The framework of [8] fails to synthesize a solution of (SD), as double is a recursive program. To the best of our knowledge, there exists no automated approach supporting recursive function synthesis from functional input-output specifications in full first-order logic.

This paper provides a solution in this respect by exploiting the constructive nature of induction. Intuitively, each case of an induction axiom tells us how to construct the desired program for the next recursive step using the program for the previous recursive step. We capture this construction recipe contained in the applications of induction in saturation-based proof search, by utilizing answer literals ans(r) [4]. When we use an induction axiom in the proof, we introduce a special term into the answer literal, serving for tracking the program corresponding to the induction axiom. As we prove the cases of the induction axiom, we capture their corresponding programs in the answer literal. Finally, when we derive a clause C ∨ ans(r), where C only contains symbols allowed in a program, we convert the special tracker terms from r into recursive functions, and obtain a program for the initial specification conditioned on ¬C.

**Contributions.** We extend saturation-based first-order theorem proving with recursive program synthesis and bring the following contributions<sup>1</sup>:


<sup>1</sup> Proofs are given in the extended version [10] of our paper.


# **2 Preliminaries**

We assume familiarity with standard multi-sorted first-order logic (FOL) with equality. We denote variables by x, y, z, w, u, terms by s, t, r, atoms by A, literals by L, clauses by C, D, formulas by F, G, all possibly with indices. Further, we write σ for Skolem constants. We reserve the symbol for the *empty clause* which is logically equivalent to ⊥. We write L for the literal complementary to L. By we denote the equality predicate and write t s as a shorthand for ¬t s. We include a conditional term constructor if −then−else in the language, as follows: given a formula F and terms s, t of the same sort, we write if F then s else t to denote the term s if F is true and t otherwise. An *expression* is a term, literal, clause or formula. We write E[t] to denote that the expression E contains the term t. For simplicity, E[s] denotes the expression E where all occurrences of t are replaced by the term s. Formulas with free variables are considered implicitly universally quantified, that is we consider closed formulas.

We use the standard semantics for FOL. For an interpretation function I, we denote the interpretation of a variable x, function symbol f and a predicate symbol p by x<sup>I</sup> , f <sup>I</sup> , p<sup>I</sup> , respectively. We use the notation e<sup>I</sup> , F<sup>I</sup> also for the interpretation of expressions e and formulas F, respectively. Further, for a variable or a constant a and a value v, we denote by I{a → v} the interpretation function I such that aI- = v and bI- = b<sup>I</sup> for any constant or variable b = a. We write <sup>F</sup>1,...,Fn <sup>G</sup>1,...,Gm to denote that <sup>F</sup><sup>1</sup> <sup>∧</sup> ... <sup>∧</sup> <sup>F</sup>n <sup>→</sup> <sup>G</sup><sup>1</sup> <sup>∨</sup> ... <sup>∨</sup> <sup>G</sup>m is valid, and extend the notation also to validity modulo a theory T.

We recall the standard notion of λ-expressions. Let t be a term and x a variable. Then λx.t denotes a λ*-expression*. For any interpretation I, we define (λx.t)<sup>I</sup> as the function f given by f(v) = t <sup>I</sup>{x→v} for any value v. Moreover, we extend the notation of λ-expressions to also bind constants. Let c be a constant, then λc.t also denotes a λ*-expression*, and its interpretation (λc.t)<sup>I</sup> is the function f given by f(v) = t <sup>I</sup>{c→v} for any value v.

A *substitution* θ is a mapping from variables to terms. A substitution θ is a *unifier* of two expressions E and E if Eθ = E θ; θ is a *most general unifier* (*mgu*) if for every unifier η of E and E , there exists a substitution μ such that η = θμ. We denote the mgu of E and E with mgu(E,E ).

We work with *term algebras* [27], in particular with the special classes of the algebraically defined data types of the natural numbers N, lists L, and binary trees BT. <sup>2</sup> We denote the sorts of symbols and terms by : (colon), e.g., <sup>f</sup> : <sup>τ</sup> <sup>→</sup> <sup>α</sup> is a function symbol with domain τ and range α. To emphasize the sort τ of a quantified variable x, we write ∀x ∈ τ or ∃x ∈ τ . For a term algebra sort τ , we denote its constructors with <sup>Σ</sup>τ . We fix an arbitrary ordering on the constructors,

<sup>2</sup> Definitions of these term algebras are in the extended version [10] of this paper.


**Fig. 2.** Simplified superposition calculus Sup.

and denote the <sup>i</sup>-th constructor in the order by <sup>c</sup>i, i.e., <sup>Σ</sup><sup>τ</sup> <sup>=</sup> {c1,...,c<sup>|</sup>Σ<sup>τ</sup> <sup>|</sup>}. For each <sup>c</sup>i, we denote its arity with <sup>n</sup>c<sup>i</sup> . We denote with <sup>P</sup>c<sup>i</sup> the set of argument positions of <sup>c</sup>i of the sort <sup>τ</sup> . We only consider the standard models of term algebras. Programs we synthesize may contain terminating recursive functions f : τ → α, where τ is a term algebra type. We define such function f by providing a set of equalities {f(c(x)) <sup>t</sup>[x, f(xj<sup>1</sup> ),...,f(xj|Pc<sup>|</sup> )]}c∈Σ<sup>τ</sup> , where <sup>P</sup><sup>c</sup> <sup>=</sup> {j1,...,j<sup>|</sup>Pc<sup>|</sup>}, and t contains no occurrences of f except for the distinguished ones. An example of such a definition is (1).

**Saturation and Superposition.** Saturation-based proof search implements *proving by refutation* [16]: validity of F is proved by establishing unsatisfiability of ¬F. Saturation-based first-order theorem provers work with clauses, rather than with arbitrary formulas. To prove a formula F, the provers negate F and further skolemize it and convert it to clausal normal form (CNF). The CNF of ¬F is denoted by cnf(¬F), resulting in a set S of initial clauses. For example, the CNF of the negated and skolemized (SD) is

$$\mathsf{halt}(y) \not\subset \sigma,\tag{2}$$

where σ is a fresh constant used for skolemizing x, and y is implicitly universally quantified. Saturation provers *saturate* S by computing logical consequences of S with respect to a sound inference system I. Whenever the empty clause is derived, the set S of clauses is unsatisfiable and F is valid. We may extend the initial set <sup>S</sup> with additional clauses <sup>C</sup>1,...,Cn. If <sup>C</sup> is derived from this extended set, we say <sup>C</sup> is derived from <sup>S</sup> *under additional assumptions* <sup>C</sup>1,...,Cn.

The *superposition calculus* Sup [22] is the most common inference system for first-order logic with equality. Figure 2 shows a simplified version of Sup. The Sup calculus is *sound* (if is derived from F, then F is unsatisfiable) and *refutationally complete* (if F is unsatisfiable, then can be derived from it).

# **3 Recent Developments in Saturation**

In this section we summarize recent results relevant to our work.

**Program Synthesis in Saturation.** Synthesizing (non-recursive) programs in saturation has been initiated in [8]. Here, *computable* and *uncomputable* symbols in the signature are distinguished. Intuitively, computable symbols are those which are allowed to appear in a synthesized program. An expression is *computable* if all symbols it contains are computable. A symbol or an expression is *uncomputable* if it is not computable.

Let <sup>A</sup>1,...,An be closed formulas. Then

$$A\_1 \land \dots \land A\_n \to \forall \overline{x} \exists y. F[\overline{x}, y] \tag{3}$$

is a *(synthesis) specification with inputs* x *and output* y.

Consider a computable term <sup>r</sup>[x] such that <sup>A</sup><sup>1</sup> <sup>∧</sup> ... <sup>∧</sup> <sup>A</sup>n → ∀x.F[x, r[x]] holds. Such an r[x] is called a *program* for (3) and a *witness for* y *in* (3). If <sup>A</sup><sup>1</sup> <sup>∧</sup> ... <sup>∧</sup> <sup>A</sup>n → ∀x.(F<sup>1</sup> <sup>∧</sup> ... <sup>∧</sup> <sup>F</sup>n <sup>→</sup> <sup>F</sup>[x, r[x]]) holds for computable formulas <sup>F</sup>1,...,Fn, then <sup>r</sup>[x] n i=1 <sup>F</sup>i is a *program with conditions* <sup>F</sup>1,...,F<sup>n</sup> for (3).

Saturation-based theorem proving was extended in [8] to a *saturation-based program synthesis framework*. To this end, the clausified negated specification (3) is extended by an *answer literal* ans:

$$A\_1 \land \dots \land A\_n \land \forall y. (\mathsf{cnf}(\neg F[\overline{\sigma}, y]) \lor \mathsf{ans}(y))\tag{4}$$

The set of clauses (4) is then saturated. During saturation, upon deriving a clause C[σ] ∨ ans(r[σ]), where r[σ] is computable and C[σ] is computable and does not contain ans, the program r[x]¬C[x] with conditions for (3) is recorded and the clause is replaced by C[σ]. This step is called *answer literal removal* within saturation. Once saturation terminates by deriving the empty clause -, the final program for (3) is constructed by composing the relevant recorded programs with conditions in a nested if−then−else. To support derivation of such clauses C[σ]∨ans(r[σ]) and to ensure that answer literals only have computable arguments, the work of [8] extended the superposition calculus Sup with new inference rules.

**Induction in Saturation.** Inductive reasoning has been integrated in saturation [5–7,9,25]. The main idea in this body of work is to apply induction by *theory lemma generation*: based on already derived formulas, generate a suitable induction axiom and add it to the search space. To this end, the following induction rule is used:

$$\frac{L[t] \lor C}{F \to \forall x. L[x]} \text{ (lm)},$$

where L[t] is a ground literal, C is a clause, and F → ∀x.L[x] is a valid induction axiom. The conclusion of the Ind rule is clausified, yielding cnf(¬F) ∨ L[x]. This clause is resolved with the premise L[t] ∨ C immediately after applying the Ind rule and the resulting clause cnf(¬F) ∨ C is added to the search space.

An example of a valid induction schema is the *structural induction axiom for natural numbers*, where G[x] is any closed formula:

$$\left(G[\mathbb{0}] \land \forall y. (G[y] \to G[\mathbf{s}(y)])\right) \to \forall x. G[x] \tag{5}$$

When we instantiate the schema with G[x] := L[x], we obtain an axiom that can be used in Ind. Since the rule requires L[t] to be ground, this instance of Ind cannot be applied on (2) and thus is not sufficient for proving (SD) of Fig. 1. To prove formulas with a free variable by induction, we extend Ind in Sect. 5.

Note that we can also use a complex formula G[t] in place of the literal L[t] in Ind, obtaining a more involved rule, possibly with multiple premises, similarly to a *mutli-clause induction rule* [7] or a *induction with arbitrary formulas* [6].

### **4 Saturation with Induction in Constructive Logic**

We first summarize the key challenges our work resolves towards recursive synthesis in saturation, and then present our synthesis approach in Sects. 5–8.

The idea of extracting programs from proofs originates from results in constructive (intuitionistic) logic, starting with Kleene's realizability [14]. In constructive logic, provability of a formula ∀x∃y.F[x, y] implies that there is an algorithm which, given values for x, outputs a value for y satisfying F[x, y].

We note that the structural induction axiom (5) over natural numbers has computational content, as follows. The program r for ∀x.G[x] can be built from a program r<sup>0</sup> for G[0] and a program r<sup>s</sup> for ∀y.(G[y] → G[s(y)]) as:

$$\begin{aligned} r(\mathbf{0}) &\simeq r\_{\mathbf{0}}\\ r(\mathbf{s}(y)) &\simeq r\_{\mathbf{s}}(r(y)) \end{aligned}$$

For this to be useful, we need to first prove G[0], then prove ∀x.(G[y] → G[s(y)]), and then use the induction axiom to derive ∀x.G[x]. Such an approach towards constructing programs does not however work in saturation-based theorem proving, as saturation does not reduce goals to subgoals [2]. Rather, we add the induction axiom as a theory lemma to the proof search and continue saturation, so we do not have proofs of either G[0] or ∀y.(G[y] → G[s(y)]). Constructing programs during saturation becomes even more complex when using answer literals, because clauses generated during saturation may contain these literals. For example, if we try to extract a proof of G[0], we may find a proof with an answer literal in it.

To capture the constructive nature of induction and address the above challenges of program synthesis in saturation, we use the following trick. We modify the induction axiom so that it indirectly stores information about the programs for G[0] and ∀y.(G[y] → G[s(y)]). To do this, instead of adding the induction axiom (5), in Sect. 5 we add what we call a *magic axiom for* (5), where G has an additional argument for storing the program. In Sect. 6 we further convert our magic axioms into formulas to be used to derive recursive programs in saturation.

# **5 Induction with Magic Formulas**

We first present our approach to *proving* formulas with a free variable by induction. We further extend this approach to *synthesis* in Sect. 6. While our approach works the same way with arbitrary term algebras, for the sake of clarity we first introduce our work for natural numbers and then for general term algebras in Sect. 8.

We use the following *magic axiom*:

$$\left(\exists u\_0.G[\mathbf{0}, u\_0] \land \forall y. \left(\exists w.G[y, w] \to \exists u\_\mathbf{s}.G[\mathbf{s}(y), u\_\mathbf{s}]\right)\right) \to \forall z.\exists x.G[z, x] \tag{6}$$

Note that all magic axioms are valid, as they are instances of the structural induction axiom (5) with the quantified formula ∃x.G[t, x] in place of G[t]. The magicalness of (6) stems from its simple, yet powerful expressiveness: when used in proof search, the variables u0, u<sup>s</sup> in the antecedent capture the programs for the base and step cases, allowing us to construct a program for x in the consequent.

Using axiom (6), we introduce the following variant of the Ind rule:

$$\frac{L[t,x] \lor C}{\left(\exists u\_{\mathsf{0}}.L[\mathsf{0},u\_{\mathsf{0}}] \land \forall y.\Big(\exists w.L[y,w] \to \exists u.L[\mathsf{s}(y),u\_{\mathsf{s}}]\Big)\right) \to \forall z.\exists x.L[z,x]}\text{ (Magland)}$$

where the only free variable of L[t, x] is x and C does not contain x.

*Example 1.* Consider the specification (SD) from Fig. 1. To prove it using superposition, and not yet synthesize the function satisfying (SD) , we use the following magic axiom:

$$\left(\exists u\_{\mathsf{0}}.\mathsf{halt}(u\_{\mathsf{0}})\simeq\mathsf{0}\land\forall y.\Big(\exists w.\mathsf{halt}(w)\simeq y\to\exists u.\mathsf{halt}(u\_{\mathsf{0}})\simeq\mathsf{s}(y)\Big)\right)\to\forall z.\exists x.\mathsf{halt}(x)\simeq z\tag{7}$$

To use (7) in saturation, we clausify it and skolemize the variables y, w, x as <sup>σ</sup>y, σw, σx(z), respectively. The following is a refutational proof of (SD) :

1. half(y) σ [negated and skolemized specification (SD) ] 2. half(u0) <sup>0</sup> <sup>∨</sup> half(σw) <sup>σ</sup>y <sup>∨</sup> half(σx(z)) <sup>z</sup> [MagInd with (7)] 3. half(u0) <sup>0</sup> <sup>∨</sup> half(us) <sup>s</sup>(σy) <sup>∨</sup> half(σx(z)) <sup>z</sup> [MagInd with (7)] 4. half(u0) <sup>0</sup> <sup>∨</sup> half(σw) <sup>σ</sup>y [BR 1, 2] 5. half(u0) <sup>0</sup> <sup>∨</sup> half(us) <sup>s</sup>(σy) [BR 1, 3] 6. half(u0) <sup>0</sup> <sup>∨</sup> half(us) <sup>s</sup>(half(σw)) [Sup 4, 5] 7. half(u0) <sup>0</sup> <sup>∨</sup> half(us) half(s(s(σw))) [Sup (H3), 6] 8. half(u0) 0 [ER 7] 9. -[BR 8, (H2) ]

Hence, the magic axiom (6) is sufficient to prove (SD) . However, (6) does not suffice to synthesize the program for (SD) from the above proof. Similarly to [8], for synthesis we would use

$$\mathsf{hall}(y) \not\cong \sigma \lor \mathsf{ans}(y) \tag{8}$$

instead of clause 1 and obtain a derivation similar to the one above, but with the answer literal ans(σx(σ)). As <sup>σ</sup>x is a fresh skolem function, it is uncomputable and not allowed in answer literals. Therefore, simply following the approach of [8] fails to synthesize a recursive program from the proof of (SD) . We address the challenge of program construction for the skolem function <sup>σ</sup>x in Sect. 6.

# **6 Programs with Primitive Recursion**

We now construct recursive programs for proofs using induction over natural numbers (6). As mentioned in Sect. 4, the antecedent of the induction axiom gives us a recipe for constructing the program for the consequent. To capture this dependence of the consequent program x on the antecedent programs u0, us, we convert the magic axiom (6) to its equivalent prenex normal form where ∀u0, u<sup>s</sup> precedes ∃x:

$$\exists y, w. \forall u\_0, u\_\mathbf{s}, z. \exists x. \left( \left( G[\mathbf{0}, u\_\mathbf{0}] \land \left( G[y, w] \to G[\mathbf{s}(y), u\_\mathbf{s}] \right) \right) \to G[z, x] \right) \tag{9}$$

The prenex form (9) of the magic axiom (6) allows us to record the dependency on the programs resulting from the base and step cases of induction. For that, we introduce a recursive operator to be used for constructing programs.

**Definition 1 (Primitive Recursion Operator).** Let <sup>f</sup><sup>1</sup> : <sup>α</sup>, and <sup>f</sup><sup>2</sup> : <sup>N</sup>×<sup>α</sup> <sup>→</sup> α. The *primitive recursion operator* R *for natural numbers and* α is:

$$\begin{aligned} \mathsf{R}(f\_1, f\_2)(0) &\simeq f\_1 \\ \mathsf{R}(f\_1, f\_2)(\mathsf{s}(y)) &\simeq f\_2(y, \mathsf{R}(f\_1, f\_2)(y)) \end{aligned}$$

**Lemma 2 (Recursive Witness).** The expression R(u0, λy, w.us)(z) is a witness for the variable x in (9).

Lemma 2 ensures that we can construct a program for the consequent of the magic axiom given programs for the base case and the step case. We next integrate this construction into our synthesis framework using answer literals. For that we take a close look at the skolemization of induction axiom (9), and define skolem symbols, denoted via rec, for the variable x, capturing the recursive program.

**Definition 3 (**rec**-Symbols).** Consider formulas G[t, x] with a single free variable x : α containing a term t : N. For each such formula we introduce a distinct computable function symbol recG[t,x] : <sup>α</sup>×α×<sup>N</sup> <sup>→</sup> <sup>α</sup>. We will refer to such symbols recG[t,x] as rec*-symbols*. When the formula <sup>G</sup>[t, x] is clear from the context or unimportant for the context, we will simply write rec instead of recG[t,x].

A term with a rec-symbol as the top-level functor is called a rec*-term*.

**Definition 4 (Magic Formula).** *The magic formula for* G[t, x] is:

$$\begin{aligned} \forall u\_0, u\_\mathfrak{s}, z. \\ \left( \left( G[0, u\_0] \land \left( G[\sigma\_y, \sigma\_w] \to G[\mathfrak{s}(\sigma\_y), u\_\mathfrak{s}] \right) \right) \to G[z, \mathfrak{rec}\_{G[t, x]}(u\_0, u\_\mathfrak{s}, z)] \right) \end{aligned} (10)$$

It is easy to see that magic formula (10) is obtained by skolemizing the prenex normal form of magic axiom (9), where we replace the variables y, w by fresh constants <sup>σ</sup>y, σw, and the variable <sup>x</sup> by a fresh recG[t,x]-symbol. The constants <sup>σ</sup>y, σw introduced in (10) are said to be *associated with the* recG[t,x]*-term*. An occurrence of any skolem constant <sup>σ</sup>y, σw is considered computable if it is an occurrence in the second argument of a recG[t,x]-term which it is associated with.

We introduce additional requirements for reasoning with rec-terms to ensure that they always represent the recursive function to be synthesized.

**Definition 5 (**rec**-Compliance).** An inference system I is rec*-compliant* if:


Using a rec-compliant inference system I, we derive clauses containing rec-terms. These terms correspond to functions constructed using the operator R.

**Definition 6 (Recursive Function Term).** Let <sup>σ</sup>y, σw be associated with rec(s1, s2, t). Then we call the term <sup>R</sup>(s1, λσy, σw.s2)(t) the *recursive function term corresponding to* rec(s1, s2, t).

For a term r, we denote by r<sup>R</sup> the expression obtained from r by iteratively replacing all rec-terms by their corresponding recursive function terms, starting from the innermost ones. Similarly, formula F <sup>R</sup> denotes the formula F in which we replace all rec-terms by their corresponding recursive function terms.

**Lemma 7 (Recursive Witness for Magic Formulas).** Consider the formula obtained from (10) by replacing recG[t,x](u0, us, z) by its corresponding recursive function term <sup>R</sup>(u0, λσy, σw.us)(z):

$$\begin{aligned} \forall u\_{\mathsf{0}}, u\_{\mathsf{s}}, z. \\ \left( \left( G[\mathsf{0}, u\_{\mathsf{0}}] \land \left( G[\sigma\_{y}, \sigma\_{w}] \to G[\mathsf{s}(\sigma\_{y}), u\_{\mathsf{s}}] \right) \right) \to G[z, \mathsf{R}(u\_{\mathsf{0}}, \lambda \sigma\_{y}, \sigma\_{w}. u\_{\mathsf{s}})(z)] \right) \end{aligned} \tag{11}$$

For every interpretation I, there exists a mapping of skolem constants to values {σy → <sup>v</sup>y, σw → <sup>v</sup>w} such that <sup>I</sup> extended by this mapping is a model of (11). As a consequence, formula (11) is satisfiable.

Lemma 7 implies that we can use formula (11) instead of (10) in derivation, while preserving the soundness of the derivations. Soundness of our approach to recursive program synthesis is given next.

**Theorem 8 (Semantics of Clauses with Answer Literals and** rec**terms).** Let <sup>C</sup>1,...,Cm be clauses and <sup>F</sup> a formula containing no answer literals and no rec-symbols. Let C be a clause containing no answer literals. Let <sup>M</sup>1,...,Ml be magic formulas. Assume that using a sound rec-compliant inference system I, we derive C ∨ ans(r[σ]), where r[σ] is computable, from the set of clauses

$$\{ \ C\_1, \ldots, C\_m, \ M\_1, \ldots, M\_l, \text{ cnf}(\neg F[\overline{\sigma}, y] \lor \mathtt{ans}(y)) \} . $$

Then

$$[M\_1^{\mathbb{R}}, \dots, M\_l^{\mathbb{R}}, C\_1, \dots, C\_m \vdash C^{\mathbb{R}}, F[\overline{\sigma}, r^{\mathbb{R}}[\overline{\sigma}]].$$

That is, under the assumptions M<sup>R</sup> <sup>1</sup> ,...,M<sup>R</sup> l , C1,...,Cm,¬CR, the computable expression <sup>r</sup>R[x] is a witness for <sup>y</sup> in <sup>∀</sup>x∃y.F[x, y].

Based on Theorem 8, if the CNF of <sup>A</sup>1,...,An is among <sup>C</sup>1,...,Cm, then rR[x] is a witness for y in (3) under the assumptions M<sup>R</sup> <sup>1</sup> ,...,M<sup>R</sup> l , C1,..., <sup>C</sup>m,¬CR. The following ensures that we can construct recursive programs with conditions.

**Theorem 9 (Recursive Programs).** Let r[σ] be a computable term, and <sup>C</sup>[σ], C1[σ],...,Cm[σ] be ground computable clauses containing no answer literals and no rec-symbols. Assume that using a sound rec-compliant inference system I, we derive the clause C[σ] ∨ ans(r[σ]) from the CNF of

$$\{ \begin{array}{c} A\_1, \dots, A\_n, \ C\_1[\overline{\sigma}], \dots, C\_m[\overline{\sigma}], \ M\_1, \dots, M\_l, \ \neg F[\overline{\sigma}, y] \lor \mathsf{ans}(y) \end{array} \}$$

where <sup>M</sup>1,...,Ml are magic formulas. Then,

$$\langle r^{\mathbb{R}}[\overline{x}] \bigwedge\_{j=1}^{m} C\_{j}[\overline{x}] \wedge \neg C[\overline{x}] \rangle$$

is a program with conditions for (3).

From Theorem 9 we obtain the following key result on program synthesis.

**Theorem 10 (Recursive Program Synthesis).** Let <sup>P</sup>1[x],...,Pk[x], where <sup>P</sup>i[x] = <sup>r</sup><sup>R</sup> i [x] i−<sup>1</sup> j=1 <sup>C</sup><sup>j</sup> [x] ∧ ¬Ci[x], be programs with conditions for (3), such that n i=1 <sup>A</sup><sup>i</sup> <sup>∧</sup> k i=1 <sup>C</sup>i[x] is unsatisfiable. Then the program <sup>P</sup>[x] defined as

$$\begin{array}{c} P[\overline{x}] := \text{if } \neg C\_1[\overline{x}] \text{ then } r\_1^{\mathbb{R}}[\overline{x}] \\ \qquad \text{else if } \neg C\_2[\overline{x}] \text{ then } r\_2^{\mathbb{R}}[\overline{x}] \\ \qquad \dots \\ \text{else if } \neg C\_{k-1}[\overline{x}] \text{ then } r\_{k-1}^{\mathbb{R}}[\overline{x}] \\ \text{else } r\_k^{\mathbb{R}}[\overline{x}], \end{array}$$

is a program for (3).

### **7 Recursive Synthesis in Saturation**

This section integrates the proving and synthesis steps of Sects. 5–6 into saturation. The crux of our approach is that instead of adding standard induction formulas to the search space, we add magic formulas.

Theorems 9–10 imply that to derive recursive programs, we can use any rec-compliant calculus, as long as the calculus supports derivation of clauses C ∨ans(r), where r is computable and C is ground, computable, and contains no rec-terms nor answer literals. In our work we rely on the extended Sup calculus of [8], which we (i) further extend by adding magic formulas alongside standard induction formulas, (ii) make rec-compliant by disallowing inferences containing uncomputable rec-terms, and (iii) extend by adding more complex rules for introducing conditions into rec-terms<sup>3</sup>. We illustrate these steps by our running example.

*Example 2.* Using the extended Sup calculus, we synthesize the program for the specification of Fig. 1. With the magic formula corresponding to (7) ,

$$\begin{aligned} \forall u\_0, u\_\mathfrak{s}, z. \end{aligned} \tag{10.4}$$

$$\begin{aligned} \left( \mathsf{hsf}(u\_0) \simeq \mathsf{0} \land \left( \mathsf{hsf}(\sigma\_w) \simeq \sigma\_y \to \ \mathsf{hsf}(u\_\mathfrak{s}) \simeq \mathsf{s}(\sigma\_y) \right) \right) &\to \mathsf{hsf}(\mathsf{rec}(u\_0, u\_\mathfrak{s}, z)) \simeq z \right), \end{aligned} \tag{12}$$

we obtain the following derivation<sup>4</sup>:

1. half(y) σ ∨ ans(y) [negated, skolemized specification with answer literal] 2. half(u0) <sup>0</sup> <sup>∨</sup> half(σw) <sup>σ</sup>y <sup>∨</sup> half(σx(z)) <sup>z</sup> [MagInd with (12) ] 3. half(u0) <sup>0</sup> <sup>∨</sup> half(us) <sup>s</sup>(σy) <sup>∨</sup> half(σx(z)) <sup>z</sup> [MagInd with (12) ] 4. half(u0) <sup>0</sup> <sup>∨</sup> half(σw) <sup>σ</sup>y <sup>∨</sup> ans(rec(u0, us, σ)) [BR 1, 2] 5. half(u0) <sup>0</sup> <sup>∨</sup> half(us) <sup>s</sup>(σy) <sup>∨</sup> ans(rec(u0, us, σ)) [BR 1, 3] 6. half(u0) <sup>0</sup> <sup>∨</sup> half(us) <sup>s</sup>(half(σw)) <sup>∨</sup> ans(rec(u0, us, σ)) [Sup 4, 5] 7. half(u0) <sup>0</sup> <sup>∨</sup> half(us) half(s(s(σw))) <sup>∨</sup> ans(rec(u0, us, σ)) [Sup (H3) , 6] 8. half(u0) <sup>0</sup> <sup>∨</sup> ans(rec(u0,s(s(σw)), σ)) [ER 7] 9. ans(rec(s(0),s(s(σw)), σ)) [BR 8, (H2) ] 10. -[answer literal removal 9]

The program recorded in step 10 of the proof is rec(s(0),s(s(σw)), x)<sup>R</sup> <sup>=</sup> <sup>R</sup>(s(0), λσw.s(s(σw)))(x) = <sup>f</sup>(x), where <sup>f</sup> is defined as:

$$\begin{aligned} f(\mathbf{0}) &\simeq \mathbf{s}(\mathbf{0})\\ f(\mathbf{s}(n)) &\simeq \mathbf{s}(\mathbf{s}(f(n))) \end{aligned}$$

Note that while the synthesized program satisfies the specification (SD) , it does not match the expected definition of the double function from (1). Since the half function is rounding down, and the specification does not require the synthesized function to produce even results, the base case was resolved in step 9 with (H2) , leading to f(0) s(0). As a result, we have f(n) = s(double(n)) for any n.

Example 2 demonstrates that specification (SD) has multiple solutions and saturation can find a solution different from the intended one. In the next example we modify the specification to have a single solution and synthesize it.

<sup>3</sup> The rules can be found in the extended version of this paper [10].

<sup>4</sup> For the fully detailed derivation, see [10].

*Example 3.* To synthesize the double function, we modify the specification:

$$\text{additional axioms: } \mathsf{even(O)}\tag{E1}$$

¬even(s(0)) (E2)

$$\forall x. \ (\mathsf{even}(\mathsf{s}(\mathsf{s}(x))) \leftrightarrow \mathsf{even}(x))\tag{E3}$$

$$\text{new specification: } \forall x \exists y. \ (\mathsf{hall}(y) \simeq x \land \mathsf{even}(y)) \tag{\text{SD'}}$$

After negating and skolemizing (SD') and adding the answer literal, we obtain:

$$\mathsf{halt}(y) \not\subset \sigma \lor \neg \mathsf{even}(y) \lor \mathsf{ans}(y) \tag{13}$$

In this case we use the magic axiom for the conjunction G[t, x] := half(x) t ∧ even(x):

$$\begin{aligned} \left( \exists u\_{\mathsf{0}}. (\mathsf{hsf}(u\_{\mathsf{0}}) \simeq \mathsf{0} \land \mathsf{even}(u\_{\mathsf{0}})) \land \\ \forall y. (\exists w. (\mathsf{hsf}(w) \simeq y \land \mathsf{even}(w)) \to \exists u. (\mathsf{hsf}(u\_{\mathsf{s}}) \simeq \mathsf{s}(y) \land \mathsf{even}(u\_{\mathsf{s}}))) \right) \\ \rightarrow \forall z. \exists x. (\mathsf{hsf}(x) \simeq z \land \mathsf{even}(x)) \end{aligned} \tag{14}$$

We clausify the magic formula corresponding to (14), and further resolve it with the premise (13) to obtain:

$$\begin{aligned} \mathsf{haff}(u\_0) \not\supset \mathsf{0} \lor \neg \mathsf{even}(u\_0) \lor \mathsf{haff}(\sigma\_w) &\simeq \sigma\_y \lor \mathsf{ans}(\mathsf{rec}(u\_0, u\_\mathfrak{s}, \sigma)) \\ \mathsf{haff}(u\_0) \not\supset \mathsf{0} \lor \neg \mathsf{even}(u\_0) \lor \mathsf{even}(\sigma\_w) &\lor \mathsf{ans}(\mathsf{rec}(u\_0, u\_\mathfrak{s}, \sigma)) \\ \mathsf{haff}(u\_0) \not\not\subset \mathsf{0} \lor \neg \mathsf{even}(u\_0) \lor \mathsf{half}(u\_\mathfrak{s}) &\not\supset \mathsf{s}(\sigma\_y) \lor \neg \mathsf{even}(u\_\mathfrak{s}) \lor \mathsf{ans}(\mathsf{rec}(u\_0, u\_\mathfrak{s}, \sigma)) \end{aligned}$$

The refutation of these clauses follows a similar course to the proof in Example 2. However, u<sup>0</sup> occurring in the literal ¬even(u0) forces the proof to use (H1) instead of (H2) , and thus the final derived answer literal will be rec(0,s(s(σw)), σ), corresponding exactly to the function definition of double from (1). Note that a derivation of this program in this case requires a saturation prover to apply induction on conjunctions of literals.

### **8 Generalization to Arbitrary Term Algebras**

Our approach from Sects. 5– 7 generalizes naturally to arbitrary term algebras. This section summarizes the key parts of this generalization. <sup>5</sup>

Let τ be a (possibly polymorphic) term algebra with constructors {c1,...,cn}, where we denote the sort of each <sup>c</sup><sup>i</sup> by <sup>τ</sup>i,<sup>1</sup> × ··· × <sup>τ</sup>i,nci <sup>→</sup> <sup>τ</sup> ,

<sup>5</sup> We state all definitions, lemmas and theorems in the appendix of the extended version of our paper [10].

and <sup>P</sup>c<sup>i</sup> <sup>=</sup> {j1,...,j|Pci <sup>|</sup>} for each <sup>i</sup> = 1,...,n. Let <sup>α</sup> be any sort. The *magic axiom for* G[t, x], where t : τ,x : α, is:

$$\left(\bigwedge\_{c \in \Sigma\_{\tau}} \forall\_{i=1}^{n\_c} y\_{c,i}. \left(\left(\bigwedge\_{j \in P\_c} \exists w\_{c,j}. G[y\_{c,j}, w\_{c,j}]\right) \to \exists u\_c. G[c(\overline{y\_c}), u\_c]\right)\right) \to \forall z. \exists x. G[z,x] \tag{15}$$

The corresponding *magic formula* uses the skolem function recG[t,x] : <sup>α</sup>n<sup>c</sup> <sup>×</sup> <sup>τ</sup> <sup>→</sup> α:

$$\forall\_{c \in \Sigma\_{\tau}} u\_c. \forall z. \left( \bigwedge\_{c \in \Sigma\_{\tau}} \left( \bigwedge\_{j \in P\_c} G[\sigma\_{y\_{c,j}}, \sigma\_{w\_{c,j}}] \to G[c(\overline{\sigma\_{y\_c}}), u\_c] \right) \to G[z, \text{rec}\_{G[t,x]}(\overline{u}, z)] \right) \tag{16}$$

Note that each <sup>σ</sup>yci,j , σwci,j introduced in (16) is considered computable only in the ith argument of its associated rec-term. We define the *recursion operator* R *for* τ *and* α analogously to Definition 1:

$$\mathsf{R}(f\_1, \ldots, f\_n)(c\_1(\overline{x})) \simeq f\_1(x\_1, \ldots, x\_{n\_{c\_1}}, \mathsf{R}(f\_1, \ldots, f\_n)(x\_{j\_1}), \ldots, \mathsf{R}(f\_1, \ldots, f\_n)(x\_{j\_{\left[P\_{c\_1}\right]})})$$

$$\cdots$$

$$\mathsf{R}(f\_1, \ldots, f\_n)(c\_n(\overline{x})) \simeq f\_n(x\_1, \ldots, x\_{n\_{c\_n}}, \mathsf{R}(f\_1, \ldots, f\_n)(x\_{j\_1}), \ldots, \mathsf{R}(f\_1, \ldots, f\_n)(x\_{j\_{\left[P\_{c\_1}\right]})})$$

where for each <sup>i</sup> we have <sup>f</sup><sup>i</sup> : <sup>τ</sup>i,<sup>1</sup> ×···× <sup>τ</sup>i,nci <sup>×</sup> <sup>α</sup><sup>|</sup>Pci <sup>|</sup> <sup>→</sup> <sup>α</sup>. Using <sup>R</sup>, we state an analogue of Lemma 7:

**Lemma 11 (Recursive Witness for Magic Formulas Using** τ **).** Consider the formula obtained from (16) by replacing recG[t,x](u, z) by its corresponding recursive function term:

$$\begin{split} \forall\_{c \in \Sigma\_{\tau}} u\_c. \forall z. \left( \bigwedge\_{c \in \Sigma\_{\tau}} \left( \bigwedge\_{j \in P\_c} G[\sigma\_{y\_{c,j}}, \sigma\_{w\_{c,j}}] \to G[c(\overline{\sigma\_{y\_c}}), u\_c] \right) \\ \quad \rightarrow G[z, \mathbb{R}(\lambda\_{i=1}^{n\_c} \sigma\_{y\_{c\_1i}}, \lambda\_{k \in P\_{c\_1}} \sigma\_{w\_{c\_1k}}, u\_{c\_1}, \dots, \lambda\_{i=1}^{n\_{c\_n}} \sigma\_{y\_{c\_mi}}, \lambda\_{k \in P\_{c\_n}} \sigma\_{w\_{c\_mk}}, u\_{c\_n}) (z) \right) \end{split} \tag{17}$$

For every interpretation, there exists its extension by some {σyc,i →vy,c,i, σwc,k → <sup>v</sup>w,c,k}c∈Σ<sup>τ</sup> ,i∈{1,...,nc},k∈P<sup>c</sup> such that the extension is a model of (17). As a consequence, formula (17) is satisfiable.

Using Lemma 11, we derive the analogues of Theorems 8–10 for an arbitrary term algebra τ . We then employ magic formulas (16) in MagInd when in the premise L[t, x] ∨ C ∨ ans(r[x]) we have t : τ . We finally note that our synthesis method generalizes also to sorts other than term algebras, as long as the induction axiom used for the sort carries the constructive meaning described in Sect. 4.

# **9 Implementation and Examples**

**Implementation.** We extended the first-order theorem prover Vampire [16] with a proof-of-concept implementation of our method for recursive program synthesis in saturation. Our implementation consists of approximately 1,100 lines of C++ code and is available online at https://github.com/vprover/vampire/ tree/synthesis-recursive.

We implemented the MagInd rule as well as a version of MagInd using a magic axiom with base case s(0) for natural numbers and cons(a, nil) for any a for lists. To support synthesis requiring induction on specifications ¬F[t, x], where F[t, x] is an arbitrary formula with the only free variable x, we use an encoding as follows. We change the specification ∀x∃y.F[x, y] to ∀x∃y.p(x, y), where p is a fresh uncomputable predicate, and we add an axiom ∀x, y.(p(x, y) ↔ F[x, y]).

**Table 1.** Synthesis examples using natural numbers N, lists L and binary trees BT. The x-variables in the program and synthesized definitions are the inputs. While our framework synthesizes all these examples, our implementation in Vampire only synthesizes those marked with "✓". Note that for "Length of 2 concatenated lists" we consider ++ to be uncomputable.


**Examples.** Our implementation can synthesize the programs for the specifications (SD) and (SD'). We also synthesize further examples over the term algebras<sup>6</sup> of natural numbers N, lists L, and binary trees BT. We display the specifications alongside the programs synthesized by our framework in Table 1. Our framework synthesizes programs for each of the examples<sup>7</sup> , yet our implementation supports so far only a limited set of magic formulas; therefore, the "Vampire" column of Table 1 lists which examples are solved in practice.

**Experimental Comparison.** To the best of our knowledge, no other approach in program synthesis supports the setting we consider: functional relational specifications of recursive programs, given in full first-order logic, without userdefined templates. For this reason, we could not compare the practical aspects of our work with other techniques, but overview related works in Sect. 10. In particular, we note that the tools surveyed in overview in Sect. 10 support a more restrictive/decidable logic than the the full first-order setting exploited in our approach. As such, the benchmarks of Table 1 cannot be translated into the input languages of techniques surveyed in Sect. 10.

# **10 Related Work**

Our approach is conceptually different from existing methods in recursive program synthesis, as we are not restricted to decidable logical fragments, nor to user-defined program templates. Our work supports program specifications in full first-order logic (with theories) and does not require syntactic templates for the programs to be synthesized. In the sequel, we only discuss related approaches that support *full automation* in program synthesis, without templates or user guidance.

We extend the recursion-free synthesis framework of [8], while exploiting ideas from deductive synthesis [17,20,29] using answer literals [4]. We bring recursive program synthesis into the landscape of saturation-based proving and construct programs from saturation proofs with magic axioms. Unlike our setting, the works of [19,29] construct recursive programs from proofs by induction, by reducing the program specification to subgoals corresponding to the cases of the induction axiom. Modern first-order theorem provers mostly implement saturation-based proof search, which however does not support a goalsubgoal architecture. Our approach integrates induction directly into saturation and enables automated reasoning with term algebras.

<sup>6</sup> See the extended version of this paper [10] for term algebra constructors and signatures, and for axiomatization and lemmas for the used predicates and functions.

<sup>7</sup> We provide the full derivations of the synthesized programs in [10].

Fully automated methods supporting recursive program synthesis include Synquid [23], Leon [15], Jennisys [18], SuSLik [24], Cypress [12], Burst [21], and Syntrec [11]. Except for Burst and Syntrec, all these works decompose goals into subgoals. Our work complements these methods, by turning saturation into a recursive synthesis framework over first-order theories. As such, our work also differs from Synquid, where term enumeration combined with type checking is used over program specifications within decidable logics. Leon uses recursive schemas corresponding to our recursive operator R, instantiates them by candidate program terms, and checks if they satisfy the specification. Unlike Leon, we support a complete handling of quantifiers via superposition reasoning. Jennisys uses a verifier to generate input-output examples, which differs from our setting of using inductive formulas as logical specifications. Burst generates programs by composition from existing ones, using quantifier-free fragments of first-order logic. Contrarily to this, we support full first-order logic and induction, without using subgoal proof strategies. Finally, we note that Syntrec guarantees bounded/relative correctness of the synthesized programs (using syntactic program templates), while our approach proves correctness of the synthesized program without further restrictions.

The syntax-guided synthesis (SyGuS) framework [1] supports specifications for recursive functions and can encode our examples from Sect. 9. However, to the best of our knowledge, SyGuS methods, including the SMT-based approach of [26], do not support recursive synthesis. While the semantics-guided synthesis framework [13] also supports recursive functions, its (to the best of our knowledge) only solvers Messy [13] and Messy-Enum [3] synthesize programs from input-output examples and using grammars, respectively, rather than purely from logical specifications.

### **11 Conclusions**

We extend saturation-based framework to recursive program synthesis by utilizing the constructive nature of induction axioms. We introduce magic axioms as a tracking mechanism and seamlessly integrate these axioms into saturation. We then construct correct recursive programs using answer literals in saturation, as also demonstrated by our proof-of-concept implementation. Extending our work with tailored handling of (more general) magic axioms, and respective superposition inferences, is an interesting line for future work. Devising and implementing further, and potentially more general, synthesis rules and induction schemes is another task for future research, allowing us to further strengthen the practical use of our work.

**Acknowledgements.** We acknowledge funding from the ERC Consolidator Grant ARTIST 101002685, the TU Wien SecInt Doctoral College, the FWF SFB project SpyCoDe F8504, the WWTF ICT22-007 grant ForSmart, and the Amazon Research Award 2023 QuAT.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Synthesizing Strongly Equivalent Logic Programs: Beth Definability for Answer Set Programs via Craig Interpolation in First-Order Logic

Jan Heuer and Christoph Wernhard(B)

University of Potsdam, Potsdam, Germany {jan.heuer,christoph.wernhard}@uni-potsdam.de

Abstract. We show a projective Beth definability theorem for logic programs under the stable model semantics: For given programs P and Q and vocabulary V (set of predicates) the existence of a program R in V such that P ∪ R and P ∪ Q are strongly equivalent can be expressed as a first-order entailment. Moreover, our result is effective: A program R can be constructed from a Craig interpolant for this entailment, using a known first-order encoding for testing strong equivalence, which we apply in reverse to extract programs from formulas. As a further perspective, this allows transforming logic programs via transforming their first-order encodings. In a prototypical implementation, the Craig interpolation is performed by first-order provers based on clausal tableaux or resolution calculi. Our work shows how definability and interpolation, which underlie modern logic-based approaches to advanced tasks in knowledge representation, transfer to answer set programming.

# 1 Introduction

Answer set programming [3,35,50,57,60] is one of the major paradigms in knowledge representation. A problem is expressed declaratively as a logic program, a set of rules in the form of implications. An answer set solver returns representations of its *answer sets* or *stable models* [36,49]. That is, minimal Herbrand models, where models with facts not properly justified in a non-circular way are excluded. Modern answer set solvers such as *clingo* [34] are advanced tools that integrate SAT technology.

Two logic programs can be considered as *equivalent* if and only if they have the same answer sets. However, if two equivalent programs are each combined with some other program, the results are not necessarily equivalent. Thus, it is of much more practical relevance to consider instead a notion of equivalence that guarantees the same answer sets even in combination with other programs: Two logic programs P, Q are *strongly equivalent* [54] if they can be exchanged in the

c The Author(s) 2024 C. Benzmüller et al. (Eds.): IJCAR 2024, LNAI 14739, pp. 172–193, 2024. https://doi.org/10.1007/978-3-031-63498-7\_11

Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Project-ID 457292495.

context of any other program R without affecting the answer sets of the overall program. That is, P and Q are strongly equivalent if and only if for all logic programs R it holds that P ∪ R and Q ∪ R have the same answer sets.

Although it has been known that strong equivalence of logic programs under the stable model semantics can be translated into equivalence of classical logical formulas, e.g. [55], developments in the languages of answer set programming make this an issue of ongoing research [18,25,38,39,42,53]. The practical objective is to apply first-order provers to determine the equivalence of two logic programs.

We now consider the situation where only a single program is given and a strongly equivalent one is to be synthesized automatically. For the new program the set of allowed predicates, including to some degree the position within rules in which they are allowed, is restricted to a given vocabulary. Not just "absolute" strong equivalence is of interest, but also strong equivalence with respect to some background knowledge expressed as a logic program. Thus, for given programs P and Q, and vocabulary V we want to find programs R in V such that P ∪ R and P ∪ Q are strongly equivalent.

Our question has two aspects: characterizing the *existence* of a program R for given P, Q, V and, if one exists, the effective *construction* of such an R. As we will show, existence can be addressed by Beth definability [7,21] on the basis of Craig interpolation [20] for first-order logic. The construction can then be performed by extracting an interpolant from a proof of the first-order characterization of existence. We realize this practically with the first-order provers CMProver [74, 75] and Prover9 [58] and an interpolation technique for clausal tableaux [76,77].

To achieve this, we start from a known representation of logic programs in classical first-order logic for the purpose of verifying strong equality. We supplement it with a formal characterization to determine whether an arbitrary given first-order formula represents a logic program, and, if so, to extract a represented logic program from the formula. This novel "reverse translation" also has other potential applications in program transformation.

Beth definability and Craig interpolation play a key role for advanced tasks in other fields of knowledge representation, in particular for query reformulation in databases [5,6,59,66,72] and description logics as well as ontology-mediated querying [2,16,17,73]. Our work aims to provide for these lines of research the bridge to answer set programming.

*Structure of the Paper.* After providing in Sect. 2 background material on strong equivalence as well as on interpolation and definability we develop in Sect. 3 our technical results. Our prototypical implementation<sup>1</sup> is then described in Sect. 4. We conclude in Sect. 5 with discussing related work and perspectives.

<sup>1</sup> Available from http://cs.christophwernhard.com/pie/asp as free software.

# 2 Background

# 2.1 Notation

We map between two formalisms: logic programs and formulas of classical firstorder logic without equality (briefly *formulas*). In both formalisms we have *atoms* p(t1,...,t*n*), where p is a *predicate* and t1,...,t*<sup>n</sup>* are terms built from *functions*, including *constants*, and *variables*. We assume variable names as case insensitive to take account of the custom to write them uppercase in logic programs and lowercase in first-order logic. Predicates in logic programs are distinct from those in formulas, but with a correspondence: If p is a predicate for use in logic programs, then the two predicates p<sup>0</sup> and p<sup>1</sup>, both with the same arity as p, are for use in formulas. Thus, predicates in formulas are always decorated with a 0 or 1 superscript. To emphasize this, we sometimes speak of 0/1*-superscripted formulas*.

A *literal* is an atom or a negated atom. A *clause* is a disjunction of literals, a *clausal formula* is a conjunction of clauses. The empty disjunction is represented by ⊥, the empty conjunction by . On occasion we write a clause also as an implication.

A subformula occurrence in a formula has *positive (negative) polarity* if it is in the scope of an even (odd) number of possibly implicit negations. A formula is *universal* if occurrences of ∀ have only positive polarity and occurrences of ∃ have only negative polarity. Semantic entailment and equivalence of formulas are expressed by |= and ≡.

Let <sup>F</sup> be a formula. <sup>F</sup>*un*(F) is the set of functions occurring in it, including constants, and <sup>P</sup>*red*(F) is the set of predicates occurring in it. <sup>P</sup>*red*±(F) is the set of pairs *pol*, p , where <sup>p</sup> is a predicate and *pol* ∈ {+, −}, such that an atom with predicate p occurs in F with the polarity indicated by *pol*. We write +, p and −, p succinctly as <sup>+</sup><sup>p</sup> and <sup>−</sup>p. To map from the predicates occurring in a formula to predicates of logic programs we define <sup>P</sup>*redLP* (F) def <sup>=</sup> {<sup>p</sup> <sup>|</sup> <sup>p</sup>*<sup>i</sup>* ∈ P*red*(F), i ∈ {0, <sup>1</sup>}}. For logic programs <sup>P</sup>, we define <sup>F</sup>*un*(P) and <sup>P</sup>*red*(P) analogously as for formulas.

### 2.2 Strong Equivalence as First-Order Equivalence

We consider *disjunctive logic programs with negation in the head* [48, Sect. 5], which provide a normal form for answer set programs [12]. A *logic program* is a set of *rules* of the form

A1; ... ; A*k*; **not** A*k*+1; ... ; **not** A*<sup>l</sup>* ← A*l*+1,...,A*m*, **not** A*m*+1,..., **not** A*n*.,

where A1,...,A*<sup>n</sup>* are atoms, 0 ≤ k ≤ l ≤ m ≤ n. Recall from Sect. 2.1 that an atom can have argument terms with functions and variables. The *positive/negative head* of a rule are the atoms A1,...,A*<sup>k</sup>* and A*k*+1,...,A*<sup>l</sup>* respectively. Analogously the *positive/negative body* of a rule are A*l*+1,...,A*<sup>m</sup>* and A*m*+1,...,A*<sup>n</sup>* respectively. Answer sets with respect to the stable model semantics for this class of programs are for example defined in [32].

Next we review the definition of the translation γ used to express strong equivalence of two logic programs in classical first-order logic.<sup>2</sup> It makes use of the fact that strong equivalence can be expressed in the intermediate logic of hereand-there [54], which in turn can be mapped to classical logic [62]. For details and proofs we refer to [25,38,39]. Similar results appeared in [28,55,62,63]. As stated in Sect. 2.1 we assume for each program predicate p two dedicated formula predicates p<sup>0</sup> and p<sup>1</sup> with the same arity. If A is an atom with predicate p, then A<sup>0</sup> is A with p<sup>0</sup> instead of p, and A<sup>1</sup> is A with p<sup>1</sup> instead of p.

Definition 1. For a rule

R = A1; ... ; A*k*; **not** A*k*+1; ... ; **not** A*<sup>l</sup>* ← A*l*+1,...,A*m*, **not** A*m*+1,..., **not** A*n*.

with variables *x* define the first-order formulas γ<sup>0</sup>(R) and γ<sup>1</sup>(R) as

$$\begin{aligned} \gamma^0(R) & \stackrel{\text{def}}{=} \forall \mathbf{z} \left( \bigwedge\_{i=l+1}^m A\_i^0 \wedge \bigwedge\_{i=m+1}^n \neg A\_i^1 \to \bigvee\_{i=1}^k A\_i^0 \vee \bigvee\_{i=k+1}^l \neg A\_i^1 \right), \\\gamma^1(R) & \stackrel{\text{def}}{=} \forall \mathbf{z} \left( \bigwedge\_{i=l+1}^m A\_i^1 \wedge \bigwedge\_{i=m+1}^n \neg A\_i^1 \to \bigvee\_{i=1}^k A\_i^1 \vee \bigvee\_{i=k+1}^l \neg A\_i^1 \right). \end{aligned}$$

For a logic program P define the first-order formula γ(P) as

$$
\gamma(P) \stackrel{\text{def}}{=} \bigwedge\_{R \in P} \gamma^0(R) \wedge \bigwedge\_{R \in P} \gamma^1(R).
$$

and define the first-order formula S*<sup>P</sup>* as

$$\mathfrak{S}\_P \stackrel{\text{def}}{=} \bigwedge\_{p \in \mathcal{P} \\ \text{red}(P)} \forall x \, (p^0(x) \to p^1(x)),$$

where variables *x* match the arity of p.

Using the transformation γ and the formula S*<sup>P</sup>* we can express strong equivalence as an equivalence in first-order logic.

Proposition 2 ([38]). *Under the stable model semantics two logic programs* P *and* Q *are strongly equivalent iff the following equivalence holds in classical firstorder logic* S*<sup>P</sup>* <sup>∪</sup>*<sup>Q</sup>* ∧ γ(P) ≡ S*<sup>P</sup>* <sup>∪</sup>*<sup>Q</sup>* ∧ γ(Q)*.*

#### 2.3 Definition Synthesis with Craig Interpolation

A formula Q(*x*) is *implicitly definable* in terms of a vocabulary (set of predicate and function symbols) V within a sentence F if, whenever two models of F agree on values of symbols in V , then they agree on the extension of Q, i.e., on the tuples of domain members that satisfy Q. In the special case where Q has no free variables, this means that they agree on the truth value of Q. Implicit definability can be expressed as

$$F \wedge F' \mid \vdash \forall x \, (Q(x) \leftrightarrow Q'(x)),\tag{i}$$

<sup>2</sup> With the notation γ we follow [25]. In [38,39] σ<sup>∗</sup> is used for the same translation.

where F and Q are copies of F and Q with all symbols not in V replaced by fresh symbols. This semantic notion contrasts with the syntactic one of *explicit definability*: A formula Q(*x*) is *explicitly definable* in terms of a vocabulary V within a sentence F if there exists a formula R(*x*) in the vocabulary V such that

$$F \vdash \forall x \; (R(x) \leftrightarrow Q(x)).\tag{ii}$$

The "Beth property" [7] states equivalence of both notions and applies to firstorder logic. "Craig interpolation" is a tool that can be applied to prove the "Beth property" [21], and, moreover to construct formulas R from given F, Q, V from a proof of implicit definability. Craig's interpolation theorem [20] applies to firstorder logic and states that if a formula F entails a formula G (or, equivalently, that F → G is valid), then there exists a formula H, a *Craig interpolant* of F and G, with the properties that F |= H, H |= G, and the vocabulary of H (predicate and function symbols as well as free variables) is in the common vocabulary of F and G. Craig's theorem can be strengthened to the existence of *Craig-Lyndon interpolants* [56] that satisfy the additional property that predicates in H occur only in polarities in which they occur in both F and G. In our technical framework this condition is expressed as <sup>P</sup>*red*±(H) ⊆ P*red*±(F) ∩ P*red*±(G).

Craig's interpolation theorem can be proven by constructing H from a proof of F |= G. This works on the basis of sequent calculi [68,71] and analytic tableaux [29]. For calculi from automated first-order reasoning various approaches have been considered [10,40,44,67,76,77]. A method [76] for clausal tableaux [46] performs Craig-Lyndon interpolation and operates on proofs emitted by a general first-order prover, without need to modify the prover for interpolation, and inheriting completeness for full first-order logic from it. Indirectly that method also works on resolution proofs expanded into trees [77].

Observe that the characterization of implicit definability (i) can also be expressed as F ∧ Q(*x*) |= F → Q (*x*). An "explicit" definiens R(*x*) can now be constructed from given F, Q(*x*) and V just as a Craig interpolant of F ∧ Q(*x*) and F → Q (*x*). The synthesis of definitions by Craig interpolation was recognized as a logic-based core technique for view-based query reformulation in relational databases [5,59,66,72]. Often strengthened variations of Craig-Lyndon interpolation are used there that preserve criteria for domain independence, e.g., through relativized quantifiers [5] or range-restriction [77].

# 3 Variations of Craig Interpolation and Beth Definability for Logic Programs

We synthesize logic programs according to a variation of Beth's theorem justified by a variation of Craig-Lyndon interpolation. Craig-Lyndon interpolation "out of the box" is not sufficient for our purposes: To obtain logic programs as definientia we need as a basis a stronger form of interpolation where the interpolant is not just a first-order formula but, moreover, is the first-order encoding of a logic program, permitting to actually extract the program.

# 3.1 Extracting Logic Programs from a First-Order Encoding

We address the questions of how to abstractly characterize first-order formulas that encode a logic program, and how to extract the program from a firstorder formula that meets the characterization. As specified in Sect. 2.1, we assume first-order formulas over predicates superscripted with 0 and 1. The notation S*<sup>P</sup>* (Definition 1) is now overloaded for 0/1-superscripted formulas F as S*<sup>F</sup>* def = - *<sup>p</sup>*∈P*redLP* (*<sup>F</sup>* ) <sup>∀</sup>*<sup>x</sup>* (p<sup>0</sup>(*x*) <sup>→</sup> <sup>p</sup><sup>1</sup>(*x*)), where variables *<sup>x</sup>* match the arity of p. We introduce a convenient notation for the result of systematically renaming all 0-superscripted predicates to their 1-superscripted correspondence.

Definition 3. For 0/1-superscripted first-order formulas F define rename<sup>0</sup>→<sup>1</sup>(F) as F with all occurrences of 0-superscripted predicates p<sup>0</sup> replaced by the corresponding 1-superscripted predicate p<sup>1</sup>.

Obviously rename<sup>0</sup>→<sup>1</sup> can be moved over logic operators, e.g., rename<sup>0</sup>→<sup>1</sup>(F ∧ G) ≡ rename<sup>0</sup>→<sup>1</sup>(F) ∧ rename<sup>0</sup>→<sup>1</sup>(G), and rename<sup>0</sup>→<sup>1</sup>(∀*x* F) ≡ ∀*x* rename<sup>0</sup>→<sup>1</sup>(F). Semantically, rename<sup>0</sup>→<sup>1</sup> preserves entailment and thus also equivalence.

Proposition 4. *For* 0/1*-superscripted first-order formulas* F *and* G *it holds that* (i) *If* F |= G*, then* rename<sup>0</sup>→<sup>1</sup>(F) |= rename<sup>0</sup>→<sup>1</sup>(G)*.* (ii) *If* F ≡ G*, then* rename<sup>0</sup>→<sup>1</sup>(F) ≡ rename<sup>0</sup>→<sup>1</sup>(G)*.*

Observe that for all rules R it holds that γ<sup>1</sup>(R) = rename<sup>0</sup>→<sup>1</sup>(γ<sup>0</sup>(R)) and for all logic programs P it holds that γ(P) |= rename<sup>0</sup>→<sup>1</sup>(γ(P)). On the basis of rename<sup>0</sup>→<sup>1</sup> we define a first-order criterion for a formula encoding a logic program, that is, a first-order entailment that holds if and only if a given first-order formula encodes a logic program.

Definition 5. A 0/1-superscripted first-order formula F is said to *encode a logic program* iff F is universal and S*<sup>F</sup>* ∧ F |= rename<sup>0</sup>→<sup>1</sup>(F).

This criterion adequately characterizes that the formula represents a logic program on the basis of the translation γ. The following theorem justifies this.

Theorem 6 (Formulas Encoding a Logic Program). (i) *For all logic programs* P *it holds that* γ(P) *is a* 0/1*-superscripted first-order formula that encodes a logic program.* (ii) *If a* 0/1*-superscripted first-order formula* F *encodes a logic program, then there exists a logic program* P *such that* S*<sup>F</sup>* |= γ(P) ↔ F*,* <sup>P</sup>*red*(P) ⊆ P*red LP* (F) *and* <sup>F</sup>*un*(P) ⊆ F*un*(F)*. Moreover, such a program* <sup>P</sup> *can be effectively constructed from* F*.*

*Proof.* (6.i) Immediate from the definition of γ. (6.ii) Procedure 7 specified below and proven correct in Proposition 8 shows the construction of a suitable program. 

Theorem 6.ii claims that the vocabulary of the program is only *included* in the respective vocabulary of the formula. This gives for the formula the freedom of a larger vocabulary with symbols that may be eliminated, e.g., by simplifications.

# Procedure 7 (Decoding an Encoding of a Logic Program).

Input: A 0/1-superscripted first-order formula F that encodes a logic program.

Method:


$$\forall x \text{ гепане}\_{0 \to 1}(M\_0) \left| = \forall x \, M\_1^{\prime \prime}.$$

A possibility is to take M <sup>1</sup> = M<sup>1</sup> and M <sup>1</sup> = . Another possibility that often leads to a smaller M is to consider each clause C in M<sup>1</sup> and place it into M <sup>1</sup> or M <sup>1</sup> depending on whether there is a clause D in M<sup>0</sup> such that rename<sup>0</sup>→<sup>1</sup>(D) subsumes C.

3. Let P be the set of rules

$$A\_1; \dots; A\_k; \text{not } A\_{k+1}; \dots; \text{not } A\_l \gets A\_{l+1}, \dots, A\_m, \text{not } A\_{m+1}, \dots, \text{not } A\_n$$

for each clause *m <sup>i</sup>*=*l*+1 A<sup>0</sup> *<sup>i</sup>* ∧ *n i*=*m*+1¬A<sup>1</sup> *<sup>i</sup>* <sup>→</sup> *<sup>k</sup> <sup>i</sup>*=0 A<sup>0</sup> *<sup>i</sup>* ∨ *l i*=*k*+1¬A<sup>1</sup> *<sup>i</sup>* in M0∧M 1.

Output: Return <sup>P</sup>, a logic program such that <sup>S</sup>*<sup>P</sup>* <sup>|</sup><sup>=</sup> <sup>γ</sup>(P) <sup>↔</sup> <sup>F</sup>, <sup>P</sup>*red*(P) <sup>⊆</sup> <sup>P</sup>*red LP* (F) and <sup>F</sup>*un*(P) ⊆ F*un*(F).

Proposition 8. *Procedure 7 is correct.*

*Proof.* For an input formula F and output program P the syntactic requirements <sup>P</sup>*red*(P) ⊆ P*red LP* (F) and <sup>F</sup>*un*(P) ⊆ F*un*(F) are easy to see from the construction of P by the procedure. To prove the semantic requirement S*<sup>F</sup>* |= γ(P) ↔ F we first note the following assumptions, which follow straightforwardly from the specification of the procedure.


The semantic requirement S*<sup>F</sup>* |= γ(P) ↔ F can then be derived as follows.


Some examples illustrate Procedure 7 and the *encoding a logic program* property.

### Example 9.


*Remark 1.* For the extracted program it is desirable that it is not unnecessarily large. Specifically it should not contain rules that are easily identified as redundant. Step 2 of Procedure 7 permits techniques to keep M small. Other possibilities include well-known formula simplification techniques that preserve equivalence such as removal of tautological or subsumed clauses and may be integrated into classification, step 1 of the procedure. In addition, conversions that just preserve equivalence modulo S*<sup>F</sup>* may be applied, conceptually as a preprocessing step, although in practice possibly implemented on the clausal form. Procedure 7 then receives as input not F but a universal first-order formula F whose vocabulary is included in that of F with the property

$$\mathbb{S}\_F \left| = F' \leftrightarrow F. \tag{iii} \tag{iii}$$

Formula F then also encodes a program: That S*<sup>F</sup>* -∧F |= rename<sup>0</sup>→<sup>1</sup>(F ) follows from S*<sup>F</sup>* ∧F |= rename<sup>0</sup>→<sup>1</sup>(F), (iii) and Proposition 4.ii. Procedure 7 guarantees for its output S*<sup>F</sup>* - |= γ(P) ↔ F , hence by (iii) it follows that S*<sup>F</sup>* |= γ(P) ↔ F.

Example 10. Consider the following clauses and programs. <sup>C</sup><sup>1</sup> <sup>=</sup> <sup>¬</sup>p<sup>0</sup> <sup>∨</sup> <sup>q</sup><sup>1</sup> <sup>∨</sup> <sup>r</sup> 0 <sup>C</sup><sup>2</sup> <sup>=</sup> <sup>¬</sup>p<sup>0</sup> <sup>∨</sup> <sup>q</sup><sup>1</sup> <sup>∨</sup> <sup>r</sup> 1 <sup>C</sup><sup>3</sup> <sup>=</sup> <sup>¬</sup>p<sup>1</sup> <sup>∨</sup> <sup>q</sup><sup>1</sup> <sup>∨</sup> <sup>r</sup> 1 P = r ← p, **not** q. ← p, **not** q, **not** r. P = r ← p, **not** q.

Assume as input of Procedure 7 the formula F = C1∧C2∧C3. Then M<sup>0</sup> = C1∧C<sup>2</sup> and M<sup>1</sup> = C3, and, aiming at a short program, we can set M <sup>1</sup> = and M <sup>1</sup> = C3. The extracted program is then P. By preprocessing F according to Remark 1 we can eliminate C<sup>2</sup> from F and obtain the shorter strongly equivalent program P .

#### 3.2 A Refinement of Craig Interpolation for Logic Programs

With the material from Sect. 3.1 on extracting logic programs from formulas we can now state a theorem on *LP-interpolation*, where *LP* stands for *logic program*. It is a variation of Craig interpolation applying to first-order formulas that encode logic programs. The theorem states not only the existence of an LP-interpolant, but, moreover, also claims effective construction.

Theorem 11 (LP-Interpolation). *Let* F *be a* 0/1*-superscripted first-order formula that encodes a logic program and let* G *be a* 0/1*-superscripted first-order formula such that* <sup>F</sup>*un*(F) ⊆ F*un*(G) *and* <sup>S</sup>*<sup>F</sup>* <sup>∧</sup><sup>F</sup> <sup>|</sup><sup>=</sup> <sup>S</sup>*<sup>G</sup>* <sup>→</sup> <sup>G</sup>*. Then there exists a* 0/1*-superscripted first-order formula* H*, called the* LP-interpolant *of* F *and* G*, such that*


*Moreover, if existence holds, then an LP-interpolant* H *can be effectively constructed, via a universal Craig-Lyndon interpolant of* F ∧ S*<sup>F</sup> and* S*<sup>G</sup>* → G*.*

*Proof.* We show the construction of a suitable formula H. Let H be a Craig-Lyndon interpolant of S*<sup>F</sup>* ∧ F and S*<sup>G</sup>* → G. Since F and S*<sup>F</sup>* are universal first-order formulas and we have the precondition <sup>F</sup>*un*(F) ⊆ F*un*(G), we may in addition assume that H is a universal first-order formula. (This additional condition is guaranteed for example by the interpolation method from [76], which computes H directly from a clausal tableaux proof, or indirectly from a resolution proof [77].) Define H def = H ∧ rename<sup>0</sup>→<sup>1</sup>(H ). Claims 2, 4 and 5 of the theorem statement are then easy to see. Claim 1 can be shown as follows. We may assume the following.

$$\begin{array}{ll} \text{(1)} \quad \mathfrak{S}\_F \wedge F \mid = H'. & \text{Since } H' \text{ is a Craig-Lyndon interpolant.}\\ \text{(2)} \quad \mathfrak{S}\_F \wedge F \mid = \mathsf{renname}\_{0 \mapsto 1} (F). & \text{Since } F \text{ encodes a logic program.} \end{array}$$

Claim 1 can then be derived in the following steps.

$$\begin{array}{ll} \text{(3)} & \mathsf{rename}\_{0\mapsto1}(\mathsf{S}\_{F}\wedge F) \vdash \mathsf{rename}\_{0\mapsto1}(H'). & \text{By (1) and Proposition 4.i.}\\ \text{(4)} & \mathsf{S}\_{F}\wedge F \vdash \mathsf{rename}\_{0\mapsto1}(\mathsf{S}\_{F}\wedge F). & \text{By (2), since } \mathsf{rename}\_{0\mapsto1}(\mathsf{S}\_{F}) \equiv \top. \\\ \text{(5)} & \mathsf{S}\_{F}\wedge F \vdash \mathsf{rename}\_{0\mapsto1}(H'). & \text{By (4) and (3).}\\ \text{(6)} & \mathsf{S}\_{F}\wedge F \vdash H. & \text{By (5) and (1), since } H = H' \wedge \mathsf{rename}\_{0\mapsto1}(H'). \end{array}$$

Claim 3 follows because since H is a Craig-Lyndon interpolant it holds that <sup>P</sup>*red*±(H ) ⊆ P*red*±(S*<sup>F</sup>* <sup>∧</sup> <sup>F</sup>) ∩ P*red*±(S*<sup>G</sup>* <sup>→</sup> <sup>G</sup>). With the predicate occurrences in rename<sup>0</sup>→<sup>1</sup>(H ), i.e., 1-superscripted predicates in positions of 0-superscripted predicates in H , we obtain the restriction of <sup>P</sup>*red*±(H) stated as claim 3.

For an LP-interpolant of formulas F and G, where F encodes a logic program, the semantic properties stated in claims 1 and 2 are those of a Craig or Craig-Lyndon interpolant of S*<sup>F</sup>* ∧F and S*<sup>G</sup>* → G. The allowed polarity/predicate pairs are those common in S*<sup>F</sup>* ∧ F and S*<sup>G</sup>* → G, as in a Craig-Lyndon interpolant, and, in addition, the 1-superscripted versions of polarity/predicate pairs that appear only 0-superscripted among these common pairs. These additional pairs are those that might occur in the result of rename<sup>0</sup>→<sup>1</sup> applied to a Craig-Lyndon interpolant. In contrast to a Craig interpolant, functions are only constrained by the first given formula F. Permitting only functions common to F and G can result in an interpolant with existential quantification, which thus does not encode a program. Claim 5 states that the LP-interpolant indeed encodes a logic program as characterized in Definition 5. This property is, so-to-speak, passed through from the given formula F to the LP-interpolant.

### 3.3 Effective Projective Definability of Logic Programs

We present a variation of the "Beth property" that applies to logic programs with stable model semantics and takes strong equivalence into account. The underlying technique is our LP-interpolation, Theorem 11. It maps into Craig-Lyndon interpolation for first-order logic, utilizing that strong equivalence of logic programs can be expressed as first-order equivalence of encoded programs. This approach allows the effective construction of logic programs in the role of "explicit definitions" via Craig-Lyndon interpolation on first-order formulas. Our variation of the "Beth property" is *projective* as it is with respect to a given set of predicates allowed in the definiens. While our LP-interpolation theorem was expressed in terms of first-order formulas that encode logic programs, we now phrase definability entirely in terms of logic programs.<sup>3</sup>

Theorem 12 (Effective Projective Definability of Logic Programs). *Let* <sup>P</sup> *and* <sup>Q</sup> *be logic programs and let* <sup>V</sup> ⊆ P*red*(P) ∪ P*red*(Q) *be a set of predicates. The existence of a logic program* <sup>R</sup> *with* <sup>P</sup>*red*(R) <sup>⊆</sup> <sup>V</sup> *,* <sup>F</sup>*un*(R) <sup>⊆</sup> <sup>F</sup>*un*(P) ∪ F*un*(Q) *such that* <sup>P</sup> <sup>∪</sup> <sup>R</sup> *and* <sup>P</sup> <sup>∪</sup> <sup>Q</sup> *are strongly equivalent is expressible as entailment between two first-order formulas. Moreover, if existence holds, such a program* R *can be effectively constructed, via a universal Craig-Lyndon interpolant of the left and the right side of the entailment.*

*Proof.* The first-order entailment that characterizes the existence of a logic program R is S*<sup>P</sup>* ∧ S*<sup>Q</sup>* ∧ γ(P) ∧ γ(Q) |= ¬S*<sup>P</sup>* - ∨ ¬S*<sup>Q</sup>*- ∨ ¬γ(P ) ∨ γ(Q ), where the primed programs P and Q are like P and Q, except that predicates not in V are replaced by fresh predicates. If the entailment holds, we can construct a program R as follows: Let H be the LP-interpolant of γ(P) ∧ γ(Q) and ¬γ(P ) ∨ γ(Q ), as specified in Theorem 11, and extract the program R from H with Procedure 7. That R constructed in this way has the properties claimed in the theorem statement can be shown as follows. Since H is an LP-interpolant it follows that

(1) S*<sup>P</sup>* ∧ S*<sup>Q</sup>* ∧ γ(P) ∧ γ(Q) |= H. (2) H |= ¬S*<sup>P</sup>* - ∨ ¬S*<sup>Q</sup>*- ∨ ¬γ(P ) ∨ γ(Q ). (3) <sup>P</sup>*red LP* (H) <sup>⊆</sup> <sup>V</sup> . (4) <sup>F</sup>*un*(H) ⊆ F*un*(P) ∪ F*un*(Q).

(5) H encodes a logic program.

From the preconditions of the theorem and since R is extracted from H with Procedure 7 and thus meets the properties stated in Theorem 6.ii it follows that

<sup>3</sup> LP-interpolation could also be phrased in terms of logic programs, providing an interpolation result for logic programs on its own, not just as basis for definability. We plan to address this in future work.

(6) <sup>V</sup> ⊆ P*red*(P) ∪ P*red*(Q).

(7) S*<sup>H</sup>* |= γ(R) ↔ H.

(8) <sup>P</sup>*red*(R) ⊆ P*red LP* (H).

(9) <sup>F</sup>*un*(R) ⊆ F*un*(H).

The claimed properties of the theorem statement can then be derived as steps (10), (11) and (18) as follows.


Conversely, we have to show that if there exists a logic program R with the properties in the above theorem statement, then the characterizing entailment S*<sup>P</sup>* ∧ S*<sup>Q</sup>* ∧ γ(P) ∧ γ(Q) |= ¬S*<sup>P</sup>* - ∨ ¬S*<sup>Q</sup>*- ∨ ¬γ(P ) ∨ γ(Q ) does hold. We may assume

(19) <sup>P</sup>*red*(R) <sup>⊆</sup> <sup>V</sup> ⊆ P*red*(P) ∪ P*red*(Q).

(20) <sup>F</sup>*un*(R) ⊆ F*un*(P) ∪ F*un*(Q).

(21) P ∪ R and P ∪ Q are strongly equivalent.

The characterizing entailment can then be derived as follows.

(22) S*<sup>P</sup>* ∧ S*<sup>Q</sup>* |= S*<sup>R</sup>* By (19). (23) S*<sup>P</sup>* ∧S*<sup>Q</sup>* ∧γ(P)∧γ(R) ≡ S*<sup>P</sup>* ∧S*<sup>Q</sup>* ∧γ(P)∧γ(Q). By (21), Proposition 2 and (22). (24) S*<sup>P</sup>* ∧ S*<sup>Q</sup>* ∧ γ(P) ∧ γ(Q) |= γ(R). By (23). (25) S*<sup>P</sup>* ∧ S*<sup>Q</sup>* ∧ γ(P) ∧ ¬γ(Q) |= ¬γ(R). By (23). (26) S*<sup>P</sup>* - ∧ S*<sup>Q</sup>*- ∧ γ(P ) ∧ ¬γ(Q ) |= ¬γ(R). By (25), (19) and (20). (27) S*<sup>P</sup>* ∧ S*<sup>Q</sup>* ∧ γ(P) ∧ γ(Q) |= ¬S*<sup>P</sup>* - ∨ ¬S*<sup>Q</sup>*- ∨ ¬γ(P ) ∨ γ(Q ). By (26) and (24). 

We now give some examples from the application point of view.

Example 13. The following examples show for given programs P, Q and sets V of predicates a possible value of R according to Theorem 12.

(i) Q = p ← q,r. p; q ← r. q ← q,s. V = {p,r} R = p ← r.

In this first example we consider the special case where P is empty and thus not shown. Predicates q and s are redundant in Q, "absolutely" and not just relative to a program P. By Theorem 12, this is proven with the characterizing first-order entailment and, moreover, a strongly equivalent reformulation of Q without q and s is obtained as R.

$$\begin{array}{ccccc} \text{(ii)} & P = \mathfrak{p}(X) \leftarrow \mathfrak{q}(X). & Q = \mathfrak{r}(X) \leftarrow \mathfrak{p}(X). & V = \{\mathfrak{p}, \mathfrak{r}\} & R = \mathfrak{r}(X) \leftarrow \mathfrak{p}(X). \\ & & \mathfrak{r}(X) \leftarrow \mathfrak{q}(X). & & \\ \text{0} & \text{.} & \text{.} & \text{.} & \text{.} & \text{.} & \text{.} & \text{.} & \text{.} \\ \end{array} \\ \begin{array}{ccccc} P = \mathfrak{r}(X) \leftarrow \mathfrak{p}(X). & P = \mathfrak{r}(X) \leftarrow \mathfrak{p}(X). & \\ \text{.} & \text{.} & \text{.} & \text{.} & \text{.} & \text{.} \\ \end{array}$$

Only r and p are allowed in R. Or, equivalently, q is redundant in Q, *relative* to program P. Again, this is proven with the characterizing firstorder entailment and, moreover, a strongly equivalent reformulation of Q without q is obtained as R. It is the clause in Q with q that is redundant relative to P and hence is eliminated in R.

$$\begin{array}{ccccc} \text{(iii)} & P = \leftarrow \mathsf{p}(X), \mathsf{q}(X). & Q = \mathsf{r}(X) \leftarrow \mathsf{p}(X), \mathsf{not } \mathsf{q}(X). & & V = \{\mathsf{p}, r\}, \\ & \text{\color{red}{.} .} & \text{.} & \text{.} & \text{.} \ \stackrel{R}{\dashv r}(X) \leftarrow \mathsf{p}(X). & & \\ & \text{.} & \text{.} & \text{.} & \text{.} & \text{.} \ \stackrel{R}{\dashv r}(X). & & \\ & \text{.} & \text{.} & \text{.} & \text{.} & \text{.} & \text{.} \end{array}$$

Only r and p are allowed in R. The negated literal with q in the body of the rule in Q is redundant relative to P and is eliminated in R.

$$\begin{array}{llll} \mbox{(iv)} & P = \mathfrak{p}(X) \leftarrow \mathfrak{q}(X), \texttt{not } \mathfrak{r}(X). & Q = \mathfrak{t}(X) \leftarrow \mathfrak{p}(X). & V = \{\mathfrak{q}, \mathfrak{r}, \mathfrak{s}, \mathfrak{t}\}, \\ & \mathfrak{p}(X) \leftarrow \mathfrak{s}(X). & R = \mathfrak{t}(X) \leftarrow \mathfrak{q}(X), \texttt{not } \mathfrak{r}(X). \\ & \mathsf{not } \mathfrak{r}(X); \mathfrak{s}(X) \leftarrow \mathfrak{p}(X). & \mathfrak{t}(X) \leftarrow \mathfrak{s}(X). \\ & \mathsf{q}(X); \mathfrak{s}(X) \leftarrow \mathfrak{p}(X). & \mathsf{i} & \mathsf{i} & \mathsf{n} & \mathsf{i} & \mathsf{i} & \mathsf{n} & \cdots & \cdots \end{array}$$

The predicate p is not allowed in R. The idea is that p is a predicate that can be used by a client but is not in the actual knowledge base. Program P expresses a schema mapping from the client predicate p to the knowledge base predicates q,r,s. The result program R is a rewriting of the client query Q in terms of knowledge base predicates. Only the first two rules of P actually describe the mapping. The other two rules complete them to a full definition, similar to Clark's completion [19], but here yielding also a program. Such completed predicate specifications seem necessary for the envisaged reformulation tasks.

(v) P = As in Example 13.iv. Q = t(X) ← q(X), **not** r(X). t(X) ← s(X). V = {p,t}

$$R = \mathfrak{t}(X) \leftarrow \mathfrak{p}(X)\_{\sim}$$

In this example P is from Example 13.iv, while Q and R are also from that example but switched. The vocabulary allows only p and t. While Example 13.iv realizes an unfolding of p, this example realizes folding into p.

(vi) P = n(X) ← z(X). n(X) ← n(Y ),s(Y,X). Q = **not** n(X2) ← z(X0),s(X0, X1), s(X1, X2).

V = {z,s} R = ← z(X0),s(X0, X1),s(X1, X2). Program P defines natural numbers recursively. Program Q has a rule whose body specifies the natural number 2 and whose head denies that 2 is a natural number. Because P implies that 2 is a natural number, this head is in R rewritten to the empty head, enforced by disallowing the predicate for natural numbers in R.

$$\begin{array}{lll}\mbox{(vi)}\quad P = \mathsf{c}(X,Y,Z) \leftarrow \mathsf{r}(X,Y), \mathsf{r}(Y,Z). & Q = \mathsf{r}(X,Y); \texttt{not } \mathsf{r}(X,Y).\\\quad \leftarrow \mathsf{c}(X,Y,Z), \texttt{not } \mathsf{r}(X,Y). & \leftarrow \mathsf{c}(X,Y,Z), \texttt{not } \mathsf{r}(X,Z).\\\quad \leftarrow \mathsf{c}(X,Y,Z), \texttt{not } \mathsf{r}(Y,Z). & R = \mathsf{r}(X,Y); \texttt{not } \mathsf{r}(X,Y).\\\quad V = \{\mathsf{r}\}\_{\mathsf{c}(\ldots\ldots\ldots\ldots\ldots\ldots\ldots)} & \leftarrow \mathsf{r}(X,Y), \texttt{r}(Y,Z), \texttt{not } \mathsf{r}(X,Z).\\\end{array}$$

Program Q describes a transitive relation r using the helper predicate c to identify chains where transitivity needs to be checked. In R the use of c is not allowed, program P gives the definition of c. Similar to Example 13.iv this realizes an unfolding of c.

Definability according to Theorem 12 inherits potential *decidability* from the first-order entailment problem that characterizes it. If, e.g., in the involved programs only constants occur as function symbols, the characterizing entailment can be expressed as validity in the decidable Bernays-Schönfinkel class.

### 3.4 Constraining Positions of Predicates Within Rules

The sensitivity of LP-interpolation to polarity inherited from Craig-Lyndon interpolation and the program encoding with superscripted predicates offers a more fine-grained control of the vocabulary of definitions than Theorem 12 by considering also positions of predicates in rules. The following corollary shows this.

Corollary 14 (Position-Constrained Effective Projective Definability of Logic Programs). *Let* P *and* Q *be logic programs and let* V+, V+1, V<sup>−</sup> ⊆ <sup>P</sup>*red*(P)∪ P*red*(Q) *be three sets of predicates. Call a logic program* <sup>R</sup> in scope of V+, V+1, V<sup>−</sup> *if predicates* p *occur in* R *only as specified in the following table.*


*The existence of a logic program* <sup>R</sup> *in scope of* V+, V+1, V<sup>−</sup> *with* <sup>F</sup>*un*(R) <sup>⊆</sup> <sup>F</sup>*un*(P) ∪ F*un*(Q) *such that* <sup>P</sup> <sup>∪</sup> <sup>R</sup> *and* <sup>P</sup> <sup>∪</sup> <sup>Q</sup> *are strongly equivalent is expressible as entailment between two first-order formulas. Moreover, if existence holds, such a program* R *can be effectively constructed, via a universal Craig-Lyndon interpolant of the left and the right side of the entailment.*

*Proof (Sketch).* Like Theorem 12 but with applying the renaming of disallowed predicates not already at the program level but in the first-order encoding with considering polarity. Let V <sup>±</sup> be the set of polarity/predicate pairs defined as V <sup>±</sup> def <sup>=</sup> {+p<sup>0</sup> <sup>|</sup> <sup>p</sup> <sup>∈</sup> <sup>V</sup>+}∪{+p<sup>1</sup> <sup>|</sup> <sup>p</sup> <sup>∈</sup> <sup>V</sup><sup>+</sup> <sup>∪</sup>V+1}∪ {−p<sup>0</sup> <sup>|</sup> <sup>p</sup> <sup>∈</sup> <sup>V</sup>−}∪ {−p<sup>1</sup> <sup>|</sup> <sup>p</sup> <sup>∈</sup> <sup>V</sup>−}. The corresponding entailment underlying definability and LP-interpolation is then S*<sup>F</sup>* ∧ S*<sup>Q</sup>* ∧ γ(P) ∧ γ(Q) |= ¬S *<sup>P</sup>* ∨ ¬S *<sup>Q</sup>* ∨ ¬γ(P) <sup>∨</sup> <sup>γ</sup>(Q) ∨ ¬*Aux* , where the primed variations of formulas are obtained by replacing each predicate p that does not appear in V <sup>±</sup> or appears in V <sup>±</sup> with only a single polarity by a dedicated fresh predicate p . (Note that negation and priming commute.) With W def <sup>=</sup> <sup>P</sup>*red*±(¬S*<sup>P</sup>* ∨ ¬S*<sup>Q</sup>* ∨ ¬γ(P) <sup>∨</sup> <sup>γ</sup>(Q)) we define *Aux* as

$$\bigwedge\_{p+p\in V^{\pm}, -p\notin V^{\pm}, +p\in W} \forall x \left( p(x) \to p'(x) \right) \land \bigwedge\_{-p\in V^{\pm}, +p\notin V^{\pm}, -p\in W} \forall x \left( p'(x) \to p(x) \right),$$

where *x* matches the arity of the respective predicates p.

$$\square$$

In general, for first-order formulas F, G the second-order entailment F |= ∀p G, where p is a predicate, holds if and only if the first-order entailment F |= G holds, where G is G with p replaced by a fresh predicate p . This explains the construction of the right side of the entailment in the proof of Corollary 14 as an encoding of quantification upon (or "forgetting about") a predicate only in a *single polarity*. With predicate quantification this can be expressed, e.g., for positive polarity, as <sup>∃</sup>+p G def = ∃p (G ∧ ∀*x* (p(*x*) → p (*x*))), where p is a fresh predicate and G is G with p replaced by p . We illustrate the specification of *Aux* with an example.

Example 15. Assume that +p ∈ V <sup>±</sup> and −p ∈ V <sup>±</sup>. Let G stand for S*<sup>P</sup>* ∧ S*<sup>Q</sup>* ∧ γ(P)∧¬γ(Q) and let G stand for G with all occurrences of p replaced by p . For now we ignore the restrictions of *Aux* by W as they only have a heuristic purpose. The following statements, which are all equivalent to each other, illustrate the specification of *Aux* in a step-by-step fashion. We start with expressing the required "forgetting". Since it appears in a negation, we have to forget here the "allowed" +p. (1) F |= ¬∃+p G. (2) F |= ¬∃p (G ∧ ∀*x* (p(*x*) → p (*x*))). (3) F |= ∀p (¬G ∨ ¬∀*x* (p(*x*) → p (*x*))). (4) F |= ¬G ∨ ¬∀*x* (p(*x*) → p (*x*)). (5) <sup>F</sup> <sup>|</sup><sup>=</sup> <sup>¬</sup>G ∨ ¬*Aux* .

Observing the restrictions by membership in W in the definition of *Aux* can result in a smaller formula *Aux* . Continuing the example, this can be illustrated as follows. Assume +p ∈ V <sup>±</sup> and −p ∈ V <sup>±</sup> as before and in addition +p /∈ W. Since <sup>+</sup>p /<sup>∈</sup> <sup>W</sup> it follows that <sup>−</sup>p /∈ P*red*±(G). So "forgetting" about <sup>+</sup><sup>p</sup> in <sup>G</sup> is then just <sup>∃</sup>p <sup>G</sup> and the *Aux* component for <sup>p</sup> is not needed. (Also a further simplification, outlined in the remark below, is possible in this case.)

*Remark 2.* The view of "priming" as predicate quantification justifies a heuristically useful simplification of interpolation inputs for definability: If a predicate p occurs in a formula F only with positive (negative) polarity, then ∃p F is equivalent to F with all atoms of the form p(*t*) replaced by (⊥). Hence, if a predicate to be primed occurs only in a single polarity, we can replace all atoms with it by a truth value constant.

We now turn to application possibilities of Corollary 14. While it gives some control over the position of predicates, it does not allow to discriminate between allowing a predicate in negative heads and positive bodies. Predicates allowed in positive heads are also allowed in negative bodies. We give some examples.

Example 16. The following examples show for given programs P, Q and sets V+, V+1, V<sup>−</sup> of predicates a possible value of R according to Corollary 14. In all the examples it is essential that a predicate is disallowed in R only in a single polarity. If it would not be allowed at all there would be no solution R.

(i) P = p ← q. Q = r ← p. r ← q. q ← s. V<sup>+</sup> = {p, q,r,s} V+1 = {} V<sup>−</sup> = {p,r,s} R = r ← p. q ← s.

Here q is allowed in R in positive heads (and negative bodies) but not in positive bodies (and negative heads). Parentheses indicate constraints that apply but are not relevant for the example.

$$\begin{array}{llll} \text{(ii)} & P = \mathfrak{p} \leftarrow \mathfrak{q}. & Q = \leftarrow \mathfrak{q}, \texttt{not } \mathfrak{p}. & V\_{+} = & \{\mathfrak{q}, \mathfrak{r}, \mathfrak{s}\} \\ & & \mathfrak{r} \leftarrow \mathfrak{q}. & V\_{+1} = \{\} & \mathfrak{s} \leftarrow \mathfrak{p}. \\ & & \mathfrak{s} \leftarrow \mathfrak{p}. & V\_{-} = \{\mathfrak{p}, \mathfrak{q}, \mathfrak{r}, \mathfrak{s}\} \\ & & \cdots & \ \ldots & \ldots & \ldots & \ldots \\ & & & \cdots & \cdots & \cdots & \mathfrak{p}. \end{array}$$

Here p is allowed in R in positive bodies (and negative heads) but not in negative bodies (and positive heads).

$$\begin{array}{ccccc} \text{(iii)} & P = \mathfrak{p} \leftarrow \mathfrak{q}. & Q = \mathfrak{s} \leftarrow \text{not } \mathfrak{r}. & V\_{+} = & \{\mathfrak{s}\} & R = \mathfrak{s} \leftarrow \text{not } \mathfrak{r}.\\ & \mathfrak{r} \leftarrow \mathfrak{p}. & \mathfrak{r} \leftarrow \mathfrak{q}. & V\_{+} = \{\mathfrak{r}\} & & \\ & & & V\_{-} = \{\mathfrak{p}, \mathfrak{q}, \mathfrak{r}, \mathfrak{s}\} & & \\ \text{H} & \text{.... , it is allowed in } & R \text{ in } \text{--neutrino bandwidth and but not in } \text{-neutrino bandwidth.} \end{array}$$

Here r is allowed in R in negative bodies and but not in positive heads.

### 4 Prototypical Implementation

We implemented the synthesis according to Theorem 12 and Corollary 14 prototypically with the PIE (*Proving, Interpolating, Eliminating*) environment [74,75], which is embedded in SWI-Prolog [78]. The implementation and all requirements are free software, see http://cs.christophwernhard.com/pie/asp.

For Craig-Lyndon interpolation there are several options, yielding different solutions for some problems. Proving can be performed by CMProver, a clausal tableaux/connection prover connection [8,9,46] included in PIE, similar to PTTP [69], SETHEO [47] and leanCoP [61], or by Prover9 [58]. Interpolant extraction is performed on clausal tableaux following [76]. Resolution proofs by Prover9 are first translated to tableaux with the *hyper* property, which allows to pass range-restriction and the Horn property from inputs to outputs of interpolation [77]. Optionally also proofs by CMProver can be transformed before interpolant extraction to ensure the hyper property. With CMProver it is possible to enumerate alternate interpolants extracted from alternate proofs. More powerful provers such as E [65] and Vampire [43] unfortunately do not emit gapfree proofs that would be suited for extracting interpolants.

The organization of the implementation closely follows the abstract exposition in Sect. 3, with Prolog predicates corresponding to theorems. For convenience in some applications, the predicate that realizes Theorem 12 and Corollary 14 permits to specify the vocabulary also complementary, by listing predicates not allowed in the result. In general, if outputs are largely *semantically* characterized, *simplifications* play a key role. Solutions with redundancies should be avoided, even if they are correct. This concerns all stages of our workflow: preparation of the interpolation inputs, choice or transformation of the proof used for interpolant extraction, interpolant extraction, the interpolant itself, and the firstorder representation of the output program, where strong equivalence must be preserved, possibly modulo a background program. Although our system offers various simplifications at these stages, this seems an area for improvement with large impact for practice. Some particular issues only show up with experiments. For example, for both CMProver and Prover9 a preprocessing of the right sides of the interpolation entailments to reduce the number of distinct variables that are Skolemized by the systems was necessary, even for relatively small inputs.

The application of first-order provers to interpolation for reformulation tasks is a rather unknown territory. Experiments with limited success are described in [4]. Our prototypical implementation covers the full range from the application task, synthesis of an answer set program for two given programs and a given vocabulary, to the actual construction of a result program via Craig-Lyndon interpolation by a first-order prover. At least for small inputs such as the examples in the paper it successfully produces results. We expect that with larger inputs from applications it at least helps to identify and narrow down the actual issues that arise for practical interpolation with current first-order proving technology. This is facilitated by the embedding into PIE, which allows easy interfacing to community standards, e.g., by exporting proving problems underlying interpolation in the *TPTP* [70] format.

# 5 Conclusion

We presented an effective variation of projective Beth definability based on Craig interpolation for answer set programs with respect to strong equivalence under the stable model semantics. Interpolation theorems for logic programs under stable models semantics were shown before in [1], where, however, programs are only on the left side of the entailment underlying interpolation, and the right side as well as the interpolant are just first-order formulas. Craig interpolation and Beth definability for answer set programs was considered in [31,64], but with just existence results for equilibrium logic, which transfer to answer set semantics. The transfer of Craig interpolation and Beth definability from monotonic logics to default logics is investigated in [15], however, applicability to the stable model semantics and a relationship to strong equivalence are not discussed.

In [73] ontology-mediated querying over a knowledge base for specific description logics is considered, based on Beth definability and Craig interpolation. Interpolation is applied there to the Clark's completion [19] of a Datalog program. Although completion semantics is a precursor of the stable model semantics, both agreeing on a subclass of programs, completion seems applied in [73] actually on program fragments, or on the "schema level", as in our Examples 13.iv, 13.v and 13.vii. A systematic investigation of these forms of completion is on our agenda. *Forgetting* [22,37] or, closely related, *uniform interpolation* and *second-order quantifier elimination* [30], may be seen as generalizing Craig interpolation: an expression is sought that captures exactly what a given expression says about a restricted vocabulary. A *Craig* interpolant is moreover related to a second given expression entailed by the first, allowing to extract it from a proof of the entailment.

We plan to extend our approach to classes of programs with practically important language extensions. Arithmetic expressions and comparisons in rule bodies are permitted in the language mini-gringo, used already in recent works on verifying strong equivalence [24,26,51]. We considered strong equivalence relative to a context *program* P. These contexts might be generalized to first-order theories that capture theory extensions of logic programs [13,33,41].

So far, our approach characterizes result programs syntactically by restricting allowed predicates and, to some degree, also their positions in rules. Can this be generalized? Restricting allowed functions, including constants, seems not possible: If a function occurs only in the left side of the entailment underlying Craig interpolation, the interpolant may have existentially quantified variables, making the conversion to a logic program impossible. From the interpolation side it is known that the Horn property and variations of range-restriction can be preserved [77]. It remains to be investigated, how this transfers to synthesizing logic programs, where in particular restrictions of the rule form and *safety* [14, 45], an important property related to range restriction, are of interest.

In addition to verifying strong equivalence, recent work addresses verifying further properties, e.g., *uniform* equivalence (equivalence under inputs expressed as ground facts) [24,26,52]. The approach is to use completion [19,50] to express the verification problem in classical first-order logic. It is restricted to so-called *locally tight* logic programs [27]. Also forms of equivalence that abstract from "hidden" predicates are mostly considered for such restricted program classes, as relative equivalence [55], projected answer sets [23], or external behavior [24]. It remains future work to consider definability with uniform equivalence and hidden predicates, possibly using completion for translating logic programs to formulas (instead of γ), although it applies only to restricted classes of programs.

Independently from the application to program synthesis, our characterization of *encodes a program* and our procedure to extract a program from a formula suggest a novel practical method for transforming logic programs while preserving strong equivalence. The idea is as follows, where P is the given program: *First-order transformations* are applied to F def = γ(P) to obtain a first-order formula F such that S*<sup>F</sup>* ∧F ≡ S*<sup>F</sup>* ∧F. For transformations that result in a universal formula, F encodes a logic program, as argued in Remark 1. Applying the extraction procedure to F then results in a program P that is strongly equivalent to P. This makes the wide range of known first-order simplifications and formula transformations applicable and provides a firm foundation for soundness of special transformations. We expect that this approach supplements known dedicated simplifications that preserve strong equivalence [11,23].

With its background in artificial intelligence research, answer set programming is a declarative approach to problem solving, where specifications are processed by automated systems. It is suitable for meta-level reasoning to verify properties of specifications and to synthesize new specifications. On the basis of a technique to verify an equivalence property of answer set programs we developed a synthesis technique. Our tools were Craig interpolation and Beth definability, fundamental insights about first-order logic that relate given formulas to further formulas characterized in certain ways. Practically realized with automated first-order provers, Craig interpolation and Beth definability become tools to synthesize formulas, and, as shown here, also answer set programs.

Acknowledgments. The authors thank anonymous reviewers for helpful suggestions to improve the presentation.

# References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Regularization in Spider-Style Strategy Discovery and Schedule Construction**

Filip B´artek1,2(B) , Karel Chvalovsk´y<sup>1</sup> , and Martin Suda<sup>1</sup>

<sup>1</sup> Czech Institute of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Prague, Czech Republic

{filip.bartek,karel.chvalovsky,martin.suda}@cvut.cz <sup>2</sup> Faculty of Electrical Engineering, Czech Technical University in Prague, Prague, Czech Republic

**Abstract.** To achieve the best performance, automatic theorem provers often rely on schedules of diverse proving strategies to be tried out (either sequentially or in parallel) on a given problem. In this paper, we report on a large-scale experiment with discovering strategies for the Vampire prover, targeting the FOF fragment of the TPTP library and constructing a schedule for it, based on the ideas of Andrei Voronkov's system Spider. We examine the process from various angles, discuss the difficulty (or ease) of obtaining a strong Vampire schedule for the CASC competition, and establish how well a schedule can be expected to generalize to unseen problems and what factors influence this property.

**Keywords:** Saturation-based theorem proving · Proving strategies · Strategy schedule construction · Vampire

# **1 Introduction**

In 1997 at the CADE conference, the automatic theorem prover Gandalf [30] surprised its contenders at the CASC-14 competition [29] and won the MIX division there. One of the main innovations later identified as a key to Gandalf's success was the use of multiple theorem proving strategies executed sequentially in a time-slicing fashion [31,32]. Nowadays, it is well accepted that a single, universal strategy of an Automatic Theorem Prover (ATP) is invariably inferior, in terms of performance, to a well-chosen portfolio of complementary strategies, most of which do not even need to be complete or very strong in isolation.

Many tools have already been designed to help theorem prover developers discover new proving strategies and/or to combine them and construct proving schedules [7,9,12,16,21,24,33,34]. For example, Sch¨afer and Schulz employed genetic algorithms [21] for the invention of strong strategies for the E prover [23], Urban developed BliStr and used it to significantly strengthen strategies for the same prover via iterative local improvement and problem clustering [33],

c The Author(s) 2024

Manuscript with additional appendices [1]: https://arxiv.org/abs/2403.12869.

C. Benzm¨uller et al. (Eds.): IJCAR 2024, LNAI 14739, pp. 194–213, 2024. https://doi.org/10.1007/978-3-031-63498-7\_12

and, more recently, Holden and Korovin applied similar ideas in their HOS-ML system [7] to produce schedules for iProver [14]. The last work mentioned—as well as, e.g., MaLeS [16]—also include a component for *strategy selection*, the task of predicting, based on the input problem's features, which strategy will most likely succeed on it. (Selection is an interesting topic, which is, however, orthogonal to our work and will not be further discussed here.)

For the Vampire prover [15], schedules were for a long time constructed by Andrei Voronkov using a tool called Spider, about which little was known until recently. Its author finally revealed the architectural building blocks of Spider and the ideas behind them at the Vampire Workshop 2023, declaring Spider "a secret weapon behind Vampire's success at the CASC competitions" and "probably the most useful tool for Vampire's support and development" [34]. Acknowledging the importance of strategies for practical ATP usability, we decided to analyze this powerful technology on our own.

In this paper, we report on a large-scale experiment with discovering strategies for Vampire, based on the ideas of Spider (recalled in Sect. 2.1).<sup>1</sup> We target the FOF fragment of the TPTP library [28], probably the most comprehensive benchmark set available for first-order theorem proving. As detailed in Sect. 3, we discover and evaluate (on all the FOF problems) more than 1000 targeted strategies to serve as building blocks for subsequent schedule construction.

Research on proving strategies is sometimes frowned upon as mere "tuning for competitions". While we briefly pause to discuss this aspect in Sect. 4, our main interest in this work is to establish how well a schedule can be expected to generalize to unseen problems. For this purpose, we adopt the standard practice from statistics to randomly split the available problems into a train set and a test set, construct a schedule on one, and evaluate it on the other. In Sect. 6, we then identify several techniques that *regularize*, i.e., have the tendency to improve the test performance while possibly sacrificing the training one.

Optimal schedule construction under some time budget can be expressed as a mixed integer program and solved (given enough time) using a dedicated tool [7,24]. Here, we propose to instead use a simple heuristic from the related set cover problem [3], which leads to a polynomial-time greedy algorithm (Sect. 5). The algorithm maintains the important ability to assign different time limits to different strategies, is much faster than optimal solving (which may overfit to the train set in some scenarios), allows for easy experimentation with regularization techniques, and, in a certain sense made precise later, does not require committing to a single predetermined time budget.

In summary, we make the following main contributions:

– We outline a pragmatic approach to schedule construction that uses a greedy algorithm (Sect. 5), contrasting it with optimal schedules in terms of the quality of the schedules and the computational resources required for their construction (Sect. 6.2). In particular, our findings demonstrate a relative efficacy of the greedy approach for datasets similar to our own.

<sup>1</sup> Not claiming any credit for these, potential errors in the explanation are ours alone.


# **2 Preliminaries**

The behavior of Vampire is controlled by approximately one hundred *options*. These options configure the preprocessing and clausification steps, control the saturation algorithm, clause and literal selection heuristics, determine the choice of generating inferences as well as redundancy elimination and simplification rules, and more. Most of these options range over the Boolean or a small finite domain, a few are numeric (integer or float), and several represent ratios.

Every option has a *default* value, which is typically the most universally useful one. Some option settings make Vampire incomplete. This is automatically recognized, so that when the prover finitely saturates the input without discovering a contradiction, it will report "unknown" (rather than "satisfiable").

A *strategy* is determined by specifying the values of all options. A *schedule* is a sequence (s*i*, t*i*)*<sup>n</sup> <sup>i</sup>*=1 of strategies s*<sup>i</sup>* together with assigned time limits t*i*, intended to be executed in the prescribed order. We stress that in this work we do not consider schedules that would branch depending on problem features.

# **2.1 Spider-Style Strategy Discovery and Schedule Construction**

We are given a set of problems P and a prover with its space of available strategies S. *Strategy discovery* and *schedule construction* are two separate phases. We work under the premise that the larger and more diverse a set of strategies we first collect, the better for later constructing a good schedule.

Strategy discovery consists of three stages: random probing, strategy optimization, and evaluation, which can be repeated as long as progress is made.

*Random Probing.* We start strategy discovery with an empty pool of strategies S = ∅. A straightforward way to make sure that a new strategy substantially contributes to the current pool S is to always try to solve a problem not yet solved (or *covered*) by any strategy collected so far. We repeatedly pick such a problem and try to solve it using a *randomly sampled* strategy out of the totality of all available strategies S. The sampling distribution may be adapted to prefer option values that were successful in the past (cf. Sect. 3.3). This stage is computationally demanding, but can be massively parallelized.

*Strategy Optimization.* Each newly discovered strategy s, solving an as-of-yet uncovered problem p, will get *optimized* to be as fast as possible at solving p. One explores the strategy neighborhood by iterating over the options (possibly in several rounds), varying option values, and committing to changes that lead to a (local) improvement in terms of solution time or, as a tie-breaker, to a default option value where time differences seem negligible. We evaluate the impact of this stage in Sect. 3.4.

*Strategy Evaluation.* In the final stage of the discovery process, having obtained an optimized version s of s, we evaluate s on all our problems P. (This is another computationally expensive, but parallelizable step.) Thus, we enrich our pool and update our statistics about covered problems. Note that every strategy s we obtain this way is associated with the problem p*<sup>s</sup>* for which it was originally discovered. We will call this problem the *witness problem* of s .

*Schedule Construction* can be tried as soon as a sufficiently rich (cf. Sect. 3) pool of strategies is collected. Since we, for every collected strategy, know how it behaves on each problem, we can pose schedule construction as an optimization task to be solved, e.g., by a (mixed) integer programming (MIP) solver.

In more detail: We seek to allocate time slices t*<sup>s</sup>* > 0 to some of the strategies s ∈ S to cover as many problems as possible while remaining in sum below a given time budget T [7,24]. Alternatively, we may try to cover all the problems known to be solvable in as little total time as possible.<sup>2</sup> In this paper, we describe an alternative schedule construction method based on a greedy heuristic, with a polynomial running time guarantee and other favorable properties (Sect. 5).

### **2.2 CPU Instructions as a Measure of Time**

We will measure computation time in terms of the number of user instructions executed (as available on Linux systems through the perf tool). This is, in our experience, more precise and more stable (on architectures with many cores and many concurrently running processes) than measuring real time.<sup>3</sup>

In fact, we report *megainstructions (Mi)*, where 1 Mi = 2<sup>20</sup> instructions reported by perf. On contemporary hardware, 2000 Mi will typically get used up in a bit less than a second and 256 000 Mi in around 2 min of CPU time. We also set 1 Mi as the granularity for the time limits in our schedules.

# **3 Strategy Discovery Experiment**

Following the recipe outlined in Sect. 2.1, we set out to collect a pool of Vampire (version 4.8) strategies covering the first-order form (FOF) fragment of the

<sup>2</sup> Strictly speaking, these only give us a set of strategy-time pairs (as opposed to a sequence). However, the strategies can be ordered heuristically afterward.

<sup>3</sup> For a more thorough motivation, see Appendix A of [26].

**Fig. 1.** Strategy discovery. *Left:* problem coverage growth in time (uniform strategy sampling distribution vs. an updated one). *Right:* collected strategies ordered by limit (2000, 4000, . . . , 256 000 Mi) and, secondarily, by how many problems can each solve.

TPTP library [28] version 8.2.0. We focused only on proving, so left out all the problems known to be satisfiable, which left us with a set P of 7866 problems. Parallelizing the process where possible, we strived to fully utilize 120 cores (AMD EPYC 7513, 3.6 GHz) of our server equipped with 500 GB RAM.

We let the process run for a total of 20.1 days, in the end covering 6796 problems, as plotted in Fig. 1 (left). The effect of diminishing returns is clearly visible; however, we cannot claim we have exhausted all the possibilities. In the last day alone, 8 strategies were added and 9 new problems were solved.

The rest of Fig. 1 is gradually explained in the following as we cover some important details regarding the strategy discovery process.

### **3.1 Initial Strategy and Varying Instruction Limits**

We seeded the pool of strategies by first evaluating Vampire's default strategy for the maximum considered time limit of 256 000 Mi, solving 4264 problems out of the total 7866.

To save computation time, we did not probe or evaluate all subsequent strategies for this maximum limit. Instead, to exponentially prefer low limits to high ones, we made use of the Luby sequence<sup>4</sup> [18] known for its utility in the restart strategies of modern SAT solvers. Our own use of the sequence was as follows.

The lowest limit was initially set to 2000 Mi and, multiplying the Luby sequence members by this number, we got the progression 2000, 2000, 4000, 2000, 2000, 4000, 8000, . . . as the prescribed limits for subsequent probe iterations. This sequence reaches 256 000 Mi for the first time in 255 steps. At that point, we stopped following the Luby sequence and instead started from the beginning (to avoid eventually reaching limits higher than 256 000 Mi).

After four such cycles, the lowest, that is 2000 Mi, limit probes stopped producing new solutions (a sampling timeout of 1 h per iteration was imposed).

<sup>4</sup> https://oeis.org/A182105.

Here, after almost 8.5 d, the "updated 2K" plot ends in Fig. 1 (left). We then increased the lowest limit to 16 000 Mi and continued in an analogous fashion for 155 iterations and 5.7 more days ("updated 16K") and eventually increased the lowest limit to 64 000 Mi ("updated 64K") until the end.

Figure 1 (right) is a scatter plot showing the totality of 1096 strategies that we finally obtained and how they individually perform. The primary order on the x axis is by the limit and allows us to make a rough comparison of the number of strategies in each limit group (2000 Mi, 4000 Mi, . . . , 256 000 Mi, from left to right). It is also clear that many strategies (across the limit groups) are, in terms of problem coverage, individually very weak, yet each at some point contributed to solving a problem considered (at that point) challenging.

### **3.2 Problem Sampling**

While the guiding principle of random probing is to constantly aim for solving an as-of-yet unsolved problem, we modified this criterion slightly to produce a set of strategies better suited for an unbiased estimation of schedule performance on unseen problems (as detailed in the second half of this paper).

Namely, in each iteration i, we "forgot" a random half P *<sup>F</sup> <sup>i</sup>* of all problems P, considered only those strategies (discovered so far) whose witness problem lies in the remaining half P *<sup>R</sup> <sup>i</sup>* <sup>=</sup> <sup>P</sup> \<sup>P</sup> *<sup>F</sup> <sup>i</sup>* , and aimed for solving a random problem in P *<sup>R</sup> i* not covered by any of these strategies. This likely slowed the overall growth of coverage, as many problems would need to be covered several times due to the changing perspective of P *<sup>R</sup> <sup>i</sup>* . However, we got a (probabilistic) guarantee that any (not too small) subset P ⊆ P will contain enough witness problems such that their corresponding strategies will cover P well.

### **3.3 Strategy Sampling**

We sampled a random strategy by independently choosing a random value for each option. The only exception were dependent options. For example, it does not make sense to configure the AVATAR architecture (changing options such as acc, which enables congruence closure under AVATAR) if the main AVATAR option (av) is set to off. Such complications can be easily avoided by following, during the sampling, a topological order that respects the option dependencies. (For example, we sample acc only after the value on has been chosen for av.)

Even under the assumption of option independence, the mean time in which a random strategy solves a new problem can be strongly influenced by the value distributions for each option. This is because some option values are rarely useful and may even substantially reduce the prover performance, for example, if they lead to a highly incomplete strategy.<sup>5</sup> Nevertheless, not to preclude the

<sup>5</sup> An extreme example is turning off *binary resolution*, the main inference for nonequational reasoning. This can still be useful, for instance when replaced by *unit resulting resolution* [15], but our sampling needs to discover this by chance.

**Fig. 2.** Strategy optimization scatter plots. *Left:* time needed to solve strategy's witness problem (a log–log plot). *Right:* the total number of problems (in thousands) solved.

possibility of discovering arbitrarily wild strategies, we initially sampled every option uniformly where possible.<sup>6</sup>

Once we collected enough strategies,<sup>7</sup> we updated the frequencies for sampling finite-domain options (which make up the majority of all options) by counting how many times each value occurred in a strategy that, at the moment of its discovery, solved a previously unsolved problem. (This was done before a strategy got optimized. Otherwise the frequencies would be skewed toward the default, especially for option values that rarely help but almost never hurt.)

The effect of using an updated sampling distribution for strategy discovery can be seen in Fig. 1 (left). We ran two independent versions of the discovery process, one with the uniform distribution and one with the updated distribution. We abandoned the uniform one after approximately 5 d, by which time it had covered 6324 problems compared to 6607 covered with the help of the updated distribution at the same mark. We can see that the rate at which we were able to solve new problems became substantially higher with the updated distribution.

### **3.4 Impact of Strategy Optimization**

Once random probing finds a new strategy s that solves a new problem p, the task of optimization (recall Sect. 2.1) is to search the option-value neighborhood of s for a strategy s that solves p in as few instructions as possible and preferably uses default option values (where this does not compromise performance on p).

The impact of optimization is demonstrated in Fig. 2. On the left, we can see that, almost invariably, optimization substantially improves the performance of the discovered strategy on its witness problem p. The geometric mean of the improvement ratio we observed was 4.2 (and the median 3.2). The right

<sup>6</sup> Exceptions were: 1. The ratios: e.g., for age weight ratio we sampled uniformly its binary logarithm (in the range −10 and 4) and turned this into a ratio afterward (thus getting values between 1 : 1024 and 16 : 1); 2. Unbounded integers (an example being the *naming threshold* [20]), for which we used a geometric distribution instead.

<sup>7</sup> This was done in an earlier version of the main experiment.

scatter plot shows the overall performance of each strategy.<sup>8</sup> Here, the observed improvement is ×1.09 on average (median 1.03), and the improvement is solely an effect of setting option values to default where possible (without this feature, we would get a geometric mean of the improvement 0.84 and median 0.91). In this sense, the tendency to pick default values regularizes the strategies, making them more powerful also on problems other than their witness problem.

### **3.5 Parsing Does Not Count**

When collecting the performance data about the strategies, we decided to ignore the time it takes Vampire to parse the input problem. This was also reflected in the instruction limiting, so that running Vampire with a limit of, e.g., 2000 Mi would allow a problem to be solved if it takes at most 2000 Mi on top of what is necessary to parse the problem.

The main reason for this decision is that Vampire, in its strategy scheduling mode, starts dispatching strategies only after having parsed the problem, which is done only once. Thus, from the perspective of individual strategies, parsing time is a form of a sunk cost, something that has already been paid.

Although more complex approaches to taking parse time into account when optimizing schedules are possible, we in this work simply pretend that problem parsing always costs 0 instructions. This should be taken into account when interpreting our simulated performance results reported next (in Sect. 4, but also in Sect. 6.2).

# **4 One Schedule to Cover Them All**

Having collected our strategies, let us pretend that we already know how to construct a schedule (to be detailed in Sect. 5) and use this ability to answer some imminent questions, most notably: How much can we now benefit?

Figure 3 plots the cumulative performance (a.k.a. "cactus plot") of schedules we could build after 2 h, 6 h, 1 day, and full 20.1 days of the strategy discovery. The dashed vertical line denotes the time limit of 256 000 Mi, which roughly corresponds to a 2-minute prover run. For reference, we also plot the behavior of Vampire's default strategy. We can see that already after two hours of strategy discovery, we could construct a schedule improving on the default strategy by 26% (from 4264 to 5403 problems solved). Although the added value per hour spent searching gradually drops, the 20.1 days schedule is still 4% better than the 1 day one (improving from 6197 to 6449 at 256 000 Mi).

The plot's x-axis ends at 8·256 000 Mi, which roughly corresponds to the time limit used by the most recent CASC competitions [27] in the FOF division (i.e., 2 min on 8 cores). The strongest schedule shown in the figure manages to

<sup>8</sup> In a previous version of the main experiment, we evaluated each strategy both before and after optimization, which gave rise to this plot.

**Fig. 3.** Cumulative performance of several greedy schedules, each using a subset of the discovered strategies as gathered in time, compared with Vampire's default strategy

solve 6789 problems (of the 6796 covered in total) at that mark.<sup>9</sup> We remark that this schedule, in the end, employs only 577 of the 1096 available strategies, which points towards a noticeable redundancy in the strategy discovery process.

One way to fit all the solvable problems below the CASC budget would be to use a standard trick and split the totality of problems P into two or more easyto-define syntactic classes (e.g., Horn problems, problems with equality, large problems, etc.) and construct dedicated schedules for each class in isolation. The prover can then be dispatched to run an appropriate schedule once the input problem features are read. We do not explore this option here. Intuitively, by splitting P into smaller subsets, the risk of overfitting to just the problems for which the strategies were discovered increases, and we mainly want to explore here the opposite, the ability of a schedule to generalize to unseen problems.

### **5 Greedy Schedule Construction**

Having collected a set of strategies S and evaluated each on the problems in P, let us by E*<sup>s</sup> <sup>p</sup>* : <sup>S</sup> <sup>×</sup> <sup>P</sup> <sup>→</sup> <sup>N</sup> ∪ {∞} denote the *evaluation matrix*, which records the obtained solutions times (and uses ∞ to signify a failure to solve a problem within the evaluation time limit used). Given a time budget T, the *schedule construction problem* (SCP) is the task of assigning a time limit <sup>t</sup>*<sup>s</sup>* <sup>∈</sup> <sup>N</sup> to every strategy s ∈ S, such that the number of covered problems

$$\left| \bigcup\_{s \in S} \{ p \in P \mid E\_p^s \le t\_s \} \right|,$$

subject to the constraint *<sup>s</sup>*∈*<sup>S</sup>* <sup>t</sup>*<sup>s</sup>* <sup>≤</sup> <sup>T</sup>, is maximized.

<sup>9</sup> It is possible to solve all 6796 covered problems with a schedule that spans 2 582 228 Mi. This is the optimum length – no shorter schedule solving all covered problems exists.



To obtain a schedule as a sequence (as defined in Sect. 2), we would need to order the strategies having t*<sup>s</sup>* > 0. This can, in practice, be done in various ways, but since the order does not influence the predicted performance of the schedule under the budget T, we keep it here unspecified (and refer to the mere time assignment t*<sup>s</sup>* as a *pre-schedule* where the distinction matters).

Although it is straightforward to encode SCP as a mixed integer program and attempt to solve it exactly (though it is an NP-hard problem), an adaptation of a greedy heuristic from a closely related (budgeted) maximum coverage problem [3,13] works surprisingly well in practice and runs in time polynomial in the size of E*<sup>s</sup> <sup>p</sup>*. The key idea is to greedily maximize the number of newly covered problems *divided* by the amount of time this additionally requires.

Algorithm 1 shows the corresponding pseudocode. It starts from an empty schedule t*<sup>s</sup>* and iteratively extends it in a greedy fashion. The key criterion appears on line 4. Note that this line corresponds to an iteration over all available strategies S and, for each strategy s, all meaningful time limits (which are only those where a new problem gets solved by s, so their number is bounded by |P|).

Algorithm 1 departs from the obvious adaptation of the above-mentioned greedy algorithm for the set covering problem [3] in that we allow extending a slice of a strategy s that is already included in the schedule (that is, has t*<sup>s</sup>* > 0) and "charge the extension" only for the additional time it claims (i.e., t − t*s*). This *slice extension* trick turns out to be important for good performance.<sup>10</sup>

#### **5.1 Do We Need a Budget?**

A budget-less version of Algorithm 1 is easy to obtain (imagine T being very large). When running on real-world E*<sup>s</sup> <sup>p</sup>* (from evaluated Vampire strategies), we noticed that the length of a typical extension (t − t*s*) tends to be small relative to the current used-up time *<sup>s</sup>*∈*<sup>S</sup>* <sup>t</sup>*<sup>s</sup>* and that the presence of a budget starts affecting the result only when the used-up time comes close to the budget.

As a consequence, if we run a budget-less version and, after each iteration, record the pair ( *<sup>s</sup>*∈*<sup>S</sup>* <sup>t</sup>*s*, <sup>|</sup><sup>P</sup> \ <sup>P</sup> |), we get a good estimate (in a single run) of how the algorithm would perform for a whole (densely inhabited) sequence of

<sup>10</sup> The degree of importance of slice extension can be observed in Fig. 5 in Appendix A.

relevant budgets. This is how the plot in Fig. 3 was obtained. Note that this would be prohibitively expensive to do when trying to solve the SCP optimally.

We can also use this observation in an actual prover. If we record and store a journal of the budget-less run, remembering which strategy got extended in each iteration and by how much, we can, given a concrete budget T, quickly replay the journal just to the point before filling up T, and thus get a good schedule for the budget T without having to optimize specifically for T.

# **6 Regularization in Schedule Construction**

To estimate the future performance of a constructed schedule on previously unseen problems, we adopt the standard methodology used in statistics, randomly split our problem set P into a train set P*train* and a test set P*test*, construct a schedule for the first, and evaluate it on the second.

To reduce the variance in the estimate, we use many such random splits and average the results. In the experiments reported in the following, we actually compute an average over several rounds of 5-fold cross-validation [6]. This means that the size of P*train* is always 80.0 % and the size P*test* 20.0% of our problem set P. However, we *re-scale* the reported train and test performance back to the size of the whole problem set P to express them in units that are immediately comparable. We note that the reported performance is obtained though simulation, i.e., it is only based on the evaluation matrix E*<sup>s</sup> p*.

*Training Strategy Sets.* We retroactively simulate the effect of discovering strategies only for current training problems P*train*. Given our collected pool of strategies S, we obtain the training strategy set S*train* by excluding those strategies from S whose witness problem lies outside P*train* (cf. Sect. 3.2). When a schedule is optimized on the problem set P*train*, the training data consists of the results of the evaluations of strategies S*train* on problems P*train*.

# **6.1 Regularization Methods**

We propose several modifications of greedy schedule construction (Algorithm 1) with the aim of improving its performance on unseen problems (the test set performance) while possibly sacrificing some of its training performance.

With the base version, we observed that it could often solve more test problems by assigning more time to strategies introduced into the schedule in early iterations, at the expense of strategies added later (the latter presumably covering just a few expensive training problems and being over-specialized to them). Most of the modifications described next assign more time to strategies added during early iterations, each according to a different heuristic.

**Slack.** The most straightforward regularization we explored extends each nonzero strategy time limit t*<sup>s</sup>* in the schedule by multiplying it by the multiplicative slack w ≥ 1 and adding the additive slack b ∈ {0, 1,...}. For each t*<sup>s</sup>* > 0, the new limit t *<sup>s</sup>* is therefore t*<sup>s</sup>* · w + b. To avoid overshooting the budget, we keep track of the total length of the extended schedule during the construction (implementation details are slightly more complicated but not immediately important). The parameters w and b control the degree of regularization, and with w = 1 and b = 0, we get the base algorithm.

**Temporal Reward Adjustment.** In each iteration of the base greedy algorithm, we select a combination of strategy s and time limit t that maximizes the number of newly solved problems n per time t. Intuitively, the relative degree to which these two quantities influence the selection is arbitrary. To allow stressing n more or less with respect to t, we exponentiate n by a regularization parameter <sup>α</sup> <sup>≥</sup> 0, so the decision criterion becomes *<sup>n</sup>*<sup>α</sup> *t* . For small values of α, the algorithm values the time more and becomes eager

to solve problems early. For large values of α, on the other hand, the algorithm values the problems more and prefers longer slices that cover more problems. For example, for α = 1.5, the algorithm prefers solving 2 problems in 5000 Mi to solving 1 problem in 2000 Mi. Compare this to α = 1 (the base algorithm), which would rank these slices the other way around.

**Diminishing Problem Rewards.** By covering a training problem with more than one strategy, we cover it robustly: When a similar testing problem is solved by only one of these strategies, the schedule still manages to solve it. However, the base greedy algorithm does not strive to cover any problem more than once: as soon as a problem is covered by one strategy, this problem stops participating in the scheduling criterion. This is the case even when covering the problem again would cost relatively little time.

Regularization by diminishing problem rewards covers problems robustly by rewarding strategy s not only by the number of *new* problems it covers but also by the problems covered by s that are already covered by the schedule. This is achieved by modifying the slice selection criterion. Instead of maximizing the number of new problems solved per time, we maximize the total reward per time, which is defined as follows: Each problem contributes the reward β*<sup>k</sup>*, where k is the number of times the schedule has covered the problem and <sup>β</sup> is a regularization parameter (0 <sup>≤</sup> <sup>β</sup> <sup>≤</sup> 1). We define 0<sup>0</sup> = 1 so that β = 0 preserves the original behavior of the base algorithm.

For example, for β = 0.1, each problem contributes the reward 1 the first time it is covered, 0.1 the second time, 0.01 the third time, etc. Informally, the algorithm values covering a problem the second time in time t as much as covering a new problem in time 10 · t.

These modifications are independent and can be arbitrarily combined.

### **6.2 Experimental Results**

We evaluated the behavior of the previously proposed techniques using three time budgets: 16 000 Mi (≈8 s ), 64 000 Mi (≈32 s), and 256 000 Mi (≈2 min). *Optimal Schedule Constructor.* In the existing approaches to the construction of strategy schedules [7,24], it is common to encode the SCP (see Sect. 5) as a mixed-integer program and use a MIP solver to find an exact solution. We implemented such an optimal schedule construction (OSC) by encoding the problem<sup>11</sup> in Gurobi [5] (ver. 10.0.3) and compared OSC to the base greedy schedule construction (Algorithm 1) on 10 random 80 : 20 splits.

For the budget of 256 000 Mi, it takes Gurobi over 16 h to find an optimal schedule, whereas the greedy algorithm finds a schedule in less than a minute. The optimal schedule solves, on average, 45.0 (resp. 8.5) more problems than the greedy schedule on P*train* (resp. on P*test*) when re-scaled to |P|. For the 16 000 Mi and 64 000 Mi budgets, Gurobi does not solve the optimal schedule within a reasonable time limit. For example, after 24 h, the relative gaps between the lower and upper objective bound are 5.38 % and 1.43 %, respectively. This makes the OSC impractical to use as a baseline for our regularization experiments.<sup>12</sup>

*Regularization of the Greedy Algorithm.* To estimate the performance of the proposed regularization methods, we evaluated each variant on 50 random splits (10 times 5-fold cross-validation). We assessed the algorithm's response to each regularization parameter in isolation. For each parameter, we evaluated regularly spaced values from a promising interval covering the default value (b = 0, w = 1, α = 1, β = 0). Figure 4 demonstrates the effect of these variations on the train and test performance for the budget 64 000 Mi.<sup>13</sup>

Temporal reward adjustment was the most powerful of the regularizations, improving test performance for all the evaluated values of α between 1.1 and 2.0. Surprisingly, the values 1.1 and 1.2 also improved the train performance, suggesting that the default greedy algorithm is too time-aggressive on our dataset.

Table 1 compares the performance of notable configurations of the greedy algorithm. Specifically, we include evaluations of the base greedy algorithm and the best of the evaluated parameter values for each of the regularizations. The table also illustrates the effect of regularizations on the computational time of the greedy schedule construction: β > 0 slows the procedure down and α > 1 speeds it up.

In a subsequent experiment, we searched for a strong combination of regularizations by local search from the strongest single-parameter regularization (α = 1.7). This yielded a negligible improvement over α = 1.7: The best observed test performance was 5707 (α = 1.7 and b = 30), compared to 5704 of α = 1.7.

Finally, we briefly explored the interactions between the budget and the optimal values of the regularization parameters. For each of the three budgets of interest and each of the regularization parameters, we identified the best param-

<sup>11</sup> We used a straightforward encoding similar to the encoding described by Holden and Korovin [7].

<sup>12</sup> A better encoding and solver settings may improve on this. However, we suspect the problem to be hard; we tried some modifications with similar (or worse) results.

<sup>13</sup> This budget seems to be the most practically relevant (e.g., for the application in interactive theorem provers). The other two budgets are detailed in Appendix A.

**Fig. 4.** Train and test performance of various regularizations of the greedy schedule construction algorithm for the budget 64 000. Performance is the mean number of problems solved out of 7866 across 50 splits. The label of each point denotes the value of the respective regularization parameter.

eter value from the evaluation grid. Table 2 shows that the best configurations of all the regularizations except multiplicative slack vary across budgets.<sup>14</sup>

# **7 Related Work**

Outside the realm of theorem proving, strategy discovery belongs to the topic of *algorithm configuration* [22], where the task is to look for a strong configuration of a parameterized algorithm automatically. Prominent general-purpose algorithm configuration procedures include ParamILS [8] and SMAC [17].

To gather a portfolio of complementary configurations, Hydra [36] searches for them in rounds, trying to maximize the marginal contribution against all the configurations identified previously. Cedalion [25] is interesting in that it maximizes such contribution *per unit time*, similarly to our heuristic for greedy schedule construction. Both have in common that they, a priori, consider all the input problems in their criterion. BliStr and related approaches [7,9,11,12,33], on the other hand, combine strategy improvement with problem clustering to breed strategies that are "local experts" on similar problems. Spider [34] is even more radical in this direction and optimizes each strategy on a single problem.<sup>15</sup>

<sup>14</sup> See a more detailed comparison in Appendix A.

<sup>15</sup> Although the preference for default option values as a secondary criterion, at the same time, helps to push for good general performance (see Sect. 3.4).


**Table 1.** Comparison of regularizations of the greedy schedule construction algorithm for the budget 64 000 Mi. Performance is the mean number of problems solved out of 7866 across 50 splits. Time to fit is the mean time to construct a schedule in seconds.

**Table 2.** Best observed values of regularization parameters for various budgets


Once a portfolio of strategies is known, it may be used in one of several ways to solve a new input problem: execute all strategies in parallel [36], select a single strategy [9], select one of pre-computed schedules [7], construct a custom strategy schedule [19], schedule strategies dynamically [16], or use a pre-computed static schedule [12,24]. The latter is the approach we explored in this work.

A popular approach to construct a static schedule (besides solving SCP optimally [7,24]) is to greedily stack uniformly-timed slices [12].<sup>16</sup> Regularization in this context is discussed by Jakubuv et al. [10]. Finally, a different greedy approach to schedule construction was already proposed in *p-SETHEO* [35].

# **8 Conclusion**

In this work, we conducted an independent evaluation of Spider-style [34] strategy discovery and schedule creation. Focusing on the FOF fragment of the TPTP library, we collected over a thousand Vampire proving strategies, each a priori optimized to perform well on a single problem. Using these strategies, it is easy to construct a single monolithic schedule which covers most of the problems

<sup>16</sup> However, uniformly-timed slices only get close to the performance of our greedy schedule at a small region depending on the slice time limit used.

known to be solvable within the budget used by the CASC competition. This suggests that for CASC not to be mainly a competition in memorization, using a substantial set of previously unseen problems each year is essential.

To construct strong schedules using the discovered strategies, we proposed a greedy schedule construction procedure, which can compete with optimal approaches. For a time budget of approximately 2 min, the greedy algorithm takes less than a minute to produce a schedule that solves more than 99.0% as many problems as an optimal schedule, which takes more than 16 h to generate. For shorter time budgets, optimal schedule construction is no longer feasible, while greedy construction still produces relatively strong schedules.

This surprising strength of the greedy scheduler can be further reinforced by various regularization mechanisms, which constitute the main contribution of this work. An appropriately chosen regularization allows us to outperform the optimal schedule on unseen problems. Finally, the runtime speed and simplicity of the greedy schedule construction algorithm and the regularization techniques make them attractive for reuse and further experimentation.

**Acknowledgment.** This work was supported by the Czech Science Foundation project no. 20-06390Y (JUNIOR grant), the European Regional Development Fund under the Czech project AI&Reasoning no. CZ.02.1.01/0.0/0.0/15 003/0000466, the project RICAIP no. 857306 under the EU-H2020 programme, the Grant Agency of the Czech Technical University in Prague, grant no. SGS20/215/OHK3/3T/37, and the Czech Science Foundation project no. 24-12759S.

# **A Experimental Results on Various Budgets**

In addition to the budget of 64 000 Mi (approx 32 s), which we discussed in Sect. 6.2, we evaluated the schedule construction algorithms on the budgets of 16 000 Mi (approx. 8 s) and 256 000 Mi (approx. 2 min). Figure 5 shows the results of these evaluations. It shows namely that temporal reward adjustment is the most powerful of the regularizations under consideration for all of these budgets, and that the optimal values of most of the regularization parameters vary across budgets.

To demonstrate the effect of the slice extension trick described in Sect. 5, we also include two weaker versions of the base greedy algorithm: one without slice extension and one with the slice extension restricted to the most recent slice ("conservative slice extension"). Both of these modifications allow including any single strategy in the schedule more than once, which is implemented in a straightforward fashion.

**Fig. 5.** Train and test performance of various regularizations of the greedy schedule construction algorithm for the budgets 256 000 Mi (*top*), 64 000 Mi (*middle*), and 16 000 Mi (*bottom*). Performance is mean number of problems solved out of 7866 across 50 splits. The label of each point denotes the value of the respective regularization parameter.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Lemma Discovery and Strategies for Automated Induction

Sólrún Halla Einarsdóttir1(B) , Márton Hajdu<sup>2</sup> , Moa Johansson<sup>1</sup> , Nicholas Smallbone<sup>1</sup> , and Martin Suda<sup>3</sup>

<sup>1</sup> Chalmers University of Technology, Gothenburg, Sweden {slrn,jomoa,nicsma}@chalmers.se <sup>2</sup> TU Wien, Vienna, Austria marton.hajdu@tuwien.ac.at <sup>3</sup> Czech Technical University in Prague, Prague, Czech Republic martin.suda@cvut.cz

Abstract. We investigate how the automated inductive proof capabilities of the first-order prover Vampire can be improved by adding lemmas conjectured by the QuickSpec theory exploration system and by training strategy schedules specialized for inductive proofs. We find that adding lemmas improves performance (measured in number of proofs found for benchmark problems) by 40% compared to Vampire's plain structural induction as baseline. Strategy training alone increases the number of proofs found by 130%, and the two methods in combination provide an increase of 183%. By combining strategy training and lemma discovery we can prove more inductive benchmarks than previous state-of-the-art inductive proof systems (HipSpec and CVC4).

Keywords: Induction · Theory Exploration · Lemma Discovery · Strategies · Vampire

# 1 Introduction

We have experimented with augmenting Vampire's capabilities for induction by injecting extra lemmas suggested by the theory exploration system Quick-Spec [25] and by training strategy schedules specialized for inductive proofs. Our aim is to improve on the state of the art in automating proofs by induction.

Proofs by induction provide a challenge for automated theorem provers. Not only are there typically many choices of which induction scheme to use, but a proof may also require the conjecture to be generalized to strengthen the inductive hypothesis, or require additional auxiliary lemmas, themselves needing another induction to prove. For example, suppose we have a recursively defined function *rev* for reversing lists, defined using the append function ++:

$$\begin{aligned} rev\left[\right] &= \left[\right] \\ rev\left(x:xs\right) &= \left(rev\ xs\right) ++ \left(x:\left[\right]\right) \\ \end{aligned}$$

where ++ is defined as follows:

$$\begin{aligned} \left| \right| ++ xs &= xs \\ \left( x : xs \right)++ys &= x : (xs++ys) \end{aligned}$$

and want to prove that *rev*(*rev*(*xs*)) = *x* for any list *xs*. When we ask Vampire to find a proof of this using structural induction it is unable to find a proof, even when given a long time. The induction hypothesis *rev*(*rev*(*xs*)) = *xs* is not strong enough to prove that *rev*(*rev*(*x* : *xs*)) = *x* : *xs*: we are missing some *lemmas*.

QuickSpec [25] is a system that produces equational conjectures from function definitions. Suppose we use QuickSpec to conjecture some lemmas about the *rev* and ++ functions. In under 1.5 s (running on a regular laptop<sup>1</sup>) QuickSpec outputs the following 9 equations as unproved conjectures:

1. *rev* []=[] 2. *x* ++ [ ] = *x* 3. [ ] ++ *x* = *x* 4. *rev* (*rev x*) = *x* 5. *rev* (*x* : [ ]) = *x* :[] 6. (*x* ++ *y*) ++ *z* = *x* ++ (*y* ++ *z*) 7. *x* : (*y* ++ *z*)=(*x* : *y*) ++ *z* 8. *rev x* ++ *rev y* = *rev* (*y* ++ *x*) 9. (*xs* ++ (*y* : (*z* : [ ])) = *rev* (*z* : (*x* : (*rev xs*)))

Now suppose we add these equations to the input we give to Vampire, marking them as conjectured lemmas. Vampire may use such lemma in a proof, but only if it also proves it (e.g. by induction). Vampire instantly (in 6 ms, running on the same laptop) finds a proof of the original property, using (2), (6), and (8) above as lemmas, as well as proofs for the lemmas that were used. A closer investigation shows that only (6) and (8) are necessary to find a proof, where (8) is used in the proof of the original goal and (6) is used to prove (8).

Coming up with lemmas is a non-trivial task, and has sparked research into various lemma discovery techniques (see [17] for an overview). Lemma discovery can broadly be divided into two categories: *Top-down* techniques include attempting to generalize the current subgoal, or analyzing failed proof attempts to suggest a missing lemma. *Bottom-up* techniques focus on discovering potentially interesting lemmas about the definitions and concepts available, without considering any particular ongoing proof attempts. Bottom-up techniques can find a wider class of lemmas, but have the disadvantage that the system spends time working with conjectures that are not relevant to the goal. For example, the earlier system HipSpec [5] would first run QuickSpec (just as in the example above) but then attempt to prove *all* discovered conjectures before working on the main goal.

<sup>1</sup> The same laptop the experiments in Sect. 4 were run on, see more precise description there.

In this work, we use theory exploration, a bottom-up technique, in a more goal-directed manner. We use QuickSpec to suggest useful lemmas, but we will not prove *all* the suggestions, only those that are useful in the proof of the main goal. To do this we leverage Vampire's AVATAR architecture [20,29], which allows us to attempt (speculatively, in parallel) the proof of the main goal using any subset of the candidate lemmas. Lemmas used must also be independently proved, but if that turns out to be hard (or even impossible) other options of finishing a proof may also be possible. Non-useful conjectures can be ignored and need not be proved, saving time. Since automatic theorem provers (ATPs) like Vampire and cvc5 now natively support applying automated induction [11,22] it is no longer necessary to use a specialized prover to apply induction before sending the resulting proof obligations to an ATP, as HipSpec did, and we examine the differences between the two approaches.

The performance of ATPs like Vampire is heavily influenced by the use of proving *strategies* and their combinations into *schedules* [15,27,28,30]. In addition to investigating the influence of adding lemmas from theory exploration, we also experiment with various learned strategies tailored for inductive proofs. A specialized strategy may allow Vampire to invent some easy lemmas itself, by applying generalization of a suitable subterm in a goal, lessening the need for theory exploration. However, finding strong targeted strategies is a time consuming endeavour which requires a set of problems with similar characteristics to those which we are interested in proving. For regular users, who typically just want to apply Vampire out of the box, this might not be an option.

# 2 Background

We propose the following design for an inductive theorem proving system:


Using our tools these two steps can be performed fully automatically, taking a problem file in the TIP [7] format as input and returning the proof found by Vampire as output.

### 2.1 QuickSpec

As seen in Sect. 1, QuickSpec is a system that produces equational *conjectures* about a theory. The conjectures are not guaranteed to be true, but have been tested to hold on 1000 randomly-generated test cases.<sup>2</sup> QuickSpec was originally designed to make conjectures about Haskell programs, but has been adapted to problems in inductive theorem proving.

<sup>2</sup> In automated reasoning terms, this means that 1000 ground instances of the conjecture have been shown to hold.

Conjecturing equations is difficult because of combinatorial explosion: even if we consider only quite simple equations and theories, there are many millions of possible conjectures. For example, if we identify a set of *n* = 10*,*000 interesting *terms*, then there are *n*<sup>2</sup> = 100*,*000*,*000 candidate equations which could be built from those terms. Generating and testing all of them is out of the question.

QuickSpec uses a more sophisticated approach which scales with the number of *terms* (e.g. 10*,*000) rather than the number of possible *equations* (e.g. 100*,*000*,*000). We enumerate terms in order of size (these terms may end up being the left or right hand side of an equational conjecture). We consider each term one by one, building up two sets as we go:


Each time we consider a new term *t*, we answer the following question: *Is it equal to any representative term?* We do this in two steps:


If neither case holds, we add *t* as a representative term. The idea here is that, in case (1), the equation *t* = *r* is redundant – we knew it already – whereas in case (2), it is new information and hence a potentially useful conjecture.

For example, suppose we take the list append function ++ and consider the following terms, where *x*, *y* and *z* range over lists: [ ]*, x, y, z, x* ++ [ ]*, y* ++ [ ]*, x* ++ (*y* ++ *z*)*, x* ++ (*x* ++ *y*)*,*(*x* ++ *y*) ++ *z,*(*x* ++ *x*) ++ *y*. <sup>3</sup> Initially, the set of discovered equations and the set of representative terms are empty. The algorithm proceeds as follows:


<sup>3</sup> In reality we enumerate terms in a systematic way, and a further refinement to the algorithm, *schemas*, eliminates many terms that differ from an existing term only in choice of variables.


At the end we have produced the conjectures (1) *x* ++ [ ] = *x* and (2) (*x* ++ *y*) ++ *z* = *x* ++ (*y* ++ *z*). Note that these conjectures are *complete* with respect to the enumerated terms, in the sense that any true equation between two such terms follows from the conjectures. In general, the QuickSpec algorithm produces a complete set of equations in this sense (though not necessarily sound, i.e. we may have false equations if we are unlucky in the testing).

It is perhaps not obvious why this algorithm should be fast. We point out the following reasons:


Therefore, the runtime of QuickSpec largely grows proportionally with the *number of discovered conjectures*, plus a small amount which is proportional to the number of explored terms. In practice, QuickSpec is able to handle theories with ≈ 20 functions and generate equations having ≈ 10 symbols on each side, after which the number of discovered conjectures typically becomes too huge.

### 2.2 Induction in Vampire

Vampire supports induction over both term algebras and integers. The former, used in this work, is based on a constructor-style and two infinite descent-style schemas [21] in addition to ad hoc schemas generated from well-founded recursive functions in the search space [13]. When inducting on a term in a unit clause (a literal), an instance of a schema with the negation of the unit clause is added to the search space. A stronger (and also more explosive) feature is non-unit induction, which inducts on arbitrarily many occurrences of a term, possibly across many literals and clauses.

Some basic lemma generation techniques such as generalizations over complex terms and occurrences [12] as well as active occurrence heuristics are also supported. In the presence of function definitions or induction hypotheses, (unordered) paramodulation may be used to reach lemmas otherwise not reachable with ordered superposition [13]. For a more detailed description of induction in Vampire we refer to [11].

Lemma Generation in Vampire. Vampire uses the traditional top-down backward reasoning approach to generate lemmas. It tries to reduce goals into subgoals and apply inferences on them, interleaved with induction inferences applied to all intermediate consequences that result from this process. A new lemma may be conjectured by generalizing over one of the terms in a subgoal. This lemma generation approach in Vampire usually derives different lemmas than QuickSpec's bottom-up theory exploration approach.

Simplifications and Orderings in Superposition. As superposition is tailored for first-order reasoning, it does not come as a surprise that some techniques that increase the efficiency of first-order reasoning are incompatible with inductive reasoning or higher-order reasoning in general. In particular, simplifications and orderings can affect a built-in induction within superposition.

Simplifications are inferences where one of the premises becomes redundant for further first-order reasoning and can be removed. For example, demodulation rewrites a clause into a smaller clause with an unconditional (unit) equation, and removes the original clause. In inductive reasoning things are not as simple, and any clause (even if it follows from smaller clauses) can be useful to generate interesting lemmas. For example, we might simplify a clause that would give rise to a crucial generalized lemma into a clause that does not give the same generalization anymore. Interestingly, given that simplification steps take up most of the inferences in a saturation run, in our experience this affects inductive reasoning less than expected.

# 3 Implementation

In order to perform our experiments we needed to integrate the lemmas conjectured by QuickSpec into Vampire's proof search, and choose a promising proof search strategy.

### 3.1 Conjectured Lemmas, AVATAR, and Vampire's Claims

Integrating conjectured lemmas into proof search poses a technical challenge as they must be proven before they can be soundly used in a proof. At the same time, trying to prove each suggested lemma before the main goal is even attempted can create a great deal of unnecessary work. As has been noted before [8,21], this challenge can be smoothly overcome in the presence of the AVATAR architecture for clause splitting [20,29].

AVATAR keeps track of information about which clause has been derived from which splitting assumption, and soundly propagates it through inferences. Deriving the empty clause conditioned on some assumptions then does not necessarily mean the search is successfully concluded, but merely signifies that the conjunction of the attached assumptions can no longer be maintained. (AVATAR then updates its propositional model to reflect this newly derived information through a call to an underlying SAT or SMT solver.)

Let us assume we want to accommodate a speculative proof with lemmas *L*1*,...,L<sup>n</sup>* under AVATAR, where each lemma *L<sup>i</sup>* is a closed formula. As a first approximation to explaining how this can be done, let us imagine introducing and immediately splitting the tautologies *L<sup>i</sup>* ∨ ¬*L<sup>i</sup>* for *i* = 1*,...,n*. <sup>4</sup> Each clause in the search then carries (independently for each *i*) the information whether it depends on: 1) the assumption corresponding to *L<sup>i</sup>* (proving with the help of lemma *Li*), 2) ¬*L<sup>i</sup>* (trying to prove lemma *Li*), or 3) neither of these (currently ignoring lemma *Li*). Depending on the order in which (conditional) empty clauses get derived, the whole power set of possible scenarios is played out as if in parallel, in which some lemmas may already have been shown to suffice for proving the main conjecture, while themselves waiting to be proven (possibly with the help of other lemmas). The underlying SAT/SMT solver orchestrates the whole endeavour, decides which compatible subset of assumptions will be worked on next, and declares the proof attempt successful as soon as the first such scenario is complete. We remark that cyclic reasoning is automatically avoided by treating the assumptions of *L<sup>i</sup>* and ¬*L<sup>i</sup>* as mutually exclusive.

It is surprisingly easy to get access to this feature of speculative lemma use in Vampire under AVATAR. In fact, we can rely on a small adjustment of just the parser added by Andrei Voronkov already in 2011. In the TPTP language [26], this adjustment introduced a new custom formula role called the *claim*. Precisely as in our use case, a claim is a formula that most likely follows from the surrounding axioms and has a high chance of being useful for proving the given conjecture, but must be itself also proven by the system in a valid proof. For this work we extended Vampire's SMT-LIB parser in an analogous way and added a custom construct assert-claim with the same semantics.

<sup>4</sup> In reality, both *<sup>L</sup><sup>i</sup>* and <sup>¬</sup>*L<sup>i</sup>* must also be skolemized and clausified, which in the prover happens before splitting. We return to this aspect further below.

Technically, when the parser reads a claim formula *L*, it picks a fresh propositional symbol *<sup>p</sup><sup>L</sup>* and passes on the equivalence *<sup>p</sup><sup>L</sup>* <sup>↔</sup> *<sup>L</sup>* as a standard axiom.<sup>5</sup> The equivalence *p<sup>L</sup>* ↔ *L* is then clausified to

{¬*p<sup>L</sup>* ∨ *C* | *C* ∈ *CNF*(*L*)} and {*p<sup>L</sup>* ∨ *D* | *D* ∈ *CNF*(¬*L*)}*.*

AVATAR recognizes the *p<sup>L</sup>* and ¬*p<sup>L</sup>* as complementary ground components and will then always assert either *p<sup>L</sup>* or ¬*pL*. Thus the first-order part of the prover must work with either the clauses from *CNF*(*L*) or from *CNF*(¬*L*), while AVATAR keeps track of the respective dependencies.

### 3.2 Proving Strategies and a New Induction Schedule

A theorem prover typically has many parameters (in Vampire called *options*) that can be changed to adjust the proof search characteristics. In Vampire, there are more than 100 options for configuring the preprocessing steps, the saturation algorithm, generating and simplification rules, proof search heuristics and also induction. By a *strategy* we mean a concrete assignment of values to such options. It is long known [27,31] that the success rate of an ATP can be dramatically improved by arranging a number of different proving strategies of complementary characteristics into a strategy *schedule*, a sequence of strategies with assigned time budgets, to be executed in sequence (or in parallel).

In this work, we constructed a strategy schedule specifically targeting inductive theorem proving on the TIP benchmarks (see Sect. 4.1). We followed the strategy discovery recipe pioneered by the Spider system [30]. This consists of


In our case, we sampled strategies from a space defined by a total of 115 base Vampire options and 19 dedicated induction options. Most of these options are Boolean, many are finite enumerations of discrete values and a few are numeric. It is clear that the totality of all strategies is astronomically large and random sampling is a way to have access to all the strategies, at least in principle. We searched for strategies in parallel on 60 cores of our server<sup>6</sup> for several days. In

<sup>5</sup> The only extra effort is to mark and a protect the new symbol *p<sup>L</sup>* against potential elimination during preprocessing, as, after all, the new equivalence would otherwise qualify as an unused predicate definition and could be discarded.

<sup>6</sup> Equipped With Intel-<sup>R</sup> Xeon-<sup>R</sup> Gold 6140 CPU @ 2.3 GHz and 500 GB RAM.

the end, we collected 246 strategies covering 236 of the 486 TIP benchmarks that we used for training.

Once a sufficiently large set of strategies has been discovered (or when the rate of solving new problems becomes too low to make search for additional strategies worth the effort), schedule construction can be formulated as an integer programming task, in which running times are assigned to individual strategies to cover the union of as many problems as possible while not exceeding a given overall time bound [15,23]. We instead adopted a greedy algorithm [4] to a weighted set cover formulation of the problem: starting from an empty schedule, we iteratively add a new strategy *s* for additional *t* units of time if this step is currently the best in terms of 1*/t*·"the number of problems that will additionally get covered". This greedy approach does not guarantee an optimal result, but runs in polynomial time and is really easy to implement. (See also [3].)

Our final schedule makes use of 66 of the discovered strategies and should be able to solve all of the covered 236 problems in under 12 s (per problem). For our later experiment we prepared a second schedule, specialized to also take into consideration the versions of the TIP benchmarks with the added lemmas (cf. label T in Sect. 4 below). This schedule makes use of 86 strategies, aims to cover a total of 522 problems (version with and without lemmas counted separately) and runs to completion after approximately 24 s.

# 4 Evaluation

In our evaluation we compare several variants of Vampire. We start with two baseline versions without strategy scheduling:

– (V): Vampire with the following flags for structural induction:


The option -ind struct enables using structural induction (constructorbased induction axioms for term algebras), and with -indoct on these axioms are based on generalizing over any term, not just Skolem constants. Moreover, the induction axioms are generated from any clause set using -nui on. Finally, -to lpo and -drc off enable a simplification ordering which is wellsuited for handling recursive functions.

– (V + L): Vampire with the same flags active as in (V), plus conjectures from QuickSpec added to the problem files as claims as described in Sect. 3.1.

The idea is that (V) serves as a baseline for what kinds of inductive proofs Vampire is capable of. By comparing (V) with (V + L), we see whether the lemmas discovered by QuickSpec help Vampire.

Next we add versions of Vampire with specialized strategy schedules:

– (S): Vampire with the specialized strategy schedule for inductive problems described in Sect. 3.2.

– (S + L): Vampire with both the strategy schedule as in (S) above and conjectures from QuickSpec added to the problem files as claims.

By comparing (S) with (V+L), we can see the relative importance of strategy scheduling and lemmas. In (S+L) we can see whether the two strategies complement each other. We expect that (S) may see less benefit from lemmas than (V) because the learned strategies may be better at e.g. generalising subgoals. Note that the strategy schedule is tuned without seeing the lemmas, so it is even possible that (S+L) might perform worse than (S) due to the extra lemmas disturbing the strategy scheduler.

Since the strategy schedule is tuned without seeing the lemmas, (S+L) illustrates what we can get by taking an existing prover with a built-in strategy schedule, and adding new lemmas to it. Notice also that any problems that require lemmas will not be proved during training, so will not influence the schedule. We can use these problems as a kind of test set for S+L, as they are effectively unseen during training.

To investigate the limits of our approach we add a third family:


Note that the strategy used by (T) and (T+L) may be prone to overfitting, as all the test problems are seen during training and influence the schedule.<sup>7</sup> The results for (T) and (T+L) are useful as a benchmark to compare the other provers against, and an indication of what a perfectly-tuned strategy schedule could do.

We evaluated our methods on the TIP benchmark set. For all methods the time limit was set to 30 s. Since the strategy schedules are randomized and may not find the same proofs every time they are run, we ran each one 5 times on each problem. The experiments were run on a Dell Inc. Latitude 5320 with an 11th Gen Intel<sup>R</sup> CoreTMi5-1145G7 @ 2.60GHz <sup>×</sup> 8 processor and 16GB RAM. Scripts used to run experiments and process results are available at https:// github.com/solrun/vampspec.

#### 4.1 TIP Benchmarks

TIP is a collection of benchmarks specifically for inductive theorem provers [6]. The problems are expressed in a syntax very similar to SMT-LIB [2], and come

<sup>7</sup> Why not use a training/test split? Because there are not very many problems in total, and more importantly, because many problems are related, which makes it hard to design an uncontaminated test set, since we need to avoid having related problems where one is in the training set and one is in the test set.

with tools to translate the problems into various formats (including standard SMT-LIB) as well as built-in support for lemma generation using QuickSpec.

TIP consists of several subsets: the *prod* set contains 50 theorems and 24 lemmas about lists and natural numbers defined in [16], the *IsaPlanner* set defined in [18] contains 86 properties originally designed to test provers that use the rippling heuristic. The *prod* and *IsaPlanner* problems have previously been used to evaluate a number inductive theorem provers [5,9,22] so experiments with them enable comparison to previous work.

The *TIP2015* set contains a further 326 problems and was added as many existing provers, like HipSpec, could solve almost all problems in the previous two sets. It includes a variety of problems such as various sorting algorithms with correctness properties expressed in alternative ways, properties of regular expressions, binary search trees, integers implemented on top of natural numbers, natural numbers in binary representation, and properties of various functions on lists and natural numbers. Some of the problems were not known to have been automated at the time of their publication [6] and are offered as challenges.

### 4.2 Results

Table 1 shows the number of proofs found for the 486 TIP benchmarks, by the different methods that we described previously. We count a proof as found if it was found in any of the 5 proof attempts using that strategy. We found that Vampire with structural induction enabled, (V), finds proofs to 102 of the problems, which increases to 143 with the addition of lemmas from QuickSpec. The specialized strategy schedule, (S), finds 236 proofs, more than twice as many as (V). The specialized strategy schedule finds some more proofs with the addition of lemmas but the increase is not so great, from 236 to 263. The strategy schedule trained on problems already containing lemmas, (T), finds 237 proofs (the same proofs as (S) and one additional proof), which increases to 288 with the addition of lemmas. The bottom line of Table 1 shows the number of proofs found by each method with or without lemmas added.


Table 1. The number of proofs found for the 486 TIP benchmarks when testing the different proof methods in the presence and absence of generated lemmas.

Although all methods find more proofs with lemmas than without, a number of proofs can only be found without the additional lemmas and are lost after lemmas are added. Most often when using ATPs, different strategies or parameterizations might both gain and lose some proofs rather than one simply being strictly better than the other. Table 2 shows this for our three strategies, with and without added lemmas. As mentioned above there are a small number of proofs (10 for (V), 6 for (S) and 1 for (T)) which are found by the each strategy without lemmas, but not after lemmas are added. Since the added lemmas increase the size of the proof search space, we are not surprised that they may in some cases prevent the strategy from finding a proof in time. In the case of the specialized strategy schedule (T) which has already seen the problems with added lemmas in its training, the added lemmas only hinder it from finding a proof in one instance. In the case where (T+L) loses the proof (T) found, *TIP2015/regexp\_RecAtom*, the number of conjectured lemmas added to the problem file is very large (459) which probably causes the search space to explode. For this particular problem both strategies (V) and (S) also found a proof, but no strategy found a proof for the problem with lemmas added.

Note that (T+L) finds 53 proofs not found by (S), showing the improvement achievable by adding QuickSpec's lemma conjecturing and using a strategy schedule specialized to make use of those lemmas, compared to only using strategy schedule training as with (S). Of these 53 problems, (S+L) finds proofs for 33 of them, making use of the added lemmas without having seen them in its training. As mentioned above, the strategy schedule of (S+L) is effectively not trained on any problems that require lemmas, so we can view the 53 problems as test problems, unseen in the training data, all solvable with a perfect strategy, and say that (S+L) solves 62% of those problems. Thus we get an indication that the strategy schedule is generalizing to unseen problems.


Table 2. Here each column shows the number of unique proofs found by the respective method (column label) but not by one of the other methods (row label).

In some cases, one of our methods performs strictly better than another, namely (S) and (T) are strictly better than (V), (T) is strictly better than (S), and (T+L) is strictly better than (S+L). Since (S) and (T) are specialized strategy schedules that can execute many different strategies for each proof attempt, including the strategy used by (V), it is unsurprising that they subsume (V). Since (S) and (T) are strategy schedules trained in the same manner, with the only difference being that (T) is trained on a superset of the problems (S) is trained on, and both evaluated on problems they have encountered in their training, we expect them to achieve a similar performance. Note that (T) only finds one proof not found by (S), so their performance is nearly equivalent. Since (T+L) is evaluated on problems with added lemmas that it has already seen during training while (S+L) is given previously unseen lemmas in its input problems, it would be surprising if (S+L) found a proof that (T+L) could not. In all cases these pairs of methods are evaluated on the same input problems (both are evaluated on problems with additional lemmas or both on the problems without lemmas).

Fig. 1. Time taken to find a proof with lemmas versus without them using the same strategy (on a log scale).

In cases where the same strategy could find a proof both with and without lemmas added to the problem file, we compare the time taken to find a proof as a metric of how easy the proof is to find. Figure 1 shows the plots of the time taken to find a proof with lemmas versus without them using the same strategy (on a log scale). The points around the edges indicate that the respective method did not find a proof within the given time limit (30 s), so points along the right-hand edge indicate that a proof was found with lemmas and not without them, while points along the top edge indicate a proof was found without lemmas and lost after they were added.

For problems where both (V) and (V+L) found a proof, the average time for (V) was 0*.*67 s with a standard deviation of 3*.*56 s, while the average time for (V+L) was 1*.*04 s with a standard deviation of 4*.*30 s. We can see how in most cases where a proof was found both with and without lemmas added, the proof search went faster without them, indicated by how most of the points are to the left of the diagonal. For problems where both (S) and (S+L) found a proof, the average time for (S) was 0*.*20 s with a standard deviation of 0*.*49 s while the average time for (S+L) was 1*.*53 s with a standard deviation of 4*.*16 s. We see many points clustered around the diagonal, indicating both proof searches took a similar amount of time to find a proof, though many more points lie to the left of the diagonal than to its right, indicating a faster proof search without lemmas. For problems where both (T) and (T+L) found a proof, the average time for (T) was 0*.*59 s with a standard deviation of 2*.*39 s while the average time for (T+L) was 0*.*37 s with a standard deviation of 1*.*16 s, so as opposed to methods (V) and (S), the proof time goes down with the addition of lemmas. Since the schedule here was trained on problems containing lemmas, it prioritizes strategies that make use of the available lemmas, thus finding the proofs more efficiently with lemmas.

Table 3. The number of proofs found in different subsets of the TIP benchmarks, along with results for CVC4 and HipSpec for the same subsets.


The problems from the *IsaPlanner* and *prod* subsets of TIP were also used for evaluation of HipSpec in [5] <sup>8</sup> and of inductive reasoning with CVC4 in [22]. The number of proofs they found are included in Table 3 along with the results of our experiments for those subsets.<sup>9</sup> We see a clear difference in results on the *prod*-

<sup>8</sup> In order to investigate whether the numbers for HipSpec would be better on a modern machine, we re-ran it on the *prod* benchmark. We found that it solved *fewer* problems, 44 in all, as a result of slight changes in HipSpec since the publication of [5]. No problems were solved by HipSpec today that were not solved back then.

<sup>9</sup> We tried but failed to run HipSpec on the *TIP2015* problems. HipSpec's input format is a limited dialect of Haskell and, while TIP problems can be converted to Haskell, the dialect is not the same as HipSpec's. As HipSpec is unmaintained, we were unable to go further.

subset, which is designed such that more complicated lemmas are needed for most proofs, and the *IsaPlanner* -subset, which contains easier problems which can often be solved without external lemmas (or with just lemmas coming from generalizations of a subgoal). On both subsets we only achieve results competitive with either CVC4 or HipSpec when we combine a specialized strategy schedule with lemmas (S+L) and (T+L). On the *TIP2015* subset, none of the methods we tested found proofs for even half of the problems, and we leave a closer examination of what is required to achieve better results there as future work.

As described in Sect. 3.2, the strategy schedules may not find the same proofs in every run. In our experiments we ran each schedule 5 times and found there was a handful of problems where the same strategy schedule would sometimes find a proof and not others, the exact numbers are shown in Table 4. In the results shown in Tables 1–3 we count that a proof was found if it was found in at least one of the five runs.

Table 4. Number of inconsistently found proofs by each strategy schedule with and without added lemmas.


# 5 Discussion

Modern day ATPs like Vampire have many moving parts. Slight changes in configuration often lead to some extra proofs being found while others are lost. This is particularly true when considering also proofs by induction, as here the potential for exploding the search space in unproductive directions is even larger. It is often difficult to know in advance what parameters and strategies will affect the capabilities of finding a proof within reasonable time.

Our first experiments tested the effect of adding lemma candidates for inductive proofs to a standard out-of-the-box variant of Vampire, simulating what a regular user might have at hand. Here, we see a clear improvement in the number of proofs for the TIP *prod*-subset, where more complicated lemmas are needed for most of the proofs, and a modest improvement on other subsets. Still, the results are well below both CVC4 and HipSpec. We conclude that simply adding lemmas from QuickSpec to Vampire (with a default induction strategy) is not sufficient to reach a state-of-the-art performance.

Secondly, we also experimented with training specialized strategies, customized to inductive proofs on TIP problems. This seems to be necessary for top performance. We experiment with two trained strategies: the first on proofs of TIP-problems without added lemmas (starting from the out-of-the-box Vampire setup). The second, to get an upper bound of how well Vampire could perform, we also trained on proofs *with the added lemmas* from QuickSpec. With a customized strategy for induction, already without lemmas Vampire performs much better than before. We notice that the increase is larger for the customized strategy than it was for adding lemmas to the standard Vampire version! We conclude that specialized strategies have a larger effect on the number of proofs than just adding auxiliary lemmas.

Finally, we added the QuickSpec lemmas. Both strategies now improved even more, especially on the TIP *prod*-subset, where they both beat previous state of the art, proving 46 and 49 problems respectively. As expected, the strategy trained on proofs with lemmas (T) had a larger increase, being able to use the lemmas available more efficiently. Interestingly, on the TIP *IsaPlanner* -subset, only the (T) strategy beat the state of the art. We conclude that in the presence of auxiliary lemmas and together with specialized strategies, Vampire can indeed outperform previous state of the art systems CVC4 and HipSpec.

However, one might argue that the comparison is biased. To get state-ofthe-art performance from Vampire requires a strategy optimised by seeing and trying the problems already! This might not be a viable option for all users. We do not know how well these schedules would perform on other types of inductive problems as they are likely overfitted to TIP to some degree. We could have divided the TIP problems into training and testing sets to try to avoid overfitting, but the TIP set is not very large, only 486 problems (of which only 60% could be solved using any method we tried), and many problems are similar to each other, so it is not clear that this would solve the problem. In short, there is too little data to train a general-purpose strategy, and we can not say how well the learned strategy generalizes to problems outside of TIP.

Even so, domain-specific strategies are reasonable in many applications. For example, in program verification, it is reasonable to run the prover over a set of problems multiple times, and find a strategy that works for just those kind of problems one is interested in verifying. Our results show that specially-tuned strategies are highly effective, and compatible with lemma discovery.

The search space for proofs where induction is allowed is inherently enormous and becomes particularly explosive when the ATP itself has to decide when to apply induction. Trained strategies seem to be necessary for competitive performance. HipSpec on the other hand was developed before CVC4 and Vampire supported induction, and thus handled the induction step outside the ATP, and only outsourced the resulting subgoals. One benefit of doing so is that the search space is much less explosive, which contributes to HipSpec's good performance. We thus leave the question of how to best implement automated induction partially unsolved: we either need highly specialized strategies trained on many attempts of proofs, or keeping the application of induction under strict control.

### 5.1 Future Work

We have many ideas for improvements when it comes to generating lemmas for inductive proofs. QuickSpec is limited to discovering equational conjectures that may have a predicate as a condition (if the theory being explored contains a function that returns a boolean value, the value of that function may be used as a predicate). However, many inductive proofs require more complex conditional lemmas. In [10] we presented RoughSpec, a system that generates conjectures that match a user-defined input template. This could be used to conjecture lemmas for inductive proofs, using lemma templates likely to be useful learned from proof libraries. Another idea is developing better methods of only providing lemmas likely to be useful, limiting the number of lemmas given to the prover so that the search space does not explode. For example, there are some prominent examples of using simple syntactic conditions [14] inside the theorem prover or using machine learning [19] before the invocation of the theorem prover to mitigate this issue.

Acknowledgments. This work was partially supported by the Wallenberg Artificial Intelligence, Autonomous Systems and Software Program (WASP), funded by the Knut and Alice Wallenberg Foundation. Martin Suda was supported by the Czech Science Foundation project no. 24-12759S and the project RICAIP no. 857306 under the EU-H2020 programme.

# References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Control-Flow Refinement for Complexity Analysis of Probabilistic Programs in KoAT (Short Paper)**

Nils Lommen(B) , El´ ´ eanore Meyer , and J¨urgen Giesl

RWTH Aachen University, Aachen, Germany *{*lommen,eleanore.meyer,giesl*}*@cs.rwth-aachen.de

**Abstract.** Recently, we showed how to use control-flow refinement (CFR) to improve automatic complexity analysis of integer programs. While up to now CFR was limited to classical programs, in this paper we extend CFR to *probabilistic* programs and show its soundness for complexity analysis. To demonstrate its benefits, we implemented our new CFR technique in our complexity analysis tool KoAT.

# **1 Introduction**

There exist numerous tools for complexity analysis of (non-probabilistic) programs, e.g., [2–6,10,11,15,16,18,19,24,25,28,30,32]. Our tool KoAT infers upper runtime and size bounds for (non-probabilistic) integer programs in a modular way by analyzing subprograms separately and lifting the obtained results to global bounds on the whole program [10]. Recently, we developed several improvements of KoAT [18,24,25] and showed that incorporating control-flow refinement (CFR) [13,14] increases the power of automated complexity analysis significantly [18].

There are also several approaches for complexity analysis of *probabilistic* programs, e.g., [1,7,9,21–23,27,29,31,34]. In particular, we also adapted KoAT's approach for runtime and size bounds, and introduced a modular framework for automated complexity analysis of probabilistic integer programs in [27]. However, the improvements of KoAT from [18,24,25] had not yet been adapted to the probabilistic setting. In particular, we are not aware of any existing technique to combine CFR with complexity analysis of probabilistic programs.

Thus, in this paper, we develop a novel CFR technique for probabilistic programs which could be used as a black box by every complexity analysis tool. Moreover, to reduce the overhead by CFR, we integrated CFR natively into KoAT by calling it on-demand in a modular way. Our experiments show that CFR increases the power of KoAT for complexity analysis of probabilistic programs substantially.

Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - 235950644 (Project GI 274/6-2) and DFG Research Training Group 2236 UnRAVeL.

c The Author(s) 2024

C. Benzm¨uller et al. (Eds.): IJCAR 2024, LNAI 14739, pp. 233–243, 2024. https://doi.org/10.1007/978-3-031-63498-7\_14

The idea of CFR is to gain information on the values of program variables and to sort out infeasible program paths. For example, consider the probabilistic **while**-loop (1). Here, we flip a (fair) coin and either set x to 0 or do nothing.

$$\text{while } \begin{array}{l} x > 0 \end{array} \text{do } x \leftarrow 0 \oplus\_{1/2} \text{noop } \mathbf{end} \tag{1}$$

The update x <sup>←</sup> 0 is in a loop. However, after setting x to 0, the loop cannot be executed again. To simplify its analysis, CFR "unrolls" the loop resulting in (2).

$$\begin{aligned} \text{while } x > 0 \text{ do } \text{break } \oplus\_{\mathbb{I}\_{/2}} \text{ noop } \text{end}\\ \text{if } x > 0 \text{ then } x \leftarrow 0 \text{ end} \end{aligned} \tag{2}$$

Here, x is updated in a separate, *non-probabilistic* **if**-statement and the loop does not change variables. Thus, we sorted out paths where x <sup>←</sup> 0 was executed repeatedly. Now, techniques for probabilistic programs can be used for the **while**loop. The rest of the program can be analyzed by techniques for non-probabilistic programs. In particular, this is important if (1) is part of a larger program.

We present necessary preliminaries in Sect. 2. In Sect. 3, we introduce our new control-flow refinement technique and show how to combine it with automated complexity analysis of probabilistic programs. We conclude in Sect. 4 by an experimental evaluation with our tool KoAT. We refer to [26] for further details on probabilistic programs and the soundness proof of our CFR technique.

# **2 Preliminaries**

Let <sup>V</sup> be a set of variables. An *atom* is an inequation <sup>p</sup><sup>1</sup> < p<sup>2</sup> for polynomials <sup>p</sup>1, p<sup>2</sup> <sup>∈</sup> <sup>Z</sup>[V], and the set of all atoms is denoted by <sup>A</sup>(V). A *constraint* is a (possibly empty) conjunction of atoms, and C(V) denotes the set of all constraints. In addition to "<", we also use "≥", "=", etc., which can be simulated by constraints (e.g., <sup>p</sup><sup>1</sup> <sup>≥</sup> <sup>p</sup><sup>2</sup> is equivalent to <sup>p</sup><sup>2</sup> < p<sup>1</sup> + 1 for integers).

For *probabilistic integer programs (PIPs)*, as in [27] we use a formalism based on transitions, which also allows us to represent **while**-programs like (1) easily. A PIP is a tuple (PV,L, -<sup>0</sup>, GT ) with a finite set of program variables PV ⊆ V, a finite set of locations <sup>L</sup>, a fixed initial location -<sup>0</sup> ∈ L, and a finite set of general transitions GT . A *general transition* g ∈ GT is a finite set of transitions which share the same start location g and the same guard <sup>ϕ</sup>g. A *transition* is a 5 tuple (-, ϕ, p, η, - ) with a *start location* - ∈ L, *target location* - ∈ L\{-<sup>0</sup>}, *guard* ϕ ∈ C(V), *probability* p <sup>∈</sup> [0, 1], and *update* η : PV → <sup>Z</sup>[V]. The probabilities of all transitions in a general transition add up to 1. We always require that general transitions are pairwise disjoint and let T = - g∈GT <sup>g</sup> denote the set of all transitions. PIPs may have *non-deterministic branching*, i.e., the guards of several transitions can be satisfied. Moreover, we also allow *non-deterministic (temporary) variables* V \ PV. To simplify the presentation, we do not consider transitions with individual costs and updates which use probability distributions, but the approach can easily be extended accordingly. From now on, we fix a PIP <sup>P</sup> = (PV,L, -<sup>0</sup>, GT ).

**Fig. 1.** A Probabilistic Integer Program

*Example 1.* The PIP in Fig. <sup>1</sup> has PV <sup>=</sup> {x, y}, <sup>L</sup> <sup>=</sup> {-0, -1, -2}, and four general transitions {t<sup>0</sup>}, {t1a, t1b}, {t<sup>2</sup>}, {t<sup>3</sup>}. The transition <sup>t</sup><sup>0</sup> starts at the initial location -<sup>0</sup> and sets <sup>x</sup> to a non-deterministic positive value <sup>u</sup> ∈ V \PV, while <sup>y</sup> is unchanged. (In Fig. 1, we omitted unchanged updates like η(y) = y, the guard true, and the probability p = 1 to ease readability.) If the general transition is a singleton, we often use transitions and general transitions interchangeably. Here, only <sup>t</sup>1a and <sup>t</sup>1b form a non-singleton general transition which corresponds to the program (1). We denoted such (probabilistic) transitions by dashed arrows in Fig. 1. We extended (1) by a loop of <sup>t</sup><sup>2</sup> and <sup>t</sup><sup>3</sup> which is only executed if y > <sup>0</sup> <sup>∧</sup> x = 0 (due to t<sup>2</sup>'s guard) and decreases <sup>y</sup> by 1 in each iteration (via <sup>t</sup><sup>3</sup>'s update).

<sup>A</sup> *state* is a function σ : V → <sup>Z</sup>, Σ denotes the set of all states, and a *configuration* is a pair of a location and a state. To extend finite sequences of configurations to infinite ones, we introduce a special location -<sup>⊥</sup> (indicating termination) and a special transition <sup>t</sup><sup>⊥</sup> (and its general transition <sup>g</sup><sup>⊥</sup> <sup>=</sup> {t<sup>⊥</sup>}) to reach the configurations of a run after termination. Let <sup>L</sup><sup>⊥</sup> <sup>=</sup> L {-<sup>⊥</sup>}, <sup>T</sup><sup>⊥</sup> <sup>=</sup> T {t<sup>⊥</sup>}, GT <sup>⊥</sup> <sup>=</sup> GT {g<sup>⊥</sup>}, and let Conf = (L<sup>⊥</sup> <sup>×</sup> <sup>Σ</sup>) denote the set of all configurations. A *path* has the form <sup>c</sup><sup>0</sup> <sup>→</sup>t<sup>1</sup> ···→t<sup>n</sup> <sup>c</sup><sup>n</sup> for <sup>c</sup>0,...,c<sup>n</sup> <sup>∈</sup> Conf and <sup>t</sup>1,...,t<sup>n</sup> ∈ T<sup>⊥</sup> for an <sup>n</sup> <sup>∈</sup> <sup>N</sup>, and a *run* is an infinite path <sup>c</sup><sup>0</sup> <sup>→</sup>t<sup>1</sup> <sup>c</sup><sup>1</sup> <sup>→</sup>t<sup>2</sup> ··· . Let Path and Run denote the sets of all paths and all runs, respectively.

We use Markovian schedulers <sup>S</sup> : Conf → GT <sup>⊥</sup> <sup>×</sup> <sup>Σ</sup> to resolve all nondeterminism. For c = (-, σ) <sup>∈</sup> Conf, a *scheduler* <sup>S</sup> yields a pair <sup>S</sup>(c)=(g, σ˜) where g is the next general transition to be taken (with - <sup>=</sup> g) and ˜<sup>σ</sup> chooses values for the temporary variables such that ˜<sup>σ</sup> <sup>|</sup><sup>=</sup> <sup>ϕ</sup>g and <sup>σ</sup>(v)=˜σ(v) for all v ∈ PV. If GT contains no such g, we obtain <sup>S</sup>(c)=(g<sup>⊥</sup>, σ). For the formal definition of Markovian schedulers, we refer to [26].

For every <sup>S</sup> and <sup>σ</sup><sup>0</sup> <sup>∈</sup> <sup>Σ</sup>, we define a probability mass function pr<sup>S</sup>,σ<sup>0</sup> . For all <sup>c</sup> <sup>∈</sup> Conf, pr<sup>S</sup>,σ<sup>0</sup> (c) is the probability that a run with scheduler <sup>S</sup> and the initial state <sup>σ</sup><sup>0</sup> starts in <sup>c</sup>. So pr<sup>S</sup>,σ<sup>0</sup> (c) = 1 if <sup>c</sup> = (-<sup>0</sup>, σ<sup>0</sup>) and pr<sup>S</sup>,σ<sup>0</sup> (c)=0 otherwise.

For all c, c <sup>∈</sup> Conf and <sup>t</sup> ∈ T⊥, let pr<sup>S</sup>(<sup>c</sup> <sup>→</sup>t <sup>c</sup> ) be the probability that one goes from c to c via the transition <sup>t</sup> when using the scheduler <sup>S</sup> (see [26] for the formal definition of pr<sup>S</sup>). Then for any path <sup>f</sup> = (c<sup>0</sup> <sup>→</sup>t<sup>1</sup> ··· →t<sup>n</sup> <sup>c</sup>n) <sup>∈</sup> Path, let pr<sup>S</sup>,σ<sup>0</sup> (f) = pr<sup>S</sup>,σ<sup>0</sup> (c<sup>0</sup>) · pr<sup>S</sup>(c<sup>0</sup> <sup>→</sup>t<sup>1</sup> <sup>c</sup><sup>1</sup>) · ... · pr<sup>S</sup>(cn−<sup>1</sup> <sup>→</sup>t<sup>n</sup> <sup>c</sup>n). Here, all paths f which are not "admissible" (e.g., guards are not fulfilled, transitions are starting or ending in wrong locations, etc.) have probability pr<sup>S</sup>,σ<sup>0</sup> (f) = 0.

The semantics of PIPs can be defined via a corresponding probability space, obtained by a standard cylinder construction. Let <sup>P</sup>S,σ<sup>0</sup> denote the probability measure which lifts prS,σ<sup>0</sup> to cylinder sets: For any <sup>f</sup> <sup>∈</sup> Path, we have prS,σ<sup>0</sup> (f) = <sup>P</sup>S,σ<sup>0</sup> (Pre<sup>f</sup> ) for the set Pre<sup>f</sup> of all infinite runs with prefix <sup>f</sup>. So <sup>P</sup>S,σ<sup>0</sup> (Θ) is the probability that a run from <sup>Θ</sup> <sup>⊆</sup> Run is obtained when using the scheduler <sup>S</sup> and starting in <sup>σ</sup><sup>0</sup>. Let <sup>E</sup>S,σ<sup>0</sup> denote the associated expected value operator. So for any random variable X : Run <sup>→</sup> <sup>N</sup> <sup>=</sup> <sup>N</sup> ∪ {∞}, we have <sup>E</sup>S,σ<sup>0</sup> (X) = n∈<sup>N</sup> <sup>n</sup> · <sup>P</sup>S,σ<sup>0</sup> (<sup>X</sup> <sup>=</sup> <sup>n</sup>). For a detailed construction, see [26].

**Definition 2 (Expected Runtime).** *For* <sup>g</sup> ∈ GT *,* <sup>R</sup>g : Run <sup>→</sup> <sup>N</sup> *is a random variable with* <sup>R</sup>g(c<sup>0</sup> <sup>→</sup>t<sup>1</sup> <sup>c</sup><sup>1</sup> <sup>→</sup>t<sup>2</sup> ···) = |{<sup>i</sup> <sup>∈</sup> <sup>N</sup> <sup>|</sup> <sup>t</sup><sup>i</sup> <sup>∈</sup> <sup>g</sup>}|*, i.e.,* <sup>R</sup>g(ϑ) *is the number of times that a transition from* g *was applied in the run* ϑ <sup>∈</sup> Run*. Moreover, the random variable* <sup>R</sup> : Run <sup>→</sup> <sup>N</sup> *denotes the number of transitions that were executed before termination, i.e., for all* ϑ <sup>∈</sup> Run *we have* <sup>R</sup>(ϑ) = g∈GT <sup>R</sup>g(ϑ)*. For a scheduler* <sup>S</sup> *and* <sup>σ</sup><sup>0</sup> <sup>∈</sup> <sup>Σ</sup>*, the* expected runtime *of* <sup>g</sup> *is* <sup>E</sup><sup>S</sup>,σ<sup>0</sup> (Rg) *and the* expected runtime *of the program is* <sup>R</sup><sup>S</sup>,σ<sup>0</sup> <sup>=</sup> <sup>E</sup><sup>S</sup>,σ<sup>0</sup> (R)*.*

The goal of complexity analysis for a PIP is to compute a bound on its *expected runtime complexity*. The set of *bounds* B consists of all functions from <sup>Σ</sup> <sup>→</sup> <sup>R</sup>≥<sup>0</sup>.

**Definition 3 (Expected Runtime Bound and Complexity** [27]**).** *The function* RB : GT → B *is an* expected runtime bound *if* (RB(g))(σ<sup>0</sup>) <sup>≥</sup> sup<sup>S</sup> <sup>E</sup><sup>S</sup>,σ<sup>0</sup> (Rg) *for all* <sup>σ</sup><sup>0</sup> <sup>∈</sup> <sup>Σ</sup> *and all* <sup>g</sup> ∈ GT *. Then* g∈GT RB(g) *is a bound on the* expected runtime complexity *of the whole program, i.e.,* g∈GT ((RB(g))(σ<sup>0</sup>)) <sup>≥</sup> sup<sup>S</sup> <sup>R</sup><sup>S</sup>,σ<sup>0</sup> *for all* <sup>σ</sup><sup>0</sup> <sup>∈</sup> <sup>Σ</sup>*.*

# **3 Control-Flow Refinement for PIPs**

We now introduce our novel CFR algorithm for *probabilistic* integer programs, based on the partial evaluation technique for non-probabilistic programs from [13,14,18]. In particular, our algorithm coincides with the classical CFR technique when the program is non-probabilistic. The goal of CFR is to transform a program P into a program P which is "easier" to analyze. Thm. 4 shows the soundness of our approach, i.e., that P and P have the same expected runtime complexity.

Our CFR technique considers "abstract" evaluations which operate on sets of states. These sets are characterized by conjunctions τ of constraints from <sup>C</sup>(PV), i.e., τ stands for all states σ <sup>∈</sup> Σ with σ <sup>|</sup><sup>=</sup> τ . We now label locations - by formulas τ which describe (a superset of) those states σ which can occur in -, i.e., where a configuration (-, σ) is reachable from some initial configuration (-<sup>0</sup>, σ<sup>0</sup>). We begin with labeling every location by the constraint true. Then we add new copies of the locations with refined labels τ by considering how the updates of transitions affect the constraints of their start locations and their guards. The labeled locations become the new locations in the refined program.

Since a location might be reachable by different paths, we may construct several variants -, τ<sup>1</sup> ,...,-, τn of the same original location -. Thus, the formulas τ are not necessarily invariants that hold for *all* evaluations that reach a location -, but we perform a case analysis and split up a location according to the different sets of states that may reach -. Our approach ensures that a labeled location -, τ can only be reached by configurations (-, σ) where σ <sup>|</sup><sup>=</sup> τ .

We apply CFR only *on-demand* on a (sub)set of transitions S⊆T (thus, CFR can be performed in a *modular* way for different subsets S). In practice, we choose S heuristically and use CFR only on transitions where our currently inferred runtime bounds are "not yet good enough". Then, for <sup>P</sup> = (PV,L, -<sup>0</sup>, GT ), the result of the CFR algorithm is the program <sup>P</sup> <sup>=</sup> (PV,L ,-<sup>0</sup>, true , GT ) where L and GT are the smallest sets satisfying the properties (3), (4), and (5) below.

First, we require that for all - ∈ L, all "original" locations -, true are in <sup>L</sup> . In these locations, we do not have any information on the possible states yet:

$$\forall \; \ell \in \mathcal{L}. \; \langle \ell, \mathtt{true} \rangle \in \mathcal{L}' \tag{3}$$

If we already introduced a location -, τ ∈L and there is a transition (-, ϕ, p, η, - ) ∈ S, then (4) requires that we also add the location - , τϕ,η,- to L . The formula <sup>τ</sup>ϕ,η, over-approximates the set of states that can result from states that satisfy τ and the guard ϕ of the transition when applying the update <sup>η</sup>. More precisely, <sup>τ</sup>ϕ,η, has to satisfy (<sup>τ</sup> <sup>∧</sup>ϕ) <sup>|</sup><sup>=</sup> <sup>η</sup>(τϕ,η,- ). For example, if <sup>τ</sup> = (<sup>x</sup> = 0), <sup>ϕ</sup> <sup>=</sup> true, and <sup>η</sup>(x) = <sup>x</sup>−1, then we might have <sup>τ</sup>ϕ,η,-= (x <sup>=</sup> <sup>−</sup>1).

To ensure that every - ∈ L only gives rise to *finitely* many new labeled locations - , τϕ,η,- , we perform *property-based abstraction*: For every location - , we use a finite so-called *abstraction layer* <sup>α</sup>- ⊂ {p<sup>1</sup> <sup>∼</sup> <sup>p</sup><sup>2</sup> <sup>|</sup> <sup>p</sup>1, p<sup>2</sup> <sup>∈</sup> <sup>Z</sup>[PV] and ∼ ∈ {<, <sup>≤</sup>, <sup>=</sup>}} (see [14] for heuristics to compute <sup>α</sup>- ). Then we require that <sup>τ</sup>ϕ,η,- must be a conjunction of constraints from <sup>α</sup>- (i.e., <sup>τ</sup>ϕ,η,- <sup>⊆</sup> <sup>α</sup> when regarding sets of constraints as their conjunction). This guarantees termination of our CFR algorithm, since for every location there are only finitely many possible labels.

$$\begin{aligned} \forall \; \langle \ell, \tau \rangle \in \mathcal{L}'. \forall \; (\ell, \varphi, p, \eta, \ell') \in \mathcal{S}. \; \langle \ell', \tau\_{\varphi, \eta, \ell'} \rangle \in \mathcal{L}'\\ \text{where } \tau\_{\varphi, \eta, \ell'} = \{ \psi \in \alpha\_{\ell'} \mid (\tau \wedge \varphi) \equiv \eta(\psi) \} \end{aligned} \tag{4}$$

Finally, we have to ensure that GT contains all "necessary" (general) transitions. To this end, we consider all g ∈ GT . The transitions (-, ϕ, p, η, - ) in g ∩ S now have to connect the appropriately labeled locations. Thus, for all labeled variants -, τ ∈L , we add the transition (-, τ , τ <sup>∧</sup> ϕ, p, η,- , τϕ,η,- ). In contrast, the transitions (-, ϕ, p, η, - ) in g \ S only reach the location where is labeled with true, i.e., here we add the transition (-, τ , τ <sup>∧</sup> ϕ, p, η,- , true ).

$$\begin{array}{l} \forall \langle \ell, \tau \rangle \in \mathcal{L}'. \forall g \in \mathcal{GT}. \\ \left( \{ (\ell, \tau), \tau \land \varphi, p, \eta, \langle \ell', \tau\_{\varphi, \eta, \ell'} \rangle \} \mid (\ell, \varphi, p, \eta, \ell') \in g \cap \mathcal{S} \right) \cup \\ \{ (\langle \ell, \tau \rangle, \tau \land \varphi, p, \eta, \langle \ell', \texttt{true} \rangle) \mid (\ell, \varphi, p, \eta, \ell') \in g \mid \mathcal{S} \}) \end{array} \cup \begin{array}{l} \begin{array}{l} \Pi \& \Pi \& \Pi \& \Pi \& \Pi \& \Pi \end{array} \forall \Pi \end{array} \tag{5}$$

**Fig. 2.** Result of Control-Flow Refinement with *S* = *{*t1*a*, t1*b*, t2, t3*}*

L and g∈GT <sup>g</sup> are finite due to the property-based abstraction, as there are only finitely many possible labels for each location. Hence, repeatedly "unrolling" transitions by (5) leads to the (unique) least fixpoint. Moreover, (5) yields proper general transitions, i.e., their probabilities still add up to 1. In practice, we remove transitions with unsatisfiable guards, and locations that are not reachable from -<sup>0</sup>, true . Thm. <sup>4</sup> shows the soundness of our approach (see [26] for its proof).

**Theorem 4 (Soundness of CFR for PIPs).** *Let* <sup>P</sup> = (PV,L ,-<sup>0</sup>, true , GT ) *be the PIP such that* L *and* GT *are the smallest sets satisfying* (3)*,* (4)*, and* (5)*. Let* R<sup>P</sup> <sup>S</sup>,σ<sup>0</sup> *and* <sup>R</sup><sup>P</sup>- <sup>S</sup>,σ<sup>0</sup> *be the expected runtimes of* <sup>P</sup> *and* <sup>P</sup> *, respectively. Then for all* <sup>σ</sup><sup>0</sup> <sup>∈</sup> <sup>Σ</sup> *we have* sup<sup>S</sup> <sup>R</sup><sup>P</sup> <sup>S</sup>,σ<sup>0</sup> = sup<sup>S</sup> <sup>R</sup><sup>P</sup>- <sup>S</sup>,σ<sup>0</sup> *.*

*CFR Algorithm and its Runtime:* To implement the fixpoint construction of Thm. 4 (i.e., to compute the PIP P ), our algorithm starts by introducing all "original" locations -, true for - ∈ L according to (3). Then it iterates over all labeled locations -, τ and all transitions t ∈ T . If the start location of t is -, then the algorithm extends GT by a new transition according to (5). Moreover, it also adds the corresponding labeled target location to L (as in (4)), if L did not contain this labeled location yet. Afterwards, we mark -, τ as finished and proceed with a previously computed labeled location that is not marked yet. So our implementation iteratively "unrolls" transitions by (5) until no new labeled locations are obtained (this yields the least fixpoint mentioned above). Thus, unrolling steps with transitions from T \S do not invoke further computations.

To over-approximate the runtime of this algorithm, note that for every location - ∈ L, there can be at most 2<sup>|</sup>α-<sup>|</sup> many labeled locations of the form -, τ . So if <sup>L</sup> <sup>=</sup> {-<sup>0</sup>,...,n}, then the overall number of labeled locations is at most 2<sup>|</sup>α-0 | <sup>+</sup>...+2<sup>|</sup>αn | . Hence, the algorithm performs at most |T |·(2<sup>|</sup>α-0 | <sup>+</sup>...+2<sup>|</sup>αn | ) unrolling steps.

*Example 5.* For the PIP in Fig. <sup>1</sup> and <sup>S</sup> <sup>=</sup> {t<sup>1</sup>a, t<sup>1</sup>b, t<sup>2</sup>, t<sup>3</sup>}, by (3) we start with <sup>L</sup> <sup>=</sup> {i, true<sup>|</sup> <sup>i</sup> ∈ {0, <sup>1</sup>, <sup>2</sup>}}. We abbreviate i, true by i in the final result of the CFR algorithm in Fig. 2. As <sup>t</sup><sup>0</sup> ∈ {t<sup>0</sup>}\S, by (5) <sup>t</sup><sup>0</sup> is redirected such that it starts at -<sup>0</sup>, true and ends in -<sup>1</sup>, true , resulting in t <sup>0</sup>. We always use primes to indicate the correspondence between new and original transitions.

Next, we consider {t<sup>1</sup>a, t<sup>1</sup>b}⊆S with the guard <sup>ϕ</sup> = (x > 0) and start location -<sup>1</sup>, true . We first handle <sup>t</sup><sup>1</sup>a which has the update <sup>η</sup> <sup>=</sup> id. We use the abstraction layer <sup>α</sup><sup>0</sup> <sup>=</sup> <sup>∅</sup>, <sup>α</sup><sup>1</sup> <sup>=</sup> {<sup>x</sup> = 0}, and <sup>α</sup><sup>2</sup> <sup>=</sup> {<sup>x</sup> = 0}. Thus, we have to find all <sup>ψ</sup> <sup>∈</sup> <sup>α</sup><sup>1</sup> <sup>=</sup> {<sup>x</sup> = 0} such that (true <sup>∧</sup> x > 0) <sup>|</sup><sup>=</sup> <sup>η</sup>(ψ). Hence, <sup>τ</sup>x>0,id,<sup>1</sup> is the empty conjunction true as no <sup>ψ</sup> from <sup>α</sup><sup>1</sup> satisfies this property. We obtain

$$t\_{1a}' : (\langle \ell\_1, \texttt{true} \rangle, x > 0, {}^{1}\langle 2, \texttt{id}, \langle \ell\_1, \texttt{true} \rangle).$$

In contrast, <sup>t</sup><sup>1</sup><sup>b</sup> has the update <sup>η</sup>(x) = 0. To determine <sup>τ</sup>x>0,η,<sup>1</sup> , again we have to find all <sup>ψ</sup> <sup>∈</sup> <sup>α</sup><sup>1</sup> <sup>=</sup> {<sup>x</sup> = 0} such that (true <sup>∧</sup> x > 0) <sup>|</sup><sup>=</sup> <sup>η</sup>(ψ). Here, we get <sup>τ</sup>x>0,η,<sup>1</sup> = (<sup>x</sup> = 0). Thus, by (4) we create the location -<sup>1</sup>, x = 0 and obtain

$$\langle t\_{1b}' : (\langle \ell\_1, \texttt{true} \rangle, x > 0, 1/2, \eta(x) = 0, \langle \ell\_1, x = 0 \rangle).$$

As <sup>t</sup>1a and <sup>t</sup>1b form one general transition, by (5) we obtain {<sup>t</sup> <sup>1</sup>a, t <sup>1</sup>b} ∈ GT .

Now, we consider transitions resulting from {t1a, t1b} with the start location -<sup>1</sup>, x = 0 . However, τ = (x = 0) and the guard ϕ = (x > 0) are conflicting, i.e., the transitions would have an unsatisfiable guard τ <sup>∧</sup> ϕ and are thus omitted.

Next, we consider transitions resulting from <sup>t</sup><sup>2</sup> with -<sup>1</sup>, true or -<sup>1</sup>, x = 0 as their start location. Here, we obtain two (general) transitions {t <sup>2</sup>}, {t <sup>2</sup> } ∈ GT :

$$\begin{array}{l} t\_2': \left( \langle \ell\_1, x = 0 \rangle, y > 0 \land x = 0, 1, \text{id}, \langle \ell\_2, x = 0 \rangle \right) \\ t\_2'': \left( \langle \ell\_1, \texttt{true} \rangle, y > 0 \land x = 0, 1, \text{id}, \langle \ell\_2, x = 0 \rangle \right) \end{array}$$

However, t <sup>2</sup> can be ignored since <sup>x</sup> = 0 contradicts the invariant x > 0 at -<sup>1</sup>, true . KoAT uses Apron [20] to infer invariants like x > 0 automatically. Finally, <sup>t</sup><sup>3</sup> leads to the transition <sup>t</sup> <sup>3</sup> : (-<sup>2</sup>, x = 0 , x = 0, <sup>1</sup>, η(y) = y <sup>−</sup> <sup>1</sup>,-<sup>1</sup>, x <sup>=</sup> <sup>0</sup> ). Thus, we obtain <sup>L</sup> <sup>=</sup> {i, true<sup>|</sup> <sup>i</sup> ∈ {0, <sup>1</sup>}} ∪ {i, x = 0<sup>|</sup> <sup>i</sup> ∈ {1, <sup>2</sup>}}.

KoAT infers a bound RB(g) for each g ∈ GT individually (thus, nonprobabilistic program parts can be analyzed by classical techniques). Then g∈GT RB(g) is a bound on the expected runtime complexity of the whole program, see Definition 3.

*Example 6.* We now infer a bound on the expected runtime complexity of the PIP in Fig. 2. Transition t <sup>0</sup> is not on a cycle, i.e., it can be evaluated at most once. So RB({t <sup>0</sup>}) = 1 is an (expected) runtime bound for the general transition {t 0}.

For the general transition {t <sup>1</sup>a, t <sup>1</sup>a}, KoAT infers the expected runtime bound 2 via probabilistic linear ranking functions (PLRFs, see e.g., [27]). More precisely, KoAT finds the *constant* PLRF {-<sup>1</sup> → <sup>2</sup>,-<sup>1</sup>, x = 0 → <sup>0</sup>}. In contrast, in the original program of Fig. 1, {t<sup>1</sup>a, t<sup>1</sup>b} is not decreasing w.r.t. any constant PLRF, because <sup>t</sup><sup>1</sup>a and <sup>t</sup><sup>1</sup>b have the same target location. So here, every PLRF where {t<sup>1</sup>a, t<sup>1</sup>b} decreases in expectation depends on <sup>x</sup>. However, such PLRFs do not yield a finite runtime bound in the end, as <sup>t</sup><sup>0</sup> instantiates <sup>x</sup> by the nondeterministic value u. Therefore, KoAT fails on the program of Fig. <sup>1</sup> without using CFR.

For the program of Fig. 2, KoAT infers RB({t <sup>2</sup>}) = RB({t <sup>3</sup>}) = y. By adding all runtime bounds, we obtain the bound 3 + 2 · y on the expected runtime complexity of the program in Fig. 2 and thus by Theorem 4 also of the program in Fig. 1.


**Table 1.** Evaluation of CFR on Probabilistic Programs

# **4 Implementation, Evaluation, and Conclusion**

We presented a novel control-flow refinement technique for probabilistic programs and proved that it does not modify the program's expected runtime complexity. This allows us to combine CFR with approaches for complexity analysis of probabilistic programs. Compared to its variant for non-probabilistic programs, the soundness proof of Theorem 4 for probabilistic programs is considerably more involved.

Up to now, our complexity analyzer KoAT used the tool iRankFinder [13] for CFR of non-probabilistic programs [18]. To demonstrate the benefits of CFR for complexity analysis of probabilistic programs, we now replaced the call to iRankFinder in KoAT by a native implementation of our new CFR algorithm. KoAT is written in OCaml and it uses Z3 [12] for SMT solving, Apron [20] to generate invariants, and the Parma Polyhedra Library [8] for computations with polyhedra.

We used all 75 probabilistic benchmarks from [27,29] and added 15 new benchmarks including our leading example and problems adapted from the *Termination Problem Data Base* [33], e.g., a probabilistic version of McCarthy's 91 function. Our benchmarks also contain examples where CFR is useful even if it cannot separate probabilistic from non-probabilistic program parts as in our leading example.

Table 1 shows the results of our experiments. We compared the configuration of KoAT with CFR ("KoAT+CFR") against KoAT without CFR. Moreover, as in [27], we also compared with the main other recent tools for inferring upper bounds on the expected runtimes of probabilistic integer programs (Absynth [29] and eco-imp [7]). As in the *Termination Competition* [17], we used a timeout of 5 min per example. The first entry in every cell is the number of benchmarks for which the tool inferred the respective bound. In brackets, we give the corresponding number when only regarding our new examples. For example, KoAT+CFR finds a finite expected runtime bound for 84 of the 90 examples. A linear expected bound (i.e., in <sup>O</sup>(n)) is found for 56 of these 84 examples, where 12 of these benchmarks are from our new set. AVG(s) is the average runtime in seconds on all benchmarks and AVG<sup>+</sup>(s) is the average runtime on all successful runs.

The experiments show that similar to its benefits for non-probabilistic programs [18], CFR also increases the power of automated complexity analysis for probabilistic programs substantially, while the runtime of the analyzer may become longer since CFR increases the size of the program. The experiments also indicate that a related CFR technique is not available in the other complexity analyzers. Thus, we conjecture that other tools for complexity or termination analysis of PIPs would also benefit from the integration of our CFR technique.

KoAT's source code, a binary, and a Docker image are available at:

### https://koat.verify.rwth-aachen.de/prob cfr

The website also explains how to use our CFR implementation separately (without the rest of KoAT), in order to access it as a black box by other tools. Moreover, the website provides a *web interface* to directly run KoAT online, and details on our experiments, including our benchmark collection.

**Acknowledgements.** We thank Yoann Kehler for helping with the implementation of our CFR technique in KoAT.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **On the (In-)Completeness of Destructive Equality Resolution in the Superposition Calculus**

Uwe Waldmann(B)

MPI for Informatics, Saarland Informatics Campus, Saarbr¨ucken, Germany uwe@mpi-inf.mpg.de

**Abstract.** Bachmair's and Ganzinger's abstract redundancy concept for the Superposition Calculus justifies almost all operations that are used in superposition provers to delete or simplify clauses, and thus to keep the clause set manageable. Typical examples are tautology deletion, subsumption deletion, and demodulation, and with a more refined definition of redundancy joinability and connectedness can be covered as well. The notable exception is Destructive Equality Resolution, that is, the replacement of a clause *x* -≈ *t* ∨ *C* with *x /*∈ vars(*t*) by *C*{*x* → *t*}. This operation is implemented in state-of-the-art provers, and it is clearly useful in practice, but little is known about how it affects refutational completeness. We demonstrate on the one hand that the naive addition of Destructive Equality Resolution to the standard abstract redundancy concept renders the calculus refutationally incomplete. On the other hand, we present several restricted variants of the Superposition Calculus that are refutationally complete even with Destructive Equality Resolution.

**Keywords:** Automated theorem proving · First-order logic · Superposition calculus

# **1 Introduction**

Bachmair's and Ganzinger's Superposition Calculus [2] comes with an abstract redundancy concept that describes under which circumstances clauses can be simplified away or deleted during a saturation without destroying the refutational completeness of the calculus. Typical concrete simplification and deletion techniques that are justified by the abstract redundancy concept are tautology deletion, subsumption deletion, and demodulation, and with a more refined definition of redundancy (Duarte and Korovin [4]) joinability and connectedness can be covered as well.

There is one simplification technique left that is not justified by Bachmair's and Ganzinger's redundancy criterion, namely Destructive Equality Resolution (DER), that is, the replacement of a clause <sup>x</sup> -≈ t∨C with x /∈ vars(t) by C{x → t}. This operation is for instance implemented in the E prover (Schulz c The Author(s) 2024

[6]), and it has been shown to be useful in practice: It increases the number of problems that E can solve and it also reduces E's runtime per solved problem. The question how it affects the refutational completeness of the calculus, both in theory and in practice, has been open, though (except for the special case that t is also a variable, where DER is equivalent to selecting the literal x -≈ t so that Equality Resolution becomes the only possible inference with this clause).

In this paper we demonstrate on the one hand that the naive addition of DER to the standard abstract redundancy concept renders the calculus refutationally incomplete. On the other hand, we present several restricted variants of the Superposition Calculus that are refutationally complete even with DER.

By lack of space, some proofs had to be omitted from this version of the paper; they can be found in the technical report [7].

# **2 Preliminaries**

**Basic Notions.** We refer to (Baader and Nipkow [1]) for basic notations and results on orderings, multiset operations, and term rewriting.

We use standard set operation symbols like ∪ and ∈ and curly braces also for finite multisets. The union S ∪ S of the multisets S and S over some set M is defined by (S ∪ S )(x) = S(x) + S (x) for every x ∈ M.

Without loss of generality we assume that all most general unifiers that we consider are idempotent. Note that if σ is an idempotent most general unifier and θ is a unifier then θ ◦ σ = θ.

A clause is a finite multiset of equational literals s ≈ t or s -≈ t, written as a disjunction. The empty clause is denoted by ⊥. We call a literal L in a clause C ∨ L maximal w.r.t. a strict literal ordering, if there is no literal in C that is larger than L; we call it strictly maximal, if there is no literal in C that is larger than or equal to L.

We write a rewrite rule as u → v. Semantically, a rule u → v is equivalent to an equation u ≈ v. If R is a rewrite system, that is, a set of rewrite rules, we write s →<sup>R</sup> t to indicate that the term s can be reduced to the term t by applying a rule from R. A rewrite system is called left-reduced, if there is no rule u → v ∈ R such that u can be reduced by a rule from R \ {u → v}.

**The Superposition Calculus.** We summarize the key elements of Bachmair's and Ganzinger's Superposition Calculus [2].

Let be a reduction ordering that is total on ground terms. We extend to an ordering on literals, denoted by <sup>L</sup>, <sup>1</sup> by mapping positive literals <sup>s</sup> <sup>≈</sup> <sup>t</sup> to multisets {s, t} and negative literals s -≈ t to multisets {s, s, t, t} and by comparing the resulting multisets using the multiset extension of . We extend the literal ordering <sup>L</sup> to an ordering on clauses, denoted by <sup>C</sup>, by comparing the multisets of literals in these clauses using the multiset extension of <sup>L</sup>.

<sup>1</sup> There are several equivalent ways to define L.

The inference system of the Superposition Calculus consists of the rules Superposition, Equality Resolution, and Equality Factoring. 2

Superposition: <sup>D</sup> <sup>∨</sup> <sup>t</sup> <sup>≈</sup> <sup>t</sup> C ∨ L[u] (D ∨ C ∨ L[t ])σ

where u is not a variable; σ = mgu(t, u); (C ∨ L[u])σ - <sup>C</sup> (D ∨ t ≈ t )σ; (t ≈ t )σ is strictly maximal in (D ∨ t ≈ t )σ; either L[u] is a positive literal s[u] ≈ s and L[u]σ is strictly maximal in (C ∨ L[u])σ, or L[u] is a negative literal s[u] -≈ s and L[u]σ is maximal in (C ∨L[u])σ; tσ - t σ; and sσ - s σ.

$$\text{EQUALITY} \text{ ResOLUTION:} \qquad \frac{C' \lor s \not\models s'}{C' \sigma}$$

where σ = mgu(s, s ) and (s -≈ s )σ is maximal in (C ∨ s -≈ s )σ.

$$\text{EQUALITY FACTORING:} \begin{aligned} \label{2.6} \quad & \frac{C' \lor r \approx r' \lor s \approx s'}{(C' \lor s' \not\approx r' \lor r \approx r')\sigma} \end{aligned}$$

where σ = mgu(s, r); sσ - s σ; and (s ≈ s )σ is maximal in (C ∨ r ≈ r ∨ s ≈ s )σ.

The ordering restrictions can be overridden using *selection functions* that determine for each clause a subset of the negative literals that are available for inferences. For simplicity, we leave out this refinement in the rest of this paper. We emphasize, however, that all results that we present here hold also in the presence of selection functions; the required modifications of the proofs are straightforward.

A ground clause C is called (classically) redundant w.r.t. a set of ground clauses N, if it follows from clauses in N that are smaller than C w.r.t. <sup>C</sup>. A clause is called (classically) redundant w.r.t. a set of clauses N, if all its ground instances are redundant w.r.t. the set of ground instances of clauses in N. <sup>3</sup> A ground inference with conclusion C<sup>0</sup> and right (or only) premise C is called redundant w.r.t. a set of ground clauses N, if one of its premises is redundant w.r.t. N, or if C<sup>0</sup> follows from clauses in N that are smaller than C. An inference is called redundant w.r.t. a set of clauses N, if all its ground instances are redundant w.r.t. the set of ground instances of clauses in N.

Redundancy of clauses and inferences as defined above is a redundancy criterion in the sense of (Waldmann et al. [8]). It justifies typical deletion and simplification techniques such as the deletion of tautological clauses, subsumption deletion (i.e., the deletion of a clause Cσ ∨ D in the presence of a clause C) or demodulation (i.e., the replacement of a clause C[sσ] by C[tσ] in the presence of a unit clause s ≈ t, provided that sσ tσ).

<sup>2</sup> The Equality Factoring rule can be replaced by the Ordered Factoring and the Merging Paramodulation rule. Our results hold also for this variant.

<sup>3</sup> Note that "redundancy" is called "compositeness" in Bachmair and Ganzinger's *J. Log. Comput.* article [2]. In later papers the standard terminology has changed.

# **3 Incompleteness**

There are two special cases where Destructive Equality Resolution (DER) is justified by the classical redundancy criterion: First, if t is the smallest constant in the signature, then every ground instance (x -≈ t ∨ C)θ follows from the smaller ground instance C{x → t}θ. Second, if t is another variable y, then every ground instance (x -≈ y ∨ C)θ follows from the smaller ground instance C{x → y}{y → s}θ, where s is the smaller of xθ and yθ.

But it is easy to see that this does not work in general: Let be a Knuth-Bendix ordering with weights w(f) = w(b) = 2, w(c) = w(d) = 1, w(z) = 1 for all variables z, and let C be the clause x -≈ b ∨ f(x) ≈ d, Then DER applied to C yields D = f(b) ≈ d. Now consider the substitution θ = {x → c}. The ground instance Cθ = c -≈ b ∨ f(c) ≈ d is a logical consequence of D, but since it is smaller than D itself, D makes neither Cθ nor C redundant.

Moreover, the following example demonstrates that the Superposition Calculus becomes indeed incomplete, if we add DER as a simplification rule, i.e., if we extend the definition of redundancy in such a way that the conclusion of DER renders the premise redundant.

**Example 1.** Let be a Knuth-Bendix ordering with weights w(f) = 4, w(g) = 3, w(b) = 4, w(b ) = 2, w(c) = w(c ) = w(d) = 1, w(z) = 1 for all variables z, and let N be the set of clauses

$$\begin{array}{l} C\_1 = \frac{f(x,d)}{f(x,y)} \approx x\\ C\_2 = \frac{f(x,y)}{b' \approx c'} \not\sim b \lor g(x) \approx d\\ C\_3 = \frac{g(b')}{c} \not\sim g(c')\\ C\_5 = \frac{g(c)}{\underline{g}(c)} \not\sim d \end{array}$$

where all the maximal terms in maximal literals are underlined.

At this point, neither demodulation nor subsumption is possible. The only inference that must be performed is Superposition between C<sup>1</sup> and C2, yielding

$$C\_6 := x \not\models b \lor g(x) \approx d$$

and by using DER, C<sup>6</sup> is replaced by

$$C\_7 := \; g(b) \approx d$$

We could now continue with a Superposition between C<sup>3</sup> and C7, followed by a Superposition with C5, followed by Equality Resolution, and obtain

$$C\_8 \, = \,\, b' \approx c'$$

from which we can derive the empty clause by Superposition with C<sup>4</sup> and once more by Equality Resolution. However, clause C<sup>7</sup> is in fact redundant: The ground clauses C<sup>3</sup> and C<sup>4</sup> imply b ≈ c; therefore C<sup>7</sup> follows from C3, C4, and the ground instances

$$\begin{array}{rcl} C\_1 \{ x \mapsto c \} &=& f(c, d) \approx c \\ C\_2 \{ x \mapsto c, \, y \mapsto d \} &=& f(c, d) \not\approx b \lor g(c) \approx d \end{array}$$

Because all terms in these clauses are smaller than the maximal term g(b) of C7, all the clauses are smaller than C7. Since C<sup>7</sup> is redundant, we are allowed to delete it, and then no further inferences are possible anymore. Therefore the clause set N = {C1,...,C5} is saturated, even though it is inconsistent and does not contain the empty clause, which implies that the calculus is not refutationally complete anymore.

# **4 Completeness, Part I: The Horn Case**

# **4.1 The Idea**

On the one hand, Example 1 demonstrates that we cannot simply extend the standard redundancy criterion of the Superposition Calculus with DER without destroying refutational completeness, and that this holds even if we impose a particular strategy on simplification steps (say, that simplifications must be performed eagerly and that demodulation and subsumption have a higher precedence than DER). On the other hand, Example 1 is of course highly unrealistic: Even though clause C<sup>7</sup> is redundant w.r.t. the clauses C1, C2, C3, and C4, no reasonable superposition prover would ever detect this – in particular, since doing so would require to invent the instance C2{x → c, y → d} of C2, which is not in any way syntactically related to C7. 4

This raises the question whether DER still destroys refutational completeness when we restrict the other deletion and simplification techniques to those that are typically implemented in superposition provers, such as tautology detection, demodulation, or subsumption. Are there alternative redundancy criteria that are refutationally complete together with the Superposition Calculus and that justify DER as well as (all/most) commonly implemented deletion and simplification techniques? Given the usual structure of the inductive completeness proofs for saturation calculi, developing such a redundancy criterion would mean in particular to find a suitable clause ordering with respect to which certain clauses have to be smaller than others. The following example illustrates a fundamental problem that we have to deal with:

**Example 2.** Let be a Knuth-Bendix ordering with weights w(f) = w(g) = w(h) = w(c) = 1, w(b) = 2, w(z) = 1 for all variables z. Consider the following set of clauses:

<sup>4</sup> In fact, a prover might use SMT-style heuristic grounding of non-ground clauses, but then finding the contradiction turns out to be easier than proving the redundancy of *C*7.

$$\begin{array}{llll} D\_1 &= \, h(x) \approx x & C\_1 &= \, h(x) \not\models b \lor f(g(x)) \approx c\\ & & C\_2 &= \, x \not\models b \lor f(g(x)) \approx c\\ & & C\_3 &= \, f(g(b)) \approx c\\ & & C\_4 &= \, h(c) \not\models b \lor f(g(c)) \approx c \end{array}$$

Demodulation of C<sup>1</sup> using D<sup>1</sup> yields C2, and if we want Demodulation to be a simplification, then every ground instance C1θ should be larger than the corresponding ground instance C2θ in the clause ordering.

DER of C<sup>2</sup> yields C3, and if we want DER to be a simplification, then every ground instance C2θ should be larger than C3θ = C3.

A Superposition inference between D<sup>3</sup> and C<sup>3</sup> yields C4. The inductive completeness proof for the calculus relies on the fact that the conclusion of an inference is smaller than the largest premise, so C<sup>3</sup> should be larger than C4.

By transitivity we obtain that every ground instance C1θ should be larger than C<sup>4</sup> in the clause ordering. The clause C4, however, *is* a ground instance of C1, which is clearly a contradiction.

On the other hand, a closer inspection reveals that, depending on the limit rewrite system R<sup>∗</sup> that is produced in the completeness proof for the Superposition Calculus, the Superposition inference between D<sup>3</sup> and C<sup>3</sup> is only needed, when D<sup>3</sup> produces the rewrite rule g(b) → g(c) ∈ R∗, and that the only critical case for DER is the one where b can be reduced by some rule in R∗. Since the limit rewrite system R<sup>∗</sup> is by construction left-reduced, these two conditions are mutually exclusive. This observation indicates that we might be able to find a suitable clause ordering if we choose it depending on R∗.

### **4.2 Ground Case**

**The Normalization Closure Ordering.** Let be a reduction ordering that is total on ground terms. Let R be a left-reduced ground rewrite system contained in .

For technical reasons that will become clear later, we design our ground superposition calculus in such a way that it operates on ground closures (C · θ). Logically, a ground closure (C · θ) is equivalent to a ground instance Cθ, but an ordering may treat two closures that represent the same ground instance in different ways. We consider closures up to α-renaming and ignore the behavior of θ on variables that do not occur in C, that is, we treat closures (C<sup>1</sup> · θ1) and (C<sup>2</sup> ·θ2) as equal whenever C<sup>1</sup> and C<sup>2</sup> are equal up to bijective variable renaming and C1θ<sup>1</sup> = C2θ2. We also identify (⊥ · θ) and ⊥.

Intuitively, in order to compare ground closures C · θ, we normalize all terms occurring in Cθ with R, we compute the multiset of all the redexes occurring during the normalization and all the resulting normal forms, and we compare these multisets using the multiset extension of . Since we would like to give redexes and normal forms in negative literals a slightly larger weight than redexes and normal forms in positive literals, and redexes in positive literals below the top a slightly larger weight than redexes at the top, we combine each of these terms with a label (0 for positive at the top, 1 for positive below the top, 2 for negative). Moreover, whenever some term t occurs several times in C as a subterm, we want to count the redexes resulting from the normalization of tθ only once (with the maximum of the labels). The reason for this is that DER can produce several copies of the same term t in a clause if the variable to be eliminated occurs several times in the clause; by counting all redexes stemming from t only once, we ensure that this does not increase the total number of redexes. Formally, we first compute the set (not multiset!) of all subterms t of C, so that duplicates are deleted, and then compute the multiset of redexes for all terms tθ (and analogously for terms occurring at the top of a literal).

**Definition 3.** We define the subterm sets ss<sup>+</sup> >-(C) and ss−(C) and the topterm sets ts<sup>+</sup>(C) and ts−(C) of a clause C by

$$\begin{array}{l} \text{ssr}^{-}(C) = \{ t \mid C = C' \lor s[t]\_p \not\simeq s' \} \\ \text{ss}^{+}\_{>\epsilon}(C) = \{ t \mid C = C' \lor s[t]\_p \approx s', \, p > \epsilon \} \\ \text{ts}^{-}(C) = \{ t \mid C = C' \lor t \not\simeq t' \} \\ \text{ts}^{+}(C) = \{ t \mid C = C' \lor t \approx t' \} .\end{array}$$

We define the labeled subterm set lss(C) and the labeled topterm set lts(C) of a clause C by

$$\begin{array}{l} \mathsf{lrs}(C) = \{ (t,2) \mid t \in \mathsf{ss}^-(C) \} \\ \qquad \cup \{ (t,1) \mid t \in \mathsf{ss}\_{>\epsilon}^+(C) \mid \mathsf{ss}^-(C) \} \\ \qquad \cup \{ (t,0) \mid t \in \mathsf{ts}^+(C) \mid (\mathsf{ss}\_{>\epsilon}^+(C) \cup \mathsf{ss}^-(C)) \} \\ \mathsf{lrs}(C) = \{ (t,2) \mid t \in \mathsf{ts}^-(C) \} \cup \{ (t,0) \mid t \in \mathsf{ts}^+(C) \mid \mathsf{ts}^-(C) \} \ . \end{array}$$

We define the R-redex multiset rmR(t, m) of a labeled ground term (t, m) with m ∈ {0, 1, 2} by

rmR(t, m) = ∅ if t is R-irreducible; rmR(t, m) = {(u, m)} ∪ rmR(t , m) if t →<sup>R</sup> t using the rule u → v ∈ R at position pand p = or m > 0; rmR(t, m) = {(u, 1)} ∪ rmR(t , m) if t →<sup>R</sup> t using the rule u → v ∈ R at position p and p> and m = 0.

**Lemma 4.** *For every left-reduced ground rewrite system* R *contained in ,* rmR(t, m) *is well-defined.*

**Definition 5.** We define the R-normalization multiset nmR(C · θ) of a ground closure (C · θ) by

$$\text{rm}\_R(C \cdot \theta) = \bigcup\_{\substack{(f(t\_1, \ldots, t\_n), m) \in \text{lss}(C) \\ \cup \bigcup\_{(x, m) \in \text{lss}(C)} \text{rm}\_R(x\theta, m) \\ \cup \bigcup\_{(t, m) \in \text{lss}(C)} \{(t\theta \downarrow\_R, m)\}$$

**Example 6.** Let C = h(g(g(x))) ≈ f(f(b)); let θ = {x → b}. Then lss(C) = {(h(g(g(x))), 0), (g(g(x)), 1), (g(x), 1), (x, 1), (f(f(b)), 0), (f(b), 1), (b, 1)} and lts(C) = {(h(g(g(x))), 0), (f(f(b)), 0)}.

Let R = {f(b) → b, g(g(b)) → b}. Then nmR(C · θ) = {(g(g(b)), 1), (f(b), 1), (f(b), 0), (h(b), 0), (b, 0)}, where the first element is a redex from the normalization of g(g(x))θ, the second from the normalization of f(b)θ, the third from the normalization of f(f(b))θ. The remaining elements are the normal forms of h(g(g(x)))θ and f(f(b))θ.

The R-normalization closure ordering <sup>R</sup> compares ground closures (C · θ1) and (D · θ2) using a lexicographic combination of three orderings:


**Lemma 7.** *If* (C · θ) *and* (Cσ · θ ) *are ground closures, such that* Cθ = Cσθ *, and* C *and* Cσ *are not equal up to bijective renaming, then* (C · θ) <sup>R</sup> (Cσ · θ )*.*

**Example 8.** Let C = h(f(x)) ≈ f(y), let θ = {x → b}; let θ = {x → b, y → b}; let σ = {y → x}. Let R = {f(b) → b}.

Then nmR(C · θ) = {(f(b), 1), (f(b), 0), (h(b), 0), (b, 0)} and nmR(Cσ · θ ) = {(f(b), 1), (h(b), 0), (b, 0)}, and therefore (C · θ) <sup>R</sup> (Cσ · θ ). The subterm f(x) occurs twice in Cσ (with labels 0 and 1), but only once in lss(Cσ) (with the larger of the two labels), and the same holds for the redex f(b) stemming from f(x)θ in nmR(Cσ · θ ).

**Parallel Superposition.** In the normalization closure ordering, redexes and normal forms stemming from several occurrences of the same term u in a closure (<sup>C</sup> · <sup>θ</sup>) are counted only once. When we perform a Superposition inference, this fact leads to a small problem: Consider a closure (C[u, u] · θ). In the Rnormalization multiset of this closure, the redexes stemming from the two copies of uθ are counted only once. Now suppose that one of the two copies of u is replaced by a smaller term v in a Superposition inference. The resulting closure (C[v, u] · θ) should be smaller than the original one, but it isn't: The redexes stemming from uθ are still counted once, and additionally, the R-normalization multiset now contains the redexes stemming from vθ.

There is an easy fix for this problem, though: We have to replace the ordinary Superposition rule by a Parallel Superposition rule, in which *all* copies of a term u in a clause C are replaced whenever one copy occurs in a maximal side of a maximal literal. Note that this is a well-known optimization that superposition provers implement (or should implement) anyhow.

We need one further modification of the inference rule: The side conditions of the superposition calculus use the traditional clause ordering <sup>C</sup>, but our completeness proof and redundancy criterion will be based on the orderings <sup>R</sup>. The difference between these orderings becomes relevant in particular when we consider (Parallel) Superposition inferences where the clauses overlap at the top of a positive literal. In this case, the <sup>R</sup>-smaller of the two premises may actually be the <sup>C</sup>-larger one. Therefore, the usual condition that the left premise of a (Parallel) Superposition inference has to be <sup>C</sup>-minimal has to be dropped for these inferences.

Parallel Superposition: <sup>D</sup> <sup>∨</sup> <sup>t</sup> <sup>≈</sup> <sup>t</sup>

$$\frac{D' \lor t \approx t'}{(D' \lor C[t', \dots, t']\_{p\_1, \dots, p\_k})\sigma}$$

where u is not a variable; σ = mgu(t, u); p1,...,p<sup>k</sup> are all the occurrences of u in C; if one of the occurrences of u in C is in a negative literal or below the top in a positive literal then Cσ - <sup>C</sup> (D ∨ t ≈ t )σ; (t ≈ t )σ is strictly maximal in (D ∨ t ≈ t )σ; either one of the occurrences of u in C is in a positive literal L[u] = s[u] ≈ s and L[u]σ is strictly maximal in Cσ, or one of the occurrences of u in C is in a negative literal L[u] = s[u] -≈ s and L[u]σ is maximal in (C ∨ L[u])σ; tσ - t σ; and sσ - s σ.

**Ground Closure Horn Superposition.** We will show that our calculus is refutationally complete for Horn clauses by lifting a similar result for ground closure Horn superposition. We emphasize that our calculus is not a basic or constraint calculus such as (Bachmair et al. [3]) or (Nieuwenhuis and Rubio [5]). Even though the ground version that we present here operates on closures, it is essentially a rephrased version of the standard ground Superposition Calculus. This explains why we also have to consider superpositions below variable positions.

The ground closure calculus uses the following three inference rules. We assume that in binary inferences the variables in the premises (D·θ2) and (C ·θ1) are renamed in such a way that C and D do not share variables. We can then assume without loss of generality that the substitutions θ<sup>2</sup> and θ<sup>1</sup> agree.

Parallel Superposition I: (D <sup>∨</sup> <sup>t</sup> <sup>≈</sup> <sup>t</sup>

$$\frac{(D' \lor t \approx t' \cdot \theta)}{((D' \lor C[t', \dots, t']\_{p\_1, \dots, p\_k}) \sigma \cdot \theta)}$$

where u is not a variable; tθ = uθ; σ = mgu(t, u); p1,...,p<sup>k</sup> are all the occurrences of u in C; if one of the occurrences of u in C is in a negative literal or below the top in a positive literal then (D ∨ t ≈ t )θ ≺<sup>C</sup> Cθ; one of the occurrences of u in C is either in a positive literal s[u] ≈ s such that (s[u] ≈ s )θ is strictly maximal in Cθ or in a negative literal s[u] -≈ s such that (s[u] -≈ s )θ is maximal in Cθ; s[u]θ s θ; (t ≈ t )θ is strictly maximal in (D ∨ t ≈ t )θ; and tθ t θ.

Parallel Superposition II: (D <sup>∨</sup> <sup>t</sup> <sup>≈</sup> <sup>t</sup>

$$\frac{(D' \lor t \approx t' \cdot \theta)}{(D' \lor C \cdot \theta[x \mapsto u[t'\theta]])}$$

where x is a variable of C; xθ = u[tθ]; if one of the occurrences of x in C is in a negative literal or below the top in a positive literal then (D ∨t ≈ t )θ ≺<sup>C</sup> Cθ; one of the occurrences of x in C is either in a positive literal s[x] ≈ s such that (s[x] ≈ s )θ is strictly maximal in Cθ or in a negative literal s[x] -≈ s such that (s[x] ≈ s )θ is maximal in Cθ; s[x]θ s θ; (t ≈ t )θ is strictly maximal in (D ∨ t ≈ t )θ; and tθ t θ.

$$\text{EQUALITY} \colon \qquad \frac{(C' \lor s \not\models s' \cdot \theta)}{(C' \sigma \cdot \theta)}$$

where sθ = s θ; σ = mgu(s, s ); and (s -≈ s )θ is maximal in (C ∨ s -≈ s )θ.

The following lemmas compare the conclusion concl(ι) of an inference ι with its right or only premise:

**Lemma 9.** *Let* ι *be a ground* Equality Resolution *inference. Then* concl(ι) *is* <sup>R</sup>*-smaller than its premise.*

**Lemma 10.** *Let* ι *be a ground* Parallel Superposition *inference*

$$\frac{(D' \lor t \approx t' \cdot \theta)}{((D' \lor C[t', \dots, t']\_{p\_1, \dots, p\_k}) \sigma \cdot \theta)}$$

*with* tθ = uθ *and* σ = mgu(t, u) *or*

$$\frac{(D' \lor t \approx t' \cdot \theta)}{(D' \lor C \cdot \theta[x \mapsto u[t'\theta]])}$$

*with* xθ = u[tθ]*. If* (tθ → t θ) ∈ R*, then* concl(ι) *is* <sup>R</sup>*-smaller than* (C · θ)*.*

*Proof.* Since tθ is replaced by t θ at all occurrences of u or at or below all occurrences of x in C, one copy of the redex tθ is removed from nmR(C · θ). Moreover all terms in D θ are smaller than tθ, and consequently all redexes stemming from D θ are smaller than tθ. Therefore nmR(C · θ) is larger than nmR(D ∨ C[t ,...,t ]<sup>p</sup>1,...,p<sup>k</sup> · θ) or nmR(D ∨ C · θ[x → u[t θ]]). In the second case, this implies (C · θ) <sup>R</sup> concl(ι) immediately. In the first case, it implies (C · θ) <sup>R</sup> (D ∨ C[t ,...,t ]<sup>p</sup>1,...,p<sup>k</sup> · θ) and (C · θ) <sup>R</sup> concl(ι) follows using Lemma 7.

**Redundancy.** We will now construct a redundancy criterion for ground closure Horn superposition that is based on the ordering(s) <sup>R</sup>.

**Definition 11.** Let N be a set of ground closures. A ground closure (C · θ) is called redundant w.r.t. N, if for every left-reduced ground rewrite system R contained in we have (i) R |= (C · θ) or (ii) there exists a ground closure (D · θ) ∈ N such that (D · θ) ≺≺<sup>R</sup> (C · θ) and R -|= (D · θ).

**Definition 12.** Let N be a set of ground closures. A ground inference ι with right or only premise (C ·θ) is called redundant w.r.t. N, if for every left-reduced ground rewrite system R contained in we have (i) R |= concl(ι), or (ii) there exists a ground closure (C ·θ) ∈ N such that (C ·θ) ≺≺<sup>R</sup> (C ·θ) and R -|= (C ·θ), or (iii) <sup>ι</sup> is a Superposition inference with left premise (D <sup>∨</sup> <sup>t</sup> <sup>≈</sup> <sup>t</sup> · θ) where tθ t θ, and (tθ → t <sup>θ</sup>) <sup>∈</sup>/ <sup>R</sup>, or (iv) <sup>ι</sup> is a Superposition inference where the left premise is not the <sup>R</sup>-minimal premise.

Intuitively, a redundant closure cannot be a minimal counterexample, i.e., a minimal closure that is false in R. A redundant inference is either irrelevant for the completeness proof (cases (iii) and (iv)), or its conclusion (and thus its right or only premise) is true in R, provided that all closures that are <sup>R</sup>-smaller than the right or only premise are true in R (cases (i) and (ii)) – which means that the inference can be used to show that the right or only premise cannot be a minimal counterexample.

We denote the set of redundant closures w.r.t. N by *Red* <sup>C</sup>(N) and the set of redundant inferences by *Red*I(N).

**Example 13.** Let be a KBO where all symbols have weight 1. Let C = g(b) -≈ c ∨ f(c) -≈ d and C = f(g(b)) -≈ d. Then the closure (C · ∅) is redundant w.r.t. {(C · ∅)}: Let R be a left-reduced ground rewrite system contained in . Assume that C is false in R. Then g(b) and c have the same R-normal form. Consequently, every redex or normal form in nmR(C · ∅) was already present in nmR(C · ∅). Moreover, the labeled normal form (c↓R, 2) that is present in nmR(C · ∅) is missing in nmR(C · ∅). Therefore (C · ∅) <sup>R</sup> (C · ∅). Besides, if (C · ∅) is false in R, then (C · ∅) is false as well.

Note that C ≺<sup>C</sup> C , therefore C is not classically redundant w.r.t. {C }.

**Lemma 14.** (*Red*I, *Red* <sup>C</sup>) *is a redundancy criterion in the sense of (Waldmann et al. [8]), that is, (1) if* N |= ⊥*, then* N \ *Red* <sup>C</sup>(N) |= ⊥*; (2) if* N ⊆ N *, then Red* <sup>C</sup>(N) ⊆ *Red* <sup>C</sup>(N ) *and Red*I(N) ⊆ *Red*I(N )*; (3) if* N ⊆ *Red* <sup>C</sup>(N)*, then Red* <sup>C</sup>(N) ⊆ *Red* <sup>C</sup>(N \ N ) *and Red*I(N) ⊆ *Red*I(N \ N )*; and (4) if* ι *is an inference with conclusion in* N*, then* ι ∈ *Red*I(N)*.*

*Proof.* (1) Suppose that N \ *Red* <sup>C</sup>(N) -|= ⊥. Then there exists a left-reduced ground rewrite system R contained in such that R |= N \ *Red* <sup>C</sup>(N). We show that R |= N (which implies N -|= ⊥). Assume that R -|= N. Then there exists a closure (C ·θ) ∈ N ∩*Red* <sup>C</sup>(N) such that R -|= (C ·θ). By well-foundedness of <sup>R</sup> there exists a <sup>R</sup>-minimal closure (C · θ) with this property. By definition of *Red* <sup>C</sup>(N), there must be a ground closure (D·θ) ∈ N such that (D·θ) ≺≺<sup>R</sup> (C ·θ) and R -|= (D · θ). By minimality of (C · θ), we get (D · θ) ∈ N \ *Red* <sup>C</sup>(N), contradicting the initial assumption.

(2) Obvious.

(3) Let N ⊆ *Red* <sup>C</sup>(N) and let (C · θ) ∈ *Red* <sup>C</sup>(N). We show that (C · θ) ∈ *Red* <sup>C</sup>(N \N ). Choose R arbitrarily. If R |= (C ·θ), we are done. Otherwise there exists a ground closure (D · θ) ∈ N such that (D · θ) ≺≺<sup>R</sup> (C · θ) and R -|= (D · θ). By well-foundedness of <sup>R</sup> there exists a <sup>R</sup>-minimal closure (D · θ) with this property. If (D · θ) were contained in N and hence in *Red* <sup>C</sup>(N), there would exist a ground closure (D ·θ) ∈ N such that (D ·θ) ≺≺<sup>R</sup> (D·θ) and R -|= (D ·θ), contradicting minimality. Therefore (D · θ) ∈ N \ N as required. The second part of (3) is proved analogously.

(4) Let ι be an inference with concl(ι) ∈ N. Choose R arbitrarily. We have to show that ι satisfies part (i), (ii), (iii), or (iv) of Definition 12. Assume that (i), (iii), and (iv) do not hold. Then R -|= concl(ι), and by Lemmas 9 and 10, concl(ι) is <sup>R</sup>-smaller than the right or only premise of ι, therefore part (ii) is satisfied if we take concl(ι) as (C · θ).

**Constructing a Candidate Interpretation.** In usual completeness proofs for superposition-like calculi, one constructs a candidate interpretation (a set of ground rewrite rules) for a saturated set of ground clauses by induction over the *clause ordering*. In our case, this is impossible since the limit *closure ordering* depends on the generated set of rewrite rules itself. We can still construct the candidate interpretation by induction over the *term ordering*, though: Instead of inspecting ground closures one by one as in the classical construction, we inspect all ground closures (C · θ) for which Cθ contains the maximal term s simultaneously, and if for at least one of them the usual conditions for productivity are satisfied, we choose the <sup>R</sup><sup>s</sup> -smallest one of these to extend Rs.

Let N be a set of ground closures. For every ground term s we define R<sup>s</sup> = - <sup>t</sup>≺<sup>s</sup> <sup>E</sup>t. Furthermore we define <sup>E</sup><sup>s</sup> <sup>=</sup> {<sup>s</sup> <sup>→</sup> <sup>s</sup> }, if (C · θ) is the <sup>R</sup><sup>s</sup> -smallest closure in N such that C = C ∨ u ≈ u , s = uθ is a strictly maximal term in Cθ, occurs only in a positive literal of Cθ, and is irreducible w.r.t. Rs, s = u θ, Cθ is false in Rs, and s s , provided that such a closure (C · θ) exists. We say that (C · θ) *produces* s → s . If no such closure exists, we define E<sup>s</sup> = ∅. Finally, we define R<sup>∗</sup> = - <sup>t</sup> Et.

The following two lemmas are proved as usual:

**Lemma 15.** *Let* s *be a ground term, let* (C · θ) *be a closure. If every term that occurs in negative literals of* Cθ *is smaller than* s *and every term that occurs in positive literals of* Cθ *is smaller than or equal to* s*, and if* R<sup>s</sup> |= (C · θ)*, then* R<sup>∗</sup> |= (C · θ)*.*

**Lemma 16.** *If a closure* (C ∨ u ≈ u · θ) *produces* uθ → u θ*, then* R<sup>∗</sup> |= (C ∨ u ≈ u · θ) *and* R<sup>∗</sup> -|= (C · θ)*.*

**Lemma 17.** *Let* (C<sup>1</sup> · θ) *and* (C<sup>2</sup> · θ) *be two closures. If* s *is a strictly maximal term and occurs only positively in both* C1θ *and* C2θ*, then* (C<sup>1</sup> · θ) <sup>R</sup><sup>s</sup> (C<sup>2</sup> · θ) *if and only if* (C<sup>1</sup> · θ) <sup>R</sup><sup>∗</sup> (C<sup>2</sup> · θ)*.*

**Lemma 18.** *Let* (D · θ)=(D ∨ t ≈ t · θ) *and* (C · θ) *be two closures in* N*. If* (D·θ) *produces* tθ → t θ *in* R∗*, and* tθ *occurs in* Cθ *in a negative literal or below the top a term in a positive literal, then* (D · θ) ≺≺<sup>R</sup><sup>∗</sup> (C · θ) *and* Dθ ≺<sup>C</sup> Cθ*.*

**Lemma 19.** *Let* (D · θ)=(D ∨ t ≈ t · θ) *and* (C · θ) *be two closures in* N*. If* (D·θ) *produces* tθ → t θ *in* R∗*,* tθ *occurs in* Cθ *at the top of the strictly maximal side of a positive maximal literal, and* R<sup>∗</sup> -|= (C · θ)*, then* (D · θ) ≺≺<sup>R</sup><sup>∗</sup> (C · θ)*.*

We can now show that the Ground Closure Horn Superposition Calculus is refutationally complete:

**Theorem 20.** *Let* N *be a saturated set of ground closures that does not contain* (⊥ · θ)*. Then* R<sup>∗</sup> |= N*.*

*Proof.* Suppose that R<sup>∗</sup> -|= N. Let (C · θ) be the <sup>R</sup><sup>∗</sup> -smallest closure in N such that R<sup>∗</sup> -|= (C · θ).

Case 1: C = C ∨ s -≈ s and sθ -≈ s θ is maximal in Cθ. By assumption, R<sup>∗</sup> -|= sθ -≈ s θ, hence sθ↓<sup>R</sup><sup>∗</sup> = s θ↓<sup>R</sup><sup>∗</sup> .

Case 1.1: sθ = s θ. Then there is an Equality Resolution inference from (C · θ) with conclusion (C σ · θ), where θ ◦ σ = θ. By saturation the inference is redundant, and by minimality of (C · θ) w.r.t. <sup>R</sup><sup>∗</sup> this implies R<sup>∗</sup> |= (C σ · θ). But then R<sup>∗</sup> |= (C · θ), contradicting the assumption.

Case 1.2: sθ -= s θ. W.l.o.g. let sθ s θ. Then sθ must be reducible by a rule tθ → t θ ∈ R∗, which has been produced by a closure (D · θ)=(D ∨t ≈ t · θ) in N. By Lemma 18, (D·θ) ≺≺<sup>R</sup><sup>∗</sup> (C·θ) and Dθ ≺<sup>C</sup> Cθ. If sθ and tθ overlap at a nonvariable position of s, there is a Parallel Superposition I inference ι between (D ·θ) and (C ·θ); otherwise they overlap at or below a variable position of s and there is a Parallel Superposition II inference <sup>ι</sup> with premises (<sup>D</sup> · <sup>θ</sup>) and (C · θ). By Lemma 16, R<sup>∗</sup> -|= (D · θ). By saturation the inference is redundant, and by minimality of (C · θ) w.r.t. <sup>R</sup><sup>∗</sup> we know that R<sup>∗</sup> |= concl(ι). Since R<sup>∗</sup> -|= (D · θ) this implies R<sup>∗</sup> |= (C · θ), contradicting the assumption.

Case 2: Cθ = C θ ∨ sθ ≈ s θ and sθ ≈ s θ is maximal in Cθ. By assumption, R<sup>∗</sup> -|= sθ ≈ s θ, hence sθ↓<sup>R</sup><sup>∗</sup> -= s θ↓<sup>R</sup><sup>∗</sup> . W.l.o.g. let sθ s θ.

Case 2.1: sθ is reducible by a rule tθ → t θ ∈ R∗, which has been produced by a closure (D · θ)=(D ∨ t ≈ t · θ) in N. By Lemmas 18 and 19, we obtain (D · θ) ≺≺<sup>R</sup><sup>∗</sup> (C · θ), and, provided that tθ occurs in sθ below the top, also Dθ ≺<sup>C</sup> Cθ. Therefore there is a Parallel Superposition (I or II) inference ι with left premise (D · θ) and right premise (C · θ), and we can derive a contradiction analogously to Case 1.2.

Case 2.2: It remains to consider the case that sθ is irreducible by R∗. Then sθ is also irreducible by Rsθ. Furthermore, by Lemma 17, <sup>R</sup>sθ and <sup>R</sup><sup>∗</sup> agree on all closures in which sθ is a strictly maximal term and occurs only positively. Therefore (C · θ) satisfies all conditions for productivity, hence R<sup>∗</sup> |= (C · θ), contradicting the assumption.

# **4.3 Lifting**

It remains to lift the refutational completeness result for ground closure Horn superposition to the non-ground case.

If C is a general clause, we call every ground closure (C ·θ) a ground instance of C. If

$$\frac{C\_n \dots C\_1}{C\_0}$$

is a general inference and

$$\frac{(C\_n \cdot \theta) \dots (C\_1 \cdot \theta)}{(C\_0 \cdot \theta)}$$

is a ground inference, we call the latter a ground instance of the former. The function G maps every general clause C and every general inference ι to the set of its ground instances. We extend G to sets of clauses N or sets of inferences I by defining G(N) := - <sup>C</sup>∈<sup>N</sup> <sup>G</sup>(C) and <sup>G</sup>(I) := - <sup>ι</sup>∈<sup>I</sup> <sup>G</sup>(ι).

**Lemma 21.** G *is a grounding function, that is, (1)* G(⊥) = {⊥}*; (2) if* ⊥ ∈ G(C) *then* C = ⊥*; and (3) for every inference* ι*,* G(ι) ⊆ *Red*I(G(concl(ι)))*.*

The grounding function G induces a lifted redundancy criterion (*Red*<sup>G</sup> <sup>I</sup> , *Red*<sup>G</sup> C) where ι ∈ *Red*<sup>G</sup> <sup>I</sup> (N) if and only if G(ι) ⊆ *Red*I(G(N)) and C ∈ *Red*<sup>G</sup> <sup>C</sup>(N) if and only if G(C) ⊆ *Red* <sup>C</sup>(G(N)).

**Lemma 22.** *Every ground inference from closures in* G(N) *is a ground instance of an inference from* N *or contained in Red*I(G(N))*.*

*Proof.* Let <sup>ι</sup> be a ground inference from closures in <sup>G</sup>(N). If <sup>ι</sup> is a Parallel Superposition I or Equality Resolution inference, then it is a ground instance of a Parallel Superposition or Equality Resolution with premises in N. It remains to consider Parallel Superposition II inferences

$$\frac{(D' \lor t \approx t' \cdot \theta)}{(D' \lor C \cdot \theta[x \mapsto u[t'\theta]])}$$

with xθ = u[tθ]. Let R be a left-reduced ground rewrite system contained in . If (tθ → t θ) ∈/ R, then ι satisfies case (iii) of Definition 12. Otherwise (C · θ[x → u[t θ]]) is a ground instance of the clause C ∈ N and <sup>R</sup>-smaller than the premise (C · θ). If R |= (C · θ[x → u[t θ]]), then R |= concl(ι), so ι satisfies case (i) of Definition 12; otherwise it satisfies case (ii) of Definition 12.

**Theorem 23.** *The Horn Superposition Calculus (using* Parallel Superposition*) together with the lifted redundancy criterion* (*Red*<sup>G</sup> <sup>I</sup> , *Red*<sup>G</sup> <sup>C</sup>) *is refutationally complete.*

*Proof.* This follows immediately from Lemma 22 and Theorem 32 from (Waldmann et al. [8]).

### **4.4 Deletion and Simplification**

It turns out that DER, as well as most concrete deletion and simplification techniques that are implemented in state-of-the-art superposition provers are in fact covered by our abstract redundancy criterion. There are some unfortunate exceptions, however.

**DER.** Destructive Equality Resolution, that is, the replacement of a clause x -≈ t ∨ C with x /∈ vars(t) by C{x → t} is covered by the redundancy criterion. To see this, consider an arbitrary ground instance (x -≈ t ∨ C · θ) of x -≈ t∨C. Let R be a left-reduced ground rewrite system contained in . Assume that the instance is false in R. Then xθ and tθ have the same R-normal form. Consequently, any redex or normal form in nmR(C{x → t} · θ) was already present in nmR(x -≈ t ∨ C · θ) (possibly with a larger label, if x occurs only positively in C). Moreover, the labeled normal form (xθ↓R, 2) that is present in nmR(x -≈ t∨C · θ) is missing in nmR(C{x → t}· θ). Therefore (x -≈ t∨C · θ) <sup>R</sup> (C{x → t} · θ). Besides, both closures have clearly the same truth value in R, that is, false.

**Subsumption.** Propositional subsumption, that is, the deletion of a clause C ∨ D with nonempty D in the presence of a clause C is covered by the redundancy criterion. This follows from the fact that every ground instance ((C∨D)·θ) of the deleted clause is entailed by a smaller ground instance (C·θ) of the subsuming clause. This extends to all simplifications that replace a clause by a subsuming clause in the presence of certain other clauses, for instance the replacement of a clause tσ ≈ t σ∨C by C in the presence of a clause t -≈ t , or the replacement of a clause u[tσ] -≈ u[t σ] ∨ C by C in the presence of a clause t ≈ t .

First-order subsumption, that is, the deletion of a clause Cσ ∨ D in the presence of a clause C is not covered, however. This is due to the fact that <sup>R</sup> makes the instance Cσ smaller than C, rather than larger (see Lemma 7).

**Tautology Deletion.** The deletion of (semantic or syntactic) tautologies is obviously covered by the redundancy criterion.

**Parallel Rewriting with Condition Literals.** Parallel rewriting with condition literals, that is, the replacement of a clause t -≈ t ∨ C[t, . . . , t]<sup>p</sup>1,...,p<sup>k</sup> , where t t and p1,...,p<sup>k</sup> are all the occurrences of t in C by t -≈ t ∨C[t ,...,t]<sup>p</sup>1,...,p<sup>k</sup> is covered by the redundancy criterion. This can be shown analogously as for DER.

**Demodulation.** Parallel demodulation is the replacement of a clause C[tσ, . . . , tσ]<sup>p</sup>1,...,p<sup>k</sup> by C[t σ, . . . , t σ]<sup>p</sup>1,...,p<sup>k</sup> in the presence of another clause t ≈ t where tσ t σ. In general, this is *not* covered by our redundancy criterion. For instance, if is a KBO where all symbols have weight 1 and if R = {f(b) → b, g(g(b)) → b}, then replacing f(f(f(b))) by g(g(b)) in some clause f(f(f(b))) -≈ c yields a clause with a larger R-normalization multiset, since the labeled redexes {(f(b), 2), (f(b), 2), (f(b), 2)} are replaced by {(g(g(b)), 2)} and g(g(b)) f(b).

A special case is supported, though: If t σ is a proper subterm of tσ, then the R-normalization multiset either remains the same of becomes smaller, since every redex in the normalization of t σ occurs also in the normalization of tσ.

# **5 Completeness, Part II: The Non-horn Case**

In the non-Horn case, the construction that we have seen in the previous section fails for (Parallel) Superposition inferences at the top of positive literals. Take an LPO with precedence f c<sup>6</sup> c<sup>5</sup> c<sup>4</sup> c<sup>3</sup> c<sup>2</sup> c<sup>1</sup> b and consider the ground closures (f(x1) ≈ c<sup>1</sup> ∨ f(x2) ≈ c<sup>2</sup> ∨ f(x3) ≈ c<sup>3</sup> · θ) and (f(x4) ≈ c<sup>4</sup> ∨ f(x5) ≈ c<sup>5</sup> ∨ f(x6) ≈ c<sup>6</sup> · θ), where θ maps all variables to the same constant b. Assume that the first closure produces the rewrite rule (f(b) ≈ c3) ∈ R∗. The R∗-normalization multisets of both closures are dominated by three occurrences of the labeled redex (f(b), 0). However, a Superposition inference between the closures yields (f(x1) ≈ c<sup>1</sup> ∨ f(x2) ≈ c<sup>2</sup> ∨ f(x4) ≈ c<sup>4</sup> ∨ f(x5) ≈ c<sup>5</sup> ∨ c<sup>3</sup> ≈ c<sup>6</sup> · θ), whose R∗-normalization multiset contains four occurrences of the labeled redex (f(b), 0), hence the conclusion of the inference is larger than both premises. If we want to change this, we must ensure that the weight of positive literals depends primarily on their larger sides, and if the larger sides are equal, on their smaller sides. That means that in the non-Horn case, the clause ordering must treat positive literals as the traditional clause ordering <sup>C</sup>. But that has two important consequences: First, DER may no longer be used to eliminate variables that occur also in positive literals (since DER might now increase the weight of these literals). On the other hand, unrestricted demodulation becomes possible for positive literals.

We sketch the key differences between the non-Horn and the Horn case; for the details, we refer to the technical report [7].

We define the subterm set ss−(C) and the topterm set ts−(C) of a clause C as in the Horn case:

$$\begin{aligned} \text{ss}^-(C) &= \{ t \mid C = C' \lor s[t]\_p \not\cong s' \} \\ \text{ts}^-(C) &= \{ t \mid C = C' \lor t \not\not\cong t' \} \end{aligned}$$

We do not need labels anymore. Instead, for every redex or normal form u that appears in negative literals we include the two-element multiset {u.u} in the R-normalization multiset to ensure that a redex or normal form u in negative literals has a larger weight than a positive literal u ≈ v with u v. We define the R-redex multiset rmR(t) of a ground term t by

$$\begin{array}{l} \text{rm}\_R(t) = \emptyset \text{ if } t \text{ is } R\text{-irreducible};\\ \text{rm}\_R(t) = \{\{u, u\}\} \cup \text{rm}\_R(t') \text{ if } t \to\_R t' \text{ using the rule } u \to v \in R. \end{array}$$

The R-normalization multiset nmR(C · θ) of a ground closure (C · θ) is

$$\text{rm}\_R(C \cdot \theta) = \bigcup\_{f \in (t\_1, \ldots, t\_n) \in \text{ss}^-(C)} \text{rm}\_R(f(t\_1 \theta \downarrow\_R, \ldots, t\_n \theta \downarrow\_R))$$

$$\begin{array}{c} \cup \bigcup\_{x \in \text{ss}^-(C)} \text{rm}\_R(x \theta) \\ \cup \bigcup\_{t \in \text{ts}^-(C)} \{ \{t \theta \downarrow\_R, t \theta \downarrow\_R\} \} \\ \cup \bigcup\_{(s \approx s') \in C} \{ \{s \theta, s' \theta\} \} \end{array}$$

Once more, the R-normalization closure ordering <sup>R</sup> compares ground closures (C · θ1) and (D · θ2) using a lexicographic combination of three orderings:


With this ordering, we can again prove Lemmas 9 and 10 and their analogue for Equality Factoring, which implies Lemma 14. In the construction of a candidate interpretation, we define again R<sup>s</sup> = - <sup>t</sup>≺<sup>s</sup> <sup>E</sup><sup>t</sup> for every ground term <sup>s</sup> and R<sup>∗</sup> = - <sup>t</sup> Et. We define E<sup>s</sup> = {s → s }, if (C · θ) is the <sup>R</sup><sup>s</sup> -smallest closure in N such that C = C ∨ u ≈ u , uθ ≈ u θ is a strictly maximal literal in Cθ, s = uθ, s = u θ, s s , s is irreducible w.r.t. Rs, Cθ is false in Rs, and C θ is false in R<sup>s</sup> ∪ {s → s }, provided that such a closure (C · θ) exists. If no such closure exists, we define E<sup>s</sup> = ∅.

We can then reprove Theorem 20 for the non-Horn case. The only difference in the proof is one additional subcase before Case 2.1: If sθ ≈ s θ is maximal, but not strictly maximal in Cθ, or if C θ is true in Rsθ ∪ {sθ → s θ}, then there is an Equality Factoring inference with the premise (<sup>C</sup> · <sup>θ</sup>). This inference must be redundant, which yields again a contradiction.

The lifting to non-ground clauses works as in Sect. 4.3.

# **6 Discussion**

We have demonstrated that the naive addition of Destructive Equality Resolution (DER) to the standard abstract redundancy concept destroys the refutational completeness of the calculus, but that there exist restricted variants of the Superposition Calculus that are refutationally complete even with DER (restricted to negative literals in the non-Horn case). The key tool for the completeness proofs is a closure ordering that is structurally very different from the classical ones – it is not a multiset extension of some literal ordering – but that still has the property that the redundancy criterion induced by it is compatible with the Superposition Calculus.

Of course there is a big gap between the negative result and the positive results. The new redundancy criterion justifies DER as well as most deletion and simplification techniques found in realistic saturation provers, but only propositional subsumption and only a very restricted variant of demodulation, The question whether the Superposition Calculus is refutationally complete together with a redundancy criterion that justifies both DER (in full generality even in the non-Horn case) and *all* deletion and simplification techniques found in realistic saturation provers (including unrestricted demodulation and first-order subsumption) is still open. Our work is an intermediate step towards a solution to this problem. There may exist a more refined closure ordering that allows us to prove the completeness of such a calculus. On the other hand, if the combination is really incomplete, a counterexample must make use of those operations that our proof fails to handle, that is, DER in positive literals in non-Horn problems, first-order subsumption, or demodulation with unit equations that are contained in the usual term ordering but yield closures that are larger w.r.t. <sup>R</sup><sup>∗</sup> .

**Acknowledgement.** I thank the anonymous IJCAR reviewers for their helpful comments and Stephan Schulz for drawing my attention to the problem at the CASC dinner in 2013.

**Disclosure of Interests.** The author has no competing interests to declare that are relevant to the content of this article.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **SAT, SMT and Quantifier Elimination**

# **Model Completeness for Rational Trees**

Silvio Ghilardi(B) and Lia M. Poidomani

Dipartimento di Matematica, Università degli Studi di Milano, Milan, Italy silvio.ghilardi@unimi.it http://users.mat.unimi.it/users/ghilardi/

**Abstract.** We analyze the theory of rational trees with finitely many constructors, infinitely many atoms and an atomicity predicate. We design a new decision procedure, proving in addition that this theory is model-complete. We also show that the enrichment of the language with selectors and simultaneous parametric fixpoints enjoys quantifier elimination.

**Keywords:** Infinite Trees · Model Completeness · Decision Procedures

# **1 Introduction**

The theory of finite and infinite trees deserves special attention in computer science applications for its capability of representing both algebraic and co-algebraic datatypes, including for instance (acyclic or even cyclic) lists, streams, etc. The theory has been largely investigated by the logic programming community since its early days [3,8] and various results have been proved about it, including decidability in the full elementary language [12], albeit within very high (actually non elementary) complexity bounds [21]. The automated reasoning community gave some contributions too, focusing especially on the more restricted constraint solving fragment [20].

Deciding satisfiability of the elementary theory of finite and infinite trees cannot be obtained via quantifier elimination, as pointed out in the comprehensive and detailed paper [4], which improves and extends classical results from [12]. In fact, the normalizations performed by the algorithms proposed in these papers prove that every formula is equivalent to a Boolean combination of special kinds of primitive formulae (thus giving, in terms of prenex forms, a reduction to a Δ<sup>0</sup> <sup>2</sup> = ∃∗∀∗ ∩ ∀∗∃∗-class), but not to a quantifier-free formula. The impossibility of full quantifier-elimination can indeed be shown by easy counterexamples. However, quantifier elimination is a nice and desirable property, for instance it facilitates the computation of interpolants [10], whose employment in verification applications is widely recognized [15].

From the point of view of 'abstract logical nonsense', quantifier elimination can be in principle always obtained, by the trivial enlargement of the language naming every formula by a fresh atomic predicate. More concretely, it often happens that manageable enrichments of the language are sufficient to achieve quantifier elimination: a classical example in this sense is the theory of real closed fields which becomes quantifiereliminable after adding to the language the ordering predicate (this is easily seen to be

The first author is supported by INdAM's GNSAGA group.

definable in the theory, because 'being positive' turns out to be equivalent to 'being a square'). A more complicated, but still manageable example is linear integer arithmetic, where in order to obtain quantifier elimination it is sufficient to enrich the language with the infinitely many definable binary predicates expressing 'equivalence modulo *n*'.

We take into consideration the variant of the theory of trees whose signature has *finitely many constructors, infinitely many constants symbols and an atomicity predicate*. We feel that this variant of the theory is rather natural to consider, especially in verification applications, where constants can represent infinite data (maybe constrained by richer theories describing their internal structure). The problem we are addressing in this paper is: what is the price to pay (in terms of language enrichments) to achieve quantifier elimination for the above theory? In [14] quantifier elimination is obtained by introducing *ad hoc* syntactic entities (called 'terms with pointers') and applying to them an extension of the classical Mal'cev quantifier elimination algorithm for finite trees [13]. In this paper we stay inside the native first order language and we prove that our theory is *model-complete* by introducing a novel technique based on *definability analysis*; as a by-product, we achieve full quantifier elimination by enriching the language with extra operation symbols for *selectors* and *simultaneous parametric fixpoints*.

The paper is structured as follows: Sects. 2, 3 introduce preliminary definitions, our variant of the theory of trees and establish some basic properties; Sect. 4 gives a new decision procedure for constraint solving problems using *graph representations, bisimulations and congruence closure*. Section 5 introduces our main technical tool, namely *definability in existential formulae*; Sect. 6 proves first model-completeness and then shows how to enrich the language to achieve full quantifier elimination. Section 7 concludes and discusses further improvements. The paper is meant to be self-contained; we moved to the online available extended version

### http://users.mat.unimi.it/users/ghilardi/allegati/GP\_IJCAR24.pdf

the proofs of few results which are either well-known or rather straightforward. Such extended version contains also a thorough comparison with the approach of some relevant papers from the literature like [4] and [14].

# **2 Preliminaries**

We adopt the usual first-order syntactic notions of signature, term, atom, literal, (ground) formula, and so on; our signatures always include equality. We compactly represent a tuple of distinct variables as *x*. The notation *t*(*x*)*,*φ(*x*) means that the term *t*, the formula φ has free variables included in the tuple *x*. Since our *tuples of variables* are assumed to be formed by *distinct* elements, we underline that when we write e.g. φ(*x,y*), we mean that the tuples *x,y* are made of distinct variables and are also disjoint from each other. Notations like φ(*t/x*) are used to denote substitutions.

A formula is said to be *universal* (resp., *existential*) if it has the form ∀*x*φ(*x,y*) (resp., ∃*x*φ(*x,y*)), where φ is quantifier-free. Formulae with no free variables are called *sentences*. A *constraint* is a conjunction of literals; a *primitive* formula is obtained from a constraint by prefixing it a finite string of existential quantifiers.

From the semantic side, given a signature Σ, we use the standard notion of a Σstructure M and of truth of a formula in a Σ-structure under a free variables assignment. A Σ*-theory* T is a set of Σ-sentences; a *model* of T is a Σ-structure M where all sentences in T are true. We use the standard notation T |= φ to say that φ is T *-valid*, i.e. true in all models of T for every assignment to the variables occurring free in φ. We say that two formulae φ and ψ are T -equivalent iff φ ↔ ψ is T -valid. A theory T is *complete* iff for every sentence φ, either φ or φ is T -valid. Complete theories can be obtained from any Σ-structure M by taking the set of the Σ-sentences that are true in M (such a theory is denoted by *T h*(M)).

We say that φ is T *-satisfiable* iff there is a model M of T and an assignment to the variables occurring free in φ making φ true in M. The *elementary* (resp. *constraint*) *satisfiability problem* for T is the following: we are given a formula (resp. a constraint) φ(*x*) and we are asked whether there exist a model M of T and an assignment α to the free variables *x* such that M*,*α |= φ(*x*) holds, i.e. such that φ is true in M under such assignment.

A theory T has *quantifier elimination* iff for every formula φ(*x*) in the signature of T there is a quantifier-free formula φ (*x*) such that T |= φ(*x*) ↔ φ (*x*). We shall be interested also into a condition that is weaker than quantifier elimination, namely model-completeness. Such a notion is usually defined semantically (as the fact that an embedding between two models of the theory is elementary), however there is a wellknown equivalent syntactic definition that we are going to extensively use (see [2], Thm. 3.5.1). This equivalent definition is supplied by the following:

**Definition 1.** *A theory* T *is said to be* model-complete *iff every existential formula* ∃*x*φ(*x,y*) *in the signature of* T *is* T *-equivalent to a universal formula* ∀*x* φ (*x ,y*)*.*

Notice that, keeping in mind prenex normal forms and the interdefinability of universal and existential quantifiers, it turns out that in a model complete theory *every formula whatsoever* is T -equivalent to *both* a universal and an existential formula; consequently, *model-completeness reduces the elementary satisfiability problem to the constraint satisfiability problem*.

#### **3** Σ**-Trees**

In the whole paper we fix a signature (let us call it Σ) containing: (i) *infinitely many individual constants*, (ii) *finitely many function symbols h*1*,...,hN* of respective arities *ar*1*,...,arN* ≥ 1 and (iii) *a unary predicate At*. Symbols *h*1*,...,hN* are called Σ*constructors* or just *constructors*. We use letters *f,g,...* for constructors or constants of Σ. We now define Σ-labelled trees.

A *tree T* is a set of finite lists of positive natural numbers satisfying the following three conditions; (1) the empty list ε belongs to *T*; (2) *T* is prefix-closed, i.e. σ ∗ τ ∈ *T* implies σ ∈ *T* (here ∗ is list concatenation); (3) if for some positive numbers *n,m* we have *n < m* and σ ∗*m* ∈ *T*, then σ∗ *n* ∈ *T*.

The elements of a tree *T* are called *nodes* and the node ε is called the *root* of *T*; given a node σ ∈ *T*, the set *T*σ = {τ | σ ∗ τ ∈ *T*} is a tree itself, called the σ-subtree of *T*. A *subtree* of *T* is a σ-subtree of *T* for some σ∈ *T*.

A Σ*-labelled tree* (or just a Σ*-tree*) is a pair (*T,*Λ) such that *T* is a tree and Λ is a map Λ : *T* −→ Σ that associates with every node σ of *T* a constructor or a constant in such a way that if Λ(σ) has arity *n*, then σ has precisely *n*-successors. The latter means that for every *i* ≥ 1, we have σ ∗ *i* ∈ *T* iff *i* ≤ *n* (as a special case, σ ∗ *i* never belongs to *T* in case Λ(σ) is a constant).

Given a node σ in a Σ-tree (*T,*Λ), the σ-subtree *T*σ of *T* can be endowed with a Σ-tree structure (*T*σ*,*Λσ) by putting Λσ(τ) = Λ(σ ∗ τ) for every τ ∈ *T*σ. The Σ*subtrees* of (*T,*Λ) are the Σ-trees of the form (*T*σ*,*Λσ), varying σ among the nodes of *T*. If Λ(σ) is a constructor whose arity is *n*, we call *T -successors* of σ the Σ-trees (*T*σ∗1*,*Λσ∗1)*,...,*(*T*σ∗*n,*Λσ<sup>∗</sup>*n*); if Λ(σ) is a constant, σ does not have *T*-successors and is said to be a *leaf* of *T* and of (*T,*Λ).

*Remark 1.* If (*T,*Λ) is a Σ-tree, we use the notation (σ*, f*) ∈ (*T,*Λ) to mean that σ ∈ *T* and Λ(σ) = *f* . The above notation can be used in order to *define a* Σ*-tree* (*T,*Λ) *by specifying by induction on the length of* σ *whether* (σ*, f*) ∈ (*T,*Λ) holds or not, for every σand for every (constructor or constant) *f* .

A Σ-tree (*T,*Λ) is *finite* iff the set of nodes of *T* is finite, it is said to be *infinite* otherwise. (*T,*Λ) is *rational* iff the set of Σ-subtrees of (*T,*Λ) is finite; notice that a rational tree can be infinite. Now comes the important definition: the sets of Σ-trees, of finite Σ-trees and of rational Σ-trees give rise to Σ-structures in the following way:


$$T = \{\mathfrak{e}\} \cup \bigcup\_{j=1}^{n} \{j \ast \sigma \mid \sigma \in T\_j\},$$

and Λis so defined

$$\Lambda(\mathfrak{e}) = h\_{\mathfrak{i}}, \quad \Lambda\left(j \ast \mathfrak{o}\right) = \Lambda\_{\mathfrak{j}}(\mathfrak{o}).$$

(thus (*T,*Λ) is the Σ-tree whose root is labelled *hi* and such that the *T*-successors of the root are (*T*1*,*Λ1)*,...,*(*Tn,*Λ*<sup>n</sup>*)).

It is well-known [12,14] that the Σ-structure of all Σ-trees and of all rational Σ-trees are elementarily equivalent (i.e. the same Σ-sentences are true in them). So it makes no difference to work on rational trees or on all Σ-trees; in this paper we made the choice to work on the structure of *rational* Σ-trees. We call R this structure. The related complete theory *T h*(R) will be also called R (so, for simplicity, *we use the same letter* R *for the set of rational* Σ*-trees, the related* Σ*-structure and the associated complete theory*).

There are some important formulae valid in R that we want to list below. In order to have a compact and intuitive notation, we need some abbreviations. If *x* is the tuple *x*1*,...,xn*, let us use ∀*x*φ for ∀*x*<sup>1</sup> ···∀*xn*φ - a similar convention is used for ∃*x*φ. If *t* = *t*1*,...tn* is a tuple of terms of the same length as *x*, then *x* = *t* stands for *<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *xi* = *ti*; moreover ∃!*x*φ(*x,y*) is used for ∃*x*φ(*x,y*)∧ ∀*x*∀*x* (φ(*x,y*)∧φ(*x ,y*) → *x* = *x* )*.*

It is useful to have abbreviations for formulas expressing that "*x* is rooted by *hi*", these are

$$\mathcal{R}\_{h\_l}(\mathbf{x}) : \quad \equiv \ \exists \mathbf{x}\_1 \cdots \exists \mathbf{x}\_{ar\_l} (\mathbf{x} = h\_l(\mathbf{x}\_1, \dots, \mathbf{x}\_{ar\_l})) \tag{1}$$

for every *i* = 1*,...,N* (recall that *h*1*,...,hN* are our constructors). The claims of the following Proposition are either immediate or well-known [4]:

**Proposition 1.** *The following sentences are* R*-valid:*

(i) ∀*x* (*hi*(*x*) = *hi*(*x* ) → *x* = *x* ) *(for i* = 1*,...,N);* (ii) ∀*x*¬(*Rhi* (*x*)∧*Rhj* (*x*)) *(for i* = *j and i, j* = 1*,...,N);* (iii) ∀*x*¬(*Rhi* (*x*)∧*At*(*x*)) *(for i* = 1*,...,N);* (iv) <sup>∀</sup>*<sup>x</sup>* (*At*(*x*)<sup>∨</sup> *<sup>N</sup> <sup>i</sup>*=<sup>1</sup> *Rhi* (*x*))*;* (v) *At*(*a*)∧*a* = *b (for distinct constants a,b* ∈ Σ*);* (vi) ∀*z*∃!*x* (*x* = *t*(*x,z*))*, where t*(*x,z*) *are proper flat terms.*

Above, a *flat* term is a constant, a variable, or a constructor applied to variables; a flat term is *proper* iff it is not a variable.

We make a comment on formula (vi) above. To correctly interpret it, recall that *x,z* must be distinct tuples of variables including all variables occurring in *t* according to our conventions. Formula (vi) expresses the *existence and uniqueness of simultaneous parametric fixpoints*: the fixpoints are simultaneous because *x* is a tuple of variables (it is not a single variable) and the *z* are the parameters on which the fixpoints depend.

We underline the following important fact, which is a consequence of the above Proposition: the formula *Rhi* (*x*) is *existential* according to its definition (1); however, because of Proposition 1(ii)–(iv), it is also logically equivalent to a *universal* formula, namely

$$\neg \mathsf{At}(\mathsf{x}) \land \bigwedge\_{j \neq i} \neg \mathsf{R}\_{h\_j}(\mathsf{x}) \tag{2}$$

(this is typical of model-complete theories, see Definition 1).

A *flat literal* is a literal of one of the following forms

$$\mathbf{x} = t, \; \mathbf{x} \neq \mathbf{y}, \; At(\mathbf{x}), \; \neg At(\mathbf{x})$$

where *x,y* are variables and *t* is a flat term; a flat literal is *proper* iff it is of the kind *At*(*x*)*,x* = *y,x* = *t*, where *x,y* are variables and the term *t* is flat and proper. A proper flat literal of the kind *At*(*x*) is said to be an *atomicity* proper flat literal, a proper flat literal of the kind *x* = *y* is said to be a *disequality* proper flat literal, a proper flat literal of the kind *x* = *t* is said to be an *equality* proper flat literal with *head variable x*.

**Definition 2.** *A constraint C*(*x*) *is in* solved form *iff it is a conjunction of atomicity proper flat literals, disequality proper flat literals and equality proper flat literals whose head variables are* pairwise different*. We also require that if a constraint C in solved form contains an atomicity proper flat literal of the kind At*(*v*)*, then the variable v is not a head variable of any equality proper flat literal v* = *t of C, unless t is a constant.*

We have an analogous notion for primitive formulae. A conjunction of variable equalities *E*(*x ,x*) of the kind *x* = *x* for *x* ∈ *x* and *x* ∈ *x* is an *equivalence diagram* (*e-diagram* for short) iff for every *x* ∈ *x* there is exactly one equality *x* = *x* with *x* ∈ *x* that is a conjunct of *E* (that is, *E* is a 'logical representation' of an equivalence relation over *x*∪*x* , having the *x* as representative elements for the equivalence classes).

The role of the e-diagrams in Definition 3 and in the algorithm of Proposition 2 below is subsequent to a choice of representatives variables for equivalence classes: once a choice is done, e-diagrams neutralize unquantified non-representative variables (these variables cannot be eliminated), by moving them outside the scope of the existential quantifiers (in such position they cannot play any role in future manipulations).

**Definition 3.** *A primitive formula is in* solved form *iff it can be written (up to logical equivalence) in the form*

$$E(\underline{\mathbf{x}}', \underline{\mathbf{x}}) \land \exists \underline{\mathbf{y}} \mathcal{C}(\underline{\mathbf{x}}, \underline{\mathbf{y}}) \tag{3}$$

*where C*(*x,y*) *is a constraint in solved form and E*(*x,x* ) *is an e-diagram.*

A primitive formula in solved form (3) has the following important property: any assignment to the free variables *x* satisfying *C* can be extended to an assignment to *x* satisfying *E*(*x ,x*) ∧ ∃*yC*(*x,y*). The following Proposition shows that the satisfiability of existential formulae can be reduced to satisfiability of primitive formulae in solved form:

**Proposition 2.** *Every existential formula is* R*-equivalent to a disjunction of primitive formulae in solved form.*

*Proof.* We first flatten all terms occurring in our formula using fresh existentially quantified variables and the logical equivalence φ(*x/t*) ↔ ∃*x* (*x* = *t* ∧φ) (here *x* is supposed not to occur in *t*). Then we remove negative atom statements by the R-equivalence <sup>¬</sup>*A*(*t*) <sup>↔</sup> *<sup>N</sup> <sup>i</sup>*=<sup>1</sup> *Rhi* (*x*). Applying DNF conversion and distributing the existential quantifier over disjunctions, we obtain a disjunction of primitive formulae whose matrices are conjunctions of flat literals.

Now we exhaustively apply to each disjunct of the form ∃*y*φ(*x,y*) the rewrite rules below (disjuncts containing an inconsistency *v* = *v* are in the end removed). The rules modify the whole disjunct or some conjuncts of its as indicated. When specifying the rules, we use letters *xi* ∈ *x* for the free variables *x* of ∃*y*φ(*x,y*), letter *y* ∈ *y* for a quantified variable of ∃*y*φ(*x,y*) and letters *v,vi,wj* ∈ *x* ∪ *y* for any variable occurring in ∃*y*φ(*x,y*); we also fix a total order ≺ on the free variables *x*.


When applying Rule (4) and moving the equality *x*<sup>1</sup> = *x*<sup>2</sup> outside the existential quantifiers ∃*y*, we simultaneously replace any free variable equality of the kind *x* = *x*<sup>2</sup> by *x* = *x*<sup>1</sup> (this ensures that free variables equalities outside the existential quantifiers form an e-diagram). The above rules are justified either as logical equivalences or by Proposition 1. Rules (2)–(3) adjust equality proper flat literals whose head variables are the same, whereas Rules (4)–(5) eliminate variable equalities inside the scope of the existential quantifiers. For termination, we consider pairs of natural numbers (*k*1*,k*2), where *k*<sup>1</sup> is the number of occurrences of constructors symbols and *k*<sup>2</sup> is the number of literals inside the existential quantifiers. All rules decrease *k*<sup>2</sup> (keeping *k*<sup>1</sup> unchanged), except Rule (3) which decreases *k*1. <sup>1</sup>

*Remark 2.* From the point of view of complexity, it is clear that DNF conversion produces an exponential blow-up; however Rules (0)–(5) can be exhaustively applied in polynomial (actually quadratic) time.

#### **4** Σ**-Graphs and Bisimulations**

In this section, we introduce a procedure to test R-satisfiability of constraints in solved form. The procedure is based on Σ-graphs (we shall essentially use Σ-graphs as alternative representations of Σ-trees):

**Definition 4.** *A* Σ-graph <sup>G</sup> = (*G,*{*Ri* }*i*≥1*,*λ) *is a set G endowed with an infinite family of unary partial functions Ri (i* <sup>≥</sup> <sup>1</sup>*) and with a labeling function* λ : *G* −→ Σ *satisfying the following condition: for every i* <sup>≥</sup> <sup>1</sup>*, the domain of the partial function Ri is the set of g* ∈ *G such that the arity of* λ(*g*) *is larger than or equal to i.*

Notice that the domains of the *Ri* exclude the nodes of a Σ-graph labeled by a constant symbol. We use the notation *gRi <sup>g</sup>* to say that *<sup>g</sup>* <sup>∈</sup> *dom*(*Ri* ) and *Ri* (*g*) = *g* . We adapt to our context the classical notion of bisimulation:

**Definition 5.** *A* bisimulation *between* Σ*-graphs* <sup>G</sup> = (*G,*{*Ri* }*i*≥1*,*λ) *and* G = (*G ,*{*R i* }*i*≥1*,*λ ) *is a relation* ρ⊆ *G*×*G satisfying the following conditions:*


*A bisimulation relation which is a total function* ρ : *G* −→ *G is said to be a* p-morphism *of* G *into* G *; if* ρ*is also surjective, we say that* G *is a* quotient *of* G*.*

Since our *Ri* are partial functions and (i) holds, conditions (ii) and (iii) in the definition of a bisimulation are trivially seen to be equivalent to each other (so one of them is redundant).

<sup>1</sup> It is essential that Rule (3) is applied to flat constraints, otherwise it might cause non termination if naively formulated [16].

### **Fig. 1.** Rules of Pseudo Congruence Closure Algorithm

Bisimulations are closed under unions, so that there is the *biggest bisimulation* among two given Σ-graphs G and G . We write *g* ∼*<sup>b</sup> g* to mean that there exists a bisimulation between the nodes *g* and *g* of G and G (equivalently, that we have ρ(*g,g* ) for the biggest bisimulation among G and G ). Also, since the identity relation is a bisimulation, the converse of a bisimulation is a bisimulation and the composition of bisimulations is a bisimulation, the relation ∼*<sup>b</sup>* restricted to the nodes of the same Σ-graph G turns out to be an equivalence relation. Bisimulation relations between a Σ-graph and itself which are equivalence relations (called *bisimulation equivalences*) produce quotients, in the sense explained by the following Proposition:

**Proposition 3.** *Suppose that* ρ *is a bisimulation equivalence on a* Σ*-graph* G = (*G,*{*Ri* }*i*≥1*,*λ)*. Then there are a* Σ*-graph* G *and a quotient q* : G −→ G *such that we have* ρ(*g,g* ) *iff q*(*g*) = *q*(*g* ) *for all g,g in* G*.*

*Proof.* We let *G* be the quotient set of *G* under the equivalence relation ρ and *q* be the map associating with *g* its equivalence class [*g*]. We then put

$$
\lambda'([\mathbf{g}]) = \lambda(\mathbf{g}), \qquad [\mathbf{g}] \mathsf{R}'\_i[\mathbf{g}'] \text{ iff } \exists \mathbf{g}'' \text{ s.t. } \mathsf{p}(\mathbf{g}', \mathbf{g}'') \text{ and } \mathbf{g} \mathsf{R}^i \mathbf{g}'' \text{ (for all } i \ge 0) \dots
$$

In this way G = (*G ,*{*R i* }*i*≥1*,*λ ) is a Σ-graph and *q* is the desired quotient.

Given a *finite* Σ-graph <sup>G</sup> = (*G,*{*Ri* }*i*≥1*,*λ) and a relation ρ<sup>0</sup> ⊆ *G* × *G*, we want to know whether there is a bisimulation equivalence ρ such that ρ0 ⊆ ρ. This is a special case of a classical problem well-studied in the literature; since our relations *Ri* are partial functions, we can give an alternative solution in a congruence closure style: our 'Pseudo Congruence Closure' (PCC) algorithm first computes the smallest equivalence relation ρ extending ρ<sup>0</sup> and then exhaustively applies to it it the two rules of Fig. 1. The following Proposition is clear:

**Proposition 4.** *There is a bisimulation equivalence extending a relation* ρ<sup>0</sup> *on a finite* Σ*-graph* <sup>G</sup> = (*G,*{*Ri* }*i*≥1*,*λ) *iff PCC, applied to* ρ<sup>0</sup>*, terminates without failure. In particular, for g*1*,g*<sup>2</sup> ∈ *G we have that g*<sup>1</sup> ∼*<sup>b</sup> g*<sup>2</sup> *iff PCC does not fail on input* G*,*ρ<sup>0</sup> = {(*g*1*,g*2)}*.*

*Remark 3.* PCC can be simulated by an ordinary congruence closure problem (thus, PCC inherit the complexity bound *O*(*n*log*n*) [9,17,18] of congruence closure). To see this, it is sufficient to consider the signature comprising unary function symbols *Fi* (*i* ≥ 0) and unary predicates *Pf* (varying *f* among the function and constant symbols of Σ). Notice that such a signature is infinite but only finitely many symbols are used when handling a finite Σ-graph. Then consider the congruence closure problem given by the following literals: *g*<sup>1</sup> = *g*<sup>2</sup> (varying (*g*1*,g*2) ∈ ρ0), *Fi*(*g*) = *g* (for *gRi g* in G), *Pf*(*g*) (if λ(*g*) = *f*), ¬*Pf*(*g*) (if λ(*g*) = *f*). Now it is easy to see that, despite the fact that the *Ri* are partial and the *Fi* are total functions, PCC and standard congruence closure run in the same way.

An important (infinite) Σ-graph is the Σ*-graph of all rational trees*. This is the Σgraph *Rat* = (R*,*{R*<sup>i</sup>* }*i,*λ<sup>R</sup>), whose underlying set is formed by the set of rational trees R and where we have


Inside *Rat*, bisimulation is trivial:

**Proposition 5.** *In the* Σ*-graph Rat, we have* (*T,*Λ) ∼*<sup>b</sup>* (*T ,*Λ ) *iff* (*T,*Λ)=(*T ,*Λ )

*Proof.* Suppose that (*T,*Λ) ∼*<sup>b</sup>* (*T ,*Λ ). Recall from Remark 1 that if (*T,*Λ) is a Σ-tree, we write (σ*, f*) ∈ (*T,*Λ) to mean that σ ∈ *T* and Λ(σ) = *f* ; moreover recall also that (*T,*Λ)=(*T ,*Λ ) iff we have (σ*, f*) ∈ (*T,*Λ) iff (σ*, f*) ∈ (*T ,*Λ ) for all (σ*, f*). This is what we are going to prove by induction on the length of σ.

If σ = ε, then from (*T,*Λ) ∼*<sup>b</sup>* (*T ,*Λ ) it follows that the root-labeling symbols of (*T,*Λ) and (*T ,*Λ ) are the same.

Suppose now that σ = *i*∗τ and that (*i*∗τ*, f*) ∈ (*T,*Λ); then Λ(ε) is a function symbol of arity bigger or equal to *i*, so that there is the *i*-th successor (*Ti,*Λ*<sup>i</sup>*) of (*T,*Λ) and we have (*T,*Λ)R*i* (*Ti,*Λ*<sup>i</sup>*) and (τ*, f*) ∈ (*Ti,*Λ*<sup>i</sup>*). Since (*T,*Λ) and (*T ,*Λ ) are bisimilar, the symbols labeling their roots are the same, so for the *i*-th successor (*T i ,*Λ *i*) of (*T ,*Λ ) we have that (*T ,*Λ )R*i* (*T i ,*Λ *<sup>i</sup>*) and also that (*Ti,*Λ*<sup>i</sup>*) ∼*<sup>b</sup>* (*T i ,*Λ *<sup>i</sup>*) (again by bisimilarity). By induction hypothesis, (τ*, f*) ∈ (*T i ,*Λ *<sup>i</sup>*), that is (*i* ∗ τ*, f*) ∈ (*T ,*Λ ) as required.

The importance of *Rat* is due to the fact that every finite Σ-graph compares with it:

**Proposition 6.** *For every finite* Σ*-graph* <sup>G</sup> = (*G,*{*Ri* }*i*≥1*,*λ) *there is a p-morphism p*<sup>G</sup> : G −→ *Rat.*

*Proof.* Given *g* ∈ *G*, let us define a rational tree *p*G(*g*) by specifying under which conditions we have (σ*, f*) ∈ *p*G(*g*) for a finite list of positive numbers σ and *f* ∈ Σ (recall Remark 1). We stipulate that (*i*<sup>1</sup> ···*ik, f*) ∈ *p*G(*g*) iff there are *g*1*,...,gk* ∈ *G* such that *g*ρ*g*<sup>1</sup> ···ρ*gk* and λ(*gk*) = *f* (for *k* = 0, this means that λ(*g*) = *f*). From this definition, it is easy to see that *p*G(*g*) is indeed a Σ-tree and that *p*<sup>G</sup> is a p-morphism. The fact that *p*G(*g*) is rational follows from the *p*-morphism condition, because the Σ-subtrees of *p*G(*g*) turn out to be of the form *p*G(*g* ) for *g* ∈ *G* and *G* is finite.

We now relate Σ-graphs and constraint satisfiability in R. We know from Proposition 2 that in order to check satisfiability of primitive formulae we can restrict to primitive formulae in solved form. To any constraint in solved form φ(*x*1*,...,xn*) we associate a Σ-graph

$$\mathcal{G}\_{\Phi} = (G\_{\Phi}, \{\mathcal{R}\_{\Phi}^{i}\}\_{i}, \mathcal{A}\_{\Phi})$$

as follows. The nodes in *G*φ are *x*1*,...,xn*; we have *xkRi xj* iff φ contains an equality of the kind *xk* = *f*(··· *,xj,*···) (with *xj* as *i*-th argument) and in such a case we put λφ (*xk*) = *f* . If φ contains an equality of the kind *xk* = *a* for a constant *a*, we put λφ (*xk*) = *a*. Notice that an equality like *xk* = *f*(··· *,xj,*···) or *xk* = *a* is uniquely identified for a variable *xk* according to Definition 2. The variables *xk* for which there are no equalities of the kind *xk* = *f*(···) or *xk* = *a* in φ are called *external* in φ. For such *xk* we put λφ (*xk*) = *ak*, where the *ak* are *fresh distinct constants* in Σ (recall that Σ has infinitely many constants).

**Theorem 1.** *A constraint* φ(*x*1*,...,xn*) *in solved form is* R*-satisfiable iff* φ *does not contain a disequality xk* = *xj such that we have xk* ∼*<sup>b</sup> xj in* Gφ*.*

*Proof.* Consider the *p*-morphism *p*<sup>G</sup>φ : G −→ *Rat* of Proposition 6 and let us assign *p*Gφ (*xk*) to every variable *xk* occurring in φ; notice that this R-assignment (called the *canonical* R*-assignment*) satisfies all equalities and all atomicity statements in φ (but it might not satisfy disequalities).

Suppose that there is no disequality *xk* = *xj* in φ such that we have *xk* ∼*<sup>b</sup> xj* in Gφ . If *xk* = *xj* occurs in φ, we cannot have *p*<sup>G</sup>φ (*xk*) = *p*<sup>G</sup>φ (*xj*), otherwise from *xk* ∼*<sup>b</sup> p*<sup>G</sup>φ (*xk*) = *p*Gφ (*xj*) ∼*<sup>b</sup> xj* we would get *xk* ∼*<sup>b</sup> xj* (this is because *p*-morphisms and their converses are bisimulations and bisimulations do compose). So the canonical R-assignment is indeed a satisfying assignment for φ.

Vice versa, suppose that φ is satisfiable. By the next lemma, the canonical Rassignment is a satisfying assignment for φ. If we have *xk* ∼*<sup>b</sup> xj* for an inequality *xk* = *xj* occurring in φ, then we have that *p*<sup>G</sup>φ (*xk*) ∼*<sup>b</sup> p*<sup>G</sup>φ (*xj*) (because *p*-morphisms are bisimulations and bisimulations do compose) which contradicts *p*<sup>G</sup>φ (*xk*) = *p*<sup>G</sup>φ (*xj*) and Proposition 5.

**Lemma 1.** *If a constraint* φ(*x*1*,...,xn*) *in solved form is* R*-satisfiable, then it is satisfied by the canonical* R*-assignment.*

*Proof.* Let α be a satisfying R-assignment for φ and let α*<sup>C</sup>* be the canonical Rassignment. Since the latter can only fail to satisfy the disequalities from φ, it is sufficient to prove that for *xk,xj* ∈ {*x*1*,...,xn*}, we have α(*xi*) = α(*xj*) ⇒ α*<sup>C</sup>*(*xi*) = α*<sup>C</sup>*(*xj*). In other words, we prove that *if there is* (σ*, f*) *such that* (σ*, f*) ∈ α(*xk*) *and* (σ*, f*) ∈ α(*xj*)*, then* α*<sup>C</sup>*(*xk*) = α*<sup>C</sup>*(*xj*). We make an induction on the length of such σ.

Independently on the inductive argument, we notice that if either *xk* or *xj* (or both) is external in φ, the claim is trivial because external variables are mapped by α*<sup>C</sup>* into singleton trees labelled by fresh distinct constants by the construction of Gφ . If *xk,xj* are both non external, in φ there are equalities *xk* = *l*(*v*1*,...,vm*) and *xj* = *l*(*w*1*,...,wm*) with *v*1*,...,vm,w*1*,...,vm* ∈ {*x*1*,...,xn*} (in the case where in φ there are equalities *xk* = *l*(···) and *xj* = *l* (···) with *l* = *l* , we again trivially have α*<sup>C</sup>*(*xk*) = α*<sup>C</sup>*(*xj*) because α*<sup>C</sup>* satisfies the equalities in φ).

**Fig. 2.** Σ-graph from Example 1 (left) and its quotient under maximum bisimulation (right).

Let us now argue by induction on σ and restrict to the case where *xk,xj* are as above. The case σ = ε is trivial because in that case we have *f* = *l* and (ε*,l*) belongs to both α(*xk*) and α(*xj*), contrary to the fact that we supposed (σ*, f*) ∈ α(*xk*) and (σ*, f*) ∈ α(*xj*) (α is a satisfying assignment, so (ε*,l*) ∈ α(*xj*)). So assume that σ = *i*∗τ; then (*i* ∗ τ*, f*) ∈ α(*xk*) implies (τ*, f*) ∈ α(*vi*) because α satisfies *xk* = *g*(*v*1*,...,vm*). For the same reason, we have (τ*, f*) ∈ α(*wi*), because otherwise (*i* ∗ τ*, f*) ∈ α(*xj*) because α satisfies *xj* = *g*(*w*1*,...,wm*). We can then apply induction to (τ*, f*) and conclude that α*<sup>C</sup>*(*vi*) = α*<sup>C</sup>*(*wi*). Since α*<sup>C</sup>* nevertheless satisfies the equalities from φ, from the facts that *xk* = *g*(*v*1*,...,vm*) and *xj* = *g*(*w*1*,...,wm*) are true under α*<sup>C</sup>* and that α*<sup>C</sup>*(*vi*) = α*<sup>C</sup>*(*wi*), we conclude that α*<sup>C</sup>*(*xk*) = α*<sup>C</sup>*(*xj*).

*Remark 4.* As a consequence of Theorem 1 and Proposition 4, in order to solve a Rconstraint satisfiability problem for a constraint φ(*x*) in solved form, it is sufficient to run PCC on inputs Gφ *,*ρ<sup>0</sup> = {(*x*1*,x*2)} for every disequality *x*<sup>1</sup> = *x*<sup>2</sup> occurring in φ. *The constraint is satisfiable iff PCC ends in failure for all such disequalities*. This gives a *O*(*n*<sup>2</sup> log*n*) complexity bound for constraints in solved form. As an alternative, one can directly compute the maximum bisimulation equivalence on the associated Σgraph using some known efficient procedure [5–7].

*Example 1.* [*In all examples, we assume that we have only two constructors: h*<sup>1</sup> *(with arity 1) and h*<sup>2</sup> *(with arity 2)*]. Consider the constraint

$$\mathbf{x} = h\_2(\mathbf{z}, \mathbf{y}) \land \mathbf{y} = h\_1(\mathbf{y}) \land \mathbf{z} = h\_2(\mathbf{x}, \mathbf{y}) \land \mathbf{x} \neq \mathbf{y} \land \mathbf{x} \neq \mathbf{z}$$

whose Σ-graph is depicted in Fig. 2. Nodes *x* and *y* are not bisimilar in this Σ-graph, but the nodes *x* and *z* are bisimilar. Hence the constraint is unsatisfiable.

### **5 Definability**

In this section we investigate what it means for a set of variables to be 'definable' (via selectors and simultaneous parametric fixpoints) from the free variables *x* of an existential formula ∃*y*φ(*y,x*). Definable and undefinable variables will be subject to different symbolic manipulations when converting an existential formula to a universal one (the possibility of such conversion is precisely the content of model completeness, see Definition 1).

Below, when we say that "φ is φ ∧ψ" we mean that φ is a conjunction and that it can be written as specified (modulo associativity and commutativity of ∧).

**Definition 6.** *Let* ∃*y*φ(*y,x*) *be an existential formula with free variables x and bounded variables y. The set of* definable variables of ∃*y*φ(*y,x*) *is the smallest subset D* ⊆ *x*∪*y satisfying the following conditions:*

(o) *x* ⊆ *D;* (i) *if u* ∈ *D and* φ *is* φ ∧*u* = *t*(*v*) *for a proper flat term t, then v* ⊆ *D;* (ii) *if u* ⊆ *D and* φ *is* φ∧*v* = *t*(*v,u*) *for proper flat terms t, then v* ⊆ *D.*

Intuitively, the condition of Definition 6(i) says that the *v* are reachable from *u* via selectors, whereas the condition of Definition 6(ii) says that the *v* are reachable via some fixpoints having the *u* as parameters. The next couple of lemmas show how to handle definable and non definable variables: the idea is that definable variables can be converted into universally quantified variables and that non definable variables can be removed because of validity/invalidity reasons.

**Lemma 2.** *Suppose that the existential formula* π*is of the kind*

$$
\exists \underline{w} \exists \underline{\mu} \phi \left( \underline{w}, \underline{\mu}, \underline{\chi} \right),
$$

*and that the variables u are definable in it. The* π *is* R*-equivalent to a formula of the kind*

$$\forall \underline{\mu}'(\Psi(\underline{\mu}', \underline{\chi}) \land (\Theta(\underline{\mu}', \underline{\chi}) \to \exists \underline{\mathbb{W}} \phi(\underline{\mathbb{W}}, \underline{\mu}, \underline{\chi}))),\tag{4}$$

*where u* ⊇ *u (i.e. u possibly extends u by some fresh variables) and* ψ*,*θ *are quantifierfree. In particular,* π is R-equivalent to a universal formula *in the case where w* = 0/ *(i.e. in the case where all quantified variables of* π*are definable).*

*Proof.* We view Definition 6 as a recursive definition and use the equivalences

$$\left[\exists \underline{\underline{v}}\,'(\mathfrak{u} = \mathfrak{t}(\underline{\underline{v}}, \underline{\underline{v}}') \wedge \underline{\phi}')\right] \leftrightarrow \left[\mathcal{R}\_{\underline{h}}(\mathfrak{u}) \wedge \forall \underline{\underline{w}} \,\forall \underline{\underline{v}}\,'(\mathfrak{u} = \mathfrak{t}(\underline{\underline{w}}, \underline{\underline{v}}') \rightarrow \underline{\underline{v}} = \underline{\underline{w}} \wedge \underline{\phi}')\right] \qquad (5)$$

(where *t* is a proper flat term whose root symbol is the constructor *hi*) and

$$\left[\exists \overline{\underline{\nu}} (\underline{\underline{\nu}} = t(\underline{\underline{\nu}}, \underline{\underline{\mu}}) \land \phi')\right] \leftrightarrow \left[\forall \underline{\underline{\nu}} (\underline{\underline{\nu}} = t(\underline{\underline{\nu}}, \underline{\underline{\mu}}) \rightarrow \phi')\right] \tag{6}$$

Syntactic details are straightforward (they can be found in the online available extended version of the paper).

*Example 2.* Consider the formula

$$\exists \mathbf{w}\_1 \exists w\_2 \exists w' \exists w'\_2 \exists u (\mathbf{x}\_1 = h\_2(w\_1, w\_2) \land \mathbf{x}\_2 = h\_2(w'\_1, w'\_2) \land u = h\_1(w\_1) \land u \neq w\_1) \tag{7}$$

whose existentially quantified variables are all definable. In particular, the variables *w*1*,w*2*,w* 1*,w* <sup>2</sup> falls within case (i) of Definition 6, hence their conversion into universal variables follows the schema (5). On the contrary, *u* is converted into a universal variable using schema (6), because *u* falls within the case (ii) of Definition 6: in fact, *u* is recursively definable from *w*<sup>1</sup> via the equality *u* = *h*1(*w*1) (the latter is a fixpoint equations - indeed a trivial fixpoints equation where the variables on the left hand side of the equality symbol do not occur on the right hand side term). **Lemma 3.** *Suppose that C*(*x,y*) *is a constraint in solved form and that in each literal from C there is at least one occurrence of a variable from y; suppose also that in the existential formula* ∃*yC*(*x,y*) *the variables y are not definable. Then we have*

$$\mathcal{R} \mid \neg \forall \underline{\mathfrak{x}} (\mathtt{Distinct}(\underline{\mathfrak{x}}) \to \left[ \exists \underline{\mathfrak{y}} \, \mathsf{C}(\underline{\mathfrak{x}}, \underline{\mathfrak{y}}) \leftrightarrow \exists \underline{\mathfrak{x}} \, \exists \underline{\mathfrak{y}} \, \mathsf{C}(\underline{\mathfrak{x}}, \underline{\mathfrak{y}}) \right]) \tag{8}$$

*(here* Distinct(*x*) *means* (*x* = *x*)*, varying the conjuncts on pairwise different x ,x* ∈ *x).*

*Proof.* First a comment: what the lemma says is that, in contexts where all parameters *x* are interpreted as distinct trees, ∃*yC*(*x,y*) is equivalent to the *sentence* ∃*x*∃*yC*(*x,y*), which in turn is always equivalent to either or ⊥ by Theorem 1.

The implication from left to right of ↔ is trivial, so let us assume that in the rational tree structure R the sentence

$$
\exists \underline{\mathfrak{x}} \exists \underline{\mathfrak{y}} C(\underline{\mathfrak{x}}, \underline{\mathfrak{y}}) \tag{9}
$$

is true and let us assign to *x* = *x*1*,...,xn* arbitrary but distinct rational trees α(*x*1)*,...,*α(*xn*). We want to show that this assignment satisfies the formula ∃*yC*(*x,y*). Since our trees are rational, the trees α(*x*1)*,...,*α(*xn*) have only finitely many subtrees; let their number be *n*+*n* and let us list them as

$$
\alpha(\mathbf{x}\_1), \dots, \alpha(\mathbf{x}\_n), \alpha(\mathbf{x}\_1'), \dots, \alpha(\mathbf{x}\_{n'}') \,. \tag{10}
$$

We now write down a constraint *C* (*x,x* ) (where *x* = *x* 1*,...,x <sup>n</sup>*) whose *unique* solution is precisely the *n*+*n* -tuple of trees (10).<sup>2</sup> We shall show that the constraint

$$\exists \underline{\underline{x}} \, \exists \underline{\underline{x}}' \, \exists \underline{\underline{y}} \, (C(\underline{x}, \underline{\underline{y}}) \land C'(\underline{x}, \underline{x}')) \tag{11}$$

is satisfiable; this proves our claim, because the satisfiability of (11) implies in particular that α(*x*1)*,...,*α(*xn*) satisfy ∃*yC*(*x,y*). We now build the Σ-graphs

$$\mathcal{G}\_{(9)} = (G\_{(9)}, \{R\_{(9)}^{i}\}\_{i}, \mathcal{\lambda}\_{(9)}) \text{ and } \mathcal{G}\_{(11)} = (G\_{(11)}, \{R\_{(11)}^{i}\}\_{i}, \mathcal{\lambda}\_{(11)}) \text{ respectively}$$

associated with the formulae (9) and (11), respectively. In both cases, *when choosing the constants labeling external variables, we take fresh constants* not appearing in (9) and (11): we can do that because our signature Σ contains infinitely many constants and the trees α(*x*1)*,...,*α(*xn*) are rational.

We make a couple of observations. Since the *y* occur in each literal from *C* and the *y* are not definable, the equality flat literals from *C* must be of the kind *v* = *t*(*x,y*) where *v* ∈ *y*, so *the x are not head variables in C*. Moreover, *every head variable v* ∈ *y has a path*<sup>3</sup> *to an external variable w* <sup>∈</sup> *<sup>y</sup>* in <sup>G</sup>(9) (and consequently also in the bigger Σ-graph G(11)): if it were not be so, considering the equality proper flat literals from *C* reachable

<sup>2</sup> For instance, if α(*x*1) is rooted by *h*<sup>2</sup> and has immediate successors the pair formed by α(*x* 4)*,*α(*x*1), then *C* contains as a conjunct the equality *x*<sup>1</sup> = *h*2(*x* <sup>4</sup>*, x*1), etc. Here we are using Proposition 1(vi) (with an empty parameters tuple *z*).

<sup>3</sup> By a path in a Σ-graph *<sup>G</sup>* = (*G,*{*R<sup>i</sup>* }*i*≥1*,*λ) from a node *g* ∈ *G* to a node *g* ∈ *G* we mean a chain of the kind *<sup>g</sup>* <sup>=</sup> *<sup>g</sup>*0*Ri*<sup>1</sup> *<sup>g</sup>*<sup>1</sup> ···*gn*−1*Rikgk* <sup>=</sup> *<sup>g</sup>* .

from *v*, we could obtain a set of proper flat equalities witnessing the definability of a subset of variables that includes *v* (see Definition (6)(ii)). By the equality proper flat literals *reachable from v*, we mean the smallest set of literals from *C* containing the equality proper flat literal headed by *v* and such that if it contains *v* = *t*(*u,x*), then for every *u* ∈ *u* it contains also the equality proper flat literal headed by *u* (if it exists).

Let us now consider *a bisimulation relation in* G(11): we claim that this must be of the kind ρ ∪*idG*(11) , where ρ is a set of pairs (*v*1*,v*2) formed by nodes which are *both head nodes belonging to y*. In fact, external nodes from *y* are not bisimilar to each other (they are labeled by different constants) nor can be bisimilar to the *x,x* (the constants labeling them are disjoint from the constants labeling the *x,x* ) nor to head nodes from *y* (the latter are not labeled by constants). Similarly, the nodes *x,x* are not bisimilar to each other (the sub-Σ-graph of G(11) involving the *x,x* is a sub-Σ-graph of *Rat* - where the only bisimulation is identity, see Proposition 5 - recall that the *x,x* represent in G(11) the pairwise *different* trees (10)) nor to head nodes from *y* (because the latter have a path to some external node taken from the *y*, as explained above, and the *x,x* only have paths inside themselves).

Suppose now that (11) is not satisfiable; this means that there is a disequality *u*<sup>1</sup> = *u*<sup>2</sup> from *C* (recall that *C* does not contain disequalities) such that PCC does not fail when initialized to ρ = {(*u*1*,u*2)} in *G*(11). Since this entails that *u*<sup>1</sup> and *u*<sup>2</sup> are bisimilar in *G*(11), we have that *u*1*,u*<sup>2</sup> are both head nodes belonging to *y*. By induction, we show that PCC initialized to ρ = {(*u*1*,u*2)} in *G*(9) makes precisely the same steps and halts precisely after the same number of steps as PCC initialized to ρ = {(*u*1*,u*2)} in *G*(11) (this would mean that (9) is not satisfiable either, a contradiction). Suppose in fact that PCC in *G*(11) added to ρ a pair of head nodes *v*1*,v*<sup>2</sup> belonging to *y* (only such nodes can be bisimilar and not identical to each other) and that we have *v*1*Ri G*(11) *v* <sup>1</sup>, *v*2*Ri G*(11) *v* 2. Then *v* <sup>1</sup> and *v* <sup>2</sup> are bisimilar and so, if they are not identical to each other, they are also head nodes belonging to *y*. In such a case, we have *v*1*Ri G*(9) *v* <sup>1</sup>, *v*2*Ri G*(9) *v* <sup>2</sup> and so PCC in *G*(9) merges their equivalence classes precisely as PCC does in *G*(11). Moreover, when PCC halts in *G*(11), so must do PCC in *G*(9) because nodes in *y* can see via the *Ri* the same nodes in both graphs: thus if PCC halted in *G*(11) and in *G*(9) for (*v*1*,v*2) ∈ ρ we have *v*1*Ri G*(9) *v* <sup>1</sup>, *v*2*Ri G*(9) *v* <sup>2</sup>, then we have also *v*1*Ri G*(11) *v* <sup>1</sup>, *v*2*Ri G*(11) *v* <sup>2</sup> in *G*(11) and (*v* 1*,v* 2) has been already added by PCC to ρ (actually in both Σ-graphs) because PCC halted in *G*(11).

# **6 Model Completeness**

We improve Proposition 2. Let us say that a primitive formula is in *reduced solved form* iff it can be written in the form *E*(*x ,x*) ∧ ∃*yC*(*x,y*) (see (3)) and, in addition to the requirements of Definition 3, we also ask that the variables *y* are definable in ∃*yC*(*x,y*).

**Theorem 2.** *Every existential formula is* R*-equivalent to a disjunction of primitive formulae in reduced solved form and also to a universal formula. As a consequence,* R *is model-complete.*

*Proof.* Consider an existential formula π(*x*) of the form ∃*w*φ(*v,w*). Given an equivalence relation *E* over *v* ∪ *w*, let us call 'full display of *E*' the conjunction of all the equalities *z*<sup>1</sup> = *z*<sup>2</sup> (for (*z*1*,z*2) ∈ *E*) and of all the disequalities *z*<sup>1</sup> = *z*<sup>2</sup> (for (*z*1*,z*2) ∈ *E*). We conjoin the matrix φ(*v,w*) of our existential formula ∃*w*φ(*v,w*) with the disjunction over the full displays of all the possible equivalence relations over *v* ∪ *w*; then we take DNF, distribute ∃*w* over disjunctions and leave each disjunct be handled by the algorithm of Proposition 2. As a result, our π is equivalent to a disjunction of primitive solved forms *E*(*x ,x*)∧ ∃*yC*(*x,y*), whose matrix *C*(*x,y*) is of the kind Distinct(*x,y*)∧φ(*x,y*). We shall treat these disjuncts separately.

Fix then such a disjunct. Let us split *y* as *y ,y*, where the *y* are definable and the *y* are not definable in ∃*yC*(*x,y*). Thus we can rewrite ∃*yC*(*x,y*) as

$$
\exists \underline{\underline{\mathbf{y}}} (\mathcal{C}'(\underline{\underline{x}}, \underline{\underline{y}}') \land \exists \underline{\underline{y}} \prime \mathcal{C}'(\underline{x}, \underline{\underline{y}}', \underline{\underline{y}}'')) \tag{12}
$$

where in *C* all literals contain at least one occurrence of some of the *y*, the *y* are definable in ∃*y C* (*x,y* ), the *y* are not definable in ∃*yC*(*x,y ,y*) and the constraint *C* contains the literals Distinct(*x,y* ) (this is because Distinct(*x,y ,y*) entails Distinct(*x,y* )). Now we can apply Lemma 3, according to which the subformula ∃*yC*(*x,y ,y*) can be replaced by its existential closure ∃*x*∃*y* ∃*yC*(*x,y ,y*) and the latter simplifies either to or to ⊥.

In conclusion, our initial existential formula π(*x*) is equivalent to a formula of the required shape, namely to a disjunction of primitive formulae in solved form

$$\bigvee\_{k} \left[ E\_{k}(\underline{\mathbf{x}}\_{k}^{\prime}, \underline{\mathbf{x}}\_{k}) \wedge \exists \underline{\mathbf{y}}\_{k} \, \mathsf{C}\_{k}(\underline{\mathbf{y}}\_{k}, \underline{\mathbf{x}}\_{k}) \right],\tag{13}$$

where the *yk* are definable in ∃*ykCk*(*yk ,xk*) (these are the disjuncts surviving after the above simplifications). Such a formula is also equivalent to a universal formula by Lemma 2.

Theorem 2 (together with Theorem 1) *gives an algorithm for deciding* R*satisfiability of any first-order sentence*, because this theorem supplies an effective procedure to rewrite an existential formula to a universal one (see the remark after Definition 1).

*Example 3.* Consider the formula

$$\exists w\_1 \exists w\_2 \forall z \forall \mu \left( x\_1 = h\_2(w\_1, w\_2) \land x\_2 \neq h\_1(z) \land \neg \text{At}(x\_2) \land \neg (h\_1(u) = u \land u = w\_2) \right) \,. \tag{14}$$

To bring it to the form (13) (and then check its satisfiability), we first rewrite it as ∃*w*1∃*w*2¬∃*z*∃*u*ψ (here ψ is the matrix of (14)), then convert ∃*z*∃*u*ψ to a universal formula θ; in this way ∃*w*1∃*w*2θ will be again existential. To the latter, we can apply the procedure of Theorem 2 and obtain a disjunction of primitive formulae in reduced solved form. After simplifications, in our case we get just one disjunct, namely

$$\exists w\_1 \exists w\_2 \exists w' \exists w'\_2 \exists u \left(\mathbf{x}\_1 = h\_2(w\_1, w\_2) \land \mathbf{x}\_2 = h\_2(w'\_1, w'\_2) \land u = h\_1(w\_1) \land u \neq w\_1\right) \quad (15)$$

(this is the same formula (7) of Example 2 which is, by the way, satisfiable according to Theorem 1).

The proof of Theorem 2 shows how to build a definitional extension of R enjoying quantifier elimination. To this aim, it is sufficient to add to the language new operation symbols for selectors and simultaneous parametric fixpoints.

For selectors, we need to take care of the fact that selectors are not totally defined. We adopt the classical solution of [11], saying that a badly applied selector just returns its argument. In other words, for every constructor *hi* of arity *ari*, we add to the signature of <sup>R</sup> new unary functions symbols Sel*<sup>j</sup> <sup>i</sup>* (for *j* = 1*,...,ari*) to be interpreted in rational trees so that the following formula is true for every assignment to *x,y*:

$$\mathsf{Sole1}\_l^j(\mathbf{x}) = \mathbf{y} \leftrightarrow \ (\exists z\_1 \dots \exists z\_{ar\_l} (\mathbf{x} = h\_l(z\_1, \dots, z\_{ar\_l}) \land \mathbf{y} = z\_j)) \lor (\neg \mathsf{R\_{h\_l}}(\mathbf{x}) \land \mathbf{y} = \mathbf{x})) \tag{16}$$

For simultaneous parametric fixpoints, we need infinitely many extra functions. Consider two tuples of variables *x*− = *x*− <sup>1</sup> *,...,x*<sup>−</sup> *<sup>n</sup>* and *y*<sup>∗</sup> = *y*<sup>∗</sup> 1*,...,y*<sup>∗</sup> *<sup>m</sup>* and proper flat literals *x*<sup>−</sup> = *t*(*x*−*,y*∗): below, we write *t*(−*,*∗) for *t*(*x*−*,y*∗) and *t*(*x,y*) for a substitution *t*(*x/x*−*,y/y*∗). We introduce *n* new functions symbols Fix*<sup>i</sup> <sup>t</sup>*(−*,*∗) (*<sup>i</sup>* <sup>=</sup> <sup>1</sup>*,...,n*), all having arity *m*, interpreted in rational trees so that the following formula is true, for every assignment to the variable *x* and to the *m*-tuple of variables *y*:

$$\mathbf{Fix}\_{\underline{\mathbf{r}}(-,\*)}^{\ell}(\underline{\mathbf{y}}) = \mathbf{x} \leftrightarrow \, \exists \underline{\mathbf{x}}(\underline{\mathbf{x}} = \underline{\mathbf{r}}(\underline{\mathbf{x}}, \underline{\mathbf{y}}) \land \underline{\mathbf{x}} = \underline{\mathbf{x}}\_{\ell}) \tag{17}$$

where *xi* denotes the *i*-th component of the tuple *x*.

Let us denote with <sup>R</sup><sup>+</sup> the Σ-structure of rational trees, enriched with the interpretation of the above extra function symbols; we also denote by <sup>R</sup><sup>+</sup> the theory whose axioms are the sentences which are true in this expanded structure. From Theorem 2 and (16), (17), we have the following result (see the online available extended version of the paper for the straightforward syntactic details):

# **Theorem 3.** <sup>R</sup><sup>+</sup> *enjoys quantifier elimination.*

*Example 4.* The formula *Rhi* (*x*) (that can be written both as a universal and as an existential formula in <sup>R</sup>) can be rewritten in <sup>R</sup><sup>+</sup> as

$$\mathbf{x} = h\_l(\mathbf{Se1}\_l^1(\mathbf{x}), \dots, \mathbf{Se1}\_l^{ar\_l}(\mathbf{x})) \tag{18}$$

without using quantifiers.

*Example 5.* Eliminating quantifiers from (15), we obtain

$$\mathbf{x}\_1 = h\_2(\mathbf{Set1}\_{h\_2}^1(\mathbf{x}\_1), \mathbf{Set1}\_{h\_2}^2(\mathbf{x}\_1)) \land \mathbf{x}\_2 = h\_2(\mathbf{Set1}\_{h\_2}^1(\mathbf{x}\_2), \mathbf{Set1}\_{h\_2}^2(\mathbf{x}\_2)) \land \mathbf{Set1}\_{h\_2}^1(\mathbf{x}\_1) \neq \mathbf{Fix1}\_{h\_1(-)}^1; \tag{19}$$

this formula says that *x*1*,x*<sup>2</sup> are both rooted with *h*<sup>2</sup> and that applying the first selector to *x*<sup>1</sup> we get something different from the fixed point of *h*1; <sup>4</sup> the latter can be represented as the infinite term

$$h\_1(h\_1(h\_1\cdots))$$

(it is the infinite unary tree labelled by *h*1).

<sup>4</sup> Formally, Fix<sup>1</sup> *<sup>h</sup>*1(−) is the constant expressing the fixpoint of *<sup>z</sup>* <sup>=</sup> *<sup>h</sup>*1(*z*): this is the simultaneous fixpoint of a tuple of functions with empty parameters formed by the single function *h*1(*z*) (the parameters ∗ are missing and the superscript 1 means the projection to the first unique component of such tuple of functions).

### **7 Conclusions, Related and Further Work**

In this paper we introduced the variant of the theory of finite and infinite trees, whose signature has finitely many constructors, infinitely many constants and an atomicity predicate. We proved that this theory is model complete, showed that every formula in this theory is equivalent to a disjunction of primitive formulae in reduced solved form and proved that one can achieve full quantifier elimination by adding selectors and simultaneous parametric fixpoints to the language.

One of the main insights given by the above results is the fact that every formula of our theory essentially expresses a Boolean combination of *algebraic dependency relations involving constructors, selectors and fixpoints* (see Example 5 to understand what we mean). We believe that this is the contribution lying behind statements like those of Theorems 2, 3. The 'explicit solved forms' of [4] and the 'terms with pointers' of [14] ultimately carry on this information too, but maybe in a less transparent way (a thorough comparison requires technical details, we cannot report them here for space reasons, but see the online available extended version of the paper).

One may wonder whether our analysis extends to other variants of the theory of trees. Clearly, if for instance we have infinitely many constructors in the signature, we lose the possibility of expressing formulae like *Rhi* (*x*) both as existential and universal formulae (see (1) and (2)), so model-completeness is likely to fail. Other partial results might probably be recovered, this has to be investigated by future work.

We wonder whether our techniques can be effective also for extensions of the theory of trees, we leave this too for future work. For instance, adding a finiteness predicate (like in [4,22]) is worth investigating.

Concerning our algorithms, we underline that the constraint solving algorithm of Sect. 4 (based on bisimulations and congruence closure) is rather efficient and its complexity is comparable with analogous constraint solving algorithms from the literature [19]. The situation is different for our decision procedure once extended to all elementary formulae. We believe that the algorithm of Theorem 2, relying on modelcompleteness, is direct and intuitive, however important improvements still need to be designed.

A first improvement should avoid any guessing of an equivalence relation between variables in the proof of Theorem 2: this improvement however requires some care and involves a strengthening of the technical Lemma 3, so we preferred to leave it for future work. The strengthening should be able to compute a disjunction of e-diagrams over *x* equivalent to ∃*yC*(*x,y*), for every existential formula ∃*yC*(*x,y*) satisfying the peculiar hypotheses of Lemma 3.

The other source of complexity of the algorithm deciding satisfiability of all firstorder formulae is the need of DNF conversions every time a quantifier alternation is removed. Although this problem is somewhat unavoidable (see the lower bound mentioned in the introduction), we believe that many redundancies arising during computations can be removed with suitable heuristics and simplification routines. Another possibility is that of exploiting the rich algebraic structure of <sup>R</sup><sup>+</sup> in order to directly design a quantifier elimination algorithm in <sup>R</sup>+: here the rich algebraic structure should consent the adoption of 'testing points' methods, similar to those employed in quantifier elimination for numerical domains [1].

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Certifying Phase Abstraction**

Nils Froleyks1(B), Emily Yu<sup>2</sup>, Armin Biere<sup>3</sup>, and Keijo Heljanko4,5

<sup>1</sup> Johannes Kepler University, Linz, Austria

N.froleyks@gmail.com <sup>2</sup> Institute of Science and Technology Austria, Klosterneuburg, Austria

<sup>3</sup> Albert–Ludwigs–University, Freiburg, Germany

<sup>4</sup> University of Helsinki, Helsinki, Finland

<sup>5</sup> Helsinki Institute for Information Technology, Helsinki, Finland

**Abstract.** Certification helps to increase trust in formal verification of safety-critical systems which require assurance on their correctness. In hardware model checking, a widely used formal verification technique, phase abstraction is considered one of the most commonly used preprocessing techniques. We present an approach to certify an extended form of phase abstraction using a generic certificate format. As in earlier works our approach involves constructing a witness circuit with an inductive invariant property that certifies the correctness of the entire model checking process, which is then validated by an independent certificate checker. We have implemented and evaluated the proposed approach including certification for various preprocessing configurations on hardware model checking competition benchmarks. As an improvement on previous work in this area, the proposed method is able to efficiently complete certification with an overhead of a fraction of model checking time.

# **1 Introduction**

Over the past few decades, symbolic model checking [2,22,23] has been put forward as one of the most effective techniques in formal verification. A lot of trust is placed into model checking tools when assessing the correctness of safety-critical systems. However, model checkers themselves and the symbolic reasoning tools they rely on, are exceedingly complex, both in the theory of their algorithms and their practical implementation. They often run for multiple days, distributed across hundreds of interacting threads, ultimately yielding a single bit of information signaling the verification result. To increase trust in these tools, several approaches have attempted to implement fully verified model checkers in a theorem proving environment such as Isabelle [1,27,54]. However, the scalability as well as versatility of those tools is often rather limited. For example, a technique update tends to require the entire tool to be re-verified.

An alternative is to make model checkers provide machine-checkable proofs as certificates that can be validated by independent checkers [8–10,31,32,39, 42,47], which is already a successful approach in SAT [34,35], i.e., proofs are mandatory in the SAT competition since 2016 [3], and they are a very hot topic in SMT [4,5,36,51] and beyond [4]. Crucially, these certificates need to be simple enough to allow the implementation of a fully verified proof checker [33,37,41], and preferably verifiable "end-to-end", i.e., certifying all stages of the model checking process, including all forms of preprocessing steps.

The approach in [15,56,57] introduces a generic certificate format that can be directly generated from hardware model checkers via book-keeping. More specifically, the certificate is in the form of a Boolean circuit that comes with an inductive invariant, such that it can be verified by six simple SAT checks. So far, it has shown to be effective across several model checking techniques, but has not covered phase abstraction [16]. The experimental results from [15,56,57] also show performance challenges with more complex model checking problems. In this paper, we focus on refining the format for smaller certificates while accommodating additional techniques such as cone-of-influence analysis reduction [22].

Phase abstraction [16] is a popular preprocessing technique which tries to simplify a given model checking problem by detecting and removing periodic signals that exhibit clock-like behaviors. These signals are essentially the clocks embedded in circuit designs, often due to the design style of multi-phase clocking [46]. Phase abstraction helps reduce circuit complexity therefore making the backend model checking task easier. Differently from [6,7] where the concept was first suggested, requiring syntactic analysis and user inputs, phase abstraction [16] makes use of ternary simulation to automatically identify a group of clock-like latches. Beside this, ternary simulation has also been utilized in the context of temporal decomposition [20] for detecting transient signals.

In industrial settings, due to the use of complex reset logic as well as circuit synthesis optimizations, clock signals are sometimes delayed by a number of initialization steps [19]. To further optimize the verification procedure we extend phase abstraction by exploiting the power of ternary simulation to capture different classes of periodic signals including those that are considered partially as clocks as well as equivalent signals [26]. An optimal phase number is computed based on globally extracted patterns, which then is used to unfold the circuit multiple times. The resulting unfolded circuit further undergoes rewriting and cone-of-influence reduction, before it is passed on to a base model checker for final verification. To summarize our contributions are as follows:


After background in Sect. 2, Sect. 3 introduces the notion of periodic signals. In Sect. 4 we present an extended variant of phase abstraction that simplifies the original model with periodic signals. In Sect. 5 we define a refined certificate format and present a general certification approach for phase abstraction. In Sect. 6 we describe the implementation of mc2 and then show the effectiveness of our new certification approach in Sect. 7.

# **2 Background**

Given a set of Boolean variables V, a literal l is either a variable v ∈ V or its negation ¬v. A *cube* is considered to be a non-contradictory set of literals. Let c be such a cube over a set of variables L and assume L are copies of L, i.e., each l ∈ L corresponds bijectively to an l ∈ L . Then we write c(L ) to denote the resulting cube after replacing the variables in c with its corresponding variables in L . For a Boolean formula f, we write f|<sup>l</sup> and f|¬<sup>l</sup> to denote the formula after substituting all occurrences of the literal l with and ⊥ respectively. We use equality symbols [24] and ≡ to denote syntactic and semantic equivalence and similarly → and ⇒ to denote syntactic and semantic logical implication.

**Definition 1 (Circuit).** *A circuit* C *is represented by a quintuple* (I, L, R, F, P)*, where* I *and* L *are (finite) sets of input and latch variables. The reset functions are given as* R = {rl(I,L) | l ∈ L} *where the individual reset function* rl(I,L) *for a latch* l ∈ L *is a Boolean formula over inputs* I *and latches* L*. Similarly the set of transition functions is given as* F = {fl(I,L) | l ∈ L}*. Finally* P(I,L) *denotes a safety property corresponding to set of good states again encoded as a Boolean formula over the inputs and latches.*

This notion can be extended to more general circuits involving for instance word-level semantics or even continuous variables by replacing in this definition Boolean formulas by corresponding predicates and terms in first-order logic modulo theories. For simplicity of exposition we focus in this work on Boolean semantics, which matches the main application area we are targeting, i.e., industrialscale gate-level hardware model checking. We claim that extensions to "circuits modulo theories" are quite straightforward.

A concrete state is an assignment to variables I ∪L. Therefore the set of reset states of a circuit is the set of satisfying assignments to R(L) = - <sup>l</sup>∈<sup>L</sup> (l rl(I,L)).

Note the use of syntactic equality "" in this definition.

As in previous work [15] we assume acyclic reset functions. Therefore R(L) is always satisfiable. A circuit with acyclic reset functions is called *stratified*.

As in bounded model checking [11], with I<sup>i</sup> and L<sup>i</sup> "temporal" copies of I and L at time step i, the *unrolling* of a circuit up to length k is expressed as:

$$U\_k = \bigwedge\_{i \in [0,k)} (L\_{i+1} \cong F(I\_i, L\_i)).$$

Cube simulation [57] subsumes ternary simulation such that a lasso found by ternary simulation can also be found via cube simulation. A cube simulation is a sequence of cubes c0, . . ., cδ, . . ., c<sup>δ</sup>+<sup>ω</sup> over latches L such that (1) R(L) ⇒ c0; (2) c<sup>i</sup> ∧ (L F(I,L)) ⇒ c <sup>i</sup>+1 for all i ∈ [0, δ + ω), where c <sup>i</sup>+1 is the primed copy of ci+1. It is called a cube lasso if cδ+<sup>ω</sup> ∧ (L F(I,L)) ⇒ c <sup>δ</sup>. In which case δ is the stem length and ω is the loop length. For δ = 0, the initial cube is already part of the loop and for ω = 0, the lasso ends in a self-loop.

### **3 Periodic Signals**

In sequential hardware designs, signals that eventually stabilize to a constant, i.e., to or ⊥, after certain initialization steps are called *transient* signals [20,57], whereas oscillating signals have clock-like or periodic behaviors. A simplest example of a clock is a latch that always oscillates between and ⊥.

Since hardware designs typically consist of complex initialization logic, there are occurrences of delayed oscillating signals, like clocks that start ticking after several reset steps, with a combination of transient and clock behaviours. We generalize this concept to categorize latches as periodic signals associated with a *duration* (i.e., the number of time steps for which a signal is delayed) and a *phase number* (i.e., the period length in a periodic behavior). Moreover, our generalization also captures equivalent and antivalent signals [26], as well as those that exhibit partial periodic behaviours. See Fig. 1 for an example.

**Fig. 1.** An example of a cube lasso over the latches <sup>l</sup> <sup>∈</sup> <sup>L</sup> <sup>=</sup> {a, b, c, d}. In the example the tall rectangles represent cubes as partial assignments, i.e., the second cube from the left is (¬a) <sup>∧</sup> <sup>b</sup> <sup>∧</sup> <sup>d</sup>. Phase 0 and 1 are marked on top of the cubes. As shown, duration d = 1 and phase number n = 2 yield a high number of useful signals for this cube lasso. Each latch l is associated with a periodic pattern λ*<sup>l</sup>* on the right describing its behaviors for phase 0 and 1. If a latch is missing from a cube for a certain phase, it has no constraint thus we use the equality of the latch to itself in the signal. Latch a turns out to be a simple clock delayed by one step. Latches b and d behave clock-like but only in phase 0. Latch c always has the opposite value of latch b in phase 1. Note that we could also have <sup>¬</sup><sup>c</sup> in phase 1 of signal <sup>λ</sup>*<sup>b</sup>* but choosing a single representative for a set of equivalent signals is beneficial for the following simplification steps.

**Definition 2 (Periodic Signal).** *Given a circuit* C = (I, L, R, F, P) *and a cube lasso* c0, . . .cδ, . . ., c<sup>δ</sup>+<sup>ω</sup>*. A periodic signal* λ<sup>l</sup> *for a latch* l ∈ L *is defined as* <sup>λ</sup><sup>l</sup> = (d, [v<sup>0</sup>, . . ., v<sup>n</sup>−<sup>1</sup>]) *where* <sup>d</sup> <sup>∈</sup> <sup>N</sup>, n <sup>∈</sup> <sup>N</sup><sup>+</sup> *and* <sup>v</sup><sup>i</sup> *is a latch literal or a constant, with* <sup>d</sup> <sup>≤</sup> <sup>δ</sup>*. We further require that there exist* <sup>k</sup><sup>0</sup>, k<sup>1</sup> <sup>∈</sup> <sup>N</sup><sup>+</sup> *with* <sup>k</sup><sup>0</sup> · <sup>n</sup> <sup>+</sup> <sup>d</sup> <sup>=</sup> <sup>δ</sup> *and* <sup>k</sup><sup>1</sup> · <sup>n</sup> <sup>=</sup> <sup>ω</sup> + 1 *such that for all* <sup>i</sup> <sup>∈</sup> [0, n) *and* <sup>j</sup> <sup>∈</sup> [0, k<sup>0</sup> <sup>+</sup> <sup>k</sup><sup>1</sup>) *we have* <sup>c</sup>i+j·<sup>n</sup> <sup>⇒</sup> (<sup>l</sup> <sup>v</sup><sup>i</sup> )*.*

For a signal λ<sup>l</sup> = (d, [v<sup>0</sup>, . . ., vn−<sup>1</sup>]) we will write λ<sup>i</sup> <sup>l</sup> to refer to the i-th element of [v<sup>0</sup>, . . ., vn−<sup>1</sup>], which we refer to as its phase. See Fig. 1 for an example where k<sup>0</sup> = 1 and k<sup>1</sup> = 2.

**Fig. 2.** Circuit transformation using phase abstraction.

### **4 Extending Phase Abstraction**

In this section, we revisit and extend phase abstraction by defining it as a sequence of preprocessing steps, as illustrated in Fig. 2. Differently from the approach in [16], we present phase abstraction as part of a compositional framework, that handles a more general class of periodic signals. As our approach subsumes temporal decomposition adopted from the framework in [57], we first apply *circuit forwarding* [57] for duration d (i.e., unrolling the reset states of a circuit by d steps) before unfolding is performed.

Figure 2 illustrates the flow of phase abstraction. The process begins by using cube simulation to identify a set of periodic signals as defined in Sect. 3 and computing an optimal duration and phase number based on a selected cube lasso. Once the circuit is unfolded n times, factoring is performed by assigning constant values to the clock-like signals as well as replacing latches with their equivalent or antivalent representative latches in each phase. Next, the property is rewritten by applying structural rewriting techniques and the circuit is further simplified using cone-of-influence reduction. Finally, the simplified circuit (C<sup>n</sup>+4 in Fig. 2) is checked using a base model checking approach such as IC3/PDR [17] or continue to be preprocessed further.

In Fig. 3, we show intuitively an example of a circuit with 4-bit states representing 0,...,9 and so on, where the initial state is 0. After forwarding the circuit by one step (d = 1), the initial state becomes 1. Subsequently with an unfolding of n = 3, every state (marked with rectangles) in the unfolded circuit consists of three states from the original circuit. We introduce the formal definitions below.

Unfolding a circuit simply means to copy the transition function multiple times to compute n steps of the original circuit at once. Each copy of the transition function only has to deal with a single phase and can therefore be optimized.

**Definition 3 (Unfolded circuit).** *Given a circuit* C = (I, L, R, F, P) *and a phase number* <sup>n</sup> <sup>∈</sup> <sup>N</sup><sup>+</sup>*. The unfolded circuit* <sup>C</sup> = (I , L , R , F , P ) *is:*

**Fig. 3.** An example of a forwarded (d = 1) and unfolded (n = 3) circuit. The circles denote states in the original circuit (0 is the initial state). The rectangle are states in the unfolded circuit.

$$\begin{array}{l} 1.\ \ I'=I^{0}\cup\cdots\cup I^{n-1};\ L'=L^{0}\cup\cdots\cup L^{n-1}.\\ 2.\ \ R'=\{r\_{l}^{r}\mid l\in L^{l}\}:\ for\ l\in L^{0},r\_{l}^{r}=r\_{l};\\ &\ for\ i\in(0,n),l^{i}\in L^{i},r\_{li}^{r}=F(I^{i},L^{i-1}).\\ 3.\ \ F'=\{f\_{l}^{\prime}\mid l\in L^{\prime}\}:\ for\ l\in L^{0},f\_{l}^{\prime}=f\_{l}(I^{0},L^{n-1});\\ &\ for\ i\in(0,n),l^{i}\in L^{i},f\_{li}^{\prime}=f\_{li}(I^{i},L^{i-1}).\\ 4.\ \ P'=\bigwedge\_{i\in[0,n)}P(I^{i},L^{i}).\\ \end{array}$$

We obtain a simplified circuit by replacing the periodic signals with constants and equivalent/antivalent latches in the unfolded circuit.

**Definition 4 (Factor circuit).** *For a fixed duration* d *and phase number* n*, given a* d*-forwarded and* n*-unfolded circuit* C = (I, L, R, F, P) *and a periodic signal with duration* d *and phase number* n *for each latch, the factor circuit* C = (I, L, R , F , P) *is defined by:*

R = {r <sup>l</sup> | l ∈ L} : *–* r <sup>l</sup>*<sup>i</sup>* = λ<sup>i</sup> <sup>l</sup>, *if* λ<sup>i</sup> <sup>l</sup> ∈ {⊥, }; *–* r <sup>l</sup>*<sup>i</sup>* = r<sup>λ</sup>*<sup>i</sup> l* , *if* λ<sup>i</sup> <sup>l</sup> ∈ L. *–* r <sup>l</sup>*<sup>i</sup>* = ¬r¬λ*<sup>i</sup> l* , *otherwise.* F = {f <sup>l</sup> | l ∈ L} : *–* f <sup>l</sup>*<sup>i</sup>* = λ<sup>i</sup> <sup>l</sup>, *if* λ<sup>i</sup> <sup>l</sup> ∈ {⊥, }; *–* f <sup>l</sup>*<sup>i</sup>* = f<sup>λ</sup>*<sup>i</sup> l* , *if* λ<sup>i</sup> <sup>l</sup> ∈ L. *–* f <sup>l</sup>*<sup>i</sup>* = ¬f¬λ*<sup>i</sup> l* , *otherwise.*

Replaced latches will be removed by a combination of rewriting and coneof-influence reduction introduced in the following definitions. There are various rewriting techniques also including SAT sweeping [30,38,43–45,59].

**Definition 5 (Rewrite circuit).** *Given a circuit* C = (I, L, R, F, P)*, a rewrite circuit* C = (I, L, R, F, P ) *satisfies* P ≡ P *.*

For a given circuit, we apply cone-of-influence reduction to obtain a reduced circuit such that latches and inputs outside the cone of influence are removed.

**Definition 6 (Reduced circuit).** *Given a circuit* C = (I, L, R, F, P)*. The reduced circuit* C = (I , L , R , F , P) *is defined as follows: –* I = I ∩ coi(P); *–* R = {r<sup>l</sup> | l ∈ L }*; –* L = L ∩ coi(P)*; –* F = {f<sup>l</sup> | l ∈ L }*,*

*where the cone of influence of the property* coi(P) ⊆ (I ∪ L) *is defined as the smallest set of inputs and latches such that* vars(P) ⊆ coi(P) *as well as* vars(rl) ⊆ coi(P) *and* vars(fl) ⊆ coi(P) *for all latches* l ∈ coi(P)*.*

# **5 Certification**

We define a revised certificate format that allows smaller and more optimized certificates. We then propose a method for producing certificates for phase abstraction. The proofs for our main theorems can be found in the Appendix.

# **5.1 Restricted Simulation**

In the following, we define a new variant of the stratified simulation relation [15], which we call *restricted simulation*, that considers the intersection of latches shared between two given circuits as a common component.

**Definition 7 (Restricted Simulation).** *Given stratified circuits* C *and* C *with* C = (I , L , R , F , P ) *and* C = (I, L, R, F, P)*. We say* C simulates C *under the* restricted simulation relation *iff*


This new simulation relation differs from [15,56], where inputs were required to be identical in both circuits (I = I ), and latches in C had to form a subset of latches in C (L ⊆ L ). Therefore, under those previous "combinational" [56] or "stratified" [15] simulation relations the simulating circuit C cannot have fewer latches than L. This is a feature we need for instance when incorporating certificates for cone-of-influence reduction [22], a common preprocessing technique. It opens up the possibility to reduce certificate sizes substantially.

Still, as for stratified simulation, restricted simulation can be verified by three simple SAT checks, i.e., separately for each of the three requirements in Definition 7.

**Definition 8 (Semantic independence).** *Let* V *be a set of variables and* v ∈ V*. Then a formula* f(V) *is said to be semantically independent of* v *iff*

$$f(\mathcal{V})|\_v \equiv f(\mathcal{V})|\_{\neg v}.$$

Semantic dependency [28,40,49,53] allows us to assume that a formula only depends on a subset of variables, which without loss of generality simplifies proofs used for the rest of this section. The stratified assumption for reset functions entails no cyclic dependencies thus R (L ) is satisfiable. A reset state in a circuit is simply a satisfying assignment to the reset predicate R(L). Based on the reset condition (Definition 7.1), it is however necessary to show that for every reset state in C it can always be extended to a reset state in C , while the common variables have the same assignment in both circuits. This is stated in the lemma below, and the proofs can be found in the Appendix.

**Lemma 1.** *Let* C = (I, L, R, F, P) *and* C = (I , L , R , F , P ) *be two stratified circuits satisfying the reset condition defined in Definition 7.1. Then* R (L ∩ L ) *is semantically dependent only on their common variables.*

In fact, semantic independence is a direct consequence of restricted simulation; thus no separate check is required. We make a further remark that if the reset function is dependent on an input variable, then it has to be an input variable common to both circuits.

Based on this, we conclude with the main theorem for restricted simulation such that C is safe if C is safe (i.e., no bad state that violates the property is reachable from any initial state).

**Theorem 1.** *Let* C = (I, L, R, F, P) *and* C = (I , L , R , F , P ) *be two stratified circuits, where* C *simulates* C *under restricted simulation. If* C *is safe, then* C *is also safe.*

Intuitively, if there is an unsafe trace in C, Definition 7.1 together with Lemma 1 allow us to find a simulating reset state and transition it with Definition 7.2 to a simulating state also violating the property in C by Definition 7.3. Here a state in C simulates a state in C if they match on all common variables. Building on this, we present witness circuits as a format for certificates. Verifying the restricted simulation relation requires three SAT checks, and another three SAT checks are needed for validating the inductive invariant [56]. Therefore certification requires in total six SAT checks as well as a polynomial time check for reset stratification.

**Definition 9 (Witness circuit).** *Let* C = (I, L, R, F, P) *be a stratified circuit. A witness circuit* W = (J, M, S, G, Q) *of* C *satisfies the following:*


The witness circuit format subsumes [15,57], thus every witness circuit in their format is also valid under Definition 9.

**Fig. 4.** Certification for (extended) phase abstraction. Base model checking is performed on circuit C*n*+4, which produces a witness circuit W*n*+2, that certifies C*n*+2, C*n*+3, and C*n*+4. We construct step-wise to obtain W0, which is a certificate for the entire model checking procedure.

### **5.2 Certifying Phase Abstraction**

The certificate format is generic, subsumes [57], and is designed to potentially be used as a standard in future hardware model checking competitions. We proceed to demonstrate how a certificate can be constructed for a model checking pipeline that includes phase abstraction. The theorems in this section state that this construction guarantees that a certificate will be produced. We illustrate our certification pipeline in Fig. 4. After phase abstraction and base model checking, we can build a certificate backwards based on the certificate produced by the base model checker. The following theorem states that the witness circuit of the reduced circuit serves as a witness circuit for the original circuit too.

**Theorem 2.** *Given a circuit* C = (I, L, R, F, P) *and its reduced circuit* C = (I , L , R , F , P )*. A witness circuit of* C *is also a witness circuit of* C*.*

The outcome of rewriting is a circuit with a simplified property that maintains semantic equivalence with the original property. Therefore in our framework, the certificate for the simplified property is also valid for the original property. Furthermore, certificates can be optimized by rewriting at any stage. We summarize this in the following proposition.

**Proposition 1.** *Given a circuit* C *and its rewrite circuit* C *. A witness circuit of* C *is also a witness circuit of* C*.*

We define the composite witness circuit to combine the certificates for cube simulation and the factor circuit.

**Definition 10 (Composite witness circuit).** *Given a stratified circuit* C = (I, L, R, F, P) *and its factor circuit* C = (I , L , R , F , P )*, and the unfolded loop invariant* φ = ∨<sup>i</sup>∈[0,m) ∧<sup>j</sup>∈[0,n) c<sup>i</sup>∗n+j+<sup>d</sup>*, with* m = (δ + ω − d + 1)/n*, obtained from the cube lasso. Let* W = (J , M , S , G , Q ) *be a witness circuit of* C *. The composite witness circuit* W = (J, M, S, G, Q) *is defined as follows: 1.* J = I ∪ J *.* )*. 4.* G = {g<sup>l</sup> | l ∈ M}*:*

*2.* M = L ∪ (M \L *3.* S = {s<sup>l</sup> | l ∈ M}*: (a) for* l ∈ L, s<sup>l</sup> = rl*; (b) for* l ∈ M \L , s<sup>l</sup> = s l. *(a) for* l ∈ L, g<sup>l</sup> = fl*; (b) for* l ∈ M \L , g<sup>l</sup> = g l. *5.* Q = φ(L) ∧ Q (J , M ).

**Theorem 3.** *Given circuit* C = (I, L, R, F, P)*, and factor circuit* C = (I , L , R , F , P )*. Let* W = (J , M , S , G , Q ) *be a witness circuit of* C *, and* W = (J, M, S, G, Q) *constructed as in Definition 10. Then* W *is a witness circuit of* C*.*

**Fig. 5.** Every fully initialized state of a 3-folded witness circuit contains 3 original states that form an unfolded state. Two consecutive 3-folded states contain either the same unfolded states or two states consecutive in the unfolded circuit.

In the construction of an n-folded witness circuit from the unfolded witness W , a single instance of W 's latches (N), yet multiples of the original latches L are used. As illustrated in Fig. 5, these L record a history, contrasting with their role in the unfolded circuit where they calculate multi-step transitions.

**Definition 11 (**n**-folded witness circuit).** *Given a circuit* C = (I, L, R, F, P) *with a phase number* <sup>n</sup> <sup>∈</sup> <sup>N</sup><sup>+</sup>*, and its unfolded circuit* <sup>C</sup> = (I , L , R , F , P )*. Let* W = (J , M , S , G , Q ) *be the witness circuit of* C *. The* n*-folded witness circuit* W = (J, M, S, G, Q) *is defined as follows:*

*1.* <sup>J</sup> <sup>=</sup> <sup>I</sup><sup>0</sup> <sup>∪</sup> <sup>J</sup>0*, where* <sup>I</sup><sup>0</sup> *and* <sup>J</sup><sup>0</sup> *are* <sup>I</sup> *and* <sup>J</sup> *respectively. 2.* <sup>M</sup> <sup>=</sup> <sup>I</sup><sup>1</sup> ··· <sup>I</sup><sup>m</sup> <sup>∪</sup> <sup>L</sup><sup>0</sup> ···L<sup>m</sup> <sup>∪</sup> <sup>N</sup> <sup>∪</sup> <sup>J</sup><sup>1</sup> ∪ {b<sup>0</sup> ··· <sup>b</sup><sup>m</sup>, e<sup>0</sup> ··· <sup>e</sup><sup>n</sup>−<sup>2</sup>}, *where* m = 2 × n − 2*,* N = M \ L *, and* I<sup>i</sup> , L<sup>i</sup> *are copies of* I *and* L*, and* J<sup>1</sup> *is a copy of* J *. 3.* S = {s<sup>l</sup> | l ∈ M}*: (a)* s<sup>b</sup><sup>0</sup> = ; *(b) For* i ∈ (0, m], s<sup>b</sup>*<sup>i</sup>* = ⊥. *(c) For* i ∈ [0, n − 1), s<sup>e</sup>*<sup>i</sup>* = ⊥. *(d) For* <sup>l</sup> <sup>∈</sup> <sup>L</sup>0, s<sup>l</sup> <sup>=</sup> <sup>r</sup> l. *(e) For* <sup>l</sup> <sup>∈</sup> (I<sup>1</sup> ··· <sup>I</sup><sup>m</sup> <sup>∪</sup> <sup>L</sup><sup>1</sup> ···L<sup>m</sup> <sup>∪</sup> <sup>J</sup><sup>1</sup>), s<sup>l</sup> <sup>=</sup> l. *(f ) For* l ∈ N,s<sup>l</sup> = s l. *4.* G = {g<sup>l</sup> | l ∈ M}*: (a)* g<sup>b</sup><sup>0</sup> = *. (b) For* <sup>i</sup> <sup>∈</sup> [1, m], g<sup>b</sup>*<sup>i</sup>* <sup>=</sup> <sup>b</sup><sup>i</sup>−1. *(c)* <sup>g</sup><sup>e</sup><sup>0</sup> <sup>=</sup> <sup>b</sup><sup>n</sup>−<sup>1</sup> ∧ ¬e<sup>n</sup>−2*. (d) For* <sup>i</sup> <sup>∈</sup> [1, n <sup>−</sup> 1), g<sup>e</sup>*<sup>i</sup>* <sup>=</sup> <sup>e</sup><sup>i</sup>−<sup>1</sup> ∧ ¬e<sup>n</sup>−2*. (e) For* <sup>l</sup> <sup>∈</sup> <sup>L</sup>0, g<sup>l</sup> <sup>=</sup> <sup>f</sup>l. *(f ) For* l <sup>1</sup> <sup>∈</sup> <sup>J</sup>1, g<sup>l</sup><sup>1</sup> <sup>=</sup> <sup>l</sup> 0. *(g) For* i ∈ [1, m]*,* l <sup>i</sup> <sup>∈</sup> (I<sup>i</sup> <sup>∪</sup> <sup>L</sup><sup>i</sup> ), g<sup>l</sup>*<sup>i</sup>* = l <sup>i</sup>−1. *(h) For* l ∈ N, g<sup>l</sup> = ite( e<sup>n</sup>−2, g <sup>l</sup>(J1, M <sup>∩</sup> (I<sup>m</sup>−n+1 ··· <sup>I</sup><sup>m</sup> ···L<sup>m</sup>−n+1 ···L<sup>m</sup> <sup>∪</sup> <sup>N</sup>)), l). *5.* Q = - <sup>i</sup>∈[0,6] q<sup>i</sup> : *(a)* q<sup>0</sup> = P(I<sup>0</sup>, L<sup>0</sup>). *(b)* q<sup>1</sup> = b<sup>0</sup>. *(c)* q<sup>2</sup> = - <sup>i</sup>∈[1,m] (b<sup>i</sup> <sup>→</sup> <sup>b</sup><sup>i</sup>−<sup>1</sup>)*. (d)* q<sup>3</sup> = - <sup>i</sup>∈[1,m] (b<sup>i</sup> <sup>→</sup> (L<sup>i</sup> <sup>F</sup>(I<sup>i</sup>−<sup>1</sup>, L<sup>i</sup>−<sup>1</sup>)))*. (e)* q<sup>4</sup> = - <sup>i</sup>∈[1,m] ((¬b<sup>i</sup> <sup>∧</sup> <sup>b</sup><sup>i</sup>−<sup>1</sup>) <sup>→</sup> (R(L<sup>i</sup>−<sup>1</sup>) <sup>∧</sup> <sup>S</sup> (N)))*. (f )* <sup>q</sup><sup>5</sup> <sup>=</sup> <sup>b</sup><sup>m</sup> <sup>→</sup> ( <sup>i</sup>∈[0,n) (( <sup>j</sup>∈[i,n−1) <sup>¬</sup>e<sup>j</sup> ) <sup>∧</sup> ( <sup>j</sup>∈[0,i) <sup>e</sup><sup>j</sup> )<sup>∧</sup> Q (J<sup>0</sup>, M <sup>∩</sup> (L<sup>i</sup> ···∪ <sup>L</sup><sup>i</sup>+n−<sup>1</sup> <sup>∪</sup> <sup>N</sup>))). *(g)* q<sup>6</sup> = - <sup>i</sup>∈[1,n−2] (e<sup>i</sup> <sup>→</sup> <sup>e</sup><sup>i</sup>−<sup>1</sup>)*. (h)* q<sup>7</sup> = - <sup>i</sup>∈[0,n−2] (e<sup>i</sup> <sup>→</sup> <sup>b</sup><sup>n</sup>+<sup>i</sup> )*.*

$$(i)\ q^8 = \bigwedge\_{i \in [0, n-2)} ((\neg b^m \wedge b^{n+i}) \to e^i).$$

The b<sup>i</sup> s are used for encoding initialization. So that inductiveness is ensured when not all copies are initialized. The <sup>n</sup> <sup>−</sup> 1 bits <sup>e</sup><sup>i</sup> are used to determine which set of n consecutive original states form an unfolded state (a state in the unfolded circuit). This information is used to determine on which copies the unfolded property needs to hold and to transition the latches in N (the part of the witness circuit added by the backend model checker) once every n steps.

**Theorem 4.** *Given a circuit* C = (I, L, R, F, P) *with a phase number* n ∈ N<sup>+</sup>*, its unfolded cicuit* C = (I , L , R , F , P ) *with a witness circuit* W = (J , M , S , G , Q )*. Let* W = (J, M, S, G, Q) *be the circuit constructed as in Definition 11. Then* W *is a witness circuit of* C*.*

After the witness circuit has been folded, the same construction from [57] can be used to construct the backward witness. With that, the pipeline outlined in Fig. 4 is completed. If phase abstraction is the first technique applied by the model checker, a final witness is obtained. Otherwise, further witness processing steps still need to be performed. An example of the entire process is illustrated in Fig. 6.

# **6 Implementation**

In this section, we present mc2, a certifying model checker implementing phase abstraction and IC3. We implement our own IC3 since no existing model checker supports reset functions or produces certificates in the desired format. We used fuzzing to increase trust in our tool. The version of mc2 used for the evaluation, was tested on over 25 million randomly generated circuits [14] in combination with random parameter configurations. All produced certificates where verified.

To extract periodic signals we perform ternary simulation [52] while using a forward-subsumption algorithm based on a one-watch-literal data structure [58] to identify supersets of previously visited cubes, and thereby a set of cube lassos. For each cube lasso we consider every factor of the loop length ω as a phase number candidate n. We also consider every duration d, that renders the leftover tail length (δ − d) divisible by n. To keep the circuit sizes manageable, we limit both n and d to a maximum of 8. We call each pair (d, n) an unfolding candidate and compute the corresponding periodic signal (Definition 2) for each latch.

For each phase, equivalences are identified by inserting a bit string corresponding to the signs of each latch into a hash table. After identifying the signals, forwarding and unfolding are performed on a copy of the circuit, followed by rudimentary rewriting. Currently the rewriting does not include structural hashing and is mostly limited to constant propagation. Afterwards a sequential cone-of-influence analysis starting from the property is performed. After performing these steps for each candidate, we pick the duration-phase pair that yields a circuit with the fewest latches and give it to a backend model checker.

**Fig. 6.** A concrete example of the model checking and certification pipeline. The original circuit has two latches; the bottom latch alternates and the top copies the previous value of this clock. The property is that at least one bit is unset. Bad states are marked gray. After unfolding with phase number two, the size of the state space is squared. Since the bottom bit is periodic, we can replace it with a constant in each phase (factor). On this circuit terminal model checking is performed, since the property is already inductive (no transition from good to bad), the circuit serves as its own witness. To produce the final witness circuit, the clock is added back as a latch, and the property is extended with the loop invariant asserting that the clock has the correct value for each phase. Lastly, the circuit is *folded* to match the speed of the original circuit. Three initialization bits b*<sup>i</sup>* are introduced and one additional bit e<sup>0</sup> that determins which pair of consecutive states need to fulfill the property (0 for the right pair and 1 for the left). This check is only part of the property once full initialization is reached. For this final witness circuit, only the good states are depicted. Also, the first two states represent sets of good states with the same behavior.

We evaluated the preprocessor on three backend model checkers: the opensource k-induction-based model checker McAiger [12](Kind in the following), the state-of-the-art IC3 implementation in ABC [18] and our own version of IC3 that supports reset functions and produces certificates. Since ABC does not support reset functions, it is not able to model check any forwarded circuit (note that implementing this feature on ABC is also a non-trivial task), therefore for this configuration we only ran phase abstraction without forwarding thus no temporal decomposition.

Our IC3 implementation on mc2 does feature cube minimization via ternary simulation [25], however it is missing proof-obligation rescheduling. In fact, we currently use a simple stack of proof obligations as opposed to a priority queue. Despite using one SAT solver instance per frame, we also do not feature coneson-demand, but instead always translate the entire circuit using Tseitin [55].

Lastly, we also modified the open source implementation of Certifaiger [21] to support certificates based on restricted simulation. For a witness circuit C of C, the new certificate checker encodes the following six checks as combinatorial AIGER circuits and then uses the aigtocnf to translate them to SAT:


The first three checks are unchanged and encode the standard check for P being an inductive invariant in C . Since P is both the inductive invariant and the property we are checking, C can technically be omitted. However, in our implementation, the inductiveness checker is an independent component from the simulation checker and would also works for scenarios where the inductive invariant is a strengthening of the property in C .

### **7 Experimental Evaluation**

This section presents experimental results for evaluating the impact of preprocessing on the different backends, as well as the effectiveness of our proposed certification approach. The experiments were run in parallel on a cluster of 32 nodes. Each node was equipped with two 8-core Intel Xeon E5-2620 v4 CPUs running at 2.10 GHz and 128 GB of main memory. We allocated 8 instances to each node, with a timeout of 5000 s for model checking and 10 000 s for certificate checking. Memory is limit to 8 GB per instance in both cases.

The benchmarks are obtained from HWMCC2010 [13] which contains a good number of industrial problems. As we observe from the experiments in general, prepossessing is usually fast. Ignoring one outlier in our benchmark set, it completes within an average of 0.07 seconds and evaluates no more than 17 unfolding candidates per benchmark. Interestingly, for the outlier "bobsmnut1", 3019 unfoldings are computed for 179 different cube lassos within 34 seconds.

Table 1 presents the effect of our preprocessing on different backends, further illustrated in Fig. 7. Our preprocessor was able to improve the performance of the sophisticated IC3/PDR implementation in ABC, allowing us to solve five more instances, all from the intel family. For each benchmark from this family, our heuristic computed an optimal phase number of 2. A likely explanation for this is that the real-world industrial designs tend to contain strict two-phase clocks [6]. The positive effect of phase abstraction is also clear in combination with the kinduction (Kind) backend. Circuit forwarding provides a further improvement, that is especially notable on the prodcell benchmarks. These also illustrate how forwarding enables more successful unfolding. Without forwarding, preprocessing only unfolds 61 out of the 818 benchmarks with an average phase number of 2, with forwarding 152 circuits are unfolded with an average of 4.

Even though our prototype implementation of IC3 is missing a number of important features present in ABC, it still solves a large number of benchmarks. **Table 1.** We presents the effect of preprocessing in combination with different backend engines on model checking time. We compare no preprocessing to only phase abstraction without forwarding (PA) and full preprocessing (Full). Note that, ABC does not support reset functions and can therefore not be combined with full preprocessing. For each model we present the phase number without forwarding n for PA and the duration d and phase number n corresponding to Full. Models where the property holds are marked as safe. The first two rows present the number of solved instances and the PAR2 score [29] over all 818 benchmarks. The table shows all instances where preprocessing had either a positive or negative impact on model checking success, with the exception of those instances rendered unsolvable for our IC3 implementation by forwarding.


However, as opposed to ABC it does lose a number of benchmarks with phase abstraction. This can be explained by the lack of sophisticated rewriting that can exploit the unfolded circuits structure. The addition of forwarding is highly

**Fig. 7.** Comparison of model checking performance. We compare four pairs of configurations; the three backend engines with and without phase abstraction (with fixed duration 0) and for Kind we present the effect of additionally allowing forwarding. The size of the markers represents n + d. The dots represent instances where the preprocessing heuristic decided not to alter the circuit. The red lines mark the timeout of 5000 s. Markers beyond that line represent instances solved by one configuration but not the other. (Color figure online)

detrimental to performance, losing 115 instances. This is due to our implementation following the PDR design outlined in [25]. It requires any blocked cube not to intersect the initial states after generalization. If only a single reset state exists this check is linear in the size of the cube. However, in the presence of reset functions it is implemented with a SAT call. While also slower the main problem however is that the reset-intersection check is also more likely to block generalization. On the 115 lost benchmarks generalization failed 96% of the time, while it only failed in 1.8% of the cases without forwarding. We keep the optimization of our IC3 implementation in the presence of reset functions for future work.

Figure 8 displays certification results on mc2 in comparison to model checking time. IC3 provides certificates that are easily verifiable, as confirmed by our experiments with cumulative overhead of only 3%. The addition of phase abstraction (i.e., including constructing n-folded witnesses as in Fig. 4, without witness back-warding) does not bring significant additional overhead. When forwarding is allowed, the certification overhead increases to 10%. The run time of certificates generation and encoding to SAT is negligible for all configurations. The certification time is dominated by the SAT solving time for the transition (Definition 7.2) and consecution check. Overall, this is a significant improvement over related work from [57] which reported 1154% overhead on the same set of benchmarks using a k-induction engine as the backend.

**Fig. 8.** Certification vs. model checking time for three configurations of our IC3 engine. The legend shows the cumulative overhead of including certification for all solved instances. The size of the markers represents n+d. The dots represent instances where preprocessing did not alter the circuit.

# **8 Conclusion**

In this paper, we present a certificate format that can be effectively validated by an independent certificate checker. We demonstrate its versatility by applying it to an extended version of phase abstraction, which we introduce as one of the contributions of this paper. We have implemented the proposed approach on a new certifying model checker mc2. The experimental results on HWMCC instances show that our approach is effective and yields very small certification overhead, as a vast improvement over related work. Our certificate format allows for smaller certificates and is designed to be possibly used in hardware model checking competitions as a standardized format.

Beyond increasing trust in model checking, certificates can be utilized in many other scenarios. For instance, such certificates will allow the use of model checkers as additional hammers in interactive theorem provers such as Isabelle [48] via Sledgehammer [50], with the potential of significantly reducing the effort needed for using theorem provers in domains where model checking is essential, such as formal hardware verification, our main application of interest. Currently in Isabelle, Sledgehammer allows to encode the current goal for automatic theorem provers or SMT solvers and then call one of many tools to solve the problem. The tool then provides a certificate which is lifted to a proof that can be replayed in Isabelle. We plan to add our model checker as an additional hammer to increase the automatic proof capability of Isabelle. This further motivates us to investigate certificate trimming via SAT proofs.

**Acknowledgements.** This work is supported by the Austrian Science Fund (FWF) under the project W1255-N23, the LIT AI Lab funded by the State of Upper Austria, the ERC-2020-AdG 101020093, the Academy of Finland under the project 336092 and by a gift from Intel Corporation.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Verifying a Realistic Mutable Hash Table Case Study (Short Paper)**

Samuel Chassot(B) and Viktor Kunˇcak

EPFL, Lausanne, Switzerland *{*samuel.chassot,viktor.kuncak*}*@epfl.ch

**Abstract.** In this work, we verify, using the Stainless program verifier, the mutable LongMap from the Scala standard library, a hash table using open addressing within a single array. As an executable specification, we write an immutable map based on a list of tuples and verify it against the mathematical definition of a map. We then show that LongMap's operations correspond to operations of this association list. To express the resizing of the hash table array, we introduce a new reference-swapping construct in Stainless. This allows us to apply the decorator design pattern without introducing aliasing. Our verification effort led us to find and fix a bug in the original implementation that manifests for large hash tables. Our performance analysis shows the verified version to be within a 1.5 factor of the original data structure.

**Keywords:** Formal verification · Hash table · LongMap · Scala

# **1 Introduction**

With the improvements in effectiveness and expanding user base of proof assistants such as Isabelle/HOL [22] and Coq [27], we are witnessing systematic verification of many purely functional data structures. The verification of these data structures is highly effective using those tools. In the Scala language ecosystem, such verification efforts were carried out using the Stainless verifier [13] and its predecessor Leon [19]. However, verification of mutable data structures remains more challenging. As an example for hash table validation on the JVM platform, a recent attempt [8] provided a proof with interactive steps and an incomplete proof based on bounded model checking for one function. We consider such efforts very valuable. At the same time, our verification led us to discover a bug that bounded model checking would have likely missed due to the large arrays required. This illustrates the limitations of bounded checks and the need for complete formal verification.

In this work, we verify a data structure from the Scala standard library: the mutable LongMap[V], a hash table with keys of type Long and values of a generic type V, implemented with open addressing (with all data stored in the arrays). We verify it using Stainless, a verification framework for a subset of Scala. To our knowledge, this is the first verified mutable map in Scala and the first verified hash table with open addressing and non-linear probing. Our implementation closely follows the existing implementation of the Scala library [26], which was implemented with efficiency in mind and withstood the test of usability. This is the fastest hash table implementation we know of in the Scala ecosystem. Our experience helped us further assess the use of Stainless for imperative code, following recent verification of the QOI compression algorithm [5] and file system components [12]. Our paper includes the following contributions:


### **1.1 Related Work**

Hash tables have been of interest in verification from the early days of the field. Guttag [11] explores the use of algebraic specifications for reasoning about hash tables, though without formal connection to executable implementations. A hash table is one of the case studies [17] in the Jahob verification system [18,29]. The version in Jahob does not use open addressing but separate chaining with linked list buckets. Furthermore, that case study uses, as an unverified assumption, the fact that the hash function is pure (total, without side effects, terminating, and deterministic). The Eiffel2 project offers a collection of verified data containers, impressive by its diversity [23]. They implemented and verified a hash table implementation using chaining. These containers are, however, simpler in their implementations than what appears in Scala and Java standard libraries. We could not explore this collection in more depth because the tools used are unavailable. De Boer et al. verified JDK's IdentityHashTable, based on open addressing and *linear* probing, in their case study [8]. The verification was done using KeY [1] and JJBMC [4], both accepting JML specification. They notably did not manage to provide a deductive proof for the remove method and one of its auxiliary methods, but instead used bounded model checking for a map of up to four elements. The KeY deductive proofs required interactive steps for the more complex methods, up to 1'655 for the put method. Hance et al. also proposed techniques to verify distributed systems interacting with an asynchronous environment, in particular file systems [14]. In this work, they developed and verified a hash table with open addressing and *linear* probing in Dafny. They implemented two versions of the hash table, one immutable and one mutable. This separates the functional correctness and correct heap manipulation proofs but requires implementing the hash table twice.

# **2 LongMap: From Scala Library to Stainless**

A LongMap[V] (called LongMap in this work) is a data structure implementing a *map* behavior with keys of type Long (signed 64-bit machine integers) and values of generic type V. The mutable LongMap of the Scala standard library [26] is a hash table employing open addressing and non-linear probing.

We implement a subset of the original LongMap interface (outlined in Sect. 3). This subset corresponds to the functions implementing the map functionality (we omit functions specific to the Scala collections hierarchy). The apply function returns the value stored for a given key or a default value if absent. The remove, update (to add/update pairs), and repack (to resize/balance the map) functions return a Boolean value indicating the operation success.

The keys and values are stored in two arrays called \_keys and \_values respectively. Both are of size *<sup>N</sup>* = 2*<sup>n</sup>* for some 3 <sup>≤</sup> *<sup>n</sup>* <sup>≤</sup> 30. The index of a given key is computed using a hash function. The corresponding value is stored in the second array at the same index as its key. We define mask = *N* − 1.

There are 2 special values in \_keys: 0 and Long.MinValue. The value 0 indicates a free spot while Long.MinValue is a *tombstone* value, indicating that a key was removed at this spot.

We use open addressing with non-linear probing to resolve collisions. Following the original Scala implementation [7,26], the probing function is *indexx*+1 = (*index<sup>x</sup>* + 2 <sup>∗</sup> (*<sup>x</sup>* + 1) <sup>∗</sup> *<sup>x</sup>* <sup>−</sup> 3) & *mask*, resulting in cubic index growth. Our verification is independent of the particular probing function but checks that the implementation is pure (i.e., deterministic, total, terminating, and without side effects), free of runtime errors, and returns an index within range.

All operations rely on two elementary ones: 1) looking for a key (seekEntry), and 2) looking for a key or an empty spot (seekEntryOrOpen). These two operations use non-linear probing, with the special values 0 and Long.MinValue in \_keys. As an example, update(k: Long, v: V) starts out by computing i = seekEntryOrOpen(k). If k is at index i, it writes \_values(i) = v; if the function returns an open spot, it writes \_keys(i) = k and \_values(i) = v.

### **2.1 Adapting for Verification**

Next, we present the changes we made to the original code to comply with the supported subset of Scala, improve the SMT solver performance, make writing specifications easier, and simplify termination checking.

**Tail recursion (to ease verification)**. We replace while loops with tailrecursive functions. Stainless can perform this transformation internally, but we have better control over specifications if we manually transform the source. For example, using a loop invariant makes having different pre- and post-conditions impossible. The Scala compiler transforms tail-recursive functions back to loops during compilation, so no performance is lost.

**Loop counters (to prove termination)**. We introduce a counter and a condition that stops while loops (implemented as tail recursion) in seekEntry and seekEntryOrOpen after a fixed number of iterations. We need this counter as we do not know whether this probing function covers the space of all indices. Moreover, it allows the proof to be agnostic to the probing function. It has a negligible impact on performance, as shown in Sect. 4.

**Data representation (for SMT performance)**. The original implementation uses the MSBs (Most Significant Bits) of the index returned by the seeking functions to indicate whether the index points to the key, a 0, or a tombstone. We replace this with some ADT for better code readability and improved verification experience, as bitwise operations are often slow in SMT solvers.

**Typing and initialization of arrays (to comply with subset)**. In the original implementation, the array storing values (\_values) is an Array[AnyRef], containing null by default, and using casts to store and access values. In our verified implementation, \_values contains boxed values because Stainless does not support SPSVERBc48s, and the Array.fill function (used to instantiate new arrays) does not support generically typed arrays. The boxing is implemented using case classes (i.e., ADTs).

**Refactoring (to ease verification)**. We split the implementation into two classes, following the decorator design pattern (DP), as detailed in Sect. 3.1.

# **3 Specification and Verification**

We first implement ListLongMap, an immutable map backed by a strictly ordered list of pairs (Long, V), and verify it against the mathematical specification of a map. It serves as the executable specification of the mutable LongMap. We thus specify the mutable LongMap as behaving as the corresponding ListLongMap. A ghost method map() (not executed at runtime) of LongMap returns an instance of ListLongMap with the same content and is used in contracts. For example, update is specified as follows: old(this).map() + (k, v) == this.map(). Figure 1 shows the LongMap interface and specification. Postconditions, expressed using ensuring calls, are lambda functions taking the return value as parameter (i.e., res). The method valid is the data structure representation invariant stating, among other things, that the inserted elements can be found when searching subsequently using the same probing function. Table 1 shows the lines of code for the program, specification, and proofs for both maps.

### **3.1 Decorator Design Pattern for Modular Proofs**

Following the decorator DP, we split the LongMap implementation into two classes to better modularize the proof. First, LongMapFixedSize implements

**Fig. 1.** Mutable LongMap interface and specification (note that we omit preconditions in this figure which are only checks of the invariant (*valid*), if any).

the LongMap specification depicted in Fig. 1 without resizing (with arrays of a given fixed length). Then, we implement the LongMap class as a decorator of LongMapFixedSize. It implements the same interface and adds the resizing operation (repack function). Being a decorator, this class forwards all operations to an underlying instance of LongMapFixedSize except for the repack function. A key observation about the original implementation of repack is that it works very similarly to the update function to insert all pairs. Only some checks are omitted because the array is assumed to be fresh (containing no tombstone values and, initially, no keys). This observation allows us to use update to implement repack without significantly impacting the performance, while simplifying the proof.

# **3.2 Swap Operation for More Expressive Unique Reference**

As discussed in Sect. 3.1, the LongMap class relies on an underlying instance of the LongMapFixedSize class. The underlying instance must be replaced by a new one during calls to repack. The repacking process first computes the new array size, then creates a new instance of LongMapFixedSize with this size, inserts all pairs, and finally replaces the current underlying instance with this new one. Aliasing appears during this replacement, yet Stainless disallows it. We can, however, observe that there is no need for aliasing because the reference to the newly created instance is not accessed after the replacement. We thus introduced <sup>a</sup> *swap* operation [15] into the Stainless verifier. In addition to array element swap [12], Stainless now offers a Cell class that encapsulates a mutable variable and offers a swap operation to swap the content of two cells. This construct enlarges the expressiveness of Stainless without the need for aliasing and enables the implementation of a resizable data structure on top of a fixed-size one.


**Table 1.** Lines of code for program, as well as specification and proof. We use many ghost functions to express induction proofs. When a function has many arguments, we typically typeset each argument on a separate line, contributing to line counts.

### **3.3 Finding and Confirming a Bug in the Original Implementation**

During the verification, we discovered that the repack function does not satisfy the specification stated in its documentation. The documentation states that the map can accommodate up to 2<sup>30</sup> values (preferably not more than 2<sup>29</sup>) [25]. However, a number of keys greater than or equal to 2<sup>28</sup> makes repack loop forever. The bug arises in the computation of the new mask and is an integer overflow: a mask candidate is reduced while it is > \_size \* 8 (where \_size is the number of keys stored in the array). However, if \_size \* 8 overflows, i.e., \_size <sup>≥</sup> <sup>2</sup><sup>28</sup>, the mask candidate is reduced below \_size. The new array then cannot accommodate all the keys. We fix the bug by modifying the loop condition and then prove that the function always returns a large enough and valid mask. Despite the small scope hypothesis [16] and claims in [8], we do not expect that bounded model checking would have discovered this bug, given that it occurs only after inserting so many key-value pairs.

### **3.4 SMT Queries**

Stainless generates verification conditions (VCs) which are solved by Inox [28] using SMT solvers (here, CVC4 [3], cvc5 [2], and Z3 [9]) and incremental function unrolling. So, a sequence of calls to SMT solvers happens for each VC and solvers run in parallel in a race. Generated SMT queries [6] use algebraic datatypes and sets. They do not contain *set-logic* directive. Only the query corresponding to the winning solver is recorded for each VC, as the others might be incomplete. Stainless uses generalized arrays [21] with non-constant default values to encode generic arrays among other things. This feature is unavailable on CVC4 and not implemented in Stainless for cvc5. Hence, VCs using it can be solved only by Z3. This partially explains why Z3 is solving most queries (Fig. 2 (right)).

# **4 Evaluation**

We run the benchmarks on an Ubuntu 20.04.6 LTS server with an Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80 GHz, 64 GB RAM.

Verification takes around 400 s when running from scratch (around 100 s when re-running with a populated verification-condition cache [10]). Figure 2 shows the VCs solving time (with cache completely disabled). Most VCs are solved quickly,

**Fig. 2.** Left: VC solving time distribution with Stainless cache disabled. Right: number of queries solved by each solvers. Both use a logarithmic scale.

**Fig. 3.** Lookup of N keys in a map prepopulated with 2<sup>22</sup> pairs (left) (time normalized per operation) and insertion of 2<sup>22</sup> pairs (initial capacity of 16) followed by lookup of N keys (right). Horizontal lines show the average. The black vertical lines show 2<sup>22</sup>. The error bars show the 95% CI. The time on the y-axis is normalized with respect to the first data point of the original map.

with a mean and a median of around 0.16 and 0.1 s, respectively. The cumulative solving time is 1'937 s. Only 3 VCs need more than 10 s in Stainless, with one VC capping at 29 s out of 12'122 VCs. When calling the solver directly on the generated SMT-LIB files, the cumulative solving time falls to 407 s. This likely shows the overhead of the unrolling in Inox [28], which is especially visible for fast VCs. Figure 2 also shows the distribution of VCs solved by each solver.

We compare the performance of our verified implementation to the original [26], the general HashMap of the Scala standard library [24], and an optimized version (denoted Opti) that changes the verified implementation to use Array[AnyRef] for \_values. We use Long as the type of stored values. We consider three scenarios: looking up keys in a pre-populated map, populating the map first, then looking up keys, and populating the map with all pairs, removing half of the keys, and inserting all pairs again before looking up keys. Results are in Fig. 3 and Fig. 4. Our verified LongMap is around 1*.*5× slower than the original implementation for lookups only, see Fig. 3 (left). The performance gap is similar

**Fig. 4.** Total time to lookup *N* keys and: (**left**) insert 2<sup>15</sup> pairs with initial capacity 2<sup>17</sup>, or (**right**) insert 2<sup>22</sup> pairs, remove 2<sup>21</sup>, and insert 2<sup>22</sup> again, with initial capacity of 16. The black vertical line shows 2<sup>15</sup> (left) and 2<sup>22</sup> (right). The error bars show the 95% CI. The time on the y-axis is normalized with respect to the original map.

when taking the population process into account (Fig. 3 (right)). We argue that this is acceptable. Indeed, the LongMap is the fastest map we know in the Scala ecosystem. As shown by Fig. 4, the performance of our verified implementation is comparable to the Scala HashMap (better in some scenarios).

**Consequences of Adapting for Verifiability.** To understand the impact of pointer indirection in the \_values array (Sect. 2.1), we modified our verified implementation to use Array[AnyRef] like the original (abandoning the proof). The results are shown as Opti in the figures, with performance close to the original one, indicating that this indirection was indeed responsible for the overhead. Similarly, in our version, creating \_values and \_keys arrays relies on Array.fill, which writes all values and is slower than constructing an array of SPSVERBc48s in the original implementation. Therefore, the verified repack operation is slower than the original, see Fig. 4 (right). As shown by Fig. 4 (left), without resizing, the performance is similar to HashMap, suggesting the impact of Array.fill. Calls to repack are infrequent, so this performance loss is limited. Finally, as witnessed by the Opti implementation performance being close to the original, there is limited performance impact of the way seek functions pass information to the caller, and of counter checks for loop termination (Sect. 2.1).

## **5 Conclusion**

We verified LongMap from the Scala standard library, a mutable hash table with Long keys, employing open addressing and non-linear probing. This led us to identify and fix a bug in the original library implementation. The performance evaluation of our verified implementation against the original shows a slowdown of around 1*.*5. The changes we needed to perform for verifiability point to directions for further improving verification support for efficient Scala constructs.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Booleguru, the Propositional Polyglot (Short Paper)

Maximilian Heisinger(B) , Simone Heisinger , and Martina Seidl

Johannes Kepler University Linz, Linz 4040, Austria {maximilian.heisinger,simone.heisinger,martina.seidl}@jku.at

Abstract. Recent approaches on verification and reasoning solve SAT and QBF encodings using state-of-the-art SMT solvers, as it "makes implementation much easier". The ease-of-use of these solvers make SAT and QBF solvers less visible to users of solvers—who are maybe from different research communities—potentially not exploiting the power of state-of-the-art tools. In this work, we motivate the need to build bridges over the widening solver-gap and introduce Booleguru, a tool to convert between formats for logic formulas. It makes SAT and QBF solvers more accessible by using techniques known from SMT solvers, such as advanced Python interfaces like Z3Py and easily generatable languages like SMT-LIB, integrating them to our conversion tool. We then introduce a language to manipulate and combine multiple formulas, optionally applying transformations for quickly prototyping encodings. Booleguru's advanced scripting capabilities form a programming environment specialized for Boolean logic, offering a more efficient way to develop novel problem encodings.

Keywords: SAT · QBF · SMT · DIMACS · QCIR · SMTLIB2 · AIGER

# 1 Introduction

Numerous recent publications with encodings of problems into SAT and QBF do not use SAT or QBF solvers directly [6,16,18]. SMT solvers, often the featurerich and popular state-of-the-art solver Z3 [8], are used instead, as it "makes implementation much simpler" [17], although no theory reasoning is involved. Z3's programming API or the Common Lisp compatible SMT-LIB standard [3] are well documented and regarded by many as easy to use. While this ease of use leads to wide adoption and fast results, adapting encodings to use less general solving backends that are potentially more efficient for the problem at hand remains hard, e.g. switching from SMT solving to using a less general SAT solver. Researchers focus on optimizing their encodings against an SMT solver's performance characteristics, instead of testing them against many different (also

This work was supported by the LIT AI Lab funded by the State of Upper Austria.

c The Author(s) 2024 C. Benzmüller et al. (Eds.): IJCAR 2024, LNAI 14739, pp. 315–324, 2024. https://doi.org/10.1007/978-3-031-63498-7\_19

Table 1. Formula formats that are optimized (✓), usable (≈), or unusable (✗) for encoding the respective problem


Table 2. File formats and their capabilities.


non-SMT) solvers. We consider the transformation of formulas into conjunctive normal form (CNF) required for SAT and QBF solvers to be non-trivial, especially for beginners. Seemingly bad intermediary results are discarded, as the effort required to re-encode the problem to be solvable with SAT or QBF solvers is too large to be spent during prototyping, effectively forming a *solver gap*. We want to bridge over this gap and reduce the friction involved with testing other solvers outside the SMT world, without extensive modifications to encoding generators. In this work, we analyze what features are required to build this bridge and develop Booleguru, a polyglot for (quantified) Boolean logic. Our tool is available under the permissive MIT license at:

https://github.com/maximaximal/booleguru

# 1.1 Bridging the Solver Gap with Propositional Logic

In order to build a bridge over the solver gap described above, we first have to identify it. When encoding problems into logic, one also has to decide which format to encode into. The encoding itself is then typically accomplished with some encoder program, which is closely tied to the problem to be encoded. Changing the output format of an encoder involves considerable effort, as encoders are typically tailored to the output formats they were designed to support. As encoders grow in complexity, communities form around them, while still relying on the original output format. The decision at the beginning of an encoder's development then influences the newly formed community, as changing their encoder to generate a different output format involves considerable effort. We

Fig. 1. Booleguru Architecture, Transformers may be arbitrarily combined.

therefore identify gaps in the solving landscape opening between the different input formats of different solvers for different problems. Table 1 lists a selection of different problems with their associated dominant file formats. Table 2 lists features offered by each format. All of them offer ways to encode propositional logic in CNF, with some of them extending it with quantifiers over variables (∀, ∃). More advanced formats also allow encoding formulas in Non-CNF, i.e., formulas built from expressions and more complex logical operators. If a format supports quantifiers, they may always be added as a prefix to the formula. While every formula with embedded quantifiers can be prenexed to be in such a prenex form, some formats also allow encoding formulas in non-prenex form, extending a format's expressiveness. Some formats also allow structure-sharing, which lets problems reference sub-expressions multiple times, without repeating them. The overlapping capabilities of different formats suggest that a conversion tool has to be able to process all of them, and to serialize complex features into less complex formats, where supported.

### 1.2 Related Work

Other communities already went through this bridge-building effort in order to reduce duplicated work and advance their fields. One of the biggest examples are the researchers of the machine learning community, who commonly use libraries like PyTorch [15], SciPy [21], and Pandas [20]. This allows others to use new innovations in these libraries, such as newly added learning algorithms or properties in PyTorch, or better storage formats in Pandas.

Multiple conversion tools already exist for QBF [9,13,19]. While all of these tools convert between specific formats, no tool tries to encompass multiple conversion or combination capabilities. Some SMT solvers are able to read multiple input formats [2,7,8], with all supporting SMT-LIB2, the format favored by the SMT-Competition [22]. SMT solvers do not offer to combine multiple formulas seamlessly. Booleguru fills this niche and provides such a convert & combine capability, while also enabling a unique development environment to create new formulas. It enables previously tedious comparisons between different solvers solving similar problems, like SMT and QBF solvers, as shown in Fig. 2.

### 2 Booleguru, the Propositional Multitool

After introducing the overlaps between file formats and solving communities, we now introduce our conversion tool: Booleguru. As shown in the architecture diagram in Fig. 1, it consists of readers, transformers, and serializers for propositional logic and extensions. Inputs in arbitrary formats may be read, modified, and then serialized in the same or a different file format. This section first describes how Booleguru stores propositional formulas in memory, so that all capabilities described in Table 2 can be provided. We then introduce features intended for working with formulas.

# 2.1 Representing Propositional Formulas in Memory

In order to be accepted as an efficient tool to work with propositional formulas, they have to be both fast to create and to traverse. While this is true for a tool in any problem domain, the lower expressiveness of propositional logic compared to bit-vectors or more complex theories leads to large formulas with many nodes, relying on tools with especially high throughput. For this, they are stored in a directed-acyclic-graph (DAG), enabling structure-sharing. Each node in the DAG is either a variable, a unary (negation), or a binary operation (and, or, etc.). A node is stored in a struct of 16B, which (on most architectures) exactly fits into a single cache-line. The remaining bits to fill the cache line are occupied by structural information of the expression and user-writable extra data to be stored within the DAG, which speeds up transformers that need to store temporary data on nodes. Beside the user writable data, nodes stay constant over the whole execution.

References between nodes are stored as 32bit unsigned integers, which are evaluated relatively to the nodes of a whole formula. When creating a formula, a hash table is used to check if a given node already exists, and if it does not, a new node is appended at the back of the node array. References to previous nodes are immutable, which enables cheaply appending new expressions that are composed of others. Expressions may only reference other expressions with IDs smaller than themselves, cycles imply a malstructured formula. Traversing this DAG does not involve hash lookups or pointer indirections, as every reference can be resolved directly through the child's index in the array. Information about a subexpression is collected during insertion of a new node, removing the requirement to scan the DAG in order to check for commonly required structural information. The 32bit references make traversal very efficient, but limits formula sizes to <sup>2</sup><sup>32</sup> <sup>−</sup> <sup>1</sup> nodes. We will provide a compile-time switch to increase reference sizes in a future version.

### 2.2 Parsing Formulas

We already implemented several parsers:


The readers are mostly implemented using the ANTLR parser generator [14]. While slower than hand-written parsers, the library allowed us to iterate faster during development and add more input languages with a shared base. Some parsers are hand-written, and performance critical ones are incrementally optimized to a specialized implementation. Each parser produces a reference to an expression inside a shared expression manager. Multiple expressions can then be combined into new ones, disregarding their source format. The command-line parsing is also fully done using an ANTLR parser producing an expression, in order to provide a language for composing formulas from multiple files or scripts.

### 2.3 Transforming Formulas

Transformers are the umbrella term for functions that work on one or more expressions passed to them. They may return new expressions built from their inputs and may be chained together. They are either implemented in native C++, Lua [12], or Fennel<sup>1</sup>, with Lua and Fennel possibly supplied at runtime by a user without re-compiling Booleguru. Several transformations are already implemented, with more being added in future releases. The list below uses the *Colon Operator* (:op) notation, which is transformed into Fennel function calls during CLI parsing. Each such transformation can be supplied to the Booleguru CLI, where they strongly bind (stronger than binary operators) to the expression preceding them. Generating transformers without expressions as inputs have to be written without an expression preceding them, akin to variables.

```
:eliminate-implication
```
Converts *a* ⇒ *b* to ¬*a* ∨ *b*.

:eliminate-equivalence

Converts *a* ⇔ *b* to (*a*∨¬*b*)∧(*b*∨¬*a*)

:eliminate-xor

Converts *a* ⊕ *b* to ¬*a* ∧ *b* ∨ *a* ∧ ¬*b*

### :distribute-ors

Distributes ∨ into the formula and remove them from the outermost context.

### :distribute-nots

Distributes ¬ into the formula and remove them from the outermost context or from applying to subexpressions.

### :distribute-to-cnf

Distributes operations in the formula until it is in Conjunctive Normal Form (CNF). This often entails exponential size increase.

<sup>1</sup> https://fennel-lang.org.

### :tseitin

Tseitin-encode a (sub-) expression into CNF, without the exponential blowup involved with :distribute-to-cnf.

### :rename

Rename one or more variables in a (sub-) expression. Can take multiple arguments.

### :solve

Solve a (sub-) expression in CNF. Returns a conjunction of variables.

### :quanttree

Draw a formula's quantifiers.

### :unquantified

Print all variables that are not quantified by some quantifier.

### :prefixtract

Extract statistics from a formula's quantifier prefix (if it is in prenex form).

Fig. 2. Z3 and selected QBF solvers solving the QCIR track of QBFGallery 2023

:quantlist

Print the quantifier prefix (if available), merging multiple quantifiers of same type into sorted blocks.

# :counterfactuals

Generate a counterfactual (parameterized).

:eqkbkf

Generate a KBKF combined with an equality formula (parameterized).

:dotter

Outputs the in-memory DAG of the formula as a .dot file that can be processed using *GraphViz*.

# :assignment-tree

Expands the quantifier prefix into a tree over all possible variable assignments and solves each leaf assignment using SAT solver. Outputs a .dot file.

# 2.4 Serializing Formulas

After reading, combining, and transforming input formulas, they can be printed in different output formats. For this, several serializers have been developed, which are listed below. (Q)DIMACS relies on the provided Tseitin encoding by default, but one may use other methods that arrive on an expression which is tagged to be in CNF. This made Booleguru a helpful tool in the QBFGallery 2023<sup>2</sup>.


# 3 Booleguru, the Programming Environment

During development of new problem encodings or crafted formulas, there is usually a step where an encoding tool or a formula generator is written [1,4,11]. In our experience, these tools often rely on similar primitives:


<sup>2</sup> https://qbf23.pages.sai.jku.at/gallery/.

– Write the formula in the desired output format.

The Lua, Fennel, and Python APIs offered by Booleguru abstract over multiple output formats and the concept of writing formulas into files. When writing a generator for a new formula, the Booleguru primitives offer the user to work directly with the formula's AST, instead of having to generate syntax describing the AST in the target language. This makes generators more suited to change, as they are always composed of (nested) functions, each generating sub-expressions.

In addition to using Booleguru as a first-class execution environment for formula generators, it may be used to reduce SMT encodings developed using Z3Py to a SAT or QBF solving problem. Booleguru optionally generates a Python module that emulates the popular interface of Z3.

*Lua and Fennel Interface.* The Lua and Fennel interfaces are both accessible through an embedded interpreter. Lua and Fennel scripts that are already distributed as a part of Booleguru are compiled ahead of time by LuaJIT<sup>3</sup>. This makes both initialization and execution of scripts as efficient as possible in the Release build of Booleguru. User-supplied scripts are compiled at runtime using LuaJIT.

*Python Interface.* Additionally to the specialized Lua and Fennel interfaces, Booleguru provides the pybooleguru interface which is directly modelled after the widely used *Z3Py* Python library. We observed that when using Z3Py during development of a new encoding, the jump from the SMT solver Z3 to more fundamental SAT or QBF solving becomes harder. This interface is intended to be a drop-in replacement to Z3Py, enabling the conversion of a complex Z3-specific encoder written in Python into an encoder capable of producing additional output of a different format. Python scripts may be read as inputs or a script may import pybooleguru instead of z3py.

*C++ Interface.* Additionally to the scripting interfaces, the C++ interface itself may also be used. Functions are provided to easily build expressions using C++, which can be useful when developing new tools in systems languages.

### 3.1 Command-Line Interactive Interface

The command-line interface of Booleguru is also considered a programming environment, as it seamlessly merges a grammar for propositional logic in infix notation with the Scheme-based Fennel, a programming language written in prefix-notation. Each Fennel expression has to return a logical expression, which it builds using the provided primitives. By combining transformations and working with expressions, new functionality can be implemented using the CLI alone. For example, for the formula (*a* ∧ *b*) ∨ (*a* ∧ ¬*b*), both solutions can be extracted using the CLI:

<sup>3</sup> https://luajit.org/luajit.html.

```
$ booleguru test.boole :solve
a & !b
$ booleguru test.boole ":(b-and ** (b-not (solve **)))" :solve
a&b
```
It can also be used to combine formulas while renaming e.g. *a* to *aa*:

```
$ booleguru "test.boole :rename@a@aa <-> test.boole"
(#aa ?b (aa & b)) <-> (#a ?b (a & b))
```
The CLI can also invoke parameterized custom binary operators. The # comment below is the file my-bin.fnl in the current working directory.

```
# (lambda my-bin [m] ((. _G m) (b-and *l* *r*) (b-or *l* *r*)))
$ ./booleguru a ::my-bin@equi b
a & b <-> a | b
```
# 3.2 Developing Booleguru

Our tool is implemented in C++, with some helpers provided to ease development. The build system is realized using CMake, which makes Booleguru easily embeddable into other projects. All modules have test suites to be run during development. They test basic features like the expression tree, but also transformers and serializers. Defining new tests is easy and running all tests is quick and parallelizable. The build system offers a fast Release build, a slower Debug build without optimization options, and a Sanitized build with enabled address- and undefined behavior sanitizers. If available, LuaJIT is used for executing Lua and Fennel scripts, otherwise the regular Lua distribution is provided as a fallback. Booleguru also natively supports to be built as a WebAssembly executable, making it runnable in browsers, including Lua and Fennel scripts.

*Embedded Fuzzer.* Transformers that process operations may also be fuzzed using the LLVM *libFuzzer* integration. Booleguru has to be built in the fuzzing mode, which creates the specialized booleguru-fuzzer binary. The commandline has to be provided via the environment variable BOOLEGURU\_ARGS, with a fuzz file that serves as the injection point for fuzzed inputs. Arbitrary transformations may then be performed on the expressions, which are called with every iteration of the fuzzer. Booleguru's embedded mutator (inspired by the mutator of Google Protocol Buffers) randomly creates arbitrary many input structures until unexpected stops are encountered. The fuzzing capability was used extensively during development of the transformers. The resulting inputs can be displayed in Limboole format using the booleguru-print-corpus tool.

# 4 Conclusion

We developed the propositional polyglot Booleguru, which can be used to convert between several widely used logic formats, to transform or combine formulas, and to develop new encodings more efficiently. We discussed the requirements our tool has to fulfill and introduced the implementation based on these requirements. Finally, we explained how Booleguru is used to generate new encodings, using the embedded Lua, Fennel, and Python scripting support. Booleguru already proved itself as a valuable tool during the QBFGallery 2023, for revisiting quantifier shifting in QBF [10], and other projects.

# References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Quantifier Shifting for Quantified Boolean Formulas Revisited

Simone Heisinger1(B) , Maximilian Heisinger<sup>1</sup> , Adrian Rebola-Pardo1,2 , and Martina Seidl<sup>1</sup>

<sup>1</sup> Institute for Symbolic Artificial Intelligence, JKU Linz, Linz, Austria {simone.heisinger,maximilian.heisinger,adrian.rebola\_pardo, martina.seidl}@jku.at

<sup>2</sup> Institute for Logic and Computation, TU Vienna, Vienna, Austria

Abstract. Modern solvers for quantified Boolean formulas (QBFs) process formulas in prenex form, which divides each QBF into two parts: the quantifier prefix and the propositional matrix. While this representation does not cover the full language of QBF, every non-prenex formula can be transformed to an equivalent formula in prenex form. This transformation offers several degrees of freedom and blurs structural information that might be useful for the solvers. In a case study conducted 20 years back, it has been shown that the applied transformation strategy heavily impacts solving time. We revisit this work and investigate how sensitive recent QBF solvers perform w.r.t. various prenexing strategies.

Keywords: Quantified Boolean Formulas · Prenexing · Normal Form Transformation

# 1 Introduction

Quantified Boolean formulas (QBFs), the extension of propositional formulas with quantifiers over the Boolean variables, have many applications in formal verification, synthesis, and artificial intelligence [28]. Over the last 25 years, many efficient QBF solvers have been developed [2], with clear tendency towards QBFs in prenex conjunctive normal form (PCNF). A QBF in PCNF has the form <sup>Q</sup><sup>1</sup>x<sup>1</sup> ...Q<sup>n</sup>x<sup>n</sup>.φ where <sup>Q</sup><sup>i</sup> ∈ {∀, ∃} and <sup>φ</sup> is a propositional formula in conjunctive normal form. In general, encodings do not result in formulas of this structure, because of recursive definitions in the encoding or from optimizations that try to minimize the scope of variables. Origins for a non-CNF structure can be for example the use of equivalences or xors in the encoding. Therefore two transformations are required: (1) *prenexing* which shifts the quantifiers outside of the formula, and (2) transformation of the quantifier-free formula to CNF. The latter is efficiently achieved by applying the QBF-variant of the well known Tseitin transformation [30] or the optimized Plaisted-Greenbaum transformation [24]. In this work, we focus on the prenexing.

c The Author(s) 2024

<sup>\*</sup>This work was supported by the LIT AI Lab funded by the state of Upper Austria and by the Vienna Science and Technology Fund (WWTF) [10.47379/VRG11005].

C. Benzmüller et al. (Eds.): IJCAR 2024, LNAI 14739, pp. 325–343, 2024. https://doi.org/10.1007/978-3-031-63498-7\_20

Without loss of generality, formulas can be assumed to be in negation normal form (i.e. negation symbols only occur in front of variables) and cleansed (i.e. no variable occurs both bound and free, and every variable is quantified at most once). Under these conditions, prenexing is achieved by the following two rules:

$$(\mathbb{Q}x.\varphi)\circ\varphi'\Leftrightarrow\mathbb{Q}x.(\varphi\circ\varphi')\qquad\varphi\circ(\mathbb{Q}x.\varphi')\Leftrightarrow\mathbb{Q}x.(\varphi\circ\varphi')\iff$$

with <sup>Q</sup> ∈ {∀, ∃} and ◦ ∈ {∧,∨}. The formula structure imposes an ordering of the quantifiers based on the subformula relation. Quantifiers from independent parts of a formula can be freely ordered. For example, the formula <sup>∀</sup>x.(∃y.(y <sup>∨</sup> x)) <sup>∧</sup> (∀z.(z ∨ ¬x)) has prenex forms <sup>∀</sup>x.∃y.∀z.φ or <sup>∀</sup>x.∀z.∃y.φ where φ = (y <sup>∨</sup> x) <sup>∧</sup> (z ∨ ¬x). Hence, a prenex form is not uniquely determined.

Egly et al. [6] suggested four different prenexing strategies that minimize the number of quantifier alternations in the prefix. Empirically, they showed that the selected prenexing strategy impacts solving performance. In this work, we revisit those prenexing strategies and give a concise formalization, which the original work lacked. We show that the original four strategies disambiguate into six unique prenexing strategies when enforcing a minimal number of quantifier alternations, and we present a tool that implements those strategies. To evaluate the impact of prenexing on modern solvers we reimplemented the generator for encoding nested counterfactuals and performed extensive experiments with these formulas and formulas from the QBFEval'08 in which a non-prenex track was organized.

# 2 Preliminaries

The set of *quantified Boolean formulas QF*(X) over variables X is defined as follows: (1) , <sup>⊥</sup>, x,¬x,¬ ,¬⊥ ∈ *QF*(X) if x <sup>∈</sup> X, (2) ϕ <sup>∨</sup> ϕ , ϕ <sup>∧</sup> ϕ <sup>∈</sup> *QF*(X) if ϕ, ϕ <sup>∈</sup> *QF*(X), (3) if <sup>ϕ</sup> <sup>∈</sup> *QF*(X), then <sup>∀</sup>x.ϕ, <sup>∃</sup>x.ϕ <sup>∈</sup> *QF*(X) with <sup>x</sup> <sup>∈</sup> <sup>X</sup>. <sup>1</sup> In QBF <sup>Q</sup>x.ϕ with <sup>Q</sup> ∈ {∀, ∃}, the subformula ϕ is called the scope of variable x <sup>∈</sup> X and x is said to be bound by quantifier <sup>Q</sup>. If variable x occurs in QBF ϕ, but ϕ does neither contain <sup>∃</sup>x nor <sup>∀</sup>x then x is free in ϕ. The set of all free variables of a QBF ϕ is denoted by *free*(ϕ). A QBF without free variables is called closed. In the following, we assume that each variable is in the scope of at most one quantifier and that each variable occurs either free or bound, but not both in a formula. We call such formulas *cleansed*. The semantics of a QBF ϕ is defined by the interpretation function [.]<sup>σ</sup> : *QF*(X) <sup>→</sup> <sup>B</sup> where <sup>B</sup> <sup>=</sup> {**1**, **<sup>0</sup>**} and <sup>σ</sup> : *free*(ϕ) <sup>→</sup> <sup>B</sup> is an assignment to the free variables of <sup>ϕ</sup>. Then [ ]<sup>σ</sup> <sup>=</sup> **<sup>1</sup>**, [⊥]<sup>σ</sup> <sup>=</sup> **<sup>0</sup>**, for any <sup>x</sup> <sup>∈</sup> <sup>X</sup>, [x]<sup>σ</sup> <sup>=</sup> <sup>σ</sup>(x) and [¬v]<sup>σ</sup> = 1 <sup>−</sup> [v]<sup>σ</sup> for <sup>v</sup> <sup>∈</sup> <sup>X</sup> ∪ { , ⊥}. Furthermore, [ϕ<sup>1</sup> <sup>∧</sup> <sup>ϕ</sup><sup>2</sup>]<sup>σ</sup> = min{[ϕ<sup>1</sup>]<sup>σ</sup>, [ϕ<sup>2</sup>]σ}, and [ϕ<sup>1</sup> <sup>∨</sup> <sup>ϕ</sup><sup>2</sup>]<sup>σ</sup> = max{[ϕ<sup>1</sup>]<sup>σ</sup>, [ϕ<sup>2</sup>]σ}. Finally, [∃x.ϕ]<sup>σ</sup> <sup>=</sup> max{ϕ[x| ], ϕ[x|⊥]} and [∀x.ϕ]<sup>σ</sup> <sup>=</sup> min{ϕ[x| ], ϕ[x|⊥]} where <sup>ϕ</sup>[x|t] denotes the QBF obtained by substituting a variable x by a truth constant <sup>t</sup> ∈ { , ⊥} in <sup>ϕ</sup>. Two QBFs ϕ, ϕ <sup>∈</sup> *QF*(X) are equivalent if [ϕ]<sup>σ</sup> = [ϕ ]<sup>σ</sup> for any assignment σ.

<sup>1</sup> For simplicity, we assume negations only in front of variables and truth constants.

Definition 1. *The propositional skeleton* <sup>ϕ</sup>psk *of QBF* <sup>ϕ</sup> <sup>∈</sup> *QF*(X) *is defined as follows:*

$$
\varphi\_{\rm psk} = \begin{cases}
x & \text{if } \varphi = x, \\
\neg x & \text{if } \varphi = \neg x, \\
\varphi'\_{\rm psk} \circ \varphi''\_{\rm psk} & \text{if } \varphi = \varphi' \circ \varphi'', \circ \in \{\land, \lor\} \\
\varphi'\_{\rm psk} & \text{if } \varphi = \mathsf{Q}V\varphi', \ \mathsf{Q} \in \{\forall, \exists\} \\
\end{cases}
$$

We say that a QBF ϕ is of the form <sup>Q</sup>V.ϕ for a set of variables <sup>V</sup> whenever <sup>ϕ</sup> <sup>=</sup> <sup>Q</sup>x<sup>1</sup>.....Qx<sup>n</sup>.ϕ for some enumeration <sup>x</sup><sup>1</sup>,...,x<sup>n</sup> of <sup>V</sup> . A QBF <sup>ϕ</sup> <sup>∈</sup> *QF*(X) is in prenex form if it is has the structure Π.φ where <sup>Π</sup> <sup>=</sup> <sup>Q</sup><sup>1</sup>x<sup>1</sup> ... <sup>Q</sup><sup>n</sup>x<sup>n</sup> is a quantifier prefix with <sup>Q</sup><sup>i</sup> ∈ {∀, ∃}, <sup>x</sup><sup>i</sup> <sup>∈</sup> <sup>X</sup> and <sup>x</sup><sup>i</sup> <sup>=</sup> <sup>x</sup><sup>j</sup> for <sup>i</sup> <sup>=</sup> <sup>j</sup> and <sup>φ</sup> is a propositional formula. If φ is also in conjunctive normal form (CNF), then ϕ is in *prenex conjunctive normal form* (PCNF). A propositional formula is in CNF if it is a conjunction of clauses. A clause is a disjunction of literals and a literal is a variable or a negated variable. Obviously, (Π.φ)psk <sup>=</sup> <sup>φ</sup> for a PCNF formula Π.φ.

Proposition 1. *Consider QBFs* ϕ, ϕ *, a quantifier* <sup>Q</sup> ∈ {∀, ∃}*, a connective* ◦*, and variables* x, y*.*


*Forests, trees and partial orders.* We will oftentimes regard trees and forests as partially ordered sets. In particular, we define a *forest* as set T equipped with a partial ordering <sup>≤</sup> such that for all elements x <sup>∈</sup> T, the subset {y <sup>∈</sup> T <sup>|</sup> y <sup>≤</sup> x} is totally ordered. When T is finite, this definition appropriately models the recursive concept of a forest: the elements of T are nodes, and x <sup>≤</sup> y if y is a descendant of x. We say that x is *covered* by y whenever x <sup>≤</sup> y and there is no z <sup>∈</sup> T with x<y<z. When regarding T as a forest, this means that x is the parent of y. A forest T is a tree if, additionally, for any two elements x, y <sup>∈</sup> T there is another element z <sup>∈</sup> T with z <sup>≤</sup> x and z <sup>≤</sup> y. For a finite T, this implies that T has a least element, which corresponds to its root.

Given a forest <sup>T</sup>, we call a list <sup>x</sup><sup>1</sup>,...,x<sup>n</sup> <sup>a</sup> *path* in <sup>T</sup> if for all <sup>1</sup> <sup>≤</sup> i<j <sup>≤</sup> <sup>n</sup> we have <sup>x</sup><sup>i</sup>, x<sup>j</sup> <sup>∈</sup> <sup>T</sup> and <sup>x</sup><sup>i</sup> < x<sup>j</sup> . The *height* of a forest <sup>T</sup> is defined as

$$\text{ht}(T) = \max\{n \ge 0 \mid \text{there is a path } x\_1, \dots, x\_n \text{ in } T\}.$$

For a node <sup>x</sup> <sup>∈</sup> <sup>T</sup>, we define its *lower bounds* as <sup>T</sup> <sup>x</sup> <sup>=</sup> {y <sup>∈</sup> T <sup>|</sup> y <sup>≤</sup> x} and its *upper bounds* as <sup>T</sup><sup>x</sup> <sup>=</sup> {<sup>y</sup> <sup>∈</sup> <sup>T</sup> <sup>|</sup> <sup>y</sup> <sup>≥</sup> <sup>x</sup>}.

*Parity-based functions.* We will use a parity-based version of the floor and ceiling functions. Intuitively, n<sup>k</sup> (resp. nk) rounds <sup>n</sup> down (resp. up) to the closest integer with the same parity as k. Formally, for integers n, k <sup>∈</sup> <sup>Z</sup> we define:

$$\begin{aligned} \lfloor n \rfloor\_k &= \max \{ m \in \mathbb{Z} \mid m \le n \text{ and } m - k \text{ is even} \} \\ \lceil n \rceil\_k &= \min \{ z \in \mathbb{Z} \mid z \ge y \text{ and } m - k \text{ is even} \} \end{aligned}$$

*Direction-parametric operators.* At several points our ordering-based definitions will depend on a *direction* parameter † ∈ {↑, ↓}. Intuitively, <sup>↑</sup>-labeled operators use a reverse ordering, while ↓-labeled operators use an unmodified ordering. In particular, we define the following operators:


# 3 Related Work

Already 20 years back it has empirically been shown that quantifier shifting has a severe impact on the solving performance [6] of QBF solvers. In this work, four different prenexing strategies were introduced that intuitively result in the smallest number of possible quantifier alternations. The authors noted that the presented strategies "leave room for different [prenexing] variants". In this work, we close this gap by providing a concise formalization of quantifier shifting.

The observation that the prenexing strategy impacts solving performance motivated development of several non-prenex non-CNF solvers [7,8,15,29]. With the rise of efficient preprocessing for PCNF formulas and a focus on applications with few quantifier alternations, however, solver development focused on formulas in prenex form. To deal with the information loss induced by quantifier shifting, solvers were introduced that employ dependency schemes [27] to (re-)discover and exploit variable independencies [16,22], i.e., those solvers recover information on quantifier dependencies that is hidden in the prefix. Reeves et al. presented an approach to move Tseitin variables from the innermost quantifier block to the outer-most possible position in the quantifier prefix [26]. The exact position is determined by the variables occuring in the formula defined by the Tseitin variable. With this reordering, they observe a considerable speed-up in solver performance. Lonsing and Egly evaluated the impact of the number of quantifier alternations on recent QBF solvers [18]. In their experiments, they established a correlation between different solving paradigms like expansion or QCDCL (see [2] for a detailed discussion of such proof systems) and the number of quantifier alternations. Also, proof-theoretical investigations [1] identify the number of quantifier alternations as source of hardness for practical solving. However, to the best of our knowledge, there is no recent study that investigates the impact of quantifier shifting on the solving behavior of state-ofthe-art solvers for formulas in prenex normal form.

Nowadays there is also much interest in dependency quantified Boolean formulas (DQBF) which allow for an explicit specification of quantifier dependencies. The decision problem of these formulas is NEXPTIME-complete [23], in contrast to the PSPACE-completeness of QBFs.

# 4 Quantifier Shifting

In this work we aim to transform arbitrary QBF formulas ϕ (which are not in general prenex or in CNF) into equivalent prenex QBF formulas of the form <sup>Q</sup>1x1.... <sup>Q</sup>nxn.ϕpsk. The formula <sup>ϕ</sup>psk is not necessarily in CNF, although this can be easily achieved through the well-known Tseitin procedure.

The method we propose can be roughly summarized as follows. First, a *quantifier tree* reflecting the scope hierarchy of quantifiers in ϕ is constructed. Each node in this quantifier tree will then be assigned a rank with some restrictions to guarantee soundness; we call this assignment a *linearization*. Finally, the formula <sup>Q</sup><sup>1</sup>x<sup>1</sup>.... <sup>Q</sup><sup>n</sup>x<sup>n</sup>.ϕpsk is constructed by enumerating the bound variables <sup>x</sup><sup>1</sup>,...,x<sup>n</sup> by rank; thanks to the restrictions on linearizations, this formula will be equivalent to ϕ.

*Example 1.* Throughout our work we will use the following QBF η as a running example:

$$\begin{aligned} \exists x\_1. x\_1 \land \left( \begin{array}{c} \left( \forall y\_1. \left( \exists z\_1. (\neg y\_1 \lor z\_1) \land \\\\ \forall u\_1. \exists v\_1. (y\_1 \lor \neg u\_1 \lor v\_1) \land (\neg y\_1 \lor u\_1 \lor \neg v\_1) \right) \right) \land \\\\ \left( \forall z\_2. (\exists u\_2. \neg z\_2 \lor u\_2) \land (\forall u\_3. x\_1 \lor z\_2 \lor u\_3) \right) \right) \end{aligned} \right)$$
 
$$\begin{pmatrix} \exists y\_4. (y\_4 \land \forall z\_4. \exists u\_4. z\_4 \land u\_4) \lor (\neg y\_4 \land \exists z\_5. \forall u\_5. z\_5 \land u\_5) \end{pmatrix}$$

Its propositional skeleton <sup>η</sup>psk is then given by:

$$\begin{aligned} x\_1 \land \left( \left( \neg y\_1 \lor z\_1 \right) \land \left( y\_1 \lor \neg u\_1 \lor v\_1 \right) \land \left( y\_1 \lor u\_1 \lor \neg v\_1 \right) \land \neg \right) \\ \left( \neg z\_2 \lor u\_2 \right) \land \left( x\_1 \lor z\_2 \lor u\_3 \right) \right) \lor \\ \left( y\_4 \land z\_4 \land u\_4 \right) \lor \left( \neg y\_4 \land z\_5 \land u\_5 \right) \end{aligned}$$

Consider the following two quantifier shifts for η:

$$\begin{aligned} \eta' &= \exists x\_1. \exists y\_4. \exists z\_5. \forall y\_1. \forall z\_2. \forall u\_3. \forall z\_4. \forall u\_5. \exists z\_1. \exists u\_2. \exists u\_4. \forall u\_1. \exists v\_1. \eta\_{\text{psk}} \\ \eta'' &= \exists x\_1. \exists y\_4. \exists z\_5. \exists v\_1. \forall y\_1. \forall z\_2. \forall u\_3. \forall z\_4. \forall u\_5. \exists z\_1. \exists u\_2. \exists u\_4. \forall u\_1. \eta\_{\text{psk}} \end{aligned}$$

While η is equivalent to η, the QBF formula η is not. The intuitive reason is that in <sup>η</sup> the quantifier <sup>∃</sup>v<sup>1</sup> has been pushed across quantifier alternation boundaries. This is exactly the situation our formalization will prevent.

Our formalization associates to each QBF a forest obtained by removing from its syntax tree all non-quantifier nodes. The remaining nodes are thus uniquely determined by a bound variable and a quantifier, and this forest contains all the information needed for quantifer shifting. Hence, we first define the abstract concept of quantifier forests, and then we will show how to construct a quantifier forest from a QBF as above.

<sup>A</sup> *quantifier forest* is a triple (T, <sup>≤</sup>, q) where (T, <sup>≤</sup>) is a finite forest regarded as a partially ordered set (see Sect. 2) and q : T → {∀, ∃}. We call it a *quantifier tree* or *quantree* whenever (T, <sup>≤</sup>) is a tree. If (T, <sup>≤</sup>, q) is a nonempty quantree, we also define q(x)=1 if <sup>q</sup>(x) = <sup>q</sup>(min(T)) and <sup>q</sup>(x)=0 otherwise. Given a QBF formula <sup>ϕ</sup>, its *associated quantifier forest* is a triple (Tϕ, <sup>≤</sup>ϕ, q<sup>ϕ</sup>), where <sup>T</sup><sup>ϕ</sup> is the set of bound variables in <sup>ϕ</sup>, and <sup>≤</sup><sup>ϕ</sup> and <sup>q</sup><sup>ϕ</sup> are defined recursively:


Proposition 2. *Let* (T<sup>ϕ</sup>, <sup>≤</sup><sup>ϕ</sup>, q<sup>ϕ</sup>) *be a quantifier forest of QBF* <sup>ϕ</sup>*. If* <sup>ϕ</sup> <sup>=</sup> <sup>Q</sup>x.ϕ<sup>0</sup> *for a quantifier* <sup>Q</sup>*, then* (T<sup>ϕ</sup>, <sup>≤</sup><sup>ϕ</sup>, q<sup>ϕ</sup>) *is a quantree.*

*Example 2.* Figure <sup>1</sup> shows the quantree associated to the QBF η from Example 1. In general, we can only guarantee that the quantifier forest associated to a QBF is a tree when the QBF is of the form <sup>Q</sup>x.ϕ. For example, the quantifier forest associated with (∀x.x)∧(∃y.y) is a forest with two incomparable elements x and y.

# 4.1 Linearizations over Quantrees

We now formalize the main object of this paper, namely the different ways the quantifiers in a formula can be rearranged into a quantifier prefix to an equivalent prenex formula. Given a QBF of the form <sup>Q</sup>x.ϕ for a quantifier <sup>Q</sup>, we consider its associated quantree T. We aim to construct an equivalent prenex QBF <sup>Q</sup><sup>1</sup>V<sup>i</sup>.... <sup>Q</sup><sup>N</sup> <sup>V</sup><sup>N</sup> .ϕpsk where <sup>Q</sup><sup>i</sup> <sup>=</sup> <sup>Q</sup><sup>i</sup>+1 for <sup>1</sup> <sup>≤</sup> i<N. To do so, each node in T (i.e. each bound variable in <sup>Q</sup>x.ϕ) must be mapped to a single quantifier block Q<sup>i</sup> V<sup>i</sup>. We call this i its *rank*. However, as shown in Example 1, assigning arbitrary ranks is unsound (i.e. the obtained prenex QBF is not equivalent to <sup>Q</sup>x.ϕ). We show how bound variables can be ranked while preserving soundness.

Let us consider an arbitrary quantree T. A map f : T → {1,...,N} for some N <sup>≥</sup> <sup>0</sup> is called a *linearization* if:

1. f(x) <sup>≤</sup> f(y) for all quantree nodes x, y <sup>∈</sup> T with x <sup>≤</sup> y.

2. For all quantree nodes x <sup>∈</sup> T, f(x) is odd if and only if q(x)=1.

Consider now a QBF of the form <sup>Q</sup>y.ϕ where <sup>Q</sup> is a quantifier and y is a variable, and its associated quantree (T, <sup>≤</sup>, q). In this case, since T is the set of bound variables in <sup>Q</sup>y.ϕ, a linearization f : T → {1,...,N} maps each bound variable x <sup>∈</sup> T to an integer f(x) we call its *rank*. A QBF ψ is called a *prenexation* of <sup>Q</sup>y.ϕ via <sup>f</sup> if <sup>ψ</sup> is of the form <sup>Q</sup><sup>1</sup>V<sup>1</sup>.... <sup>Q</sup><sup>N</sup> <sup>V</sup><sup>N</sup> .ϕpsk where <sup>V</sup><sup>i</sup> <sup>=</sup> {<sup>x</sup> <sup>∈</sup> <sup>T</sup> <sup>|</sup> <sup>f</sup>(x) = <sup>i</sup>} and <sup>Q</sup><sup>i</sup> <sup>=</sup> <sup>Q</sup> (resp. <sup>Q</sup>) if <sup>i</sup> is odd (resp. even) for <sup>1</sup> <sup>≤</sup> <sup>i</sup> <sup>≤</sup> <sup>N</sup>.


mal linearizations for this quantree for each strategy. In each column, the variables mapped to each rank are shown; the quantifier of each block appears in the header. Note that the optimal linearizations for strategies Q†↑ and Q†↓ assign the same rank to Q-quantified variables; this is a consequence of Lemma 2.

Theorem 1. *Let* T *be the quantree associated to a QBF of the form* <sup>Q</sup>y.ϕ*. Consider a prenexation* ψ *of* <sup>Q</sup>y.ϕ *via some linearization* f : T → {1,...,N}*. Then* <sup>Q</sup>y.ϕ *is equivalent to* ψ*.*

To guarantee this form <sup>Q</sup>y.ϕ for an arbitrary QBF ϕ, we can simply introduce a fresh variable y that does not occur in ϕ. Obviously, ϕ is equivalent to <sup>Q</sup>y.ϕ.

*Example 3.* Figure 1 shows six linearizations for the quantree associated to the QBF η from Example <sup>1</sup> and Example 2. In that example, the quantifier shift η is the prenexation of η via the linearization f<sup>1</sup>. Note that the mapping f that would produce η is not a linearization, since that would violate Theorem 1. In particular, <sup>u</sup><sup>1</sup> <sup>≤</sup> <sup>v</sup><sup>1</sup> but <sup>f</sup>(v<sup>1</sup>) < f(u<sup>1</sup>).

# 4.2 Alternation Height of Quantrees

So far we have not shown that linearizations even exist. Given the theoretical and empirical impact of the number of quantifier alternations on QBF solving, we are not just interested in their existence, but rather on linearizations that minimize the maximum rank N. We will now show how to compute the minimal value of N for which linearizations exist; in fact, this value will be extremely useful to extend the ideas from [6] to arbitrary QBFs in Sect. 5.2.

Consider an arbitrary quantree <sup>T</sup>, and a path <sup>x</sup>1,...,x<sup>n</sup> in <sup>T</sup>. We call this path *alternating* whenever q(x<sup>i</sup>) <sup>=</sup> <sup>q</sup>(xi+1) for <sup>1</sup> <sup>≤</sup> i<n. Then we can define the *alternation height* of T as

aht(T) = max{<sup>n</sup> <sup>≥</sup> <sup>0</sup> <sup>|</sup> there is an alternating path <sup>x</sup>1,...,x<sup>n</sup> in <sup>T</sup>}.

Intuitively, the alternation height of T is the height of the tree that results from "clumping" together all adjacent nodes with the same quantifier. It then becomes apparent that any linearization f over T must have N <sup>≥</sup> aht(T), since for any alternating path <sup>x</sup><sup>1</sup>,...,x<sup>n</sup> we have <sup>f</sup>(x<sup>1</sup>) <sup>&</sup>lt; ··· < f(x<sup>n</sup>). The following result shows that this lower bound can indeed be realized:

Theorem 2. *Let* T *be a quantree. Then, a linearization* f : T → {1,..., aht(T)} *exists. Furthermore, there exists no linearization* g : T → {1,...,N} *such that* <sup>0</sup> <sup>≤</sup> N < aht(T)*.*

Observe that the number of quantifier alternations in a prenexation via a linearization grows with the value N. In the following sections, we will restrict our scope to linearizations that minimize this value, i.e. linearizations in the set

Lin(T) = {f : T → {1,...,N} | f is a linearization and N = aht(T)}.

# 5 Linearization Strategies

We now follow up on the ideas from [6] and formalize them. In particular, we aim to obtain a formal definition of when does a linearization follow a given strategy, to ascertain whether strategies determine a unique linearization for each quantree, and to find simple algorithms to compute this linearization.

# 5.1 Strategies as Preferences over Linearizations

Here we take a non-constructive approach. For each prenexing strategy, we define a preference relation between the linearizations in Lin(T); linearizations that follow the strategy "better" than others are preferred. As we will show, this induces a strategy-based partial order between linearizations. The desired output of a strategy must then be a maximal element w.r.t. this partial order.

The strategies from [6] are based on the idea of pushing quantifiers of a given polarity as high or as low as possible in the quantifier hierarchy. This lends itself to a natural definition of preference.

Consider an arbitrary quantree T. Given a direction † ∈ {↑, ↓} and a quantifier Q, we define the *semi-preference* relation -<sup>Q</sup>† over Lin(T) given by f -Q† g iff f(x) <sup>≤</sup>† <sup>g</sup>(x) for all <sup>x</sup> <sup>∈</sup> <sup>T</sup> with <sup>q</sup>(x) = <sup>Q</sup>. In other words, <sup>g</sup> is preferred over f whenever g assigns ranks to <sup>Q</sup>-nodes further in the direction † than f does.

*Example 4.* Consider the linearizations <sup>f</sup><sup>1</sup>, f<sup>2</sup> and <sup>f</sup><sup>6</sup> from Fig. 1. All universal variables are assigned lower ranks by both <sup>f</sup><sup>1</sup> and <sup>f</sup><sup>2</sup> than by <sup>f</sup><sup>6</sup>, so <sup>f</sup><sup>6</sup> -∀↓ <sup>f</sup><sup>1</sup> and <sup>f</sup><sup>6</sup> -∀↓ <sup>f</sup><sup>2</sup> hold. Note also that <sup>f</sup><sup>1</sup> and <sup>f</sup><sup>2</sup> assign the same ranks to universal variables, so both <sup>f</sup><sup>1</sup> -∀↓ <sup>f</sup><sup>2</sup> and <sup>f</sup><sup>2</sup> -∀↓ <sup>f</sup><sup>1</sup> hold. Note that this does not imply <sup>f</sup><sup>1</sup> <sup>=</sup> <sup>f</sup><sup>2</sup>.

Example 4 shows that the antisymmetric property of partial orders does not hold for -<sup>Q</sup>†. This is intuitive: two linearizations might be just as good as each other regarding Q-nodes, while wildly differing for other nodes. Hence, for strategies to uniquely determine linearizations we need to provide preferences for both quantifiers. While this was proposed by [6], it was also noted there that uniqueness is not attained.

*Example 5.* The linearizations <sup>f</sup><sup>2</sup> and <sup>f</sup><sup>3</sup> from Fig. <sup>1</sup> are both good candidates for linearizations for the strategy ∃<sup>↓</sup>∀<sup>↑</sup> from [6]: both assign high ranks to existential variables and low ranks to universal variables. However, there is no apparent criterion why <sup>f</sup><sup>2</sup> should or should not be preferred to <sup>f</sup><sup>3</sup> under that strategy.

We solve this problem by giving one quantifier priority over the other. Our *strategies* are of the form <sup>Q</sup>†‡, where <sup>Q</sup> is a quantifier and †, ‡ are directions. A good linearization under this strategy pushes quantifiers Q in the † direction, and quantifiers Q in the ‡ direction; when in conflict, the former should prevail.

To formalize this idea, we liberally borrow from the somewhat similar notion of lexicographic orderings. Let us define f <sup>≈</sup><sup>Q</sup> <sup>g</sup> whenever <sup>f</sup>(x) = <sup>g</sup>(x) for all x <sup>∈</sup> T with q(x) = <sup>Q</sup>. In other words, f <sup>≈</sup><sup>Q</sup> <sup>g</sup> holds whenever both <sup>f</sup> -<sup>Q</sup>† g and g -<sup>Q</sup>† f hold, regardless of the choice of direction †. We define the *preference* relation <sup>Q</sup>†‡ over Lin(T) given by <sup>f</sup> <sup>Q</sup>†‡ <sup>g</sup> iff <sup>f</sup> -<sup>Q</sup>† g holds, and whenever <sup>f</sup> <sup>≈</sup><sup>Q</sup> g holds then f -<sup>Q</sup>‡ g holds as well.

# Proposition 3. <sup>Q</sup>†‡ *defines a partial order on* Lin(T)*.*

It is easy to check that ∀†† and ∃†† are the same preference relation for † ∈ {↑, ↓}. Hence, our approach defines 6 unique strategies, while [6] only proposed 4 strategies. On the one hand, the ∃†∀† strategy from [6] corresponds to our ∃†† (or, equivalently, ∀††) strategy. On the other hand, our strategies ∃†‡ and ∀‡† with † = ‡ are both covered by the ∃†∀‡ strategy from [6], which is not uniquely determined.

*Example 6.* Although we cannot yet convince the reader of this, the linearizations given in Fig. <sup>1</sup> are the maximum element of Lin(T) for each of the six (unique) preference orderings <sup>Q</sup>†‡; the corresponding strategy is shown in the rightmost column. For now, we can foreshadow that strategies Q†↑ and Q†↓ assign the same ranks to Q-nodes. As shown later in Lemma 1, this holds in general.

# 5.2 Optimal Linearizations over a Strategy

Proposition 3 suggests this is a good direction to formalize the idea of strategies: since Lin(T) is finite, there exist optimal linearizations w.r.t. the preference ordering <sup>Q</sup>†‡. We call these linearizations <sup>Q</sup>†‡*-optimal*; linearizing a quantree T through the strategy <sup>Q</sup>†‡ then means computing a <sup>Q</sup>†‡-optimal linearization in Lin(T).

Some hurdles remain unsolved, though. For one, we have not determined if <sup>Q</sup>†‡-optimal linearizations are unique (i.e. if maximal elements w.r.t. Q†‡ are maxima as well). This is of interest because to empirically test the performance effect of quantifier shifting strategies, the outcome of applying a strategy must be reproducible at worst, and uniquely determined by definition at best. A second issue is a consequence of our non-constructive approach: we are yet to provide a procedure that computes a Q†‡-optimal linearization for a given quantree.

The rest of this section is devoted to computing a closed-form expression for Q†‡-optimal linearizations. Since this expression is deterministic and computable, this solves both aforementioned issues.

*Overview.* As we mentioned above, <sup>Q</sup>†‡ is somewhat similar to a lexicographic ordering in two components where the first component is ordered by -<sup>Q</sup>† and the second component is ordered by -<sup>Q</sup>‡. We exploit this intuition to construct Q†‡-optimal linearizations: we will first optimize the first component (in our case, the ranks of Q-nodes), and then optimize the second component (the ranks of Q-nodes) while keeping the first component fixed.

To optimize the first component, we find a linearization Γ†, defined below in (1), that is optimal for -<sup>Q</sup>†. This is more precisely expressed in Lemma 1: <sup>Γ</sup>† pushes Q-nodes further in direction † than any other linearization.

Interestingly, <sup>Γ</sup>† does not depend on <sup>Q</sup>: <sup>Γ</sup>† actually optimizes *all* nodes in the † direction. The second part of our method optimizes the ranks assigned to Q-nodes in the ‡ direction while keeping Q-nodes constant. For a general linearization f, this procedure results in a new linearization [f] <sup>Q</sup>‡ defined below in (2). Lemma <sup>2</sup> shows that [f] <sup>Q</sup>‡ is optimal for <sup>Q</sup>†‡ among the linearizations that assign the same ranks as f to <sup>Q</sup>-nodes. These two results are combined in Theorem 3: the unique <sup>Q</sup>†‡-maximal linearization is [Γ†] Q‡ .

*Theoretical Results.* Let us consider a quantree (T, <sup>≤</sup>, q) and a strategy <sup>Q</sup>†‡. We define the mapping <sup>Γ</sup>† : <sup>T</sup> → {1,..., aht(T)} given by

$$T\_{\uparrow}(x) = \left\lfloor \left| \max^{\dagger} \{1, \ldots, \text{aht}(T)\} - \text{aht}(T\_x^{\dagger}) \right| + 1 \right\rfloor\_{q^{\star}(x)}^{\dagger} \,. \tag{1}$$

Furthermore, we define the mapping [f] <sup>Q</sup>‡ : T → {1,..., aht(T)} for f <sup>∈</sup> Lin(T) given by

$$\left[f\right]^{\mathsf{Q}\dagger}(x) = \left[\min^{\dagger}\{f(y) \mid y \in T\_x^{\ddagger} \text{ and } q(y) = \mathsf{Q}\}\right]^{\ddagger}\_{q^{\mathsf{A}}(x)}.\tag{2}$$

In (2), min‡ is taken over a subset of {1,..., aht(T)}; we follow the convention that min‡ (∅) = max‡{1,..., aht(T)}.

Lemma 1. <sup>Γ</sup>† <sup>∈</sup> Lin(T)*. Furthermore, for any* <sup>g</sup> <sup>∈</sup> Lin(T)*, we have* <sup>g</sup> -<sup>Q</sup>† Γ†*.* Lemma 2. [f] <sup>Q</sup>‡ <sup>∈</sup> Lin(T) *for all* f <sup>∈</sup> Lin(T)*. Furthermore,* [f] <sup>Q</sup>‡ <sup>≈</sup><sup>Q</sup> f*, and for any* g <sup>∈</sup> Lin(T) *with* g <sup>≈</sup><sup>Q</sup> <sup>f</sup>*, we have* <sup>g</sup> <sup>Q</sup>†‡ <sup>f</sup>*.*

Theorem 3. *Let* <sup>f</sup> <sup>∈</sup> Lin(T) *be a* <sup>Q</sup>†‡*-optimal linearization. Then,* <sup>f</sup> = [Γ†] Q‡ *. In particular,* [Γ†] <sup>Q</sup>‡ *is the maximum element in* (Lin(T), <sup>Q</sup>†‡)*.*

*Example 7.* Let us check that [Γ<sup>↓</sup>] ∃↑ is indeed <sup>f</sup><sup>3</sup> for the quantree in Fig. <sup>1</sup> for a few values. First note that [Γ<sup>↓</sup>] ∃↑ only depends on the values of <sup>Γ</sup><sup>↓</sup> for existential nodes, so we only need to compute these. In this case, max†{1,..., aht(T)} <sup>=</sup> aht(T)=5, q(x)=1, and aht(<sup>T</sup> <sup>↓</sup> <sup>x</sup> ) = aht(T<sup>x</sup>) is simply the maximum number of quantifier alternations below <sup>x</sup>. <sup>Γ</sup><sup>↓</sup> respects the tree ordering, so we obtain

$$\left[\boldsymbol{T}\_{\downarrow}\right]^{\exists \uparrow}(z\_1) = \boldsymbol{T}\_{\downarrow}(z\_1) = \left\lfloor \left|5 - \text{aht}(T\_{z\_1})\right| + 1 \right\rfloor\_1 = \left\lfloor 4 \right\rfloor\_1 = 3 = f\_3(z\_1).$$

Furthermore, we can compute [Γ<sup>↓</sup>] ∃↑(u<sup>1</sup>) by checking only <sup>Γ</sup><sup>↓</sup>(z<sup>1</sup>), since <sup>z</sup><sup>1</sup> realizes the min<sup>↑</sup> operator in (2). Then,

$$\left[\left[\boldsymbol{\Gamma}\_{\downarrow}\right]^{\ni\uparrow}(u\_1) = \left[\boldsymbol{\Gamma}\_{\downarrow}(z\_1)\right]\_{q^\star(u\_1)} = \left[3\right]\_0 = 4 = f\_3(u\_1).$$

Example <sup>7</sup> suggests that [f] <sup>Q</sup>‡ can be computed recursively. Indeed, the rank of a node can be computed based on the ranks of its children or parent.

Corollary 1. *Let* x <sup>∈</sup> T *such that* q(x) <sup>=</sup> <sup>Q</sup>*. Then,*

$$\left[f\right]^{\mathsf{Q}\downarrow}(x) = \left[\min^{\sharp}\{\left[f\right]^{\mathsf{Q}\downarrow}(y) \mid x \text{ is covered by } y \in T\ w.r.t. \leq^{\sharp}\}\right.\\ \left. \right]\_{q^{\*}(x)}^{\sharp}$$

# 6 Implementation and Evaluation

We implemented the optimal linearization [Γ†] <sup>Q</sup>‡ for each strategy <sup>Q</sup>†‡ described in Sect. 5. Our implementation uses the Booleguru framework [10], designed for efficiently working with propositional formulas and QBFs. Booleguru provides a convenient parsing and serialization infrastructure for widely used formats, as well as helper functions to write formula transformations. Our extension is licensed under the MIT license and publicly available<sup>2</sup>.

Our implementation computes a quantifier shift on an input QBF ϕ based on a strategy <sup>Q</sup>†‡ by traversing twice the abstract syntax tree of the parsed QBF ϕ in a depth-first fashion. In the first pass, the propositional skeleton <sup>ϕ</sup>psk and the quantree T are extracted. Furthermore, the values aht(T <sup>x</sup>) and aht(T<sup>x</sup>), which we call *height* and *depth* of x, are computed for each node x <sup>∈</sup> T.

The second pass is applied only to the quantree. For each node x <sup>∈</sup> T, we compute its rank [Γ†] Q‡ (x). For <sup>Q</sup>-nodes, this rank is given by Γ†(x), which is trivial to compute from the height and depth of x; for <sup>Q</sup>-nodes, Corollary <sup>1</sup> allows a recursive computation. Based on their rank, quantifier nodes in the quantree are collected in a quantifier prefix which is appended to ϕpsk.

To apply a linearization strategy to an arbitrary formula, Booleguru needs to be called with the options :linearize-quants-{E,A}{up,down}-{up,down} using the quantifier <sup>E</sup> (∃) or <sup>A</sup> (∀) and the two directions up (↑) and down (↓). Overall, there are eight different combinations that we evaluate in the following. However, from the discussion above, it becomes obvious that only six of

<sup>2</sup> https://github.com/maximaximal/booleguru.

those eight strategies are different. In the implementation, all quantifiers are first extracted from an expression and processed in a separate tree. Each node contains the quantifier type, the quantified variables, and dependent quantifier nodes.

After computing the linearization, the extracted quantifiers are inserted piecewise as new expressions that wrap the originally transformed expression. This ensures the ordering of variables within the quantifier blocks stays the same. The fully quantified expression is then returned from the transformer and can either be printed using one of Booleguru's serializers, or processed further.

# 6.1 Benchmarks

As most solvers only process formulas in prenex (conjunctive) normal form, hardly any non-prenex benchmarks are currently available. To test our implementation, we considered the benchmark set from the QBFEval 2008 and we reimplemented a generator for nested counterfactuals as described below. All used formulas and corresponding experimental logs are available at [11].

*Nested Counterfactuals.* We developed a novel generator for nested counterfactuals (NCFs) based on a Lua script, which is integrated into Booleguru. The full encoding is described in [6]. To generate NCFs, five arguments must be provided: numbers of formulas in the background theory, numbers of variables, clauses per formula, variables per clause, nesting depth. Optionally, a sixth argument to fix the seed value for random choices. A counterfactual φ>ψ is true over a background theory T iff the minimal change of T to incorporate φ entails ψ. In a nested counterfactual, also φ or ψ are allowed to be (nested) counterfactuals. For details see [6]. We chose the range of arguments based on the description mentioned in Egly et al. [6]. We assume that the background theory T always consists of 5 randomly generated formulas. Each of these formulas consists of 2 to 10 clauses where each clause is a disjunction of 3 variables. The clauses contain randomly chosen atoms from a set of 5 variables. These atoms have a 50 percent chance of being negated. No clause may contain the same literal more than once and the clauses are non-tautological. The nesting depth of the counterfactuals ranges from 2 to 6. All possible combinations of these selected parameters result in 45 different classes. For each of these classes, 100 instances were generated to ensure that both, satisfiable and unsatisfiable results are represented. With the 8 strategies we obtain 36 000 prenexed formulas either in the non-CNF QCIR format or in QDIMACS.

*Non-Prenex-Non-CNF Benchmarks from QBFEval 2008.* In the QBFEval 2008, a non-prenex, non-CNF track was organized [19]. The benchmarks are available at the QBFLib.<sup>3</sup>. This set consists of 492 formulas in the outdated Boole format. To transform these formulas into prenex form, we first rewrote them into

<sup>3</sup> http://www.qbflib.org.

Table 1. Number of solved formulas per strategy and solver of QBFEval'08 set (QCIR). Diff indicates the difference between the best and the worst strategy. Each strategy has 492 formulas.


the related Limboole<sup>4</sup> format that is processable by Booleguru. Again, we considered all eight options resulting in 4936 prenexed formulas.

# 6.2 Experimental Setup

All experiments were run with a timeout of 15 minutes on a cluster of dualsocket AMD EPYC 7313 @ 3.7 GHz machines running Ubuntu 22.04 with a 8 GB memory limit per task. We split the experiments into two parts: on the one hand, we consider solvers that process formulas in prenex conjunctive normal form (PCNF) and on the other hand, we consider solvers that process formulas in prenex non-CNF. For the first group of solvers that accept formulas in the QDIMACS format, we consider the following solvers: The solver DepQBF (version 6.03) is a conflict/solution driven clause/cube learning (QCDCL) solver that integrates several advanced inprocessing techniques and reasoning under the standard dependency scheme [17]. Also Qute [21] is a QCDCL solver that employs dynamic dependency learning. This solver is also able to process QCIR formulas, i.e., it is also included in the second group. The solver CAQE [25] (version 4.0.2) implements clausal selection. The solver RAReQS [14] (version 1.1) implements variable expansion in CEGAR style. Finally, dynQBF [4,5] (version 1.1.1) is a BDD-based solver. For pre-processing, we used Bloqqer [3] and HQSpre [31]. For testing the encodings in the non-CNF QCIR format, we include the solvers QuAbS [9,29] and CQESTO [13] (version v00.0) (*sic*) that lift clausal selection to circuits, QFUN [12] (version v00.0) (*sic*), a solver that employs machine learning techniques, and Qute which was already mentioned above.

# 6.3 Experimental Results

In the following, we first discuss the results of the solvers that process formulas in QCIR, (i.e. formulas in prenex form but not in CNF). Second, we report on our experiments with QDIMACS formulas for the PCNF solvers.

<sup>4</sup> http://fmv.jku.at/limboole/.


Table 2. Number of different prefixes generated from of the 2008 non-CNF benchmark set with all strategy combinations. Each strategy has 492 formulas.

*Prenexed formulas in QCIR.* The nested counterfactual benchmarks were easily solved by QCIR solvers, i.e., they could exploit the formula structure to quickly solve these formulas (all were solved in less than a second). Therefore, we focus on the formulas of the QBFEval'08 benchmark set in the following. Table 1 shows the results for the QCIR solvers and Table 2 shows the number of different prefixes that were generated with all strategy combinations. For QuAbS, QFUN, and CQESTO we see a clear difference between the best and worst shifting strategy. In contrast, Qute seems to be less sensitive regarding the prenexing, which might be related to its dynamic dependency learning. The detailed solving behavior of QFUN and CQESTO is shown in Fig. 2. For QFUN we observe that ∃↑↑ and ∀↑↑ clearly perform best, while ∀↓↑ and ∃↑↓ seem to be less beneficial.

*Prenexed Formulas in QDIMACS.* Table 3 shows the results of the QDIMACS solvers on the encodings of the nested counterfactuals and Table 4 shows the number of different prefixes that were generated between all strategy combinations. DepQBF solves all formulas from 4 of the 8 strategies and most of the others, dynQBF is able to solve most of the formulas and Qute solves about one quarter of the formulas. Meanwhile RAReQS and CAQE hardly solve any of those. These could be connected with the observation that those solvers perform better on formulas with few quantifier alternations. For all solvers we observe

Fig. 2. Solving time of the QBFEVAL'08 set with QFUN (left) and CQESTO (right).

Table 3. Number of solved formulas per strategy and solver of NCFs. Diff indicates the difference between the best and the worst strategy. Each strategy has 4500 formulas.


Table 4. Number of different prefixes generated from of the NCF benchmark set with all strategy combinations. Each strategy has 4500 formulas.


Fig. 3. Solving time of nested counterfactuals with DepQBF (left) and dynQBF (right).

that the chosen shifting strategy impacts the number of the solved formulas. Details of the runs of DepQBF and dynQBF are shown in Fig. 3. For DepQBF, we observe that strategies ∃↓↓ and ∀↓↓ are clearly less preferable than strategies ∃↑↑ and ∀↑↑, while dynQBF prefers to have existential quantifiers shifted down. The QBFEval'08 benchmarks are very challenging for recent QDIMACS solvers with our encoding. Out of the 492 formulas, DepQBF solves up to 128 formulas with the best strategy (∀↓↑). dynQBF solves around <sup>60</sup> formulas. The other tools solve less than 30 formulas. Enabling preprocessing is beneficial for all solvers. When preprocessors Bloqqer or HQSpre simplify the formulas, then almost all formulas can be solved. With and without preprocessing, the shifting strategies have only little impact on this benchmark set. Note that more than two third of these formulas have five or less quantifier alternations.

# 7 Conclusion and Future Work

This paper analyzes and extends previous work from 2003 on quantifier shifting for quantified Boolean formulas. Since then, much progress has been made in the development of QBF solvers by introducing novel solving paradigms, applying efficient preprocessing techniques, and exploiting quantifier (in-)dependence. However, most of those approaches assume formulas in prenex normal form. As a consequence, most encodings are provided in this form, which unnecessarily restricts solvers with a certain design choice. In this work, we not only formalized prenexing in a concise manner, but we also provide an efficient, publicly available tool that implements the discussed prenexing strategies and Tseitin transformation. In extensive experiments with state-of-the-art prenex CNF and non-CNF solvers, we showed that in many instances prenexing strategy selection impacts solving runtime. We showed that different solvers perform differently on different strategies, hence it was not possible to uniquely identify the best strategy. Therefore, we think it is important that solver developers and also the developers of QBF encodings exploit information available in the problem structure and do not introduce artificial restrictions.

In future work, we plan to design and evaluate further prenexing strategies and we will also revisit more non-prenex QBF encodings to obtain larger benchmark sets. At the moment, hardly any formulas in non-prenex form are available which we changed by providing the generator for encodings of nested counterfactuals. But this is a first step only. Many of the considered formulas are either too hard or too easy for recent solvers, hence more effort is necessary to obtain a larger variety of interesting benchmarks (also in the light of next QBF evaluations). Finally, we want to explore how prenexing strategies affect the generation of certificates and solutions in terms of Herbrand and Skolem functions. From first-order logic, it is well known that it is beneficial to move quantifiers as far inwards as possible to minimize the arity of the first-order Skolem functions [20].

# References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Satisfiability Modulo Exponential Integer Arithmetic**

Florian Frohn and J¨urgen Giesl(B)

RWTH Aachen University, Aachen, Germany {florian.frohn,giesl}@informatik.rwth-aachen.de

**Abstract.** SMT solvers use sophisticated techniques for polynomial (linear or non-linear) integer arithmetic. In contrast, non-polynomial integer arithmetic has mostly been neglected so far. However, in the context of program verification, polynomials are often insufficient to capture the behavior of the analyzed system without resorting to approximations. In the last years, *incremental linearization* has been applied successfully to satisfiability modulo real arithmetic with transcendental functions. We adapt this approach to an extension of polynomial integer arithmetic with exponential functions. Here, the key challenge is to compute suitable *lemmas* that eliminate the current model from the search space if it violates the semantics of exponentiation. An empirical evaluation of our implementation shows that our approach is highly effective in practice.

# **1 Introduction**

Traditionally, automated reasoning techniques for integers focus on polynomial arithmetic. This is not only true in the context of SMT, but also for program verification techniques, since the latter often search for polynomial invariants that imply the desired properties. As invariants are over-approximations, they are well suited for proving "universal" properties like safety, termination, or upper bounds on the worst-case runtime that refer to all possible program runs. However, proving dual properties like unsafety, non-termination, or lower bounds requires under-approximations, so that invariants are of limited use here.

For lower bounds, an *infinite set* of witnesses is required, as the runtime w.r.t. a finite set of (terminating) program runs is always bounded by a constant. Thus, to prove non-constant lower bounds, *symbolic under-approximations* are required, i.e., formulas that describe an infinite subset of the reachable states. However, polynomial arithmetic is often insufficient to express such approximations. To see this, consider the program

$$x \leftarrow 1; \ y \leftarrow \mathtt{nondet}(0, \infty); \text{ while } y > 0 \text{ do } x \leftarrow 3 \cdot x; \ y \leftarrow y - 1 \text{ done}$$

where nondet(0,∞) returns a natural number non-deterministically. Here, the set of reachable states after execution of the loop is characterized by the formula

$$\exists n \in \mathbb{N}. \ x = 3^n \land y = 0. \tag{1}$$

c The Author(s) 2024

Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - 235950644 (Project GI 274/6-2).

C. Benzm¨uller et al. (Eds.): IJCAR 2024, LNAI 14739, pp. 344–365, 2024. https://doi.org/10.1007/978-3-031-63498-7\_21

In recent work, *acceleration techniques* have successfully been used to deduce lower runtime bounds automatically [17,18]. While they can easily derive a formula like (1) from the code above, this is of limited use, as most<sup>1</sup> SMT solvers cannot handle terms of the form 3n. Besides lower bounds, acceleration has also successfully been used for proving non-termination [15,18,19] and (un)safety [3,6,7,20,28,29], where its strength is finding long counterexamples that are challenging for other techniques.

Importantly, exponentiation is not just "yet another function" that can result from applying acceleration techniques. There are well-known, important classes of loops where polynomials and exponentiation *always* suffice to represent the values of the program variables after executing a loop [16,26]. Thus, the lack of support for integer exponentiation in SMT solvers is a major obstacle for the further development of acceleration-based verification techniques.

In this work, we first define a novel SMT theory for integer arithmetic with exponentiation. Then we show how to lift standard SMT solvers to this new theory, resulting in our novel tool SwInE (SMT with Integer Exponentiation).

Our technique is inspired by *incremental linearization*, which has been applied successfully to *real* arithmetic with transcendental functions, including the natural exponential function expe(x) = e<sup>x</sup>, where e is Euler's number [11]. In this setting, incremental linearization considers exp<sup>e</sup> as an uninterpreted function. If the resulting SMT problem is unsatisfiable, then so is the original problem. If it is satisfiable and the model that was found for exp<sup>e</sup> coincides with the semantics of exponentiation, then the original problem is satisfiable. Otherwise, *lemmas* about exp<sup>e</sup> that rule out the current model are added to the SMT problem, and then its satisfiability is checked again. The name "incremental linearization" is due to the fact that these lemmas only contain linear arithmetic.

The main challenge for adapting this approach to integer exponentiation is to generate suitable lemmas, see Sect. 4.2. Except for so-called *monotonicity lemmas*, none of the lemmas from [11] easily carry over to our setting. In contrast to [11], we do not restrict ourselves to linear lemmas, but we also use nonlinear, polynomial lemmas. This is due to the fact that we consider a binary version λx, y. x<sup>y</sup> of exponentiation, whereas [11] fixes the base to e. Thus, in our setting, one obtains *bilinear* lemmas that are linear w.r.t. x as well as y, but may contain multiplication between x and y (i.e., they may contain the subterm <sup>x</sup> · <sup>y</sup>). More precisely, bilinear lemmas arise from *bilinear interpolation*, which is a crucial ingredient of our approach, as it allows us to eliminate *any* model that violates the semantics of exponentiation (Theorem 23). Therefore, the name "incremental linearization" does not fit to our approach, which is rather an instance of "counterexample-guided abstraction refinement" (CEGAR) [13].

To summarize, our contributions are as follows: We first propose the new SMT theory EIA for integer arithmetic with exponentiation (Sect. 3). Then, based on novel techniques for generating suitable lemmas, we develop a CEGAR approach for EIA (Sect. 4). We implemented our approach in our novel open-

<sup>1</sup> CVC5 uses a dedicated solver for integer exponentiation with base 2.

source tool SwInE [22,23] and evaluated it on a collection of 4627 EIA benchmarks that we synthesized from verification problems. Our experiments show that our approach is highly effective in practice (Sect. 6). All proofs can be found in [21].

# **2 Preliminaries**

We are working in the setting of *SMT-LIB logic* [4], a variant of many-sorted first-order logic with equality. We now introduce a reduced variant of [4], where we only explain those concepts that are relevant for our work.

In SMT-LIB logic, there is a dedicated Boolean sort **Bool**, and hence formulas are just terms of sort **Bool**. Similarly, there is no distinction between predicates and functions, as predicates are simply functions of type **Bool**.

So in SMT-LIB logic, a *signature* Σ = (Σ<sup>S</sup>, Σ<sup>F</sup> , Σ<sup>R</sup>) consists of a set Σ<sup>S</sup> of *sorts*, a set <sup>Σ</sup><sup>F</sup> of *function symbols*, and a *ranking function* <sup>Σ</sup><sup>R</sup> : <sup>Σ</sup><sup>F</sup> <sup>→</sup> (Σ<sup>S</sup>)<sup>+</sup>. The meaning of Σ<sup>R</sup>(f)=(s1,...,sk) is that f is a function which maps arguments of the sorts <sup>s</sup>1,...,s<sup>k</sup>−<sup>1</sup> to a result of sort <sup>s</sup>k. We write <sup>f</sup> : <sup>s</sup><sup>1</sup> ... s<sup>k</sup> instead of "<sup>f</sup> <sup>∈</sup> <sup>Σ</sup><sup>F</sup> and <sup>Σ</sup><sup>R</sup>(f)=(s1,...,sk)" if <sup>Σ</sup> is clear from the context. We always allow to implicitly extend Σ with arbitrarily many constant function symbols (i.e., function symbols <sup>x</sup> where <sup>|</sup>Σ<sup>R</sup>(x)<sup>|</sup> = 1). Note that SMT-LIB logic only considers closed terms, i.e., terms without free variables, and we are only concerned with quantifier-free formulas, so in our setting, all formulas are ground. Therefore, we refer to these constant function symbols as *variables* to avoid confusion with other, predefined constant function symbols like **true**, 0,..., see below.

Every SMT-LIB signature is an extension of Σ**Bool** where Σ<sup>S</sup> **Bool** = {**Bool**} and Σ<sup>F</sup> **Bool** consists of the following function symbols:

# **true**,**false** : **Bool** <sup>¬</sup> : **Bool Bool** <sup>∧</sup>,∨, <sup>=</sup><sup>⇒</sup> , ⇐⇒ : **Bool Bool Bool**

Note that SMT-LIB logic only considers well-sorted terms. A Σ*-structure* **A** consists of a *universe* A = - <sup>s</sup>∈Σ*<sup>S</sup>* <sup>A</sup><sup>s</sup> and an *interpretation function* that maps each function symbol f : s<sup>1</sup> ... s<sup>k</sup> to a function f **<sup>A</sup>** : <sup>A</sup><sup>s</sup><sup>1</sup> <sup>×</sup>...×A<sup>s</sup>*k−*<sup>1</sup> <sup>→</sup> <sup>A</sup><sup>s</sup>*<sup>k</sup>* . SMT-LIB logic only considers structures where <sup>A</sup>**Bool** <sup>=</sup> {**true**,**false**} and all function symbols from Σ**Bool** are interpreted as usual.

A Σ*-theory* is a class of Σ-structures. For example, consider the extension Σ**Int** of Σ**Bool** with the additional sort **Int** and the following function symbols:

<sup>0</sup>, <sup>1</sup>,... : **Int** <sup>+</sup>, <sup>−</sup>, ·, div, mod: **Int Int Int** <, <sup>≤</sup>, >, <sup>≥</sup>, <sup>=</sup>, =: **Int Int Bool**

Then the Σ**Int**-theory *non-linear integer arithmetic* (NIA)<sup>2</sup> contains all Σ**Int**structures where A**Int** = Z and all symbols from Σ**Int** are interpreted as usual.

<sup>2</sup> As we only consider quantifier-free formulas, we omit the prefix "QF " in theory names and write, e.g., NIA instead of QF NIA. In [4], QF NIA is called an *SMT-LIB logic*, which restricts the (first-order) *theory* of integer arithmetic to the quantifierfree fragment. For simplicity, we do not distinguish between SMT-LIB logics and theories.

If **A** is a Σ-structure and Σ is a subsignature of Σ, then the *reduct* of **A** to Σ is the unique Σ -structure that interprets its function symbols like **A**. So the theory *linear integer arithmetic* (LIA) consists of the reducts of all elements of NIA to <sup>Σ</sup>**Int** \ {·, div, mod}.

Given a Σ-structure **A** and a Σ-term t, the *meaning* t **<sup>A</sup>** of t results from interpreting all function symbols according to **A**. For function symbols f whose interpretation is fixed by a <sup>Σ</sup>-theory <sup>T</sup> , we denote <sup>f</sup>'s interpretation by f T . Given a <sup>Σ</sup>-theory <sup>T</sup> , a <sup>Σ</sup>-formula <sup>ϕ</sup> (i.e., a <sup>Σ</sup>-term of type **Bool**) is *satisfiable in* T if there is an **A** ∈ T such that ϕ **<sup>A</sup>** = **true**. Then **A** is called a *model* of <sup>ϕ</sup>, written **<sup>A</sup>** <sup>|</sup><sup>=</sup> <sup>ϕ</sup>. If *every* **<sup>A</sup>** ∈ T is a model of <sup>ϕ</sup>, then <sup>ϕ</sup> is <sup>T</sup> *-valid*, written <sup>|</sup>=<sup>T</sup> <sup>ϕ</sup>. We write <sup>ψ</sup> <sup>≡</sup><sup>T</sup> <sup>ϕ</sup> for <sup>|</sup>=<sup>T</sup> <sup>ψ</sup> ⇐⇒ <sup>ϕ</sup>.

We sometimes also consider *uninterpreted functions*. Then the signature may not only contain the function symbols of the theory under consideration and variables, but also additional non-constant function symbols.

We write "term", "structure", "theory", ... instead of "Σ-term", "Σ-structure", "Σ-theory", ... if Σ is irrelevant or clear from the context. Similarly, we just write "≡" and "valid" instead of "≡<sup>T</sup> " and "T -valid" if T is clear from the context. Moreover, we use unary minus and t <sup>c</sup> (where t is a term of sort **Int** and <sup>c</sup> <sup>∈</sup> <sup>N</sup>) as syntactic sugar, and we use infix notation for binary function symbols.

In the sequel, we use x, y, z, . . . for variables, s, t, p, q, . . . for terms of sort **Int**, ϕ, ψ, . . . for formulas, and a, b, c, d, . . . for integers.

### **3 The SMT Theory EIA**

We now introduce our novel SMT theory for *exponential integer arithmetic*. To this end, we define the signature Σexp **Int**, which extends <sup>Σ</sup>**Int** with

# exp : **Int Int Int**.

If the 2nd argument of exp is non-negative, then its semantics is as expected, i.e., we are interested in structures **A** such that exp **<sup>A</sup>** (c, d) = <sup>c</sup><sup>d</sup> for all <sup>d</sup> <sup>≥</sup> 0. However, if the 2nd argument is negative, then we have to use different semantics. The reason is that we may have <sup>c</sup><sup>d</sup> <sup>∈</sup>/ <sup>Z</sup> if d < 0. Intuitively, exp should be a partial function, but all functions are total in SMT-LIB logic. We solve this problem by interpreting exp(c, d) as c|d<sup>|</sup> . This semantics has previously been used in the literature, and the resulting logic admits a known decidable fragment [5].

**Definition 1 (EIA).** *The theory* exponential integer arithmetic (EIA) *contains all* Σexp **Int**-structures **A** with exp **<sup>A</sup>** (c, d) = c|d<sup>|</sup> whose reduct to Σ**Int** is in NIA.

Alternatively, one could treat exp(c, d) like an uninterpreted function if d is negative. Doing so would be analogous to the treatment of division by zero in SMT-LIB logic. Then, e.g., exp(0, <sup>−</sup>1) <sup>=</sup> exp(0, <sup>−</sup>2) would be satisfied by a structure **A** with exp **<sup>A</sup>** (c, d) = <sup>c</sup><sup>d</sup> if <sup>d</sup> <sup>≥</sup> 0 and exp **<sup>A</sup>** (c, d) = d, otherwise. However, the drawback of this approach is that important laws of exponentiation like

$$\mathsf{exp}(\mathsf{exp}(x,y),z) = \mathsf{exp}(x,y\cdot z)$$

would not be valid. Thus, we focus on the semantics from Definition 1.

### **Algorithm 1:** CEGAR for EIA

```
Input: a Σexp
            Int-formula ϕ
  // Preprocessing
1 do
2 ϕ-
        ← ϕ;
3 ϕ ← FoldConstants(ϕ);
4 ϕ ← Rewrite(ϕ);
5 while ϕ = ϕ-

             ;
  // Refinement Loop
6 while there is a NIA-model A of ϕ do
7 if A is a counterexample then
8 L ← ∅;
9 for kind ∈ {Symmetry, Monotonicity, Bounding,Interpolation} do
10 L←L∪ ComputeLemmas(ϕ, kind);
11 ϕ ← ϕ ∧ -
                 {ψ ∈L| A |= ψ}
12 else return sat
13 return unsat
```
# **4 Solving EIA Problems via CEGAR**

We now explain our technique for solving EIA problems, see Algorithm 1. Our goal is to (dis)prove satisfiability of ϕ in EIA. The loop in Line 6 is a CEGAR loop which lifts an SMT solver for NIA (which is called in Line 6) to EIA. So the *abstraction* consists of using NIA- instead of EIA-models. Hence, exp is considered to be an uninterpreted function in Line 6, i.e., the SMT solver also searches for an interpretation of exp. If the model found by the SMT solver is a *counterexample* (i.e., if exp **<sup>A</sup>** conflicts with exp EIA), then the formula under consideration is refined by adding suitable lemmas in Lines 9–11 and the loop is iterated again.

**Definition 2 (Counterexample).** *We call a NIA-model* **A** *of* ϕ *a* counterexample *if there is a subterm* exp(s, t) of ϕ such that exp(s, t) **<sup>A</sup>** = (s **<sup>A</sup>**)|<sup>t</sup>**A**<sup>|</sup> .

In the sequel, we first discuss our preprocessings (first loop in Algorithm 1) in Sect. 4.1. Then we explain our refinement (Lines 9–11) in Sect. 4.2. Here, we first introduce the different kinds of lemmas that are used by our implementation in Sect. 4.2.1–4.2.4. If implemented naively, the number of lemmas can get quite large, so we explain how to generate lemmas *lazily* in Sect. 4.2.5. Finally, we conclude this section by stating important properties of Algorithm 1.

*Example 3 (Leading Example).* To illustrate our approach, we show how to prove

<sup>∀</sup>x, y. <sup>|</sup>x<sup>|</sup> <sup>&</sup>gt; <sup>2</sup> ∧ |y<sup>|</sup> <sup>&</sup>gt; 2 =<sup>⇒</sup> exp(exp(x, y), y) <sup>=</sup> exp(x, exp(y, y))

by encoding absolute values suitably<sup>3</sup> and proving unsatisfiability of its negation:

$$x^2 > 4 \land y^2 > 4 \land \mathsf{exp}(\mathsf{exp}(x, y), y) = \mathsf{exp}(x, \mathsf{exp}(y, y))$$

#### **4.1 Preprocessings**

In the first loop of Algorithm 1, we preprocess ϕ by alternating *constant folding* (Line 3) and *rewriting* (Line 4) until a fixpoint is reached. Constant folding evaluates subexpressions without variables, where subexpressions exp(c, d) are evaluated to c|d<sup>|</sup> , i.e., according to the semantics of EIA. Rewriting reduces the number of occurrences of exp via the following (terminating) rewrite rules:

$$\begin{aligned} \mathsf{exp}(x, c) &\to x^{|c|} & \text{if } c \in \mathbb{Z} \\ \mathsf{exp}(\mathsf{exp}(x, y), z) &\to \mathsf{exp}(x, y \cdot z) \\ \mathsf{exp}(x, y) \cdot \mathsf{exp}(z, y) &\to \mathsf{exp}(x \cdot z, y) \end{aligned}$$

In particular, the 1st rule allows us to rewrite<sup>4</sup> exp(s, 0) to s<sup>0</sup> = 1 and exp(s, 1) to s<sup>1</sup> = s. Note that the rule

$$\mathsf{exp}(x,y) \cdot \mathsf{exp}(x,z) \to \mathsf{exp}(x,y+z)$$

would be unsound, as the right-hand side would need to be exp(x, <sup>|</sup>y<sup>|</sup> <sup>+</sup> <sup>|</sup>z|) instead. We leave the question whether such a rule is beneficial to future work.

*Example 4 (Preprocessing).* For our leading example, applying the 2nd rewrite rule at the underlined position yields:

$$\begin{aligned} x^2 > 4 \land y^2 > 4 \land \underline{\exp(\exp(x,y),y)} = \exp(x, \exp(y,y))\\ \to x^2 > 4 \land y^2 > 4 \land \underline{\exp(x,y^2)} = \exp(x, \exp(y,y)) \end{aligned} \tag{2}$$

**Lemma 5.** *We have* <sup>ϕ</sup> <sup>≡</sup>EIA FoldConstants(ϕ) *and* <sup>ϕ</sup> <sup>≡</sup>EIA Rewrite(ϕ)*.*

#### **4.2 Refinement**

Our refinement (Lines 9–11 of Algorithm 1) is based on the four kinds of lemmas named in Line 9: *symmetry lemmas*, *monotonicity lemmas*, *bounding lemmas*, and *interpolation lemmas*. In the sequel, we explain how we compute a set L of such lemmas. Then our refinement conjoins

$$\{\psi \in \mathcal{L} \mid \mathbf{A} \nmid \psi\}$$

to ϕ in Line 11. As our lemmas allow us to eliminate *any* counterexample, this set is never empty, see Theorem 23. To compute L, we consider all terms that are *relevant* for the formula ϕ.

<sup>3</sup> We tested several encodings, but surprisingly, this non-linear encoding worked best.

<sup>4</sup> Note that we have exp(0, 0) EIA = 0<sup>0</sup> = 1.

**Definition 6 (Relevant Terms).** *A term* exp(s, t) *is* relevant *if* ϕ *has a subterm of the form* exp(±s, <sup>±</sup>t)*.*

*Example 7 (Relevant Terms).* For our leading example (2), the relevant terms are all terms of the form exp(±x, <sup>±</sup>y<sup>2</sup>), exp(±y, <sup>±</sup>y), or exp(±x, <sup>±</sup>exp(y, y)).

While the formula ϕ is changed in Line 11 of Algorithm 1, we only conjoin new lemmas to ϕ, and thus, relevant terms can never become irrelevant. Moreover, by construction our lemmas only contain exp-terms that were already relevant before. Thus, the set of relevant terms is not changed by our CEGAR loop.

As mentioned in Sect. 1, our approach may also compute lemmas with nonlinear polynomial arithmetic. However, our lemmas are linear if s is an integer constant and t is linear for all subterms exp(s, t) of ϕ. Here, despite the fact that the function "mod" is not contained in the signature of LIA, we also consider literals of the form <sup>s</sup> mod <sup>c</sup> = 0 where <sup>c</sup> <sup>∈</sup> <sup>N</sup><sup>+</sup> <sup>=</sup> <sup>N</sup> \ {0} as linear. The reason is that, according to the SMT-LIB standard, LIA contains a function<sup>5</sup> "divisible<sup>c</sup> **Int Bool**" for each <sup>c</sup> <sup>∈</sup> <sup>N</sup>+, which yields **true** iff its argument is divisible by c, and hence we have s mod c = 0 iff divisiblec(s).

In the sequel, -... means -... **<sup>A</sup>**, where **A** is the model from Line 6 of Algorithm 1.

**4.2.1 Symmetry Lemmas** *Symmetry lemmas* encode the relation between terms of the form exp(±s, <sup>±</sup>t). For each relevant term exp(s, t), the set <sup>L</sup> contains the following symmetry lemmas:

$$t \bmod 2 = 0 \implies \mathbf{exp}(s, t) = \mathbf{exp}(-s, t) \tag{SYM\_1}$$

$$t \bmod 2 = 1 \implies \exp(s, t) = -\exp(-s, t) \tag{SYM\_2}$$

$$\mathsf{exp}(s,t) = \mathsf{exp}(s,-t) \tag{SYM\_3}$$

Note that sym<sup>1</sup> and sym<sup>2</sup> are just implications, not equivalences, as, for example, <sup>c</sup>|d<sup>|</sup> = (−c)|d<sup>|</sup> does not imply <sup>d</sup> mod 2 = 0 if <sup>c</sup> = 0.

*Example 8 (Symmetry Lemmas).* For our leading example (2), the following symmetry lemmas would be considered, among others:

$$\text{\textquotedbl{}SYM}\_1: \qquad -y \bmod 2 = 0 \implies \mathsf{exp}(-y, -y) = \mathsf{exp}(y, -y) \tag{3}$$

$$\text{SYM}\_2: \qquad -y \bmod 2 = 1 \implies \mathsf{exp}(-y, -y) = -\mathsf{exp}(y, -y) \tag{4}$$

$$\text{SYM}\_3: \qquad \qquad \mathsf{exp}(x, \mathsf{exp}(y, y)) = \mathsf{exp}(x, -\mathsf{exp}(y, y)) \tag{5}$$

$$\mathsf{SYM}\_3: \tag{6}$$

$$\mathsf{exp}(y, y) = \mathsf{exp}(y, -y) \tag{6}$$

Note that, e.g., (3) results from the term exp(−y, <sup>−</sup>y), which is relevant (see Definition 6) even though it does not occur in ϕ.

<sup>5</sup> We excluded these functions from Σ**Int**, as they can be simulated with mod.

To show soundness of our refinement, we have to show that our lemmas are EIA-valid.

**Lemma 9.** *Let* s, t *be terms of sort* **Int**. Then sym1–sym<sup>3</sup> are EIA-valid.

### **4.2.2 Monotonicity Lemmas** *Monotonicity lemmas* are of the form

<sup>s</sup><sup>2</sup> <sup>≥</sup> <sup>s</sup><sup>1</sup> <sup>&</sup>gt; <sup>1</sup> <sup>∧</sup> <sup>t</sup><sup>2</sup> <sup>≥</sup> <sup>t</sup><sup>1</sup> <sup>&</sup>gt; <sup>0</sup> <sup>∧</sup> (s<sup>2</sup> > s<sup>1</sup> <sup>∨</sup> <sup>t</sup><sup>2</sup> > t1) =<sup>⇒</sup> exp(s2, t2) <sup>&</sup>gt; exp(s1, t1), (mon)

i.e., they prohibit violations of monotonicity of exp.

*Example 10 (Monotonicity Lemmas).* For our leading example (2), we obtain, e.g., the following lemmas:

$$x > 1 \land \exp(y, y) > y^2 > 0 \implies \exp(x, \exp(y, y)) > \exp(x, y^2) \tag{7}$$

$$x > 1 \land \neg \exp(y, y) > y^2 > 0 \implies \exp(x, -\exp(y, y)) > \exp(x, y^2) \tag{8}$$

So for each pair of two different relevant terms exp(s1, t1), exp(s2, t2) where <sup>s</sup>2 <sup>≥</sup> s1 > 1 and <sup>t</sup>2 <sup>≥</sup> <sup>t</sup>1 <sup>&</sup>gt; 0, the set <sup>L</sup> contains mon.

**Lemma 11.** *Let* s1, s2, t1, t<sup>2</sup> *be terms of sort* **Int**. Then mon is EIA-valid.

**4.2.3 Bounding Lemmas** *Bounding lemmas* provide bounds on relevant terms exp(s, t) where s and t are non-negative. Together with symmetry lemmas, they also give rise to bounds for the cases where s or t are negative.

For each relevant term exp(s, t) where s and t are non-negative, the following lemmas are contained in L:

<sup>t</sup> =0 =<sup>⇒</sup> exp(s, t)=1 (bnd1)

$$t = 1 \implies \exp(s, t) = s \tag{\text{BND}\_2}$$

$$s = 0 \land t \neq 0 \iff \mathsf{exp}(s, t) = 0 \tag{\text{BND}\_3}$$

<sup>s</sup> =1 =<sup>⇒</sup> exp(s, t)=1 (bnd4)

$$s+t>4\land s>1\land t>1 \implies \mathsf{exp}(s,t)>s\cdot t+1\tag{\text{BND}\_5}$$

The cases <sup>t</sup> ∈ {0, <sup>1</sup>} are also addressed by our first rewrite rule (see Sect. 4.1). However, this rewrite rule only applies if t is an integer constant. In contrast, the first two lemmas above apply if t evaluates to 0 or 1 in the current model.

*Example 12 (Bounding Lemmas).* For our leading example (2), the following bounding lemmas would be considered, among others:


# **Lemma 13.** *Let* s, t *be terms of sort* **Int**. Then bnd1–bnd<sup>5</sup> are EIA-valid.

The bounding lemmas are defined in such a way that they provide lower bounds for exp(s, t) for almost all non-negative values of s and t. The reason why we focus on lower bounds is that polynomials can only bound exp(s, t) from above for finitely many values of s and t. The missing (lower and upper) bounds are provided by *interpolation lemmas*.

**4.2.4 Interpolation Lemmas** In addition to bounding lemmas, we use *interpolation lemmas* that are constructed via *bilinear interpolation* to provide bounds. Here, we assume that the arguments of exp are positive, as negative arguments are handled by symmetry lemmas, and bounding lemmas yield tight bounds if at least one argument of exp is 0. The correctness of interpolation lemmas relies on the following observation.

**Lemma 14.** *Let* <sup>f</sup> : <sup>R</sup><sup>+</sup> <sup>→</sup> <sup>R</sup><sup>+</sup> *be convex,* <sup>w</sup>1, w<sup>2</sup> <sup>∈</sup> <sup>R</sup>+*, and* <sup>w</sup><sup>1</sup> < w2*. Then*

$$\forall x \in [w\_1, w\_2]. \ f(x) \le f(w\_1) + \frac{f(w\_2) - f(w\_1)}{w\_2 - w\_1} \cdot (x - w\_1) \qquad and$$

$$\forall x \in \mathbb{R}\_+ \; \bigvee(w\_1, w\_2). \ f(x) \ge f(w\_1) + \frac{f(w\_2) - f(w\_1)}{w\_2 - w\_1} \cdot (x - w\_1).$$

Here, [w1, w2] and (w1, w2) denote closed and open real intervals. Note that the right-hand side of the inequations above is the linear interpolant of f between w<sup>1</sup> and w2. Intuitively, it corresponds to the secant of f between the points (w1, f(w1)) and (w2, f(w2)), and thus the lemma follows from convexity of f.

Let exp(s, t) be relevant, s = c > 0, t = d > 0, and exp (c, d) <sup>=</sup> <sup>c</sup><sup>d</sup>, i.e., we want to prohibit the current interpretation of exp(s, t).

**Interpolation Lemmas for Upper Bounds.** First assume exp (c, d) > c<sup>d</sup>, i.e., to rule out this counterexample, we need a lemma that provides a suitable upper bound for exp(c, d). Let c , d <sup>∈</sup> <sup>N</sup><sup>+</sup> and:

$$\begin{aligned} c^- := \min(c, c') & \quad c^+ := \max(c, c') & \quad d^- := \min(d, d') & \quad d^+ := \max(d, d')\\ [c^\pm] & := [c^- \dots c^+] & \quad [d^\pm] & := [d^- \dots d^+] \end{aligned}$$

Here, [a .. b] denotes a closed integer interval. Then we first use d−, d<sup>+</sup> for linear interpolation w.r.t. the 2nd argument of λx, y. x<sup>y</sup>. To this end, let

$$\operatorname{ip}\_2^{[d^\pm]}(x,y) := x^{d^-} + \frac{x^{d^+} - x^{d^-}}{d^+ - d^-} \cdot (y - d^-),$$

where we define <sup>a</sup> <sup>b</sup> := <sup>a</sup> <sup>b</sup> if <sup>b</sup> = 0 and <sup>a</sup> <sup>0</sup> := 0. So if <sup>d</sup><sup>−</sup> < d<sup>+</sup>, then ip[d*±*] <sup>2</sup> (x, y) corresponds to the linear interpolant of x<sup>y</sup> w.r.t. y between d<sup>−</sup> and d<sup>+</sup>. Then ip[d*±*] <sup>2</sup> (x, y) is a suitable upper bound, as

$$\forall x \in \mathbb{N}\_+, y \in [d^\pm]. \ x^y \le \text{ip}\_2^{[d^\pm]}(x, y) \tag{11}$$

follows from Lemma 14. Hence, we could derive the following EIA-valid lemma:<sup>6</sup>

$$s > 0 \land t \in [d^{\pm}] \implies \mathsf{exp}(s, t) \le \mathsf{i} \mathsf{p}\_2^{[d^{\pm}]}(s, t) \tag{19.1}$$

*Example 15 (Linear Interpolation w.r.t.* y*).* Let exp(s, t) = exp (3, 9) > 3<sup>9</sup>, i.e., we have c = 3 and d = 9. Moreover, assume c = d = 1, i.e., we get c<sup>−</sup> = 1, c<sup>+</sup> = 3, d<sup>−</sup> = 1, and d<sup>+</sup> = 9. Then

$$\mathrm{iip}\_2^{[d^\pm]}(x,y) = \mathrm{iip}\_2^{[1..9]}(x,y) = x^1 + \frac{x^9 - x^1}{9 - 1} \cdot (y - 1) = x + \frac{x^9 - x}{8} \cdot (y - 1).$$

Hence, ip<sup>1</sup> corresponds to

$$s > 0 \land t \in [1, 9] \implies \exp(s, t) \le s + \frac{s^9 - s}{8} \cdot (t - 1).$$

This lemma would be violated by our counterexample, as we have

$$\left\| s + \frac{s^9 - s}{8} \cdot (t - 1) \right\| = 3 + \frac{3^9 - 3}{8} \cdot 8 = 3^9 < \left\| \mathsf{exp} \right\|(3, 9) = \left\| \mathsf{exp}(s, t) \right\|.$$

However, the degree of ip[d*±*] <sup>2</sup> (s, t) depends on d<sup>+</sup>, which in turn depends on the model that was found by the underlying SMT solver. Thus, the degree of ip[d*±*] <sup>2</sup> (s, t) can get very large, which is challenging for the underlying solver.

So we next use c−, c<sup>+</sup> for linear interpolation w.r.t. the 1st argument of λx, y. x<sup>y</sup>, resulting in

$$\operatorname{ip}\_1^{[c^\pm]}(x,y) := (c^-)^y + \frac{(c^+)^y - (c^-)^y}{c^+ - c^-} \cdot (x - c^-).$$

Then due to Lemma 14, ip[c*±*] <sup>1</sup> (x, y) is also an upper bound on the exponentiation function, i.e., we have

$$\forall y \in \mathbb{N}\_+, x \in [c^{\pm}]. \ x^y \le \text{ip}\_1^{[c^{\pm}]}(x, y). \tag{12}$$

Note that we have <sup>y</sup>−d*<sup>−</sup>* <sup>d</sup>+−d*<sup>−</sup>* <sup>∈</sup> [0, 1] for all <sup>y</sup> <sup>∈</sup> [d±], and thus

$$\operatorname{ip}\_2^{[d^\pm]}(x,y) = x^{d^-} \cdot \left(1 - \frac{y - d^-}{d^+ - d^-}\right) + x^{d^+} \cdot \frac{y - d^-}{d^+ - d^-}$$

<sup>6</sup> Strictly speaking, this lemma is not a Σexp **Int**-term if <sup>d</sup><sup>+</sup> > d−, as the right-hand side makes use of division in this case. However, an equivalent Σexp **Int**-term can clearly be obtained by multiplying with the divisor.

is monotonically increasing in both xd*<sup>−</sup>* and xd<sup>+</sup> . Hence, in the definition of ip[d*±*] <sup>2</sup> , we can approximate <sup>x</sup>d*<sup>−</sup>* and xd<sup>+</sup> with their upper bounds ip[c*±*] <sup>1</sup> (x, d−) and ip[c*±*] <sup>1</sup> (x, d<sup>+</sup>) that can be derived from (12). Then (11) yields

$$\forall x \in [c^{\pm}], y \in [d^{\pm}]. \ x^{y} \le \text{ip}^{[c^{\pm}][d^{\pm}]}(x, y) \tag{13}$$

where

$$\mathrm{ip}^{[c^{\pm}][d^{\pm}]}(x,y) := \mathrm{ip}\_{1}^{[c^{\pm}]}(x,d^{-}) + \frac{\mathrm{ip}\_{1}^{[c^{\pm}]}(x,d^{+}) - \mathrm{ip}\_{1}^{[c^{\pm}]}(x,d^{-})}{d^{+} - d^{-}} \cdot (y - d^{-}) \cdot \mathrm{i}$$

So the set L contains the lemma

$$s \in [c^{\pm}] \land t \in [d^{\pm}] \implies \mathsf{exp}(s, t) \le \mathsf{i} \mathsf{p}^{[c^{\pm}][d^{\pm}]}(s, t), \tag{\mathsf{I} \mathsf{P}\_2}$$

which is valid due to (13), and rules out any counterexample with exp (c, d) > c<sup>d</sup>, as ip[c*±*][d*±*] (c, d) = c<sup>d</sup>.

*Example 16 (Bilinear Interpolation, Example* 15 *continued).* In our example, we have:

$$\begin{aligned} \mathrm{ip}\_1^{[c^\pm]}(x,y) &= \mathrm{ip}\_1^{[1..3]}(x,y) = 1^y + \frac{3^y - 1^y}{3 - 1} \cdot (x - 1) = 1 + \frac{3^y - 1}{2} \cdot (x - 1) \\ \mathrm{ip}\_1^{[c^\pm]}(s, d^-) &= \mathrm{ip}\_1^{[1..3]}(s, 1) = 1 + \frac{3 - 1}{2} \cdot (s - 1) = s \\ \mathrm{ip}\_1^{[c^\pm]}(s, d^+) &= \mathrm{ip}\_1^{[1..3]}(s, 9) = 1 + \frac{3^9 - 1}{2} \cdot (s - 1) = 1 + 9841 \cdot (s - 1) \end{aligned}$$

Hence, we obtain the lemma

$$s \in [1, 3] \land t \in [1, 9] \implies \exp(s, t) \le s + \frac{1 + 9841 \cdot (s - 1) - s}{8} \cdot (t - 1).$$

This lemma is violated by our counterexample, as we have

$$\left\lbrack s + \frac{1 + 9841 \cdot (s - 1) - s}{8} \cdot (t - 1) \right\rbrack = 3^9 < \left\lbrack \exp \right\rbrack \left( 3, 9 \right) = \left\lbrack \exp (s, t) \right\rbrack \dots$$

ip<sup>2</sup> relates exp(s, t) with the *bilinear* function ip[c*±*][d*±*] (s, t), i.e., this function is linear w.r.t. both s and t, but it multiplies s and t. Thus, if s is an integer constant and t is linear, then the resulting lemma is linear, too.

To compute interpolation lemmas, a second point (c , d ) is needed. In our implementation, we store all points (c, d) where interpolation has previously been applied and use the one which is closest to the current one. The same heuristic is used to compute *secant lemmas* in [11]. For the 1st interpolation step, we use (c , d )=(c, d). In this case, ip<sup>2</sup> simplifies to <sup>s</sup> <sup>=</sup> <sup>c</sup> <sup>∧</sup> <sup>t</sup> <sup>=</sup> <sup>d</sup> <sup>=</sup><sup>⇒</sup> exp(s, t) <sup>≤</sup> <sup>c</sup><sup>d</sup>.

**Lemma 17.** *Let* <sup>c</sup><sup>+</sup> <sup>≥</sup> <sup>c</sup><sup>−</sup> <sup>&</sup>gt; <sup>0</sup> *and* <sup>d</sup><sup>+</sup> <sup>≥</sup> <sup>d</sup><sup>−</sup> <sup>&</sup>gt; <sup>0</sup>*. Then* ip<sup>2</sup> *is EIA-valid.*

**Interpolation Lemmas for Lower Bounds.** While bounding lemmas already yield lower bounds, the bounds provided by bnd<sup>5</sup> are not exact, in general. Hence, if exp (c, d) < cd, then we also use bilinear interpolation to obtain a precise lower bound for exp(c, d). Dually to (11) and (12), Lemma 14 implies:

$$\forall x, y \in \mathbb{N}\_{+}. \ x^{y} \ge \mathrm{ip}\_{2}^{[d..d+1]}(x, y) \quad \text{(14)} \quad \forall x, y \in \mathbb{N}\_{+}. \ x^{y} \ge \mathrm{ip}\_{1}^{[c..c+1]}(x, y) \quad \text{(15)}$$

Additionally, we also obtain

$$\forall x, y \in \mathbb{N}\_{+} . x^{y+1} - x^{y} \ge \mathrm{ip}\_{1}^{[c..c+1]}(x, y+1) - \mathrm{ip}\_{1}^{[c..c+1]}(x, y) \tag{16}$$

from Lemma 14. The reason is that for <sup>f</sup>(x) := <sup>x</sup><sup>y</sup>+1 <sup>−</sup> <sup>x</sup><sup>y</sup>, the right-hand side of (16) is equal to the linear interpolant of f between c and c + 1. Moreover, f is convex, as <sup>f</sup>(x) = <sup>x</sup><sup>y</sup> · (<sup>x</sup> <sup>−</sup> 1) where for any fixed <sup>y</sup> <sup>∈</sup> <sup>N</sup>+, both <sup>x</sup><sup>y</sup> and <sup>x</sup> <sup>−</sup> <sup>1</sup> are non-negative, monotonically increasing, and convex on R+.

If <sup>y</sup> <sup>≥</sup> <sup>d</sup>, then ip[d..d+1] <sup>2</sup> (x, y) = <sup>x</sup><sup>d</sup> + (x<sup>d</sup>+1 <sup>−</sup> <sup>x</sup><sup>d</sup>) · (<sup>y</sup> <sup>−</sup> <sup>d</sup>) is monotonically increasing in the first occurrence of <sup>x</sup><sup>d</sup>, and in <sup>x</sup><sup>d</sup>+1−x<sup>d</sup>. Thus, by approximating <sup>x</sup><sup>d</sup> and <sup>x</sup><sup>d</sup>+1 <sup>−</sup> <sup>x</sup><sup>d</sup> with their lower bounds from (15) and (16), (14) yields

$$\forall x \in \mathbb{N}\_+, y \ge d. \ x^y \ge \text{ip}\_1^{[c.c.c+1]}(x, d) + (\text{ip}\_1^{[c.c.c+1]}(x, d+1) - \text{ip}\_1^{[c.c.c+1]}(x, d)) \cdot (y - d)$$

$$= \text{ip}^{[c.c.c+1][d..d+1]}(x, y). \tag{17}$$

So dually to ip2, the set <sup>L</sup> contains the lemma

$$s \ge 1 \land t \ge d \implies \mathsf{exp}(s, t) \ge \mathsf{i} \mathsf{p}^{[c..c+1][d..d+1]}(s, t) \tag{\mathsf{IP}\_3}$$

which is valid due to (17) and rules out any counterexample with exp (c, d) < c<sup>d</sup>, as ip[c..c+1][d..d+1](c, d) = c<sup>d</sup>.

*Example 18 (Interpolation, Lower Bounds).* Let exp(s, t) = exp (3, 9) < 3<sup>9</sup>, i.e., we have c = 3, and d = 9. Then

$$\begin{aligned} \mathrm{ip}\_1^{[3..4]}(x,9) &= 3^9 + (4^9 - 3^9) \cdot (x - 3) = 19683 + 242461 \cdot (x - 3) \\ \mathrm{ip}\_1^{[3..4]}(x,10) &= 3^{10} + (4^{10} - 3^{10}) \cdot (x - 3) = 59049 + 989527 \cdot (x - 3) \\ \mathrm{ip}\_1^{[3..4][9..10]}(x,y) &= \mathrm{ip}\_1^{[3..4]}(x,9) + (\mathrm{ip}\_1^{[3..4]}(x,10) - \mathrm{ip}\_1^{[3..4]}(x,9)) \cdot (y - 9) \end{aligned}$$

and thus we obtain the lemma

<sup>s</sup> <sup>≥</sup> <sup>1</sup>∧<sup>t</sup> <sup>≥</sup> 9 =<sup>⇒</sup> exp(s, t) <sup>≥</sup> <sup>747066</sup> · <sup>s</sup> ·<sup>t</sup> <sup>−</sup> <sup>6481133</sup> · <sup>s</sup> <sup>−</sup> <sup>2201832</sup> ·<sup>t</sup> + 19108788. It is violated by our counterexample, as we have


# **Lemma 19.** *Let* c, d <sup>∈</sup> <sup>N</sup>+*. Then* ip<sup>3</sup> *is EIA-valid.*

**4.2.5 Lazy Lemma Generation** In practice, it is not necessary to compute the entire set of lemmas L. Instead, we can stop as soon as L contains a single lemma which is violated by the current counterexample. However, such a strategy would result in a quite fragile implementation, as its behavior would heavily depend on the order in which lemmas are computed, which in turn depends on low-level details like the order of iteration over sets, etc. So instead, we improve Lines 9–11 of Algorithm 1 and use the following precedence on our four kinds of lemmas:

symmetry monotonicity bounding interpolation

Then we compute all lemmas of the same kind, starting with symmetry lemmas, and we only proceed with the next kind if none of the lemmas computed so far is violated by the current counterexample. The motivation for the order above is as follows: Symmetry lemmas obtain the highest precedence, as other kinds of lemmas depend on them for restricting exp(s, t) in the case that s or t is negative. As the coefficients in interpolation lemmas for exp(s, t) grow exponentially w.r.t. t (see, e.g., Example 18), interpolation lemmas get the lowest precedence. Finally, we prefer monotonicity lemmas over bounding lemmas, as monotonicity lemmas are linear (if the arguments of exp are linear), whereas bnd<sup>5</sup> may be non-linear.

*Example 20 (Leading Example Finished).* We now finish our leading example which, after preprocessing, looks as follows (see Example 4):

$$\text{l.r.}^2 > 4 \land y^2 > 4 \land \text{exp}(x, y^2) = \text{exp}(x, \text{exp}(y, y)) \tag{2}$$

Then our implementation generates 12 symmetry lemmas, 4 monotonicity lemmas, and 8 bounding lemmas before proving unsatisfiability, including

$$(3), (4), (5), (6), (7), (8), (9), \text{ and } (10).$$

These lemmas suffice to prove unsatisfiability for the case x > 2 (the cases <sup>x</sup> <sup>∈</sup> [−<sup>2</sup> .. 2] or <sup>y</sup> <sup>∈</sup> [−<sup>2</sup> .. 2] are trivial). For example, if y < <sup>−</sup>2 and <sup>−</sup><sup>y</sup> mod 2 = 0, we get

$$\begin{aligned} y &< -2 \stackrel{(10)}{\stackrel{\frown}{\frown}} \exp(-y, -y) > y^2 + 1 \stackrel{(3)}{\stackrel{\frown}{\frown}} \exp(y, -y) > y^2 + 1\\ &\stackrel{(6)}{\stackrel{\frown}{\frown}} \exp(y, y) > y^2 + 1 \stackrel{(7)}{\stackrel{\frown}{\frown}} \exp(x, \exp(y, y)) > \exp(x, y^2) \stackrel{(2)}{\stackrel{\frown}{\frown}} \text{false} \end{aligned}$$

and for the cases y > 2 and y < <sup>−</sup><sup>2</sup> ∧ −<sup>y</sup> mod 2 = 1, unsatisfiability can be shown similarly. For the case x < <sup>−</sup>2, 5 more symmetry lemmas, 2 more monotonicity lemmas, and 3 more bounding lemmas are used. The remaining 3 symmetry lemmas and 3 bounding lemmas are not used in the final proof of unsatisfiability.

While our leading example can be solved without interpolation lemmas, in general, interpolation lemmas are a crucial ingredient of our approach.

*Example 21.* Consider the formula

$$1 < x < y \land 0 < z \land \mathsf{exp}(x, z) < \mathsf{exp}(y, z).$$

Our implementation first rules out 33 counterexamples using 7 bounding lemmas and 42 interpolation lemmas in <sup>∼</sup>0.1 seconds, before finding the model x = 21, y = 721, and z = 4. Recall that interpolation lemmas are only used if a counterexample cannot be ruled out by any other kinds of lemmas. So without interpolation lemmas, our implementation could not solve this example.

Our main soundness theorem follows from soundness of our preprocessings (Lemma 5) and the fact that all of our lemmas are EIA-valid (Lemmas 9, 11, 13, 17, and 19).

**Theorem 22 (Soundness of Algorithm** 1**).** *If Algorithm 1 returns* **sat***, then* ϕ *is satisfiable in EIA. If Algorithm 1 returns* **unsat***, then* ϕ *is unsatisfiable in EIA.*

Another important property of Algorithm 1 is that it can eliminate *any* counterexample, and hence it makes progress in every iteration.

**Theorem 23 (Progress Theorem).** *If* **A** *is a counterexample and* L *is computed as in Algorithm 1, then*

$$\mathbf{A} \neq \bigwedge \mathcal{L}\_{\Box}$$

Despite Theorems 22 and 23, EIA is of course undecidable, and hence Algorithm 1 is incomplete. For example, it does not terminate for the input formula

$$y \neq 0 \land \mathsf{exp}(2, x) = \mathsf{exp}(3, y). \tag{18}$$

Here, to prove unsatisfiability, one needs to know that 2|x<sup>|</sup> is 1 or even, but 3|y<sup>|</sup> is odd and greater than 1 (unless y = 0). This cannot be derived from the lemmas used by our approach. Thus, Algorithm 1 would refine the formula (18) infinitely often.

Note that monotonicity lemmas are important, even though they are not required to prove Theorem 23. The reason is that *all* (usually infinitely many) counterexamples must be eliminated to prove **unsat**. For instance, reconsider Example 20, where the monotonicity lemma (7) eliminates infinitely many counterexamples with exp(x, exp(y, y)) <sup>≤</sup> exp(x, y<sup>2</sup>) . In contrast, Theorem 23 only guarantees that every single counterexample can be eliminated. Consequently, our implementation does not terminate on our leading example if monotonicity lemmas are disabled.

# **5 Related Work**

The most closely related work applies *incremental linearization* to NIA, or to non-linear real arithmetic with transcendental functions (NRAT). Like our approach, incremental linearization is an instance of the CEGAR paradigm: An initial abstraction (where certain predefined functions are considered as uninterpreted functions) is refined via linear lemmas that rule out the current counterexample.

Our approach is inspired by, but differs significantly from the approach for linearization of NRAT from [11]. There, non-linear polynomials are linearized as well, whereas we leave the handling of polynomials to the backend solver. Moreover, [11] uses linear lemmas only, whereas we also use bilinear lemmas. Furthermore, [11] fixes the base to Euler's number e, whereas we consider a binary version of exponentiation.

The only lemmas that easily carry over from [11] are monotonicity lemmas. While [11] also uses symmetry lemmas, they express properties of the sine function, i.e., they are fundamentally different from ours. Our bounding lemmas are related to the "lower bound" and "zero" lemmas from [11], but there, λx. e<sup>x</sup> is trivially bounded by 0. Interpolation lemmas are related to the "tangent" and "secant lemmas" from [11]. However, tangent lemmas make use of first derivatives, so they are not expressible with integer arithmetic in our setting, as we have <sup>∂</sup> ∂y <sup>x</sup><sup>y</sup> <sup>=</sup> <sup>x</sup><sup>y</sup> · ln <sup>x</sup>. Secant lemmas are essentially obtained by linear interpolation, so our interpolation lemmas can be seen as a generalization of secant lemmas to binary functions. A preprocessing by rewriting is not considered in [11].

In [10], incremental linearization is applied to NIA. The lemmas that are used in [10] are similar to those from [11], so they differ fundamentally from ours, too.

Further existing approaches for NRAT are based on interval propagation [14,24]. As observed in [11], interval propagation effectively computes a piecewise *constant* approximation, which is less expressive than our bilinear approximations.

Recently, a novel approach for NRAT based on the *topological degree test* has been proposed [12,30]. Its strength is finding irrational solutions more often than other approaches for NRAT. Hence, this line of work is orthogonal to ours.

EIA could also be tackled by combining NRAT techniques with branch-andbound, but the following example shows that doing so is not promising.

*Example 24.* Consider the formula <sup>x</sup> <sup>=</sup> exp(3, y) <sup>∧</sup> y > 0. To tackle it with existing solvers, we have to encode it using the natural exponential function:

$$e^z = 3 \land x = e^{y \cdot z} \land y > 0 \tag{19}$$

Here x and y range over the integers and z ranges over the reals. Any model of (19) satisfies z = ln 3, where ln 3 is irrational. As finding such models is challenging, the leading tools MathSat [9] and CVC5 [2] fail for e<sup>z</sup> = 3.

MetiTarski [1] integrates decision procedures for real closed fields and approximations for transcendental functions into the theorem prover Metis [27] to prove theorems about the reals. In a related line of work, iSAT3 [14] has been coupled with SPASS [35]. Clearly, these approaches differ fundamentally from ours.

Recently, the complexity of a decidable extension of linear integer arithmetic with exponentiation has been investigated [5]. It is equivalent to EIA without the functions "·", "div", and "mod", and where the first argument of all occurrences of exp must be the same constant. Integrating decision procedures for fragments like this one into our approach is an interesting direction for future work.

# **6 Implementation and Evaluation**

**Implementation.** We implemented our approach in our novel tool SwInE. It is based on SMT-Switch [31], a library that offers a unified interface for various SMT solvers. SwInE uses the backend solvers Z3 4.12.2 [32] and CVC5 1.0.8 [2]. It supports incrementality and can compute models for variables, but not yet for uninterpreted functions, due to limitations inherited from SMT-Switch.

The backend solver (which defaults to Z3) can be selected via command-line flags. For more information on SwInE and a precompiled release, we refer to [22,23].

**Benchmarks.** To evaluate our approach, we synthesized a large collection of EIA problems from verification benchmarks for safety, termination, and complexity analysis. More precisely, we ran our verification tool LoAT [18] on the benchmarks for *linear Constrained Horn Clauses (CHCs)* with *linear integer arithmetic* from the *CHC Competitions* 2022 and 2023 [8] as well as on the benchmarks for *Termination* and *Complexity of Integer Transition Systems* from the *Termination Problems Database (TPDB)* [34], the benchmark set of the *Termination and Complexity Competition* [25], and extracted all SMT problems with exponentiation that LoAT created while analyzing these benchmarks. Afterwards, we removed duplicates.

The resulting benchmark set consists of 4627 SMT problems, which are available at [22]:


**Evaluation.** We ran SwInE with both supported backend solvers (Z3 and CVC5). To evaluate the impact of the different components of our approach, we also tested with configurations where we disabled rewriting, symmetry lemmas, bounding lemmas, interpolation lemmas, or monotonicity lemmas. All experiments were performed on StarExec [33] with a wall clock timeout of 10s and a memory limit of 128GB per example. We chose a small timeout, as LoAT usually has to discharge many SMT problems to solve a single verification task. So in our setting, each individual SMT problem should be solved quickly.

The results can be seen in Tables 1, 2, 3, and 4, where VB means "virtual best". All but 48 of the 4627 benchmarks can be solved, and all unsolved benchmarks are Complexity Problems. All CHC Comp Problems can be solved with both backend solvers. Considering Complexity and Termination Problems, Z3


**Table 2.** CHC Comp '23 – Results

**Table 1.** CHC Comp '22 – Results


**Fig. 1.** CHC Comp '22 – Runtime

**Fig. 2.** CHC Comp '23 – Runtime

#### **Table 3.** Complexity – Results


**Fig. 3.** Complexity – Runtime

backend configuration **sat unsat unknown** Z3 default 223 431 0 CVC5 208 430 16 VB 223 431 0 Z3 no rewriting 223 431 0 no symmetry 223 431 0 no bounding 177 429 48 no interpolation 15 429 210 no monotonicity 223 431 0 no rewriting, no lemmas 7 428 219 CVC5 no rewriting 208 430 16 no symmetry 208 430 16 no bounding 171 428 55 no interpolation 10 428 216 no monotonicity 208 430 16 no rewriting, no lemmas 7 428 219

**Table 4.** Termination – Results

**Fig. 4.** Termination – Runtime

and CVC5 perform almost equally well on unsatisfiable instances, but Z3 solves more satisfiable instances.

Regarding the different components of our approach, our evaluation shows that the impact of rewriting is quite significant. For example, it enables Z3 to solve 81 additional Complexity Problems. Symmetry lemmas enable Z3 to solve more Complexity Problems, but they are less helpful for CVC5. In fact, symmetry lemmas are needed for most of the examples where Z3 succeeds but CVC5 fails, so they seem to be challenging for CVC5, presumably due to the use of "mod". Bounding and interpolation lemmas are crucial for proving satisfiability. In particular, disabling interpolation lemmas harms more than disabling any other feature, which shows their importance. For example, Z3 can only prove satisfiability of 3 CHC Comp Problems without interpolation lemmas.

Interestingly, only CVC5 benefits from monotonicity lemmas, which enable it to solve more Complexity Problems. From our experience, CVC5 explores the search space in a more systematic way than Z3, so that subsequent candidate models often have a similar structure. Then monotonicity lemmas can help CVC5 to find structurally different candidate models.

Remarkably, disabling a single component does not reduce the number of **unsat**'s significantly. Thus, we also evaluated configurations where *all* components were disabled, so that exp is just an uninterpreted function. This reduces the number of **sat** results dramatically, but most **unsat** instances can still be solved. Hence, most of them do not require reasoning about exponentials, so it would be interesting to obtain instances where proving **unsat** is more challenging.

The runtime of SwInE can be seen in Figs. 1, 2, 3, and 4. Most instances can be solved in a fraction of a second, as desired for our use case. Moreover, CVC5 can solve more instances in the first half second, but Z3 can solve more instances later on. We refer to [22] for more details on our evaluation.

**Validation.** We implemented sanity checks for both **sat** and **unsat** results. For **sat**, we evaluate the input problem using EIA semantics for exp, and the current model for all variables. For **unsat**, assume that the input problem ϕ contains the subterms exp(s0, t0),..., exp(sn, tn). Then we enumerate all SMT problems

<sup>ϕ</sup> <sup>∧</sup> <sup>n</sup> <sup>i</sup>=0 <sup>t</sup><sup>i</sup> <sup>=</sup> <sup>c</sup><sup>i</sup> <sup>∧</sup> exp(si, ti) = <sup>s</sup><sup>c</sup>*<sup>i</sup>* <sup>i</sup> where <sup>c</sup>1,...,c<sup>n</sup> <sup>∈</sup> [0 .. k] for some <sup>k</sup> <sup>∈</sup> <sup>N</sup>

(we used k = 10). If any of them is satisfiable in NIA, then ϕ is satisfiable in EIA. None of these checks revealed any problems.

# **7 Conclusion**

We presented the novel SMT theory EIA, which extends the theory *non-linear integer arithmetic* with integer exponentiation. Moreover, inspired by *incremental linearization* for similar extensions of *non-linear real arithmetic*, we developed a CEGAR approach to solve EIA problems. The core idea of our approach is to regard exponentiation as an uninterpreted function and to eliminate counterexamples, i.e., models that violate the semantics of exponentiation, by generating suitable *lemmas*. Here, the use of *bilinear interpolation* turned out to be crucial, both in practice (see our evaluation in Sect. 6) and in theory, as interpolation lemmas are essential for being able to eliminate *any* counterexample (see Theorem 23). Finally, we evaluated the implementation of our approach in our novel tool SwInE on thousands of EIA problems that were synthesized from verification tasks using our verification tool LoAT. Our evaluation shows that SwInE is highly effective for our use case, i.e., as backend for LoAT. Hence, we will couple SwInE and LoAT in future work.

With SwInE, we provide an SMT-LIB compliant open-source solver for EIA [23]. In this way, we hope to attract users with applications that give rise to challenging benchmarks, as our evaluation suggests that our benchmarks are relatively easy to solve. Moreover, we hope that other solvers with support for integer exponentiation will follow, with the ultimate goal of standardizing EIA.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **SAT-Based Learning of Computation Tree Logic**

Adrien Pommellet(B) , Daniel Stan , and Simon Scatton

LRE, EPITA, Le Kremlin-Bicˆetre, France {adrien,simon.scatton}@lrde.epita.fr, daniel.stan@epita.fr https://www.lrde.epita.fr/ adrien/, https://www.tudo.re/daniel.stan/

**Abstract.** The CTL learning problem consists in finding for a given sample of positive and negative Kripke structures a distinguishing CTL formula that is verified by the former but not by the latter. Further constraints may bound the size and shape of the desired formula or even ask for its minimality in terms of syntactic size. This synthesis problem is motivated by explanation generation for dissimilar models, e.g. comparing a faulty implementation with the original protocol. We devise a SAT-based encoding for a fixed size CTL formula, then provide an incremental approach that guarantees minimality. We further report on a prototype implementation whose contribution is twofold: first, it allows us to assess the efficiency of various output fragments and optimizations. Secondly, we can experimentally evaluate this tool by randomly mutating Kripke structures or syntactically introducing errors in higher-level models, then learning CTL distinguishing formulas.

**Keywords:** Computation Tree Logic · Passive learning · SAT solving

# **1 Introduction**

*Passive learning* is the act of computing a theoretical model of a system from a given set of data, without being able to acquire further information by actively querying said system. The input data may have been gathered through monitoring, collecting executions and outputs of systems. Automata and logic formulas tend to be the most common models, as they allow one to better explain systems of complex or even entirely opaque design.

*Linear-time Temporal Logic* LTL [19] remains one of the most widely used formalisms for specifying temporal properties of reactive systems. It applies to finite or infinite execution traces, and for that reason fits the passive learning framework very well: a LTL formula is a concise way to distinguish between correct and incorrect executions. The LTL learning problem, however, is anything but trivial: even simple fragments on finite traces are NP-complete [10], and consequently recent algorithms tend to leverage SAT solvers [16].

*Computation Tree Logic* CTL [8] is another relevant formalism that applies to execution trees instead of isolated linear traces. It is well-known [1, Thm. 6.21] that LTL and CTL are incomparable: the former is solely defined on the resulting runs of a system, whereas the latter depends on its branching structure.

However, the CTL passive learning problem has seldom been studied in as much detail as LTL. In this article, we formalize it on *Kripke structures* (KSs): finite graph-like representations of programs. Our goal is to find a CTL formula (said to be *separating*) that is verified by every state in a positive set S<sup>+</sup> yet rejected by every state in a negative set S−.

We first prove that an explicit formula can always be computed and we bound its size, assuming the sample is devoid of contradictions. However, said formula may not be minimal. The next step is therefore to solve the bounded learning problem: finding a separating CTL formula of size smaller than a given bound n. We reduce it to an instance Φ<sup>n</sup> of the Boolean satisfiability problem whose answer can be computed by a SAT solver; to do so, we encode CTL's bounded semantics, as the usual semantics on infinite executions trees can lead to spurious results. Finally, we use a bottom-up approach to pinpoint the minimal answer by solving a series of satisfiability problems. We show that a variety of optimizations can be applied to this iterative algorithm. These various approaches have been implemented in a C++ tool and benchmarked on a test sample.

*Related Work.* Bounded model checking harnesses the efficiency of modern SAT solvers to iteratively look for a witness of bounded size that would contradict a given logic formula, knowing that there exists a completeness threshold after which we can guarantee no counter-example exists. First introduced by Biere et al. [3] for LTL formulas, it was later applied to CTL formulas [17,24,25].

This approach inspired Neider et al. [16], who designed a SAT-based algorithm that can learn a LTL formula consistent with a sample of *ultimately periodic* words by computing propositional Boolean formulas that encode both the semantics of LTL on the input sample and the syntax of its Directed Acyclic Graph (DAG) representation. This work spurred further SAT-based developments such as learning formulas in the property specification language PSL [21] or LTL<sup>f</sup> [7], applying MaxSAT solving to noisy datasets [11], or constraining the shape of the formula to be learnt [15]. Our article extends this method to CTL formulas and Kripke structures. It subsumes the original LTL learning problem: one can trivially prove that it is equivalent to learning CTL formulas consistent with a sample of lasso-shaped KSs that consist of a single linear sequence of states followed by a single loop.

Fijalkow et al. [10] have studied the complexity of learning LTL<sup>f</sup> formulas of size smaller than a given bound and consistent with a sample of *finite* words: it is already NP-complete for fragments as simple as LTLf(∧,X), LTLf(∧, <sup>F</sup>), or LTLf(F,X,∧,∨). However, their proofs cannot be directly extended to samples of infinite but ultimately periodic words.

Browne et al. [6] proved that KSs could be characterized by CTL formulas and that conversely bisimilar KSs verified the same set of CTL formulas. As we will show in Sect. 3, this result guarantees that a solution to the CTL learning problem actually exists if the input sample is consistent.

Wasylkowski et al. [23] mined CTL specifications in order to explain preconditions of Java functions beyond pure state reachability. However, their learning algorithm consists in enumerating CTL templates of the form <sup>∀</sup><sup>F</sup> <sup>a</sup>, <sup>∃</sup><sup>F</sup> <sup>a</sup>, <sup>∀</sup>G(<sup>a</sup> <sup>=</sup>⇒ ∀X∀<sup>F</sup> <sup>b</sup>) and <sup>∀</sup>G(<sup>a</sup> <sup>=</sup>⇒ ∃X∃<sup>F</sup> <sup>b</sup>) where a, b <sup>∈</sup> AP for each function, using model checking to select one that is verified by the Kripke structure representing the aforementioned function.

Two very recent articles, yet to be published, have addressed the CTL learning problem as well. Bordais et al. [5] proved that the passive learning problem for LTL formulas on ultimately periodic words is NP-hard, assuming the size of the alphabet is given as an input; they then extend this result to CTL passive learning, using a straightforward reduction of ultimately periodic words to lassoshaped Kripke structures. Roy et al. [22] used a SAT-based algorithm, resulting independently to our own research in an encoding similar to the one outlined in Sect. 4. However, our explicit solution to the learning problem, the embedding of the negations in the syntactic DAG, the approximation of the recurrence diameter as a semantic bound, our implementation of this algorithm, its test suite, and the experimental results are entirely novel contributions.

# **2 Preliminary Definitions**

### **2.1 Kripke Structures**

Let AP be a finite set of atomic propositions. A Kripke structure is a finite directed graph whose vertices (called *states*) are labelled by subsets of AP.

**Definition 1 (Kripke Structure).** *A* Kripke structure *(KS)* K *on* AP *is a tuple* <sup>K</sup> = (Q, δ, λ) *such that:*


An infinite *run* <sup>r</sup> of <sup>K</sup> starting from a state <sup>q</sup> <sup>∈</sup> <sup>Q</sup> is an infinite sequence <sup>r</sup> = (si) <sup>∈</sup> <sup>Q</sup><sup>ω</sup> of consecutive states such that <sup>s</sup><sup>0</sup> <sup>=</sup> <sup>q</sup> and <sup>∀</sup> <sup>i</sup> <sup>≥</sup> 0, <sup>s</sup>i+1 <sup>∈</sup> <sup>δ</sup>(si). <sup>R</sup>K(q) is the set of all infinite runs of <sup>K</sup> starting from <sup>q</sup>.

The *recurrence diameter* <sup>α</sup>K(q) of state <sup>q</sup> in <sup>K</sup> is the length of the longest finite run (si)i=0,...,αK(q) starting from <sup>q</sup> such that <sup>∀</sup> i, j <sup>∈</sup> [0 .. αK(q)], if <sup>i</sup> <sup>=</sup> <sup>j</sup> then <sup>s</sup><sup>i</sup> <sup>=</sup> <sup>s</sup><sup>j</sup> (i.e. the longest *simple* path in the underlying graph structure). We may omit the index K whenever contextually obvious.

Note that two states may generate the same runs despite their computation trees not corresponding. It is therefore necessary to define an equivalence relation on states of KSs that goes further than mere run equality.

**Definition 2 (Bisimulation Relation).** *Let* <sup>K</sup> = (Q, δ, λ) *be a KS on* AP*. The* canonical bisimulation *relation* ∼ ⊆ <sup>Q</sup> <sup>×</sup> <sup>Q</sup> *is the coarsest (i.e. the most general) equivalence relation such that for any* <sup>q</sup><sup>1</sup> <sup>∼</sup> <sup>q</sup>2*,* <sup>λ</sup>(q1) = <sup>λ</sup>(q2) *and* <sup>∀</sup> <sup>q</sup> <sup>1</sup> ∈ <sup>δ</sup>(q1), <sup>∃</sup> <sup>q</sup> <sup>2</sup> <sup>∈</sup> <sup>δ</sup>(q2) *such that* <sup>q</sup> <sup>1</sup> <sup>∼</sup> <sup>q</sup> 2*.*

Bisimilarity does not only entails equality of runs, but also similarity of shape: two bisimilar states have corresponding computation trees at any depth. A partition refinement algorithm allows one to compute ∼ by refining a sequence of equivalence relations (∼i)<sup>i</sup>≥<sup>0</sup> on <sup>Q</sup> <sup>×</sup> <sup>Q</sup> inductively, where for every <sup>q</sup>1, q<sup>2</sup> <sup>∈</sup> <sup>Q</sup>:

$$\begin{aligned} q\_1 \sim\_0 q\_2 &\iff \lambda(q\_1) = \lambda(q\_2) \\ q\_1 \sim\_{i+1} q\_2 &\iff (q\_1 \sim\_i q\_2) \land (\{ [q'\_1]\_{\sim i} \mid q'\_1 \in \delta(q\_1) \} = \{ [q'\_2]\_{\sim i} \mid q'\_2 \in \delta(q\_2) \}) \end{aligned}$$

Where [q]<sup>∼</sup>*<sup>i</sup>* stands for the equivalence class of <sup>q</sup> <sup>∈</sup> <sup>Q</sup> according to the equivalence relation <sup>∼</sup>i. Intuitively, <sup>q</sup><sup>1</sup> <sup>∼</sup><sup>i</sup> <sup>q</sup><sup>2</sup> if their computation trees are corresponding up to depth i. The next theorem is a well-known result [1, Alg. 31]:

**Theorem 1 (Characteristic Number).** *Given a KS* <sup>K</sup>*, there exists* <sup>i</sup><sup>0</sup> <sup>∈</sup> <sup>N</sup> *such that* <sup>∀</sup> <sup>i</sup> <sup>≥</sup> <sup>i</sup>0*,* <sup>∼</sup> <sup>=</sup> <sup>∼</sup>i*. The smallest integer* <sup>i</sup><sup>0</sup> *verifying that property is known as the* characteristic number C<sup>K</sup> *of* K*.*

Note that Browne et al. [6] introduced an equivalent definition: the characteristic number of a KS is also the smallest integer <sup>C</sup><sup>K</sup> <sup>∈</sup> <sup>N</sup> such that any two states are not bisimilar if and only if their labelled computation trees of depth C<sup>K</sup> are not corresponding.

### **2.2 Computation Tree Logic**

**Definition 3 (Computation Tree Logic).** Computation Tree Logic *(CTL) is the set of formulas defined by the following grammar, where* <sup>a</sup> <sup>∈</sup> AP *is any atomic proposition and* † ∈ {∀ , ∃ } *a quantifier:*

<sup>ϕ</sup> ::= <sup>a</sup> ||¬<sup>ϕ</sup> <sup>|</sup> <sup>ϕ</sup> <sup>∧</sup> <sup>ϕ</sup> <sup>|</sup> <sup>ϕ</sup> <sup>∨</sup> <sup>ϕ</sup> | † <sup>X</sup> <sup>ϕ</sup> | † <sup>F</sup> <sup>ϕ</sup> | † <sup>G</sup><sup>ϕ</sup> | † <sup>ϕ</sup>U<sup>ϕ</sup>

*Given* <sup>E</sup> ⊆ {¬,∧,∨, <sup>∀</sup>X, <sup>∃</sup>X, <sup>∀</sup>F, <sup>∃</sup>F, <sup>∀</sup>G, <sup>∃</sup>G, <sup>∀</sup>U, <sup>∃</sup>U}*, we define the (syntactic)* fragment *CTL*(E) *as the subset of CTL formulas featuring only operators in* E*.*

CTL formulas are verified against states of KSs (a process known as *model checking*). Intuitively, <sup>∀</sup> (*all*) means that all runs starting from state <sup>q</sup> must verify the property that follows, ∃ (*exists*), that at least one run starting from q must verify the property that follows, X ϕ (*next*), that the next state of the run must verify ϕ, F ϕ (*finally*), that there exists a state of run verifying ϕ, G ϕ (*globally*), that each state of the run must verify ϕ, and ϕUψ (*until*), that the run must keep verifying ϕ at least until ψ is eventually verified.

More formally, for a state <sup>q</sup> <sup>∈</sup> <sup>Q</sup> of a KS <sup>K</sup> = (Q, δ, λ) and a CTL formula <sup>ϕ</sup>, we write (<sup>q</sup> <sup>|</sup>=<sup>K</sup> <sup>ϕ</sup>) when <sup>K</sup> *satisfies* <sup>ϕ</sup>. CTL's semantics are defined inductively on ϕ (see [1, Def. 6.4] for a complete definition); we recall below the *until* case:

**Definition 4 (Semantics of** <sup>∀</sup>U, <sup>∃</sup>U**).** *Let* <sup>K</sup> = (Q, δ, λ) *be a KS,* <sup>ϕ</sup> *and* <sup>ψ</sup> *two CTL formulas,* <sup>q</sup> <sup>∈</sup> <sup>Q</sup>*, and* † ∈ {∀ , ∃ }*. Then:*

<sup>q</sup> <sup>|</sup>=<sup>K</sup> † <sup>ϕ</sup>U<sup>ψ</sup> ⇐⇒ †(si) ∈ RK(q), <sup>∃</sup> <sup>i</sup> <sup>≥</sup> <sup>0</sup>,(s<sup>i</sup> <sup>|</sup>=<sup>K</sup> <sup>ψ</sup>) <sup>∧</sup> (<sup>∀</sup> j < i, s<sup>j</sup> <sup>|</sup>=<sup>K</sup> <sup>ϕ</sup>)

Bisimilarity and CTL equivalence coincide [1, Thm. 7.20] on finite KSs. The proof relies on the following concept:

**Theorem 2 (Browne et al.** [6]**).** *Given a KS* <sup>K</sup> = (Q, δ, λ) *and a state* <sup>q</sup> <sup>∈</sup> <sup>Q</sup>*, there exists a CTL formula* <sup>ϕ</sup><sup>q</sup> <sup>∈</sup> *CTL*({¬,∧,∨, <sup>∀</sup>X, <sup>∃</sup>X}) *known as the* master formula *of state* <sup>q</sup> *such that, for any* <sup>q</sup> <sup>∈</sup> <sup>Q</sup>*,* <sup>q</sup> <sup>|</sup>=<sup>K</sup> <sup>ϕ</sup><sup>q</sup> *if and only if* <sup>q</sup> <sup>∼</sup> <sup>q</sup> *.*

To each CTL formula ϕ, we associate a syntactic tree T . For brevity's sake, we consider a syntactic *directed acyclic graph* (DAG) D by coalescing identical subtrees in the original syntactic tree T , as shown in Fig. 1. The *size* <sup>|</sup>ϕ<sup>|</sup> of a CTL formula <sup>ϕ</sup> is then defined as the number of nodes of its smallest syntactic DAG. As an example, |¬<sup>a</sup> ∧ ∀<sup>X</sup> <sup>a</sup><sup>|</sup> = 4.

**Fig. 1.** The syntactic tree and indexed DAG of the CTL formula ¬a ∧ ∀Xa.

### **2.3 Bounded Semantics**

We introduce the bounded temporal operators <sup>∀</sup>F<sup>u</sup>, <sup>∃</sup>F<sup>u</sup>, <sup>∀</sup>G<sup>u</sup>, <sup>∃</sup>G<sup>u</sup>, <sup>∀</sup>U<sup>u</sup>, and <sup>∃</sup>U<sup>u</sup>, whose semantics only applies to the first <sup>u</sup> steps of a run. Formally:

**Definition 5 (Bounded Semantics of CTL).** *Let* <sup>K</sup> = (Q, δ, λ) *be a KS,* <sup>ϕ</sup> *and* <sup>ψ</sup> *two CTL formulas,* <sup>u</sup> <sup>∈</sup> <sup>N</sup> *and* <sup>q</sup> <sup>∈</sup> <sup>Q</sup>*. The bounded semantics of CTL of rank* <sup>u</sup> *with regards to* <sup>K</sup> *are defined as follows for the quantifier* † ∈ {∀ , ∃ }*:*

$$\begin{aligned} q & \left| \neg \kappa \right| \upmodels \mathsf{F}^u \varphi \iff \dagger (s\_i) \in \mathsf{R}\_{\mathsf{K}}(q), \exists i \in [0 \dots u], s\_i \left| \vdash \kappa \right| \varphi \\\ q & \left| \vdash \kappa \right| \upmodels \mathsf{G}^u \varphi \iff \dagger (s\_i) \in \mathsf{R}\_{\mathsf{K}}(q), \forall i \in [0 \dots u], s\_i \left| \vdash \kappa \right| \varphi \\\ q & \left| \vdash \kappa \right| \upmodels \varphi \Downarrow \mathsf{U}^u \psi \iff \dagger (s\_i) \in \mathsf{R}\_{\mathsf{K}}(q), \exists i \in [0 \dots u], (s\_i \left| \vdash \kappa \right| \psi) \land (\forall j < i, s\_j \left| \vdash \kappa \right| \varphi) \end{aligned}$$

Intuitively, the rank <sup>u</sup> of the bounded semantics acts as a timer: (<sup>q</sup> <sup>|</sup>=<sup>K</sup> <sup>∀</sup>G<sup>u</sup> <sup>ϕ</sup>) means that <sup>ϕ</sup> must hold for the next <sup>u</sup> computation steps; (<sup>q</sup> <sup>|</sup>=<sup>K</sup> <sup>∀</sup>F<sup>u</sup> <sup>ϕ</sup>), that <sup>q</sup> must always be able to reach a state verifying <sup>ϕ</sup> within <sup>u</sup> steps; (<sup>q</sup> <sup>|</sup>=<sup>K</sup> <sup>∀</sup> <sup>ϕ</sup>U<sup>u</sup> <sup>ψ</sup>), that q must always be able to reach a state verifying ψ within u steps, and that ϕ must hold until it does; etc. This intuition results in the following properties:

$$\begin{array}{lcll} \text{Property 1 } (Base\ case). \ (q \Vdash \kappa \Downarrow \psi) & \iff \ (q \Vdash \kappa \Downarrow \mathsf{F}^{0}\ \psi) & \iff \ (q \Vdash \kappa \Downarrow \psi) \\ \Vdash \dagger\varphi\Downarrow^{0}\psi) & \iff (q \Vdash \kappa \Downarrow \mathsf{G}^{0}\ \psi). \\ \text{Property 2 } (Induction). \ (q \Vdash \kappa \Downarrow \mathsf{F}^{u+1}\ \varphi) & \iff (q \Vdash \kappa \Downarrow \varphi) \lor \underset{q' \in \delta(q)}{\circlearrowleft}\ (q' \Vdash \kappa \Downarrow \mathsf{F}^{u}\ \varphi), \\ \Psi(q \Vdash \kappa \Downarrow \mathsf{Y}\ \psi \mathsf{U}^{u+1}\ \psi) & \iff (q \Vdash \kappa \Downarrow \psi) \lor \left[ (q \Vdash \kappa \Downarrow \varphi) \land \underset{q' \in \delta(q)}{\circlearrowleft}\ (q' \Vdash \kappa \Downarrow \mathsf{U}^{u}\ \psi) \right], \text{ and} \\ \end{array}$$

∈δ(q)

(<sup>q</sup> <sup>|</sup>=<sup>K</sup> †G<sup>u</sup>+1 <sup>ϕ</sup>) ⇐⇒ (<sup>q</sup> <sup>|</sup>=<sup>K</sup> <sup>ϕ</sup>) ∧ q-∈δ(q) (q <sup>|</sup>=<sup>K</sup> †G<sup>u</sup> <sup>ϕ</sup>), where - = ∧ if † = ∀ and -= ∨ if † = ∃.

*Property 3 (Spread).* (<sup>q</sup> <sup>|</sup>=<sup>K</sup> † <sup>F</sup><sup>u</sup> <sup>ϕ</sup>) =<sup>⇒</sup> (<sup>q</sup> <sup>|</sup>=<sup>K</sup> † <sup>F</sup><sup>u</sup>+1 <sup>ϕ</sup>), (<sup>q</sup> <sup>|</sup>=<sup>K</sup> †G<sup>u</sup>+1 <sup>ϕ</sup>) =<sup>⇒</sup> (<sup>q</sup> <sup>|</sup>=<sup>K</sup> †G<sup>u</sup> <sup>ϕ</sup>), and (<sup>q</sup> <sup>|</sup>=<sup>K</sup> † <sup>ϕ</sup>U<sup>u</sup> <sup>ψ</sup>) =<sup>⇒</sup> (<sup>q</sup> <sup>|</sup>=<sup>K</sup> † <sup>ϕ</sup>U<sup>u</sup>+1 <sup>ψ</sup>).

Bounded model checking algorithms [25] rely on the following result, as one can then restrict the study of CTL semantics to finite and fixed length paths.

**Theorem 3.** *Given* <sup>q</sup> <sup>∈</sup> <sup>Q</sup>*, for* † ∈ {∀ , ∃ } *and* ∈ {F,G}*,* <sup>q</sup> <sup>|</sup>=<sup>K</sup> † ϕ *(resp.* <sup>q</sup> <sup>|</sup>=<sup>K</sup> † <sup>ϕ</sup>Uψ*) if and only if* <sup>q</sup> <sup>|</sup>=<sup>K</sup> † α(q) <sup>ϕ</sup> *(resp.* <sup>q</sup> <sup>|</sup>=<sup>K</sup> † <sup>ϕ</sup>U<sup>α</sup>(q) <sup>ψ</sup>*).*

A full proof of this result is available in Appendix A.

# **3 The Learning Problem**

We consider the synthesis problem of a distinguishing CTL formula from a sample of positive and negative states of a given KS.

### **3.1 Introducing the Problem**

First and foremost, the sample must be self-consistent: a state in the positive sample cannot verify a CTL formula while another bisimilar state in the negative sample does not.

**Definition 6 (Sample).** *Given a KS* <sup>K</sup> = (Q, δ, λ)*, a* sample *of* <sup>K</sup> *is a pair* (S+, S−) <sup>∈</sup> <sup>2</sup><sup>Q</sup> <sup>×</sup> <sup>2</sup><sup>Q</sup> *such that* <sup>∀</sup> <sup>q</sup><sup>+</sup> <sup>∈</sup> <sup>S</sup>+*,* <sup>∀</sup> <sup>q</sup><sup>−</sup> <sup>∈</sup> <sup>S</sup>−*,* <sup>q</sup><sup>+</sup> <sup>∼</sup> <sup>q</sup>−*.*

We define the *characteristic number* <sup>C</sup>K(S+, S−) of a sample as the smallest integer <sup>c</sup> <sup>∈</sup> <sup>N</sup> such that for every <sup>q</sup><sup>+</sup> <sup>∈</sup> <sup>S</sup><sup>+</sup>, <sup>q</sup><sup>−</sup> <sup>∈</sup> <sup>S</sup>−, <sup>q</sup><sup>+</sup> <sup>∼</sup><sup>c</sup> <sup>q</sup>−.

**Definition 7 (Consistent Formula).** *A CTL formula* ϕ *is said to be* consistent *with a sample* (S+, S−) *of* <sup>K</sup> *if* <sup>∀</sup> <sup>q</sup><sup>+</sup> <sup>∈</sup> <sup>S</sup>+*,* <sup>q</sup><sup>+</sup> <sup>|</sup>=<sup>K</sup> <sup>ϕ</sup> *and* <sup>∀</sup> <sup>q</sup><sup>−</sup> <sup>∈</sup> <sup>S</sup>−*,* <sup>q</sup><sup>−</sup> <sup>|</sup>=<sup>K</sup> <sup>ϕ</sup>*.*

The rest of our article focuses on the following passive learning problems:

**Definition 8 (Learning Problem).** *Given a sample* (S+, S−) *of a KS* <sup>K</sup> *and* <sup>n</sup> <sup>∈</sup> <sup>N</sup>∗*, we introduce the following instances of the CTL learning problem:*

*<sup>L</sup>CTL*(E)(K, S+, S−)**.** *Is there* <sup>ϕ</sup> <sup>∈</sup> *CTL*(E) *consistent with* (S+, S−)*? L*<sup>≤</sup><sup>n</sup> *CTL*(E)(K, S+, S−)**.** *Is there* <sup>ϕ</sup> <sup>∈</sup> *CTL*(E)*,* <sup>|</sup>ϕ| ≤ <sup>n</sup>*, consistent with* (S+, S−)*? MLCTL*(E)(K, S+, S−)**.** *Find the* smallest <sup>ϕ</sup> <sup>∈</sup> *CTL*(E) *consistent with* (S+, S−)*.*

**Theorem 4.** *<sup>L</sup>CTL*(K, S+, S−) *and MLCTL*(K, S+, S−) *always admit a solution.*

*Proof.* Consider ψ = q+∈S<sup>+</sup> ϕq<sup>+</sup> . This formula ψ is consistent with (S+, S−) by design. Thus <sup>L</sup>CTL(K, S+, S−) always admits a solution, and so does the problem MLCTL(K, S+, S−), although <sup>ψ</sup> is unlikely to be the minimal solution.

Bordais et al. [5] proved that L<sup>≤</sup><sup>n</sup> CTL(K, S+, S−) is NP-hard, assuming the set of atomic propositions AP is given as an input as well.

# **3.2 An Explicit Solution**

We must find a formula consistent with the sample (S+, S−), an easier problem than Browne et al. [6]'s answer to Theorem 2 that subsumes bisimilarity with an entire KS. As we know that every state in S<sup>−</sup> is dissimilar to every state in S<sup>+</sup>, we will try to encode this fact in CTL form, then use said encoding to design a formula consistent with the sample.

**Definition 9 (Separating Formula).** *Let* (S+, S−) *be a sample of a KS* <sup>K</sup> <sup>=</sup> (Q, δ, λ)*. Assuming that* AP *and* <sup>Q</sup> *are ordered, and given* <sup>q</sup>1, q<sup>2</sup> <sup>∈</sup> <sup>Q</sup> *such that* <sup>q</sup><sup>1</sup> <sup>∼</sup> <sup>q</sup>2*, formula* <sup>D</sup><sup>q</sup>1,q<sup>2</sup> *is defined inductively w.r.t.* <sup>c</sup> <sup>=</sup> <sup>C</sup>K({q1}, {q2}) *as follows:*


∃X q- <sup>2</sup>∈δ(q2) D<sup>q</sup>- 1,q- 2 *, picking the smallest* q <sup>1</sup> *verifying this property; – else if* <sup>c</sup> = 0 *and* <sup>∃</sup> <sup>q</sup> <sup>2</sup> <sup>∈</sup> <sup>δ</sup>(q2)*,* <sup>∀</sup> <sup>q</sup> <sup>1</sup> <sup>∈</sup> <sup>δ</sup>(q1)*,* <sup>q</sup> <sup>1</sup> <sup>∼</sup><sup>c</sup>−<sup>1</sup> <sup>q</sup> <sup>2</sup>*, then* D<sup>q</sup>1,q<sup>2</sup> = 

∀X q- <sup>1</sup>∈δ(q1) D<sup>q</sup>- 2,q- 1 *, picking the smallest* q <sup>2</sup> *verifying this property.*

*The formula* <sup>S</sup>K(S+, S−) = q+∈S<sup>+</sup> q−∈S<sup>−</sup> <sup>D</sup>q+,q<sup>−</sup> <sup>∈</sup> *CTL*({¬,∧,∨, <sup>∀</sup>X, <sup>∃</sup>X}) *is then called the* separating formula *of sample* (S+, S−)*.*

Intuitively, the CTL formula D<sup>q</sup>1,q<sup>2</sup> merely expresses that states q<sup>1</sup> and q<sup>2</sup> are dissimilar by negating Definition 2; it is such that <sup>q</sup><sup>1</sup> <sup>|</sup>=<sup>K</sup> <sup>D</sup><sup>q</sup>1,q<sup>2</sup> but <sup>q</sup><sup>2</sup> <sup>|</sup>=<sup>K</sup> <sup>D</sup><sup>q</sup>1,q<sup>2</sup> . Either q<sup>1</sup> and q<sup>2</sup> have different labels, q<sup>1</sup> admits a successor that is dissimilar to q2's successors, or q<sup>2</sup> admits a successor that is dissimilar to q1's. The following result is proven in Appendix B:

**Theorem 5.** *The separating formula* <sup>S</sup>K(S+, S−) *is consistent with* (S+, S−)*.*

As proven in Appendix C, we can bound the size of <sup>S</sup>K(S+, S−):

**Corollary 1.** *Assume the KS* <sup>K</sup> *has degree* <sup>k</sup> *and* <sup>c</sup> <sup>=</sup> <sup>C</sup>K(S+, S−)*, then:*

*– if* <sup>k</sup> <sup>≥</sup> <sup>2</sup>*, then* |SK(S+, S−)| ≤ (5 · <sup>k</sup><sup>c</sup> + 1) · |S<sup>+</sup>|·|S<sup>−</sup>|*; – if* <sup>k</sup> = 1*, then* |SK(S+, S−)| ≤ (2 · <sup>c</sup> + 3) · |S<sup>+</sup>|·|S<sup>−</sup>|*.*

# **4 SAT-Based Learning**

The *universal* fragment CTL<sup>∀</sup> <sup>=</sup> CTL({¬,∧,∨, <sup>∀</sup>X, <sup>∀</sup>F, <sup>∀</sup>G, <sup>∀</sup>U}) of CTL happens to be as expressive as the full logic [1, Def. 6.13]. For that reason, we will reduce a learning instance of CTL<sup>∀</sup> of rank <sup>n</sup> to an instance of the SAT problem. A similar reduction has been independently found by Roy et al. [22].

**Lemma 1.** *There exists a Boolean propositional formula* Φ<sup>n</sup> *such that the instance L*<sup>≤</sup><sup>n</sup> *CTL*<sup>∀</sup> (K, S+, S−) *of the learning problem admits a solution* <sup>ϕ</sup> *if and only if the formula* Φ<sup>n</sup> *is satisfiable.*

#### **4.1 Modelling the Formula**

Assume that there exists a syntactic DAG <sup>D</sup> of size smaller than or equal to <sup>n</sup> representing the desired CTL formula <sup>ϕ</sup>. Let us index <sup>D</sup>'s nodes in [1 .. n] in such a fashion that each node has a higher index than its children, as shown in Fig. 1. Hence, n always labels a root and 1 always labels a leaf.

Let <sup>L</sup> <sup>=</sup> AP ∪ {,¬,∧,∨, <sup>∀</sup>X, <sup>∀</sup>F, <sup>∀</sup>G, <sup>∀</sup>U} be the set of labels that decorates the DAG's nodes. For each <sup>i</sup> <sup>∈</sup> [1 .. n] and <sup>o</sup> ∈ L, we introduce a Boolean variable τ <sup>o</sup> <sup>i</sup> such that <sup>τ</sup> <sup>o</sup> <sup>i</sup> = 1 if and only if the node of index <sup>i</sup> is labelled by <sup>o</sup>.

For all <sup>i</sup> <sup>∈</sup> [1 .. n] and <sup>j</sup> <sup>∈</sup> [0 .. i <sup>−</sup> 1], we also introduce a Boolean variable li,j (resp. ri,j ) such that li,j = 1 (resp. ri,j = 1) if and only if j is the left (resp. right) child of i. Having a child of index 0 stands for having no child at all in the actual syntactic DAG D.

Three mutual exclusion clauses guarantee that each node of the syntactic DAG has exactly one label and at most one left child and one right child. Moreover, three other clauses ensure that a node labelled by an operator of arity x has exactly x actual children (by convention, if x = 1 then its child is to the right). These simple clauses are similar to Neider et al.'s encoding [16] and for that reason are not detailed here.

#### **4.2 Applying the Formula to the Sample**

For all <sup>i</sup> <sup>∈</sup> [1 . ., n] and <sup>q</sup> <sup>∈</sup> <sup>Q</sup>, we introduce a Boolean variable <sup>ϕ</sup><sup>q</sup> <sup>i</sup> such that ϕ<sup>q</sup> <sup>i</sup> = 1 if and only if state <sup>q</sup> verifies the sub-formula <sup>ϕ</sup><sup>i</sup> rooted in node <sup>i</sup>. The next clauses implement the semantics of the true symbol , the atomic propositions, and the CTL operator ∀X.

$$\bigwedge\_{\substack{i \in [1..n] \\ q \in Q}} (\tau\_i^\top \implies \varphi\_i^q) \tag{8.62}$$

$$\bigwedge\_{\substack{i \in \left[1..n\right] \\ q \in Q}} \left[ \left( \bigwedge\_{a \in \lambda(q)} (\tau\_i^a \implies \varphi\_i^q) \right) \wedge \left( \bigwedge\_{a \notin \lambda(q)} (\tau\_i^a \implies \neg \varphi\_i^q) \right) \right] \tag{5.91a}$$

$$\bigwedge\_{\substack{i \in [2...n] \\ k \in [1...i-1]}} \left[ (\tau\_i^{\lhd \lhd} \wedge \mathfrak{r}\_{i,k}) \implies \bigwedge\_{q \in Q} \left( \varphi\_i^q \iff \bigwedge\_{q' \in \delta(q)} \varphi\_k^{q'} \right) \right] \tag{\mathsf{sem}\_{\mathsf{VX}}} \tag{\mathsf{sem}\_{\mathsf{VX}}} \lambda$$

Semantic clauses are structured as follows: an antecedent stating node i's label and its possible children implies a consequent expressing ϕ<sup>q</sup> <sup>i</sup> 's semantics for each <sup>q</sup> <sup>∈</sup> <sup>Q</sup>. Clause sem states that <sup>q</sup> <sup>|</sup><sup>=</sup> is always true; clause sema, that <sup>q</sup> <sup>|</sup><sup>=</sup> <sup>a</sup> if and only if <sup>q</sup> is labelled by <sup>a</sup>; and clause sem∀<sup>X</sup>, that <sup>q</sup> <sup>|</sup><sup>=</sup> <sup>∀</sup>X<sup>ψ</sup> if and only if all of q's successors verify ψ. Similar straightforward clauses encode the semantics of the Boolean connectors ¬, ∧ and ∨.

CTL semantics are also characterized by fixed points, whose naive encoding might however capture spurious (least vs greatest) sets: we resort to the bounded semantics <sup>∀</sup>F<sup>u</sup>, <sup>∀</sup>U<sup>u</sup> and <sup>∀</sup>G<sup>u</sup>. For all <sup>i</sup> <sup>∈</sup> [1 .. n], <sup>q</sup> <sup>∈</sup> <sup>Q</sup>, and <sup>u</sup> <sup>∈</sup> [0 .. α(q)], we introduce a Boolean variable ρ<sup>u</sup> i,q such that <sup>ρ</sup><sup>u</sup> i,q = 1 if and only if <sup>q</sup> verifies the sub-formula rooted in i according to the CTL bounded semantics of rank u (e.g. <sup>q</sup> <sup>|</sup><sup>=</sup> <sup>∀</sup>F<sup>u</sup> <sup>ψ</sup>, assuming sub-formula <sup>∀</sup><sup>F</sup> <sup>ψ</sup> is rooted in node <sup>i</sup>).

Thanks to Theorem 3 we can introduce the following equivalence clause:

$$\bigwedge\_{i \in [2..n]} \left[ \left( \bigvee\_{o \in \{\forall \mathsf{F}, \forall \mathsf{G}, \forall \mathsf{U}\}} \tau\_i^o \right) \implies \bigwedge\_{q \in Q} (\varphi\_i^q \iff \rho\_{i,q}^{\alpha(q)}) \right] \tag{8.62}$$

Property 3 yields two other clauses whose inclusion is not mandatory (they were left out by Roy et al. [22]) that further constrains the bounded semantics:

$$\bigwedge\_{i \in [2..n]} \left[ (\tau\_i^{\vee \mathsf{F}} \vee \tau\_i^{\vee \mathsf{U}}) \implies \bigwedge\_{\substack{q \in Q \\ u \in [1..\alpha(q)]}} (\rho\_{i,q}^{u-1} \implies \rho\_{i,q}^{u}) \right] \tag{\mathsf{ascent}\_{\rho}}$$

$$\bigwedge\_{i \in [2..n]} \left[ \tau\_i^{\vee \mathsf{G}} \implies \bigwedge\_{\substack{q \in Q \\ u \in [1..\alpha(q)]}} (\rho\_{i,q}^{u} \implies \rho\_{i,q}^{u-1}) \right] \tag{\mathsf{ascent}\_{\rho}}$$

The next clause enables variable ρ<sup>u</sup> i,q for temporal operators only:

$$\bigvee\_{i \in [2..n]} \left[ \left( \bigvee\_{o \in \{\forall \mathsf{F}, \forall \mathsf{G}, \forall \mathsf{U} \}} \neg \tau\_{i}^{o} \right) \implies \bigwedge\_{\substack{q \in Q \\ u \in [0..\alpha(q)]}} \neg \rho\_{i,q}^{u} \right] \tag{10.6}$$

Properties 1 and 2 yield an inductive definition of bounded semantics. We only explicit the base case base<sup>ρ</sup> and the semantics sem∀<sup>U</sup> of <sup>∀</sup>U<sup>u</sup>, but also implement semantic clauses for the temporal operators ∀F and ∀G.

$$\bigwedge\_{\substack{i \in [2, \dots, n] \\ k \in [1..i-1]}} \left[ \left( \bigvee\_{o \in \{\forall \mathbf{F}, \forall \mathbf{U} \}} \tau^{o}\_{i} \right) \wedge \mathfrak{r}\_{i,k} \right) \implies \bigwedge\_{q \in Q} (\rho^{0}\_{i,q} \iff \varphi^{q}\_{k}) \right] \tag{\mathtt{base}}$$
 
$$\bigwedge\_{\substack{i \in [2..n] \\ j,k \in [1..i-1]}} \left[ (\tau^{\forall \mathbf{U}}\_{i} \wedge \mathfrak{l}\_{i,j} \wedge \mathfrak{r}\_{i,k}) \implies$$
 
$$\bigwedge\_{\substack{q \in Q \\ u \in [1..\alpha(q)]}} \left( \rho^{u}\_{i,q} \iff \left[ \varphi^{q}\_{k} \vee \left( \varphi^{q}\_{j} \wedge \bigwedge\_{q' \in \mathcal{S}(q)} \rho^{\min(\alpha(q'), u-1)}\_{i,q'} \right) \right] \right) \right] \tag{\mathtt{Same}\_{\mathbf{U}}}$$

Finally, the last clause ensures that the full formula ϕ (rooted in node n) is verified by the positive sample but not by the negative sample.

$$\left(\bigwedge\_{q^{+} \in S^{+}} \varphi\_{n}^{q^{+}}\right) \wedge \left(\bigwedge\_{q^{-} \in S^{-}} \neg \varphi\_{n}^{q^{-}}\right) \tag{\mathsf{sem}\_{\varphi}}$$

### **4.3 Solving the SAT Instance**

We finally define the formula Φ<sup>n</sup> as the conjunction of all the aforementioned clauses. Assuming an upper bound d on the KS's recurrence diameter, this encoding requires <sup>O</sup>(n<sup>2</sup> <sup>+</sup> <sup>n</sup> · |AP<sup>|</sup> <sup>+</sup> <sup>n</sup> · |Q| · <sup>d</sup>) variables and <sup>O</sup>(<sup>n</sup> · |AP<sup>|</sup> <sup>+</sup> <sup>n</sup><sup>3</sup> · |Q| · <sup>d</sup> <sup>+</sup> <sup>n</sup> · |AP|·|Q|) clauses, not taking transformation to conjunctive normal form into account. By design, Lemma 1 holds.

*Proof.* The syntactic clauses allow one to infer the DAG of a formula <sup>ϕ</sup> <sup>∈</sup> CTL of size smaller than or equal to n from the valuations taken by the variables (τ <sup>o</sup> <sup>i</sup> ), (li,j ), and (ri,j ). Clauses sem<sup>a</sup> to sem<sup>ϕ</sup> guarantee that the sample is consistent with said formula <sup>ϕ</sup>, thanks to Theorem <sup>3</sup> and Properties 1, 2, and 3.

### **5 Algorithms for the Minimal Learning Problem**

We introduce in this section an algorithm to solve the minimum learning problem MLCTL<sup>∀</sup> (K, S+, S−). Remember that it always admits a solution if and only if the state sample is consistent by Theorem 4.

### **5.1 A Bottom-Up Algorithm**

By Theorem 4, there exists a rank n<sup>0</sup> such that the problem L<sup>≤</sup>n<sup>0</sup> CTL<sup>∀</sup> (K, S) admits a solution. Starting from n = 0, we can therefore try to solve L<sup>≤</sup><sup>n</sup> CTL<sup>∀</sup> (K, S) incrementally until a (minimal) solution is found, in a similar manner to Neider and Gavran [16]. Algorithm 1 terminates with an upper bound n<sup>0</sup> on the number of required iterations.

```
Input: a KS K and a sample S.
Output: the smallest CTL∀ formula ϕ
         consistent with S.
n ← 0;
repeat
   n ← n + 1;
   compute Φn;
until Φn is satisfiable by some valuation v;
from v build and return ϕ.
Algorithm 1: Solving MLCTL∀ (K, S).
```
### **5.2 Embedding Negations**

The CTL formula <sup>∃</sup><sup>F</sup> <sup>a</sup> is equivalent to the CTL<sup>∀</sup> formula ¬ ∀G¬a, yet the former remains more succinct, being of size 2 instead of 4. While CTL<sup>∀</sup> has been proven to be as expressive as CTL, the sheer amount of negations needed to express an equivalent formula can significantly burden the syntactic DAG. A possible optimization is to no longer consider the negation ¬ as an independent operator but instead embed it in the nodes of the syntactic DAG, as shown in Fig. 2.

Note that such a definition of the syntactic DAG alters one's interpretation of a CTL formula's size: as a consequence, under this optimization, Algorithm 1

$$
\Box \to \mathbb{R} \bigtimes\_{\Box}^{\perp}
$$

**Fig. 2.** The syntactic DAG of ¬∃ U ¬ ∀X¬a, before and after embedding negations.

may yield a formula with many negations that is no longer minimal under the original definition of size outlined in Sect. 2.2.

Formally, for each <sup>i</sup> <sup>∈</sup> [1 .. n], we introduce a new variable <sup>ν</sup><sup>i</sup> such that ν<sup>i</sup> = 0 if and only if the node of index i is negated. As an example, in Fig. 2, <sup>ν</sup><sup>1</sup> <sup>=</sup> <sup>ν</sup><sup>3</sup> <sup>=</sup> <sup>ν</sup><sup>4</sup> = 0, but <sup>ν</sup><sup>2</sup> = 1 and the sub-formula rooted in node 3 is ¬ ∀<sup>X</sup> <sup>¬</sup>a.

We then change the SAT encoding of CTL∀'s semantics accordingly. We remove the ¬ operator from the syntactic DAG clauses and the set L of labels. We delete its semantics and update the semantic clauses of the other operators. Indeed, the right side of each equivalence expresses the semantics of the operator rooted in node i *before* applying the embedded negation; we must therefore change the left side of the semantic equivalence accordingly, replacing the Boolean variable ϕ<sup>q</sup> <sup>i</sup> with the formula ˜ϕ<sup>q</sup> <sup>i</sup> = (¬ν<sup>i</sup> ∧ ¬ϕ<sup>q</sup> <sup>i</sup> ) <sup>∨</sup> (ν<sup>i</sup> <sup>∧</sup> <sup>ϕ</sup><sup>q</sup> <sup>i</sup> ) that is equivalent to ϕ<sup>q</sup> <sup>i</sup> if <sup>ν</sup><sup>i</sup> = 1 and <sup>¬</sup>ϕ<sup>q</sup> <sup>i</sup> if <sup>ν</sup><sup>i</sup> = 0.

### **5.3 Optimizations and Alternatives**

*Minimizing the Input KS.* In order to guarantee that an input S is indeed a valid sample, one has to ensure no state in the positive sample is bisimilar to a state in the negative sample. To do so, one has to at least partially compute the bisimilarity relation <sup>∼</sup> on <sup>K</sup> = (Q, δ, λ). But refining it to completion can be efficiently performed in <sup>O</sup>(|Q|·|AP<sup>|</sup> <sup>+</sup> <sup>|</sup>δ| · log(|Q|)) operations [1, Thm. 7.41], yielding a bisimilar KS Kmin of minimal size.

Minimizing the input KS is advantageous as the size of the semantic clauses depends on the size of K, and the SAT solving step is likely to be the computational bottleneck. As a consequence, we always fully compute the bisimulation relation ∼ on K and minimize said KS.

*Approximating the Recurrence Diameter.* Computing the recurrence diameter of a state q is unfortunately an NP-hard problem that is known to be hard to approximate [4]. A coarse upper bound is <sup>α</sup>(q) ≤ |Q| − 1: it may however result in a significant number of unnecessary variables and clauses. Fortunately, the decomposition of a KS K into strongly connected components (SCCs) yields a finer over-approximation shown in Fig. 3 that relies on the ease of computing α in a DAG. It is also more generic and suitable to CTL than existing approximations dedicated to LTL bounded model checking [14].

Contracting each SCC to a single vertex yields a DAG known as the *condensation* of K. We weight each vertex of this DAG G with the number of vertices in the matching SCC. Then, to each state <sup>q</sup> in the original KS <sup>K</sup>, we associate the weight <sup>β</sup>(q) of the longest path in the DAG <sup>G</sup> starting from <sup>q</sup>'s SCC, minus one (in order not to count q). Intuitively, our approximation assumes that a simple path entering a SCC can always visit every single one of its states once before exiting, a property that obviously does not hold for two of the SCCs shown here.

*Encoding the Full Logic.* CTL<sup>∀</sup> is semantically exhaustive but the existential temporal operators commonly appear in the literature; we can therefore consider the learning problem on the full CTL logic by integrating the operators ∃X, ∃F, ∃G, and ∃U and provide a Boolean encoding of their semantics. We also consider the fragment CTL<sup>U</sup> <sup>=</sup> {¬,∨, <sup>∃</sup>X, <sup>∃</sup>G, <sup>∃</sup>U} used by Roy et al. [22].

**Fig. 3.** An approximation β of the recurrence diameter α relying on SCC decomposition that improves upon the coarse upper bound α(q) ≤ |Q| − 1 = 6.

# **6 Experimental Implementation**

We implement our learning algorithm in a C++ prototype tool LearnCTL<sup>1</sup> relying on Microsoft's Z3 due to its convenient C++ API. It takes as an input a sample of positive and negative KSs with initial states, then coalesced into a single KS and a sample of states compatible with the theoretical framework we described. It finally returns a separating CTL∀, CTL, or CTL<sup>U</sup> formula after a sanity check performed by model-checking the input KSs against the learnt formula, using a simple algorithm based on Theorem 3.

### **6.1 Benchmark Collection**

We intend on using our tool to generate formulas that can explain flaws in faulty implementations of known protocols. To do so, we consider structures generated by higher formalisms such as program graphs: a single mutation in the program graph results in several changes in the resulting KS. This process has been achieved manually according to the following criteria:


<sup>1</sup> publicly available at https://gitlab.lre.epita.fr/adrien/learnctl.

We collected program graphs in a toy specification language for a CTL model checker class implemented in Java. Furthermore, we also considered PROMELA models from the Spin model-checker [12] repository. Translations were then performed through the Python interface of spot/ltsmin [9,13].

*Example 1.* Consider the mutual exclusion protocol proposed by [18] and specified in PROMELA in Fig. 4 that generates a KS with 55 states. We generate mutants by deleting no more than one line of code at a time, ignoring variable and process declarations as they are necessary for the model to be compiled and the two assertion lines that are discarded by our KS generator, our reasoning being that subtle changes yield richer distinguishing formulas.

Furthermore, removing the instruction ncrit-- alone would lead to an infinite state space; thus, its deletion is only considered together with the instruction ncrit++. Finally, we set some atomic propositions of interest: c stands for at least one process being in the critical section (ncrit>0), m for both processes (ncrit>1), and t for process 0's turn. An extra *dead* atomic proposition is added by Spot/LTSMin to represent deadlocked states.

As summarized on Fig. 4, every mutated model, once compared with the original KS, lead to distinguishing formulas characterizing Peterson's protocol: mutations m1, m2, and m3 yield a mutual exclusion property, m4 yields a liveness property, m5 yields a fairness property, and m6 yields global liveness formula.

**Fig. 4.** Peterson's mutual exclusion protocol in PROMELA and learnt formulas for each deleted instruction.

#### **6.2 Quantitative Evaluation**

We quantitatively assess the performance of the various optimizations and CTL fragments discussed previously. To do so, we further enrich the benchmark series through the use of random mutations of hard-coded KSs: these mutations may alter some states, re-route some edges, and spawn new states. We consider a total of 234 test samples, ranging from size 11 to 698 after minimization. We perform the benchmarks on a GNU/Linux Debian machine (bullseye) with 24 cores (Intel(R) Xeon(R) CPU E5-2620 @ 2.00 GHz) and 256Go of RAM, using version 4.8.10 of libz3 and 1.0 of LearnCTL.

Table 1 displays a summary of these benchmarks: β stands for the refined approximation of the recurrence diameter described in Sect. 5.3; ¬, for the embedding of negations in the syntactic tree introduced in Sect. 5.2. The average size of the syntactic DAGs learnt is 4.14.

Option β yields the greatest improvement, being on average at least 6 times faster than the default configuration; option ¬ further divides the average runtime by at least 2. These two optimizations alone speed up the average runtime by a factor of 12 to 20. The CTL fragment used, all other options being equal, does not influence the average runtime as much (less than twofold in the worst case scenario); (CTLU, β,¬) is the fastest option, closely followed by (CTL∀, β,¬).

Intuitively, approximating the recurrence diameter aggressively cuts down the number of SAT variables needed: assuming that α has upper bound d, we only need <sup>n</sup> · |Q| · <sup>d</sup> Boolean variables (ρ<sup>u</sup> i,q) instead of <sup>n</sup> · |Q<sup>|</sup> <sup>2</sup>. Moreover, embedding negations, despite requiring more complex clauses, results in smaller syntactic DAGs with "free" negations, hence faster computations, keeping in mind that the last SAT instances are the most expensive to solve, being the largest.


**Table 1.** Number of timeouts at ten minutes | arithmetic mean (in milliseconds) on the 178 samples that never timed out of various options and fragments.

Figure 5 further displays a log-log plot comparing the runtime of the most relevant fragments and options to (CTLU, β,¬). For a given set of parameters, each point stands for one of the 234 test samples. Points above the black diagonal favour (CTLU, β,¬); points below, the aforementioned option. Points on the second dotted lines at the edges of the figure represent timeouts.

Unsurprisingly, (CTL∀, β,¬) and (CTL, β,¬) outperform (CTLU, β,¬) when a minimal distinguishing formula using the operator ∀U exists: the duality between ∀U and ∃U is complex and, unlike the other operators, cannot be handled at no cost by the embedded negation as it depends on the *release* operator.

**Fig. 5.** Comparing (CTLU, β, ¬) to other options on every sample.

# **7 Conclusion and Further Developments**

We explored in this article the CTL learning problem: we first provided a direct explicit construction before relying on a SAT encoding inspired by bounded model-checking to iteratively find a minimal answer. We also introduced in Sect. 3 an explicit answer to the learning problem that belongs to the fragment CTL(¬,∧,∨, <sup>∀</sup>X, <sup>∃</sup>X). It remains to be seen if a smaller formula can be computed using a more exhaustive selection of CTL operators. A finer grained explicit solution could allow one to experiment with a top-down approach as well.

Moreover, we provided a dedicated C++ implementation, and evaluated it on models of higher-level formalisms such as PROMELA. Since the resulting KSs have large state spaces, further symbolic approaches are to be considered for future work, when dealing with program graphs instead of Kripke structures. In this setting, one might also consider the synthesis problem of the relevant atomic propositions from the exposed program variables. Nevertheless, the experiments on Kripke structures already showcase the benefits of the approximated recurrence diameter computation and of our extension of the syntactic DAG definition, as well as the limited relevance of the target CTL fragment.

Another avenue for optimizations can be inferred from the existing SAT-based LTL learning literature: in particular, Rienier et al. [20] relied on a topologyguided approach by explicitly enumerating the possible shapes of the syntactic DAG and solving the associated SAT instances in parallel. Given the small size on average of the formulas learnt so far and the quadratic factor impacting the number of semantic clauses such as sem∀<sup>U</sup> due to the structural variables <sup>l</sup>i,j and ri,k, this approach could yield huge performance gains in CTL's case as well.

We relied on Z3's convenient C++ API, but intuit that we would achieve better performance with state-of-the-art SAT solvers such as the winners of the yearly SAT competition [2]. We plan on converting our Boolean encoding to the DIMACS CNF format in order to interface our tool with modern SAT solvers.

Finally, it is known that the bounded learning problem is NP-complete, but we would also like to find the exact complexity class of the minimal CTL learning problem. We intuit that it is not, Kripke structures being a denser encoding in terms of information than lists of linear traces: as an example, one can trivially compute an LTL formula (resp. a CTL formula) of polynomial size that distinguishes a sample of ultimately periodic words (resp. of finite computation trees with lasso-shaped leaves), but the same cannot be said of a sample of Kripke structures. It remains to be seen if this intuition can be confirmed or infirmed by a formal proof.

# **A Proof of Theorem 3**

<sup>∀</sup>F*.* Assume that <sup>q</sup> <sup>|</sup><sup>=</sup> <sup>∀</sup><sup>F</sup> <sup>ϕ</sup>. Let us prove that <sup>q</sup> <sup>|</sup><sup>=</sup> <sup>∀</sup>F<sup>α</sup>(q) <sup>ϕ</sup>. Consider a run <sup>r</sup> = (si) ∈ R(q). By hypothesis, we can define the index <sup>j</sup> = min{<sup>i</sup> <sup>∈</sup> <sup>N</sup> <sup>|</sup> <sup>s</sup><sup>i</sup> <sup>|</sup><sup>=</sup> <sup>ϕ</sup>}.

Now, assume that j>α(q). By definition of the recurrence diameter α, <sup>∃</sup> <sup>k</sup>1, k<sup>2</sup> <sup>∈</sup> [0 .. j <sup>−</sup> 1] such that <sup>k</sup><sup>1</sup> <sup>≤</sup> <sup>k</sup><sup>2</sup> and <sup>s</sup><sup>k</sup><sup>1</sup> <sup>=</sup> <sup>s</sup><sup>k</sup><sup>2</sup> . Consider the finite runs <sup>u</sup> = (si)<sup>i</sup>∈[0. .k1] and <sup>v</sup> = (si)<sup>i</sup>∈[k1+1. .k2]. We define the infinite, ultimately periodic run <sup>r</sup> <sup>=</sup> <sup>u</sup> · <sup>v</sup><sup>ω</sup> = (s <sup>i</sup>). By definition of <sup>j</sup>, <sup>∀</sup> <sup>i</sup> <sup>∈</sup> <sup>N</sup>, <sup>s</sup> <sup>i</sup> <sup>|</sup><sup>=</sup> <sup>ϕ</sup> in order to preserve the minimality of <sup>j</sup>. Yet <sup>r</sup> ∈ R(q) and <sup>q</sup> <sup>|</sup><sup>=</sup> <sup>∀</sup><sup>F</sup> <sup>ϕ</sup>. By contradiction, <sup>j</sup> <sup>≤</sup> <sup>α</sup>(q). As consequence, (<sup>q</sup> <sup>|</sup><sup>=</sup> <sup>∀</sup><sup>F</sup> <sup>ϕ</sup>) =<sup>⇒</sup> (<sup>q</sup> <sup>|</sup><sup>=</sup> <sup>∀</sup>F<sup>α</sup>(q) <sup>ϕ</sup>) holds.

Trivially, (<sup>q</sup> <sup>|</sup><sup>=</sup> <sup>∀</sup>F<sup>α</sup>(q) <sup>ϕ</sup>) =<sup>⇒</sup> (<sup>q</sup> <sup>|</sup><sup>=</sup> <sup>∀</sup><sup>F</sup> <sup>ϕ</sup>) holds. Hence, we have proven both sides of the desired equivalence for ∀F.

<sup>∀</sup>G*.* Assume that <sup>q</sup> <sup>|</sup><sup>=</sup> <sup>∀</sup>G<sup>α</sup>(q) <sup>ϕ</sup>. Let us prove that <sup>q</sup> <sup>|</sup><sup>=</sup> <sup>∀</sup>Gϕ. Consider a run <sup>r</sup> = (si) ∈ R(q) and <sup>j</sup> <sup>∈</sup> <sup>N</sup>. Let us prove that <sup>s</sup><sup>j</sup> <sup>|</sup><sup>=</sup> <sup>ϕ</sup>.

State s<sup>j</sup> is obviously reachable from q. Let us consider a finite run without repetition u = (s <sup>i</sup>)<sup>i</sup>∈[0. .k] such that <sup>s</sup><sup>0</sup> <sup>=</sup> <sup>q</sup> and <sup>s</sup> <sup>k</sup> <sup>=</sup> <sup>s</sup><sup>j</sup> . By definition of the recurrence diameter, <sup>k</sup> <sup>≤</sup> <sup>α</sup>(q). We define the infinite runs <sup>v</sup> = (si)i>j and <sup>r</sup> <sup>=</sup> <sup>u</sup> · <sup>v</sup>. Since <sup>r</sup> ∈ R(q) and <sup>q</sup> <sup>|</sup><sup>=</sup> <sup>∀</sup>G<sup>α</sup>(q) <sup>ϕ</sup>, <sup>s</sup><sup>k</sup> <sup>|</sup><sup>=</sup> <sup>ϕ</sup>, hence <sup>s</sup><sup>j</sup> <sup>|</sup><sup>=</sup> <sup>ϕ</sup>. As a consequence, (<sup>q</sup> <sup>|</sup><sup>=</sup> <sup>∀</sup>G<sup>α</sup>(q) <sup>ϕ</sup>) =<sup>⇒</sup> (<sup>q</sup> <sup>|</sup><sup>=</sup> <sup>∀</sup>Gϕ).

Trivially, (<sup>q</sup> <sup>|</sup><sup>=</sup> <sup>∀</sup>Gϕ) =<sup>⇒</sup> (<sup>q</sup> <sup>|</sup><sup>=</sup> <sup>∀</sup>G<sup>α</sup>(q) <sup>ϕ</sup>) holds. Hence, we have proven both sides of the desired equivalence for ∀G.

<sup>∃</sup><sup>F</sup> *and* <sup>∃</sup>G*.* Formula <sup>∃</sup><sup>F</sup> <sup>ϕ</sup> (rep. <sup>∃</sup><sup>G</sup> <sup>ϕ</sup>) being equivalent to the dual formula ¬ ∀<sup>G</sup> <sup>¬</sup><sup>ϕ</sup> (resp. ¬ ∀<sup>F</sup> <sup>¬</sup>ϕ), the previous proofs immediately yield the desired equivalences.

<sup>∀</sup><sup>U</sup> *and* <sup>∃</sup>U*.* We can handle the case of <sup>∀</sup> <sup>ϕ</sup>U<sup>ψ</sup> in a manner similar to <sup>∀</sup>F: we prove by contradiction that the first occurrence of ψ always happens in less than <sup>α</sup>(q) steps. And the semantic equivalence for <sup>∃</sup> <sup>ϕ</sup>U<sup>ψ</sup> can be handled in a fashion similar to ∀G: an existing infinite run yields a conforming finite prefix without repetition of length lesser than or equal to α(q).

# **B Proof of Theorem 5**

Given two dissimilar states <sup>q</sup>1, q<sup>2</sup> <sup>∈</sup> <sup>Q</sup>, let us prove by induction on the characteristic number <sup>c</sup>q1,q<sup>2</sup> <sup>=</sup> <sup>C</sup>K({q1}, {q2}) that <sup>q</sup><sup>1</sup> <sup>|</sup><sup>=</sup> <sup>D</sup>q1,q<sup>2</sup> and <sup>q</sup><sup>2</sup> <sup>|</sup><sup>=</sup> <sup>D</sup>q1,q<sup>2</sup> .


By induction hypothesis, D<sup>q</sup>- 1,q- <sup>2</sup> is well-defined, <sup>q</sup> <sup>1</sup> <sup>|</sup><sup>=</sup> <sup>D</sup><sup>q</sup>- 1,q- <sup>2</sup> and <sup>q</sup> <sup>2</sup> <sup>|</sup><sup>=</sup> <sup>D</sup><sup>q</sup>- 1,q- 2 .

$$\text{As a consequence, } D\_{q\_1, q\_2} \text{ is well-defined, } q\_1 \left| = \exists \mathbb{X} \left( \bigwedge\_{q\_2' \in \delta(q\_2)} D\_{q\_1', q\_2'} \right) , \text{ and } q\_2 \nmid = \emptyset \right| $$

∃X q- <sup>2</sup>∈δ(q2) D<sup>q</sup>- 1,q- 2 .

We handle the case where <sup>∃</sup> <sup>q</sup> <sup>2</sup> <sup>∈</sup> <sup>δ</sup>(q2), <sup>∀</sup> <sup>q</sup> <sup>1</sup> <sup>∈</sup> <sup>δ</sup>(q1), <sup>q</sup> <sup>1</sup> <sup>∼</sup><sup>c</sup> <sup>q</sup> <sup>2</sup> in a similar fashion. As a consequence, the property holds at rank c + 1.

Therefore, for each <sup>q</sup><sup>+</sup> <sup>∈</sup> <sup>S</sup><sup>+</sup> and each <sup>q</sup><sup>−</sup> <sup>∈</sup> <sup>S</sup>−, <sup>q</sup><sup>+</sup> <sup>|</sup><sup>=</sup> <sup>D</sup>q+,q<sup>−</sup> and <sup>q</sup><sup>−</sup> <sup>|</sup><sup>=</sup> <sup>D</sup>q+,q<sup>−</sup> . Hence, <sup>q</sup><sup>+</sup> <sup>|</sup><sup>=</sup> <sup>S</sup>K(S+, S−) and <sup>q</sup><sup>−</sup> <sup>|</sup><sup>=</sup> <sup>S</sup>K(S+, S−).

# **C Proof of Corollary 1**

First, given <sup>q</sup><sup>+</sup> <sup>∈</sup> <sup>S</sup><sup>+</sup> and <sup>q</sup><sup>−</sup> <sup>∈</sup> <sup>S</sup>−, let us bound the size of <sup>D</sup>q+,q<sup>−</sup> based on their characteristic number <sup>c</sup><sup>q</sup>1,q<sup>2</sup> <sup>=</sup> <sup>C</sup>K({q1}, {q2}).

$$\begin{aligned} c\_{q\_1, q\_2} = 0 \implies |D\_{q^+, q^-}| \le 2 &\text{as } \lambda(q\_1) \ne \lambda(q\_2), \\ c\_{q\_1, q\_2} \ge 1 \implies |D\_{q^+, q^-}| \le (k+1) + \sum\_{q\_2' \in \delta(q\_2)} |D\_{q\_1', q\_2'}| &\text{for some } q\_1' \in \delta(q\_1) \\ \text{or } |D\_{q^+, q^-}| \le (k+1) + \sum\_{q\_1' \in \delta(q\_1)} |D\_{q\_1', q\_2'}| &\text{for some } q\_2' \in \delta(q\_2) \end{aligned}$$

We are looking for an upper bound (Un)<sup>n</sup>≥<sup>0</sup> such that <sup>∀</sup> <sup>n</sup> <sup>∈</sup> <sup>N</sup>, <sup>∀</sup> <sup>q</sup><sup>+</sup> <sup>∈</sup> <sup>S</sup><sup>+</sup>, <sup>∀</sup> <sup>q</sup><sup>−</sup> <sup>∈</sup> <sup>S</sup>−, if <sup>c</sup>q+,q<sup>−</sup> <sup>≤</sup> <sup>n</sup>, then <sup>|</sup>Dq+,q<sup>−</sup> | ≤ <sup>U</sup>n. We define it inductively:

$$\begin{aligned} U\_0 &= 2\\ U\_{n+1} &= k \cdot U\_n + k + 1 \end{aligned}$$

Assuming <sup>k</sup> <sup>≥</sup> 2, we explicit the bound <sup>U</sup><sup>n</sup> = (2 + <sup>k</sup>+1 <sup>k</sup>−<sup>1</sup> ) · <sup>k</sup><sup>n</sup> <sup>−</sup> <sup>k</sup>+1 <sup>k</sup>−<sup>1</sup> <sup>≤</sup> <sup>5</sup> · <sup>k</sup><sup>n</sup>. As ({q<sup>+</sup>}, {q<sup>−</sup>}) is a sub-sample of <sup>S</sup>, <sup>c</sup>q+,q<sup>−</sup> <sup>≤</sup> <sup>c</sup> and <sup>|</sup>Dq+,q<sup>−</sup> | ≤ <sup>U</sup><sup>c</sup> <sup>≤</sup> <sup>5</sup> · <sup>k</sup><sup>c</sup>. We can finally bound the size of <sup>S</sup>K(S+, S−):

$$\begin{aligned} |\mathcal{S}\_{\mathcal{K}}(S^+, S^-)| &\leq (|S^+| - 1) \cdot (|S^-| - 1) + \sum\_{\substack{q^+ \in S^+ \\ q^- \in S^-}} |D\_{q^+, q^-}| \\ &\leq |S^+| \cdot |S^-| + |S^+| \cdot |S^-| \cdot U\_c \\ &\leq (5 \cdot k^c + 1) \cdot |S^+| \cdot |S^-| \end{aligned}$$

Yielding the aforementioned upper bound. If <sup>k</sup> = 1, then <sup>U</sup><sup>n</sup> = 2 · <sup>n</sup> + 2 and the rest of the proof is similar to the previous case.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

#### MCSat-Based Finite Field Reasoning in the **Yices2**SMT Solver (Short Paper)

Thomas Hader1(B) , Daniela Kaufmann<sup>1</sup> , Ahmed Irfan<sup>2</sup> , Stéphane Graham-Lengrand<sup>2</sup> , and Laura Kovács<sup>1</sup>

<sup>1</sup> TU Wien, Vienna, Austria {thomas.hader,daniela.kaufmann,laura.kovacs}@tuwien.ac.at <sup>2</sup> SRI International, Menlo Park, CA, USA {ahmed.irfan,stephane.graham-lengrand}@sri.com

Abstract. This system description introduces an enhancement to the Yices2 SMT solver, enabling it to reason over non-linear polynomial systems over finite fields. Our reasoning approach fits into the modelconstructing satisfiability (MCSat) framework and is based on zero decomposition techniques, which find finite basis explanations for theory conflicts over finite fields. As the MCSat solver within Yices2 can support (and combine) several theories via theory plugins, we implemented our reasoning approach as a new plugin for finite fields and extended Yices2's frontend to parse finite field problems, making our implementation the first MCSat-based reasoning engine for finite fields. We present its evaluation on finite field benchmarks, comparing it against cvc5. Additionally, our work leverages the modular architecture of the MCSat solver in Yices2 to provide a foundation for the rapid implementation of further reasoning techniques for this theory.

Keywords: SMT solving · MCSat · finite fields · polynomial arithmetic

# 1 Introduction

Satisfiability Modulo Theories (SMT) solving plays a crucial role in automated reasoning as it combines the power of Boolean satisfiability (SAT) with various mathematical background theories [3]. This connection enables the automated verification and synthesis of systems [15] that require reasoning in more expressive logical theories, for example real/integer arithmetic.

State-of-the-art SMT solvers employ a combination of Boolean level reasoning and theory-specific algorithms. This is achieved either through the use of the CDCL(T) paradigm [16] or the model-constructing satisfiability (MCSat) approach [11,14]. The MCSat algorithm lifts the Boolean-level CDCL algorithm to the theory level, while keeping the search theory independent. This approach is particularly effective for handling complex arithmetic theories. For instance, Yices2 [5] uses the MCSat approach to handle non-linear arithmetic constraints.

Finite fields offer an ideal framework for modeling bounded machine arithmetic, particularly relevant in the context of contemporary cryptosystems utilized in system security and post-quantum cryptography. Current methodologies, for instance, develop private and secure systems using zero-knowledge (ZK) proofs [7] or authenticate blockchain technologies like smart contracts [19]. Verifying applications in such areas require efficient SMT solvers that support reasoning over finite field arithmetic, e.g., verification of a compiler for ZK proofs [18].

*Related Work.* Currently, the related work on SMT solving in finite field arithmetic is still rather limited. Our own theoretical work [9] on MCSat approaches based on finding zero decompositions comes with a proof-of-concept implementation that facilitates only a very fundamental MCSat algorithm, has only limited support of Boolean propagation, and is unable to parse SMT-LIB 2 [2].

The only other SMT solver that we are aware of being capable of reasoning over finite fields is cvc5 [1,17], which uses a classical CDCL(T) approach. As a theory engine, Gröbner bases [4] reasoning over a set of polynomial equalities is applied. If the derived Gröbner basis contains the constant 1, then the system is unsatisfiable and a conflict core for the CDCL(T) search can be found. Otherwise, a guided enumeration of all possible solutions is performed to search for a model.

Note that both approaches [9,17] use complementary techniques. On the one hand, Gröbner bases are highly engineered to find conflicts in the polynomial input, which tends to help for unsatisfiable instances [17]. On the other hand, a model constructing approach tends to be fast whenever there is a solution, especially when there is a high number of models [9].

We further note that at the moment our implementation in Yices2, as well as cvc5, is restricted to finite fields where the order (i.e. size) is prime. This limitation is sufficient for many applications in cryptosystems and ZK proofs.

Besides using dedicated finite field solvers, problems over prime fields can be encoded in integer arithmetic using the modulo operator. Further, since terms are bounded, encoding as bit-vectors for subsequent bit-blasting is possible. However, prior experiments have shown that those encodings perform poorly on existing solvers [17].

*Contributions.* We present an integration of the theory of non-linear finite field arithmetic in the Yices2 SMT solver [5], enabling it to reason over finite field problems. This includes the following contributions which we will further explain throughout the rest of this paper:


To the best of our knowledge, our work is currently the only finite field instantiation of MCSat. While our initial theory reasoning approach follows closely the explanation generation procedure of our previous work [9], our implementation allows easy drop-in of an improved explanation procedure in the future.

# 2 Preliminaries

In mathematics, a finite field is a field that contains a finite number of elements. A finite field F*<sup>p</sup>* of prime order p can roughly be seen as the representation of the integers modulo the prime p. We refer to [9,17] for details on the theory and representation of finite fields. Since there is no inherent order on finite fields, polynomial constraints are either equalities p = 0 or disequalities p -= 0 for a finite field polynomial p.

For SMT solving in finite fields, we are interested in the following problem:

Given a finite field F*p*, where p is a prime number, let X = x1,...x*n*, let <sup>F</sup> be a set of polynomial constraints in <sup>F</sup>*p*[X] and <sup>F</sup> a formula following the logical structure:

$$\mathcal{F}\_{\epsilon} = \bigwedge\_{C \subseteq F} \bigvee\_{f \in C} f \quad = \bigwedge\_{C \subseteq F} \bigvee\_{f \in C} \mathsf{poly}(f) \rhd 0 \quad \text{with } \rhd \in \{=, \neq\} .$$

*SMT solving over finite fields:* Does an assignment <sup>ν</sup> : {x1,...,x*n*} → <sup>F</sup>*<sup>n</sup> p* exist that satisfies F?

For example, the formula F<sup>1</sup> = (x − 1=0 ∨ y − 1 = 0) ∧ (xy − 1 = 0) is satisfied by the assignment <sup>x</sup> → <sup>1</sup>, y → <sup>1</sup> in <sup>F</sup>3; whereas the formula <sup>F</sup><sup>2</sup> <sup>=</sup> (<sup>x</sup> <sup>−</sup> 1=0 <sup>∨</sup> <sup>y</sup> <sup>−</sup> 1 = 0) <sup>∧</sup> (xy <sup>−</sup> 1 = 0) <sup>∧</sup> (<sup>x</sup> <sup>−</sup> 2 = 0) is unsatisfiable in <sup>F</sup>3.

*Yices2 and MCSat.* Yices2 contains two main solvers, one based on the traditional CDCL(T) approach [16] and one based on the MCSat approach [13,14]. Yices2's common API and front-ends can automatically select which solver to use at runtime, depending on an SMT-LIB 2 logic. In particular, when non-linear real or integer arithmetic constraints are present Yices2 selects the MCSat solver. The MCSat solver in Yices2 currently supports the theories of nonlinear real arithmetic (QF\_NRA) [13] and integer arithmetic (QF\_NIA) [10], bit-vectors (QF\_BV) [8], equality and uninterpreted functions (QF\_EUF), arrays [6], and combinations thereof.

In contrast to CDCL(T) that *complements* CDCL with theory reasoning, MCSat applies CDCL-like mechanisms to *perform* theory reasoning. Specifically, it explicitly and incrementally constructs models with first-order variable assignments—maintained in a *trail*—while maintaining the invariant that none of the constraints evaluate to false. MCSat decides upon such assignments when there is choice, it can propagate them when there is not, and it backtracks upon conflict. The lemmas learned upon backtracking are based on theoryspecific explanations of conflicts and propagations. This theory-specific reasoning is implemented through *plugins* that provide interfaces to make decisions, perform propagations, and produce explanations.

# 3 Usability of SMT Solving in Finite Fields

Support for finite field reasoning in Yices2 is available on the master branch<sup>1</sup>and will be included in the next release (2.7). The theory of finite fields can be accessed using a not-yet official extension of the SMT-LIB 2 language (.smt2).

*SMT-LIB 2 Parsing.* Extending the parser to handle finite field problems was our first extension to Yices2. Currently, polynomials over finite fields are no official theory in SMT-LIB 2 [2]. However, when implementing finite field support in cvc5 [1], an extension was proposed in [17]. In the interest of keeping inputs and benchmarks comparable, we aimed at a compatible implementation. Standardization efforts to create an official SMT-LIB 2 theory for finite field arithmetic are currently ongoing.

In the SMT-LIB 2 extension, the theory of (quantifier-free) non-linear finite field arithmetic is denoted as QF\_FFA. Elements can be defined using the sort FiniteField. The sort is indexed by the order of the finite field, which is required to be a prime number. For instance (\_ FiniteField 3) defines the finite field of order 3. Constants are indexed with the field order to indicate which finite field they belong in, e.g., (\_ ff2 3). Note that the integer following ff is interpreted modulo the field order. As a short-cut to avoid rewriting the field order over and over again, the as keyword can be used to interpret the constant in the correct field type: (as ff2 FF3) for a defined finite field sort FF3. To specify the finite field operations ff.mul and ff.add are available for multiplication and addition of finite field values, respectively. Both support an arbitrary number of operators. Atoms with finite field terms can be = with its respective meaning. For example, an encoding of F<sup>1</sup> can be seen in Fig. 1.

Fig. 1. Example for an SMT-LIB 2 encoding of a finite field problem *F*1.

<sup>1</sup> Available at https://github.com/SRI-CSL/yices2.

# 4 Implementation Details

*The Implementation of MCSat in* Yices2*.* The MCSat solver in Yices2 supports multiple theories via a notion of theory *plugin* that builds upon an earlier architecture [11]. An MCSat theory plugin in Yices2 implements a number of functionalities that are given to the main MCSat solver as function pointers. The main MCSat loop calls these functions for theory-specific operations such as deciding or propagating the value of variables or getting explanation lemmas, or upon certain events such as the creation of new terms and lemmas. In return, a theory plugin can access theory-generic mechanisms for, e.g., inspecting the MCSat trail, creating variables and requesting to be notified of certain events like variable assignments, as well as raising conflicts. A theory plugin is not required to implement mechanisms for propagating theory assignments and explaining them, but for the current theories in Yices2, such propagations have provided noticeable speed-ups (see, e.g., [8]).

*The Finite Field MCSat Plugin.* Before handling constraints in the finite field MCSat plugin, the input assertions are represented as polynomial constraints. Limited preprocessing (e.g., constant propagation) is performed at this step. Internally, the plugin only handles polynomial equalities and disequalities. The implementation of the finite field plugin follows an approach similar to the plugin for non-linear arithmetic [10].

Using the MCSat trail, the finite field plugin reads which constraints must be fulfilled at any given time (as decided or deduced by the Boolean plugin) and tracks the assignment of values to polynomial variables. It also tracks, for each polynomial variable, the *set of feasible values* that the variable can be assigned without any of the polynomial constraints evaluating to false: Using watch lists, it detects when any of the constraints becomes *unit*, i.e. when all of its variables but one have been assigned values. Upon such detection, it computes how the constraint restricts the set of feasible values for the last remaining variable, using regular univariate polynomial factorization. When that set becomes empty, the plugin reports a theory conflict to the main MCSat engine. Given a conflict core and the current assignment, the *explanation procedure* in the plugin generates a (globally valid) lemma that explains the conflict in that it excludes a class of assignments (including the current one) that all violate the conflict core. The MCSat engine performs conflict analysis using theory explanations and Boolean resolutions, and either backtracks if it can or concludes unsatisfiability. On the other hand, the instance is satisfiable once all variables are assigned a value.

*Finite Field Explanations.* In our earlier work [9], we presented an explanation procedure for finite fields. This approach employs subresultant regular subchains (SRS) [20] between conflicting polynomials to provide new polynomial constraints that can be propagated. In a nutshell, SRS can be used to construct a generalized greatest common divisor (GCD) of polynomials that takes into account the current partial variable assignment on the trail. The computed GCD is utilized in a zero decomposition procedure to reduce the degree of the conflicting polynomials until we can learn a polynomial constraint that excludes the current partial assignment. This constraint is added as an explanation clause to resolve the conflict. We implemented the procedure of [9] in the current version of Yices2 using LibPoly. However, it is important to note that there are other solving techniques for polynomial systems over finite field that could potentially be utilized to develop an explanation method suitable for an MCSatbased search. Furthermore, it is still an open question how different techniques perform in an MCSat environment. That is why we have kept the explanation procedure encapsulated in our implementation, allowing for easy extension in order to support development and evaluation of future explanation procedures.

# 5 Evaluation

Since finite field solving is a rather new endeavor in the world of SMT, no extensive set of SMT-LIB 2 benchmarks exists yet. For the evaluation we have selected the benchmark sets presented in the papers describing the theory behind the implementation of Yices2 and cvc5:


*Experimental Setup.* Our experiments were run on an AMD EPYC 7502 CPU with a timeout of 300 seconds per benchmark instance. We compared our implementation of Yices2 against cvc5 version 1.1.1, which is the latest released version at the time of writing. We are not aware of any further SMT solvers supporting the theory of non-linear finite fields to be included in the comparison.

*Experimental Results.* The performance comparison between the two solvers on the first benchmark set can be seen in Fig. 2 and Fig. 3 (left). It is clear to see that the random instances are harder to solve than the crafted instances (which have significantly more variables). We believe that this is due to the lack of internal structure in random polynomials. This makes symbolic handling of those polynomial systems hard, both for Gröbner basis computation (in cvc5) as well as SRS computation (in Yices2).

Note that cvc5 performs most symbolic computation upfront (when generating the Gröbner basis) and enumerates potential solutions in a second step (using auxiliary Gröbner basis calls). The MCSat approach in Yices2, on the other hand, interleaves model generation and symbolic computation during the search. This tends to be an advantage for harder polynomial systems especially together with small finite field orders. When the finite field order increases, this advantage seems to vanish. For the crafted polynomial benchmarks, Yices2

Fig. 2. Runtime comparison for benchmarks from [9] (in seconds, timeout 300 s) result: sat o, unsat x; finite field order: 3 (blue), 13 (green), and 211 (orange)

tends to be faster. We believe that this is due to the fact that the polynomials tend to be large (in the number of monomials), but rather easy to solve. Generating a full Gröbner basis upfront might add significant overhead.

For the second benchmark set, many instances can be solved by both solvers immediately (c.f. Fig. 3 right). We believe that those instances can be solved without extensive finite field reasoning, as the benchmark set contains Boolean structure. This enables both solvers to successfully solve benchmarks even with vast field orders. However, once extensive algebraic reasoning is required in finite fields of vast order (the majority of the benchmarks), the purely symbolic approach of cvc5 in proving unsatisfiability seems to be advantageous. An MCSat approach requires to pick actual values in a gigantic search space, thus especially strong lemmas need to be learned in order to prune the search space efficiently. Improving the explanation procedure is part of our future work.

Fig. 3. Instances solved over time (timeout 300s) by Yices2 and cvc5 from [9] (left) and [17] (right).

# 6 Summary and Outlook

In this system description we have presented the first implementation of an MCSat-based decision procedure for non-linear finite field polynomials. We have shown that MCSat is a feasible way of solving SMT instances over finite fields and it compares well with SMT approaches using Gröbner bases for many instances.

The presented tool implementation is well suited for future experiments and the rapid development of more advanced explanation procedures that will eliminate the current bottlenecks with regard to large finite fields.

Acknowledgments. This work was conducted during the first author's stay at SRI International. We acknowledge funding from the ERC Consolidator Grant ARTIST 101002685, the TU Wien SecInt Doctoral College, the FWF grants SFB 10.55776/F8504 and ESPRIT 10.55776/ESP666), the WWTF ICT22-007 project ForSmart, the NSF award CCRI-2016597, the Amazon Research Award 2024 QuAT, and from SRI Internal Research And Development funds. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the US Government or NSF.

Disclosure of Interests. The authors have no competing interests to declare that are relevant to the content of this article.

# References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Certified MaxSAT Preprocessing

Hannes Ihalainen1(B) , Andy Oertel2,3 , Yong Kiam Tan<sup>4</sup> , Jeremias Berg<sup>1</sup> , Matti Järvisalo<sup>1</sup> , Magnus O. Myreen<sup>5</sup> , and Jakob Nordström2,3

<sup>1</sup> Department of Computer Science, University of Helsinki, Helsinki, Finland {hannes.ihalainen,jeremias.berg,matti.jarvisalo}@helsinki.fi <sup>2</sup> Lund University, Lund, Sweden andy.oertel@cs.lth.se, jn@di.ku.dk <sup>3</sup> University of Copenhagen, Copenhagen, Denmark <sup>4</sup> Institute for Infocomm Research (I2R), A\*STAR, Singapore, Singapore tanyk1@i2r.a-star.edu.sg <sup>5</sup> Chalmers University of Technology, Gothenburg, Sweden myreen@chalmers.se

Abstract. Building on the progress in Boolean satisfiability (SAT) solving over the last decades, maximum satisfiability (MaxSAT) has become a viable approach for solving NP-hard optimization problems. However, ensuring correctness of MaxSAT solvers has remained a considerable concern. For SAT, this is largely a solved problem thanks to the use of proof logging, meaning that solvers emit machine-verifiable proofs to certify correctness. However, for MaxSAT, proof logging solvers have started being developed only very recently. Moreover, these nascent efforts have only targeted the core solving process, ignoring the preprocessing phase where input problem instances can be substantially reformulated before being passed on to the solver proper.

In this work, we demonstrate how pseudo-Boolean proof logging can be used to certify the correctness of a wide range of modern MaxSAT preprocessing techniques. By combining and extending the VeriPB and CakePB tools, we provide formally verified end-to-end proof checking that the input and preprocessed output MaxSAT problem instances have the same optimal value. An extensive evaluation on applied MaxSAT benchmarks shows that our approach is feasible in practice.

Keywords: maximum satisfiability · preprocessing · proof logging · formally verified proof checking

# 1 Introduction

The development of Boolean satisfiability (SAT) solvers is arguably one of the true success stories of modern computer science—today, SAT solvers are routinely used as core engines in many types of complex automated reasoning systems. One example of this is SAT-based optimization, usually referred to as *maximum satisfiability (MaxSAT) solving.* The improved performance of SAT solvers, coupled with increasingly sophisticated techniques for using SAT solver calls to reason about optimization problems, have made MaxSAT solvers a powerful tool for tackling real-world NP-hard optimization problems [8].

However, Modern MaxSAT solvers are quite intricate pieces of software, and it has been shown repeatedly in the MaxSAT evaluations [51] that even the best solvers sometimes report incorrect results. This was previously a serious issue also for SAT solvers (see, e.g., [13]), but the SAT community has essentially eliminated this problem by requiring that solvers should be *certifying* [1,53], i.e., not only report whether a given formula is satisfiable or unsatisfiable but also produce a machine-verifiable proof that this conclusion is correct. Many different SAT proof formats such as RUP [33], TraceCheck [7], GRIT [17], and LRAT [16] have been proposed, with DRAT [35,36,74] established as the de facto standard; for the last ten years, proof logging has been compulsory in the (main track of the) SAT competitions [66]. It is all the more striking, then, that until recently no similar developments have been observed in MaxSAT solving.

### 1.1 Previous Work

A first natural question to ask—since MaxSAT solvers are based on repeated calls to SAT solvers—is why we cannot simply use SAT proof logging also for MaxSAT. The problem is that DRAT can only reason about clauses, whereas MaxSAT solvers argue about costs of solutions and values of objective functions. Translating such claims to clausal form would require an external tool to certify correctness of the translation. Also, such clausal translations incur a significant overhead and do not seem well-adapted for, e.g., counting arguments in MaxSAT.

While there have been several attempts to design proof systems specifically for MaxSAT solving [11,23,39,45,57,58,63–65], none of these have come close to providing a general proof logging solution, because they apply only for very specific algorithm implementations and/or fail to capture the full range of techniques used. Recent papers have instead proposed using pseudo-Boolean proof logging with VeriPB [9,32] to certify correctness of so-called solution-improving solvers [72] and core-guided solvers [4]. Although these works demonstrate, for the first time, practical proof logging for modern MaxSAT solving, the methods developed thus far only apply to the core solving process. This ignores the preprocessing phase, where the input formula can undergo major reformulation. State-of-the-art solvers sometimes use stand-alone preprocessor tools, or sometimes integrate preprocessing-style reasoning more tightly within the MaxSAT solver engine, to speed up the search for optimal solutions. Some of these preprocessing techniques are lifted from SAT to MaxSAT, but there are also native MaxSAT preprocessing methods that lack analogies in SAT solving.

### 1.2 Our Contribution

In this paper, we show, for the first time, how to use pseudo-Boolean proof logging with VeriPB to produce proofs of correctness for a wide range of preprocessing techniques used in modern MaxSAT solvers. VeriPB proof logging has previously been successfully used not only for core MaxSAT search as discussed above, but also for advanced SAT solving techniques (including symmetry breaking) [9,27,32], subgraph solving [28–30], constraint programming [22,31,54,55], and 0–1 ILP presolving [37], and we add MaxSAT preprocessing to this list.

In order to do so, we extend the VeriPB proof format to include an *output section* where a reformulated output can be presented, and where the pseudo-Boolean proof establishes that this output formula and the input formula are *equioptimal*, i.e., have optimal solutions of the same value. We also enhance CakePB [10,29]—a verified proof checker for pseudo-Boolean proofs—to handle proofs of reformulation. In this way, we obtain an end-to-end formally verified toolchain for certified preprocessing of MaxSAT instances.

It is worth noting that although preprocessing is also a critical component in SAT solving, we are not aware of any tool for certifying reformulations even for the restricted case of decision problems, i.e., showing that formulas are *equisatisfiable*—the DRAT format and tools support proofs that satisfiability of an input CNF formula F implies satisfiability of an output CNF formula G but not the converse direction (except in the special case where F is a subset of G). To the best of our knowledge, our work presents the first practical tool for proving (two-way) equisatisfiability or equioptimality of reformulated problems.

We have performed computational experiments running a MaxSAT preprocessor with proof logging and proof checking on benchmarks from the MaxSAT evaluations [51]. Although there is certainly room for improvements in performance, these experiments provide empirical evidence for the feasibility of certified preprocessing for real-world MaxSAT benchmarks.

### 1.3 Organization of This Paper

After reviewing preliminaries in Sect. 2, we explain our pseudo-Boolean proof logging for MaxSAT preprocessing in Sect. 3, and Sect. 4 discusses verified proof checking. We present results from a computational evaluation in Sect. 5, after which we conclude with a summary and outlook for future work in Sect. 6.

# 2 Preliminaries

We write to denote a literal, i.e., a {0, <sup>1</sup>}-valued Boolean variable <sup>x</sup> or its negation <sup>x</sup> = 1 <sup>−</sup> <sup>x</sup>. A *clause* <sup>C</sup> <sup>=</sup> -<sup>1</sup> <sup>∨</sup> ... <sup>∨</sup> <sup>k</sup> is a disjunction of literals, where a *unit clause* consists of only one literal. A formula in *conjunctive normal form (CNF)* <sup>F</sup> <sup>=</sup> <sup>C</sup><sup>1</sup> <sup>∧</sup> ... <sup>∧</sup> <sup>C</sup><sup>m</sup> is a conjunction of clauses, where we think of clauses and formulas as sets so that there are no repetitions and order is irrelevant.

A *pseudo-Boolean (PB) constraint* is a 0–1 linear inequality - <sup>j</sup> <sup>a</sup><sup>j</sup> <sup>j</sup> <sup>≥</sup> <sup>b</sup>, where, when convenient, we can assume all literals <sup>j</sup> to refer to distinct variables and all integers a<sup>j</sup> and b to be positive (so-called *normalized form*). A *pseudo-Boolean formula* is a conjunction of such constraints. We identify the clause C = -<sup>1</sup> ∨···∨<sup>k</sup> with the pseudo-Boolean constraint PB(C) = -<sup>1</sup> <sup>+</sup> ··· <sup>+</sup> <sup>k</sup> ≥ 1, so a CNF formula <sup>F</sup> is just a special type of PB formula PB(F) = {PB(C) <sup>|</sup> <sup>C</sup> <sup>∈</sup> <sup>F</sup>}.

<sup>A</sup> *(partial) assignment* <sup>ρ</sup> mapping variables to {0, <sup>1</sup>}, extended to literals by respecting the meaning of negation, satisfies a PB constraint - <sup>j</sup> <sup>a</sup><sup>j</sup> <sup>j</sup> <sup>≥</sup> <sup>b</sup> if - *<sup>j</sup>* :ρ(*<sup>j</sup>* )=1 <sup>a</sup><sup>j</sup> <sup>≥</sup> <sup>b</sup> (assuming normalized form). A PB formula is satisfied by ρ if all constraints in it are. We also refer to total satisfying assignments ρ as *solutions*. In a *pseudo-Boolean optimization (PBO)* problem we ask for a solution minimizing a given *objective* function O = - <sup>j</sup> <sup>c</sup><sup>j</sup> <sup>j</sup> + W, where c<sup>j</sup> and W are integers and W represents a trivial lower bound on the minimum cost.

### 2.1 Pseudo-Boolean Proof Logging Using Cutting Planes

The pseudo-Boolean proof logging in VeriPB is based on the *cutting planes* proof system [15] with extensions as discussed briefly next. We refer the reader to [14] for and in-depth discussion of cutting planes and to [9,26,37,73] for more detailed information about the VeriPB proof system and format.

A pseudo-Boolean proof maintains two sets of *core constraints* C and *derived constraints* <sup>D</sup> under which the objective <sup>O</sup> should be minimized. At the start of the proof, <sup>C</sup> is initialized to the constraints in the input formula <sup>F</sup>. Any constraints derived by the rules described below are placed in D, from where they can later be moved to C (but not vice versa). The proof system semantics preserves the invariant that the optimal value of any solution to C and to the original input problem <sup>F</sup> is the same. New constraints can be derived from C∪D by performing *addition* of two constraints or *multiplication* of a constraint by a positive integer, and *literal axioms* - ≥ 0 can be used at any time. Additionally, we can apply *division* to - <sup>j</sup> <sup>a</sup><sup>j</sup> <sup>j</sup> <sup>≥</sup> <sup>b</sup> by a positive integer <sup>d</sup> followed by rounding up to obtain - <sup>j</sup> aj/d<sup>j</sup> ≥ b/d, and *saturation* to yield - <sup>j</sup> min{a<sup>j</sup> , b} · <sup>j</sup> <sup>≥</sup> <sup>b</sup> (where we again assume normalized form).

The negation of a constraint C = - <sup>j</sup> <sup>a</sup><sup>j</sup> <sup>j</sup> <sup>≥</sup> <sup>b</sup> is <sup>¬</sup><sup>C</sup> <sup>=</sup> - <sup>j</sup> <sup>a</sup><sup>j</sup> <sup>j</sup> <sup>≤</sup> <sup>b</sup> <sup>−</sup> <sup>1</sup>. For a (partial) assignment ρ we write C<sup>ρ</sup> for the *restricted constraint* obtained by replacing literals in C assigned by ρ with their values and simplifying. We say that C *unit propagates under* ρ if C<sup>ρ</sup> cannot be satisfied unless is assigned to <sup>1</sup>. If repeated unit propagation on all constraints in C ∪ D ∪ {¬C}, starting with the empty assignment <sup>ρ</sup> <sup>=</sup> <sup>∅</sup>, leads to contradiction in the form of an unsatisfiable constraint, we say that C follows by *reverse unit propagation (RUP)* from C∪D. Such (efficiently verifiable) RUP steps are allowed in VeriPB proofs as a convenient way to avoid writing out an explicit cutting planes derivation. We use the same notation C<sup>ω</sup> to denote the result of applying to C a *(partial) substitution* <sup>ω</sup>, which can map variables not only to {0, <sup>1</sup>} but also to literals, and extend this notation to sets of constraints by taking unions.

In addition to the above rules, which derive semantically implied constraints, there is a *redundance-based strengthening rule*, or just *redundance rule* for short, that can derive non-implied constraints C as long as they do not change the feasibility or optimal value. This can be guaranteed by exhibiting a *witness substitution* <sup>ω</sup> such that for any total assignment <sup>α</sup> satisfying C∪D but violating <sup>C</sup>, the composition <sup>α</sup> ◦ <sup>ω</sup> is another total assignment that satisfies C∪D∪{C} and yields an objective value that is at least as good. Formally, C can be derived from C∪D by exhibiting <sup>ω</sup> and subproofs for

$$\mathcal{C}\cup\mathcal{D}\cup\{\neg C\}\vdash(\mathcal{C}\cup\mathcal{D}\cup\{C\})\dagger\_{\omega}\cup\{O\geq O\uparrow\_{\omega}\},\tag{1}$$

using the previously discussed rules (where the notation C<sup>1</sup>  C<sup>2</sup> means that the constraints C<sup>2</sup> can be derived from the constraints C1).

During preprocessing, constraints in the input formula are often deleted or replaced by other constraints, in which case the proof should establish that these deletions maintain equioptimality. Removing constraints from the derived set D is unproblematic, but unrestricted deletion from the core set C can clearly introduce spurious better solutions. Therefore, removing <sup>C</sup> from <sup>C</sup> can only be done by the *checked deletion rule*, which requires a proof that the redundance rule can be used to rederive <sup>C</sup> from C\{C} (see [9] for a more detailed explanation).

Finally, it turns out to be useful to allow replacing O by a new objective O using an *objective function update rule*, as long as this does not change the optimal value of the problem. Formally, updating the objective from O to O requires derivations of the two constraints <sup>O</sup> <sup>≥</sup> <sup>O</sup> and <sup>O</sup> <sup>≥</sup> <sup>O</sup> from the core set C, which shows that any satisfying solution to C has the same value for both objectives. More details on this rule can be found in [37].

### 2.2 Maximum Satisfiability

A WCNF instance of (weighted partial) maximum satisfiability <sup>F</sup><sup>W</sup> = (FH, FS) is a conjunction of two CNF formulas F<sup>H</sup> and F<sup>S</sup> with *hard* and *soft* clauses, respectively, where soft clauses <sup>C</sup> <sup>∈</sup> <sup>F</sup><sup>S</sup> have positive weights <sup>w</sup><sup>C</sup> . A solution <sup>ρ</sup> to <sup>F</sup><sup>W</sup> must satisfy <sup>F</sup><sup>H</sup> and has value cost(FS, ρ) equal to the sum of weights of all soft clauses not satisfied by ρ. The optimum opt <sup>F</sup><sup>W</sup> of <sup>F</sup><sup>W</sup> is the minimum of cost(FS, ρ) over all solutions <sup>ρ</sup>, or <sup>∞</sup> if no solution exists.

State-of-the-art MaxSAT preprocessors such as MaxPre [39,44] take a slightly different *objective-centric* view [5] of MaxSAT instances <sup>F</sup> = (F, O) as consisting of a CNF formula F and an objective function O = - <sup>j</sup> <sup>c</sup><sup>j</sup> <sup>j</sup> + W to be minimized under assignments ρ satisfying F. A WCNF MaxSAT instance <sup>F</sup><sup>W</sup> = (FH, FS) is converted into objective-centric form ObjMaxSAT(F<sup>W</sup> ) = (F, O) by letting the formula <sup>F</sup> <sup>=</sup> <sup>F</sup><sup>H</sup> ∪ {<sup>C</sup> <sup>∨</sup> <sup>b</sup><sup>C</sup> <sup>|</sup> <sup>C</sup> <sup>∈</sup> <sup>F</sup>S, <sup>|</sup>C<sup>|</sup> <sup>&</sup>gt; <sup>1</sup>} of ObjMaxSAT(F<sup>W</sup> ) consist of the hard clauses of <sup>F</sup><sup>W</sup> and the non-unit soft clauses in FS, each extended with a fresh variable b<sup>C</sup> that does not appear in any other clause. The objective O = - (-)∈F*<sup>S</sup>* <sup>w</sup>(-) - +w<sup>C</sup> b<sup>C</sup> contains literals - for all unit soft clauses in F<sup>S</sup> as well as literals for all new variables b<sup>C</sup> , with coefficients equal to the weights of the corresponding soft clauses. In other words, each unit soft clause - <sup>∈</sup> <sup>F</sup><sup>S</sup> of weight <sup>w</sup> is transformed into the term <sup>w</sup> · in the objective function O, and each non-unit soft clause C is transformed into the hard clause <sup>C</sup> <sup>∨</sup> <sup>b</sup><sup>C</sup> paired with the unit soft clause (b<sup>C</sup> ) with same weight as <sup>C</sup>. The following observation summarizes the properties of ObjMaxSAT(F<sup>W</sup> ) that are central to our work.

Observation 1. *For any solution* <sup>ρ</sup> *to a WCNF MaxSAT instance* <sup>F</sup><sup>W</sup> *there exists a solution* <sup>ρ</sup> *to* (F, O) = ObjMaxSAT(F<sup>W</sup> ) *with* <sup>O</sup>(ρ ) = cost(F<sup>W</sup> , ρ)*. Conversely, if* <sup>ρ</sup> *is a solution to* ObjMaxSAT(F<sup>W</sup> )*, then there exists a solution* <sup>ρ</sup> *of* <sup>F</sup><sup>W</sup> *for which* cost(F<sup>W</sup> , ρ) <sup>≤</sup> <sup>O</sup>(ρ )*.*

For the second part of the observation, the reason O(ρ ) is only an upper bound on cost(F<sup>W</sup> , ρ) is that the encoding forces <sup>b</sup><sup>C</sup> to be true whenever <sup>C</sup> is not satisfied by an assignment but not vice versa.

An objective-centric MaxSAT instance (F, O), in turn, clearly has the same optimum as the pseudo-Boolean optimization problem of minimizing O subject to PB(F). For the end-to-end formal verification, the fact that this coincides with opt <sup>F</sup><sup>W</sup> needs to be formalized into theorems as shown in Fig. 4.

# 3 Proof Logging for MaxSAT Preprocessing

We now discuss how pseudo-Boolean proof logging can be used to reason about correctness of MaxSAT preprocessing steps. Our approach maintains the invariant that the current working instance in the preprocessor is synchronized with the PB constraints in the core set C as described in Sect. 2.2. At the end of each preprocessing step (i.e., application of a preprocessing technique) the set of derived constraints D is empty. All constraints derived in the proof as described in this section are moved to the core set, and constraints are always removed by checked deletion from the core set. Full details are in the online appendix [40].

### 3.1 Overview

All our preprocessing steps maintain *equioptimality*, which means that if preprocessing of the WCNF MaxSAT instance <sup>F</sup><sup>W</sup> yields the output instance <sup>F</sup><sup>W</sup> <sup>P</sup> , then the equality opt <sup>F</sup><sup>W</sup> = opt <sup>F</sup><sup>W</sup> P is guaranteed to hold. Our preprocessing is *certified*, meaning that we provide a machine-verifiable proof justifying this claimed equality. Our discussion below focuses on input instances that have solutions, but our techniques also handle the—arguably less interesting—case of <sup>F</sup><sup>W</sup> not having solutions; details are in the online appendix [40].

An overview of the workflow of our certifying MaxSAT preprocessor is shown in Fig. 1. Given a WCNF instance <sup>F</sup><sup>W</sup> as input, the preprocessor proceeds in five stages (illustrated on the left in Fig. 1), and then outputs a preprocessed MaxSAT instance <sup>F</sup><sup>W</sup> <sup>P</sup> together with a pseudo-Boolean proof that opt ObjMaxSAT <sup>F</sup><sup>W</sup> <sup>=</sup> opt ObjMaxSAT <sup>F</sup><sup>W</sup> P . For certified MaxSAT preprocessing, this proof can then be fed to a formally verified checker as in Sect. 4 to verify that (a) the initial core constraints in the proof correspond exactly to the clauses in ObjMaxSAT <sup>F</sup><sup>W</sup> , (b) each step in the proof is valid, and (c) the final core constraints in the proof correspond exactly to the clauses in ObjMaxSAT <sup>F</sup><sup>W</sup> P . Below, we provide more details on the five stages of the preprocessing flow.


Fig. 1. Overview of the five stages of certified MaxSAT preprocessing of a WCNF instance *<sup>F</sup><sup>W</sup>* . The middle column contains the state of the working MaxSAT instance as a WCNF instance and a lower bound on its optimum cost (Stages 1–2), or as an objective-centric instance (Stages 3–5). The right column contains a tuple (*C*, O) with the set *C* of core constraints, and objective O, respectively, of the proof after each stage.

Stage 1: Initialization. An input WCNF instance <sup>F</sup><sup>W</sup> is transformed to pseudo-Boolean format by converting it to an objective-centric representation (F0, O<sup>0</sup>) = ObjMaxSAT <sup>F</sup><sup>W</sup> and then representing all clauses in <sup>F</sup><sup>0</sup> as pseudo-Boolean constraints as described in Sect. 2.2. The VeriPB proof starts out with core constraints PB(F<sup>0</sup>) and objective O<sup>0</sup>. The preprocessor maintains a lower bound on the optimal cost of the working instance, which is initialized to <sup>0</sup> for the input <sup>F</sup><sup>W</sup> .

Stage 2: Preprocessing on the Initial WCNF Representation. During preprocessing on the WCNF representation, a (very limited) set of simplification techniques are applied on the working formula. At this stage the preprocessor removes duplicate, tautological, and blocked clauses [43]. Additionally, hard unit clauses are unit propagated and clauses subsumed by hard clauses are removed. Importantly, the preprocessor is performing these simplifications on a WCNF MaxSAT instance where it deals with hard and soft clauses. As the pseudo-Boolean proof has no concept of hard or soft clauses, the reformulation steps must be expressed in terms of the constraints in the proof. The next example illustrates how reasoning with different types of clauses is logged in the proof.

*Example 1.* Suppose the working instance has two duplicate clauses C and D. If both are hard, then the proof has two identical constraints PB(C) and PB(D) in the core set, and PB(D) can be deleted since it follows from PB(C) by reverse unit propagation (RUP). If D is instead a non-unit soft clause, the proof has the constraint PB(<sup>D</sup> <sup>∨</sup> <sup>b</sup>D) and the term <sup>w</sup>Db<sup>D</sup> in the objective, where <sup>b</sup><sup>D</sup> does not appear in any other constraint. Then in the proof we (1) remove the RUP constraint PB(D∨bD), (2) introduce <sup>b</sup><sup>D</sup> <sup>≥</sup> <sup>1</sup> by redundance-based strengthening using the witness {b<sup>D</sup> <sup>→</sup> <sup>0</sup>}, (3) remove the term <sup>w</sup>Db<sup>D</sup> from the objective, and (4) delete <sup>b</sup><sup>D</sup> <sup>≥</sup> <sup>1</sup> with the witness {b<sup>D</sup> <sup>→</sup> <sup>0</sup>}.

Stage 3: Conversion to Objective-Centric Representation. In order to apply more simplification rules in a cost-preserving way, the working instance <sup>F</sup><sup>W</sup> <sup>1</sup> = (F<sup>1</sup> H, F<sup>1</sup> <sup>S</sup>) at the end of Stage 2 is converted into the corresponding objective-centric representation that takes the lower-bound lb inferred during Stage 1 into account. More specifically, the preprocessor next converts its working MaxSAT instance into the objective-centric instance <sup>F</sup><sup>2</sup> = (F2, O<sup>2</sup> <sup>+</sup> lb) where (F2, O<sup>2</sup>) = ObjMaxSAT(F<sup>W</sup> <sup>1</sup> ).

Here it is important to note that at the end of Stage 2, the core constraints <sup>C</sup><sup>1</sup> and objective O<sup>1</sup> of the proof are not necessarily PB(F<sup>2</sup>) and O<sup>2</sup> + lb, respectively. Specifically, consider a unit soft clause (-) of <sup>F</sup><sup>W</sup> <sup>1</sup> obtained by shrinking a non-unit soft clause <sup>C</sup> <sup>⊇</sup> (-) of the input instance, with weight w<sup>C</sup> . Then the objective function O<sup>2</sup> in the preprocessor will include the term w<sup>C</sup> that does not appear in the objective function O<sup>1</sup> in the proof. Instead, O<sup>1</sup> contains the term <sup>w</sup><sup>C</sup> <sup>b</sup><sup>C</sup> and <sup>C</sup><sup>1</sup> the constraint -<sup>+</sup>b<sup>C</sup> <sup>≥</sup> <sup>1</sup> where <sup>b</sup><sup>C</sup> is the fresh variable added to C in Stage 1. In order to "sync up" the working instance and the proof we (1) introduce - <sup>+</sup> <sup>b</sup><sup>C</sup> <sup>≥</sup> <sup>1</sup> to the proof with the witness {b<sup>C</sup> <sup>→</sup> <sup>0</sup>}, (2) update <sup>O</sup><sup>1</sup> by adding w<sup>C</sup> - <sup>−</sup> <sup>w</sup><sup>C</sup> <sup>b</sup><sup>C</sup> , (3) remove the constraint - <sup>+</sup> <sup>b</sup><sup>C</sup> <sup>≥</sup> <sup>1</sup> with the witness {b<sup>C</sup> <sup>→</sup> <sup>0</sup>}, and (4) remove the constraint - <sup>+</sup> <sup>b</sup><sup>C</sup> <sup>≥</sup> <sup>1</sup> with witness {b<sup>C</sup> <sup>→</sup> <sup>1</sup>}. The same steps are logged for all soft unit clauses of <sup>F</sup><sup>W</sup> <sup>1</sup> obtained during Stage 2. In the following stages, the preprocessor will operate on an objective-centric MaxSAT instance whose clauses correspond exactly to the core constraints of the proof.

Stage 4: Preprocessing on the Objective-Centric Representation. During preprocessing on the objective-centric representation, more simplification techniques are applied to the working objective-centric instance and logged to the proof. We implemented proof logging for a wide range of preprocessing techniques. These include MaxSAT versions of rules commonly used in SAT solving like bounded variable elimination (BVE) [20,68], bounded variable addition [49], blocked clause elimination [43], subsumption elimination, self-subsuming resolution [20,60], failed literal elimination [24,46,75], and equivalent literal substitution [12,48,71]. We also cover MaxSAT-specific preprocessing rules like TrimMaxSAT [61], (group)-subsumed literal (or label) elimination (SLE) [6,44], intrinsic at-most-ones [38,39], binary core removal (BCR) [25,44], label matching [44], and hardening [2,39,56]. Here we give examples for BVE, SLE, label matching, and BCR—the rest are detailed in the online appendix [40]. In the following descriptions, let (F, O) be the current objective-centric working instance.

*Bounded Variable Elimination (BVE)* [20,68]. BVE eliminates from F a variable x that does not appear in the objective by replacing all clauses in which either x or <sup>x</sup> appears with the non-tautological clauses in {<sup>C</sup> <sup>∨</sup><sup>D</sup> <sup>|</sup> <sup>C</sup> <sup>∨</sup><sup>x</sup> <sup>∈</sup> F,D∨<sup>x</sup> <sup>∈</sup> <sup>F</sup>}.

An application of BVE is logged as follows: (1) each non-tautological constraint PB(<sup>C</sup> <sup>∨</sup> <sup>D</sup>) is added by summing the existing constraints PB(<sup>C</sup> <sup>∨</sup>x) and PB(D∨x) and saturating, after which (2) each constraint of the form PB(<sup>C</sup> <sup>∨</sup>x) and PB(<sup>D</sup> <sup>∨</sup> <sup>x</sup>) is deleted with the witness <sup>x</sup> <sup>→</sup> <sup>1</sup> or <sup>x</sup> <sup>→</sup> <sup>0</sup>, respectively.

*Label Matching* [44]. Label matching allows merging pairs of objective variables that can be deduced to not both be set to 1 by optimal solutions. Assume that (i) <sup>F</sup> contains the clauses <sup>C</sup> <sup>∨</sup>b<sup>C</sup> and <sup>D</sup> <sup>∨</sup>bD, (ii) <sup>b</sup><sup>C</sup> and <sup>b</sup><sup>D</sup> are objective variables with the same coefficient <sup>w</sup> in <sup>O</sup>, and (iii) <sup>C</sup> <sup>∨</sup> <sup>D</sup> is a tautology. Then label matching replaces <sup>b</sup><sup>C</sup> and <sup>b</sup><sup>D</sup> with a fresh variable <sup>b</sup>CD, i.e., replaces <sup>C</sup> <sup>∨</sup>b<sup>C</sup> and <sup>D</sup> <sup>∨</sup> <sup>b</sup><sup>D</sup> with <sup>C</sup> <sup>∨</sup> <sup>b</sup>CD and <sup>D</sup> <sup>∨</sup> <sup>b</sup>CD and adds <sup>−</sup>wb<sup>C</sup> <sup>−</sup> wb<sup>D</sup> <sup>+</sup> wbCD to <sup>O</sup>.

As <sup>C</sup> <sup>∨</sup> <sup>D</sup> is a tautology there is some literal such that - <sup>∈</sup> <sup>C</sup> and - <sup>∈</sup> <sup>D</sup>. Label matching is logged via the following steps: (1) introduce the constraint <sup>b</sup><sup>C</sup> <sup>+</sup> <sup>b</sup><sup>D</sup> <sup>≥</sup> <sup>1</sup> with the witness {b<sup>C</sup> <sup>→</sup> -, b<sup>D</sup> <sup>→</sup> -}, (2) introduce the constraints <sup>b</sup>CD <sup>+</sup> <sup>b</sup><sup>C</sup> <sup>+</sup> <sup>b</sup><sup>D</sup> <sup>≥</sup> <sup>2</sup> and <sup>b</sup>CD <sup>+</sup> <sup>b</sup><sup>C</sup> <sup>+</sup> <sup>b</sup><sup>D</sup> <sup>≥</sup> <sup>1</sup> by redundance; these correspond to bCD = b<sup>C</sup> + b<sup>D</sup> which holds even though the variables are binary due to the constraint added in the first step, (3) update the objective by adding <sup>−</sup>wb<sup>C</sup> <sup>−</sup> wb<sup>D</sup> <sup>+</sup>wbCD to it, (4) introduce the constraints PB(<sup>C</sup> <sup>∨</sup>bCD) and PB(<sup>D</sup> <sup>∨</sup>bCD) which are RUP, (5) delete PB(<sup>C</sup> <sup>∨</sup>b<sup>C</sup> ) and PB(<sup>D</sup> <sup>∨</sup>bD) with the witness {b<sup>C</sup> <sup>→</sup> -, b<sup>D</sup> <sup>→</sup> -}, (6) delete the constraint <sup>b</sup>CD <sup>+</sup> <sup>b</sup><sup>C</sup> <sup>+</sup> <sup>b</sup><sup>D</sup> <sup>≥</sup> <sup>2</sup> with the witness {b<sup>C</sup> <sup>→</sup> <sup>0</sup>, b<sup>D</sup> <sup>→</sup> <sup>0</sup>} and <sup>b</sup>CD <sup>+</sup> <sup>b</sup><sup>C</sup> <sup>+</sup> <sup>b</sup><sup>D</sup> <sup>≥</sup> <sup>1</sup> with the witness {b<sup>C</sup> <sup>→</sup> <sup>1</sup>, b<sup>D</sup> <sup>→</sup> <sup>0</sup>}, (7) delete <sup>b</sup><sup>C</sup> <sup>+</sup> <sup>b</sup><sup>D</sup> <sup>≥</sup> <sup>1</sup> with the witness {b<sup>C</sup> <sup>→</sup> <sup>0</sup>}.

*Subsumed Literal Elimination (SLE)* [6,39]. Given two non-objective variables <sup>x</sup> and <sup>y</sup> such that (i) {<sup>C</sup> <sup>|</sup> <sup>C</sup> <sup>∈</sup> F, y <sup>∈</sup> <sup>C</sup>}⊆{<sup>C</sup> <sup>|</sup> <sup>C</sup> <sup>∈</sup> F, x <sup>∈</sup> <sup>C</sup>} and (ii) {<sup>C</sup> <sup>|</sup> <sup>C</sup> <sup>∈</sup> F, <sup>x</sup> <sup>∈</sup> <sup>C</sup>}⊆{<sup>C</sup> <sup>|</sup> <sup>C</sup> <sup>∈</sup> F, <sup>y</sup> <sup>∈</sup> <sup>C</sup>}, subsumed literal elimination (SLE) allows fixing <sup>x</sup> = 1 and <sup>y</sup> = 0. This is proven by (1) introducing <sup>x</sup> <sup>≥</sup> <sup>1</sup> and <sup>y</sup> <sup>≥</sup> <sup>1</sup>, both with witness {<sup>x</sup> <sup>→</sup> <sup>1</sup>, <sup>y</sup> <sup>→</sup> <sup>0</sup>}, (2) simplifying the constraint database via propagation, and (3) deleting the constraints introduced in the first step as neither x nor y appears in any other constraints after simplification.

If x and y are objective variables, the application of SLE additionally requires that: (iii) the coefficient in the objective of x is at most as high as the coefficient of y. Then the value of x is not fixed as it would incur cost. Instead, only y = 0 is fixed and y removed from the objective. Intuitively, conditions (i) and (ii) establish that the values of x and y can always be flipped to 0 and 1, respectively, without falsifying any clauses. If neither of the variables is in the objective, this flip does not increase the cost of any solutions. Otherwise, condition (iii) ensures that the flip does not make the solution worse, i.e., increase its cost.

*Binary Core Removal (BCR)* [25,44]. Assume that the following four prerequisites hold: (i) <sup>F</sup> contains a clause <sup>b</sup><sup>C</sup> <sup>∨</sup> <sup>b</sup><sup>D</sup> for two objective variables <sup>b</sup><sup>C</sup> and bD, (ii) b<sup>C</sup> and b<sup>D</sup> have the same coefficient w in O, (iii) the negations b<sup>C</sup> and b<sup>D</sup> do not appear in any clause of F, and (iv) both b<sup>C</sup> and b<sup>D</sup> appear in at least one other clause of F but not together in any other clause of F. Binary core removal replaces all clauses containing b<sup>C</sup> or b<sup>D</sup> with the non-tautological clauses in {<sup>C</sup> <sup>∨</sup><sup>D</sup> <sup>∨</sup>bCD <sup>|</sup> <sup>C</sup> <sup>∨</sup>b<sup>C</sup> <sup>∈</sup> F,D <sup>∨</sup>b<sup>D</sup> <sup>∈</sup> <sup>F</sup>}, where <sup>b</sup>CD is a fresh variable, and modifies the objective function by adding <sup>−</sup>wb<sup>C</sup> <sup>−</sup> wb<sup>D</sup> <sup>+</sup> wbCD <sup>+</sup> <sup>w</sup> to it.

BCR is logged as a combination of the so-called *intrinsic at-most-ones* technique [38,39] and BVE. Applying intrinsic at most ones on the variables b<sup>C</sup> and <sup>b</sup><sup>D</sup> introduces a new clause (b<sup>C</sup> <sup>∨</sup> <sup>b</sup><sup>D</sup> <sup>∨</sup> <sup>b</sup>CD) and adds <sup>−</sup>wb<sup>C</sup> <sup>−</sup> wb<sup>D</sup> <sup>+</sup> wbCD <sup>+</sup> <sup>w</sup> to the objective. Our proof for intrinsic at most ones is the same as the one presented in [4]. As this step removes b<sup>C</sup> and b<sup>D</sup> from the objective, both can now be eliminated via BVE.

Stage 5: Constant Removal and Output. After objective-centric preprocessing, the final objective-centric instance (F3, O<sup>3</sup>) is converted back to a WCNF instance. Before doing so, the constant term W<sup>3</sup> of O<sup>3</sup> is removed by introducing a fresh variable <sup>b</sup><sup>W</sup><sup>3</sup> , and setting <sup>F</sup><sup>4</sup> <sup>=</sup> <sup>F</sup><sup>3</sup> <sup>∧</sup> (b<sup>W</sup><sup>3</sup> ) and <sup>O</sup><sup>4</sup> <sup>=</sup> <sup>O</sup><sup>3</sup> <sup>−</sup> <sup>W</sup><sup>3</sup> <sup>+</sup> <sup>W</sup>3b<sup>W</sup><sup>3</sup> . This step is straightforward to prove.

Finally, the preprocessor outputs the WCNF instance <sup>F</sup><sup>W</sup> <sup>P</sup> = (F4, F <sup>P</sup> <sup>S</sup> ) that has F<sup>4</sup> as hard clauses. The set F <sup>P</sup> <sup>S</sup> of soft clauses consists of a unit soft clause (-) of weight <sup>c</sup> for each term <sup>c</sup> · in O<sup>4</sup>. The preprocessor also outputs the final proof of the fact that the minimum-cost of solutions to the pseudo-Boolean formula PB(F<sup>0</sup>) under O<sup>0</sup> is the same as that of PB(F<sup>4</sup>) under O<sup>4</sup>, i.e. that opt(ObjMaxSAT(F<sup>W</sup> )) = opt(ObjMaxSAT(F<sup>W</sup> <sup>P</sup> )).

### 3.2 Worked Example of Certified Preprocessing

We give a worked-out example of certified preprocessing of the instance <sup>F</sup><sup>W</sup> <sup>=</sup> (FH, FS) where <sup>F</sup><sup>H</sup> <sup>=</sup> {(x<sup>1</sup> <sup>∨</sup> <sup>x</sup>2),(x2)} and three soft clauses: (x1) with weight <sup>1</sup>, (x3∨x4) with weight <sup>2</sup>, and (x4∨x5) with weight <sup>3</sup>. The proof for one possible execution of the preprocessor on this input instance is detailed in Table 1.

During Stage 1 (Steps 1–4 in Table 1), the core constraints of the proof are initialized to contain the four constraints corresponding to the hard and non-unit soft clauses of <sup>F</sup><sup>W</sup> (IDs (1)–(4) in Table 1), and the objective to <sup>x</sup><sup>1</sup> + 2b<sup>1</sup> + 3b2, where <sup>b</sup><sup>1</sup> and <sup>b</sup><sup>2</sup> are fresh variables added to the non-unit soft clauses of <sup>F</sup><sup>W</sup> .

During Stage 2 (Steps 5–9), the preprocessor fixes x<sup>2</sup> = 0 via unit propagation by removing <sup>x</sup><sup>2</sup> from the clause (x<sup>1</sup> <sup>∨</sup> <sup>x</sup>2), and then removing the unit clause (x2). The justification for fixing x<sup>2</sup> = 0 are Steps 5–7. Next the preprocessor fixes x<sup>1</sup> = 1 which (i) removes the hard clause (x1), and (ii) increases the lower bound on the optimal cost by 1. The justification for fixing x<sup>1</sup> = 1 are Steps 8 and 9 of Table 1. At this point—at the end of Stage 2—the working instance <sup>F</sup><sup>W</sup> <sup>1</sup> = (F<sup>1</sup> H, F<sup>1</sup> <sup>S</sup>) has <sup>F</sup><sup>1</sup> <sup>H</sup> <sup>=</sup> {} and <sup>F</sup><sup>1</sup> <sup>S</sup> <sup>=</sup> {(x<sup>3</sup> <sup>∨</sup> <sup>x</sup>4),(x<sup>4</sup> <sup>∨</sup> <sup>x</sup>5)}.

Table 1. Example proof produced by a certifying preprocessor. The column (ID) refers to constraint IDs in the pseudo-Boolean proof. The column (Step) indexes all proof logging steps and is used when referring to the steps in the discussion. The letter ω is used for the witness substitution in redundance-based strengthening steps.


In Stage 3, the preprocessor converts its working instance into the objectivecentric representation (F, O) where <sup>F</sup> <sup>=</sup> {(x<sup>3</sup> <sup>∨</sup> <sup>x</sup><sup>4</sup> <sup>∨</sup> <sup>b</sup>1),(x<sup>4</sup> <sup>∨</sup> <sup>x</sup><sup>5</sup> <sup>∨</sup> <sup>b</sup>2)} and O = 2b<sup>1</sup> + 3b<sup>2</sup> + 1, which exactly matches the core constraints and objective of the proof after Step 9. Thus, in this instance, the conversion does not result in any proof logging steps. Afterwards, during Stage 4 (Steps 10–17), the preprocessor applies BVE in order to eliminate x<sup>4</sup> (Steps 10–12) and SLE to fix b<sup>2</sup> to 0 (Steps 13–17). Finally, Steps 18 and 19 represent Stage 5, i.e., the removal of the constant 1 from the objective. After these steps, the preprocessor outputs the preprocessed instance <sup>F</sup><sup>W</sup> <sup>P</sup> = (<sup>F</sup> <sup>P</sup> <sup>H</sup> , F <sup>P</sup> <sup>S</sup> ), where <sup>F</sup> <sup>P</sup> <sup>H</sup> <sup>=</sup> {(x<sup>3</sup> <sup>∨</sup> <sup>x</sup><sup>5</sup> <sup>∨</sup> <sup>b</sup>1),(b3)} and F <sup>P</sup> <sup>S</sup> contains two clauses: (b1) with weight <sup>2</sup>, and (b3) with weight <sup>1</sup>.

# 4 Verified Proof Checking for Preprocessing Proofs

This section presents our new workflow for formally verified, end-to-end proof checking of MaxSAT preprocessing proofs based on pseudo-Boolean reasoning; an overview of this workflow is shown in Fig. 2. To realize this workflow, we extended the VeriPB tool and its proof format to support a new *output section* for declaring (and checking) reformulation guarantees between input and output PBO instances (Sect. 4.1); we similarly modified CakePB [29] a verified proof checker to support the updated proof format (Sect. 4.2); finally, we built a verified frontend, CakePBwcnf, which mediates between MaxSAT WCNF instances and PBO instances (Sect. 4.3). Our formalization is carried out in the HOL4 proof assistant [67] using CakeML tools [34,59,70] to obtain a verified executable implementation of CakePBwcnf.

Fig. 2. Workflow for end-to-end verified MaxSAT preprocessing proof checking.

In the workflow in Fig. 2, the MaxSAT preprocessor produces a reformulated output WCNF together with a proof of equioptimality with the input WCNF. This proof is elaborated by VeriPB and then checked by CakePBwcnf, resulting in a verified *verdict*—in case of success, the input and output WCNFs are equioptimal. This workflow also supports verified checking of WCNF MaxSAT solving proofs (where the output parts of the flow are omitted).

#### 4.1 Output Section for Pseudo-Boolean Proofs

Given an input PBO instance (F, O), the VeriPB proof system as described in Sect. 2.1 maintains the invariant that the core constraints C (and current objective) are equioptimal to the input instance. Utilizing this invariant, the new *output section* for VeriPB proofs allows users to optionally specify an output PBO instance (F , O ) at the end of a proof. This output instance is claimed to be a reformulation of the input which is either: (i) *derivable*, i.e., satisfiability of F implies satisfiability of F , (ii) *equisatisfiable* to F, or (iii) *equioptimal* to (F, O). These are increasingly stronger claims about the relationship between the input and output instances. After checking a pseudo-Boolean derivation, VeriPB runs reformulation checking which, e.g., for equioptimality, checks that C ⊆ <sup>F</sup> , <sup>F</sup> ⊆ C, and that the respective objective functions are syntactically equal after normalization; other reformulation guarantees are checked analogously.

The VeriPB tool supports an *elaboration* mode [29], where in addition to checking the proof it also converts it from *augmented format* to *kernel format*. The augmented format contains syntactic sugar rules to facilitate proof logging for solvers and preprocessors like MaxPre, while the kernel format is supported by the formally verified proof checker CakePB. The new output section is passed unchanged from augmented to kernel format during elaboration.

# 4.2 Verified Proof Checking for Reformulations

There are two main verification tasks involved in extending CakePB with support for the output section. The first task is to verify soundness of all cases of reformulation checking. Formally, the equioptimality of an input PBO instance *fml*, *obj* and its output counterpart *fml* , *obj* is specified as follows:

> sem\_output *fml obj* None *fml obj* Equioptimal def = <sup>∀</sup> *<sup>v</sup>*. (<sup>∃</sup> *<sup>w</sup>*. satisfies *w fml* <sup>∧</sup> eval\_obj *obj w* <sup>≤</sup> *<sup>v</sup>*) ⇐⇒ (∃ *w* . satisfies *<sup>w</sup> fml* <sup>∧</sup> eval\_obj *obj <sup>w</sup>* <sup>≤</sup> *<sup>v</sup>*)

This definition says that, for all values *v*, the input instance has a satisfying assignment with objective value less than or equal to *v* iff the output instance also has such an assignment; note that this implies (as a special case) that *fml* is satisfiable iff *fml'* is satisfiable. The verified correctness theorem for CakePB says that *if* CakePB successfully checks a pseudo-Boolean proof in kernel format and prints a verdict declaring equioptimality, then the input and output instances are indeed equioptimal as defined in sem\_output.

The second task is to develop verified optimizations to speedup proof steps which occur frequently in preprocessing proofs; some code hotspots were also identified by profiling the proof checker against proofs generated by MaxPre. Similar (unverified) versions of these optimizations are also used in VeriPB. These optimizations turned out to be necessary in practice—they mostly target steps which, when naïvely implemented, have quadratic (or worse) time complexity in the size of the constraint database.

*Optimizing Reformulation Checking.* The most expensive step in reformulation checking for the output section is to ensure that the core constraints C are included in the output formula and vice versa (possibly with permutations and duplicity). Here, CakePB normalizes all pseudo-Boolean constraints involved to a canonical form and then copies both C and the output formula into respective array-backed hash tables for fast membership tests.

*Optimizing Redundance and Checked Deletion Rules.* A naïve implementation of these two rules would require iterating over the entire constraints database when checking all subproofs in (1) for the right-hand-side constraints (C∪D∪{C})<sup>ω</sup> ∪{<sup>O</sup> <sup>≥</sup> <sup>O</sup>ω}. An important observation here is that preprocessing proofs frequently use substitutions ω that only involve a small number of variables (often a single variable, which in addition is fresh in the important special case of *reification* constraints <sup>z</sup> <sup>⇔</sup> <sup>C</sup> encoding that <sup>z</sup> is true precisely when the constraint <sup>C</sup> is satisfied). Consequently, most of the constraints (C∪D∪{C})ω can be skipped when checking redundance because they are unchanged by the substitution. Similarly, the constraint <sup>O</sup> <sup>≥</sup> <sup>O</sup><sup>ω</sup> is expensive to construct when the objective O contains many terms, but this construction can be skipped if no variables being substituted occur in O. CakePB stores a lazily-updated mapping of variables to their occurrences in the constraint database and the objective, which it uses to detect these cases.

The occurrence mapping just discussed is crucial for performance due to the frequency of steps involving witnesses for preprocessing proofs, but incurs some memory overhead in the checker. More precisely, every variable occurrence in any constraint in the database corresponds to exactly one ID in the mapping. Thus, the overhead of storing the mapping is in the worst case quadratic in the number of constraints, but it is still linear in the total space usage for the constraints database.

### 4.3 Verified WCNF Frontend

The CakePBwcnf frontend mediates between MaxSAT WCNF problems and pseudo-Boolean optimization problems native to CakePB. Accordingly, the correctness of CakePBwcnf is stated in terms of MaxSAT semantics, i.e., the encoding, underlying pseudo-Boolean semantics, and proof system are all formally verified. In order to trust CakePBwcnf, one *only* has to carefully inspect the formal definition of MaxSAT semantics shown in Fig. 3 to make sure that it matches the informal definition in Sect. 2.2. Here, each clause *C* is paired with a natural number *n*, where *n* = 0 indicates a hard clause and when *n* > 0 it is the weight of *C*. The optimal cost of a weighted CNF formula *wfml* is None (representing ∞) if no satisfying assignment to the hard clauses exist; otherwise, it is the minimum cost among all satisfying assignments to the hard clauses.

Fig. 3. Formalized semantics for MaxSAT WCNF problems.

*There and Back Again.* CakePBwcnf contains a verified WCNF-to-PB encoder implementing the encoding described in Sect. 2.2. Its correctness theorems are shown in Fig. 4, where the two lemmas in the top row relate the satisfiability and cost of the WCNF to its PB optimization counterpart after running wcnf\_to\_pbf (and vice versa), see Observation 1. Using these lemmas, the final theorem (bottom row) shows that equioptimality for two (encoded) PB optimization problems can be *translated* back to equioptimality for the input and preprocessed WCNFs.


Fig. 4. Correctness theorems for the WCNF-to-PB encoding.

*Putting Everything Together.* The final verification step is to specialize the endto-end machine code correctness theorem for CakePB to the new frontend. The resulting theorem for CakePBwcnf is shown abridged in Fig. 5; a detailed explanation of similar CakeML-based theorems is available elsewhere [29,69] so we do not go into details here. Briefly, the theorem says that whenever the verdict string "s VERIFIED OUTPUT EQUIOPTIMAL" is printed (as a suffix) to the standard output by an execution of CakePBwcnf, then the two input files given on the command line parsed to equioptimal MaxSAT WCNF instances.

Fig. 5. Abridged final correctness theorem for CakePBwcnf.

# 5 Experiments

We updated the MaxSAT preprocessor MaxPre 2.1 [39,42,44] to MaxPre 2.2 which now produces proof logs in the VeriPB format [10]. MaxPre 2.2 is available at the MaxPre 2 repository [50]. The generated proofs were elaborated using VeriPB [73] and then checked by the verified proof checker CakePBwcnf. As benchmarks we used the 558 weighted and 572 unweighted MaxSAT instances from the MaxSAT Evaluation 2023 [52].

The experiments were conducted on 11th Gen Intel(R) Core(TM) i5-1145G7 @ 2.60 GHz CPUs with 16 GB of memory, a solid state drive as storage, and Rocky Linux 8.5 as operating system. Each benchmark ran exclusively on a node and the memory was limited to 14 GB. The time for MaxPre was limited to 300 s. There is an option to let MaxPre know about this time limit, but we did not use this option since MaxPre then decides which techniques to try based on how much time remains. This would have made it very hard to get reliable measurements of the overhead when proof logging is switched on in the preprocessor. The time limits for both VeriPB and CakePBwcnf were set to 6000 s to get as many instances checked as possible.

The main focus of our evaluation was the default setting of MaxPre, which does not use some of the techniques mentioned in Sect. 3 (or the online appendix [40]). We also conducted experiments with all techniques enabled to check the correctness of the proof logging implementation for all preprocessing techniques. The data and source code from our experiments can be found in [41].

The goal of the experiments was to answer the following questions:

RQ1. How much extra time is required to write the proof for the preprocessor? RQ2. How long does proof checking take compared to proof generation?

Fig. 6. Proof logging overhead for MaxPre.

Fig. 7. MaxPre vs. combined proof checking running time.

To answer the first question, in Fig. 6 we compare MaxPre with and without proof logging. In total, 1081 instances were successfully preprocessed by Max-Pre without proof logging. With proof logging enabled, 8 fewer instances were preprocessed due to either time- or memory-outs. For the successfully preprocessed instances, the geometric mean of the proof logging overhead is 46% of the running time, and 95% of the instances were preprocessed with proof logging in at most twice the time required without proof logging.

Our comparison between proof generation and proof checking is based on the 1073 instances for which preprocessing with proof logging was successful. Out of these, 1021 instances were successfully checked and elaborated by VeriPB. For 991 instances the verdicts were confirmed by the formally verified proof checker CakePBwcnf, with the remaining instances being time- or memoryouts. This shows the practical viability of our approach, as the vast majority of preprocessing proofs were checked within the resource limits.

A scatter plot comparing the running time of MaxPre with proof logging enabled against the combined checking process is shown in Fig. 7. For the combined checking time, we only consider the instances that have been successfully checked by CakePBwcnf. In the geometric mean, the time for the combined verified checking pipeline of VeriPB elaboration followed by CakePBwcnf checking is 113× the preprocessing time of MaxPre. A general reason for this overhead is that the preprocessor has more MaxSAT application-specific context than the pseudo-Boolean checker, so the preprocessor can log proof steps without performing the actual reasoning while the checker must ensure that those steps are sound in an application-agnostic way. An example for this is reification: as the preprocessor knows its reification variables are fresh, it can easily emit redundance steps that witness on those variables; but the checker has to verify freshness against its own database. Similar behaviour has been observed in other applications of pseudo-Boolean proof logging [27,37].

To analyse further the causes of proof checking overhead, we also compared VeriPB to CakePBwcnf. The checking of the elaborated kernel proof with CakePBwcnf is <sup>6</sup>.7<sup>×</sup> faster than checking and elaborating the augmented proof with VeriPB. This suggests that the bottleneck for proof checking is VeriPB; VeriPB *without* elaboration is about <sup>5</sup>.3<sup>×</sup> slower than CakePBwcnf. As elaboration is a necessary step before running CakePBwcnf, improving the performance of VeriPB would benefit the performance of the pipeline as a whole. One specific feature that seems desirable would be to augment RUP rule applications with LRAT-style hints [16], so that VeriPB would not need to perform unit propagation to elaborate RUP steps to cutting planes derivations. Though these types of engineering challenges are important to address, they are beyond the scope of the current paper and we have to leave them as future work.

# 6 Conclusion

In this work, we show how to use pseudo-Boolean proof logging to certify correctness of the MaxSAT preprocessing phase, extending previous work for the main solving phase in unweighted model-improving solvers [72] and general coreguided solvers [4]. As a further strengthening of previous work, we present a fully formally verified toolchain which provides end-to-end verification of correctness.

In contrast to SAT solving, there is a rich variety of techniques in maximum satisfiability solving, and it still remains to design pseudo-Boolean proof logging methods for general, weighted, model-improving MaxSAT solvers [21,47,62] and *implicit hitting set (IHS)* MaxSAT solvers [18,19] with *abstract cores* [3]. Nevertheless, our work adds further weight to the conclusion that pseudo-Boolean reasoning seems like a very promising foundation for MaxSAT proof logging. We are optimistic that this work is another step on the path towards general adoption of proof logging in the context of SAT-based optimization.

Acknowledgments. This work has been financially supported by the University of Helsinki Doctoral Programme in Computer Science DoCS, the Research Council of Finland under grants 342145 and 346056, the Swedish Research Council grants 2016- 00782 and 2021-05165, the Independent Research Fund Denmark grant 9040-00389B, the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation, and by A\*STAR, Singapore. Part of this work was carried out while some of the authors participated in the extended reunion of the semester program *Satisfiability: Theory, Practice, and Beyond* in the spring of 2023 at the Simons Institute for the Theory of Computing at UC Berkeley. We also acknowledge useful discussions at the Dagstuhl workshops 22411 *Theory and Practice of SAT and Combinatorial Solving* and 23261 *SAT Encodings and Beyond*. The computational experiments were enabled by resources provided by LUNARC at Lund University.

# References


2004. LNCS, vol. 3542, pp. 276–291. Springer, Heidelberg (2005). https://doi.org/ 10.1007/11527695\_22


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **A Formal Model to Prove Instantiation Termination for E-matching-Based Axiomatisations**

Rui Ge(B) , Ronald Garcia , and Alexander J. Summers

Department of Computer Science, University of British Columbia, Vancouver, BC, Canada {rge,rxg}@cs.ubc.ca, alex.summers@ubc.ca

**Abstract.** SMT-based program analysis and verification often involve reasoning about program features that have been specified using quantifiers; incorporating quantifiers into SMT-based reasoning is, however, known to be challenging. If quantifier instantiation is not carefully controlled, then runtime and outcomes can be brittle and hard to predict. In particular, uncontrolled quantifier instantiation can lead to unexpected incompleteness and even non-termination. E-matching is the most widely-used approach for controlling quantifier instantiation, but when axiomatisations are complex, even experts cannot tell whether or not their use of E-matching guarantees completeness or termination.

This paper presents a new formal model that facilitates the proof, once and for all, that giving a complex E-matching-based axiomatisation to an SMT solver such as Z3 or cvc5, cannot cause non-termination. Key to our technique is an operational semantics for solver behaviour that models how the E-matching rules common to most solvers are used to determine when quantifier instantiations are enabled, but abstracts over irrelevant details of individual solvers. We demonstrate the effectiveness of our technique by presenting a termination proof for a set theory axiomatisation adapted from those used in the Dafny and Viper verifiers.

**Keywords:** SMT solving · Quantifiers · Termination proofs · E-matching

# **1 Introduction**

SMT-based program analysis and verification have advanced dramatically in the past two decades. These advances have been partly fuelled by major improvements in SAT and SMT solving techniques, as well as their implementations in state-of-the-art solvers such as Z3 [22] and cvc5 [2]. Leveraging these advances in SMT, a huge number of program analysis and verification tools have been based on SMT, including for example Dafny [17], Why3 [12] and Viper [24].

Such tools must translate a wide range of problem features into SMT queries that model these domain-specific concerns. While some theories relevant to problem features (e.g. linear arithmetic [22]) are natively supported by SMT solvers, most problem features must be modelled by *axiomatisation*.

Axiomatising problem features involves introducing uninterpreted sorts, uninterpreted functions on these sorts, and (crucially) *quantifiers*<sup>1</sup> that define the intended meaning of these features. For instance, one can model sets of integers by introducing a sort *Set* for sets, uninterpreted functions *member* and *diff* to represent set membership and set difference respectively, and quantifiers such as <sup>∀</sup>*s*1*, s*<sup>2</sup> : *Set, x* : *Int. member*(*x, s*<sup>2</sup>) → ¬*member*(*x, diff*(*s*1*, s*<sup>2</sup>)).

Such modelling to SMT is expressive, but makes heavy use of quantifiers that must be instantiated during SMT solving. But quantifier instantiation in SMT notoriously presents notable challenges, potentially causing slow performance and even non-termination, as well as unexpectedly-failing proofs [4,19]. Worse still, latent quantifier instantiation issues may not surface on all runs, but cause a "butterfly effect" [16], meaning that unrelated changes to an input problem may lead to substantial changes in solver behaviour along these lines.

To manage these issues, solvers allow quantifiers to be annotated with instantiation *triggers* (a.k.a. instantiation *patterns*). Triggers specify (possibly multiple) shapes of ground terms that must be *known* (occur in the current proof context, modulo known equalities) to enable a quantifier instantiation. This method of guiding quantifier instantiation is referred to as *E-matching* [8,25] and is supported by virtually all modern SMT solvers.

However, selecting appropriate triggers is an art. The choice requires expertise in managing a fine balance: not too restrictive, to avoid insufficient quantifier instantiations for proofs, and not too permissive, to prevent excessive instantiations. Subtle issues can easily lead to the same hard-to-debug problems even for the most talented of SMT artists [16,19], and even when successful it is unclear how one can *know* that the chosen triggers are guaranteed to work in the future.

The ideal aim is to achieve both instantiation completeness and instantiation termination. *Instantiation completeness* means that all necessary quantifier instantiations for a proof can be made by the solver. *Instantiation termination* means that the solver will never endlessly explore infinitely many quantifier instantiations. In this paper, we focus on instantiation termination.<sup>2</sup>

Failures of instantiation termination stem from *matching loops*: the problematic scenario of a quantifier instantiation (possibly indirectly) leading to learning new terms that cause further instantiations of the same quantifier, potentially creating an endless loop. Matching loops *can* cause non-termination, but (problematically, for debugging) may only do so on some runs (in case heuristics in the solver arrive at the facts necessary to complete a proof "in time").

<sup>1</sup> We use the term *quantifier* (also) as a synonym for quantified formula.

<sup>2</sup> Instantiation termination can be trivially achieved by pathological trigger choices that prevent all instantiations (similar to proving a function terminating under a false precondition). However, such axiomatisations are not useful (or used) in practice.

Our paper enables proving that matching loops have been avoided altogether. We present a high-level formal model of E-matching-based quantifier instantiation that suffices to prove *once and for all* that a given set of trigger-annotated quantifiers, when combined with *any possible* ground facts, guarantees instantiation termination, thereby ensuring the absence of matching loops. Our model is designed to be broadly applicable because it models the core E-matching rules common to most solvers, but abstracts over implementation details where individual solvers make different choices. Our model enables formal termination proofs based on familiar concepts from program reasoning, with manageable complexity, allowing axiomatisation practitioners to independently construct these proofs and confidently seek terminating responses to ground theory queries.

Our main technical contributions are as follows:


Our research draws inspiration from Dross et al.'s [11] prior formalism for quantifier instantiation via E-matching. To the best of our knowledge, their work represents the sole formal attempt in this space before ours. However, we find their formalism incompatible with our goals: we elaborate on this point in Sect. 5.

Full details and supporting proofs are available in our technical report (TR hereafter) [13].

# **2 Problem Statement**

We begin with a basic grounding in E-matching, and use this to lay out the most important challenges a formal model needs to address to be useful in practice.

### **2.1 Quantifier Instantiation via E-matching**

Quantifiers are crucial for effectively modelling external problem features as an SMT problem. However, when determining whether such a first-order problem is satisfiable, an SMT solver must contend with quantifiers ranging over infinite sorts. A successful proof will (and need) only involve finitely many instantiations of the quantifiers, but selecting these is in general undecidable. Most solvers provide *E-matching* as the main means of guiding instantiation.

E-matching requires each quantifier to be associated with instantiation *triggers* (a.k.a. instantiation *patterns*). Triggers consist of terms containing the quantified variables, and prescribe that instantiations should only be made when ground terms of matching shape(s) arise in the current proof search.

During a proof search, SMT solvers maintain and update the currently-known ground terms and (dis)equalities on them in an efficient congruence-closure data structure called an *E-graph*. This information enables *E-matching* [21,25] matching modulo currently-known equalities—of known terms against quantifier triggers, which enables new instantiations, and of potential instantiations against previous ones, which prevents redundant instantiations.

*Example 1.* Consider the set theory axiom presented early in Sect. 1, now annotated with triggers (written comma-separated inside square brackets)<sup>3</sup>:

$$\forall s\_1, s\_2, x. \left[ \text{ifff}(s\_1, s\_2), \text{member}(x, s\_2) \right] \\ \text{member}(x, s\_2) \to \neg member(x, \text{ifff}(s\_1, s\_2))$$

The trigger consists of two terms, *diff*(*s*1*, s*<sup>2</sup>) and *member*(*x, s*<sup>2</sup>); a multi-term trigger prescribes that terms matching *all* (here, both) patterns must be known for some instantiation of the quantified variables. If so, the corresponding instantiation of the quantifier *itself* will be made: the instantiated quantifier body<sup>4</sup> will be treated as a newly-derived fact (typically, a *clause*), and the solver will also record that this instantiation has been made (to avoid doing so again).

Suppose that an E-graph represents the congruence closure of the facts: *member*(*t, a*)=, *diff*(*b, c*)=*b* and *a*=*c*. E-matching will find a successful match against the trigger above; although it might seem that there is no consistent pair of terms here, the equality *a* <sup>=</sup> *c* means that (modulo equalities) we can consider the terms *member*(*t, a*) and *diff*(*b, a*) as known in the E-graph, which match the triggers under the instantiation *s*<sup>1</sup>→*b*, *<sup>s</sup>*<sup>2</sup>→*<sup>a</sup>* and *<sup>x</sup>*→*t*. The corresponding instantiation of the quantifier body yields <sup>¬</sup>*member*(*t, a*)∨¬*member*(*t, diff*(*b, a*)). Subsequently, the same quantifier cannot be instantiated with e.g. *s*<sup>1</sup>→*b*, *<sup>s</sup>*<sup>2</sup>→*<sup>c</sup>* and *x*→*t* since, again modulo equalities, this is an equivalent instantiation.

*Example 2.* Consider a variant of the previous quantifier, modified with a different trigger, and in the context of a different E-graph that represents instead the congruence closure of the facts: *member*(*t, a*)= and *member*(*t, b*)=.

$$\begin{array}{c} \forall s\_1, s\_2, x. \left[ \begin{matrix} \\mathbf{z}, s\_1 \end{matrix} \right], member(x, s\_2) \right] \\ \mathbf{m} member(x, s\_2) \to \neg member(x, \operatorname{diff}(s\_1, s\_2)) \end{array}$$

Now four instantiations are enabled: one for each pair of *member* applications in our current model (and E-graph): e.g. instantiating *s*<sup>1</sup>→*a*, *<sup>s</sup>*<sup>2</sup>→*<sup>b</sup>* and *<sup>x</sup>*→*<sup>t</sup>* or *s*<sup>1</sup>→*b*, *s*<sup>2</sup>→*a* and *x*→*t*. All four will be made: they are different choices since we don't know that *a* <sup>=</sup> *b*. The second, for example, causes the new clause (rewritten as a disjunction) <sup>¬</sup>*member*(*t, a*)∨¬*member*(*t, diff*(*b, a*)) to be assumed. This doesn't change the E-graph (which is populated only by assumed *literals*);

<sup>3</sup> For brevity, sorts on quantified variables are omitted in this example and hereafter.

<sup>4</sup> *Quantifier body* refers to the subformula that falls within the scope of a quantifier.

clauses are kept separately in the prover state. However, case-splitting on this clause may lead to the literal <sup>¬</sup>*member*(*t, diff*(*b, a*)) being added. At this point, five *new* quantifier instantiations will be enabled; the number of pairs of *member* applications has increased. In fact, by alternately instantiating this quantifier and case-splitting on newly-learned clauses, we can uncover new instantiations indefinitely, in a so-called *matching loop*.

These first examples show that the choice of triggers affects instantiation behaviour, and that modelling instantiations requires considering not only initial terms, but also facts learned during proof search and case-splitting choices.

*Example 3.* Consider the following "subset elimination" axiom (from the set theory axiomatisation we tackle later) with nested quantifiers:

<sup>∀</sup>*s*1*, s*2*.* [*subset*(*s*1*, s*<sup>2</sup>)] *subset*(*s*1*, s*<sup>2</sup>) <sup>→</sup> (∀*x.*[*member*(*x, s*<sup>1</sup>)][*member*(*x, s*<sup>2</sup>)] *member*(*x, s*<sup>1</sup>) <sup>→</sup> *member*(*x, s*<sup>2</sup>))

The inner quantifier has *two* triggers, defining *alternative* conditions for instantiation (a term of either shape is sufficient). Note that these triggers depend on the outer-quantified variables *<sup>s</sup>*<sup>1</sup> and *<sup>s</sup>*<sup>2</sup>, and thus their instantiations.

Instantiating an outer quantifier expands the current quantifiers for instantiations. In this example, instantiating the outer quantifier (∀*s*1*, s*2*....*) results in a clause that includes a copy of the inner quantifier (∀*x. . . .*); case-splitting on this clause can cause the copy to be assumed, effectively adding one more quantifier for future potential instantiations. As such, the instantiation of outer quantifiers *dynamically* introduces new quantifiers, adding complexity to establishing termination arguments—one must be able to identify and predict the quantifiers (and their instantiations) that will be dynamically introduced.

### **2.2 Objectives for a Formal Model of E-matching**

Given the difficulty of choosing quantifier triggers and *knowing* that their instantiations can *never* continue forever, our objective is to provide formal and usable means of proving such E-matching *termination proofs* once-and-for-all. Rather than attempt to capture the precise behaviour of a specific solver and its configuration, we want a model that abstracts over the behaviours of *any* reasonable implementation of E-matching, while still being sufficiently precise for the proofs to work and be reasonable to construct in practice.

The design of a model for E-matching must address multiple challenges:


We present our model, designed to address these challenges in the next section; we demonstrate its applicability for termination proofs in Sect. 4.

# **3 An Operational Semantics for E-matching**

We develop our formal model in the style of a *small-step operational semantics*, a popular choice for programming languages. In this operational style, states represent intermediate points of a proof search, while transitions represent solver steps; non-determinism abstracts over choices specific solvers make. With this design, our desired notion of instantiation termination can be recast as a familiar style of termination proof, albeit against a semantics with novel core details.

### **3.1 Preliminaries**

Our syntax for formulas is based around a generalisation of conjunctive normal form, used internally in SMT algorithms; we assume all formulas are preconverted to this form (existential quantifiers are eliminated by Skolemisation).

**Definition 1 (Formula Syntax).** *We assume a pre-defined set of* atoms<sup>5</sup>*, including equalities on terms <sup>t</sup>*<sup>1</sup> <sup>=</sup> *<sup>t</sup>*2*. A* (simple) literal *<sup>l</sup> is either an atom or its negation. The grammars of* extended literals *φ,* extended clauses *C and* extended conjunctive normal form (ECNF) formulas *A are as follows:*

$$\phi ::= l \mid (\forall \overline{x} . [\overline{T}] A)^{\sharp \alpha} \qquad \quad C ::= \phi \mid C \lor C \qquad \quad A ::= C \mid A \land A$$

*Here,* (∀−→*x .*−→ [*T*]*A*)*<sup>α</sup> denotes a* tagged quantifier*: the (possibly-multiple) variables* −→*x are bound, the (possibly-multiple) trigger sets* −→*T are each marked with square brackets and positioned before the quantifier body A, and α is a* tag *used to uniquely identify this particular quantifier (see also Sect. 3.6).*

As presented in Example 1, a trigger set *T* is a (non-empty) set of terms, written comma-separated. There are additional requirements: each trigger set must contain each quantified variable at least once, and each term must contain at least one quantified variable. Furthermore, each term must contain at least one uninterpreted function application and no interpreted function symbols such as equalities. These restrictions are common for SMT solvers.

When quantifier tags are not relevant, we omit them for brevity.

<sup>5</sup> The pre-defined atoms come from the first-order signature of the problem in question.

### **3.2 States**

As illustrated in Examples 1 and 2, both case-splitting and quantifier instantiation steps are crucial to our problem; we define our semantics around these two kinds of transitions. Furthermore, we must abstractly capture information relevant for deciding E-matching questions, tracking in particular which terms and equalities are known (modulo currently known equalities), and which quantifier instantiations have already been made.

**Definition 2 (States).** *States s* <sup>∈</sup> State*are defined as follows:*

$$s ::= \langle W, A, E \rangle \mid \Diamond \mid \bot$$

*where* ♦ *and* ⊥ *are distinguished symbols for* saturated *and* inconsistent *states, W (the* current quantifiers*) is a set of tagged quantifiers, A (the* current clauses*) is a set of extended clauses, and E (the* current E-state*) is explained below.*

For simple applications of our semantics, the set of current quantifiers remains fixed, but for problems with nested quantifiers (e.g. Example 3), it may grow as a solver runs. As we show, which instantiations are immediately enabled is definable in terms of both the current quantifiers and the current E-state. The current clauses, on the other hand, through case-splitting, introduce new quantifiers to the current quantifiers and generate new literals for the E-state; new extended *clauses* may be added as a consequence of quantifier instantiations.

The inconsistent and saturated states represent two different termination conditions for traces in our semantics: the former due to logical inconsistency, and the latter due to all quantifier instantiations having been exhausted.

### **3.3 E-interfaces**

Each solver maintains its own implementation of E-graphs to efficiently represent and query the currently-known ground terms modulo congruences and known equalities. Rather than formalising such an implementation, we devise an abstraction called an *E-interface*, capturing the operations and expected mathematical properties of E-graph implementations.

**Definition 3 (E-interface Judgements).** *An E-interface E*<sup>I</sup> *is a set of equalities and disequalities on terms.*<sup>6</sup> *We write <sup>E</sup>*<sup>I</sup> kn *<sup>t</sup> to express that the ground term t is* known *in the E-interface E*<sup>I</sup> *; we write <sup>E</sup>*<sup>I</sup> *<sup>t</sup>*<sup>1</sup> <sup>∼</sup> *<sup>t</sup>*<sup>2</sup> *to express that the ground terms <sup>t</sup>*<sup>1</sup> *and <sup>t</sup>*<sup>2</sup> *are* known equal *in <sup>E</sup>*<sup>I</sup> *. These two judgements are (mutually recursively) defined by (the least fixed-point of) the derivation rules:*

*<sup>t</sup>*<sup>1</sup> <sup>∼</sup> *<sup>t</sup>*<sup>2</sup> <sup>∈</sup> *<sup>E</sup>*<sup>I</sup> *<sup>E</sup>*<sup>I</sup> *<sup>t</sup>*<sup>1</sup> <sup>∼</sup> *<sup>t</sup>*<sup>2</sup> (eq-in) *<sup>E</sup>*<sup>I</sup> *<sup>t</sup>*<sup>2</sup> <sup>∼</sup> *<sup>t</sup>*<sup>1</sup> *<sup>E</sup>*<sup>I</sup> *<sup>t</sup>*<sup>1</sup> <sup>∼</sup> *<sup>t</sup>*<sup>2</sup> (eq-sym) *<sup>E</sup>*<sup>I</sup> *<sup>t</sup>*<sup>1</sup> <sup>∼</sup> *<sup>t</sup>*<sup>2</sup> *<sup>E</sup>*<sup>I</sup> *<sup>t</sup>*<sup>2</sup> <sup>∼</sup> *<sup>t</sup>*<sup>3</sup> *<sup>E</sup>*<sup>I</sup> *<sup>t</sup>*<sup>1</sup> <sup>∼</sup> *<sup>t</sup>*<sup>3</sup> (eq-tran) *<sup>E</sup>*<sup>I</sup> kn *<sup>t</sup> E*<sup>I</sup> *t* <sup>∼</sup> *t* (eq-kn-refl)

<sup>6</sup> A positive or negative non-equational literal, *P*, is added to the E-interface via *P* = or *P* = ⊥, respectively; -= ⊥ is preloaded into all E-interfaces.

426 R. Ge et al.

$$\begin{array}{c} E^1 \Vdash t\_i \sim t\_i' \quad E^1 \Vdash\_{\text{kn}} g\left(t\_1, \dots, t\_i, \dots, t\_n\right) \\ \frac{E^1 \Vdash\_{\text{f}} g\left(t\_1, \dots, t\_i, \dots, t\_n\right) \sim g\left(t\_1, \dots, t\_i', \dots, t\_n\right)}{E^1 \Vdash t\_1 \sim t\_2} \text{(KN-EQ)}\\ \frac{E^1 \Vdash t\_1 \sim t\_2}{E^1 \Vdash\_{\text{kn}} t\_1} \text{(KN-EQ)} \quad \frac{E^1 \Vdash\_{\text{kn}} g\left(\dots, t\_i, \dots, \right)}{E^1 \Vdash\_{\text{kn}} t\_i} \text{(KN-SUB)}\\ \vdots \text{. or } \Vdash t\_1 \ll \dots \ll \dots \ll \dots \ll \dots \ll \dots \ll \dots \ll 1 \end{array}$$

*The judgement <sup>E</sup>*<sup>I</sup> *<sup>t</sup>*<sup>1</sup> ∼ *<sup>t</sup>*<sup>2</sup> *represents <sup>t</sup>*<sup>1</sup> *and <sup>t</sup>*<sup>2</sup> *being* known disequal *in <sup>E</sup>*<sup>I</sup> *; the judgement E*<sup>I</sup> <sup>⊥</sup> *represents that E*<sup>I</sup> *is* inconsistent *(in the logical sense); cf. App. A of the TR.*

E-interfaces are equivalent if they agree on these judgements in all cases. When a proof step adds new literals, we must be able to extend our E-interfaces.

**Definition 4 (E-interface Extension).** *For a set of equality and disequality literals <sup>L</sup>, the* update of an E-interface *<sup>E</sup>*<sup>I</sup> with *L, denoted E*<sup>I</sup>  *L, is a minimal E-interface which satisfies all E-interface judgements that E*<sup>I</sup> *does, while also satisfying E*<sup>I</sup> *<sup>l</sup> for all <sup>l</sup>* <sup>∈</sup> *<sup>L</sup>.*

We call a set of terms a *basis* of *<sup>E</sup>*<sup>I</sup> if each element is a representative of a different equivalence class<sup>7</sup> induced by the *<sup>E</sup>*<sup>I</sup> *<sup>t</sup>*<sup>1</sup> <sup>∼</sup> *<sup>t</sup>*<sup>2</sup> relation on the terms known in *E*<sup>I</sup> . As we shall see in the next subsection, equivalence classes are relevant for defining which quantifier instantiations can be made after which.

#### **3.4 E-histories, E-states, E-matching**

As illustrated in Example 1, E-matching against triggers does not suffice to determine whether a quantifier instantiation should be considered *enabled*; we must also determine whether the instantiation is considered redundant given *previous* ones. We record previous instantiations using our next formal ingredient:

**Definition 5 (E-histories and E-states).** *An* E-history *<sup>E</sup>*<sup>H</sup> *is a set of pairs (each denoted* (*α* : −→*r* )*) in our formalism: the first element is a tag (identifying a quantifier), and the second is a vector of ground terms (representing an instantiation of the corresponding quantifier).*

*An* E-state *(cf. Definition 2) E is a pair* (*E*<sup>I</sup> *, E*<sup>H</sup>) *of E-interface and Ehistory.*

Recall that E-states are a part of the states in our formalism. E-states consist of an E-interface component, which captures the current known terms and equality information, and an E-history component, which records the history of instantiations, in particular representing sufficient information to reject redundant instantiations.

**Definition 6 (History-Enabled E-matches).** *Given a candidate match pair* (*α* : −→*r* ) *(of tag α and vector of terms* −→*r ), the* E-state *E* enables (*α* : −→*r* )*, written <sup>E</sup>* hist (*α* : −→*<sup>r</sup>* )*, if: for every instantiation pair* (*α* : −→ *r*) <sup>∈</sup> *E*<sup>H</sup>*, at least one of the pointwise equalities* −−−−→ *<sup>r</sup><sup>i</sup>* <sup>∼</sup> *<sup>r</sup> <sup>i</sup> is* not *known in <sup>E</sup>*<sup>I</sup> *.*

<sup>7</sup> What we refer to as an *equivalence class* in this paper is also known as a *congruence class* in the literature: an equivalence class modulo known equalities.

*Example 4.* Revisiting Example 1, suppose the tag of the quantifier is *τ* , *E* is the E-state whose E-interface component contains the example literals. The first instantiation *s*<sup>1</sup>→*b*, *<sup>s</sup>*<sup>2</sup>→*<sup>a</sup>* and *<sup>x</sup>*→*<sup>t</sup>* is represented in our formal model by adding (*τ* : (*b, a, t*)) to the E-history, resulting in a new E-state, say *E* . The second candidate match *s*<sup>1</sup>→*b*, *s*<sup>2</sup>→*c* and *x*→*t* is not enabled in *E* since the three pointwise equalities between instantiated terms are all known in *E* .

With the help of the above ingredients, we formally characterise E-matching:

**Definition 7 (E-matching).** *For a given state W, A, E, the judgement W, A, E* match (∀−→*x .*−→[*T*]*A* )*<sup>α</sup>*−→*r defines which instantiations (using terms* −→*r ) of which quantifiers* (∀−→*x .*−→[*T*]*A* )*<sup>α</sup> are enabled by E-matching rules, as follows:*

$$\frac{\begin{array}{c} (\forall \overline{x}^{\flat}.\overrightarrow{[T]}A')^{\sharp \alpha} \in W \qquad \overrightarrow{t}^{\flat} \text{ is one trigger set of } \overrightarrow{[T]} \\ E^{\mathtt{I}} \Vdash\_{\mathsf{kn}} \overrightarrow{t} \vdash\_{\mathsf{l}}^{\flat}\overrightarrow{[\ \overline{r}^{\flat}/\sharp \mathsf{r}]} \qquad E \Vdash\_{\mathsf{h}\mathsf{ist}} \left(\not\!\!\!\!\alpha\;\mathrel{\ \Box}\overrightarrow{\ \overline{r}^{\flat}}\right) \end{array}}{\begin{array}{c} \forall W, A, E \Vdash\_{\mathsf{match}} \left(\forall \ \overline{x}^{\flat}.\overrightarrow{[T]}A'\right)^{\sharp \alpha} \rightsquigarrow \overrightarrow{r}^{\flat} \end{array}}$$

*We write W, A, E* match *to mean* no *instantiations are enabled in this state.*

E-matching match requires (1) a quantifier in the current state, (2) a trigger set −→*t* with replacement terms −→*r* for quantified variables −→*x* to be known in *E*<sup>I</sup> , and (3) that this potential match is enabled by the E-state *E*. Note that (2) implies the terms −→*r* to match against the quantified variables of one trigger set −→*t* to be known in the current E-interface *E*<sup>I</sup> .

### **3.5 State Transitions**

The last main ingredient of our formal model is the definition of state transitions.

**Definition 8 (State Transitions).** *The* (single step) state transition relation −→ ⊆ State × State*is defined by the union of the following cases:*

$$\begin{aligned} \emptyset \subset \Phi \subseteq \left\{ \phi\_i \mid C \in A; \ W\_1, E\_1^1 \mid \forall\_{\text{sat}} \ C; \ C \text{ is } \cdots \vee \phi\_i \vee \cdots \right\} \\\hline W\_2 = W\_1 \cup \text{filter}\_{\forall} (\Phi) \quad E\_2^1 = E\_1^1 \lhd \text{filter}\_{\text{ilt}}(\Phi) \quad E\_2^{\text{H}} = E\_1^{\text{H}} \\\hline \langle W\_1, A, E\_1 \rangle \longrightarrow \langle W\_2, A, E\_2 \rangle \\\hline \end{aligned} \text{(SPI.T)}$$
 
$$\begin{aligned} \frac{E^{\text{I}} \Vdash \perp}{\langle W, A, E \rangle \longrightarrow \perp} (\text{BOT}) \end{aligned}$$
 
$$\begin{aligned} \frac{E^{\text{I}} \Vdash \perp}{\langle W, A, E \rangle \longrightarrow \diamond} \end{aligned} \text{(SOT)}$$

$$\begin{array}{c} \langle W\_{1}, A\_{1}, E\_{1} \rangle \vdash\_{\text{match}} (\forall \overline{x'}. \overline{[T]} A\_{11})^{\sharp \alpha} \triangleleft \overline{r'}\\ A\_{12} = A\_{11} \left[ \overline{\top} / \overline{\pi} \right] & A'\_{12} = \text{filter}\_{\vee} (A\_{12}) \cup \text{filter}\_{\text{list}} (A\_{12})\\ A\_{2} = A\_{1} \cup (A\_{12} \backslash A'\_{12}) & W\_{2} = W\_{1} \cup \text{filter}\_{\vee} (A\_{12})\\ E^{\mathbb{I}}\_{2} = E^{\mathbb{I}}\_{1} \circ \text{filter}\_{\text{list}} (A\_{12}) & E^{\mathbb{H}}\_{2} = E^{\mathbb{H}}\_{1} \circ (\sharp \alpha : \stackrel{\mathcal{T}}{r'})\\ \hline \langle W\_{1}, A\_{1}, E\_{1} \rangle \longrightarrow \langle W\_{2}, A\_{2}, E\_{2} \rangle \end{array} \text{(NST)}$$

*where the overloaded operators* filter<sup>∀</sup> *and* filterlit *select quantifiers and simple literals, respectively, from any provided set of extended literals, or from unit clauses of any provided set of extended clauses; the judgement W, E*<sup>I</sup> sat *<sup>C</sup> holds if: for some disjunct <sup>φ</sup><sup>i</sup> of <sup>C</sup>, either <sup>φ</sup><sup>i</sup> is a tagged quantifier from <sup>W</sup>, or <sup>φ</sup><sup>i</sup> is a simple literal that <sup>E</sup>*<sup>I</sup> *knows.*

Our state transition relation −→ consists of case-splitting steps, steps that deduce the inconsistent state, steps that deduce the saturated state, and quantifier instantiation steps, corresponding to the rules (split), (bot), (sat) and (inst) respectively.

We allow a case-splitting transition to non-deterministically select *any* nonempty subset of the disjuncts in the *unsatisfied* current clauses—those that have not yet been made true in the current state. A case-splitting transition must make progress towards satisfying the clauses. We do not impose restrictions on the order in which unsatisfied current clauses are chosen, nor on the number of disjuncts assumed within a clause, provided that progress is being made.<sup>8</sup>

We model case-splitting as non-deterministic. Recall Example 2, where the clause <sup>¬</sup>*member*(*t, a*) ∨ ¬*member*(*t, diff*(*b, a*)) is learnt. Subsequently, the solver can choose to assume either one or both of the disjuncts; generally, it can choose to assume neither disjunct as long as it selects at least one disjunct from some other unsatisfied clause. Here, the disjuncts are ground simple literals (which are added to the E-state); in general, some could be new quantifiers to record.

Our sat judgement checks if a provided clause is satisfied (i.e. at least one disjunct is assumed in the current state). If all current clauses are satisfied, and the E-interface is not inconsistent, and there are no enabled instantiations, the (sat) rule applies and transitions to the saturated state (♦). Conversely, if the current E-interface is inconsistent, the (bot) rule transitions to the inconsistent state (⊥); if there are enabled instantiations, the (inst) rule applies.

The instantiation rule (inst) relies on the match judgement to select an instantiation enabled by E-matching rules. The effect of an instantiation transition involves adding quantifiers and simple literals occurring as unit clauses in the quantifier body to the current quantifiers *<sup>W</sup>*<sup>1</sup> and E-interface *<sup>E</sup>*<sup>I</sup> <sup>1</sup>, respectively; any remaining non-unit clauses are added to the current clauses *A*<sup>1</sup>. Finally, the E-history *E*<sup>H</sup> <sup>1</sup> is updated to record this instantiation.

In practice, common SMT solvers such as cvc5 [2] perform quantifier instantiation both (1) up-front and (2) in phases interleaved with other solver steps. In particular, the latter is essential for many applications: most quantifier instantiations lead to e.g. clauses requiring context-aware case-splitting via DPLL/CDCL. Our model effectively captures both processes through its unrestricted interleavings of quantifier instantiation and case-splitting steps.

In retrospect, Sects. 3.2 to 3.5 have tackled design challenges #1 and #2 (cf. Sect. 2.2). We address #3 and #4 in the next two subsections, respectively.

<sup>8</sup> Our model allows simulating efficient propagation-based restrictions of case-splitting, but does not require it; restricting to this case would be possible if needed.

### **3.6 Nested Quantifiers**

Example 3 demonstrates that instantiating outer quantifiers in nested structures of quantifiers can introduce new quantifiers on the fly. To effectively argue for termination regarding these instantiations (as will be discussed in Sect. 4), one must be able to identify and predict these dynamically introduced quantifiers. To facilitate this, we employ a tagging system that is capable of handling nested structures (cf. App. A of the TR for details). Each quantifier in an axiomatisation is labelled with a distinct tag. The tag for any non-nested quantifier (including the outermost quantifier in any nested structure of quantifiers) is not parameterised. A nested quantifier has its tag parameterised by all of its outer-quantified variables. Instantiating an outer quantifier produces a copy of the quantifier body in which (among other changes) tags of all inner quantifiers that are parameterised by this outer-quantifier are updated to reflect this instantiation. In Example 3, we label the outer and inner quantifiers with tags union-elim and union-elim(*s*1*, s*<sup>2</sup>), respectively. Instantiating the outer quantifier with *s*<sup>1</sup>→*<sup>a</sup>* and *<sup>s</sup>*<sup>2</sup>→*<sup>b</sup>* introduces a copy of the quantifier body in which the inner quantifier is tagged with union-elim(*a, b*).

To further mitigate redundancy in quantifier instantiation, our semantics supports two additional optimisations. First, a quantifier is only permitted to join the current quantifiers *W* if its tag is known to be *distinct* from the tags of existing quantifiers in *W*, *modulo equivalence on the parameters of the tags*, as assessed in the current E-interface. This criterion prevents adding redundant quantifiers into *<sup>W</sup>*. Second, the relation of history-enabled E-matches hist leverages the current E-interface to verify the uniqueness of tags—once again, modulo equivalence on tag parameters—before enabling an E-match. An E-match is enabled only if no quantifier with an equivalent tag has been instantiated with an equivalent match previously. (cf. App. A of the TR for related definitions.)

#### **3.7 Theory-Specific Reasoning**

Although our rules do not yet account for (interpreted) theory reasoning (as performed by theory solvers in a typical SMT solver design), our small-step semantics is intentionally chosen to easily accommodate future extensions: "hotplugging" new kinds of primitive transitions is straightforward, and will not disturb the existing formal rules (e.g. for quantifier instantiations or case-splitting). Similarly to our E-interfaces for abstracting of E-graph details, we plan to do this in a way which abstracts over the *effects* of theory deduction steps, without exposing the solver-specific internals. For example, we can add deduction steps which extend the E-interface with new terms and/or (dis)equalities, based on a valid deduction within, say, an integer theory.

Just as for quantifier instantiations, it may be necessary for some applications to guarantee that theory reasoning is performed under some fairness conditions (e.g. that inconsistencies detectable by a theory solver are not infinitely postponed). Imposing custom fairness constraints on the traces of our semantics for specific examples can be achieved in a standard way for small-step semantics.

While it is clear that extensions to theory solving will be straightforward, we choose the case study for this paper to be a complex and practically-relevant axiomatisation which nonetheless does not rely on external theory solvers.

# **4 Proving Instantiation Termination for E-matching**

We now apply our model to prove instantiation termination for a practical Ematching-based axiomatisation. First, we briefly present our set theory axiomatisation, adapted from Dafny and Viper. We then demonstrate our methodology for constructing instantiation termination proofs using our model.

### **4.1 Axiomatisation for Set Theory**

To assess our formal model, we tackle formal proofs of instantiation termination for axiomatisations currently employed by state-of-the-art verification tools, specifically targeting set theory in this paper. Set theory, despite the known challenges associated with its quantifier instantiation, is extensively used in verifiers.

Drawing from the axioms used by Dafny [18] and Viper [27], we aim to construct an axiomatisation that (1) faithfully models the core of set theory, (2) supports various encodings of set theory used by verifiers, and (3) strives to maintain a balance on triggers to ensure instantiation termination without harming instantiation completeness.

Our axiomatisation involves 12 uninterpreted functions, representing a wider range of set operations than the counterparts in Dafny and Viper. Cardinality operators are, however, removed due to their dependency on external linear arithmetic solvers (cf. Sect. 3.7 for explanation). Refer to App. C.1 and C.2 of the TR for a full presentation of our axiomatisation and comparison with theirs.

Dafny and Viper typically use complex "iff" formulas to define set operations, restricting trigger flexibility as they must apply in both directions of the "iff". Inspired by proof systems for formal logic, we redefine set operations using analogues of introduction and elimination axioms, introducing independent triggers for each implication direction and thereby enhancing trigger flexibility.

*Example 5.* Below is our elimination rule for set union, named (union-elim), allowing more alternative triggers than the counterparts from Dafny and Viper.

> <sup>∀</sup>*s*<sup>1</sup>*, s*<sup>2</sup>*, x.* [*member*(*x, union*(*s*<sup>1</sup>*, s*<sup>2</sup>))] [*union*(*s*<sup>1</sup>*, s*<sup>2</sup>)*, member*(*x, s*<sup>1</sup>)] [*union*(*s*<sup>1</sup>*, s*<sup>2</sup>)*, member*(*x, s*<sup>2</sup>)] *member* (*x, union* (*s*<sup>1</sup>*, s*<sup>2</sup>)) <sup>→</sup> *member*(*x, s*<sup>1</sup>) <sup>∨</sup> *member*(*x, s*<sup>2</sup>)

Our axiomatisation overall has more permissive triggers, which provides more flexibility for instantiation, but also increases the risk of non-termination. That instantiation termination holds for our axiomatisation means that Dafny and Viper's more restrictive triggers are not necessary to ensure termination.

### **4.2 Progress Measure**

To prove *instantiation termination* for an axiomatisation, it suffices to prove that querying *any* set of ground literals on the axiomatisation cannot lead to an infinite trace in our formal semantics. The proof argument is parametric with respect to the ground literals in the initial state.<sup>9</sup> Drawing inspiration from program reasoning [7,26], we identify a suitable measure on solver states and then establish its decrease at appropriate steps in a well-founded manner.

This method leverages the specific features of the axioms under consideration. We analyse our set theory axioms and classify them by two criteria: (1) whether instantiating the axiom would potentially generate new quantifiers or new equivalence classes of terms, i.e. new terms modulo equalities, and (2) whether the axiom contains nested quantifiers.

*Non-generative Quantifiers.* We call a quantifier *non-generative* if its instantiations yield neither new quantifiers nor new equivalence classes of terms. The majority of our set theory axioms are non-generative.

For instance, the (union-elim) axiom from Example 5, when instantiated with *s*<sup>1</sup>→*a*, *<sup>s</sup>*<sup>2</sup>→*<sup>b</sup>* and *<sup>x</sup>*→*t*, yields <sup>¬</sup>*member* (*t, union* (*a, b*)) <sup>∨</sup> *member*(*t, a*) <sup>∨</sup> *member*(*t, b*), without the potential (via case-splitting) to introduce new quantifiers or new equivalence classes of terms. The absence of new terms is because all of *t*, *a*, *b* and *union*(*a, b*) are subterms of the matched trigger and hence known. *Bool*-sorted terms never add new equivalence classes (cf. Definition 3).

Instantiating a non-generative quantifier reduces the amount of enabled Ematches by at least one since, on the one hand, history-enabled E-matches prevent instantiating the same quantifier with equivalent matches; on the other hand, instantiating a non-generative quantifier does not introduce new quantifiers or equivalence classes, thereby not expanding the match pool. This suggests:

**Idea 1.** *Define the progress measure to be about the amount of enabled Ematches.*

*Generative Quantifiers.* A quantifier is *generative* if its instantiations may introduce new quantifiers or new equivalence classes of terms. Among our set theory axioms *without nested quantifiers*, four are generative, with each potentially creating new applications of Skolem functions upon instantiation.

For instance, the following (subset-intro) axiom, when instantiated, may create a new term *Skss*(*s*<sup>1</sup>*, s*<sup>2</sup>) for some sets *<sup>s</sup>*<sup>1</sup> and *<sup>s</sup>*<sup>2</sup>:

$$\begin{array}{c} \forall s\_1, s\_2. \left[ \mathit{subset}(s\_1, s\_2) \right] \left( \mathit{subset}(s\_1, s\_2) \lor \mathit{member}(Sk\_{ss}(s\_1, s\_2), s\_1) \right) \land \\ \quad \left( \mathit{subset}(s\_1, s\_2) \lor \neg member(Sk\_{ss}(s\_1, s\_2), s\_2) \right) \end{array}$$

<sup>9</sup> In fact, it would be straightforward to generalise the termination proof argument, including the termination theorem, to the ground *clauses* in the initial state.

Similarly, axioms for introducing extensional equality on sets, set disjointness, and set emptiness—namely (equal-sets-intro), (disjoint-intro), and (isEmptyintro-1), respectively—can each produce new applications of Skolem functions: *Skeq*(*s*<sup>1</sup>*, s*<sup>2</sup>), *Skdj*(*s*<sup>1</sup>*, s*<sup>2</sup>), and *Skie*(*s*), respectively (cf. App. C.1 of the TR).

Generative quantifiers, by introducing new equivalence classes of terms, may expand the pool of E-matches, including those enabled. We thereby suggest:

**Idea 2.** *Predict new equivalence classes of terms introduced by instantiating generative quantifiers; incorporate these forecasts to estimate enabled E-matches.*

Set theory axioms *with nested quantifiers* are all generative because their instantiations can potentially create new quantifiers. Such axioms include (subset-elim) from Example 3, and axioms (disjoint-elim) and (isEmpty-elim-1) for eliminating set disjointness and emptiness, respectively (cf. App. C.1 of the TR).

Instantiating these three axioms does not introduce new equivalence classes of ground terms. However, since they contain nested quantifiers, their instantiations can create new quantifiers—each with its own set of enabled E-matches, effectively raising the total amount of enabled E-matches. We therefore propose:

**Idea 3.** *Incorporate predicted effects from instantiating generative quantifiers with nested quantifier structures to refine estimates of enabled E-matches.*

In practice, provided that these ideas are respected, one can often define simpler termination measures via *over-approximations* of these candidate instantiations (provided this over-approximation remains finite and decreasing).

*Formalising a Practical Progress Measure.* A basis of an E-interface is a representation of the known equivalence classes. We define its overapproximation to include potential new equivalence classes introduced by generative quantifiers.

**Definition 9 (Overapproximation of Basis for Set Theory).** *Suppose B is a basis of an E-interface. The functions O*<sup>1</sup>(*B*) *and <sup>O</sup>*<sup>2</sup>(*B*) *denote overapproximations for the Set*(*T*)*-sorted and T-sorted elements within basis B, respectively, to accommodate new expected equivalence classes of terms.*

$$\begin{aligned} O\_1(B) &= \text{filter}\_{Set(T)}(B) \\ O\_2(B) &= \text{filter}\_T(B) \cup \widehat{Sk\_{ss}}(O\_1(B), O\_1(B)) \cup \widehat{Sk\_{eq}}(O\_1(B), O\_1(B)) \\ &\cup \widehat{Sk\_{dj}}(O\_1(B), O\_1(B)) \cup \widehat{Sk\_{ie}}(O\_1(B)) \end{aligned}$$

*Here* filter*Set*(*T*) *and* filter*<sup>T</sup> take a basis and select its Set*(*T*)*-sorted and <sup>T</sup>-sorted elements, respectively; each Sk is lifted from the corresponding Sk to support sets.*

The potential new terms introduced by generative quantifiers are all *T*-sorted Skolem terms. Thus predictions are solely performed by *O*<sup>2</sup>(*B*), not by *<sup>O</sup>*<sup>1</sup>(*B*).

Note that the results of these two overapproximations are guaranteed to be finite. E-interface bases always remain finite: elements are added (at most) for the new terms introduced in a step. Since our construction filters and e.g. maps Skolem functions over these finite sets, its results are finite. Leveraging this overapproximation of equivalence classes, we estimate enabled E-matches.

**Definition 10 (Overestimation of Enabled E-matches for Set Theory).** *Consider an arbitrary state s* <sup>=</sup> *W, A, E. Let B be a basis of the E-interface E*I *. Define an overestimation of the enabled E-matches for s from B as follows:*

$$P(\left, B) = \left\{\dots, p\_{\sharp\tau\_i}, \dots, p\_{\sharp\tau\_j(\overrightarrow{\tau})}, \dots\right\}$$

*where p<sup>τ</sup><sup>i</sup> and <sup>p</sup>τ<sup>j</sup>* ( →−*<sup>r</sup>* ) *each denote a set of tuples that overapproximate the enabled E-matches from the basis <sup>B</sup> to the quantifiers with tags τ<sup>i</sup> and τ<sup>j</sup>* ( −→*r* )*, respectively; each tag τ<sup>i</sup> identifies an original quantifier from <sup>W</sup>, and each τj* ( −→*r* ) *identifies a quantifier introduced by instantiating an original quantifier τ<sup>j</sup> from <sup>W</sup> with terms* −→*<sup>r</sup> from approximations <sup>O</sup>*<sup>1</sup>(*B*) *or <sup>O</sup>*<sup>2</sup>(*B*)*. Original quantifiers from W are those from the axiomatisation, not those introduced at runtime.*

*To clarify, examples for each category are presented as follows; the remaining quantifiers shall adhere to the same pattern.*


*psubset-elim(a, b)* <sup>=</sup> {*<sup>x</sup>* <sup>|</sup> *<sup>x</sup>* <sup>∈</sup> *<sup>O</sup>*<sup>2</sup>(*B*) ; *<sup>E</sup>* hist (*subset-elim(a, b)* : *<sup>x</sup>*)} *where a, b* <sup>∈</sup> *<sup>O</sup>*<sup>1</sup>(*B*)*.*

We define a progress measure for our set theory axiomatisation. The first and foremost ingredient of our progress measure is an overestimation on the amount of enabled E-matches. We anticipate that this overestimation strictly descents after each instantiation step and does not ascend after each case-splitting step. The second ingredient is the amount of unsatisfied current clauses, which we expect to descent by at least one after each case-splitting step. The result of the progress measure is a lexicographically ordered pair of the above two ingredients.

**Definition 11 (Progress Measure for Set Theory).** *We define the progress measure M* : State−→ (<sup>N</sup> ∪ {−1})<sup>2</sup>*, as follows, where* · *denotes cardinality.*

*M* (*s*) = ⎧ ⎪⎪⎪⎪⎨ ⎪⎪⎪⎪⎩ *p*∈*P* (*W,A,E,B*) *p ,* - *<sup>C</sup>* <sup>∈</sup> *<sup>A</sup>* <sup>|</sup> *W, E*<sup>I</sup> sat *<sup>C</sup> if s* <sup>=</sup> *W, A, E and B is a basis for E*<sup>I</sup> <sup>−</sup>1*,* <sup>−</sup><sup>1</sup> *if s* <sup>=</sup> <sup>⊥</sup> *or* ♦

Inconsistent or saturated states are assigned (the smallest) measures (−1*,* <sup>−</sup>1). The order on (<sup>N</sup> ∪ {−1}) is the natural extension of that on <sup>N</sup>.

### **4.3 Invariants and Termination Theorem**

Drawing on program reasoning, we anticipate classical techniques such as induction variants can be employed to termination proofs. We maintain two kinds of induction variants: general-purpose and problem-specific invariants.

General-purpose invariants uphold the integrity of our formal semantics, remaining valid across all applications. For example, the E-history *E*<sup>H</sup> of an arbitrary state *s* <sup>=</sup> *W, A, E* must be up to date w.r.t. the current quantifiers *W* and E-interface *E*<sup>I</sup> . That is, for every pair (*τ* : −→*r* ) from *E*<sup>H</sup>, there exists a quantifier ∀−→*x .*−→[*T*]*A* from *W* whose tag is *τ* , the dimension of −→*x* is equal to that of −→*<sup>r</sup>* , *<sup>E</sup>*<sup>I</sup> kn −→*<sup>r</sup>* , and *<sup>E</sup>*<sup>I</sup> kn −→*t* [ →−*r /*→−*<sup>x</sup>* ] for some trigger set −→*t* from −→[*T*]. (cf. App. A of the TR for more invariants.)

Problem-specific invariants are tailored to the distinct features of each problem, focusing on properties of solver states reachable from specified initial states, and tracing the origins of terms in intermediate states. For example, consider an arbitrary intermediate state *W, A, E*: for each extended clause in *A* of the form <sup>¬</sup>*member* (*t, union* (*a, b*)) <sup>∨</sup> *member*(*t, a*) <sup>∨</sup> *member*(*t, b*), (union-elim : (*a, b, t*)) <sup>∈</sup> *E*<sup>H</sup> holds; the tag being for the axiom (union-elim) discussed in Example 5. This invariant concerns the origins of the extended clauses in the current clauses *A*. Case-splitting on a current clause (e.g. the one above) may seem to introduce a new term, but this invariant indicates that this term is not new—it is equal to a known term that triggered a prior instantiation, as tracked by the E-history *E*<sup>H</sup>. This ensures a traceable lineage for each clause, linking it back to a specific quantifier in the E-history. (cf. App. B of the TR for more invariants.)

We finally define the instantiation termination theorem for our set theory axiomatisation, proven by induction on traces leveraging both general-purpose and set-theory-specific invariants. Note that termination is proved against an *arbitrary* set of ground literals—this works because our progress measure and invariants are defined parametrically with the current state. Given these right invariants and termination measure, the proof is straightforward (cf. App. B of the TR). This theorem guarantees the absence of matching loops in this axiomatisation; practitioners of this axiomatisation hence can confidently seek terminating answers to ground theory queries.

**Theorem 1 (Instantiation Termination for Set Theory).** *Suppose L is an arbitrary set of ground literals. The initial state is <sup>s</sup>*<sup>0</sup> <sup>=</sup> *W*<sup>0</sup>*, A*<sup>0</sup>*, E*<sup>0</sup>*, where <sup>W</sup>*<sup>0</sup> *is our axiomatisation for set theory with tags, <sup>A</sup>*<sup>0</sup> <sup>=</sup> <sup>∅</sup>*, <sup>E</sup>*<sup>I</sup> <sup>0</sup> <sup>=</sup> <sup>∅</sup>  *L, and E*H <sup>0</sup> <sup>=</sup> <sup>∅</sup>*. Any sequence of transitions from the initial state <sup>s</sup>*<sup>0</sup>*, where* −→ *defined in Sect. 3.5 represents the transition relation, has a finite length.*

# **5 Related Work**

For the purpose of program verification, where SMT solvers are used to prove unsatisfiability, E-matching is widely used to handle quantifiers. The idea of E-matching dates back to Nelson [25], which was first put into practice in Simplify [8]. Since then, efficient handling of E-matching-based quantifier instantiation has been studied by, e.g. de Moura and Bjørner [21] for Z3, Ge et al. [14] for CVC3, Bansal et al. [1] for Z3 and CVC4, and Moskal et al. [20] for Fx7. When satisfiable results and their models are of interest, model-based quantifier instantiation (MBQI) [15] can be used to handle quantifiers.

Dross et al. [9–11] formally define and reason about instantiation termination in a similar context. They define a novel *logic* with first-class triggers, introduce *instantiation trees* as algebraic objects to help define termination, and provide an ingenious technique for showing, for their implementation in Alt-Ergo, that finding a *single* finite instantiation tree is sufficient for termination.

Despite being a powerful tool for numerous deep meta-theoretic results [9], we believe that *applying* a formal inductive construction of instantiation trees for larger examples would be complex in practice: existing examples focus instead on bounds for the sets of terms ever generatable by a solver run. These arguments closely relate to our inductive termination proofs over traces. Our work enables detailed formal proofs based directly on such familiar notions from program reasoning, including inductive invariants and well-founded measures.

The approach of this prior work also requires restrictions on solver behaviour, including *fairness* of quantifier instantiation, and *eager* application of theory deductions (via entailments in their custom logic)<sup>10</sup>. Our operational model and termination proofs do not require or build in such assumptions. Still, *restricting* our traces (e.g. with fairness constraints) would be simple to do if desired for specific applications. Our weak assumptions make our approach (extended with appropriate theory deduction steps) applicable to SMT solvers broadly; solvers such as Z3 [22] and cvc5 [2] commonly interleave theory reasoning and quantifier instantiation in (bounded or exhaustive) rounds of multiple steps.

The Axiom Profiler [4] leverages Z3 log files to provide comprehensive support for analysing quantifier instantiations. The tool focuses on helping users effectively understand and debug problematic solver runs, rather than proving their absence. It was validated by empirical evidence rather than formal proofs.

Existing works on the termination of SMT transition systems [3,5,6,23] demonstrate that divergence is prevented by ensuring all new terms derive from a finite basis. In contrast, in our work, a finite basis does not imply termination the basis can grow. At a high level these works prove that certain solver aspects always terminate. However, E-matching cannot have this property; instead it places the onus on the author of an axiomatisation to achieve termination through careful selection of axioms and triggers, motivating a user-facing model.

<sup>10</sup> We explain how to simply add theory steps to our operational model in Sect. 3.7.

# **6 Conclusion and Future Work**

We have shown a novel model for E-matching as widely employed in SMT solvers, abstracting over solver details while enabling detailed and formal proofs of instantiation termination. Our model has been shown to apply directly and rigorously to the kinds of axiomatisations used in practical verification tools.

In future work, we would like to explore axiomatisations that rely on morerestricted characteristics of a solver, such as fairness of instantiation selection or theory reasoning steps. Similarly to our E-interfaces, we will investigate suitable abstractions over theory solver interactions incorporated into a proof search.

While instantiation termination is a much sought-after property, the complementary problem of guaranteed instantiation completeness is a natural next target to investigate with our novel operational model, which may require us to also explore various fairness restrictions of our model's transition relation.

**Acknowledgments.** We thank the anonymous reviewers, Mark R. Greenstreet and Yanze Li for their detailed and constructive suggestions. We are very grateful to Claire Dross for putting generous time and energy into thoughtful feedback for us. This work has been partly funded by NSERC Discovery Grants held by Garcia and Summers.

# **References**


438 R. Ge et al.


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Fast and Verified UNSAT Certificate Checking

Peter Lammich(B)

University of Twente, Enschede, Netherlands p.lammich@utwente.nl

Abstract. We describe a formally verified checker for unsatisfiability certificates in the LRAT format, which can be run in parallel with the SAT solver, processing the certificate while it is being produced. It is implemented time and memory efficiently, thus increasing the trust in the SAT solver at low additional cost.

The verification is done w.r.t. a grammar of the DIMACS format and a semantics of CNF formulas, down to the LLVM code of the checker. In this paper, we report on the checker and its design process using the Isabelle-LLVM stepwise refinement approach.

Keywords: UNSAT certificates · LRAT · Isabelle-LLVM · Verified Software

# 1 Introduction

SAT solvers are highly complex and highly optimized programs, which are used to verify critical properties of other systems. To increase the trust in them, SAT solvers produce certificates that can be independently checked by formally verified checkers [5,9,10,16,23,34,35]. Here, the focus is on certificates for unsatisfiability, as certificates for satisfiability are (considered) trivial.

Typically, certificate checking proceeds in two phases: An unverified *elaborator* adds additional information to the certificate produced by the SAT solver, and then a formally verified *checker* checks the elaborated certificate against the original formula. This approach moves some complicated and computationally expensive tasks into the unverified elaborator, making checking of the elaborated certificate simpler and less expensive.

However, the elaborator has to recompute information which is, in principle, known to the solver, and elaboration typically takes as long as solving. More recent techniques accelerate elaboration by including this information into the certificate [2]. The most recent development are solvers that directly produce elaborated certificates [29]. This allows for *streaming* the certificates from the solver into the checker: solving and checking are done in parallel, and the potentially large certificates need not be stored on disk. When implemented appropriately, the memory footprint of the checker is similar to that of the solver.

There are different formats for elaborated unsatisfiability certificates, such as PB [4] and GRAT [23]. The de-facto standard is the LRUP format [10], and its backwards compatible generalizations LRAT [9] and LPR [35]. These correspond to the non-elaborated DRUP [17], DRAT [36], and DPR [35] formats. With an exception in 2023, LRUP is sufficient for all top performing solvers in the SAT competitions of the last years [29].

In this paper, we present a formally verified checker that can stream LRUP certificates. We benchmark our tool on the CaDiCaL solver [29], where it only causes a minimal additional computation overhead, and has a memory usage similar to that of the solver. Our checker is as fast as the highly optimized unverified lrat-trim checker [29], and at least one order of magnitude faster than any other verified checker we know of. Using the Isabelle Refinement Framework [22], our checker is verified down to the LLVM intermediate representation [26] of its code, and against a formal grammar of the DIMACS CNF format, which is the standard for representing CNF formulas [32]. To the best of our knowledge, our checker is the first that comes with a verified parser. Our tool and benchmark data is available at https://github.com/lammich/lrat\_isa.

In the rest of this paper, we describe our formal specification (Sect. 2), the abstract certificate checking algorithm (Sect. 3), and its implementation (Sect. 4). We then report on our benchmark results (Sect. 5). Finally we conclude the paper and discuss related and future work (Sect. 6).

# 2 Specification

We prove soundness of our checker, i.e., it accepts a string only if it is a representation of an unsatisfiable formula in DIMACS CNF format<sup>1</sup>. In this section we present the formalization of this specification.

### 2.1 Conjunctive Normal Form

Throughout this paper, we will use some simplified Isabelle/HOL notation, and explain unusual notations where they first occur. For definitions we use ≡. Data types are written in prefix notation, e.g., *lit set* for a set of literals. Function application is denoted as f x<sup>1</sup> ... x*n*.

The following is the abstract syntax and semantics of CNF, taken from the GRAT tool [23] and slightly adapted to our needs:

typedef *var* ≡ {*v::nat. v* = *0*} *lit* ≡ *Pos var* | *Neg var clause* ≡ *lit set cnf* ≡ *clause set valuation* ≡ *var* ⇒ *bool sem lit :: lit* ⇒ *valuation* ⇒ *bool*

<sup>1</sup> Note that proving completeness is less interesting: even if we show that our checker accepts all valid certificates, the elaborator or solver may still fail to produce one. We verify completeness empirically on a large set of benchmarks.

*sem lit* (*Pos v*) σ ≡ σ *v sem lit* (*Neg v*) σ ≡ ¬ σ *v*

*sem cnf :: cnf* ⇒ *valuation* ⇒ *bool sem cnf F* σ ≡ ∀*C*∈*F.* ∃*l*∈*C. sem lit l* σ

*sat :: cnf* ⇒ *bool sat F* ≡ ∃σ*. sem cnf F* σ

A *variable* is a positive natural number, a *literal* is a positive or negative variable, a *clause* is a set of literals, and a *cnf-formula* is a set of clauses. A *valuation* assigns truth values to variables. For a valuation σ, the *semantics* assigns truth values to literals (*sem lit*) and formulas (*sem cnf*): a positive literal is true iff its variable is true, and a negative literal is true iff its variable is false. A formula is true iff every clause contains a true literal, and it is *satisfiable* if there is a valuation for which it is true.

### 2.2 Specification of the DIMACS CNF Format

DIMACS CNF is the de-facto standard format for representing CNF formulas. Figure 1 displays an example: the file can start with optional comment lines, indicated by a heading 'c'. After the comments, there is a header of the form p cnf n m, where n is the maximum variable, and m is the number of clauses. Then the clauses follow, encoded as zero-terminated sequences of integers, where a positive integer represents a positive lit-


Fig. 1. Example formula in DIMACS CNF

eral, and a negative integer represents a negative literal. We need to specify how a word in DIMACS format corresponds to a formula. While a language is a set of words, we use a relation between words and corresponding abstract syntax. By slight abuse of naming, we call such relations *grammars*. We shallowly embed regular grammars into Isabelle HOL's logic:

$$\begin{array}{l} \{(a, 'r) \: gM \equiv ('a \; list \times 'r) \; set\\ \mathsf{return} \; x \equiv \{([], x)\} \qquad \langle C \rangle \equiv \{ \ ([c], c) \mid c \in C \}\\ \mathsf{bind} \; g \; f \equiv \{ \ (w\_1 \circledast w\_2, r) \mid \exists x. \ (w\_1, x) \in g \land (w\_2, r) \in f \; x \} \end{array}$$

Here, (w, r) ∈ g means that the grammar g relates the word w to the result r. The empty relation {} corresponds to the empty language. The relation return <sup>x</sup> relates the empty word to the result x. It corresponds to the language {[]} of only the empty word. The relation C relates single-character words to the corresponding character from the set C. Finally, the relation bind g f relates a word w1w<sup>2</sup> to a result *r*, if g relates w<sup>1</sup> to some intermediate result x, and *f x* relates w<sup>2</sup> to *r*. This corresponds to concatenation of languages.

The type *gM* is a monad, and we use the usual shortcut notation for bind:

*<sup>x</sup>*←*g; f x* <sup>≡</sup> bind *<sup>g</sup>* (λ*x. f x*) *<sup>g</sup>*1*;g*<sup>2</sup> <sup>≡</sup> bind *<sup>g</sup>*<sup>1</sup> (<sup>λ</sup> *. g*2)

We also define shortcuts to apply a function to the result of a monad, to lift a binary function into a monad, and to concatenate two grammars, ignoring the result of the latter:

$$a \text{ } \langle \& \rangle \text{ } f \equiv x \leftarrow a; \text{ } \mathsf{return} \ (f \, x)\\quad 2 \, f \, a \, b \equiv x \leftarrow a; \, y \leftarrow b; \, \mathsf{return} \ (f \, x \, y)\\quad \, a \ll b \equiv r \leftarrow a; \, b; \, \mathsf{return} \ r$$

We then define the relational versions of the power function and the Kleene star:

*<sup>g</sup> pow g 0* <sup>≡</sup> return [] *<sup>g</sup> pow g* (*n+1*) <sup>≡</sup> *lift2* (*#*) *<sup>g</sup>* (*<sup>g</sup> pow g n*) *g*<sup>∗</sup> ≡ *n::nat. g pow g n*

where *x#xs* prepends the element *x* to the list *xs*. That is, g pow g n and g<sup>∗</sup> relate the input to lists, the elements being the results produced by g. We also define *<sup>g</sup>*? <sup>≡</sup> (*<sup>g</sup>* & *Some*) <sup>∪</sup> (return *None*).

Using the grammar monad, we specify a grammar for the simplified DIMACS format as used by SAT competitions since 2009 [32]. We start with defining sets of ASCII characters:

```
whitespace, digits1, digits :: 8 word set
whitespace ≡ {' ', '\t', '\n', '\v', '\f ', '\r'}
digits1 ≡ {'1', ... , '9'} digits ≡ {'0', ... , '9'}
```
Here, *8 word* is the 8-bit word type from Isabelle's machine word library [3,11]. Note *whitespace* includes all 6 ASCII whitespace characters. Based on this, we define a grammar *g dimacs ::* (*8 word* × *cnf*) *set*:

```
g ws ≡ whitespace	∗; return () g ws1 ≡ whitespace	; g ws
```

```
g variable ≡ x←digits1	; xs←digits	∗; return (nat of str (x#xs)))
g literal ≡ ({'-'}	; g variable &	 Neg) ∪ (g variable &	 Pos)
g clause ≡ (g literal  g ws1)∗ &	 set  {'0'}	
g cnf ≡ (return {})
 ∪ (c←g clause; cs ← (g ws1; g clause)∗; return ({c} ∪ set cs))
```
*<sup>g</sup> comment* ≡ {*'c'*} *;* −{*'\n'*} <sup>∗</sup>*;* {*'\n'*} *;* return () *<sup>g</sup> <sup>p</sup> header* ≡ {*'p'*} *;* −{*'\n'*} <sup>∗</sup>*;* {*'\n'*} *;* return () *<sup>g</sup> comments* <sup>≡</sup> (*<sup>g</sup> ws* <sup>∪</sup> *<sup>g</sup> comment*)∗*;* return () *<sup>g</sup> dimacs* <sup>≡</sup> *<sup>g</sup> comments; g <sup>p</sup> header*?*; g ws; g cnf <sup>g</sup> ws*

Here, *nat of str :: 8 word list* ⇒ *nat* converts a string to a natural number, and *set :: a list* ⇒ *a set* yields the set of elements in a list.

Note that we do not check the contents of the header, which contains auxiliary information for parsing, but does not affect the represented formula. We also accept multiple clauses per line and clauses spanning several lines, as well as extra whitespace anywhere in the file. Many SAT solvers support similar relaxations of the format, and we wanted this flexibility in our tool, too.

As a sanity check, we prove that our grammar is unambiguous, i.e., that it relates the same word to at most one formula:

(*w, f*1) ∈ *g dimacs* ∧ (*w, f*2) ∈ *g dimacs* =⇒ *f*<sup>1</sup> *= f*<sup>2</sup>

#### 2.3 Correctness Specification

At this point, we can formalize the postcondition for our checker's specification: ∃F. (w, F) ∈ g dimacs ∧ ¬sat F means that the sequence of bytes w is a valid DIMACS CNF representation of an unsatisfiable formula.

# 3 Certificates for Unsatisfiability

RUP (reverse unit propagation) certificates contain the clauses learned by the solver. The checker justifies that addition of each clause preserves satisfiability. For an unsatisfiable formula, the last learned clause is the empty clause. Adding the empty clause yields an unsatisfiable formula, and, as each clause addition is justified to preserve satisfiability, the original formula is unsatisfiable, too.

Justification is done by *reverse unit propagation* [14]: a clause C can be added to the formula F, if the formula F ∧¬C is unsatisfiable, and if this can be shown by generating an empty clause via unit propagation. For RUP, the checker has to implement unit propagation itself, for example with a two-watched-literals data structure [28]. LRUP (linear RUP) certificates annotate each clause addition, with the relevant unit clauses in the order they become unit, and the final conflict clause. This makes the checker simpler and more efficient, as it only needs to check if clauses are unit, rather then find unit clauses.

The certificates also contain clauses deleted by the solver. This allows the checker to also delete those clauses from its data structures, freeing up memory. Note that deleting a clause trivially preserves satisfiability.

The actual LRUP format uses natural numbers to identify clauses, rather than spelling them out whenever they are referenced. The n clauses of the initial formula implicitly get the ids [1,...,n]. A clause addition has the form <id> <literal>\* 0 <id>+ 0. It consists of the id under which this clause shall be added, a zero-terminated list of the literals of the clause, and a zero terminated list of the unit clauses and the conflict clause to justify the addition. A clause deletion has the form <id>+ 0, and consists of a zero terminated list of the ids of the clauses to be deleted. There is an ASCII and a more compact binary encoding for LRUP certificates.

#### 3.1 Abstract Checker

In this section, we present our formalization of the abstract checker algorithm. We start with defining some basic concepts:

− *:: lit* ⇒ *lit* −*Pos v* ≡ *Neg v* −*Neg v* ≡ *Pos v pan* ≡ *lit* ⇒ *bool consistent* (*A::pan*) ≡ ∀*l.* ¬ (*A l* ∧ *A* (−*l*)) *sat wrt F A* ≡ ∃σ*. sem cnf F* σ ∧ (∀*l. A l* =⇒ *sem lit l* σ) *conflict A C* ≡ ∀*l*∈*C. A* (−*l*) *is uot A C l* ≡ *l*∈*C* ∧ ¬*A*(−*l*) ∧ (∀*l* ∈*C*−{*l*}*. A*(−*l* )) *taut C* ≡ ∃*l. l*∈*C* ∧ −*l*∈*C*

The literal −l is the negation of the literal l. A *partial assignment* (*pan*) characterizes a set of literals that are assigned (to true). It is *consistent* if it does not assign both a literal and its negation. A formula F is *satisfiable w.r.t.* a partial assignment A (*sat wrt F A*), if A can be extended to a satisfying valuation; A is in *conflict* with a clause C (*conflict A C*), if the negations of all the clause's literals are assigned. The clause C is *unit or true* w.r.t. A and a literal l (*is uot ACl*), if l is the only literal in C whose negation is not assigned. A clause is a *tautology* (*taut C*), if it contains both a literal and its negation.

Correctness of a RUP step adding C to F is implied by the following lemmas:

(1) Let C be a non-tautological clause. Then, the *initial assignment* λ*l.* −*l*∈*C*, which assigns the negated version of each literal in C, is consistent, and F ∧ ¬C is satisfiable iff F is satisfiable w.r.t. the initial assignment:

$$\neg start\ C \implies consistent\ (\lambda l.\ -l \in C) \land sat\ (F \land \neg C) = sat\ \bot wt\ (\lambda l.\ -l \in C)$$

(2) If the formula contains a unit or true clause, assigning its literal preserves consistency and does not change satisfiability:

*consistent A* ∧ *C*∈*F* ∧ *is uot A C l* =⇒ *consistent* (*A*(*l := True*)) ∧ *sat wrt F A = sat wrt F* (*A*(*l := True*))

(3) If the formula contains a conflict clause, it is unsatisfiable:

*consistent A* ∧ *C*∈*F* ∧ *conflict A C* =⇒ ¬*sat wrt F A*

Note that the learned clause cannot be a tautology. While adding tautologies trivially preserves satisfiability, they yield an inconsistent initial assignment. Instead of spending computation time to detect tautologies, we let our checker run with the inconsistent assignment: should it succeed, we add the clause, which is safe.

We formalize the abstract checker as a transition system over the state:

*checker state* ≡ *CNF formula* | *CLS formula clause pan* | *PRF formula clause pan* | *PDN formula clause* | *UNSAT* | *FAIL*

The transition relation → is the least relation that satisfies the following rules:


(*add conflict*) *uC*∈*F* ∧ *conflict A uC* =⇒ *PRF F C A* → *PDN F C* (*finish proof*) *C*={} =⇒ *PDN F C* → *CNF* ({*C*} ∪ *F*) (*finish proof unsat*) *PDN F* {} → *UNSAT* (*to fail*) *s* → *FAIL*

The checker starts in state *CNF F*, with some formula F. To delete clauses (*del clauses*), they are removed from F. A clause addition is split into multiple smaller steps: First, we initiate adding a clause by going to state *CLS* (*start clause*). We also maintain a partial assignment, starting with the empty assignment λ . *False*. We then add the literals of the clause, one by one (*add lit*). For each added literal l, we assign the negated literal −l. When all literals have been added, we start the proof (*start proof*) going to state *PRF*. During the proof, we add unit clauses, assigning the unit literal (*add unit*). When we have added enough unit clauses, we add a conflict clause (*add conflict*), going to state *PDN* (proof done). From there, we either go to state *UNSAT* if we have proved the empty clause (*finish proof unsat*), or back to state *CNF* with the new clause added to the formula (*finish proof*). We can always go to *FAIL* (*to fail*), indicating that the proof failed.

With the above Lemmas 1–3, some bookkeeping that *add lit* steps construct the correct initial assignment, and a special case for tautologies, we prove:

Theorem 1 (Soundness of Abstract Checker). *If the abstract checker can reach UNSAT from the initial state CNF F, then the formula* F *is unsatisfiable: CNF* F →<sup>∗</sup> *UNSAT* =⇒ ¬*sat* F

Note that we do not yet model clause identifiers on this level. They will be introduced in a later refinement step.

# 4 Implementation

We have specified a grammar to relate strings in DIMACS format to formulas, a semantics to define satisfiability of formulas, and an abstract certificate checker. We now refine these to the actual implementation of a certificate checker.

We use the Isabelle Refinement Framework [24], which supports refinement in multiple steps and in a modular fashion. Each step focuses on a different aspect of the algorithm, thus structuring the correctness proof, and making it manageable in the first place. In this section, we first describe the data structures that we use in our implementation, to represent abstract concepts such as literals and clauses (cf. Sect. 2.1). We then describe how we implement the abstract checker algorithm (cd. Sect. 3.1), using these data structures. Finally, we describe how we integrate the checker with the parser, to obtain the actual verified tool.

### 4.1 Basic Concepts and Data Structures

We use data structures such as arrays, dynamic arrays, and array lists from Isabelle LLVM's library [22]. For technical reasons, sizes and counters are implemented as non-negative *signed* 64-bit integers, or, equivalently, as unsigned 64 bit integers less than 2<sup>63</sup>. Formally, *refinement relations* between concrete and abstract types are used. For example, *size rel ::* (*64 word* × *nat*) *set* relates non-negative 64-bit signed integers to natural numbers. Similarly, Booleans are implemented by 1-bit words, via the relation *bool1 rel ::* (*1 word* × *bool*) *set*.

Clause identifiers are modelled as 64-bit unsigned integers less than <sup>2</sup><sup>63</sup> <sup>−</sup> <sup>1</sup>, via the relation *cid rel ::* (*64 word* × *nat*) *set*. This bound allows us to use clause identifiers as indexes into an array whose length is represented by a size.

Literals are first refined to natural numbers via *nlit rel ::* (*nat* × *lit*) *set*, where a number n > 1 represents the variable n/2, and the literal is negative iff n is odd. The natural numbers are further refined to unsigned 32-bit integers, via *u32 rel ::* (*32 word* × *nat*) *set*. When we compose the two refinement relations, we get a relation between 32-bit integers and literals: *ulit rel* ≡ *u32 rel O nlit rel*. Using 0 for *None*, we can also refine optional literals to 32-bit integers via the relation *ulito rel ::* (*32 word* × *lit option*) *set*. For each operation on the abstract data type, we define a corresponding operation on the concrete data type. For example, we define:

*nlit neg :: nat* <sup>⇒</sup> *nat nlit neg n* <sup>≡</sup> if *even n* then *n+1* else *<sup>n</sup>*−*<sup>1</sup> ulit neg :: 32 word* ⇒ *32 word ulit neg w* ≡ *w XOR 1*

We show that the concrete operations refine their abstract counterparts:

*ulit neg, nlit neg :: u32 rel* → *u32 rel nlit neg,* (−) *:: nlit rel* → *nlit rel*

Here, *f,g :: R*<sup>1</sup> → *R*<sup>2</sup> is a shorthand notation for ∀(*x,y*)∈*R*1*.* (*f x, g y*) ∈ *R*2. Combining these refinement theorems yields *ulit neg,* (−) *:: ulit rel* → *ulit rel*.

Clauses are implemented as zero-terminated arrays of 32-bit words, via the relation *zcl assn :: 32 word ptr* ⇒ *clause* ⇒ *assn*. As arrays are stored on the heap, this relation is expressed as separation logic assertion (*assn*). By convention, pure refinement relations have the suffix *rel*, while those that use the heap have the suffix *assn*.

A *clause database cdb* ≡ *nat* ⇒ *clause option* is a partial function from clause identifiers to clauses. It is implemented by a dynamic array of pointers to clauses *cdbi* ≡ *32 word ptr larray*, via *cdb assn :: cdbi* ⇒ *cdb* ⇒ *assn*. The array is indexed by the clause identifier. For clause identifiers not in the database, the array contains a null pointer. Consider the abstract operation *cdb ins cid C db* that inserts a clause C with identifier cid into the database db, its concrete version *cdb ins impl*, and the corresponding refinement theorem:


The concrete operation destructively updates the array, thus the abstract *cdb* parameter does no longer correspond to any concrete value. Also, the ownership of the inserted clause is transferred into the clause database, thus the abstract clause parameter does no longer correspond to any (visible) concrete value. We call those parameters *destroyed*, indicated by a *<sup>d</sup>* in the refinement theorem [21].

### 4.2 Data Structures with Capacity Bounds

Several data structures in our checker use counters. For example, during parsing, the literals of a clause are collected in an array list, which uses a counter for its size. We prove non-overflow of these counters from the bounded size of the CNF file, and a limit on how many literals we can read from the certificate before the checker rejects it<sup>2</sup>. While we elide the details, we note that some abstract data structures have a capacity bound field. This is a *ghost field*, i.e., it is not present in the implementation.

The *clause builder* uses a dynamic array to store the literals of the clause that is currently parsed, and also keeps track of the maximum literal encountered so far. Its abstract type is *cbld* ≡ *nat* × *lit list* × *nat*. A clause builder (*ml,ls,bnd*) *:: cbld* consists of the maximum encountered literal *ml*, the current list of literals *ls*, and a (ghost) bound *bnd* that limits the length of *ls*. We define a *data type invariant cbld inv :: cbld* ⇒ *bool* that characterizes valid clause builders (i.e., the bound and maximum literal are consistent with the list of literals). The relation *cbld assn ::* (*32 word* × *32 word array list*) ⇒ *cbld* ⇒ *assn* implements clause builders.

A partial assignment (cf. Sect. 3.1) is implemented by an array of bits indexed by the literals, as well as an array list that contains all set literals. This array list allows for efficiently resetting the assignment in between proof steps. We use the type *rpani :: 1 word larray* × *32 word array list* for the implementation, and *rpan :: bool list* × *nat list* × *nat* for the functional representation, related by *rpan assn :: rpani* ⇒ *rpan* ⇒ *assn*. The last field of *rpan* is a (ghost) capacity bound. The type *rpan* comes with an invariant *rpan inv*, and an abstraction function *rpan* α *:: rpan* ⇒ *pan* to the encoded partial assignment.

### 4.3 Proof Checker Implementation

We implement the abstract checker state (Sect. 3.1) by the following types:


All data structures start with an error flag, which indicates a failed proof (abstract state *FAIL*). Outside a proof, i.e., in abstract states *CNF* and *UNSAT*, the checker state is represented by a tuple (*err,unsat,db,bld,A*) *:: cs op*, where *unsat* indicates that the formula has been proved unsatisfiable and *db* is the clause database holding the formula. The builder state *bld* and assignment *A* are unused here, but threaded through such that they can be reused when the next proof begins. When building a clause (abstract state *CLS*), the state is represented as (*err,A,bld,db*) *:: cs bc*. Finally, inside a proof (abstract states *PRF* and *PDN*), the state is (*err,confl,A,bld,db*) *:: cs ip*. Here, *confl* indicates that a conflict clause has been found.

<sup>2</sup> The size of the formula plus the number of literals in the certificate cannot exceed 2<sup>63</sup>. We don't expect this limit to be ever hit in practice.

We define invariants *cs op inv*, *cs bc inv*, *cs ip inv*; and abstraction functions *cs op* α, *cs bc* α, *cs ip* α to the abstract checker state. We then show that the functions on the concrete states preserve the invariants and implement the transition relation →<sup>∗</sup> on the corresponding abstract states. For example, the following function handles a proof step, adding a unit or a conflict clause:

$$\begin{array}{l} cs.prf\\_step & ::\; nat \Rightarrow\; cs.ip \Rightarrow\; cs.ip \; nres\\ cs.ip\\_inv\\_cap & ::\; cap \; cs \land\; cap > 0\\ \implies\; cs.prf\\_step \; cid\; cs \leq\; \mathsf{spec}\; cs'.\; cs.ip\\_inv\; (cap-1)\; cs'\\ \land\; (cs.ip\\_\alpha\; cs) & \stackrel{\scriptstyle\frown}{\rightarrow} (cs.ip\\_\alpha\; cs')\end{array}$$

Here, *'a nres* is the Isabelle Refinement Framework's type of programs that return a result of type *'a*, and <sup>P</sup> <sup>=</sup><sup>⇒</sup> <sup>c</sup> <sup>≤</sup> spec r. Q r is a Hoare-triple with pre- condition *P*, program *c*, and postcondition *Q* [24]. That is, if the concrete checker state *cs* has some capacity left, then the *cs prf step* function preserves the invariant *cs ip inv* and implements the abstract transition relation →. The available capacity of the checker state decreases by one.

Fig. 2. Function to check if a clause is unit, true, or conflict.

The implementation of *cs prf step* uses a function to check if a clause is unit, true, or a conflict. It is displayed in Fig. 2. It first checks (l. 3) if the clause identifier is valid, and looks it up in the database (l. 4). Then (l. 6), it loops over the literals of the clause, maintaining a state consisting of an optional literal and an error flag *(ul,err)*. Initially (l. 11), the state is *(None,False)*. The loop

assigns to *ul* the first literal that is not false (l. 9). If a second non-false literal is encountered, the error flag is set (l. 10). The function returns the state after the loop, or *(None,True)* if the clause was invalid (l. 12). Note that we *assume* (l. 5) a finite clause. On the abstract level, we can use this to justify termination of the loop. When implementing the function, we have to prove finiteness, which is trivial, as the clause is stored in an array. Dually, we *assert* (l. 7) that the literals of the clause are in bounds of the assignment. This has to be proved on the abstract level. When implementing, we can use it to justify that the array access for looking up the literal is in bounds. This way, assertions and assumptions are used to pass proof obligations up and down the refinement chain, proving them at the most convenient abstraction level.

The loop in *check uot* is the innermost loop of the checker, and special care has been taken to optimize it: while an actual certificate always contains unit clauses, we also allow clauses with one true literal (cf. *is uot* in Sect. 3.1). This avoids indexing both A(l) and A(−l) to distinguish between the two cases.

The correctness theorem for *check uot* is as follows:

*rpan inv A* ∧ *cdb cid = Some C* ∧ *cdb vars cdb* ⊆ *rpan dom A* =⇒ *check uot cdb cid A* <sup>≤</sup> spec (*ul,err*)*.* <sup>¬</sup> *err* −→ case *ul* of *Some l* ⇒ *is uot* (*rpan* α *A*) *C l* | *None* ⇒ *conflict* (*rpan* α *A*) *C*

I.e., if the partial assignment satisfies its invariant, the clause identifier identifies clause C, and the clause database contains only variables within the bounds of the partial assignment, then the function will either return an error, or some literal l such that C is unit or true w.r.t. l, or *None* and C is a conflict clause.

### 4.4 A Verified DIMACS Parser

We present the parsing function's signature and correctness theorem. Its implementation is elided due to page limit constraints:

*read dimacs cs :: 8 word list* ⇒ (*cs op* × *nat*) *nres read dimacs cs str* <sup>≤</sup> spec (*cs,cap*)*.* <sup>∃</sup> *F. cs op inv cap cs* ∧ (*cs op* α *cs = FAIL* ∨ (*str,F*) ∈ *g dimacs* ∧ *cs op* α *cs = CNF F*)

This function parses a string, and returns a checker state. On a parsing error, the checker state corresponds to the abstract state *FAIL*. Otherwise, it corresponds to *CNF F* for the formula F parsed from the string. The function also returns the capacity left for the certificate after parsing the formula.

#### 4.5 Assembling the Whole Program

Having implemented functions for the proof steps, we combine them with a parser (details elided) for LRAT proofs, resulting in a function that reads an LRAT proof from a buffered reader (*brd rs*), performs the corresponding transitions on the proof state, and finally checks if the proof state has reached *UNSAT*:

*main checker loop :: cs op* ⇒ *brd rs* ⇒ (*bool* × *brd rs*) *nres*

The certificate checker, displayed in Fig. 3, combines the main checker loop with the DIMACS parser. It takes a string *cnf*, parses it as formula (l. 3), initializes a buffered reader for the certificate stream (l. 5), and runs the main checker loop with that reader (l. 6). From the correctness of the parser

Fig. 3. The checker program.

(Sect. 4.4), the fact that all proof steps in *main checker loop* implement the abstract checker, and the fact that the abstract checker is sound (Theorem 1), we prove:

Theorem 2 (Soundness of Functional Checker). *If read check lrat cnf returns true, then cnf is a valid representation of an unsatisfiable formula:*

*read check lrat cnf* <sup>≤</sup> spec *r. r* <sup>=</sup>⇒ ∃*F.* (*cnf, F*) <sup>∈</sup> *<sup>g</sup> dimacs* ∧ ¬ *sat F*

## 4.6 Refinement to LLVM Code

In Sect. 4.1 and Sect. 4.2 we have indicated how we implement the basic data structures of our checker. Then, we have mostly presented functional code. Given implementations of the data structures, refining this functional code to imperative code is mostly straightforward. Actually, much of this process can be automated by the Sepref tool [22], which we use to generate implementations for each data structure and algorithm. For example, for the function *check uot* (cf. Sect. 4.3):

sepref def *check uot impl* is *check uot :: rpan assn* → *cdb assn* → *cid rel* → *ulito rel* × *bool1 rel* unfolding *check uot def* by *sepref*

This generates the function *check uot impl* and proves the refinement theorem:

*check uot impl, check uot :: rpan assn* → *cdb assn* → *cid rel* → *ulito rel* × *bool1 rel*

To read the certificate, we use an external C function based on *fread*:

*size t fread from certificate*(void ∗*p, size t n*) { if (*!cert file*) return *0;* return *fread*(*p,1,n,cert file*)*;* }

Inside Isabelle, this function is specified by:

*htriple* (*arr assn xsi xs size rel ni n n* ≤ *length xs*) *fread from certificate xsi ni* (λ*ri.* ∃*r ys. size rel ri r arr assn xsi ys r*≤*n length ys = length xs drop r ys = drop r xs*)

Where *htriple* is the Hoare triple for LLVM programs and is the separating conjunction. This matches the specification of POSIX's fread function [30], except that we do not specify what data is read. This is sound, as it is a valid over-approximation of the actual behaviour.

### 4.7 Soundness Theorem

Finally, we generate an implementation of *read check lrat* (Sect. 4.5), obtaining:

*read check lrat impl :: 8 word ptr* × *64 word* ⇒ *1 word llM read check lrat impl, read check lrat :: inp assn* → *bool1 rel* Here, *inp assn* implements the input string by an array and its length. In order to smoothly interface this function from C/C++, we eliminate the tuple type and return a byte instead of a bit. We define:

*lrat checker :: 8 word ptr* ⇒ *64 word* ⇒ *8 word llM lrat checker p n* ≡ if *read check lrat impl* (*p,n*) then return *1* else return *0*

Isabelle LLVM's code generator creates LLVM code, and a matching header file:

export llvm *lrat checker* is *uint8 t lrat checker*(*uint8 t* ∗*, int64 t*) file *../code/lrat checker export.*{*ll,h*}

We link this with a small C program that reads the command line, memory-maps the formula file, provides the function *fread from certificate* (cf. Sect. 4.6), calls the verified checker, and prints the result.

Chaining together the correctness of the functional checker (Theorem 2) and the refinement theorem for *read check lrat*, and unfolding some definitions yields:

Theorem 3 (Soundness of Implementation). *When we pass the checker a pointer cp to an array of size cszi holding the bytes c, then the checker will terminate with the array being unchanged, and if the result is not zero, the bytes* c *in the array are a syntactically correct encoding of an unsatisfiable CNF:*

*htriple* (*arr assn cp c size rel cszi csz csz=length c*) (*lrat checker cp cszi*) (λ*r. arr assn cp c* (*r*=*0* =⇒ (∃*F.* (*c,F*)∈*g dimacs* ∧ ¬*sat F*)))

Note that this theorem does not depend on any complex data structures or refinements. Apart from the basic notions of Hoare triples, separation logic, machine words, and pointers to arrays, it only depends on our semantics of formulas (Sect. 2.1), and our grammar for the DIMACS format (Sect. 2.2).

# 5 Benchmarks

For our benchmarks, we have used the latest stable versions of the tools available at the time of writing: CaDiCaL 1.9.4 [7], lrat-trim 0.2.0 [27], cake lpr 7a207e9 [8], gratchk dc6dd9d [15], lrat-check 9ee016c [12], and lrat-acl2 (incremental) 8.5 [1] on gcl 2.6.13pre [13]. We used an AMD Ryzen 9 7950X3D machine with 128 GB DDR5 RAM and a 2.0 TB Samsung 990 Pro SSD disk.

We have used problems from the 2022 SAT competition<sup>3</sup> [33]: out of the 156 problems proved unsatisfiable in the main track, CaDiCaL timed out on 5 after 5000 s. The remaining 151 problems form our benchmark set.

<sup>3</sup> We did not choose the 2023 competition, because the problems there are biased towards checkers that use techniques not available for direct LRAT generation in CaDiCaL.

First, we let the checker run in parallel to CaDiCaL, streaming the certificate directly into the checker. We used our checker and cake lpr<sup>4</sup>. We measure the computing times (the sum of user and system


Table 1. Benchmark results in streaming mode. The table displays the averages over the successfully certified problems (*n*).

time) that were allocated to the sat solver (t*s*) and checker (t*c*). The ratio t*c*/t*<sup>s</sup>* indicates how the work is distributed between solver and checker. The smaller this ratio, the less time the checker needs in comparison to the solver. Next, we measure the average CPU loads allocated to the solver (l*s*) and checker (l*c*). A solver load less than 100% indicates that the solver was slowed down. The less load the checker produces, the fewer additional computing power is needed for checking. We also measure the peak memory consumption (maximum resident set size) of the solver (m*s*) and checker (m*c*). The ratio m*c*/m*<sup>s</sup>* indicates the additional memory required for checking. Finally, we measure the wall-clock time until certification finishes (w), and compare that to the time required by the solver to solve the problem and write the certificate to a file (w*<sup>f</sup>* ), and to the solving time without producing a certificate at all (w*b*). The ratios w/w*<sup>f</sup>* and w/w*<sup>b</sup>* indicate the observed extra time required for certification. The results are displayed in Table 1: Our checker verified all problems, adding about 6% more computation time and 80% more memory on top of solving and certificate producing. It does not significantly slow down the solver, which runs at 97% CPU load. Compared to writing the certificate to a file, streaming it directly to the checker is 2% slower, and the overhead added by the whole certification process is 10%. The cake lpr checker failed to certify 13 problems<sup>5</sup>. For the remaining problems, it added 61% of computation time, and the solver only ran at 85% load. Streaming the certificate to cake lpr is 30% slower than writing the certificate to a file, and 43% slower than solving without producing a certificate. Moreover, for each cake lpr run, maximum heap and stack sizes have to be determined upfront, and cake lpr is likely to use all available heap<sup>6</sup>. Without prior knowledge of the problem, it is impossible to guess good sizes. For our experiments, we used 8 GiB stack and 16 GiB heap, based on the maximum of 11 GiB that our tool needed. With this, cake lpr ran out of memory for six problems, and maxed out at around 16 GiB memory usage for most of the remaining problems (131/138). On average, it needed 162 times more memory than the solver.

<sup>4</sup> We didn't include a Coq based lrat-checker [9], nor an ACL2 based one [16]: the former is reportedly less efficient than cake lpr [35], and the latter supports, to the best of our knowledge, no streaming of the certificate.

<sup>5</sup> 6 memouts, 6 parsing errors, most likely due to benchmarks incompatible with CakeLPR's strict interpretation of the DIMACS CNF format, and one timeout at 5000 s.

<sup>6</sup> We assume that the garbage collector only becomes active when available memory has filled up.


Table 2. Benchmark results in file mode.

To measure the performance of just the checker, we ran it on certificates stored in files. For this experiment we also included the gratchk tool, which is reported to be faster than cake lpr [35], the lrat-acl2 tool, and the unverified checker implementations lrattrim (forward) and lrat-check, to compare our verified tool against unverified but highly optimized implementations.

For the garbage collected tools (gratchk, cake lpr, lrat-acl2), we set a heap limit of 16GiB. If possible, we used binary LRAT encoding (our tool and lrat-trim), and did not include conversion time from LRAT to GRAT (gratchk). Using our tool as baseline (100%), we display the ratio of the total computation times over all problems (t*tot*), and the average ratios of computation time and peak memory usage per problem (t*avg* and m*avg*). The results are displayed in Table 2: our tool is slightly faster but uses slightly more memory than lrat-trim. It is significantly faster and uses less memory than any other verified or unverified tool we tested. After 14:30h, lrat-acl2 had processed 66 problems and succeeded on 57. The same problems took 3:25m to check by our tool. We aborted the experiment at that point, as, by extrapolation, it would have taken 5 more days to complete.

# 6 Conclusion

We have used the Isabelle LLVM framework to formally verify soundness of an unsatisfiability certificate checker. Our checker is verified w.r.t. a grammar of the DIMACS format, a semantics of CNF, and down to the LLVM code that implements the checker. Completeness of the checker has been empirically verified by showing that it accepts a large set of benchmarks. Our checker accepts the LRUP fragment of the LRAT format, which makes it suitable for checking certificates from many top-performing SAT solvers. For solvers that support streaming of LRAT certificates, our tool can be run in parallel to the solver, eliminating the need to store the potentially large certificate, and coming back with the certification result the moment the solver is finished. For CaDiCaL, this is only 10% slower than running just the solver, and 2% slower than writing the certificate to a file without checking it. Our implementation is slightly faster and uses only 4% more memory than the unverified and highly optimized lrat-trim checker. It is significantly faster and more memory efficient than any other LRAT checker we know of, verified or unverified. This makes it possible to routinely run the checker with the solver, increasing the confidence at low cost.

To design our checker, we first implemented and profiled prototypes in C++ to determine the important optimizations. This took roughly 40 person hours. We then used the Isabelle Refinement Framework to produce a verified version of the checker. This was done in a top-down refinement process, which was guided by the experience from the unverified prototypes. This took another 200 h.

### 6.1 Related Work

The closest work to ours is the verified cake lpr checker [34,35]. It supports streaming certificates<sup>7</sup> and the full LPR format. The cake lpr checker is verified down to assembly code (with a thin C wrapper around it), while our checker is verified down to LLVM intermediate code. While verifying an LLVM compiler is orthogonal to this project, we would immediately profit from such a verified compiler, further reducing our trusted code base. Moreover, our checker is verified w.r.t. a grammar of DIMACS CNF, while cake-lpr's parser is not verified. It only comes with a sanity check, showing that the parser is left inverse to a pretty printer. Our checker is significantly faster than cake lpr, and only allocates as much memory as needed, while cake-lpr's memory size has to be set upfront, making it uncontrollable without background information about the problem. In particular in streaming mode, such information is not available. Finally, cake lpr uses the ASCII encoding of LRAT, while our checker uses the more compact binary encoding.<sup>8</sup>

There are other verified certificate checkers [5,9,16,23], which, however, do not support streaming or are significantly slower than cake lpr.

### 6.2 Future Work

There are no principle problems to extend our tool to the more powerful LRAT and LPR formats. We leave this to future work, as we are not aware of any solver that would support streaming these formats.

While our parser was manually implemented and then verified, there is work on verified parser generators [6,18–20,25,31]. We leave it to future work to integrate similar techniques into the Isabelle LLVM workflow.

While faster than parsing the ASCII encoding, decompression of the binary encoding is a hot-spot in our checker. In streaming mode, we could probably use a less compact but faster to read format, which we leave to future work.

# References


<sup>7</sup> Surprisingly, we have not found reports on using cake lpr in streaming mode. In particular, Pollit et. al. [29] did not consider this possibility when they extended CaDiCaL to directly produce LRUP certificates.

<sup>8</sup> Conversion between the encodings is easy, and we leave native support of the ASCII encoding in our checker to future work.


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Generalized Optimization Modulo Theories

Nestan Tsiskaridze1(B) , Clark Barrett<sup>1</sup> , and Cesare Tinelli<sup>2</sup>

<sup>1</sup> Stanford University, Stanford, CA, USA {nestan,barrett}@cs.stanford.edu <sup>2</sup> The University of Iowa, Iowa City, IA, USA cesare-tinelli@uiowa.edu

Abstract. Optimization Modulo Theories (OMT) has emerged as an important extension of the highly successful Satisfiability Modulo Theories (SMT) paradigm. The OMT problem requires solving an SMT problem with the restriction that the solution must be optimal with respect to a given objective function. We introduce a generalization of the OMT problem where, in particular, objective functions can range over partially ordered sets. We provide a formalization of and an abstract calculus for the Generalized OMT problem and prove their key correctness properties. Generalized OMT extends previous work on OMT in several ways. First, in contrast to many current OMT solvers, our calculus is theory-agnostic, enabling the optimization of queries over any theories or combinations thereof. Second, our formalization unifies both singleand multi-objective optimization problems, allowing us to study them both in a single framework and facilitating the use of objective functions that are not supported by existing OMT approaches. Finally, our calculus is sufficiently general to fully capture a wide variety of current OMT approaches (each of which can be realized as a specific strategy for rule application in the calculus) and to support the exploration of new search strategies. Much like the original abstract DPLL(T) calculus for SMT, our Generalized OMT calculus is designed to establish a theoretical foundation for understanding and research and to serve as a framework for studying variations of and extensions to existing OMT methodologies.

Keywords: Optimization Modulo Theories (OMT) · Optimization · Satisfiability Modulo Theories (SMT) · Abstract Calculus

# 1 Introduction

Over the past decade, the field of Optimization Modulo Theories (OMT) has emerged, inspiring the interest of researchers and practitioners alike. OMT builds on the highly successful Satisfiability Modulo Theories (SMT) [3] paradigm and extends it: while the latter focuses solely on finding a theory model for a firstorder formula, the former adds an objective term that must be optimized with respect to some total ordering over the term's domain.

The development of OMT solvers has fostered research across an expanding spectrum of applications, including scheduling and planning with resources [7, 13,17,20,26,30,35,38,48,58], formal verification and model checking [37,49], program analysis [10,23,25,28,69], requirements engineering and specification synthesis [21,41–43], security analysis [4,18,46,61], system design and configuration [14,15,29,34,47,51,63,68], machine learning [59,62], and quantum annealing [5].

Various OMT procedures have been developed for different types of optimization objectives (e.g., single- and multi-objective problems), underlying theories (e.g., arithmetic and bitvectors), and search strategies (e.g., linear and binary search). We provide an overview of established OMT techniques in Sect. 5. An extensive survey can be found in Trentin [64].

We introduce a proper generalization of the OMT problem and an abstract calculus for this generalization whose main goal is similar to that of the DPLL(T) calculus for SMT [45]: to provide both a foundation for theoretical understanding and research and a blueprint for practical implementations. Our approach is general in several ways. First, in contrast to previous work in OMT, it is parameterized by the optimization order, which does not need to be total, and it is not specific to any theory or optimization technique, making the calculus easily applicable to new theories or objective functions. Second, it encompasses both single- and multi-objective optimization problems, allowing us to study them in a single, unified framework and enabling combinations of objectives not covered in previous work. Third, it captures a wide variety of current OMT approaches, which can be realized as instances of the calculus together with specific strategies for rule application. Finally, it provides a framework for the exploration of new optimization strategies.

Contributions . To summarize, our contributions include:


The rest of the paper is organized as follows. Section 2 introduces background and notation. Section 3 defines the Generalized OMT problem. Section 4 presents the calculus, provides an illustrative example of its use and addresses its correctness<sup>1</sup>. Finally, Sect. 5 discusses related work, and Sect. 6 concludes.

# 2 Background

We assume the standard many-sorted first-order logic setting for SMT, with the usual notions of signature, term, formula, and interpretation. We write I |= φ

<sup>1</sup> Full proofs and an additional example are provided in the longer version of this paper [67].


Table 1. Theory-specific notation.

to mean that formula φ holds in or is *satisfied* by an interpretation I. A *theory* is a pair T = (Σ, **I**), where Σ is a signature and **I** is a class of Σ-interpretations. We call the elements of **I** T *-interpretations*. We write Γ |=<sup>T</sup> φ, where Γ is a formula (or a set of formulas), to mean that Γ T *-entails* φ, i.e., every T -interpretation that satisfies (each formula in) Γ satisfies φ as well. For convenience, for the rest of the paper, *we fix a background theory* T *with equality and with signature* Σ*.* We also fix an infinite set X of sorted variables with sorts from Σ and assume ≺<sup>X</sup> is some total order on X . We assume that all terms and formulas are Σterms and Σ-formulas with free variables from X . Since the theory T is fixed, we will often abbreviate |=<sup>T</sup> as |= and consider only interpretations that are T -interpretations assigning a value to every variable in X . At various places in the paper, we use sorts and operators from standard SMT-LIB theories such as integers, bitvectors, strings,<sup>2</sup> or data types [2]. We assume that every <sup>T</sup> interpretation interprets them in the same (standard) way. Table 1 lists theory symbols used in this paper and their meanings. A Σ-formula φ is *satisfiable* (resp., *unsatisfiable*) *in* T if it is satisfied by some (resp., no) T -interpretation.

Let s be a Σ-term. We denote by s<sup>I</sup> the value of s in an interpretation I, defined as usual by recursively determining the values of sub-terms. We denote by *FV* (s) the set of all variables occurring in s. Similarly, we write *FV* (φ) to denote the set of all the free variables occurring in a formula φ. If *FV* (φ) = {v1,...,v*n*}, where for each i ∈ [1, n), v*<sup>i</sup>* ≺<sup>X</sup> v*i*+1, then the relation *defined by* φ (in T ) is {(v<sup>I</sup> <sup>1</sup> ,...,v<sup>I</sup> *<sup>n</sup>*) |I|= φ for some T -interpretation I}. A relation is *definable in* T if there is some formula that defines it. Let *v* be a tuple of variables (v1,...,v*n*),

<sup>2</sup> For simplicity, we assume strings are over characters ranging only from 'a' to 'z'.

and let *t* = (t1,...,t*n*) be a tuple of Σ-terms, such that t*<sup>i</sup>* and v*<sup>i</sup>* are of the same sort for i ∈ [1, n]; then, we denote by s[*v* ← *t*] the term obtained from s by simultaneously replacing each occurrence of variable v*<sup>i</sup>* in s with the term t*i*.

If S is a finite *sequence* (s1,...,s*n*), we write Top(S) to denote, s1, the first element of S in S; we write Pop(S) to denote the subsequence (s2,...,s*n*) of S. We use ∅ to denote both the empty set and the empty sequence. We write s ∈ S to mean that s occurs in the sequence S, and write S ◦S for the sequence obtained by appending S at the end of S.

We adopt the standard notion of *strict partial order* ≺ on a set A, that is, a relation in A × A that is irreflexive, asymmetric, and transitive. The relation ≺ is a *strict total order* if, in addition, a<sup>1</sup> ≺ a<sup>2</sup> or a<sup>2</sup> ≺ a<sup>1</sup> for every pair a1, a<sup>2</sup> of distinct elements of A. As usual, we will call ≺ *well-founded* over a subset A of A if A contains no infinite descending chains. An element m ∈ A is *minimal (with respect to* ≺*)* if there is no a ∈ A such that a ≺ m. If A has a unique minimal element, it is called a *minimum*.

# 3 Generalized Optimization Modulo Theories

We introduce a formalization of the *Generalized Optimization Modulo Theories* problem which unifies single- and multi-objective optimization problems and lays the groundwork for the calculus presented in Sect. 4.

### 3.1 Formalization

For the rest of the paper, we fix a theory T with some signature Σ.

Definition 1 (Generalized Optimization Modulo Theories (GOMT)). *A Generalized Optimization Modulo Theories problem is a tuple* GO := t, ≺, φ*, where:*


For any GOMT problem GO and T -interpretations I and I , we say that:


Informally, the term t represents the *objective function*, whose value we want to optimize. The order ≺ is used to compare values of t, with a value a being considered *better* than a value a if a ≺ a . Finally, the formula φ imposes constraints on the values that t can take. It is easy to see that the value of t <sup>I</sup> assigned by a GO-solution I is always minimal. As a special case, if ≺ is a total order, then t I is also unique (i.e., it is a minimum). Once we have fixed a GOMT problem GO, we will informally refer to a GO-consistent interpretation as a *solution (of* φ*)* and to a GO-solution as an *optimal solution*.

Our notion of Generalized OMT is closely related to one by Bigarella et al. [6], which defines a notion of OMT for a generic background theory using a predicate that corresponds to a total order in that theory. Definition 1 generalizes this in two ways. First, we allow partial orders, with total orders being a special case. One useful application of this generalization is the ability to model multiobjective problems as single-objective problems over a suitable partial order, as we explain below. Second, we do not restrict ≺ to correspond to a predicate symbol in the theory. Instead, any partial order *definable* in the theory can be used. This general framework captures a large class of optimization problems.

*Example 1.* Suppose T is the theory of real arithmetic with the usual signature. Let GO := x + y, ≺, 0 < x ∧ xy = 1, where x and y are variables of sort Real and ≺ is defined by the formula v<sup>1</sup> <<sup>R</sup> v<sup>2</sup> (where v<sup>1</sup> ≺<sup>X</sup> v2). A GO-solution is any interpretation that interprets x and y as 1.

*Example 2.* With T now being the theory of integer arithmetic, let GO = x, ≺ , x<sup>2</sup> <sup>&</sup>lt; <sup>20</sup>, where <sup>x</sup> is of sort Int, and <sup>≺</sup> is defined by <sup>v</sup><sup>1</sup> <sup>&</sup>gt;Int <sup>v</sup><sup>2</sup> (where <sup>v</sup><sup>1</sup> <sup>≺</sup><sup>X</sup> <sup>v</sup>2). <sup>A</sup> GO-solution must interpret <sup>x</sup> as the maximum integer satisfying <sup>x</sup><sup>2</sup> <sup>&</sup>lt; <sup>20</sup> (i.e., x must have value 4).

The examples above are both instances of what previous work refers to as single-objective optimization problems [64], with the first example being a minimization and the second a maximization problem. The next example illustrates a less conventional ordering.

Note that from now on, to keep the exposition simple, we define partial orders ≺ appearing in GO problems only *semantically*, i.e., formally, but without giving a specific defining formula. However, it is easy to check that all orders used in this paper are, in fact, definable in a suitable T .

*Example 3.* Let GO <sup>=</sup> x, <sup>≺</sup>, x<sup>2</sup> <sup>&</sup>lt; <sup>20</sup> be a variation of Example 2, where now, for any integers a and b, a ≺ b iff |b| ≺Int |a|. A GO-solution can interpret x either as 4 or −4. Neither solution dominates the other since their absolute values are equal.

We next show how multi-objective problems are also instances of Definition 1.

### 3.2 Multi-objective Optimization

We use the term *multi-objective optimization* to refer to an optimization problem consisting of several sub-problems, each of which is also an optimization problem. A multi-objective optimization may also require specific interrelations among its sub-problems. In this section, we define several varieties of multi-objective optimization problems and show how each can be realized using Definition 1. For each, we also state a correctness proposition which follows straightforwardly from the definitions.

In the following, given a strict ordering ≺, we will denote its reflexive closure by -. We start with a multi-objective optimization problem which requires that the sub-problems be prioritized in lexicographical order [8,9,53,56,64].

Definition 2 (Lexicographic Optimization (LO)). *A lexicographic optimization problem is a sequence of GOMT problems* LO = (GO1,..., GO*n*)*, where* GO*<sup>i</sup>* := t*i*, ≺*i*, φ*i for* i ∈ [1, n]*. For* T *-interpretations* I *and* I *, we say that:*


An LO problem can be solved by converting it into an instance of Definition 1.

Definition 3 (GOLO). *Given an* LO *problem* (GO1,..., GO*n*)*, with* GO*<sup>i</sup>* := t*i*, ≺*i*, φ*i for* i ∈ [1, n]*, the corresponding* GO *instance is defined as* GOLO(GO1,..., GO*n*) := t, ≺LO, φ*, where:*

*–* t = *tup*(t1,...,t*n*)*;* φ = φ<sup>1</sup> ∧···∧ φ*n; – if* t *is of sort* σ*, then* ≺LO *is the lexicographic extension of* (≺1,. . .,≺*n*) *to* σT*: for* (a1,...,a*n*)*,* (b1,...,b*n*) ∈ σ<sup>T</sup> *,* (a1,...,a*n*) ≺LO (b1,...,b*n*) *iff for some* j ∈ [1, n] : *(i)* a*<sup>i</sup>* = b*<sup>i</sup> for all* i ∈ [1, j)*; and (ii)* a*<sup>j</sup>* ≺*<sup>j</sup>* b*<sup>j</sup> .*

Here and in other definitions below, we use the data type theory constructor *tup* to construct the objective term t. This is a convenient mechanism for keeping an ordered list of the sub-objectives and keeps the overall theoretical framework simple. In practice, if using a solver that does not support tuples or the theory of data types, other implementation mechanisms could be used. Note that if each sub-problem uses a total order, then ≺LO will also be total.

Proposition 1. *Let* I *be a* GOLO*-solution. Then* I *is also a solution to the corresponding* LO *problem as defined in Definition 2.*

*Example 4 (*LO*).* Let GO<sup>1</sup> :=x, ≺1,*True* and GO<sup>2</sup> :=y +[2] z, ≺2,*True*, where x, y, z are variables of sort *BV*[2], a ≺<sup>1</sup> b iff a ≺[2] b, and a ≺<sup>2</sup> b iff a [2] b. Now, let GO = GOLO(GO1, GO2) = t, ≺LO, *True*. Then, t = *tup*(x, y +[2] z) and (a1, a2) ≺LO (b1, b2) iff a<sup>1</sup> ≺[2] b<sup>1</sup> or (a<sup>1</sup> = b<sup>1</sup> and a<sup>2</sup> [2] b2).

Now, let I, I , and I be such that: x<sup>I</sup> = 11, y<sup>I</sup> = 00, z<sup>I</sup> = 10, and t <sup>I</sup> := (11, 10); xI- = 01, yI- = 01, zI- = 01, and t I- := (01, 10); xI-- = 01, yI-- = 01, zI-- = 10, and t I-- := (01, 11). Then, I <GO I <GO I, since (01, 11) ≺LO (01, 10) ≺LO (11, 10).

We can also accommodate Pareto optimization [8,9,64] in our framework.

Definition 4 (Pareto Optimization (PO)). *A Pareto optimization problem is a sequence of GOMT problems* PO = (GO1,..., GO*n*)*, where* GO*<sup>i</sup>* := t*i*, ≺*<sup>i</sup>* , φ*i for* i ∈ [1, n]*. For any* T *-interpretations* I *and* I *, we say that:*

	- *(ii) for some* j ∈[1, n]*,* t I *<sup>j</sup>* ≺*<sup>j</sup>* t I- *j .*

Definition 5 (GOPO). *Given a PO problem* PO = (GO1,..., GO*n*)*, we define* GOPO(GO1,..., GO*n*) := t, ≺PO, φ*, where:*


Proposition 2. *Let* I *be a* GOPO*-solution. Then* I *is also a solution to the corresponding* PO *problem as defined in Definition 4.*

Next, consider a PO example with two sub-problems: one minimizing the length of a string w, and the other maximizing a substring x of w lexicographically.

*Example 5 (*PO*).* Let T be the SMT-LIB theory of strings and let GO<sup>1</sup> := len(w), ≺1, len(w) < 4 and GO<sup>2</sup> := x, ≺2, contains(w, x), where w, x are variables of sort Str, ≺<sup>1</sup> is ≺Int, and ≺<sup>2</sup> is Str. Now, let GOPO = GOPO(GO1, GO2) = t, ≺PO, len(w) < 4 ∧ contains(x, w). Then, t = *tup*(len(w), x) and (a1, a2) ≺PO (b1, b2) iff a<sup>1</sup> -Int b1, a<sup>2</sup> str b2, and (a<sup>1</sup> ≺Int b<sup>1</sup> or a<sup>2</sup> str b2). Now, let I, I , and I be such that: I := {w → "aba", x → "ab"} and t <sup>I</sup> := (3, "ab"); I := {w → "z", x → "z"} and t I- := (1, "z"); and I := {w → , x → } and t I-- := (0, ). Then, I <GO I, since (1, "z") ≺PO (3, "ab"); but both I and I are incomparable with I. Both I and I are optimal solutions.

Though we omit them for space reasons, we can similarly capture the MinMax and MaxMin optimization problems [56,64] as corresponding GOMINMAX and GOMAXMIN instances of Definition 1. 3

Note that except for degenerate cases, the orders used for MinMax and MaxMin, as well as the order ≺PO above, are always partial orders. Being able to model these multi-objective optimization problems in a clean and simple way is a main motivation for using a partial instead of a total order in Definition 1.

Another problem in the literature is the *multiple-independent* (or *boxed*) optimization problem [8,9,64]. It simultaneously solves several independent GOMT problems. We show how to realize this as a single GO instance.

<sup>3</sup> Details of these formulations can be found in the longer version of this paper [67].

Definition 6 (Boxed Optimization (BO)). *A boxed optimization problem is a sequence of GOMT problems,* BO = (GO1,..., GO*n*)*, where* GO*<sup>i</sup>* := t*i*, ≺*i*, φ*i for* i ∈ [1, n]*. We say that:*

*– A sequence of interpretations* (I1,..., I*n*) BO*-dominates* (I <sup>1</sup>,..., I *<sup>n</sup>*)*, denoted by* (I1,..., I*n*) <BO (I <sup>1</sup>,..., I *<sup>n</sup>*)*, if* I*<sup>i</sup> and* I *<sup>i</sup> are* GO*i-*consistent *or each* i ∈ [1, n]*, and: (i)* t I*i <sup>i</sup> i* t I- *i <sup>i</sup> for all* i ∈ [1, n]*; and (ii) for some* j ∈[1, n]*,* t I*j <sup>j</sup>* ≺*<sup>j</sup>* t I- *j j . –* (I1,..., I*n*) *is a solution to* BO *iff* I*<sup>i</sup> is* GO*i-consistent for each* i ∈ [1, n] *and no* (I <sup>1</sup>,..., I *<sup>n</sup>*) BO*-dominates* (I1,..., I*n*)*.*

Note that in previous work, there is an additional assumption that φ*<sup>i</sup>* = φ*<sup>j</sup>* for all i, j ∈ [1, n]. Below, we show how to solve the more general case without this assumption. We first observe that the above definition closely resembles Definition 4 for Pareto optimization (PO) problems. Leveraging this similarity, we show how to transform an instance of a BO problem into a PO problem.

Definition 7. *(*GOBO*) Let* BO = (GO1,..., GO*n*)*, where* GO*<sup>i</sup>* := t*i*, ≺*i*, φ*i for* i ∈ [1, n]*. Let* V*<sup>i</sup> be the set of all free variables in the* i *th sub-problem that also appear in at least one other sub-problem:*

$$V\_i = \left( FV(t\_i) \cup FV(\phi\_i) \right) \cap \bigcup\_{j \in [1, n], j \neq i} FV(t\_j) \cup FV(\phi\_j) \dots$$

*Let vi* = (v*i,*1,...,v*i,m*) *be some ordering of the variables in* V*<sup>i</sup> (say, by* ≺<sup>X</sup> *), and for each* j ∈ [1, m]*, let* v *i,j be a fresh variable of the same sort as* v*i,j , and let v- <sup>i</sup>* = (v *i,*1,...,v *i,m*)*. Then, let* t *<sup>i</sup>* = t*i*[*vi* ← *v- <sup>i</sup>* ]*,* φ *<sup>i</sup>* = φ*i*[*vi* ← *v- <sup>i</sup>* ]*, and* GO *<sup>i</sup>* = t *<sup>i</sup>*, ≺*i*, φ *<sup>i</sup>. Then we define* GOBO := GOPO(GO <sup>1</sup>,..., GO *<sup>n</sup>*)*.*

Proposition 3. *Let* I *be a solution to* GOBO *as defined in Definition 7. Then* (I1,..., I*n*) *is a solution to the corresponding* BO *problem as defined in Definition 6, where for each* i ∈ [1, n]*,* I*<sup>i</sup> is the same as* I *except that each variable* v*i,j* ∈ V*<sup>i</sup> is interpreted as* (v *i,j* )I*.*

In practice, solvers for BO problems can be implemented without variable renaming (see, e.g., [8,36,53]). Variable renaming, while a useful theoretical construct, also adds generality to our definition of BO. An interesting direction for future experimental work would be to compare the two approaches in practice.

Compositional Optimization. GOMT problems can also be combined by functional composition of multiple objective terms, possibly of different sorts, yielding *compositional optimization problems* [12,62,64]. Our framework handles them naturally by simply constructing an objective term capturing the desired compositional relationship. For example, compositional objectives can address the (partial) MaxSMT problem [64], where some formulas are *hard* constraints and others are *soft* constraints. The goal is to satisfy all hard constraints and as many soft constraints as possible. The next example is inspired by Cimatti et al. [12] and Teso et al. [62].

*Example 6 (*MaxSMT*).* Let x≥0 and y≥0 be hard constraints and 4x+y−4≥0 and 2x + 3y − 6≥0 soft constraints. We can formalize this as GOCO = t, ≺, φ, where: t = *ite*(4x + y − 4 ≥ 0, 0, 1) + *ite*(2x + 3y − 6 ≥ 0, 0, 1), ≺≡≺Int, and φ = x ≥ 0 ∧ y ≥ 0. An optimal solution must satisfy both hard constraints and, by minimizing the objective term t, as many soft constraints as possible.

MaxSMT has various variants including generalized, partial, weighted, and partial weighted MaxSMT [64], all of which our framework can handle similarly.

Next, we show a different compositional example that combines two different orders, one on strings and the other on integers. This example also illustrates a theory combination not present in the OMT literature.

*Example 7 (Composition of Str and Int).* Let <sup>T</sup> be again the theory of strings<sup>4</sup> Let GOCO = *tup*(x, len(x)), ≺, contains(x, "a") ∧ len(x) > 1, where x is of sort Str and (a1, b1) ≺ (a2, b2) iff b<sup>1</sup> ≺Int b<sup>2</sup> or (b<sup>1</sup> = b<sup>2</sup> and a<sup>1</sup> str a2). ≺ prioritizes minimizing the length, but then maximizes the string with respect to lexicographic order. An optimal solution must interpret x as the string "za" of length 2 since x must be of length at least 2 and contain "a", making "za" the largest string of minimum length.

Based on the definitions given in this section, we see that our formalism can capture any combination of GO (including compositional), GOLO, GOPO, GOMINMAX , GOMAXMIN , and GOBO problems. And note that the last four all make use of the partial order feature of Definition 1.

# 4 The GOMT Calculus

We introduce a calculus for solving the GOMT problem, presented as a set of *derivation rules*. We fix a GOMT problem GO = t, ≺, φ where φ is satisfiable (optimizing does not make sense otherwise). We start with a few definitions.

Definition 8 (State). *A* state *is a tuple* Ψ = I, Δ, τ *, where* I *is an interpretation,* Δ *is a formula, and* τ *is a sequence of formulas.*

The set of all states forms the state space for the GOMT problem. Intuitively, the proof procedure of the calculus is a search procedure over this state space which maintains at all times a *current state* I, Δ, τ storing a candidate solution and additional search information. In the current state, I is the best solution found so far in the search; Δ is a formula describing the remaining, yet unexplored, part of the state space, where a better solution might exist; and τ contains formulas that divide up the search space described by Δ into *branches* represented by the individual formulas in τ , maintaining the invariant that the disjunction of all the formulas <sup>τ</sup>1,...,τ*<sup>p</sup>* in <sup>τ</sup> is equivalent to <sup>Δ</sup> modulo <sup>φ</sup>, that is, <sup>φ</sup> <sup>|</sup>= (*<sup>p</sup> <sup>i</sup>*=1 τ*<sup>i</sup>* ⇔ Δ).

Note that states contain T -interpretations, which are possibly infinite mathematical structures. This is useful to keep the calculus simple. In practice, it

<sup>4</sup> The SMT-LIB theory of strings includes the theory of integers to support constraints over string length.

is enough just to keep track of the interpretations of the (finitely-many) symbols without fixed meanings (variables and uninterpreted functions and sorts) appearing in the state, much as SMT solvers do in order to produce models.

Definition 9 (Solve). Solve *is a function that takes a formula and returns a satisfying interpretation if the formula is satisfiable and a distinguished value* ⊥ *otherwise.*

Definition 10 (Better). BetterGO *is a function that takes a* GO*-consistent interpretation* I *and returns a formula* BetterGO(I) *with the property that for every* GO*-consistent interpretation* I *,*

$$\mathcal{T}' \vdash \text{BETTER}\_{\mathcal{GC}}(\mathcal{T}) \quad \text{iff } \mathcal{T}' <\_{\mathcal{GC}} \mathcal{T}.$$

The function above is specific to the given optimization problem GO or, put differently, is parametrized by t, ≺, and φ. When GO is clear, however, we simply write Better, for conciseness.

The calculus relies on the existence and computability of Solve and Better. Solve can be realized by any standard SMT solver. Better relies on a defining formula for ≺ as discussed below. We note that intuitively, Better(I) is simply a (possibly unsatisfiable) formula characterizing the solutions of φ that are better than I. Assuming α<sup>≺</sup> is the formula defining ≺, with free variables v<sup>1</sup> ≺<sup>X</sup> v2, if the value t <sup>I</sup> can be represented by some constant c (e.g., if t <sup>I</sup> is a rational number), then Better(I) = α≺[(v1, v2) ← (t, c)] satisfies Definition 10. On the other hand, it could be that t <sup>I</sup> is not representable as a constant (e.g., it could be an algebraic real number); then, a more sophisticated formula (involving, say, a polynomial and an interval specifying a particular root) may be required.

Definition 11 (Initial State). *The* initial state *of the GOMT problem* GO = t, ≺, φ *is* I0, Δ0, τ0*, where* I<sup>0</sup> = Solve(φ)*,* Δ<sup>0</sup> = Better(I0)*,* τ<sup>0</sup> = (Δ0)*.*

Note that I<sup>0</sup> = ⊥ since we assume that φ is satisfiable. The search for an optimal solution to the GOMT problem in our calculus starts with an arbitrary solution of the constraint φ and continues until it finds an optimal one.

### 4.1 Derivation Rules

Figure 1 presents the derivation rules of the GOMT calculus. The rules are given in guarded assignment form, where the rule premises describe the conditions on the current state that must hold for the rule to apply, and the conclusion describes the resulting modifications to the state. State components not mentioned in the conclusion of a rule are unchanged.

A derivation rule *applies* to a state if (i) the conditions in the premise are satisfied by the state and (ii) the resulting state is different. A state is *saturated* if no rules apply to it. A GO-*derivation* is a sequence of states, possibly infinite, where the first state is the initial state of the GOMT problem GO, and each state

$$\begin{aligned} \tau \neq \emptyset \quad \psi &= \operatorname{Tor}(\tau) \quad \phi \Vdash \psi \Leftrightarrow \bigvee\_{j=1}^{k} \psi\_{j}, \ k \geq 1\\ \text{F-SPLIT} & \quad \tau := (\psi\_{1}, \dots, \psi\_{k}) \circ \operatorname{Por}(\tau) \\ \text{F-SAT} & \quad \frac{\tau \neq \emptyset \quad \psi = \operatorname{Top}(\tau) \quad \operatorname{Souve}(\phi \land \psi) = \mathcal{X}' \quad \mathcal{X}' \neq \bot \quad \Delta' = \Delta \land \operatorname{BETTER}(\mathcal{X}')\\ \text{F-SAT} & \quad \dfrac{\tau \neq \emptyset \quad \psi = \operatorname{Top}(\tau) \quad \operatorname{Souve}(\phi \land \psi) = \bot}{\Delta := \Delta \land \neg \psi, \ \tau := \operatorname{Pop}(\tau)} \end{aligned}$$

Fig. 1. The derivation rules of the GOMT Calculus.

in the sequence is obtained by applying one of the rules to the previous state. The *solution sequence* of a derivation is the sequence made up of the solutions (i.e., the interpretations) in each state of the derivation.

The calculus starts with a solution for φ and improves on it until an optimal solution is found. During a derivation, the best solution found so far is maintained in the I component of the current state. A search for a better solution can be organized into branches through the use of the F-Split rule. Progress toward a better solution is enforced by the formula Δ which, by construction, is falsified by all the solutions found so far. We elaborate on the individual rules next.

F-Split. F-Split divides the branch of the search space represented by the top formula ψ = Top(τ ) in τ into k sub-branches (ψ1,...,ψ*k*), ensuring their disjunction is equivalent to <sup>ψ</sup> modulo the constraint <sup>φ</sup>: <sup>φ</sup> <sup>|</sup><sup>=</sup> <sup>ψ</sup> <sup>⇔</sup> *<sup>k</sup> <sup>j</sup>*=1 ψ*<sup>j</sup>* . The rest of the state remains unchanged. F-Split is applicable whenever τ is non-empty. The rule does not specify how the formulas ψ1,...,ψ*<sup>k</sup>* are chosen. However, a pragmatic implementation should aim to generate them so that they are *irredundant* in the sense that no formula is entailed modulo φ by the (disjunction of the) other formulas. This way, each branch potentially contains a solution that the others do not. Note, however, that this is not a requirement.

F-Sat. The F-Sat rule applies when there is a solution in the branch represented by the top formula ψ in τ . The rule selects a solution I = Solve(φ ∧ ψ) from that branch. One can prove that, by the way the formulas in τ are generated in the calculus, I necessarily improves on the current solution I, moving the search closer to an optimal solution.<sup>5</sup> Thus, F-Sat switches to the new solution (with I := I ) and directs the search to seek an even better solution by updating Δ to Δ = Δ ∧ Better(I ). Note that F-Sat resets τ to the singleton sequence (Δ ), discarding any formulas in τ . This is justified, as any discarded better solutions must also be in the space defined by Δ .

F-Close. The F-Close rule eliminates the first element ψ of a non-empty τ if the corresponding branch contains no solutions (i.e., Solve(φ ∧ ψ) = ⊥).

<sup>5</sup> See Lemma 6 in Appendix B of a longer version of this paper [67].

The rule further updates the state by adding the negation of ψ to Δ as a way to eliminate from further consideration the interpretations satisfying ψ.

Note that rules F-Sat and F-Close both update Δ to reflect the remaining search space, whereas F-Split refines the division of the current search space.

### 4.2 Search Strategies

The GOMT calculus provides the flexibility to support different search strategies. Here, we give some examples, including both notable strategies from the OMT literature as well as new strategies enabled by the calculus, and explain how they work at a conceptual level.

Divergence of Strategies: The strategies discussed below, with the exception of Hybrid search, may diverge if an optimal solution does not exist or if there is a *Zeno-style* [54,55] infinite chain of increasingly better solutions, all dominated by an optimal one. We discuss these issues and termination in general in Sect. 4.4.

Linear Search: A linear search strategy is obtained by never using the F-Split rule. Instead, the F-Sat rule is applied to completion (that is, repeatedly until it no longer applies). As we show later (see Theorem 2), in the absence of Zeno chains, τ eventually becomes empty, terminating the search. At that point, I is guaranteed to be an optimal solution.

Binary Search: A binary search strategy is achieved by using the F-Split rule to split the search space represented by ψ = Top(τ ) into two subspaces, represented by two formulas ψ<sup>1</sup> and ψ2, with φ |= ψ ⇔ (ψ<sup>1</sup> ∨ ψ2). In a strict binary search strategy, ψ<sup>1</sup> and ψ<sup>2</sup> should be chosen so that the two subspaces are disjoint and, to the extent possible, of equal size. A typical binary strategy alternates applications of F-Split with applications of either F-Sat or F-Close until τ becomes empty, at which point I is guaranteed to be an optimal solution. A smart strategy would aim to find an optimal solution as soon as possible by arranging for solutions in ψ<sup>1</sup> (which will be checked first) to be better than solutions in ψ2, if this is easy to determine. Note that an unfortunate choice of ψ<sup>1</sup> by F-Split, containing no solutions at all, is quickly remedied by an application of F-Close which removes ψ1, allowing ψ<sup>2</sup> to be considered next. The same problem of Zeno-style infinite chains can occur in this strategy.

Multi-directional Exploration: For multi-objective optimization problems, a search strategy can be defined to simultaneously direct the search space towards any or all objectives. Formally, if n is the number of objectives, then the F-Split rule can be instantiated in such a way that ψ*<sup>j</sup>* = *<sup>n</sup> <sup>i</sup>*=1 ψ*ji*, where ψ*ji* is a formula describing a part of the search space for the i *th* objective term in the j*th* branch. Search Order: We formalize τ as a sequence to enforce exploring the branches in τ in a specific order, assuming such an order can be determined at the time of applying F-Split. Often, this is the case. For example, in binary search, it is typically best to explore the section of the search space with better objective values first. If a solution is found in this section, a larger portion of the search space is pruned. Conversely, if the branches are explored in another order, even finding a solution necessitates continued exploration of the space corresponding to the remaining branches.

Alternatively, τ can be implemented as a set, by redefining the Top and Pop functions accordingly to select and remove a desired element in τ . With τ defined as a set, additional search strategies are possibile, including parallel exploration of the search space and the ability to arbitrarily switch between branches.

Hybrid Search: For some objectives and orders, there exist off-the-shelf external optimization procedures (e.g., Simplex for linear real arithmetic). One way to integrate such a procedure into our calculus is to replace a call to the Solve function in F-Sat with a call to an external optimization procedure Optimize that is sort- and order-compatible with the GOMT problem. We pass to Optimize as parameters the constraint φ ∧ Top(τ ) and the objective t and obtain an optimal solution in the current branch Top(τ ). <sup>6</sup> The call can be viewed as an accelerator for a linear search on the current branch. This approach incorporates theory-specific optimization solvers in much the same way as is done in the OMT literature. However, our calculus extends previous approaches with the ability to blend theory-specific optimization with theory-agnostic optimization by interleaving applications of F-Sat using Solve with applications using Optimize. For example, we may want to alternate between expensive calls to an external optimization solver and calls to a standard solver that are guided by a custom branching heuristic.

Other Strategies: The calculus enables us to mix and match the above strategies arbitrarily, as well as to model other notable search techniques like cutting planes [16] by integrating a cut formula into Solve. And, of course, one advantage of an abstract calculus is that its generality provides a framework for the exploration of new strategies. Such an exploration is a promising direction for future work.

### 4.3 New Applications

A key feature of our framework is that it is theory-agnostic, that is, it can be used with any SMT theory or combination of theories. This is in contrast to most of the OMT literature in which a specific theory is targeted. It also fully supports arbitrary composition of GOMT problems using the multi-objective approaches described in Sect. 3.2. Thus, our framework enables OMT to be extended to new

<sup>6</sup> This assumes there exists an optimal solution in the current branch. If not (i.e., if the problem is unbounded), a suitable error can be raised and the search terminated.

application areas requiring either combinations of theories or multi-objective formulations that are unsupported by previous approaches. We illustrate this (and the calculus itself) using a Pareto optimization problem over the theories of strings and integers (a combination of theories and objectives unsupported by any existing OMT approach or solver).

*Example 8 (*GOPO*).* Let GO<sup>1</sup> := len(w), ≺1, len(s) < len(w) and GO<sup>2</sup> := x, ≺2, x = s·w·s, where w, x, s are of sort Str, len(w) and len(s) are of sort Int, ≺<sup>1</sup> ≡ ≺Int, and ≺<sup>2</sup> ≡ Str;. Then, let GOPO(GO1, GO2) := t, ≺PO, φ, where t is *tup*(len(w), x), φ is x = s · w · s ∧ len(s) < len(w), and (a1, a2) ≺PO (b1, b2) iff a<sup>1</sup> -<sup>1</sup> b1, a<sup>2</sup> -<sup>2</sup> b2, and either a<sup>1</sup> ≺<sup>1</sup> b<sup>1</sup> or a<sup>2</sup> ≺<sup>2</sup> b<sup>2</sup> or both. Suppose initially:

$$\begin{array}{l} \mathsf{Z}\_{0} = \{x \mapsto \text{"aabaa"}, s \mapsto \text{"a}", w \mapsto \text{"aba"}, \}, \quad \tau\_{0} = (\Delta\_{0}),\\ \Delta\_{0} = (\mathsf{len}(w) \le 3 \land x >\_{\mathsf{str}} \text{"aabaa"}) \vee (\mathsf{len}(w) < 3 \land x \ge\_{\mathsf{str}} \text{"aabaa"}). \end{array}$$

The initial objective term value is (3, "aabaa").

1. We can first apply F-Split to split the top-level disjunction in τ . And suppose we want to work on the second disjunct first. This results in:

$$\tau\_1 = (\mathsf{len}(w) < 3 \land x \ge\_{\mathsf{str}} \text{ "aabaa"}, \mathsf{len}(w) \le 3 \land x >\_{\mathsf{str}} \text{ "aabaa"})$$

while the other elements of the state are unchanged.

2. Now, suppose we want to do binary search on the length objective. This can be done by again applying the F-Split rule with the disjunction (len(w) < 2 ∧ x ≥str "aabaa") ∨ (2 ≤ len(w) < 3 ∧ x ≥str "aabaa") to get:

$$\begin{array}{rcl} \tau\_2 = & (\mathsf{len}(w) < 2 \land x \ge\_{\mathsf{str}} \text{"aabaa"}, 2 \le \mathsf{len}(w) < 3 \land x \ge\_{\mathsf{str}} \text{"aabaa"}, \\ & \mathsf{len}(w) \le 3 \land x >\_{\mathsf{str}} \text{"aabaa"}). \end{array}$$

3. Both F-Split and F-Sat are applicable, but we follow the strategy of applying F-Sat after a split. Suppose we get the new solution I = {x → "b", s → , w → "b"}. Then we have:

$$\begin{array}{lcl} \mathcal{I}\_3 = & \{ x \mapsto \text{"b"}, s \mapsto \epsilon, w \mapsto \text{"b"} \}, & \tau\_3 = \text{ ( $\Delta\_3$ )},\\ \Delta\_3 = & \{ \text{len}(w) \le 1 \land x >\_{\text{str}} \text{"b"} \} \vee \text{(len}(w) < 1 \land x \ge\_{\text{str}} \text{"b")}. \end{array}$$

4. Both F-Split and F-Sat are again applicable. Suppose that we switch now to linear search and thus again apply F-Sat, and suppose the new solution is I = {x → "z", s → , w → "z"}. This brings us to the state:

$$\begin{array}{lcl} \mathcal{L}\_{4} = & \{x \mapsto \text{"z"}, s \mapsto \epsilon, w \mapsto \text{"z"}\}, \quad \tau\_{4} = \text{ ( $\Delta\_{4}$ )},\\ \Delta\_{4} = & \text{(len}(w) \le 1 \land x >\_{\text{str}} \text{"z")} \vee \text{(len}(w) < 1 \land x \ge\_{\text{str}} \text{"z")}. \end{array}$$

5. Now, Solve(φ ∧ ((len(w) ≤ 1 ∧ x >str "z") ∨ (len(w) < 1 ∧ x ≥str "z"))) = ⊥. Indeed, len(w) = 0, since 0 ≤ len(s) < len(w); if len(w)=1, then len(s)=0 and len(x)=1, thus, x >str "z". Now F-Close can derive the state:

$$\langle \mathcal{I}\_5, \Delta\_5, \tau\_5 \rangle = \langle \mathcal{I}\_4, \Delta\_4 \wedge \neg \Delta\_4, \emptyset \rangle$$

6. We have reached a saturated state, and I<sup>5</sup> is a Pareto optimal solution.

Optimization of objectives involving strings and integers (or strings and bitvectors) could be especially useful in security applications such as those mentioned in [60]. Optimization could be used in such applications to ensure that a counter-example is as simple as possible, for example.

Examples of multi-objective problems unsupported by existing solvers include multiple Pareto problems with a single min/max query, Paretolexicographic multi-objective optimization, and single Pareto queries involving MinMax and MaxMin optimization (see, for example, [1,32,52]). Our framework offers immediate solutions to these problems.

As has repeatedly been the case in SMT research, when new capabilities are introduced, new applications emerge. We expect that will happen also for the new capabilities introduced in this paper. One possible application is the optimization of emerging technology circuit designs [22].

### 4.4 Correctness

In this section, we establish correctness properties for GO-derivations. Initially, we demonstrate that upon reaching a saturated state, the interpretation I in that state is optimal.<sup>7</sup>

Theorem 1. *(Solution Soundness) Let* I, Δ, τ *be a saturated state in a derivation for a GOMT problem* GO*. Then,* I *is an optimal solution to* GO*.*

*Proof.* (Sketch) We show that in a saturated state τ = ∅, and when τ = ∅, φ |= ¬Δ. Then, we establish that I is GO-consistent, and that for any GOconsistent T -interpretation J , J |= Δ iff J <GO I. This implies there is no J s.t. J |= φ and J <GO I, confirming I as an optimal solution to GO.

In general, the calculus does not always have complete derivation strategies, for a variety of reasons. It could be that the problem is unbounded, i.e., no optimal solutions exist along some branch. Another possibility is that the order is not well-founded, and thus, an infinite sequence of improving solutions can be generated without ever reaching an optimal solution. For the former, various checks for unboundedness can be used. These are beyond the scope of this work, but some approaches are discussed in Trentin [64]. The latter can be overcome using a hybrid strategy when an optimization procedure exists (see Theorem 4). It is also worth observing that any derivation strategy is in effect an *anytime procedure*: forcibly stopping a derivation at any point yields (in the final state) the best solution found so far. When an optimal solution exists and is unique, stopping early provides the best approximation up to that point of the optimal solution.

There are also fairly general conditions under which solution complete derivation strategies do exist. We present them next.

<sup>7</sup> Full proofs for the theorems in this section can be found in a longer version of this paper [67].

Definition 12. *A derivation strategy is* progressive *if it (i) never halts in a non-saturated state and (ii) only uses* F-Split *a finite number of times in any derivation.*

Let us again fix a GOMT problem GO = t, ≺, φ. Consider the set A*<sup>t</sup>* = {t <sup>I</sup> | I is GO-consistent}, collecting all values of t in interpretations satisfying φ.

Theorem 2. *(Termination) If* ≺ *is well-founded over* A*t, any progressive strategy reaches a saturated state.*

*Proof.* (Sketch) We show that any derivation using a progressive strategy terminates when ≺ is well-founded. Subsequently, based on the definition of progressive, the final state must be saturated.

Theorem 3. *(Solution Completeness) If* ≺ *is well-founded over* A*<sup>t</sup> and* GO *has one or more optimal solutions, every derivation generated by a progressive derivation strategy ends with a saturated state containing one of them.*

*Proof.* The proof is a direct consequence of Theorem 1 and Theorem 2.

Solution completeness can also be achieved using an appropriate hybrid strategy.

Theorem 4. *If* GO *has one or more optimal solutions and is not unbounded along any branch, then every derivation generated by a progressive hybrid strategy, where* Solve *is replaced by* Optimize *in* F-Sat*, ends with a saturated state containing one of them.*

*Proof.* (Sketch) If D is such a derivation, we note that F-Split can only be applied a finite number of times in D and consider the suffix of D after the last application of F-Split. In that suffix, F-Close can only be applied a finite number of times in a row, after which F-Sat must be applied. We then show that due to the properties of Optimize, this must be followed by either an application of F-Close or a single application of F-Sat followed by F-Close. Both cases result in saturated states. The theorem then follows from Theorem 1.

# 5 Related Work

Various approaches for solving the OMT problem have been proposed. We summarize the key ideas below and refer the reader to Trentin [64] for a more thorough survey.

The *offline schema* employs an SMT solver as a black box for optimization search through incremental calls [54,55], following linear- or binary-search strategies. Initial bounds on the objective function are given and iteratively tightened after each call to the SMT solver. In contrast, the *inline schema* conducts the optimization search within the SMT solver itself [54,55], integrating the optimization criteria into its internal algorithm. While the inline schema can be more efficient than the offline counterpart, it necessitates invasive changes to the solver and may not be possible for every theory.

*Symbolic Optimization* optimizes multiple independent linear arithmetic objectives simultaneously [36], seeking optimal solutions for each corresponding objective. This approach improves performance by sharing SMT search effort. It exists in both offline and inline versions, with the latter demonstrating superior performance. Other arithmetic schemas combine simplex, branch-and-bound, and cutting-plane techniques within SMT solvers [44,50]. A polynomial constraint extension has also been introduced [33].

Theory-specific techniques address objectives involving pseudo-Booleans [11, 54,55,57], bitvectors [40,65], bitvectors combined with floating-point arithmetic [66], and nonlinear arithmetic [6]. Other related work includes techniques for lexicographic optimization [8], Pareto optimization [8,24], MaxSMT [19], and All-OMT [64].

Our calculus is designed to capture all of these variations. It directly corresponds to the offline schema, can handle both single- and multi-objective problems, and can integrate solvers with inline capabilities (including theory-specific ones) using the hybrid solving strategy. Efficient MaxSMT approaches [19] can also be mimicked in our calculus. These approaches systematically explore the search space by iteratively processing segments derived from unsat cores. Our calculus can instantiate these branches using the F-Split rule, by first capturing unsat cores from calls to F-Close, and then using these cores to direct the search in the F-Split rule.

# 6 Conclusion and Future Work

This paper introduces the Generalized OMT problem, a proper extension of the OMT problem. It also provides a general setting for formalizing various approaches for solving the problem in terms of a novel calculus for GOMT and proves its key correctness properties. As with previous work on abstract transition systems for SMT [27,31,39,45], this work establishes a framework for both theoretical exploration and practical implementations. The framework is general in several aspects: (i) it is parameterized by the optimization order, which does not need to be total; (ii) it unifies single- and multi-objective optimization problems in a single definition; (iii) it is theory-agnostic, making it applicable to any theory or combination of theories; and (iv) it provides a formal basis for understanding and exploring search strategies for Generalized OMT.

In future work, we plan to explore an extension of the calculus to the generalized All-OMT problem. We also plan to develop a concrete implementation of the calculus in a state-of-the-art SMT solver and evaluate it experimentally against current OMT solvers.

Acknowledgements. This work was funded in part by the Stanford Agile Hardware Center and by the National Science Foundation (grant 2006407).

# References


Systems Ph.D Workshop 2019, CPS Summer School "Designing Cyber-Physical Systems - From concepts to implementation", Alghero, Italy, pp. 51–59 (2019)


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Author Index**

### **A**

Acclavio, Matteo II-216 Amrollahi, Daneshvar I-154 Arrial, Victor II-338 Avigad, Jeremy I-3 Ayala-Rincón, Mauricio II-317

### **B**

Baader, Franz II-279 Balbiani, Philippe II-78 Barragán, Andrés Felipe González II-317 Barrett, Clark I-458 Bártek, Filip I-194 Berg, Jeremias I-396 Bhayat, Ahmed I-75 Biere, Armin I-284 Bonsangue, Marcello II-401 Bozec, Tanguy II-157 Bromberger, Martin I-133 Brown, Chad E. I-86 Bruni, Alessandro II-61

### **C**

Cerna, David M. II-317 Chassot, Samuel I-304 Chvalovský, Karel I-194 Ciabattoni, Agata II-176 Coopmans, Tim II-401

#### **D**

Das, Anupam II-237 De Lon, Adrian I-105 De, Abhishek II-237 Dixon, Clare II-3

#### **E**

Ehling, Georg II-381 Einarsdóttir, Sólrún Halla I-214

### **F**

Férée, Hugo II-43 Fernández Gil, Oliver II-279 Ferrari, Mauro II-24 Fiorentini, Camillo II-24 Frohn, Florian I-344 Froleyks, Nils I-284 Fruzsa, Krisztina II-114

### **G**

Gao, Han II-78 Garcia, Ronald I-419 Ge, Rui I-419 Gencer, Çi ˘gdem II-78 Ghilardi, Silvio I-265 Giesl, Jürgen I-233, I-344, II-360 Giessen, Iris van der II-43 Gool, Sam van II-43 Graham-Lengrand, Stéphane I-386 Guerrieri, Giulio II-338

### **H**

Hader, Thomas I-386 Hajdu, Márton I-21, I-115, I-154, I-214 Heisinger, Maximilian I-315, I-325 Heisinger, Simone I-315, I-325 Heljanko, Keijo I-284 Heuer, Jan I-172 Hozzová, Petra I-21, I-154 Hustadt, Ullrich II-3

### **I**

Ihalainen, Hannes I-396 Irfan, Ahmed I-386

### **J**

Järvisalo, Matti I-396 Johansson, Moa I-214

© The Editor(s) (if applicable) and The Author(s) 2024 C. Benzmüller et al. (Eds.): IJCAR 2024, LNAI 14739, pp. 481–482, 2024. https://doi.org/10.1007/978-3-031-63498-7

482 Author Index

### **K**

Kaliszyk, Cezary I-86 Kassing, Jan-Christoph II-360 Kaufmann, Daniela I-386 Kesner, Delia II-338 Khalid, Zain I-53 Kotthoff, Lars I-53 Kovács, Laura I-21, I-115, I-154, I-386 Kozen, Dexter II-257 Krasnopol, Florent I-133 Kunˇcak, Viktor I-304 Kutsia, Temur II-317, II-381 Kuznets, Roman II-114

#### **L**

Laarman, Alfons II-401 Lammich, Peter I-439 Lommen, Nils I-233

### **M**

Mei, Jingyi II-401 Meyer, Éléanore I-233 Middeldorp, Aart II-298 Mitterwallner, Fabian II-298 Möhle, Sibylle I-133 Myreen, Magnus O. I-396

#### **N**

Nalon, Cláudia II-3, II-97 Niederhauser, Johannes I-86 Nordström, Jakob I-396

### **O**

Oertel, Andy I-396 Olivetti, Nicola II-78

#### **P**

Papacchini, Fabio II-3 Pattinson, Dirk II-97 Peltier, Nicolas II-157 Perrault, C. Raymond I-53 Petitjean, Quentin II-157 Platzer, André II-196

Poidomani, Lia M. I-265 Pommellet, Adrien I-366 Prebet, Enguerrand II-196

#### **R**

Rawson, Michael I-115 Rebola-Pardo, Adrian I-325 Ritter, Eike II-61 Rooduijn, Jan II-257 Ruess, Harald II-137

#### **S**

Scatton, Simon I-366 Schmid, Ulrich II-114 Schöpf, Jonas II-298 Schürmann, Carsten II-61 Seidl, Martina I-315, I-325 Shillito, Ian II-43 Sighireanu, Mihaela II-157 Silva, Alexandra II-257 Smallbone, Nicholas I-214 Stan, Daniel I-366 Suda, Martin I-75, I-194, I-214 Summers, Alexander J. I-419 Sutcliffe, Geoff I-30, I-53 Suttner, Christian I-53

### **T**

Tan, Yong Kiam I-396 Tesi, Matteo II-176 Tinelli, Cesare I-458 Tsiskaridze, Nestan I-458

### **V**

van Ditmarsch, Hans II-114 Vartanyan, Grigory II-360 Voronkov, Andrei I-21, I-115, I-154

#### **W**

Wagner, Eva Maria I-154 Waldmann, Uwe I-244 Weidenbach, Christoph I-133 Wernhard, Christoph I-172

### **Y**

Yu, Emily I-284