Karlsruher Schriften zur Anthropomatik

Band 58

Band 58

J. Pfrommer

Distr. Planning for Self-Organizing Production Systems

Julius Pfrommer

**Distributed Planning for Self-Organizing Production Systems**

Julius Pfrommer

## **Distributed Planning for Self-Organizing Production Systems**

Karlsruher Schriften zur Anthropomatik Band 58 Herausgeber: Prof. Dr.-Ing. habil. Jürgen Beyerer

Eine Übersicht aller bisher in dieser Schriftenreihe erschienenen Bände finden Sie am Ende des Buchs.

# **Distributed Planning for Self-Organizing Production Systems**

by Julius Pfrommer

Karlsruher Schriften zur Anthropomatik

Herausgeber: Prof. Dr.-Ing. habil. Jürgen Beyerer

Eine Übersicht aller bisher in dieser Schriftenreihe erschienenen Bände finden Sie am Ende des Buchs.

Band 58

Karlsruher Institut für Technologie Institut für Anthropomatik und Robotik

Distributed Planning for Self-Organizing Production Systems

Zur Erlangung des akademischen Grades eines Doktor-Ingenieurs von der KIT-Fakultät für Informatik des Karlsruher Instituts für Technologie (KIT) genehmigte Dissertation

von Julius Pfrommer

Tag der mündlichen Prüfung: 22. Juli 2019 Referent: Prof. Dr.-Ing. habil. Jürgen Beyerer Korreferent: Prof. Dr.-Ing. Michael Weyrich

**Impressum**

Karlsruher Institut für Technologie (KIT) KIT Scientific Publishing Straße am Forum 2 D-76131 Karlsruhe

KIT Scientific Publishing is a registered trademark of Karlsruhe Institute of Technology. Reprint using the book cover is not allowed.

www.ksp.kit.edu

*This document – excluding the cover, pictures and graphs – is licensed under a Creative Commons Attribution-Share Alike 4.0 International License (CC BY-SA 4.0): https://creativecommons.org/licenses/by-sa/4.0/deed.en*

*The cover page is licensed under a Creative Commons Attribution-No Derivatives 4.0 International License (CC BY-ND 4.0): https://creativecommons.org/licenses/by-nd/4.0/deed.en*

Print on Demand 2024 – Gedruckt auf FSC-zertifiziertem Papier

ISSN 1863-6489 ISBN 978-3-7315-1253-0 DOI 10.5445/KSP/1000152028

# **Kurzfassung**

Für automatisierte Produktionsanlagen gibt es einen fundamentalen Tradeoff zwischen Effizienz und Flexibilität. In den meisten Fällen sind die Abläufe nicht nur durch den physischen Aufbau der Produktionsanlage, sondern auch durch die spezielle zugeschnittene Programmierung der Anlagensteuerung fest vorgegeben. Änderungen müssen aufwändig in einer Vielzahl von Systemen nachgezogen werden. Das macht die Herstellung kleiner Stückzahlen unrentabel.

In dieser Dissertation wird ein Ansatz entwickelt, um eine automatische Anpassung des Verhaltens von Produktionsanlagen an wechselnde Aufträge und Rahmenbedingungen zu erreichen. Dabei kommt das Prinzip der Selbstorganisation durch verteilte Planung zum Einsatz. Die aufeinander aufbauenden Ergebnisse der Dissertation sind wie folgt:


Ausnutzung zusätzlicher Struktur in den Modellen skaliert der Ansatz auch auf große Szenarien.


Die Ergebnisse zusammenfassend werden Grundlagen für flexible automatisierte Produktionssysteme geschaffen – in einer Werkshalle, aber auch über Standorte und Organisationen verteilt – welche die ihnen innewohnenden Freiheitsgrade durch Planung zur Laufzeit und agentenbasierte Koordination gezielt einsetzen können. Der Bezug zur Praxis wird durch Anwendungsbeispiele hergestellt. Die Machbarkeit des Ansatzes wurde mit realen Maschinen im Rahmen des EU-Projekts SkillPro und in einer Simulationsumgebung mit weiteren Szenarien demonstriert.

# **Abstract**

There is a fundamental tradeoff between automation and flexibility in production systems. Large lot sizes can be produced efficiently with automated production systems. Many machines and equipment, like a 5-axis CNC mill, are in principle capable of producing many different kinds of parts. Similarly, (intra-) logistics systems exist for the automated transport and warehousing. Integrating these flexible components to an overall production system that is equally flexible has been prevented by the limits of automation technology in dealing with the ensuing complexity. Most production processes are rigid not only by way of the physical layout of machines and their integration, but also by the custom programming of the control logic for the integration of components to a production systems. Changes are time- and resource-expensive. This makes the production of small lot sizes of customized products economically challenging.

This thesis develops solutions for the automated adaptation of production systems based on self-organisation and distributed planning. The main results are the following:


duction processes. For this, the technique of Monte-Carlo Tree Search is extended. By exploiting algebraic structure in the models, the approach can be scaled to large scenarios.


In summary, the results of this work enable future production systems that are both efficient and adaptive. For this the components of the production system are enabled to use the degrees of freedom that are available to them. By the use of self-organization for the coordination, components of the overall system can react to changes in the system topology and external conditions. The approach was tested in application scenarios—in simulation and in physical production systems. For example as part of the EU-project SkillPro.

# **Acknowledgments**

Let me first thank Prof. Dr.-Ing. habil. Jürgen Beyerer for his mentorship, guidance and shared passion for agent-based and distributed systems throughout the development of this thesis. The time at the IES chair and the interactions were formative and invaluable. Furthermore I want to thank Prof. Dr.-Ing. Michael Weyrich for the discussions and his role as a reviewer of this dissertation.

Fraunhofer IOSB, and particularly the ILT department led by Thomas Usländer, was a great environment to conduct both scientific research leading up this thesis and to pursue high-impact engineering projects in the automation environment. The colleagues at IOSB and especially my former group leaders Miriam Schleipen and Ljiljana Stojanovic taught me the ropes of the craft of applied research. This was an experience from which I still profit immensely.

Besides the results on distributed planning for production control, a "byproduct" of this thesis was that it enabled me to contribute to the open62541 open source implementation of the OPC UA standard for industrial communication. This background ensures a clear technological path from the theoretical work to the application scenarios. The open62541 development team proved very inspirational and productive, so let me thank Florian Palm and Sten Grüner, Prof. Leon Urbas, Chris Iatrou, Stefan Profanter, and Andreas Ebner.

Another thank you goes to Joseph Warrington and Georg Schildbach for their guidance during my first steps into research during my time at the Automatic Control Laboratory at ETH Zurich.

Most importantly, this thesis would not have been possible without the encouragement and support of my family. *Ce thèse n'aurait pas été possible sans tout ton soutien, Ingrid!*

Karlsruhe, July 2019 *Julius Pfrommer*

# **Table of Contents**



# **List of Figures**



# **List of Tables**


# **List of Symbols**

#### **General Notation**


#### **General Sets**


### **Probability**


#### **Production System Model**


#### **Distributed Production System Model**


#### **Trace Theory**


### **Planning**



#### **Miscellaneous**


# **List of Acronyms**



# **1 Introduction**

*There has been a great deal of talk, much of it well founded, that the effect of science on economics and on the economy has not only been very large but that something like a second industrial revolution is impending. Illustrating this are the enormous advances in communications – physical and informational –, advances in automatization and in the domain of information and control, and finally, atomic energy.*

*John von Neumann [Neu55]*

## **1.1 Production and Logistics in a Global Economy**

After introducing the assembly line for the production of the Model T automobile in 1908, Henry Ford sought to make his company self-sufficient. For the supply of raw material, his Ford Motor Company bought 700,000 acres of forest, rubber plantations, iron and coal mines, and so on. The Ford River Rouge Complex near Detroit was designed to transform the incoming raw material into fully assembled cars. The manufacturing operations performed at River Rouge included coal coking, steel forging, sheet metal stamping, engine casting, lumber milling, tire making, the production of sheet glass from molten sand, and many more. All leading up to the final production step: the final car assembly in Ford's assembly line [Bri03]. Since Ford's only product at the time was the Model T, all production processes were highly specialized to maximize efficiency and reduce costs. This made the Model T the most affordable car at the time. Ford's competitor Chevrolet had a different approach. They used generic manufacturing equipment that could produce parts for several models at the same time. This also enabled frequent updates of the car models. Innovations of the 1920s, such as motors with electric starters that require no hand crank, let the Model T appear increasingly outdated. In 1927, Ford finally introduced a successor: the Model A. However, since the

#### 1 Introduction

production processes were tailored towards the Model T, the changeover proved difficult. The River Rouge site came to standstill for a duration of six months until production could be slowly resumed [Hou85]. In the following years, Ford gave up the model of the highly integrated manufacturing site. Today, like all automotive companies, Ford operates a range of production sites that are specialized on a range of parts (e.g. the internal combustion engine) that is used for several car models. And also the final assembly lines can switch between car models to adjust to changing customer demands. All production sites are connected with logistics networks and rely heavily on external material and component suppliers.

Ford's change from an integration manufacturing site to a network of interconnected production sites is exemplary for the evolution of many manufacturing industries. Initially, customized craft production is replaced by mass production, resulting in large efficiency gains and opening up new markets. When the markets are divided up, companies diversify their product portfolio to cater for individual customer groups. This leads to smaller order sizes, reducing the efficiency of the mass production approach. New methods have been developed to enable customization without loosing all the efficiency gains of mass production. Among the most popular ones are Lean Production [Ohn88] (also known as the Toyota Production System) and the Just in Sequence (JIS) inventory strategy [WS11]. They enable production sites to reduce the minimum order size that is still economical to produce. Sometimes lot sizes are reduced to the absolute minimum: a single customized product.

While automotive brands rely on a network of suppliers, these relationships are relatively stable. Building up the capacity to produce a specific part in high quantities in the expected quality takes time and investments in automated processes. This only makes economical sense when years of high demand are expected. For many retail goods, a similar division of labour happens in a complex supply-chain. But, these relationships can be established and dissolved literally overnight. Cheap long-range shipping, the reduction of tariffs and easy communication has enabled world-wide supply chains. We will take the example of global supply chains in the apparel industry [Ger99]. The supply-chain for a cotton shirt comprises the provider of raw cotton, spinning of the yarn, color dyeing of the yarn, weaving of the textile, cutting and sewing to make the shirt, design printing and stitching of brand logos. All of these production steps are typically executed by different companies in different countries. And the supply-chain is dynamically reconfigured for individual orders based on availability and price. A shirt bought today may have taken a wholly different way around the world than the same shirt bought a week earlier in the same store [Chr00]. Many apparel retailers even forego central warehousing for their stores. Instead, products are delivered directly from the last link in the supply chain to the store. So the retailers do not bind capital in stock for the entire season. And they can react within weeks to data showing good or bad sales of a specific product [CM15]. On the downside, most western apparel brands do not control their supply-chain and rely on sourcing agents from overseas.

The biggest sourcing agent is Li&Fung Limited. Operating out of Hong Kong, Li&Fung self-describes it's core business as "managing the supply chain for high volume, time sensitive goods" [FFW07]. Li&Fung owns no factory, warehouse or inventory and can still source nearly any retail good. Its database contains factories throughout Asia with their support for different manufacturing processes, available capacity and logistics options, as well as the availability and prices of raw material commodity components. Orchestrating the supply chain is a profitable business. In 2015, Li&Fung achieved a turnover of \$18.8 billion and a healthy \$2.2 billion profit [Lim15]. The leverage the sourcing agent has over the supply chain is seen increasingly critical by companies who rely on their services. In 2015, Wal-Mart announced a plan to reduce their reliance specifically on Li&Fung. Relying too much on a single provider had become a strategic weakness for the world's biggest retailer [WS15].

So why are dynamic reconfigurations of the supply-chain possible for the apparel industry, but not for automotive? In the apparel supply-chain, suppliers provide generic access to manufacturing processes that can be used for many different customers with little changeover costs and fast production ramp-up.

• Standardized commodity goods enable a high degree of automation. For example, the objective of textile plants is to run their power looms with as little downtime as possible. The automated equipment allows the configuration of different product types. For example, a Jacquard loom can configure different weaving patterns. The difference between product types can be entirely handled by the automated production system.

#### 1 Introduction

• On the other end of the spectrum, cutting and sewing of apparel is highly dependent on human labour. Lot-sizes are generally smaller and the type of product can change drastically between orders (e.g. switching from jeans to dress shirts). Handling of pliable textile material is difficult for automated equipment and requires custom machines and long changeover times. This makes automated equipment ineconomical as long as cheap human labour is available in overseas countries.

Globalization and the decentralization of supply chains lead to increased requirements for flexibility in production [Abe+06]. In practice, however, flexibility in production is a conflicting goal with efficiency. Automated systems provide the increased efficiency required for mass production. But they require considerable investment for the initial setup and the changeover between products. In recent years, many countries with a large industrial base have set up research programmes to renew industrial production with the increased use of information processing and communication technology. Among these programmes are "Industrie 4.0" in Germany [KWH13], "Made in China 2025" [Ken15] in China and "Industrie du Futur" [FD16] in France. A major part of these efforts is the creation of new automation technology that improves on the tradeoff between efficiency and flexibility in production.

## **1.2 The Structure of Automated Production Systems**

The vast majority of control systems for production systems is organized as a hierarchy. This is true both for discrete manufacturing and continuous production processes (e.g. chemicals, pharmaceuticals, beverages). Figure 1.1 shows the automation pyramid, a frame of reference for the hierarchical design of most automated systems in discrete manufacturing. Figure 1.2 depicts the typical automation hierarchy from the process industry (e.g. chemicals and pharmaceuticals). Decisions are made hierarchically and data is aggregated more and more as it is forwarded to the upper levels of the automation hierarchy. Hierarchical control follows the principle of subsidiarity, where upper levels make high-level decisions that are gradually refined as they are forwarded down the automation hierarchy. Subsidiarity is a necessary consequence of the fact that information

**Figure 1.1:** The automation hierarchy in discrete manufacturing. The IEC 62264 / ISA-95 standard does not include the enterprise level. It was added here to include interfaces to systems outside the shopfloor.

is aggregated when it moves up the automation hierarchy. The models used for decision making in the upper levels are more and more coarse and abstract away low-level details. But the low-level details have to be considered eventually. The lower levels in the automation hierarchy takes decisions from one level above and "fill the gaps".

In many systems, the control levels are tightly coupled. Changes to a component of a production system usually requires changes in many adjacent systems both vertically and horizontally. The subsystems are interwoven and implicit assumptions about adjacent systems are represented only in custom control code. This makes modifications to automated systems costly and time-intensive. Reducing this effort is the focus of an entire research community.

The remainder of this section gives an overview on the most important planning and optimization methods for decision-making on the different levels of the automation hierarchy. Whilst it is not possible to provide a complete enumeration, the examples from this chapter will set the frame for the modeling and planning techniques that are introduced later on.

**Figure 1.2:** Typical automation hierarchy of a chemical plant [Sko04].

#### **The Control Level**

Modern control theory and their implementation on computers can be traced back to work done at MIT in the 1940s. There, Norbert Wiener first coined the term "cybernetics" as the conjunction of control and communication [Wie48]. At the same time, project Whirlwind, conducted at Jay Forrester's Servomechanisms Laboratory, developed digital "feedback control" for numerically controlled (NC) manufacturing processes [Rei91]. Both modern Programmable Logic Controllers (PLCs) and the application of feedback control methods on digital computers are descendant from this work in a direct lineage. The fundamental difference between the two lies in the type of decisions they have to make. Programmable Logic Controllers (PLC) generally are driven by a state machine with discrete transitions or events. Feedback (optimal) control is mostly concerned with physical systems with continuous dynamics.

**Figure 1.3:** Example for a PLC program in ladder logic

Programmable Logic Controllers Programmable Logic Controllers (PLC, see [Joh87; Wal12]) are directly interfaced with a physical process via sensors and actuators. In the automation of discrete manufacturing, PLC couple the physical system with digital control and communication. This coupling requires custom program code to accomodate for the specific details of the physical system and its intended functionality. As the control level often deals with safety-critical functionality, hard bounds on realtime reactiveness have to be guaranteed. The IEC-61131 languages [Com93] are standard for PLC programming and mandate a programming style that is idiomatic to industrial controllers. Figure 1.3 shows an example for one of the IEC-61131 languages, ladder logic, which is directly descendent from the analog circuits that were originally used for industrial control before the PLC.

The logic coded into a PLC is mostly reactive. Sensor inputs are read and translated into actuation commands. This loop is repeated at a fast pace (many hundred Hertz) for realtime operations. Lengthy planning procedures do not usually fit into the constraints of the control loop in a PLC. The IEC-61499 standard is intended as a modernization of IEC-61131 [Vya11]. It adds an event-based control flow to the strictly cyclic operations of previous PLC generations. But even with IC-61499, PLC-based control is mostly reactive. Computationally expensive planning is generally not performed in a safetycritical control environment.

Automated production systems typically are integrated from components that come with their own control hardware. The integration requires communication between individual controllers and custom control software to react to cyclically transmitted status messages and events. On the control level, traditional fieldbuses are still common today [Zur14]. Fieldbuses are even mandatory to use if safety-critical functionality relies on digital communication between controllers or between a PLCs and field devices with sensors and actuators. The programming of PLCs is often finished on-site as part of the integration of system components. As many automation systems are custom solutions, code reuse for PLCs is difficult even if systems are built from standard components. This leads to a tight coupling that also increases the time required to make changes in an existing system.

Feedback Control The canonical definition of an optimal control problem is as follows [Lib11]:

$$
\dot{x} = f(t, x, u), \quad x(t\_0) = x\_0 \tag{1.1}
$$

The system dynamics is described by an ordinary differential equation (ODE) *f*. The system state at time *t* is x(*t*) ∈ R *<sup>n</sup>* and its evolution depends on the control input u(*t*) ∈ R *<sup>m</sup>*. The initial condition is given by x0. A cost functional for the state evolution assigns costs to the state and control effort between times *t*<sup>0</sup> and *t<sup>f</sup>* and an additional terminal cost on the final state x(*t<sup>f</sup>* ).

$$C(\mathfrak{u}) = \int\_{t\_0}^{t\_f} L(t, \mathfrak{x}(t), \mathfrak{u}(t))dt + K(t\_f, x\_f) \tag{1.2}$$

The problem of optimal control is to compute u <sup>∗</sup> = arg min<sup>u</sup> *C*(u). In this general framework, computing u ∗ is a variational problem as u is a (vectorial) function over time [Lue69]. A popular approach is to discretize the time domain *T* = {*t*0*, t*1*, . . . , t<sup>f</sup>* } so that u <sup>∗</sup> problem of finding the sequence of u that minimizes *C*. Applying the resulting u blindly until *t<sup>f</sup>* is called *open-loop* control. Repeating the optimization after every time period with updated state information is called Model Predictive Control (MPC) or Receding Horizon Control [ML99; Mac02].

Traditionally, feedback control has be implemented as analog electrical circuits. But these are restricted to relatively simplistic solutions, such as PID controllers [Ben93]. With the increase in computational power available in control devices, Model Predictive Control (MPC) has become possible for many application. Here, the control problem is stated as an optimization problem that is solved repetitively at a high frequency to incorporate sensor measurements for "feedback" control. Optimal control as an optimization problem originates from the association of dynamical system with control input with

## **The Operations Level**


or a buildup of residual oil from the metal coating in the stamp tool. It is often up to a skilled process expert to adjust the process parameters at runtime to ensure the required performance.

### **The Plant Management Level**

On the plant management level, an entire shopfloor is considered. Typically the planning horizon is between several hours and several days of operation [She03].


### **The Enterprise Level**

Enterprise Resource Planning Enterprise Resource Planning (ERP) describes a class of software systems to assist enterprise-wide management [Jac+07]. The

ERP products with the highest market share are SAP ERP (previously SAP/R3) and the Oracle E-Business Suite [Gar18]. In many aspects, ERP functionality mirrors tasks performed at the plant management level. But the timeframes are generally much longer and several production and warehousing sites are jointly considered. Enterprise-wide optimization (EWO) aims at optimizing the operations of supply, manufacturing and distribution activities of a company to reduce overall costs and inventories [Gro05].

Supply-Chain Management Supply-Chain Management (SCM) [Ali05; GF08] is concerned with production scenarios that include several layers of suppliers. SCM is most common in industries where suppliers are not delivering specialized parts instead of commodity products. The integration with suppliers is often very tight. This enables the reduction of buffer storage at the production site by the use of just-in-time and just-in-sequence delivery. The so-called bullwhip effect [LPW97] describes how small fluctuations in customer demand lead to large fluctuations in demands at suppliers that are several tiers removed. A big motivator for digitalisation and information sharing in the supply-chain is the reduction of the bullwhip effect by enabling better forecasts for the suppliers.

## **1.3 Approaches for Flexible Production Systems**

This section discusses approaches to render automated production system flexible. The possibility of flexible production is one of the driving motivations for Industrie 4.0 [Wey+14]. This thesis has a scope on automated production. Organizational methods that focus on the human element in production, such as Lean Manufacturing [Ohn88], are therefore not considered in depth. Several authors have developed frameworks to characterize flexibility in production systems [Bro+84; BS88; GG89; SS90; Ger93; DT98; Wie+07]

The scientific literature uses specific terms to describe flexibility in a production context. See Figure 1.4 for a common nomenclature by Wiendahl [Wie+07]. Even though specific terms exist, the term flexibility is deliberately used with its colloquial meaning in this thesis: The models and planning algorithms developed in this thesis apply to all level in the automation hierarchy. The specific terms for flexibility from the scientific literature are mostly tied to one level of the

**Figure 1.4:** Classes of manufacturing changeability from [Wie+07].

automation hierarchy. We aim to avoid misunderstandings by the specific terms outside of their commonly understood definition.

**Service-Oriented Production Systems** The principle of service-orientation is used in computer science to develop system architectures where components are loosely coupled [Mac+06]. A specific service provider can be exchanged as long as the interfaces for interaction remain identical and the underlying functionality is still provided. Discovery mechanisms are used to find and select appropriate service providers. In the context of Industrie 4.0, service-orientation is regarded as an enabler for future control system architectures that dissolve the classical automation hierarchy. See Figure 1.5 for a popular depiction. The DIN SPEC 16593-1 standard [DIN18] defines a reference model with basic principles for service-based architectures in the context of Industrie 4.0. This is the common basis for technical realisations to the vision from Figure 1.5.

**Figure 1.5:** Decomposition of the automation hierarchy with distributed services [Mes13; Mon14].

A range of research projects has translated service-orientation to production control. The authors from the SOCRADES project [JS05; De +08; Cân+11] and Shen et al. [She+07] develop a service-oriented manufacturing system architecture where semantic technologies are used to match possible providers of functionality in a manufacturing system. Loskyll et al. [Los+11; Los+12] expand the concept of semantic service discovery to the parameterization and orchestration of services. For this, they develop a domain-specific ontology for the use in semantic reasoning tools. Puttonen et al. [PLM13] describe a set of specialized web services for composing and invoking semantically enriched automated procedures in a manufacturing setting. They also present an algorithm to identify the steps required to reach a predefined goal state. Dürkop et al. [Dür+14] discuss the use of service-oriented architectures in reconfigurable manufacturing systems (see Figure 1.4 and the technical challenges that need to be overcome. [SZW17] use model-based approach for the service development and a modular architecture to reduce the complexity of service-based production systems. [LV15] combine service-oriented manufacturing control with a multi-agent architecture. The Smart Factory Web testbed in the Industrial Internet Consortium (IIC) uses web services for planning and control in global supply chains [Jun+17].

**Agent-Based Production Systems** An even more radical departure from the classical automation hierarchy is investigated with agent-based distributed control of production systems. Agent-based systems in manufacturing and logistics are the topic of a dedicated research community that has been active since the 1980s [DP87; LS92]. The survey papers [MVK06; LK08; Lei09; LMV13; LK15] give an account on the history of agent-based control and an overview on the focus of current work. Notably, the IEEE-IES Technical Committee on Industrial Agents (TC-IA¹ brings together researchers on an international level.

Software frameworks have been developed to assist the development of agentbased systems. For example the well-known JADE project [BCG07]. The frameworks for software agents, however, do not provide abstractions specifically for the production domain and are used for the development of distributed software systems in general.

The core challenge of agent-based control is the coordination of individual decision making across agents. The remainder of this paragraph discusses the most common approaches. A common coordination mechanisms for agent-based control is negotiation [ZR89]. The Contract Net Protocol (CNP) [Smi80] replaces the market with a negotiation scheme in order to decompose and distribute tasks between agents. The CNP has been applied for agent-based manufacturing systems in a range of research projects and industrial installations [Par87; LL94; SKB97; Oue+99]. While the CNP is mostly used for greedy decision making, other authors have integrated scheduling theory with agent-based control [SWH06; Agn+14; Bad11] Other coordination mechanisms are nature-inspired and derived from the behavior of animals [XL08].

Holonic production control is a special case of agent-based control. The term *holon*, originally coined in [Koe68], refers to systems made up from components that encapsulates both physical assets and virtual functionality [GLK98; Fis99; MB00]. The key idea is that the system components are themselves holons. This goes beyond the usual system-of-systems approach, as holons are self-similar in the sense that the structure and functionality of the constituent parts is governed by the same principles as their parent. Taken to its extreme, this self-similarity in manufacturing systems has led to the concept of the fractal factory [War93].

¹https://tcia.ieee-ies.org/

**Figure 1.6:** Building blocks of a holonic manufacturing system according to the PROSA reference architecture [Van+98].

The PROSA project has proposed a architecture reference architecture for holonic manufacturing systems [Van+98]. See Figure 1.6 for the building blocks defined by PROSA.

What is currently lacking in the field of agent-based and holonic manufacturing control are widely used benchmark scenarios to quantitatively compare the proposed coordination mechanisms. Compared to other scientific fields, this has led to many competing approaches without a clear winner and uncertainty on how well the different approaches can cope with aspects outside of their original scope. For example if unforeseen events are introduced in a stochastic environment. For practitioners, this has led to an overwhelming range of choices. For researchers, years of effort have so far not amalgamated into a unified theory of agent-based production control.

### **Plug and Produce**

The idea of Plug & Produce is derived from plug-and-play functionality known from the USB interface for computer hardware. There, well-known device classes with standardized functionality remove the need for custom software drivers for the hardware integration. Arai et al. [Ara+00] first translated plug-an-play to the production domain and coined the term Plug & Produce. Onori et al. [Ono+12] use the concept for a self-configuring assembly system at the shop-floor level.

The integration of machines and equipment with Plug & Produce encompasses the following aspects: First of all, basic connectivity is established for an existing (industrial) communication infrastructure [Dür+12; Rei+10]. Second, the new component announces its presence to a central controller or directly to the adjacent components with a discovery mechanism [Pro+17]. Third, in production, there exists a wide range of machines and equipment. This heterogeneity cannot be reduced to a small number of devices classes. A way to enable Plug & Work scenarios in the face of device heterogeneity is the use of self-descriptions languages for the integration [OHN14; Sch+15a]. The fourth and most challenging aspect is the functional integration. Lepuschitz et al. [Lep+11] show reconfiguration of manufacturing resources based on distributed IEC 61499 function blocks and a semantic description of the manufacturing setting and the expected behavior.

Many of the published Plug & Produce implementations use a dedicated interconnector module that acts as a facade for manufacturing equipment and provides a uniform interface and that generates low-level commands for the underlying device [NWS07; Dor+17].

One approach for the functional integration in Plug & Produce is the modeling of the skills of technical equipment. For this, see the review of the state of the art in Chapter 5.

## **1.4 The Missing Hierarchy of Production Theories**

Scientists and engineers use models on a level of abstraction that is the most useful for the phenomena under investigation [Gie04]. It often occurs that a more accurate model is available in principle. But working with an increased level of accuracy would overburden the analysis with unnecessary complexity. For example, an electrical engineer laying out the power grid of a city will not use Maxwell's Equation for a power-flow study. Many technical fields have arranged these model approaches (theories) in a hierarchy. This hierarchy has evolved over time and – in the natural sciences – its development is closely related to the process of scientific discovery [Kuh62; Car84]. As an example, Figure 1.7 shows how the model hierarchy established in the field of optics . If some phenomena cannot be

**Figure 1.7:** The model hierarchy in optics. (Reproduced from http://www. argmin.net/2018/01/25/optics.)

explained one can resort to a more detailed (and computationally or analytically more expensive) model until the first principles from Maxwell's Equations and quantum physics are reached [MW59].

Different academic fields have produced "Theories of Production". For example economics [Sch34; Dan66; She71], business administration [Sch86; Fär88] and production management [Dyc06]. Around the year 2000, prominent authors have called for a unified theory of production that provides a common foundation that integrates existing results [Dyc03; Sch04; WNH10]. Recent years have seen a range of proposals to fulfill this need [NW10; Sch+11; Sch+15b]. The authors of [Sch+17] provide a comprehensive review and classification of such production theories. But the cited work remains mostly conceptual and does not model the detailed control of automated systems on the lower levels of the control hierarchy.

By contrasting the model hierarchies in other academic fields with the control hierarchy in automation, one could aim for a high-fidelity model at the bottom layers that is abstracted more and more as we go up in the control hierarchy. While that is the case, we want to stress that the underlying modeling principles should stay the same for all levels of the control hierarchy.

The work on a theory of production is not only of academic interest and also relevant to practitioners that have to cope with the increasing complexity of production systems. Currently, each layer of the control hierarchy works with dedicated system models that are tailored for the tasks at hand. But the models are so task-specific that they become mutually incompatible. This is problematic at vertical as well as horizontal interfaces of the control hierarchy. Assumptions about the behavior of adjacent components are implicit in the custom rules for the interaction between subsystems and encoded in custom control program code. The lack of a common core to translate between subsystems leads to constant manual effort for system integrators and the loss of flexibility due to the task-specific hard coupling of components.

## **1.5 Scientific Contributions and Thesis Organization**

The original title at the very beginning of the work leading to this thesis was "Agent-based production control using the paradigm of self-organisation in the context of Industrie 4.0". The path to the final thesis led through scientific fields that are not directly associated with that original title. For example Probabilistic Graphical Models, Convex Optimisation, Description Logics, Reinforcement Learning, and many more. Some of these detours did not pay off as intended. But many did. While not always visible, the thesis has stayed true to the original topic and the results and techniques from different fields finally tie to together into one body of work.

The thesis is organized into six chapters. See Figure 1.8 for an overview and recommended reading order. The scientific contributions are closely related to the thesis structure. In general, prior results are summarized in dedicated "background" sections. Beyond the background sections, all results with an explicit reference to the literature were developed as part of the thesis development. It follows a summary of the key developments.

**Chapter 2: A Model of Concurrent Production Systems** A novel model representation for production systems is introduced. The goal of the model is to provide a uniform representation of both discrete and continuous production

**Figure 1.8:** Structure and recommended reading order of the thesis.

processes on all levels of the automation hierarchy. To achieve this, the model combines the following aspects:


The following four chapters aim to show that the model representation not only presents a common core for higher-level models but is practical to use in flexible production systems. For this, a tailored planning algorithm is developed,

further scaling of planning is achieved by exploiting additional model structure, the planning algorithm is extended to distributed planning by independent agents, and finally models of the skills of machines and equipment are used as highlevel abstraction from which executable low-level representations are derived for runtime control.

#### **Chapter 3: Simulation-Based Planning for Concurrent Production Systems**

The production system model implies a planning problem: maximizing the expected reward that is generated. Existing planning algorithms from the production domain do not apply to the model from Chapter 2. They make limiting assumptions on the planning problem structure that no longer apply. To solve the planning problem, an algorithm based on Monte-Carlo Tree Search (MCTS) is developed.


**Chapter 4: Distributed Planning for Self-Organizing Production Systems** The planning complexity depends on the number of individual components that partake in the system. By reducing the planning scope to a subset of the system components, it is much easier to explore the solution space and converge to good solutions. We develop an algorithm that combines Monte-Carlo Tree Search with Message Passing approaches known from Belief Propagation. Thereby, the system can be compartmentalized into individual agents who are coordinating their actions. The agent coordination mechanism is shown to improve the planning solution quality compared to uncoordinated individual planning. The same implementation of our novel planning algorithm can be used the solve hitherto separate planning and optimization concerns from all levels of the automation hierarchy.


**Chapter 5: Modeling of Production Skills** In order to achieve flexibility in automation with runtime planning, an accurate system model is required. It can be quite resource-intensive to keep the physical system and its model representation synchronized. Especially if new manufacturing operations are frequently introduced in a flexible production environment. To reduce this effort, higher-level descriptions are developed from which low-level representations for runtime control are generated. We introduce a formal model for the technical skills of components in an automated system based on semantic modeling and deductive inference based on second-order logic.

The modeled skills are used to generate executable action-representations for specific operations. The generation of executable actions take in as input the capabilities of the system components, the topological layout of the system and a description of the requested operation. The capability model is therefore used as a high-level descriptive language that can be compiled to executable actions for the participating system components. The generated action representations contain all preconditions and effects required for detailed planning.

Already published results that were created in preparation for this thesis are:


The following publications are our technical reports or publications in adjacent fields. They contributed to the thesis indirectly by using similar techniques or by working on implementation technologies in industrial automation (e.g. using technologies such as OPC UA and AutomationML) that are relevant for bringing the results of this thesis into practice.


# **2 A Model of Concurrent Production Systems**

*The value of [Lagrange's book "Mécanique Analytique"] consists in the exposition of a general method by which every mechanical question may be stated in a single algebraic equation. The entire history of any mechanical system, as for example, the solar system, may thus be condensed into a single sentence.*

*Robert S. Woodward [Woo95]*

The chapter introduces a model for production systems that combines continuous and discrete system dynamics with concurrency, i.e. parallel operations and the synchronization of system components. The model is intended as the basis for a "theory of production systems". For this, the model needs to able to represent the system dynamics on all levels of the control hierarchy. Examples from different levels of the control hierarchy are used for demonstration and to substantiate this claim. The core postulate of this chapter is the following:

> *The same set of modeling principles can represent the relevant properties of production systems on all levels of the control hierarchy.*

## **2.1 State, Actions and Action Sequences**

**Definition 2.1.** *A system is a set of components c* ∈ *C.*

Examples for components on a manufacturing shopfloor are machines and logistics equipment, such as forklifts. Depending on the level of abstraction of the model, components can also represent parts inside a machine, as well as entire production plants and warehouses. Assume for now that components do not contain other components in a hierarchy.

**Definition 2.2.** *Each component c has a state s* ∈ *Sc. The set S<sup>c</sup> contains all possible states of the component c.*

The global state space *S* = ×*c*∈*<sup>C</sup> S<sup>c</sup>* is the cartesian product of the possible states for every component. A global state is a vector s ∈ *S* with elements *sc*. The state space of a subset of the components *Q* ⊆ *C* is *S<sup>Q</sup>* = ×*c*∈*QSc*. The projection Π*Q*(s) = (*s<sup>c</sup>* : *c* ∈ *Q*) extracts the state of the components *Q* from the global state. In general, a subscript denoting a set of components indicates projection and s*<sup>Q</sup>* = Π*Q*(s). The inverse projection is Π −1 *<sup>Q</sup>* (s*Q*) = {u ∈ *S* : Π*Q*(u) = s*<sup>Q</sup>* }.

**Example 2.1.** Consider the ABB IRB140 robotic manipulator from Figure 2.1. The green sphere represents the Tool Center Point (TCP). The TCP has six degrees of freedom (three each for translation and rotation). But some positions are not reachable due to the physical constraints of the manipulator. The constraints are encoded in the set of reachable positions Ψ ⊂ R 6 . In addition, the IRB140 can be fitted with different tools. In this example, the possible tools are Υ = {gripper*,* welder*,* drill*,* none}. The overall state space of the manipulator is *S*irb140 = Ψ × Υ.

**Figure 2.1:** IRB140 robotic manipulator with inverse kinematics control (Coppelia V-Rep).

Many manufacturing operations define trajectories where the manipulator starts and finishes at fixed locations. As an alternative to modeling a continuous state space, one can limit the possible positions to a discrete set of predefined positions. A continuous configuration space for the position is however better suited to model the physical movement dynamics of the robot.

Components can change their state over time. The following nomenclature is taken from [Fuj98]: *Physical time* refers to time in the physical system that is represented by the model. *Simulation time* refers to a time representation, for example a real value that corresponds to a physical clock by scaling and an epoch-date for the origin. *Wallclock time* refers to the time when the simulation is executed. In distributed systems, it is generally impossible to assign an absolute order to events [Lam78]. We make the simplifying assumption that clocks are synchronized to absolute precision. So the physical time of the system components is always identical. In the model representation, however, the simulation time can differ between components. That is, components can evolve their state independently from one another until synchronization forces their simulation time to coincide again. If not stated otherwise, time refers to simulation time in the remainder of the text.

**Definition 2.3.** *The time-indexed state of a component c is*

$$
\sigma = (s, t) \in \Sigma\_c, \quad \Sigma\_c = S\_c \times \mathbb{R} \,. \tag{2.1}
$$

*It indicates the state of the component s at time t. The time t represents the offset from some epoch-date in seconds.*

The time-indexed state of the entire system is σ ∈ Σ for Σ = ×*c*∈*<sup>C</sup>* Σ*c*. In the context of a system state σ, the time-indexed state of a component *c* is referred to as *σ<sup>c</sup>* = (*sc, tc*). Again, subscript-based notation for projection refers to the definitions from the surrounding context: Σ*<sup>Q</sup>* denotes the joint time-indexed state-space for a subset of the components *Q* ⊆ *C* and σ*<sup>Q</sup>* = Π*Q*(σ) is the projection of a global time-indexed state into Σ*Q*.

Components change their time-indexed state by executing actions. In the model, actions skip the time-indexed state ahead to the time after the execution. Components have no well-defined state during the execution of an action. If the model is linked to a physical instance of the system, then the relation between simulation time and physical time is as follows. If the time *t<sup>c</sup>* of some component *c* is earlier or equal to the physical time, then the component has been idle since *t<sup>c</sup>* and is immediately available. If *t<sup>c</sup>* is later than the physical time, then the component is occupied with the execution of one or more actions until *tc*.

**Definition 2.4.** *Actions are possible state transitions, represented as a tuple*

$$a = \left(C\_a, \bar{\Sigma}\_a, \mathbf{e}\_a, d\_a\right). \tag{2.2}$$

*The tuple describes the action via its participating components, preconditions and effects. It consists of*


The set of feasible time-indexed initial states Σ¯ *<sup>a</sup>* encodes the preconditions of the action *a* for the participating components. The shorthand notation for the state of the participating components is σ*<sup>a</sup>* = Π*C<sup>a</sup>* (σ). The feasible global initial states for the action *a* are Σ*<sup>a</sup>* = {σ ∈ Σ : σ*<sup>a</sup>* ∈ Σ¯ *<sup>a</sup>*}. Every action is an operator on the time-indexed global state *a* : Σ*<sup>a</sup>* → Σ.

From the initial global state σ it follows that the earliest possible starting time is *t* start *a* (σ*a*) = max*c*∈*C<sup>a</sup> t<sup>c</sup>* for the action *a*. Let σ <sup>0</sup> = *a*(σ) denote the global state following the execution. The new global time-indexed state is then comprised of elements

$$\sigma\_c' = \begin{cases} \left( e\_{a,c}(\sigma\_a), \ t\_a^{\text{start}}(\sigma\_a) + d\_a(\sigma\_a) \right), & \text{if } c \in C\_a \\ \sigma\_c, & \text{else} \end{cases} \tag{2.3}$$

Equation 2.3 defines the operational semantics of all actions. It implies the *Markov assumption*: The outcome of an action depends only on the previous system state. Components that are not participating in an action do not partake in the preconditions and their state is left unchanged as well.

Equation 2.3 defines how components with different simulation times are synchronized: The participating component with the highest simulation time defines the starting time of the action. The other participating components idle until they join the execution.¹ The feasible initial states Σ*<sup>a</sup>* can also encode preconditions for the component timing. Consider a physical system whose dynamics is described by a differential equation. Such systems typically cannot idle without any effect on their state. To adequately cover components that cannot idle in the sense of Equation 2.3, the preconditions of Σ*<sup>a</sup>* can require identical simulation times for all participating components ∀σ*<sup>a</sup>* ∈ Σ*a,* ∀(*c, c*<sup>0</sup> ) ∈ *C<sup>a</sup>* × *Ca, t<sup>c</sup>* = *t<sup>c</sup>* <sup>0</sup> . Alternatively, as e*<sup>a</sup>* takes the time-indexed state of all participating components as input, the state evolution of the individual components until the starting time of *a* can be modeled as part of the effect function.

In discrete manufacturing, the major concern is the movement and transformation of products within the system. Many modeling approaches represent products as objects with individual lifecycles [Sal+10] or even as independent agents who negotiate and make independent decisions [KBT17]. To model products as individual objects, they could be represented as system components *c*. But, in this text, we follow a different approach that enables algorithmic improvements for planning later on.

Let *ρ* ∈ *P* denote the set of different product types. A product type not only represents marketable output, but also raw material and semi-finished work-pieces that occur between production steps. Products are not considered individually. So products of the same product type are indistinguishable. Instead of representing each component individually, it suffices to track the number of products of each product type present at every component. A product is always contained in some component. Take for example a workpiece mounted inside a machine, a crate of material sitting on a forklift, or a finished product stored in a high bay warehouse.

¹This corresponds to the use of the popular Max-Plus algebra to describe the time-evolution of discrete event systems [Bac+92].

**Definition 2.5.** *Some components can physically contain products. In that case, the component state decomposes into the component configuration ξ* ∈ Ξ*<sup>c</sup> and a vector* p *for the number of contained products for each product type.*

$$s = (\xi, \mathfrak{p}) \in S\_c, \quad S\_c \subseteq \left(\Xi\_c \times \mathbb{N}\_0^{|P|}\right) \dots$$

For convenience, we denote a vector with just a single product of type *ρ* as the basis-vector ν*ρ*. The number of products of type *ρ* in component *c* is (*pc*)*ρ*. A component *c* with no contained products has p*<sup>c</sup>* = 0, where 0 denotes the null-vector of appropriate dimensionality.

**Example 2.2.** This example introduces a minimal manufacturing scenario that will be used throughout this text. The scenario, shown in Figure 2.2, is comprised of a machine tool (mt) that mills piston rods out of steel bars, a lattice box (box) and a robotic manipulator that packages the piston rods for transport (manip).

**Figure 2.2:** Minimal scenario for discrete manufacturing

Four actions have been defined for the scenario. The actions produce and package have only one participating components, the machine tool and the manipulator respectively. The actions put and take require the participation of two components to model the transition of a product between them. We do not consider single piston rods, but only orders of 100 parts with the product type order. Every component can only hold a single order-product at once. Therefore, once the machine tool has completed an order, the parts need to be put into the lattice box before the next order can be produced. The manipulator then has to take the parts of the previous order out of the lattice box before the machine tool can put in the next. Products "appear" and "disappear" when they leave the scope of the modeled system.

The complete definition of the action put is as follows:

$$\mathbf{\bullet} \cdot C\_{\mathbf{put}} = \{\mathfrak{mt}, \mathbf{box}\}$$

$$\begin{aligned} \bullet \; \bar{\Sigma}\_{\mathtt{put}} = \{ \sigma\_{\mathtt{put}} \in \Sigma\_{C\_{\mathtt{put}}} : p\_{\mathtt{nt}} = \nu\_{\mathtt{order}}, \; p\_{\mathtt{box}} = 0 \} \end{aligned} $$

$$\mathbf{\bullet} \,\, e\_{\mathrm{put}}(\sigma\_{\mathrm{put}}) = ( (\xi\_{\mathrm{mt}}, \mathbf{0}), (\xi\_{\mathrm{box}}, \nu\_{\mathrm{order}}) )$$

$$\mathbf{\bullet} \cdot d\_{\text{put}}(\sigma\_{\text{put}}) = (5s, 5s)^2$$

Actions can be chained to form action sequences. The set of actions *A* is the alphabet for the free monoid (Kleene Star, [HU79]) *A*<sup>∗</sup> which contains all words (sequences) of finite length. The empty sequence is written as *ε*. The sequence elements *w k* are indexed as w = *w* <sup>1</sup>*w* 2 *. . . w*|w<sup>|</sup> .

**Definition 2.6.** *An action sequence* w ∈ *A*<sup>∗</sup> *is itself an action resulting from the composition of the constituent actions.*

For a valid initial state σ ∈ Σ<sup>w</sup> (defined in the next paragraph), the resulting state after the execution of w is w(σ) = (*w* <sup>|</sup>w<sup>|</sup> ◦ · · · ◦ *w* 1 )(σ) according to the dynamics of action execution from Equation (2.3). The multiplication notation for sequences is preferred to the ◦-notation for function composition to keep the notation light, to write the sequence elements in-order, and because we will take on an algebraic perspective on composition later on. Note that, due to the way concurrency is represented with the time-indexing of component states, an action at a later sequence index may actually start at an earlier simulation time than one of its predecessors in the sequence. But this is possible only if the two actions do not share participating components.

The subsequence of the first *k* elements is w:*<sup>k</sup>* = Q*<sup>k</sup> <sup>l</sup>*=1 *w l* . The subsequence starting at the *k*th element is w*k*: = Q<sup>|</sup>w<sup>|</sup> *<sup>l</sup>*=*<sup>k</sup> w l* . The domain of w as an operator is Σ<sup>w</sup> = σ ∈ Σ : ∀*k* ∈ {1*, . . . ,* |w|}*,* w:*k*−<sup>1</sup> (σ) ∈ Σ*w<sup>k</sup>* . This ensures that the preconditions of all actions are satisfied when the sequence is executed in order.

The composition of action operators to a sequence is always possible. But the resulting sequence might be infeasible with an empty domain Σ<sup>w</sup> = ∅.

The set of all action sequences, from now on denoted as *W* = *A*<sup>∗</sup> , implies a tree-graph. In the context of an initial system state σ, infeasible sequences are removed from the tree. The feasible sequences starting at σ are *W*<sup>σ</sup> = {w ∈ *A*<sup>∗</sup> : σ ∈ Σw}. Obviously, the pruned tree is a subset *W*<sup>σ</sup> ⊆ *W* and is still a tree since for every sequence w ∈ *W*<sup>σ</sup> all subsequences of the first *k* ∈ {0*, . . . ,* |w|} elements are also contained w:*<sup>k</sup>* ∈ *W*<sup>σ</sup>. After the first action *a* has been executed, the system states becomes *a*(σ) and the sequence tree becomes *a*(*W*) = {w : *a*w ∈ *W*}.

**Example 2.3.** This extends the scenario from Example 2.2. Assume an initial system state σ where no component contains any products. The four defined actions with their preconditions and effects yield a sequence tree *W*<sup>σ</sup> of feasible sequences. All feasible sequences up to five actions are shown in Figure 2.3.

**Figure 2.3:** Sequence tree for Example 2.2.

Note that the order of the actions in the sequence do not require that the start times of the actions have the same ordering. Consider the action sequence (produce put produce take). The last two actions produce and take can start at the same time. It is even possible that an action occurs later in a sequence but starts before a preceding action in terms of absolute time.

**Table 2.1:** Evolution of the system state for the sequence produce*,* put*,* produce*,* take. The component state *s* is simplified and describes only the number of products currently at the component. All actions are assumed to have a duration of five time units.

Action sequences can be the result of a planning procedure that are executed only once for the current situation. But sequences can also be used as reusable macro-actions. Many manufacturing systems are organized in lines, within which work-pieces undergo a fixed sequence of actions. In Example 2.3, the sequence (produce put take package) could be such a macro action. System parts with little flexibility can thus be modeled with comparatively few macro-actions. These can be integrated seamlessly with more fine-grained actions where more behavioral flexibility is required.

## **2.2 Parameterized Actions**

The action definition from Section 2.1 has been accompanied by examples from discrete manufacturing. In the process industry (e.g. chemicals, pharmaceuticals, food and beverages), many decisions have to be made on a continuous domain. In principle, the set of actions *A* could represent continuous decisions with an uncountably infinite set of actions. But then we could no longer maintain explicit representations of every action in computer-based simulations. Instead, we allow actions to be parameterized.

**Definition 2.7.** *The preconditions, effects and durations of a parameterized action a depend on the choice of action parameter θ* ∈ Θ*a.*

$$a^{\theta} = (C\_a, \bar{\Sigma}\_a^{\theta}, \mathbf{e}\_a^{\theta}, d\_a^{\theta}) \tag{2.4}$$

There are no particular restrictions on possible parameter spaces Θ*a*. Of course, parameters can also be vectorial or sets, even though we use scalar notation for parameters in general. Actions *a* that do not define parameters have Θ*<sup>a</sup>* = ∅. In that case, the parameter can be omitted in the notation. The following list gives examples for parameters on different scales of measure and sizes of the parameter space.

Nominal and Ordinal Parameters In the simplest case, parameterized actions simply group actions that, in some sense, belong together. For example the action paint with categorical parameters Θpaint = {blue*,* green}. There can also be a natural order among the parameters on an ordinal scale, such as {cold*,* warm*,* hot}.


**Example 2.4.** Take the example of a storage tank in a process control setting. The tank can be filled with liquid and drained afterward. The fill level is controlled by a pump and a valve at the bottom of the tank.

**Figure 2.4:** Piping and instrumentation diagram (P&ID) of a storage tank. (Reproduced with permission from https://commons.wikimedia. org/wiki/File:Pump\_with\_tank\_pid\_en.svg.)

Consider the action drain acting on the tank. Let *v* ∈ [0*,* 1] be the control value for the valve and *τ* the duration of the action with *θ*drain = (*v, τ* ). The valve is closed for *v* = 0 and fully open for *v* = 1. In addition, the liquid flow depends on the pressure at the valve and hence the fill level described by the state of the tank *s*tank ∈ R+. The flow of the (incompressible) fluid out of the tank can then be described by a differential equation *s*˙tank = *g*(*s*tank*, v*) according to Bernoulli's principle [Ber38] and *s*tank(*t*0) is known from the initial state σdrain. The effect of the action drain on the tank is

$$e^{(v,\tau)}\_{\mathbf{d}\mathbf{r}\mathbf{a}\mathbf{in}}(\sigma\_{\mathbf{d}\mathbf{r}\mathbf{a}\mathbf{in}})\_{\mathbf{t}\mathbf{a}\mathbf{k}} = s\_{\mathbf{t}\mathbf{a}\mathbf{k}}(t\_0) + \int\_{t=t\_0}^{t\_0+\tau} g(s\_{\mathbf{t}\mathbf{a}\mathbf{k}}(t), v)dt\ .$$

With the introduction of parametric actions, problems from control theory can be represented in the model. In Model Predictive Control (MPC) [ML99], a dynamical system is approximated by discretizing the time domain. The problem of optimal control is then posed as an explicit optimization problem that selects control values for every time period. By taking the (continuous) control decisions as action parameters, optimal control problems can be represented with a single action that encompasses all system components. Control with mixed discretecontinuous control values can be represented by either a more complex parameterspace Θ or by representing discrete choices with different actions. But only with several actions can aspects of concurrency be represented, where the system components are synchronized via joint participation in an action and evolve their state independently otherwise.

So far, the model presented in this chapter unifies the treatment of discrete events, concurrency and continuous system dynamics.

## **2.3 Uncertainty and Observations**

Until now, it was implicitly assumed that actions are deterministic. But virtually no production system actually is. There is always an interaction with a stochastic environment, from logistics influenced by weather conditions to the human operator returning late from a break. In addition, even fully automated manufacturing processes are inherently stochastic. This is illustrated by the fact that very few production processes achieve zero-defects production.² It is desirable to represent this uncertainty also in the model used for control or planning. With an explicit representation of the uncertainty in the evolution of the system state, plans of higher quality can be achieved by optimizing for the expected reward. In general, uncertainty in the model is represented by belief distributions over the system state and actions with probabilistic outcomes.

This section uses the notation of [Hem66]: Underlined symbols denote random variables whose realizations follow some probability distribution. The same symbol may occur underlined and non-underlined. Here, this can be the distinction between a random variable and a sampled realization of the random variable. Another case is the distinction between an action used as an identifier and the same action used to sample stochastic state transitions.

The system state following an initial state σ and an action with uncertainty *a* is in effect the realization of a random variable σ <sup>0</sup> ∼ σ <sup>0</sup> = *a*(σ). The probability (density) of the possible outcomes σ <sup>0</sup> ∈ Σ is P(σ <sup>0</sup> = σ 0 ). We allow the application of uncertain states with a belief distribution to actions The resulting state has a belief distribution P(*a*(σ) = σ 0 ) = R Σ [P(*a*(ν) = σ 0 ) P(σ = ν)] *d*ν. We do not consider the case where *a* is not applicable to the realization of the initial state σ.

By Equation 2.3, non-participating components *c /*∈ *C<sup>a</sup>* are not affected by the action execution. Actions with uncertainty are not exempt from this rule. All state transitions where a non-participating component changes its timed-indexed state via execution of an action *a* must have zero probability.

$$\forall \sigma \in \Sigma\_a, \forall \sigma' \in \Sigma, \forall c \notin C\_a: \sigma\_c \neq \sigma'\_c \Rightarrow \mathbb{P}(\underline{a}(\sigma) = \sigma') = 0 \tag{2.5}$$

If the state following the execution of an action with uncertainty is not immediately and fully knowable, we have to indirectly infer probabilities over the resulting state via incomplete or noisy observations. This is called the *partially observable* setting [KLC98]. With partial observability, the components *c* ∈ *C* each generate

²The most common reasons for delays and unforeseen events are, according to [VHL03], machine failure, urgent job arrival, job cancellation, due date change, shortage or delay in the arrival of material, change in job priority, rework or quality problems, over- or underestimation of process time, and operator absenteeism.

observations *o<sup>c</sup>* ∈ *Oc*. An action results in observations from the participating components *O<sup>a</sup>* = (×*c*∈*<sup>C</sup> Oc*). With a slight abuse of notation, the next state and the observations are both drawn by sampling the action (σ 0 *, oa*) ∼ *a*(σ). The distribution of observations conditionally depends on the action and the state transition P(*o<sup>a</sup>* |σ*, a,*σ 0 ).

Obviously, for every action *a*, the observations *o<sup>a</sup>* are conditionally independent of the state of components that do not participate in *a*. Otherwise, if an observation was conditionally dependent on the state of a non-participating component, then the observation could depend on the outcome of an action that occurs earlier in the action sequence but actually has a later starting time in simulation. This cannot be possible. Therefore, given the state of the participating components σ*a*, the resulting state of the participating components and the observations are conditionally independent of the non-participating components.

$$(\sigma'\_a, o\_a \mid \sigma\_a) \perp \sigma\_{C \backslash C\_a} \tag{2.6}$$

Now we extend the treatment of uncertainty to action sequences. We write Θ = ∪*a*∈*A*Θ*<sup>a</sup>* for the set of possible action parameters and *O* = ∪*a*∈*AO<sup>a</sup>* for the set of possible observations. The action-subscript for parameters and observations is dropped when the relation is clear from context.

**Definition 2.8.** *A history* h *is a sequence of episodes h k . Each episode consists of the selected action and action parameters, as well as the generated observation.*

$$h = \underbrace{a^1 \theta^1 o^1}\_{h^1} \underbrace{a^2 \theta^2 o^2}\_{h^2} \dots \underbrace{a^{|h|} \theta^{|h|} o^{|h|}}\_{h^{|h|}}$$

The set of possible histories *H* ⊂ (*A* × Θ × *O*) ∗ implies a tree-graph. The histories in *H* have finite length. Either because the scenario is *done* when a specific state is reached or by a cutoff at a maximum history depth. A history h can be appended with the next action *a*, parameters *θ* and observation o to form h <sup>0</sup> = h*aθ*o. In the case of a partially observable system, the current system state can only be inferred with uncertainty. Every history yields a probability distribution for the belief over the final system state that is conditioned on the

observations. Recall that *a*˜ denotes the already parameterized actions.

$$\sigma' \sim \underline{\hbar}(\sigma), \,\underline{\hbar}(\sigma) = \mathbb{P}\left( (a^{1\theta^1} \cdots a^{|\underline{h}|\cdot\theta^{|\underline{h}|}}) (\sigma) \, \middle| \, a^1, \dots, a^{|\underline{h}|} \right) \tag{2.7}$$

The computation of the belief distribution for the resulting state can be performed with iterative Bayes updates [Jay03] for the intermediary system state between actions. The state belief distribution following from an uncertain initial state is h(σ). In general, this computation cannot be performed with the available computational resources. The planning algorithms from the later chapters therefore rely on samples from forward simulations of the system only.

## **2.4 Reward and Policies**

So far, only the dynamics of actions and action sequences were discussed. Now we begin to express preferences between action sequences.

**Definition 2.9.** *Executing action a with parameters θ inducing a transition of the system state from* σ *to* σ <sup>0</sup> *generates a reward r*(σ*, a, θ,*σ 0 )*. The reward function is*

$$
\mathfrak{r}: \Sigma \times A \times \Theta \times \Sigma \to \mathbb{R} \,. \tag{2.8}
$$

With the reward function, each state defines a planning problem with the goal to maximize the future (expected) reward. This is known as the decision-theoretic planning problem [De 70; BDH99]. The following hierarchy of planning problems is distinguished in the literature [LaV06]:


The application of decision-theoretic planning for the (feedback) control of dynamical systems is discussed in [DW91; Ber+95; BG01]. See [LP12] for an application of MDP solvers to model manufacturing scenarios under uncertainty.

Algorithms for solving decision-theoretic planning problems belong to two distinct groups, online planning algorithms and policy constructing algorithms. Online planning algorithms are executed for the selection of the next action [Ros+08; SV10] at runtime. Policy constructing algorithms are executed ahead of time. They compute a fixed policy function that takes the current system state (or observed history) to the next action [KHL08].

**Definition 2.10.** *A policy is a mechanism to select the next action during runtime. In fully observable settings, policies are represented as functions π* : Σ → *A that map from the current system state to the next action. The policy function becomes π* : Σ → *A* × Θ *in settings with parameterized actions.*

*In POMDP, policies π* : *H* → *A or π* : *H* → *A* × Θ *map from the observed history to the next action. Internally, the POMDP policy may decide the next action and parameters based on a belief distribution over the current system state conditioned on the observed history* P(σ ∈ Σ | h ∈ *H*)*.*

For simplicity of the exposition, consider fully observable settings without parameterized actions. For a fixed policy *π* and a discount rate *γ* ∈ [0*,* 1), the value of an initial state σ 0 is the expected discounted reward.³

$$\mathfrak{v}^{\pi}(\sigma^{0}) = \mathbb{E}\left[\sum\_{k=0}^{\infty} \gamma^{k} \mathfrak{r}(\sigma^{k}, a^{k}, \sigma^{k+1}) \, \Big|\, a^{k} = \pi(\sigma^{k}), \, \sigma^{k+1} \sim a^{k}(\sigma^{k})\right] \tag{2.9}$$

As an alternative to the discount factor, we can limit the number of considered periods. The optimal value of a state v is the expected reward resulting from optimal action selection in every step. Based on Bellman's principle of optimality [Bel57], the V-value can be written as a recursive formula.

$$\mathfrak{v}(\sigma) = \max\_{a \in A} \underset{\sigma' \sim a(\sigma)}{\mathbb{E}} \left[ \mathfrak{r}(\sigma, a, \sigma') + \gamma \mathfrak{v}(\sigma') \right] \tag{2.10}$$

³The V-value and Q-value are longstanding terms in the literature. As both are scalar values we denote them as v and q in the mathematical notation.

The corresponding *Q*-value is the expected reward for selecting action *a* in state σ and optimal decision making thereafter.

$$\mathfrak{q}(\sigma, a) = \underset{\sigma' \sim a(\sigma)}{\mathbb{E}} \left[ \mathfrak{r}(\sigma, a, \sigma') + \gamma \mathfrak{v}(\sigma') \right] \tag{2.11}$$

Selecting a value according to max*a*∈*<sup>A</sup>* q(σ*, a*) is the optimal policy. But computing the optimal policy is generally computationally intractable. The remainder of this thesis is concerned with ways of rendering optimization and planning in the framework introduced in this chapter computationally feasible. This is achieved a) with tailored planning algorithms, b) so-called rollout policies that exploit known structure in the planning problem and c) the decomposition of the planning problem into a coupled set of smaller problems that are jointly optimized by cooperating agents.

# **3 Simulation-Based Planning for Concurrent Production Systems**

*Programming, or program planning, may be defined as the construction of a schedule of actions by means of which an economy, organisation, or other complex of activities may move from one defined state to another, or from a defined state toward some specifically defined objective.*

*Marshal K. Wood and George B. Dantzig [WD51]*

The model from Chapter 2 is generic and can be used to represent many types of systems. Very little constraints are imposed on the system dynamics that can be represented. This chapter develops an algorithm for sequential decision making that does not make additional limiting assumptions. But this richness with respect to possible system dynamics is a drawback when it comes to planning. Most planning algorithms impose a much more limited model structure they exploit to reduce the computational effort. The core postulate of this chapter is the following:

> *The same algorithm can be used for planning and runtime control on all levels of the control hierarchy – ranging from continuous dynamics of a physical system to global supplychain operations – and for both continuous and discrete production.*

This chapter develops a planning algorithm for the full model from Chapter 2 without additional assumptions. In the first two sections, two techniques are used to reduce the number of action sequences that are visited for planning with action

sequences in deterministic scenarios. In Section 3.1 the search tree is explicitly pruned by removing equivalent action sequences (this will have a precise definition). Section 3.2 further speeds up planning by implicitly pruning less promising parts of the search tree via Monte-Carlo Tree Search (MCTS). Section 3.3 extends planning to the full model with parametric actions and uncertainty with partial observability. In order to scale to larger scenarios, Section 3.4 develops a custom rollout policy that uses a relaxation of the planning problem to a Mixed-Integer Linear Program (MILP).

## **3.1 Tree Search with Backtracking**

The model from Chapter 2 can represent concurrency, i.e. parallelism and the synchronization of system components. A consequence of the model is that many action sequences are equivalent with respect to their overall preconditions and effects.

**Example 3.1.** Take the scenario from Example 2.2 with the modification that the robotic manipulator initially contains a product. The actions produce and package do not share a participating component as *C*produce = {mt} and *C*package = {manip}. It is easy to see that the sequence (produce package) is equivalent to the sequence (package produce) in terms of their preconditions and effects. In that sense, the two actions are independent from each other.

This notion of action sequence equivalence is made rigorous based on established results from Trace Theory (TT) [CF69; Maz77]. The theoretical prerequisites are summarized in the following. Full proofs can be found in [DM97a]. Some elementary definitions from Group Theory [DF04] are assumed as known.

### **Background: Trace Theory**

Previous nomenclature from Chapter 2 applies to TT by taking *A* as a set of letters of action-identifiers that are concatenated to words corresponding to action

sequences. Let *A* be a finite alphabet of letters and *A*<sup>∗</sup> the set of all finitelength words over *A*. The composition of letters forms a free monoid with generating set *A* and the empty word, denoted *ε*, as the unit element. Some letters commute and are said to be mutually independent. This is captured in the set of independence relations *Z* ⊆ *A* × *A*. Independence relations are symmetric (*a, b*) ∈ *Z* ⇒ (*b, a*) ∈ *Z* and irreflexive ∀*a* ∈ *A,*(*a, a*) ∈*/ Z*. We also write *a* ⊥*<sup>Z</sup> b* to denote independence between *a* and *b*. From *Z* follows an equivalence relation between words '*Z*: two words w and v from *A*<sup>∗</sup> are equivalent according to '*<sup>Z</sup>* if v is a permutation of w that can be reached by successive reordering of adjacent elements that commute according to *Z*. In algebraic terms, the trace monoid M(*A, Z*) is the quotient of *A*<sup>∗</sup> by the congruence '*Z*. Its elements, called traces, are pairwise disjoint subsets of *A*<sup>∗</sup> . Each trace contains the words that are mutually equivalent according to '*Z*. Let [w]*<sup>Z</sup>* denote the trace generated from the word w. We can now rephrase equivalency as w and v generating the same trace w '*<sup>Z</sup>* v ⇔ [w]*<sup>Z</sup>* = [v]*Z*. When it is clear from context which independence relations apply, we simply write [w] for the trace and w ' v to denote equivalence.

The independence relations define a partial order between some of the letters in a word. This partial order is a "must appear before" binary relation ≺.

$$w^k \prec w^l \Leftrightarrow k < l \land \neg (w^k \perp w^l) \tag{3.1}$$

The transformation between equivalent words w ' v is described by a permutation of the element indices *τ* . The partial order of the word elements is invariant to this permutation and therefore

$$\forall k, l \in \{1, \ldots, |w|\}, \; w^k \prec w^l \Rightarrow v^{\tau(k)} \prec v^{\tau(l)}.\tag{3.2}$$

The partial order yields the same *dependence graph* and *Hasse diagram* for every word from the same trace.¹ The dependence graph G = (*V, E*)is a directed acyclic graph with nodes *V* = {*w* 1 *, . . . , w*<sup>|</sup>w<sup>|</sup>} and edges *E* = {(*w, v*) ∈ *V* 2 : *w* ≺ *v*}.

¹Graphs being equal refers to the existence of an isomorphism between the graphs that takes node labels into account.

**Figure 3.1:** The Dependence Graph (left) and Hasse Diagram (right) for the trace [*abcaed*] according to the trace monoid M(*A, Z*) over the alphabet *A* = {*a, b, c, d, e*} and independencies *Z* = {(*a, c*)*,*(*c, a*)*,*(*b, d*)*,*(*d, b*)*,*(*d, e*)*,*(*e, d*)}.

The Hasse diagram G <sup>0</sup> = (*V, E*<sup>0</sup> ) is the dependence graph with redundancies removed *E*<sup>0</sup> = {(*w, v*) ∈ *E* : @*u, w* ≺ *u* ≺ *v*}. See Figure 3.1 for an example.

Besides their dependence graph and Hasse diagram, traces are uniquely represented by a word in a normal form. That is, exactly one word from every trace is in the normal form. There exist several normal forms with this property. We consider the Lexicographical Normal Form (LNF): Assume a total ordering of the letters *A* and a resulting lexicographical ordering over words. Longer words have a larger order than smaller words regardless of their elements. For two words w*,* v of the same length, w *<* v if *w <sup>k</sup> < v<sup>k</sup>* for the smallest index *k* where *w <sup>k</sup>* 6= *v k* . A word is in LNF if it has the smallest lexicographical order among all words of the trace.

#### **Tree-Search with Trace-Based Pruning**

With the prerequisites in place, we now take a look back at Chapter 2. Remember that action sequences are operators on the system state with defined preconditions (the operator domain) and effects.

**Definition 3.1.** *Two action sequences* w *and* v *are equivalent* w ' v *if they have the same domain of initial states* Σ<sup>w</sup> = Σ<sup>v</sup> *and yield identical results* ∀σ ∈ Σw*,* w(σ) = v(σ)*.*

The independence of actions is defined based on the possibility to commute them if they occur in adjacent positions in any action sequence.

**Definition 3.2.** *Two actions a and b are independent a* ⊥ *b if commuting them yields an equivalent sequence* ∀w*,* v ∈ *A*<sup>∗</sup> *,* w*ab*v ' w*ba*v ⇔ *a* ⊥ *b.*

It is generally intractable to show the equivalence of two action sequences by enumerating the set of valid initial system states and the effect of the action sequence on those states. For some action pairs however, independence can be shown without evaluating the preconditions and effects:

**Proposition 3.3.** *Any two actions a, b that do not share a participating component are independent.*

$$C\_a \cap C\_b = \mathcal{Q} \Rightarrow a \perp b \tag{3.3}$$

*Proof.* For all action sequences w*,* v there must be w*ab*v equivalent to w*ba*v. This follows directly from the operational semantics of actions defined in Equation 2.3.

**Example 3.2.** Figure 3.2 shows which sequences from Example 2.3 are equivalent and can be pruned. Only the sequences up to a length of five are shown. The initial lexicographic ordering of the actions for the LNF is produce *<* put *<* take *<* package.

**Figure 3.2:** Sequence tree from Example 2.3 before (left) and after the pruning of equivalent sequences (right).


**Algorithm 1** Is the action sequence w*a* in LNF given that w is in LNF?

Some sequences may additionally be equivalent due to some lucky alignment of the action effects. Such additional equivalencies are not considered in this text. Now that the independence of any two actions is easily computed, we use this information to speed up the search for the best action sequence. Only one sequence from each trace is considered. The sequences from the same trace are equivalent also with respect to the reward that they generate. To consider only one sequence per trace, we evaluate only the action sequences in LNF.

**Proposition 3.4.** *If a word* w *is not in LNF, no word* wu *starting with the prefix* w *is in LNF.*

*Proof.* If w is not in LNF, then a word v ' w must exist so that v *<* w. For all u ∈ *A*<sup>∗</sup> there is vu ' wu and vu *<* wu.

From Proposition 3.4, we know that a sequence w*a* can only be in LNF if w is in LNF. During a depth-first traversal of the search tree, entire subtrees can thus be pruned away. If the action sequence w leading to the current position in the search tree is known to be in LNF, the candidate sequence w*a* can be tested for LNF very fast. The test is performed by Algorithm 1. It has a worst-case runtime that is linear in the sequence length. In practice the algorithm performs much faster as a breaking condition is usually found within the first few elements of the sequence.

**Proposition 3.5.** *If the action sequence* w *is in LNF, then for all a* ∈ *A Algorithm 1 returns true if and only if the sequence* w*a is also in LNF.*

*Proof.* For a sequence v, denote with *Z*(v*, k*) = {*l* : *l < k* ∧ ∀*m* ∈ {*l, . . . , k* − 1}*, v<sup>k</sup>* ⊥ *v <sup>m</sup>*} the indices of the contiguous elements before *v k* that all commute with *v k* . In the following we show that an action-sequence v is in Lexicographical Normal Form (LNF) if and only if

$$\forall k \in \{1, \ldots, |\mathbf{v}|\}, \forall l \in Z(\mathbf{v}, k), \ v^l < v^k. \tag{3.4}$$

First, we show that (3.4) must be satisfied for every sequence v in LNF. Assume v is in LNF and does not hold condition (3.4). Then there exists a combination of indices *l < k* where *v <sup>l</sup> > v<sup>k</sup>* and ∀*m* ∈ {*l, k* − 1}*, v<sup>m</sup>* ' *v k* . The decomposition v = u*v <sup>l</sup>*q*v <sup>k</sup>*z can be rearranged as y = u*v kv <sup>l</sup>*qz with possibly empty subsequence u*,* q and z. This contradicts v being in LNF since v ' y and y *<* v.

Second, we show that condition (3.4) is sufficient for v to be in LNF. Assume v satisfies (3.4) and is not in LNF. Then there must exist an equivalent sequence y ' v in LNF. Equivalent sequences are permutations and |y| = |v|. So for y to have a smaller order, there must be an index *l* where y and v first differ y :*l*−<sup>1</sup> = v :*l*−1 and *v <sup>l</sup> < y<sup>l</sup>* . Due to y ' v, action *v <sup>l</sup>* must appear in v at an index greater than *l* and v can be decomposed as v = u*v <sup>l</sup>*q*y <sup>l</sup>*z, with possibly empty subsequences u*,* q and z, where *y l* commutes with *v l* and all elements of q. This contradicts (3.4).

Since Equation 3.4 only refers to the preceding elements in the action sequence and w is known to be in LNF, it suffices to test (3.4) for the new element.

**Proposition 3.6.** *For any action a and LNF action sequence* w*, Algorithm 2 returns an action sequence that is in LNF and equivalent to* w*a.*

*Proof.* When the condition in line 3 of Algorithm 2 is not true, then *a* commutes with *w k* and *a < w<sup>k</sup>* . Therefore, once the condition in line 3 is true, w:*<sup>k</sup>a* is in LNF and the LNF-order of the sequence cannot be improved by moving an element from w*<sup>k</sup>*+1: before *a*. Since w is in LNF, no permutation of the sub-sequence w*<sup>k</sup>*+1: has reduced LNF-order. Therefore w:*<sup>k</sup>a*w*<sup>k</sup>*+1: is in LNF.

A trivial approach to search for good sequences is to cast the search space as a tree structure and to enumerate the solutions at the leaf nodes of the tree. Every node in the tree represents a (partial) action sequence. Depth-first search completely enumerates the solutions at the leaf nodes without having to hold the entire tree in memory. Algorithm 3, called DFS for depth-first search, walks over the tree in recursive fashion. The algorithm backtracks to a prior partial action )

**Algorithm 3** Depth-First Search

```
1: v
   max ← −∞, wmax ← 
2: procedure DFS(ω, v, w)
3: if |w| = k
             max then
4: if v > vmax then
5: v
            max ← v
6: wmax ← w
7: return
8: for a ∈ A : ω ∈ Σa do
9: ω
          0 ← a(ω)
10: v
         0 ← v + r(ω, a, ω
                       0
11: DFS(ω
              0
              , v0
                 , wa)
12: DFS(σ, 0, )
13: return wmax
```
#### **Algorithm 4** Pruned Depth-First Search

```
1: v
   max ← −∞, wmax ← 
2: procedure DFSᴸᴺF(ω, v, w)
3: if |w| = k
             max then
4: if v > vmax then
5: v
            max ← v
6: wmax ← w
7: return
8: for a ∈ A : ω ∈ Σa ∧ TestLNF(w, a) do
9: ω
          0 ← a(ω)
10: v
         0 ← v + r(ω, a, ω
                        0
                        )
11: DFSᴸᴺF(ω
                 0
                 , v0
                    , wa)
12: DFSᴸᴺF(σ, 0, )
13: return wmax
```
**Algorithm 5** Branch & Bound

```
1: v
   max ← −∞, wmax ← 
2: procedure B&B(ω, v, w)
3: if |w| = k
             max then
4: if v > vmax then
5: v
            max ← v
6: wmax ← w
7: return
8: for a ∈ A : ω ∈ Σa do
9: ω
          0 ← a(ω)
10: v
         0 ← v + r(ω, a, ω
                       0
                        )
11: if v
           0 + β(ω
                 0
                  , wa) > vmax then
12: B&B(ω
                 0
                  , v0
                    , wa)
13: B&B(σ, 0, )
14: return wmax
```
#### **Algorithm 6** Pruned Branch & Bound

```
1: v
   max ← −∞, wmax ← 
2: procedure B&BᴸᴺF(ω, v, w)
3: if |w| = k
              max then
4: if v > vmax then
5: v
            max ← v
6: wmax ← w
7: return
8: for a ∈ A : ω ∈ Σa ∧ TestLNF(w, a) do
9: ω
          0 ← a(ω)
10: v
         0 ← v + r(ω, a, ω
                        0
                        )
11: if v
           0 + β(ω
                 0
                  , wa) > vmax then
12: B&BᴸᴺF(ω
                    0
                    , v0
                       , wa)
13: B&BᴸᴺF(σ, 0, )
14: return wmax
```


sequence once all solutions starting at the current position in the tree have been enumerated. The tree of depth *k* max contains up to P*<sup>k</sup>* max *<sup>k</sup>*=0 |*A*| *<sup>k</sup>* nodes and |*A*| *k* max leaf nodes. Enumerating all possible combinations is intractable for all but the most trivial scenarios. Usually the branching factor of the tree is reduced because not all actions from *A* are applicable in the intermediary system states. But the problem of combinatorial explosion of the search space remains. Proposition 3.4 can be used to test at every node whether the entire subtree starting at the node is not in LNF and can be skipped (pruned). Since every trace contains exactly one sequence in LNF, it is ensured that traversing the tree without the pruned branches still visits every trace once. Algorithm 4, called DFSᴸᴺF, extends DFS with the pruning of sequences that are not in LNF. Note that only line 10 of Algorithm 4 has changed compared to DFS.

Another technique to speed up tree search is Branch & Bound [Lit+63]. Branch & Bound is a well-known technique for combinatorial optimization in the presence of so-called admissible heuristics [Pea84]. A heuristic is called admissible if it overestimates the performance that can still result from a partial solution. Here, we express the admissible heuristics as a function *β*(σ*,* w) with the current system state σ and the current partial action sequence w as arguments. If the admissible heuristics show that the best performance starting from the current partial solution is worse than the best solution that was already encountered, then the entire subtree behind the current partial solution can be pruned. Branch & Bound is implemented in Algorithm 5. In Algorithm 6, Branch & Bound is combined with trace-based pruning. Again, only line 10 is changed to prune out sequences that are not in LNF. The commonality between DFS and Branch & Bound is the use of backtracking to return to a previous partial solution either when all nodes in a subtree have been visited or when the solutions that remain in the subtree are known to be suboptimal.

#### **Evaluation**

The effect of the pruning techniques for tree search methods is evaluated based on the Jobshop Scheduling Problem (JSP) [Pin08]. Scheduling is one of the most important planning problems on the shopfloor. It is also computationally challenging. The jobs *j* ∈ *J* have to be processed on the machines *M*. Every job consists of operations *o* ∈ *O<sup>j</sup>* = {1*,* 2*, . . . ,* |*M*|}. The operations need to be processed in-order for every job and are each assigned to a specific machine *mj,o* ∈ *M*. The operation duration is *dj,o*. The goal is to find a schedule that assigns operations to machines in order to minimize the finishing time of the last job. See Table 3.1c for a benchmark JSP from the literature. There are (|*J*|!)|*M*<sup>|</sup> possible schedules for a JSP with |*J*| jobs and |*M*| machines where every jobs needs to visit every machine once [JM98]. So a JSP with |*J*| = 20 and |*M*| = 10 has 7*.*2651 × 10<sup>183</sup> possible solutions. Compare this to the mass of the observable universe currently estimated to the equivalent of 10<sup>80</sup> hydrogen atoms. Even though the number of possible solutions is huge, current solution techniques can compute near-optimal solutions for JSP with hundreds of machines and jobs. Many benchmark scheduling problems have even been solved with certified optimaltiy. However, the worst-case analysis of the JSP shows that it is NP-hard for instances where |*J*| ≥ 3 and |*M*| ≥ 3 [GJS76]. Therefore, unless P = NP, all algorithms for solving the JSP are either heuristic or require a long running time for some (synthetic) JSP problems.

We cast the JSP in the model from Chapter 2 in terms of actions with preconditions and effects. Every machine and every job in a JSP is represented as a component in the system model *C* = *C*machines ∪ *C*jobs. The operations for each job are represented as actions where both one machine and one job participate. The time-indexed state of the machine components is trivial. It only consists of the time when the machine is available next. The state of the job components is the number of operations that were already performed for the job. The precondition of every action is that the associated operation is next in line for the job. The action effect simply increases the number of finished operations for the job by one. The duration of each action (operation in the JSP) is deterministic. The cost generated by an action is the increase in the maximum timestamp of the components. So



**(c)** The 6 × 6 benchmark JSP ft06 from [FT63].

**Table 3.2:** Example JSP problems.

the cost of the entire sequence is the makespan of the solution.

$$\mathfrak{x}(\sigma, a, \sigma') = \left(\max\_{c \in C} t\_c\right) - \left(\max\_{c \in C} t'\_c\right) \tag{3.5}$$

Here, *t<sup>c</sup>* refers to the simulation time of component *c* in the system state σ and *t* 0 *c* refers to the simulation time of the component in the system state σ 0 .

Given a partial schedule as an action sequence w, the following trivial heuristic computes an upper bound for the reward that can be generated following the execution of w. For every component (representing a job or a machine), the remaining operations are summed up and added to its current time.

$$\beta^{JSP}(\sigma, w) = -\max\_{c \in C} \left( t\_c + \sum\_{\substack{a \in A : a \notin w, \\ c \in C\_a}} d\_a \right) \tag{3.6}$$

The ft06 benchmark problem from Table 3.1c illustrates the importance of optimisation. An unoptimized solution can require more than 150 seconds to complete all jobs. The optimal solution requires just 55s. Finding the optimal solution for larger problems is not possible without the help of computers. In addition to the ft06 benchmark, two additional minimal problems are considered. See Table 3.2 for their exact definition. The minimal examples are too small to be of any practical value. But they give an indication of how fast the problem complexity increases. A sequence of 9 unique elements has 9! = 362*,* 880 permutations. The minimal 3 × 3 JSP depicted in Table 3.2 also defines 9 actions. But the full scenario tree for the JSP has only 1*,*680 leaf nodes corresponding to valid schedules. By additionally pruning branches that are not in the LNF only 63 leaf nodes remain. Using the makespan lower bound from Equation 3.6, nodes can be pruned where the lower bound is equal or worse to the best solution encountered so far. With the Branch & Bound pruning, only 3 leaf nodes are actually visited. However, more nodes were visited overall compared to trace-based pruning for full depth-first search. This indicates that mostly branches close to the leaves were pruned with Branch & Bound only. By combining trace pruning with Branch & Bound, only 65 nodes are visited overall. This is a decrease in the search running time by a factor of more than 50 compared to Branch & Bound search without pruning.

On the 4 × 4 JSP depicted in Table 3.2 an even smaller fraction of the scenariotree is visited by combining trace-based and makespan-based pruning. However, the number of visited nodes still grows fast with the size of the JSP. On the ft06 6×6 JSP, 682,508 nodes are expanded with both bound and trace pruning enabled. Our implementation is capable of visiting over 1,000,000 nodes per second during Branch & Bound search (on a Lenovo T480 laptop computer). However, with only either trace-based pruning or Branch & Bound enabled, we could not solve ft06 in over a day of computation. A 10 × 10 JSP could not be solved within several days of compute time even with both trace-based and bound-based are activated.


#### 3 Simulation-Based Planning for Concurrent Production Systems

**(a)** Visited nodes for the minimal 3 × 3 JSP


**(b)** Visited nodes for the minimal 4 × 4 JSP


**(c)** Visited nodes for the ft06 6 × 6 JSP

**Table 3.3:** Benchmarking of pruning techniques for DFS and Branch & Bound. Questionmarks indicate that an optimization did not terminate after two weeks of computation. So the numbers are outstanding.

In summary, trace-based pruning reduces the number of visited action sequences to a small fraction. For the 4 × 4 JSP, the number of action sequences (visited leaf nodes) was reduced by a factor of more than 5,600. The improvements are comparatively increasing with the length of the action sequences.

We have shown that LNF pruning leads to a dramatic reduction of the search space. The reduction is already on many orders of magnitude for the comparably small benchmark problems that were considered. However, even with tracebased pruning, naive tree search does not scale up to scenarios of relevant size. In practice, genetic algorithms are often used to solve combinatorial problems. Modern heuristic solvers find good solutions for JSP with thousands of jobs [Dim15]. But these solvers exploit the specific structure of the JSP. Handling of complex action preconditions is near-impossible for genetic algorithms. The genetic crossing and mutation of two partial solutions will almost always lead to an infeasible action sequence where the preconditions of an action are not met. Repairing an infeasible plan is usually quite difficult. For example if a single product is missing to finish an order of many hundred products.

## **3.2 Planning for Discrete Action Sequences**

The algorithms from Section 3.1 use backtracking to return to previous partial solutions. This requires either that the system state is fully known so that it can be stored in a computer (to return to previous states) or that action sequences can be deterministically repeated. For experiments carried out in the physical world, neither of these assumptions is true. In this section, we further speed up planning by "implicitly pruning" less promising branches with Monte-Carlo Tree Search (MCTS). Many recent breakthroughs in Artificial Intelligence were made possible by MCTS. This includes the AlphaGo system [Sil+17] which is able to play the game of Go with superhuman performance. In short, MCTS enables planning in scenarios with many combinatorial variations where backtracking tree search is not able to cover any significant portion of the search space. In addition, MCTS uses only forward-simulation without backtracking. This reduces the coupling between the planning algorithm and the simulator to the point where real-world simulations could be used to generate samples for the planning algorithm.

#### **Background: Monte-Carlo Tree Search**

In MCTS [Bro+12; Mun+14], a scenario tree is explored by iterative playouts. A playout is essentially one run of the scenario with sequential decision making in a series of steps. Every playout starts at the root of the sequence tree and evolves by "forward simulation". In contrast to Branch & Bound, there is no backtracking to previous states within a playout. Historically, MCTS evolved from research on the multi-armed bandit problem: The multi-armed bandit problem is an idealized version of a slot machine. The following short exposition follows [Mun+14] and adapts the notation to the conventions of this text. Consider a multi-armed bandit in a casino where they player can choose a different arm during *k* max rounds (this will be extended to sequential decision making later on). The reward of the different arms *a* ∈ *A* is random according to the distribution P*<sup>a</sup>* with a support on [0*,* 1]. But the player is initially unaware of the reward distribution for every arm. In each round *k*, the player chooses a bandit *a <sup>k</sup>* ∈ *A* and collects a reward *r <sup>k</sup>* ∼ P*a<sup>k</sup>* . The rewards generated at the other arm are not observed. The goal is to find a strategy that maximizes the expected sum of payouts during the *k* rounds. This leads to the so-called *Exploration/Exploitation Tradeoff* : The player could select a bandit with high expected reward. But this might get him stuck at a bandit with suboptimal expected reward. So the player wants to explore other options without loosing too much in the process. The expected reward of the different arms *a* ∈ *A* is *µ<sup>a</sup>* = E[P*a*]. The expected reward of the best arm (there can be several best arms) is *µ* <sup>∗</sup> = max*a*∈*<sup>A</sup> µa*. Possible strategies for repeated play are analyzed with respect to their *expected regret*. The cumulative regret after *k* rounds compares the actual reward *r* with the expected reward of choosing the arm with highest expected reward every time.

$$b\_k = k\mu^\* - \sum\_{l=1}^k r^l$$

The expected cumulative regret is therefore

$$\mathbb{E}[b\_k] = k\mu^\* - \sum\_{l=1}^k \mathbb{E}[r^l] = \sum\_{a \in A} \mathbb{E}[n\_a(k)](\mu^\* - \mu\_a)\,,$$

where *na*(*k*) denotes the number of draws from *a* after the first *k* rounds.

An important tool for bounding the expected cumulative regret is the Chernoff-Hoeffding inequality. Let (*y i* )*<sup>i</sup>*=1*,...,n* i.i.d. samples of a probability distribution with support [0*,* 1] and mean *µ*. The empirical mean estimator *µ*ˆ = 1 *n* P*<sup>n</sup> <sup>i</sup>*=1 *y i* is a function of the *y i* and therefore a random variable as well. The Chernoff-Hoeffding inequality gives probability bounds for the distance between the true and the estimated mean.

$$\mathbb{P}\left(\underline{\hat{\mu}}-\mu \geq \epsilon\right) \leq e^{-2n\epsilon^{2}} \quad \text{and} \quad \mathbb{P}\left(\underline{\hat{\mu}}-\mu \leq -\epsilon\right) \leq e^{-2n\epsilon^{2}}$$

The bound is independent of the underlying distribution of the *y<sup>i</sup>* . So it can be applied even when the distribution of the *y i* is not known!

There exists a variety of approaches for selecting the next draw in the multiarmed bandit setting. Auer et al. [ACF02] proposed the so-called Upper Confidence Bound (UCB) algorithm. Let *µ*ˆ*a,k* denote the mean return of arm *a* sampled during the first *k* rounds. UCB always selects the next arm as follows:

$$a^k = \operatorname\*{arg\,max}\_{b \in A} \hat{\mu}\_{b, k-1} + \sqrt{\frac{3 \log k}{2n\_b(k-1)}}$$

Selecting the best arm according to the UCB gives an upper bound for the expected cumulative regret by application of the Chernoff-Hoeffding inequality.

$$\mathbb{E}[b^k] \le 6 \sum\_{\substack{a \in A, \\ \mu^\* > \mu\_a}} \frac{\log k}{\mu^\* - \mu\_a} + |A|(\frac{\pi^2}{3} + 1)\,,$$

The expected cumulative regret grows at most logarithmically in *k*.

The algorithm of Auer et al. [ACF02] selects actions from a flat list of options. Monte-Carlo Tree Search (MCTS) is a family of algorithms that uses the same principle of iterative exploration for planning in sequential decision settings. A popular variation of MCTS, called Upper Confidence on Trees (UCT) [KS06], uses the UCB decision rule for every action choice during a playout. See Figure 3.3 for an overview. Algorithm 7 shows UCT (but with some modifications compared to [KS06] that will be explained in the following). MCTS can be seen

**Figure 3.3:** Outline of a Monte-Carlo Tree Search. Reproduced from [Cha+08].

as "simulation-based" planning, as no backtracking is used. Instead, the results from the last simulation are incorporated into statistics about the expected reward for the possible next actions at the current position in the scenario tree. The updated empirical reward statistics is used for decision-making in the following playouts.

The assumption that the reward in every step is in the range [0*,* 1] is not required for UCT to converge to optimal decisions in the limit. On the downside, there is currently no good characterization of the convergence speed achieved by UCT that doesn't make strong assumptions on the underlying scenario. It is only known that for a large enough number of plays, every branch of the scenario tree is visited. Since only a small subset of the tree is explored in practice before the algorithm is terminated, MCTS struggles with scenarios where the reward is "unevenly distributed". That is, if a big reward can be found after a long sequence of actions where small changes to the action sequence lead to much worse results, it is then not very likely that the reward is encountered at all and MCTS will not assign a correct value estimation for the choices. A workaround for this is Reward Shaping [NHR99] where the reward is distributed such that it may be encountered early on in the action sequences.

UCT may require tuning of the parameter *α* which regulates the importance of the upper confidence bias. Setting *α* corresponds to making a choice for the Exploration/Exploitation tradeoff between the two extremes "always explore" and


"always exploit". Note that UCB was developed for action selection in stochastic scenarios with unknown reward distributions for the different actions. Here we apply the same principle to tree-search for deterministic scenarios. There is an informal argument for this. Many rollout policies are stochastic, similar to the uniform sampling policy described above. So the rollout takes samples from the reward distribution that is implied by applying the current rollout policy to the subsequent steps. When nodes are visited several times, the underlying distribution for the policy changes. So past experience does not necessarily match the samples taken later on. Nevertheless, the UCB-based selection rule shows good performance in practice.

### **Monte-Carlo Tree Search for Discrete Action Sequences**

We introduce a modification of UCT called UCTᴸᴺF. See Algorithm 8 for details. UCTᴸᴺF is inspired from UCT but deviates in a few key aspects. First, the original UCT algorithm gradually builds up a tree structure where the nodes correspond to partial action sequences. But only one new node is added with every play. Once the edge of the current tree is reached, the remaining steps are "rolled out" and only the sum of rewards for the rollout is considered. UCTᴸᴺF does not use rollouts and adds nodes for all selected actions in the sequence. Second, the original UCT stores a statistics about the average reward that was achieved after selecting a node (action). Instead, we perform a maximization in every step of the Update procedure. Therefore, the value estimation of every node is the maximum reward that can be achieved by following the best known sequence of actions after selecting the node (action). Third, we do not only update a single sequence for every play. Instead, when we arrive at a sequence that is not in LNF, then we permute the sequence to the equivalent LNF sequence. But we store the original sequence and the LNF sequence and run the Update procedure on both of them. All three modifications to the original UCT algorithm are interrelated. This is explained next.

The easiest way to restrict the search to sequences in LNF only is to simply remove actions leading to non-LNF sequences at the current node in the search tree. But when choosing one action at a time, this often leads to dead-ends where no action can be chosen without breaking the LNF constraint. Since we do not want to backtrack to previous system states during a playout, virtually no playout would complete. Therefore we use the LNF algorithm to repair the action sequence as we go along. However, simply permuting the sequence to LNF after every action selection is not compatible with the UCT approach either: Soon, some nodes are always selected because they have a very low visit count. But as the sequence is permuted to LNF, this nodes continue to not have their visit count *n* increased. Suppose an example with three actions *a < b < c* that all commute and where every action can be chosen only once in every sequence. The sequence starting with the action *c* would always become the LNF sequences *ac* or *bc* after the second action choice. So the node for the first action *c* is nevery updated. But the UCT algorithm will always choose *c* first as long as the counter *n*[*c*] is not increased. This is solved by updating all sequences that were encountered in the current playout, regardless of whether they are in LNF or not. The counter *z* for the node updates within the current playout ensures that no node is updated twice for one playout. Note that we use the algorithm LNF in a slightly different fashion. Giving the action sequence w and the reward history r as input, both

are permuted to yield an LNF action sequence where the index of actions still matches the index of the corresponding reward.

#### **Evaluation**

We evaluate the algorithms UCT and UCTᴸᴺF on the benchmark Jobshop Scheduling Problems (JSP): The ft06 benchmark with 6 jobs on 6 machines from [FT63] and the abz5 benchmark with 10 jobs on 10 machines from [ABZ88]. However, we change the classical JSP in one important aspect: Instead of optimizing the makespan, the time when the last job finishes, we target the sum of the finishing times (the tardiness if the job is immediately due) for all jobs. The reason for the change is to ensure that the immediate reward of actions depends on a small number of participating components *C<sup>a</sup>* only. When optimizing the makespan, it is of course possible to return the negative increase of the maximum simulation time for all components. But then the actions would have all components *C* as participants. We prefer to have only a small subset of the components participating in each action, so that the trace-based equivalence of action sequences as defined in Section 3.1 can be used to prune the sequence tree. The rollout policy *π* is set to select randomly among the possible choices with a uniform distribution.

As can be seen in Figure 3.4, MCTS makes big improvements to the best-known solution early on. Depending on the exploration parameter *α*, the algorithm then "converges" towards a reward value that is no longer improved upon even with very long running times. This is not a true convergence however, since we know that all branches of the search tree are explored eventually with the UCB rule. The more complex abz5 example shows an interesting phenomenon where higher *α* lead to better results. But it takes longer until "convergence" is reached. This insight makes sense with regards to the UCB action selection. Once all actions at a given node have been explored several times, the UCB rule will predominantly choose actions with a known high reward. So the observed convergence is explained by a shift from exploration to exploitation in the action selections. Pruning sequences that are not in LNF speeds up the convergence to some final best reward. This was expected as the pruning effectively reduces the size of the search tree.

#### 3 Simulation-Based Planning for Concurrent Production Systems

**(a)** Benchmark of UCT with the JSP example ft06.

**(c)** Benchmark of UCT with the JSP example abz5.

**(b)** Benchmark of UCTᴸᴺF with the JSP example ft06.

**(d)** Benchmark of UCTᴸᴺF with the JSP example abz5.

**Figure 3.4:** JSP benchmarks for Monte-Carlo Tree Search. Every benchmark was run 10 times. The lines give the average empirical reward. The standard error is indicated by tick marks and the variance is shown in a lighter shade.

#### **Algorithm 8** UCT for Deterministic Actions with Trace-Based Pruning

```
1: procedure UCTᴸᴺF(σ
                   0
                    )
2: n[ · ] ← 0, q[ · ] ← 0
3: while enough time do
4: Y ← PlayᴸᴺF(σ
                     0
                      )
5: UpdateᴸᴺF(Y )
6: return arg max
             a∈A
                  q[a]
1: procedure UpdateᴸᴺF(Y )
2: z[ · ] ← 0
3: for (w, r) ∈ Y do
4: for k = |w| . . . , 1 do
5: g ← w:k
6: if z[g] > 0 then
7: break
8: n[g] ← n[g] + 1, z[g] ← 1
9: q[g] ← r
                  k + max
                      a∈A,
                     n[ga]>0
                           q[LNF(ga)]
1: procedure PlayᴸᴺF(σ)
2: w ← ε, r ← ε, Y ← ∅
3: while ¬done(σ) do
4: B ← {b ∈ A : n[wb] = 0}
5: if B 6= ∅ then
6: a ← πB(σ)
7: else
8: a ← arg max
               b∈A, σ∈Σb
                       h
                        q[LNF(wb)] + α
                                     q log n[w]+1
                                       n[LNF(wb)] i
9: (σ, v) ← a(σ)
10: Y ← Y ∪ {(wa, rv), LNF(wa, rv)}
11: (w, r) ← LNF(wa, rv)
12: return Y
```
# **3.3 Planning with Uncertainty and Continuous Action Parameters**

As defined in Section 2.2, parameterized actions a take parameters from the set Θ*a*. But the algorithms from the previous sections 3.1 and 3.2 can only handle discrete choices of deterministic actions. If the set of parameters is finite, we can simply extend the tree-search techniques to search over the joint space of actions and their parameters. This is however not possible if the action parameters are uncountable. For example if a parameter is defined on a continuous domain. Furthermore, while the UCB rule for bandit problems is defined for decision making in stochastic unknown environments, UCT assumes deterministic actions. This section deals with the extension of the simulation-based planning approach to actions *a* that can be both stochastic and have parameters on a continuous parameter space.

## **Background: MCTS under Uncertainty**

For MDP, where the outcome depends not only on decision-making, but also on the response from the stochastic system model, it has been known for some time that a randomly sampled subset of the scenario tree that covers only a vanishing fraction of the full scenario is enough to compute near-optimal actions from any state [KMN02]. If is the admissible error for the estimation of the V-value of the current state (see Section 2.4), then the number of sample playouts grows exponentially in . It does however not depend on the size and complexity of the state representation. With the advent of MCTS, sampling based methods have also been used for sequential decision-making under uncertainty [KS06].

The POMCP algorithm (Partially Observable Monte-Carlo Planning) extends MCTS to scenarios with partial observability (POMDP) [SV10]. The authors [SV10] write with regards to the performance of their invention POMCP:

[On a benchmark problem], POMCP achieved the same performance with 4 seconds of online computation to the state-of-the-art solver SARSOP with 1000 seconds of offline computation.

While the current system state σ is known to the simulator used to sample state transitions, the decision-making only relies on the observations resulting from the


actions and the reward statistics that were collected prior. See Algorithm 9 for details. The search tree now not only consists of nodes representing actions, but is a bipartite graph of actions and the resulting observations. The rollout policy *π* takes the full history of actions and observations as input. A trivial rollout policy is to sample uniformly from the previously unexplored actions. This policy is a popular choice as it does not depend on prior knowledge and assumptions about the scenario and the state representation.

## **Background: Optimistic Optimization**

Optimization of functions on a continuous domain has a long history. Very efficient solvers exist today for the optimization of convex functions where gradient information is available [BV04]. Gradient-free global optimization of non-convex functions remains challenging. Classical solution techniques are the Dividing Rectangles (DIRECT) algorithm [JPS93] as well as heuristic genetic algorithms

**Figure 3.5:** Example for Optimistic Optimization on the unit cube. The unit cube is split into cells based on a hierarchical partitioning of the domain. Every cell has an index (~*, i*) that also represents a node in the partitioning tree: ~ is the height of the tree at that node (the number of splits) and *i* is the index within the set of cells of height ~. The initial node (0*,* 1) represents the entire domain, here depicted as the unit cube. Leaf nodes are marked gray in the partitioning tree.

[SP97]. While the latter provides no lower bounds for performance, the convergence speed of DIRECT depends on an upper bound of the function's gradient known as the Lipschitz bound. In recent years, the ideas for optimization in discrete bandit settings and MCTS have been translated to optimization of continuous functions. The resulting techniques are known as Optimistic Optimization (OO). See [Mun+14] for a comprehensive account.

Suppose we want to maximize some function *f* : *X* → R over the domain *X*.

$$x^\* = \underset{x \in X}{\text{arg}\max} \, f(x) \tag{3.7}$$

If we can make assumptions of convexity, then a range of established methods and commercial tools can be used to solve the optimization problem [BV04]. OO does not require an assumption of convexity. The basic idea is to iteratively dissect *X* into disjoint cells of decreasing size that cover the entire function domain (Figure 3.5a). The hierarchy of cells forms a tree structure (Figure 3.5b). An "optimistic" upper bound is computed for every cell. This upper bound guides the continuing selection and splitting of the cells. The breakthrough of recent work is to not require a Lipschitz bound for *f* to compute the upper bound. Instead, the function is assumed to be smooth around the optimizer with respect to a semi-metric.² This is a much weaker assumption.

**Algorithm 10** Deterministic Optimistic Optimization (DOO) for the optimization of an unknown function *f* : *X* → R. This formulation is for the special case where the search domain is the *n*-dimensional unit cube *X* = [0*,* 1]*<sup>n</sup>* and cells are split into three children.


**Deterministic Optimistic Optimization** The first considered OO algorithm is Deterministic Optimistic Optimization (DOO). Algorithm 10 shows a simplified version of DOO. It assumes the domain of *f* is the *n*-dimensional unit cube and cells are split into three children. In every iteration, the cell with the highest upper bound is selected and split into 3 children. The cells are denoted as (~*, i*) where

²A semi-metric has *`*(x*,* y) = *`*(y*,* x) and *`*(x*,* y) = 0 ⇒ x = y. Different from a regular norm, the triangle inequality (a consequence of the Cauchy-Schwarz inequality) is not required to hold.

~ is the depth of the tree and *i* is an index for the cells at the same depth.³ The set *L* contains the leaf nodes of the current search tree (cf. Figure 3.5b). The midpoint of the visited cells (~*, i*) is stored as x[~*, i*]. The upper bound of each cell is computed from an evaluation at the midpoint and the bias *δ*(~) that depends on the depth-position of the cell. The choice of *δ* depends on the target function *f* and semi-metric *`*. More detail on that can be found in the next paragraph. Along which dimension to split is determined from the tree-depth ~ at which the cell is situated.

**Example 3.3.** Figure 3.6 shows the graph of two example functions we seek to maximize on the domain *X* = [0*,* 1]. Notably, the Garland function is not differentiable at some points on the domain and has no Lipschitz constant. The function is also not differentiable at the optimizer *π/*6. But there exists a semi-metric for which the Garland function is locally smooth around the optimizer.

**(a)** Sine and quadratic: *f*1(*x*) = 0*.*25 sin(50*x*) · sin(10*x*) − (*x* − 0*.*75)<sup>2</sup> with a scaled Euclidean metric fitted to the optimizer.

**(b)** Garland function: *f*2(*x*) = 4*x* (1 − *x*)( <sup>3</sup> <sup>4</sup> + 1 4 (1 − p |sin(60*x*)|)) and the semi-metric *`*(*x, y*) = *β*k*x* − *y*k <sup>1</sup>*/*<sup>2</sup> fitted to the optimizer.

**Figure 3.6:** Example functions that are locally smooth around the optimized for a semi-metric *`*.

³Symbols with a crossing bar as in ~ are used to denote height-indices in a tree.

To show convergence of DOO, the following assumptions are made for *f* and its domain [Mun+14].


For every region (~*, i*) containing the optimizer x <sup>∗</sup> ∈ *X*~*,i*, we have *f*(x~*,i*) + *δ*(~) ≥ *f*(x~*,i*) + *`*(x~*,i,* x ∗ ) ≥ *f* ∗ . Since the leaf nodes always cover the entire function domain, cells with x~*,i* + *δ*(~) *< f* <sup>∗</sup> are never expanded as they are dominated by the leaf node containing the optimizer. So the cells potentially expanded at depth ~ are *I*<sup>~</sup> = {(~*, i*) : *f*(x<sup>~</sup>*,i*) + *δ*(~) ≥ *f* <sup>∗</sup>}.

**(Stochastic) Simultaneous Optimistic Optimization** DOO requires no global Lipschitz bound of the target function. But it requires knowledge of a semi-metric *`* that is smooth around the optimizer. The *`* is not known in many cases. The Simultaneous Optimistic Optimization (SOO) algorithm [Mun11] adopts ideas from the DIRECT algorithm [JPS93] to achieve nearly the same convergence results as DOO even without knowledge of *`*. SOO still assumes the existence of a such a semi-metric *`*. But it suffices to show the existence of any such semi-metric for the convergence analysis without actually using it in the algorithm. In many cases, the function *f* itself can be used to construct a suitable semi-metric! Take any norm k · k for the function domain *X*. The semi-metric *`*(x*,* y) for points x*,* y on *X* is constructed as follows. With *η* = kx − yk the distance on the

domain norm, the distance on *`* is the difference between the optimizer and the worst point in the *η*-ball around the optimizer:

$$\tilde{\ell}(\eta) = \sup\_{\|x - \mathfrak{a}^\*\| \le \eta} (f^\* - f(\mathfrak{x})), \quad \ell(\mathfrak{x}, \mathfrak{y}) = \tilde{\ell}(\|x - \mathfrak{y}\|) \tag{3.8}$$

We forego full convergence proofs at this point and refer to the original paper [Mun11]. Stochastic SOO (StoSOO) [VCM13] extends SOO to the optimization of stochastic functions. Cells are split only after *κ* evaluations. The mean of the sampled values at the cell midpoint is used for the evaluation. The authors of [VCM13] provide a convergence bound for the expected regret on the order of *O*(log<sup>2</sup> (*ι*)*/* √ *ι*) where *ι* is the number of samples taken from the stochastic function *f*. The tuning parameter *η* controls how much emphasis the algorithm is putting on exploration, i.e. the tradeoff between exploration and exploitation.

Algorithm 11 shows a simplified version of StoSoo for the *n*-dimensional unit cube where cells are split into three child cells after *κ* evaluations. We will now explain the major changes compared to DOO. First, the visited nodes are split into leaf nodes *L* and internal nodes *N*. The internal nodes have been sampled *κ* times and are no longer evaluated. Second, since we do know the semi-metric *`* (and hence the cell diameter *δ*), an upper confidence bias is added to the cell evaluation for the selection. Third, the algorithm iterates over the depth-level ~ of the cells. One cell is selected at every level (if there is an improvement compared to the previous levels) and the cell is either sampled again or split.

### **Planning for Parameterized Action Sequences**

To integrate StoSOO with MCTS, we introduce so-called hybrid trees. Hybrid trees contain nodes for actions and parameters. Hybrid trees are bipartite as a parameter selection must follow an action selection and vice versa. Partiallyobservable hybrid trees (POHT) use three types of nodes: actions, parameters and the resulting observations. See Figure 3.7 for an example. Note that hybrid trees not only grow at the leaf nodes. The paramter-nodes represent a cell in the parameter space Θ*<sup>a</sup>* of the associated action *a*. As the cells of a continuous domain can be partitioned indefinitely often, a hybrid tree can grow new branches at the parameter-nodes during planning.

**Algorithm 11** Stochastic Simultaneous Optimistic Optimization (StoSOO) on the [0*,* 1]*<sup>n</sup>* cube. Cells are split after having been sampled *κ* times.

1: **procedure** StoSOO(*f, κ, ι, η*) 2: *L* ← {(0*,* 1)}*,* x[0*,* 1] ← 1 1 2 3: *q*[0*,* 1] ∼ *f*(x[0*,* 1])*, n*[0*,* 1] ← 1 4: **while** less than *ι* samples taken **do** 5: *q* max ← −∞ 6: **for** ~ = 0*, . . . ,* min(depth(*L*)*,* ~ max) **do** 7: *<sup>i</sup>* <sup>←</sup> arg max*<sup>j</sup>*:(~*,j*)∈*<sup>L</sup> <sup>q</sup>*[~*, j*] + <sup>q</sup>log(*<sup>ι</sup>* <sup>2</sup>*/η*) 2*n*[~*,j*] 8: *<sup>r</sup>* <sup>←</sup> *<sup>q</sup>*[~*, i*] + <sup>q</sup>log(*<sup>ι</sup>* <sup>2</sup>*/η*) 2*n*[~*,i*] 9: **if** *r* ≥ *q* max **then** 10: *q* max ← *r* 11: **if** *n*[~*, i*] *< κ* **then** 12: SampleStoSOO(~*, i*) 13: **else** 14: SplitStoSOO(~*, i*) 15: **return** arg max(~*,i*)*,n*[~*,i*]*>*<sup>0</sup> *q*[~*, i*] 1: **procedure** SampleStoSOO(~*, i*) 2: *y* ∼ *f*(x[~*, i*]) 3: *n*[~*, i*] ← *n*[~*, i*] + 1 4: *q*[~*, i*] ← *q*[~*, i*] + *<sup>y</sup>*−*q*[~*,i*] *n*[~*,i*] 1: **procedure** SplitStoSOO(~*, i*) 2: *v* ← max{*l* : x[~ + 1*, l*] 6= ∅} 3: *d* ← mod(~*, n*) + 1 4: o ← ν*<sup>d</sup>* 1 3 <sup>b</sup>~*/n*c+1 5: **for** *j* ∈ {1*, . . . ,* 3} **do** 6: *L* ← *L* ∪ {(~ + 1*, v* + *j*)} 7: x[~ + 1*, v* + *j*] ← x[~*, i*] + (*j* − 2)o 8: SampleStoSOO(~ + 1*, v* + *j*) 9: *L* ← *L* \ {(~*, i*)}

**Figure 3.7:** Partially Observable Hybrid Tree. The rectangle encompassing several circular nodes denotes a subtree for the optimisation of an action parameter with optimistic optimization. The tree is not fully explored for better visual representation.

For planning in POHT, we want to combine OO with MCTS. The upper confidence bound is used to select discrete actions and OO is used to select and iteratively refine the choice of action parameters. The StoSOO algorithm samples from the stochastic target function *f* several times within the SplitStoSOO procedure. This prevents its unmodified use for simulation-based planning – without backtracking – in sequential decision-making settings. Instead, we want every call to the OO subproblem to return exactly one parameter combination *θ* to continue the playout with the following steps in the scenario. The accumulated reward is then used to update the statistics for the involved branches of the scenario tree.

Previous authors have used OO for sequential decision-making with a continuous action-space [MWL11; Bus+13; BPM18]. The TrailBlazer algorithm of [GVM16] combines discrete action selection with continuous search. But it uses backtracking to return to a previous position in the scenario tree. In this thesis, we

want to avoid storing the system state for backtracking search. So the algorithms can just as well be performed with playouts in physical experiments only.

We now develop the novel Partially Observable Hybrid Tree Planning (POHTP) algorithm that combines MCTS – and in particular the approach for partiallyobservable planning from to the POMCP algorithm – with Optimistic Optimization for parameter selection on continuous domains. See Algorithm 12 for the full details. In every step for sequential decision making, the algorithm is presented with the choice of several discrete actions that are each parameterized from a continuous domain. If there is only one action, POHTP reduces to StoSOO as a special case. On the other hand, if the actions have no parameters, POHTP reduces to a variation of POMCP. The difference to the original POMCP is the update mechanism that maximizes over possible choices to compute the V-value estimate (instead of taking the empirical reward from the previous plays), full playouts and a switch between exploitation and exploration within each playout that is explained in the next paragraph.

In contrast to POMCP, no rollouts are used that aggregate the reward beyond the previously constructed tree. Instead, the full history of every playout is recorded and used to update the value estimates for the nodes in a second Update procedure. Furthermore, the upper confidence bound is not used for decision-making at every step. Instead, the algorithm initially takes optimal decisions (for the current value estimates) and switches to an explorative regime at the depth *d* of the decision tree. In the StoSOO algorithm, the dimension along which to split the current cell is determined by the depth in the search tree. The important ingredient of StoSOO to achieve fast convergence is to select cells from a specific depth in each iteration. Similar to StoSOO, POHTP for every iteration selects a depth at which the "exploration" (splitting in StoSOO) begins. The depth *d* for this switch of the action-selection regime is iterated together with number of performed playouts. Every action contributes one level to the depth of the decision tree. The parameters of the action *a* contribute according to the depth in the embedded tree for the parameter selection *L*[h*a*] for the action *a* after an observed history h. The depth of the parameter-selection tree depth(*L*[h*a*]) is the number of times the smallest cell represented in the tree has been split.

**Algorithm 12** Partially Observable Hybrid Tree Planning (POHTP) 1: **procedure** POHTP(σ 0 ) 2: *n*[ · ] ← 0*, q*[ · ] ← 0*, e*[ · ] ← 0*, L*[ · ] ← {(0*,* 1)}*,* x[ · ; 0*,* 1] ← 1 1 2 *, d*¯ ← 1 3: **while** enough time **do** 4: (h*,* r) ← Play(σ 0 ) 5: Update(h*,* r) 6: *d*¯ ← *d*¯+ 1 7: **if** *d >*¯ log<sup>2</sup> (*n*[]) **then** 8: *d*¯ ← 1 9: **return** arg max (*a,θ*):*n*[*aθ*]*>*0 *q*[*aθ*] 1: **procedure** Param(h*a, d*¯) 2: **if** *d >*¯ depth(*L*[h*a*]) **then** 3: **return** arg max *θ*:∃(¯*l,i*)∈*L*[h*a*]*,* x[h*a*;¯*l,i*]=*θ q*[h*aθ*] 4: **else if** *d <*¯ 1 **then** 5: *G* ← *L*[h*a*] 6: **else** 7: *G* ← *L*[h*a*; *d*] 8: (~*, i*) ← arg max (¯*l,j*)∈*G,* x[h*a*;¯*l,j*]=*θ* h *q*[h*aθ*] + *α* qlog *<sup>n</sup>*[h*a*]+1 *n*[h*aθ*] i 9: *θ* ← x[h*a*; ~*, i*] 10: **if** *n*[h*aθ*] *< κ* ∧ ~ = ~ max **then** 11: **return** *θ* 12: *u*[h*a*; ~*, i*] ← *u*[h*a*; ~*, i*] + 1 13: *µ* ← *u*[h*a*; ~*, i*] 14: *n* ← dim(Θ*a*) 15: *δ* ← mod(~*, n*) + 1 16: *ξ* ← |*L*[h*a*; ~ + 1]| + 1 17: x[h*a*; ~ + 1*, ξ*] ← *θ* + (*µ* − 2)ν*<sup>δ</sup>* 1 3 <sup>b</sup>~*/n*c+1 18: *L*[h*a*] ← *L*[h*a*] ∪ {(~ + 1*, ξ*)} 19: **if** *µ* = 3 **then** 20: *L*[h*a*] ← *L*[h*a*] \ {(~*, i*)} 21: **return** x[h*a*; ~ + 1*, ξ*] 1: **procedure** Update(h) 2: **for** *k* = |h|*, . . . ,* 1 **do** 3: g = h :*k*−1 4: (*a, θ,* o*, r*) ← h *k* 5: *n*[g*aθ*] ← *n*[g*aθ*] + 1 6: *e*[g*aθ*] ← *<sup>r</sup>*−*e*[g*aθ*] *n*[g*aθ*] 7: *q*[g*aθ*] ← *e*[g*aθ*] + P o∈*O q*[g*aθ*o]*n*[g*aθ*o] *n*[g*aθ*] 8: *n*[g*a*] ← *n*[g*a*] + 1 9: *q*[g*a*] ← max *φ*∈Θ*a, n*[g*aφ*]*>*0 *q*[g*aφ*] 10: *n*[g] ← *n*[g*a*] + 1 11: *q*[g] ← max *b*∈*A, n*[g*b*]*>*0 *q*[g*b*] 1: **procedure** Play(σ*, d*¯) 2: h ← *ε* 3: **while** ¬done(σ) **do** 4: *B* ← {*b* ∈ *A* : *n*[h*b*] = 0} 5: **if** *d >*¯ 1 ∧ *n*[h] *>* 0 **then** 6: *a* ← arg max *b*∈*A,n*[h*b*]*>*0 *q*[h*b*] 7: *d*¯ ← *d*¯− 1 8: *θ* ← Param(h*a, d*¯) 9: **else if** *B* 6= ∅ **then** 10: (*a, θ*) ← *πB*(σ) 11: *d*¯ ← *d*¯− 1 12: **else** 13: *a* ← arg max *b*∈*A* h *q*[h*b*] + *α* qlog *<sup>n</sup>*[h]+1 *n*[h*b*] i 14: *d*¯ ← *d*¯− 1 15: *θ* ← Param(h*a, d*¯) 16: *d*¯ ← *d*¯− depth(*L*[h*a*]) 17: (σ 0 *,* o*, r*) ∼ *a θ* (σ) 18: h ← h*aθ*o*r,* σ ← σ 0 19: **return** h

The POHTP procedure initializes the algorithm and then iterates over a series of playouts. Importantly, the exploration depth for decision-making *d* is cut off at the maximum decision-making depth at log<sup>2</sup> of the number of playout iterations.

The Play procedure simulates steps until the current scenario is "done". If a node in the search has not been encountered before, the policy *π* is used to select action and action-parameters. Otherwise, the action is selected via a UCB evaluation and the parameter is selected via OO.

The Update procedure stores the empirical mean reward that is directly generated by an action-parameter combination as *e*[h*aθ*]. The Q-value associated with the action-parameter combination additionaly takes the expected following reward into account. This is again the empirical mean over the observations following the action-parameter combinations. The Q-value for the action and the observation maximize over the respective choices.

So the periods in the observed history are *h <sup>k</sup>* = (*a k θ ko k* ). Updating the parameterization-nodes is very similar to the updating of action-nodes in POMCP. The considered statistic simply keeps count of how often the node was visited and the empirical reward generated in the ensuing subtree. Since action parameters are now selected as well, the rollout policy *π*(*s, A*) for choosing an action also returns a matching parameterization. Choosing a good rollout policy is crucial since the optimization of parameterized actions leads to a large number of branches and the search encounters a previously unknown part of the scenario tree in most iterations.

The Param procedure returns exactly one parameter combination for the current selected action *a*. Again, if *d* is higher than the depth of *L*[h*a*], then the best leaf node is returned. Other Param is in the explorative regime. If *d* is smaller than one, then the upper confidence bound is used to select the best parameter among all parameters in the tree. If instead *d* points to a depth-level inside *L*[h*a*], then a node at this depth is selected. Note that a level can become empty when all nodes in the level have been split. If *L*[h*a*] contains the indices of the leaf cells, then *L*[h*a*; *d*] denotes the first level containing leaf cells above or at depth *d*.

$$L[\mathsf{ha};d] = \{ (c,i) \in L[\mathsf{ha}] : c \ge d, \nexists e \in \{d, \dots, c - 1\}, \exists (e,j) \in L[\mathsf{ha}] \}$$

If the thus selected parameter has been sampled less than *κ* times, it will be returned. Otherwise the selected cell in the parameter space is split. And the current cell is removed from the list of leaf cells.

The POHTP algorithm can be combined with the pruning of equivalent sequences according to Section 3.1. Now the equivalence applies not only to action sequences, but to histories where every step consists of an action, an action parameter and the resulting observation. Entire steps can be commuted if the respective actions are independent. Note that no problem-specific structure is exploited in the POHTP algorithm developed in this chapter.

### **Evaluation**

### **Swingup of an inverted pendulum**

Planning for parameterized actions is evaluated for an inverted pendulum displayed in Figure 3.8. The inverted pendulum is one of the canonical problems in the literature on optimal control [Lib11]. A pendulum is attached to a cart that is free to move in the horizontal plane. The goal is to perform a swing-up to bring the pendulum into an upright position – and keep it there – by the precise application of an accelerating force to the cart.

The full problem definition is as follows. Assume a single system component pend for the pendulum and a single action acc with a parameter Θacc = [−2*N,* 2*N*] for the force applied to accelerate the cart. The cart and the pendulum are approximated by point-masses *m<sup>c</sup>* and *m<sup>p</sup>* of 1kg each. The length of the pendulum *l* is one meter. The angle of the pendulum *β* gives the difference from the upright position. The position of the cart *x* is in meters from the point of origin. The initial state of the pendulum is *s* 0 pend = (*<sup>β</sup>* = 0*.*5*, <sup>β</sup>*˙ = 0*, x* = 0*, <sup>x</sup>*˙ = 0). The state evolution of the pendulum is described by two coupled differential equations for the cart position *x* and the pendulum angle *β* [FYK92]:

$$\begin{aligned} \dot{x} &= \frac{-m\_p g \sin(\beta) + m\_p l \sin(\beta) \dot{\beta}^2 + u}{m\_c + m\_p \sin^2(\beta)}\\ \dot{\beta} &= \frac{(m\_p + m\_c) g \sin(\beta) - m\_p l \sin(\beta) \cos(\beta) \dot{\beta}^2 - \cos(\beta) u}{l(m\_c + m\_p \sin^2(\beta))} \end{aligned} \tag{3.9}$$

**Figure 3.8:** The inverted pendulum problem. The goal is to perform a swing-up of the pendulum to an angle *β* = 0 and to keep the pendulum in the upright position.

The simulation time is discretized into periods of 0.2s. The effect of the action acc is the (numerical) solution to the forward simulation of Equation (3.9) for the control value *u* given by the action parameter. The reward returned by acc is a cost term associated with the resulting pendulum state *s* 0 and the energy expenditure for the control.

$$r\_{\mathbf{acc}}(u, s') = 2||\beta'|| + x + u^2 \tag{3.10}$$

By k · k we denote the angular distance from the upright position.

The control (acceleration) applied in each period is the result of POHTP with 512 playouts over a horizon of 15 steps with a 0.2s time discretization. Even though the step length is discretized, the simulation uses the Runge-Kutta method for precise forward-simulation of the underlying differential equation.

After the 512 playouts, the acceleration parameters with the best Q-value is selected and applied. The optimization is then repeated for the resulting system state. This resembles optimal control based on MPC [ML99]. But as we directly optimize on the model from Section 2, we make less limiting assumptions than traditional MPC based on convex optimization.

**Figure 3.9:** Swingup of an inverted pendulum. The initial of the pendulum is at 160 degrees (*θ* = 160 <sup>180</sup>*π*). The upper time series show the penalization for |*θ*|. The lower time series show the penalization of | atan2(sin(*θ*)*,* cos(*θ*))| where "loopovers", e.g. angles at a multiple of 2*π* are not penalized.

As can be seen in Figure 3.9, the POHTP algorithm achieves the swing-up and balancing of the inverted pendulum. Note that the angle plateaus at multiples of 2*π* due to the use of the angular distance in the cost function. Adapting the cost function with a higher penalty for the angle leads to a more speedy swingup. But at a greater cost for the control energy *u*.

### **Optimal Order Quantity under Uncertainty**

The scenario is concerned with the operations of a pencil factory. A customer gives the order for 50,000 pencils with his company logo on the casing. The customer is willing to pay \$2 for each pencil. No payment is made if the order is incomplete or arrives too late. The plastic pencil casings with the logo printing are bought from a supplier that demands \$0.5 for each casing. The production

costs in our factory are \$1. So, in theory, the pencils are sold to the customer with a margin of \$0.5 each. However, some pencils do not make it through quality control. Every pencil has a 10% chance of being sorted out by an *inline quality control* system that verfies every single product. Due to the time constraints, it is not possible to reorder additional pen casings at the supplier once production has started. The question now is: how many pencil casings should be ordered at the supplier initially in order to maximize the expected earnings?

**Figure 3.10:** Expected reward for different numbers of ordered pencil casings. The average was computed over 100 simulation runs each. The shaded area indicates the empirical standard deviation.

In order to compare the results of our tree-search planning, we implemented a simple "brute-force" solution technique: Monte-Carlo simulations for all possible solutions in the relevant range. The simulation model is as follows. Assume that *n* pencil casings have been delivered by the supplier. Start to produce pencils until either 50.000 pencils are done or more than *n* − 50*.*000 casings were discarded due to quality problems (otherwise, we would continue losing money during production without being able to complete the order eventually). Figure 3.10 shows the results of the Monte-Carlo simulation in the range between 55.000 and 56.000 ordered pencil casings. The results show a phase transition between virtually all samples not completing the order to virtuall all samples completing the order. In the transition range, only some orders are completed and this number differs between simulations. Therefore, the standard deviation for the reward is much bigger in the transition range. According to the Monte-Carlo simulations, the optimal order quantity is 55.750, resulting in an expected reward of \$1,6575. With

less ordered pencil cases, there is a high likelihood of the order not completing. Every additional pencil case incurs higher cost than the additional likelihood of completing the order justifies.

**Figure 3.11:** Convergence speed of the optimization for the order quantity under uncertainty example.

Figure 3.11 shows the empirical reward of the best paramater after a certain number of plays.

## **3.4 Planning with Linear Actions**

Most manufacturing systems perform repetitive tasks. While lot sizes have generally become smaller, most products are still produced in bulk. When many product instances are considered individually, the action sequences can become very long. This section identifies a large class of actions where reasoning and planning for action repetition is simplified.

**Example 3.4.** Consider a stamping press that takes in raw material from an aluminum coil. The action stamp puts the produced work-pieces of type p1 into a lattice box adjacent to the press. Every execution of stamp increases the number of parts in the lattice box by one and reduces the length of the remaining aluminum coil by 2.5cm. Suppose that 20m of coil are loaded initially. How often can stamp be repeated and how many additional parts will be in the lattice box afterward? The answer is of course trivial. But how can this type of reasoning be made accessible to a planning algorithm that operates on the actions from Definition 2.4?

### **Linear Actions and Action Repetition**

Reasoning about the effects of the action stamp in Example 3.4 is easy and intuitive. The action has a fixed effect and repeating the action *n* times multiplies the effect by *n*. We can compute the maximum number of repetitions of stamp that are possible starting from the described initial state. It is implicit to Example 3.4, that if the action stamp can be repeated *n* times, any number of repetitions between zero and *n* is also possible. We now spell out these implicit assumptions in the form of conditions that so-called *linear actions* have to conform to in addition to Definition 2.4.

**Definition 3.7.** *An action a*¯ *is linear if the following conditions hold.*


$$
\sigma \in \Sigma\_{\bar{a}} \land \bar{a}^n(\sigma) \in \Sigma\_{\bar{a}} \Rightarrow \forall k \in \{0, \dots, n\}, \; \bar{a}^k(\sigma) \in \Sigma\_{\bar{a}} \qquad (3.11)
$$

*3. The action duration is identical for all feasible initial states.*

$$\forall (\sigma, \sigma') \in \Sigma\_{\bar{a}} \times \Sigma\_{\bar{a}}, \ d\_{\bar{a}}(\sigma\_{\bar{a}}) = d\_{\bar{a}}(\sigma'\_{\bar{a}}) \tag{3.12}$$

*4. A constant reward* r*a*¯ *is generated for every action repetition.*

Linear actions have advantages over normal actions: First, once the maximum number of repetitions has been established, the preconditions don't need to be verified for every repetition. Second, the effect of repeatedly applying the action can be computed with analytical shortcuts instead of *n*-fold composition of the effect function.

The notation for repeated application of an action resembles the notation for action parameterization. This is intentional. Repetition of linear actions is a special case in the general framework of parameterized actions. If a parameterized action is also linear, the notation *a θ,n* indicates that the same parameter *θ* ∈ Θ*<sup>a</sup>*¯ is applied for each of the *n* repetitions.

The following joke from the mathematical folklore [RD05] sets the frame for discussing the composition of linear actions and the superposition of the effects.

A biologist, a physicist, and a mathematician sit in a street café watching the crowd. Across the street they see a man and a woman entering a building. Ten minutes later they reappear together with a third person.

```
biologist: They have reproduced.
```
physicist: The measurement wasn't accurate.

mathematician: If exactly one person enters the building now, it will be empty again.

The mathematician treats the operators "person entering the house" and "person leaving the house" as elements from an algebraic group (entering is the inverse of leaving). The group is indeed closed under composition. But the operator resulting from the composition does not apply to all situations. Obviously, there can be no negative number of people in the house. Translated to our case, components cannot contain a negative number of products. This universal constraint has ramifications on the definition of linear action.

Suppose that for the considered linear action *a*¯, the participating components *C<sup>a</sup>*¯ can hold products inside the component. So their state is described by a tuple *s* = (*ξ,* p) for the configuration *ξ* and the number of contained products for every product type p (cf. Section 2.1). Definition 3.7 implies a linear effect on the contained products in the components. This effect can be expressed by a fixed change vector δ *a*¯ *<sup>c</sup>* ∈ Z |*P* | . So for a state transition σ <sup>0</sup> = ¯*a*(σ) and component *c* ∈ *Ca*¯, the state transition is between *s<sup>c</sup>* = (*ξ,* p) and *s* 0 *<sup>c</sup>* = (*ξ* 0 *,* p 0 ) and the product change is p <sup>0</sup> = p + δ *a*¯ *c* .

Since the number of products in the component cannot be negative, there is a universal positivity constraint for all linear actions. The feasible states all conform to the positivity constraint Σ*<sup>a</sup>*¯ ⊆ Σ + *a*¯ .

$$\Sigma\_{\bar{a}}^{+} = \left\{ \sigma \in \Sigma : \forall c \in \bar{C}\_{\bar{a}}, \ \sigma\_{c} = ((\xi, \mathfrak{p}), t), \ \mathfrak{p} + \delta\_{c}^{\bar{a}} \succeq \mathbf{0} \right\}. \tag{3.13}$$

For many linear actions, the condition (3.11) can be shown to hold with a convexity argument. This is illustrated by the following example.

**Example 3.5.** This example builds on the previous Example 3.4. Suppose stamp is a linear action with the participating components *C*stamp = {box*,* press}. The lattice box has no particular configuration state and Ξbox = ∅. The configuration of the press *ξ*press = (*ξ*tool*, ξ*coil) consists of the press tooling for either product p1 or p2 and the remaining length of the coil. So the configuration space for the press is Ξpress = {p1*,* p2} × R+. An additional condition of the press is that at least 50cm of coil need to remain after the action in order to facilitate replenishing. The condition of the lattice box is that the maximum load of 300kg shall not be exceeded. The effect on the products in the lattice box δ stamp box = νp1 contains mostly zeros with a single one-entry at the p1 position. There is no effect on the lattice box configuration. So the effect on the box is (*ne*stamp)(σstamp)box = (∅*,* pbox + *n*νp1). Let the vector µ describe the weight of every product. Then the initial states that are feasible for the lattice box are

$$\Sigma\_{\mathsf{retamp};\mathsf{box}} = \left\{ \sigma \in \Sigma : \boldsymbol{\mu}^{\mathsf{T}} (\mathsf{p}\_{\mathsf{box}} + \boldsymbol{\nu}\_{\mathsf{p}1}) \le 300 \right\}.$$

The action has no effect on the products contained in the press and δ stamp press = 0. (As before, 0 is the null-vector of appropriate size). For simplicity, we refer to the state after executing stamp as σ 0 (with analogous notation for its components). The action has no effect on the tooling of the press *ξ* 0 tool = *ξ*tool and the remaining coil length is reduced by a fixed amount *ξ* 0 coil = *ξ*coil − 2*.*5. So the *n*-fold repetition has the following effect (*ne*stamp)(σstamp)press = ((*ξ*tool*, ξ*coil −2*.*5*n*)*,* 0). At least 50cm of coil need to remain in the press and

$$\Pi\_{\mathtt{pr\\_util}}(F\_{\mathtt{stamp}}) = \{ \sigma \in \Sigma : \xi\_{\mathtt{cio1}} - 2.5 \ge 50 \}\dots$$

The valid initial states for stamp must lie in the feasible set for both the press and the lattice box. In addition, no negative number of products must be contained in a component is given by the positivity constraint Σ + stamp. In total, the feasible states for beginning the action are

$$
\Sigma\_{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\cdots}}}}}}}}}}}}}}}}}}}}}}}}}}
$$

The following proof sketch shows that (3.11) holds for stamp. Let *ψ*(σ) = (pbox*, ξ*coil) a projection of the set of feasible initial states *X* = {*ψ*(σ) : σ ∈ Σstamp}. The three constraints Σstamp:box, Σstamp:press and Σ + stamp are then expressed as a system of linear inequalities for all x ∈ *X*.

$$
\begin{bmatrix}
\mathbf{0}^{1\times|P|} & 1\\ 
\mathbf{I}^{|P|} & 0
\end{bmatrix} x \succeq \begin{bmatrix}
\mu^\top \boldsymbol{\delta}\_{\text{box}}^{\text{stamp}} - 300\\ 
52.5\\ 
\end{bmatrix} .\tag{3.14}$$

Since *X* is equivalent to {x ∈ (N |*P* | <sup>0</sup> × R) : condition (3.14) is true}, the space of projected valid initial states is convex. The effect of stamp on *X* is described by a linear operator *f*(x) = x + (δ stamp box *,* 2*.*5). The operators stamp and *f* are related as *ψ* ◦ stamp = *f* ◦ *ψ*. The *n*-fold application of *f* is a linear equation. Due to the described convexity property of *X*, for all *n* ∈ N<sup>0</sup> and initial state representations x ∈ *X*, there is

$$\left(x+n\begin{bmatrix}\delta^{\mathsf{stamp}}\\\end{bmatrix}\right)\in X\Rightarrow \forall k\in\{0,\ldots,n\},\ \left(x+k\begin{bmatrix}\delta^{\mathsf{stamp}}\\\end{bmatrix}\right)\in X\ .$$

Since for every x ∈ *X* there exists at least one σ ∈ Σstamp such that x = *ψ*(σ), the linear action stamp satisfies Equation 3.11.

#### **MILP Relaxation of the Planning Problem**

Linear actions were introduced with the promise to simplify reasoning and planning of action sequences with many repetitions. Now we relax the planning problem with linear actions so that it can be solved as a Mixed-Integer Linear Program (MILP) [BW05]. The MILP formulation can be solved with off-the-shelf solvers [Gur16]. In constrast to MCTS, the planning complexity for the relaxed planning problem is mostly independent of the number of repetitions for each action. Used as part of a rollout policy, the MILP relaxation allows the scaling to scenarios with hundreds of individual products that are considered at once. On the downside, it imposes limits on the model dynamics that can be represented.

**Assumption 3.8.** *In the remainder of this chapter, the following two assumptions are made.*


In the Example 3.5, suppose a second action take that takes out one finished piece from the lattice box. In order to take out 500 pieces, we need to run stamp 500 times as well. Now that we assume actions can run "in parallel" on the same components, how can the preconditions for the feasible initial states be represented? So far we have worked with feasible initial states Σ*a*. For linear actions, this can be transformed to the set of feasible post-states Γ*<sup>a</sup>* ⊆ Σ.

$$\Gamma\_a = \{ \sigma \in \Sigma : \exists \omega \in \Sigma\_a, \ \sigma = a(\omega) \}\tag{3.15}$$

For linear actions with a fixed effect vectors δ *a c* , the conversion between Σ*<sup>a</sup>* and Γ*<sup>a</sup>* can be achieved by a simple translation of the constraints describing the set Σ*a*. Instead of tracking the feasible initial states before the execution of the actions, we only demand that the final system state, when each action has been repeated the desired number of times, is a feasible post-state for all the actions.

Every action and every component are assigned an index from {1*, . . . ,* |*A*|} and {1*, . . . ,* |*C*|} respectively. Let x = (p*c*)*c*∈*<sup>C</sup>* denote the concatenated column vector for the products initially contained in the different components. So x is a vector with |*C*||*P*| elements. In the second statement of Definition 3.7, it is demanded that a feasible *n*-fold repetition indicates that any number of repetitions between zero and *n* must be feasible. Since the effect on the number of contained products (and only these are considered here) is linear, the constraints for the feasible initial states must be the intersection of a convex set with the set of integers. Otherwise, it would be possible to find a system stat σ where the action *a* in question can be executed *n* times but not *n* − 1 times. As the constraints encoded in Σ*<sup>a</sup>* (and therefore also Γ*a*) are convex in that sense, they can be represented as the intersection of half-spaces via a set of linear inequalities [BV04] defined by a matrix H*<sup>a</sup>* and vector g *a* , such that H*a*x g *a* .

In a production scenario, most actions are associated with costs for material, energy, worker's wages, and so on. But some actions have positive reward, such as finishing an order for the customer. After completing the order, we could make more products. But if the customer won't pay for them they only incurr costs. This is represented for the MILP as follows: Denote with the vector n the number of repetitions for each action. The vector r contains the costs incurred for every repetition of the actions.

Goals are defined by a number of target repetitions n *<sup>g</sup>* ∈ *N* |*A*| 0 for every action. Each repetition of the action *a* up to *n g <sup>a</sup>* yields the additional goal reward r *g a* . The total goal reward for the repetitions n – in addition to the reward generated from each action's fixed reward r*<sup>a</sup>* – is

$$\sum\_{a \in A} \left[ \min \{ n\_a, n\_a^g \} \text{ } \mathbf{r}\_a^g \right]. \tag{3.16}$$

The MILP computes (3.16) by the introduction of a slack variable v that counts the missing repetitions for each action according to the goal definition. The objective function takes the reward for reaching all goals and subtracts the missing repetitions according to the slack variable.

The column vector δ*<sup>a</sup>* ∈ N <sup>|</sup>*C*||*<sup>P</sup>* <sup>|</sup> describes the effect of action *a* on the products contained in all components. These effect vectors are assembled to a matrix ∆ ∈ Z |*C*||*P* |×|*A*| for the effect across all actions.

$$\delta\_a = (\delta\_c^a)\_{c \in C}, \quad \Delta = [\delta\_1, \dots, \delta\_{|A|}] \tag{3.17}$$

The post-state after executing all action repetitions n is x <sup>0</sup> = x+∆n. Optimizing the repetitions to maximize the reward under the defined constraints then gives the MILP formulation:

$$\bar{V}(\mathbf{z}, n^g, r^g) = \max\_{\mathbf{n} \in \mathbb{N}\_0^{|A|}} \left[ \mathbf{n}^\top \mathbf{r} - \mathbf{v}^\top \mathbf{r}^g \right] + \mathbf{n}^{g^\top} \mathbf{r}^g \tag{3.18}$$

such that

$$x + \Delta n = x'\tag{3.19}$$

$$H^a x' \succeq g^a, \qquad \forall a \in A \tag{3.20}$$

$$n + v \succeq n^g \tag{3.21}$$

$$x' \succeq \mathbf{0}, \quad n \succeq \mathbf{0}, \quad v \succeq \mathbf{0} \tag{3.22}$$

$$\mathbf{x}' \in \mathbb{R}\_0^{|C||P|}, \mathbf{n} \in \mathbb{N}\_0^{|A|}, \quad \mathbf{v} \in \mathbb{R}\_0^{|A|} \tag{3.23}$$

Function *V*¯ approximates the value of the system state x (with only the contained products) for optimal decision making in the MILP relaxation. The goal is to maximize the reward, including the goal reward. The constant additive term n *<sup>g</sup>*<sup>&</sup>gt;r *g* can be removed for the actual optimization. But it is required to recover the actual *V* -value for the relaxed planning problem including the goal reward. Lagrangian relaxation is used to penalize if an action *a* is repeated less than *n g a* times. The slack vector v gives the number of repetitions lacking for every action. (If the goal is met, the slack variable is zeroed out by the optimizer.) In practice, the dimensionality of x and ∆ can be reduced by considering only the products that are actually referred to by the linear actions. The maximum number of repetitions for each action is *n* max. This maximum number of repetitions is only introduced to model binary values: The vector m contains binary values for the fixed reward incurred if an action is repeated at least once.

The constraints for the optimization are as follows. The post-state x 0 after all repetitions have been executed is computed in (3.19). The post-conditions for all actions must hold simultaneous for x 0 according to (3.20). In (3.21), the slack value v is set to the number of missing repetitions according to the goal definition. The constraints in (3.22) ensure that the number of products in the final state, the number of repetitions and the slack repetitions are all non-negative. In (3.23), the repetitions n are required to be integers. The number of remaining products x 0 and the slack v are real values. But they will only take on integer values since the repetitions are a natural number and the system dynamics in ∆ leads to integer changes.

For every action selection in the rollout policy, the relaxed planning problem is solved. Actions that are slated for zero repetitions by the relaxed solution and actions that are not immediately executable due to their preconditions are ignored. A sensitivity analysis is performed for the remaining actions: For every action a modified version of the original MILP is solved where the number of repetitions for that action is fixed to be one less than in the original solution. The difference of the new solution in the objective function is a grade for "importance" of that action. The rollout policy then returns the action with the highest importance and the number of repetitions chosen for the action via the MILP relaxation.

**Algorithm 13** Rollout policy for linear actions. Takes as input the current state and returns a linear action for the next step and its repetitions.

1: **procedure** *π*linear(σ*,* h*,* n *g ,* r *g* ) 2: x ← (p*c*)*c*∈*<sup>C</sup> .* State vector of contained products in all components 3: n <sup>∗</sup> ← *V*¯ (x*,* n *g ,* r *g* ) *.* Optimal repetitions in the MILP relaxation 4: *B* ← {*b* ∈ *A* : σ ∈ Σ*<sup>b</sup>* ∧ *n* ∗ *<sup>b</sup> >* 0 ∧ TestLNF(h*, b*)} 5: *b* ∼ U(*B*) *.* Uniform sampling among the eligible actions 6: *n* <sup>0</sup> ← max{*m* ∈ {1*, . . . , n*<sup>∗</sup> *b* } : *b m*−1 (σ) ∈ Σ*b*} 7: **return** (*b, n*<sup>0</sup> )

The above description of the setting can be translated to a MILP. Since all actions are linear, no additional relaxations are required besides the assumption that actions can execute "in parallel". By using the solution to the linear program to guide the rollout, the global optimum is found already in the first rollout.

### **Evaluation**

Consider a simplified supply chain for the production of mobile phones. See Figure 3.12 for an overview. The OEM (Original Equipment Manufacturer) owns the phone brand as well as production sites for soldering, assembly and packaging. Parts are bought from suppliers. The final phone is assembled from a case, a battery, a screen and a PCB (printed circuit board) with a chipset soldered on. If the production capacity of the OEM is insufficient, assembled phones can be bought from an external contract manufacturer. The cost for soldering, assembly and packaging are \$10 each. Transportation costs are not assumed for the example. The following prices are demanded by the suppliers.


• Case: \$10

• Screen: \$30

• Chipset: \$20

• Assembled Phone: \$150

• Battery: \$30

The supplier PCB1 has limited stock and can deliver at most 1,000 PCB. The chipset supplier has limited stock of 5,000 remaining chipsets. As a consequence, the first 1,000 phones cost \$122 to make (bill of material and production costs). PCB for additional phones have to be bought from the alternative supplier PCB2 at a higher price. The phones then cost \$125 to make. For more than 5,000 phones, the required chipset is no longer available. But assembled phones can still be bought from the contract manufacturer. This comes at the increased cost of \$160 for each phone: \$150 for the phone and \$10 for branding and packaging. So buying from the contract manufacturer is more expensive than producing the phones in the OEM's production facilities. It might however be required to buy assembled phones in order to complete a large order.

Consider now a scenario where a customer orders 6,000 phones for the price of \$175 each. What are the maximum earnings (revenue minus cost) the OEM can achieve in each scenario? For this supply-chain example, the optimization problem was solved exactly by the MILP. Hence the decisions by the rollout policy immediately led to the globally optimal action and parameter sequence. (This

#### 3 Simulation-Based Planning for Concurrent Production Systems

**Figure 3.12:** Supply chain example. The arrows denote the possible number of transported products between production sites and suppliers.

is not the case for all planning problems. For example when only a subset of the actions is linear.) The MILP was solved with the commercial solver Gurobi [Gur16]. For details, refer to the literature for optimization of convex functions and optimization over integers [BV04; BW05]. The phones are sold for \$1,050,000 and were produced at a cost of \$762,000. This leaves a profit margin of \$288,000.

# **4 Distributed Planning for Self-Organizing Production Systems**

*Outside the firm, price movements direct production, which is co-ordinated through a series of exchange transactions on the market. Within a firm, these market transactions are eliminated and in place of the complicated market structure with exchange transactions is substituted the entrepreneurcoordinator, who directs production. It is clear that these are alternative methods of coordinating production. Yet, having regard to the fact that, if production is regulated by price movements, production could be carried on without any organization at all might we ask, why is there any organization?*

*Ronald H. Coase [Coa37]*

The coordination of industrial production is historically performed either by a central planner or market-mechanisms for coordination. The former is fraught with the problem of keeping the model for planning up-to-date and the complexity of planning itself. The latter has the problem of suboptimal solutions arising from market-based coordination. The core idea of markets is to have selfish participants maximize their personal gain. Under some technical conditions, markets are "efficient" for the incorporation of information into prices and the allocation of goods according to a preference function of the buyers [MF70]. From Game Theory, we know the existence of suboptimal equilibria in situations with competing agents where no participant has an incentive to change his strategy even though an equilbrium with higher overall welfare exists [Nas51]. In this thesis we instead assume cooperating agents that aim to jointly maximize the overall reward.

This chapter extends the model from Chapter 2 to include multiple agents that coordinate their actions in a distributed fashion. Afterwards the POHTP algorithm from Chapters 3 is adapted for the distributed setting. The result is a distributed planning algorithm where agents exchange messages for coordination via "utility propagation". The postulate for this chapter is the following:

> *Independent agents can jointly perform planning in a production scenario where every agent only has a simulation model of the system part in his visible scope.*

## **4.1 Background: The Generalized Distributive Law**

Judea Pearl introduced Belief Propagation (BP) as a way to efficiently compute inference tasks on (conditional) probability distributions [Pea88]. The algorithms that perform BP have become known as "message passing" algorithms since they are based on the exchange of messages representing conditional distributions between nodes in a graph [KF09]. In the years following the publication of [Pea88] the similarities between BP and other preexisting techniques in different scientific communities have been discovered. The common core of these techniques has been developed into the Generalized Distributive Law (GDL) family of algorithms [AM00; KFL01]. The GDL comprises as special cases the Baum-Welch algorithm for state estimation in Hidden Markov Models [Wel03], the Max-Plus algorithm for finding the maximum a-posteri event in a probability distribution, Turbo codes for error correction on noisy communication channels, and many more. We now summarize message passing for the distributed optimization of a function. Full proofs are omitted here. They can be found together with more pointers to the literature in [AM00; KF09].

The function *g* : *X* → R is defined for the domain *X*. For simplicity, let *X* contain only a finite number of members. The domain decomposes into variables *v* such that *X* = ×*v*∈*<sup>V</sup> Xv*. We write x ∈ *X* for the vector with entries *x<sup>v</sup>* ∈ *Xv*.

**Figure 4.1:** Factor graph for a problem decomposition. Factors that share are variable are neighbors and connected with an edge. The subscript indicates the variables in the domain of the factor.

The function *g* decomposes into a sum of factors *f* ∈ *F*.

$$g(x) = \sum\_{f \in F} f(x\_f)$$

Every factor depends on a subset of the variables *V<sup>f</sup>* ⊆ *V* and *X<sup>f</sup>* = ×*v*∈*V<sup>f</sup> Xv*. In the context of a vector x, the projection of x on the variables in the domain of *f* is written as x*<sup>f</sup>* . The goal is to find the maximizer x <sup>∗</sup> = arg maxx∈*<sup>X</sup> g*(x). Usually, the optimization for each factor arg maxx*<sup>f</sup>* <sup>∈</sup>*X<sup>f</sup> f*(x*<sup>f</sup>* ) is much easier as it only has to consider a fraction of the full domain *X*.

Now we constrain the domain for the optimization by fixing some of the variables. Assume that the variables in the scope of the factor *f* have been fixed to some y*<sup>f</sup>* ∈ *X<sup>f</sup>* . The optimization problem with this additional constraints is said to be conditioned on y*<sup>f</sup>* :

$$\underset{x \in X \mid y\_f}{\text{arg}\max} \, g(x) \tag{4.1}$$

Since the domain *X<sup>f</sup>* is finite, we can write a table with the results of Equation 4.1 for each y*<sup>f</sup>* ∈ *X<sup>f</sup>* . This changes the perspective of the optimization. We can ask which y*<sup>f</sup>* ∈ *X<sup>f</sup>* is best, knowing what the optimal "reaction" will be. Tabular representations of this kind (for finite domains) are the messages that are exchanged in the message passing algorithms.

If the factors form a tree-graph, then the computational effort for solving the optimization problem can be reduced drastically. Figure 4.1 represents the factors of an example problems as nodes. The factor name indicates the variables in the factor domain. So the factor *fbd* has the domain *Xfbd* = *X<sup>b</sup>* × *Xd*. This tree structure is a so-called junction tree with respect to the variables of each factor [Cow+99]. In a junction tree, nodes can be connected (are neighbors) if they share at least one variable. They don't have to be connected if they share a variable. But if they do share a variable, then all nodes on the paths between these two nodes must also refer to that variable. An example in Figure 4.1 are the nodes *fab* and *fbd*. They share the variable *b*. So all nodes on a path between *fab* and *fbd* must have *b* in their domain for the graph to be a junction tree.

On a tree-graph, every edge separates the tree into two otherwise unconnected halves. Take the edge (*fab, fbc*) in Figure 4.1. Cutting at the edge splits the set of factor functions into disjoint sets *F* = *Fab* ∪ *Fbc*. This results in two smaller optimization problems *gab*(x) = P *f*∈*Fab* (x*<sup>f</sup>* ) and *gbc*(x) = P *f*∈*Fbc* (x*<sup>f</sup>* ). Note that, given a fixed assignment to *xb*, the two subproblems are "conditionally independent" from one another. That is, for a fixed *xb*, the overall optimization problem can be solved by optimizing each subproblem individually and merging the partial solutions.

By convention, we denote the nodes representing factor functions as *i* and *j* ∈ *N*(*i*). The set *N*(*i*) contains the neighbors with a direct edge to *i*. The subtree behind the edge (*i, j*) on the side of *i* contains the factor functions *Fi*→*<sup>j</sup>* ⊆ *F*. The factors *i* and *j* share the variables *Vij* = *V<sup>i</sup>* ∩ *V<sup>j</sup>* with the domain x*ij* ∈ *Xij* . The message sent from *i* to *j* then is (a tabular representation of) a function m*i*→*<sup>j</sup>* : *Xij* → R. It contains the value of the best-possible solution for the subproblem with the factors *Fi*→*<sup>j</sup>* conditioned on the shared domain.

$$\mathfrak{m}\_{i \to j}(\mathfrak{y}\_{ij}) = \underset{x \in X \mid \mathfrak{y}\_{ij}}{\arg\max} \sum\_{f \in F\_{i \to j}} f(x\_f) \tag{4.2}$$

In the junction tree, the messages m*i*→*<sup>j</sup>* can be computed in such a way that the computation at every node *i* only considers the domain *X<sup>i</sup>* .The result of Equation 4.2 can then be computed by only considering the local factor and the

received messages:

$$\mathfrak{m}\_{i \to j}(y\_{ij}) = \underset{\mathfrak{w}\_i \in X\_i}{\arg\max} \left[ f\_i(\mathbf{x}\_i) + \sum\_{l \in N(i) \backslash \{j\}} \mathfrak{m}\_{l \to i}(\mathbf{x}\_{il}) \right] \tag{4.3}$$

In a so-called forward-backward pass, the messages are first sent out by the edgenodes with only one neighbor. Here, computing the message with Equation 4.2 is straightforward, as it only requires access to the factor function of the node itself. Other nodes compute and send out a message to their neighbor *j* as soon as a message has arrived from all other neighbors (not considering the receiving neighbor *j*). In a tree-graph, the exchanged messages converge after a forwardbackward pass when a message has been sent over every edge in both directions. Every node *i* then chooses the solution

$$\mathop{\arg\max}\_{\mathbf{x}\_{i}\in X\_{i}}\left[f\_{i}(\mathbf{x}\_{i}) + \sum\_{l\in N(i)} \mathfrak{m}\_{l\to i}(\mathbf{x}\_{il})\right] \tag{4.4}$$

from his domain *X<sup>i</sup>* . If the solution of Equation 4.4 is unique at every node, then the nodes agree with respect to the assignment of shared variables and the joint assignment of values to x is optimal. Additional communication is required to break ties. In practice, small random disturbances added to the values of the exchanged messages prevent ties effectively. Message passing generally works on loopy graphs as well. The convergence is then not guaranteed. Still, the results are often surprisingly good. If convergence in a loopy graph is achieved, the solution quality can be characterized according to the so-called Bethe free energy [YFW01].

The algorithm just presented is known under the name "Max-Plus" or "Max-Sum" according to the operations for joining partial solutions and marginalization. The general approach also works in any algebraic semiring where the operators max and + are replaced with their respective counterparts. The underlying principle of the message-passing algorithm is traced back to the distributive property of the two operators of the semiring, hence the name Generalized Distributive Law. In this text, we are only considering the GDL for finite domains. See the publication [WJ+08] for the application of the GDL to inference on continuous probability distribution from the exponential family.

# **4.2 A Model of Distributed Concurrent Production Systems**

The established models for multi-agent coordination decompose the planning problem in such a way that the sets of actions available to each agent are disjoint. Compare for example with the popular MA-STRIPS [BD08] and DEC-POMDP [BZI00] models. We take a different route. The actions available to every agent and the components that are visible in their scope overlap. This overlap is the common language that is required for coordination. Informally spoken, the overlap acts as the "hinge" between the per-agent models.

**Definition 4.1.** *A distributed planning problem is represented by a tuple*

$$\left(C, A, \sigma^0, I, \{C\_i, A\_i, \mathfrak{r}\_i\}\_{i \in I}\right) \dots$$

The definitions for the set of components *C*, the actions *A* and the initial system state σ 0 are identical to the central planning problem from Chapter 3. The additional agents *i* ∈ *I* each have a limited scope with regards to the part of the overall system that is visible to them. This is reflected in the visible components *C<sup>i</sup>* ⊆ *C*, and the actions *A<sup>i</sup>* ⊆ *A* with parameters from Θ*<sup>i</sup>* . The joint state of the components in the scope of *i* is σ*<sup>i</sup>* with the state-space Σ*<sup>i</sup>* = Σ*C<sup>i</sup>* . Each agent has a private reward function r*<sup>i</sup>* : Σ*i*×*Ai*×Θ*i*×Σ*<sup>i</sup>* → R. The components and actions in the scope of two agents *i* and *j* can overlap. The shared components and actions are *Cij* = *C<sup>i</sup>* ∩ *C<sup>j</sup>* and *Aij* = *A<sup>i</sup>* ∩ *A<sup>j</sup>* . Two agents are considered neighbors if they share a component in their scope. The set *J*(*i*) contains the neighbors of the agent *i*. The neighbor relation is of course symmetric *j* ∈ *N*(*i*) ⇔ *i* ∈ *N*(*j*).

$$J(i) = \{ j \in (I \backslash \{i\}) : C\_i \cap C\_j \neq \mathcal{Q} \}$$

**Example 4.1.** Consider again the manufacturing scenario from Example 2.2. Now, two agents jointly control the system. One agent is responsible for production and the other for packaging. Each agent has only a subset of the system components in his local scope. The lattice box in the middle is in the scope of both agents. See Figure 4.2 for details.

The agents see all actions where a component in their scope participates. Therefore, *C*prod = {produce*,* put*,* take} and *C*pack = {put*,* take*,* package}. But since the agent prod has no visibility for the packaging robot, he can only have a partial view on the action take. The same is true for the pack agent and the action put.

Every agent *i* is equipped with a simulation model of the components and actions in his scope. An agent may have actions *a* in his scope where some of the components participating in the action are outside of the scope of *i* so *C<sup>a</sup>* \* *C<sup>i</sup>* . It would be preferable that all actions entirely fit into the scope of every participating agent. But then we could not correctly model interactions across the boundaries of an agent scope. Take the situation of Example 4.1. Products are moved from the machine tool to the packaging robot. For this, the products leave the scope of the agent prod and enter the scope of the agent pack. The component of the lattice box is shared by both agents. More formally, box ∈ (*C*prod ∩ *C*pack). The agent prod must be able to predict – in his private model of the system dynamics – when the lattice box will be free again. But he does not see the packaging robot

who takes out products. If the action representations for the individual agents were to include all participating components and the agents know all actions that act on components in their scope, then the agents would have to keep all components *C* in their scope.

To overcome this, each agent has an internal representation of the actions that only describes the preconditions and effect on components that are in the agents scope. In Example 4.1, the agent prod has a partial representation of the action take to work with action sequences where multiple products are sequentially moved to pack. From the perspective of the agent prod, the products simply disappear when they are actually moved to the component package. The remainder of this section spells out the assumptions that are required for the agents' individual partial system models to be mutually compatible. This is required for the distributed planning methods introduced in the later sections of this chapter.

**Definition 4.2.** *Based on Definition 2.4, the projection of an action a to the scope of an agent i is*

$$a\_i = (C\_{a,i}, \bar{\Sigma}\_{a,i}, \mathbf{e}\_{a,i}, d\_{a,i})$$


Similar to the global action definition, per-agent actions *a<sup>i</sup>* are operators on the domain Σ*a,i* = {σ*<sup>i</sup>* ∈ Σ*<sup>i</sup>* : Π*Ca,i* (σ*i*) ∈ Σ¯ *a,i*}. The full operator signature is *ai* : Σ*a,i* → Σ*<sup>i</sup>* .

**Assumption 4.3.** *If a component in the scope of an agent i participates in an action a, then the projected action a<sup>i</sup> is in the scope of that agent.*

$$\forall i \in I, \forall a \in A, \ C\_{a,i} \neq \mathcal{Q} \Rightarrow a\_i \in A\_i$$

The participating agents of the action *a* are those with at least one participating component in their scope *I<sup>a</sup>* = {*i* ∈ *I* : *Ca,i* 6= ∅}. From the definition of Σ¯ *a,i*, the projected action imposes less constraints on the feasible initial states. The inverse projection of the feasible set is Π −1 *Ca,i* (Σ¯ *a,i*) = {σ ∈ Σ : Π*Ca,i* (σ) ∈ Σ¯ *a,i*}. Since less constraints are imposed on feasible initial states Σ*<sup>a</sup>* ⊆ Π −1 *Ca,i* (Σ¯ *a,i*). So there may be global states σ where agent *i* beliefs σ*<sup>i</sup>* to be feasible for his projected action *a<sup>i</sup>* so that σ*<sup>i</sup>* ∈ Σ*a,i* but which are not feasible for the original action σ ∈*/* Σ*a*. In order to prevent the agents from jointly selecting an action that is infeasible for the current global system state σ (and possible damaging equipment or endangering human operators) we assume that the action definitions and the decomposition into agents does not lead to infeasible action selections if the agents jointly agree on the feasibility.

**Assumption 4.4.** *If the participating agents i* ∈ *I<sup>a</sup> agree that action a is feasible based on their individual projected action a<sup>i</sup> with preconditions* Σ*a,i, then the action is also globally feasible.*

$$\left(\bigcap\_{i \in I\_a} \Pi\_{C\_{a,i}}^{-1}(\bar{\Sigma}\_{a,i})\right) \subseteq \Sigma\_a$$

Assumption 4.4 implies restrictions for the possible preconditions with respect to timing and synchronization between agents. The feasibility of an action can not depend on the simulation time of a participating components outside the scope of a participating agent. Otherwise, it would be possible to construct situations where all agents jointly, but incorrectly, belief an action to be feasible. Suppose a situation where molten iron ore is transferred from a component exclusively in the scope of agent *i* to a component exclusively in the scope of agent *j*. The global action for the transfer rightly imposes constraints on the simulation time to ensure that the component with the molten ore does not idle for too long. But this timing constraint cannot be represented in the projected action for either agent, violating Assumption 4.4. If timing conditions are critical, then all participating need to be in the scope of the agents.

In addition to assumptions for the feasible initial states, we limit the effect so that the post-state of the components *Ca,i* is correctly predicted by *a<sup>i</sup>* . That means for deterministic scenarios that the effect on the components *Ca,i* follows from the initial state of the *Ca,i*. In stochastic scenarios, the distribution for the post-states of the components *Ca,i* is conditionally independent from components outside of *Ca,i*.

**Assumption 4.5.** *If an action a has a participating agent i, then the effect on the components Ca,i is determined by the initial state of the components Ca,i only. If the action is deterministic, then*

$$\forall \sigma, \omega \in \Sigma\_a, \ \sigma\_{a,i} = \omega\_{a,i} \Rightarrow \Pi\_{C\_{a,i}}(e\_{a,i}(\sigma)) = \Pi\_{C\_{a,i}}(e\_{a,i}(\omega))\ .$$

*If the action is stochastic, so that the post-state and observations are sampled as* (σ 0 *,* o*a*) ∼ *a*(σ) *with* s 0 *the untimed state of the components from* σ 0 *, then*

$$(s'\_{a,i}, o\_{a,i} \mid \sigma\_{a,i}) \perp \sigma\_{C \backslash C\_i} \cdot$$

Note that Assumption 4.5 restricts the effect on the resulting state s*a,i*, but not on the resulting simulation time of the components. This allows the time synchronization of components across the scope of a single agent.

Last, we require that the reward generated by the actions does not depend on the simulation time of the components. This will become important later on, when the agents predict their reward based on an internal simulation model that is restricted to their scope.

**Assumption 4.6.** *For any two system states* σ *and* σ <sup>0</sup> *where the untimed component states are identical* s = s 0 *, the reward from any feasible action a is identical*

$$\mathbf{r}(\sigma, a, \theta, a(\sigma)) = \mathbf{r}(\sigma', a, \theta, a(\sigma')) \,. \tag{4.5}$$

If an action *a* is completely outside the scope of agent *i*, then the projection is the identity action *ε* – also used to denote an empty action sequence. The identity action can be simply omitted in an action sequence.

$$C\_a \cap C\_i = \mathcal{B} \Rightarrow a\_i = \varepsilon\_i$$

Action sequences w are projected to the scope of agent *i* as

$$\Pi\_i(w) = w\_i = (w\_i^k)\_{k \in \{1, \ldots, |w|\}} \cdot \mathbb{1}$$

Since actions project to the identity *ε* for an agent *i* that is not participating in it, the sequence w*<sup>i</sup>* may contain less elements than w. We continue to use the

same index notation *k* for both global and per-agent action sequences and make the number of sequence members explicit only when this is needed.

As described in Section 2.1, the set *W* = *A*<sup>∗</sup> contains all action sequences of finite length generated from a set of base actions *A*. It implies a tree-graph where every edge denotes an action that is appended to the previous sequence. The sequence trees *W*<sup>σ</sup> with a defined initial state σ contains only feasible sequences starting from the initial state. The sequence tree *W<sup>i</sup>* = *A*<sup>∗</sup> *i* considered by the agent *i* contains all sequences formed from the actions in *i*'s scope. The sequence tree *W*<sup>σ</sup> *i* contains the sequences from *W<sup>i</sup>* that agent *i* beliefs to be feasible starting from the initial state σ*<sup>i</sup>* . The inverse projection of the per-agent sequence tree Π −1 *i* (*W*<sup>σ</sup> *i* ) contains all global sequences that are compatible with (project to) a sequence from *W*<sup>σ</sup> *i* and that are also feasible for some compatible initial state ω with ω*<sup>i</sup>* = σ*<sup>i</sup>* .

$$\Pi\_i^{-1}(W\_i^{\sigma}) = \{ w \in W : \exists \omega \in \Sigma, \ w\_i = \sigma\_i, \ w \in W^{\omega}, \ w\_i \in W\_i^{\sigma} \} \quad (4.6)$$

Since the considered system dynamics takes concurrency into account, the index of an action in the action sequence could no longer coincide with the order in which the actions are executed according to the simulation time. The state of component *c* after executing the first *k* actions is *w* :*k* (σ)*<sup>c</sup>* = *σ k <sup>c</sup>* = (*s k c , t<sup>k</sup> c* ).

**Proposition 4.7.** *From the Assumptions 4.3, 4.4 and 4.5 follows that for any global state* σ ∈ Σ *and agent i* ∈ *I*

$$\left(\bigcap\_{j\in I} \Pi\_j^{-1}(W\_j^{\sigma})\right) = W^{\sigma} \subseteq \Pi\_i^{-1}(W\_i^{\sigma})\,.$$

*Proof.* Consider the subset relation *W*<sup>σ</sup> ⊆ Π −1 *i* (*W*<sup>σ</sup> *i* ). Assume there exists a sequence v ∈ *W*<sup>σ</sup>. But an agent *i* beliefs that the projected sequence is not feasible for his scope so that v ∈*/* Π −1 *i* (*W*<sup>σ</sup> *i* ). Let *k* the index in v where v :*<sup>k</sup>* ∈ Π −1 *i* (*W*<sup>σ</sup> *i* ). Such a *k* must exist since the empty sequence that is always feasible. From Equation 4.6 the agent *i* regards the shortened sequence as feasible Π*i*(w:*<sup>k</sup>* ) ∈ *W*<sup>σ</sup> *i* . A consequence of Assumption 4.3 and Assumption 4.5 is that the agent *i* correctly predicts the (untimed) system state of the components in his

scope based on the projected sequence w:*<sup>k</sup> i* .

$$\forall \sigma \in \Sigma\_{w^{\cdot k}}, \ \sigma' = w^{\cdot k}(\sigma), \ \omega\_i = w\_i^{\cdot k}(\sigma\_i), \ s\_i^{\sigma'\_i} = s\_i^{\omega\_i}$$

Here, s σ 0 *i* and s ω*<sup>i</sup> i* denote the untimed state of the components in *i*'s scope in the respective timed state vectors σ 0 and ω*<sup>i</sup>* . Since w ∈ *W*<sup>σ</sup> we know that σ <sup>0</sup> ∈ Σ*wk*+1 . From the preconditions of projected action from Definition 4.2 it must be that ω*<sup>i</sup>* ∈ Σ*wk*+1*,i*. This contradicts the initial assumption.

We first show that T *<sup>j</sup>*∈*<sup>I</sup>* Π −1 *j* (*W*<sup>σ</sup> *j* ) ⊆ *W*<sup>σ</sup>. Assume an action sequence u where u ∈*/ W*<sup>σ</sup> and u ∈ T *<sup>j</sup>*∈*<sup>I</sup>* Π −1 *j* (*W*<sup>σ</sup> *j* ). There exists an index *l* such that the subsequence u :*l* is contained in *W*<sup>σ</sup> but u :*l*+1 is not. Let σ <sup>0</sup> = u :*k* (σ). The agents agree that their projection of *u <sup>k</sup>*+1 is feasible ∀*i* ∈ *I,* σ*<sup>i</sup>* ∈ Σ*uk*+1*,i*. (The identity action is always feasible.) But σ <sup>0</sup> ∈*/* Σ*uk*+1 . This contradicts Assumption 4.4. The equality relation T *<sup>j</sup>*∈*<sup>I</sup>* Π −1 *j* (*W*<sup>σ</sup> *j* ) = *W*<sup>σ</sup> is then a direct consequence of the previously established fact that for all agents *j* ∈ *I* the set Π −1 *j* (*W*<sup>σ</sup> *j* ) is a superset of *W<sup>σ</sup>*

Proposition 4.7 summarizes the first important result for distributed planning from this section. All feasible global sequences are projected to a feasible sequence from the standpoint of the individual agents. On the other hand, an individual agent could assume a sequence w*<sup>i</sup>* to be valid that has no feasible global counterpart. If the agents however jointly agree on a sequence by each considering the projection to their scope, then the sequence is globally feasible.

# **4.3 Distributed Planning for Deterministic Action Sequences**

The sequence tree shared between two agents *i* and *j* ∈ *N*(*i*) is written as *Wij* . It contains all sequences of the joint actions *Wij* = (*A<sup>i</sup>* ∩ *A<sup>j</sup>* ) ∗ . Note that the sequences in *Wij* may be unfeasible. They are partial sequences and each agent has to "fill the holes" for the components in his scope. Based on a (partial) sequence w*ij* ∈ *Wij* , the sequence tree of the individual agent *i* can be conditioned to contain only sequences that are in accordance with the shared sequence w*ij* .

**Definition 4.8.** *For an action sequence* v*ij* ∈ *Wij shared by the agents i and j* ∈ *N*(*i*)*, the conditional tree W*<sup>σ</sup> *i* |v*ij contains only those sequences for agent i whose projection to Wij is compatible with* v*ij in the following sense.*

$$W\_i^{\sigma}|\boldsymbol{v}\_{ij} = \{\boldsymbol{w}\_i \in W\_i^{\sigma} : \boldsymbol{w}\_{ij} = \boldsymbol{v}\_{ij}\}$$

**Example 4.2.** Consider the two agents from Example 4.1 and an initial state σ where no component contains products.

**Figure 4.3:** Conditioned sequence trees of two agents.

Every agent internally considers the sequence tree *W*<sup>σ</sup> prod and *W*<sup>σ</sup> pack respectively. By imposing the shared sequence vprod*,*pack, the agents sequence trees are pruned to the conditional sequence trees shown in Figure 4.3b, Also compare with the sequence tree for the overall scenario from Figure 2.3.

For a given initial state σ, the global reward generated from a history w is

$$\mathfrak{r}(\sigma, w) = \sum\_{k=1}^{|w|} \mathfrak{r}\left(w^{:k-1}(\sigma), w^k, w^{:k}(\sigma)\right).$$

The local reward for the agents *i* ∈ *I* (who know the initial state of the components in their scope σ*i*) is

$$\mathfrak{r}\_i(\sigma\_i, w\_i) = \sum\_{k=1}^{|w|} \mathfrak{r}\_i(w\_i^{:k-1}(\sigma\_i), w\_i^k, w\_i^{:k}(\sigma\_i)) \,.$$

Now, we can state the planning problem for action sequences as a factorized optimization problem with factors r*<sup>i</sup>* and overlapping factor domains *W<sup>i</sup>* .

**Proposition 4.9.** *If the reward generated by the actions a is factorized into peragent reward as* r(σ*, a,*σ 0 ) = P *i*∈*I* r*i*(σ*<sup>i</sup> , a<sup>i</sup> ,*σ 0 *i* )*, then the reward* r(w) *for a global action sequence* w *factorizes to the sum of the per-agent reward functions* r*i* : *W<sup>i</sup>* → R*.*

$$\mathfrak{r}(\sigma, w) = \sum\_{i \in I} \mathfrak{r}\_i(\sigma\_i, w\_i)$$

*Proof.* Take the following sequence of equations. The gist of the proof lies in the equality between the Equations 4.7 and 4.8.

$$\mathfrak{r}(\sigma, w) = \sum\_{k=1}^{|w|} \mathfrak{r}\left(w^{:k-1}(\sigma), w^k, w^k(\sigma)\right) \tag{4.7}$$

$$=\sum\_{k=1}^{|w|}\sum\_{i\in I} \mathbf{r}\_i(w\_i^{:k-1}(\sigma\_i), w\_i^k, w\_i^{:k}(\sigma\_i))\tag{4.8}$$

$$\mathfrak{l} = \sum\_{i \in I} \sum\_{k=1}^{|w|} \mathfrak{r}\_i \left( w\_i^{:k-1} (\sigma\_i), w\_i^k, w\_i^{:k} (\sigma\_i) \right) = \sum\_{i \in I} \mathfrak{r}\_i (\sigma\_i, w\_i)$$

First, Proposition 4.7 guarantees that for any system state σ and feasible sequence w the projected sequence w*<sup>i</sup>* is feasible for the projected system state w*<sup>i</sup>* . Secondly, from Assumption 4.5 follows that knowledge of w*<sup>i</sup>* suffices to determine the untimed state of the components *C<sup>i</sup>* . Third, the agents reward depends only on the untimed state of the components in their scope according to Assumption 4.6. Therefore the reward of the individual agents is determined by just the components in their scope and the action sequence projected to their scope.

**Definition 4.10.** *The V-value of the agent i with agent-state* σ*<sup>i</sup> is the sum of rewards generated by the best feasible action sequence from the perspective of the agent.*

$$\mathfrak{v}\_i(\sigma\_i) = \max\_{w\_i \in W\_i^{\sigma}} \mathfrak{v}\_i(\sigma\_i, w\_i).$$

We are not discounting later reward to compute the V-value. Instead it is implied that the tree *W* either has a maximum height. Either because the scenario is "done" after a finite number of actions or based on a cutoff height of the sequence tree.

**Definition 4.11.** *The conditional V-value of an agent i for the action sequence* v*ij shared with the neighbor j* ∈ *N*(*i*) *is the best reward the agent can achieve with a sequence that projects to* v*ij .*

$$\mathfrak{v}\_i(\sigma\_i \mid \boldsymbol{v}\_{ij}) = \begin{cases} \max\_{\boldsymbol{w}\_i \in W\_i^{\sigma} \mid \boldsymbol{v}\_{ij}} \mathfrak{r}\_i(\sigma\_i, \boldsymbol{w}\_i), & W\_i^{\sigma} \mid \boldsymbol{v}\_{ij} \neq \mathcal{D}\_i \\\ -\infty, & \mathit{else} \end{cases}$$

The local V-value is defined as −∞ if the shared sequence v*ij* is infeasible for the initial state of the components in the agent's scope σ*<sup>i</sup>* . This becomes important later on when v*<sup>i</sup>* is used by the agents to signal preferences for shared sequences to their neighbors.

**Proposition 4.12.** *In a setting with just two agents i and j, the global V-value can be decomposed as follows.*

$$\mathfrak{v}(\sigma) = \max\_{w \in W^{\sigma}} \mathfrak{r}(\sigma, w) = \max\_{w\_{ij} \in W\_{ij}} \left[ \mathfrak{v}\_i(\sigma\_i \mid w\_{ij}) + \mathfrak{v}\_j(\sigma\_j \mid w\_{ij}) \right]$$

*Proof.* Let u ∈ *W* an optimal global action sequence, so the reward generated by u is r(σ*,*u) = v(σ). (There may be several optimal action sequences.) This reward decomposes into the private reward of the agents *i* and *j*.

$$\mathfrak{v}(\sigma, u) = \mathfrak{v}\_i(\sigma\_i, u\_i) + \mathfrak{v}\_j(\sigma\_j, u\_j)$$

Given the shared sequence u*ij* , agent *j* can choose any action sequence from *W*<sup>σ</sup> *j* |u*ij* without impacting the reward for *i*. We know that r*<sup>j</sup>* (σ*<sup>j</sup> ,*u*<sup>j</sup>* ) = max<sup>v</sup>*i*∈*W*<sup>σ</sup> *j* <sup>|</sup>u*ij* r*<sup>j</sup>* (σ*<sup>j</sup> ,* v*<sup>j</sup>* ). Otherwise, if the agent *j* could find a better sequence than u*<sup>i</sup>* , u could not be a maximizer for the global sequence. Since u*<sup>j</sup>* ∈ *W*<sup>σ</sup> *j* |u*ij* we know that *W*<sup>σ</sup> *j* |u*ij* is nonempty and therefore with Definition 4.11

$$\mathfrak{r}(\sigma, u) = \mathfrak{r}\_i(\sigma\_i, u\_i) + \mathfrak{v}\_j(\sigma\_j \mid u\_{ij}) \dots$$

The same line of reasoning can be followed for the agent *i* so that r(σ*,*u) = v*i*(σ*<sup>i</sup> ,*u*ij* ) + v*<sup>j</sup>* (σ*<sup>j</sup>* |u*ij* ).

In the case with two agents, given precomputed conditional V-value functions v*i*(σ | w*ij* ), (e.g. available as a lookup table), the optimization problem to find the optimal reward for a given system state σ is simplified from a search over *W*<sup>σ</sup> to a search over just *Wij* . Now assume a case with three agents *i, j, l*. The agents *i* and *j* have shared components and the agents *j* and *l* have shared components. But *i* and *l* do not share any components. Conditioned on the action sequence w*jl*, the agent *l* is not only independent from the actions of *j* but also of the actions of *i*. This mechanism is used for "utility propagation" between agents that form a tree-graph. The difference to the original GDL is that the overall domain *W* is not a cartesian product of the *W<sup>i</sup>* .

**Assumption 4.13.** *The agents and their neighbor relations between the agents form a tree-graph. The components in the scope of agent i are in the shared scope with at most one of i's neighboring agents.*

$$\forall j \in N(i), \ \forall c \in C\_{ij} \Rightarrow \forall l \in N(i) \ (\{j\}, \ c \notin C\_{il})$$

For a given initial system state σ, the messages m*i*→*<sup>j</sup>* : *Wij* → R exchanged between neighboring agents are computed as follows:

$$\mathfrak{m}\_{i \to j}(\mathfrak{v}\_{ij}) = \max\_{\substack{\boldsymbol{w}\_i \in W\_i^{\sigma} \mid \boldsymbol{v}\_{ij}}} \left[ \mathfrak{v}\_i(\sigma\_i, \boldsymbol{w}\_i) + \sum\_{l \in N(i) \backslash \{l\}} \mathfrak{m}\_{l \to i}(\boldsymbol{w}\_{il}) \right] \tag{4.9}$$

The messages m*i*→*<sup>j</sup>* : *Wij* → R describe the best reward that the subgraph of agents "behind" the edge *i* → *j* can achieve for a given shared sequence w*ij* . Computing the messages m quickly becomes intractable. The number of possible shared sequences grows exponentially with the length of the shared sequence. Furthermore, the optimization performed for each messages m*i*→*<sup>j</sup>* and shared sequence w*ij* requires in itself an optimization of over *W*<sup>σ</sup> *i* |v*ij* that takes the messages received by the other neighbours *N*(*i*) \ *j* into account. The conditional tree *W*<sup>σ</sup> *i* |v*ij* also grows exponentially with the tree depth. (Although pruning non-conforming sequences can drastically reduce the overall size).

The Max-Plus algorithm from the GDL family is used to efficiently solve the maximum a-posteriori (MAP) problem of finding the event with highest probability. It has been used for "utility propagation" for multi-agent decision making by [KV05]. To overcome the combinatorial explosion of the search space, we develop a novel combination of Max-Plus with MCTS. See Algorithm 14 for the full algorithm specification. Similar to standard UCT, the Distributed Upper Confidence on Trees (DUCT) algorithm performs iterative playouts and generates statistics that guide decision-making in later playouts. Every agent *i* ∈ *I* stores the reward he can make after a sequence w*<sup>i</sup>* in a hashmap *v<sup>i</sup>* [w*<sup>i</sup>* ]. As before, the hashmap returns zero when no entry has been set prior. Every agent is performing independent playouts based on his internal simulator. Actions are selected according to the UCT rule with the addition that conditional reward signaled by they neighbors is considered as well. For the updates, the computed Vvalue estimate for a sequence w*<sup>i</sup>* considers only the reward the agent receives. But the action *a* ∗ *i* are selected to maximize the reward for all agents. Afterwards, the messages to the neighboring agents are computed. Here the action *a* ∗ *i* is implied by the use of the local V-value *v<sup>i</sup>* . Note that the reward signaled by the agent *j* to *i* is not mirrored back in the messages from *i* to *j*. Here, the agents are jointly exploring the scenario tree. Initially, all messages are set to zero. Therefore, if rewards are generally negative, then the agents will initially overestimate the value of sequences where no empirical reward estimate from the neighbors exist.

**Algorithm 14** Distributed Upper Confidence on Trees (DUCT) Algorithm

1: **procedure** DUCT(σ 0 ) 2: *mi*→*<sup>j</sup>* [ · ] ← 0 ∀*i* ∈ *I, j* ∈ *N*(*i*) 3: *n<sup>i</sup>* [ · ] ← 0*, v<sup>i</sup>* [ · ] ← 0 ∀*i* ∈ *I* 4: **while** enough time **do** 5: **for** *i* ∈ *I* **do** 6: (w*<sup>i</sup> ,* r*i*) ← Play*i*(σ 0 *i* ) 7: Update*i*(w*<sup>i</sup> ,* r*i*) 8: **return** arg max*a*∈*<sup>A</sup>* hP *i*∈*I*: *ai*∈*A<sup>i</sup> vi* [*a<sup>i</sup>* ] i 1: **procedure** Play*i*(σ*i*) 2: w*<sup>i</sup>* ← *ε,* r*<sup>i</sup>* ← *ε* 3: **while** ¬done(σ*i*) **do** 4: *B* ← {*b<sup>i</sup>* ∈ *A<sup>i</sup>* : *n*[w*ib<sup>i</sup>* ] = 0} 5: **if** *B* 6= ∅ **then** 6: *a<sup>i</sup>* ← *π B i* (σ*i*) 7: **else** 8: *a<sup>i</sup>* ← arg max *bi*∈*A<sup>i</sup>* h *vi* [w*ib<sup>i</sup>* ] + *α* qlog *n*[w*i*]+1 *<sup>n</sup>*[w*ibi*] + P *j*∈*N*(*i*) *<sup>m</sup>j*→*i*(Π*ij* (w*ibi*))<sup>i</sup> 9: (σ*<sup>i</sup> , ei*) ← *ai*(σ*i*) 10: w*<sup>i</sup>* ← w*ia<sup>i</sup> ,* r*<sup>i</sup>* ← (*r* 1 *i , r*<sup>2</sup> *i , . . . , ei*) 11: **return** w*<sup>i</sup> ,* r*<sup>i</sup>* 1: **procedure** Update*i*(w*<sup>i</sup> ,* r*i*) 2: **for** *k* = |w*<sup>i</sup>* |*, . . . ,* 1 **do** 3: u*<sup>i</sup>* ← w:*<sup>k</sup> i* 4: *n*[u*<sup>i</sup>* ] ← *n*[u*<sup>i</sup>* ] + 1 5: *a* ∗ *<sup>i</sup>* ← arg max *ai*∈*Ai*:*n*[u*iai*]*>*0 h *q*[u*ia<sup>i</sup>* ] + P *<sup>j</sup>*∈*N*(*i*) *mj*→*<sup>i</sup>* [Π*ij* (u*iai*)]<sup>i</sup> 6: *v<sup>i</sup>* [u*<sup>i</sup>* ] ← *r k <sup>i</sup>* + *v<sup>i</sup>* [u*ia* ∗ *i* ] 7: **for** *j* ∈ *N*(*i*) **do** 8: *u* ← *v<sup>i</sup>* [u*<sup>i</sup>* ] + P *<sup>l</sup>*∈*N*(*i*)\{*l*} *ml*→*i*(Π*il*(u*i*)) 9: **if** *e > m*[Π*ij* (u*i*)] **then** 10: *m*[Π*ij* (u*i*)] ← *e*

## **4.4 Distributed Planning under Uncertainty**

In stochastic scenarios, following the description from Section 2.3, the current state is not known with absolute certainty. Instead, the system state can only be inferred from indirect observations. Recall that histories h are comprised of episodes *h <sup>k</sup>* = (*a k θ k*o *k* ) with an action, action-parameters and observations. The set of all possible global histories of finite length is *H* = (*A* × Θ × *O*) ∗ . The set of histories *H* implies a tree-graph with edges between observations and actions, actions and parameters, and parameters and observations if they can occur in sequence in a history. From every history, the current system state can be inferred P(σ | h). For this, an belief distribution for the initial state σ 0 is updated with the received observations. We now make the additional distinction between histories and *complete histories*. Complete histories are the leafs in the tree-graph of *H*. They denote histories after which the scenario is "done". This can also be enforced by a maximum history depth. The set of complete histories is *H* ⊆ *H*.

The algorithm enhances our prior work in [Pfr16a] in several regards. Most importantly, every agent participating in the decision making has a local simulator to predict the evolution of the system state based on his restricted local knowledge. An important inspiration came from [AO15]. However, they use Variable Elimination [KF09] for selecting joint actions instead of message passing. In addition, they rely on a central simulator for the global system to generate sample plays. Since MCTS is an online algorithm, they would require a simulator for the global system also at runtime for the agent coordination.

As the agents *i* ∈ *I* have a limited scope, they can only observe a portion of the full history. In particular, the projection to the agent-scope Π*i*(h) = h*<sup>i</sup>* contains only the episodes with actions *a* ∈ *A<sup>i</sup>* . The observed action parameters are from the full parameter space of the action *θ* ∈ Θ*a*. The observations received by the agent *i* are from components that participate in action *a* and are in the scope of the agent o*<sup>i</sup>* ∈ *Oa,i* = (×*c*∈*Ca,iOc*).

**Definition 4.14.** *A history* h ∈ *H projects to the scope of an agent i* ∈ *I as*

$$h\_i = \underbrace{a\_i^1 \theta\_i^1 \mathbf{o}\_i^1}\_{h\_i^1} \dots \underbrace{a\_i^{|h|} \theta\_i^{|h|} \mathbf{o}\_i^{|h|}}\_{h\_i^{|h|}} \dots$$

*The set of histories for agent i is H<sup>i</sup>* = (*A<sup>i</sup>* × Θ*<sup>i</sup>* × *Oi*) <sup>∗</sup> *where* Θ*<sup>i</sup>* = (∪*a*∈*Ai*Θ*a*) *and O<sup>i</sup>* = (∪*a*∈*AiOa,i*)*.*

Similar to the projection of deterministic action sequences from Section 4.3, the episodes are indexed with *k*. If an episode of the global sequence refers to an action outside of agent *i*'s scope, then the action projects to the identity operator *a /*∈ *A<sup>i</sup>* ⇒ *a<sup>i</sup>* = *ε* and the episode does not occur in the agent's local history. The global history thus may contain more episodes than are visible to the individual agent. The difference in the index *k* will be made explicit only when the meaning is not clear from context.

Agents have a local decision-making policy *π<sup>i</sup>* . Since the components in the agents scope are overlapping, neighboring agents need to coordinate to select the next action in their shared scope. Disagreement would lead to incompatible action selections for the components in the shared scope. This needs to be avoided. So the local decision-making policy of an agent *i* is no longer a deterministic result from the local history h*<sup>i</sup>* alone. It also depends on the coordination with neighboring agents – and hence on the observations the agents *j* ∈ *N*(*i*) have made that are not necessarily in the scope of *i*. This lack of information from the limited viewpoint of agent *i* is expressed by taking the policy as a random variable as well. The next action and action parameters are sampled as (*a<sup>i</sup> , θi*) ∼ *π<sup>i</sup>* (h*i*).

The value for optimal decision making in each node (from the viewpoint of agent *i*), the V-value and the Q-value, can be computed recursively with Bellman's Equation [Put94]. Note that the decision-making step is split into action-selection and parameter-selection. To simplify the notation, we refer to both the V-value and the Q-value of an agent history as q.

**Definition 4.15.** *The Q-value of selfish agent i assumes optimal decision making by i. The other agents I* \ {*i*} *coordinate with i by agreeing to i's decisions for the components in the shared scope.*

$$\mathfrak{q}\_{i}(\hbar\_{i}) = \begin{cases} 0, & \hbar\_{i} \in \overline{H}\_{i} \\ \max\_{a\_{i} \in A\_{i}} \mathfrak{q}\_{i}(\hbar\_{i}a\_{i}), & else \end{cases} \tag{4.10}$$

$$\mathfrak{q}\_{i}(\boldsymbol{h}\_{i}\boldsymbol{a}\_{i}) = \max\_{\theta\_{i}\in\Theta\_{a,i}} \mathfrak{q}\_{i}(\boldsymbol{h}\_{i}\boldsymbol{a}\_{i}\boldsymbol{\theta}\_{i})\tag{4.11}$$

$$\mathfrak{q}\_{i}(\boldsymbol{h}\_{i}\boldsymbol{a}\_{i}\boldsymbol{\theta}\_{i}) = \underset{\begin{subarray}{c}\boldsymbol{\sigma}\_{i},\boldsymbol{\sigma}\_{i}' \in \boldsymbol{\Sigma}\_{i},\\ \boldsymbol{a}\_{i} \in O\_{a,i}\end{subarray}}{\mathbb{E}} \left[\mathfrak{r}\_{i}(\boldsymbol{\sigma}\_{i},\boldsymbol{a}\_{i},\boldsymbol{\theta}\_{i},\boldsymbol{\sigma}\_{i}') + \mathfrak{q}\_{i}(\boldsymbol{h}\_{i}\boldsymbol{a}\_{i}\boldsymbol{\theta}\_{i}\boldsymbol{o}\_{i}) \, \Big|\, \boldsymbol{h}\_{i}\boldsymbol{a}\_{i}\boldsymbol{\theta}\_{i}\right] \quad (4.12)$$

A distinction by cases is made whether the last episode contains only an action, an action with action parameters, or an action with parameters and the resulting observations. When action, parameter and observation are appended to a history, as in h 0 *<sup>i</sup>* = h*iaiθi*o*<sup>i</sup>* , then h 0 *<sup>i</sup>* matches with Equation 4.10 that is defined for histories with complete episodes. Equations 4.11 and 4.12 take histories with an incomplete last episode as input. The case distinction in Equation 4.10 is required so that the recursive formulation terminates at the leaf nodes of *H<sup>i</sup>* .

A shared history h*ij* ∈ *Hij* = (*Aij* × Θ*ij* × *Oij* ) <sup>∗</sup> between an agent *i* and his neighbor *j* ∈ *N*(*i*) contains only the episodes with a shared action *a* ∈ *Aij* . The agents *i* and *j* both observe the complete action parameters for the shared actions. The observations in the shared history contain only the observations from components in the shared scope *Oij* = (∪*a*∈*Aij* (×*c*∈*Ca*∩*CijOc*)). The V-value and Q-value are now conditioned on a shared history for the future episodes. Observations that are not compatible with the shared history are marginalized out in the probabilistic expectation. Incompatible action and parameter choices are disallowed in the maximization steps.

**Definition 4.16.** *A conditional Q-value for a selfish agent i assumes optimal decision making under the constraint that future episodes (after the initial history* h*i) project to the partial history* g*ij shared with the neighbor j* ∈ *N*(*i*)*.*

$$\mathfrak{q}\_i(h\_i \mid g\_{ij}) = \begin{cases} 0, & h\_i \in \overline{H}\_i \\ \max\limits\_{a\_i \in \{A\_i \mid A\_{ij}\} \cup \{a\_{ij}^1\}} \mathfrak{q}\_i(h\_i a\_i \mid g\_{ij}), & else \end{cases} \tag{4.13}$$

q*i*(h*ia<sup>i</sup>* | g*ij* ) = q*i*(h*iaiθ* 1 *ij* | g*ij* )*, a<sup>i</sup>* = *a* 1 *ij* max *θi*∈Θ*a,i* q*i*(h*iaiθ<sup>i</sup>* <sup>|</sup> <sup>g</sup>*ij* )*, else* (4.14) q*i*(h*iaiθ<sup>i</sup>* | g*ij* ) = E σ*i,*σ 0 *<sup>i</sup>*∈Σ*i,* u*i*∈*Oa,i,* u*ij*=o 1 *ij* - r*i*(σ*<sup>i</sup> , a<sup>i</sup> , θ<sup>i</sup> ,*σ 0 *i* ) + q*i*(h*iaiθi*u*<sup>i</sup>* | g 2: *ij* ) <sup>h</sup>*iaiθ<sup>i</sup> , a<sup>i</sup>* = *a* 1 *ij* E σ*i,*σ 0 *<sup>i</sup>*∈Σ*i,* o*i*∈*Oa,i* - r*i*(σ*<sup>i</sup> , a<sup>i</sup> , θ<sup>i</sup> ,*σ 0 *i* ) + q*i*(h*iaiθi*o*<sup>i</sup>* | g*ij* ) <sup>h</sup>*iaiθ<sup>i</sup> , else* (4.15)

The action, action parameters and observations of the first episode of the partial history g*ij* are referred to as *a* 1 *ij* , *θ* 1 *ij* and o 1 *ij* respectively. As episodes are added to per-agent history h*<sup>i</sup>* , the remaining shared history for future episodes is getting shorter. Once no shared history for the conditioning remains with g*ij* = *ε*, the Q-value and V-value fall back to the formulations from Equations 4.10 to 4.12. The action choice is constrained to agree with the remaining partial history g*ij* . Actions without participating components from *Cij* can be freely chosen as they are not constrained by the partial history. The case distinction in Equation 4.14 requires that the matching parameters from the partial history g*ij* are taken if the action is from the partial history. Finally, Equation 4.15 also makes a case distinction for actions from the constraining partial history g*ij* . Equation 4.15 then recurses by adding the expected reward from future episodes constrained on the remaining partial history. If the current action is from the shared history, then the remaining shared history g :2 *ij* has the current episode removed.

The agents in Definition 4.16 are self-interested. In order to have collaborative agents jointly optimize the global reward (across all agents), each individual agents has to assess the impact of his choices on the expected future reward for himself as well as for the other agents. Analogous to the previous section on distributed planning for deterministic action sequences, the agents are assumed to form a tree-graph with their neighborhood relations. (Cf. the discussion of Assumption 4.13.) To coordinate, the agents exchange messages m*i*→*<sup>j</sup>* : *Hij* → R with their neighbors *j* ∈ *N*(*i*). The value of the message m*i*→*<sup>j</sup>* (g*ij* ) evaluated for a given shared history g*ij* describes the expected Q-value (the reward for optimal play in the remaining episodes) for the agents behind the edge *i* → *j* conditional to the given shared partial history. See the later Definition 4.18 for the messages. The per-agent history is projected to the shared scope with h*ij* = Π*il*(h*i*). The expected immediate reward in the subtree behind the edge *l* → *i*

$$\bar{\mathfrak{r}}\_{j \to i}(h\_i, a\_i \theta\_i \mathfrak{o}\_i) = \mathfrak{m}\_{j \to i}(h\_{ij}) - \mathfrak{m}\_{j \to i}(\Pi\_{ij}(h\_i a\_i \theta\_i \mathfrak{o}\_i))$$

refers to the reward the agents in the sub-tree expect to make when they "fill the gaps" between the partial histories h*ij* and Π*ij* (h*iaiθi*o*i*).

**Definition 4.17.** *The global Q-value for an unselfish agent i conditioned on the future partial shared history* g*ij assumes optimal decision making by the agent* *i with respect to the expected global reward and a fixed policy followed by the other agents.*

q ∗ *i* (h*<sup>i</sup>* | g*ij* ) = −∞*,* h*<sup>i</sup>* ∈ *H<sup>i</sup> ,* g*ij* 6= *ε* 0*,* h*<sup>i</sup>* ∈ *H<sup>i</sup> ,* g*ij* = *ε* arg max *ai*∈(*Ai*\*Aij* )∪{*a* 1 *ij* } q ∗ *i* (h*ia<sup>i</sup>* | g*ij* )*, else* (4.16) q ∗ *i* (h*ia<sup>i</sup>* | g*ij* ) = q ∗ *i* (h*iaiθ* 1 *ij* | g*ij* )*, a<sup>i</sup>* = *a* 1 *ij* max *θi*∈Θ*a,i* q ∗ *i* (h*iaiθ<sup>i</sup>* <sup>|</sup> <sup>g</sup>*ij* )*, else* (4.17) ∗ *i* (h*iaiθ<sup>i</sup>* | g*ij* ) = E σ*i,*σ 0 *<sup>i</sup>*∈Σ*i,* u*i*∈*Oa,i,* u*ij*=o 1 *ij* h r*i*(σ*<sup>i</sup> , a<sup>i</sup> , θ<sup>i</sup> ,*σ 0 *i* ) + P *l*∈*N*(*i*) - ¯r*l*→*i*(h*<sup>i</sup> , aiθi*u*i*) + q ∗ *i* (h*iaiθi*u*<sup>i</sup>* | g 2: *ij* ) h*iaiθ<sup>i</sup>* i *, a<sup>i</sup>* = *a* 1 *ij* E σ*i,*σ 0 *<sup>i</sup>*∈Σ*i,* o*i*∈*Oa,i* h r*i*(σ*<sup>i</sup> , a<sup>i</sup> , θ<sup>i</sup> ,*σ 0 *i* ) + P *l*∈*N*(*i*) - ¯r*l*→*i*(h*<sup>i</sup> , aiθi*o*i*) + q ∗ *i* (h*iaiθi*o*<sup>i</sup>* | g*ij* ) h*iaiθ<sup>i</sup>* i *, else* (4.18)

q

Again, Equation 4.16 can return negative infinity in the case where the scenario is "done" but the constraint to fulfill the partial history g*ij* has not been fulfilled. The use of messages m*l*→*i*(h*il*) relies on Assumption 4.13. So at most two agents share a component in their scope. Otherwise for a given future shared history g*ij* , the expected reward for the agents in the sub-tree behind the edge *l* → *i* could also be conditional to a portion of the shared history g*ij* that also applies to *l*, i.e. Π*il*(g*ij* ). The messages m would then be conditioned to this as m*l*→*i*(Π*il*(h*i*)| Π*il*(g*ij* )). Assumption 4.13 removes this source of further complexity.

Now a word on the difference between the global Q-value q ∗ *i* estimated by the agents *i* and the messages m*i*→*<sup>j</sup>* between neighboring agents *i* and *j*. In accordance with the principles of the GDL described in Section 4.1, it has to be avoided that the agents "mirror back" expected reward that was signaled to them by a neighbor *j*. Only the expected reward from the other neighbors *N*(*i*) \ {*j*} is

forwarded in the messages. In essence the message contains the expected reward generated in the subtree (according to the agent's neighbor relation) behind the edge *i* → *j* under the assumption of globally optimal decision making by the agents according to their respective Q-value estimation.

**Definition 4.18.** *The messages exchanged over the edge i* → *j between neighboring agents are computed for optimal decisions based on* q ∗ *i .*

$$\mathfrak{q}\_{i \to j}(\hbar\_i \mid \mathfrak{g}\_{ij}) = \begin{cases} -\infty, & \hbar\_i \in \overline{H}\_i, \ \mathfrak{g}\_{ij} \neq \varepsilon \\ 0, & \hbar\_i \in \overline{H}\_i, \ \mathfrak{g}\_{ij} = \varepsilon \\ \mathfrak{q}\_{i \to j}(\hbar\_i a\_i^\* \mid \mathfrak{g}\_{ij}), & \text{else} \end{cases} \tag{4.19}$$

*where a* ∗ *<sup>i</sup>* = arg max *ai*∈(*Ai*\*Aij* )∪{*a* 1 *ij* } q ∗ *i* (h*ia<sup>i</sup>* | g*ij* )*,*

$$\mathfrak{q}\_{i \to j}(\mathfrak{h}\_i a\_i \mid \mathfrak{g}\_{ij}) = \begin{cases} \mathfrak{q}\_{i \to j}(\mathfrak{h}\_i a\_i \theta^1\_{ij} \mid \mathfrak{g}\_{ij}), & a\_i = a^1\_{ij} \\ \mathfrak{q}\_{i \to j}(\mathfrak{h}\_i a\_i \theta^\*\_i \mid \mathfrak{g}\_{ij}), & else \end{cases} \tag{4.20}$$

*where θ* ∗ *<sup>i</sup>* = arg max *θi*∈Θ*a,i* q ∗ *i* (h*iaiθ<sup>i</sup>* | g*ij* )*,*

$$\mathfrak{q}\_{i\rightarrow j}(\hbar\_{i}a\_{i}\boldsymbol{\theta}\_{i}\mid\boldsymbol{g}\_{ij}) = \begin{cases} \mathbb{E} & \left[\mathop{\mathbf{r}}\_{i}(\sigma\_{i},a\_{i},\boldsymbol{a}\_{i},\sigma\_{i}^{\prime}) + \\ \mathop{\mathbf{z}}\_{i\leftarrow j}(\varepsilon\_{i}\boldsymbol{\sigma}\_{i},\varepsilon\_{i}) & \sum\_{\begin{subarray}{c}i\in N(i)\backslash\{j\}\\ i\neq j\end{subarray}}\left[\mathop{\mathbf{\bar{r}}}\_{l\rightarrow i}(\hbar\_{i},a\_{i}\boldsymbol{\theta}\_{i}\boldsymbol{u}\_{i})\right] + \\ \mathbf{q}\_{i\rightarrow j}(\hbar\_{i}\boldsymbol{a}\_{i}\boldsymbol{\theta}\_{i}\boldsymbol{u}\_{i}\mid\boldsymbol{g}\_{ij}^{2\cdot})\Big{|}\bigg{|}\hbar\_{i}a\_{i}\boldsymbol{\theta}\_{i}\big{)},\\ \mathop{\mathbf{z}}\_{i\rightarrow j}(\boldsymbol{\sigma}\_{i},\varepsilon\_{i},\varepsilon\_{i},\varepsilon\_{i},\varepsilon\_{i}) +\\ \mathop{\mathbf{e}}\_{i\leftarrow\boldsymbol{e}}\limits{|}\varepsilon\_{i}(\boldsymbol{\sigma}\_{i},\boldsymbol{a}\_{i},\boldsymbol{\theta}\_{i},\sigma\_{i}^{\prime}) +\\ \mathbf{q}\_{i\rightarrow j}(\hbar\_{i}\boldsymbol{a}\_{i}\boldsymbol{\theta}\_{i}\boldsymbol{o}\_{i}\mid\boldsymbol{g}\_{ij})\Big{|}\bigg{|}\bigg{|}\bigg{)}\_{\mathrm{el}\boldsymbol{e}},\\ \end{cases}\tag{4.21}$$

*Finally the message from i to the neighbor j is*

$$\mathfrak{m}\_{i \to j}(\mathfrak{g}\_{ij}) = \mathfrak{q}\_{i \to j}(\varepsilon \mid \mathfrak{g}\_{ij}) \,. \tag{4.22}$$

Decision-making by the agent *i* depends on optimal decision-making by his neighbors, as communicated in the messages m*j*→*<sup>i</sup>* and vice versa. Astute readers will have noticed a circular dependency between Definition 4.17 and Definition 4.18. Taken together the values are well-defined. Suppose that *H* is only one level deep. Then the messages m can be computed with a forward-backward pass similar to the standard Max-Plus algorithm. Now let *H* allow two actions in a row. With the same argument, the messages that evaluate the second action choice can be generated. Once these messages are known, the agents can compute the Q-value for the possible first actions.

**Example 4.3.** Suppose that an Original Equipment Manufacturer (OEM) owns two production sites. One in Germany and the other in China. A customer buys a lot of 1000 items to be made to order. The German site can produce the items for \$40 a piece. The Chinese site can fulfill the order for \$35 a piece. But due to uncertainties for the long transport, there is a 10% chance that the products will not reach the OEM headquarters in time for packaging and delivery. For this, the scenario defines three components *C* = {oem*,* de*,* cn}. Every component also represents an agent *C* = *I*. There are six actions defined: *A* = {pass\_de*,* prod\_de*,* prod\_cn*,* pass\_cn*,* pass\_oem*,* deliver}. The "pass" action terminates the scenario for the respective agent. The participants of the action *C*pass\_de = {de}, *C*prod\_de = {oem*,* de}, *C*pass\_cn = {cn}, *C*prod\_cn = {oem*,* cn}, *C*deliver = *C*pass\_oem = {oem}. The shared actions are *A*oem*,*de = {prod\_de} and *A*oem*,*cn = {prod\_cn}. The action prod\_cn returns an observation that is either *o*<sup>f</sup> or *o*<sup>s</sup> indicating either failure or success. None of the actions take a parameter.


**Table 4.1:** Reward generated by the actions in the supply-chain example.

The possible histories are (prod\_de*,* deliver), (prod\_cn *o*f) and (prod\_cn *o*s*,* deliver), as well as the sub-histories with only the first action. The empty action parameters and observations are omitted for readability. From the perspective of global optimisation, the deterministic reward for going with the German production site is (−40 + 55) = 15. The expected reward with the production in China is 0*.*9(−35 + 55) + 0*.*1(−35) = 14*.*5.

We now show some selected examples for the Q-value and the messages between the agents from Definition 4.17 and 4.18. Once the production sites have either produced or passed on the production, the scenario is "done" for them and no further rewards are generated. As de and cn have only one neighbor, no received messages are "mirrored back" in qde→oem and qcn→oem.

$$\mathfrak{q}\_{\mathsf{da}\rightarrow\mathsf{o}\mathsf{an}}(\mathsf{prod}\,\mathsf{de}\,\mathsf{d}\,\mathsf{e}\,\mathsf{l}\,\cdot) = 0,\quad \mathfrak{q}\_{\mathsf{da}\rightarrow\mathsf{o}\mathsf{an}}(\mathsf{pass}\,\mathsf{s}\,\mathsf{de}\,\mathsf{l}\,\cdot) = 0$$

$$\mathfrak{q}\_{\mathsf{cn}\rightarrow\mathsf{o}\mathsf{an}}(\mathsf{prod}\,\mathsf{c}\,\mathsf{n}\,\mathsf{l}\,\cdot) = 0,\quad \mathfrak{q}\_{\mathsf{cn}\rightarrow\mathsf{o}\mathsf{an}}(\mathsf{pass}\,\mathsf{c}\,\mathsf{n}\,\mathsf{l}\,\cdot) = 0$$

Given a either the production-production or the pass-action as a conditional, the optimisation of the actions is trivial for de and cn as there is only one action that can be selected in accordance with the conditional. The Equations 4.19 through 4.21 return the following.

$$\begin{aligned} \mathfrak{m}\_{\mathsf{da}\rightarrow\mathsf{o}\mathsf{on}}(\mathsf{prod}\,\mathsf{de}\,\mathsf{d}\,\mathsf{e}\,\mathsf{|}\,\cdot) &= \mathfrak{q}\_{\mathsf{da}\rightarrow\mathsf{o}\mathsf{on}}(\varepsilon\,\mathsf{|}\,\mathsf{prod}\,\mathsf{d}\,\mathsf{e}) = -40 \\ \mathfrak{m}\_{\mathsf{da}\rightarrow\mathsf{o}\mathsf{on}}(\mathsf{pass}\,\mathsf{\_{de}}\,\mathsf{de}\,\mathsf{|}\,\cdot) &= \mathfrak{q}\_{\mathsf{da}\rightarrow\mathsf{o}\mathsf{on}}(\varepsilon\,\mathsf{|}\,\mathsf{pass}\,\mathsf{\_{de}}) = 0 \\ \mathfrak{m}\_{\mathsf{cn}\rightarrow\mathsf{o}\mathsf{on}}(\mathsf{prod}\,\mathsf{cn}\,\mathsf{c}\,\mathsf{|}\,\cdot) &= \mathfrak{q}\_{\mathsf{cn}\rightarrow\mathsf{o}\mathsf{on}}(\varepsilon\,\mathsf{|}\,\mathsf{prod}\,\mathsf{cn}) = -35 \\ \mathfrak{m}\_{\mathsf{cn}\rightarrow\mathsf{o}\mathsf{on}}(\mathsf{pass}\,\mathsf{cn}\,\mathsf{|}\,\cdot) &= \mathfrak{q}\_{\mathsf{cn}\rightarrow\mathsf{o}\mathsf{on}}(\varepsilon\,\mathsf{|}\,\mathsf{pass}\,\mathsf{\_{cn}}) = 0 \end{aligned}$$

With these messages transferred, the agent oem can continue with the computation of *q* ∗ oem.

$$\begin{aligned} \mathbf{q}\_{\text{oon}}^{\*} (\text{prod\\_cn}) &= \mathbb{P}(o\_{\text{s}}) \Big[ \mathbf{r}\_{\text{oon}} (\text{prod\\_cn}) + \\ & \qquad \bar{\mathbf{t}}\_{\text{da}\rightarrow\text{oon}} (\varepsilon, \text{prod\\_cn} \, o\_{\text{s}}) + \\ & \qquad \bar{\mathbf{t}}\_{\text{cn\rightarrow oon}} (\varepsilon, \text{prod\\_cn} \, o\_{\text{s}}) + \mathbf{q}\_{\text{oon}}^{\*} (\text{prod\\_cn} \, o\_{\text{s}}) \Big] + \\ & \qquad \mathbb{P}(o\_{\text{f}}) \Big[ \mathbf{r}\_{\text{oon}} (\text{prod\\_cn}) + \\ & \qquad \bar{\mathbf{t}}\_{\text{da}\rightarrow\text{oon}} (\varepsilon, \text{prod\\_cn} \, o\_{\text{f}}) + \\ & \qquad \bar{\mathbf{t}}\_{\text{cn\rightarrow oon}} (\varepsilon, \text{prod\\_cn} \, o\_{\text{f}}) + \mathbf{q}\_{\text{oon}}^{\*} (\text{prod\\_cn} \, o\_{\text{f}}) \Big] \\ & = 0.9 \cdot [0 + 0 - 35 + \mathbf{r}\_{\text{oon}} (\text{d\\_1vec})] + \\ & 0.1 \cdot [0 + 0 - 35 + 0] = 14,5 \end{aligned}$$

The message from oem to its neighbors do not contain the reward that was signaled from the neighbor itself.

$$\mathfrak{m}\_{\mathsf{o}\texttt{o}\texttt{m}\rightarrow\mathsf{do}}(\texttt{prod\\_de}) = 55, \quad \mathfrak{m}\_{\mathsf{o}\texttt{on}\rightarrow\mathsf{do}}(\texttt{pasc\\_de}) = 14.5,$$

$$\mathfrak{m}\_{\mathsf{o}\texttt{on}\rightarrow\mathsf{c}\texttt{n}}(\texttt{prod\\_cn}) = 49.5, \quad \mathfrak{m}\_{\mathsf{o}\texttt{on}\rightarrow\mathsf{do}}(\texttt{pasc\\_cn}) = 15$$

With that, the agents can jointly maximize the expected global reward even though they have only a limited scope to the system state and possible actions.

The messages m*i*→*<sup>j</sup>* assign a value to every node of the shared history tree *Hij* . Computing the messages m*i*→*<sup>j</sup>* is computationally challenging. Instead of computing the message values directly, they are approximated via Monte-Carlo sampling. See Algorithm 15 for the full specification of the The Distributed Partially-Observable Hybrid Tree Planning (DPOHTP) algorithm.

Similar to the DUCT algorithm, the agents *I* are exchanging messages as they jointly explore the solution space. Every agent can use his private model for the components in his scope. So DPOHTP does not rely on a central simulator. The agents optimize for the global reward with their action choices. But they keep the reward for the different agents separated for the statistics on expected reward. The Play*<sup>i</sup>* procedure generates playout histories by sampling from the stochastic simulator and using the current reward statistics, exchanged messages and policy *π* for decision making. The Params*<sup>i</sup>* procedure is used to select parameters for the current action. It uses Optimistic Optimization similarly to StoSOO. But every call to Params*<sup>i</sup>* returns exactly one parameter vector. So several actions with a parameter each can be selected in one playout. The Update*<sup>i</sup>* procedure takes the last playout of the agent *i* and updates his internal reward statistics as well as the messages sent to the neighbors. Every parameter node stores the empirical direct reward that was generated by the action-parameter combination in a hashmap *ei* . The Q-value of the parameter additionally considers the expected reward for optimal decision-making later on. Optimal decision-making here refers to the maximization of the global reward, taking the reward signaled by the neighboring agents via messages into account. Note that the messages are updated with an averaging procedure in lines 14–18 instead of maximizing over the actions and parameters. This is due to the fact that several histories h*<sup>i</sup>* project to the same h*ij* . A maximization under the premise that only the current h*<sup>i</sup>* is considered in the message would distort the message which represents an expectation over the future reward.

**Algorithm 15** The Distributed Partially-Observable Hybrid Tree Planning (DPO-HTP) algorithm

1: **procedure** DPOHTP(σ 0 ) 2: *mi*→*<sup>j</sup>* [ · ] ← 0 ∀*i* ∈ *I, j* ∈ *N*(*i*) 3: *ni*→*<sup>j</sup>* [ · ] ← 0 ∀*i* ∈ *I, j* ∈ *N*(*i*) 4: *ni*[ · ] ← 0*, qi*[ · ] ← 0 ∀*i* ∈ *I* 5: *ei*[ · ] ← 0 ∀*i* ∈ *I* 6: *Li*[ · ] ← {(0*,* 1)} ∀*i* ∈ *I* 7: *d*¯ ← 1 8: **while** enough time **do** 9: **for** *i* ∈ *I* **do** 10: σ*<sup>i</sup>* ∼ σ 0 *i* 11: (h*i,* r*i*) ← Play*i*(σ 0 *i* ) 12: Update*i*(h*i,* r*i*) 13: *d*¯ ← *d*¯+ 1 14: **if** *d >*¯ log<sup>2</sup> (*ni*[*ε*] **then** 15: *d*¯ ← 1 16: **return** arg max *a*∈*A, θ*∈Θ*a,* ∀*i*∈*Ia, ni*[*aθ*]*>*0 hP *i*∈*I <sup>q</sup>i*[Π*i*(*aθ*)]i 1: **procedure** Play*i*(σ*i, d*¯) 2: h*<sup>i</sup>* ← *ε* 3: r*<sup>i</sup>* ← *ε* 4: **while not** done(σ*i*) **do** 5: *B* ← {*b<sup>i</sup>* ∈ *A<sup>i</sup>* : *n*[h*ibi*] = 0} 6: **if** *B* 6= ∅ **then** 7: (*ai, θi*) ← *πi*(h*i, Bi*) 8: *d*¯ ← *d*¯− 1 9: **else** 10: *ai*← arg max *bi*∈*Ai* h *qi*[h*ibi*] + *α* qlog *<sup>n</sup>i*[h*i*]+1 *<sup>n</sup>i*[h*ibi*] + P *<sup>j</sup>*∈*N*(*i*) *<sup>m</sup>j*→*i*[Π*ij* (h*ibi*)]<sup>i</sup> 11: *d*¯ ← *d*¯− 1 12: *θ<sup>i</sup>* ← Param*i*(h*iai, d*¯) 13: *d*¯ ← *d*¯− depth(*Li*[h*iai*] 14: (σ*i,* o*i, vi*) ∼ *a θi i* (σ*i*) 15: h*<sup>i</sup>* ← h*iaiθi*o*<sup>i</sup>* 16: r*<sup>i</sup>* ← (r*i, vi*) 17: **return** h*i,* r*<sup>i</sup>*

1: **procedure** Param*i*(h*iai, d*¯) 2: **if** *d >*¯ depth(*L*[h*iai*]) **then** 3: **return** arg max *θ*:∃(*u,v*)∈*Li*[h*iai*]*,* x*i*[h*iai*;*u,v*]=*θ q*[h*iaiθ*] 4: **else if** *d <*¯ 1 **then** 5: *G* ← *Li*[h*iai*] 6: **else** 7: *G* ← *Li*[h*iai*; *d*¯] 8: (*u, v*) ← arg max (*c,j*)∈*G,* x*i*[h*iai*;*c,j*]=*θ* h *qi*[h*iaiθ*] + *α* qlog *<sup>n</sup>i*[h*iai*]+1 *<sup>n</sup>i*[h*iaiθ*] + P *<sup>j</sup>*∈*N*(*i*) *<sup>m</sup>j*→*i*(Π*il*(h*iaiθ*))<sup>i</sup> 9: *θ* ← x*i*[h*iai*; *u, v*] 10: **if** *ni*[h*iaiθ*] *< κ* ∧ *l* = *l* max **then** 11: **return** *θ* 12: *u*[h*a*; *l, i*] ← *u*[h*a*; *l, i*] + 1 13: *µ* ← *u*[h*a*; *l, i*] 14: *n* ← dim(Θ*a*) 15: *δ* ← mod(*l, n*) + 1 16: *ξ* ← |*L*[h*a*; *l* + 1]| + 1 17: x[h*a*; *l* + 1*, ξ*] ← *θ* + (*µ* − 2)ν*<sup>δ</sup>* 1 3 <sup>b</sup>*l/n*c+1 18: *L*[h*a*] ← *L*[h*a*] ∪ {(*l* + 1*, ξ*)} 19: **if** *µ* = 3 **then** 20: *L*[h*a*] ← *L*[h*a*] \ {(*l, i*)} 21: **return** x[h*a*; *l* + 1*, ξ*]

$$\begin{array}{ll}\mathsf{11.prec}\mathsf{12.pstr}\ (h\_{i},h\_{i},\mathsf{r}\_{1})\\\ \frac{3}{2} & \mathsf{for}\ \begin{aligned} &[h\_{i},\{a\_{i},\boldsymbol{b}\_{i}\}\leftarrow\mathsf{id}\ \mathsf{e}\_{i}\\ &\mathsf{a}\_{i}\leftarrow h\_{i}^{k-1}\;\;\mathsf{d}\\ &\mathsf{a}\_{i}\leftarrow h\_{i}^{k-1}\;\;\mathsf{e}\_{i}\end{aligned}\\ &\begin{aligned} &[\mathsf{11.prec}\mathsf{12.pstr}\ ]\\ &\mathsf{a}\leftarrow h\_{i}^{k-1}\;\;\mathsf{e}\_{i}\;\;\mathsf{e}\_{j}\\ &\mathsf{b}\leftarrow h\_{j}^{k}\;\;\mathsf{e}\_{i}\;\;\mathsf{e}\_{j}\;\;\mathsf{e}\_{j}\;\;\mathsf{e}\_{i}\;\;\mathsf{e}\_{j}\\ &\mathsf{c}\leftarrow h\_{i}^{k}\;\;\mathsf{e}\_{i}\;\;\mathsf{e}\_{j}\;\;\mathsf{e}\_{i}\;\;\mathsf{e}\_{j}\;\;\mathsf{e}\_{j}\;\;\mathsf{e}\_{j}\end{aligned}}\\ &\begin{aligned} &[\mathsf{11.prec}\mathsf{12.pstr}\ ]\\ &\mathsf{a}\leftarrow\operatorname{c}[\mathsf{11.prec}\mathsf{12.pstr}\ ]\\ &\mathsf{a}\leftarrow\operatorname{c}[\mathsf{11.prec}\mathsf{12.pstr}\ ]\\ &\mathsf{a}\leftarrow\operatorname{c}[\mathsf{11.prec}\mathsf{12.pstr}\ ]\\ &\mathsf{a}\leftarrow\operatorname{c}[\mathsf{11.prec}\mathsf{12.pstr}\ ]\\ &\mathsf{a}\leftarrow\operatorname{c}[\mathsf{11.prec}\mathsf{12.pstr}\ ]\\ &\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\$$

## **4.5 Evaluation**

A preliminary version of DPOHTP was evaluated in [Pfr16a]. The main difference is that this version allows each agent to run simulations in a private simulator. So the agents are completely decoupled besides their message exchange. Existing benchmark examples from the Dec-POMDP literature assume a central simulator.

The scenario from autonomous driving was chosen since it enables a good visual understanding of the effect of agent coordination. Autonomous driving can be considered as one of the activites on the lower levels of the control hierarchy for supply-chain logistics. Furthermore, it shows the ability of the approach to adjust to changing system topologies with agent entering and leaving at runtime. Note that the example does not claim to be a physically accurate simulation for autonomous driving. It is a highly stylized benchmark example and not intended for practical use. Still, the model is based on results from the relevant literature that uses MCTS for planning of car maneuvers [LKK16]. If required, the simulation model can be readily replaced with something more accurate without changing the (implementation of) the GMCTS algorithm used for planning and coordination.

Two versions of the driving scenario are shown in Figures 4.4 and 4.5. Both start from the same initial situation. Three cars (blue, green and red) are driving close to each other at the same speed. They head toward the grey car that blocks the left lane after an accident. The cars have only a limited radius of visibility (assume some fog or heavy rain blocking the view). The cars seeing each other are marked with blue and green lines respectively.

The timeline of the scenario is discretized to 100 millisecond periods. In every period the cars may choose between five actions: nothing, accelerate, break, moveLeft and moveRight. The cars receives a reward at every iteration. A negative reward of −1 is assigned for every action besides nothing. This gives an incentive not to over-react. An additional negative reward of −50 is assigned to every car that crosses onto the shoulder of the road. Cars touching each other receive a reward of −1000 each. Each car has an internal simulator to predict future situations given joint actions for all cars in their visibility scope. Cars outside the visibility scope are not considered in the simulator. The simulator is the basis for MCTS-based planning.

In the first figure, the red car "sees" the accident at a very late point when the green car has already nearly passed. Until then, the red car stays in his line and keeps the original speed. Then, to prevent a crash that can now be predicted by the red car, the red car moves behind the blue car. The second figure was produced by identical parameters than the first figure. The only change is the exchange of messages between agents. So they use our DPOMCP algorithm instead of plain MCTS. It can be seen that the red car reacts much sooner to the accident, even before the grey car moves into his scope of visibility. The reason for this change of behavior is as follows: In the second line of Figure 4.5, the green car already sees the accident. Therefore the potential crash of the red and grey cars are predicted by the green car. When the green cars exchanges a message with preferences with the red car, then the danger of the crash is marginalized into the expected reward starting from possible shared action sequences. Thus, the red car learns about the potential strong negative consequences for staying in its line. The red cars therefore moves to the right lane and inserts even before the blue car. The change of exchanging messages leads to a very different – and more safe – outcome compared to the scenario without coordination.

The advantage of our work compared to previous results [FB10] is that it does not require that cars explicitly define coordination groups. Second, the complexity of the coordination algorithm grows exponentially only in the number of neighbors at every agent. Since every agent has a limited scope (and just a few neighbors), scaling to very large systems is easily possible. Third, when agents enter and leave the system at runtime, this has only a very local impact as just the direct neighbours need to coordinate.

**Figure 4.4:** Autonomous driving example without coordination. The blue lines indicate neighbor relations.

**Figure 4.5:** Autonomous driving example with coordination. The green lines indicate neighbor relations that are also communication relations.

# **5 Modeling of Production Skills**

*The tape was a small loop that fed continuously between magnetic pickups. On it were recorded the movements of a master machinist turning out a shaft. […] Rudy Hertz, an old timer, who had been about ready to retire. And here, now, this little loop […] was Rudy as Rudy had been to his machine that afternoon. Rudy, the turner-on of power, the setter of speeds, the controller of the cutting tool. This was the essence of Rudy as far as his machine was concerned.*

*Kurt Vonnegut – Player Piano [Von52]*

The model of production systems developed in Chapter 2 demands that every possible behavior of the manufacturing resources is captured in the form of actions with well-defined preconditions and effects. This lays a heavy burden on plant operators to manage the set of available actions. Especially when changes are made to the plant topology and the targeted product lineup. While the actions are conceptually uniform, the complexity of a flexible manufacturing plant manifests itself in the large number of actions required to model the relevant aspects. For tens of considered final products with tens of manufacturing steps each, as well as tooling and transportation, plant operators need to model (and keep up-to-date) thousands of actions.

In addition to just defining the actions, their execution requires custom control code or the parameterization of reusable functionality for the product at hand. This custom effort to foresee, implement and deploy actions that are possibly needed at runtime is then a bottleneck for system integrators and operators who want to introduce changes at a later time. To reduce this burden, we introduce higher-level abstractions for the skills of the system components. Together with a similar description of the production requirements, the low-level actions are then generated automatically or with tool assistance.

*Executable actions for planning and runtime control in a production systems can be generated from higher-level descriptions of the production system and the requirements.*

Note: Our work on skill modeling [PSB13a; Pfr+14c; PSB13b; Pfr+15] was published starting in 2013. The more recent literature already references and incorporates these results. This chapter provides a synthesis of our published work together with a novel representation of the concepts in Description Logics. Our own publications are not referenced in the summary of the state-of-the-art.

## **5.1 Background: Skill Models for Production**

Sussman was one of the first to define a "theory of skill" that was not focused on the skilled human but on the computer [Sus73]. He writes:

[…] a skill is a set of answer procedures, each indexed by a description of the problem types for which it is appropriate, along with a set of pitfalls to avoid when it is necessary to construct a new answer procedure. A skill is acquired by the construction of such a store of "runnable" knowledge – canned answers to problems – by "compiling" it from knowledge of the problem domain supplied in a more "intelligible" form – a form designed more for communication than for use as answers to problems.

This early work already talks about representations for the skill and the problem description, as well as the "compilation" of their combination into an executable form. Since then, work on technical skill definitions has continued mainly in the robotics domain. Many authors have worked on methods for learning of particular robot skills [GFB94; MK97; FRD98]. But we are rather interested in skill definitions that applies across different application domains.

In the last 10 years, interest in using skill definitions specifically for the production domain has considerably gained in importance. Some authors from the production community differentiate between the terms skill and capability. In this text, we will use the terms interchangeably.

The authors from the SIARAS project [Mal+07] use ontologies to store skills and their relations for production. This information is then used for automatic reconfiguration of production systems. Naumann et al. [NWS07] use an ontology to model robot skills and use state-charts for sequences within manufacturing processes. Järvenpää et al. [Jar+11] define an ontology for skills in manufacturing and use it to map resources to manufacturing steps. Kluge [Klu11] uses skill models for assembly planning. Huckaby and Christensen [HC12] provide a taxonomy for assembly tasks and related skill primitives. Constraints specify whether they are executable in a certain situation. Björkelund et al. [Bjö+12] first make use of the PPR (Product, Process, and Resource) concepts in the context of skills. They relate skills to all three views of PPR and represent skills as finite state machines. Keddis et al. [KKZ14] define a description vocabulary for skills of production resources and an accompanying algorithm for production planning and scheduling. Legat et al. [LSV14] use a description of the resource skills to guide engineers in the implementation of field control code. Backhaus et al. [BUR14] present a classification of manufacturing skills to enable taskoriented programming. This concept is expanded in [BR15] and used to generate executable tasks for a welding robot. Malakuti et al. discuss challenges of skillbased production control for practitioners and possible solutions [Mal+18]. The authors of [Jär+18] use semantic inference to determine the capabilities of resource combinations. For example of a robotic manipulator combined with a gripper.

Many interesting approaches to describe the skills of production resources via ontologies and high-level description schemas have been proposed in the literature. All authors report the successful application in the demonstration scenarios for which their description schemas were developed. We argue that modelling semantic knowledge about production systems facilitates the integration of resources in a Plug & Produce fashion. But it is not enough. First, as ontologies for describing manufacturing skills become more detailed, the less general they are. This leads to the difficult situation where

• general skill description schemas require additional information to cover implementation-specific edge-cases and constraints, and

• detailed skill description schemas are applicable to only a narrow subset of resources and will have difficulties in getting the required support across vendors and integration tooling suppliers. A possible scenario is the emergence of a standardized core skill description schema from which domain-specific and sometimes overlapping schemas will branch off.

Second, semantic reasoners were developed to infer further information from an existing knowledge-base. But they are not equipped for reasoning about numerical optimization problems, such as the resource- and time-efficient production of many products on concurrent resources. Therefore, we see the role of semantic skill descriptions primarily for integration and configuration tasks. To make the flexibility of generic resource skills available on the level of large-scale systems, they need to be propagated to dedicated planning and runtime control systems that may use different representations. This also offers the possibility to integrate overlapping and domain-specific skill description schemas if they can be interfaced to a unified abstraction used for planning and runtime control.

## **5.2 Background: Description Logics**

Description Logics (DL) [Baa03] are a family of formal knowledge representation languages developed to represent hierarchical and relational structures and to enable reasoning over these structures. DL are the formal basis of semantic models with ontologies. DL models are defined in terms of *constants*, *concepts* and *roles*. The semantics of a DL is defined in terms of first-order logic. Hence, no DL is more powerful than first-order logic. Let the domain ∆ a fixed countably infinite set. The interpretation I is a function · I that assigns


The interpretation has to be consistent with respect to assurances that are defined for the model.

Table 5.1 gives an overview on the syntax of the EL DL [KKS14] with the addition of concrete domains to express and reason about numerical attributes [BH91]. More expressive DL than EL exist. It was selected for the exposition due to its relative simplicity and the existence of performant solvers. By convention, we write the elements of DL models as follows: Constants are written in typewriter font, Conceptsin a sans-serif font with a leading uppercase and hasRole definitions in a sans-serif font with a leading lowercase.

**Example 5.1.** Begin with the constants alice and bob. Both of them are humans and therefore, Human(Alice) and Human(Bob). Every human is either male or female. Therefore Female v Human, Male v Human and Female u Male v >. Suppose Alice is the daughter of Bob. The role childOf describes the parent-child relation. We assert this relation between our two humans as childOf(Alice*,* Bob). Daughters are the female children of a human. This can be stated as daugherOf v childOf with the domain dom(daughterOf) v Human and range ran(daughterOf) v Human u Female. From this we can infer that in effect daughterOf(Alice*,* Bob).

Attributes from concrete domains are also assigned via role relations. Alice is 11 years old. So she has the attributes hasAge(Alice*,* 11) and hasName(Alice*,* "Alice"). Queries over concrete domains are stated in the form of predicates. The concept Human u ∃hasAge*.*(*<,* 12) u ∃hasName*.*(=*,* "Alice") contains all humans named Alice and less than 12 years old.


**Table 5.1:** Syntax and Semantics of the EL<sup>+</sup> <sup>⊥</sup> Description Logic.

## **5.3 The PPRS Model for Production Skills**

We begin with an informal characterization of the PPRS model. In production, *processes* are used to create *products*. Processes stand for a type of operation, such as welding. The processes are performed by the *resources* of a production system, such as machines and technical equipment in general. *Skills* describe what a component is capable of in general. *Transformations* describe a specific production step that changes the attributes of a workpiece, consumes ingoing workpieces and material to create some output, and so on. Transformations are associated with one or more processes. *Actions* are realizations of a skill.

**Figure 5.1:** Outline of the relations between the PCM concepts.

For example to perform a specific transformation of a workpiece. Figure 5.1 gives a high-level overview on the six concepts and their relations. More precise definitions are now given, together with the representation the language of DL.

**Example 5.2.** The example used for the remainder of this chapter comes from the production of automotive battery systems. The example is heavily simplified to serve as an educational example. From a high-level perspective, battery systems are comprised of battery cells that are welded to a conductive busbar. See [Das+18] for a detailed account of the different joining techniques in battery production.

**Definition 5.1** (Product)**.** *A product is a type of marketable good, raw material or intermediate workpiece between production steps. Products describe discrete (countable) entities. Bulk material and fluids must be "packaged" to be considered a discrete product. Whenever product types and product instances need to be differentiated, product instances are denoted as* workpieces *to make the distinction clear.*

In DL, products are represented as constants. The properties of the product type are defined via attributes from concrete domains. The definition corresponds to the product definition from Chapter 2. Workpieces (product instances) of the same product type are interchangeable.

**Example 5.3.** The products defined for the battery production example are cell, busbar and battery.

Product(cell) Product(busbar) Product(battery)

**Definition 5.2** (Process)**.** *Processes denote a type of production operation. Every process defines a set of process attributes that are used to characterize instances derived from the process. Processes form a hierarchy where attributes are inherited from parent processes.*

The term process is overloaded and understood differently in the fields of computer science, statistics, workflow management, production, and many more. The definition used here corresponds to the use of the term *manufacturing process* in DIN 8580 [DIN8580]. In DL, processes are represented as concepts. Processes form a hierarchy of concepts derived from the topmost concept Process.

**Example 5.4.** The processes defined for the battery production example are welding and the more specialized laser-welding. The processes are described by the power used for welding and, for the laser-welding process, the wavelength of the laser.

LaserWelding v Welding v Process dom(hasPower) v Welding ran(hasPower) = *P*(R+) dom(hasWaveLength) v LaserWelding ran(hasWaveLength) = *P*(R+)

The use of the powerset *P*(R+) as the concrete domain of the hasPower role allows instances of a welding process to refer to ranges of possible temperatures instead of a single scalar. Note that the concrete domains for process attributes may become quite complex, e.g. to describe the possible tool positions and rotations for a 5-axis CNC mill. Implementation may restrict concrete domains for example

**Figure 5.2:** Excerpt from a hierarchy of production processes based on DIN 8580 [DIN8580]. The process attributes are from the indicated concrete domains. Arrows denote an inheritance relation, which is expressed in DL via concept inclusion, such as CountersinkDrilling v Drilling.

to one-dimensional ranges, where the representation and verification of predicates on a computer is trivial.

Figure 5.2 shows a more complete example process hierarchy, specializing processes for drilling from DIN 8580. Processes usually originate from the production and logistics domain. But they can also be auxiliary to the core production operations. For example processes for machine maintenance.

**Definition 5.3** (Resource)**.** *Resources denote machines and technical assets in general. Resources are components in the nomenclature of Chapter 2.*

Renaming the components from Chapter 2 to resources may seem unnecessary. This is done in this chapter only to assure the possibility of comparison with existing production skill models from the literature.

In DL, resources are represented as constants. The set of all resources is denoted with the concept Resource. In addition, derived concepts can be used to group resources. For example 5AxisManipulator v Manipulator v Resource. This is useful to query for specific resources. But this grouping of resource has no further purpose in this text.

**Example 5.5.** In the battery production example, we consider only one resource: The (imginary) LaserWelder200.

Resource(LaserWelder200)

**Definition 5.4** (Transformation)**.** *Transformations are production operations that transition one or several input products into one or several output products by the application of production processes.*

In DL, transformations are represented as constants that are instances of the Transformation concept. Two new roles, hasInput and hasOutput, both with the range Transformation and the range Product, are used to model which ingoing products are transformed to which output. The input/output relations between transformations can be used to draw a graph of a bill of processes. Note however, that transformations must act on a product. Auxiliary processes, such as machine tooling and transportation steps, are not considered.

In addition, transformations are also instances of one or more process concepts.

Transformation v Process

With that, the transformations can assign values to the respective process attributes. The attributes of a transformation describe requirements that need to be fulfilled by implementations of the transformation. As described earlier, process attributes can have powersets for their domain in order to describe subsets, such as ranges. For example, a transformation for welding two workpieces together could indicate the range of supported temperatures in reference to the temperature attribute specified for the welding process.

**Example 5.6.** To produce a battery, cells are welded together with a busbar. The welding process requires a power to be applied from the range between 2kW and 3kW.

```
Transformation(JoinBatteryCells)
hasInput(JoinBatteryCells, cell)
hasInput(JoinBatteryCells, busbar)
hasOutput(JoinBatteryCells, battery)
Welding(JoinBatteryCells)
hasPower(JoinBatteryCells, [2.000, 3.000])
```
**Definition 5.5** (Skill)**.** *Skills describe that a resource can execute a process under constraints defined in terms of the process attributes.*

In DL, skills are represented as constants. Similar to transformations, skills are also instances of processes.

Skill v Process

Therefore, skills refer to the same attributes used to defined the transformation requirements. But it in the case of skills, the attributes describe the possibility to realize a process for given attributes.

**Example 5.7.** The LaserWelder has the skill to weld with a power between 2.5kW and 5kW. The CO2 laser used has a fixed wavelength of 10.6µm.

```
Skill(LW200LaserWeld)
hasSkill(LaserWelder200, LW200LaserWeld)
```

```
LaserWelding(LW200LaserWeld)
hasPower(LW200LaserWeld, [2.500, 5.000])
hasWavelength(LW200LaserWeld, 10.6)
```
Now we can formulate a query to find all skills that match with the Join-BatteryCells transformation in terms of supported processes and process attributes. This leads us to the resources that possess said skills. The ∩ operator is used to find attributes where the overlap with the indicated range is nonempty.

```
SkillMatch v Welding u ∃hasPower(∩, [2.000, 3.000])
ResourceMatch v ∃hasSkill.SkillMatch
```
The skill representations cannot capture all details of a production system. Therefore, we can only test whether a resource can perform a transformation in general. For some domains, the attributes might be sufficient to guarantee feasibility of the match. In other domains, additional checks need to be performed either by detailed descriptions of the machine and product geometry or by manual intervention.

However, once a skill has been specialized for a transformation or auxiliary operation, implemented, tested and deployed to the production system, e.g. in the form of PLC control code, then we know the exact conditions under which the concrete operation can be applied.

**Definition 5.6** (Action)**.** *An action is an executable process instance with welldefined preconditions and effects. Actions are defined for one or several participating resources. The DL concept* Action *contains instance elements that correspond to the formal action definition from Chapter 2.*

Actions are assumed to encapsulate all the required information to execute on the resource. For example in the form of IEC-61131 PLC code, configuration parameters, and so on. As a consequence, actions can be triggered simply by reference to their identifier, provided that all the preconditions are fulfilled.

**Example 5.8.** The action LW200BatteryWeld is an implementation of the JoinBatteryCells transformation. It makes use of the LW200LaserWeld skill.

```
Action(LW200BatteryWeld)
implements(LW200BatteryWeld, JoinBatteryCells)
uses(LW200BatteryWeld, LW200LaserWeld)
```
## **5.4 Assisted Generation of Executable Actions**

In a Plug & Produce scenario, at some point the generic skills of machines and equipment need to be specialzied into executable actions with known preconditions and effects. This splits into two steps:


The assistend generation of actions was considered as part of the arhitecture of the SkillPro project. See Figure 5.3 for an overview.

The machines and equipment are assigned to a *Skill Execution Engine* (SEE). The SEE provide smart wrappers to the physical resources, which range from conveyor belts with little configurability to complex machine tools and even human workers. Facing towards the underlying resources, the SEE implement domainspecific connectivity, e.g. based on a fieldbus protocol or OPC UA. In case of the human worker, this is accomplished with a tablet-based graphical-user-interface. Also, the SEE may contain domain-specific knowledge on how to derive actions from high-level skill-based descriptions

The *Manufacturing Execution System* (MES) is responsible for orchestrating the available resources in order to achive short- to mid-term manufacturing goals. For this, the MES implements two main features working in lockstep: computing a fine-grained execution plan that accomplishes the manufacturing goals, and the orchestration of the manufacturing resources at runtime.

The *Asset Management System* AMS constitutes the central knowledge base of a manufacturing facility and provides this information to the adjacent components. It contains semantic descriptions of the available resources and their skills, as well as a detailed plant model including topological (e.g. how resources are arranged in

**Figure 5.3:** Architecture of the SkillPro project [Pfr+15]

work cells) and topographical (e.g. layout and position of the resources) relations. The AMS also holds product models, including drawings, bills of material and bills of processes. The AMS furthermore manages customer orders on a long-term horizon and ensures that the required resources with the right skills are available. Lastly, it interfaces the SkillPro framework with enterprise level ERP (Enterprise Resource Planning) and PDM (Product Data Management) systems.

Assume that a list of product transformations (bill of processes) are initially provided or it could be inferred from the product description itself [TMP92]. The task of determining the right operations to produce a specific product was originally investigated as Computer-Aided Process Planning (CAPP) [ElM93; Kir95]. CAPP deals with finding "a way through the production system". So CAPP is different from production planning and scheduling, where the production of many products on a given plant layout is considered.

The generation of executable actions from high-level skill-based descriptions requires the communication of AMS and SEE framework components. The SEE may internally implement actions with a variety of technologies. Possible ways for implementing actions are:

• Customization of predefined procedures by parametrization, where the defined parameters are either resource- or production-domain specific [ON15]. Note that a description of the entire product in an appropriate format may constitute a parameter in this context. More examples of reasoning on the execution of a high-level task model can be found for example in the RoboEarth project [TB09; Wai+11].


The architecture from Figure 5.3 was implemented in the SkillPro project. At runtime, every SEE is represented by an OPC UA server that provides a uniform interface. Clients can connect to the OPC UA server and discover the current state of the component, as well as the available actions. So from the perspective of a higher-level control system, the SEE are all uniform. Even if they represent very different types of production equipment. If coordination between components for an action is required (for example the time-synchronization beginning of a procedure), they can also use OPC UA or rely some other communication technology in the background. The latter is rather disapproved of, since this adds technically rigid solution where tool-support for quick adaptations in the sense of Plug & Work do not have as much tool-support.

# **6 Conclusion**

Automated production systems face a tradeoff between efficiency and flexibility. This thesis aims to improve the flexibility of automated production systems by the use of a unified model representation on all levels of the control hierarchy. Recall the postulate from the outset of Chapter 2:

The same set of modeling principles can represent the relevant properties of production systems on all levels of the control hierarchy.

The thesis developed modeling principles that are able to represent both continuous and discrete production on all levels of the control hierarchy, including concurrency, i.e. parallel operations and the synchronization of system components, as well as uncertainty in stochastic scenarios. To our knowledge, no prior approach was able to encompass all of these properties. In order to make use of such a model for planning and runtime control appropriate decisions need to be derived. Chapter 3 started with the following postulate:

The same algorithm can be used for planning and runtime control on all levels of the control hierarchy – ranging from continuous dynamics of a physical system to global supply-chain operations – and for both continuous and discrete production.

Accordingly, an algorithm for optimal sequential decision making was developed based on forward-simulation of the model from Chapter 2. Now that a model for all levels of the control hierarchy and a matching planning algorithm exist, the remaining challenge is to speed up planning for production operations at an industrial scale. A range of measures were developed in this thesis to speed up planning.

• Pruning action sequences that are equivalent in a precise sense (Section 3.1).


A way to speed up planning that is orthogonal to the techniques just mentioned is to decompose the planning problem into smaller subproblems to be solved by independent agents. For this, the algorithm from Chapter 3 is extended for multi-agent coordination. The chapter sets out to achieve the following target result:

Independent agents can jointly perform planning in a production scenario where every agent only has a simulation model of the system part in his visible scope.

The core idea of the developed planning algorithm is to use "utility propagation" for the agent coordination similar to "belief propagation" in probabilistic graphical models. The decomposition into independent agent not only reduces the planning complexity. Companies in a supply-chain can jointly optimize their actions without a central entity that has access to all private information.

But in order to optimise production with planning algorithms, the model representations of the production system need to be accurate. Furthermore, if the resulting plans shall be used for automated control, then the action abstractions used in the model need to available as automated procedures on the actual machines and equipment. Keeping the model and the physical production system synchronized can become quite resource-intensive when the production is flexible and changes over time. It is therefore preferable to automate much of the configuration work as well. The beginning of Chapter 5 postulates the following.

Executable actions for planning and runtime control in a production systems can be generated from higher-level descriptions of the production system and the requirements.

We put forward a framework to describe the skills of production equipment based on Description Logic (i.e. semantic modeling). The framework also encompasses the description of product transformations and the manufacturing processes involved. These descriptions are then used to match production steps with machines and equipment capable of performaing than. Then, executable actions are generated and deployed for use by the runtime control systems. This is relevant for to achieve flexible production. Instead of manually programming a PLC, higher-level abstractions can be used to generate and parameterized the required control code.

The thesis has answered all four postulates in the affirmative. Example from different production scenarios and from all levels of the control hierarchy have been used to substantiate the claim. But surely this work can only be a stepping stone towards future automated production systems that are both efficient and flexible. To realize this goal, a large research programme is necessary that goes beyond the scope of a single thesis. We now enumerate open research questions and some of the key results that need to be obtained future for the work from different domains to coalesce into a coherent whole.

**Derive existing models for production and logistics from a common core** The production and logistics domain has developed numerous approaches for modeling operations with varying degree of detail. This text has proposed a highfidelity model as the common basis for all such models used in production and logistics. To support that claim, examples from several production domains and on several levels of the automation hierarchy have been used for the exposition.

But a more comprehensive treatment is required next to mere examples. It is an open research question which limiting assumptions need to be made on top of the high-fidelity model to recover more coarse-grained models that are already in productive use today. Such an investigation is the basis for an automated treatment of interfaces where subsystems that are controlled by different modeling approaches connect.

**Using Machine Learning to Guide Exploration** Monte-Carlo Tree Search (MCTS) as a planning algorithm is essentially blind once it reaches a point in the scenario tree that was never visited before. Machine Learning methods can be used to enable MCTS to "learn to see". The Q-value of the system state (or a belief distribution for the system state) can be approximated from similar experiences in the past that were (reached with a different sequence of actions). In that sense,

the model structure is not explicitly represented a-priori but learned from the interaction with the scenario over a series of plays. This has becomes known as Approximate Dynamic Programming [Ber+05; Pow07]. The combination of MCTS with neural networks as function approximators has famously been used in the Deepmind AlphaGo system [Sil+16]. The expectation is that a further reduction of the sample complexity for simulation-based planning with concurrent production system model can be achieved.

There is currently an active community working on Multi-Agent Reinforcement Learning (MARL) [BBD08]. Recent advances in deep learning are now being integrated with MARL. Recent work also lets the agents learn how to communicate instead of predefining the information flow [Foe+16]. "Utility propagation" as used in Chapter 4 could be combined with recent MARL techniques. For example by sending neural networks for Q-value estimation as messages instead of an explicit Q-value representations on a search-tree.

**Explicit representation of product attributes** The model from Chapter 2 assumes that products are interchangeable. Instead of representing them individually, only a count of products of the same type at the possible locations is considered. But there are of course differences between products of the same product type. This becomes apparent for example when quality issues come up. Quality (for which different the definition depends on the case at hand) is then conditionally dependent on the attributes of incoming raw material and semi-finished products, process settings and the state of the physical process. A complete "theory of production" should therefore encompass product attributes as well.

**Open systems with agents entering and leaving at runtime** The decentralized planning approach proposed in Chapter 4 allows for agents to enter and leave the system at runtime. Such changes do not have to be communicated within the entire system and only neighboring agents have to be informed. A remaining question is to see how fast the overall system can adjust to the changes.

**Domain-specific action generation** The generation of executable actions from skill definitions requires domain-specific tool support. In some domains, commercial products are already available that can transform higher-level descriptions to executable definitions. Notably in the robotics domain where clear categories of equipment exist (cf. https://www.artiminds.com and https: //www.keba.com). Another example is the standardized "G-code" for numerically controlled machine tools [ISO82]. G-code can be exported by virtually CAD/CAM tools that target product design. The control synthesis solutions for the individual application domains then have to be integrated. For example to assist the integration of systems that combine robotics for loading and unloading into a machine tool that transforms the work piece and visual inspection for quality control. To us, high-level descriptions based on skill-definitions are the most promising inroad for control synthesis in domains with more variety in the types of machines and equipment as well as a larger range of products to consider.

# **Bibliography**



























## **Karlsruher Schriftenreihe zur Anthropomatik (ISSN 1863-6489)**




#### Philipp Woock **Umgebungskartenschätzung aus Sidescan-Sonardaten für ein autonomes Unterwasserfahrzeug.** ISBN 978-3-7315-0541-9 **Band 26**




Lehrstuhl für Interaktive Echtzeitsysteme Karlsruher Institut für Technologie

Fraunhofer-Institut für Optronik, Systemtechnik und Bildauswertung IOSB Karlsruhe

The field of production automation faces a fundamental tradeoff between efficiency and flexibility. Most production systems are rigid in their behavior not only due to the layout of machines and their physical integration, but also by the custom programming of the control logic. This work develops the formal basis to enable adaptive production systems based on self-organisation and distributed planning. The main results – building upon each other – are the following:


ISSN 1863-6489 ISBN 978-3-7315-1253-0

J. Pfrommer

Distr. Planning for Self-Organizing Production Systems

Band 58