The Karlsruhe Series on Software Design and Quality

32

**32**

**Markus Kilian Frank**

**Model-Based Performance Prediction for** 

**Concurrent Software on Multicore Architectures**

Model-Based Performance Prediction for Concurrent Software on Multicore Architectures – A Simulation-Based Approach

Markus Kilian Frank

Markus Kilian Frank

### **Model-Based Performance Prediction for Concurrent Software on Multicore Architectures**

A Simulation-Based Approach

### **The Karlsruhe Series on Software Design and Quality Volume 32**

Dependability of Software-intensive Systems group Faculty of Computer Science Karlsruhe Institute of Technology

and

Software Engineering Division Research Center for Information Technology (FZI), Karlsruhe

Editor: Prof. Dr. Ralf Reussner

### **Model-Based Performance Prediction for Concurrent Software on Multicore Architectures**

A Simulation-Based Approach

by Markus Kilian Frank

Dissertation, Universität Stuttgart Fakultät 5: Informatik, Elektrotechnik und Informationstechnik

Model-Based Performance Prediction for Concurrent Software on Multicore Architectures – A Simulation-Based Approach

von Markus Kilian Frank

Tag der mündlichen Prüfung: 15. März 2021 Hauptberichter: Prof. Dr.-Ing. Steffen Becker Mitberichter: Prof. Dr. Ralf H. Reussner

### **Impressum**

Karlsruher Institut für Technologie (KIT) KIT Scientific Publishing Straße am Forum 2 D-76131 Karlsruhe

KIT Scientific Publishing is a registered trademark of Karlsruhe Institute of Technology. Reprint using the book cover is not allowed.

www.ksp.kit.edu

*This document – excluding parts marked otherwise, the cover, pictures and graphs – is licensed under a Creative Commons Attribution-Share Alike 4.0 International License (CC BY-SA 4.0): https://creativecommons.org/licenses/by-sa/4.0/deed.en*

*The cover page is licensed under a Creative Commons Attribution-No Derivatives 4.0 International License (CC BY-ND 4.0): https://creativecommons.org/licenses/by-nd/4.0/deed.en*

Print on Demand 2022 – Gedruckt auf FSC-zertifiziertem Papier

ISSN 1867-0067 ISBN 978-3-7315-1146-5 DOI: 10.5445/KSP/1000139935

### **Model-Based Performance Prediction for Concurrent Soware on Multicore Architectures—A Simulation-Based Approach**

Von der Fakultät für Informatik, Elektrotechnik und Informationstechnik der Universität Stuttgart zur Erlangung der Würde eines Doktors der Naturwissenschaften (Dr. rer. nat.) genehmigte Abhandlung

### Vorgelegt von Markus Kilian Frank

aus Bensheim

Hauptberichter: Prof. Dr.-Ing. Steen Becker Mitberichter: Prof. Dr. Ralf H. Reussner

Tag der mündlichen Prüfung: 15. März 2021

Institut für Software Engineering, Abteilung Softwarequalität und -architektur 2021

### **Abstract**

Model-based performance prediction is a well-known concept to ensure the quality of software. Thereby, software architects create abstract architectural models and specify software behaviour, hardware characteristics, and the user's interaction. They enrich the models with performance-relevant characteristics and use performance models to solve the models or simulate the software behaviour. Doing so, software architects can predict quality attributes such as the system's response time. Thus, they can detect violations of service-level objectives already early during design time, and alter the software design until it meets the requirements.

Current state-of-the-art tools like Palladio have proven useful for over a decade now, and provide accurate performance prediction not only for sophisticated, but also for distributed cloud systems. They are built upon the assumption of single-core CPU architectures, and consider only the clock rate as a single metric for CPU performance. However, current processor architectures have multiple cores and a more complex design. Therefore, the use of a single-metric model leads to inaccurate performance predictions for parallel applications in multicore systems.

In the course of this thesis, we face the challenges for model-based performance predictions which arise from multicore processors, and present multiple strategies to extend performance prediction models. In detail, we (1) discuss the use of multicore CPU simulators used by CPU vendors; (2) conduct an extensive experiment to understand the eect of performanceinuencing factors on the performance of parallel software; (3) research multi-metric models to reect the characteristics of multicore CPUs better, and nally, (4) investigate the capabilities of software modelling languages to express massively parallel behaviour.

As a contribution of this work, we show that (1) multicore CPU simulators simulate the behaviour of CPUs in detail and accurately. However, when using architectural models as input, the simulation results are very inaccurate. (2) Due to extensive experiments, we present a set of performance curves that reect the behaviour of characteristic demand types. We included the performance curves into Palladio and have increased the performance predictions signicantly. (3) We present an enhanced multi-metric hardware model, which reects the memory architecture of modern multicore CPUs. (4) We provide a parallel architectural pattern catalogue, which includes four of the most common parallelisation patterns (i.e., parallel loops, pipes and lter, fork/join, master worker). Through this catalogue, we enable the software architect to model the parallel behaviour of software faster and with fewer errors.

### **Zusammenfassung**

Modellbasierte Performancevorhersagen sind ein bekanntes Konzept zur Sicherung der Qualität von Software. Dabei erstellen Softwarearchitekten abstrakte Architekturmodelle und spezizieren das Softwareverhalten, die Hardwareeigenschaften und die Interaktion der Nutzer. Sie reichern die Modelle mit leistungsrelevanten Eigenschaften an und verwenden Performancemodelle, um das Software-Verhalten zu simulieren oder durch analytische Methoden zu bestimmen. Auf diese Weise können die Software-Architekten Qualitätsmerkmale wie die Antwortszeit des Systems auf Benutzeranfragen vorhersagen. So können sie Verletzungen der Service-Level-Ziele bereits anhand des Entwurfs erkennen und den Software-Entwurf so lange verändern, bis er den Anforderungen entspricht.

Palladio ist ein Werkzeug, das dem aktuellen Stand der Technik entspricht und sich seit über einem Jahrzehnt bewährt hat. Palladio bietet eine genaue Performancevorhersage nicht nur für anspruchsvolle, sondern auch für verteilte Systeme. Dabei baut Palladio auf der Annahme von Single-Core-CPU-Architekturen auf und berücksichtigt nur die Taktrate als einzige Metrik. Aktuelle Prozessorarchitekturen haben jedoch mehrere Kerne und ein komplexeres Design. Daher führt die Verwendung eines Modells mit nur einer Metrik zu ungenauen Performancevorhersagen für parallele Anwendungen in Mehrkernsystemen.

Im Verlauf dieser Arbeit stellen wir uns den Herausforderungen für modellbasierte Performancevorhersagen, die sich aus Mehrkernprozessoren ergeben, und präsentieren mehrere Strategien zur Erweiterung von Performancevorhersagemodellen. Im Detail diskutieren wir (1) die Verwendung von Mehrkern-CPU-Simulatoren, die von CPU-Herstellern verwendet werden; (2) Wir führen ein umfangreiches Experiment durch, um den Einuss von leistungsbeeinussenden Faktoren auf die Performance paralleler Software zu verstehen; (3) Wir erforschen multimetrische Modelle, um die Eigenschaften

von Mehrkern-CPUs besser widerzuspiegeln; (4) und schließlich untersuchen wir die Fähigkeiten von Software-Modellierungssprachen, massiv paralleles Verhalten auszudrücken.

Als Beitrag dieser Arbeit können wir zeigen, dass (1) Multicore-CPU-Simulatoren das Verhalten von CPUs detailliert und genau simulieren. Wenn jedoch Architekturmodelle als Input für die Simulatoren verwendet werden, sind die Simulationsergebnisse von geringer Qualität. (2) Aufgrund der umfangreichen Experimente können wir eine Reihe von Referenzkurven präsentieren, die das Verhalten von charakteristischen Lasten widerspiegeln. Wir haben die Referenzkurven in Palladio integriert und können die Performancevorhersagen erheblich steigern. (3) Wir stellen ein verbessertes multimetrisches Hardware-Modell vor, das die Speicherarchitektur moderner Mehrkern-CPUs widerspiegelt. (4) Wir stellen einen Katalog paralleler Architekturmuster zur Verfügung, der vier der gängigsten Parallelisierungsmuster enthält. Durch diesen Katalog ermöglichen wir es dem Software-Architekten, das Parallelverhalten von Software wesentlich schneller und mit weniger Fehlern zu modellieren.

### **Acknowledgements**

Writing a doctoral thesis is a long and arduous journey with many ups and downs. Mastering the road alone is almost impossible. Therefore, I would like to take a moment and thank all the people who supported me during the past ve years.

First of all, I would like to thank my amazing wife, Judith, for her unconditional support, her warm words, and her trust and belief in me. Her constant support did not only eased my mind when I needed to relax, but also gave me the strength to push forward. Deciding to continue my PhD and follow Steen to Stuttgart forced us back into a long-distance relationship. But this did not stop you from loving me, giving me comfort, and nally marrying me. Judith, I love you more than anything!

No fewer thanks go to my parents: My father Klaus, who is a bright inspiration for me. If it were not for him, I never would have started my PhD and followed his example. Now I hope I am one step closer to being as great as you, Dad. My beloved mother, Anne, was there for me from the beginning. She supports me through all my life and always wants only the best for me. Mum, Dad, you are the best!

Next in line, I have to show my gratitude to all my workmates, rst of all, my supervisor Steen. He moved mountains and fought giants to make it possible for me to start at the university in the rst place after my application e-mail was the victim of a spam lter and therefore missed the application deadline. But that is not all: His great feedback always guided me through the foggy journey of my PhD. Thank you for that!

In this context, I would like to thank all my colleagues in Chemnitz. Especially Marcus, who translated and taught me the rules of research as a young researcher. His long discussions helped me to understand the disastrous reviewer comments of rejected papers and helped me to improve. Marcus

is someone to push deadlines with until the last minute, even if that means working all night—or at least as long as there is coee.

Of course, I also thank all my "Doktorgeschwister": Marina, Thomas, Henning, Floriment, Alireza, Vijayshree, Sebastian F., Sandro, Angelika, Stephan, Max, Maximilian, Sebastian K., Heiko as well as the rest of my colleagues André, Jörg, Robert, and Erik for their many hours of discussions, advice, reviews, feedback and research meetings. Especially I would like to highlight here the contributions of Floriment and Alireza—your feedback on the thesis helped me a lot.

Excellent research is not done alone. Thus, a great thanks go to all students I had the pleasure to work with in the past ve years. In particular to Denis, Julian, Kim, Lukas, Philipp, and Sebastian. Your high interest in the topics was great, and it was a pleasure to work with you. Finally, I have to admit that writing a doctoral thesis, can be not only hard for the writer, but also his whole environment. In this sense, I would also like to thank all my dear friends for their support, motivational words, feedback, and time they spent drinking beer with me and listening to me grumbling about my work.

Thank you.

### **Contents**








### **Part I.**

### **Introduction and Foundation**

### **1. Introduction**

Manufacturers had been doubling the density of components per integrated circuit at regular intervals, and they would continue to do so as far as the eye could see.

Gordon E. Moore – 1965

Software sells. This slogan is true for almost all areas of today's business. Software no longer has a supporting role, but is a core feature and enabler of technology, features, usability, and business. Autonomous cars, smartphones, smart homes, legal tech companies, and multimedia streaming services are only a few examples of successful applications that dominate our daily life and highly depend on sophisticated software. This software is so complex that it contains thousands or even millions of lines of code, it cannot be developed by a single person anymore, and it has to full high levels of quality standards to meet the Service-level Objective (SLO). Due to the complexity of the software and the immense cost of software failures and bugs, such software is developed in an engineering-like way, to ensure high quality standards [KBAW94; KKB+98]. This engineering-like way includes a structured method of collecting requirements, creating architectural designs, as well as evaluating and testing. In the following, we focus on the evaluation of architectural designs used in the early design phase. Therefore, model-based performance prediction approaches are used to simulate and to evaluate the quality attributes of architectural design (e.g., response time). To use such approaches, the Software Architect (SA) must create an architectural model of the software (i.e., the software model), specify the users' behaviour (i.e., user model), and create a description of hardware characteristics (i.e., the hardware model). In the next step, the SA uses simulation-based or analytic

solvers to evaluate the dierent quality attributes of the architectural design. State-of-the-art approaches like the Palladio Bench<sup>1</sup> or CloudSim<sup>2</sup> achieved accurate predictions for complex, distributed, and cloud systems.

Nevertheless, all of the current approaches consider only a single metric in the hardware model—the CPU speed—as relevant for estimating the performance of the system. This assumption is appropriate when using hardware powered by CPUs with up to four cores. However, today's common CPUs have more than four cores. By now, multicore processors have been widely used for more than a decade in all types of devices, such as smartphones, laptops, and desktop PCs. While smartphones have up to 8 cores, desktop PCs with 16 cores or servers with more than 100 cores are a common sight today.

Moving from single-core CPUs to multicore CPUs brings a range of new challenges to the software engineering domain. First of all, to use the full potential of multicore processors, software developers must write software that supports parallelism on multiple levels. Writing parallel software is even more challenging when the developers must consider live-/ deadlock, synchronisation, concurrent data access, etc.

Dierent domains tackle the multicore challenge in their ways: In safetycritical embedded systems like aeroplanes or cars it is important to prove the correctness of the application and to guarantee deadline (e.g., detecting and reacting before crashing into an obstacle). Because parallelism signicantly increases the complexity, it was common sense in the embedded domain to disable all but one core and continue to use sequential applications [KSS+17]. However, due to the increased amount of software (and thus, hardware requirements) manufacturers are now forced to develop new approaches to not only develop parallel applications but also to specify and verify them.

In the HPC domain, parallel execution has been researched for years. Thereby HPC focuses on low and algorithmic levels. It is common sense to use programming languages like Fortran and to optimise each instruction. So, developers in HPC search for potential optimisations and count each byte to, e.g., t their instructions into a single cache page. That way, they can gain

<sup>1</sup>http://www.palladio-simulator.com/

<sup>2</sup>http://www.cloudbus.org/cloudsim/

massive performance boosts, since each instruction is executed millions of times.

The developers of Business Information System (BIS) usually do not have the expert knowledge of HPC developers, nor do they have the resource limitations embedded systems have. So, a common practice today is to slice applications into (micro-)services and try to avoid parallelism within these services. Moreover, parallelism is achieved by running multiple instances of a service and handling user requests in isolation. E.g., to coordinate or exchange data in a Kubernetes Cluster, key-value databases like etcd.io are used, even though a shared in-memory solution like Redis might be much faster. But also more complex to handle parallel accesses.

Slicing applications along the user requests (jobs) has the advantage that each job can be handled independently. However, the benets are limited. With multiple jobs accessing the main memory at the same time—even if they have an isolated memory space—shared resources like the memory bus become a bottleneck. Further, slicing is often not possible due to the domain. E.g., for data analysis, the whole data set is evaluated, and the algorithms are complex, time and resource-intensive. Thus, to use the full potential of today's hardware, the code within the services also needs to be parallelised.

Today, it is still common practice for people from High Performance Computing (HPC) and BIS to follow a try-and-error approach to see if the software under development fulls the SLOs. This approach is not only cost, and timeintensive, but simply not applicable for large-scale systems like Facebook, Netix, or Twitter any more. These systems are so large and have such a high number of user requests that it is simply not possible to generate the load for testing any more [WS03].

Thus, having reliable software performance predictions of parallel applications in multicore environments in more critical than ever. Thereby we need to enable SAs to factor in parallel behaviour during the early design phase, which is challenging, since commonly-used languages for designing software (e.g., UML 2.5<sup>3</sup> ) have only limited capability to express parallel behaviour and the SA needs to model each behaviour manually. Next, we

<sup>3</sup>UML 2.5 Specication: https://www.omg.org/spec/UML/2.5.1/PDF

must reevaluate the model-based performance prediction methods to show their accuracy and suitability for parallel software.

In the course of this thesis, we will not focus on challenges when coding parallel applications but focus on the modelling and performance prediction aspect.

### **1.1. Requirements to Enable Model-based Performance Predictions for Parallel Soware on Multicore Environments**

Given this background, we can identify major requirements to successfully perform model-based performance predictions for parallel software run in multicore environments:


### **1.2. Problem Statement**

Unfortunately, no approach exists which fulls all of the above requirements [FHLB17]. However, approaches exist that meet at least one requirement, although none of them focuses on model-based performance predictions. In what follows, we give a brief overview (see Chapter 4 for a full discussion on the related work):


Due to the paucity of all-encompassing approaches, SAs are currently limited in their ability to model parallel behaviour, and the process of modelling is highly error-prone and time-consuming [FH16; FSH17]. Furthermore, when it comes to performance predictions, SAs are currently not able to make reliable Quality of Service (QoS) predictions for parallel applications running in multicore environments. Thus, an engineer-like approach to develop highly parallel applications suers from single-metric hardware models, incomplete performance models, inaccurate solvers, and the absence of language support for modelling parallel software behaviour at the moment.

### **1.3. Solution Overview & Contributions**

To overcome the shortages named above, we propose an approach containing four individual contributions combined into the Palladio Bench.

112

115

118

121

124

127

**Figure 1.1.:** Overview of the solution and contributions presented in this thesis

Figure 1.1 gives an overview of the contributions. To be able to provide them to the SA as a combined approach, we integrate them into the Palladio Bench. In this way the SA can benet from the full potential of all contributions.

CB1: First, we provide a parallel architectural template catalogue based on the AT method to oer SAs a set of easy-to-use common parallel design patterns (). Thus, we can signicantly reduce the time a SA needs to model parallel behaviour—while keeping the number of errors low. At the same time, we increase user acceptance and improve the user experience. In total, we support four abstract parallel design patterns (Master-Worker, Parallel Loops, Fork & Join, Pipes & Filters), which the SA can use to model the behaviour of 33 common parallel design patterns.

CB2: We conduct extensive experiments to analyse the impact of performanceinuencing factors on the response time ( ). We use the measurements to derive performance curves, which we integrate into Palladio to increase the prediction accuracy (). As result, we provide a set of performance curves for common types of software behaviour to the SA. These performance curves can increase the performance predictions without detailed modelling of all performance relevant aspects.

CB3: We extend the Domain Specic Language (DSL) of the Palladio Bench, the Palladio Component Model (PCM)[BKR09], and include characteristic elements to reect the memory hierarchy into performance models ( , ). In doing so, we also extend the current state-of-the-art simulator (SimuLizar) to handle the models ( ). As a result, we present a memory hierarchy model, implement the approach in the PCM and SimuLizar, and are now able to simulate cache behaviour and memory bandwidth utilisation to a certain extent.

CB4: We connect multicore CPU simulators used by hardware architects and CPU vendors to Palladio. We use the PCM models as input for the simulators, simulate them, and play the results back (, ). We eventually provide two strategies: A trace-driven and a source code-driven approach. We evaluate both methods and are able to show that CPU simulators cannot be used for realistic model-based performance predictions, due to the lowlevel information needed as input model. This information is absent in our architectural input models.

In the context of this doctoral project, we published a number of peerreviewed publications including conference papers, journals, workshops, and posters. Further, a number of student theses were supervised by the author of this thesis. Appendix A.1 gives a detailed overview of the publications and topics.

### **1.4. Thesis Structure**

The remainder of this thesis is structured in three parts: Introduction & Foundations, Contributions, and Summary. Table 1.1 gives an outline of the remaining chapters.


**Table 1.1.:** Overview of the thesis structure

### **2. Foundations**

In the following section, we introduce the fundamental concepts needed to understand and follow the rest of the thesis.

First of all, we are going to lay out the basics of parallel and concurrent software. In the same section, we will introduce two dierent taxonomies to categorise concurrent and parallel software: categorisation based on memory usage and categorisation based on information exchange.

After we understand the software characteristics of concurrent and parallel software, we will expound the hardware characteristics of multicore CPUs. Thereby, we will focus on high-level concepts needed to follow the rest of the thesis.

In the latter portion of this section, we will use that knowledge to elaborate common parallelisation patterns, approaches to predict the behaviour of multicore CPUs, and model-based approaches to predict quality attributes of software designs.

### **2.1. Parallel and Concurrent Soware**

In this section, we will elaborate on the characteristics of parallel and concurrent software. Thereby, we will focus only on the software view (the hardware view is illuminated in Section 2.2).

### **2.1.1. Parallel vs. Concurrent**

Parallelism and concurrency are often used as synonyms in the literature. However, they are not the same thing.

**Figure 2.1.:** Concurrent vs. Parallel Execution

In computing, concurrency was rst used to better utilise or share resources in a computer (comp. [MSM04]). For that, the computing task is partitioned into smaller subsets and, with the help of the operating system's schedulers, tasks can quickly be swapped. This has the benet of one task not having to lock the processor while idling (i.e., while waiting for I/O). By quickly swapping many tasks, it appears to the user as if the tasks are executed in parallel. However, this must not be the case. Figure 2.1a exemplies a concurrent execution of multiple tasks.

Compared to concurrency, parallelism describes the behaviour of two tasks being executed at the same time, in parallel. Figure 2.1b exemplies a parallel execution.

Finally, we can conclude with the following denitions for concurrency and parallelism from [Sun08]:

"Concurrency: A condition that exists when at least two threads are making progress. A more generalised form of parallelism that can include time-slicing as a form of virtual parallelism.

Parallelism: A condition that arises when at least two threads are executing simultaneously."

In addition, Table 2.1 summarises the dierent characteristics.

While we use concurrency to utilise a single core more ecently, parallelism needs real multicore systems to execute dierent threads in parallel. Thus, we use multicore systems to improve the throughput of a system. In this thesis, we focus on parallelism, parallel software, and multicore architectures.


**Table 2.1.:** Comparison of Concurrency and Parallelism (cf. [Tec17])

### **2.1.2. Shared vs. Distributed Memory**

In parallel systems, it can be necessary to exchange data among the individual tasks. Most common approaches are based on either shared or distributed memory approaches. In this context, the terms shared and distributed memory do not refer to the physical location or layout of the memory, but rather to how the memory is presented to the parallel applications (cf. also shared and distributed memory computer architectures):


### **2.1.3. Means to Parallelise**

Depending on the given memory access method (shared or distributed), dierent parallelisation paradigms have to be used to support the access method or to ensure it. In the following section, we will explain two general means of achieving parallelisation: Thread-based and message-based. For each of these two methods, we will give examples of commonly used implementations. The list of examples is far from complete and is only used to explain the basic concept.

### **2.1.3.1. Thread-Based Approaches**

In thread-based approaches, parallelisation is achieved by spawing new threads. The operating system then schedules the new threads to the processors and cores. Data exchange is done by the principle of shared memory, which makes it also necessary for the developer to take care of mutex. In the next three paragraphs, we explain pure threads and stream processing, as well as OpenMP.

**Threads:** Threads are the most basic means of achieving parallelisation. Figure 2.2a exemplies the approach. To achive thread-based parallelism within an application, the main thread of the application forks new threads and assigns tasks to them. Each thread executes its subroutine, and by scheduling the threads to individual cores (by the operating system), the threads run in parallel. This approach is often also called task parallelism because each task is separated into an individual thread [Rei07]. To successfully use this approach, it is essential that the individual tasks have no limited, well designed inter-thread communication. If they share the same data, the developer needs to take care of data access restrictions (i.e., locks and mutual exclusion). Thread-based means to parallelise are the foundations for design pattern (like master-worker pattern) or parallelisation patterns (like fork-join). We discuss these patterns in more detail in Section 2.3.

**Stream Processing** or data-ow programming (sometimes also referred to by the architectural style: pipes and lters) is a programming paradigm well known from Linux command line shells (pipe) or graphical calculations

**Figure 2.2.:** Abstract overview of threads and stream processing

within GPUs. The basic concept is explained by using Figure 2.2b. In stream processing, there is a sequence of data (stream) and a series of operations. The operations are applied in a specic order to the streams to get the desired result set [Bea15]. Each operation is thereby independent of the others, and only needs specic input data. Thus, it is possible to run each operation in parallel and even to have multiple instances of each operation. Typically, each operation instance runs in its thread to archive the parallel execution.

While stream processing traditionally used kernel functions, such as operations, and was optimised for particular CPUs (e.g., GPUs), the concept is widely adopted nowadays, used in common programming languages (e.g., Java Streams), and runs on general-purpose CPUs [GR04].

**OpenMP:** The OpenMP Application Programming Interface (API) is a precompiler, who was designed by a group of software and hardware manufacturers. Both interest groups have agreed on specications to create a uniform standard for programming parallel computers with a shared address space. The three main components of OpenMP are compiler directives, runtime libraries, and environment variables. Implementations are available for almost all common programming languages, which makes OpenMP a popular API for developers.

The parallel programming model of OpenMP is based on parallel threads which have a shared and a private address space. All programs start with a single master thread. Based on the fork-join execution model, it creates a socalled thread team. The compiler directive triggers the creation of the team

at the beginning of the program section, which is about to be parallelised. All threads of the team execute this section in parallel. The exchange of information between the threads takes place using shared variables. These variables are kept in the address space shared by all threads concerned. However, private variables are stored on each thread's stack and are therefore only held for the duration of the execution of the parallel section. The shared and private variables are specied in the compiler directives. When parallel processing is completed, the created threads terminate, leaving only the master thread.

OpenMP provides various mechanisms for coordinating the threads. It is possible to implement critical areas in which only one thread may process. To synchronise the threads of a team, OpenMP uses the barrier directive to wait for all threads, and to synchronise the workow. The barrier directive causes all threads reaching it to pause until all threads of the team have reached it. The programming model also provides locking mechanisms in the form of simple and nestable lock variables. Their use and further implemented concepts for thread coordination are described in detail in [RR12](p.369-373).

One of the most critical aspects of the underlying programming model is the possibility of establishing parallelism on the loop level. Within a parallel section, loops can be parallelised using the for-directive. For this purpose, the loop iterations, and thus the computing work, is distributed to the threads of the team. This distribution can be done in dierent ways, e.g., by assigning a certain number of iterations to the threads in the team. Another variant is to assign the iteration blocks dynamically. Whenever the processing of a block is completed, a new one is assigned. To use OpenMP parallel loops, the parallel loop must full certain conditions. One is that the total number of iterations must be known before entering the loop. Furthermore, the individual calculations of the iterations must be independent of each other and must not change the running index of the loop (cf. [HL08; RR07]).

Due to its relatively simple use, OpenMP is frequently used to speed up and parallelise legacy software, by merely annotating for-loops, so that they run in parallel.

### **2.1.3.2. Message-Based Approaches**

Message-based parallelism is characterised by the clear intercommunication of a set of concurrent tasks. These tasks may reside on the same physical device, or across an arbitrary number of devices. To exchange data with each other, the tasks communicate by sending and receiving messages. This data exchange usually requires the cooperation of each process [GHK+13]. Even though message-based parallelism approaches can be used on the same machine, message-passing is often associated with distributed memory models and distributed computing.

In the following, we will briey explain two common frameworks for messagebased parallelisation: MPI and Actors.

**MPI:** Message Passing Interface (MPI) is a specication for developing parallel programs that communicate with each other by the exchange of messages [BVS13]. It is a standard interface for message-passing calls and is powerful, exible, and usable [SAB18]. One property of MPI is that it is very explicit, meaning that the programmer can control many details of the data ow [Eij17]. Additionally, interface specications have been dened and implemented for C/C++ and Fortran [BVS13]. Nowadays, MPI has become a standard for developing message-passing applications [BVS13].

**Actors:** The Actors Model (Actors) is an abstract model for parallel processing. It was rst presented in the paper [HBS73], which introduced the basic concept of actors. There are numerous programming languages and partly identically named implementations, which use the axioms of the actors model to implement parallelism, but dier in detail. In the following, we will, therefore, only deal with the core axioms of the actors model:

Actors are considered to be basic, abstract units, which include processing, memory and communication. Actors follow the principles of object-oriented design. Accordingly, actors can be considered as objects, and are encapsulated from each other. The encapsulation also means that no two actors share the same memory. Thus the exchange of information between the actors must take place via explicit communication. Explicit communication happens by the asynchronous exchange of messages (in many implementations

also a synchronous option is available). The actor can react to a message only with three actions:


Actors have a message queue in which the incoming messages are held (see Figure 2.3), since they can only be processed one after the other. Messages are taken from the queue and processed according to the "First In - First Out" principle. Also, the concept of state machines is supported. The state of the actor after processing a message determines the behaviour for processing the next one [Ver15]. Due to encapsulation and independence, actors can be executed in parallel. However, the actors themselves operate like a sequential application. A manual implementation of locks and mutexes is not necessary, because each actor has its own memory space [Cli81].

When it comes to determining potential actors in an application, Storti gave the following statement: "Everything is an actor" [Sto15]. In practice, however, this leads to too much complexity and performance losses for a ne-granular actor system. Therefore, one tends to represent each functional task by an actor [Ver15].

### **2.1.4. Thread-Based vs. Message-Based**

The shared memory model characterises thread-based parallelism. Each thread has its local memory, but also shares the global set of variables. The communication between the threads is achieved by updates and access to memory in the same address space [GHK+13]. Thread-based approaches can be faster than message-based approaches because of the more convenient access to the shared memory address space. However, this shared access can lead to problems, such as race conditions. Message-based approaches have better scalability than thread-based approaches because of the distributed memory model, which enables the simple addition of new parallel tasks. Also, since each task has its isolated memory, race conditions are a much

**Figure 2.3.:** Example of an Actor System (cf. [Doy14])

smaller threat. A disadvantage of message passing is the necessity of implementing an interface that is responsible for the data transfer between the tasks [Pie16].

### **2.2. Single- and Multicore Architectures**

In practice, a wide variety of multicore CPU architectures exist. The variance ranges from very specialised architectures (like GPUs or embedded control units), over networks clusters, symmetric multiprocessors, and massive parallel supercomputer CPUs to o-the-shelf CPUs [MSM04].

In the following, we will give an overview of the most common CPU architectures. A basic understanding of the hardware will later help to understand performance characteristics and performance issues of parallel applications.

### **2.2.1. Architectural Design**

While there are multiple taxonomies to categorise CPU architectures, by far the most common one is the taxonomy introduced by Flynn [Fly72], which we will follow in this section. Flynn categorises all CPU architectures by the

**Figure 2.4.:** Example of SISD (cf. [MSM04])

number of instruction streams and data streams. Thereby, a stream is a sequence of instructions or data a CPU processes. Flynn distinguishes between four dierent types: SISD, SIMD, MISD and MIMD [Fly72; MSM04].


Only considering Flynn's taxonomy is a good start. However, it is not sucient for understanding multicore architectures as a whole. In particular,

**Figure 2.5.:** Example of SIMD (cf. [MSM04])

**Figure 2.6.:** Example of MIMD (cf. [MSM04])

the memory hierarchies and the CPU core interactions are not detailed enough. Thus, Mattson et al. [MSM04] specied additional subcategories for MIMD: Symmetric Multiprocessors (SMP) and Non-Uniform Memory Access (NUMA) architectures.

**Figure 2.7.:** Exemplication of SMP (cf. [MSM04])

**SMP:** Figure 2.7 shows the composition of SMP. It is a subclass for shared memory systems. Each CPU accesses the same memory, while only

**Figure 2.8.:** Exemplication of NUMA (cf. [MSM04])

one memory exists in the architecture. Furthermore , all CPUs share the same connection (memory bus) and can access the memory at the same speed. SMP architectures are the easiest for the programmer, because there is no need to consider the location of the data. In this kind of architecture, the memory bus often becomes a bottleneck, because the utilisation of the bus increases with an increasing number of cores. Therefore, this architecture does not scale well, and only works for a limited number of CPUs.

**NUMA:** A more complex architecture is NUMA architecture, which Figure 2.8 illustrates. As in SMP architectures, the memory is shared, and each processor can access all blocks in the memory. However, some blocks of memory might be more closely associated with some CPU cores than others. Thus, cores can access data located in a closer memory faster and therefore, the access times for data located in dierent memories can dier signicantly. To compensate for these eects, a hierarchical cache system is often used [KTJ06] together with a strategy to maintain cache-coherence. Hence, these architectures are also called cache-coherent nonuniform memory access systems (ccNUMA).

For the sake of completeness, we also have to mention the subcategories for distributed-memory architectures. In a distributed-memory architecture, each processing unit has its memory and address space (see Figure 2.9). Communication with the other processors is done by message passing. Depending on the topology, the communication speed can range from as fast as shared memory to rather slow (e.g., communicating over an ethernet

**Figure 2.9.:** Exemplication of a Distributed Memory Architecture (cf. [MSM04])

network) [MSM04]. Even though these kinds of systems have a high research interest, especially in the domain of HPC, we will focus in this thesis on general-purpose CPUs since the business information applications we are interested in use this kind of hardware architecture.

### **2.2.2. Common CPU Architecture Example**

To foster understanding, we will briey describe the architecture of a common general-purpose CPU with a hierarchical memory hierarchy (like an Intel i7) in this section using Figure 2.10. In Figure 2.10, multiple processors are depicted. Each processor contains multiple cores. Common desktop processors currently have 2 to 32 cores per processor (i.e., AMD Ryzen Threadripper 3970X<sup>1</sup> ).

Each core contains a Central Processing Unit (CPU) and two types of Level 1 Cache (L1)—one for instructions (L1 Instruction cache) and one for data (L1 data cache). The L1 cache is directly accessible by the CPU and guarantees fast access of data in case of a cache hit. Further, each core has its Level 2 Cache (L2), which is, in comparison to the L1 cache, slightly larger, but its access times are slower.

Depending on the system's architecture and the mainboard used, multiple processors can be used. Thereby, the memory bus connects the individual processors with the Last Level Cache (L3) and the main memory. If there is too much communication between processors, or between processors and main memory, the bus can become a bottleneck—similar to network links.

<sup>1</sup>https://www.amd.com/en/products/cpu/amd-ryzen-threadripper-3970x

**Figure 2.10.:** Example of a Common Hierarchical Multicore Processor (cf. [Sch08])

Often mainboards support prioritised access from one processor to a specic segment of the main memory (RAM module), which improves the access rates of data in that segment.

Besides architectures with hierarchical memory hierarchies, there are also architectures with a pipeline or array design [RR07]. However, since they are not common, we will skip explaining them at this point.

### **2.3. Parallel Programming Patterns**

In the past years, not only specialised domains like HPC, but also standard application developers and researchers have had to face the need for ecient parallel software. However, developing such software is complex, challenging, and error-prone [MMG+09]. Therefore, a quite broad range of best practices and patterns has arisen to guide developers when realising parallel software.

In this section, we will introduce fundamental parallelisation patterns. Understanding the pattern will help to comprehend the core concepts of parallel programming. Following this section will also help to elucidate contribution 1 (see Chapter 6), in which we introduce a parallel pattern catalogue for common modelling languages, such as UML2.

First, we look at the pattern denition. Afterwards, we will introduce different categories of parallel patterns and explain the main concepts behind them.

### **2.3.1. Patterns for Parallel Programming**

Mattson et al. denes a pattern as follows:

"A (design) pattern describes a good solution to a recurring problem in a particular context. The pattern follows a prescribed format that includes the pattern name, a description of the context, the forces (goals and constraints), and the solution. The idea is to record the experience of experts in a way that can be used by others facing a similar problem. In addition to the solution itself, the name of the pattern is important. It can form the basis for a domain-specic vocabulary that can signicantly enhance communication between designers in the same area." [MSM04, p. 11]

Starting from this, dening the characteristics of a pattern is tricky, fuzzy, and in practice, the gap between a pattern description and its implementation can signicantly dier. Further, the same pattern often goes by dierent names in dierent communities. Therefore, we performed a literature review in [SWD19] to categorise common parallel patterns and nd synonyms. The main results of this review are shown in gure 2.12, and a more detailed discussion is given in Section 6.5.1.

After collecting parallel patterns from the literature, we extracted the description and grouped similar patterns together, naming the pattern by the most common name (i.e., fork & join). Further, we categorised the patterns by their level of abstraction, into three groups: Algorithmic, Architectural, and Design Patterns (Figure 2.12 groups the latter two, for reasons of simplication). For each pattern, Figure 2.12 lists synonyms or implementation variants. This list is far from complete and is intended only to provide a rough overview.

In the following, we describe each main pattern in more detail.

### **2.3.2. Parallel Architectural & Design Patterns**

"[An architectural] pattern provides a scheme for rening elements of a software system or the relationships between them. It describes a commonly recurring structure of interacting roles that solves a general design problem within a particular context." [BHS07, p. 392]

### **2.3.2.1. Master-Worker Pattern**

According to [Eij17], the master-worker pattern is one of the most wellknown patterns in parallel programming, and is supported by a broad set of programming languages. The basic idea behind the master-worker pattern is simple: One mammoth task is split into multiple subtasks that can run in parallel. Thereby it is essential that the subtask is as isolated as possible, in order to avoid interdependences.

The master is in charge of distributing the work to the workers, as well as coordinating them.

Since it is a design pattern, the master and the workers are often designed as individual components. While each worker-component has a specic task, the master-component takes over the role of a facade and a load-balancer or task manager. The calling instances call only a function on the mastercomponent, which also provides the result to the calling instances.

On a lower abstraction level, this pattern behaves similarly to the fork/join pattern.

### **2.3.2.2. Message Passing**

We already discussed in Section 2.1.3.2 the basic idea of message passing: each acting instance has its own memory, and can only interact with other instances by sending messages. Often these messages are sent asynchronously, and each acting instance has a message queue to store messages until they can be processed [Erb12].

To implement the message passing pattern, languages that support these features are required. One option is to use object-oriented programming

languages to manually implement the pattern, frameworks like AKKA Actors (for Scala)<sup>2</sup> , or specic languages like Erlang<sup>3</sup> .

Choosing a message-passing approach is a fundamental design and therefore categorised here as an architectural pattern.

### **2.3.3. Algorithmic Patterns**

Algorithmic patterns are, in contrast to design and architecture patterns, on a much lower abstraction level and focus on a solution strategy for one concrete implementation problem. An algorithmic pattern, therefore, describes a solution strategy with one or multiple subroutines.

In the following three paragraphs, we will describe three parallel algorithmic patterns. All of them are based on shared memory, and have a thread-based approach.

### **2.3.3.1. Parallel Loops & Sections**

Parallel loops are an ecient way to realise parallelism for programs that show a need for many repetitions of the same calculation without dependency between loop cycles [MSM04].

Its ease of achieving parallelism denes the parallel loops pattern, and it requires a set of independent data that can be split into smaller subsets. Each data subset is initially passed to an individual loop. E.g., considering a list of 800 entries, where for each entry the same operation is performed. The parallel loop pattern would split the list into, e.g., four subsets—each subset containing 200 entries. Now instead of having one single loop iterating over 800 elements, we have four loops iterating over 200 elements each. By separating each loop into an individual thread, parallelism can be achieved. To achieve the best results, splitting the dataset into equal and independent parts is critical.

<sup>2</sup>https://doc.akka.io/docs/akka/current/typed/index.html

<sup>3</sup>https://www.erlang.org/

**Figure 2.11.:** Example Stream Application (cf. [Mic17])

### **2.3.3.2. Streams or Pipes and Filters**

The pipes & lters pattern is rather common as well. The following description is a summary of the ocial Microsoft Azure documentation [Mic17].

The pipes & lters pattern can be used as a parallelism approach based on data streams. A stream consists of lters, which are processing steps, and pipes that represent connections between lters.

The pipes & lters pattern works by separating a set of data into streams and applying pipelines of pipes and lters in a predetermined order onto these streams. While each lter is independent of the others, and only relies on the input stream, parallelisation can be achieved by executing dierent lters in parallel. Slow lters can have multiple instances to faster process the input stream. In the end, the processed data stream is collected. Figure 2.11 illustrates this approach.

### **2.3.3.3. Fork-Join**

The following content is based on information found in [Eij17] and is very similar to the master-worker pattern. Even though the abstraction level is much lower, the idea is the same: due to the logical identication of subtasks, one mammoth task is split into subtasks, which can be executed in parallel.

In the best case, the subtasks are independent. However, this is often not the case. Therefore, locks and synchronisation mechanisms are used to include barriers, mutually exclusive data access, and waiting conditions.

**Figure 2.12.:** Categorisation of Parallel Patterns

### **2.4. Analyses and Prediction of Quality of Service Attributes**

The analyses and prediction of QoS attributes (e.g., response time) is a major part of Software Performance Engineering (SPE). C. Smith and W. Lloyd dene SPE as following: "SPE is a model-based approach that uses deliberately simple models of software processing with the goal of using the simplest possible model that identies problems with the system architecture, design, or implementation plans. These models are easily constructed and solved to provide feedback on whether the proposed software is likely to meet performance goals."[SW03]

In this section, we will rst introduce an approach using CPU simulators to estimate the QoS attributes. Second, we focus on the model-based QoS predictions on architectural level (e.g., the Palladio approach).

### **2.4.1. CPU Simulators**

CPU simulators are often used by hardware vendors to evaluate the quality attributes of new CPU architectures. However, they can also full various other duties. The primary duty we are interested in is the estimation of quality attributes of a parallel software running on a target environment without deploying it. So, one of the biggest challenges is the consideration of

dierent types of CPU architectures. Even though common architectures are supported by now, and CPU simulators deliver reliable results, the simulation takes much time.

In the following, we will describe the main characteristics of CPU simulators. CPU simulators are relevant to follow the accomplishment in Contribution 4 described in Section 9. Parts of this section originated from the collaboration with a Student—S. Graef [Gra18].

### **2.4.1.1. Foundations of CPU Simulators**

Hardware architects have researched CPU simulators for years. The main dierence is in the type of entry, the calculation method and the application scope [AS16].

**Figure 2.13.:** Simulation of target components [Carl J. Mauer – Computer Sciences Department – University of Wisconsin]

Figure 2.13 shows the evaluation process when using simulators. The interesting part here is the target application, which is running on the target system. While the target application is known, the target system needs to be simulated (or emulated) on the host machine (i.e., the computer running the simulation) [AS16].

In the following sections, we describe the dierent dimensions according to [SK13a] in detail:

**Section 2.4.1.2:** Functional vs. Timing Simulators

**Section 2.4.1.3:** Cycle-Driven vs Event-Driven Simulators

**Section 2.4.1.4:** Trace-Driven vs. Execution-Driven Simulators

### **Section 2.4.1.5:** User-Level vs Full-System Simulators

In Chapter 9, we perform a literature survey and use the above dimensions to classify the CPU simulators. Thus each simulator receives a trade-o (spider-web) diagram, which briey describes its characteristics.

### **2.4.1.2. Functional vs. Timing Simulators**

The group of functional simulators are used for functionality testing [AS16] only. Thus, they are not relevant in this theses. Because we do not research the correctness of application, but the behaviour and performance.

In contrast to functional simulators, timing simulators focus on the exact behaviour. They can simulate the hardware and software under study to an extent, that it is possible to get performance counter for any time. Most timing simulators are also called cycle-level simulators [AS16], because they track every clock cycle. The cycle level accuracy, however, comes on the cost of time. The simulations times of cycle-level simulators are up to 25 times longer than functional simulators, and they use more compute resources [AS16].

### **2.4.1.3. Cycle-Driven vs. Event-Driven Simulators**

To further drill down into the group of timing simulators. We can distinguish between two additional subgroups: the cycle-driven (cycle-accurate, or cyclelevel) and the event-driven simulators.

While cycle-driven approaches are relatively slow, event-driven simulators reduce the time consumption. One particular kind of event-driven simulators is interval simulators [GEE10]. Interval simulators combine the feature set of functional and timing simulators. But they do not simulate on cycle-level but in intervals. The idea is that the missing events such as branching, mismatches, and cache misses dividing the normal command ow through the pipeline into intervals. Then these intervals are evaluated separately. This combination can reduce simulation time [AS16].

### **2.4.1.4. Trace-Driven vs. Execution-Driven Simulators**

CPU simulators use a dierent kind of inputs. We distinguish between tracedriven and execution-driven. Trace-driven simulators use a trace as input. The traces contain detailed and low-level information about the execution. One drawback is that trace les can grow very large. But on the plus side, it is not necessary to emulate the Instruction Set Architecture (ISA) with this type of simulator [AS16].

In contrast to that, execution-driven simulators use an executable application as input. When it comes to accuracy execution-driven simulators are very accurately by emulating the ISA and also take errors that occur into account (e.g., incorrectly specied code path) [AS16]. Thus, this type of simulators is most suited to predict the behaviour of an application.

### **2.4.1.5. User-Level vs Full-System Simulators**

User-level (or application-level) simulators do not consider operating system calls. In contrast, full-system simulators take the system calls into account. So, the predictive power for system calls intensive applications is better with full-system simulators. The disadvantage is that the simulators become heavy [AS16].

### **2.4.2. Model-Based Quality-of-Service Predictions on Architectural Level**

In model-driven software development, models are used to develop the software on a high abstraction level, which abstracts the software's complexity to ease understanding and analysability. As a result, models become a central artefact and are used for, e.g., code generation and automatic deployment.

In an early design phase, models are used to analyse and improve the software before the software is realised. SPE is such an analysis method. SPE aims to predict the software's quality attributes, such as response time (performance), costs of operation (costs), and range of capable performance (scalability) [BDIS04]. Later, this approach was used in model-driven performance engineering, which allows software developers to design performance models in

a DSL [BDIS04; Hap08]. However, to derive performance metrics from such models, it is necessary to combine model-based hardware descriptions with software descriptions and environment/usage descriptions.

The main advantages of SPE are that software developers are able to evaluate the performance requirements of the system at an early stage. In this phase, decisions and design can easily be altered, because no realisation has to be adapted. So dierent design alternatives can be evaluated and compared, and trade-o decisions can be made in an informed and engineer-like manner, saving both time and money [WS03].

SPE also enables complex load tests. These tests can cover usage scenarios, e.g., for highly dynamic cloud systems with worldwide deployment and multiple millions of users. To run such tests on a real installation can be nearly impossible, based on the load generation, substantial expenses, and not yet available hardware. With SPE, such tests can be realised for dozens of dierent design variants with lower costs [BDIS04].

Currently, there are two approaches that can be named as state-of-theart approaches for model-based quality-of-service prediction and analysis: CloudSim<sup>4</sup> and Palladio<sup>5</sup> . While the former focuses especially on cloud applications and elasticity, the latter is a general-purpose approach, which works for all kinds of component-based systems. Due to this fact, we will focus in the following on the Palladio approach.

### **2.4.2.1. The Palladio Approach**

Palladio is a model- and software component-based modelling approach that focuses on the prediction of quality attributes, and is therefore an example of a model-based analysis method on an architectural level [BKR09; RBH+16]. Palladio supports a variety of quality attributes, such as performance (i.e., response time) [BKR09], cost-eciency [LE15], reliability [BKBR11], energyeciency [ÖGW+14], security [HFL16], and recently also scalability and elasticity [LB14]. Palladio uses its own DSL, which follows the example of UML. Therefore, it has a short adoption phase, and it is expected to have a high acceptance rate among software architects [BKR09].

<sup>4</sup>http://www.cloudbus.org/cloudsim/

<sup>5</sup>https://www.palladio-simulator.com/home/

To analyse an architectural design, the software architect has to specify a software and hardware model, as well as usage behaviour. Within the software model, the architect describes the behaviour, structure, and characteristics of the software. In the hardware model, the given hardware environment is described (for instance the HDDs, CPUs, and the system-landscape). Finally, the usage behaviour describes the behaviour of the user: How often a function is called, how many users are active at the same time, etc..

In the remainder of this section, we will continue explaining the details of PCM, describe standard solvers to analyse the architectural models, and introduce the AT extension, which will be used for contribution 1 (see Section 6).

### **2.4.2.2. PCM**

Figure 2.19 gives an overview of the PCM and its elements. The PCM contains multiple main aspects, which are explained as follows:

**Repository Diagram:** In the repository diagram, the software architect models the components and their type. Further, he denes the required and provided interfaces of components here. Each component has a type, which is dened by (a) the provided interfaces of the type, and (b) by the required interfaces of the type. The syntax and semantics used in the diagram are similar to the UML2 Component Diagram [RQZ07].

Further, each component species a particular behaviour for each operation inherited from the provided interface. Within this behaviour specication, the software architect can model the behaviour of this operation, i.e., calling other operations or consuming resource demands, such as CPU or hard disk demands. In the PCM, the behaviour specication is called Service Eect Specication (SEFF). The SEFF is similar to a UML2 Activity Diagram; it can use, e.g., loops, branches, internal actions (to demand hardware resources like CPU cycles) and external actions (usage of other components that causes requires interfaces of the component).

**System Diagram:** In the system diagram, the components from the repository diagram are instantiated. The instances of components are called assembly context. Further, the system in the system diagram provides

**Figure 2.14.:** Overview of the PCM (cf. [Leh18])

its interfaces, which represents the external access that is called from a user. These interfaces are forwarded to a provided interface of an assembly context. Also, a system can require interfaces, i.e., if an assembly requires external services (cf. [Leh18]).


Each of the above models represents a part of a complete PCM model. The whole model can serve as input for dierent solvers described in the following.

### **2.4.2.3. Solver**

To analyse a PCM model, a set of analytic or simulative solvers can be used (as shown in Figure 2.19). The result is a behaviour analysis of the complete system. This behaviour can be further analysed to identify limitations of the system, such as bottlenecks or SLOs violation. Afterwards, the model can be altered, and the consequences of the changes can be analysed. The analysis allows the SA to evaluate dierent versions of a system, before the rst line of code is written.

Palladio oers a set of solvers, which we briey characterise in the following. We will give more detail information about the solver needed for this thesis in the next section.

**SimuCom:** SimuCom is a simulation-based solver for the PCM. Its engine works based on a model-to-text (m2t) transformation, and, during the simulation, SimuCom can take measurements for a set of default metrics (i.e., response time).


While analytical solvers are a lot faster in analysing the input model, they provide only information about mean values. Further, a simulation-based solvers oer more exibility and freedom to the software architect, but can result in long simulation times, even for smaller systems.

For the understanding of this thesis, it is necessary to have a more detailed understanding of SimuCom (for Chapter 9), SimuLizar (for Chapter 8), and ProtoCom (for Chapters 7 and 9), which we give in the following.

**Figure 2.15.:** Overview of the SimuCom Solver [Bec08]

**Figure 2.16.:** Detailed View of SimuCom [Bec08]

### **2.4.2.4. SimuCom**

Figure 2.15 shows the basic approach of the SimuCom solver. First, Simu-Com takes as input a full PCM instance. Afterwards, it uses model-to-text transformations to generate the simulations code, which again is executed by the SimuCom Platform [Bec08]. The SimuCom Framework uses Discrete-Event-Simulation Modelling in Java (DESMO-J)<sup>6</sup> .

To get a better understanding of the m2t transformation, Figure 2.15 gives a more detailed view.

<sup>6</sup> http://desmoj.sourceforge.net/home.html

The whole SimuCom simulation approach is based on the simulation of resources. Each resource is handled and simulated as a G/G/1 queue. The simulated workload component generates the load for the queues, and for each user, a thread is spawned that traverses through the simulated system [Bec08].

The simulation is based on a simulation of resources (see Figure 2.16). For this, SimuCom simulates the G/G/1. A simulated workload generates the load for the simulated resources. For each user, a thread is started which traverses the (simulated) system. When passing through the SEFF simulation, the resource demands in the form of stochastic expressions are evaluated to determine the resource demands. In general, there are two types of resources: The CommunicationLinkResource and the ProcessingResources. The latter are subdivided again into active resources (e.g., CPU or HDD demands) and passive resources (e.g., thread pools).

### **2.4.2.5. SimuLizar**

SimuLizar is the next generation simulator and replaces the SimuCom simulator [BBM13; Bec17]. Therefor, SimuLizar is based on the SimuCom core framework as well. In addition to SimuCom, SimuLizar supports the analysis of self-adaptive systems, e.g., systems that scale dynamically depending on environmental factors, such as workload changes or service-level objectives violations. Further, SimuLizar gives more freedom when specifying the monitoring points. In contrast to SimuCom, the SimuLizar simulator does not generate simulation code. SimuLizar follows an interpreter-based approach instead. Meyer [Mey11] argues that a generator-based approach is faster for non-adaptive systems. However, for adaptive systems, the generative approach is unsuited because the generated code must be modied each time an adaption occurs. In interpreter-based approaches, the simulator traverses through the PCM instance and interprets the model elements it encounters. For the simulation logic, SimuLizar uses the core SimuCom framework. The simulation and interpretation process of SimuLizar contains two steps:

1. In the rst step—the SimulizarRuntimeState— the setting up and conguration takes place. Thereby dierent model instances run the ModelObservers, in which, e.g., the ResourceEnvironmentSyncer is called, which creates SimulatedResourceContainer and

SimulatedLinkResourceContainer for each ResourceContainer or NetworkLink that is modelled inside the resourceEnvironment model. Next, the containers are stored in the resource registry of the SimuComModel.

2. In the second step, the simulation run, the PCM model interpreter traverses each user request and navigates through the various Palladio models. Thereby, the interpreter calls the correct interpretation for each model element. For example, rst the user scenario model is interpreted, then all system calls in the user scenario are identied and interpreted. That way, the interpreter traverses through the models until it reaches the resource demands. Additionally, SimuLizer can consider self-adaptive behaviour [Bec17].

### **2.4.2.6. ProtoCom**

Like the above two solvers, ProtoCom is also a Palladio analyser. A common method of design evaluation is performance prototyping. For this purpose, ProtoCom oers a method for generating runnable Java applications from the PCM instances. Thereby, it uses model-to-code (m2c) transformations [KL14]. These applications can be run in realistic environments, and the software developer can check the monitoring data against the SLOs.

**ProtoCom Transformation** The process of the m2c transformation is shown in gure 2.17. The input of the transformation is a PCM instance. The transformation generates a runnable performance prototype. The prototype consists of the generated code and the ProtoCom framework.

During the m2c transformation, ProtoCom traverses through the PCM instances, and transforms the processing resource demands into synthetic resource demands (e.g., calculating Fibonacci)<sup>7</sup> .

To match the specied resource demands in the model, ProtoCom needs to run a calibration on the target platform. The calibration step is required only once. Afterwards, the target platform is no longer needed. Moreover, it is possible to run all experiments on a host machine.

<sup>7</sup>A full description of all available demands is given in Chapter 5

**Figure 2.17.:** Overview of the ProtoCom M2C Transformation [KL14]

**Java SE RMI Prediction Prototype** ProtoCom can provide various target applications (like Java SE or Java EE). In the following, we have a closer look at the Java SE RMI prediction prototype. This will become most relevant in Chapter 9.

Figure 2.18 shows the architectural view of a JavaSE performance prototype. As one can see, the prototype consists of two parts: rst, the prototype (above the dotted line) and second, the ProtoCom framework (below the dotted line). The latter is the same for each prototype and contains the ProtoCom logic. The prototype varies and reects the PCM input instances directly.

Especially interesting for us is the AbstractResourceEnviroment. This component contains all the dierent resource demands. By default, ProtoCom uses a Fibonacci demand to represent the load on the CPU (for CPU-intensive load). However, other demands, such as sorting array demand (for I/Ointensive tasks), are available.

**Resource Demand Mapping** Given the work from Becker [Bec08], there are two ways to map independent resource demands to hardware-dependent ones.

The rst approach involves the introduction of a constant scaling factor. This requires the knowledge of the hardware's capabilities. For example, one work unit could correspond to the calculation of 100, 000 Fibonacci numbers. However, the knowledge of this factor and the accuracy of this approach is highly questionable.

**Figure 2.18.:** Architectural View of JavaSE Performance Prototype [KL14]

Therefore, the second approach is based on an automated performance detection of this factor. Thereby, a benchmark is run on the target machine to determine the factor. The output of this benchmark is a calibration table. This table includes two columns: the rst column shows the time in and the seconds, the input parameter for the Fibonacci function (e.g., how many numbers should be calculated).

We will explain the resource demands and the approach behind it in more detail in Chapter 5. For more information on the ProtoCom approach, we refer to [Bec08].

### **2.4.2.7. Architectural Templates**

S. Lehrig proposed the AT approach in [Leh18] to enable software architects to easily reuse architectural knowledge in the form of reusable AT in the context of architectural analysis. Lehrig included a proof of concept of the AT method in Palladio. We will use the AT method in Chapter 6 to build a parallel architectural catalogue. Therefore, we will explain the details of the

**Figure 2.19.:** Process of the AT Application Process (cf. [LHB18])

AT method in the course of this section. To follow Chapter 6, it is necessary to understand the basics explained in the following.

### **The Architectural Template Method**

"The AT method is a software engineering method with which software architects can reuse architectural knowledge from pre-specied templates—ATs—for architectural modelling and architectural analyses. AT engineers specify the AT, that is, implemented, quality-assured, and provided within catalogues. In applying ATs from such catalogues, software architects become more eective and ecient in their architectural analysis tasks." [LHB18]

The AT method diers between two views: (a) the view of the software architect who wants to use ATs, and (b) the AT engineer who creates the ATs. In the following, we will explain the use of ATs and their creation in detail as proposed by Lehrig [Leh18; LHB18].

**Usage of an Architectural Template** To use an AT, the software architect needs rst to model the software architecture of the desired system. During this process, the software architect can choose and apply suitable ATs from the provided AT catalogue. The catalogue provides dierent QoS specic templates. For example, a load balancer template can improve response time. To use the AT, the software architect binds the corresponding roles to the software architecture, and congures the parameters of the AT. When using the AT tool, it prevents misconguration or violation of AT constraints (e.g., having illegal connections between model elements). Before analysing the model by the solver, the AT engine performs a m2m transformation and weaves AT completions into the architectural model (e.g., a load balancer [Leh18]).

**Creation of an Architectural Template** The AT engineer creates AT templates and provides them via an AT catalogue to the software architect. An AT catalogue contains ATs for a specic topic, like architectural styles or parallel architectural patterns.

The rst step in the creation of a new AT is the identication of need and the corresponding QoS properties (e.g., response time) and which metrics need to be measured, as well as a suitable analysis approach (e.g., Palladio). In the next step, the AT engineers need to gather and extract the reusable architectural knowledge, and to formalise it within an AT. Thereby, the AT engineer needs to specify roles, completions, and constraints, and bind rst-named to architectural elements. In the last step, AT engineers ensure the quality and correctness of the AT, e.g., by testing.

### **2.5. Hierarchical Queueing Petri Nets**

Especially for the rst contribution of this thesis (see Chapter 6), we are using Hierarchical Queuing Petri Net (HQPN) to formally describe the dynamic behaviour of the parallel language elements in the PCM. Therefore, we will briey describe the foundations of HQPN here.

HQPNs include several extensions to the conventional Petri Net (PN)s. These extensions include the Coloured Petri Net (CPN), Generalised Stochastic Petri Net (GSPN), Coloured Generalised Stochastic Petri Net (CGSPN), and QPN. In the following, we assume the reader is familiar with PNs. Therefore, we only give a brief introduction to PNs and HQPNs. Thereby, we follow

the denitions given by [BK02] for PNs and the various extensions [Jen13]. A more detailed overview is provided in [Koz08].

### **2.5.1. Petri Nets**

An ordinary PN is a 5-tuple = (, , <sup>−</sup> , + , 0), where:


PNs cannot dier between the token type. A CPN allows the user to bind a type (colour) to each token. Each place is restricted to a set of colours. Furthermore, the transitions of CPNs can re in dierent modes based on the colour of the token.

In addition, using Stochastic Petri Net (SPN)s, we can include temporal aspects. SPN assigns an exponentially distributed ring delay to each transition. This delay denes the time a transition waits after being enabled until it res [Koz08].

### **2.5.2. Queuing Petri Nets**

Bause et. al [BK02] introduced QPNs. QPNs are based on CGSPNs and integrates the queue concepts into places. Therefore, QPNs are used to express queuing behaviour, which are in the form of SPE, in PNs. In QPN, there is a queueing place, where tokens are queued, and a depository for tokens which have completed their service.

Models in QPN can become quite large. To tolerate the size problem of monolithic QPNs, it is convenient to divide them into smaller inter-active subnets. For this purpose, HQPNs are used. They consist of several QPNs subnets and additionally contain subnet places. Each subnet has a dedicated input and output place, as well as another place counting the active population of

the subnet, which is the number of tokens red into the subnet that have not yet left the subnet.

According to Bause et. al [BK02] a Hierarchical Queueing PN is a 4-tuple = ( , , , ), where:

	- a) ∈ is a non-hierarchical QPN ,,, <sup>−</sup> , + , <sup>0</sup> , , ,
	- b) the sets of net elements are pairwise disjoint: ∀1, <sup>2</sup> ∈ : - <sup>1</sup> ≠ <sup>2</sup> ⇒ 1∪ <sup>1</sup> ) ∩ <sup>2</sup> ∪<sup>2</sup> = ∅]

115

118

121

124

127

70

### **2.6. Summary of Foundations**

In this chapter, we presented the foundations needed to follow the course of the thesis. Since not all foundations are necessary to understand certain contributions, Figure 2.20 provides an overview of the contributions, and of the sections of the foundations required to follow.

**Figure 2.20.:** Mapping of Foundations to Contributions

In the next chapter, we will continue with outlining the research design.

### **3. Research Design**

This section introduces the research design followed in this thesis. It claries the overall research goal, the research questions to be answered in the course of this thesis, and the process followed to answer the questions.

In model-based QoS prediction, the highest goal is to be as precise as possible about the predictions of the desired quality attribute in comparison to the real system (accuracy).

Since the focus of this thesis is performance prediction, we will look only at the quality attribute performance. The current state-of-the-art modelbased performance prediction approaches focus only on a single metric—CPU Speed—when specifying the characteristics of processor architectures. Singlemetric models might be ne for most single-core architectures. However, recent experiments have shown that current performance models produce insuciently accurate predictions when analysing parallel applications in multicore environments [FH16; FSH17]. Therefore, we formulate the following hypothesis, on which this thesis is based:

### Hypothesis 1 (0.1):

There exist additional CPU architecture and memory hierarchy related performance-inuencing factors—besides CPU speed—which have an impact on the performance of parallel application.

### Hypothesis 2 (0.2):

When considering the additional performance-inuencing factors in performance prediction models in an abstract form, in architectural models, and during design time, we can improve the accuracy of the model-based performance predictions for parallel application.

Validating the hypothesis will result in a set of emerging research questions: How can parallel applications be modelled; how far o are current predictions; what are the relevant performance-inuencing factors; do other approaches exist to predict the performance of parallel applications; etc.

Since model-based performance prediction is often used during the early design phase, it is important that predictions based on abstract software architectures are reliable—also for parallel applications—to ensure a high level of quality and to foster the use of engineering-like approaches. Therefore the overall goal is dened as follows:

### Research Goal ():

Improving the accuracy, usability, and applicability of model-based QoS predictions of the performance of parallel applications in multicore environments.

### **3.1. Research Method**

To achieve its goal, this thesis follows the the design science approach in combination with the method for experiment-based performance model derivation proposed by Jens Happe [Hap08]. According to this method, the performance model is extended in steps. First, a minimum set of additional attributes are identied in a goal-oriented manner. Second, the additional attributes are added to the performance model. Third, the performance is evaluated and checked to see if it meets the requirements. If so, the model derivation terminates. If not, further performance attributes are identied, and steps two and three are repeated. The evaluation—checking whether the requirements are met—is based on an experiment validation. One chooses a concrete scenario, sets up an experimental environment, and uses the experiment's results to evaluate the altered performance model [Hap08].

In contrast to behavioural science, whose goal is truth, the outcome of this thesis is one or multiple useful artefacts. Therefore, the design science approach, whose goal is a utility (cf. [HC10]), is most suitable and is applied in the course of this thesis. Figure 3.1 shows the design science approach we have chosen. The core artefact—in the middle of the gure—is the improved

**Figure 3.1.:** Applied Design Science Framework to Archive the ([HC10])

performance prediction prototype, including improved prediction models, enhanced tooling, and adjusted processes. We evaluate this artefact using the experiment-based performance model derivation method, mentioned above. The environment provides the requirements for the artefact, particularly the requirements of software architects and performance engineers, who have a real-world need for accurate parallel performance predictions, and therefore, also for the evaluation. The environment also denes the use case scenarios and provides further insights from expert interviews.

The insights gained during the evaluation of the artefacts can not only be used to improve the artefacts further, but can also add to the knowledge base. Vice versa, the artefact builds and is improved by the current state-of-the-art techniques, methods and knowledge. As a last step, the environment is used to conduct a eld test and conrm the quality of the artefact in production or semi-productive environments.

To nd the relevant metrics for the evaluation, to break down the overall , and to reveal additional contributions, the formulation of research questions helps. In the following, the research questions for this thesis are introduced and explained based on the thesis process overview given in Figure 3.2.

### **3.2. Research Questions**

Given the research method described above, a concrete process can be extracted. In this section, we will describe the research process (see Figure 3.2) step by step. While doing so, we will break down the overall hypothesis 0.1, 0.2, and into smaller and easier-to-evaluate research questions and assign them to the process steps. In so doing, we identify four main questions, which we break down into subquestions. For each question, we will give a detailed explanation as well as introducing the hypothesis on which we have based the research question.

The rst step shown in Figure 3.2, is to verify or falsify the base hypothesises 0.<sup>1</sup> and 0.2, and to identify the research need. We veried the hypothesis in [FH16; FSH17], where we performed a scenario evaluation of the capability of a state-of-the-art performance prediction approach (Palladio). For this, we used two dierent parallelisation paradigms—Java threads and AKKA Actors—to implement and parallelise two standard parallel applications, namely a matrix multiplication and a bank transaction scenario. Further, the scenarios were modelled and analysed with Palladio. Finally, we compared the Palladio analysis results with the measured execution times of the applications. Simply put, the results show that the predictions are o by up to 63%.

The next logical step is to perform a SLR to identify all related research in the eld and to discover possible solution strategies unknown to our community. The SLR is described in Chapter 4.

Further, the lessons we learned from the experiment led us to hypothesise <sup>1</sup> to 4, explained next.

### **3.2.1.** 1**: Performance Modelling of Parallel Behaviour**

**Soware Behaviour:** When we talk about software behaviour in the following, we relate to the performance inuencing aspects of the behaviour. Thus, we model abstract elements of the control ow of the software and the path the application takes through the program. The model's pragmatism is based on the idea to reect the program's performance as good as possible concerning wall clock time. We do not focus on

the formal semantics of the behaviour. Moreover, we relate to the performance characteristics and the performance relevant demands a software behaviour executes on its hardware, especially in multicore environments. Relevant aspects are (but not exclusively) forking and synchronising of threads, data read and write operations, and resource-demanding operations.

Keeping that in mind, our rst hypothesis <sup>1</sup> relates to the abilities of current modelling languages to represent the needs and performance characteristics of parallel software behaviour. Often parallel behaviours do the same task, but with dierent data (e.g., parallel loops—Section 2.3—or SIMD— Section 2.2). Therefore, modelling the same behaviour over and over again is time-consuming, error-prone, and simply not possible for highly parallel systems.

So, to verify or falsify <sup>1</sup> we raised 1.<sup>1</sup> and 1.2. Further, 1.<sup>3</sup> was dened to answer the question of how to improve modelling languages if <sup>1</sup> is veried.

These research questions relate to the actions 4.<sup>1</sup> to 4.<sup>3</sup> in Figure 3.2, where rst the current modelling languages are evaluated; next, an extension in the form of a parallel AT catalogue is created; and last, the extension is evaluated based on a user study to prove its eectiveness.

### 1: Modelling of parallel performance relevant behaviour in massive parallel environments:


are captured and expressed, and (b) all necessary information for performance evaluation is covered?

1.3**:** How can software architects be supported in the task of creating accurate performance prediction models eciently?

### **3.2.2.** 2**: Behaviour of Highly Parallel Applications**

The second research question focuses on the performance behaviours characteristics of highly parallel systems in parallel environments (multicore architectures). The assumption here is that the selected parallelisation paradigm, as well as the architecture characteristics, have a high impact on the performance of an application and therefore need to be considered in the performance predictions (2.1).

The 2.<sup>1</sup> therefore focuses on observing the parallel application execution, while the 2.<sup>2</sup> aims to identify the most relevant performance-inuencing factors (Action 5.1). 2.<sup>3</sup> covers the observation from [FSH17], in which we noticed that the selection of the parallelisation paradigm may have an impact. A validation of this hypothesis (2.1) is needed. Finally, 2.<sup>4</sup> aims to identify common characteristics in the execution of parallel behaviours, which can be described in characteristic curves (Action 5.3.1). These curves can be included in the model predictions to increase accuracy.

### 2: Performance behaviour of highly parallel applications in massive parallel environments:


### **3.2.3.** 3**: Performance Prediction Models**

This research question deals with performance prediction models for parallel applications. <sup>3</sup> is the baseline hypothesis here, and 3.<sup>1</sup> is designed to verify that.

Based on 2.2, 3.<sup>2</sup> aims to answer the question of which performanceinuencing factors need to be included in the prediction model (Action5.2.1). At the same time, 3.<sup>3</sup> covers the evaluation of the altered performance prediction models (Action 5.2.2).

### 3: Performance prediction models:

3**:** Current model-based performance prediction models fail to consider relevant performance-inuencing factors for parallel systems and thus their predictions are o.


### **3.2.4.** 4**: CPU Simulators**

As explained in Section 2.4.1, CPU simulators can simulate the behaviour of parallel applications in multicore environments based on a given implementation. Therefore, the hypothesis here is that these CPU simulators, included in the performance prediction process, can help improve the quality of prediction (4). The signicant challenge here will be to nd suitable simulators that work with architectural designs (4.1) and integrate them into the existing approaches and tooling (4.<sup>2</sup> and Acton 6.1). Finally, 4.<sup>3</sup> evaluates the quality of the integrated approach (Action 6.2).

### 4: CPU simulators for architectural performance predictions:


4.<sup>3</sup> Does the use of CPU Simulators increase the performance prediction accuracy for parallel applications in multicore environments?

### **3.3. Research Design Evaluation**

Addressing the RQ satisfactorily, and providing an artefact that is benecial for the given use cases, is essential for a design science approach [HC10]. Therefore, we will lay out the evaluation of the contributions in this section and follow the concepts pointed out by [SV12]. Sonnenberg and vom Brocke argue for a continuous evaluation of artefacts and sub-artefacts throughout the whole research project. As Figure 3.2 shows in action 4.3, 5.2.2, 5.3.2, and 6.2, each research question (contribution) is evaluated separately. While in 5.2.2, 5.3.2, and 6.<sup>2</sup> the artefacts are compared against the current state-of-the-art approaches, 4.<sup>3</sup> is evaluated by a user study to prove the usability of the artefact empirically. After this, an individual evaluation and additional integrated evaluation of the combined artefacts is planned (Action 7). Further details of the specic evaluation methods are provided in the corresponding chapters of the contributions.

### **3.4. Design Science Guidelines**

To perform an adequate design science experiment, Hevner et al. provide seven guidelines, which they recommend addressing in a project-specic manner [HC10]. The guidelines are listed below, and we briey describe how we have addressed them in this thesis:

<sup>1</sup> **Design as an artefact:** This thesis will result in multiple artefacts. First, it provides a modelling language extension to specify parallel behaviour within models (see Chapter 6). Second, it provides a model or model extension that captures the relevant characteristics of multicore architectures (see Chapter 8), which can be used for performance predictions. Third,

it provides a model that captures the characteristic speedup behaviour of highly parallel applications, including the relevant performance-inuencing factors (see Chapter 7). This can be used to estimate the maximum speedup of applications. Fourth, it provides a method to include CPU simulators in the process of predicting performance (see Chapter 9) to get very accurate parallel behaviour predictions.

<sup>2</sup> **Problem relevance:** As [FH16; FSH17] showed, the need for accurate performance predictions is highly relevant, as current prediction models are far o. Moreover, in the papers, they only considered a medium parallel multicore system with 16 cores. The current state of the art is already 32 to 64 cores for desktop PCs.

<sup>3</sup> **Design evaluation:** The utility, quality, and ecacy of the design artefacts is rigorously demonstrated by use case evaluations and commitment to state-of-the-art performance. If an artefact does not perform with better accuracy than the current state of the art, the artefact is considered depraved.

<sup>4</sup> **Research contributions:** The main contribution of this work is an improved performance prediction for parallel applications in multicore environments. An added benet is its contribution to the knowledge base.

<sup>5</sup> **Research rigour:** Strict, rigorous, and peer-reviewed methods are used to achieve the research goal, e.g., SLRs, the experiment-based performance model derivation proposed [Hap08], and guidelines for user studies and experiment evaluations (e.g., GQM).

<sup>6</sup> **Design as a search process:** It is necessary to satisfy existing laws and best practices in the application area of the artefacts domain. Identifying laws and best practices is achieved by a SLR covering this and neighbouring domains, as well as by expert interviews from academia and industry.

<sup>7</sup> **Communication of research:** The results are transmitted to both industry and the academy, who will both benet from this information, by means of various peer-reviewed conference and workshop papers. The publications are summarised in Appendix A.1.

Given that, we will continue in the next chapter with <sup>3</sup> and research and describe the related work.

### **4. Related Work**

Following the research design described above, we came to ask ourselves if existing research had face challenges similar to what we faced in [FH16; FSH17]. To answer this question as extensively as possible, we decided to perform a full SLR according to Kitchenham [KBB+09]. A SLR has two advantages: First, we may nd useful approaches that we can use to overcome our challenges. Second, at the same time, we delineate the research area and cover related work.

In this section, we elaborate on step (3) in the research process and present the SLR design and results. This SLR was sucessfully peer-reviewed and published in [FHLB17]. For the sake of being up-to-date, we re-executed the SLR for this thesis and added the delta of resources found. This ensures that we focus only on research question R−<sup>2</sup> (see Section 4.2.1), which is especially relevant for this thesis.

### **4.1. SLR Overview**

Even though performing a SLR comes with additional overhead, it also brings a set of advantages. First, Kitchenham [KBB+09; KDJ04] provides a detailed reference process to follow step by step. Second, if the review protocol (search method) is well designed, the outcome of the search is reproducible and more importantly, scientically elaborated and reusable.

Figure 4.1 shows the process we followed during the SLR. We split the whole process into three phases: Planning, conducting, reporting. In brackets, we indicate how many sources remain for further processing. The rst number (red) shows the sources from the 2016 run, and the second number (blue) from the 2020 re-execution.

In the course of the chapter, we elaborate on each phase and each step in detail. Finally, we give a conclusion and an overview of the related work.

**Figure 4.1.:** Overview of the Systematic Literature Review Process (cf. [KC07])

### **4.2. SLR Planning**

The rst phase, the planning phase, will result in the review protocol, which is the most important artefact of the SLR. It denes the whole process, containing the search strategy, inclusion and exclusion criteria, and the data extraction process. Further, we dene the search goal and research questions here. In the following, we report on each step in detail.

### **4.2.1. Research Questions**

As we have already elaborated on the need for research [FH16; FSH17], we will skip this step and start on the research question, which sets the primary direction of the search.

Given our domain, we focus on software developers and architects. We search for modelling approaches that enable software architects to analyse and predict the performance of parallel software in multicore environments during the design phase. Thus, we aim to answer two concrete research questions [FKB18]:


### **4.2.2. Review Protocol**

Given the research question, we create the review protocol, which is the central artefact created during the rst phase. All further steps are aligned to the denitions established in the review protocol. Thus, its quality is crucial for the SLR. To guarantee high quality, we develop the review protocol iteratively. At the end of each iteration, we validate the review protocol against a set of sources that we want to ensure are included, and a set of sources that we want to ensure are excluded. We pick theses sources manually upfront.

For the sake of simplicity, we describe only the nal version here.

### **4.2.2.1. Search Strategy**

In the search strategy, we dene which search engines we use and how we construct the search terms to create the search phrases.

Our rst decision here is to use Google Scholar<sup>1</sup> . Using Google Scholar is suggested by Kitchenham [KC07] because Google Scholar is a meta-search engine and includes most sources of scientic publications—also from other relevant databases.

Next, we derive the search terms. The initial set of search terms we gain from our knowledge and the already-known related work. During the iterations,

<sup>1</sup> https://scholar.google.com/

we update the collection of search terms, based on the results, and also include synonyms.

The nal list contains the following search terms—synonyms not listed: Parallel Programming, Many Core, Multicore, Modeling, and Software Performance Engineering. To get the search phrases we combine these terms using "AND" and "OR"-operators. This leads to the following search terms [FHLB17]:


We also use our synonym list to replace keywords by synonyms. This way, we can cover a more extensive range. An example, based on T<sup>3</sup> and synonyms is:

**T**3<sup>1</sup> **:** ("Many Core") AND ("Performance Modeling" OR "Software Design") AND ("ACTORS")

Further, we created a blacklist with terms we expected to come up in our search that are outside our specic domain. For example, we blacklisted weather, since we expected to nd sources focused on weather prediction models. The full list of synonyms, blacklist, and keywords, along with all results, are available online<sup>2</sup> .

### **4.2.2.2. In- and Exclusion Criteria**

After agreeing on the search strategy, we dene in- and exclusion criteria to lter identied sources and to focus on relevant documents. In our case, we consider all sources that full one of the following statements [FHLB17]:

<sup>2</sup> https://doi.org/10.5281/zenodo.3972806

### **Inclusion Criteria**


Additionally, we dene the following exclusion criteria. We will not consider sources that full one of the following criteria [FHLB17]:

### **Exclusion Criteria**


To decide whether to consider a source or not, we rst apply the inclusion criteria. Next, we apply the exclusion criteria to all sources passing the inclusion criteria check. If a source fulls the exclusion criteria, we eliminate it from consideration.

### **4.2.2.3. Quality Indicators:**

The next step is to evaluate the remaining sources. For this, we use quality indicators. In the following, we will introduce the quality indicators. To pass the evaluation step, a source has to at least partly full at least one quality indicator:


### **4.2.2.4. Data Extraction**

Next, we need to dene how the data is extracted from the remaining sources. For this, we dene a three-step process. First, we collect bibliographic information about the source (e.g., authors and date of publication). Based on this information, along with the absolute number of keyword hits in the title, we rank the sources. Second, we extract and summarise the sources by evaluating the abstract, introduction, and conclusion (in order of ranking). In the process, we re-evaluate the in- and exclusion criteria. Third, we perform a full paper review for the remaining papers. During the review, we again double check in- and exclusion criteria.


**Table 4.1.:** Characteristics Used for Categorising [FHLB17]

### **4.2.2.5. Data Analysis**

After data extraction, we evaluate and interpret the extracted data. We categorise the data using the four characteristics shown in Table 4.1.

The rst dimension of categorisation is the domain. Here we distinguish between sources contributing to the domains of Embedded Systems, HPC, or SPE. Sources that target software engineering, in general, are assigned to General Software Engineering.

The second dimension is the source type: Problem Statements focus on open issues; Solution Introductions provide an approach; Experience Reports describe practical realisations (e.g., case studies); and Knowledge Accumulations summarise a wide eld of knowledge (e.g., surveys).

We expect numerous sources to target parallelisation patterns or techniques. Thus, the third and fourth dimension splits these source groups according to the pattern or technique each focuses on. In case no pattern or technique is described, we tag it as Not Available.

### **4.2.3. Evaluate Review Protocol**

To evaluate the review protocol, we execute two evaluations. First, as mentioned, we perform multiple iterations. In each iteration we execute a small test search and check that pre-dened sources are included or excluded correctly.

Second, we form a review board within our group, including experts from SPE, Model-driven Software Development (MDSD), and HPC domains, which are the most relevant domains for our search. Within this board, we review each iteration run. In total, we had three iterations within the full group and several discussions in groups of two.

### **4.3. SLR Conducting**

Once the SLR is planned, the implementation phase begins (see Figure 4.1). In this section, we describe how to perform the SLR as dened in the review protocol (Section 4.2.2). We perform the actual search, apply the lters to the sources found, and analyse the data retrieved. For the sake of simplicity, we give only a summary of the results. The complete raw data and documentation are available in our repository<sup>3</sup> .

### **4.3.1. Executing the Search**

To evaluate our search phrases and terms, we performed test searches with strict automatic ltering based on our blacklist rules. Due to the small number of results (three), we relaxed the blacklist rule by removing the word weather forecast, which led to the expected result that the sources found covered a more comprehensive range. Therefore, we decided to manually preselect sources based on title only, evaluating the title of the sources one by one. Only those sources that passed the evaluation were considered in further steps. On December 11, 2016, we conducted the search of the rst run and obtained 54 sources after the manual pre-selection. On June 14, 2020, we reran the SLR. We focused only on R−<sup>2</sup> and executed only the query T<sup>3</sup> with its variations. We received a delta of 15 new papers.

### **4.3.2. Applying the Filters**

With the initial result set at hand, we apply our lter criteria step by step. As mentioned above, we performed the rst evaluation during the search.

<sup>3</sup> https://doi.org/10.5281/zenodo.3972806

We manually evaluated all sources based on the title. To minimise personal bias, we ensured that only sources from other areas were excluded and only if the title provided sucient evidence for exclusion. Figure 4.2 shows the ltering process and the number of sources after each step.

**Figure 4.2.:** Filtering Process

During the second evaluation, we check that the search terms are mentioned in the title, the keyword section, or at least in the abstract. All sources not mentioning at least one of our search terms were excluded. We ended up with 47 sources after this step. The rejected sources only mentioned the keyword in the full text, where we assume that it had no signicant relevance.

In the third evaluation step, we read the abstract of each source and apply our in- and exclusion criteria, leaving us with 38 sources relevant for the SLR.

In addition, we ranked the remaining sources according to the ranking criteria "date", "keyword hit rate in the title", and " number of citations". For each ranking criterion, we have introduced a corresponding metric that assigns a source to a rank: an ordinal scale from "A" (high relevance) to "D" (low relevance). For example, we have assigned the rank "A" to sources with over 1,000 citations. Finally, we ranked all sources based on the mean value of the assigned ranks.

After the ranking, we conducted a full review of the paper for a detailed analysis. We evaluated the quality of the paper and re-evaluated the in- and exclusion criteria, eliminating four additional sources. So in total, 34 (+ seven from the second round) sources made it into the nal paper list, which was passed on to the next step.

### **4.3.3. Extracting the Data**

The extraction of data followed the three-step plan dened in the protocol. First, we collected meta-information (e.g., authors and date of publication). Then, we summarised the problems the sources faced by reading the abstract. In the third step, we performed a full review of the ltered sources.

### **4.3.4. Analysing the Data**

After the full paper reviews, we categorised the sources into the pre-dened dimensions (see Section 4.2.2.5). Table 4.2 and 4.3 show the sources and their categories.

Also, we added a column "Model for" to the table. Whenever a source targets a modelling approach, the purpose of the model is noted in this column.


ER - Experience Report KA - Knowledge Accumulation \*\*Sources from 2020

**Table 4.2.:** Classication of Sources for General Software Engineering


**Table 4.3.:** Classication of sources for HPC, Embedded Systems, and SPE

### **4.4. SLR Reporting**

In this section, we report on the results of the SLR in detail. Thus, we rst give a summary of each paper. The purpose of the abstract is not to fully understand each approach (for this, we refer to the source), but to get an overview of the areas where active research is being done. After the report, we extract valuable lessons learned, summarise the ndings, and highlight sources that are particularly relevant as related work for this thesis.

### **4.4.1. Report Results**

Table 4.2 and 4.3 summarises the complete set of sources we found during the search. In the following, we report the results, as we reported them in [FHLB17]. Further, we mark every source that came up while re-performing the SLR with the keyword "[2020]".

The report follows the structure of the domain category.

### **4.4.1.1. General Soware Engineering**

"Sources we found for general software engineering address dierent challenges, which focus mainly on the design and implementation phase in the software development process. For example, in their problem statement, Hwu et al. [HKM08] describe the challenges that arise from concurrent programming and claim that software developers need to apply engineering approaches to handle the complexity involved.

Mehrara et al. [MJU+09] give an overview of parallelism and compiler technology to understand the software development challenge. To ease the development process, the book by Mattson et al. [MSM04] presents a methodical approach for creating parallel programs and gives an overview of patterns. "Finding Concurrency", "Algorithm Structure", "Supporting Structures", and "Implementation Mechanisms" are the four groups of patterns systematised according to the stage of the software development process and reecting the dierent abstraction levels during the process. Following this approach, Pankratius et al. [PSJT08] present an experience report on four case studies on developing multicore software for general purpose applications, where each case study uses a dierent programming language and hardware specication. The report shows that parallelising software is an individual task, and the speed-up can vary. A reason for varying speed-ups is the dierent hardware specication (i.e., number of cores, cache architecture), which motivates auto-tuners. Auto-tuners are used for source-code-based parallelisation and are addressed by [KP11; PH11; SPT10; ZP12].

Pankratius et al. give another experience report in the form of a case study [PJT09]. Dierent groups of software developers were asked to parallelise BZip2. Lessons learned are that the use of parallelisation patterns on higher abstraction levels increases the speed-up.

Haller et al. gives another approach [HO09], where a combination of threadbased and event-based models are unied with the help of an abstract ACTOR that provides dierent kinds of operations to receive messages.

In the work of Iwainsky et al. [ISC+15], the authors automatically generate empirical performance models for OpenMP. They perform tests on dierent hardware and show that the overhead of OpenMP grows linearly or superlinearly with the number of threads. Further, they show that the chosen compiler has a major impact on the performance of the application.

To illustrate the hardware impact, Stürmer et al. [SWH+09] compare two dierent system architectures. In their work, they show that not only the number of cores, but also the memory controller and the caches have a signicant impact on performance.

To avoid low-level synchronisation defects during the software development, new programming languages are proposed. For example, XJava, which preserves the object-oriented approach while simplifying the expression of parallelism, is presented by [OPT09]. To support the development of new programming languages, an automated usability evaluation for the design of parallel programming languages was introduced by Pankratius [Pan11].

Rodrigues et al. [RGD11a] utilise a meta-model extension on MARTE proles to specify the task and data allocation in the memory hierarchy for GPU architectures.

We also found an experience report by Rodrigues et al. [RGD11b] that describes a case study where the authors use UML and the MARTE prole to specify and generate OpenCL code with the help of model-driven engineering approaches. They claim that the model-driven engineering approach is well suited for programmers to create parallel programs and that the MARTE prole has a high potential for parallel modelling programs.

The paedagogically-oriented contribution of Brown et al. [BSA+10] focuses on the education of 'new generations of students'. They identify a list of recommendations to improve students' knowledge of parallel programming.

The work of Sagardui et al. [SEPA13] needs to be highlighted because it focuses on verication and validation of multicore systems in early design phases. In their related work, they show that in the embedded system domain, there are approaches for modelling multicore systems with the help of MARTE proles. Their contribution is a high-level process, which recommends the use of three models to represent multicore systems and their software: an application model, a platform model, and an allocation model."[FHLB17]

[2020] An additional three sources came up in this category when reperforming the SLR in 2020: In his doctoral dissertation [TMCB16], C. Terboven describes the high relevance of data locality to the performance of parallel applications. He develops an approach to optimise the data locality in NUMA systems for the OpenMP paradigm. For this purpose, he creates a thread-anity model. In [ADKT17], the authors face the issue of the absence of 'good high-level programming tools'. To overcome this problem, they introduce FastFlow, which is a framework that uses a steam-based paradigm to parallelise. They enable the software architect to model their system using cyclic graphs. Finally, [IDSM05] proposes a deep learning approach to estimate the performance of a parallel application by using multilayer neural networks.

### **4.4.1.2. HPC**

"All sources we found in the HPC domain focus on techniques to enable parallelism in HPC applications. Diaz et al. [DMN12] performed a survey. They comprehensively described dierent concepts, libraries, and languages to bring parallelism to applications. They show that distributed memory is the most commonly used programming approach for parallel programming in the HPC domain. Further, the work from Rabenseifner et al. [RHJ09] focuses on the potentials and challenges of this dominant programming model.

Other sources we found introduce problem-specic solutions to handle parallelism. Hadjidoukas et al. [HPD09] introduce a user-level thread library called PSTHREADS, which allows the use of ne-grained parallelism with large numbers of threads. Luebke et al. [Lue08] explained the CUDA programming model and argued for its use in the biomedical imaging community. Martinez et al. [MGF11] proposes a source-to-source translator from CUDA to OpenCL."[FHLB17]

[2020] During the re-performance, we found four additional sources, all highly relevant. Two of them [CGIP16; SEE19] use a Queuing Network (QN)-based approach. More specically, [CGIP16] uses QN along with both analytical and simulation-based solvers to optimise parameters for parallel execution in systems with CPUs and GPUs. In contrast, [SEE19] uses QN along with non-linear solvers to estimate the message communication for Message Passing Interface (MPI)-based applications in cloud environments. They focus on the interaction delay in distributed systems caused by the network delay.

In [EB16], the authors use a statistical approach to estimate the performance of parallel applications. They perform small-scale experiments, measure and analyse these small-scale experiments, and use these data to estimate the performance in large-scale scenarios. Finally, the authors of [PF05] combine discrete event simulations and mathematical modelling to create a performance model for parallel and distributed systems. Further, they use UML activity diagrams to model the low-level (close to code) behaviour of the application and enrich it with additional performance relevant information.

### **4.4.1.3. Embedded Systems**

"The majority of the sources we found in the domain of embedded systems introduce an approach to handle parallelism within a program. Bini et al. [BBE+11] present the approach developed in the ACTORS project. They show that the ACTORS approach is useful in handling time-sensitive applications with variable load. A problem statement paper by Gray et al. [GA12] describes the challenges of multicores in the embedded domain on the model-driven software engineering level. They identify problems within the whole development spectrum (i.e., system modelling, programming models of software languages, analysis and verication, toolchains support, and sophisticated hardware implementations).

Llopard et al. introduce a modelling approach. [LCFH14] that combines hierarchical state machines (HSMs) with data parallelism and operations on compound data.

Lin et al. [LLL+11] propose a framework to generate program code for multicore embedded systems out of SysML models."[FHLB17]

### **4.4.1.4. SPE**

"After the nal evaluation, ve sources in the SPE domain remained. As one would expect, all the sources focus on improving the accuracy of performance prediction for multicore systems by either adopting an existing performance model or proposing a new model.

In [THW09], Treibig et al. improve the predictive power by including properties of cache hierarchy design with the use of the simple balance metric. Further, the authors publish a problem statement paper [THW12] and discuss the sensible use of hardware performance counters in a structured performance engineering approach. Additionally, typical performance patterns and their respective metric signatures are dened.

Xu et al. [XCDM10] propose a new performance model called CAMP for shared memory on multicore systems. The model uses non-linear equilibrium equations.

Van Craeynest et al. [VE11] also proposes a new model, MPPM, for estimating multi-program multicore performance. It employs a method to model the performance entanglement between co-executing programs with shared caches.

Samuel Williams uses a rooine function to determine the correlation between oating-point operations and bytes transferred from DRAM to estimate the peak performance of a CPU in [Wil09]." [FHLB17]

### **4.4.2. Evaluate Report**

In the previous section, we presented insights into our search results. To sum up our ndings, we derive the critical lessons learned during the SLR [FKB18]:


In addition to the above insights, we can answer the SLR's research questions as follows:


### **4.5. Threats to Validity**

During the design of the SLR, we made several decisions according to our scope. Each one brings certain trade-os, which we discuss in the following, and as was reported in [FHLB17]:


### **4.6. Summary**

The SLR revealed useful insights in the area of parallel programming, parallel modelling, and parallel performance prediction. Even though none of the approaches satisfy our requirements, we gained a lot of insights and knowledge in this area. On top of that, we acknowledge the work in three areas, especially:

**Parallel Modelling:** It becomes clear that the expression of parallel behaviour in software models is more and more relevant. Therefore, [RGD11b] aims to use UML MARTE proles to enrich software models with multicore information. Even though they do not focus on performance predictions, but on code generation for OpenCL, and therefore focus

on a low abstraction viewpoint, their ideas and methods to express parallel behaviour might be adaptable.

Even more relevant is the work from [PF05], in which they use UML activity diagrams to specify software performance models for parallel applications. Again, they focus on low abstraction levels and assume an implementation already exists, but these insights should be used to create performance prediction models on architectural levels during the design phase.


In addition to the references revealed by the SLR, there exist other related works that are not directly related to performance predictions of parallel applications. The exacted schedulers from J. Happe [Hap08] and the CloudSim project are the two most relevant contributions [CRB+11].

J. Happe included a concept of exacted scheduling in the performance prediction approach. This approach takes several eects (like overhead for content

switches) for specic scheduling approaches into account. The approach is designed to work mainly for single cores and does not addresses the challenges of multicore systems. However, it supports concurrent software, and can therefore be used for parallel applications as well. Even though this increases the prediction accuracy of parallel applications, the impact is rather small [FH16].

Like Palladio, the CloudSim approach is a system simulator for cloud environments. Similar to Palladio, CloudSim uses a specication of a hardware, software and usage model to simulate quality attributes like response time and elasticity of cloud environments. Due to this characteristic, they both support basic parallel executions of containers. However, they do not consider multicore aspects and assume a linear speedup.

Since none of the related work satises our requirements or answers our research question, we take the insights from the SLR into account and continue with our research process (see Figure 3.2).

### **5. Running Example: Resource Demands**

Through the course of the thesis, we refer to dierent kinds of representative examples. In this section, we introduce each example, give a brief description, an implementation example, and characterise it. We categorise the examples into two groups: Resource Demanding Examples and Complex Examples.

### **5.1. Resource Demanding Examples**

The group of resource-demanding examples represents very low-level and algorithmic examples, where each represents a special kind of resourcedemanding behaviour. Most of the resource demands can be marked as processor-intensive demands (which mainly consume CPU time), I/O intensive tasks (which have many reads and writes, memory accesses, and consume memory bandwidth), or a characteristic combination of both. We will not focus on an optimised implementation of the given problems for a specic hardware. Moreover, we are interested in the characteristics of resource-demanding behaviours, since this will be relevant for the performance predictions later. In the following, we will briey explain each resource demand and give an implementation example. Further, the realisation of each resource demand in Protocom is provided in Appendix A.2. The general implementation example will help to understand the core problem. In contrast, the implementation from Protcom will help in following Contribution 4 (see Chapter 7).

### **5.1.1. Fibonacci Numbers**

**Description:** In mathematics, the Fibonacci numbers (or Fibonacci sequence) is a well-known sequence and describes the addition of two preceding numerical values to get the current value. The rst element of the sequence is <sup>0</sup> = 0 and the second is <sup>1</sup> = 1. For all other elements = −<sup>1</sup> + −<sup>2</sup> must hold [Knu97].

**Implementation:** Implementing a sequential version of the Fibonacci sequence is straightforward, and a Java implementation is given in Lst. 5.1. In this implementation, recursion is used to calculate all the other numbers up to the given position. This implementation shows the core problem and is not optimised for the most performant execution.

```
1 /* Returns the fibonacci number of the position n
2 * in the sequence. */
3 static int fibonacci(int position) {
4 if (position <= 1) {
5 return position;
6 }else {
7 return fibonacci(position-1) + fibonacci(position-2);
8 }
9 }
```
**Listing 5.1:** Sample implementation of the Fibonacci number in Java

Since the Fibonacci number of position n is based on the two preceding numbers, a parallelisation of this problem is complex and exceeds the scope of this work.

**Characterisation:** The actual work of the Fibonacci number calculation is a simple addition. Therefore, the Fibonacci demand is a processor-intensive demand [FBKK19]. Storing the preceding values is very low overhead and can be done eciently in L1.

### **5.1.2. Mandelbrot Set**

**Description:** The Mandelbrot Set is another mathematical sequence, which is named after the French mathematician Benoit Mandelbrot (cf. [DH84]). The sequence is a set of complex numbers which is dened by the iteration of <sup>0</sup> = 0 and + 1 = + .

Geometrically interpreted as a part of the Gaussian number plane, the Mandelbrot set is a fractal. Images of it can be generated by placing a pixel grid on the number plane and assigning a value of to each pixel. If the sequence is restricted with the corresponding , i.e. if it belongs to the Mandelbrot set, the pixel will be coloured (e.g., black), and otherwise not. If the colour is determined by how many elements of the sequence have to be calculated until it is clear that the sequence is not restricted, a so-called speed picture of the Mandelbrot set is created: The colour of each pixel indicates how fast the sequence with the respective is heading towards innity.

**Implementation:** The following implementation (Lst. 5.2) plots a region (size by size) of the Mandelbrot set. The variables and represent the centre of the region, while gives the size dimension and denes the maximum number of iterations.

```
1 public class Mandelbrot {
2
3 // return number of iterations to check if c = a + ib is in Mandelbrot set
4 public static int mand(Complex z0, int max) {
5 Complex z = z0;
6 for (int t = 0; t < max; t++) {
7 if (z.abs() > 2.0) return t;
8 z = z.times(z).plus(z0);
9 }
10 return max;
11 }
12
13 public static void main(String[] args) {
14 double xc = Double.parseDouble(args[0]);
15 double yc = Double.parseDouble(args[1]);
16 double size = Double.parseDouble(args[2]);
17
18 int n = 512; // create n-by-n image
19 int max = 255; // maximum number of iterations
20
21 Picture picture = new Picture(n, n);
22 for (int i = 0; i < n; i++) {
23 for (int j = 0; j < n; j++) {
24 double x0 = xc - size/2 + size*i/n;
25 double y0 = yc - size/2 + size*j/n;
26 Complex z0 = new Complex(x0, y0);
27 int gray = max - mand(z0, max);
28 Color color = new Color(gray, gray, gray);
29 picture.set(i, n-1-j, color);
30 } }
```
31 picture.show(); 32 } }

**Listing 5.2:** Sample implementation of the Mandelbrot Set in Java [SW17]

**Characterisation:** Creating a graphical representation of the Mandelbrot set is not only a computation-intensive task, a lot of (complex) numbers have to be calculated, stored, and re-accessed as well. Therefore, the Mandelbrot Set demand can be characterised as I/O-intensive task.

### **5.1.3. Sorting Arrays**

The sorting array demand is characterised by a lot of data access and swapor-switch operations. In practice, a lot of dierent sorting algorithms are known, and have dierent pros and cons. In the following, we will focus on the Dual Pivot Quicksort algorithm, since this one is also implemented in the Java base class library.

**Description:** The Dual Pivot Quicksort algorithm [Yar09] is an improved version of Quicksort. It is characterised by using two pivot elements, one at the left end of the array and one at the right end of the array. In this algorithm, the left element must be smaller or equal to the right element. Otherwise, they will be swapped. After that, the set is spilt into three subsets: Values smaller than the left pivot element, values larger than the right pivot element, and values between the left and right element. After that, the three sets are partitioned and step one is repeated until all partitions contain only one element. At the last step, they are merged.

**Implementation:** The following code in Lst. 5.3 is an implementation of the algorithm described above from the Java base class library. The code is highly optimised and hard to read. An easily comprehensible version, along with detailed explanations, can be found in [Yar09].

```
1 static void sort(int[] a, int left, int right,
2 int[] work, int workBase, int workLen) {
3 // Use Quicksort on small arrays
4 if (right - left < QUICKSORT_THRESHOLD) {
5 sort(a, left, right, true);
6 return;
7 }
8
9 /*
10 * Index run[i] is the start of i-th run
11 * (ascending or descending sequence).
12 */
13 int[] run = new int[MAX_RUN_COUNT + 1];
14 int count = 0; run[0] = left;
15
16 // Check if the array is nearly sorted
17 for (int k = left; k < right; run[count] = k) {
18 if (a[k] < a[k + 1]) { // ascending
19 while (++k <= right && a[k - 1] <= a[k]);
20 } else if (a[k] > a[k + 1]) { // descending
21 while (++k <= right && a[k - 1] >= a[k]);
22 for (int lo = run[count] - 1, hi = k; ++lo < --hi; ) {
23 int t = a[lo]; a[lo] = a[hi]; a[hi] = t;
24 }
25 } else { // equal
26 for (int m = MAX_RUN_LENGTH; ++k <= right && a[k - 1] == a[k]; ) {
27 if (--m == 0) {
28 sort(a, left, right, true);
29 return;
30 }
31 }
32 }
33
34 /*
35 * The array is not highly structured,
36 * use Quicksort instead of merge sort.
37 */
38 if (++count == MAX_RUN_COUNT) {
39 sort(a, left, right, true);
40 return;
41 }
42 }
```
**Listing 5.3:** Implementation of the sort method of the DualPivotQuicks from the Java base class library

**Characterisation:** Due to the high interaction with the memory and the enormous amounts of reading and writing operations, the Dual Pivot Quicksort algorithm is a highly I/O-intensive task.

### **5.1.4. Calculating Primes**

**Description:** In mathematics, a prime number is a natural number, that is higher than one, and that cannot be formed by multiplying two smaller natural numbers.

Prime numbers are of high interest in informatics, especially in cryptography. Large prime numbers are used for encryption.

Even though there are dierent approaches to nd a prime number, i.e., trial division (i.e., brute force) or with the help of the Sieve of Eratosthenes [One09], it remains a resource-intensive task. The current largest prime number is 2 <sup>82</sup>,589,<sup>933</sup> − 1 and was discovered by Patrick Laroche in 2018 [Lar18].

**Implementation:** The implementation in Lst. 5.4 shows a trial division approach to nd prime numbers. It simply checks whether each number is divisible by another number higher than one.

```
1 public static List<Integer> getPrimeNumbers(final int upperBound) {
2 List<Integer> resultSet = new ArrayList<>();
3 for (int i = 2; i <= upperBound; i++) {
4 if (isPrime(i)) {
5 resultSet.add(i);
6 }
7 }
8 return primeNumbers;
9 }
10 public static boolean isPrime(final int numberToCheck) {
11 boolean result = true;
12 for (int i = 2; i < numberToCheck; i++) {
13 if (numberToCheck % i == 0) {
14 result = false;
15 }
16 }
17 return result;
18 }
```
**Characterisation:** The base characterisation of the above calculating prime resource demand is dened by the method isPrime, which performs a high number of divisions. This leads to a load on the CPU. The I/O interaction is comparably low. The few numbers can even be stored in caches. Thus, calculating prime demand is a CPU-intensive demand.

### **5.1.5. Counting Numbers**

**Description:** Counting numbers is a straightforward algorithm to count numbers from zero upwards toward a limit. This example is a synthetic demand, which is added here because it can put much pressure on the memory architecture.

**Implementation:** The implementation of the counting number example is given in Lst. 5.5. It shows a for-loop which iterates until the given upper limit is reached. In each iteration, the current counter is added to a counting variable of . In this Java implementation, must be a class variable to prevent the just-in-time compiler from removing it during the code execution—as part of the just-in-time code optimisation.

```
1 // needed to stop the JIT compiler from removing the code in execute
2 private long k;
3
4 private void countNumbers(final double countTo) {
5 for (long j = 0; j < countTo; j++) {
6 if (k > 100000) {
7 k = 0;
8 }
9 k += j;
10 }
11 }
```
**Listing 5.5:** Implementation of the counting numbers demand from Protocom

**Characterisation:** The characteristics of the counting number demand are rather simple but at the same time interesting. The demand produces both CPU demand from the addition and I/O demand by getting the numbers from memory. The latter can be neglected when executing the code sequentially because the numbers will be stored in L1 or registers.

### **5.1.6. Matrix Multiplication**

**Description:** In mathematics, matrix multiplication is a multiplicative combination of matrices. To multiply two matrices with each other, the number of columns in the rst matrix must match the number of rows in the second matrix. The result of matrix multiplication is again a matrix. The entries of the new matrix are calculated by multiplying and summing the entries of the rows of the rst matrix, component by component, with the columns of the second matrix.

Matrix multiplication is often used in linear algebra or natural science. Each entry of the matrix product is calculated by = Í =1 · . In this equation, and are the corresponding entries of the matrices A and B, when is calculated.

**Implementation:** The implementation in Lst. 5.6 shows an example of a matrix multiplication. The number of columns of matrix a must be equal to the number of rows of matrix b.

```
1 public static int[][] multiplyMatrix(final int[][] matrixA,
2 final int[][] matrixB) {
3 int[][] result = new int[matrixA.lenght][matrixB[0].length];
4 for (int i = 0; i < matrixA.length; i++) {
5 for (int j = 0; j < matrixB[0].length; j++) {
6 for (int k = 0; k < matrixA[0].length; k++) {
7 result[i][j] = result[i][j] + matrixA[i][k] * matrixB[k][j];
8 }
9 }
10 }
11 return result;
12 }
```
**Listing 5.6:** Example implementation of the a matrix multiplication in Java

**Characterisation:** Matrix multiplication is a good example of an I/O intensive task because for each multiplication, two values have to be loaded from memory, and one value has to be written. The multiplication itself has only a moderate impact on the CPU. Further, the order of the three for-loops has a signicant impact on performance. Arranging the ,, properly can cause caching eects because the data of arrays are stored in the main memory in a way that the next value is within the same cache page and proactively

loaded (see page cache for more details). Arranging them in the wrong order will result in a lot of cache misses and main memory access, which results in a degraded performance. The dierence between the best and worst version can impact the performance by a factor of eight (the worst combination is eight times slower than the best combination) [FH16].

### **5.1.7. Summary**

The examples given in this section will be used throughout the further course of this thesis, each example representing a unique resource demand. Table 5.1 summarises the characteristics of the individual demands and gives the CPU-intensity and I/O intensity of each demand.


**Table 5.1.:** Summary of Resource Demand Characteristics

### **5.2. Complex Examples**

In the upcoming section, we will describe more complex examples which produce a more extensive resource demand. The rst example—Bank Transaction is a common one, when it comes to interaction between multiple threads or actors.

The second example is taken from the SPEC Benchmark Suite. It consists of multiple combined low-level demands (like the ones explained before). SPEC Benchmarks are often used to evaluate the performance of hardware systems. Thus, they can be used as a substitution for more complex realworld examples. The advantage of using SPEC Benchmarks instead of real

examples is that they are (a) more comfortable to set up, and (b) better compared to dierent setups.

### **5.2.1. Bank Transaction Example**

**Description:** The bank transaction example is a common example used in literature to describe various problems in parallel execution [Lin10a]. Its underlying data model consists of a simplied version of the bank domain. Figure 5.1 shows the domain represented by a UML class diagram.

**Figure 5.1.:** Domain View of the Bank Transaction Example (cf. [Lin10a])

In this example, a bank consists of a set of accounts. Each account has a balance and a method to deposit or withdraw money. Further, there are transactions which transfer a specic amount of money from one account to another. A transaction is successful if the balance of the source account is higher than the amount of the transaction and the money can be transferred. Vice versa, a transaction will fail if the balance is insucient.

The scenarios rising from the example are complex—especially for parallel executions—because the order in which the transactions are executed is important. Additionally, it must be guaranteed that only one transaction is executed for a bank account to prevent multiple write operations simultaneously.

**Implementation:** For this example, a variety of instances can be found across the literature. However, in the course of this thesis, we will refer to the version conceived by J. Link [Lin10b]. Link uses AKKA Actors to implement the scenario. As introduced in Chapter 2, Actors are used as a means to parallelise. In the example presented, each bank account represents an actor with its own message queue. In the message queues, the incoming transactions are stored. Further, Link uses a transaction actor, which manages the individual transactions. Thereby, each transaction is executed in the following order: (1) get the source and target account, (2) check account balance, (3) withdraw money and (4) deposit money. Given the use of the actor paradigm, the example implementation can be executed in parallel, and multiple transactions are processed at once.

The full implementation can be found in [Lin10b].

**Characterisation:** The primary work in this example is subtraction or addition. However, the use of Actors puts much additional overhead on top. Every message utilises the memory bus and uses additional memory. So, if we consider the example by using an Actor implementation, we can expect a low to medium demand on the CPU and a comparable high demand on the memory architecture.

### **5.2.2. SPEC Benchmarks**

In two seminar theses we evaluated in collaboration with P. Gruber [Gru20] and A. Yoon [Yoo19], the suitability of performance benchmarks as use case examples. The following sections are part of these works:

Performance benchmarks (e.g., from SPEC<sup>1</sup> ) are designed to evaluate the performance of computer systems. Further, they can be used to make dierent computer systems comparable. To ensure comparability, a benchmark is standardised and portable, which means the benchmark has the minimum possible dependencies on specic hardware. Additionally, benchmarks are not designed to stress the operating system. Depending on the benchmark set, it stresses the graphic card, the I/O bus, or—most commonly—the CPU.

<sup>1</sup> https://www.spec.org/

According to SPEC (the Standard Performance Evaluation Corporation)—a non-prot organisation whose goal it is to establish, maintain and endorse a standardised set of relevant benchmarks for computer systems—a benchmark is "a standard of measurement or evaluation" [SPE20]. A computer benchmark refers to a computer program which executes a set of operations to produce a metric that represents the performance of a computer environment. A Benchmark typically measures execution speed and throughput as metrics. These metrics are used to analyse the performance of a system [SPE20].

Running the same benchmark on dierent hardware enables us to compare the performance of the dierent systems [SPE20]. According to the IBM Knowledge Centre, benchmark testing can help to determine current performance (issues) and help to improve performance [IBM18].

In the following, we will focus on the SPEC Benchmark sets, as they are very commonly used. However, other benchmark sets are suitable as well.

When the user runs the benchmarks from SPEC, he usually gets a base and a peak value for the specic task. The main dierence between base and peak is that peak is the result of using optimisations for the particular task, while the base value is based on the same optimisation setting for all tasks [MVL+10]. In general, both are reecting the time the task has run. Further, the benchmark outputs the ratio between the execution time and the run time of the benchmark on a reference system. The creators of the benchmark individually chose the reference system. This ratio would give an impression of whether the used system were faster, slower, or as fast as the reference system. This allows the evaluation of comparison results at rst glance. In the end, the SPEC benchmarks deliver a specication which ideally gives an impression of how well a system performs. The general specication is the median value of all applications.

In the following, we will give a brief overview of the SPEC benchmarks, focusing on parallel execution as they are suited to test multicore systems.

**SPEC MPI 2007:** SPEC MPI consists of 13 dierent applications [MVL+10]. All 13 applications are examples from a scientic background. They are used to perform weather predictions or to simulate uids. They are all implemented in FORTRAN or C(++). In contrast to the above resource-demanding examples, these tasks are neither low, in terms

of complexity, nor created synthetically. The SPEC MPI benchmark uses MPI calls as a means to parallelise. That means the independent processor cores need to communicate with each other regularly. Müller et al. [MVL+10] give additional details about the message size, implementation, and number of message calls.


**Part II.**

### **Contributions**

### **6. CB**1**: Parallel Architectural Pattern Catalogue**

In the previous sections we learned about the foundations and state of the art of parallel computing, hardware architectures, and parallelisation paradigms, dened the research approach and the research question to be answered in this thesis, and followed the research design. In the next four chapters we lay out the individual contributions (numbered from <sup>1</sup> to <sup>4</sup> according to the <sup>1</sup> to 4) in detail.

The rst contribution picks up the requirement deployed in Chapter 1. The requirement is that software architects should be able to express concurrency in software models in a way that characterises the behaviour of the software. The specication also includes highly concurrent software with multiple thousands of concurrently executed threads. As a result of this chapter, we can present an answer to the research question 1, validate hypothesis 1, and present a parallel architectural pattern catalogue, which contains reusable knowledge. Given that pattern catalogue, the software architect can easily and eciently model the behaviour of parallel software.

Figure 6.1 lays out the process followed to produce the rst contribution. As the rst step of this process, we analyse the current state of the art and establish why this requirement is currently not fullled. Next, we dene a set of challenges to overcome and goals to meet in order to full the requirement. To evaluate the quality of a parallel modelling language enhancement, we propose a set of evaluation metrics next. After that, we investigate dierent strategies to enhance current modelling languages, pick the most suitable one for our scenario, and execute the approach using the example of OpenMP parallel loops.

**Figure 6.1.:** Overview of the Research Method for Contribution <sup>1</sup>

After conrming that the strategy is suitable, we identify a list of the most useful parallel patterns, which we implement and combine in the parallel architectural template catalogue. Finally, we execute an empirical study to evaluate this catalogue.

As result we present a parallel architectural pattern catalogue containing three of the most frequently-occurring parallelisation patterns. We can show that the use of the pattern catalogue increases the eciency of the SAs signicantly. Furthermore, we are able to increase the accuracy of the performance predictions with the help of overhead functions.

Please note that signicant parts of the work from step 1 are reviewed and published in [FH16]. Additionally, the results from steps two through four are summarised, published, and reviewed in [FKHB19]. Finally, the specication of the pattern behaviour (described in Section 6.6) is reviewed and published in [FHB20].

All raw data, implementations, and accompanying resources are publicly available:

**Section 6.1** Peformance Prediction for Matrix Multiplications: https://zenodo.org/badge/latestdoi/250200347

**Section 6.5** Parallel Architectureal Pattern Catalogue: https://github.com/PalladioSimulator/Palladio-Addons-P arallelPerformanceCatalogue

**Section 6.7.2** User Study Data: https://doi.org/10.5281/zenodo.3755339

### **6.1. Problem Space**

To emphasise the issues with current modelling approaches, we will rst briey report on a controlled experiment we performed in [FH16]<sup>1</sup> . In this work we used a matrix multiplication example (see Section 5.1.6). Later we will leverage the same example to evaluate our enhancements to existing modelling languages.

### **6.1.1. General Information**

In the controlled experiment, we evaluate the multicore and multi-threading capabilities of the current state-of-the-art performance modelling tools. In this specic case, we prioritise Palladio and raise the following research questions:

<sup>1</sup> Is it possible to model multicore systems with Palladio?

<sup>2</sup> How precise are the predictions?

<sup>1</sup>The full experiment description can be found in [FH16], and all data are available at https://zenodo.org/badge/latestdoi/250200347

To answer these questions, we confront the problem from two sides. On the one side, we implement a matrix multiplication as a parallelised code example and measure the execution time (response time) on dedicated hardware. On the other side, we model the same instance with Palladio and perform a simulation.

As a metric, we focus only on the execution (or response) time of the actual matrix multiplication. To evaluate its accuracy, we compare measurements to our simulation result.

### **6.1.2. Implementation**

Listing 6.1 shows the implementation we used for the matrix multiplication. The implementation follows the explanation in Section 5.1.6 and uses three for-loops to multiply and add up the respective matrix elements of matrixA and matrixB. The provisional sum is stored in matrixC. When all iterations are nished, matrixC holds the results of the matrix multiplication. The order of the three for-loops can be altered without changing the result, but this impacts the performance greatly. We tested all variants on our target hardware and chose the fastest variant as described in [FH16].

For parallelisation, we used a framework—the omp4j<sup>2</sup> framework. It provides basic OpenMP functionalities like parallel sections and loops for the Java environment, and supports up to 16 worker threads. To use this framework we simply had to add line 5 to the code and use the omp4j pre-compiler (see Lst. 6.1). Please note that the threadNum feeds omp4j the number of threads it should use. This parameter is optional. We used that number to set the number of threads. When not specied, the default is the number of available CPU cores. The scheduling parameter is optional. A static scheduling tells omp4j to do the scheduling while pre-compiling and not during runtime (dynamic).

```
1 /* Requires: matrixA, matrixB, matrixC != null;
2 * Requires: matrixA.getWidth == matrixB.getHeight;
3 * Ensures: matrixC = matrixA x matrixB;
4 */
5 // omp parallel for schedule(static) threadNum(2)
6 for (int i = 0; i < matrixA.getWidth(); i++) {
7 for (int k = 0; k < matrixB.getHeight(); k++) {
```
<sup>2</sup>See http://omp4j.org and https://github.com/omp4j/omp4j

```
8 for (int j = 0; j < matrixA.getHeight(); j++) {
9 result[i][j] += matrixA[i][k] * matrixB[k][j];
10 } } }
```
**Listing 6.1:** Sample implementation of a matrix multiplication in Java with OpenMP annotations

### **6.1.3. Modelling**

While implementing the matrix multiplication is straightforward, the modelling part is more challenging. To model the software behaviour in Palladio, we need to know some characteristics of our software; for example, the resource demand (i.e., the CPU time) for a specic task like a single multiplication, and how often this action is performed. Tasks that demand a single resource are called actions in Palladio.

To gather the additional characteristics, we rst measure a sequential matrix multiplication and estimate the resource demand for a single multiply-add (line 9). We compute the number of multiply-add operations from the input matrices dimensions.

With this information at hand, we begin to model the use case, starting with a sequential version. Figure 6.2 shows the PCM's Service Eect Specication (SEFF), which we use to model the software behaviour. The SEFF consists of only one action, which includes the resource demand for one multiplication (0.00000069) multiplied by the number of multiplication operations needed (indicated by the input matrices' dimensions). We took the resource demand from the measurements and it represents the time it takes to perform a single multiply-add operation.

We could also have used three nested PCM loop-actions and only annotated the actual resource demand in the internal action, which would be a more natural approach. However, we chose the rst approach because it abstracts the actual algorithm and greatly improves performance during analysis [FH16].

After creating the sequential model, we adapt it to t the parallel scenario. This process involves much manual modelling, since the parallel constructs in the PCM are aligned to UML and are therefore very basic (e.g., do not

**Figure 6.2.:** SEFF Denition of the Sequential Model

support massive parallel behaviour). We model each thread as a separate branch of a fork, where each branch gets the same amount of work. This is a valid assumption because the OpenMP parallel loop construct is implemented in the same way<sup>3</sup> .

Therefore, depending on the number of threads needed, it is necessary to model not only one but threads by branches with actions and divide the resource demand into equal shares. This process is labour-intensive and error-prone.

### **6.1.4. Experiment Evaluation**

Table 6.1 shows the measurements and simulation results we collected by executing the program 500 times and by running the Palladio simulation. We computed the mean for both—the execution and simulation time. As one can see, the accuracy of the simulations drops when the number of worker threads (viz., the number of used cores) increases. One reason for the decreasing accuracy is that the simulation only considers CPU speed as a relevant metric, which leads to a linear speedup, while the measurements show that this is not the case.

<sup>3</sup> see OpenMP Specication: https://www.openmp.org/wp-content/uploads/OpenMP-API -Specification-5.0.pdf

There are several known reasons for not reaching an ideal speedup with a parallel program. We assume that the reason with the most signicant impact is additional overhead created by the threads like synchronisation. In most cases, the matrices are not read or written directly from memory, but from the caches of the CPU cores. So, every time the result matrix is updated, the cache entries of other CPU cores become invalid and have to be synchronised or invalidated, which is expensive.

Regarding our research questions we summarise our ndings:


Following the thesis hypothesis <sup>3</sup> we assume that the inaccuracy is due to additional performance-inuencing factors like cache sizes, memory size, and memory bandwidth, which are not considered in the model yet. The investigation of these factors follows in Chapter 7.

Having the result of the controlled experiment at hand, we can use it to dene challenges and goals in the next section. Afterwards, we will use these goals to evaluate dierent modelling approaches of language extension strategies.


**Table 6.1.:** Simulations and Measurements Summary

### **6.2. Problem Specification - Challenges and Goals**

In this section we use the insights from the controlled experiment described above and the lessons learned from the SLR (see Chapter 4). In the process, we identied challenges in modelling the behaviour of parallel software and the performance predictions for multicore systems. As the matrix multiplication use case makes clear, there are two major challenges [FKHB19]:


### **6.2.1. Goals**

According to the challenges identied, we aim for the following goals [FKHB19]:


### **6.2.2. Evaluation Metrics**

To be able to evaluate dierent language extension approaches, we dene the following evaluation metrics [FKHB19], based on the goals identied in Sec 6.2.1:


<sup>4</sup> **Understandable:** means how intuitively one can use the approach. An approach is intuitively usable if (a) it can be used without much training and (b) the syntax supports the underlying semantics. A more understandable approach is desirable.

### **6.3. Modelling Language Extension**

With the goals (<sup>1</sup> to 4) in mind, in this section we evaluate dierent variants to enrich existing modelling languages with parallel constructs. We rst determine which dierent diagram types we consider relevant (see Section 6.3.1). Second, we propose dierent concepts (see Section 6.3.2) and third, we evaluate whether each combination of diagram type and concept meets the evaluation metrics <sup>1</sup> to <sup>4</sup> (see Section6.3.3). During the evaluation, we continue to use the running example, matrix multiplication, in combination with openMP, and regularly refer to it. Even though we use this specic example, we claim that the approach is transferable and works for dierent examples and parallelisation paradigms as well. We discuss that in the next section in detail.

### **6.3.1. Diagram Types**

The PCM (see Section 2.4.2.1) provides dierent diagram types, which are candidates for an extension. We now have a closer look at the diagram types and their suitability for expressing parallel software behaviour:


set the whole component as parallel executable. That way two instances of the same component run in parallel, very much like in function-oriented architectures or micro-services.


### **6.3.2. Extension Concepts**

Using a model means abstracting real-world objects and behaviour for a specic purpose [Sta73]. The challenge is nding the right level of abstraction as well as the relevant objects to represent in the model. In the following, we introduce three relevant elements (objects) for software characteristics, which are candidates to take into account while modelling the software behaviour. These concepts are independent of the above-described diagram types and can be included in any of them.

**Overhead:** The concept of overhead modelling considers overhead caused by parallel execution (i.e., thread initialisation, synchronisation, etc.). For example, if we parallelise a program using threads, the additional

overhead for creating, running, terminating, and synchronising the threads needs to be represented to adapt the speedup correctly.


### **6.3.3. Diagram and Concept Evaluation**

Now that we know the relevant extension points (view types) and the possible concepts, we evaluate each combination based on the evaluation goals <sup>1</sup> to <sup>4</sup> (see. Section 6.2.2). Afterwards, we will take the combination which seems most promising, evaluate it based on the use case example, and propose it as a reference approach to create the parallel AT catalogue. This plan also means that we neglect the other combinations for now, but keep them in mind so that we can return to them if the chosen solution is not satisfactory.

The process to evaluate the combination is based on expert opinions. For this purpose, we conducted multiple review rounds within the Reliable Software Systems Group in Stuttgart, the Software Engineering Chair in Chemnitz, the Software Design and Quality Group in Karlsruhe, and with various external experts. The invited experts—mostly from German universities work in dierent domains (Model-based Performance Prediction, HPC, Cloud Computing, and Parallel Programming).


**Table 6.2.:** Summary for Dierent Extension Strategies

Table 6.2 summarises our evaluation and shows in the left column the three diagrams we selected as entry points. For each diagram type we used <sup>1</sup> to <sup>4</sup> as evaluation criteria (second column). The third to fth columns show the three concepts, and an individual cell gives our nal rating for a concept in combination with a diagram type based on the evaluation criteria. In the following, we will enter a detailed discussion.


require much eort to realise. But even then, the benet is more than questionable, because to use the concept, the software architect needs detailed knowledge of the resources and their access, which should be abstracted during the design phase. Further, including a complex concept (like locks) into a design model dramatically decreases the understandability and increases the eort needed.


### **6.3.4. Enhancement Process**

Having the evaluation of the diagram types and the concepts at hand, in the following section we present the process for including parallel concepts (e.g., parallel loops) into a modelling language (like the PCM).

### **6.3.4.1. Choosing a Starting Point**

After listing and evaluating all available options, we decide to focus on the SEFF Diagram with an overhead concept in the rst run. Deciding for or against the repository diagram is a matter of abstraction level. While dening a component as parallel-capable means abstracting the parallelisation to the component level, and low abstract concepts like loops or section, focusing on loops means that the SA must already have an accurate idea of the software system during the design phase, which might not be the case. However, focusing rst on the SEFF brings another advantage. The inclusion of the overhead model is better supported than in the repository diagram.

### **6.3.4.2. SEFF Language Extension**

After choosing a concept and a diagram type, we now propose an approach to extend the language. This is a two-step approach: We rst design a language construct to represent massive parallelism on the CPU level (like OpenMP parallel loops); and second, we add the overhead concept to the language to increase the prediction accuracy.

ℎ<sup>1</sup> **Modelling Aspect:** In the following, we focus on<sup>1</sup> to3, which means we want to ease the modelling process for multi-threading and support parallel behaviour in the models. For proof of concept, we focus on the running example of the matrix multiplication in combination with OpenMPlike behaviour in our models. Since UML2 Activity Diagrams, as well as the PCM, already support loop-action, we focus on this action rst.

The rst question to answer is, which additional information is required to enrich a loop-action to a parallel loop-action. To answer that question, we follow the method for an experiment-based performance model derivation [Hap08]. According to this method, the performance model is extended in steps.


To identify the minimum set of attributes, we look again at the OpenMP parallel loop as a reference. As shown in Listing 6.1, the parallel loop only takes information about the number of worker threads used and the scheduler method (for a full discussion on performance-inuencing factors see Section 7.2). Additionally, the scheduler method can already be set as a parameter of the CPU in the Resource Diagram of PCM. For the sake of simplicity, we start with the number of worker threads. Figure 6.3a shows the result of this rst step.

Figure 6.3a shows a loop action annotated as parallel loop based on the PCM languages. There are only two dierences to a regular loop action: The applied role @Parallel, which indicates that everything in the loop behaviour can be executed in parallel, and the number of worker threads attribute (threadPoolSize).

ℎ<sup>2</sup> **Accuracy Aspect:** In the following, we focus on 4. For this, we decided to use the concept of overhead modelling rst and include this concept in the PCM modelling language. To that end, we add the attribute to the parallel loop action from above. Figure 6.3 shows the parallel loop with the new overhead attribute. By allowing the attribute to be a dynamic value (as indicated by the sample value 50\*threadPoolSize), we can achieve two things at once. First, we enable the modelling of overhead, which can either be xed or dynamic and equal for all threads (like thread initiation or synchronisation overhead). Second, we give the software architect the freedom to use this attribute to include a speedup function or, to be more precise, slow-down functions. For this, we allowed the specication of any

**(a)** Annotate Parallel Loop Including Thread Pool Size **(b)** Annotate Parallel Loop Including Thread Pool Size and Overhead Function

**Figure 6.3.:** Stepwise extension of loop to a parallel loop

kind of stochastic expression (in PCM called stoex). In theory, this enables the software architect to model any type of behaviour here.

For clarity, in Figure 6.4 we show what a parallel loop would look like when using only existing concepts in PCM for a threadPool- Size of two. Figure 6.4 shows the instantiation of the parallel loop with two threads. It uses a fork action to fork two separate threads. Each thread has an internal action, which needs CPU-time. The resource demand is split equally among the two threads. Both threads are in a synchronisation point, which means they are synchronised after execution. In each thread, we add an internal action to describe the additional overhead.

### **6.3.4.3. Enhance the Modelling Language**

Now that we have introduced the conceptual idea, we discuss in the following how the concept can be realised and integrated into existing models and analysis. First, we describe two dierent ways (Meta-Model Extension vs. UML Proles) to extend modelling languages in general. Afterwards, we sketch the process of how to integrate them.

**Figure 6.4.:** SEFF Representation of the Unfolded the Parallel Loop Example from Figure 6.3

**Architectural Templates & Meta Model Extension:** There are two known ways to extend a modelling language like the PCM. The rst way includes a full meta-model extension. In our case, this would mean extending the PCM directly and adding new meta-model elements and attributes.

The second approach is a proling strategy. A UML Prole uses stereotypes and proles to extend the meta-model without changing the actual metamodel. For the PCM there is a similar approach—the AT Method [LHB17] which uses the AT Language. Within the AT Method, new language elements can be added, as long as there is a way to map the new language constructs to already-existing elements in the meta-model.

**ATs vs. Meta Model Extension:** With our scenario in mind, we identify the advantages and disadvantages of the inclusion strategies.

Using the AT method has many advantages. Once an AT is dened, it is easy and fast to use, and since every AT model extension has to be representable within the PCM, it is guaranteed that the simulations and analysing tools can handle the AT model extension. Thus we will not break any existing system. At the same time, this advantage becomes a disadvantage because mapping everything to existing meta-model elements also means limited power. Therefore, it might still be necessary to use a meta-model extension to achieve the intended outcome.

On the other hand, using a full meta-model extension is the most exible option and gives us the freedom to integrate any kind of extension. However, this freedom comes at the cost of eort. Using a full meta-model extension means we would also have to adapt the performance prediction model and analysis tools to guarantee that the new language elements are supported.

In our case, we decided to use the AT method because it ts the use case best. As shown in Section 6.3.4.2, we are able to represent the new language extensions (see. Figure 6.3b) with the help of existing meta-model elements (see Figure 6.4). Note that other use cases may still require a full meta-model extension.

**Architectural Template Extension Process:** To use the AT method for our needs, we have to create a new AT. We can create a new AT by following three basic steps as described in [Leh18].


**III. Register AT:** In the last step, we have to add the newly created AT to the AT catalogue to make it available to the software architect via the Palladio tooling.

A full explanation of how the use case example is realised, along with a denition of additional relevant patterns, can be found in Section 6.5.

**Figure 6.5.:** AT Prole for Parallel Loop Extension

### **6.4. Proof of Concept Evaluation**

In this section we present a proof of concept evaluation, using the running example. We apply the new parallel AT and evaluate it based on the simulation results, the predened goals <sup>1</sup> to 4, and the evaluation metrics <sup>1</sup> to 4. If the evaluation is positive, we will use the approach to build a full parallel architectural template catalogue (see Section 6.5).

### **6.4.1. Result-based Evaluation**

First of all, we evaluate the approach based on the prediction accuracy of the simulation. Here we use our use case example and compare the results of the altered model with the model we used in Section 6.1.3. More specically, we use the newly introduced language extension concepts and remodel the use case using the new AT. So instead of using the fork action and modelling all the individual worker threads manually, we use a loop action and apply the parallel loop AT. Instead of creating dierent models for each number of worker threads, we were able to use the threadPoolSize attribute to congure the model.

The most challenging part, however, was to nd a function to represent the overhead. For this evaluation, we want to keep the process of nding a good representation for the overhead as simple as possible. Thus we use the measurements we took from the implementation (see Table 6.1) for one, two, and four worker threads. We calculate the dierence between a linear speedup and the actual measurements. Next, we extract a simple linear curve based on the number of threads as x and the dierence of linear speedup and actual measurements as y. We ended up with the following equation, because it best t the observations: overhead = 900 - 50 \* threadPoolSize.

At rst, this seems unnatural because we decrease the overhead while increasing the thread pool size. However, we increase the overhead per worker thread according to the workload while increasing the thread pool size. For two threads we have a total overhead of 1, 600 (800 for worker thread one plus 800 for worker thread two), and for four worker threads, we have a total overhead of 2, 800 (compare with Figure 6.3a).

Table 6.3 shows the simulation results when using a parallel loop action, congured as described above. The most noticeable outcome is that we achieve better accuracy in all cases. For one to eigth threads we reach 99 % precision.

The high precision is not surprising because we used the measurements from the real execution to calibrate the model. If we had used all measurments, we would have achieved an accuracy of 99% for all cases. This, however, would have been a model overtting.


**Table 6.3.:** Simulations and Measurements Summary Using a Parallal-Loop-Action

Nevertheless, the evaluation shows us two things. First, the overhead modelling approach can be used to signicantly increase the performance model prediction accuracy—if used correctly. Second, nding an overhead function, without having measurements from an implementation, is an extremely challenging task, which requires much experience in parallel computing and is still error-prone. Therefore, we propose to use characteristic performance curves to estimate the overhead function (see <sup>4</sup> in Chapter 7).

### **6.4.2. Goal-based Evaluation**

In the next step, we evaluate the approach given the goal-fullment rate. We anticipate that we will reach all the goals <sup>1</sup> to 4. A detailed discussion follows:

<sup>1</sup> Eort: Our rst goal was to reduce the modelling eort so that it is no longer necessary to model every worker thread. In the proposed language extension, the software architect can just dene the number of worker threads. Within the parallel loop AT, a completion is used to automatically generate the needed model and distribute the workload equally among all worker threads (as an OpenMP loop would do). However, this only works if the threads are identical. By denition, introducing automatisation reduces the overall eort.


### **6.4.3. Metrics-based Evaluation**

Finally, we evaluate the approach using the evaluation criteria <sup>1</sup> to 4. For this, we use the insights gained from the expert community. We consulted multiple experts from dierent German universities (e.g., TU Dresden - Department VDR and ZIH, TU Chemnitz - Department of Software Engineering and Operating Systems Group, HPI Potsdam, FZI Karlsruhe and KIT - Department of Software Design and Quality). In the following, we discuss the evaluation metrics in detail based on the results of the expert interviews:

<sup>1</sup> Congurable: Due to the parameterizable character of the parallel loop extension, the approach is highly exible and straightforward to change. Therefore, the software architect can evaluate sets of congurations quickly.


### **6.5. Building a Pattern Catalogue**

After evaluating the above approach, we will use this approach in the upcoming section to create a parallel architectural template catalogue, containing the parallel behaviour patterns most often needed by software architects. In the rst part of the section, we focus on the research, collection, and identication of such relevant patterns. In the next section, we give a detailed behaviour description for each pattern. Finally, we will visualise the empirical study we used to evaluate the usability of the template catalogue. We will not further discuss the implementation details of the individual patterns, but we will follow the approach described above. The full pattern catalogue, along with the source code and further documentation of the individual patterns, is available in the parallel AT catalogue repository on GitHub<sup>4</sup> .

### **6.5.1. Pattern Identification**

The rst question we have to ask when building a pattern catalogue is: which are the relevant patterns? To answer this question, we formulate two sub-research questions: (1.1.1) Which parallel patterns already exist in practice and (1.1.2) do they have similarities which allow them to be categorised? To answer that question, we performed a structured literature review in [SWD19]. The results of this study are presented in the course of this section.

### **6.5.1.1. Search Method**

To answer the 1.1.1, we performed a structured literature review and followed this process:


<sup>4</sup> https://github.com/PalladioSimulator/Palladio-Addons-ParallelPerformanceCata logue

**4. Abortion:** Due to a large number of search results, we decided to continue step 3 until we encountered 20 consecutive papers with no new patterns. The danger of this approach is that we most likely will not nd all existing patterns. However, we can be quite certain that we cover the most relevant ones. This is good enough for building a rst version of a parallel architectural template catalogue. Extending additional patterns later will not require much overhead.

After conducting the search, we ended up with 35 patterns. As we had assumed, many of them follow the same concept but are named dierently.

### **6.5.1.2. Pattern description**

In the following, we give a short overview and a brief description of the 35 patterns we found, as reported in [SWD19]:


of the generic Actor model, and as such will not be considered an approach.


The following patterns are found in [MRR12] and are duplicates of already named patterns:


The remaining four hits are SISD, SIMD, MIMD, and MISD and these are not software behaviour patterns but hardware architecture styles. Thus we ignore them as we build the taxonomy.

### **6.5.2. Pattern Categorisation**

After collecting patterns and extracting characteristics, we went through the result set again and started to group similar patterns. We named each group according to the most common name and also introduced an additional dimension, the abstraction level. We added three levels of abstraction: Algorithmic, Architectural, and Design Patterns. Figure 6.6 shows the result by grouping architectural and design patterns for simplication. For each pattern, Figure 6.6 lists synonyms or implementation variants based on the ndings of the structured literature review. This list is not complete and provides only an overview.

For a detailed explanation of the individual groups, see Section 2.3.

### **6.5.3. Pattern Selection**

After we successfully categorised all patterns, we extracted the core behaviour from each group of patterns. For three out of the four groups, we decided to realise a parallel AT, but decided against the message-passing

**Figure 6.6.:** Categorisation of Parallel Patterns

pattern, for several reasons. The message-passing paradigm follows a dierent concept and assumptions, which are fundamentally dierent from the other three patterns, as well as the concept Palladio is based on (especially as represented in Actors). Palladio builds upon the assumption of passive and stateless components. However, an actor is a state-full and active component. Ignoring this fact will lead to a violation of the Markovian properties, which the Palladio simulations and analyses are based on. Therefore, we decided against a realisation of the message-passing paradigm in an AT [SWD19].

For all the other patterns, we followed the proposed approach and realised a corresponding parallel AT. We published the complete parallel pattern catalogue along with the source code in a Palladio sub-repository on GitHub<sup>5</sup> .

### **6.6. Formal Semantics for Parallel Behaviour in the PCM**

To create or use parallel modelling language elements, it is crucial to understand the semantics of their behaviour. Therefore, in the course of this section we will explain the semantics of the most relevant parallel languages elements in the PCM and the semantics of the parallel ATs. To do so, we will use a formal specication with the help of HQPNs (see Section 2.5).

We start by explaining the mapping of fundamental PCM components to Hierarchical Queuing Petri Nets (HQPN), which was developed by Koziolek

<sup>5</sup> https://github.com/PalladioSimulator/Palladio-Addons-ParallelPerformanceCatalogue

in [Koz08]. Koziolek dened semantic behaviour for most of the PCM elements. However, we will discuss only the loop and asynchronous fork at this point, which we later reuse for our parallel ATs. For a full denition of all fundamental elements of the PCM, we refer to Koziolek's dissertation [Koz08].

Second, we introduce a mapping for asynchronous loops, which was not done by Koziolek.

Third, we discuss mapping the parallel behaviour to QPNs in general. Based on that, we will evaluate and compare the semantic behaviour of the parallel ATs (from [FH18]) to the expected parallel behaviour.

### **6.6.1. Mapping of general PCM Components**

All elements used in the following are part of the Palladio SEFF, which describes the behaviour of the software model. For the sake of simplicity, we only use subnets (QPN) of the HQPN.

Within our HQPN each token represents a single user or request within our system. The token's colour is a complex data type named TokenData (see Lst. 6.2). It contains:


In the following we adhere to Koziolek's semantics and refer to the TokenList as <sup>a</sup>. For further details on mapping the processing resources, stochastic expressions, and distributions, see [Koz08].

```
c o l o r Va rS pec = p r o d u c t s t r i n g ∗ s t r i n g ;
c o l o r V a r L i s t = l i s t Va rS pec ;
c o l o r C om pPa rLi s t = l i s t Va rS pec ;
c o l o r L o o p L i s t = l i s t i n t ;
c o l o r G u a r d Li s t = l i s t s t r i n g ;
c o l o r TokenId = i n t ;
c o l o r TokenDa ta = p r o d u c t V a r L i s t ∗ C om pPa r Li s t ∗
      L o o p L i s t ∗ G u a r d Li s t ∗ TokenId ;
```
**Listing 6.2:** Colour of a token, called TokenData (cf. [Koz08])

### **6.6.1.1. PCM Loop**

Figure 6.7a shows the mapping of a PCM Loop Component (on the top as PCM description) to a QPN (below). The QPN contains the loop head and body. After entering the loop, the rst transition <sup>1</sup> is to evaluate the loop iteration (in case it is not a constant value, but a distribution or stochastic expression). The transition <sup>1</sup> adds the loop iteration integer as a list instead of an integer to the LoopList. The reason for this is that the loop can be executed recursively nested, and the token needs to memorise all the loop counters. The head of the list gives the current iteration count.

Based on that value, either transition <sup>3</sup> (counter = 0) or <sup>2</sup> (counter > 0) res. If <sup>2</sup> res, the token will be red in a subnet 2, which represents the loop body. As soon as the token returns from the subnet, <sup>4</sup> res, <sup>3</sup> decreases the loop counter, and the token enters the loop head. Finally, when <sup>3</sup> is reached, <sup>4</sup> removes the counter from the list of loop iteration integers and the token is placed in the successor of the loop (i.e., 3).

### **6.6.1.2. PCM Asynchronous Fork**

Asynchronous Forks spawn new threads without synchronising them in the end. Each thread terminates independently of the others. Figure 6.7b illustrates the behaviour for the given PCM specication (above).

First, the transition <sup>1</sup> res a copy of the current token into multiple places in QPN , each representing a forked behaviour. During 1, the values of the current token are modied in a way that the ID ℎ stays unique. For that, a number is added for each forked behaviour. The rest of the values stay the same. At the end of each forked behaviour, the transition <sup>2</sup> - ushes the copied token. To continue, the transition <sup>1</sup> res an additional token to the successor, represented here by +1.

### **6.6.1.3. PCM Synchronous Fork**

In contrast to asynchronous forks, in synchronous forks the control ow spawns threads and waits for them to nish before continuing with the next steps. Figure 6.7b illustrates the behaviour and describes the PCM.

In general, the QPN looks very similar to the asynchronous forks, so in the following, we only go into the two main dierences.

First, instead of the transition <sup>2</sup> to (in asynchronous forks), which ushes the token after the forked behaviour has nished, for synchronous forks we have one transition 2, which only res if there is a token available in each place <sup>2</sup> to . If that is true, <sup>2</sup> res and places a token in the successor of the synchronous fork—in our case +1. The token that is placed in the successor place is a merged copy of <sup>2</sup> to . Further, the ID ℎ is modied so that is removed. Thus the ID is reset to the original value before entering the fork, and remains unique.

The second dierence to the asynchronous forks is when and how to pass the token to the successor. While for the asynchronous forks, the transition <sup>1</sup> immediately passes a token to the successor, the transition in the synchronous forks does not and only passes the tokens into the forked behaviours. The successor is added in the end, and the transition <sup>2</sup> triggers the successor. In that way, we ensure that all forked behaviours have nished before continuing.

### **6.6.2. Mapping of Parallel Behaviour to QPN**

In this section, we discuss the behaviour of parallel loops, sections, and blocks. Since no native PCM elements represent these concepts, we give the PCM descriptions based on the parallel AT extensions introduced above.


**Figure 6.8.:** Mapping PCM2QPN: (a) asynchronous parallel loop, (b) synchronous parallel loop The description should reect the way common frameworks like openMP<sup>6</sup> have implemented these concepts.

### **6.6.2.1. Parallel Loops**

**Behaviour:** Parallel loops are a parallelisation concept known from dierent parallel programming paradigms like OpenMP. Put simply, a parallel loop executes each loop iteration in a separate thread. With the help of a thread pool, the scheduler assigns each thread (worker thread) to a physical core and can execute in parallel. A requirement for the many scenarios is that the threads are data independent or that the dependence is explicitly dened. Data independent means that the read and write operation of each thread does not inuence the others. A typical example to illustrate the behaviour of parallel loops is our running instance of a matrix multiplication [FH16]. Assuming we have two matrices (10x10) we want to multiply, this would result in a total number of 1000 multiplications to perform. Using, for example, OpenMP parallel loops with a thread pool size of 8, this would split the workload for each thread equally, resulting in 125 calculations per thread.

A parallel loop can either be synchronous (often used when distributing workloads and realising a master-worker pattern [MSM04]) or asynchronous (i.e., implementing an observer pattern).

**PCM Instance:** Given the above behaviour description of a parallel loop, it is similar to a fork action in PCM. It has a successor and a forked behaviour. Since the behaviours are all equal, specifying it once is enough. In addition to the fork action, information about the thread pool size and the number of iterations is required. For synchronous forks, a passive resource is needed as well. A passive resource can be used to implement require and release behaviours, i.e., for mutexes [Koz08].

**Mapping:** For the mapping of the behaviour description to QPN, we distinguish between two dierent kinds of parallel loops: Synchronous and asynchronous loops, which are shown in Figure 6.8.

<sup>6</sup>OpenMP – https://www.openmp.org/

Asynchronous Parallel Loop: The QPN for asynchronous parallel loops is a combination of a loop and an asynchronous fork. It starts similarly to a fork with the transition 1, which res two tokens. One token is red in the place of the successor +1, which can then continue, and another token is red into the place of the loop behaviour. The id of the token is altered and increased (1). Following the description of a loop (see Figure 6.7a), the next step evaluates the loop iteration. In this case, two evaluations are done. One is for the outer loop, which forks the new threads. Here the value equals the value of the given thread pool size. The evaluation of the iteration literal species the second loop iteration value and then divides it by the thread pool size, to share the workload equally. It is added to the LoopList. Based on that former value, the loop either continues or nally goes to 4. If the loop continues, <sup>3</sup> res two tokens, one into the subnet , with an adjusted id (cf. Section 6.6.1.2), and one to <sup>3</sup> with an adjusted loop counter. After that, the loop condition is re-evaluated. Further, the subnet represents a normal loop as characterised in Section 6.6.1.1. Finally, when a subnet has nished, <sup>5</sup> destroys the token.

Synchronous Parallel Loop: In contrast to the asynchronous parallel loop, the synchronous one does not continue until all tokens have returned from all subnets. For that reason, there is no fork action in the beginning, and the QPN starts with the evaluation of the loop iteration, which again equals the value for the thread pool size. The loop execution behaves the same way as the asynchronous loop does. In contrast to asynchronous loops, where tokens are ushed after returning from subnets, in the synchronous loop the tokens are passed on. The transition <sup>4</sup> res a token into two places: <sup>5</sup> and 6. Further, <sup>5</sup> shows a passive resource and indicates the number of created tokens. Therefore, whenever a subnet nishes and the token returns, <sup>4</sup> res and increases the number of tokens in the places. Subsequently, the original token with the corresponding colour is placed in the 5, and the loop iteration counter is removed from the token's colour. Finally, transition <sup>5</sup> res if there are the number of tokens in the place 6. The value of is equal to the value of the thread pool size. Thus, the transition <sup>5</sup> res if all subnets have been returned. Further, the transition <sup>5</sup> adjusts the value of the id eld, removes the added identier for the subnet , and restores the value to its original value.

Please note that to provide a useful example, we modelled the passive resource (6) along with the require () and release () actions explicitly. It is also possible to combine it with 5.

### **6.6.2.2. Parallel Sections and Blocks**

**Behaviour:** Parallel sections or blocks refer to a specic area in the source code that is either explicitly marked for parallel execution (i.e., parallel sections in OpenMP) or implicitly allows multiple executions of the same block. The former behaves similarly to a loop. Most of the time, a parallel section is used to split the workload based on a task set or data structure. The block is specied by the same behaviour, but can have dierent input parameters. It can be a method that is called by multiple threads.

**PCM Instance:** In the PCM a block, which can be called multiple times from dierent threads, is modelled with a simple fork action and therefore can be either synchronous or asynchronous. Due to the similarities of a parallel section to a parallel loop, there is no additional concept in PCM, and on an abstract level, it can be handled in the same way as a parallel loop.

**Mapping:** The mapping of PCM Instances for parallel sections to QPN is performed in a way very similar to the mapping of parallel loops. The only dierence is that the subnet will not be of type loop, but arbitrary types. This means that it is not the loop characterisation that is passed to the subnet, but an adjusted version of the VarList, describing the workload for the specic subnet. For blocks, the mapping is the same as for forks. Due to these highly similar concepts, we will skip a full description at this point.

### **6.6.3. Evaluation of the Mapping of Parallel ATs to QPN**

In the following, we evaluate the correctness of the behaviour of the parallel loop ATs based on the running example. As described in Section 6.5, the parallel ATs need to map all elements to the given PCM instances. Since loops, sections, and blocks are very similar, the parallel AT method maps all kinds of parallel behaviour (loops, sections, or blocks) to a fork-join scenario (see Figure 6.3). Therefore, we can use the existing mapping of forks to QPNs to express the formal semantics. To show that this is a valid approach, we elaborate on a thought experiment. For that, we assume a synchronous parallel loop, which should calculate a matrix multiplication with the matrices of size 1010. So, in total 1000 multiplications have to be performed. Further, we assume each multiplication takes 1 on a two-core system. In theory, sequentially executing the multiplication takes 1. Using a synchronous parallel loop (as described in Section 6.6.2) needs additional information about the number of worker threads. Assume we use two worker threads for the two-core system. The behaviour of the synchronous loop splits into two separate threads, which share the workload equally. That means each worker thread needs to perform 500 multiplications and needs 500. Since we assume two cores, the overall execution time is 500, because both threads can run in parallel. Now let us consider the parallel AT: Here we use the parallel loop action (see Figure 6.3a) and specify the number of replications to be 1000, the thread pool size is two, and the resource demand for one calculation is 1 on the CPU. The parallel AT approach now maps this to a fork behaviour with two parallel threads, which needs to be synchronised in the end. The resource demand for each internal action is still the same 1 on the CPU. But this time, it is multiplied by the number of repetitions divided by the number of worker threads (i.e., it shares the workload equally). In this case, each internal action takes 500, and the total run-time is 500.

This demonstrates that the response time behaviour is the same. For this, in future work, we plan to provide mathematical proof based on QPNs.

### **6.6.4. Upshot**

In this section, we formally dened the semantic behaviour of the fundamental parallel language concepts fork and parallel loop. This will not only help to create and use new parallel language concepts, but it also helps to understand the parallel ATs. At this point, we only explain fork and parallel loop, since the other two parallel ATs—Master-Worker-Pattern and Pipes and Filters—are mapped and build upon the same basic constructs as the parallel loop.

### **6.7. Empirical Evaluation of the Parallel AT Catalogue**

Now that we have all the parts of a parallel AT catalogue complete (process to enhance modelling languages, pattern selection, behaviour descriptions, and nally the catalogue itself), we still need to evaluate 1.3—Does the architectural template catalogue support software architects in the task to create accurate performance prediction models eciently?

We have already shown how we can use an overhead function to increase accuracy. In this section, we want to evaluate the eciency and usability of the approach. Since both quality aspects are hard to determine, we set up an empirical user study. This study was part of the work we conducted in [Zah20], and we present a summary in the following subsection.

### **6.7.1. Experiment Design**

To conduct a user study, we decided to go with a controlled user experiment. The controlled experiment gives us the advantage of minimising variance and disturbing side eects and gives us the opportunity to change the experiment variables according to our needs [RH09]. Further, it allows us to perform statistical analyses on our measurements [WRH+12]. To determine and specify the necessary metrics, we use a Goal-Question-Metric (GQM) [CR94] plan to dene goals, questions, and metrics.

In total we derive four goals from the given 1.3. Figure 6.9 shows the GQM-tree.

For each goal, we formulate the corresponding question, the metric we want to measure to answer the question, and the hypotheses we have regarding the outcome. With questions two to four, we would like to determine which metrics to measure during the user study. So we measure the time participants will need to full a task, the number of errors they make, and the time they need to x mistakes. In contrast to that, we answer question one by evaluating a questionnaire that each participant completes.

**Figure 6.9.:** Goals, Hypotheses, Questions, and Metrics of the User Study

### **6.7.1.1. Conduction Process**

Given the above GQM-Plan, we developed an experiment design and study process. Figure 6.10 shows the experiment design. It contains three phases:

**Phase 0 – Warm-up:** During this phase, we rst want to recruit participants. To get the most reliable results, we aim to have a mix of diverse participants. Their experience with performance engineering should range from none to expert. Finding experts will be more dicult since they are rare. However, if we can show that beginners using the parallel AT catalogue are better (in terms of the above questions) than experts who are not using the parallel AT catalogue, we can make a strong statement, even with only a moderate sample size.

The next step during warm-up is to train the participants. During this step, we will teach each participant the requirements to full the task, as well as educate them on the tool we want to use. Since we do not want to measure how well participants can learn new tools, we do not monitor this step in any way. However, we provide feedback, answer questions, and ensure that all participants complete the training.

The last step is to split the participants into two test groups. Both groups should be equal in size and experience level. Hence, each group should contain the same number of experts, advanced users, and beginners.

**Phase 1:** In the rst phase, the participants are assigned to groups and scenarios. Each group has to complete the same scenarios. However, the order in which they should use the parallel AT catalogue diers. Group A needs to complete scenario I with the standard toolkit, while group B uses the parallel AT catalogue to do so.

During the execution of scenario I we measure the overall time, the number of errors, and the time each participant spends on errors (Appendix A.3 shows the sheet we use to take the measurements). After completing the task, each participant has to ll out a questionnaire (see Section 6.7.1.3).

**Phase 2:** The last phase is similar to the rst one. This time the participants get a second scenario, and we switch tasks for the groups. Thus, group A has to use the parallel AT catalogue and group B the standard toolkit. This way we can rule out any learning eects participants may show during the completion of the rst scenario. We again measure the times and errors. Afterwards, participants have to ll out a questionnaire again, and nally, we interview them.

### **6.7.1.2. Scenario Selection**

In addition to the above-formulated GQM-Plan and process, we also need a scenario. The scenario will be presented to the participants, and they will have to solve the task afterwards.

The rst scenario involves the running example of the matrix multiplications and is fully described in Appendix A.3 Scenario II.

The second scenario describes a parallel search strategy to nd literature in a literature database (see Appendix A.3 Scenario 1 for a full description).

Both scenarios have in common that they need to fork multiple threads that are performing a similar task. For each thread, there is some overhead for

**Figure 6.10.:** Overview of the User Study

forking and synchronisation. Besides that, the threads are independent of each other.

We ensure that both scenarios could be modelled in Palladio with and without the parallel AT catalogue.

### **6.7.1.3. Questionnaire**

To capture general information about the participants and to rate the usability of the parallel AT catalogue, we design a three-part questionnaire that each participant has to ll out. In the rst part, we ask for general information about the participant, like their current degree, their level of expertise with performance engineering, and their experience with Palladio. Based on this information we design the user groups A and B and aim for a balanced group.

The second part contains four short questions, which have to be lled twice by the participants—once after each scenario. Here we ask about the diculty of the scenario, how they would rate their own performance, the amount of work they had to do, and how they would rate the usability of the standard toolkit/parallel AT catalogue for the scenario.

The third part contains a total of six questions. The rst three are about the usability and speed of the parallel AT catalogue in comparison to the standard toolkit. In the rst three the participants are asked to use a scale from one to seven. The latter three are free text elds, where the participants give their nal thoughts about general aspects of the experiment. Appendix A.3 shows the full experiment leaet with all questions, scenario descriptions, and information provided to the participants.

### **6.7.1.4. Analysis Process**

To answer questions two to four, we can consider the measurements we took o time and number of errors during the experiments. However, to answer question one, on usability, we have to consider participant feedback. In the questionnaire, the participants can rate the usability of dierent items, using a scale divided into seven levels. We can now translate the levels in a numerical schema ranging from one to seven. For each question, we calculate the mean value.

Now that we have numerical values for all questions and thus our metrics for the hypothesis and goals, we can directly analyse some of them. Thus, we perform a t-test with a condence level of 95% regarding each hypothesis, which will allow the condent approval or rejection of the respective hypothesis.

### **6.7.2. Study Conduction**

In conducting the controlled user study, we strictly follow the experiment design. We were able to recruit 16 participants from dierent areas and with varying levels of experience. In total, we recruited nine beginners, ve advanced users, and two experts. We split the 16 participants into two groups of eight people each and tried to balance the groups as best as possible. After that, we trained the participants. Due to time conicts, we were not able to train all participants at once and had to conduct several sessions.

We conducted the actual experiment with scenario A and B in a separate session, where we invited the participants individually. The individual sessions gave us the chance to monitor the participants better and to measure personal values more accurately.

In the next section, we will elaborate on the results.

### **6.7.3. Study Results & Reporting**

In the following, we will briey report on the result of the study and only give relevant information. However, we have made all raw data publicly available<sup>7</sup> .

After conducting the study, we were confronted with a set of measurements. First, we will look at the measurements we took during the study. For this, Table 6.4 summarises the result. The table shows all participants (rst column), the measurements we took for the task with the standard toolkit (second to fourth columns), and the measurements for the parallel AT catalogue (columns ve to seven).

In the summary section at the bottom of the table, the following characteristics are immediately noticeable, even without a detailed analysis:


<sup>7</sup>Raw Data: https://doi.org/10.5281/zenodo.3755339




catalogue. This indicates that the extension is not helping to reduce the number of errors.

**Time spent in errors:** However, when considering the mean time spent in errors, we can assume that the errors are easier to x when using the parallel AT catalogue.

Next, we look at questions ve to seven from the questionnaire. Figure 6.11 displays the results in a likert plot.

All three plots show a strong tendency toward the parallel AT catalogue.

We found that 74,5% of all participants rated their performance with the parallel AT catalogue as fast or better, while only 12% would say the same of the standard toolkit. At the same time 69% rate their performance as equally slow when using the standard toolkit.

Additionally, 81% of the participants rate the amount of work required to full the task as "little" when using the extension. None says it is too much. In contrast to that, all participants agree that the amount of work with the standard toolkit is much (19%) or too much (81%).

Finally, 94% of the participants rate the usability of the parallel AT catalogue as good and only 6% rate it as somewhat bad. In contrast to these numbers, the majority of the participants rate the usability of the standard toolkit as bad (13%) or very bad (69%) when it comes to parallel behaviour.

In addition to the evaluation by sight, we also performed a t-test evaluation for all of the goals, research questions, and corresponding hypotheses (see Figure 6.9), even though we are aware that the validity of t-tests is very limited, given the small sample size of 16 participants. To perform the t-test we followed the denition given by [WRH+12], formulated all <sup>0</sup> hypotheses, and used a condence interval of 95% in combination with a one-sided distribution table<sup>8</sup> .

After performing the t-test, we can reject the <sup>0</sup> hypotheses for goal I (improved usability measured by the questionnaire) and goal II (increased e ciency measured by the time needed). Thus, we have signicant proof that the parallel AT catalogue increases the usability of Palladio when it comes to

<sup>8</sup> http://math.mit.edu/~vebrunel/Additional%20lecture%20notes/t%20(Student's)% 20table.pdf

**(b)** Likert Plot of the Results from Question Six (rounded to integers)

**Figure 6.11.:** Liker Plots of Questions Five to Seven [Zah20]

modelling parallel behaviour and increases the eciency (reduces the time needed) of SA in creating models that include parallel behaviour. Regarding goal III (make Palladio less error-prone) and goal IV (reduce time in errors), we were not able to reject the <sup>0</sup> hypotheses.

### **6.8. Transferability and Limitations**

### **6.8.1. Transferability of the Parallel AT Catalogue**

The parallel architectural template catalogue provides a set of the most common parallel patterns. It enables software architects to use parallel constructs in their software models quickly, easily, and eciently.

Even though we focus on model-based performance prediction and therefore on languages like the PCM, we think that the approach is highly transferable. The PCM uses a UML-like syntax and semantics. Further, the AT method uses UML proles to include the languages extension. Thus, transferring the approach to pure UML or to any other UML-like languages is easily doable.

Additionally, we did not do any domain specic pattern selection. Therefore, all of the identied, characterised, and realised patterns are of high value not only for software performance prediction, but for all computer science.

On the down side, we have to say that we included performance-specic attributes, like the overhead function modelling, in our patterns. These domain-specic characteristics are a valuable contribution to software performance engineers; however, they might not be of high relevance for other domains.

### **6.8.2. Limitations of the Parallel AT Catalogue**

Even though the parallel AT catalogue, the parallel pattern taxonomy, and the formal semantics for parallel behaviour are of great benet to software architects, we need to consider the limitations of this approach as well.


Further, we did not include high-level parallelisation approaches, which are above the SEFF (software behaviour), and we explicitly excluded parallel components (e.g., in parallel executed container, services, etc.).


### **6.9. Summary of CB**1

In this chapter, we described the contributions we made with respect to the requirement . To do so, we rst identied the research need. Second, we showed that the current process of modelling parallel behaviour

for performance prediction with the help of state-of-the-art performance prediction tools (e.g., Palladio) is not only error-prone but also time-consuming. In addition to that, the predictions are innacurate as well.

In the next step, we formulated the research goal: To support software architects with an ecient way to express parallel behaviour in software models along with the necessary characteristics. Next, we created a method to enhance current modelling languages to include parallel patterns with the help of the architectural template method [Leh18]. While creating the method, we carefully evaluated dierent diagrams, view types, and enhancement concepts. To make a proof of concept, we used our running example (the matrix multiplication) and created the rst parallel AT for the PCM. The evaluation of the working example veried the approach, and we ware able to:


Testing the approach encouraged us to continue building a full parallel architectural template catalogue. To do so, we performed a structured literature search to nd 35 parallel patterns. We extracted the core characteristics of these patterns and created a taxonomy with ve root patterns (see Figure 6.6). Out of this we successfully created a parallel AT catalogue which supports 4 out of 5 root patterns.

Finally, we conducted a controlled user study, in which we were able to empirically and signicantly conrm that the parallel AT catalogue increases the eciency and usability of the Palladio approach to modelling parallel software behaviour.

To wrap up, we can answer our research question as follows:

1.1: Are software architects able to model even simple parallel concepts of highly parallel systems in an ecient way? Thereby, SA needs to focus on abstract performance relevant attributes on architectural level during early design time.

**Answer:** In an empirical user study using a controlled experiment, we were able to show that current state-of-the-art tools do not support SA in an ecient way.

1.2: Are software architects able to model the parallel software behaviour of an application with the help of current modelling languages, so that (a) the relevant performance characteristics are captured and expressed, and (b) all necessary information for performance evaluation is covered?

**Answer:** SA are currently not able to model (a) all relevant characteristics of parallel software, which results in (b) inaccurate performance predictions for parallel software in multicore environments.

### 1.3: How can software architects be supported in the task of creating accurate performance prediction models eciently?

**Answer:** With the help of a parallel AT catalogue SAs can be supported in creating performance prediction models more quickly and with a higher user acceptance (usability). Furthermore, they can use the concept of overhead modelling to increase the accuracy of the predictions.

3.1: Are current simulation-based performance prediction approaches capable of predicting the performance of parallel and highly parallel systems accurately?

**Answer:** The experiments we performed in [FH16; FSH17] show that current state-of-the-art performance prediction approaches are up to 80% o when trying to predict the response-time for parallel applications in multicore environments

With the parallel AT catalogue presented in this chapter, we make a significant contribution for SAs who want to make more accurate performance predictions for parallel software more quickly. The contribution also resolves and fulls requirement .

### **7. CB**2**: Performance Curves for Parallel Behaviour**

In this chapter, we will continue the research from contribution CB<sup>1</sup> (see Chapter 6) and still focus on and .

In CB1, we presented a pattern catalogue extension for Palladio, providing the most relevant parallel patterns. We included a concept in the modelling process, which allows the SA to model the overhead and speedup behaviour with the help of performance curves. The biggest challenge here is to specify the overhead model, since this task requires a lot of experience and additional knowledge of the software and hardware.

Therefore, in this chapter, we investigate parallel performance-inuencing factors (PPiFs), set up experiment-based performance evaluation, and extract performance curves for parallel application.

The overall goal is to extract and cluster characteristic performance curves, which can be provided to the SA. By the help of the performance curves, we want to enable SAs to easily dene overhead functions and thereby further increase the performance prediction accuracy.

Figure 7.1 shows the structure and the research method followed in this section.

First of all, we are going to dene the problem space, followed by the denition of the research goals and evaluation criteria. Next, we will investigate PPiFs, which we will use in the next steps to design the experiment setup. We will analyse the results from the experiment executions to extract performance curves, which we will integrate into Palladio. Finally, we will evaluate the approach using SPEC benchmarks.

**Figure 7.1.:** Overview of the Research Method for Contribution CB<sup>2</sup>

As a result of this contribution, we present (1) 14 lessons learned from the experiments and (2) deliver twelve performance curves to the SA. The performance curves represent the six most relevant software behaviours and increase the predictive power of Palladio. Thereby, we are able to increase the prediction accuracy up to 72% for the benchmark applu311.

Please note that signicant parts of the work from steps one to three have been reviewed and published in [FBKK19]. In addition, the remaining steps are currently under review in [FSK+20].

Further, all results, raw data, and implementation details have been made available online:

**Section 7.3** Load Test Generator Based on ProtoCom: https://doi.org/10.5281/zenodo.3828432

**Section 7.4** Experiment Raw Data: https://doi.org/10.5281/zenodo.3855492

```
Section 7.5 Performance Curves:
      https://github.com/PalladioSimulator/Palladio-Addons-P
      arallelPerformanceCatalogue
```

```
Section 7.7 Performance Curve Evaluation:
      https://doi.org/10.5281/zenodo.4081091
```
### **7.1. Problem Space**

As we have learned so far, the performance of parallel applications relies on a complex set of factors. Often these factors are interconnected and therefore, it is a tricky task to tell how PPiFs will aect the overall performance of an application without executing and measuring it. But even given the measurements, it is still a challenging and time-consuming task to determine the eect of each Parallel Performance-inuencing Factor (PPiF).

In Chapter 6, we proposed an abstract approach to include speedup behaviour of parallel applications with the help of performance curves in the performance prediction models, by dening an overhead function. At the same time, we realised that dening these performance curves is a time-consuming and challenging task, which needs experience and additional knowledge of the software and hardware.

### **7.1.1. Idea**

To save the SA the eort of specifying the overhead function, we want to provide the SA performance curves, which contain relevant PPiFs.

Figure 7.2 shows an example of a speedup curve based on the PPiFs' worker threads and resource demand type (see Chap. 5 for detailed information on the resource demands).

The diagram contains ve dierent examples with an individual speedup behaviour characteristic for each case. This example can be mapped one-toone to a two-dimensional performance curve.

**Figure 7.2.:** Measurements of Speedup Functions for Dierent Resource Demands on a 40-Core System with Enabled Hyper-threading

Our idea is to integrate such performance curves into Palladio. That way the SA only needs to specify, e.g., the thread number and the resource demand type. The solver takes the performance curves into account and calculates the speedup behaviour based on the reference curve. In Section 5.1.7 we discussed a set of algorithms which can be exemplied to a resource demand type. We will use these alogrithms in the course of the chapter to investigate the resource demand types.

### **7.1.2. Problem Specification**

Having a closer look at the topic, it becomes clear that dening performance curves is no straightforward task, and we have to overcome a set of challenges:


tinues to increase performance while the speedup of CountingNumbers decreases after a while. Identifying and clustering adequate types is a challenge.

<sup>3</sup> **Types of PPiFs:** The variety of PPiFs ranges from xed hardware-specic inuencing factors, such as L1, to exible software specic inuencing factors, like thread pool size. Finding and selecting the right set of PPiFs is a major challenge.

Given these challenges we derive the following goals:


Given these goals, we can derive two metrics to evaluate the nal performance curves:


Taking the challenges, goals, and evaluation metrics into account, we can dene the research method next.

### **7.1.3. Research Method**

To estimate the performance curves, we combine the approach of experimentbased performance model derivation proposed by [Hap08] and the process of extracting performance curves by [WHW12]. This process is displayed in Figure 7.1. Concretely, we rst want to determine a set of relevant PPiFs by scanning the literature and conducting expert interviews. Next, we rank the PPiFs and start to build a performance curve for the most relevant ones. If we are satised, we continue; if not, we consider additional PPiFs.

For each PPiF, we set up an experimental design to monitor and measure the behaviour of the software performance. In our case, we focus only on the execution time, specically, the speedup behaviour. From the measurements, we perform statistical analysis and clustering to determine a set of the relevant performance curves. Finally, we integrate them into Palladio, utilising overhead functions, and evaluate their accuracy.

The research method, along with the collection of the PPiFs, was published, reviewed, and accepted in [FBKK19]. Besides that, major portions of the measurements were gained in collaboration with student projects [Gre19].

### **7.2. Parallel Performance-influencing Factors**

The rst step towards performance curves is to identify a list of potential PPiFs. To do so, we perform a literature review and interview experts from dierent domains, like SPE, HPC, and operating system domain. Next, we prioritise the PPiFs based on the results from the expert interviews.

In the following, we rst present the outcome of the PPiFs-collection and the interviews. Afterwards, we rank the list based on the insights we gained during the discussions.

### **7.2.1. PPiFs Collection**

The following list of PPiFs represents the outcome of a literature review [Söh18] and expert interviews we performed. For the latter, we interviewed four software performance experts within our department, seven HPC experts from the University of Dresden, Hasso-Plattner Institute, and Karlsruhe Institute of Technology (KIT), and three experts on parallel execution in embedded systems from the University of Chemnitz.

The following list is quoted verbatim from [FBKK19]; it is categorised into two groups (congurable and xed PPiFs) and contains the subset of all PPiFs that the experts agreed on:"

### **7.2.1.1. Configurable PPiFs**

Congurable factors are properties which can be directly congured or inuenced by the software developer and therefore adjusted to the given hardware or scenario. Often auto-tuners are used to nd the best conguration for these properties on a given system.


### **7.2.1.2. Fixed PPiFs**

In contrast to congurable PPiFs, xed PPiFs are given by the considered application or the infrastructure used, and cannot be inuenced by the software developer.


We do not claim this list to be complete, but it does contain the relevant factors for parallel execution that we located in literature, and abtained from the expert interviews.

### **7.2.2. Prioritising**

Now that we have the list of PPiFs at hand, we need to prioritise the list. The prioritisation is essential to decide which factor to take into account rst. Considering all factors at once increases the eort signicantly and makes both the extraction of performance curves as well as the decision for the SA more complex.

So we not only take into consideration the eect of the factors, but also the challenge for the SA to retrieve this information. Table 7.1 shows the prioritised list worked out with the expert board.

Highest ranked are the threads and the thread pool size. It seems logical that these two factors inuence performance the most and directly. We could also add the number of hardware cores here, but we included that in the



**Table 7.1.:** Prioritised list of PPiFs after ranking by experts

thread pool size. If there is no multicore hardware available, considering threads would not make sense. Even though context switches are a relevant factor as well, this topic is already covered by J. Happe [Hap08].

Next, we rank the type of resource demand, because the board agreed upon the fact that the kind of operation has a direct impact on the parallelisability of the problem, and therefore on its speedup. In contrast to that, the decision regarding the parallelisation strategy is not as clear. The board agreed that the paradigm used to parallelise an application aects performance. But the committee could not decide on the level of impact. The main argument against a high ranking was that, correctly implemented, all paradigms result in a good speedup behaviour.

For factors ve to eight, the board again agreed on their impact, especially that data locality and caches have a high impact on the speedup behaviour. However, we rank data locality low, because it is hard for the SA to consider that in architectural models. Further, we ignored software caches for now.

### **7.3. Experiment-Based Performance Evaluation**

In this section, we describe the experimental design and setup, the hardware environments, the experiment results, and the extraction of the performance curves.

### **7.3.1. Experiment Design**

The outline of the experiment is sketched in Figure 7.3. Thereby the rst step is to get and generate typical resource demands. For this purpose we use ProtoCom<sup>1</sup> (see Section 2.4.2.1).

**Figure 7.3.:** Overview of Experiment Setup using ProtoCom as Resource Demand Factory [FBKK19]

**ProtoCom:** ProtoCom provides ve dierent types of basic resource demands: Mandel set, sorting arrays, counting numbers, calculating primes, and calculating Fibonacci numbers. In addition to that, we implemented one additional demand—multiply matrices—and adjusted other demands, like sorting array, to be able to specify the array size. All implementations of the resource demands are given in Appendix A.2.

ProtoCom enables us now to generate work packages of the six specic primitive resource demands. The advantage of using ProtoCom is that we can specify the exact runtime (i.e., ve seconds) of these packages in a given environment [BDH08]. We use this characteristic to generate several independent work packages of the same resource demand, which have zero

<sup>1</sup> https://sdqweb.ipd.kit.edu/wiki/ProtoCom

interdependencies. Thus, we can guarantee a pure workload on the CPU without communication, waiting, or locking side eects.

**Parallelisation:** In the next step, we take these generated packages, add them to a queue, and build a parallelisation approach around it. In total, we support four parallel paradigms: Java Threads, Java Streams, OpenMP, and AKKA ACTORS.

Each paradigm can take the queue and execute it in parallel. Thereby, we can specify the thread pool size and can measure the pure execution time of the queue-execution step.

Finally, we can generate a runnable jar le, which can be executed with the desired parameter set on the target platform. The complete source code is available online<sup>2</sup> .

**Experiment Execution:** In the last step, we take the runnable jar le and deploy it on the target platform. For each platform, we perform multiple runs, always changing only one parameter: Thread pool size or parallelisation paradigm. We run each conguration multiple times and vary the thread pool size from one to three times the number of physical cores available on that platform.

While performing each run, we measured not only the runtime but also the cache behaviour. Since measuring low-level metrics in this way is not supported by the JVM, we used PAPI API<sup>3</sup> and perf<sup>4</sup> .

### **7.3.2. Experiment Environment**

To investigate the behaviour of dierent hardware environments, we performed our experiment on multiple target platforms. The characteristics of all machines used are displayed in Table 7.2. We use three dedicated servers of dierent dimensions. The smallest has 12 physical cores and the largest 96 physical cores.

<sup>2</sup>Load Test Generator: https://doi.org/10.5281/zenodo.3828432

<sup>3</sup> http://icl.cs.utk.edu/papi/

<sup>4</sup> https://perf.wiki.kernel.org/index.php/Main\_Page


**Table 7.2.:** Overview of the hardware environments and their conguration

### **7.4. Measurements and Results**

We execute the above-described experiment setting for all 96 variations. The variations include the six resource demands, the four parallelisation paradigms, and the four dierent hardware settings. Thereby, we measure the execution time as well as the L2, L3, and main memory access where possible<sup>5</sup> .

We execute all experiments for all the demands with a package execution time of 0.2. Thereby, we congure the total amount of packages for each hardware individually, always three times the number of available cores. Using the same number of packages for all the four hardware settings would mean having to pick the highest value. This would result in very long execution times on smaller hardware environments.

In total, we end up with over 70,000 measurements in over 800 experiment runs. Due to this extensive amount of data, we are not able to show and discuss all the results in detail. In this section, we present the results for the server in Stuttgart only, which are exemplary. The results for the hardware in Potsdam and the multi-node cluster (cloudbw) are attached in Appendix A.4. Even though we only show the results from Stuttgart here, we discuss noteworthy results of all the experiments.

A full description of the experiment setup, execution, and discussion is available in the supervised student thesis [Gre19]. Further, all results and raw data are publicly available online<sup>6</sup> .

### **7.4.1. Result Report Server Stuttgart**

For the sake of understanding, we rst separate the performance/speedup and the memory behaviour aspect. Thus, we rst report the performance of the individual experiment runs concerning the thread pool size. Later, we have a closer look at memory behaviour, and nally, we bring both aspects together.

<sup>5</sup>Not all hardware supports reading the performance counter for L1, L2, and L3 cache <sup>6</sup>Experiment results raw data: https://doi.org/10.5281/zenodo.3855492

### **7.4.1.1. Performance Behaviour**

Figure 7.4 shows the measurements using a speedup chart for the dierent parallelisation paradigms and resource demands. The x-axis indicates the number of used worker threads. This number represents the number of active threads (i.e., the thread pool size). While each worker thread is directly mapped to a processing unit, the threads in the system are assigned using the thread pool to worker threads.

The y-axis displays the speedup. We calculate this value based on the execution time of a single thread application (i.e., by using only one worker thread). To increase the readability of the diagrams, only every sixth data point is displayed. The line between the data points represents the skipped values.

The rst area from the left (from 0 to 96 worker threads) indicates the eld where each worker thread can be mapped to a physical core. The second area from the left (from 97 to 192) shows the eld where, due to hyper-threading, each worker thread can be mapped to a virtual core. The third area from the left (from 193 to 576) represents the area where we increased the number of worker threads even further. In this area, not all worker threads can be directly mapped to cores, which means that the scheduler either has to switch tasks, and therefore handle context switches, or suspend worker threads until a core is free.

At this point, we notice three characteristics:


**(d)** Speedup Curve for all Demands Using AKKA Actors

**Figure 7.4.:** Speedup for Dierent Parallelisation Paradigms [Gre19]

write operations) can benet from hyper-threading and continue to speed up. Even though the speedup is not as great as before, it is still linear. On the other hand, a processor-intensive task, like calculating primes or Fibonacci numbers, cannot benet much from hyper-threading and stays constant. Further, very I/O-intensive tasks, like sorting arrays or calculating matrices, show a rather bad performance in area two, compared to hardware environments with smaller core numbers. A hypothesis here is that due to cold caches and unfortunate memory architectures, the hyper-threading eect is abrogated. Noteworthy is the decreasing performance of the counting numbers demand as well. In the third area (from 193), we can see a performance stagnation with low tendency to a performance degression.

3. Ignoring AKKA Actors, the speedup behaviour of the individual demands does not dier much for the paradigms. For example, the speedup curve for the Mandel Set demand is similar for Java threads, Java streams, and pyjamas. Thus, we can say that the paradigm used does not have a great impact on the speedup behaviour. An outlier here is the sorting arrays demand, but only for Java streams.

### **7.4.1.2. Memory Behaviour**

Besides the performance of the parallelisation paradigms and resource demands, we also measure memory behaviour. Here we measure the L2 and L3 cache miss ratio and the total number of cache accesses. Again, due to the extensive amount of data, we focus in the following only on the measurements taken from the dedicated server in Stuttgart, the parallelisation paradigm Java threads, and limit the scope to the L2 and L3 cache accesses and miss rate.

Figure 7.5a and 7.5b show the cache miss ratio for the L2 and L3 caches. Thereby, the x-axis shows the number of used worker threads again. The y-axis shows the percentage of cache misses (a lower number is better). In addition to that, Figure 7.6a and 7.6b show the total number of cache accesses for L2 and L3 on the y-axis.

**(b)** L3 Cache Hit/Miss Rate for all Demands Using Java Threads

**Figure 7.5.:** Cache Miss Rate for Java Threads on the Server in Stuttgart [Gre19]

**(b)** Total L3 Cache Accesses for all Demands Using Java Thread

**Figure 7.6.:** Cache Accesses for Java Threads on the Server in Stuttgart [Gre19]

The measurements give us detailed insights on memory behaviour. We highlight the following characteristics:


### **7.4.2. Comparison of Parallelisation Paradigms**

In the next step, we compare the performance for all hardware, resource demands and parallelisation paradigms. For this, we focus on each resource demand type and compare the performance of the parallelisation paradigms. We make the comparison for each hardware separately.

We notice that for each resource demand, the speedup behaviour is similar no matter which parallelisation paradigm we use. Here we have to note the unexplainable behaviour of the AKKA Actor implementation again, which we neglect in the comparison.

On the one hand, this is surprising, because we assumed that the parallelisation paradigm has an impact on the speedup behaviour. On the other hand, we do not compare the absolute performance. That means that the parallelisation paradigm can have an impact on the absolute performance, but scales similarly.

Further, we notice that we achieved a very good overall speedup. This is because we used the packages from Protocom, which are independent and place the parallelisation paradigm on top.

### **7.4.3. Comparison Server**

Next, we are interested in how the hardware setting inuences the speedup behaviour. To analyse it, we rst need to normalise the results for all machines. Normalising means we divide the number of worker threads by the number of available physical cores. Further, we describe the speedup as a

**(a)** Comparison of the Four Hardware Environments Using Java Threads and the Mandel Set Demand

**(b)** Comparison of the Four Hardware Environments Using Java Threads and the Count Number Demand

**Figure 7.7.:** Comparison of the Four Hardware Environments Exemplied by Using Java Threads, Mandel Set, and Count Number Demand

relative value in percent. In theory, a speedup of 100% is possible. As an example, if we take the large server in Stuttgart, which has 96 physical and, due to hyper-threading, 192 virtual cores, a speedup of 100% would mean utilising all virtual cores optimally and achieving an absolute speedup of 192. In Figure 7.7 we use the results from Mandel Set (best speedup) and the count number (worst speedup) demand to exemplify the results of the comparison while using Java threads.

First, we focus on Figure 7.7a, which shows the speedup behaviour for the four dierent hardware environments using the Mandel Set demand and Java threads. This demand performed the best in all the experiments, and shows the best parallelisation characteristics. As we can see, the server in Stuttgart and the small server in Potsdam show the best and almost identical behaviour. The big server in Stuttgart shows a slightly better behaviour before the number of physical cores are hit (up to one). However, in area two (from one to two) it cannot benet as much from hyper-threading. The multi-node server (bwCluster) shows the weakest performance. However, for all three areas, all machines show the same characteristics. Only the gradient of the charts diers.

Next, we focus on Figure 7.7b. This gure shows the speedup behaviour for the four dierent hardware environments using the count number demand and Java threads. This demand showed the worst speedup behaviour in all the experiments. Having a look at the diagram, we notice four peculiarities: First, the speedup in area one (zero to one) is almost alike for all environments. Second, on the hardware in Stuttgart, speedup already attens at around 0.8 or 80 cores (see also Figure 7.4a). Third, three out of four show decreasing performance in area 2 (one to two). While the small hardware in Potsdam and the multi-node cluster show similar behaviour, the server in Stuttgart underperforms, and the big server in Potsdam shows no performance decreases at all. Fourth, in the third area, all hardware shows the same behaviour again.

In summary, we can state that there are slight dierences when considering speedup behaviour among all the dierent hardware environments. However, the essential characteristic is mostly the same. This is an important observation, because it allows us to extract performance curves from our measurements regardless of the hardware used. Further, we will be able to generalise the performance curves for all kinds of general-purpose hardware environments using a similar architecture.

### **7.4.4. Lessons Learned**

After we conducted the experiments and displayed the key results of the measurements, we can state interesting insights. These insights are not only relevant for our research question and the next step—extracting performance curves—but also show informative facts about parallel computing in general. In the following, we list all relevant aspects:


### **7.5. Extracting Performance Curves**

In the course of this section, we describe the process of extracting performance curves from the measurements. Due to the massive amount of measurements, we follow a structured method to extract the data. This process consists of four steps: normalisation, clustering, staging, and extraction. We describe each step in detail in the following.

### **7.5.1. Normalisation**

First of all, we decide to abstract from the actual measurements. To do so, we create speedup curves for each experiment run. As a reference for the speedup curve, we always use measures from the single-thread run. That way, we do not need to compare actual measurements with each other, but have a more abstract view on the data.

Next, we face the challenge of comparing measurements from dierent machines. Since each hardware environment has distinct characteristics and a dierent number of cores, the maximal possible speedup diers as well. To still be able to compare measurements from dierent machines, we need to normalise the data. As a normalisation factor, we used the number of cores available in each setting. As described in Section 7.4.3 we divide both the speedup and the number of worker threads by the number of available cores in the system. As a result, we get normalised values for all the machines, which we are able to compare. Figure 7.7 gives one example for the parallelisation paradigm: Java threads and the resource demand Mandel set. As depicted in the gure, the speedup of the machine in Stuttgart is almost the same as the small machine in Potsdam. For example, the x-axis value of 2 stands for the use of 192 worker threads in Stuttgart and 24 worker threads in Stuttgart. In both cases, this is twice as much as the number of physical cores. Both achieve a relative speedup of 85%, which is an absolute speedup of 160 in Stuttgart and 20 in Potsdam.

### **7.5.2. Clustering**

After we are able to compare all measurements with each other, we have to perform clustering to get the curves which behave similarly. In our case, we perform a manual clustering based on the observations of the speedup curves. As shown in Figure 7.4, all demands have unique behaviour. Therefore, the rst cluster criteria are the resource demand type. Next, we compare the speedup behaviour for the given hardware environment and parallelisation paradigm for each demand type. As stated in Lesson 5, we can conrm that the choice of the parallelisation paradigm has no signicant impact on the speedup behaviour. Thus, we do group by parallelisation paradigm. However, as stated in Lesson 4, we assume a bug in the AKKA Actors framework caused the unnatural behaviour. Therefore, we neglect these measurements for further consideration.

A greater impact on the behaviour has the choice of hardware environment as illustrated in Figure 7.7. For all but the counting number demand, the dierence between the four environments lies in a corridor of maximum 30%. Thereby, the dedicated servers do behave similarly, and only the virtualised bwUniCluster behaves dierently. Thus, we decide to separate virtualised and dedicated systems.

### **7.5.3. Staging**

Besides clustering, we noticed that the speedup behaviour diers sharply when reaching specic numbers of worker threads. Therefore, we introduced three stages. The three stages align with the three areas in the previously shown diagrams. Stage one starts with one worker threads and goes up to the number of physical cores, the second stage goes from here to the number of virtual cores, and the third stage goes from here until innity.

### **7.5.4. Extraction**

In the nal step, we extract the performance curves from each cluster and stage. To do so, we use linear regression. Thereby, we consider all speedup


**Table 7.3.:** Extracted Performance Curves for Dedicated Machines Based on the Speedup Behaviour of the Demands

curves in a cluster and stage, take the average, and extract a linear function using regression. For the rst two stages, we gain very tting curves (r-value above 0.90 for a condence interval of 0.95). For the third stage, the variance of the measurements is higher. Thus, the resulting curves are not as tting (r-values between 0.3 and 0.87). Table 7.3 shows the performance curves for dedicated machines for each demand type and stage (Appendix A.2 shows the performance curves for virtualised hardware). Additionally, Figure 7.8 visualises the performance curves. The x-value is the normalised value of the worker threads (ℎ/ℎ). The y-value gives the relative speedup concerning the maximal possible speedup (/).

### **7.5.5. Using Performance Curves: An Example**

The SAcan now use the above performance curves to correct the performance predictions—not only from Palladio, but from any performance prediction tool. To utilise the performance curves, the SA needs information about the available cores in the system, the number of worker threads, and the kind of resource demand. For example, assume we have a dedicated machine with 30 physical cores, using 45 worker threads, and have a resource demand-type which is close to the sorting array demand. First, we have to calculate the normalised x-value: = 45/30, which is 1.5. After checking Table 7.3, we pick the following performance curve: () = 0.131 + 0.250.

**Figure 7.8.:** Comparison of the Four Hardware Environments Using Java Threads and the Mandel Set Demand

Inserting the above values, we end up with: (1.5) = 0.131 ∗ 1.5 + 0.250, which is 0.45. This is the relative speedup calculated by the performance curve (absolute is 27).

In contrast, Palladio assumes a linear speedup which is in our example 45 (absolute) or 0.75 (relative). So we can now correct any performance prediction given from Palladio by the factor 0.6. For example, imagine our Palladio simulation takes 200. We multiply the Palladio result with the factor and end up with an output of 120.

Of course, this is a lot of manual eort. Therefore, in the next section we discuss integrating performance curves into Palladio for automated calculations.

### **7.6. Palladio Integration**

To integrate the performance curves into Palladio, we need to alter the performance predictions. One way of doing so is to include the performance curves into the simulators. Another way, which we follow here, is to use the overhead concept introduced in CB1.

**Figure 7.9.:** Prole Example for Parallel For-loop

In short, we alter the parallel patterns to include the performance curves directly into the patterns. Further, we use the QVT-o transformations to automatically estimate the right overhead, add it to the model, and run the simulations.

In the following, we briey discuss changes made to the proles and the dierence in the workow for the SA. The full implementation details and the source code are available in the git repository of the parallel AT catalogue.

### **7.6.1. Profile Extension**

To include the performance curves into the parallel ATs, we rst need to alter the proles of each AT. Figure 7.9 shows the nal AT given the parallel loop AT.

We include three new enum types, enabling the SA to choose whether to use a custom overhead function, a custom performance curve, a pre-dened performance curve, or no overhead model at all. The rst enum denes


**Figure 7.10.:** Property View of the Applied Parallel Loop AT

whether to use a performance curve or not, the second enum species which demand-type curve to choose, and the third one whether to use the performance curves for virtual or dedicated hardware. Further, we add the required elds for a custom performance curve.

### **7.6.2. Workflow Adaptation**

To use the performance curves, the SA rst needs to model the software, hardware, and usage model as normal. Next, the SA needs to apply a parallel AT from the parallel pattern catalogue. Figure 7.10 shows the property view of the applied catalogue. Here, the SA can choose to use a performance curve and picks the desired curve for his resource demand and hardware type. If desired, he can also input his own performance curve.

After setting all properties, the SA can run the simulation using experiment automatisation. Within the QVT-o transformation, the properties are interpreted, the correct performance curve is picked, and the overhead is added.

### **7.6.3. OVT-o Transformation**

Running the simulations with the AT method extension, will call the QVT-o script of the parallel AT and trigger the m2m transformation, before the actual simulation takes place.

We altered the QVT-o scripts to now automatically calculate the correct overhead by picking the right overhead function for the given conguration. The calculation of the overhead happens according to the example given above (see Section 7.5.5). We transform the time units in resource demand and add the resource demand as overhead by adding an internal action to the model.

The source code of the QVT-o implementation and the code for the performance curves is available online in our git-repo<sup>7</sup> .

### **7.7. Evaluation**

In the following section, we evaluate the performance curves using a set of SPEC benchmarks. To do so, we describe the experimental setup and the method in the rst part. Later we report on the results.

### **7.7.1. Method**

To research the usability of the performance curves, we compare the performance prediction to the measurements taken from real executions. To cover a broad set of scenarios, we use SPEC benchmarks. SPEC oers three benchmark suites for parallel applications: MPI 2007, OMP2012, ACCEL. OMP2012 uses an OpenMP implementation of 13 dierent applications which cover a comprehensive set of application types. ACCEL focuses on GPUs, and therefore uses OpenCL implementations. MPI 2007 uses MPI as a means to parallelise and focus HPC systems. Thus, ACCEL and MPI2007 do not t our domain, and we decide to use OMP2012.

To compare the measurements with the predictions, we rst group the application within the benchmark suite according to the demand type we assume they have. Thereby, we use the documentation provided by SPEC. To give an example, the documentation of the benchmark suite bt311 reads as following: "BT is a simulated CFD application that uses an implicit algorithm to solve 3-dimensional (3-D) compressible Navier-Stokes equations.

<sup>7</sup> https://github.com/PalladioSimulator/Palladio-Addons-ParallelPerformanceCata logue

The nite dierences solution to the problem is based on an Alternating Direction Implicit (ADI) approximate factorization that decouples the x, y and z dimensions. The resulting systems are Block-Tridiagonal of 5x5 blocks and are solved sequentially along each dimension." <sup>8</sup> . Because the characteristics are similar to the MatrixMultiplication demand, we assign bt311 to the group of MatrixMultiplication. Table 7.4 shows the mapping of the benchmark applications to the expected demands.


**Table 7.4.:** Mapping of benchmark applications to expected demand types

After the mapping, we execute all benchmarks on our hardware. Thereby, we increase the number of worker threads step by step, from one up to twice the number of physical cores<sup>9</sup> . Figure 7.11 shows the speedup curves for all benchmark applications within the benchmark suite. We can see that the maximum speedup for each application varies from 7 to 44. Further, we can see dierent behaviour characteristics for all applications.

<sup>8</sup>https://www.spec.org/auto/omp2012/Docs/357.bt331.html

<sup>9</sup>Unfortunately, we were not able to run the benchmark ildbc due to technical issues.

**Figure 7.11.:** Speedup Curves for the Applications from the OMP2012 Benchmark Suite

Next, we model the scenario using the parallel architectural template catalogue and the performance curves in Palladio. Our models consist solely of a single parallel loop and one internal action. We use the measurements from the sequential run to calibrate the CPU resource demand for the internal action, and specify the parameters of the parallel loop accordingly (e.g., number of worker threads, and demand type). In a nal step, we compare the measurements from the execution with the simulation results.

Due to the extensive runtime of the benchmarks, we are only able to test one hardware setting. Therefore, we choose the more comprehensive system in Potsdam (40 physical cores—see Table 7.2), since it is a mid-range system and covers the characteristics of the smaller system and the machine in Stuttgart.

In the following, we present the results. To foster understandability and presentation style, we show only the accuracy of the predictions and not the actual runtimes. All measurements, simulation results, raw data, and performance curves are available online<sup>10</sup> .

<sup>10</sup>https://doi.org/10.5281/zenodo.4081091

### **7.7.2. Results**

In the following, we discuss the results of the experiment and the simulation. We will not present any raw data, but rather focus on the processes data. Table 7.5 shows the individual benchmarks from left to right. From top to bottom, the dierent number of worker threads are displayed. Further, we distinguish between the pure Palladio approach (top) and the Palladio approach using ATs and performance curves.

Each cell contains information about the inaccuracy of the approach. Thereby we compared the simulated runtime with the measurements and we calculate the accuracy dierence as following:

$$P\_{Error} = \frac{(t\_{predictionTime} - t\_{runtime})}{t\_{runtime}} \, \* \, 100 \tag{7.1}$$

The closer the number is to zero, the more precise is the prediction. For example, if we measure a runtime of 100 and have a prediction of 80 the prediction error is −20%. The minus indicates that the prediction is underestimated.

In our goal 3, we aim for performance predictions that do not dier more than 40% of the actual measurements. Thus, we colour cells with inaccuracy below 40% green. As we see, the pure Palladio approach is accurate for a low number of worker threads. However, it becomes more inaccurate for higher numbers.

In contrast, the performance curves approach is able to satisfy our 40% limit in half the cases. Further, Table 7.6 shows the increase in accuracy compared to the pure Palladio approach. We calculated these values by the following equation:

$$
\Delta P\_{Error} = |P\_{Error}(Pallado)| - |P\_{Error}(PerfCurve)|\tag{7.2}
$$

We can use simple subtraction to calculate the delta in the prediction error since the divisor is the same for () and ( ). However, for the same reason, we can only compare the results within a column with each other and cannot compare values from dierent columns.

As we see, we can increase the accuracy signicantly for a large number of worker threads (more than ten) and even for a low number we are in eight out of twelve cases more precise. Additionally to the tables, we provide a visualisation of the values for the best and the worst scenario in Appendix A.4.4.

Please note that the values in Table 7.6 are only intended to show in which cases the performance curves perform better and in which cases they perform worse than the pure Palladio approach. Due to the nature of relative values, and the fact that each column has a dierent divisor, a comparison of the values is not valid.

Overall, the measurements show that the use of performance curves dramatically contributes to the accuracy aspect of performance predictions. However, they also show that we have not yet captured all PPiFs. Especially demands which show a low speedup and thus are bad to parallelise are ultimately not captured in the performance curves. Identifying additional PPiFs, measuring their inuence, and deducting more precise performance curves is still an open challenge and remains for future work.

At this point, we can present a total of twelve performance curves which already greatly improve the performance prediction capabilities of tools like Palladio. Further, we provide integration into Palladio. Thus, we enable the SA to eciently use the performance curves and benet from more accurate prediction results.

### **7.8. Assumptions & Threats to Validity**

To conclude our results from the evaluation and to put the results in perspective, we discuss assumptions made and threats to validity in the following. Therefore, we list each assumption or threat and discuss it in detail.

**Monitoring Overhead:** During the execution of all experiments, we monitored only one PPiF at a time (e.g., response time, L1, L2, L3, etc.). Thereby we use dierent tools to monitor the runtime (e.g., perf or PAPI). The usage of these tools puts overhead on the system and might inuence performance factors, or even have an impact on the


**Table 7.5.:** Shows the inaccuracy of the pure Palladio approach in comparison to the Palladio + Performance Curve approach for the SPEC OMP2012 Benchmark set.


**Table 7.6.:**Shows the accuracy gain of the performance curve approach in comparison to the pure Palladio approach

PPiFs under review. Thus, we were able to observe higher runtime and worse speedup when monitoring memory behaviour.


because the overhead could also come from waiting conditions. Thus the simulations might show higher CPU utilisation than the actual system.

**Over-interpretation of Results:** In Table 7.6, we indicated the accuracy gain of the performance curves in contrast to the pure Palladio approach. Thereby we subtracted the relative values in Table 7.5, which is theoretically possible because the divisor of both values is the same. We decided to display the values in Table 7.6 in this way, to give an impression of the number of cases in which the performance curves perform better. However, a comparison of the values from dierent columns would lead to wrong or over-interpretation of the results, because each column has a dierent divisor (see. Section 7.7.2).

### **7.9. Summary of CB**2

In this chapter, we researched performance curves for parallel applications in multicore environments. Thereby, we worked on the fullment of requirements , , and .

In the course of the chapter, we rst performed a structured literature review in combination with expert interviews to identify the most relevant PPiFs. Next, we conducted extensive experiments to (a) evaluate the impact of PPiFs on performance and (b) collect measurements to extract performance curves. During the experiments, we researched dierent hardware environments and parallelisation paradigms as well.

As a result, we present 14 lessons learned from the experiments. Additionally, we deliver a set of twelve performance curves to the SA. The performance curves represent the most relevant software behaviours. Combining the performance curves with performance prediction approaches such as the PCM, we show that the accuracy of parallel application predictions increases greatly. Thus, we provide an instrument to the SA that helps to improve accuracy of model-based performance predictions on an architectural level for parallel applications in multicore environments.

To evaluate the performance curves, we use a standardised benchmark suite— SPEC OMP2012—and compare the predictions from Palladio (containing the

performance curves) with the measurements we took from executing the benchmark on a medium-sized multicore environment. We show that the performance curves increase the accuracy for all cases in which we use a high number of worker threads (equal to the number of virtual cores) and, in 19 out of 24 cases, of a low number of worker threads—when compared to the default Palladio approach.

In a nutshell, we are able to answer our research question as follows:

2.1: How do highly parallel applications behave in massive parallel environments (multicore systems) regarding response time (speedup), memory access rates (L1, L2, L3, RAM usage), and memory bandwidth utilisation?

**Answer:** In over 800 experiments we took 70,000 measurements. Thereby, we monitored the response time and memory accesses of the systems. Using these measurements we extracted the twelve performance curves given in Table 7.3 to describe the behaviour.

2.2: What factors inuence performance the most in highly parallel applications?

**Answer:** In Table 7.1 we listed the top eight performance-inuencing factors we identied via structured literature review, expert interviews, and our experiments.

### 2.3: Does the choice of parallelisation strategy have a signi cant impact on behaviour?

**Answer:** The experiments show slight dierences in the performance of the individual parallelisation paradigms. However, these dierences are not signicant for all thread-based paradigms. The only paradigm that diverges is the AKKA Actors implementation. Here we assume issues in the framework coding.

### 2.4: Do highly parallel applications show similar behaviour, which can be described by one or multiple performance curves?

**Answer:** In Table 7.3 we present performance curves for all the research resource demands. We used linear regression to extract the curves from the measurements. Thus, the curves describe the average behaviour for each demand type on all the tested machines.

3.2: What are the missing characteristics of software behaviour that must be included in performance prediction models (performance-inuencing factors) to enable simulationbased performance prediction approaches to accurately predict the performance of parallel applications?

**Answer:** Table 7.1 shows the top eight most performance-inuencing factors, gained from structured literature reviews, expert interviews, and experimenting.

Finally, we can verify or falsify our hypothesis as follows:

2.1: The speedup and performance behaviour of highly parallel applications depends heavily on the chosen parallelisation strategy or paradigm.

**Reject:** The choice of parallelisation strategy does not have a high impact on behaviour.

2.2: The hardware architecture (e.g., number of CPU cores, memory bandwidth, memory hierarchies) of the execution environment has a strong impact on the performance of the parallel applications.

**Accept:** We measured dierences in the normalised speedup for all the machines. Thus, we can verify that the hardware architecture has an impact on the performance. The biggest noticeable dierence is between virtualised hardware and dedicated systems. Virtualised hardware shows worse performance.

2.3: The speedup of a parallel application is not only inuenced by the number of cores available in a system but also by additional hardware specic performance-inuencing factors.

**Accept:** In Table 7.1 we listed the top eight performance-inuencing factors we identied.

### **8. CB**3**: Meta-Model Extension for the PCM to Include Memory Architectures**

Single-metric hardware performance models, which only consider CPU speed as a relevant characteristic, have proven insucient. Therefore, we research the eect of additional metrics like memory architecture, hierarchies, and bandwidth in this chapter and focus on the requirements , , and .

In the previous chapter, we researched the inuence of worker threads, cores utilisation, resource demand type, and parallelisation paradigms. Thereby, we observed the cache behaviour and cache access. In this section we continue to research the next PPiFs: memory design and memory bandwidth (see Tab 7.1).

To do so, we extend the PCM, adopt the solvers, and update the editors of the Palladio bench. Thereby, we follow the research process illustrated in Figure 8.1.

In the course of this chapter, we rst dene the problem space and the research goal, introduce the idea behind the approach, and set the evaluation criteria. Next, we research the problem space of memory hierarchies to identify relevant elements to include in the meta-model. Afterwards, we discuss meta-model extension strategies and perform the extension. To support the new meta-model features, we extend the editors (tree-editor and Siriuseditor) and the simulators (SimuLizar). Finally, we evaluate the approach in an experiment-based manner and compare the new performance predictions to the earlier ones without consideration of memory bandwidth.

**Figure 8.1.:** Overview of the Research Method for Contribution CB<sup>3</sup>

As a contribution we (1) give detailed insights into the behaviour of parallel applications, (2) provide a meta-model extension for the PCM, which included memory hierarchies, (3) provide a SimuLizar extension to simulate memory hierarchies, and (4) lay out four modelling approaches with dierent strengths and weaknesses.

As a result, we can show that the four memory model approaches increase the performance prediction accuracy of Palladio. Each model works exceptionally well under certain circumstances. Overall, we present the cache-line model, which has the best overall performance prediction power and increases the accuracy of up to 57%. Thus, in the best case, the prediction error is below 15%.

Signicant results of this chapter were acquired while collaborating on the supervision of student theses by [Gru19] and [Tru20]. Further, insights from the rst two steps of the research method were reviewed and published in [GF19].

We make all data, meta-models, code, and plugin extensions online available:

**Section 8.3** Meta-Models and Palladio Plugin: https://github.com/PalladioSimulator/Palladio-Addon-Me moryHierarchy

```
Section 8.4 Evaluation and Results:
      https://doi.org/10.5281/zenodo.4094588
```
### **8.1. Problem Space**

In this chapter, we research the impact of memory architectures of a multicore CPU on the overall performance. Thereby, we follow the hypothesis that for modern complex multicore CPUs, not only the clock rate but also the memory bus is a bottleneck. Further, we assume that the sizes and utilisation of the caches have a signicant impact on the overall performance of highly parallel applications [BDH08; BKR09; FBKK19; FH16].

**Prototype—Using Network-Links as Memory Bandwidth Model:** In [GF19], we research the impact of a simple memory model by using network links to emulate the data transfer. By observing and measuring the memory utilisation of a real application, we were able to calibrate the model and to increase the performance predictions up 26% for a 16-core machine.

The insights from the prototype encourage us to further investigate memory hierarchies and to properly include them into the PCM.

**Research Goal and Idea:** The research in this chapter focuses on the fullment of two goals:

1**:** SAs shall be able to model memory hierarchies in the hardware model and specify the memory access behaviour in the software model.

2**:** Given these additional model elements, the solver shall give more accurate performance predictions for parallel applications.

Since we know that the memory bus has a great impact on the performance [GF19], we aim to include a concept very similar to network links for the memory bus. However, we need to face and overcome the following challenges:


**Research Method:** To achieve our goals, we continue to follow the experimentbased performance model derivation method [Hap08] and iteratively extend the meta-model. To evaluate the results, we compare the current PCM and solvers with the extended ones and compare the simulation results to the results from the experiments for our running use cases. Thereby, we only focus on a single evaluation criterion:

<sup>1</sup> : The accuracy of the new performance predictions needs to be better than the current ones—better meaning closer to the real measurements from the experiments.

### **8.2. Meta-Model Extension**

In the following, we describe the meta-model extension in detail. To achieve the nal model, we follow the method of experiment-based performance prediction [Hap08]. That way, we go through the process of modelling four times, adapting solvers and editors, and evaluating [Tru20] the results. Nevertheless, we describe only the nal result in the following.

First, we will research the required model elements. Next, we discuss different strategies to extend a meta-model, along with their advantages and disadvantages. Finally, we describe the changes to the PCM in detail.

### **8.2.1. Meta-Model Elements**

To identify the required model elements, we use the identied PPiFs (see Table 7.1) as a starting point and choose to include caches, main memory, and the memory bus. At rst, it seems reasonable to include all PPiFs, along with all attributes, and thus have a model which is as close as possible to the real-world objects. However, having such a meta-model would increase the complexity by far and the SA would not be able to handle the architectural design.

Thus, we follow the general denition of modelling by [Sta73] and the goaldriven modelling approach by Koziolek for qualitative modelling [RBH+16]. Therefore, we dene the following three properties for our modelling approach:


In the following, we describe each of the PPiFs we consider, and give the rst set of relevant attributes for the meta-model:

	- **Size:** The cache sizes are an important factor for the cache eectiveness. However, considering the size for performance prediction would lead to a full cache simulator, which can easily become very complex, not only to implement, but also for the SA to specify. Therefore, we decided not to consider the cache size in our models, but to focus on the cache hit or miss rate. Thus, we abstract the cache behaviour.
	- **Hit-rates:** The cache hit-rate gives the probability that a cache request will be fullled (e.g., 40%). In case a cache hits, we assume an immediate delivery of the results with no delay. In case of a cache miss, the next cache has to be queried, and the cache updates its cache page, which puts additional demand on the bus.
	- **Page-size:** The size of the cache page is relevant to specify because in case of a cache miss, the whole cache page will be updated and needs to be fetched from the main memory. Thus, each cache miss puts additional demand on the memory bus.
	- **Type:** Caches can be shared or private. Common architectures have private L1 and L2 caches, while the L3 cache is shared [Sch08]. Only a single core can access private caches. Shared caches can be accessed by multiple or all caches.
	- **Latency:** Memory latency is the time between initiating a request for data and the beginning of the actual data transfer. In our models, we neglect the latency, because we assume that latencies are very low and do not have a major impact on the overall performance. However, we include it in the meta-model for use in the future.
	- **Bandwidth /Throughput:** Describes the maximum throughput of the bus being fully utilised—burst rate (e.g., 12 GB/s).
	- **Dynamic:** Due to the architecture and composition of cores, caches, memory, and bandwidth, the maximum throughput of the memory bus can vary according to the number of cores used. In general, the overall throughput increases with usage of more cores, due to additional resources (e.g., buses) becoming available. This is especially true if a new NUMA node is utilised (see Section 2.2). Since we do not consider unique architectures like ccNUMA, but want to provide an abstract model, which the SA can use for all kind of architectures, we need to provide an abstract attribute to specify this behaviour.

To clarify the choice of PPiFs and attributes and to further follow the argumentation line, we have to explain a set of assumptions we made.


consider the modelling of NUMA nodes too complex, we ignore these eects for now. Later we will have to re-evaluate this decision.


### **8.2.2. Meta-Model Extension Strategies**

Before we start to model the above elements into the meta-model, we rst need to discuss extension strategies. In general, there are two possible ways to extend the PCM.

The rst is a full PCM extension. In a full meta-model extension, we model the changes and new elements directly into the PCM. After altering the PCM, we need to release a new version. Advantages of this approach are that all model elements are in one place, and it is straightforward and easy to follow. However, on the downside, a new release of PCM has a long-range impact. For example, we cannot guarantee that all solvers and tools can handle the new version.

Therefore, we favour a second approach, in which we use Proles and Stereotypes [FV04] to extend the PCM. That way, we can model our memory model and all elements in a separate model. Using proles and stereotypes, we can link our new model elements into the PCM. The advantages of this approach are that we do not need to release a new PCM version, but can provide the memory models and proles as a plugin. Solvers and tools which cannot handle the new elements ignore those. Thus, we can guarantee downward compatibility. On the downside, this approach becomes unusable if we need to alter many already existing elements. However, in our scenario, this is not the case.

### **8.2.3. Hardware Model Extension**

In the following, we detail the extension of the meta-model. For this, we rst look at the extension of the meta-model to enable the SA to model the hardware characteristics in the hardware model. To do so, we pick an entry point and lay out the meta-model. In the next section, we explain the changes to the workow and adaptations of the software model. In the software model, the SA needs to specify the memory behaviour, e.g., memory accesses.

### **8.2.3.1. Entry Point**

Given the current version of the PCM (version 4.2.0), we identify multiple elements which we can use as an entry point for the extension:

**ProcessingResourceSpecification:** The ProcessingResourceSpecification contains the information on the processing resources. This entry point is suitable because we can reuse the predened resources CPU, HDD, and Delay and add our processing type for the memory hierarchy. However, to fully support the characteristics of memory hierarchies, we need to add elements for the hierarchical structure. Further, a ProcessingResourceSpecification requires a processing rate and a scheduling policy, which do not apply to memory elements. To avoid ambiguity in our models, we decided against the ProcessingResourceSpecification.

**ResourceContainer:** The resource container is the more general model element. It can contain a ProcessingResourceSpecification and other hardware-related characteristics. From the modelling aspect it has no disadvantages. Thus, we choose it as an extension point.

As described in the previous section, we choose a prole-based extension strategy. Given the ResourceContainer as starting point, we can now start to model our meta-model extension. Figure 8.2 shows the applied prole to the ResourceContainer. It maps our meta-model extension (MemoryHierachyMetamodel) into the already existing PCM element (ResourceContainer). In the next section, we explain the memory meta-model mapped into the resource container.

**Figure 8.2.:** Overview of the Prole Extension of the ResourceContainer

### **8.2.3.2. Modelling the Memory Hierarchy**

**Design Rationale** During the design of the memory meta-model, we follow the Palladio design principles and approaches. Therefore, we align our modelling to existing elements or reuse them if possible. This brings two benets: First, the SA is familiar with the modelling concept; second, the simulation logic of existing elements can be reused or adapted only slightly.

When analysing the PCM (Version 4.2), we identied two elements which we can reuse:


In the following, we introduce the meta-model extension. Thereby, we focus only on the extension part—the memory architecture.

**Memory Meta-Model** Figure 8.3 shows the nal meta-model extension. At the top of the gure, the MemoryHierarchyContainer represents the top-level element and the entry point. Each container can have multiple Memory-HierarchyResourceEnviroments. Each environment consists of multiple memory elements (i.e., caches or main memory) and a set of connections (i.e., the memory bus). Further, each environment has an entry point, which de nes the entry point of the memory architecture. Both the starting point and the memory element are of the type LinkableMemoryHierarchyResources.

**Figure 8.3.:** Meta-Model Extension Containing the New Elements for the Memory Hierarchy

We decided on this way of modelling because (a) it aligns with the network link layout, and (b) we are more exible for extensions and further adaptations.

The memory element has two attributes. The cacheHitRate describes the possibility of a request to result in a cache hit. The isPrivateCache denes whether the cache is private or shared by other elements in the architecture. A memory element is connected to another element via a MemoryHierarchy-Linking Resource. A linking resource always connects two linkable memory resources—one as successor and one as the predecessor. Further, the linking resource has a specication.

The MemoryHierarchyLinkingResourceSpecification has in total four attributes, which are all adapted from the network linking resource. The number of replicas denes the total number of busses available. The latency describes the time between the initial request for data and the data transfer; the throughput describes the maximum data transfer capacity of the link.

With this extension, the SA is now able to specify the memory characteristics in the hardware model. Next, the SA needs to specify the memory behaviour in the software model.

### **8.2.4. Modelling Memory Behaviours**

To utilise the memory architecture specied in the hardware model, the SA has to set the memory behaviour in the software model as well. In the following, we discuss the extension of the software model and the adaptation of the workow.

### **8.2.4.1. Resource Demanding Calls**

To specify resource demands (e.g., CPU or HDD or memory demands) the SA requires a model element that can name specic resources. In the PCM these elements are named Calls and are specied in the SEFF (see Section 2.4.2.1). For the memory resource demand, we evaluate existing calls to check their reusability. In total, we evaluate six call actions:


in Java applications. However, since we are not interested in the memory consumption of the system, but more in the delay generated by memory architectures, we do not consider these calls further.


Given the evaluation of the discussion, we have to choose the abstraction level on which we want the SA to model the memory behaviour:


<sup>1</sup> https://jira.palladio-simulator.com/browse/PALLADIO-32

**high:** If we want the SA to specify the memory demand abstractly and only enable her to dene the demand in a parametric manner, the internal action is the best choice.

Because we are convinced that the separation of reads and writes is essential when researching the performance impact of memory architectures, we chose in favour of the resource call.

### **8.2.4.2. Integration of Memory Calls into the SEFF**

In the current version of Palladio, it is not possible to modify an existing resourceCall action to handle a customised behaviour inside the simulation, so we have to implement a workaround. Thus, we use the chid-extenders (or sub-classing)<sup>2</sup> , which we can use non-invasively (e.g., not changing the PCM) to create a clone of the resourceCall, which we use within the simulation to implement our custom behaviour. At the same time, we propose a code change<sup>3</sup> , which enables the customisation of resourceCalls. Therefore, this should be just a temporary solution.

### **8.3. Adaptation of PCM Solvers**

Enabling the SA to model the memory hierarchies in the hardware model and the memory behaviour in the software model is only part of the solution. In the next step, we need to adapt the PCM solvers so that they can interpret and analyse the new model elements. Palladio contains a number of dierent solvers (see Section 2.4.2.1). In the following, we briey describe the adaptation of SimuLizar, the current default simulation-based solver. We give only a high-level description of the process and implemented behaviour. We do not give detailed information on the implementation. For this we refer to [Tru20] and the code<sup>4</sup> .

<sup>2</sup> https://ed-merks.blogspot.com/2008/01/creating-children-you-didnt-know.html 3 https://jira.palladio-simulator.com/projects/SIMUCOM/issues/SIMUCOM-97?filte r=allopenissues

<sup>4</sup> https://github.com/PalladioSimulator/Palladio-Addon-MemoryHierarchy

To realise the recognition of the memory hierarchy, we use the observer extension point. All observers, like ResourceEnvironment, ResourceContainer, NetworkLinks, and ProcessingResources, are called, and representative Java classes are created. These Java classes are all stored into the model registry class that can be used to look up and access these elements during the simulation.

The PCMStartInterpretationJob—which is the simulation entry point inside SimuLizar consists of two phases: (1) set up and (2) simulation. During the set-up, the initialise () method of all classes that use the model observer extension point is called. In this phase, the MemoryHierarchyObserver class looks for ResourceContainer elements that have the ResourceContainerWith-MemoryHierarchy stereotype applied. Next, it searches, creates, and stores objects representing the modelled memory hierarchy structure into a Memory-HierarchyRegister class. The MemoryHierarchyRegister stores all necessary information about the memory hierarchy structure. Therefore, we use the register to look up all the required memory hierarchy information during the simulation. In the simulation, the memory demand is speci ed with the InternalActionWithMemory model element, which is a subclass of the InternalAction and has no dierence from the InternalAction. The only dierence is dened in the memory hierarchy ecore model—not the PCM ecore model. That way, and with the help of the rdse-switch extension point, which can delegate the interpretation of a call that is not inside the SePackage to other plugins that support this extension point (e.g., the InternalActionWithMemory), the call is delegated into the MemoryHierachyCallAwareSwitch of the memory hierarchy. Next, we can process the call here as we desire. In short, the MemoryHierarchyObserver is used to search for ResourceContainer, which contains memory hierarchy elements. Additionally, this class is responsible for creating necessary objects for simulation and for storing them in the MemoryHierarchyRegistry. During the simulation, the memory demand is reduced based on hit/miss-rate, and the updated demand is then simulated through the next MemoryHierarchy-LinkingResource. Unfortunately, we cannot reuse NetworkLink implementations, because SimuLizar has no support for them. Thus, we use the NetworkLink code from SimuCom to implement The MemoryHierarchyLinking-Resource.

In contrast to NetworkLinks, the MemoryHierarchyLinkingResource does not do round trips. To model the memory hierarchy, each core has its link

to the L1 and L2. For example, if we consider a 96-core server, a total of 192 linking objects are created during the simulation phase. To guarantee performant simulations, we added a modied version of the FCFC-scheduler, which can simultaneously handle multiple instances, instead of only one at a time.

### **8.4. Adaptation of Modelling Editors**

While the default Eclipse tree editors are part of Eclipse EMF and provide an out-of-the-box approach to edit the memory model, we aim to include the modelling in the Palladio workow. Because Palladio is using Sirius<sup>5</sup> to visualise the PCM graphically, we need to adopt the Sirius editors as well. In the following, we briey describe the changes we made. Thereby we follow the Palladio style guides<sup>6</sup> .

To extend the editors, we create two new plugin projects. One contains the .odesign le. The other has additional Java code to perform more complex actions.

In the .odesign le we have to specify two elements (see Figure A.22): The graphical elements and the tools. The graphical elements contain the nodes and edges. Here we dene e.g., the memory cache element and the memory predecessor and successor link. The tools dene the action the editor can perform on the model elements, e.g., double click, creating new elements, etc. (see Figure A.23).

Additionally, we use external Java actions to provide more complex editor features. For example, we use a dialog view to let the SA specify the throughput with the help of a stochastic expression (see Figure A.24).

To use the extended editor and diagram, the SA needs to enable the correct viewpoint (i.e., SeffWithMemoryHierarchy).

<sup>5</sup> https://www.eclipse.org/sirius/

<sup>6</sup> https://sdqweb.ipd.kit.edu/wiki/PCM\_Development/Sirius\_Editors

### **8.5. Evaluation of PCM Extension**

To assess the usability of the memory model extension, we use an experimentbased approach based on the matrix multiplication example (see Section 5). Figure 8.4 gives an overview of the evaluation approach.

**Figure 8.4.:** Overview of the Evaluation Process for CB<sup>3</sup>

In a rst step, we execute Memtest86<sup>7</sup> on our target hardware. This way, we get information about memory bandwidth, which we use to calibrate our performance model (i.e., the hardware specication). Next, we implement and execute the matrix multiplication example in our test environment. Thereby, we execute a dierent version with dierent matrix sizes. In this step, we also monitor the cache hit rates, which we use to calibrate our models further. In a common performance prediction process, the SA does not have this information at hand and needs to estimate the cache hits. But since we want to evaluate the performance models, we decide to use the information at hand to reduce the error-proneness. Finally we simulate the models and compare the measurements with the predictions from the simulations.

<sup>7</sup> https://www.memtest86.com/

Since we are following the method of experiment-based performance prediction [Hap08], we iterate multiple times over the process of performance model creation. In total, we have four iterations, and in each one we create a performance model with specic properties. We present and discuss all models in the course of the evaluation.

We provide all results, raw data, and performance models in an online repository <sup>8</sup> . Further, we provide the extension as a Palladio plugin <sup>9</sup> .

### **8.5.1. Experiment Setup**

To set up the experiment, we rst implement the matrix multiplication use case given the characteristics in Chapter 5 and the implementation we used in [FH16]. To parallelise the application, we use Pyjama [GS13], as we did in the previous chapter. We decided against using the synthetic demands from ProtoCom, because this time we want to provoke as many inter-thread communications as possible.

We executed the implementation of the three dedicated systems described in Table 7.2. We did not use the BWUniCluster. Due to the virtualised environment, it is not possible there to run perf or collect the performance properties we need for calibrating the model.

On each system, we perform multiple runs of the experiment. In each run, we change the number of worker threads, starting with one (sequential run) and increasing the number stepwise, up to twice the number of physical cores. Additionally, we consider two dierent matrix sizes. In the rst scenario, we multiply matrices with a dimension of 3000x3000. In the second scenario, we consider a more massive matrix of 7000x7000. We use this scenario to guarantee that a matrix does not t into caches. The system in Stuttgart has a particularly large cache space, so to force main memory accesses, we use larger matrix sizes.

For each conguration and scenario, we execute multiple runs (100 for the small and 50 for the large scenario) to eliminate variances and side eects. Due to the low standard deviation, we only consider the mean value in

<sup>8</sup> https://doi.org/10.5281/zenodo.4094588

<sup>9</sup> https://github.com/PalladioSimulator/Palladio-Addon-MemoryHierarchy

the following. Further, we recorded all performance counters during the execution using perf.

### **8.5.2. Model Calibration**

For modelling and simulating the use case we use PCM nightly version (prerelease PCM 4.3.0), Eclipse 2019-09 Modelling Tools, and OpenJDK 11.0.2 on a Windows 10 machine with 16 RAM and 43.2 Intel CPU.

Further, we reused the model from [FH16] and [Gru19] and made slight modications and applied the required calibration. In the following, we describe the modication and the calibration of the memory hierarchy model:

**Repository Model:** Since we use the same example as in [FH16], we can reuse the repository diagram completely. The most relevant element in the repository model is the MatrixMultiplicationComponent, which provides the method multiplyMatrix. As we use the resourceCall, we additionally need to specify the resourceCallRole for the MultiplyMatrixComponent. We store the required resource for the call in the MemoryHierarchyPlugin, and we can access it via the pathmap mechanism. Figure 8.5 shows the model for the repository diagram.

**Figure 8.5.:** Repository Model for the Matrix Multiplication Use Case

**SEFF Model:** Inside the SEFF diagram, we specify the actual behaviour of the multiplication. Since we use dierent hardware and thread numbers as in [FH16], we need to remodel this diagram—but keep the concept. We assume that the Pyjamas implementation of OpenMP splits the load equally on all threads. Thus, we use fork action, which contains 192 ForkBehaviours. Each behaviour includes a fraction of the actual load. Since the manual modelling of all 192 ForkBehaviours is time-intensive and error-prone, we can use the parallel loop AT from the parallel pattern catalogue (see Chapter 6). We use the measurements from the sequential run to calibrate the CPU demand. Thereby we separated CPU demands as good as possible from memory hierarchy demands. To achieve this, we also used the measurements we gain from perf (see Appendix A.5.2 for more information).

Additionally, we specify the resource call for the memory behaviour here and use the values provided by perf. Figure 8.6 shows the model for a two-threaded application using the fork action.

**ResourceEnvironment Model:** The modelling of the resource environment is straightforward and follows the example of [FH16]. However, we decided against using the exact schedulers from [Hap08] because for short response times, the exact scheduler implementation always adds a constant of 100 to the simulation results. That can aect the simulation accuracy too much especially for low prediction values.

Most important is that we add the stereotype for the memory hierarchy here.

**MemoryHierarchy Model:** The memory hierarchy model contains the new diagram type we included to model the memory hierarchy. Thus, we need to model it from the sketch. Figure 8.7 shows the nal model.

We have to specify all attributes for the model elements identied in Section 8.2.1. We calibrate the values as follows:

**Cache hit rate:** To calculate the hit rate, we use the measurements from perf. Since the cache hit rate varies for each conguration of worker threads, we have to adjust the value for each experiment.

**Figure 8.6.:** SEFF Model for the Matrix Multiplication Use Case with Two Threads.

**Cache isPrivate attribute:** We establish whether a cache is private or shared from the CPU specication. In our case, L1 and L2 are private and L3 is shared.

**Figure 8.7.:** Repository Model for the Matrix Multiplication Use Case


With the calibrated model we described, we can perform the simulations. In the next section, we compare the simulation results and the measurements and discuss the outcome.

### **8.5.3. Results**

To make the process of how we gain the results and the designs comprehensible, we discuss the outcome of the modelling and simulation for each iteration. In total, we have four iterations; iteration 0 describes the state before we include the memory model and iteration three a complex model with only a low accuracy increase. After the discussion of the four iterations and models, we make a comparison of all models in the next section. To better follow the result description for the individual iterations, we refer to the gures of the comparison in the next paragraph (see Figure 8.8).

### **8.5.3.1. Iteration 0: Default Palladio Model**

**Overview:** The default Palladio model contains no memory attributes and represents our starting point. Modelling a parallel system (e.g., the matrix multiplication) follows the example in [FH16]. Here we use a fork action to specify the software behaviour of each OpenMP thread individually. To calibrate the CPU demand, we use the measurements to form the sequential run.

### **Model Modifications:** None

### **Results:**


### **8.5.3.2. Iteration 1: Read-Data Model**

**Overview:** The read-data model takes the amount of data required for the matrix multiplication into account. Further, it takes into account the dierent cache levels and the time needed to transfer the data from caches or main memory to the core.

Using this model, we have two options. First, we apply the memory hierarchy values to the model and do not cache the values for the CPU demand. However, in the execution of the sequential run, we already consider a data transfer and cache hit rates—even if not explicitly. Thus, the second and more appropriate option is to also adjust the CPU demand by querying the data transfer demand.

### **Model Modifications:**

**Knowledge:** Information about total data transferred.

**ResourceCalls** with memory demand in each ForkBehaviour and memory hierarchy model.

**Memory Hierarchy:** Setting of all attributes in the memory model diagram.

**Results:** For all systems and all use cases, the read-data model gives a more accurate prediction. However, the overall accuracy is only slightly better than the pure Palladio approach.

### **8.5.3.3. Iteration 2: Cache-Line Model**

**Overview:** Because the accuracy of the read-data model is low, we further investigate a more rened grain memory model. In the cache-line model, we consider the fact that a cache miss will not only fetch the required data from the next memory level, but will also load a full cache line. In the above model, we assume a data transfer of 4 in case of a miss. In this model, we assume a transfer of 64 instead.

### **Model Modifications:**

**Knowledge:** Pure CPU demand.


**L3 Cache** is set to private.

### **Results:**


### **8.5.3.4. Iteration 3: Cache-Line-Scaling-DRAM Model**

**Overview:** Beyond the cache-line model, we investigated further and included the scaling eects of the memory bus between L3 and main memory, too. The bandwidth scaling is dependent on the number of threads used. Therefore, we used the measurements taken by perf to calibrate the model further and adjusted the throughput of the memory link accordingly.

**Model Modification:** Throughput between L3 and DRAM is modied depending on the numbers of worker threads.

### **Results:**



### **8.5.4. Result Summary**

**Table 8.1.:** Mean Prediction Error for the Dierent Use Cases and Modelling Approaches

Figure 8.8 shows all the above models in direct comparison for the two use cases. The diagrams show the prediction error in percentage. The closer a value is to zero, the more accurate the predictions are. In addition to that, we provide more detailed diagrams and the speedup curve in Appendix A.5. For example, we provide models where we did not limit the memory links to the physical core size. Thus, we assumed that hyper-threading also increases the number of replicas for a memory link.

As we can see from Figure 8.8, dierent models behave best in dierent scenarios. Thus, there is not one model that beats all. However, we are interested in the prediction of a highly parallel application. So if we ignore low numbers of worker threads (e.g., lower than the number of physical

**Figure 8.8.:** Comparison of Prediction Models: Prediction Error in % for all Machines and Use Cases

cores), we can see that the cache-line model shows the best accuracy for all (except 12-core large use case) scenarios. For example, the cache-line model increases the accuracy for the maximum thread number and lies between 32% for the worst case (96-core system and small use case) and 89% for the best case (40-core system large use case).

Table 8.1 gives a full overview of the mean prediction error. The smaller the value, the more accurate the prediction. Bold values are the most accurate models for each row. We neglect the values for the read-data model because as we explained above, it uses a misleading calibration and considers memory eects twice. The mean prediction error is averaged for all thread variation. Thus, we cannot compare the error of the 96-core server with the 40-core server directly. That is because from 96 threads and up, measurements are only taken in steps of 8.

As we can see in the table for each scenario, we can nd a model with a mean error below 40% and all models are more accurate than the default Palladio approach—except the cache-line model with the large use case. However, we are more interested in high values of worker threads. Considering the diagrams, we see that especially the accuracy of the 96-core system is inferior. This can have two reasons: (1) We did not capture all relevant characteristics in the memory model, or (2) there are other PPiFs that inuence the performance. Given our current state of research, we believe the latter to be true. Especially eects such as data access, locks, and sequential parts of applications impact the parallelisation capabilities of highly parallel applications, and are not considered in the current models.

### **8.5.5. Discussion and Lessons Learned**

During the creation of the meta-model extension and the experiment execution, we learned valuable insights that we want to share. Thus, we describe noteworthy lessons in the following.

**Exact Scheduler:** At rst, we tried to use the exact scheduler developed by [Hap08]. However, we noticed that the implementations add an arbitrary but constant value of 100 for short runtimes (below 200) to the predictions. This interferes with the result of the speedup for a large number of worker threads. On the other end, for very long

execution times, the exact scheduler seems not to add any demand at all.


are of good quality. Thus, it is all the more interesting that for the 96-core system, the prediction is so low. Obviously, we are missing performance-relevant factors. These factors might have something to do with memory bandwidth. For example, we did not research prefetchers, memory bandwidth latency or inter-core connections, which certainly have an impact on performance. However, it is more likely that they are of another nature and not bound to the memory hierarchy. Future work will have to look into that.

### **8.6. Threats to Validity & Limitations**

To put the results in the right perspective, we discuss assumptions made and threats to validity in the following. Thereby, we distinguish between internal and external validity.

**Internal Validity:** Internal validity describes the validity of the specic experiment setting on which the response time prediction depends. Thus, we need to name three factors: the execution and measuring of the memory hierarchy utilisation, the experiment execution time, and the implementation of the simulation.

Multiple unforeseeable factors inuence the execution time of the matrix multiplication. For example, we generate the matrix with random numbers. However, a matrix with many zeros can be calculated faster due to the internal processor optimisations. Also, we did not pin threads to cores but relied on the operating system's scheduler. So, threads could switch to other cores—which results in cold caches. Further, operating system interruptions can inuence execution time. To minimise the eects, we executed each run multiple times and used mean values.

When taking the measurements, we increased the number of worker threads continuously. The execution of the experiments for all thread numbers would have resulted in very long execution times. Thus, we increased the thread number in steps, choosing a step size of four for numbers below 96 and a step size of 8 for numbers above 96. This is reected in the calculation of the mean error. Thus, we cannot directly compare the mean prediction error for the 96-core machine with the other machines.

Another threat to internal validity is the use of perf. Perf reads a low-level performance counter from the hardware to get, e.g., cache access rates. These performance counter events can vary between hardware vendors. For example, we are not able to read the L1-dCache-store. Even though the use of a performance counter was our only chance to get low-level information and is a common approach, the measurements might not be comparable between the hardware. Further, the use of monitoring applications puts additional overhead on the system, and can inuence performance in general.

The next aspect we need to consider but have no inuence on, is the Turbo-Boost or auto-throttling. Depending on the core's temperature, modern CPUs throttle down the CPU clock frequency. Thus, the CPU becomes slower. We did not investigate these eects, but monitored the CPU temperature during the experiments.

Finally, we have to discuss the model itself. During the meta-model creation process, we abstracted the architecture of the multicore CPUs signicantly. Thereby, we neglected characteristics to make the model more comfortable to handle. However, it is possible to neglect performance-relevant aspects (e.g., prefetching or cache optimisation). One result of this might be the low accuracy of large multicore systems (e.g., a 96-core system). Following up on this, PPiFs is a task for future research.

**External validity:** The external validity describes whether the ndings can be generalised outside the scope of this paper.

For now, we assume that the memory hierarchy model we developed can be generalised, because we analysed various CPU architectures upfront. The generalisation includes not only CPUs but also GPUs, even though we only focused on CPUs.

A more relevant threat is the evaluation scenario. In this work, we provide a proof-of-concept evaluation of the memory hierarchy model and the solvers. Thereby we used only one use case (i.e., the matrix multiplication), one programming language (i.e., Java), and one parallelisation paradigm (i.e.,

Pyjamas). However, a broader set of use cases, algorithms, languages, and complex applications is required to make more generalisable assumptions.

Finally, there are some minor threats, which go along with a controlled experiment. For example, we did not research how a system under load, with a complex application stack and multiple services running, will impact the memory architecture.

### **8.7. Summary of CB**3

In the course of this chapter, we focused on the requirements , , and and researched an approach to consider memory hierarchies in performance predictions. To do so, we rst identied PPiFs for memory hierarchies and their attributes. Next, we mapped the PPiFs to model elements and attributes, and included a memory hierarchy model in the PCM using a prole-based extension. Afterwards, we extended the editors to enable the SA to utilise the new model elements. Finally, we extended the current default simulator SimuLizar to interpret the added model elements and to take them into account during the simulations.

To evaluate the meta-model extension, we executed a matrix multiplication use case with dierent matrix sizes. At the same time, we modelled and simulated the use case and compared the measurements with the predictions. As a contribution of this chapter, we present:


As a result of the contribution, we can show that the four memory model approaches increase the performance prediction accuracy of Palladio. Each model works exceptionally well under certain circumstances. Overall, we favour the cache-line model, which has the best overall performance prediction power and increases the accuracy up to 57%. Thus, in the best case, the prediction error is below 11%.

However, the overall prediction for large systems and a high number of worker threads is still low with over 60% prediction error. We assume here the impact of further PPiFs, e.g., eects like data access, locks, and sequential parts of the application. To investigate these PPiFs is a challenge for future work.

To sum up, we can answer our research question:

3.3: Can modelling the additional performance-inuencing factors improve the overall accuracy of performance prediction?

**Answer:** We introduced a memory hierarchy model and included it for evaluation into the PCM. The results show that modelling the memory hierarchy helps in all cases to increase the performance predictions compared to the pure Palladio approach. For systems up to 40 cores, we even gained results that satised our requirements .

### **9. CB**4**: CPU Simulators**

In this chapter, we introduce a dierent approach to tackle the requirements and by using and integrating already available CPU simulators into the Palladio approach.

CPU simulators are often used by hardware vendors to benchmark their architectures [AS16]. CPU simulators have the advantage of reecting the exact behaviour of specic CPUs, ranging from the CPU times up to the utilisation of the individual CPU registers. At the same time, this precise prediction of the behaviour comes at the cost of very long simulation times. Further, to utilise the simulators, we either need to provide a runnable application or the trace les of an execution.

Nevertheless, we are convinced that researching the integration of CPU simulators into the Palladio approach is benecial, worth the eort, and can reveal new insights into the characteristics of parallel applications in multicore environments. Figure 9.1 shows the research approach and the structure of this chapter.

In the next section, we rst explain the problem space, identify challenges, research questions, and set the goals. After that, we perform a structured literature search to identify available multicore CPU simulators, followed by an evaluation of all simulators. The assessment also includes the selection of suitable simulators. In the next section, we investigate extension strategies for Palladio. In combination with the selected CPU simulator, we prototype an extension process. Finally, we perform a use case evaluation and discuss the results and future work.

As a result, we provide a proof of concept approach, which we evaluate with the help of the bank account use case example (see Section 5.2.1). We are able to show that by using CPU simulators, the non-linear speedup behaviour is

**Figure 9.1.:** Overview of the Research Method for Contribution CB<sup>4</sup>

present in the performance predictions. However, the predictions underestimate the performance by far. This indicates that the input model is missing relevant characteristics.

Please note that signicant contributions described in this chapter were part of collaborative student research projects [Det20; Gra18]<sup>1</sup> .

Additionally, we published all accompanying data (e.g., documentation on CPU simulator's docker les, implementations, evaluation data) online:

https://zenodo.org/badge/latestdoi/282948837

### **9.1. Problem Space**

As we learned from the research in CB<sup>1</sup> to CB3, predicting the behaviour of parallel applications is highly complex and depends on many PPiFs. In CB<sup>3</sup>

<sup>1</sup>Please check them as well for further information, especially on implementation details.

we research the impact of PPiFs. Thereby, we looked only at one PPiFs at the time, knowing that the PPiFs inuence each other.

Therefore, in the following, we research the integration of CPU simulators into the Palladio approach. CPU simulators predict the behaviour of an application on specic hardware in detail [AR06] and also consider side and cross eects of PPiFs.

### **9.1.1. Idea and Goal**

To reect the complex interaction of multiple PPiFs, we integrate existing exact multicore CPU simulators into the Palladio approach and utilise them as third-party model solvers.

To do so, we use Palladio's software, hardware, and usage models as input for the CPU simulators. Once we fetch the results from the simulators, we play them back into the Palladio Bench for further analysis.

In detail, we research and evaluate two dierent approaches:


### **9.1.2. Problem Specification**

To successfully reach our research goal, we have to answer a set of questions:


To answer these research questions, we follow the research method illustrated and explained above (see Figure 9.1).

### **9.2. Overview of Multicore CPU Simulators**

In this section, we give an overview of multicore CPU simulators. In a rst step, we dene the research strategy to nd simulators in literature. Second, we give a short overview of all the simulators found, including their strengths and weaknesses. Finally, we present an overview, categorisation, and analysis of the simulator.

To follow the section, we recommend reading the section on CPU simulators in Chapter 2.4.1 rst.

### **9.2.1. Search Strategy**

To answer the research question 1, we conduct a structured literature search. Since we assume the number of available CPU simulators to be low to moderate, we perform a simple keyword search using ve databases (Google Scholar, IEEE explore, Research Gate, Science Direct and IBS BW). The keywords we use are multicore, cpu and simulator, which we combine into the single search term multicore cpu simulator.

In a second step, we perform snowballing to reveal additional simulators taken from related work.

We limit our result set to (a) multicore simulators, which are (b) not older than ten years (last update). Further, we have a set of requirements. So we are looking for CPU simulators, that:


After conducting the search strategy, we sustained ten multicore CPU simulators.

Figure 9.2 gives an overview of the found simulators, categorising the simulators based on their capability to simulate Java applications and ISA x86 architectures.

**Figure 9.2.:** Overview of multicore CPU simulators [Gra18]

In the following, we characterise all remaining simulators briey. Thereby, we start with trace-based simulators and continue with source code-based simulators.

To characterise all simulators, we used available literature, set up, and ran example projects for all simulators. Thereby, we used Docker to handle dependencies and guarantee simple reuse. A description of how to run the Docker les is available in [Gra18] and all les are publicly available online<sup>2</sup> .

### **9.2.2. Trace-based Simulators**

We only found one CPU simulator that takes trace les as input.

**Tejas:** Tejas<sup>3</sup> is a multicore simulator designed by the Indian Institute Of Technology (IIT). It is entirely written in Java and was released in 2015 [SKK+15].

Figure 9.3 gives an overview of the main characteristics of Tejas. The dimensions of the spiderweb diagram are explained in detail in Section 2.4.1. The more a simulator fulls a dimension, the closer the point is to the outer circle.

**Figure 9.3.:** Tejas Feature Net [Gra18]

The developer follows a cycle-accurate trace-driven approach. However, the core Tejas implementation requires two input les: rst, the conguration

<sup>2</sup>https://doi.org/10.5281/zenodo.3961930

<sup>3</sup> http://www.cse.iitd.ac.in/tejas/

le; second, an executable le. This makes the core Tejas implementation a source code-based simulator. Further, like most other simulators, the Tejas approach uses the Intel PinTool. However, this tool only works with C++ code.

To support Java code, there exists an extension call Tejas Java<sup>4</sup> . Instead of the Intel PinTool, it uses the common Jikes RVM<sup>5</sup> . With the help of the Jikes RVM, it is possible to provide a trace le as input. Tejas Java can create stats and an output trace, which can be used as an input le for the original Tejas simulator.

### **9.2.3. Source Code-based Simulators**

In the following, we briey characterise the remaining CPU simulators. All of these are source code-based, and they need at least two input les: rst, the simulator's conguration le and second, a compiled Executable and Linking Format (ELF) le.

**Sniper:** Sniper<sup>6</sup> is developed by a cooperation between the Ghent University and the Intel ExaScience Lab. Like most CPU simulators, it relies on the Intel Pin Tool and thus supports only C++ applications.

Figure 9.4 shows the characteristics of Sniper.

Sniper is a timing-based simulator, using a hybrid cycle simulation model. The hybrid model enables Sniper to skip specic cycles and gives a performance gain. Sniper is highly suited to simulate OpenMP applications.

**zsim:** Another CPU simulator has been developed by the Massachusetts Institute of Technology and Trustees of Standford University and further modied by MIT—zsim<sup>7</sup> .

Figure 9.5 gives an overview of the characteristics of zsim.

<sup>4</sup> http://www.cse.iitd.ac.in/tejas/tejas\_java/

<sup>5</sup> https://www.jikesrvm.org/

<sup>6</sup> http://snipersim.org/

<sup>7</sup> https://github.com/s5z/zsim

**Figure 9.4.:** Sniper Feature Net [Gra18]

**Figure 9.5.:** zsim Feature Net [Gra18]

Zsim aims to simulate systems with up to 1,000 cores, and therefore they choose an execution-driven, user-level approach [SK13b]. zsim can simulate multi-thread and client server applications, and supports C++, Java, Scala and Python.

**MaxSim:** MaxSim<sup>8</sup> is a simulator built upon the Maxime VM and the zsim simulator. Therefore, the feature net looks similar (see Figure 9.6).

In contrast to most other simulators, MaxSim uses the Maxine VM instead of the Intel PinTool. This enables MaxSim to simulate Java applications as

<sup>8</sup> https://github.com/beehive-lab/MaxSim

**Figure 9.6.:** MaxSim Feature Net [Gra18]

well. Further, the Maxine VM is capable of interpreting Java les newer than JDK 7.

**Gem5:** Gem5<sup>9</sup> is the fusion of the previous projects Michigan m5 and the Wisconsin GEMS. Scientists mainly use it for performance measurements and analysing computer architectures [BGOS12].

**Figure 9.7.:** Gem5 Feature Net [Gra18]

Figure 9.7 shows the feature net of Gem5 and indicates that Gem5 is an emulation-based simulator for x86 ISA architectures. Gem5 oers a set of

<sup>9</sup> http://www.gem5.org/

ARM ISA options, which gives much freedom. The emulation comes at the cost of performance and accuracy since Gem5 is not cycle-accurate.

However, Gem5 oers direct support of Java benchmarks.

**MARSSx86:** In contrast to Gem5, MARSSx86<sup>10</sup> is a cycle accurate full system simulator for x86 multicore ISAs.

The purpose of MARSSx86 is to have an ecient and straightforward complete system simulator [PAG11b]. Even though the full source code is available on GitHub, it is written in C code, and development ended in 2012.

Figure 9.8 shows the full feature net of the simulator.

**Figure 9.8.:** MARSSx86 Feature Net [Gra18]

**Multi2Sim:** The purpose of Multi2Sim<sup>11</sup> is to support computer architects in the task of developing new architectures. Its primary goal is to verify the correctness and feasibility of new hardware designs [UJM+12].

Figure 9.9 shows the feature net of the simulator. It indicates that Multi2Sim is very versatile. Besides the capability of simulating x86 ISA, it can also simulate ARM and GPUs.

<sup>10</sup>http://marss86.org/

<sup>11</sup>http://www.multi2sim.org/

**Figure 9.9.:** MultiSim Feature Net [Gra18]

### **9.2.4. Evaluation and Selection**

After conducting the search, we execute an assessment of the simulators. Thereby, we evaluate nine criteria. We are able to determine ve of them by reading documentation or studying the corresponding literature. For the remaining four, we set up the simulators and use benchmark testing. In the following, we explain each criterion and how it is accessed.


**Processor Model:** This attribute describes the supported processor models. We distinguish between in order (IO) and out-of-order (OOO). We gain this information from the documentation.

The following four characteristics cannot be extracted from literature, and are gained from setting up the simulator and running a benchmark example. All characteristics are raised by the execution of a single-use case, and therefore have limited power and are objectively biased. Nevertheless, we provided them as an indicator and an internal comparison.


For all simulators and characteristics, Table 9.1 provides an overview. Given that overview of available multicore CPU simulators, we are able to answer 1. Moreover, the overview of their advantages and disadvantages answers 2.

<sup>12</sup>https://www.spec.org/cpu2006/

Given our description, evaluation, and testing, we nominate the simulator MaxSim for source code analysis and the simulator Tejas Java for trace le analysis, as promising candidates.

In the following, we sketch the process for including source code-based and trace le-based simulators into the PCM workow.

### **9.3. Palladio Extension Strategies**

In the last section, we provided an overview of all available multicore CPU simulators. We sketched their characteristics and briey described advantages and disadvantages.

With this knowledge, we will design two strategies to include trace-driven and source code-driven CPU simulators into the Palladio approach. For both procedures, we will rst theoretically describe how inclusion could work. Next, we provide a proof-of-concept evaluation with a CPU simulator most suited for the scenario. Finally, we will discuss the limitations of each strategy and further challenges to tackle.

### **9.4. Trace-driven Strategy**

Since the Palladio models contain information on an abstract architectural level, the trace-driven inclusion strategy sounds most promising. The general idea follows the concept to extract the stack traces from one of the Palladio's solver engines. In the next step, we use the traces as input les for the CPU simulators and run the simulations. Finally, we play back the results from the simulators to the solver.

Figure 9.10 exemplies this process. As shown, we do not use any additional information besides the already existing Palladio models. As a solver engine, we propose SimuCom, because SimuCom uses m2t transformations to generate simulation code, which provides resource demand traces. In the following, we have a detailed look at the SimuCom solver and the ways to extract traces.



**Figure 9.10.:** Inclusion Strategy Using SimuCom and Trace-driven Multicore CPU Simulators [Gra18]

### **9.4.1. SimuCom**

As explained in Section 2.4.2.4, the SimuCom approach follows a m2t transformation approach to generate simulation code out of PCM instances. The SimuCom framework uses and executes the simulation code to simulate the system.

As part of the suitability check and analysis of SimuCom, we identied two possible extension points in the source code of the SimuCom Framework (see Appendix A.6.1).

The rst extension point (see Listing A.7 in Appendix A.6.1) hooks into the getScheduledResource-method. At this point, the processed-demand traces are available.

The second extension point (see Listing A.8) hooks into the ExperimentRunner. At this point, the stochastic simulation starts. Here, the idea is to get the event traces and hand them over to the CPU simulator.

### **9.4.2. Discussion**

Unfortunately, we are not able to implement the trace-driven approach without immense eort and without changing either SimuCom or Tejas signicantly. The main reason for this is that most multicore CPU simulators do not accept trace les. The only exception is the Tejas simulator.

However, the trace les provided by SimuCom are not suitable for analysis with Tejas, because they lack more detailed information about software and hardware. SimuCom only provides resource-demand traces, but Tejas needs additional information about the CPU architecture, memory addresses, and operations.

In a nutshell, we believe that the trace-driven approach is still worthy of future research. However, CPU simulators are designed to help CPU designers evaluate the design of a CPU architecture, and therefore require much low-level information and return very detailed information about the status and behaviour of the CPU. For our purposes, this information is too detailed, and, at the same time, we are not able to provide the amount of input data required, since we use Palladio to look at architectural design.

So, for future research, we propose having a look at high-level multicore thread simulators, if available, or extending the current state-of-the-art Palladio simulator, SimuLizar. Thereby, we can use the insights of the CPU simulators and also use their libraries like JIKES or Maxine VM.

### **9.5. Source Code-Driven Strategy**

Realising that the trace-driven approach does not work out of the box, we have a closer look at the source code-based approach. Figure 9.11 lays out the source code-based approach. As in the trace-driven approach, we use a PCM instance as a starting point. This time, however, we do not use a simulator to generate the trace, but use ProtoCom to create a runnable Java SE performance prototype.

We feed the performance prototype to the CPU simulator and play the results back to the Palladio Bench. In Figure 9.11, we show the removal of all Java RMI calls and other overhead. This step is required, since most CPU simulators cannot handle RMI calls well.

**Figure 9.11.:** Inclusion Strategy Using ProtoCom and Source Code-base Multicore CPU Simulators [Gra18]

### **9.5.1. Removal of Java RMI Communication**

Removing the Java RMI communication from the ProtoCom performance prototype meant a manual adaptation of the generated source code and the elimination of all the features coming with Java RMI calls (e.g., the simulation of distributed systems).

However, this step is necessary, because all the remaining CPU simulators (which support Java les) were not able to successfully run RMI calls. The underlying engines Jikes RVM or Maxine VM do not support Java RMI calls, and even with the help of the engine developers, we were not able to include this feature in a reasonable amount of time. Thus we have to remove all RMI calls to proceed.

To still be able to run the prototype, we unravel the RMI communication stack trace and include a new class calling the required methods not via method invocation, but by simply calling the required method in the specic order (for further implementation details, please see Appendix A.6.2).

The removal of the RMI calls is only possible due to our simple use case, and would result in a complex task for larger, distributed, or more advanced use cases.

### **9.5.2. ProtoCom Calibration**

To be able to run the simulations locally but still get the correct results for the target system, we rst need to calibrate the ProtoCom performance prototype.

Therefore, we execute a calibration run, which includes the simulation of a Java prototype. The prototype performs a xed number of, e.g., Fibonacci demand operations on the target system. We use the measurements to create a calibration table. With the help of the calibration table, we are now able to execute the simulations locally, while getting the correct results for the target system.

### **9.5.3. Discussion**

In the above sections, we have sketched a method to use a PCM instance as input for ProtoCom, generate a runnable performance prototype, and use the prototype as input feed for multicore CPU simulators.

However, this process is not straightforward, contains a lot of manual adaptations, and only works for specic use cases. One of the major drawbacks is the lack of support for Java RMI calls by the CPU simulator's engine. Further, the benet of the simulator itself is questionable, since only two simulators (MaxSim/zsim and Tejas) support Java applications at all, and their accuracy (11.2% and 18.77%) is medium for real applications, which means we can assume that the accuracy drops even further when generating performance prototypes out of abstract architectural models.

Nevertheless, we were able to sketch the process of including CPU simulators into the Palladio process, and therefore successfully answered 3.

### **9.6. Execution and Use Case Evaluation**

To answer the nal research question 4, we perform a use case evaluation of the source code-based approach using the multicore simulator

MaxSim. All results, conguration les, Docker containers, and measurements are publicly available<sup>13</sup> .

### **9.6.1. Use Case and Process**

As a use case, we use a complex example of the bank account use case (see Section 5.2.1). We decide to use this example for multiple resources:


The process we follow to evaluate the CPU Simulator approach is straightforward. We use the measurements taken in [FSH17] as ground truths. This gives us (a) the measurements from implementation and execution of the use case, (b) the results of the Palladio simulation, without any extensions, and (c) the PCM models for the use case.

In the next step, we use the PCM models to generate the performance prototype using ProtoCom. Next, we adopt the prototype as described above, remove all RMI calls, and perform the ProtoCom calibration process to run the simulations for the target system locally. After that, we feed the prototype to the MaxSim simulator, and nally, we compare the results from the simulator to the measurements and Palladio simulation results from [FSH17].

### **9.6.2. Setup**

The setup phase contains two actions: the setup of the simulator and the calibration of ProtoCom.

<sup>13</sup>https://zenodo.org/badge/latestdoi/282948837

### **9.6.2.1. MaxSim Setup**

During the setup, we have to congure the CPU simulator. The conguration includes specifying the characteristics of the CPU architecture. The listing in Appendix A.9 shows the full conguration le for MaxSim. It includes the specication of the L1, L2, and L3, as well as the specication of the number of cores and clock rates.

### **9.6.2.2. ProtoCom Calibration**

To calibrate the local ProtoCom instance for the target system, we created a sample calibration project with a synthetic ProtoCom resource demand (e.g., calculating Fibonacci numbers). In this project, we specied the number of calculation iterations to 1, 000, 000, 000 and executed the project on the target system. The execution takes around 25.7 (see Appendix A.6.4 for more detailed information).

With this information, we can adjust the ProtoComs calibration table and include the information into the performance prototype.

### **9.6.3. Execution & Measurements**

Due to a version change in Palladio and Java, we re-executed the simulations with Palladio using the same values and experiment setup as in [FSH17]. We get the same results and continue with the execution. Unfortunately, we are not able to re-run the experiments on the hardware, since it is not available any more. Therefore, we have to rely on our previous measurements.

Table 9.2 shows the measurements from [FSH17], the simulation results using SimuCom, and the simulation results using MaxSim for one to sixteen worker threads. The upper part of the table contains the result for 500 transactions (small use case) and the lower part the results for one million transactions (large use case).

Further, Figure 9.12 visualises the results using bar and scatter charts.


**Table 9.2.:** Overview of all measurements and simulation results from SimuCom and the MaxSim approach

**Figure 9.12.:** Chart based visualisation of the measurements for small and large use

### **9.6.4. Discussion**

Given the results of the experiment, the rst thing to notice is the overall poor accuracy of MaxSim. Even for sequential scenarios, the results are very inaccurate. Overall, MaxSim performs a lot better for the small use case (accuracy up to 76%), but performs very poorly for the large use case (accuracy up to 2.50%). Hence, we can answer the question <sup>4</sup> and are not able to provide more accurate results with the use of CPU simulators.

cases

Second, we noticed as—pointed out in [FSH17]—the super-linear speedup of real execution for two and four worker threads.

Third, when looking at the speedup behaviour (cmp. Figure 9.12) we can see that the CPU simulator captures the behaviour of the real application, but is o by a factor of 10 to 20. In contrast, we see that SimuCom applies a linear speedup.

To wrap it up, CPU simulators are used to benchmark a CPU architecture design. To do so, they give very detailed information on the behaviour and characteristics of a CPU. In the past, they were able to show that they work with high accuracy. However, to work properly, they need detailed information and runnable source code. So we assume that the reasons we got such inaccurate results are the following:


absence of precongured models. This process is time-consuming and prone to errors. Small changes can lead to signicant changes in the simulation results. To avoid errors, we used the information provided by the CPU vendors and tried our best to create accurate CPU models.

**Artificial Load:** The CPU load generated by ProtoCom is articial. ProtoCom supports ve dierent types. In our case, we used the default setting and created the performance prototype with a Fibonacci demand. Each demand has specic characteristics (processor intensive vs. I/O intensive). However, for the complex use case, a single demand type might be not sucient.

### **9.7. Summary of CB**4

In this chapter, we discussed the possibilities to integrate multicore CPU simulators, used by hardware engineers, into the Palladio approach. To do so, we rst executed a structured literature review to nd the current state of the art in CPU simulators. Next, we evaluated each CPU simulator, carved out its strengths and weaknesses, presented the results in an overview table (see Section 9.1), and showed how they can be used by SA for performance predictions. In a second step, we sketched out the integration process of both trace-driven and source code-driven CPU simulators into the Palladio workow.

Finally, we implemented and executed the source code-driven approach by using the CPU simulator MaxSim. Unfortunately, the results we received were very inaccurate and performed on average even worse than before. In the above section, we discussed the reasons for the inaccuracy. Two reasons we think have the most substantial inuence are (a) the example use case used, and (b) the abstract input model.

When continuing the research, we rst need to evaluate the results by the use of a second scenario. Further, we will try to use another CPU simulator based on another engine (Jikes RVM vs. Maxine VM). However, there is an even more signicant challenge to face. All of the CPU simulators we tested cannot handle any Java les built with Java 1.8 or above. This technical limitation and the fact that the CPU simulator engine cannot simulate Java RMI calls makes it close to impossible to continue research at the moment.

In conclusion, we answer our research questions (see Chapter 3) as follows:

4.1: Can CPU Simulators be used by software architects to evaluate the response time of parallel architectural designs?

**Answer:** We were able to show that it is possible to transform the architectural models into a performance prototype, which we again can use as input for multicore CPU simulators to determine the response or execution time of a parallel application.

### 4.2: How would the integration of CPU simulators alter the process of performance predictions?

**Answer:** In Section 9.3 we sketched two approaches to include CPU simulators into the performance prediction workow: (1) a trace-driven approach, (2) a source code-driven approach. In both cases, we use the PCM without additional information as a starting point. Next, we transform the PCM by the use of solvers either into a tracele or a performance prototype, which we nally use as input for the multicore simulators.

### 4.3: Does the use of CPU Simulators increase the performance prediction accuracy for parallel applications in multicore environments?

**Answer:** We implemented the source code-driven approach to evaluate the accuracy of the performance prediction using multicore CPU simulators. Thereby, we used a complex use case example, the Bank Transaction Example (see Section 5.2.1). The prediction accuracy of this approach for the given example was very inaccurate, with an accuracy from 2.50% to 15.29%, and up to 54% worse than the pure Palladio approach.

Therefore, we have to reject our hypothesis 4: CPU simulators—used in other domains (e.g., hardware vendors)—can help to improve the predictions for parallel applications on multicore CPUs.

### **Part III.**

### **Evaluation and Summary**

### **10. Evaluation**

In the previous four chapters, we presented in detail the four contributions of this thesis. Along with a detailed description, we provided an extensive discussion about the benets and limitations, and evaluated each contribution individually. In this chapter, we pick up our overall research goal (see Chapter 3), show how the contributions can be combined, give an overview of the research questions we answered, and show the contribution of this work given the requirements from Chapter 1.

### **10.1. Combination of Contributions**

Even though we previously considered each contribution individually, a combination of the contributions is possible and even desirable. Thus, we will discuss whether and how a combination is possible.

### **10.1.1. Combination of CB**1

In CB<sup>1</sup> (Chapter 6) we researched the capabilities of the PCM language to express parallel behaviour. As a result, we provide a lightweight metamodel extension using the AT method and provide a pattern catalogue to quickly include common parallel patterns into the software models. The main characteristic of the lightweight extension is that we do not alter the core meta-model, and can map all new language elements to already existing ones. Thus, we ensure that all existing simulators and extensions can still handle the models. Further, this makes it theoretically possible to combine the parallel architectural pattern catalogue with all the other contributions.

In the following, we briey sketch what a combination would look like.

**Combination with CB**<sup>2</sup> In CB<sup>2</sup> (Chapter 7) we researched the behaviour of parallel applications and the inuence of PPiFs on performance. Further, we extracted performance curves to capture the characteristics of dierent types of resource demands. We included the performance curves into Palladio to enable the SA to increase the performance prediction without modelling the characteristics of parallel applications in detail.

In Section 7.6, we show how we integrated the performance curves into Palladio using the parallel pattern catalogue. Thus, this indicates that the combination of the two contributions is not only easily possible, but is even necessary in order to use the performance curves in Palladio.

**Combination with CB**<sup>3</sup> In CB<sup>3</sup> (Chapter 8) we extended the PCM to include memory architectures of CPUs into the PCM. Thereby, we extended the software and hardware models, as well as the simulator SimuLizar.

For a combination of CB<sup>1</sup> and CB3, we have to have a detailed look at the SEFF diagram: To consider memory accesses, we altered the internal action element so that we can specify the memory access needed. To successfully use the pattern catalogue in combination, we have to ensure that during the QVT-o transformation (1) the internal action is copied with all attributes, and (2) the memory access demand is adjusted for each copied instance. Currently, the rst requirement is fullled. The latter has not yet been implemented. However, an adaptation is easily possible, if we assume that the total memory access demand is spread equally amongst all threads, spawned by the parallel AT.

**Combination with CB**<sup>4</sup> In CB4, we present a prototype approach to use a multicore CPU simulator as a solver for the PCM. Even though we achieved predictions of low accuracy with CPU simulators, a combination of CB<sup>1</sup> and CB<sup>4</sup> is possible without further actions.

As described in the Chapter 6 (CB1), we ensure that all solvers still work due to the lightweight meta-models extension. Therefore, we can use both sketched strategies (trace-based and source code-based) in combination with the parallel pattern catalogue. Since the pattern catalogue focuses on the software models, using the extension will lead to faster creation of the models, but will not aect accuracy.

### **10.1.2. Combination of CB**2

We can use the developed performance curves to adjust the performance predictions, e.g., by adding additional resource demands to the model or by calculating the dierence to the linear speedup. To gain the performance curves, we performed extensive experiments and used the measurements to extract performance curves using linear regression. Thus, the performance curves include a lot of implicit eects going on during parallel execution.

Given that, we have a look at the combination of the remaining two contributions and discuss whether a combination makes sense.

**Combination with CB**<sup>3</sup> While extracting the performance curves, we looked at various attributes: Number of worker threads, number of physical and virtual cores, performance (i.e., speedup), and the type of resource demand. While using the measurements from the experiments to extract the performance curves, we captured eects implicitly, such as synchronisation, caching, or idling. Thus, a combination of the performance curves with the memory bandwidth model is, in theory, possible.

In Section 8.5, we conclude that the cache-line memory model is the most tting one. In the following, we briey describe the results when combining the cache-line model with the performance curves. Thereby we use the matrix multiplication example as a reference use case. To gain the combined values, we rst simulate the cache-line model as described in Chapter 8. Afterwards, we apply the performance curves manually, as described in Section 7.5.5.

Figure 10.1 shows the prediction error when combining the cache-line model with the matrix multiplication performance curve. Further, the gure shows the error for the dierent hardware and use case settings.

In addition to that, the following Table 10.1 shows the mean prediction error.

Looking at the pure values shows that the combined model works for the 40 core system and the 96-core system. However, it brings an accuracy decrease for the 12-core system. The interpretation of this observation is as follows: the performance curve always assumes an additional overhead. In the case

**Figure 10.1.:** Prediction Error for the Combined Approach: Matrix Multiplication Performance Curves and Cache-line Memory Model


**Table 10.1.:** Comparision of Cache-Line and Cache-Line with Performance Curvees

of the 12-core system, the cache-line model was already underestimating performance. Thus, by adding the performance curve, we increased the underestimation and made the predictions worse. For the other two cases, the opposite is true. The cache-line model overestimated system performance under test. So, by adding the performance curves, we added additional overhead and increased the accuracy of the prediction.

In general, we do not suggest combining the memory model with the performance curve. The main reason for this is that while taking the measurements for the performance curves, we measured memory eects as well—even though the measuring was implicit by measuring the overall performance. Thus, both models, the performance curves and the memory model, include memory eects. Combining them would mean taking this eect into account twice. The increase in accuracy of the larger systems was only a lucky coincidence resulting from adding to inaccurate prediction approaches.

Instead of a combination of the two approaches, we suggest investigating the eects of PPiFs more in-depth, and making either approach more accurate.

**Combination with CB**<sup>4</sup> More interesting is a combination of the performance curves with the CPU simulator approach. Even though the performance curves do include most characteristics we want the CPU simulators to evaluate, we learned that our current input models are too abstract for the multicore CPU simulators to provide accurate results. Here the performance curves can give a boost. Using the performance curves with the parallel architectural pattern catalogue will result in adding additional overhead as internal action to the model.

Evaluating whether these models will result in more accurate predictions using the multicore CPU simulators is an open task and remains for future work.

### **10.1.3. Combination of CB**3 **and CB**4

The remaining combination is the combination of the PCM extension for memory hierarchies and the use of multicore CPU simulators.

Unfortunately, a combination is currently not possible, because neither the SimCom solver (used for the trace-driven approach) nor ProtoCom (for the source code-based approach) supports interpretation of the memory hierarchy extension. Thus, there is no method to feed the memory models into the CPU simulators.


**Table 10.2.:** Summary of working combinations

Nevertheless, researching performance prototypes such as those created by ProtoCom, which includes the information from the memory hierarchy model and therefore memory accesses and cache behaviour, sounds promising and is an open challenge for future work.

To summarise the possible combinations, Table 10.2 gives an overview of which combinations are suitable.

### **10.2. Research Goal Evaluation**

In the introduction (see Chapter 1), we motivated the problem for performance prediction arising from multicore CPUs and highly parallel software. We identied ve requirements that we need to full to enable accurate performance predictions for parallel applications in multicore environments. In Chapter 3, we dened the following research goal of this thesis:

Research Goal (): Improving the accuracy, usability, and applicability of model-based QoS predictions concerning the performance of parallel applications in multicore environments.

Next, we rened the requirements given the RG and raised four research questions.

In this section, we evaluate whether we achieved each research goal. Therefore, we will rst answer the research question, discuss whether the requirements were satised, and nally, assess whether the RG was achieved.

### **10.2.1. Answering the Research Questions**

Because the research questions map to the contributions, each research question has already been discussed in the corresponding chapter. Therefore, we will not discuss them here again. However, in Appendix A.7, we provide a condensed version of the questions and our answers.

### **10.2.2. Assess Requirement Fulfilment**

After going through the research questions and their answers, we revisit the following requirements we initially set up. In this step, we show which contribution did full the requirements. Also, we lay out open tasks and challenges for future work.

### **10.2.2.1. Assess**

: Software architects shall be able to express concurrency in software models, which describe software behaviour. This includes highly concurrent software, which can consist of multiple hundreds or even thousands of concurrently executed threads.

With the help of the parallel architectural template catalogue (see Chapter 6), we provide an easy-to-use approach for the SA to quickly include massive parallel behaviour. Thereby, the parallel AT catalogue includes four abstract design patterns. The SA can use the four patterns to model the behaviour of 32 out of 35 common parallelisation patterns we identied in a structured literature review.

Further, we introduced a PCM extension to enable the SA to specify the memory accesses and memory data consumption (see Chapter 8).

Open Tasks: The remaining three patterns are based on message passing, which we have not yet considered. Therefore, two open tasks are: (1) include message-passing concepts (e.g., MPI or Actors); (2) include inter-thread communication. When designing the pattern catalogue, we focused on the specication of the thread behaviour. Up to now, we have not included inter-thread communication, which can inuence the software behaviour, e.g., due to waiting conditions.

### **10.2.2.2. Assess**

 : In case the single metric—CPU speed—is no longer sucient to cover all the performance relevant aspects for multicore systems, the software architect shall be able to specify additional performanceinuencing factors (e.g., memory bandwidth, cache behaviour, or the memory architecture) needed.

 : The performance prediction models shall include relevant performance-inuencing factors and reect the additional complexity.

To tackle this requirement, we provide two solutions strategies. First, we extended the PCM to include the memory architecture (see Chapter 8). That way the SA is now able to specify the L1, L2, L3, main memory, and memory bandwidth in the hardware model. Further, he can dene the memory accesses and memory consumption in the software model.

The second strategies are to use performance curves. The performance curves we extracted from extensive experimentation (see Chapter 7) include additional PPiFs in an abstract way. The SA can use one out of six predened performance curves to consider additional PPiFs in the performance predictions.

Open Task: As shown in Chapter 8, considering memory architectures in the performance predictions already helps improve accuracy. However, to be even more precise, we need to consider additional metrics. So, open for future work is investigating the PPiFs that have not yet been considered, and stepwise including the most relevant ones.

### **10.2.2.3. Assess**

 : The solvers, used to interpret and analyse the models, need to be capable of processing and evaluating the adapted software, hardware, and performance models.

In CB<sup>3</sup> (see Chapter 8), we adopted the solver SimuLizar, in a way that the solver can interpret and analyse the memory architecture model. For CB<sup>1</sup> and CB<sup>2</sup> no adaptation of the solver was required, since we did not alter the PCM here.

Open Tasks: Currently there remains no open task here. If we tackle the previously stated open task, we might need to reconsider altering the solvers.

### **10.2.2.4. Assess**

: The performance predictions need to align with the real and measurable behaviour of the software to an extent that is useful for the software architect.

In CB2, CB3, and CB<sup>4</sup> we faced the requirement and aimed for an improvement of performance predictions. With both CB<sup>2</sup> and CB<sup>3</sup> we can provide an approach that greatly increases the accuracy of performance predictions for parallel applications—up to 98% in the best case when using performance curves, and up to 93% accuracy in the best case when using memory modelling.

Open Task: Even though we can increase the predictions, there is still room for improvement. On the one hand, we need to include further PPiFs into the memory models and consider pre-fetching, inter-core communication, and latencies. On the other hand, we need more ne-grain performance curves.

### **10.2.3. Assess the Research Goal Fulfilment**

Given the answers to the research questions and the requirement assessment, we can state that this work has contributed to the improvement of performance predictions for parallel applications in multicore environments. Thereby we have provided better software (i.e., including memory accesses), hardware (i.e., including memory hierarchies), and performance prediction models (i.e., adopting SimuLizar, using CPU Simulators, and providing performance curves). Further, we have contributed to the usability aspect by providing a parallel architectural template catalogue.

Even though we have identied several open questions for future work, we did achieve our research goal, contribute to the domain of SPE, and enable (and improve) performance predictions for parallel applications in multicore environments.

### **11. Conclusion & Future Work**

In the nal chapter, we recap the most important insights given in the contributions CB<sup>1</sup> to CB4. Thereby, we briey summarise the method, ndings, and outcome. Further, in this chapter we discuss the open challenges and remaining tasks for future work in detail. We do not discuss threats to validity separately. However, we did discuss the threats to validity for each contribution in detail in the corresponding chapters, and refer to the sections 6.8 (CB1), 7.8 (CB2), 8.6 (CB3), and 9.6.4 (CB4).

### **11.1. Conclusion**

Software-rich applications dominate our daily life more and more. These applications full complex and tasks critical to safety. Therefore, it is essential that the application comply with high-quality standards and meet SLO. To ensure high-quality standards, we have to develop software in an engineering-like manner.

One aspect of software engineering is model-based performance prediction, in which software architects model software architectures, enrich the models with performance-relevant information, and use analytical or simulationbased solvers to predict quality attributes, such as performance on architectural drafts during the early design phase. Current state-of-the-art modelbased performance prediction approaches can give accurate predictions for even complex systems. To do so, they consider the user's behaviour, software behaviour, and hardware characteristics. For the latter, they only consider CPU-speed as a single metric.

However, modern processor architectures consist of multiple CPU cores, complex memory architectures, and extensive optimisation mechanisms. To fully utilise such multicore architectures, software developers have to

develop the software in a parallel manner, which is even more complicated and makes an engineering-like approach more relevant than ever. However, since model-based performance prediction approaches only consider CPU speed—which by now is no longer the only limiting factor—the accuracy of predictions for parallel applications in multicore environments suers greatly.

To support SA in making accurate performance predictions for parallel applications, we researched applications for parallel performance predictions in this thesis. Thereby we faced the requirements , , , , and (see Chapter 1).

As a contribution regarding the requirement , we present a parallel performance pattern catalogue to the SA (see Chapter 6). The pattern catalogue enables the SA to (a) specify the behaviour of highly parallel applications in software models, and (b) to reduce the time and eort needed.

As a contribution regarding the requirement , we present a memory meta-model which includes the most relevant memory hierarchy characteristics (see Chapter 8). Further, we included the meta-model as a meta-model extension in the PCM, and provided graphical editors to the SA to model memory hierarchies in the hardware model, and memory behaviour in the software models.

As a contribution regarding the requirement , we present a set of performance curves to the SA (see Chapter 7). The performance curves reect the speedup behaviour of the six most common resource demand types. Thus, with the help of the performance curves, the SA can consider the speedup behaviour of a parallel application in the prediction models. Thereby, the performance curves can be quickly added and provide a highlevel view of complex correlation.

As a contribution regarding the requirement , we extended a performance prediction solver SimuLizar to interpret and analyse the memory meta-model (see Chapter 8). Thus, we enabled the SA to analyse complex memory hierarchies typical in multicore CPUs. Further, we give a proofof-concept approach on how to include CPU simulators in the workow of performance predictions (see Chapter 9)

Finally, as a contribution regarding the requirement , we evaluated the performance curves, memory hierarchy modelling, and CPU simulators

against various use cases. As a result, we are able to show that both the memory hierarchy modelling and the performance curves contribute to the predictive power. Thereby, both approaches contribute and work best in specic scenarios. We can achieve an accuracy of up to 98% in the best case when using performance curves, and up to 93% accuracy in the best case when using memory modelling.

So, to wrap it up, we provide new tools to the SA's silver box. These tools enable him to model the behaviour of highly parallel systems in software performance models, let him specify the characteristics of multicore environments in the hardware performance model, and give him enhanced model-based performance solvers to achieve more accurate performance predictions for parallel applications in multicore environments.

These tools help SAs to create and evaluate high-quality software architectures, which meet the SLOs, already during the design phase.

### **11.2. Future Work**

In the course of this thesis, we researched multiple approaches to enable the software architect to better handle performance prediction for parallel applications. Even though we answered all our research questions and made a signicant step in the direction of requirements fullment, we also raised new questions, research ideas, and an approach to be even better in the sense of requirement fullment. In the following, we briey sketch the open challenges left for future work. Thereby, we group the items according to the contributions.

**CB**1**: Parallel Architectural Template Catalogue** In CB1, we researched the modelling language capabilities regarding their suitability for similar behaviour. As a result, we introduced a parallel pattern catalogue based on the AT method. In the rst step, we only focused on thread-based patterns.

Thus, a challenge for future work is to investigate other patterns that also represent parallelisation paradigms, such as message passing (e.g., MPI or Actors).

Futher, the current approach supports an abstract method to include the overhead of parallel applications (e.g., forking or synchronisation). With performance curves, we give the SA a tool to make the overhead estimation simple. However, dierent tool and language support is desirable to reduce the abstraction level, and to make it more precise.

Additionally, the current approaches neglect inter-thread communication, even though inter-thread communication is a relevant PPiF as well. Thus, an additional challenge is to include concepts to simplify the complex patterns of inter-thread communication, and to t them into the modelling languages.

When it comes to evaluation, the empirical study already gives strong evidence. However, further studies with larger sample sizes and more complex use cases could help to collect additional insights.

**CB**2**: Parallel Performance Curves** In the evaluation of CB2, we saw that performance curves already improve the predictive power of performance prediction approaches. However, depending on the scenario, the prediction error is still higher than our overall goal of 20%. Thus, we need to reconsider the choice of PPiFs and the use of synthetic demands in future work. Further, a more ne-grained categorisation or other performance curves could contribute to a better result.

Further, a model which allows the SA to specify the sequential and parallel part of an application (e.g., following Amdahl's law) and the specication of the I/O and processor-intensive share (e.g., a demand type which contains 20% of I/O-intensive and 80% of processor-intensive demands) would be benecial for a better characterisation of resource demands.

Another aspect is evaluation. We evaluate the approach using the SPEC benchmark, which covers a comprehensive set of representative demands. However, using a real-world example, e.g., simulations for material science, might oer further insights.

**CB**3**: Memory Model Extension for the PCM** In CB3, we extended the PCM to include memory hierarchies and memory behaviour. Due to the very complex characteristics of memory behaviour, this is one of the most challenging endeavours. The approach we present in CB<sup>3</sup> is a rst step, in which

we simplied some of the complex interaction of hardware, software, and controller.

By simplifying (abstracting) the memory eects, we did not consider data locality, workload balance and NUMA nodes. To give an example for the latter, each NUMA node has its own architecture, which is characterised by a fast bandwidth. However, when accessing the data from another NUMA node, another much slower bandwidth is used. This can greatly aect performance. Next, we included the concept of latency in our models, but did not further investigate latency eects in memory. We also did not explore snooping or cache-coherency eects. Thus, by setting memory bandwidth latencies and considering cache coherency eects, the performance predictions could benet. Additionally, current CPUs use pre-fetchers to avoid cache misses and to give performance boosts. We considered these eects in the abstract form of cache hit rates, but a more proactive approach might be needed. Finally, we have not yet combined the memory model with the parallel AT catalogue. A combination would give the SA additional comfort and freedom.

When it comes to evaluation, we conducted a level 1—proof of concept evaluation—using one use case. This evaluates the memory model extension, but the scientic power regarding the prediction accuracy is relatively weak. Thus, further comparisons with more complex examples will help to make more ne-grained models, and give a better understanding of the predictive power.

**CB**4**: CPU Simulators** In CB4, we adopted the Palladio Bench workow to transform the PCM models in a running performance prototype, which we then fed into multicore CPU simulators to gain more accurate performance predictions. The approach seems very promising. However, the results we achieved were often highly inaccurate.

A signicant challenge for future work is adaptability. Only a few CPU simulators support native Java applications as input, and the ones that do require a Java version below 1.8. This results in signicant compatibility issues.

Nevertheless, we rate the insights we gained as very relevant. Therefore, transferring the concepts from CPU simulators at least to some extent into performance simulators, such as SimuLizar, sounds very promising.

Further, a factor involved in the low accuracy could be the simplied PCM models. Thus, including additional model elements, as we did in CB3, might lead to better results when also adopting ProtoCom.

Also, the evaluation was carried out with a single use case, and served as a proof-of-concept evaluation. We might achieve better results and further insights by using additional use cases.

### **A. Appendix**

### **A.1. Publications & Supervised Theses**

In the context of this doctoral project, we published a number of peerreviewed publications including conference papers, journals, workshops, and posters. Figure A.1 indicates (in blue) the publications for each area of the thesis.

**Figure A.1.:** Publications:[FBKK19; FH16; FH18; FHLB17; FKB18; FKHB19; FSH17; GF19], Theses: [Det20; Gra18; Gre19; Gru19; Gru20; Söh18; Sta17; SWD19; Tru20; Yoo19; Zah20]

115

118

121

124

127

Further, a number of student thesis were supervised by the author of this thesis. We highlight the supervised theses (in grey) and map each one to the areas it addresses.

### **A.2. Implementations of Resource Demands in Protocom**

### **A.2.1. Fibonacci Numbers**

In comparison to the example given in Chapter 5, Protcom uses an iterative approach (see Lst. A.1). This implementation does not focus on a specic Fibonacci number, but on the number of Fibonacci calculations performed (given bei the iterationCount).

```
1 private long fibonacci(double iterationCount) {
2 long i1 = 1;
3 long i2 = 1;
4 long i3 = 0;
5 for (long i = 0; i < iterationCount; i++) {
6 i3 = i1 + i2;
7 i2 = i1;
8 i1 = i3;
9 }
10 return i3;
```
**Listing A.1:** Implementation of the Fibonacci demand in Protocom

### **A.2.2. Mandel Set**

```
1 private void drawMandelbrot(long init) {
2 // Date d1 = new Date();
3 int n = (int) init;
4 float m = n;
5 int x, y;
6 for (y = -n; y < n; y++) {
7 // System.out.print("\n");
8 for (x = -n; x < n; x++) {
9 if (iterate(x / m, y / m) == 0) {
10 // System.out.print("*");
11 } else {
12 // System.out.print(" ");
13 }
```

```
14
15 }
16 }
17 // Date d2 = new Date();
18 // long diff = d2.getTime() - d1.getTime();
19 // System.out.println("\nJava Elapsed " + diff / 1000.0f);
20 }
21
22 private int iterate(float x, float y) {
23 float cr = y - 0.5f;
24 float ci = x;
25 float zi = 0.0f;
26 float zr = 0.0f;
27 int i = 0;
28 while (true) {
29 i++;
30 float temp = zr * zi;
31 float zr2 = zr * zr;
32 float zi2 = zi * zi;
33 zr = zr2 - zi2 + cr;
34 zi = temp + temp + ci;
35 if (zi2 + zr2 > BAILOUT) {
36 return i;
37 }
38 if (i > MAX_ITERATIONS) {
39 return 0;
40 }
41 }
42 }
```
**Listing A.2:** Implementation of the Mandel Set demand in Protocom

### **A.2.3. Sorting Arrays**

```
1 public SortArrayDemand(final int arraySize) {
2 super(-3, 0, 3, 10000, 50);
3 this.arraySize = arraySize;
4 this.values = new double[this.arraySize];
5 final Random r = new Random(SEED);
6 for (int i = 0; i < this.values.length; i++) {
7 this.values[i] = r.nextDouble();
8 }
9 }
10
11 public SortArrayDemand() {
12 this(DEFAULT_ARRAY_SIZE);
13 }
14
15 private void sortArray(final int amountOfNumbers) {
16 final int iterations = amountOfNumbers / this.arraySize;
17 final int rest = amountOfNumbers % this.arraySize;
18 for (int i = 0; i < iterations; i++) {
```

```
19 final double[] lotsOfDoubles = getArray(this.arraySize);
20 Arrays.sort(lotsOfDoubles);
21 }
22 final double[] lotsOfDoubles = getArray(rest);
23 Arrays.sort(lotsOfDoubles);
24 }
```
**Listing A.3:** Implementation of the Sorting Array demand in Protocom

### **A.2.4. Calculate Prime Demand**

```
1 private long calculatePrime(double numberNextPrimes) {
2
3 boolean isPrime = true;
4 long currentNumber = number;
5 long primesFound = 0;
6 long currentDivisor;
7 long upperBound;
8
9 while (primesFound < numberNextPrimes) {
10 // test primality of currentNumber
11 currentDivisor = 2;
12 upperBound = currentNumber / 2;
13 while ((currentDivisor < upperBound) && (isPrime)) {
14 isPrime = currentNumber % currentDivisor != 0;
15 currentDivisor++;
16 }
17 // count primes and continue
18 if (isPrime) {
19 primesFound++;
20 }
21 // prepare for next iteration
22 isPrime = true;
23 currentNumber++;
24 }
25 return currentNumber;
26 }
```
**Listing A.4:** Implementation of the Sorting Array demand in Protocom

### **A.2.5. Counting Numbers Demand**

```
1 private void countNumbers(double countTo) {
2 for (long j = 0; j < countTo; j++) {
3 if (k > 100000) {
4 k = 0;
5 }
6 k += j;
```
}

**Listing A.5:** Implementation of the Counting Numbers demand in Protocom

### **A.2.6. Matrix Multiplicationn Demand**

```
1 private static final int DEFAUL_MATRIX_SIZE = 500;
2 private final double matrixA[][];
3 private final double matrixB[][];
4 private final int matrixSize;
5
6 public MultiplyMatrixDemand(int matrixSize) {
7 super(-3, 0, 3, 10000, 50);
8 this.matrixSize = matrixSize;
9
10 matrixA = new double[matrixSize][matrixSize];
11 matrixB = new double[matrixSize][matrixSize];
12
13 fillMatrixRandom(matrixA);
14 fillMatrixRandom(matrixB);
15
16 }
17
18 public MultiplyMatrixDemand() {
19 this(DEFAUL_MATRIX_SIZE);
20 }
21
22 private void multiplyMatrix(final long numberOfMultiplications) {
23 double resultMatrix[][] = new double[matrixSize][matrixSize];
24 long numberOfPerformedMultiplications = 0;
25
26 while (numberOfPerformedMultiplications < numberOfMultiplications) {
27 for (int i = 0; i < matrixA.length; i++) {
28 for (int k = 0; k < matrixB.length; k++) {
29 for (int j = 0; j < matrixA.length; j++) {
30 if(numberOfPerformedMultiplications < numberOfMultiplications) {
31 resultMatrix[i][j] = resultMatrix[i][j] + matrixA[i][k] * matrixB[k][j];
32 numberOfPerformedMultiplications++;
33 }else {
34 return;
35 }
36 }
37 }
38 }
39 }
40 }
```
**Listing A.6:** Implementation of the Matrix Multiplication demand in Protocom

### **A.3. User Study Protocols**

### **Controlled User Study: Usability and Efficiency Evaluation of the Parallel Performance Catalogue Extension for the Palladio-Bench**

User Study Leaflet

General Information:

errors, and time spent in errors will be recorded and noted. At certain point you are completing the modeling tasks, your task completion time, number of submission still counts and your participation will be counted as successful. While Even if you are not able to achieve a working model in the given time, your solution is correct when a simulation of the model starts and finishes successfully. participation to be successful you have to work on both tasks. You modeling modeling task. For each task you will have 30 minutes. In order for your experiment contains two use case scenarios and each scenario contains one In this experiment you will be modeling parallel behaviors in Palladio. The s have to answer before proceeding with the next task. during the study, you will encounter questions from the questionnaire which you **A.3.1. Blank User Study Leaflet—Group A**

### **Introductory questions:**


none □ □ □ □ □ □ □ expert

3. this experiment? How would you rate your experience with Palladio before the conduction of

none □ □ □ □ □ □ □ expert

## Consent Form

**DESCRIPTION:** You are invited to participate in **a research study** on **different modeling tools in the Palladio-Bench tool**.

**TIME INVOLVEMENT:** Your participation will take approximately **60 minutes.**

**DATA COLLECTION:** For this study you will model use case scenarios in Palladio. During the will be measured. Also, you will need to fill in a questionnaire modeling process, metrics such as task completion time, number of errors and time spent in errors .

**RISKS AND BENEFITS:** No risk associated with this study. The collected data is securely stored. We do guarantee no data misuse and privacy is completely preserved. Your decision whether or not to participate in this study will not affect your grade in school.

**PARTICIPANT'S RIGHTS:**  If you have read this form and have decided to participate in this project, please understand your **participation is voluntary** and you have the **right to withdraw to which you are otherwise entitled your consent or discontinue participation at any time without penalty or loss of benefits**  . **The alternative is not to participate.** The results of this journals. Your identity is not disclosed unless we directly inform and ask for your permission. research study may be presented at scientific or professional meetings or published in scientific

**CONTACT INFORMATION:**  If you have any questions, concerns or complaints about this Denis Zahariev ( research, its procedures, risks and benefits, contact following persons: denis.zahariev95@gmail.com) Markus Frank (markus.frank@iste.uni-stuttgart.de)

*By signing this document I confirm that I agree to the terms and conditions.*

*Name: \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_ Signature, Date: \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_* 

# Use Case Scenarios and Modeling Tasks

## **Use Case Scenario 1**

**task. Start with reading the use case description and then proceed with the** 

### **Use Case Description:**

system. The resource environment where the system is deployed ha instance of the component and the interface are present in the software overhead resulting from the creation and the start of the thread. Exactly one Each thread also requires 5 CPU resources for the synchronization for one list of literature in a single database requires 100 CPU resources. create all of the threads responsible for the search. The searching operation method and the component implements it. In the specification of the method component and one providing interface. The interface declares the search the number of databases is limited to 16. The software consists of one database is searched in a separate thread. For the purpose of this scenario, various scientific databases. The search is executed in parallel where each The software in this use case is used to search for a list of literature in s a no think time. call of the search method is started with a closed workload of one user and system is deployed on a single container. In the usage scenario, a single CPU with a processing rate of 200 and 4 number of replicas and the whole

## **Task A (Standard toolkit):**

Diagram Diagram of the basic component. Your task is to complete the SEFF In the project that you receive every diagram is complete except the SEFF .

### Questionnaire

## **Questions regarding Use Case Scenario 1:**

How would you rate the difficulty of the task in Use Case Scenario 1?

4.

very easy □ □ □ □ □ □ □ very hard

5. Scenario 1? How would you rate your performance regarding the task in Use Case

very slow □ □ □ □ □ □ □ very fast

6. Use Case Scenario 1? How would you rate the amount of work required for completing the task in

too little □ □ □ □ □ □ □ too much

7. modeling of parallel behaviors and your user experience with it? How would you rate the usability of the standard toolkit regarding the

very bad □ □ □ □ □ □ □ very good 

# Use Case Scenarios and Modeling Tasks

## **Use Case Scenario 2**

**task. Start with reading the use case description and then proceed with the** 

## **Use Case Description:**

results in 16 threads. The software consists of one component and is calculated in a separate thread. With the given size of the matrices, this multiplication is executed in parallel where each row of the resulting matrix up complex calculations. It multiplies two 16x16 matrices and the The software in this use case is used in machine learning in order to speed one 4 where the system is deployed has a CPU with a processing rate of 250 and interface are present in the software system. The resource environment and the start of the thread. Exactly one instance of the component and the CPU resources for the synchronization overhead resulting from the creation resulting rows requires 125 CPU resources. Each thread also requires 5 component implements it. The multiplication operation for one of the providing interface. The interface declares the multiply method and the number of replicas and the whole system is deployed on a single started with a closed workload of one user and no think time. container. In the usage scenario, a single call of the multiply method is

# **Task B (Parallel Performance Catalogue):**

and to apply the Parallel Loops AT. automation are also complete. Your task is to complete the SEFF Diagram Diagram of the basic component. The files required for the experiment In the project that you receive every diagram is complete except the SEFF

### Questionnaire

## **Questions regarding Use Case Scenario 2:**

How would you rate the difficulty of the task in Use Case Scenario 2?

1.

very easy □ □ □ □ □ □ □ very hard

2. Scenario 2? How would you rate your performance regarding the task in Use Case

very slow □ □ □ □ □ □ □ very fast

3. Use Case Scenario 2? How would you rate the amount of work required for completing the task in

too little □ □ □ □ □ □ □ too much

4. How would you rate the usability of the Parallel Performance Catalogue regarding the modeling of parallel behaviors and your user experience with

very bad □ □ □ □ □ □ □ very good

it?

# **Questions regarding the Parallel Performance Catalogue:**

5. comparison to the standard toolkit? How would you rate the usability of the Parallel Performance Catalogue in


### **Controlled User Study: Usability and Efficiency Evaluation of the Parallel Performance Catalogue Extension for the Palladio-Bench**

User Study Leaflet

General Information:

errors, and time spent in errors will be recorded and noted. At certain point you are completing the modeling tasks, your task completion time, number of submission still counts and your participation will be counted as successful. While Even if you are not able to achieve a working model in the given time, your solution is correct when a simulation of the model starts and finishes successfully. participation to be successful you have to work on both tasks. You modeling modeling task. For each task you will have 30 minutes. In order for your experiment contains two use case scenarios and each scenario contains one In this experiment you will be modeling parallel behaviors in Palladio. The s have to answer before proceeding with the next task. during the study, you will encounter questions from the questionnaire which you **A.3.2. Blank User Study Leaflet—Group B**

### **Introductory questions:**


none □ □ □ □ □ □ □ expert

3. this experiment? How would you rate your experience with Palladio before the conduction of

none □ □ □ □ □ □ □ expert

## Consent Form

**DESCRIPTION:** You are invited to participate in **a research study** on **different modeling tools in the Palladio-Bench tool**.

**TIME INVOLVEMENT:** Your participation will take approximately **60 minutes.**

**DATA COLLECTION:** For this study you will model use case scenarios in Palladio. During the will be measured. Also, you will need to fill in a questionnaire modeling process, metrics such as task completion time, number of errors and time spent in errors .

**RISKS AND BENEFITS:** No risk associated with this study. The collected data is securely stored. We do guarantee no data misuse and privacy is completely preserved. Your decision whether or not to participate in this study will not affect your grade in school.

**PARTICIPANT'S RIGHTS:**  If you have read this form and have decided to participate in this project, please understand your **participation is voluntary** and you have the **right to withdraw to which you are otherwise entitled your consent or discontinue participation at any time without penalty or loss of benefits**  . **The alternative is not to participate.** The results of this journals. Your identity is not disclosed unless we directly inform and ask for your permission. research study may be presented at scientific or professional meetings or published in scientific

**CONTACT INFORMATION:**  If you have any questions, concerns or complaints about this Denis Zahariev ( research, its procedures, risks and benefits, contact following persons: denis.zahariev95@gmail.com) Markus Frank (markus.frank@iste.uni-stuttgart.de)

*By signing this document I confirm that I agree to the terms and conditions.*

*Name: \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_ Signature, Date: \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_* 

# Use Case Scenarios and Modeling Tasks

## **Use Case Scenario 1**

**task. Start with reading the use case description and then proceed with the** 

### **Use Case Description:**

system. The resource environment where the system is deployed ha instance of the component and the interface are present in the software overhead resulting from the creation and the start of the thread. Exactly one Each thread also requires 5 CPU resources for the synchronization for one list of literature in a single database requires 100 CPU resources. create all of the threads responsible for the search. The searching operation method and the component implements it. In the specification of the method component and one providing interface. The interface declares the search the number of databases is limited to 16. The software consists of one database is searched in a separate thread. For the purpose of this scenario, various scientific databases. The search is executed in parallel where each The software in this use case is used to search for a list of literature in s a no think time. call of the search method is started with a closed workload of one user and system is deployed on a single container. In the usage scenario, a single CPU with a processing rate of 200 and 4 number of replicas and the whole

### **Task B (Parallel Performance Catalogue ):**

and to apply the Parallel Loops AT. automation are also complete. Your task is to complete the SEFF Diagram Diagram of the basic component. The files required for the experiment In the project that you receive every diagram is complete except the SEFF

### Questionnaire

## **Questions regarding Use Case Scenario 1:**

1. How would you rate the difficulty of the task in Use Case Scenario 1?

very easy □ □ □ □ □ □ □ very hard

2. Scenario 1? How would you rate your performance regarding the task in Use Case

very slow □ □ □ □ □ □ □ very fast

3. Use Case Scenario 1? How would you rate the amount of work required for completing the task in

too little □ □ □ □ □ □ □ too much

4. it? regarding the modeling of parallel behaviors and your user experience with How would you rate the usability of the Parallel Performance Catalogue

very bad □ □ □ □ □ □ □ very good 

# Use Case Scenarios and Modeling Tasks

## **Use Case Scenario 2**

**task. Start with reading the use case description and then proceed with the** 

## **Use Case Description:**

4 where the system is deployed has a CPU with a processing rate of 250 and interface are present in the software system. The resource environment and the start of the thread. Exactly one instance of the component and the CPU resources for the synchronization overhead resulting from the creation resulting rows requires 125 CPU resources. Each thread also requires 5 component implements it. The multiplication operation for one of the providing interface. The interface declares the multiply method and the results in 16 threads. The software consists of one component and one is calculated in a separate thread. With the given size of the matrices, this multiplication is executed in parallel where each row of the resulting matrix up complex calculations. It multiplies two 16x16 matrices and the The software in this use case is used in machine learning in order to speed number of replicas and the whole system is deployed on a single started with a closed workload of one user and no think time. container. In the usage scenario, a single call of the multiply method is

## **Task A (Standard toolkit**

**):**

Diagram Diagram of the basic component. Your task is to complete the SEFF In the project that you receive every diagram is complete except the SEFF .

### Questionnaire

## **Questions regarding Use Case Scenario 2:**

How would you rate the difficulty of the task in Use Case Scenario 2?

5.

very easy □ □ □ □ □ □ □ very hard

6. Scenario 2? How would you rate your performance regarding the task in Use Case

very slow □ □ □ □ □ □ □ very fast

7. Use Case Scenario 2? How would you rate the amount of work required for completing the task in

too little □ □ □ □ □ □ □ too much

8. modeling of parallel behaviors and your user experience with it? How would you rate the usability of the standard toolkit regarding the

very bad □ □ □ □ □ □ □ very good

# **Questions regarding the Parallel Performance Catalogue:**

9. comparison to the standard toolkit? How would you rate the usability of the Parallel Performance Catalogue in



### **A.3.3. Blank Measurement Protocol**


### **Use Case Scenario 2:**

Start time: \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_

4.

5. Finish time: \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_

Number of errors and time spent in errors: 

6.

1.

2.

3.

Number of errors and time spent in errors:

Finish time: \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_

Start time: \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_

**Use Case Scenario 1:** 

Date: \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_

Total number of errors: \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_ Total time spent in errors: \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_


10

### 285

## **Measurement Protocol**

### **A.4. Additional Performance Factor Measurements**

### **A.4.1. Speedup Behaviour**

### **A.4.1.1. Server Potsdam Small**

**(b)** Speedup Curve for all Demands Using Pyjama (OpenMP)

**Figure A.2.:** Speedup for Threads and OpenMP [Gre19]

**(b)** Speedup Curve for all Demands Using AKKA Actors

**Figure A.3.:** Speedup for Streams and Actors [Gre19]

### **A.4.1.2. Server Potsdam Large**

**(b)** Speedup Curve for all Demands Using Pyjama (OpenMP)

**Figure A.4.:** Speedup for Threads and OpenMP [Gre19]

**(b)** Speedup Curve for all Demands Using AKKA Actors

**Figure A.5.:** Speedup for Streams and Actors [Gre19]

### **A.4.1.3. Multi Node Cluster – BW Cloud**

**(b)** Speedup Curve for all Demands Using Pyjama (OpenMP)

**Figure A.6.:** Speedup for Threads and OpenMP [Gre19]

**(b)** Speedup Curve for all Demands Using AKKA Actors

**Figure A.7.:** Speedup for Streams and AKKA Actors [Gre19]

### **A.4.2. Cache Behaviour**

### **A.4.2.1. Uni Stuttgart – L2 Cache**

**(c)** L2 Cache Behaviour for AKKA Actors

**Figure A.8.:** L2 Cache Behaviour [Gre19]

### **A.4.2.2. Uni Stuttgart – L3 Cache**

**(c)** L3 Cache Behaviour for AKKA Actors

**Figure A.9.:** L3 Cache Behaviour [Gre19]

**A.4.2.3. Server Potsdam Large – L2 Cache**

**Figure A.10.:** L2 cache behaviour for Threads and OpenMP [Gre19]

**(b)** L2 Cache Behaviour for AKKA Actors

### **A.4.2.4. Server Potsdam Large – L3 Cache**

**(b)** L3 Cache Behaviour for AKKA Actors

**Figure A.13.:** L3 Cache Behaviour for Streams and AKKA Actors [Gre19]

### **A.4.2.5. Server Potsdam Small – L2 Cache**

**(b)** L2 Cache Behaviour for AKKA Actors

**Figure A.15.:** L2 Cache Behaviour for Streams and AKKA Actors [Gre19]

### **A.4.2.6. Server Potsdam Small – L3 Cache**

**(b)** L3 Cache Behaviour for Pyjama (OpenMP)

**(b)** L3 Cache Behaviour for AKKA Actors

**Figure A.17.:** L3 cache behaviour for Streams and AKKA Actors [Gre19]

### **A.4.2.7. Multi Node Cluster (BW Cloud) – L3 Cache**

**(b)** L3 Cache Behaviour for AKKA Actors

**Figure A.19.:** L3 Cache Behaviour for Streams and AKKA Actors [Gre19]

### **A.4.3. Performance Curves**

### **A.4.3.1. Performance Curve for Dedicated Hardware**


**Table A.1.:** Extracted Performance Curves for Dedicated Machines based on the Speedup Behaviour of the Demands

### **A.4.3.2. Performance Curves for Virtualised Hardware**


**Table A.2.:** Extracted Performance Curves for Virtualised Machines Based on the Speedup Behaviour of the Demands

### **A.4.4. Performance Prediction Error**

**(a)** Prediction of the Palladio and Performance Curves in Compression to the Measurements for the Best Case imagick

**(b)** Prediction of the Palladio and Performance Curves in Compression to the Measurements for the Worst Case md

**(a)** Prediction of the Speedup for the Approaches Palladio and Performance Curves in Compression to the Measured Speedup for the Best Case imagick

**(b)** Prediction of the Speedup for the Approaches Palladio and Performance Curves in Compression to the Measured Speedup for the Worst Case md

### **A.5. Memory Hierarchy Models**

### **A.5.1. Sirius Extension for Memory Hierarchy Model**

**Figure A.22.:** Screenshoot of the .odesign File for the Memory Hierarchy [Tru20]

**Figure A.23.:** Screenshot of the Memory Hierarchy Editor with Palette Showing Elements That Can Be Added to the Diagram [Tru20]


**Figure A.24.:** Screenshot of the Memory Hierarchy Editor with an Edit Dialog [Tru20]

**Figure A.25.:** Screenshot of the .odesign File for the SeWithMemoryHierarchy Viewpoint [Tru20]


**Figure A.26.:** Screenshot of the Sirius Viewpoint Setting with the Viewpoints SEFF and SeWithMemory-Hierarchy Activated [Tru20]

### **A.5.2. CPU and Memory Demand Calibration**

To get the pure CPU demand (without memory hierarchy demand), we used the measurements we took from the sequential execution and the perf measurements. The intension in extracting the pure CPU demand is, that when considering the measurements from a sequential run, it contains both the CPU demands and the memory hierarchy demand. So, if we had used the measurements form a sequential run also for the multicore models, we would have also considered memory hierarchy demands. Thus, by modelling memory hierarchy demands explicitly—as we do in CB3—and not using the pure CPU demands, we would have considered memory hierarchy demands twice.

To extracted the pure CPU demands from the sequential measurements, we use the perf measurements and calculate the demand by the following formula:

$$Demand\_{CPU} = time\_{singleThreshold} - time\_{memoryHierarchy} \tag{A.1}$$

To estimate the ℎ we use two dierent formulas: one for non-cache-line models and one for cache-line models.

### **A.5.2.1. Memory Time for Non-Cache-Line Models**

For non-cache-line models, we assume the transfer of Java integers. Thus, we assume 4 bytes. We multiply the 4 bytes with the measured cache access times (load-operations) from perf and divide it by the memory bandwidth. The formula is the following:

$$\begin{aligned} time\_{memoryHierarchy} &= \frac{time\_{memoryHierarchy} - 1}{bandwidth\_{L1}} + \frac{load\_{L2} \times 4}{bandwidth\_{L2}} + \frac{load\_{DRAM} \times 4}{bandwidth\_{DRAM}} \end{aligned} \tag{A.2}$$

or:

$$\begin{aligned} time\_{memoryHierarchy} &= \\ 4 \times \left( \frac{load\_{dcache}}{bandwidth\_{L1}} + \frac{load\_{L2}}{bandwidth\_{L2}} + \frac{load\_{L3}}{bandwidth\_{L3}} + \frac{load\_{DRAM}}{bandwidth\_{DRAM}} \right) \end{aligned} \tag{A.3}$$

### **A.5.2.2. Memory Time for Cache-Line Models**

In case we consider cache-line models, we do not multiply with 4 bytes but use the cache-line size. Only for the data transfer between the CPU registers and the L1 cache we assume a lower data-rate of the actual values (i.e., 4 bytes integer). In all the hardware systems we consider the cache-line size is 64 bytes.

Thus, the following formula is used:

$$\begin{aligned} time\_{memoryHierarchy} &= \\ \frac{load\_{dcache} \times 4}{band \, width\_{L1}} + size\_{cacheLine} \\ \times \left(\frac{load\_{L2}}{band \, width\_{L2}} + \frac{load\_{L3}}{band \, width\_{L3}} + \frac{load\_{DRAM}}{band \, width\_{DRAM}}\right) \end{aligned} \tag{A.4}$$

### **A.5.3. Results HPI Small (12 Cores)**

**(a)** Comparison of Prediction Models: Prediction Error in % for the 12-Core Machine and Small Use Case

**(b)** Comparison of Prediction Models: Prediction Error in % for the 12-Core Machine and Large Use Case

**(c)** Comparison of Prediction Models: Speedup Diagram for the 12-Core Machine and Small Use Case

**(d)** Comparison of Prediction Models: Speedup Diagram for the 12-Core Machine and Large Use Case

### **A.5.4. Results HPI Large (40 Cores)**

**(a)** Comparison of Prediction Models: Prediction Error in % for the 12-Core Machine and Small Use Case **(b)** Comparison of Prediction Models: Prediction Error in % for the 12-Core Machine and Large Use Case

**(c)** Comparison of Prediction Models: Speedup Diagram for the 12-Core Machine and Small Use Case

**(d)** Comparison of Prediction Models: Speedup Diagram for the 12-Core Machine and Large Use Case

### **A.5.5. Results Stuttgart (96 Cores)**

**(a)** Comparison of Prediction Models: Prediction Error in % for the 12-Core Machine and Small Use Case

**(b)** Comparison of Prediction Models: Prediction Error in % for the 12-Core Machine and Large Use Case

**(c)** Comparison of Prediction Models: Speedup Diagram for the 12-Core Machine and Small Use Case

**(d)** Comparison of Prediction Models: Speedup Diagram for the 12-Core Machine and Large use Case

### **A.6. CPU Simulator**

### **A.6.1. Extension Points to Connect Trace-drive CPU Simulators to Palladio**

### **A.6.1.1. SimCom Extension Point A**

**Listing A.7:** de.uka.ipd.sdq.simucomframework.resources.ScheduledResource – getScheduledResource()

```
1 private IActiveResource getScheduledResource(final SimuComModel simuComModel,
2 final String sensorDescription) {
3
4 IActiveResource scheduledResource = null;
5 // active resources scheduled by standard scheduling techniques
6 if (getSchedulingStrategyID().equals(SchedulingStrategy.FCFS)) ||
7 (getSchedulingStrategyID().equals(SchedulingStrategy.PROCESSOR_SHARING)) ||
8 (getSchedulingStrategyID().equals(SchedulingStrategy.DELAY)) {
9 ...
10 } else {
11 scheduledResource = getModel().getSchedulingFactory().createResourceFromExtension(
12 getSchedulingStrategyID(), getNextResourceId(), getNumberOfInstances());
13 }
14
15 if (scheduledResource instanceof SimuComExtensionResource) {
16 // The resource takes additional configuration that is available in the SimuComModel object
17 // As the scheduler project is currently SimuCom-agnostic, we use the
18 // SimuComExtensionResource class to initialize the resource wit a SimuCom-related object.
19 ((SimuComExtensionResource) scheduledResource).initialize(simuComModel);
20 }
21 return scheduledResource;
22 }
```
### **A.6.1.2. SimuCom Extension Point B**

**Listing A.8:** de.uka.ipd.sdq.simucomframework.ExperimentRunner – run()

```
1 public static double run(SimuComModel model, long simTime) {
2 // ...
3 setupStopConditions(model);
4
5 // measure elapsed time for the simulation
6 double startTime = System.nanoTime();
7
8 ISimulationControl simulationControl = model.getSimulationControl();
9 simulationControl.start();
10
11 return System.nanoTime() - startTime;
12 }
```
**Figure A.30.:** PCM inuence on the SE RMI Prediction Prototpye [Gra18]

**Figure A.31.:** Sequence Diagram for Initialisation and Assembly using RMI [Gra18]

### **A.6.2. SimulatorBuilder Class**

### **A.6.3. MaxSim Config File**

**Listing A.9:** MaxSim: Hardware Conguration – 8Cores[Gra18]

**Figure A.32.:** Sequence Diagram for Prototype without RMI [Gra18]

```
sim = {
   m a x T o t a lI n s t r s = 1 0 0 0 0 0 0 0 0 0 0 0 0 L ;
   p h a s e L e n g t h = 1 0 0 0 0 ;
   s t a t s P h a s e I n t e r v a l = 1 0 0 0 0 ;
   p o i n t e r T a g g i n g = t r u e ;
   f f R e i n s t r u m e n t = t r u e ;
   l o g T o F i l e = t r u e ;
} ;
s y s = {
   c a c h e s = {
       l 1 d = {
          a r r a y = {
              t y p e = " S e tA s s o c " ;
              ways = 8 ;
          } ;
          c a c h e s = 8 ;
          l a t e n c y = 4 ;
          s i z e = 3 2 7 6 8 ;
       } ;
       l 1 i = {
          a r r a y = {
              t y p e = " S e tA s s o c " ;
              ways = 4 ;
```

```
} ;
         c a c h e s = 8 ;
         l a t e n c y = 3 ;
          s i z e = 3 2 7 6 8 ;
      } ;
      l 2 = {
         a r r a y = {
             t y p e = " S e tA s s o c " ;
             ways = 8 ;
          } ;
         c a c h e s = 8 ;
         l a t e n c y = 6 ;
         c h i l d r e n = " l 1 i | l 1 d " ;
          s i z e = 2 6 2 1 4 4 ;
         MAPro fCacheGroupId = 0 ;
      } ;
      l 3 = {
         a r r a y = {
             hash = " H3 " ;
             t y p e = " S e tA s s o c " ;
             ways = 1 6 ;
          } ;
         bank s = 8 ;
         c a c h e s = 1 ;
         l a t e n c y = 3 0 ;
         c h i l d r e n = " l 2 " ;
          s i z e = 3 3 5 5 4 4 3 2 ;
         MAPro fCacheGroupId = 1 ;
      } ;
      MAProfCacheGroupNames = " l 2 | l 3 " ;
   } ;
c o r e s = {
      h a s w e l l = {
         c o r e s = 1 6 ;
         d ca c h e = " l 1 d " ;
         i c a c h e = " l 1 i " ;
         t y p e = "OOO " ;
      } ;
} ;
[ . . . ]
l 3 = {
         bank s = 1 6 ;
         c a c h e s = 1 ;
         l a t e n c y = 3 0 ;
         c h i l d r e n = " l 2 " ;
          s i z e = 6 7 1 0 8 8 6 4 ;
} ;
```
### **A.6.4. ProtoCom Calibration**

```
Listing A.10: MaxSim: Calibration Run-Cong [Gra18]
```

```
p r o c e s s 0 = {
  command = " . / maxine / com . o r a c l e . max . vm . n a t i v e / g e n e r a t e d / l i n u x / maxvm \
  −XX: + MaxSimExitFFOnVMEnter \
  −XX: + MaxSimEnterFFOnVMExit \
  −XX: + M a x Sim P r o fili n g \
  −XX: + M axSim P ri n t P r o fil eO nVM Exi t \
  −cp / u s r / l o c a l / s r c / c a l i b r a t i o n T o o l . j a r
            me . g r a e f . s e b a s t i a n . b a c h e l o r . t h e s i s . Main " ;
   s t a r t F a s t F o r w a r d e d = t r u e ;
   s y n c e d F a s t F o r w a r d = " Never " ;
} ;
```
**Listing A.11:** MaxSim: Calibration Results [Gra18]

```
# z sim s t a t s
===
r o o t : # S t a t s
 c o n t e n t i o n : # C o n t e n ti o n s i m u l a t i o n s t a t s
   domain −0 : # Domain s t a t s
     tim e : 2 5 7 0 7 1 1 5 2 6 2 # Weave s i m u l a t i o n tim e
 tim e : # S i m u l a t o r tim e breakdown
   i n i t : 5 3 6 9 5 3 6 0 0 5
   bound : 8 9 0 0 1 2 2 7 9 9 6 0 9
   weave : 1 6 2 9 9 9 8 3 2 0 9 4 7
   f f : 2 0 7 2 0 1 8 5 0 0
[ . . . ]
 p ha se : 5 5 0 0 1 3 7 # Si m u l a t e d p h a s e s
 h a s w e l l : # Core s t a t s
   h a sw ell −0 : # Core s t a t s
     c y c l e s : 5 5 0 0 1 3 7 5 1 4 2 # Si m u l a t e d u n h a l t e d c y c l e s
     [ . . . ]
   h a sw ell −1 : # Core s t a t s
     c y c l e s : 0 # Si m u l a t e d u n h a l t e d c y c l e s
     c C y c l e s : 0 # C y cl e s due t o c o n t e n t i o n s t a l l s
[ . . . ]
```
### **A.7. Research Questions and Answers**

Due to the fact, that the research question map to the contributions, we already discussed each research question in the corresponding chapter of the contribution. For the sake of better overview, we briey summarise the outcome and answer to each research question in the following again.

### **A.7.0.1.** 1**: Modelling of parallel performance relevant behaviour in massive parallel environments**

1.1:Are software architects able to model even simple parallel concepts of highly parallel systems in an ecient way?

**Answer:** We could show during an empirical user study using a controlled experiment, that current state of the art tool do not support SA in en ecient way.

1.2: Are software architects able to model the parallel software behaviour of an application with the help of current modelling languages, so that (a) the relevant performance characteristics are captured and expressed, and (b) all necessary information for performance evaluation is covered?

**Answer:** SA are currently not able to model (a) all relevant characteristics of parallel software, which results in (b) inaccurate performance predictions for parallel software in multicore enviorments.

1.3: How can software architects be supported by the task to create accurate performance perdition models eciently?

**Answer:** By the help of a parallel AT catalogue SAs can be supported to create performance prediction models faster and with a higher user acceptance (usability). Further they can use the concept of overhead modelling to increase the accuracy of the predictions.

### **A.7.0.2.** 2**: Performance behaviour of highly parallel applications in massive parallel environments:**

2.1: How do highly parallel applications behave in massive parallel environments (multicore systems) regarding response time (speedup), memory access rates (L1, L2, L3, RAM usage), and memory bandwidth utilisation?

**Answer:** In over 800 experiments we took 70,000 measurements. Thereby, we monitored the response time and memory accesses of the systems. Using these measurements we extracted the twelve performance curves given in Table 7.3 to describe the behaviour.

### 2.2: What factors inuence performance the most in highly parallel applications?

**Answer:** In Table 7.1 we listed the top eight performance-inuencing factors we identied by a structured literature reviews, expert interviews, and the experiments.

### 2.3: Does the choice of parallelisation strategy have a signi cant impact on behaviour?

**Answer:** The experiments show slight dierences in the performance of the individual parallelisation paradigms. However, these dierences are not signication for all thread-based paradigms. The only paradigm that diverges is the AKKA Actors implementation. Here we assume issues in the coding of the framework.

2.4: Do highly parallel applications show similar behaviour, which can be described by one or multiple performance curves?

**Answer:** In Table 7.3 we present performance curves for all the research resource demands. We used linear regression to extract the curves form the measurements. Thus, the curves describe the average behaviour for each demand type on all the tested machines.

Finally, we can verify or falsify our hypothesis as follows:

2.1: The speedup and performance behaviour of highly parallel applications depends heavily on the chosen parallelisation strategy or paradigm.

**Reject:** The chose of the parallelisation strategy does not have a high impact on the behaviour

2.2: The hardware architecture (e.g., number of CPU cores, memory bandwidth, memory hierarchies) of the execution environment has a strong impact on the performance of the parallel applications.

**Accept:** We measured dierences in the normalised speedup for all the machines. Thus, they can verify that the hardware architecture has an impact on the performance. The biggest noticeable dierence is between virtualised hardware and dedicated systems. Virtualised hardware show worse performance.

2.3: The speedup of a parallel application is not only inuenced by the number of cores available in a system but also by additional hardware specic performance-inuencing factors.

**Accept:** In Table 7.1 we listed the top eight performance-inuencing factors we identied

### **A.7.0.3.** 3**: Performance Prediction Models**

3.1: Are current simulation-based performance prediction approaches capable of predicting the performance of parallel and highly parallel systems accurately?

**Answer:** The experiments we performed in [FH16; FSH17] show that current state of the art performance prediction approaches are up to 80% o when trying to predict the response-time for parallel applications in multicore environments

3.2:If not, what are the missing characteristics of software behaviour that must be included in performance prediction models (performance-inuencing factors)?

**Answer:** Table 7.1 shows the top eight most performance-inuencing factors, we gained from a structured literature reviews, expert interviews, and experimenting.

3.3: Can modelling the additional performance-inuencing factors improve the overall accuracy of performance prediction?

**Answer:** We showed that booths, the use of performance curves, which are an abstract representation of the PPiFs, and the modelling of memory hierarchies help to improve the performance predictions for parallel applications in multicore environments. Thereby we achieve an accuracy up to 89% for certain scenarios. That result is by 57% more accurate than the pure Palladio approach.

### **A.7.0.4.** 4**: CPU Simulators**

### 4.1: Can CPU Simulators be used by software architects to evaluate the response time of parallel architectural designs?

**Answer:** We were able to show, that it is possible to transform the architectural models into a performance prototype. Which we again can use as input for multicore CPU simulators to determine the response or execution time of a parallel application.

### 4.2: How would the integration of CPU simulators alter the process of performance predictions?

**Answer:** In Section 9.3 we sketched two approaches to include CPU simulators into the performance prediction workow: (1) a trace-driven approach, (2) a source code-driven approach. In both cases we use the PCM without additional informations as starting point. Next, we transform the PCM by the use of solvers either into a trace-le or a performance prototype, which we nally use as input for the multicore simulators.

### 4.3: Does the use of CPU Simulators increase the performance prediction accuracy for parallel applications in multicore environments?

**Answer:** We implemented the source code-driven approach to evaluate the accuracy of the performance prediction using multicore CPU simulators. Thereby, we used a complex use case example the Bank Transaction Example (see Sec. 5.2.1). The prediction accuracy of this approach for the given example was with an accuracy from 2.50% to 15.29% very inaccurate and up to 54% worse than the pure Palladio approach.

Therefore, we have to reject our hypothesis 4: CPU simulators—used in other domains (e.g, hardware vendors)—can help to improve the predictions for parallel applications on multicore CPUs.

### **List of Figures**





### **List of Tables**



### **List of Abbreviations**


CGSPN Coloured Generalised Stochastic Petri Net


OOO Out-Of-Order

PCM Palladio Component Model

PN Petri Net


### **Literature References**







mance 2019. Wuerzburg, Nov. 2019. url: https : / / www . performance - symposium . org / fileadmin / user \_ upload / palladio- conference/2019/Papers/SSP2019\_paper\_3.pdf (cit. on pp. 194–196, 271).




papers / Security \_ Modeling \_ with \_ Palladio - Different \_ Approaches.pdf (cit. on p. 33).











### [RR07] T. Rauber and G. Rünger. Multicore:: Parallele Programmierung. Springer-Verlag, 2007 (cit. on pp. 16, 24).






All URLs were last checked on May 24, 2022

### **Publications of the Author**


biani, D. Weyns, and U. Zdun. Cham: Springer International Publishing, 2020, pp. 381–394 (cit. on p. 100).


Performance Engineering. currently under review. ACM. 2020 (cit. on p. 154).


[VFB20] V. Vijayshree, M. Frank, and S. Becker. "Extended Abstract of Performance Analysis and Prediction of Model Transformation". In: Companion of the ACM/SPEC International Conference on Performance Engineering. 2020, pp. 8–9.

All URLs were last checked on May 24, 2022

### **Supervised Theses**



### **The Karlsruhe Series on Software Design and Quality**

ISSN 1867-0067





### Band 25 Sebastian Michael Lehrig Efficiently Conducting Quality-of-Service Analyses by Templating Architectural Knowledge. ISBN 978-3-7315-0756-7

### Band 26 Georg Hinkel

Implicit Incremental Model Analyses and Transformations. ISBN 978-3-7315-0763-5

### Band 27 Christian Stier

Adaptation-Aware Architecture Modeling and Analysis of Energy Efficiency for Software Systems. ISBN 978-3-7315-0851-9

### Band 28 Lukas Märtin

Entwurfsoptimierung von selbst-adaptiven Wartungsmechanismen für software-intensive technische Systeme. ISBN 978-3-7315-0852-6

### Band 29 Axel Busch

Quality-driven Reuse of Model-based Software Architecture Elements. ISBN 978-3-7315-0951-6

### Band 30 Kiana Busch

An Architecture-based Approach for Change Impact Analysis of Software-intensive Systems. ISBN 978-3-7315-0974-5

### Band 31 Misha Strittmatter

A Reference Structure for Modular Metamodels of Quality-Describing Domain-Specific Modeling Languages. ISBN 978-3-7315-0982-0

### Band 32 Markus Frank

Model-Based Performance Prediction for Concurrent Software on Multicore Architectures. A Simulation-Based Approach. ISBN 978-3-7315-1146-5

Die Bände sind unter www.ksp.kit.edu als PDF frei verfügbar oder als Druckausgabe bestellbar.

### The Karlsruhe Series on Software Design and Quality

**Edited by Prof. Dr. Ralf Reussner**

Model-based performance prediction is a well-known concept to ensure the quality of software. Current state-of-the-art tools like Palladio provide accurate performance prediction for sophisticated and distributed cloud systems. However, they are built upon the assumption of single-core CPU architectures and consider only the clock rate as a single metric for CPU performance. Current processor architectures have multiple cores and a more complex design. Therefore, the use of a single-metric model leads to inaccurate performance predictions for parallel applications in multicore systems.

In this book, present multiple strategies to extend performance prediction models to support multicore architectures. We perform extensive experiments and present a set of performance curves that reflect the behaviour of characteristic demand types. We included the performance curves into Palladio and have increased the performance predictions significantly. Further, we provide a parallel architectural pattern catalogue. This catalogue enables the software architect to model the parallel behaviour of software faster and with fewer errors.

**32**

**Markus Kilian Frank**

**Model-Based Performance Prediction for** 

**Concurrent Software on Multicore Architectures**

ISSN 1867-0067 ISBN 978-3-7315-1146-5

Gedruckt auf FSC-zertifiziertem Papier 9 783731 511465